LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH 00/22 -v7] mcount and latency tracing utility -v7
@ 2008-01-30  3:15 Steven Rostedt
  2008-01-30  3:15 ` [PATCH 01/22 -v7] printk - dont wakeup klogd with interrupts disabled Steven Rostedt
                   ` (21 more replies)
  0 siblings, 22 replies; 45+ messages in thread
From: Steven Rostedt @ 2008-01-30  3:15 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, Peter Zijlstra,
	Christoph Hellwig, Mathieu Desnoyers, Gregory Haskins,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Tim Bird,
	Sam Ravnborg, Frank Ch. Eigler, Jan Kiszka, John Stultz,
	Arjan van de Ven


[
  version 7  (and hopefully last) of mcount / trace patches:

  changes include:

   Ported to lastest git 0ba6c33bcddc64a54b5f1c25a696c4767dc76292

   Moved the markers around so they would only be armed when used,
   this brings down the overhead dramatically.

   Added printing of the "to" process name in the sched switch output:
    ksoftirq-8     2d..3 120829us+:  8:49:S --> 11:115 group_balance
    group_ba-11    2d..3 120836us!:  11:115:S --> 0:140 <idle>

   Removed notrace to nmi handlers. I've tested it a little with
   NMIs and function trace, and it seems to work fine.

   added "disable" to available_tracers that will unregister all
   tracers when written into current_tracer.

   Ran new benchmarks and got better results! See below.
]

All released version of these patches can be found at:

   http://people.redhat.com/srostedt/tracing/


The following patch series brings to vanilla Linux a bit of the RT kernel
trace facility. This incorporates the "-pg" profiling option of gcc
that will call the "mcount" function for all functions called in
the kernel.

Note: I did investigate using -finstrument-functions but that adds a call
to both start and end of a function. Using mcount only does the
beginning of the function. mcount alone adds ~13% overhead. The
-finstrument-functions added ~19%.  Also it caused me to do tricks with
inline, because it adds the function calls to inline functions as well.

This patch series implements the code for x86 (32 and 64 bit), but
other archs can easily be implemented as well (note: ARM and PPC are
already implemented in -rt)

Some Background:
----------------

A while back, Ingo Molnar and William Lee Irwin III created a latency tracer
to find problem latency areas in the kernel for the RT patch.  This tracer
became a very integral part of the RT kernel in solving where latency hot
spots were.  One of the features that the latency tracer added was a
function trace.  This function tracer would record all functions that
were called (implemented by the gcc "-pg" option) and would show what was
called when interrupts or preemption was turned off.

This feature is also very helpful in normal debugging. So it's been talked
about taking bits and pieces from the RT latency tracer and bring them
to LKML. But no one had the time to do it.

Arnaldo Carvalho de Melo took a crack at it. He pulled out the mcount
as well as part of the tracing code and made it generic from the point
of the tracing code.  I'm not sure why this stopped. Probably because
Arnaldo is a very busy man, and his efforts had to be utilized elsewhere.

While I still maintain my own Logdev utility:

  http://rostedt.homelinux.com/logdev

I came across a need to do the mcount with logdev too. I was successful
but found that it became very dependent on a lot of code. One thing that
I liked about my logdev utility was that it was very non-intrusive, and has
been easy to port from the Linux 2.0 days. I did not want to burden the
logdev patch with the intrusiveness of mcount (not really that intrusive,
it just needs to add a "notrace" annotation to functions in the kernel
that will cause more conflicts in applying patches for me).

Being close to the holidays, I grabbed Arnaldos old patches and started
massaging them into something that could be useful for logdev, and what
I found out (and talking this over with Arnaldo too) that this can
be much more useful for others as well.

The main thing I changed, was that I made the mcount function itself
generic, and not the dependency on the tracing code.  That is I added

register_mcount_function()
 and
clear_mcount_function()

So when ever mcount is enabled and a function is registered that function
is called for all functions in the kernel that is not labeled with the
"notrace" annotation.


The Simple Tracer:
------------------

To show the power of this I also massaged the tracer code that Arnaldo pulled
from the RT patch and made it be a nice example of what can be done
with this.

The function that is registered to mcount has the prototype:

 void func(unsigned long ip, unsigned long parent_ip);

The ip is the address of the function and parent_ip is the address of
the parent function that called it.

The x86_64 version has the assembly call the registered function directly
to save having to do a double function call.

To enable mcount, a sysctl is added:

   /proc/sys/kernel/mcount_enabled

Once mcount is enabled, when a function is registed, it will be called by
all functions. The tracer in this patch series shows how this is done.
It adds a directory in the debugfs, called mctracer. With a ctrl file that
will allow the user have the tracer register its function.  Note, the order
of enabling mcount and registering a function is not important, but both
must be done to initiate the tracing. That is, you can disable tracing
by either disabling mcount or by clearing the registered function.

When one function is registered, it is called directly from the mcount
assembly. If more than one function is registered, a "loop" function
is called that calls all the registered functions.

Here's a simple example of the tracer output:

CPU 2: hackbench:11867 preempt_schedule+0xc/0x84 <-- avc_has_perm_noaudit+0x45d/0x52c
CPU 1: hackbench:12052 selinux_file_permission+0x10/0x11c <-- security_file_permission+0x16/0x18
CPU 3: hackbench:12017 update_curr+0xe/0x8b <-- put_prev_task_fair+0x24/0x4c
CPU 2: hackbench:11867 avc_audit+0x16/0x9e3 <-- avc_has_perm+0x51/0x63
CPU 0: hackbench:12019 socket_has_perm+0x16/0x7c <-- selinux_socket_sendmsg+0x27/0x3e
CPU 1: hackbench:12052 file_has_perm+0x16/0xbb <-- selinux_file_permission+0x104/0x11c

This is formated like:

 CPU <CPU#>: <task-comm>:<task-pid> <function> <-- <parent-function>


Latency Tracer Format:
----------------------

The format used by the RT patch is a bit more complex. It is designed to
record a lot of information quickly and dump out a lot too.

There's two versions of the format. Verbose and non-vebose.

verbose:

preemption latency trace v1.1.5 on 2.6.24-rc7-tst
--------------------------------------------------------------------
 latency: 89 us, #3/3, CPU#1 | (M:desktop VP:0, KP:0, SP:0 HP:0 #P:4)
    -----------------
    | task: kjournald-600 (uid:0 nice:-5 policy:0 rt_prio:0)
    -----------------
 => started at: _spin_lock_irqsave+0x2a/0x63 <c06310d2>
 => ended at:   _spin_unlock_irqrestore+0x32/0x41 <c0631245>

       kjournald   600 1 1 00000000 00000000 [397408f1] 0.003ms (+0.079ms): _spin_lock_irqsave+0x2a/0x63 <c06310d2> (scsi_dispatch_cmd+0x155/0x234 [scsi_mod] <f8867c19>)
       kjournald   600 1 1 00000000 00000001 [39740940] 0.081ms (+0.005ms): _spin_unlock_irqrestore+0x32/0x41 <c0631245> (scsi_dispatch_cmd+0x1be/0x234 [scsi_mod] <f8867c82>)
       kjournald   600 1 1 00000000 00000002 [39740945] 0.087ms (+0.000ms): trace_hardirqs_on_caller+0x74/0x86 <c0508bdc> (_spin_unlock_irqrestore+0x32/0x41 <c0631245>)


non-verbose:

preemption latency trace v1.1.5 on 2.6.24-rc7-tst
--------------------------------------------------------------------
 latency: 89 us, #3/3, CPU#2 | (M:desktop VP:0, KP:0, SP:0 HP:0 #P:4)
    -----------------
    | task: kjournald-600 (uid:0 nice:-5 policy:0 rt_prio:0)
    -----------------
 => started at: _spin_lock_irqsave+0x2a/0x63 <c06310d2>
 => ended at:   _spin_unlock_irqrestore+0x32/0x41 <c0631245>

                 _------=> CPU#            
                / _-----=> irqs-off        
               | / _----=> need-resched    
               || / _---=> hardirq/softirq 
               ||| / _--=> preempt-depth   
               |||| /                      
               |||||     delay             
   cmd     pid ||||| time  |   caller      
      \   /    |||||   \   |   /           
kjournal-600   1d...    3us+: _spin_lock_irqsave+0x2a/0x63 <c06310d2> (scsi_dispatch_cmd+0x155/0x234 [scsi_mod] <f8867c19>)
kjournal-600   1d...   81us+: _spin_unlock_irqrestore+0x32/0x41 <c0631245> (scsi_dispatch_cmd+0x1be/0x234 [scsi_mod] <f8867c82>)
kjournal-600   1d...   87us : trace_hardirqs_on_caller+0x74/0x86 <c0508bdc> (_spin_unlock_irqrestore+0x32/0x41 <c0631245>)


Debug FS:
---------

Although enabling and disabling mcount is done through the sysctl:

/proc/sys/kernel/mcount_enabled

The rest of the tracing uses debugfs.

/debugfs/tracing

Here's the available files:

 available_tracers
  Lists the tracers that are compiled into the kernel and can be
  echoed into current_tracer

 current_tracer
   Shows the current tracer that is registered. On bootup this is
   blank. To set a new tracer, simply echo into this file the name
   found in available_tracers. To unregister all tracers, echo
   "disable" into this file.

   e.g.

     echo wakeup > /debugfs/tracing/current_tracer

 trace_ctrl
   echo 1 to enable the current tracer.
   echo 0 to disable it.

    Note, the function trace also needs mcount_enabled set,
    otherwise it will not record.

 latency_trace
   Outputs the current trace in latency_trace format.

 tracing_thresh
   echo a number (in usecs) into this to record all traces that are
   greater than threshold. Only matters for those traces that use
   a max threshold (preemptoff irqsoff and wakeup)

 iter_ctrl
   echo "symonly" to not show the instruction pointers in the trace
   echo "nosymonly" to disable symonly.
   echo "verbose" for verbose output from latency format.
   echo "noverbose" to disable verbose ouput.
   cat iter_ctrl to see the current settings.

 tracing_max_latency
   Holds the current max critical latency.
   echo 0 to reset and start tracing.
   Only holds for those tracers that use a max setting.
   (preemptoff irqsoff and wakeup)

 trace
   simple output format of the current trace.

 latency_hist/
   This is a directory of histograms (when configured)
   This directory holds histograms for preemption off, irqs off,
   preemption and/or irqs off, and wakeup timings.
   (all numbers are in usecs)

Trace tool:
-----------

 The trace_cmd.c found at
   http://people.redhat.com/srostedt/tracing/trace-cmd.c

 This tool will set up the tracer and allow you to run a program
 and trace it. (must be root)

  e.g.

   # ./trace-cmd -f echo hi
   # cat /debug/tracing/latency_trace | grep echo | head
    echo-4240  1d..1 1076us : ret_from_fork+0x6/0x1c (schedule_tail+0x9/0x5e)
    echo-4240  1d..1 1077us : schedule_tail+0x1e/0x5e (finish_task_switch+0xb/0x61)
    echo-4240  1d..1 1078us : finish_task_switch+0x24/0x61 (_spin_unlock_irq+0x8/0x40)
    echo-4240  1...1 1079us : _spin_unlock_irq+0x27/0x40 (sub_preempt_count+0xd/0x104)
    echo-4240  1...1 1080us+: sub_preempt_count+0x8d/0x104 (in_lock_functions+0x8/0x2c)
    echo-4240  1d... 1082us : error_code+0x72/0x78 (do_page_fault+0xe/0x7c0)
    echo-4240  1d... 1083us : do_page_fault+0x35b/0x7c0 (add_preempt_count+0xc/0x10f)
    echo-4240  1d..1 1085us : add_preempt_count+0x53/0x10f (in_lock_functions+0x8/0x2c)
    echo-4240  1d..1 1086us : do_page_fault+0x38f/0x7c0 (sub_preempt_count+0xd/0x104)
    echo-4240  1d..1 1087us : sub_preempt_count+0x8d/0x104 (in_lock_functions+0x8/0x2c)


  The available switches are:

    -s - context switch tracing
    -p - preemption off tracing
    -i - interrupts off tracing
    -b - both interrupts off and preemption off tracing
    -w - wakeup timings
    -e - event tracing
    -f - function trace

  Only -f (function trace) may be enabled with the other tracers,
  since most the other tracers also have a function trace version
  if mcount is enabled (yes -f will enable mcount).


Overhead:
---------

Note that having mcount compiled in seems to show a little overhead.

Here's the new runs: Running "hackbench 50" 10 times each.

Without any of the patches:
  Avg: 4.1953 secs

With Event tracer, wakeup timings, context switch compiled in
but all turned off (marker overhead only).
  Avg: 4.2649 (1.66%)

Same config but event tracer registered but not enabled:
  Avg: 4.2717 (1.82%)

Event tracer enabled:
  Avg: 4.3249 (3.09%)

Wakeup registed but not enabled:
  Avg: 4.2942 (2.36%)

Wakeup enabled:
  Avg: 4.3124 (2.79%)

sched_switch registered but not enabled:
  Avg: 4.3118 (2.78%)

sched_switch enabled:
  Avg: 4.2552 (1.43%) /* seems to be in the noise */


Preempt off timings configured:
  preempt not enabled:
    Avg: 4.6768 (11.48%)

  preempt enabled:
    Avg: 9.1784 (118.78%) /* expected */

Irqs off timings configured:
  irqsoff not enabled:
    Avg: 4.7744 (13.80%)

  irqsoff enabled:
    Avg: 10.3651 (147.04%)

preempt off and irqs off configured:
  everything not enabled:
    Avg: 5.0823 (21.14%)

  irqsoff enabled:
    Avg: 16.0194 (281.84%)

  preemptoff enabled:
    Avg: 9.8112 (133.86%)

  preemptirqsoff enabled:
    Avg: 15.9543 (280.29%)

Mcount and function trace configured:

  mcount not enabled:
    4.8638 (15.93%) /* note - I removed a bit of notrace
                         which can increase this number */

  mcount enabled:
    6.2816 (49.74%)

  mcount and function tracing enabled:
    25.2035 (500.76%)  /* this may seem big, but this is down
                          from 113 secs! (2590%) */

Wakeup histogram configured (always on):
  Avg: 4.268 (1.73%)

irqs off histogram configured (always on):
  Avg: 6.9054 (64.60%)

preempt off histogram configured (always on):
  Avg: 5.8953 (40.52%)

irqs and preempt off histograms configured (always on):
  Avg: 14.808 (252.85%)

For full output of the above numbers see:
  http://people.redhat.com/srostedt/tracing/hack-results.cal

Wakeup timings, event trace and context switch tracing adds virtually no
overhead when configured.  The interrupt and preemption off, as well
as mcount does, and should only be in debugging kernels.

Future:
-------
The way the mcount hook is done here, other utilities can easily add their
own functions. Just care needs to be made not to call anything that is not
marked with notrace, or you will crash the box with recursion. But
even the simple tracer adds a "disabled" feature so in case it happens
to call something that is not marked with notrace, it is a safety net
not to kill the box.

I plan on looking into converting lockdep to use markers so that lockdep
and the preempt and irqs off tracers can be run time enabled without
adding much overhead when configured in.

SystemTap:
----------
One thing that Arnaldo and I discussed last year was using systemtap to
add hooks into the kernel to start and stop tracing.  kprobes is too
heavy to do on all funtion calls, but it would be perfect to add to
non hot paths to start the tracer and stop the tracer.

So when debugging the kernel, instead of recompiling with printks
or other markers, you could simply use systemtap to place a trace start
and stop locations and trace the problem areas to see what is happening.



These are just some of the ideas we have with this. And we are sure others
could come up with more.

These patches are for the underlining work. We'll see what happens next.





^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 01/22 -v7] printk - dont wakeup klogd with interrupts disabled
  2008-01-30  3:15 [PATCH 00/22 -v7] mcount and latency tracing utility -v7 Steven Rostedt
@ 2008-01-30  3:15 ` Steven Rostedt
  2008-01-30  3:15 ` [PATCH 02/22 -v7] Add basic support for gcc profiler instrumentation Steven Rostedt
                   ` (20 subsequent siblings)
  21 siblings, 0 replies; 45+ messages in thread
From: Steven Rostedt @ 2008-01-30  3:15 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, Peter Zijlstra,
	Christoph Hellwig, Mathieu Desnoyers, Gregory Haskins,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Tim Bird,
	Sam Ravnborg, Frank Ch. Eigler, Jan Kiszka, John Stultz,
	Arjan van de Ven, Steven Rostedt

[-- Attachment #1: printk-no-klogd-on-rq-locked.patch --]
[-- Type: text/plain, Size: 4145 bytes --]

[ This patch is added to the series since the wakeup timings trace
  may lockup without it. ]

I thought that one could place a printk anywhere without worrying.
But it seems that it is not wise to place a printk where the runqueue
lock is held.

I just spent two hours debugging why some of my code was locking up,
to find that the lockup was caused by some debugging printk's that
I had in the scheduler.  The printk's were only in rare paths so
they shouldn't be too much of a problem, but after I hit the printk
the system locked up.

Thinking that it was locking up on my code I went looking down the
wrong path. I finally found (after examining an NMI dump) that
the lockup happened because printk was trying to wakeup the klogd
daemon, which caused a deadlock when the try_to_wakeup code tries
to grab the runqueue lock.

This patch adds a runqueue_is_locked interface in sched.c for other
files to see if the current runqueue lock is held. This is used
in printk to determine whether it is safe or not to wakeup the klogd.

And with this patch, my code ran fine ;-)

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 include/linux/sched.h |    2 ++
 kernel/printk.c       |   14 ++++++++++----
 kernel/sched.c        |   18 ++++++++++++++++++
 3 files changed, 30 insertions(+), 4 deletions(-)

Index: linux-mcount.git/kernel/printk.c
===================================================================
--- linux-mcount.git.orig/kernel/printk.c	2008-01-29 17:02:10.000000000 -0500
+++ linux-mcount.git/kernel/printk.c	2008-01-29 17:25:40.000000000 -0500
@@ -590,9 +590,11 @@ static int have_callable_console(void)
  * @fmt: format string
  *
  * This is printk().  It can be called from any context.  We want it to work.
- * Be aware of the fact that if oops_in_progress is not set, we might try to
- * wake klogd up which could deadlock on runqueue lock if printk() is called
- * from scheduler code.
+ *
+ * Note: if printk() is called with the runqueue lock held, it will not wake
+ * up the klogd. This is to avoid a deadlock from calling printk() in schedule
+ * with the runqueue lock held and having the wake_up grab the runqueue lock
+ * as well.
  *
  * We try to grab the console_sem.  If we succeed, it's easy - we log the output and
  * call the console drivers.  If we fail to get the semaphore we place the output
@@ -1001,7 +1003,11 @@ void release_console_sem(void)
 	console_locked = 0;
 	up(&console_sem);
 	spin_unlock_irqrestore(&logbuf_lock, flags);
-	if (wake_klogd)
+	/*
+	 * If we try to wake up klogd while printing with the runqueue lock
+	 * held, this will deadlock.
+	 */
+	if (wake_klogd && !runqueue_is_locked())
 		wake_up_klogd();
 }
 EXPORT_SYMBOL(release_console_sem);
Index: linux-mcount.git/include/linux/sched.h
===================================================================
--- linux-mcount.git.orig/include/linux/sched.h	2008-01-29 17:02:10.000000000 -0500
+++ linux-mcount.git/include/linux/sched.h	2008-01-29 17:25:40.000000000 -0500
@@ -222,6 +222,8 @@ extern void sched_init_smp(void);
 extern void init_idle(struct task_struct *idle, int cpu);
 extern void init_idle_bootup_task(struct task_struct *idle);
 
+extern int runqueue_is_locked(void);
+
 extern cpumask_t nohz_cpu_mask;
 #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ)
 extern int select_nohz_load_balancer(int cpu);
Index: linux-mcount.git/kernel/sched.c
===================================================================
--- linux-mcount.git.orig/kernel/sched.c	2008-01-29 16:59:15.000000000 -0500
+++ linux-mcount.git/kernel/sched.c	2008-01-29 17:25:40.000000000 -0500
@@ -621,6 +621,24 @@ unsigned long rt_needs_cpu(int cpu)
 # define const_debug static const
 #endif
 
+/**
+ * runqueue_is_locked
+ *
+ * Returns true if the current cpu runqueue is locked.
+ * This interface allows printk to be called with the runqueue lock
+ * held and know whether or not it is OK to wake up the klogd.
+ */
+int runqueue_is_locked(void)
+{
+	int cpu = get_cpu();
+	struct rq *rq = cpu_rq(cpu);
+	int ret;
+
+	ret = spin_is_locked(&rq->lock);
+	put_cpu();
+	return ret;
+}
+
 /*
  * Debugging: various feature bits
  */

-- 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 02/22 -v7] Add basic support for gcc profiler instrumentation
  2008-01-30  3:15 [PATCH 00/22 -v7] mcount and latency tracing utility -v7 Steven Rostedt
  2008-01-30  3:15 ` [PATCH 01/22 -v7] printk - dont wakeup klogd with interrupts disabled Steven Rostedt
@ 2008-01-30  3:15 ` Steven Rostedt
  2008-01-30  8:46   ` Peter Zijlstra
  2008-01-30 13:21   ` Jan Kiszka
  2008-01-30  3:15 ` [PATCH 03/22 -v7] Annotate core code that should not be traced Steven Rostedt
                   ` (19 subsequent siblings)
  21 siblings, 2 replies; 45+ messages in thread
From: Steven Rostedt @ 2008-01-30  3:15 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, Peter Zijlstra,
	Christoph Hellwig, Mathieu Desnoyers, Gregory Haskins,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Tim Bird,
	Sam Ravnborg, Frank Ch. Eigler, Jan Kiszka, John Stultz,
	Arjan van de Ven, Steven Rostedt

[-- Attachment #1: mcount-add-basic-support-for-gcc-profiler-instrum.patch --]
[-- Type: text/plain, Size: 12600 bytes --]

If CONFIG_MCOUNT is selected and /proc/sys/kernel/mcount_enabled is set to a
non-zero value the mcount routine will be called everytime we enter a kernel
function that is not marked with the "notrace" attribute.

The mcount routine will then call a registered function if a function
happens to be registered.

[This code has been highly hacked by Steven Rostedt, so don't
 blame Arnaldo for all of this ;-) ]

Update:
  It is now possible to register more than one mcount function.
  If only one mcount function is registered, that will be the
  function that mcount calls directly. If more than one function
  is registered, then mcount will call a function that will loop
  through the functions to call.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 Makefile                   |    3 
 arch/x86/Kconfig           |    1 
 arch/x86/kernel/entry_32.S |   25 +++++++
 arch/x86/kernel/entry_64.S |   36 +++++++++++
 include/linux/linkage.h    |    2 
 include/linux/mcount.h     |   38 ++++++++++++
 kernel/sysctl.c            |   11 +++
 lib/Kconfig.debug          |    1 
 lib/Makefile               |    2 
 lib/tracing/Kconfig        |   10 +++
 lib/tracing/Makefile       |    3 
 lib/tracing/mcount.c       |  141 +++++++++++++++++++++++++++++++++++++++++++++
 12 files changed, 273 insertions(+)

Index: linux-mcount.git/Makefile
===================================================================
--- linux-mcount.git.orig/Makefile	2008-01-29 17:01:56.000000000 -0500
+++ linux-mcount.git/Makefile	2008-01-29 17:26:17.000000000 -0500
@@ -509,6 +509,9 @@ endif
 
 include $(srctree)/arch/$(SRCARCH)/Makefile
 
+ifdef CONFIG_MCOUNT
+KBUILD_CFLAGS	+= -pg
+endif
 ifdef CONFIG_FRAME_POINTER
 KBUILD_CFLAGS	+= -fno-omit-frame-pointer -fno-optimize-sibling-calls
 else
Index: linux-mcount.git/arch/x86/Kconfig
===================================================================
--- linux-mcount.git.orig/arch/x86/Kconfig	2008-01-29 16:59:15.000000000 -0500
+++ linux-mcount.git/arch/x86/Kconfig	2008-01-29 17:26:18.000000000 -0500
@@ -19,6 +19,7 @@ config X86_64
 config X86
 	bool
 	default y
+	select HAVE_MCOUNT
 
 config GENERIC_TIME
 	bool
Index: linux-mcount.git/arch/x86/kernel/entry_32.S
===================================================================
--- linux-mcount.git.orig/arch/x86/kernel/entry_32.S	2008-01-29 16:59:15.000000000 -0500
+++ linux-mcount.git/arch/x86/kernel/entry_32.S	2008-01-29 17:26:18.000000000 -0500
@@ -75,6 +75,31 @@ DF_MASK		= 0x00000400 
 NT_MASK		= 0x00004000
 VM_MASK		= 0x00020000
 
+#ifdef CONFIG_MCOUNT
+.globl mcount
+mcount:
+	/* unlikely(mcount_enabled) */
+	cmpl $0, mcount_enabled
+	jnz trace
+	ret
+
+trace:
+	/* taken from glibc */
+	pushl %eax
+	pushl %ecx
+	pushl %edx
+	movl 0xc(%esp), %edx
+	movl 0x4(%ebp), %eax
+
+	call   *mcount_trace_function
+
+	popl %edx
+	popl %ecx
+	popl %eax
+
+	ret
+#endif
+
 #ifdef CONFIG_PREEMPT
 #define preempt_stop(clobbers)	DISABLE_INTERRUPTS(clobbers); TRACE_IRQS_OFF
 #else
Index: linux-mcount.git/arch/x86/kernel/entry_64.S
===================================================================
--- linux-mcount.git.orig/arch/x86/kernel/entry_64.S	2008-01-29 16:59:15.000000000 -0500
+++ linux-mcount.git/arch/x86/kernel/entry_64.S	2008-01-29 17:26:18.000000000 -0500
@@ -53,6 +53,42 @@
 
 	.code64
 
+#ifdef CONFIG_MCOUNT
+
+ENTRY(mcount)
+	/* unlikely(mcount_enabled) */
+	cmpl $0, mcount_enabled
+	jnz trace
+	retq
+
+trace:
+	/* taken from glibc */
+	subq $0x38, %rsp
+	movq %rax, (%rsp)
+	movq %rcx, 8(%rsp)
+	movq %rdx, 16(%rsp)
+	movq %rsi, 24(%rsp)
+	movq %rdi, 32(%rsp)
+	movq %r8, 40(%rsp)
+	movq %r9, 48(%rsp)
+
+	movq 0x38(%rsp), %rsi
+	movq 8(%rbp), %rdi
+
+	call   *mcount_trace_function
+
+	movq 48(%rsp), %r9
+	movq 40(%rsp), %r8
+	movq 32(%rsp), %rdi
+	movq 24(%rsp), %rsi
+	movq 16(%rsp), %rdx
+	movq 8(%rsp), %rcx
+	movq (%rsp), %rax
+	addq $0x38, %rsp
+
+	retq
+#endif
+
 #ifndef CONFIG_PREEMPT
 #define retint_kernel retint_restore_args
 #endif	
Index: linux-mcount.git/include/linux/linkage.h
===================================================================
--- linux-mcount.git.orig/include/linux/linkage.h	2008-01-29 16:59:15.000000000 -0500
+++ linux-mcount.git/include/linux/linkage.h	2008-01-29 17:26:18.000000000 -0500
@@ -3,6 +3,8 @@
 
 #include <asm/linkage.h>
 
+#define notrace __attribute__((no_instrument_function))
+
 #ifdef __cplusplus
 #define CPP_ASMLINKAGE extern "C"
 #else
Index: linux-mcount.git/include/linux/mcount.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-mcount.git/include/linux/mcount.h	2008-01-29 17:26:18.000000000 -0500
@@ -0,0 +1,38 @@
+#ifndef _LINUX_MCOUNT_H
+#define _LINUX_MCOUNT_H
+
+#ifdef CONFIG_MCOUNT
+extern int mcount_enabled;
+
+#include <linux/linkage.h>
+
+#define CALLER_ADDR0 ((unsigned long)__builtin_return_address(0))
+#define CALLER_ADDR1 ((unsigned long)__builtin_return_address(1))
+#define CALLER_ADDR2 ((unsigned long)__builtin_return_address(2))
+
+typedef void (*mcount_func_t)(unsigned long ip, unsigned long parent_ip);
+
+struct mcount_ops {
+	mcount_func_t func;
+	struct mcount_ops *next;
+};
+
+/*
+ * The mcount_ops must be a static and should also
+ * be read_mostly.  These functions do modify read_mostly variables
+ * so use them sparely. Never free an mcount_op or modify the
+ * next pointer after it has been registered. Even after unregistering
+ * it, the next pointer may still be used internally.
+ */
+int register_mcount_function(struct mcount_ops *ops);
+int unregister_mcount_function(struct mcount_ops *ops);
+void clear_mcount_function(void);
+
+extern void mcount(void);
+
+#else /* !CONFIG_MCOUNT */
+# define register_mcount_function(ops) do { } while (0)
+# define unregister_mcount_function(ops) do { } while (0)
+# define clear_mcount_function(ops) do { } while (0)
+#endif /* CONFIG_MCOUNT */
+#endif /* _LINUX_MCOUNT_H */
Index: linux-mcount.git/kernel/sysctl.c
===================================================================
--- linux-mcount.git.orig/kernel/sysctl.c	2008-01-29 17:02:10.000000000 -0500
+++ linux-mcount.git/kernel/sysctl.c	2008-01-29 17:26:18.000000000 -0500
@@ -46,6 +46,7 @@
 #include <linux/nfs_fs.h>
 #include <linux/acpi.h>
 #include <linux/reboot.h>
+#include <linux/mcount.h>
 
 #include <asm/uaccess.h>
 #include <asm/processor.h>
@@ -514,6 +515,16 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec,
 	},
+#ifdef CONFIG_MCOUNT
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "mcount_enabled",
+		.data		= &mcount_enabled,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+#endif
 #ifdef CONFIG_KMOD
 	{
 		.ctl_name	= KERN_MODPROBE,
Index: linux-mcount.git/lib/Kconfig.debug
===================================================================
--- linux-mcount.git.orig/lib/Kconfig.debug	2008-01-29 17:02:10.000000000 -0500
+++ linux-mcount.git/lib/Kconfig.debug	2008-01-29 17:26:18.000000000 -0500
@@ -562,5 +562,6 @@ config LATENCYTOP
 	  Enable this option if you want to use the LatencyTOP tool
 	  to find out which userspace is blocking on what kernel operations.
 
+source lib/tracing/Kconfig
 
 source "samples/Kconfig"
Index: linux-mcount.git/lib/Makefile
===================================================================
--- linux-mcount.git.orig/lib/Makefile	2008-01-29 17:02:10.000000000 -0500
+++ linux-mcount.git/lib/Makefile	2008-01-29 17:26:18.000000000 -0500
@@ -67,6 +67,8 @@ obj-$(CONFIG_AUDIT_GENERIC) += audit.o
 obj-$(CONFIG_SWIOTLB) += swiotlb.o
 obj-$(CONFIG_FAULT_INJECTION) += fault-inject.o
 
+obj-$(CONFIG_MCOUNT) += tracing/
+
 lib-$(CONFIG_GENERIC_BUG) += bug.o
 
 hostprogs-y	:= gen_crc32table
Index: linux-mcount.git/lib/tracing/Kconfig
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-mcount.git/lib/tracing/Kconfig	2008-01-29 17:26:18.000000000 -0500
@@ -0,0 +1,10 @@
+
+# Archs that enable MCOUNT should select HAVE_MCOUNT
+config HAVE_MCOUNT
+       bool
+
+# MCOUNT itself is useless, or will just be added overhead.
+# It needs something to register a function with it.
+config MCOUNT
+	bool
+	select FRAME_POINTER
Index: linux-mcount.git/lib/tracing/Makefile
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-mcount.git/lib/tracing/Makefile	2008-01-29 17:26:18.000000000 -0500
@@ -0,0 +1,3 @@
+obj-$(CONFIG_MCOUNT) += libmcount.o
+
+libmcount-y := mcount.o
Index: linux-mcount.git/lib/tracing/mcount.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-mcount.git/lib/tracing/mcount.c	2008-01-29 17:26:18.000000000 -0500
@@ -0,0 +1,141 @@
+/*
+ * Infrastructure for profiling code inserted by 'gcc -pg'.
+ *
+ * Copyright (C) 2007-2008 Steven Rostedt <srostedt@redhat.com>
+ *
+ * Originally ported from the -rt patch by:
+ *   Copyright (C) 2007 Arnaldo Carvalho de Melo <acme@redhat.com>
+ *
+ * Based on code in the latency_tracer, that is:
+ *
+ *  Copyright (C) 2004-2006 Ingo Molnar
+ *  Copyright (C) 2004 William Lee Irwin III
+ */
+
+#include <linux/module.h>
+#include <linux/mcount.h>
+
+/*
+ * Since we have nothing protecting between the test of
+ * mcount_trace_function and the call to it, we can't
+ * set it to NULL without risking a race that will have
+ * the kernel call the NULL pointer. Instead, we just
+ * set the function pointer to a dummy function.
+ */
+notrace void dummy_mcount_tracer(unsigned long ip,
+				 unsigned long parent_ip)
+{
+	/* do nothing */
+}
+
+static DEFINE_SPINLOCK(mcount_func_lock);
+static struct mcount_ops mcount_list_end __read_mostly =
+{
+	.func = dummy_mcount_tracer,
+};
+
+static struct mcount_ops *mcount_list __read_mostly = &mcount_list_end;
+mcount_func_t mcount_trace_function __read_mostly = dummy_mcount_tracer;
+int mcount_enabled __read_mostly;
+
+/* mcount is defined per arch in assembly */
+EXPORT_SYMBOL_GPL(mcount);
+
+notrace void mcount_list_func(unsigned long ip, unsigned long parent_ip)
+{
+	struct mcount_ops *op = mcount_list;
+
+	while (op != &mcount_list_end) {
+		op->func(ip, parent_ip);
+		op = op->next;
+	};
+}
+
+/**
+ * register_mcount_function - register a function for profiling
+ * @ops - ops structure that holds the function for profiling.
+ *
+ * Register a function to be called by all functions in the
+ * kernel.
+ *
+ * Note: @ops->func and all the functions it calls must be labeled
+ *       with "notrace", otherwise it will go into a
+ *       recursive loop.
+ */
+int register_mcount_function(struct mcount_ops *ops)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&mcount_func_lock, flags);
+	ops->next = mcount_list;
+	/* must have next seen before we update the list pointer */
+	smp_wmb();
+	mcount_list = ops;
+	/*
+	 * For one func, simply call it directly.
+	 * For more than one func, call the chain.
+	 */
+	if (ops->next == &mcount_list_end)
+		mcount_trace_function = ops->func;
+	else
+		mcount_trace_function = mcount_list_func;
+	spin_unlock_irqrestore(&mcount_func_lock, flags);
+
+	return 0;
+}
+
+/**
+ * unregister_mcount_function - unresgister a function for profiling.
+ * @ops - ops structure that holds the function to unregister
+ *
+ * Unregister a function that was added to be called by mcount profiling.
+ */
+int unregister_mcount_function(struct mcount_ops *ops)
+{
+	unsigned long flags;
+	struct mcount_ops **p;
+	int ret = 0;
+
+	spin_lock_irqsave(&mcount_func_lock, flags);
+
+	/*
+	 * If we are the only function, then the mcount pointer is
+	 * pointing directly to that function.
+	 */
+	if (mcount_list == ops && ops->next == &mcount_list_end) {
+		mcount_trace_function = dummy_mcount_tracer;
+		mcount_list = &mcount_list_end;
+		goto out;
+	}
+
+	for (p = &mcount_list; *p != &mcount_list_end; p = &(*p)->next)
+		if (*p == ops)
+			break;
+
+	if (*p != ops) {
+		ret = -1;
+		goto out;
+	}
+
+	*p = (*p)->next;
+
+	/* If we only have one func left, then call that directly */
+	if (mcount_list->next == &mcount_list_end)
+		mcount_trace_function = mcount_list->func;
+
+ out:
+	spin_unlock_irqrestore(&mcount_func_lock, flags);
+
+	return 0;
+}
+
+/**
+ * clear_mcount_function - reset the mcount function
+ *
+ * This NULLs the mcount function and in essence stops
+ * tracing.  There may be lag
+ */
+void clear_mcount_function(void)
+{
+	mcount_trace_function = dummy_mcount_tracer;
+}

-- 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 03/22 -v7] Annotate core code that should not be traced
  2008-01-30  3:15 [PATCH 00/22 -v7] mcount and latency tracing utility -v7 Steven Rostedt
  2008-01-30  3:15 ` [PATCH 01/22 -v7] printk - dont wakeup klogd with interrupts disabled Steven Rostedt
  2008-01-30  3:15 ` [PATCH 02/22 -v7] Add basic support for gcc profiler instrumentation Steven Rostedt
@ 2008-01-30  3:15 ` Steven Rostedt
  2008-01-30  3:15 ` [PATCH 04/22 -v7] x86_64: notrace annotations Steven Rostedt
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 45+ messages in thread
From: Steven Rostedt @ 2008-01-30  3:15 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, Peter Zijlstra,
	Christoph Hellwig, Mathieu Desnoyers, Gregory Haskins,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Tim Bird,
	Sam Ravnborg, Frank Ch. Eigler, Jan Kiszka, John Stultz,
	Arjan van de Ven, Steven Rostedt

[-- Attachment #1: mcount-annotate-generic-code.patch --]
[-- Type: text/plain, Size: 919 bytes --]

Mark with "notrace" functions in core code that should not be
traced.  The "notrace" attribute will prevent gcc from adding
a call to mcount on the annotated funtions.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>

---
 lib/smp_processor_id.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-mcount.git/lib/smp_processor_id.c
===================================================================
--- linux-mcount.git.orig/lib/smp_processor_id.c	2008-01-25 21:46:50.000000000 -0500
+++ linux-mcount.git/lib/smp_processor_id.c	2008-01-25 21:47:03.000000000 -0500
@@ -7,7 +7,7 @@
 #include <linux/kallsyms.h>
 #include <linux/sched.h>
 
-unsigned int debug_smp_processor_id(void)
+notrace unsigned int debug_smp_processor_id(void)
 {
 	unsigned long preempt_count = preempt_count();
 	int this_cpu = raw_smp_processor_id();

-- 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 04/22 -v7] x86_64: notrace annotations
  2008-01-30  3:15 [PATCH 00/22 -v7] mcount and latency tracing utility -v7 Steven Rostedt
                   ` (2 preceding siblings ...)
  2008-01-30  3:15 ` [PATCH 03/22 -v7] Annotate core code that should not be traced Steven Rostedt
@ 2008-01-30  3:15 ` Steven Rostedt
  2008-01-30  3:15 ` [PATCH 05/22 -v7] add notrace annotations to vsyscall Steven Rostedt
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 45+ messages in thread
From: Steven Rostedt @ 2008-01-30  3:15 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, Peter Zijlstra,
	Christoph Hellwig, Mathieu Desnoyers, Gregory Haskins,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Tim Bird,
	Sam Ravnborg, Frank Ch. Eigler, Jan Kiszka, John Stultz,
	Arjan van de Ven, Steven Rostedt

[-- Attachment #1: mcount-add-x86_64-notrace-annotations.patch --]
[-- Type: text/plain, Size: 2212 bytes --]

Add "notrace" annotation to x86_64 specific files.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 arch/x86/kernel/head64.c     |    2 +-
 arch/x86/kernel/setup64.c    |    4 ++--
 arch/x86/kernel/smpboot_64.c |    2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

Index: linux-mcount.git/arch/x86/kernel/head64.c
===================================================================
--- linux-mcount.git.orig/arch/x86/kernel/head64.c	2008-01-25 21:46:50.000000000 -0500
+++ linux-mcount.git/arch/x86/kernel/head64.c	2008-01-25 21:47:05.000000000 -0500
@@ -46,7 +46,7 @@ static void __init copy_bootdata(char *r
 	}
 }
 
-void __init x86_64_start_kernel(char * real_mode_data)
+notrace void __init x86_64_start_kernel(char *real_mode_data)
 {
 	int i;
 
Index: linux-mcount.git/arch/x86/kernel/setup64.c
===================================================================
--- linux-mcount.git.orig/arch/x86/kernel/setup64.c	2008-01-25 21:46:50.000000000 -0500
+++ linux-mcount.git/arch/x86/kernel/setup64.c	2008-01-25 21:47:05.000000000 -0500
@@ -114,7 +114,7 @@ void __init setup_per_cpu_areas(void)
 	}
 } 
 
-void pda_init(int cpu)
+notrace void pda_init(int cpu)
 { 
 	struct x8664_pda *pda = cpu_pda(cpu);
 
@@ -197,7 +197,7 @@ DEFINE_PER_CPU(struct orig_ist, orig_ist
  * 'CPU state barrier', nothing should get across.
  * A lot of state is already set up in PDA init.
  */
-void __cpuinit cpu_init (void)
+notrace void __cpuinit cpu_init(void)
 {
 	int cpu = stack_smp_processor_id();
 	struct tss_struct *t = &per_cpu(init_tss, cpu);
Index: linux-mcount.git/arch/x86/kernel/smpboot_64.c
===================================================================
--- linux-mcount.git.orig/arch/x86/kernel/smpboot_64.c	2008-01-25 21:46:50.000000000 -0500
+++ linux-mcount.git/arch/x86/kernel/smpboot_64.c	2008-01-25 21:47:05.000000000 -0500
@@ -317,7 +317,7 @@ static inline void set_cpu_sibling_map(i
 /*
  * Setup code on secondary processor (after comming out of the trampoline)
  */
-void __cpuinit start_secondary(void)
+notrace __cpuinit void start_secondary(void)
 {
 	/*
 	 * Dont put anything before smp_callin(), SMP

-- 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 05/22 -v7] add notrace annotations to vsyscall.
  2008-01-30  3:15 [PATCH 00/22 -v7] mcount and latency tracing utility -v7 Steven Rostedt
                   ` (3 preceding siblings ...)
  2008-01-30  3:15 ` [PATCH 04/22 -v7] x86_64: notrace annotations Steven Rostedt
@ 2008-01-30  3:15 ` Steven Rostedt
  2008-01-30  8:49   ` Peter Zijlstra
  2008-01-30  3:15 ` [PATCH 06/22 -v7] handle accurate time keeping over long delays Steven Rostedt
                   ` (16 subsequent siblings)
  21 siblings, 1 reply; 45+ messages in thread
From: Steven Rostedt @ 2008-01-30  3:15 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, Peter Zijlstra,
	Christoph Hellwig, Mathieu Desnoyers, Gregory Haskins,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Tim Bird,
	Sam Ravnborg, Frank Ch. Eigler, Jan Kiszka, John Stultz,
	Arjan van de Ven, Steven Rostedt

[-- Attachment #1: mcount-add-x86-vdso-notrace-annotations.patch --]
[-- Type: text/plain, Size: 4566 bytes --]

Add the notrace annotations to some of the vsyscall functions.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 arch/x86/kernel/vsyscall_64.c  |    3 ++-
 arch/x86/vdso/vclock_gettime.c |   15 ++++++++-------
 arch/x86/vdso/vgetcpu.c        |    3 ++-
 include/asm-x86/vsyscall.h     |    3 ++-
 4 files changed, 14 insertions(+), 10 deletions(-)

Index: linux-mcount.git/arch/x86/kernel/vsyscall_64.c
===================================================================
--- linux-mcount.git.orig/arch/x86/kernel/vsyscall_64.c	2008-01-25 21:46:50.000000000 -0500
+++ linux-mcount.git/arch/x86/kernel/vsyscall_64.c	2008-01-25 21:47:06.000000000 -0500
@@ -42,7 +42,8 @@
 #include <asm/topology.h>
 #include <asm/vgtod.h>
 
-#define __vsyscall(nr) __attribute__ ((unused,__section__(".vsyscall_" #nr)))
+#define __vsyscall(nr) \
+		__attribute__ ((unused, __section__(".vsyscall_" #nr))) notrace
 #define __syscall_clobber "r11","rcx","memory"
 #define __pa_vsymbol(x)			\
 	({unsigned long v;  		\
Index: linux-mcount.git/arch/x86/vdso/vclock_gettime.c
===================================================================
--- linux-mcount.git.orig/arch/x86/vdso/vclock_gettime.c	2008-01-25 21:46:50.000000000 -0500
+++ linux-mcount.git/arch/x86/vdso/vclock_gettime.c	2008-01-25 21:47:06.000000000 -0500
@@ -24,7 +24,7 @@
 
 #define gtod vdso_vsyscall_gtod_data
 
-static long vdso_fallback_gettime(long clock, struct timespec *ts)
+notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
 {
 	long ret;
 	asm("syscall" : "=a" (ret) :
@@ -32,7 +32,7 @@ static long vdso_fallback_gettime(long c
 	return ret;
 }
 
-static inline long vgetns(void)
+notrace static inline long vgetns(void)
 {
 	long v;
 	cycles_t (*vread)(void);
@@ -41,7 +41,7 @@ static inline long vgetns(void)
 	return (v * gtod->clock.mult) >> gtod->clock.shift;
 }
 
-static noinline int do_realtime(struct timespec *ts)
+notrace static noinline int do_realtime(struct timespec *ts)
 {
 	unsigned long seq, ns;
 	do {
@@ -55,7 +55,8 @@ static noinline int do_realtime(struct t
 }
 
 /* Copy of the version in kernel/time.c which we cannot directly access */
-static void vset_normalized_timespec(struct timespec *ts, long sec, long nsec)
+notrace static void
+vset_normalized_timespec(struct timespec *ts, long sec, long nsec)
 {
 	while (nsec >= NSEC_PER_SEC) {
 		nsec -= NSEC_PER_SEC;
@@ -69,7 +70,7 @@ static void vset_normalized_timespec(str
 	ts->tv_nsec = nsec;
 }
 
-static noinline int do_monotonic(struct timespec *ts)
+notrace static noinline int do_monotonic(struct timespec *ts)
 {
 	unsigned long seq, ns, secs;
 	do {
@@ -83,7 +84,7 @@ static noinline int do_monotonic(struct 
 	return 0;
 }
 
-int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
+notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
 {
 	if (likely(gtod->sysctl_enabled && gtod->clock.vread))
 		switch (clock) {
@@ -97,7 +98,7 @@ int __vdso_clock_gettime(clockid_t clock
 int clock_gettime(clockid_t, struct timespec *)
 	__attribute__((weak, alias("__vdso_clock_gettime")));
 
-int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz)
+notrace int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz)
 {
 	long ret;
 	if (likely(gtod->sysctl_enabled && gtod->clock.vread)) {
Index: linux-mcount.git/arch/x86/vdso/vgetcpu.c
===================================================================
--- linux-mcount.git.orig/arch/x86/vdso/vgetcpu.c	2008-01-25 21:46:50.000000000 -0500
+++ linux-mcount.git/arch/x86/vdso/vgetcpu.c	2008-01-25 21:47:06.000000000 -0500
@@ -13,7 +13,8 @@
 #include <asm/vgtod.h>
 #include "vextern.h"
 
-long __vdso_getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *unused)
+notrace long
+__vdso_getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *unused)
 {
 	unsigned int dummy, p;
 
Index: linux-mcount.git/include/asm-x86/vsyscall.h
===================================================================
--- linux-mcount.git.orig/include/asm-x86/vsyscall.h	2008-01-25 21:46:50.000000000 -0500
+++ linux-mcount.git/include/asm-x86/vsyscall.h	2008-01-25 21:47:06.000000000 -0500
@@ -24,7 +24,8 @@ enum vsyscall_num {
 	((unused, __section__ (".vsyscall_gtod_data"),aligned(16)))
 #define __section_vsyscall_clock __attribute__ \
 	((unused, __section__ (".vsyscall_clock"),aligned(16)))
-#define __vsyscall_fn __attribute__ ((unused,__section__(".vsyscall_fn")))
+#define __vsyscall_fn \
+	__attribute__ ((unused, __section__(".vsyscall_fn"))) notrace
 
 #define VGETCPU_RDTSCP	1
 #define VGETCPU_LSL	2

-- 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 06/22 -v7] handle accurate time keeping over long delays
  2008-01-30  3:15 [PATCH 00/22 -v7] mcount and latency tracing utility -v7 Steven Rostedt
                   ` (4 preceding siblings ...)
  2008-01-30  3:15 ` [PATCH 05/22 -v7] add notrace annotations to vsyscall Steven Rostedt
@ 2008-01-30  3:15 ` Steven Rostedt
  2008-01-30  3:15 ` [PATCH 07/22 -v7] initialize the clock source to jiffies clock Steven Rostedt
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 45+ messages in thread
From: Steven Rostedt @ 2008-01-30  3:15 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, Peter Zijlstra,
	Christoph Hellwig, Mathieu Desnoyers, Gregory Haskins,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Tim Bird,
	Sam Ravnborg, Frank Ch. Eigler, Jan Kiszka, John Stultz,
	Arjan van de Ven, Steven Rostedt

[-- Attachment #1: rt-time-starvation-fix.patch --]
[-- Type: text/plain, Size: 10266 bytes --]

Handle accurate time even if there's a long delay between
accumulated clock cycles.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 arch/powerpc/kernel/time.c    |    3 +-
 arch/x86/kernel/vsyscall_64.c |    5 ++-
 include/asm-x86/vgtod.h       |    2 -
 include/linux/clocksource.h   |   58 ++++++++++++++++++++++++++++++++++++++++--
 kernel/time/timekeeping.c     |   36 +++++++++++++-------------
 5 files changed, 82 insertions(+), 22 deletions(-)

Index: linux-mcount.git/arch/x86/kernel/vsyscall_64.c
===================================================================
--- linux-mcount.git.orig/arch/x86/kernel/vsyscall_64.c	2008-01-25 21:47:06.000000000 -0500
+++ linux-mcount.git/arch/x86/kernel/vsyscall_64.c	2008-01-25 21:47:09.000000000 -0500
@@ -86,6 +86,7 @@ void update_vsyscall(struct timespec *wa
 	vsyscall_gtod_data.clock.mask = clock->mask;
 	vsyscall_gtod_data.clock.mult = clock->mult;
 	vsyscall_gtod_data.clock.shift = clock->shift;
+	vsyscall_gtod_data.clock.cycle_accumulated = clock->cycle_accumulated;
 	vsyscall_gtod_data.wall_time_sec = wall_time->tv_sec;
 	vsyscall_gtod_data.wall_time_nsec = wall_time->tv_nsec;
 	vsyscall_gtod_data.wall_to_monotonic = wall_to_monotonic;
@@ -121,7 +122,7 @@ static __always_inline long time_syscall
 
 static __always_inline void do_vgettimeofday(struct timeval * tv)
 {
-	cycle_t now, base, mask, cycle_delta;
+	cycle_t now, base, accumulated, mask, cycle_delta;
 	unsigned seq;
 	unsigned long mult, shift, nsec;
 	cycle_t (*vread)(void);
@@ -135,6 +136,7 @@ static __always_inline void do_vgettimeo
 		}
 		now = vread();
 		base = __vsyscall_gtod_data.clock.cycle_last;
+		accumulated  = __vsyscall_gtod_data.clock.cycle_accumulated;
 		mask = __vsyscall_gtod_data.clock.mask;
 		mult = __vsyscall_gtod_data.clock.mult;
 		shift = __vsyscall_gtod_data.clock.shift;
@@ -145,6 +147,7 @@ static __always_inline void do_vgettimeo
 
 	/* calculate interval: */
 	cycle_delta = (now - base) & mask;
+	cycle_delta += accumulated;
 	/* convert to nsecs: */
 	nsec += (cycle_delta * mult) >> shift;
 
Index: linux-mcount.git/include/asm-x86/vgtod.h
===================================================================
--- linux-mcount.git.orig/include/asm-x86/vgtod.h	2008-01-25 21:46:50.000000000 -0500
+++ linux-mcount.git/include/asm-x86/vgtod.h	2008-01-25 21:47:09.000000000 -0500
@@ -15,7 +15,7 @@ struct vsyscall_gtod_data {
 	struct timezone sys_tz;
 	struct { /* extract of a clocksource struct */
 		cycle_t (*vread)(void);
-		cycle_t	cycle_last;
+		cycle_t	cycle_last, cycle_accumulated;
 		cycle_t	mask;
 		u32	mult;
 		u32	shift;
Index: linux-mcount.git/include/linux/clocksource.h
===================================================================
--- linux-mcount.git.orig/include/linux/clocksource.h	2008-01-25 21:46:50.000000000 -0500
+++ linux-mcount.git/include/linux/clocksource.h	2008-01-25 21:47:09.000000000 -0500
@@ -50,8 +50,12 @@ struct clocksource;
  * @flags:		flags describing special properties
  * @vread:		vsyscall based read
  * @resume:		resume function for the clocksource, if necessary
+ * @cycle_last:		Used internally by timekeeping core, please ignore.
+ * @cycle_accumulated:	Used internally by timekeeping core, please ignore.
  * @cycle_interval:	Used internally by timekeeping core, please ignore.
  * @xtime_interval:	Used internally by timekeeping core, please ignore.
+ * @xtime_nsec:		Used internally by timekeeping core, please ignore.
+ * @error:		Used internally by timekeeping core, please ignore.
  */
 struct clocksource {
 	/*
@@ -82,7 +86,10 @@ struct clocksource {
 	 * Keep it in a different cache line to dirty no
 	 * more than one cache line.
 	 */
-	cycle_t cycle_last ____cacheline_aligned_in_smp;
+	struct {
+		cycle_t cycle_last, cycle_accumulated;
+	} ____cacheline_aligned_in_smp;
+
 	u64 xtime_nsec;
 	s64 error;
 
@@ -168,11 +175,44 @@ static inline cycle_t clocksource_read(s
 }
 
 /**
+ * clocksource_get_cycles: - Access the clocksource's accumulated cycle value
+ * @cs:		pointer to clocksource being read
+ * @now:	current cycle value
+ *
+ * Uses the clocksource to return the current cycle_t value.
+ * NOTE!!!: This is different from clocksource_read, because it
+ * returns the accumulated cycle value! Must hold xtime lock!
+ */
+static inline cycle_t
+clocksource_get_cycles(struct clocksource *cs, cycle_t now)
+{
+	cycle_t offset = (now - cs->cycle_last) & cs->mask;
+	offset += cs->cycle_accumulated;
+	return offset;
+}
+
+/**
+ * clocksource_accumulate: - Accumulates clocksource cycles
+ * @cs:		pointer to clocksource being read
+ * @now:	current cycle value
+ *
+ * Used to avoids clocksource hardware overflow by periodically
+ * accumulating the current cycle delta. Must hold xtime write lock!
+ */
+static inline void clocksource_accumulate(struct clocksource *cs, cycle_t now)
+{
+	cycle_t offset = (now - cs->cycle_last) & cs->mask;
+	cs->cycle_last = now;
+	cs->cycle_accumulated += offset;
+}
+
+/**
  * cyc2ns - converts clocksource cycles to nanoseconds
  * @cs:		Pointer to clocksource
  * @cycles:	Cycles
  *
  * Uses the clocksource and ntp ajdustment to convert cycle_ts to nanoseconds.
+ * Must hold xtime lock!
  *
  * XXX - This could use some mult_lxl_ll() asm optimization
  */
@@ -184,13 +224,27 @@ static inline s64 cyc2ns(struct clocksou
 }
 
 /**
+ * ns2cyc - converts nanoseconds to clocksource cycles
+ * @cs:		Pointer to clocksource
+ * @nsecs:	Nanoseconds
+ */
+static inline cycle_t ns2cyc(struct clocksource *cs, u64 nsecs)
+{
+	cycle_t ret = nsecs << cs->shift;
+
+	do_div(ret, cs->mult + 1);
+
+	return ret;
+}
+
+/**
  * clocksource_calculate_interval - Calculates a clocksource interval struct
  *
  * @c:		Pointer to clocksource.
  * @length_nsec: Desired interval length in nanoseconds.
  *
  * Calculates a fixed cycle/nsec interval for a given clocksource/adjustment
- * pair and interval request.
+ * pair and interval request. Must hold xtime_lock!
  *
  * Unless you're the timekeeping code, you should not be using this!
  */
Index: linux-mcount.git/kernel/time/timekeeping.c
===================================================================
--- linux-mcount.git.orig/kernel/time/timekeeping.c	2008-01-25 21:46:50.000000000 -0500
+++ linux-mcount.git/kernel/time/timekeeping.c	2008-01-25 21:47:09.000000000 -0500
@@ -66,16 +66,10 @@ static struct clocksource *clock; /* poi
  */
 static inline s64 __get_nsec_offset(void)
 {
-	cycle_t cycle_now, cycle_delta;
+	cycle_t cycle_delta;
 	s64 ns_offset;
 
-	/* read clocksource: */
-	cycle_now = clocksource_read(clock);
-
-	/* calculate the delta since the last update_wall_time: */
-	cycle_delta = (cycle_now - clock->cycle_last) & clock->mask;
-
-	/* convert to nanoseconds: */
+	cycle_delta = clocksource_get_cycles(clock, clocksource_read(clock));
 	ns_offset = cyc2ns(clock, cycle_delta);
 
 	return ns_offset;
@@ -195,7 +189,7 @@ static void change_clocksource(void)
 
 	clock = new;
 	clock->cycle_last = now;
-
+	clock->cycle_accumulated = 0;
 	clock->error = 0;
 	clock->xtime_nsec = 0;
 	clocksource_calculate_interval(clock, NTP_INTERVAL_LENGTH);
@@ -205,9 +199,15 @@ static void change_clocksource(void)
 	printk(KERN_INFO "Time: %s clocksource has been installed.\n",
 	       clock->name);
 }
+
+void timekeeping_accumulate(void)
+{
+	clocksource_accumulate(clock, clocksource_read(clock));
+}
 #else
 static inline void change_clocksource(void) { }
 static inline s64 __get_nsec_offset(void) { return 0; }
+void timekeeping_accumulate(void) { }
 #endif
 
 /**
@@ -302,6 +302,7 @@ static int timekeeping_resume(struct sys
 	timespec_add_ns(&xtime, timekeeping_suspend_nsecs);
 	/* re-base the last cycle value */
 	clock->cycle_last = clocksource_read(clock);
+	clock->cycle_accumulated = 0;
 	clock->error = 0;
 	timekeeping_suspended = 0;
 	write_sequnlock_irqrestore(&xtime_lock, flags);
@@ -448,27 +449,28 @@ static void clocksource_adjust(s64 offse
  */
 void update_wall_time(void)
 {
-	cycle_t offset;
+	cycle_t cycle_now;
 
 	/* Make sure we're fully resumed: */
 	if (unlikely(timekeeping_suspended))
 		return;
 
 #ifdef CONFIG_GENERIC_TIME
-	offset = (clocksource_read(clock) - clock->cycle_last) & clock->mask;
+	cycle_now = clocksource_read(clock);
 #else
-	offset = clock->cycle_interval;
+	cycle_now = clock->cycle_last + clock->cycle_interval;
 #endif
+	clocksource_accumulate(clock, cycle_now);
+
 	clock->xtime_nsec += (s64)xtime.tv_nsec << clock->shift;
 
 	/* normally this loop will run just once, however in the
 	 * case of lost or late ticks, it will accumulate correctly.
 	 */
-	while (offset >= clock->cycle_interval) {
+	while (clock->cycle_accumulated >= clock->cycle_interval) {
 		/* accumulate one interval */
 		clock->xtime_nsec += clock->xtime_interval;
-		clock->cycle_last += clock->cycle_interval;
-		offset -= clock->cycle_interval;
+		clock->cycle_accumulated -= clock->cycle_interval;
 
 		if (clock->xtime_nsec >= (u64)NSEC_PER_SEC << clock->shift) {
 			clock->xtime_nsec -= (u64)NSEC_PER_SEC << clock->shift;
@@ -482,13 +484,13 @@ void update_wall_time(void)
 	}
 
 	/* correct the clock when NTP error is too big */
-	clocksource_adjust(offset);
+	clocksource_adjust(clock->cycle_accumulated);
 
 	/* store full nanoseconds into xtime */
 	xtime.tv_nsec = (s64)clock->xtime_nsec >> clock->shift;
 	clock->xtime_nsec -= (s64)xtime.tv_nsec << clock->shift;
 
-	update_xtime_cache(cyc2ns(clock, offset));
+	update_xtime_cache(cyc2ns(clock, clock->cycle_accumulated));
 
 	/* check to see if there is a new clocksource to use */
 	change_clocksource();
Index: linux-mcount.git/arch/powerpc/kernel/time.c
===================================================================
--- linux-mcount.git.orig/arch/powerpc/kernel/time.c	2008-01-25 21:46:50.000000000 -0500
+++ linux-mcount.git/arch/powerpc/kernel/time.c	2008-01-25 21:47:09.000000000 -0500
@@ -773,7 +773,8 @@ void update_vsyscall(struct timespec *wa
 	stamp_xsec = (u64) xtime.tv_nsec * XSEC_PER_SEC;
 	do_div(stamp_xsec, 1000000000);
 	stamp_xsec += (u64) xtime.tv_sec * XSEC_PER_SEC;
-	update_gtod(clock->cycle_last, stamp_xsec, t2x);
+	update_gtod(clock->cycle_last-clock->cycle_accumulated,
+		    stamp_xsec, t2x);
 }
 
 void update_vsyscall_tz(void)

-- 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 07/22 -v7] initialize the clock source to jiffies clock.
  2008-01-30  3:15 [PATCH 00/22 -v7] mcount and latency tracing utility -v7 Steven Rostedt
                   ` (5 preceding siblings ...)
  2008-01-30  3:15 ` [PATCH 06/22 -v7] handle accurate time keeping over long delays Steven Rostedt
@ 2008-01-30  3:15 ` Steven Rostedt
  2008-01-30  3:15 ` [PATCH 08/22 -v7] add get_monotonic_cycles Steven Rostedt
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 45+ messages in thread
From: Steven Rostedt @ 2008-01-30  3:15 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, Peter Zijlstra,
	Christoph Hellwig, Mathieu Desnoyers, Gregory Haskins,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Tim Bird,
	Sam Ravnborg, Frank Ch. Eigler, Jan Kiszka, John Stultz,
	Arjan van de Ven

[-- Attachment #1: initialize-clocksource-to-jiffies.patch --]
[-- Type: text/plain, Size: 2086 bytes --]

The latency tracer can call clocksource_read very early in bootup and
before the clock source variable has been initialized. This results in a
crash at boot up (even before earlyprintk is initialized). Since the
clock->read variable points to NULL.

This patch simply initializes the clock to use clocksource_jiffies, so
that any early user of clocksource_read will not crash.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: John Stultz <johnstul@us.ibm.com>
---
 include/linux/clocksource.h |    3 +++
 kernel/time/timekeeping.c   |    9 +++++++--
 2 files changed, 10 insertions(+), 2 deletions(-)

Index: linux-mcount.git/include/linux/clocksource.h
===================================================================
--- linux-mcount.git.orig/include/linux/clocksource.h	2008-01-25 21:47:09.000000000 -0500
+++ linux-mcount.git/include/linux/clocksource.h	2008-01-25 21:47:11.000000000 -0500
@@ -273,6 +273,9 @@ extern struct clocksource* clocksource_g
 extern void clocksource_change_rating(struct clocksource *cs, int rating);
 extern void clocksource_resume(void);
 
+/* used to initialize clock */
+extern struct clocksource clocksource_jiffies;
+
 #ifdef CONFIG_GENERIC_TIME_VSYSCALL
 extern void update_vsyscall(struct timespec *ts, struct clocksource *c);
 extern void update_vsyscall_tz(void);
Index: linux-mcount.git/kernel/time/timekeeping.c
===================================================================
--- linux-mcount.git.orig/kernel/time/timekeeping.c	2008-01-25 21:47:09.000000000 -0500
+++ linux-mcount.git/kernel/time/timekeeping.c	2008-01-25 21:47:11.000000000 -0500
@@ -53,8 +53,13 @@ static inline void update_xtime_cache(u6
 	timespec_add_ns(&xtime_cache, nsec);
 }
 
-static struct clocksource *clock; /* pointer to current clocksource */
-
+/*
+ * pointer to current clocksource
+ *  Just in case we use clocksource_read before we initialize
+ *  the actual clock source. Instead of calling a NULL read pointer
+ *  we return jiffies.
+ */
+static struct clocksource *clock = &clocksource_jiffies;
 
 #ifdef CONFIG_GENERIC_TIME
 /**

-- 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 08/22 -v7] add get_monotonic_cycles
  2008-01-30  3:15 [PATCH 00/22 -v7] mcount and latency tracing utility -v7 Steven Rostedt
                   ` (6 preceding siblings ...)
  2008-01-30  3:15 ` [PATCH 07/22 -v7] initialize the clock source to jiffies clock Steven Rostedt
@ 2008-01-30  3:15 ` Steven Rostedt
  2008-01-30  3:15 ` [PATCH 09/22 -v7] add notrace annotations to timing events Steven Rostedt
                   ` (13 subsequent siblings)
  21 siblings, 0 replies; 45+ messages in thread
From: Steven Rostedt @ 2008-01-30  3:15 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, Peter Zijlstra,
	Christoph Hellwig, Mathieu Desnoyers, Gregory Haskins,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Tim Bird,
	Sam Ravnborg, Frank Ch. Eigler, Jan Kiszka, John Stultz,
	Arjan van de Ven, Steven Rostedt

[-- Attachment #1: get-monotonic-cycles.patch --]
[-- Type: text/plain, Size: 5342 bytes --]

The latency tracer needs a way to get an accurate time
without grabbing any locks. Locks themselves might call
the latency tracer and cause at best a slow down.

This patch adds get_monotonic_cycles that returns cycles
from a reliable clock source in a monotonic fashion.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 include/linux/clocksource.h |   54 +++++++++++++++++++++++++++++++++++++-------
 kernel/time/timekeeping.c   |   26 +++++++++++++++++++--
 2 files changed, 70 insertions(+), 10 deletions(-)

Index: linux-mcount.git/include/linux/clocksource.h
===================================================================
--- linux-mcount.git.orig/include/linux/clocksource.h	2008-01-25 21:47:11.000000000 -0500
+++ linux-mcount.git/include/linux/clocksource.h	2008-01-25 21:47:13.000000000 -0500
@@ -88,8 +88,16 @@ struct clocksource {
 	 */
 	struct {
 		cycle_t cycle_last, cycle_accumulated;
-	} ____cacheline_aligned_in_smp;
 
+		/* base structure provides lock-free read
+		 * access to a virtualized 64bit counter
+		 * Uses RCU-like update.
+		 */
+		struct {
+			cycle_t cycle_base_last, cycle_base;
+		} base[2];
+		int base_num;
+	} ____cacheline_aligned_in_smp;
 	u64 xtime_nsec;
 	s64 error;
 
@@ -175,19 +183,30 @@ static inline cycle_t clocksource_read(s
 }
 
 /**
- * clocksource_get_cycles: - Access the clocksource's accumulated cycle value
+ * clocksource_get_basecycles: - get the clocksource's accumulated cycle value
  * @cs:		pointer to clocksource being read
  * @now:	current cycle value
  *
  * Uses the clocksource to return the current cycle_t value.
  * NOTE!!!: This is different from clocksource_read, because it
- * returns the accumulated cycle value! Must hold xtime lock!
+ * returns a 64bit wide accumulated value.
  */
 static inline cycle_t
-clocksource_get_cycles(struct clocksource *cs, cycle_t now)
+clocksource_get_basecycles(struct clocksource *cs)
 {
-	cycle_t offset = (now - cs->cycle_last) & cs->mask;
-	offset += cs->cycle_accumulated;
+	int num;
+	cycle_t now, offset;
+
+	preempt_disable();
+	num = cs->base_num;
+	/* base_num is shared, and some archs are wacky */
+	smp_read_barrier_depends();
+	now = clocksource_read(cs);
+	offset = (now - cs->base[num].cycle_base_last);
+	offset &= cs->mask;
+	offset += cs->base[num].cycle_base;
+	preempt_enable();
+
 	return offset;
 }
 
@@ -197,11 +216,27 @@ clocksource_get_cycles(struct clocksourc
  * @now:	current cycle value
  *
  * Used to avoids clocksource hardware overflow by periodically
- * accumulating the current cycle delta. Must hold xtime write lock!
+ * accumulating the current cycle delta. Uses RCU-like update, but
+ * ***still requires the xtime_lock is held for writing!***
  */
 static inline void clocksource_accumulate(struct clocksource *cs, cycle_t now)
 {
-	cycle_t offset = (now - cs->cycle_last) & cs->mask;
+	/*
+	 * First update the monotonic base portion.
+	 * The dual array update method allows for lock-free reading.
+	 * 'num' is always 1 or 0.
+	 */
+	int num = 1 - cs->base_num;
+	cycle_t offset = (now - cs->base[1-num].cycle_base_last);
+	offset &= cs->mask;
+	cs->base[num].cycle_base = cs->base[1-num].cycle_base + offset;
+	cs->base[num].cycle_base_last = now;
+	/* make sure this array is visible to the world first */
+	smp_wmb();
+	cs->base_num = num;
+
+	/* Now update the cycle_accumulated portion */
+	offset = (now - cs->cycle_last) & cs->mask;
 	cs->cycle_last = now;
 	cs->cycle_accumulated += offset;
 }
@@ -272,6 +307,9 @@ extern int clocksource_register(struct c
 extern struct clocksource* clocksource_get_next(void);
 extern void clocksource_change_rating(struct clocksource *cs, int rating);
 extern void clocksource_resume(void);
+extern cycle_t get_monotonic_cycles(void);
+extern unsigned long cycles_to_usecs(cycle_t cycles);
+extern cycle_t usecs_to_cycles(unsigned long usecs);
 
 /* used to initialize clock */
 extern struct clocksource clocksource_jiffies;
Index: linux-mcount.git/kernel/time/timekeeping.c
===================================================================
--- linux-mcount.git.orig/kernel/time/timekeeping.c	2008-01-25 21:47:11.000000000 -0500
+++ linux-mcount.git/kernel/time/timekeeping.c	2008-01-25 21:47:13.000000000 -0500
@@ -71,10 +71,12 @@ static struct clocksource *clock = &cloc
  */
 static inline s64 __get_nsec_offset(void)
 {
-	cycle_t cycle_delta;
+	cycle_t now, cycle_delta;
 	s64 ns_offset;
 
-	cycle_delta = clocksource_get_cycles(clock, clocksource_read(clock));
+	now = clocksource_read(clock);
+	cycle_delta = (now - clock->cycle_last) & clock->mask;
+	cycle_delta += clock->cycle_accumulated;
 	ns_offset = cyc2ns(clock, cycle_delta);
 
 	return ns_offset;
@@ -103,6 +105,26 @@ static inline void __get_realtime_clock_
 	timespec_add_ns(ts, nsecs);
 }
 
+cycle_t notrace get_monotonic_cycles(void)
+{
+	return clocksource_get_basecycles(clock);
+}
+
+unsigned long notrace cycles_to_usecs(cycle_t cycles)
+{
+	u64 ret = cyc2ns(clock, cycles);
+
+	ret += NSEC_PER_USEC/2; /* For rounding in do_div() */
+	do_div(ret, NSEC_PER_USEC);
+
+	return ret;
+}
+
+cycle_t notrace usecs_to_cycles(unsigned long usecs)
+{
+	return ns2cyc(clock, (u64)usecs * 1000);
+}
+
 /**
  * getnstimeofday - Returns the time of day in a timespec
  * @ts:		pointer to the timespec to be set

-- 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 09/22 -v7] add notrace annotations to timing events
  2008-01-30  3:15 [PATCH 00/22 -v7] mcount and latency tracing utility -v7 Steven Rostedt
                   ` (7 preceding siblings ...)
  2008-01-30  3:15 ` [PATCH 08/22 -v7] add get_monotonic_cycles Steven Rostedt
@ 2008-01-30  3:15 ` Steven Rostedt
  2008-01-30  3:15 ` [PATCH 10/22 -v7] mcount based trace in the form of a header file library Steven Rostedt
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 45+ messages in thread
From: Steven Rostedt @ 2008-01-30  3:15 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, Peter Zijlstra,
	Christoph Hellwig, Mathieu Desnoyers, Gregory Haskins,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Tim Bird,
	Sam Ravnborg, Frank Ch. Eigler, Jan Kiszka, John Stultz,
	Arjan van de Ven, Steven Rostedt

[-- Attachment #1: mcount-add-time-notrace-annotations.patch --]
[-- Type: text/plain, Size: 5179 bytes --]

This patch adds notrace annotations to timer functions
that will be used by tracing. This helps speed things up and
also keeps the ugliness of printing these functions down.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 arch/x86/kernel/apic_32.c     |    2 +-
 arch/x86/kernel/hpet.c        |    2 +-
 arch/x86/kernel/time_32.c     |    2 +-
 arch/x86/kernel/tsc_32.c      |    2 +-
 arch/x86/kernel/tsc_64.c      |    4 ++--
 arch/x86/lib/delay_32.c       |    6 +++---
 drivers/clocksource/acpi_pm.c |    8 ++++----
 7 files changed, 13 insertions(+), 13 deletions(-)

Index: linux-mcount.git/arch/x86/kernel/apic_32.c
===================================================================
--- linux-mcount.git.orig/arch/x86/kernel/apic_32.c	2008-01-29 11:35:35.000000000 -0500
+++ linux-mcount.git/arch/x86/kernel/apic_32.c	2008-01-29 11:49:47.000000000 -0500
@@ -577,7 +577,7 @@ static void local_apic_timer_interrupt(v
  *   interrupt as well. Thus we cannot inline the local irq ... ]
  */
 
-void fastcall smp_apic_timer_interrupt(struct pt_regs *regs)
+notrace fastcall void smp_apic_timer_interrupt(struct pt_regs *regs)
 {
 	struct pt_regs *old_regs = set_irq_regs(regs);
 
Index: linux-mcount.git/arch/x86/kernel/hpet.c
===================================================================
--- linux-mcount.git.orig/arch/x86/kernel/hpet.c	2008-01-29 11:35:35.000000000 -0500
+++ linux-mcount.git/arch/x86/kernel/hpet.c	2008-01-29 11:49:47.000000000 -0500
@@ -295,7 +295,7 @@ static int hpet_legacy_next_event(unsign
 /*
  * Clock source related code
  */
-static cycle_t read_hpet(void)
+static notrace cycle_t read_hpet(void)
 {
 	return (cycle_t)hpet_readl(HPET_COUNTER);
 }
Index: linux-mcount.git/arch/x86/kernel/time_32.c
===================================================================
--- linux-mcount.git.orig/arch/x86/kernel/time_32.c	2008-01-29 11:35:35.000000000 -0500
+++ linux-mcount.git/arch/x86/kernel/time_32.c	2008-01-29 11:49:47.000000000 -0500
@@ -122,7 +122,7 @@ static int set_rtc_mmss(unsigned long no
 
 int timer_ack;
 
-unsigned long profile_pc(struct pt_regs *regs)
+notrace unsigned long profile_pc(struct pt_regs *regs)
 {
 	unsigned long pc = instruction_pointer(regs);
 
Index: linux-mcount.git/arch/x86/kernel/tsc_32.c
===================================================================
--- linux-mcount.git.orig/arch/x86/kernel/tsc_32.c	2008-01-29 11:35:35.000000000 -0500
+++ linux-mcount.git/arch/x86/kernel/tsc_32.c	2008-01-29 11:49:47.000000000 -0500
@@ -269,7 +269,7 @@ core_initcall(cpufreq_tsc);
 
 static unsigned long current_tsc_khz = 0;
 
-static cycle_t read_tsc(void)
+static notrace cycle_t read_tsc(void)
 {
 	cycle_t ret;
 
Index: linux-mcount.git/arch/x86/kernel/tsc_64.c
===================================================================
--- linux-mcount.git.orig/arch/x86/kernel/tsc_64.c	2008-01-29 11:35:35.000000000 -0500
+++ linux-mcount.git/arch/x86/kernel/tsc_64.c	2008-01-29 11:49:47.000000000 -0500
@@ -248,13 +248,13 @@ __setup("notsc", notsc_setup);
 
 
 /* clock source code: */
-static cycle_t read_tsc(void)
+static notrace cycle_t read_tsc(void)
 {
 	cycle_t ret = (cycle_t)get_cycles_sync();
 	return ret;
 }
 
-static cycle_t __vsyscall_fn vread_tsc(void)
+static notrace cycle_t __vsyscall_fn vread_tsc(void)
 {
 	cycle_t ret = (cycle_t)get_cycles_sync();
 	return ret;
Index: linux-mcount.git/arch/x86/lib/delay_32.c
===================================================================
--- linux-mcount.git.orig/arch/x86/lib/delay_32.c	2008-01-29 11:35:35.000000000 -0500
+++ linux-mcount.git/arch/x86/lib/delay_32.c	2008-01-29 11:49:47.000000000 -0500
@@ -24,7 +24,7 @@
 #endif
 
 /* simple loop based delay: */
-static void delay_loop(unsigned long loops)
+static notrace void delay_loop(unsigned long loops)
 {
 	int d0;
 
@@ -39,7 +39,7 @@ static void delay_loop(unsigned long loo
 }
 
 /* TSC based delay: */
-static void delay_tsc(unsigned long loops)
+static notrace void delay_tsc(unsigned long loops)
 {
 	unsigned long bclock, now;
 
@@ -72,7 +72,7 @@ int read_current_timer(unsigned long *ti
 	return -1;
 }
 
-void __delay(unsigned long loops)
+notrace void __delay(unsigned long loops)
 {
 	delay_fn(loops);
 }
Index: linux-mcount.git/drivers/clocksource/acpi_pm.c
===================================================================
--- linux-mcount.git.orig/drivers/clocksource/acpi_pm.c	2008-01-29 11:35:35.000000000 -0500
+++ linux-mcount.git/drivers/clocksource/acpi_pm.c	2008-01-29 11:49:47.000000000 -0500
@@ -30,13 +30,13 @@
  */
 u32 pmtmr_ioport __read_mostly;
 
-static inline u32 read_pmtmr(void)
+static inline notrace u32 read_pmtmr(void)
 {
 	/* mask the output to 24 bits */
 	return inl(pmtmr_ioport) & ACPI_PM_MASK;
 }
 
-u32 acpi_pm_read_verified(void)
+notrace u32 acpi_pm_read_verified(void)
 {
 	u32 v1 = 0, v2 = 0, v3 = 0;
 
@@ -56,12 +56,12 @@ u32 acpi_pm_read_verified(void)
 	return v2;
 }
 
-static cycle_t acpi_pm_read_slow(void)
+static notrace cycle_t acpi_pm_read_slow(void)
 {
 	return (cycle_t)acpi_pm_read_verified();
 }
 
-static cycle_t acpi_pm_read(void)
+static notrace cycle_t acpi_pm_read(void)
 {
 	return (cycle_t)read_pmtmr();
 }

-- 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 10/22 -v7] mcount based trace in the form of a header file library
  2008-01-30  3:15 [PATCH 00/22 -v7] mcount and latency tracing utility -v7 Steven Rostedt
                   ` (8 preceding siblings ...)
  2008-01-30  3:15 ` [PATCH 09/22 -v7] add notrace annotations to timing events Steven Rostedt
@ 2008-01-30  3:15 ` Steven Rostedt
  2008-01-30  3:15 ` [PATCH 11/22 -v7] Add context switch marker to sched.c Steven Rostedt
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 45+ messages in thread
From: Steven Rostedt @ 2008-01-30  3:15 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, Peter Zijlstra,
	Christoph Hellwig, Mathieu Desnoyers, Gregory Haskins,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Tim Bird,
	Sam Ravnborg, Frank Ch. Eigler, Jan Kiszka, John Stultz,
	Arjan van de Ven, Steven Rostedt

[-- Attachment #1: mcount-function-tracer.patch --]
[-- Type: text/plain, Size: 38267 bytes --]

This is a simple trace that uses the mcount infrastructure. It is
designed to be fast and small, and easy to use. It is useful to
record things that happen over a very short period of time, and
not to analyze the system in general.

An interface is added to the debugfs

  /debugfs/tracing/

This patch adds the following files:

  available_tracers
     list of available tracers. Currently only "function" is
     available.

  current_tracer
     The trace that is currently active. Empty on start up.
     To switch to a tracer simply echo one of the tracers that
     are listed in available_tracers:

      echo function > /debugfs/tracing/current_tracer

     To disable the tracer:

       echo disable > /debugfs/tracing/current_tracer


  trace_ctrl
     echoing "1" into this file starts the mcount function tracing
      (if sysctl kernel.mcount_enabled=1)
     echoing "0" turns it off.

  latency_trace
      This file is readonly and holds the result of the trace.

  trace
      This file outputs a easier to read version of the trace.

  iter_ctrl
      Controls the way the output of traces look.
      So far there's two controls:
        echoing in "symonly" will only show the kallsyms variables
            without the addresses (if kallsyms was configured)
        echoing in "verbose" will change the output to show
            a lot more data, but not very easy to understand by
            humans.
        echoing in "nosymonly" turns off symonly.
        echoing in "noverbose" turns off verbose.

The output of the function_trace file is as follows

  "echo noverbose > /debugfs/tracing/iter_ctrl"

preemption latency trace v1.1.5 on 2.6.24-rc7-tst
--------------------------------------------------------------------
 latency: 0 us, #419428/4361791, CPU#1 | (M:desktop VP:0, KP:0, SP:0 HP:0 #P:4)
    -----------------
    | task: -0 (uid:0 nice:0 policy:0 rt_prio:0)
    -----------------

                 _------=> CPU#
                / _-----=> irqs-off
               | / _----=> need-resched
               || / _---=> hardirq/softirq
               ||| / _--=> preempt-depth
               |||| /
               |||||     delay
   cmd     pid ||||| time  |   caller
      \   /    |||||   \   |   /
 swapper-0     0d.h. 1595128us+: set_normalized_timespec+0x8/0x2d <c043841d> (ktime_get_ts+0x4a/0x4e <c04499d4>)
 swapper-0     0d.h. 1595131us+: _spin_lock+0x8/0x18 <c0630690> (hrtimer_interrupt+0x6e/0x1b0 <c0449c56>)

Or with verbose turned on:

  "echo verbose > /debugfs/tracing/iter_ctrl"

preemption latency trace v1.1.5 on 2.6.24-rc7-tst
--------------------------------------------------------------------
 latency: 0 us, #419428/4361791, CPU#1 | (M:desktop VP:0, KP:0, SP:0 HP:0 #P:4)
    -----------------
    | task: -0 (uid:0 nice:0 policy:0 rt_prio:0)
    -----------------

         swapper     0 0 9 00000000 00000000 [f3675f41] 1595.128ms (+0.003ms): set_normalized_timespec+0x8/0x2d <c043841d> (ktime_get_ts+0x4a/0x4e <c04499d4>)
         swapper     0 0 9 00000000 00000001 [f3675f45] 1595.131ms (+0.003ms): _spin_lock+0x8/0x18 <c0630690> (hrtimer_interrupt+0x6e/0x1b0 <c0449c56>)
         swapper     0 0 9 00000000 00000002 [f3675f48] 1595.135ms (+0.003ms): _spin_lock+0x8/0x18 <c0630690> (hrtimer_interrupt+0x6e/0x1b0 <c0449c56>)


The "trace" file is not affected by the verbose mode, but is by the symonly.

 echo "nosymonly" > /debugfs/tracing/iter_ctrl

tracer:
[   81.479967] CPU 0: bash:3154 register_mcount_function+0x5f/0x66 <ffffffff80337a4d> <-- _spin_unlock_irqrestore+0xe/0x5a <ffffffff8048cc8f>
[   81.479967] CPU 0: bash:3154 _spin_unlock_irqrestore+0x3e/0x5a <ffffffff8048ccbf> <-- sub_preempt_count+0xc/0x7a <ffffffff80233d7b>
[   81.479968] CPU 0: bash:3154 sub_preempt_count+0x30/0x7a <ffffffff80233d9f> <-- in_lock_functions+0x9/0x24 <ffffffff8025a75d>
[   81.479968] CPU 0: bash:3154 vfs_write+0x11d/0x155 <ffffffff8029a043> <-- dnotify_parent+0x12/0x78 <ffffffff802d54fb>
[   81.479968] CPU 0: bash:3154 dnotify_parent+0x2d/0x78 <ffffffff802d5516> <-- _spin_lock+0xe/0x70 <ffffffff8048c910>
[   81.479969] CPU 0: bash:3154 _spin_lock+0x1b/0x70 <ffffffff8048c91d> <-- add_preempt_count+0xe/0x77 <ffffffff80233df7>
[   81.479969] CPU 0: bash:3154 add_preempt_count+0x3e/0x77 <ffffffff80233e27> <-- in_lock_functions+0x9/0x24 <ffffffff8025a75d>


 echo "symonly" > /debugfs/tracing/iter_ctrl

tracer:
[   81.479913] CPU 0: bash:3154 register_mcount_function+0x5f/0x66 <-- _spin_unlock_irqrestore+0xe/0x5a
[   81.479913] CPU 0: bash:3154 _spin_unlock_irqrestore+0x3e/0x5a <-- sub_preempt_count+0xc/0x7a
[   81.479913] CPU 0: bash:3154 sub_preempt_count+0x30/0x7a <-- in_lock_functions+0x9/0x24
[   81.479914] CPU 0: bash:3154 vfs_write+0x11d/0x155 <-- dnotify_parent+0x12/0x78
[   81.479914] CPU 0: bash:3154 dnotify_parent+0x2d/0x78 <-- _spin_lock+0xe/0x70
[   81.479914] CPU 0: bash:3154 _spin_lock+0x1b/0x70 <-- add_preempt_count+0xe/0x77
[   81.479914] CPU 0: bash:3154 add_preempt_count+0x3e/0x77 <-- in_lock_functions+0x9/0x24


Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
---
 lib/Makefile                 |    1 
 lib/tracing/Kconfig          |   15 
 lib/tracing/Makefile         |    3 
 lib/tracing/trace_function.c |   72 ++
 lib/tracing/tracer.c         | 1160 +++++++++++++++++++++++++++++++++++++++++++
 lib/tracing/tracer.h         |   96 +++
 6 files changed, 1347 insertions(+)

Index: linux-mcount.git/lib/tracing/Kconfig
===================================================================
--- linux-mcount.git.orig/lib/tracing/Kconfig	2008-01-29 17:26:18.000000000 -0500
+++ linux-mcount.git/lib/tracing/Kconfig	2008-01-29 18:06:25.000000000 -0500
@@ -8,3 +8,18 @@ config HAVE_MCOUNT
 config MCOUNT
 	bool
 	select FRAME_POINTER
+
+config TRACING
+        bool
+	select DEBUG_FS
+
+config FUNCTION_TRACER
+	bool "Profiler instrumentation based tracer"
+	depends on DEBUG_KERNEL && HAVE_MCOUNT
+	select MCOUNT
+	select TRACING
+	help
+	  Use profiler instrumentation, adding -pg to CFLAGS. This will
+	  insert a call to an architecture specific __mcount routine,
+	  that the debugging mechanism using this facility will hook by
+	  providing a set of inline routines.
Index: linux-mcount.git/lib/tracing/Makefile
===================================================================
--- linux-mcount.git.orig/lib/tracing/Makefile	2008-01-29 17:26:18.000000000 -0500
+++ linux-mcount.git/lib/tracing/Makefile	2008-01-29 18:06:25.000000000 -0500
@@ -1,3 +1,6 @@
 obj-$(CONFIG_MCOUNT) += libmcount.o
 
+obj-$(CONFIG_TRACING) += tracer.o
+obj-$(CONFIG_FUNCTION_TRACER) += trace_function.o
+
 libmcount-y := mcount.o
Index: linux-mcount.git/lib/tracing/tracer.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-mcount.git/lib/tracing/tracer.c	2008-01-29 18:07:38.000000000 -0500
@@ -0,0 +1,1160 @@
+/*
+ * ring buffer based mcount tracer
+ *
+ * Copyright (C) 2007 Steven Rostedt <srostedt@redhat.com>
+ *
+ * Originally taken from the RT patch by:
+ *    Arnaldo Carvalho de Melo <acme@redhat.com>
+ *
+ * Based on code from the latency_tracer, that is:
+ *  Copyright (C) 2004-2006 Ingo Molnar
+ *  Copyright (C) 2004 William Lee Irwin III
+ */
+
+#include <linux/fs.h>
+#include <linux/gfp.h>
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/linkage.h>
+#include <linux/seq_file.h>
+#include <linux/percpu.h>
+#include <linux/ctype.h>
+#include <linux/debugfs.h>
+#include <linux/kallsyms.h>
+#include <linux/utsrelease.h>
+#include <linux/uaccess.h>
+#include <linux/hardirq.h>
+#include <linux/mcount.h>
+
+#include "tracer.h"
+
+static struct tracing_trace tracer_trace __read_mostly;
+static DEFINE_PER_CPU(struct tracing_trace_cpu, tracer_trace_cpu);
+static int trace_enabled __read_mostly;
+static unsigned long trace_nr_entries = (65536UL);
+
+static struct trace_types_struct *trace_types __read_mostly;
+static struct trace_types_struct *current_trace __read_mostly;
+static int max_tracer_type_len;
+
+static DEFINE_MUTEX(trace_types_lock);
+
+static int __init set_nr_entries(char *str)
+{
+	if (!str)
+		return 0;
+	trace_nr_entries = simple_strtoul(str, &str, 0);
+	return 1;
+}
+__setup("trace_entries=", set_nr_entries);
+
+enum trace_type {
+	__TRACE_FIRST_TYPE = 0,
+
+	TRACE_FN,
+
+	__TRACE_LAST_TYPE
+};
+
+enum trace_flag_type {
+	TRACE_FLAG_IRQS_OFF		= 0x01,
+	TRACE_FLAG_NEED_RESCHED		= 0x02,
+	TRACE_FLAG_HARDIRQ		= 0x04,
+	TRACE_FLAG_SOFTIRQ		= 0x08,
+};
+
+int register_trace(struct trace_types_struct *type)
+{
+	struct trace_types_struct *t;
+	int len;
+	int ret = 0;
+
+	if (!type->name) {
+		pr_info("Tracer must have a name\n");
+		return -1;
+	}
+
+	mutex_lock(&trace_types_lock);
+	for (t = trace_types; t; t = t->next) {
+		if (strcmp(type->name, t->name) == 0) {
+			/* already found */
+			pr_info("Trace %s already registered\n",
+				type->name);
+			ret = -1;
+			goto out;
+		}
+	}
+
+	type->next = trace_types;
+	trace_types = type;
+	len = strlen(type->name);
+	if (len > max_tracer_type_len)
+		max_tracer_type_len = len;
+ out:
+	mutex_unlock(&trace_types_lock);
+
+	return ret;
+}
+
+void unregister_trace(struct trace_types_struct *type)
+{
+	struct trace_types_struct **t;
+	int len;
+
+	mutex_lock(&trace_types_lock);
+	for (t = &trace_types; *t; t = &(*t)->next) {
+		if (*t == type)
+			goto found;
+	}
+	pr_info("Trace %s not registered\n", type->name);
+	goto out;
+
+ found:
+	*t = (*t)->next;
+	if (strlen(type->name) != max_tracer_type_len)
+		goto out;
+
+	max_tracer_type_len = 0;
+	for (t = &trace_types; *t; t = &(*t)->next) {
+		len = strlen((*t)->name);
+		if (len > max_tracer_type_len)
+			max_tracer_type_len = len;
+	}
+ out:
+	mutex_unlock(&trace_types_lock);
+}
+
+void notrace tracing_reset(struct tracing_trace_cpu *data)
+{
+	data->trace_idx = 0;
+	atomic_set(&data->underrun, 0);
+}
+
+#ifdef CONFIG_MCOUNT
+static void notrace function_trace_call(unsigned long ip,
+					unsigned long parent_ip)
+{
+	struct tracing_trace *tr = &tracer_trace;
+	struct tracing_trace_cpu *data;
+	unsigned long flags;
+	int cpu;
+
+	if (unlikely(!trace_enabled))
+		return;
+
+	raw_local_irq_save(flags);
+	cpu = raw_smp_processor_id();
+	data = tr->data[cpu];
+	atomic_inc(&data->disabled);
+
+	if (likely(atomic_read(&data->disabled) == 1))
+		tracing_function_trace(tr, data, ip, parent_ip, flags);
+
+	atomic_dec(&data->disabled);
+	raw_local_irq_restore(flags);
+}
+
+static struct mcount_ops trace_ops __read_mostly =
+{
+	.func = function_trace_call,
+};
+#endif
+
+void tracing_start_function_trace(void)
+{
+	register_mcount_function(&trace_ops);
+}
+
+void tracing_stop_function_trace(void)
+{
+	unregister_mcount_function(&trace_ops);
+}
+
+static inline notrace struct tracing_entry *
+tracing_get_trace_entry(struct tracing_trace *tr,
+			struct tracing_trace_cpu *data)
+{
+	unsigned long idx, idx_next;
+	struct tracing_entry *entry;
+
+	idx = data->trace_idx;
+	idx_next = idx + 1;
+
+	if (unlikely(idx_next >= tr->entries)) {
+		atomic_inc(&data->underrun);
+		idx_next = 0;
+	}
+
+	data->trace_idx = idx_next;
+
+	if (unlikely(idx_next != 0 && atomic_read(&data->underrun)))
+		atomic_inc(&data->underrun);
+
+	entry = data->trace + idx * TRACING_ENTRY_SIZE;
+
+	return entry;
+}
+
+static inline notrace void
+tracing_generic_entry_update(struct tracing_entry *entry,
+			     unsigned long flags)
+{
+	struct task_struct *tsk = current;
+	unsigned long pc;
+
+	pc = preempt_count();
+
+	entry->preempt_count = pc & 0xff;
+	entry->pid	 = tsk->pid;
+	entry->t	 = now();
+	entry->flags = (irqs_disabled_flags(flags) ? TRACE_FLAG_IRQS_OFF : 0) |
+		((pc & HARDIRQ_MASK) ? TRACE_FLAG_HARDIRQ : 0) |
+		((pc & SOFTIRQ_MASK) ? TRACE_FLAG_SOFTIRQ : 0) |
+		(need_resched() ? TRACE_FLAG_NEED_RESCHED : 0);
+	memcpy(entry->comm, tsk->comm, TASK_COMM_LEN);
+}
+
+notrace void tracing_function_trace(struct tracing_trace *tr,
+				    struct tracing_trace_cpu *data,
+				    unsigned long ip,
+				    unsigned long parent_ip,
+				    unsigned long flags)
+{
+	struct tracing_entry *entry;
+
+	entry = tracing_get_trace_entry(tr, data);
+	tracing_generic_entry_update(entry, flags);
+	entry->type	    = TRACE_FN;
+	entry->fn.ip	    = ip;
+	entry->fn.parent_ip = parent_ip;
+}
+
+enum trace_iterator {
+	TRACE_ITER_SYM_ONLY	= 1,
+	TRACE_ITER_VERBOSE	= 2,
+};
+
+/* These must match the bit postions above */
+static const char *trace_options[] = {
+	"symonly",
+	"verbose",
+	NULL
+};
+
+static unsigned trace_flags;
+
+enum trace_file_type {
+	TRACE_FILE_LAT_FMT	= 1,
+};
+
+static struct tracing_entry *tracing_entry_idx(struct tracing_trace *tr,
+					       unsigned long idx,
+					       int cpu)
+{
+	struct tracing_entry *array = tr->data[cpu]->trace;
+	unsigned long underrun;
+
+	if (idx >= tr->entries)
+		return NULL;
+
+	underrun = atomic_read(&tr->data[cpu]->underrun);
+	if (underrun)
+		idx = ((underrun - 1) + idx) % tr->entries;
+	else if (idx >= tr->data[cpu]->trace_idx)
+		return NULL;
+
+	return &array[idx];
+}
+
+static struct notrace tracing_entry *
+find_next_entry(struct tracing_iterator *iter, int *ent_cpu)
+{
+	struct tracing_trace *tr = iter->tr;
+	struct tracing_entry *ent, *next = NULL;
+	int next_cpu = -1;
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		if (!tr->data[cpu]->trace)
+			continue;
+		ent = tracing_entry_idx(tr, iter->next_idx[cpu], cpu);
+		if (ent && (!next || next->t > ent->t)) {
+			next = ent;
+			next_cpu = cpu;
+		}
+	}
+
+	if (ent_cpu)
+		*ent_cpu = next_cpu;
+
+	return next;
+}
+
+static void *find_next_entry_inc(struct tracing_iterator *iter)
+{
+	struct tracing_entry *next;
+	int next_cpu = -1;
+
+	next = find_next_entry(iter, &next_cpu);
+
+	if (next) {
+		iter->next_idx[next_cpu]++;
+		iter->idx++;
+	}
+	iter->ent = next;
+	iter->cpu = next_cpu;
+
+	return next ? iter : NULL;
+}
+
+static void notrace *
+s_next(struct seq_file *m, void *v, loff_t *pos)
+{
+	struct tracing_iterator *iter = m->private;
+	void *ent;
+	void *last_ent = iter->ent;
+	int i = (int)*pos;
+
+	(*pos)++;
+
+	/* can't go backwards */
+	if (iter->idx > i)
+		return NULL;
+
+	if (iter->idx < 0)
+		ent = find_next_entry_inc(iter);
+	else
+		ent = iter;
+
+	while (ent && iter->idx < i)
+		ent = find_next_entry_inc(iter);
+
+	iter->pos = *pos;
+
+	if (last_ent && !ent)
+		seq_puts(m, "\n\nvim:ft=help\n");
+
+	return ent;
+}
+
+static void *s_start(struct seq_file *m, loff_t *pos)
+{
+	struct tracing_iterator *iter = m->private;
+	void *p = NULL;
+	loff_t l = 0;
+	int i;
+
+	mutex_lock(&trace_types_lock);
+
+	if (!current_trace || current_trace != iter->trace)
+		return NULL;
+
+	/* let the tracer grab locks here if needed */
+	if (current_trace->start)
+		current_trace->start(iter);
+
+	if (*pos != iter->pos) {
+		iter->ent = NULL;
+		iter->cpu = 0;
+		iter->idx = -1;
+
+		for (i = 0; i < NR_CPUS; i++)
+			iter->next_idx[i] = 0;
+
+		for (p = iter; p && l < *pos; p = s_next(m, p, &l))
+			;
+
+	} else {
+		l = *pos;
+		p = s_next(m, p, &l);
+	}
+
+	return p;
+}
+
+static void s_stop(struct seq_file *m, void *p)
+{
+	struct tracing_iterator *iter = m->private;
+
+	/* let the tracer release locks here if needed */
+	if (current_trace && current_trace == iter->trace && iter->trace->stop)
+		iter->trace->stop(iter);
+
+	mutex_unlock(&trace_types_lock);
+}
+
+#ifdef CONFIG_KALLSYMS
+static void seq_print_symbol(struct seq_file *m,
+			     const char *fmt, unsigned long address)
+{
+	char buffer[KSYM_SYMBOL_LEN];
+
+	sprint_symbol(buffer, address);
+	seq_printf(m, fmt, buffer);
+}
+#else
+# define seq_print_symbol(m, fmt, address) do { } while (0)
+#endif
+
+#ifndef CONFIG_64BIT
+# define IP_FMT "%08lx"
+#else
+# define IP_FMT "%016lx"
+#endif
+
+static void notrace seq_print_ip_sym(struct seq_file *m,
+				     unsigned long ip, int sym_only)
+{
+	if (!ip) {
+		seq_printf(m, "0");
+		return;
+	}
+
+	seq_print_symbol(m, "%s", ip);
+	if (!sym_only)
+		seq_printf(m, " <" IP_FMT ">", ip);
+}
+
+static void notrace print_help_header(struct seq_file *m)
+{
+	seq_puts(m, "                 _------=> CPU#            \n");
+	seq_puts(m, "                / _-----=> irqs-off        \n");
+	seq_puts(m, "               | / _----=> need-resched    \n");
+	seq_puts(m, "               || / _---=> hardirq/softirq \n");
+	seq_puts(m, "               ||| / _--=> preempt-depth   \n");
+	seq_puts(m, "               |||| /                      \n");
+	seq_puts(m, "               |||||     delay             \n");
+	seq_puts(m, "   cmd     pid ||||| time  |   caller      \n");
+	seq_puts(m, "      \\   /    |||||   \\   |   /           \n");
+}
+
+static void notrace print_trace_header(struct seq_file *m,
+				       struct tracing_iterator *iter)
+{
+	struct tracing_trace *tr = iter->tr;
+	struct tracing_trace_cpu *data = tr->data[tr->cpu];
+	struct trace_types_struct *type = current_trace;
+	unsigned long underruns = 0;
+	unsigned long underrun;
+	unsigned long entries   = 0;
+	int sym_only = !!(trace_flags & TRACE_ITER_SYM_ONLY);
+	int cpu;
+	const char *name = "preemption";
+
+	if (type)
+		name = type->name;
+
+	for_each_possible_cpu(cpu) {
+		if (tr->data[cpu]->trace) {
+			underrun = atomic_read(&tr->data[cpu]->underrun);
+			if (underrun) {
+				underruns += underrun;
+				entries += tr->entries;
+			} else
+				entries += tr->data[cpu]->trace_idx;
+		}
+	}
+
+	seq_printf(m, "%s latency trace v1.1.5 on %s\n",
+		   name, UTS_RELEASE);
+	seq_puts(m, "-----------------------------------"
+		 "---------------------------------\n");
+	seq_printf(m, " latency: %lu us, #%lu/%lu, CPU#%d |"
+		   " (M:%s VP:%d, KP:%d, SP:%d HP:%d",
+		   cycles_to_usecs(data->saved_latency),
+		   entries,
+		   (entries + underruns),
+		   tr->cpu,
+#if defined(CONFIG_PREEMPT_NONE)
+		   "server",
+#elif defined(CONFIG_PREEMPT_VOLUNTARY)
+		   "desktop",
+#elif defined(CONFIG_PREEMPT_DESKTOP)
+		   "preempt",
+#else
+		   "unknown",
+#endif
+		   /* These are reserved for later use */
+		   0, 0, 0, 0);
+#ifdef CONFIG_SMP
+	seq_printf(m, " #P:%d)\n", num_online_cpus());
+#else
+	seq_puts(m, ")\n");
+#endif
+	seq_puts(m, "    -----------------\n");
+	seq_printf(m, "    | task: %.16s-%d "
+		   "(uid:%d nice:%ld policy:%ld rt_prio:%ld)\n",
+		   data->comm, data->pid, data->uid, data->nice,
+		   data->policy, data->rt_priority);
+	seq_puts(m, "    -----------------\n");
+
+	if (data->critical_start) {
+		seq_puts(m, " => started at: ");
+		seq_print_ip_sym(m, data->critical_start, sym_only);
+		seq_puts(m, "\n => ended at:   ");
+		seq_print_ip_sym(m, data->critical_end, sym_only);
+		seq_puts(m, "\n");
+	}
+
+	seq_puts(m, "\n");
+}
+
+
+static void notrace
+lat_print_generic(struct seq_file *m, struct tracing_entry *entry, int cpu)
+{
+	int hardirq, softirq;
+
+	seq_printf(m, "%8.8s-%-5d ", entry->comm, entry->pid);
+	seq_printf(m, "%d", cpu);
+	seq_printf(m, "%c%c",
+		   (entry->flags & TRACE_FLAG_IRQS_OFF) ? 'd' : '.',
+		   ((entry->flags & TRACE_FLAG_NEED_RESCHED) ? 'N' : '.'));
+
+	hardirq = entry->flags & TRACE_FLAG_HARDIRQ;
+	softirq = entry->flags & TRACE_FLAG_SOFTIRQ;
+	if (hardirq && softirq)
+		seq_putc(m, 'H');
+	else {
+		if (hardirq)
+			seq_putc(m, 'h');
+		else {
+			if (softirq)
+				seq_putc(m, 's');
+			else
+				seq_putc(m, '.');
+		}
+	}
+
+	if (entry->preempt_count)
+		seq_printf(m, "%x", entry->preempt_count);
+	else
+		seq_puts(m, ".");
+}
+
+unsigned long preempt_mark_thresh = 100;
+
+static void notrace
+lat_print_timestamp(struct seq_file *m, unsigned long long abs_usecs,
+		    unsigned long rel_usecs)
+{
+	seq_printf(m, " %4lldus", abs_usecs);
+	if (rel_usecs > preempt_mark_thresh)
+		seq_puts(m, "!: ");
+	else if (rel_usecs > 1)
+		seq_puts(m, "+: ");
+	else
+		seq_puts(m, " : ");
+}
+
+static void notrace
+print_lat_fmt(struct seq_file *m, struct tracing_iterator *iter,
+	      unsigned int trace_idx, int cpu)
+{
+	struct tracing_entry *entry = iter->ent;
+	struct tracing_entry *next_entry = find_next_entry(iter, NULL);
+	unsigned long abs_usecs;
+	unsigned long rel_usecs;
+	int sym_only = !!(trace_flags & TRACE_ITER_SYM_ONLY);
+	int verbose = !!(trace_flags & TRACE_ITER_VERBOSE);
+
+	if (!next_entry)
+		next_entry = entry;
+	rel_usecs = cycles_to_usecs(next_entry->t - entry->t);
+	abs_usecs = cycles_to_usecs(entry->t - iter->tr->time_start);
+
+	if (verbose) {
+		seq_printf(m, "%16s %5d %d %d %08x %08x [%08lx]"
+			   " %ld.%03ldms (+%ld.%03ldms): ",
+			   entry->comm,
+			   entry->pid, cpu, entry->flags,
+			   entry->preempt_count, trace_idx,
+			   cycles_to_usecs(entry->t),
+			   abs_usecs/1000,
+			   abs_usecs % 1000, rel_usecs/1000, rel_usecs % 1000);
+	} else {
+		lat_print_generic(m, entry, cpu);
+		lat_print_timestamp(m, abs_usecs, rel_usecs);
+	}
+	switch (entry->type) {
+	case TRACE_FN:
+		seq_print_ip_sym(m, entry->fn.ip, sym_only);
+		seq_puts(m, " (");
+		seq_print_ip_sym(m, entry->fn.parent_ip, sym_only);
+		seq_puts(m, ")\n");
+		break;
+	}
+}
+
+static void notrace print_trace_fmt(struct seq_file *m,
+				    struct tracing_iterator *iter)
+{
+	struct tracing_entry *entry = iter->ent;
+	unsigned long usec_rem;
+	unsigned long secs;
+	int sym_only = !!(trace_flags & TRACE_ITER_SYM_ONLY);
+	unsigned long long t;
+
+	t = cycles_to_usecs(entry->t);
+	usec_rem = do_div(t, 1000000ULL);
+	secs = (unsigned long)t;
+
+	seq_printf(m, "[%5lu.%06lu] ", secs, usec_rem);
+	seq_printf(m, "CPU %d: ", iter->cpu);
+	seq_printf(m, "%s:%d ", entry->comm,
+		   entry->pid);
+	switch (entry->type) {
+	case TRACE_FN:
+		seq_print_ip_sym(m, entry->fn.ip, sym_only);
+		if (entry->fn.parent_ip) {
+			seq_printf(m, " <-- ");
+			seq_print_ip_sym(m, entry->fn.parent_ip,
+					 sym_only);
+		}
+		break;
+	}
+	seq_printf(m, "\n");
+}
+
+static int trace_empty(struct tracing_iterator *iter)
+{
+	struct tracing_trace_cpu *data;
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		data = iter->tr->data[cpu];
+
+		if (data->trace &&
+		    (data->trace_idx ||
+		     atomic_read(&data->underrun)))
+			return 0;
+	}
+	return 1;
+}
+
+static int s_show(struct seq_file *m, void *v)
+{
+	struct tracing_iterator *iter = v;
+
+	if (iter->ent == NULL) {
+		if (iter->iter_flags & TRACE_FILE_LAT_FMT) {
+			/* print nothing if the buffers are empty */
+			if (trace_empty(iter))
+				return 0;
+			print_trace_header(m, iter);
+			if (!(trace_flags & TRACE_ITER_VERBOSE))
+				print_help_header(m);
+		} else
+			seq_printf(m, "tracer:\n");
+	} else {
+		if (iter->iter_flags & TRACE_FILE_LAT_FMT)
+			print_lat_fmt(m, iter, iter->idx, iter->cpu);
+		else
+			print_trace_fmt(m, iter);
+	}
+
+	return 0;
+}
+
+static struct seq_operations tracer_seq_ops = {
+	.start = s_start,
+	.next = s_next,
+	.stop = s_stop,
+	.show = s_show,
+};
+
+static struct tracing_iterator notrace *
+__tracing_open(struct inode *inode, struct file *file, int *ret)
+{
+	struct tracing_iterator *iter;
+
+	iter = kzalloc(sizeof(*iter), GFP_KERNEL);
+	if (!iter) {
+		*ret = -ENOMEM;
+		goto out;
+	}
+
+	iter->tr = inode->i_private;
+	iter->trace = current_trace;
+	iter->pos = -1;
+
+	/* TODO stop tracer */
+	*ret = seq_open(file, &tracer_seq_ops);
+	if (!*ret) {
+		struct seq_file *m = file->private_data;
+		m->private = iter;
+
+		/* stop the trace while dumping */
+		if (iter->tr->ctrl)
+			trace_enabled = 0;
+
+		if (iter->trace && iter->trace->open)
+			iter->trace->open(iter);
+	} else {
+		kfree(iter);
+		iter = NULL;
+	}
+
+ out:
+	return iter;
+}
+
+int tracing_open_generic(struct inode *inode, struct file *filp)
+{
+	filp->private_data = inode->i_private;
+	return 0;
+}
+
+int tracing_release(struct inode *inode, struct file *file)
+{
+	struct seq_file *m = (struct seq_file *)file->private_data;
+	struct tracing_iterator *iter = m->private;
+
+	if (iter->trace && iter->trace->close)
+		iter->trace->close(iter);
+
+	/* reenable tracing if it was previously enabled */
+	if (iter->tr->ctrl)
+		trace_enabled = 1;
+
+	seq_release(inode, file);
+	kfree(iter);
+	return 0;
+}
+
+static int tracing_open(struct inode *inode, struct file *file)
+{
+	int ret;
+
+	__tracing_open(inode, file, &ret);
+
+	return ret;
+}
+
+static int tracing_lt_open(struct inode *inode, struct file *file)
+{
+	struct tracing_iterator *iter;
+	int ret;
+
+	iter = __tracing_open(inode, file, &ret);
+
+	if (!ret)
+		iter->iter_flags |= TRACE_FILE_LAT_FMT;
+
+	return ret;
+}
+
+
+static void notrace *
+t_next(struct seq_file *m, void *v, loff_t *pos)
+{
+	struct trace_types_struct *t = m->private;
+
+	(*pos)++;
+
+	if (t)
+		t = t->next;
+
+	m->private = t;
+
+	return t;
+}
+
+static void *t_start(struct seq_file *m, loff_t *pos)
+{
+	struct trace_types_struct *t = m->private;
+	loff_t l = 0;
+
+	mutex_lock(&trace_types_lock);
+	for (; t && l < *pos; t = t_next(m, t, &l))
+		;
+
+	return t;
+}
+
+static void t_stop(struct seq_file *m, void *p)
+{
+	mutex_unlock(&trace_types_lock);
+}
+
+static int t_show(struct seq_file *m, void *v)
+{
+	struct trace_types_struct *t = v;
+
+	if (!t)
+		return 0;
+
+	seq_printf(m, "%s", t->name);
+	if (t->next)
+		seq_putc(m, ' ');
+	else
+		seq_putc(m, '\n');
+
+	return 0;
+}
+
+static struct seq_operations show_traces_seq_ops = {
+	.start = t_start,
+	.next = t_next,
+	.stop = t_stop,
+	.show = t_show,
+};
+
+static int show_traces_open(struct inode *inode, struct file *file)
+{
+	int ret;
+
+	ret = seq_open(file, &show_traces_seq_ops);
+	if (!ret) {
+		struct seq_file *m = file->private_data;
+		m->private = trace_types;
+	}
+
+	return ret;
+}
+
+static struct file_operations tracing_fops = {
+	.open = tracing_open,
+	.read = seq_read,
+	.llseek = seq_lseek,
+	.release = tracing_release,
+};
+
+static struct file_operations tracing_lt_fops = {
+	.open = tracing_lt_open,
+	.read = seq_read,
+	.llseek = seq_lseek,
+	.release = tracing_release,
+};
+
+static struct file_operations show_traces_fops = {
+	.open = show_traces_open,
+	.read = seq_read,
+	.release = seq_release,
+};
+
+static ssize_t tracing_iter_ctrl_read(struct file *filp, char __user *ubuf,
+				      size_t cnt, loff_t *ppos)
+{
+	char *buf;
+	int r = 0;
+	int len = 0;
+	int i;
+
+	/* calulate max size */
+	for (i = 0; trace_options[i]; i++) {
+		len += strlen(trace_options[i]);
+		len += 3; /* "no" and space */
+	}
+
+	/* +2 for \n and \0 */
+	buf = kmalloc(len + 2, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	for (i = 0; trace_options[i]; i++) {
+		if (trace_flags & (1 << i))
+			r += sprintf(buf + r, "%s ", trace_options[i]);
+		else
+			r += sprintf(buf + r, "no%s ", trace_options[i]);
+	}
+
+	r += sprintf(buf + r, "\n");
+	WARN_ON(r >= len + 2);
+
+	r = simple_read_from_buffer(ubuf, cnt, ppos,
+				    buf, r);
+
+	kfree(buf);
+
+	return r;
+}
+
+static ssize_t tracing_iter_ctrl_write(struct file *filp,
+				       const char __user *ubuf,
+				       size_t cnt, loff_t *ppos)
+{
+	char buf[64];
+	char *cmp = buf;
+	int neg = 0;
+	int i;
+
+	if (cnt > 63)
+		cnt = 63;
+
+	if (copy_from_user(&buf, ubuf, cnt))
+		return -EFAULT;
+
+	buf[cnt] = 0;
+
+	if (strncmp(buf, "no", 2) == 0) {
+		neg = 1;
+		cmp += 2;
+	}
+
+	for (i = 0; trace_options[i]; i++) {
+		int len = strlen(trace_options[i]);
+
+		if (strncmp(cmp, trace_options[i], len) == 0) {
+			if (neg)
+				trace_flags &= ~(1 << i);
+			else
+				trace_flags |= (1 << i);
+			break;
+		}
+	}
+
+	filp->f_pos += cnt;
+
+	return cnt;
+}
+
+static struct file_operations tracing_iter_fops = {
+	.open = tracing_open_generic,
+	.read = tracing_iter_ctrl_read,
+	.write = tracing_iter_ctrl_write,
+};
+
+static ssize_t tracing_ctrl_read(struct file *filp, char __user *ubuf,
+				 size_t cnt, loff_t *ppos)
+{
+	struct tracing_trace *tr = filp->private_data;
+	char buf[64];
+	int r;
+
+	r = sprintf(buf, "%ld\n", tr->ctrl);
+	return simple_read_from_buffer(ubuf, cnt, ppos,
+				       buf, r);
+}
+
+static ssize_t tracing_ctrl_write(struct file *filp,
+				  const char __user *ubuf,
+				  size_t cnt, loff_t *ppos)
+{
+	struct tracing_trace *tr = filp->private_data;
+	long val;
+	char buf[64];
+
+	if (cnt > 63)
+		cnt = 63;
+
+	if (copy_from_user(&buf, ubuf, cnt))
+		return -EFAULT;
+
+	buf[cnt] = 0;
+
+	val = simple_strtoul(buf, NULL, 10);
+
+	val = !!val;
+
+	if (tr->ctrl ^ val) {
+		if (val)
+			trace_enabled = 1;
+		else
+			trace_enabled = 0;
+
+		tr->ctrl = val;
+
+		mutex_lock(&trace_types_lock);
+		if (current_trace->ctrl_update)
+			current_trace->ctrl_update(tr);
+		mutex_unlock(&trace_types_lock);
+	}
+
+	filp->f_pos += cnt;
+
+	return cnt;
+}
+
+static ssize_t tracing_set_trace_read(struct file *filp, char __user *ubuf,
+				      size_t cnt, loff_t *ppos)
+{
+	char buf[max_tracer_type_len+2];
+	int r;
+
+	mutex_lock(&trace_types_lock);
+	if (current_trace)
+		r = sprintf(buf, "%s\n", current_trace->name);
+	else
+		r = sprintf(buf, "\n");
+	mutex_unlock(&trace_types_lock);
+
+	return simple_read_from_buffer(ubuf, cnt, ppos,
+				       buf, r);
+}
+
+static ssize_t tracing_set_trace_write(struct file *filp,
+				       const char __user *ubuf,
+				       size_t cnt, loff_t *ppos)
+{
+	struct tracing_trace *tr = &tracer_trace;
+	struct trace_types_struct *t;
+	char buf[max_tracer_type_len+1];
+	int i;
+
+	if (cnt > max_tracer_type_len)
+		cnt = max_tracer_type_len;
+
+	if (copy_from_user(&buf, ubuf, cnt))
+		return -EFAULT;
+
+	buf[cnt] = 0;
+
+	/* strip ending whitespace. */
+	for (i = cnt - 1; i > 0 && isspace(buf[i]); i--)
+		buf[i] = 0;
+
+	mutex_lock(&trace_types_lock);
+	for (t = trace_types; t; t = t->next) {
+		if (strcmp(t->name, buf) == 0)
+			break;
+	}
+	if (!t || t == current_trace)
+		goto out;
+
+	if (current_trace && current_trace->reset)
+		current_trace->reset(tr);
+
+	current_trace = t;
+	if (t->init)
+		t->init(tr);
+
+ out:
+	mutex_unlock(&trace_types_lock);
+
+	filp->f_pos += cnt;
+
+	return cnt;
+}
+
+static struct file_operations tracing_ctrl_fops = {
+	.open = tracing_open_generic,
+	.read = tracing_ctrl_read,
+	.write = tracing_ctrl_write,
+};
+
+static struct file_operations set_tracer_fops = {
+	.open = tracing_open_generic,
+	.read = tracing_set_trace_read,
+	.write = tracing_set_trace_write,
+};
+
+static struct dentry *d_tracer;
+
+struct dentry *tracing_init_dentry(void)
+{
+	static int once;
+
+	if (d_tracer)
+		return d_tracer;
+
+	d_tracer = debugfs_create_dir("tracing", NULL);
+
+	if (!d_tracer && !once) {
+		once = 1;
+		pr_warning("Could not create debugfs directory 'tracing'\n");
+		return NULL;
+	}
+
+	return d_tracer;
+}
+
+static __init void tracer_init_debugfs(void)
+{
+	struct dentry *d_tracer;
+	struct dentry *entry;
+
+	d_tracer = tracing_init_dentry();
+
+	entry = debugfs_create_file("trace_ctrl", 0644, d_tracer,
+				    &tracer_trace, &tracing_ctrl_fops);
+	if (!entry)
+		pr_warning("Could not create debugfs 'trace_ctrl' entry\n");
+
+	entry = debugfs_create_file("iter_ctrl", 0644, d_tracer,
+				    NULL, &tracing_iter_fops);
+	if (!entry)
+		pr_warning("Could not create debugfs 'iter_ctrl' entry\n");
+
+	entry = debugfs_create_file("latency_trace", 0444, d_tracer,
+				    &tracer_trace, &tracing_lt_fops);
+	if (!entry)
+		pr_warning("Could not create debugfs 'latency_trace' entry\n");
+
+	entry = debugfs_create_file("trace", 0444, d_tracer,
+				    &tracer_trace, &tracing_fops);
+	if (!entry)
+		pr_warning("Could not create debugfs 'trace' entry\n");
+
+	entry = debugfs_create_file("available_tracers", 0444, d_tracer,
+				    &tracer_trace, &show_traces_fops);
+	if (!entry)
+		pr_warning("Could not create debugfs 'trace' entry\n");
+
+	entry = debugfs_create_file("current_tracer", 0444, d_tracer,
+				    &tracer_trace, &set_tracer_fops);
+	if (!entry)
+		pr_warning("Could not create debugfs 'trace' entry\n");
+}
+
+/* dummy trace to disable tracing */
+static struct trace_types_struct disable_trace __read_mostly =
+{
+	.name = "disable",
+};
+
+static inline notrace int page_order(const unsigned long size)
+{
+	const unsigned long nr_pages = DIV_ROUND_UP(size, PAGE_SIZE);
+	return ilog2(roundup_pow_of_two(nr_pages));
+}
+
+__init static int tracer_alloc_buffers(void)
+{
+	const int order = page_order(trace_nr_entries * TRACING_ENTRY_SIZE);
+	const unsigned long size = (1UL << order) << PAGE_SHIFT;
+	struct tracing_entry *array;
+	int i;
+
+	for_each_possible_cpu(i) {
+		tracer_trace.data[i] = &per_cpu(tracer_trace_cpu, i);
+		array = (struct tracing_entry *)
+			  __get_free_pages(GFP_KERNEL, order);
+		if (array == NULL) {
+			printk(KERN_ERR "tracer: failed to allocate"
+			       " %ld bytes for trace buffer!\n", size);
+			goto free_buffers;
+		}
+		tracer_trace.data[i]->trace = array;
+	}
+
+	/*
+	 * Since we allocate by orders of pages, we may be able to
+	 * round up a bit.
+	 */
+	tracer_trace.entries = size / TRACING_ENTRY_SIZE;
+
+	pr_info("tracer: %ld bytes allocated for %ld",
+		size, trace_nr_entries);
+	pr_info(" entries of %ld bytes\n", (long)TRACING_ENTRY_SIZE);
+	pr_info("   actual entries %ld\n", tracer_trace.entries);
+
+	tracer_init_debugfs();
+
+	register_trace(&disable_trace);
+
+	return 0;
+
+ free_buffers:
+	for (i-- ; i >= 0; i--) {
+		struct tracing_trace_cpu *data = tracer_trace.data[i];
+
+		if (data && data->trace) {
+			free_pages((unsigned long)data->trace, order);
+			data->trace = NULL;
+		}
+	}
+	return -ENOMEM;
+}
+
+device_initcall(tracer_alloc_buffers);
Index: linux-mcount.git/lib/tracing/tracer.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-mcount.git/lib/tracing/tracer.h	2008-01-29 18:06:25.000000000 -0500
@@ -0,0 +1,96 @@
+#ifndef _LINUX_MCOUNT_TRACER_H
+#define _LINUX_MCOUNT_TRACER_H
+
+#include <asm/atomic.h>
+#include <linux/sched.h>
+#include <linux/clocksource.h>
+
+struct tracing_function {
+	unsigned long ip;
+	unsigned long parent_ip;
+};
+
+struct tracing_entry {
+	char type;
+	char cpu;  /* who will want to trace more than 256 CPUS? */
+	char flags;
+	char preempt_count; /* assumes PREEMPT_MASK is 8 bits or less */
+	int pid;
+	cycle_t t;
+	char comm[TASK_COMM_LEN];
+	struct tracing_function fn;
+};
+
+struct tracing_trace_cpu {
+	void *trace;
+	unsigned long trace_idx;
+	atomic_t      disabled;
+	atomic_t      underrun;
+	unsigned long saved_latency;
+	unsigned long critical_start;
+	unsigned long critical_end;
+	unsigned long critical_sequence;
+	unsigned long nice;
+	unsigned long policy;
+	unsigned long rt_priority;
+	cycle_t preempt_timestamp;
+	pid_t	      pid;
+	uid_t	      uid;
+	char comm[TASK_COMM_LEN];
+};
+
+struct tracing_iterator;
+
+struct tracing_trace {
+	unsigned long entries;
+	long	      ctrl;
+	int	      cpu;
+	cycle_t	      time_start;
+	struct tracing_trace_cpu *data[NR_CPUS];
+};
+
+struct trace_types_struct {
+	const char *name;
+	void (*init)(struct tracing_trace *tr);
+	void (*reset)(struct tracing_trace *tr);
+	void (*open)(struct tracing_iterator *iter);
+	void (*close)(struct tracing_iterator *iter);
+	void (*start)(struct tracing_iterator *iter);
+	void (*stop)(struct tracing_iterator *iter);
+	void (*ctrl_update)(struct tracing_trace *tr);
+	struct trace_types_struct *next;
+};
+
+struct tracing_iterator {
+	struct tracing_trace *tr;
+	struct trace_types_struct *trace;
+	struct tracing_entry *ent;
+	unsigned long iter_flags;
+	loff_t pos;
+	unsigned long next_idx[NR_CPUS];
+	int cpu;
+	int idx;
+};
+
+#define TRACING_ENTRY_SIZE sizeof(struct tracing_entry)
+
+void notrace tracing_reset(struct tracing_trace_cpu *data);
+int tracing_open_generic(struct inode *inode, struct file *filp);
+struct dentry *tracing_init_dentry(void);
+void tracing_function_trace(struct tracing_trace *tr,
+			    struct tracing_trace_cpu *data,
+			    unsigned long ip,
+			    unsigned long parent_ip,
+			    unsigned long flags);
+
+void tracing_start_function_trace(void);
+void tracing_stop_function_trace(void);
+int register_trace(struct trace_types_struct *type);
+void unregister_trace(struct trace_types_struct *type);
+
+static inline notrace cycle_t now(void)
+{
+	return get_monotonic_cycles();
+}
+
+#endif /* _LINUX_MCOUNT_TRACER_H */
Index: linux-mcount.git/lib/Makefile
===================================================================
--- linux-mcount.git.orig/lib/Makefile	2008-01-29 17:26:18.000000000 -0500
+++ linux-mcount.git/lib/Makefile	2008-01-29 17:26:43.000000000 -0500
@@ -68,6 +68,7 @@ obj-$(CONFIG_SWIOTLB) += swiotlb.o
 obj-$(CONFIG_FAULT_INJECTION) += fault-inject.o
 
 obj-$(CONFIG_MCOUNT) += tracing/
+obj-$(CONFIG_TRACING) += tracing/
 
 lib-$(CONFIG_GENERIC_BUG) += bug.o
 
Index: linux-mcount.git/lib/tracing/trace_function.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-mcount.git/lib/tracing/trace_function.c	2008-01-29 18:06:24.000000000 -0500
@@ -0,0 +1,72 @@
+/*
+ * ring buffer based mcount tracer
+ *
+ * Copyright (C) 2007 Steven Rostedt <srostedt@redhat.com>
+ *
+ * Based on code from the latency_tracer, that is:
+ *
+ *  Copyright (C) 2004-2006 Ingo Molnar
+ *  Copyright (C) 2004 William Lee Irwin III
+ */
+#include <linux/fs.h>
+#include <linux/debugfs.h>
+#include <linux/uaccess.h>
+#include <linux/mcount.h>
+
+#include "tracer.h"
+
+static notrace void function_reset(struct tracing_trace *tr)
+{
+	int cpu;
+
+	tr->time_start = now();
+
+	for_each_online_cpu(cpu)
+		tracing_reset(tr->data[cpu]);
+}
+
+static notrace void start_function_trace(struct tracing_trace *tr)
+{
+	function_reset(tr);
+	tracing_start_function_trace();
+}
+
+static notrace void stop_function_trace(struct tracing_trace *tr)
+{
+	tracing_stop_function_trace();
+}
+
+static notrace void function_trace_init(struct tracing_trace *tr)
+{
+	if (tr->ctrl)
+		start_function_trace(tr);
+}
+
+static notrace void function_trace_reset(struct tracing_trace *tr)
+{
+	if (tr->ctrl)
+		stop_function_trace(tr);
+}
+
+static notrace void function_trace_ctrl_update(struct tracing_trace *tr)
+{
+	if (tr->ctrl)
+		start_function_trace(tr);
+	else
+		stop_function_trace(tr);
+}
+
+static struct trace_types_struct function_trace __read_mostly =
+{
+	.name = "function",
+	.init = function_trace_init,
+	.reset = function_trace_reset,
+	.ctrl_update = function_trace_ctrl_update,
+};
+
+static __init int init_function_trace(void)
+{
+	return register_trace(&function_trace);
+}
+
+device_initcall(init_function_trace);

-- 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 11/22 -v7] Add context switch marker to sched.c
  2008-01-30  3:15 [PATCH 00/22 -v7] mcount and latency tracing utility -v7 Steven Rostedt
                   ` (9 preceding siblings ...)
  2008-01-30  3:15 ` [PATCH 10/22 -v7] mcount based trace in the form of a header file library Steven Rostedt
@ 2008-01-30  3:15 ` Steven Rostedt
  2008-01-30  3:15 ` [PATCH 12/22 -v7] Make the task State char-string visible to all Steven Rostedt
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 45+ messages in thread
From: Steven Rostedt @ 2008-01-30  3:15 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, Peter Zijlstra,
	Christoph Hellwig, Mathieu Desnoyers, Gregory Haskins,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Tim Bird,
	Sam Ravnborg, Frank Ch. Eigler, Jan Kiszka, John Stultz,
	Arjan van de Ven, Steven Rostedt

[-- Attachment #1: add-trace-hooks-to-sched.patch --]
[-- Type: text/plain, Size: 703 bytes --]

Add marker into context_switch to record the prev and next tasks.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 kernel/sched.c |    2 ++
 1 file changed, 2 insertions(+)

Index: linux-mcount.git/kernel/sched.c
===================================================================
--- linux-mcount.git.orig/kernel/sched.c	2008-01-25 21:46:55.000000000 -0500
+++ linux-mcount.git/kernel/sched.c	2008-01-25 21:47:19.000000000 -0500
@@ -2198,6 +2198,8 @@ context_switch(struct rq *rq, struct tas
 	struct mm_struct *mm, *oldmm;
 
 	prepare_task_switch(rq, prev, next);
+	trace_mark(kernel_sched_schedule,
+		   "prev %p next %p", prev, next);
 	mm = next->mm;
 	oldmm = prev->active_mm;
 	/*

-- 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 12/22 -v7] Make the task State char-string visible to all
  2008-01-30  3:15 [PATCH 00/22 -v7] mcount and latency tracing utility -v7 Steven Rostedt
                   ` (10 preceding siblings ...)
  2008-01-30  3:15 ` [PATCH 11/22 -v7] Add context switch marker to sched.c Steven Rostedt
@ 2008-01-30  3:15 ` Steven Rostedt
  2008-01-30  3:15 ` [PATCH 13/22 -v7] Add tracing of context switches Steven Rostedt
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 45+ messages in thread
From: Steven Rostedt @ 2008-01-30  3:15 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, Peter Zijlstra,
	Christoph Hellwig, Mathieu Desnoyers, Gregory Haskins,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Tim Bird,
	Sam Ravnborg, Frank Ch. Eigler, Jan Kiszka, John Stultz,
	Arjan van de Ven, Steven Rostedt

[-- Attachment #1: parse-out-task-state-to-char-string.patch --]
[-- Type: text/plain, Size: 1305 bytes --]

The tracer wants to be able to convert the state number
into a user visible character. This patch pulls that conversion
string out the scheduler into the header. This way if it were to
ever change, other parts of the kernel will know.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 include/linux/sched.h |    2 ++
 kernel/sched.c        |    2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

Index: linux-mcount.git/include/linux/sched.h
===================================================================
--- linux-mcount.git.orig/include/linux/sched.h	2008-01-25 21:46:55.000000000 -0500
+++ linux-mcount.git/include/linux/sched.h	2008-01-25 21:47:21.000000000 -0500
@@ -2055,6 +2055,8 @@ static inline void migration_init(void)
 }
 #endif
 
+#define TASK_STATE_TO_CHAR_STR "RSDTtZX"
+
 #endif /* __KERNEL__ */
 
 #endif
Index: linux-mcount.git/kernel/sched.c
===================================================================
--- linux-mcount.git.orig/kernel/sched.c	2008-01-25 21:47:19.000000000 -0500
+++ linux-mcount.git/kernel/sched.c	2008-01-25 21:47:21.000000000 -0500
@@ -5149,7 +5149,7 @@ out_unlock:
 	return retval;
 }
 
-static const char stat_nam[] = "RSDTtZX";
+static const char stat_nam[] = TASK_STATE_TO_CHAR_STR;
 
 void sched_show_task(struct task_struct *p)
 {

-- 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 13/22 -v7] Add tracing of context switches
  2008-01-30  3:15 [PATCH 00/22 -v7] mcount and latency tracing utility -v7 Steven Rostedt
                   ` (11 preceding siblings ...)
  2008-01-30  3:15 ` [PATCH 12/22 -v7] Make the task State char-string visible to all Steven Rostedt
@ 2008-01-30  3:15 ` Steven Rostedt
  2008-01-30  3:15 ` [PATCH 14/22 -v7] Generic command line storage Steven Rostedt
                   ` (8 subsequent siblings)
  21 siblings, 0 replies; 45+ messages in thread
From: Steven Rostedt @ 2008-01-30  3:15 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, Peter Zijlstra,
	Christoph Hellwig, Mathieu Desnoyers, Gregory Haskins,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Tim Bird,
	Sam Ravnborg, Frank Ch. Eigler, Jan Kiszka, John Stultz,
	Arjan van de Ven, Steven Rostedt

[-- Attachment #1: trace-add-cmdline-switch.patch --]
[-- Type: text/plain, Size: 10854 bytes --]

This patch adds context switch tracing, of the format of:

                 _------=> CPU#            
                / _-----=> irqs-off        
               | / _----=> need-resched    
               || / _---=> hardirq/softirq 
               ||| / _--=> preempt-depth   
               |||| /                      
               |||||     delay             
   cmd     pid ||||| time  |   caller      
      \   /    |||||   \   |   /           
 swapper-0     1d..3  137us+:  0:140:R --> 2912:120
    sshd-2912  1d..3  216us+:  2912:120:S --> 0:140
 swapper-0     1d..3  261us+:  0:140:R --> 2912:120
    bash-2920  0d..3  267us+:  2920:120:S --> 0:140
    sshd-2912  1d..3  330us!:  2912:120:S --> 0:140
 swapper-0     1d..3 2389us+:  0:140:R --> 2847:120
yum-upda-2847  1d..3 2411us!:  2847:120:S --> 0:140
 swapper-0     0d..3 11089us+:  0:140:R --> 3139:120
gdm-bina-3139  0d..3 11113us!:  3139:120:S --> 0:140
 swapper-0     1d..3 102328us+:  0:140:R --> 2847:120
yum-upda-2847  1d..3 102348us!:  2847:120:S --> 0:140


 "sched_switch" is added to /debugfs/tracing/available_tracers

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
---
 lib/tracing/Kconfig              |    9 ++
 lib/tracing/Makefile             |    1 
 lib/tracing/trace_sched_switch.c |  165 +++++++++++++++++++++++++++++++++++++++
 lib/tracing/tracer.c             |   43 ++++++++++
 lib/tracing/tracer.h             |   23 +++++
 5 files changed, 240 insertions(+), 1 deletion(-)

Index: linux-mcount.git/lib/tracing/Kconfig
===================================================================
--- linux-mcount.git.orig/lib/tracing/Kconfig	2008-01-29 18:06:25.000000000 -0500
+++ linux-mcount.git/lib/tracing/Kconfig	2008-01-29 18:08:06.000000000 -0500
@@ -23,3 +23,12 @@ config FUNCTION_TRACER
 	  insert a call to an architecture specific __mcount routine,
 	  that the debugging mechanism using this facility will hook by
 	  providing a set of inline routines.
+
+config CONTEXT_SWITCH_TRACER
+	bool "Trace process context switches"
+	depends on DEBUG_KERNEL
+	select TRACING
+	help
+	  This tracer hooks into the context switch and records
+	  all switching of tasks.
+
Index: linux-mcount.git/lib/tracing/Makefile
===================================================================
--- linux-mcount.git.orig/lib/tracing/Makefile	2008-01-29 18:06:25.000000000 -0500
+++ linux-mcount.git/lib/tracing/Makefile	2008-01-29 18:08:06.000000000 -0500
@@ -1,6 +1,7 @@
 obj-$(CONFIG_MCOUNT) += libmcount.o
 
 obj-$(CONFIG_TRACING) += tracer.o
+obj-$(CONFIG_CONTEXT_SWITCH_TRACER) += trace_sched_switch.o
 obj-$(CONFIG_FUNCTION_TRACER) += trace_function.o
 
 libmcount-y := mcount.o
Index: linux-mcount.git/lib/tracing/trace_sched_switch.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-mcount.git/lib/tracing/trace_sched_switch.c	2008-01-29 18:08:06.000000000 -0500
@@ -0,0 +1,165 @@
+/*
+ * trace context switch
+ *
+ * Copyright (C) 2007 Steven Rostedt <srostedt@redhat.com>
+ *
+ */
+#include <linux/module.h>
+#include <linux/fs.h>
+#include <linux/debugfs.h>
+#include <linux/kallsyms.h>
+#include <linux/uaccess.h>
+#include <linux/marker.h>
+#include <linux/mcount.h>
+
+#include "tracer.h"
+
+static struct tracing_trace *tracer_trace;
+static int trace_enabled __read_mostly;
+static atomic_t sched_ref;
+int tracing_sched_switch_enabled __read_mostly;
+
+static notrace void sched_switch_callback(const struct marker *mdata,
+					  void *private_data,
+					  const char *format, ...)
+{
+	struct tracing_trace **p = mdata->private;
+	struct tracing_trace *tr = *p;
+	struct tracing_trace_cpu *data;
+	struct task_struct *prev;
+	struct task_struct *next;
+	unsigned long flags;
+	va_list ap;
+	int cpu;
+
+	if (!trace_enabled)
+		return;
+
+	va_start(ap, format);
+	prev = va_arg(ap, typeof(prev));
+	next = va_arg(ap, typeof(next));
+	va_end(ap);
+
+	raw_local_irq_save(flags);
+	cpu = raw_smp_processor_id();
+	data = tr->data[cpu];
+	atomic_inc(&data->disabled);
+
+	if (likely(atomic_read(&data->disabled) == 1))
+		tracing_sched_switch_trace(tr, data, prev, next, flags);
+
+	atomic_dec(&data->disabled);
+	raw_local_irq_restore(flags);
+}
+
+static notrace void sched_switch_reset(struct tracing_trace *tr)
+{
+	int cpu;
+
+	tr->time_start = now();
+
+	for_each_online_cpu(cpu)
+		tracing_reset(tr->data[cpu]);
+}
+
+static notrace void start_sched_trace(struct tracing_trace *tr)
+{
+	sched_switch_reset(tr);
+	trace_enabled = 1;
+	tracing_start_sched_switch();
+}
+
+static notrace void stop_sched_trace(struct tracing_trace *tr)
+{
+	tracing_stop_sched_switch();
+	trace_enabled = 0;
+}
+
+static notrace void sched_switch_trace_init(struct tracing_trace *tr)
+{
+	tracer_trace = tr;
+
+	if (tr->ctrl)
+		start_sched_trace(tr);
+}
+
+static notrace void sched_switch_trace_reset(struct tracing_trace *tr)
+{
+	if (tr->ctrl)
+		stop_sched_trace(tr);
+}
+
+static void sched_switch_trace_ctrl_update(struct tracing_trace *tr)
+{
+	/* When starting a new trace, reset the buffers */
+	if (tr->ctrl)
+		start_sched_trace(tr);
+	else
+		stop_sched_trace(tr);
+}
+
+static struct trace_types_struct sched_switch_trace __read_mostly =
+{
+	.name = "sched_switch",
+	.init = sched_switch_trace_init,
+	.reset = sched_switch_trace_reset,
+	.ctrl_update = sched_switch_trace_ctrl_update,
+};
+
+static int tracing_sched_arm(void)
+{
+	int ret;
+
+	ret = marker_arm("kernel_sched_schedule");
+	if (ret)
+		pr_info("sched trace: Couldn't arm probe switch_to\n");
+
+	return ret;
+}
+
+void tracing_start_sched_switch(void)
+{
+	long ref;
+
+	ref = atomic_inc_return(&sched_ref);
+	if (tracing_sched_switch_enabled && ref == 1)
+		tracing_sched_arm();
+}
+
+void tracing_stop_sched_switch(void)
+{
+	long ref;
+
+	ref = atomic_dec_return(&sched_ref);
+	if (tracing_sched_switch_enabled && ref)
+		marker_disarm("kernel_sched_schedule");
+}
+
+__init static int init_sched_switch_trace(void)
+{
+	int ret;
+
+	ret = register_trace(&sched_switch_trace);
+	if (ret)
+		return ret;
+
+	ret = marker_probe_register("kernel_sched_schedule",
+				    "prev %p next %p",
+				    sched_switch_callback,
+				    &tracer_trace);
+	if (ret) {
+		pr_info("sched trace: Couldn't add marker"
+			" probe to switch_to\n");
+		goto out;
+	}
+
+	if (atomic_read(&sched_ref))
+		ret = tracing_sched_arm();
+
+	tracing_sched_switch_enabled = 1;
+
+ out:
+	return ret;
+}
+
+device_initcall(init_sched_switch_trace);
Index: linux-mcount.git/lib/tracing/tracer.c
===================================================================
--- linux-mcount.git.orig/lib/tracing/tracer.c	2008-01-29 18:07:38.000000000 -0500
+++ linux-mcount.git/lib/tracing/tracer.c	2008-01-29 18:08:06.000000000 -0500
@@ -52,6 +52,7 @@ enum trace_type {
 	__TRACE_FIRST_TYPE = 0,
 
 	TRACE_FN,
+	TRACE_CTX,
 
 	__TRACE_LAST_TYPE
 };
@@ -229,6 +230,24 @@ notrace void tracing_function_trace(stru
 	entry->fn.parent_ip = parent_ip;
 }
 
+notrace void tracing_sched_switch_trace(struct tracing_trace *tr,
+					struct tracing_trace_cpu *data,
+					struct task_struct *prev,
+					struct task_struct *next,
+					unsigned long flags)
+{
+	struct tracing_entry *entry;
+
+	entry = tracing_get_trace_entry(tr, data);
+	tracing_generic_entry_update(entry, flags);
+	entry->type		= TRACE_CTX;
+	entry->ctx.prev_pid	= prev->pid;
+	entry->ctx.prev_prio	= prev->prio;
+	entry->ctx.prev_state	= prev->state;
+	entry->ctx.next_pid	= next->pid;
+	entry->ctx.next_prio	= next->prio;
+}
+
 enum trace_iterator {
 	TRACE_ITER_SYM_ONLY	= 1,
 	TRACE_ITER_VERBOSE	= 2,
@@ -547,6 +566,8 @@ lat_print_timestamp(struct seq_file *m, 
 		seq_puts(m, " : ");
 }
 
+static const char state_to_char[] = TASK_STATE_TO_CHAR_STR;
+
 static void notrace
 print_lat_fmt(struct seq_file *m, struct tracing_iterator *iter,
 	      unsigned int trace_idx, int cpu)
@@ -557,6 +578,7 @@ print_lat_fmt(struct seq_file *m, struct
 	unsigned long rel_usecs;
 	int sym_only = !!(trace_flags & TRACE_ITER_SYM_ONLY);
 	int verbose = !!(trace_flags & TRACE_ITER_VERBOSE);
+	int S;
 
 	if (!next_entry)
 		next_entry = entry;
@@ -583,6 +605,16 @@ print_lat_fmt(struct seq_file *m, struct
 		seq_print_ip_sym(m, entry->fn.parent_ip, sym_only);
 		seq_puts(m, ")\n");
 		break;
+	case TRACE_CTX:
+		S = entry->ctx.prev_state < sizeof(state_to_char) ?
+			state_to_char[entry->ctx.prev_state] : 'X';
+		seq_printf(m, " %d:%d:%c --> %d:%d\n",
+			   entry->ctx.prev_pid,
+			   entry->ctx.prev_prio,
+			   S,
+			   entry->ctx.next_pid,
+			   entry->ctx.next_prio);
+		break;
 	}
 }
 
@@ -594,6 +626,7 @@ static void notrace print_trace_fmt(stru
 	unsigned long secs;
 	int sym_only = !!(trace_flags & TRACE_ITER_SYM_ONLY);
 	unsigned long long t;
+	int S;
 
 	t = cycles_to_usecs(entry->t);
 	usec_rem = do_div(t, 1000000ULL);
@@ -612,6 +645,16 @@ static void notrace print_trace_fmt(stru
 					 sym_only);
 		}
 		break;
+	case TRACE_CTX:
+		S = entry->ctx.prev_state < sizeof(state_to_char) ?
+			state_to_char[entry->ctx.prev_state] : 'X';
+		seq_printf(m, " %d:%d:%c ==> %d:%d\n",
+			   entry->ctx.prev_pid,
+			   entry->ctx.prev_prio,
+			   S,
+			   entry->ctx.next_pid,
+			   entry->ctx.next_prio);
+		break;
 	}
 	seq_printf(m, "\n");
 }
Index: linux-mcount.git/lib/tracing/tracer.h
===================================================================
--- linux-mcount.git.orig/lib/tracing/tracer.h	2008-01-29 18:06:25.000000000 -0500
+++ linux-mcount.git/lib/tracing/tracer.h	2008-01-29 18:08:06.000000000 -0500
@@ -10,6 +10,14 @@ struct tracing_function {
 	unsigned long parent_ip;
 };
 
+struct tracing_sched_switch {
+	unsigned int prev_pid;
+	unsigned char prev_prio;
+	unsigned char prev_state;
+	unsigned int next_pid;
+	unsigned char next_prio;
+};
+
 struct tracing_entry {
 	char type;
 	char cpu;  /* who will want to trace more than 256 CPUS? */
@@ -18,7 +26,10 @@ struct tracing_entry {
 	int pid;
 	cycle_t t;
 	char comm[TASK_COMM_LEN];
-	struct tracing_function fn;
+	union {
+		struct tracing_function fn;
+		struct tracing_sched_switch ctx;
+	};
 };
 
 struct tracing_trace_cpu {
@@ -82,11 +93,21 @@ void tracing_function_trace(struct traci
 			    unsigned long ip,
 			    unsigned long parent_ip,
 			    unsigned long flags);
+void tracing_sched_switch_trace(struct tracing_trace *tr,
+				struct tracing_trace_cpu *data,
+				struct task_struct *prev,
+				struct task_struct *next,
+				unsigned long flags);
+
 
 void tracing_start_function_trace(void);
 void tracing_stop_function_trace(void);
 int register_trace(struct trace_types_struct *type);
 void unregister_trace(struct trace_types_struct *type);
+void tracing_start_sched_switch(void);
+void tracing_stop_sched_switch(void);
+
+extern int tracing_sched_switch_enabled;
 
 static inline notrace cycle_t now(void)
 {

-- 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 14/22 -v7] Generic command line storage
  2008-01-30  3:15 [PATCH 00/22 -v7] mcount and latency tracing utility -v7 Steven Rostedt
                   ` (12 preceding siblings ...)
  2008-01-30  3:15 ` [PATCH 13/22 -v7] Add tracing of context switches Steven Rostedt
@ 2008-01-30  3:15 ` Steven Rostedt
  2008-01-30  3:15 ` [PATCH 15/22 -v7] trace generic call to schedule switch Steven Rostedt
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 45+ messages in thread
From: Steven Rostedt @ 2008-01-30  3:15 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, Peter Zijlstra,
	Christoph Hellwig, Mathieu Desnoyers, Gregory Haskins,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Tim Bird,
	Sam Ravnborg, Frank Ch. Eigler, Jan Kiszka, John Stultz,
	Arjan van de Ven, Steven Rostedt

[-- Attachment #1: trace-generic-cmdline.patch --]
[-- Type: text/plain, Size: 8872 bytes --]

Saving the comm of tasks for each trace is very expensive.
This patch includes in the context switch hook, a way to
store the last 100 command lines of tasks. This table is
examined when a trace is to be printed.

Note: The comm may be destroyed if other traces are performed.
Later (TBD) patches may simply store this information in the trace
itself.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 lib/tracing/Kconfig              |    1 
 lib/tracing/trace_function.c     |    2 
 lib/tracing/trace_sched_switch.c |    5 +
 lib/tracing/tracer.c             |  108 ++++++++++++++++++++++++++++++++++++---
 lib/tracing/tracer.h             |    3 -
 5 files changed, 111 insertions(+), 8 deletions(-)

Index: linux-mcount.git/lib/tracing/Kconfig
===================================================================
--- linux-mcount.git.orig/lib/tracing/Kconfig	2008-01-29 18:08:06.000000000 -0500
+++ linux-mcount.git/lib/tracing/Kconfig	2008-01-29 18:09:01.000000000 -0500
@@ -18,6 +18,7 @@ config FUNCTION_TRACER
 	depends on DEBUG_KERNEL && HAVE_MCOUNT
 	select MCOUNT
 	select TRACING
+	select CONTEXT_SWITCH_TRACER
 	help
 	  Use profiler instrumentation, adding -pg to CFLAGS. This will
 	  insert a call to an architecture specific __mcount routine,
Index: linux-mcount.git/lib/tracing/trace_function.c
===================================================================
--- linux-mcount.git.orig/lib/tracing/trace_function.c	2008-01-29 18:06:24.000000000 -0500
+++ linux-mcount.git/lib/tracing/trace_function.c	2008-01-29 18:08:10.000000000 -0500
@@ -29,10 +29,12 @@ static notrace void start_function_trace
 {
 	function_reset(tr);
 	tracing_start_function_trace();
+	tracing_start_sched_switch();
 }
 
 static notrace void stop_function_trace(struct tracing_trace *tr)
 {
+	tracing_stop_sched_switch();
 	tracing_stop_function_trace();
 }
 
Index: linux-mcount.git/lib/tracing/trace_sched_switch.c
===================================================================
--- linux-mcount.git.orig/lib/tracing/trace_sched_switch.c	2008-01-29 18:08:06.000000000 -0500
+++ linux-mcount.git/lib/tracing/trace_sched_switch.c	2008-01-29 18:09:03.000000000 -0500
@@ -32,6 +32,11 @@ static notrace void sched_switch_callbac
 	va_list ap;
 	int cpu;
 
+	if (!atomic_read(&sched_ref))
+		return;
+
+	tracing_record_cmdline(current);
+
 	if (!trace_enabled)
 		return;
 
Index: linux-mcount.git/lib/tracing/tracer.c
===================================================================
--- linux-mcount.git.orig/lib/tracing/tracer.c	2008-01-29 18:08:06.000000000 -0500
+++ linux-mcount.git/lib/tracing/tracer.c	2008-01-29 18:10:04.000000000 -0500
@@ -171,6 +171,87 @@ void tracing_stop_function_trace(void)
 	unregister_mcount_function(&trace_ops);
 }
 
+#define SAVED_CMDLINES 128
+static unsigned map_pid_to_cmdline[PID_MAX_DEFAULT+1];
+static unsigned map_cmdline_to_pid[SAVED_CMDLINES];
+static char saved_cmdlines[SAVED_CMDLINES][TASK_COMM_LEN];
+static int cmdline_idx;
+static DEFINE_SPINLOCK(trace_cmdline_lock);
+atomic_t trace_record_cmdline_disabled;
+
+static void trace_init_cmdlines(void)
+{
+	memset(&map_pid_to_cmdline, -1, sizeof(map_pid_to_cmdline));
+	memset(&map_cmdline_to_pid, -1, sizeof(map_cmdline_to_pid));
+	cmdline_idx = 0;
+}
+
+notrace void trace_stop_cmdline_recording(void);
+
+static void notrace trace_save_cmdline(struct task_struct *tsk)
+{
+	unsigned map;
+	unsigned idx;
+
+	if (!tsk->pid || unlikely(tsk->pid > PID_MAX_DEFAULT))
+		return;
+
+	/*
+	 * It's not the end of the world if we don't get
+	 * the lock, but we also don't want to spin
+	 * nor do we want to disable interrupts,
+	 * so if we miss here, then better luck next time.
+	 */
+	if (!spin_trylock(&trace_cmdline_lock))
+		return;
+
+	idx = map_pid_to_cmdline[tsk->pid];
+	if (idx >= SAVED_CMDLINES) {
+		idx = (cmdline_idx + 1) % SAVED_CMDLINES;
+
+		map = map_cmdline_to_pid[idx];
+		if (map <= PID_MAX_DEFAULT)
+			map_pid_to_cmdline[map] = (unsigned)-1;
+
+		map_pid_to_cmdline[tsk->pid] = idx;
+
+		cmdline_idx = idx;
+	}
+
+	memcpy(&saved_cmdlines[idx], tsk->comm, TASK_COMM_LEN);
+
+	spin_unlock(&trace_cmdline_lock);
+}
+
+static notrace char *trace_find_cmdline(int pid)
+{
+	char *cmdline = "<...>";
+	unsigned map;
+
+	if (!pid)
+		return "<idle>";
+
+	if (pid > PID_MAX_DEFAULT)
+		goto out;
+
+	map = map_pid_to_cmdline[pid];
+	if (map >= SAVED_CMDLINES)
+		goto out;
+
+	cmdline = saved_cmdlines[map];
+
+ out:
+	return cmdline;
+}
+
+void tracing_record_cmdline(struct task_struct *tsk)
+{
+	if (atomic_read(&trace_record_cmdline_disabled))
+		return;
+
+	trace_save_cmdline(tsk);
+}
+
 static inline notrace struct tracing_entry *
 tracing_get_trace_entry(struct tracing_trace *tr,
 			struct tracing_trace_cpu *data)
@@ -212,7 +293,6 @@ tracing_generic_entry_update(struct trac
 		((pc & HARDIRQ_MASK) ? TRACE_FLAG_HARDIRQ : 0) |
 		((pc & SOFTIRQ_MASK) ? TRACE_FLAG_SOFTIRQ : 0) |
 		(need_resched() ? TRACE_FLAG_NEED_RESCHED : 0);
-	memcpy(entry->comm, tsk->comm, TASK_COMM_LEN);
 }
 
 notrace void tracing_function_trace(struct tracing_trace *tr,
@@ -368,6 +448,8 @@ static void *s_start(struct seq_file *m,
 	if (!current_trace || current_trace != iter->trace)
 		return NULL;
 
+	atomic_inc(&trace_record_cmdline_disabled);
+
 	/* let the tracer grab locks here if needed */
 	if (current_trace->start)
 		current_trace->start(iter);
@@ -395,6 +477,8 @@ static void s_stop(struct seq_file *m, v
 {
 	struct tracing_iterator *iter = m->private;
 
+	atomic_dec(&trace_record_cmdline_disabled);
+
 	/* let the tracer release locks here if needed */
 	if (current_trace && current_trace == iter->trace && iter->trace->stop)
 		iter->trace->stop(iter);
@@ -523,8 +607,11 @@ static void notrace
 lat_print_generic(struct seq_file *m, struct tracing_entry *entry, int cpu)
 {
 	int hardirq, softirq;
+	char *comm;
+
+	comm = trace_find_cmdline(entry->pid);
 
-	seq_printf(m, "%8.8s-%-5d ", entry->comm, entry->pid);
+	seq_printf(m, "%8.8s-%-5d ", comm, entry->pid);
 	seq_printf(m, "%d", cpu);
 	seq_printf(m, "%c%c",
 		   (entry->flags & TRACE_FLAG_IRQS_OFF) ? 'd' : '.',
@@ -576,6 +663,7 @@ print_lat_fmt(struct seq_file *m, struct
 	struct tracing_entry *next_entry = find_next_entry(iter, NULL);
 	unsigned long abs_usecs;
 	unsigned long rel_usecs;
+	char *comm;
 	int sym_only = !!(trace_flags & TRACE_ITER_SYM_ONLY);
 	int verbose = !!(trace_flags & TRACE_ITER_VERBOSE);
 	int S;
@@ -586,9 +674,10 @@ print_lat_fmt(struct seq_file *m, struct
 	abs_usecs = cycles_to_usecs(entry->t - iter->tr->time_start);
 
 	if (verbose) {
+		comm = trace_find_cmdline(entry->pid);
 		seq_printf(m, "%16s %5d %d %d %08x %08x [%08lx]"
 			   " %ld.%03ldms (+%ld.%03ldms): ",
-			   entry->comm,
+			   comm,
 			   entry->pid, cpu, entry->flags,
 			   entry->preempt_count, trace_idx,
 			   cycles_to_usecs(entry->t),
@@ -608,12 +697,14 @@ print_lat_fmt(struct seq_file *m, struct
 	case TRACE_CTX:
 		S = entry->ctx.prev_state < sizeof(state_to_char) ?
 			state_to_char[entry->ctx.prev_state] : 'X';
-		seq_printf(m, " %d:%d:%c --> %d:%d\n",
+		comm = trace_find_cmdline(entry->ctx.next_pid);
+		seq_printf(m, " %d:%d:%c --> %d:%d %s\n",
 			   entry->ctx.prev_pid,
 			   entry->ctx.prev_prio,
 			   S,
 			   entry->ctx.next_pid,
-			   entry->ctx.next_prio);
+			   entry->ctx.next_prio,
+			   comm);
 		break;
 	}
 }
@@ -626,15 +717,18 @@ static void notrace print_trace_fmt(stru
 	unsigned long secs;
 	int sym_only = !!(trace_flags & TRACE_ITER_SYM_ONLY);
 	unsigned long long t;
+	char *comm;
 	int S;
 
+	comm = trace_find_cmdline(iter->ent->pid);
+
 	t = cycles_to_usecs(entry->t);
 	usec_rem = do_div(t, 1000000ULL);
 	secs = (unsigned long)t;
 
 	seq_printf(m, "[%5lu.%06lu] ", secs, usec_rem);
 	seq_printf(m, "CPU %d: ", iter->cpu);
-	seq_printf(m, "%s:%d ", entry->comm,
+	seq_printf(m, "%s:%d ", comm,
 		   entry->pid);
 	switch (entry->type) {
 	case TRACE_FN:
@@ -1184,6 +1278,8 @@ __init static int tracer_alloc_buffers(v
 
 	tracer_init_debugfs();
 
+	trace_init_cmdlines();
+
 	register_trace(&disable_trace);
 
 	return 0;
Index: linux-mcount.git/lib/tracing/tracer.h
===================================================================
--- linux-mcount.git.orig/lib/tracing/tracer.h	2008-01-29 18:08:06.000000000 -0500
+++ linux-mcount.git/lib/tracing/tracer.h	2008-01-29 18:09:03.000000000 -0500
@@ -25,7 +25,6 @@ struct tracing_entry {
 	char preempt_count; /* assumes PREEMPT_MASK is 8 bits or less */
 	int pid;
 	cycle_t t;
-	char comm[TASK_COMM_LEN];
 	union {
 		struct tracing_function fn;
 		struct tracing_sched_switch ctx;
@@ -98,7 +97,7 @@ void tracing_sched_switch_trace(struct t
 				struct task_struct *prev,
 				struct task_struct *next,
 				unsigned long flags);
-
+void tracing_record_cmdline(struct task_struct *tsk);
 
 void tracing_start_function_trace(void);
 void tracing_stop_function_trace(void);

-- 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 15/22 -v7] trace generic call to schedule switch
  2008-01-30  3:15 [PATCH 00/22 -v7] mcount and latency tracing utility -v7 Steven Rostedt
                   ` (13 preceding siblings ...)
  2008-01-30  3:15 ` [PATCH 14/22 -v7] Generic command line storage Steven Rostedt
@ 2008-01-30  3:15 ` Steven Rostedt
  2008-01-30  3:15 ` [PATCH 16/22 -v7] Add marker in try_to_wake_up Steven Rostedt
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 45+ messages in thread
From: Steven Rostedt @ 2008-01-30  3:15 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, Peter Zijlstra,
	Christoph Hellwig, Mathieu Desnoyers, Gregory Haskins,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Tim Bird,
	Sam Ravnborg, Frank Ch. Eigler, Jan Kiszka, John Stultz,
	Arjan van de Ven, Steven Rostedt

[-- Attachment #1: trace-sched-hooks.patch --]
[-- Type: text/plain, Size: 5094 bytes --]

This patch adds hooks into the schedule switch tracing to
allow other latency traces to hook into the schedule switches.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 lib/tracing/trace_sched_switch.c |  123 +++++++++++++++++++++++++++++++++------
 lib/tracing/tracer.h             |   14 ++++
 2 files changed, 119 insertions(+), 18 deletions(-)

Index: linux-mcount.git/lib/tracing/tracer.h
===================================================================
--- linux-mcount.git.orig/lib/tracing/tracer.h	2008-01-29 12:35:27.000000000 -0500
+++ linux-mcount.git/lib/tracing/tracer.h	2008-01-29 14:22:15.000000000 -0500
@@ -113,4 +113,18 @@ static inline notrace cycle_t now(void)
 	return get_monotonic_cycles();
 }
 
+#ifdef CONFIG_CONTEXT_SWITCH_TRACER
+typedef void (*tracer_switch_func_t)(void *private,
+				     struct task_struct *prev,
+				     struct task_struct *next);
+struct tracer_switch_ops {
+	tracer_switch_func_t func;
+	void *private;
+	struct tracer_switch_ops *next;
+};
+
+extern int register_tracer_switch(struct tracer_switch_ops *ops);
+extern int unregister_tracer_switch(struct tracer_switch_ops *ops);
+#endif /* CONFIG_CONTEXT_SWITCH_TRACER */
+
 #endif /* _LINUX_MCOUNT_TRACER_H */
Index: linux-mcount.git/lib/tracing/trace_sched_switch.c
===================================================================
--- linux-mcount.git.orig/lib/tracing/trace_sched_switch.c	2008-01-29 12:35:40.000000000 -0500
+++ linux-mcount.git/lib/tracing/trace_sched_switch.c	2008-01-29 14:24:35.000000000 -0500
@@ -18,33 +18,21 @@ static struct tracing_trace *tracer_trac
 static int trace_enabled __read_mostly;
 static atomic_t sched_ref;
 int tracing_sched_switch_enabled __read_mostly;
+static DEFINE_SPINLOCK(sched_switch_func_lock);
 
-static notrace void sched_switch_callback(const struct marker *mdata,
-					  void *private_data,
-					  const char *format, ...)
+static void notrace sched_switch_func(void *private,
+				      struct task_struct *prev,
+				      struct task_struct *next)
 {
-	struct tracing_trace **p = mdata->private;
-	struct tracing_trace *tr = *p;
+	struct tracing_trace **ptr = private;
+	struct tracing_trace *tr = *ptr;
 	struct tracing_trace_cpu *data;
-	struct task_struct *prev;
-	struct task_struct *next;
 	unsigned long flags;
-	va_list ap;
 	int cpu;
 
-	if (!atomic_read(&sched_ref))
-		return;
-
-	tracing_record_cmdline(current);
-
 	if (!trace_enabled)
 		return;
 
-	va_start(ap, format);
-	prev = va_arg(ap, typeof(prev));
-	next = va_arg(ap, typeof(next));
-	va_end(ap);
-
 	raw_local_irq_save(flags);
 	cpu = raw_smp_processor_id();
 	data = tr->data[cpu];
@@ -57,6 +45,105 @@ static notrace void sched_switch_callbac
 	raw_local_irq_restore(flags);
 }
 
+static struct tracer_switch_ops sched_switch_ops __read_mostly =
+{
+	.func = sched_switch_func,
+	.private = &tracer_trace,
+};
+
+static tracer_switch_func_t tracer_switch_func __read_mostly =
+	sched_switch_func;
+
+static struct tracer_switch_ops *tracer_switch_func_ops __read_mostly =
+	&sched_switch_ops;
+
+static void notrace sched_switch_func_loop(void *private,
+					   struct task_struct *prev,
+					   struct task_struct *next)
+{
+	struct tracer_switch_ops *ops = tracer_switch_func_ops;
+
+	for (; ops != NULL; ops = ops->next)
+		ops->func(ops->private, prev, next);
+}
+
+notrace int register_tracer_switch(struct tracer_switch_ops *ops)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&sched_switch_func_lock, flags);
+	ops->next = tracer_switch_func_ops;
+	smp_wmb();
+	tracer_switch_func_ops = ops;
+
+	if (ops->next == &sched_switch_ops)
+		tracer_switch_func = sched_switch_func_loop;
+
+	spin_unlock_irqrestore(&sched_switch_func_lock, flags);
+
+	return 0;
+}
+
+notrace int unregister_tracer_switch(struct tracer_switch_ops *ops)
+{
+	unsigned long flags;
+	struct tracer_switch_ops **p = &tracer_switch_func_ops;
+	int ret;
+
+	spin_lock_irqsave(&sched_switch_func_lock, flags);
+
+	/*
+	 * If the sched_switch is the only one left, then
+	 *  only call that function.
+	 */
+	if (*p == ops && ops->next == &sched_switch_ops) {
+		tracer_switch_func = sched_switch_func;
+		tracer_switch_func_ops = &sched_switch_ops;
+		goto out;
+	}
+
+	for (; *p != &sched_switch_ops; p = &(*p)->next)
+		if (*p == ops)
+			break;
+
+	if (*p != ops) {
+		ret = -1;
+		goto out;
+	}
+
+	*p = (*p)->next;
+
+ out:
+	spin_unlock_irqrestore(&sched_switch_func_lock, flags);
+
+	return 0;
+}
+
+static notrace void sched_switch_callback(const struct marker *mdata,
+					  void *private_data,
+					  const char *format, ...)
+{
+	struct task_struct *prev;
+	struct task_struct *next;
+	va_list ap;
+
+	if (!atomic_read(&sched_ref))
+		return;
+
+	tracing_record_cmdline(current);
+
+	va_start(ap, format);
+	prev = va_arg(ap, typeof(prev));
+	next = va_arg(ap, typeof(next));
+	va_end(ap);
+
+	/*
+	 * If tracer_switch_func only points to the local
+	 * switch func, it still needs the ptr passed to it.
+	 */
+	tracer_switch_func(mdata->private, prev, next);
+}
+
 static notrace void sched_switch_reset(struct tracing_trace *tr)
 {
 	int cpu;

-- 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 16/22 -v7] Add marker in try_to_wake_up
  2008-01-30  3:15 [PATCH 00/22 -v7] mcount and latency tracing utility -v7 Steven Rostedt
                   ` (14 preceding siblings ...)
  2008-01-30  3:15 ` [PATCH 15/22 -v7] trace generic call to schedule switch Steven Rostedt
@ 2008-01-30  3:15 ` Steven Rostedt
  2008-01-30  3:15 ` [PATCH 17/22 -v7] mcount tracer for wakeup latency timings Steven Rostedt
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 45+ messages in thread
From: Steven Rostedt @ 2008-01-30  3:15 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, Peter Zijlstra,
	Christoph Hellwig, Mathieu Desnoyers, Gregory Haskins,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Tim Bird,
	Sam Ravnborg, Frank Ch. Eigler, Jan Kiszka, John Stultz,
	Arjan van de Ven, Steven Rostedt

[-- Attachment #1: add-markers-to-wakeup.patch --]
[-- Type: text/plain, Size: 1040 bytes --]

Add markers into the wakeup code, to allow the tracer to
record wakeup timings.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 kernel/sched.c |    8 ++++++++
 1 file changed, 8 insertions(+)

Index: linux-mcount.git/kernel/sched.c
===================================================================
--- linux-mcount.git.orig/kernel/sched.c	2008-01-25 21:47:21.000000000 -0500
+++ linux-mcount.git/kernel/sched.c	2008-01-25 21:47:30.000000000 -0500
@@ -1885,6 +1885,10 @@ static int try_to_wake_up(struct task_st
 
 out_activate:
 #endif /* CONFIG_SMP */
+	trace_mark(kernel_sched_wakeup,
+		   "p %p rq->curr %p",
+		   p, rq->curr);
+
 	schedstat_inc(p, se.nr_wakeups);
 	if (sync)
 		schedstat_inc(p, se.nr_wakeups_sync);
@@ -2026,6 +2030,10 @@ void fastcall wake_up_new_task(struct ta
 		p->sched_class->task_new(rq, p);
 		inc_nr_running(p, rq);
 	}
+	trace_mark(kernel_sched_wakeup_new,
+		   "p %p rq->curr %p",
+		   p, rq->curr);
+
 	check_preempt_curr(rq, p);
 #ifdef CONFIG_SMP
 	if (p->sched_class->task_wake_up)

-- 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 17/22 -v7] mcount tracer for wakeup latency timings.
  2008-01-30  3:15 [PATCH 00/22 -v7] mcount and latency tracing utility -v7 Steven Rostedt
                   ` (15 preceding siblings ...)
  2008-01-30  3:15 ` [PATCH 16/22 -v7] Add marker in try_to_wake_up Steven Rostedt
@ 2008-01-30  3:15 ` Steven Rostedt
  2008-01-30  9:31   ` Peter Zijlstra
  2008-01-30  3:15 ` [PATCH 18/22 -v7] Trace irq disabled critical timings Steven Rostedt
                   ` (4 subsequent siblings)
  21 siblings, 1 reply; 45+ messages in thread
From: Steven Rostedt @ 2008-01-30  3:15 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, Peter Zijlstra,
	Christoph Hellwig, Mathieu Desnoyers, Gregory Haskins,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Tim Bird,
	Sam Ravnborg, Frank Ch. Eigler, Jan Kiszka, John Stultz,
	Arjan van de Ven, Steven Rostedt

[-- Attachment #1: mcount-trace-wakeup-latency.patch --]
[-- Type: text/plain, Size: 21164 bytes --]

This patch adds hooks to trace the wake up latency of the highest
priority waking task.

  "wakeup" is added to /debugfs/tracing/available_tracers

Also added to /debugfs/tracing

  tracing_max_latency
     holds the current max latency for the wakeup

  wakeup_thresh
     if set to other than zero, a log will be recorded
     for every wakeup that takes longer than the number
     entered in here (usecs for all counters)
     (deletes previous trace)

Examples:

  (with mcount_enabled = 0)

============
preemption latency trace v1.1.5 on 2.6.24-rc8
--------------------------------------------------------------------
 latency: 26 us, #2/2, CPU#1 | (M:rt VP:0, KP:0, SP:0 HP:0 #P:2)
    -----------------
    | task: migration/0-3 (uid:0 nice:-5 policy:1 rt_prio:99)
    -----------------

                 _------=> CPU#            
                / _-----=> irqs-off        
               | / _----=> need-resched    
               || / _---=> hardirq/softirq 
               ||| / _--=> preempt-depth   
               |||| /                      
               |||||     delay             
   cmd     pid ||||| time  |   caller      
      \   /    |||||   \   |   /           
   quilt-8551  0d..3    0us+: wake_up_process+0x15/0x17 <ffffffff80233e80> (sched_exec+0xc9/0x100 <ffffffff80235343>)
   quilt-8551  0d..4   26us : sched_switch_callback+0x73/0x81 <ffffffff80338d2f> (schedule+0x483/0x6d5 <ffffffff8048b3ee>)


vim:ft=help
============

    
  (with mcount_enabled = 1)

============
preemption latency trace v1.1.5 on 2.6.24-rc8
--------------------------------------------------------------------
 latency: 36 us, #45/45, CPU#0 | (M:rt VP:0, KP:0, SP:0 HP:0 #P:2)
    -----------------
    | task: migration/1-5 (uid:0 nice:-5 policy:1 rt_prio:99)
    -----------------

                 _------=> CPU#            
                / _-----=> irqs-off        
               | / _----=> need-resched    
               || / _---=> hardirq/softirq 
               ||| / _--=> preempt-depth   
               |||| /                      
               |||||     delay             
   cmd     pid ||||| time  |   caller      
      \   /    |||||   \   |   /           
    bash-10653 1d..3    0us : wake_up_process+0x15/0x17 <ffffffff80233e80> (sched_exec+0xc9/0x100 <ffffffff80235343>)
    bash-10653 1d..3    1us : try_to_wake_up+0x271/0x2e7 <ffffffff80233dcf> (sub_preempt_count+0xc/0x7a <ffffffff8023309e>)
    bash-10653 1d..2    2us : try_to_wake_up+0x296/0x2e7 <ffffffff80233df4> (update_rq_clock+0x9/0x20 <ffffffff802303f3>)
    bash-10653 1d..2    2us : update_rq_clock+0x1e/0x20 <ffffffff80230408> (__update_rq_clock+0xc/0x90 <ffffffff80230366>)
    bash-10653 1d..2    3us : __update_rq_clock+0x1b/0x90 <ffffffff80230375> (sched_clock+0x9/0x29 <ffffffff80214529>)
    bash-10653 1d..2    4us : try_to_wake_up+0x2a6/0x2e7 <ffffffff80233e04> (activate_task+0xc/0x3f <ffffffff8022ffca>)
    bash-10653 1d..2    4us : activate_task+0x2d/0x3f <ffffffff8022ffeb> (enqueue_task+0xe/0x66 <ffffffff8022ff66>)
    bash-10653 1d..2    5us : enqueue_task+0x5b/0x66 <ffffffff8022ffb3> (enqueue_task_rt+0x9/0x3c <ffffffff80233351>)
    bash-10653 1d..2    6us : try_to_wake_up+0x2ba/0x2e7 <ffffffff80233e18> (check_preempt_wakeup+0x12/0x99 <ffffffff80234f84>)
[...]
    bash-10653 1d..5   33us : tracing_record_cmdline+0xcf/0xd4 <ffffffff80338aad> (_spin_unlock+0x9/0x33 <ffffffff8048d3ec>)
    bash-10653 1d..5   34us : _spin_unlock+0x19/0x33 <ffffffff8048d3fc> (sub_preempt_count+0xc/0x7a <ffffffff8023309e>)
    bash-10653 1d..4   35us : wakeup_sched_switch+0x65/0x2ff <ffffffff80339f66> (_spin_lock_irqsave+0xc/0xa9 <ffffffff8048d08b>)
    bash-10653 1d..4   35us : _spin_lock_irqsave+0x19/0xa9 <ffffffff8048d098> (add_preempt_count+0xe/0x77 <ffffffff8023311a>)
    bash-10653 1d..4   36us : sched_switch_callback+0x73/0x81 <ffffffff80338d2f> (schedule+0x483/0x6d5 <ffffffff8048b3ee>)


vim:ft=help
============

The [...] was added here to not waste your email box space.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 lib/tracing/Kconfig        |   14 +
 lib/tracing/Makefile       |    1 
 lib/tracing/trace_wakeup.c |  359 +++++++++++++++++++++++++++++++++++++++++++++
 lib/tracing/tracer.c       |  131 ++++++++++++++++
 lib/tracing/tracer.h       |    5 
 5 files changed, 508 insertions(+), 2 deletions(-)

Index: linux-mcount.git/lib/tracing/Kconfig
===================================================================
--- linux-mcount.git.orig/lib/tracing/Kconfig	2008-01-29 18:09:01.000000000 -0500
+++ linux-mcount.git/lib/tracing/Kconfig	2008-01-29 18:10:17.000000000 -0500
@@ -9,6 +9,9 @@ config MCOUNT
 	bool
 	select FRAME_POINTER
 
+config TRACER_MAX_TRACE
+	bool
+
 config TRACING
         bool
 	select DEBUG_FS
@@ -25,6 +28,17 @@ config FUNCTION_TRACER
 	  that the debugging mechanism using this facility will hook by
 	  providing a set of inline routines.
 
+config WAKEUP_TRACER
+	bool "Trace wakeup latencies"
+	depends on DEBUG_KERNEL
+	select TRACING
+	select CONTEXT_SWITCH_TRACER
+	select TRACER_MAX_TRACE
+	help
+	  This tracer adds hooks into scheduling to time the latency
+	  of the highest priority task tasks to be scheduled in
+	  after it has worken up.
+
 config CONTEXT_SWITCH_TRACER
 	bool "Trace process context switches"
 	depends on DEBUG_KERNEL
Index: linux-mcount.git/lib/tracing/Makefile
===================================================================
--- linux-mcount.git.orig/lib/tracing/Makefile	2008-01-29 18:09:01.000000000 -0500
+++ linux-mcount.git/lib/tracing/Makefile	2008-01-29 18:10:17.000000000 -0500
@@ -3,5 +3,6 @@ obj-$(CONFIG_MCOUNT) += libmcount.o
 obj-$(CONFIG_TRACING) += tracer.o
 obj-$(CONFIG_CONTEXT_SWITCH_TRACER) += trace_sched_switch.o
 obj-$(CONFIG_FUNCTION_TRACER) += trace_function.o
+obj-$(CONFIG_WAKEUP_TRACER) += trace_wakeup.o
 
 libmcount-y := mcount.o
Index: linux-mcount.git/lib/tracing/trace_wakeup.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-mcount.git/lib/tracing/trace_wakeup.c	2008-01-29 18:10:17.000000000 -0500
@@ -0,0 +1,359 @@
+/*
+ * trace task wakeup timings
+ *
+ * Copyright (C) 2007 Steven Rostedt <srostedt@redhat.com>
+ *
+ * Based on code from the latency_tracer, that is:
+ *
+ *  Copyright (C) 2004-2006 Ingo Molnar
+ *  Copyright (C) 2004 William Lee Irwin III
+ */
+#include <linux/module.h>
+#include <linux/fs.h>
+#include <linux/debugfs.h>
+#include <linux/kallsyms.h>
+#include <linux/uaccess.h>
+#include <linux/mcount.h>
+
+#include "tracer.h"
+
+static struct tracing_trace *tracer_trace __read_mostly;
+static int trace_enabled __read_mostly;
+
+static struct task_struct *wakeup_task;
+static int wakeup_cpu;
+static unsigned wakeup_prio = -1;
+
+static DEFINE_SPINLOCK(wakeup_lock);
+
+static void notrace __wakeup_reset(struct tracing_trace *tr);
+/*
+ * Should this new latency be reported/recorded?
+ */
+static int notrace report_latency(cycle_t delta)
+{
+	if (tracing_thresh) {
+		if (delta < tracing_thresh)
+			return 0;
+	} else {
+		if (delta <= tracing_max_latency)
+			return 0;
+	}
+	return 1;
+}
+
+static void notrace wakeup_sched_switch(void *private,
+					struct task_struct *prev,
+					struct task_struct *next)
+{
+	struct tracing_trace **ptr = private;
+	struct tracing_trace *tr = *ptr;
+	struct tracing_trace_cpu *data;
+	unsigned long latency = 0, t0 = 0, t1 = 0;
+	cycle_t T0, T1, delta;
+	unsigned long flags;
+	int cpu;
+
+	if (unlikely(!trace_enabled) || next != wakeup_task)
+		return;
+
+	/* The task we are waitng for is waking up */
+	data = tr->data[wakeup_cpu];
+
+	if (unlikely(!data) || unlikely(!data->trace) ||
+	    unlikely(atomic_read(&data->disabled)))
+		return;
+
+	spin_lock_irqsave(&wakeup_lock, flags);
+
+	/* disable local data, not wakeup_cpu data */
+	cpu = raw_smp_processor_id();
+	atomic_inc(&tr->data[cpu]->disabled);
+
+	/* We could race with grabbing wakeup_lock */
+	if (unlikely(!trace_enabled || next != wakeup_task))
+		goto out;
+
+	tracing_function_trace(tr, data, CALLER_ADDR1, CALLER_ADDR2, flags);
+
+	/*
+	 * usecs conversion is slow so we try to delay the conversion
+	 * as long as possible:
+	 */
+	T0 = data->preempt_timestamp;
+	T1 = now();
+	delta = T1-T0;
+
+	if (!report_latency(delta))
+		goto out;
+
+	latency = cycles_to_usecs(delta);
+
+	tracing_max_latency = delta;
+	t0 = cycles_to_usecs(T0);
+	t1 = cycles_to_usecs(T1);
+
+	update_max_tr(tr, wakeup_task, wakeup_cpu);
+
+	if (tracing_thresh)
+		printk(KERN_WARNING "(%16s-%-5d|#%d): %lu us wakeup latency "
+		       "violates %lu us threshold.\n"
+		       " => started at timestamp %lu: ",
+				wakeup_task->comm, wakeup_task->pid,
+				raw_smp_processor_id(),
+				latency, cycles_to_usecs(tracing_thresh), t0);
+	else
+		printk(KERN_WARNING "(%16s-%-5d|#%d): new %lu us maximum "
+		       "wakeup latency.\n => started at timestamp %lu: ",
+				wakeup_task->comm, wakeup_task->pid,
+				cpu, latency, t0);
+
+	printk(KERN_CONT "   ended at timestamp %lu: ", t1);
+	dump_stack();
+	t1 = cycles_to_usecs(now());
+	printk(KERN_CONT "   dump-end timestamp %lu\n\n", t1);
+
+out:
+	__wakeup_reset(tr);
+	atomic_dec(&tr->data[cpu]->disabled);
+	spin_unlock_irqrestore(&wakeup_lock, flags);
+
+}
+
+static struct tracer_switch_ops switch_ops __read_mostly = {
+	.func = wakeup_sched_switch,
+	.private = &tracer_trace,
+};
+
+static void notrace __wakeup_reset(struct tracing_trace *tr)
+{
+	struct tracing_trace_cpu *data;
+	int cpu;
+
+	assert_spin_locked(&wakeup_lock);
+
+	for_each_possible_cpu(cpu) {
+		data = tr->data[cpu];
+		tracing_reset(data);
+	}
+
+	wakeup_cpu = -1;
+	wakeup_prio = -1;
+	if (wakeup_task) {
+		put_task_struct(wakeup_task);
+		tracing_stop_function_trace();
+	}
+
+	wakeup_task = NULL;
+
+	/*
+	 * Don't let the trace_enabled = 1 show up before
+	 * the wakeup_task is reset.
+	 */
+	smp_wmb();
+}
+
+static void notrace wakeup_reset(struct tracing_trace *tr)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&wakeup_lock, flags);
+	__wakeup_reset(tr);
+	spin_unlock_irqrestore(&wakeup_lock, flags);
+}
+
+static notrace void wakeup_check_start(struct tracing_trace *tr,
+				       struct task_struct *p,
+				       struct task_struct *curr)
+{
+	unsigned long flags;
+	int cpu = smp_processor_id();
+
+	if (likely(!rt_task(p)) ||
+	    p->prio >= wakeup_prio ||
+	    p->prio >= curr->prio)
+		return;
+
+	atomic_inc(&tr->data[cpu]->disabled);
+	if (unlikely(atomic_read(&tr->data[cpu]->disabled) != 1))
+		goto out;
+
+	/* interrupts should be off from try_to_wake_up */
+	spin_lock(&wakeup_lock);
+
+	/* check for races. */
+	if (!trace_enabled || p->prio >= wakeup_prio)
+		goto out_locked;
+
+	/* reset the trace */
+	__wakeup_reset(tr);
+
+	wakeup_cpu = task_cpu(p);
+	wakeup_prio = p->prio;
+
+	wakeup_task = p;
+	get_task_struct(wakeup_task);
+
+	local_save_flags(flags);
+
+	tr->data[wakeup_cpu]->preempt_timestamp = now();
+	tracing_start_function_trace();
+	tracing_function_trace(tr, tr->data[wakeup_cpu],
+			       CALLER_ADDR1, CALLER_ADDR2, flags);
+
+
+ out_locked:
+	spin_unlock(&wakeup_lock);
+ out:
+	atomic_dec(&tr->data[cpu]->disabled);
+}
+
+static notrace void wake_up_callback(const struct marker *mdata,
+				     void *private_data,
+				     const char *format, ...)
+{
+	struct tracing_trace **ptr = mdata->private;
+	struct tracing_trace *tr = *ptr;
+	struct task_struct *curr;
+	struct task_struct *p;
+	va_list ap;
+
+	if (likely(!trace_enabled))
+		return;
+
+	va_start(ap, format);
+
+	/* now get the meat: "p %p rq->curr %p" */
+	p = va_arg(ap, typeof(p));
+	curr = va_arg(ap, typeof(curr));
+
+	va_end(ap);
+
+	wakeup_check_start(tr, p, curr);
+}
+
+static notrace void start_wakeup_trace(struct tracing_trace *tr)
+{
+	int ret;
+
+	ret = marker_arm("kernel_sched_wakeup");
+	if (ret) {
+		pr_info("wakeup trace: Couldn't arm probe"
+			" kernel_sched_wakeup\n");
+		return;
+	}
+
+	ret = marker_arm("kernel_sched_wakeup_new");
+	if (ret) {
+		pr_info("wakeup trace: Couldn't arm probe"
+			" kernel_sched_wakeup_new\n");
+		goto out;
+	}
+
+	register_tracer_switch(&switch_ops);
+	tracing_start_sched_switch();
+
+	wakeup_reset(tr);
+	trace_enabled = 1;
+	return;
+
+ out:
+	marker_disarm("kernel_sched_wakeup");
+}
+
+static notrace void stop_wakeup_trace(struct tracing_trace *tr)
+{
+	trace_enabled = 0;
+	tracing_stop_sched_switch();
+	unregister_tracer_switch(&switch_ops);
+	marker_disarm("kernel_sched_wakeup");
+	marker_disarm("kernel_sched_wakeup_new");
+}
+
+static notrace void wakeup_trace_init(struct tracing_trace *tr)
+{
+	tracer_trace = tr;
+
+	if (tr->ctrl)
+		start_wakeup_trace(tr);
+}
+
+static notrace void wakeup_trace_reset(struct tracing_trace *tr)
+{
+	if (tr->ctrl) {
+		stop_wakeup_trace(tr);
+		/* make sure we put back any tasks we are tracing */
+		wakeup_reset(tr);
+	}
+}
+
+static void wakeup_trace_ctrl_update(struct tracing_trace *tr)
+{
+	if (tr->ctrl)
+		start_wakeup_trace(tr);
+	else
+		stop_wakeup_trace(tr);
+}
+
+static void notrace wakeup_trace_open(struct tracing_iterator *iter)
+{
+	/* stop the trace while dumping */
+	if (iter->tr->ctrl)
+		stop_wakeup_trace(iter->tr);
+}
+
+static void notrace wakeup_trace_close(struct tracing_iterator *iter)
+{
+	/* forget about any processes we were recording */
+	if (iter->tr->ctrl)
+		start_wakeup_trace(iter->tr);
+}
+
+static struct trace_types_struct wakeup_trace __read_mostly =
+{
+	.name = "wakeup",
+	.init = wakeup_trace_init,
+	.reset = wakeup_trace_reset,
+	.open = wakeup_trace_open,
+	.close = wakeup_trace_close,
+	.ctrl_update = wakeup_trace_ctrl_update,
+	.print_max = 1,
+};
+
+__init static int init_wakeup_trace(void)
+{
+	int ret;
+
+	ret = register_trace(&wakeup_trace);
+	if (ret)
+		return ret;
+
+	ret = marker_probe_register("kernel_sched_wakeup",
+				    "p %p rq->curr %p",
+				    wake_up_callback,
+				    &tracer_trace);
+	if (ret) {
+		pr_info("wakeup trace: Couldn't add marker"
+			" probe to kernel_sched_wakeup\n");
+		goto fail;
+	}
+
+	ret = marker_probe_register("kernel_sched_wakeup_new",
+				    "p %p rq->curr %p",
+				    wake_up_callback,
+				    &tracer_trace);
+	if (ret) {
+		pr_info("wakeup trace: Couldn't add marker"
+			" probe to kernel_sched_wakeup_new\n");
+		goto fail_deprobe;
+	}
+
+	return 0;
+ fail_deprobe:
+	marker_probe_unregister("kernel_sched_wakeup");
+ fail:
+	unregister_trace(&wakeup_trace);
+	return 0;
+}
+
+device_initcall(init_wakeup_trace);
Index: linux-mcount.git/lib/tracing/tracer.c
===================================================================
--- linux-mcount.git.orig/lib/tracing/tracer.c	2008-01-29 18:10:04.000000000 -0500
+++ linux-mcount.git/lib/tracing/tracer.c	2008-01-29 18:10:17.000000000 -0500
@@ -28,8 +28,13 @@
 
 #include "tracer.h"
 
+unsigned long tracing_max_latency __read_mostly = (cycle_t)ULONG_MAX;
+unsigned long tracing_thresh __read_mostly;
+
 static struct tracing_trace tracer_trace __read_mostly;
 static DEFINE_PER_CPU(struct tracing_trace_cpu, tracer_trace_cpu);
+static struct tracing_trace max_tr __read_mostly;
+static DEFINE_PER_CPU(struct tracing_trace_cpu, max_data);
 static int trace_enabled __read_mostly;
 static unsigned long trace_nr_entries = (65536UL);
 
@@ -64,6 +69,43 @@ enum trace_flag_type {
 	TRACE_FLAG_SOFTIRQ		= 0x08,
 };
 
+/*
+ * Copy the new maximum trace into the separate maximum-trace
+ * structure. (this way the maximum trace is permanently saved,
+ * for later retrieval via /debugfs/tracing/latency_trace)
+ */
+void update_max_tr(struct tracing_trace *tr, struct task_struct *tsk, int cpu)
+{
+	struct tracing_trace_cpu *data = tr->data[cpu];
+	void *save_trace;
+	int i;
+
+	max_tr.cpu = cpu;
+	max_tr.time_start = data->preempt_timestamp;
+
+
+	/* clear out all the previous traces */
+	for_each_possible_cpu(i) {
+		data = tr->data[i];
+		save_trace = max_tr.data[i]->trace;
+		memcpy(max_tr.data[i], data, sizeof(*data));
+		data->trace = save_trace;
+	}
+
+	data = max_tr.data[cpu];
+	data->saved_latency = tracing_max_latency;
+
+	memcpy(data->comm, tsk->comm, TASK_COMM_LEN);
+	data->pid = tsk->pid;
+	data->uid = tsk->uid;
+	data->nice = tsk->static_prio - 20 - MAX_RT_PRIO;
+	data->policy = tsk->policy;
+	data->rt_priority = tsk->rt_priority;
+
+	/* record this tasks comm */
+	tracing_record_cmdline(current);
+}
+
 int register_trace(struct trace_types_struct *type)
 {
 	struct trace_types_struct *t;
@@ -811,9 +853,14 @@ __tracing_open(struct inode *inode, stru
 		goto out;
 	}
 
-	iter->tr = inode->i_private;
+	mutex_lock(&trace_types_lock);
+	if (current_trace && current_trace->print_max)
+		iter->tr = &max_tr;
+	else
+		iter->tr = inode->i_private;
 	iter->trace = current_trace;
 	iter->pos = -1;
+	mutex_unlock(&trace_types_lock);
 
 	/* TODO stop tracer */
 	*ret = seq_open(file, &tracer_seq_ops);
@@ -1093,7 +1140,7 @@ static ssize_t tracing_ctrl_write(struct
 		tr->ctrl = val;
 
 		mutex_lock(&trace_types_lock);
-		if (current_trace->ctrl_update)
+		if (current_trace && current_trace->ctrl_update)
 			current_trace->ctrl_update(tr);
 		mutex_unlock(&trace_types_lock);
 	}
@@ -1164,6 +1211,50 @@ static ssize_t tracing_set_trace_write(s
 	return cnt;
 }
 
+static ssize_t tracing_max_lat_read(struct file *filp, char __user *ubuf,
+				    size_t cnt, loff_t *ppos)
+{
+	unsigned long *ptr = filp->private_data;
+	char buf[64];
+	int r;
+
+	r = snprintf(buf, 64, "%ld\n",
+		     *ptr == (unsigned long)-1 ? -1 : cycles_to_usecs(*ptr));
+	if (r > 64)
+		r = 64;
+	return simple_read_from_buffer(ubuf, cnt, ppos,
+				       buf, r);
+}
+
+static ssize_t tracing_max_lat_write(struct file *filp,
+				     const char __user *ubuf,
+				     size_t cnt, loff_t *ppos)
+{
+	long *ptr = filp->private_data;
+	long val;
+	char buf[64];
+
+	if (cnt > 63)
+		cnt = 63;
+
+	if (copy_from_user(&buf, ubuf, cnt))
+		return -EFAULT;
+
+	buf[cnt] = 0;
+
+	val = simple_strtoul(buf, NULL, 10);
+
+	*ptr = usecs_to_cycles(val);
+
+	return cnt;
+}
+
+static struct file_operations tracing_max_lat_fops = {
+	.open = tracing_open_generic,
+	.read = tracing_max_lat_read,
+	.write = tracing_max_lat_write,
+};
+
 static struct file_operations tracing_ctrl_fops = {
 	.open = tracing_open_generic,
 	.read = tracing_ctrl_read,
@@ -1232,6 +1323,19 @@ static __init void tracer_init_debugfs(v
 				    &tracer_trace, &set_tracer_fops);
 	if (!entry)
 		pr_warning("Could not create debugfs 'trace' entry\n");
+
+	entry = debugfs_create_file("tracing_max_latency", 0644, d_tracer,
+				    &tracing_max_latency,
+				    &tracing_max_lat_fops);
+	if (!entry)
+		pr_warning("Could not create debugfs "
+			   "'tracing_max_latency' entry\n");
+
+	entry = debugfs_create_file("tracing_thresh", 0644, d_tracer,
+				    &tracing_thresh, &tracing_max_lat_fops);
+	if (!entry)
+		pr_warning("Could not create debugfs "
+			   "'tracing_threash' entry\n");
 }
 
 /* dummy trace to disable tracing */
@@ -1255,6 +1359,8 @@ __init static int tracer_alloc_buffers(v
 
 	for_each_possible_cpu(i) {
 		tracer_trace.data[i] = &per_cpu(tracer_trace_cpu, i);
+		max_tr.data[i] = &per_cpu(max_data, i);
+
 		array = (struct tracing_entry *)
 			  __get_free_pages(GFP_KERNEL, order);
 		if (array == NULL) {
@@ -1263,6 +1369,18 @@ __init static int tracer_alloc_buffers(v
 			goto free_buffers;
 		}
 		tracer_trace.data[i]->trace = array;
+
+/* Only allocate if we are actually using the max trace */
+#ifdef CONFIG_TRACER_MAX_TRACE
+		array = (struct tracing_entry *)
+			  __get_free_pages(GFP_KERNEL, order);
+		if (array == NULL) {
+			printk(KERN_ERR "wakeup tracer: failed to allocate"
+			       " %ld bytes for trace buffer!\n", size);
+			goto free_buffers;
+		}
+		max_tr.data[i]->trace = array;
+#endif
 	}
 
 	/*
@@ -1270,6 +1388,7 @@ __init static int tracer_alloc_buffers(v
 	 * round up a bit.
 	 */
 	tracer_trace.entries = size / TRACING_ENTRY_SIZE;
+	max_tr.entries = tracer_trace.entries;
 
 	pr_info("tracer: %ld bytes allocated for %ld",
 		size, trace_nr_entries);
@@ -1292,6 +1411,14 @@ __init static int tracer_alloc_buffers(v
 			free_pages((unsigned long)data->trace, order);
 			data->trace = NULL;
 		}
+
+#ifdef CONFIG_TRACER_MAX_TRACE
+		data = max_tr.data[i];
+		if (data && data->trace) {
+			free_pages((unsigned long)data->trace, order);
+			data->trace = NULL;
+		}
+#endif
 	}
 	return -ENOMEM;
 }
Index: linux-mcount.git/lib/tracing/tracer.h
===================================================================
--- linux-mcount.git.orig/lib/tracing/tracer.h	2008-01-29 18:10:15.000000000 -0500
+++ linux-mcount.git/lib/tracing/tracer.h	2008-01-29 18:10:17.000000000 -0500
@@ -69,6 +69,7 @@ struct trace_types_struct {
 	void (*stop)(struct tracing_iterator *iter);
 	void (*ctrl_update)(struct tracing_trace *tr);
 	struct trace_types_struct *next;
+	int print_max;
 };
 
 struct tracing_iterator {
@@ -107,6 +108,10 @@ void tracing_start_sched_switch(void);
 void tracing_stop_sched_switch(void);
 
 extern int tracing_sched_switch_enabled;
+extern unsigned long tracing_max_latency;
+extern unsigned long tracing_thresh;
+
+void update_max_tr(struct tracing_trace *tr, struct task_struct *tsk, int cpu);
 
 static inline notrace cycle_t now(void)
 {

-- 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 18/22 -v7] Trace irq disabled critical timings
  2008-01-30  3:15 [PATCH 00/22 -v7] mcount and latency tracing utility -v7 Steven Rostedt
                   ` (16 preceding siblings ...)
  2008-01-30  3:15 ` [PATCH 17/22 -v7] mcount tracer for wakeup latency timings Steven Rostedt
@ 2008-01-30  3:15 ` Steven Rostedt
  2008-01-30  3:15 ` [PATCH 19/22 -v7] trace preempt off " Steven Rostedt
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 45+ messages in thread
From: Steven Rostedt @ 2008-01-30  3:15 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, Peter Zijlstra,
	Christoph Hellwig, Mathieu Desnoyers, Gregory Haskins,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Tim Bird,
	Sam Ravnborg, Frank Ch. Eigler, Jan Kiszka, John Stultz,
	Arjan van de Ven, Steven Rostedt

[-- Attachment #1: mcount-tracer-latency-trace-irqs-off.patch --]
[-- Type: text/plain, Size: 28808 bytes --]

This patch adds latency tracing for critical timings
(how long interrupts are disabled for).

 "irqsoff" is added to /debugfs/tracing/available_tracers

Note:
  tracing_max_latency
    also holds the max latency for irqsoff (in usecs).
   (default to large number so one must start latency tracing)

  tracing_thresh
    threshold (in usecs) to always print out if irqs off
    is detected to be longer than stated here.
    If irq_thresh is non-zero, then max_irq_latency
    is ignored.

Here's an example of a trace with mcount_enabled = 0

=======
preemption latency trace v1.1.5 on 2.6.24-rc7
--------------------------------------------------------------------
 latency: 100 us, #3/3, CPU#1 | (M:rt VP:0, KP:0, SP:0 HP:0 #P:2)
    -----------------
    | task: swapper-0 (uid:0 nice:0 policy:0 rt_prio:0)
    -----------------
 => started at: _spin_lock_irqsave+0x2a/0xb7
 => ended at:   _spin_unlock_irqrestore+0x32/0x5f

                 _------=> CPU#            
                / _-----=> irqs-off        
               | / _----=> need-resched    
               || / _---=> hardirq/softirq 
               ||| / _--=> preempt-depth   
               |||| /                      
               |||||     delay             
   cmd     pid ||||| time  |   caller      
      \   /    |||||   \   |   /           
 swapper-0     1d.s3    0us+: _spin_lock_irqsave+0x2a/0xb7 (e1000_update_stats+0x47/0x64c [e1000])
 swapper-0     1d.s3  100us : _spin_unlock_irqrestore+0x32/0x5f (e1000_update_stats+0x641/0x64c [e1000])
 swapper-0     1d.s3  100us : trace_hardirqs_on_caller+0x75/0x89 (_spin_unlock_irqrestore+0x32/0x5f)


vim:ft=help
=======


And this is a trace with mcount_enabled == 1


=======
preemption latency trace v1.1.5 on 2.6.24-rc7
--------------------------------------------------------------------
 latency: 102 us, #12/12, CPU#1 | (M:rt VP:0, KP:0, SP:0 HP:0 #P:2)
    -----------------
    | task: swapper-0 (uid:0 nice:0 policy:0 rt_prio:0)
    -----------------
 => started at: _spin_lock_irqsave+0x2a/0xb7
 => ended at:   _spin_unlock_irqrestore+0x32/0x5f

                 _------=> CPU#            
                / _-----=> irqs-off        
               | / _----=> need-resched    
               || / _---=> hardirq/softirq 
               ||| / _--=> preempt-depth   
               |||| /                      
               |||||     delay             
   cmd     pid ||||| time  |   caller      
      \   /    |||||   \   |   /           
 swapper-0     1dNs3    0us+: _spin_lock_irqsave+0x2a/0xb7 (e1000_update_stats+0x47/0x64c [e1000])
 swapper-0     1dNs3   46us : e1000_read_phy_reg+0x16/0x225 [e1000] (e1000_update_stats+0x5e2/0x64c [e1000])
 swapper-0     1dNs3   46us : e1000_swfw_sync_acquire+0x10/0x99 [e1000] (e1000_read_phy_reg+0x49/0x225 [e1000])
 swapper-0     1dNs3   46us : e1000_get_hw_eeprom_semaphore+0x12/0xa6 [e1000] (e1000_swfw_sync_acquire+0x36/0x99 [e1000])
 swapper-0     1dNs3   47us : __const_udelay+0x9/0x47 (e1000_read_phy_reg+0x116/0x225 [e1000])
 swapper-0     1dNs3   47us+: __delay+0x9/0x50 (__const_udelay+0x45/0x47)
 swapper-0     1dNs3   97us : preempt_schedule+0xc/0x84 (__delay+0x4e/0x50)
 swapper-0     1dNs3   98us : e1000_swfw_sync_release+0xc/0x55 [e1000] (e1000_read_phy_reg+0x211/0x225 [e1000])
 swapper-0     1dNs3   99us+: e1000_put_hw_eeprom_semaphore+0x9/0x35 [e1000] (e1000_swfw_sync_release+0x50/0x55 [e1000])
 swapper-0     1dNs3  101us : _spin_unlock_irqrestore+0xe/0x5f (e1000_update_stats+0x641/0x64c [e1000])
 swapper-0     1dNs3  102us : _spin_unlock_irqrestore+0x32/0x5f (e1000_update_stats+0x641/0x64c [e1000])
 swapper-0     1dNs3  102us : trace_hardirqs_on_caller+0x75/0x89 (_spin_unlock_irqrestore+0x32/0x5f)


vim:ft=help
=======


Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 arch/x86/kernel/process_64.c  |    3 
 arch/x86/lib/thunk_64.S       |   18 +
 include/asm-x86/irqflags_32.h |    4 
 include/asm-x86/irqflags_64.h |    4 
 include/linux/irqflags.h      |   37 ++-
 include/linux/mcount.h        |   31 ++-
 kernel/fork.c                 |    2 
 kernel/lockdep.c              |   16 +
 lib/tracing/Kconfig           |   18 +
 lib/tracing/Makefile          |    1 
 lib/tracing/trace_irqsoff.c   |  415 ++++++++++++++++++++++++++++++++++++++++++
 lib/tracing/tracer.c          |   59 ++++-
 lib/tracing/tracer.h          |    2 
 13 files changed, 575 insertions(+), 35 deletions(-)

Index: linux-mcount.git/arch/x86/kernel/process_64.c
===================================================================
--- linux-mcount.git.orig/arch/x86/kernel/process_64.c	2008-01-29 18:06:20.000000000 -0500
+++ linux-mcount.git/arch/x86/kernel/process_64.c	2008-01-29 18:10:56.000000000 -0500
@@ -233,7 +233,10 @@ void cpu_idle (void)
 			 */
 			local_irq_disable();
 			enter_idle();
+			/* Don't trace irqs off for idle */
+			stop_critical_timings();
 			idle();
+			start_critical_timings();
 			/* In many cases the interrupt that ended idle
 			   has already called exit_idle. But some idle
 			   loops can be woken up without interrupt. */
Index: linux-mcount.git/arch/x86/lib/thunk_64.S
===================================================================
--- linux-mcount.git.orig/arch/x86/lib/thunk_64.S	2008-01-29 18:06:20.000000000 -0500
+++ linux-mcount.git/arch/x86/lib/thunk_64.S	2008-01-29 18:10:56.000000000 -0500
@@ -47,8 +47,22 @@
 	thunk __up_wakeup,__up
 
 #ifdef CONFIG_TRACE_IRQFLAGS
-	thunk trace_hardirqs_on_thunk,trace_hardirqs_on
-	thunk trace_hardirqs_off_thunk,trace_hardirqs_off
+	/* put return address in rdi (arg1) */
+	.macro thunk_ra name,func
+	.globl \name
+\name:
+	CFI_STARTPROC
+	SAVE_ARGS
+	/* SAVE_ARGS pushs 9 elements */
+	/* the next element would be the rip */
+	movq 9*8(%rsp), %rdi
+	call \func
+	jmp  restore
+	CFI_ENDPROC
+	.endm
+
+	thunk_ra trace_hardirqs_on_thunk,trace_hardirqs_on_caller
+	thunk_ra trace_hardirqs_off_thunk,trace_hardirqs_off_caller
 #endif
 
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
Index: linux-mcount.git/include/asm-x86/irqflags_32.h
===================================================================
--- linux-mcount.git.orig/include/asm-x86/irqflags_32.h	2008-01-29 18:06:20.000000000 -0500
+++ linux-mcount.git/include/asm-x86/irqflags_32.h	2008-01-29 18:10:56.000000000 -0500
@@ -139,9 +139,9 @@ static inline int raw_irqs_disabled(void
 static inline void trace_hardirqs_fixup_flags(unsigned long flags)
 {
 	if (raw_irqs_disabled_flags(flags))
-		trace_hardirqs_off();
+		__trace_hardirqs_off();
 	else
-		trace_hardirqs_on();
+		__trace_hardirqs_on();
 }
 
 static inline void trace_hardirqs_fixup(void)
Index: linux-mcount.git/include/asm-x86/irqflags_64.h
===================================================================
--- linux-mcount.git.orig/include/asm-x86/irqflags_64.h	2008-01-29 18:06:20.000000000 -0500
+++ linux-mcount.git/include/asm-x86/irqflags_64.h	2008-01-29 18:10:56.000000000 -0500
@@ -120,9 +120,9 @@ static inline int raw_irqs_disabled(void
 static inline void trace_hardirqs_fixup_flags(unsigned long flags)
 {
 	if (raw_irqs_disabled_flags(flags))
-		trace_hardirqs_off();
+		__trace_hardirqs_off();
 	else
-		trace_hardirqs_on();
+		__trace_hardirqs_on();
 }
 
 static inline void trace_hardirqs_fixup(void)
Index: linux-mcount.git/include/linux/irqflags.h
===================================================================
--- linux-mcount.git.orig/include/linux/irqflags.h	2008-01-29 18:06:20.000000000 -0500
+++ linux-mcount.git/include/linux/irqflags.h	2008-01-29 18:10:56.000000000 -0500
@@ -12,10 +12,21 @@
 #define _LINUX_TRACE_IRQFLAGS_H
 
 #ifdef CONFIG_TRACE_IRQFLAGS
-  extern void trace_hardirqs_on(void);
-  extern void trace_hardirqs_off(void);
+# include <linux/mcount.h>
+  extern void trace_hardirqs_on_caller(unsigned long ip);
+  extern void trace_hardirqs_off_caller(unsigned long ip);
   extern void trace_softirqs_on(unsigned long ip);
   extern void trace_softirqs_off(unsigned long ip);
+  extern void trace_hardirqs_on(void);
+  extern void trace_hardirqs_off(void);
+  static inline void notrace __trace_hardirqs_on(void)
+  {
+	trace_hardirqs_on_caller(CALLER_ADDR0);
+  }
+  static inline void notrace __trace_hardirqs_off(void)
+  {
+	trace_hardirqs_off_caller(CALLER_ADDR0);
+  }
 # define trace_hardirq_context(p)	((p)->hardirq_context)
 # define trace_softirq_context(p)	((p)->softirq_context)
 # define trace_hardirqs_enabled(p)	((p)->hardirqs_enabled)
@@ -28,6 +39,8 @@
 #else
 # define trace_hardirqs_on()		do { } while (0)
 # define trace_hardirqs_off()		do { } while (0)
+# define __trace_hardirqs_on()		do { } while (0)
+# define __trace_hardirqs_off()		do { } while (0)
 # define trace_softirqs_on(ip)		do { } while (0)
 # define trace_softirqs_off(ip)		do { } while (0)
 # define trace_hardirq_context(p)	0
@@ -41,24 +54,32 @@
 # define INIT_TRACE_IRQFLAGS
 #endif
 
+#ifdef CONFIG_CRITICAL_IRQSOFF_TIMING
+ extern void stop_critical_timings(void);
+ extern void start_critical_timings(void);
+#else
+# define stop_critical_timings() do { } while (0)
+# define start_critical_timings() do { } while (0)
+#endif
+
 #ifdef CONFIG_TRACE_IRQFLAGS_SUPPORT
 
 #include <asm/irqflags.h>
 
 #define local_irq_enable() \
-	do { trace_hardirqs_on(); raw_local_irq_enable(); } while (0)
+	do { __trace_hardirqs_on(); raw_local_irq_enable(); } while (0)
 #define local_irq_disable() \
-	do { raw_local_irq_disable(); trace_hardirqs_off(); } while (0)
+	do { raw_local_irq_disable(); __trace_hardirqs_off(); } while (0)
 #define local_irq_save(flags) \
-	do { raw_local_irq_save(flags); trace_hardirqs_off(); } while (0)
+	do { raw_local_irq_save(flags); __trace_hardirqs_off(); } while (0)
 
 #define local_irq_restore(flags)				\
 	do {							\
 		if (raw_irqs_disabled_flags(flags)) {		\
 			raw_local_irq_restore(flags);		\
-			trace_hardirqs_off();			\
+			__trace_hardirqs_off();			\
 		} else {					\
-			trace_hardirqs_on();			\
+			__trace_hardirqs_on();			\
 			raw_local_irq_restore(flags);		\
 		}						\
 	} while (0)
@@ -76,7 +97,7 @@
 #ifdef CONFIG_TRACE_IRQFLAGS_SUPPORT
 #define safe_halt()						\
 	do {							\
-		trace_hardirqs_on();				\
+		__trace_hardirqs_on();				\
 		raw_safe_halt();				\
 	} while (0)
 
Index: linux-mcount.git/include/linux/mcount.h
===================================================================
--- linux-mcount.git.orig/include/linux/mcount.h	2008-01-29 18:06:20.000000000 -0500
+++ linux-mcount.git/include/linux/mcount.h	2008-01-29 18:10:56.000000000 -0500
@@ -6,10 +6,6 @@ extern int mcount_enabled;
 
 #include <linux/linkage.h>
 
-#define CALLER_ADDR0 ((unsigned long)__builtin_return_address(0))
-#define CALLER_ADDR1 ((unsigned long)__builtin_return_address(1))
-#define CALLER_ADDR2 ((unsigned long)__builtin_return_address(2))
-
 typedef void (*mcount_func_t)(unsigned long ip, unsigned long parent_ip);
 
 struct mcount_ops {
@@ -35,4 +31,31 @@ extern void mcount(void);
 # define unregister_mcount_function(ops) do { } while (0)
 # define clear_mcount_function(ops) do { } while (0)
 #endif /* CONFIG_MCOUNT */
+
+
+#ifdef CONFIG_FRAME_POINTER
+/* TODO: need to fix this for ARM */
+# define CALLER_ADDR0 ((unsigned long)__builtin_return_address(0))
+# define CALLER_ADDR1 ((unsigned long)__builtin_return_address(1))
+# define CALLER_ADDR2 ((unsigned long)__builtin_return_address(2))
+# define CALLER_ADDR3 ((unsigned long)__builtin_return_address(3))
+# define CALLER_ADDR4 ((unsigned long)__builtin_return_address(4))
+# define CALLER_ADDR5 ((unsigned long)__builtin_return_address(5))
+#else
+# define CALLER_ADDR0 ((unsigned long)__builtin_return_address(0))
+# define CALLER_ADDR1 0UL
+# define CALLER_ADDR2 0UL
+# define CALLER_ADDR3 0UL
+# define CALLER_ADDR4 0UL
+# define CALLER_ADDR5 0UL
+#endif
+
+#ifdef CONFIG_CRITICAL_IRQSOFF_TIMING
+  extern void notrace time_hardirqs_on(unsigned long a0, unsigned long a1);
+  extern void notrace time_hardirqs_off(unsigned long a0, unsigned long a1);
+#else
+# define time_hardirqs_on(a0, a1)		do { } while (0)
+# define time_hardirqs_off(a0, a1)		do { } while (0)
+#endif
+
 #endif /* _LINUX_MCOUNT_H */
Index: linux-mcount.git/kernel/fork.c
===================================================================
--- linux-mcount.git.orig/kernel/fork.c	2008-01-29 18:06:20.000000000 -0500
+++ linux-mcount.git/kernel/fork.c	2008-01-29 18:10:56.000000000 -0500
@@ -1036,7 +1036,7 @@ static struct task_struct *copy_process(
 
 	rt_mutex_init_task(p);
 
-#ifdef CONFIG_TRACE_IRQFLAGS
+#if defined(CONFIG_TRACE_IRQFLAGS) && defined(CONFIG_LOCKDEP)
 	DEBUG_LOCKS_WARN_ON(!p->hardirqs_enabled);
 	DEBUG_LOCKS_WARN_ON(!p->softirqs_enabled);
 #endif
Index: linux-mcount.git/kernel/lockdep.c
===================================================================
--- linux-mcount.git.orig/kernel/lockdep.c	2008-01-29 18:06:20.000000000 -0500
+++ linux-mcount.git/kernel/lockdep.c	2008-01-29 18:10:56.000000000 -0500
@@ -39,6 +39,7 @@
 #include <linux/irqflags.h>
 #include <linux/utsname.h>
 #include <linux/hash.h>
+#include <linux/mcount.h>
 
 #include <asm/sections.h>
 
@@ -2009,7 +2010,7 @@ void early_boot_irqs_on(void)
 /*
  * Hardirqs will be enabled:
  */
-void trace_hardirqs_on(void)
+void notrace trace_hardirqs_on_caller(unsigned long a0)
 {
 	struct task_struct *curr = current;
 	unsigned long ip;
@@ -2050,14 +2051,19 @@ void trace_hardirqs_on(void)
 	curr->hardirq_enable_ip = ip;
 	curr->hardirq_enable_event = ++curr->irq_events;
 	debug_atomic_inc(&hardirqs_on_events);
+	time_hardirqs_on(CALLER_ADDR0, a0);
 }
+EXPORT_SYMBOL(trace_hardirqs_on_caller);
 
+void notrace trace_hardirqs_on(void) {
+	trace_hardirqs_on_caller(CALLER_ADDR0);
+}
 EXPORT_SYMBOL(trace_hardirqs_on);
 
 /*
  * Hardirqs were disabled:
  */
-void trace_hardirqs_off(void)
+void notrace trace_hardirqs_off_caller(unsigned long a0)
 {
 	struct task_struct *curr = current;
 
@@ -2075,9 +2081,15 @@ void trace_hardirqs_off(void)
 		curr->hardirq_disable_ip = _RET_IP_;
 		curr->hardirq_disable_event = ++curr->irq_events;
 		debug_atomic_inc(&hardirqs_off_events);
+		time_hardirqs_off(CALLER_ADDR0, a0);
 	} else
 		debug_atomic_inc(&redundant_hardirqs_off);
 }
+EXPORT_SYMBOL(trace_hardirqs_off_caller);
+
+void notrace trace_hardirqs_off(void) {
+	trace_hardirqs_off_caller(CALLER_ADDR0);
+}
 
 EXPORT_SYMBOL(trace_hardirqs_off);
 
Index: linux-mcount.git/lib/tracing/Kconfig
===================================================================
--- linux-mcount.git.orig/lib/tracing/Kconfig	2008-01-29 18:10:17.000000000 -0500
+++ linux-mcount.git/lib/tracing/Kconfig	2008-01-29 18:10:56.000000000 -0500
@@ -28,6 +28,24 @@ config FUNCTION_TRACER
 	  that the debugging mechanism using this facility will hook by
 	  providing a set of inline routines.
 
+config CRITICAL_IRQSOFF_TIMING
+	bool "Interrupts-off critical section latency timing"
+	default n
+	depends on TRACE_IRQFLAGS_SUPPORT
+	depends on GENERIC_TIME
+	select TRACE_IRQFLAGS
+	select TRACING
+	select TRACER_MAX_TRACE
+	help
+	  This option measures the time spent in irqs-off critical
+	  sections, with microsecond accuracy.
+
+	  The default measurement method is a maximum search, which is
+	  disabled by default and can be runtime (re-)started
+	  via:
+
+	      echo 0 > /debugfs/tracing/tracing_max_latency
+
 config WAKEUP_TRACER
 	bool "Trace wakeup latencies"
 	depends on DEBUG_KERNEL
Index: linux-mcount.git/lib/tracing/Makefile
===================================================================
--- linux-mcount.git.orig/lib/tracing/Makefile	2008-01-29 18:10:17.000000000 -0500
+++ linux-mcount.git/lib/tracing/Makefile	2008-01-29 18:10:56.000000000 -0500
@@ -3,6 +3,7 @@ obj-$(CONFIG_MCOUNT) += libmcount.o
 obj-$(CONFIG_TRACING) += tracer.o
 obj-$(CONFIG_CONTEXT_SWITCH_TRACER) += trace_sched_switch.o
 obj-$(CONFIG_FUNCTION_TRACER) += trace_function.o
+obj-$(CONFIG_CRITICAL_IRQSOFF_TIMING) += trace_irqsoff.o
 obj-$(CONFIG_WAKEUP_TRACER) += trace_wakeup.o
 
 libmcount-y := mcount.o
Index: linux-mcount.git/lib/tracing/trace_irqsoff.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-mcount.git/lib/tracing/trace_irqsoff.c	2008-01-29 18:10:56.000000000 -0500
@@ -0,0 +1,415 @@
+/*
+ * trace irqs off criticall timings
+ *
+ * Copyright (C) 2007 Steven Rostedt <srostedt@redhat.com>
+ *
+ * From code in the latency_tracer, that is:
+ *
+ *  Copyright (C) 2004-2006 Ingo Molnar
+ *  Copyright (C) 2004 William Lee Irwin III
+ */
+#include <linux/module.h>
+#include <linux/fs.h>
+#include <linux/debugfs.h>
+#include <linux/kallsyms.h>
+#include <linux/uaccess.h>
+#include <linux/mcount.h>
+
+#include "tracer.h"
+
+static struct tracing_trace *tracer_trace __read_mostly;
+static __cacheline_aligned_in_smp DEFINE_MUTEX(max_mutex);
+static int trace_enabled __read_mostly;
+
+/*
+ * Sequence count - we record it when starting a measurement and
+ * skip the latency if the sequence has changed - some other section
+ * did a maximum and could disturb our measurement with serial console
+ * printouts, etc. Truly coinciding maximum latencies should be rare
+ * and what happens together happens separately as well, so this doesnt
+ * decrease the validity of the maximum found:
+ */
+static __cacheline_aligned_in_smp unsigned long max_sequence;
+
+#ifdef CONFIG_MCOUNT
+/* irqsoff uses its own function trace to keep the overhead down */
+static void notrace irqsoff_trace_call(unsigned long ip,
+				       unsigned long parent_ip)
+{
+	struct tracing_trace *tr = tracer_trace;
+	struct tracing_trace_cpu *data;
+	unsigned long flags;
+	int cpu;
+
+	if (likely(!trace_enabled))
+		return;
+
+	local_save_flags(flags);
+
+	if (!irqs_disabled_flags(flags))
+		return;
+
+	cpu = raw_smp_processor_id();
+	data = tr->data[cpu];
+	atomic_inc(&data->disabled);
+
+	if (likely(atomic_read(&data->disabled) == 1))
+		tracing_function_trace(tr, data, ip, parent_ip, flags);
+
+	atomic_dec(&data->disabled);
+}
+
+static struct mcount_ops trace_ops __read_mostly =
+{
+	.func = irqsoff_trace_call,
+};
+#endif /* CONFIG_MCOUNT */
+
+/*
+ * Should this new latency be reported/recorded?
+ */
+static int notrace report_latency(cycle_t delta)
+{
+	if (tracing_thresh) {
+		if (delta < tracing_thresh)
+			return 0;
+	} else {
+		if (delta <= tracing_max_latency)
+			return 0;
+	}
+	return 1;
+}
+
+static void notrace
+check_critical_timing(struct tracing_trace *tr,
+		      struct tracing_trace_cpu *data,
+		      unsigned long parent_ip,
+		      int cpu)
+{
+	unsigned long latency, t0, t1;
+	cycle_t T0, T1, T2, delta;
+	unsigned long flags;
+
+	/*
+	 * usecs conversion is slow so we try to delay the conversion
+	 * as long as possible:
+	 */
+	T0 = data->preempt_timestamp;
+	T1 = now();
+	delta = T1-T0;
+
+	local_save_flags(flags);
+
+	if (!report_latency(delta))
+		goto out;
+
+	tracing_function_trace(tr, data, CALLER_ADDR0, parent_ip, flags);
+	/*
+	 * Update the timestamp, because the trace entry above
+	 * might change it (it can only get larger so the latency
+	 * is fair to be reported):
+	 */
+	T2 = now();
+
+	delta = T2-T0;
+
+	latency = cycles_to_usecs(delta);
+
+	if (data->critical_sequence != max_sequence ||
+	    !mutex_trylock(&max_mutex))
+		goto out;
+
+	tracing_max_latency = delta;
+	t0 = cycles_to_usecs(T0);
+	t1 = cycles_to_usecs(T1);
+
+	data->critical_end = parent_ip;
+
+	update_max_tr_single(tr, current, cpu);
+
+	if (tracing_thresh)
+		printk(KERN_WARNING "(%16s-%-5d|#%d): %lu us critical section "
+		       "violates %lu us threshold.\n"
+		       " => started at timestamp %lu: ",
+				current->comm, current->pid,
+				raw_smp_processor_id(),
+				latency, cycles_to_usecs(tracing_thresh), t0);
+	else
+		printk(KERN_WARNING "(%16s-%-5d|#%d):"
+		       " new %lu us maximum-latency "
+		       "critical section.\n => started at timestamp %lu: ",
+				current->comm, current->pid,
+				raw_smp_processor_id(),
+				latency, t0);
+
+	print_symbol(KERN_CONT "<%s>\n", data->critical_start);
+	printk(KERN_CONT " =>   ended at timestamp %lu: ", t1);
+	print_symbol(KERN_CONT "<%s>\n", data->critical_end);
+	dump_stack();
+	t1 = cycles_to_usecs(now());
+	printk(KERN_CONT " =>   dump-end timestamp %lu\n\n", t1);
+
+	max_sequence++;
+
+	mutex_unlock(&max_mutex);
+
+out:
+	data->critical_sequence = max_sequence;
+	data->preempt_timestamp = now();
+	tracing_reset(data);
+	tracing_function_trace(tr, data, CALLER_ADDR0, parent_ip, flags);
+}
+
+static inline void notrace
+start_critical_timing(unsigned long ip, unsigned long parent_ip)
+{
+	int cpu;
+	struct tracing_trace *tr = tracer_trace;
+	struct tracing_trace_cpu *data;
+	unsigned long flags;
+
+	if (likely(!trace_enabled))
+		return;
+
+	cpu = raw_smp_processor_id();
+	data = tr->data[cpu];
+
+	if (unlikely(!data) || unlikely(!data->trace) ||
+	    data->critical_start || atomic_read(&data->disabled))
+		return;
+
+	atomic_inc(&data->disabled);
+
+	data->critical_sequence = max_sequence;
+	data->preempt_timestamp = now();
+	data->critical_start = parent_ip;
+	tracing_reset(data);
+
+	local_save_flags(flags);
+	tracing_function_trace(tr, data, ip, parent_ip, flags);
+
+	atomic_dec(&data->disabled);
+}
+
+static inline void notrace
+stop_critical_timing(unsigned long ip, unsigned long parent_ip)
+{
+	int cpu;
+	struct tracing_trace *tr = tracer_trace;
+	struct tracing_trace_cpu *data;
+	unsigned long flags;
+
+	if (likely(!trace_enabled))
+		return;
+
+	cpu = raw_smp_processor_id();
+	data = tr->data[cpu];
+
+	if (unlikely(!data) || unlikely(!data->trace) ||
+	    !data->critical_start || atomic_read(&data->disabled))
+		return;
+
+	atomic_inc(&data->disabled);
+	local_save_flags(flags);
+	tracing_function_trace(tr, data, ip, parent_ip, flags);
+	check_critical_timing(tr, data, parent_ip, cpu);
+	data->critical_start = 0;
+	atomic_dec(&data->disabled);
+}
+
+void notrace start_critical_timings(void)
+{
+	unsigned long flags;
+
+	local_save_flags(flags);
+
+	if (irqs_disabled_flags(flags))
+		start_critical_timing(CALLER_ADDR0, 0);
+}
+
+void notrace stop_critical_timings(void)
+{
+	unsigned long flags;
+
+	local_save_flags(flags);
+
+	if (irqs_disabled_flags(flags))
+		stop_critical_timing(CALLER_ADDR0, 0);
+}
+
+#ifdef CONFIG_LOCKDEP
+void notrace time_hardirqs_on(unsigned long a0, unsigned long a1)
+{
+	unsigned long flags;
+
+	local_save_flags(flags);
+
+	if (irqs_disabled_flags(flags))
+		stop_critical_timing(a0, a1);
+}
+
+void notrace time_hardirqs_off(unsigned long a0, unsigned long a1)
+{
+	unsigned long flags;
+
+	local_save_flags(flags);
+
+	if (irqs_disabled_flags(flags))
+		start_critical_timing(a0, a1);
+}
+
+#else /* !CONFIG_LOCKDEP */
+
+/*
+ * Stubs:
+ */
+
+void early_boot_irqs_off(void)
+{
+}
+
+void early_boot_irqs_on(void)
+{
+}
+
+void trace_softirqs_on(unsigned long ip)
+{
+}
+
+void trace_softirqs_off(unsigned long ip)
+{
+}
+
+inline void print_irqtrace_events(struct task_struct *curr)
+{
+}
+
+/*
+ * We are only interested in hardirq on/off events:
+ */
+void notrace trace_hardirqs_on(void)
+{
+	unsigned long flags;
+
+	local_save_flags(flags);
+
+	if (irqs_disabled_flags(flags))
+		stop_critical_timing(CALLER_ADDR0, 0);
+}
+EXPORT_SYMBOL(trace_hardirqs_on);
+
+void notrace trace_hardirqs_off(void)
+{
+	unsigned long flags;
+
+	local_save_flags(flags);
+
+	if (irqs_disabled_flags(flags))
+		start_critical_timing(CALLER_ADDR0, 0);
+}
+EXPORT_SYMBOL(trace_hardirqs_off);
+
+void notrace trace_hardirqs_on_caller(unsigned long caller_addr)
+{
+	unsigned long flags;
+
+	local_save_flags(flags);
+
+	if (irqs_disabled_flags(flags))
+		stop_critical_timing(CALLER_ADDR0, caller_addr);
+}
+EXPORT_SYMBOL(trace_hardirqs_on_caller);
+
+void notrace trace_hardirqs_off_caller(unsigned long caller_addr)
+{
+	unsigned long flags;
+
+	local_save_flags(flags);
+
+	if (irqs_disabled_flags(flags))
+		start_critical_timing(CALLER_ADDR0, caller_addr);
+}
+EXPORT_SYMBOL(trace_hardirqs_off_caller);
+
+#endif /* CONFIG_LOCKDEP */
+
+static void start_irqsoff_trace(struct tracing_trace *tr)
+{
+	trace_enabled = 1;
+	register_mcount_function(&trace_ops);
+}
+
+static void stop_irqsoff_trace(struct tracing_trace *tr)
+{
+	unregister_mcount_function(&trace_ops);
+	trace_enabled = 0;
+}
+
+static void irqsoff_trace_init(struct tracing_trace *tr)
+{
+	tracer_trace = tr;
+	/* make sure that the tracer is visibel */
+	smp_wmb();
+
+	if (tr->ctrl)
+		start_irqsoff_trace(tr);
+}
+
+static void irqsoff_trace_reset(struct tracing_trace *tr)
+{
+	if (tr->ctrl)
+		stop_irqsoff_trace(tr);
+}
+
+static void irqsoff_trace_ctrl_update(struct tracing_trace *tr)
+{
+	if (tr->ctrl)
+		start_irqsoff_trace(tr);
+	else
+		stop_irqsoff_trace(tr);
+}
+
+static void notrace irqsoff_trace_open(struct tracing_iterator *iter)
+{
+	/* stop the trace while dumping */
+	if (iter->tr->ctrl)
+		stop_irqsoff_trace(iter->tr);
+}
+
+static void notrace irqsoff_trace_close(struct tracing_iterator *iter)
+{
+	if (iter->tr->ctrl)
+		start_irqsoff_trace(iter->tr);
+}
+
+static void irqsoff_trace_start(struct tracing_iterator *iter)
+{
+	mutex_lock(&max_mutex);
+}
+
+static void irqsoff_trace_stop(struct tracing_iterator *iter)
+{
+	mutex_unlock(&max_mutex);
+}
+
+static struct trace_types_struct irqsoff_trace __read_mostly =
+{
+	.name = "irqsoff",
+	.init = irqsoff_trace_init,
+	.reset = irqsoff_trace_reset,
+	.open = irqsoff_trace_open,
+	.close = irqsoff_trace_close,
+	.start = irqsoff_trace_start,
+	.stop = irqsoff_trace_stop,
+	.ctrl_update = irqsoff_trace_ctrl_update,
+	.print_max = 1,
+};
+
+__init static int init_irqsoff_trace(void)
+{
+	register_trace(&irqsoff_trace);
+
+	return 0;
+}
+
+device_initcall(init_irqsoff_trace);
Index: linux-mcount.git/lib/tracing/tracer.c
===================================================================
--- linux-mcount.git.orig/lib/tracing/tracer.c	2008-01-29 18:10:17.000000000 -0500
+++ linux-mcount.git/lib/tracing/tracer.c	2008-01-29 18:11:18.000000000 -0500
@@ -74,24 +74,14 @@ enum trace_flag_type {
  * structure. (this way the maximum trace is permanently saved,
  * for later retrieval via /debugfs/tracing/latency_trace)
  */
-void update_max_tr(struct tracing_trace *tr, struct task_struct *tsk, int cpu)
+static void notrace __update_max_tr(struct tracing_trace *tr,
+				    struct task_struct *tsk, int cpu)
 {
 	struct tracing_trace_cpu *data = tr->data[cpu];
-	void *save_trace;
-	int i;
 
 	max_tr.cpu = cpu;
 	max_tr.time_start = data->preempt_timestamp;
 
-
-	/* clear out all the previous traces */
-	for_each_possible_cpu(i) {
-		data = tr->data[i];
-		save_trace = max_tr.data[i]->trace;
-		memcpy(max_tr.data[i], data, sizeof(*data));
-		data->trace = save_trace;
-	}
-
 	data = max_tr.data[cpu];
 	data->saved_latency = tracing_max_latency;
 
@@ -106,6 +96,47 @@ void update_max_tr(struct tracing_trace 
 	tracing_record_cmdline(current);
 }
 
+notrace void update_max_tr(struct tracing_trace *tr,
+			   struct task_struct *tsk, int cpu)
+{
+	struct tracing_trace_cpu *data;
+	void *save_trace;
+	int i;
+
+	/* clear out all the previous traces */
+	for_each_possible_cpu(i) {
+		data = tr->data[i];
+		save_trace = max_tr.data[i]->trace;
+		memcpy(max_tr.data[i], data, sizeof(*data));
+		data->trace = save_trace;
+	}
+
+	__update_max_tr(tr, tsk, cpu);
+}
+
+/**
+ * update_max_tr_single - only copy one trace over, and reset the rest
+ * @tr - tracer
+ * @tsk - task with the latency
+ * @cpu - the cpu of the buffer to copy.
+ */
+notrace void update_max_tr_single(struct tracing_trace *tr,
+				  struct task_struct *tsk, int cpu)
+{
+	struct tracing_trace_cpu *data = tr->data[cpu];
+	void *save_trace;
+	int i;
+
+	for_each_possible_cpu(i)
+		tracing_reset(max_tr.data[i]);
+
+	save_trace = max_tr.data[cpu]->trace;
+	memcpy(max_tr.data[cpu], data, sizeof(*data));
+	data->trace = save_trace;
+
+	__update_max_tr(tr, tsk, cpu);
+}
+
 int register_trace(struct trace_types_struct *type)
 {
 	struct trace_types_struct *t;
@@ -203,12 +234,12 @@ static struct mcount_ops trace_ops __rea
 };
 #endif
 
-void tracing_start_function_trace(void)
+notrace void tracing_start_function_trace(void)
 {
 	register_mcount_function(&trace_ops);
 }
 
-void tracing_stop_function_trace(void)
+notrace void tracing_stop_function_trace(void)
 {
 	unregister_mcount_function(&trace_ops);
 }
Index: linux-mcount.git/lib/tracing/tracer.h
===================================================================
--- linux-mcount.git.orig/lib/tracing/tracer.h	2008-01-29 18:10:17.000000000 -0500
+++ linux-mcount.git/lib/tracing/tracer.h	2008-01-29 18:10:56.000000000 -0500
@@ -112,6 +112,8 @@ extern unsigned long tracing_max_latency
 extern unsigned long tracing_thresh;
 
 void update_max_tr(struct tracing_trace *tr, struct task_struct *tsk, int cpu);
+void update_max_tr_single(struct tracing_trace *tr,
+			  struct task_struct *tsk, int cpu);
 
 static inline notrace cycle_t now(void)
 {

-- 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 19/22 -v7] trace preempt off critical timings
  2008-01-30  3:15 [PATCH 00/22 -v7] mcount and latency tracing utility -v7 Steven Rostedt
                   ` (17 preceding siblings ...)
  2008-01-30  3:15 ` [PATCH 18/22 -v7] Trace irq disabled critical timings Steven Rostedt
@ 2008-01-30  3:15 ` Steven Rostedt
  2008-01-30  9:40   ` Peter Zijlstra
  2008-01-30  3:15 ` [PATCH 20/22 -v7] Add markers to various events Steven Rostedt
                   ` (2 subsequent siblings)
  21 siblings, 1 reply; 45+ messages in thread
From: Steven Rostedt @ 2008-01-30  3:15 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, Peter Zijlstra,
	Christoph Hellwig, Mathieu Desnoyers, Gregory Haskins,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Tim Bird,
	Sam Ravnborg, Frank Ch. Eigler, Jan Kiszka, John Stultz,
	Arjan van de Ven, Steven Rostedt

[-- Attachment #1: mcount-trace-latency-trace-preempt-off.patch --]
[-- Type: text/plain, Size: 15657 bytes --]

Add preempt off timings. A lot of kernel core code is taken from the RT patch
latency trace that was written by Ingo Molnar.

This adds "preemptoff" and "preemptirqsoff" to /debugfs/tracing/available_tracers

Now instead of just tracing irqs off, preemption off can be selected
to be recorded.

When this is selected, it shares the same files as irqs off timings.
One can either trace preemption off, irqs off, or one or the other off.

By echoing "preemptoff" into /debugfs/tracing/current_tracer, recording
of preempt off only is performed. "irqsoff" will only record the time
irqs are disabled, but "preemptirqsoff" will take the total time irqs
or preemption are disabled. Runtime switching of these options is now
supported by simpling echoing in the appropriate trace name into
/debugfs/tracing/current_tracer.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 arch/x86/kernel/process_32.c |    3 
 include/linux/irqflags.h     |    3 
 include/linux/mcount.h       |    8 +
 include/linux/preempt.h      |    2 
 kernel/sched.c               |   24 +++++
 lib/tracing/Kconfig          |   25 +++++
 lib/tracing/Makefile         |    1 
 lib/tracing/trace_irqsoff.c  |  183 +++++++++++++++++++++++++++++++------------
 8 files changed, 196 insertions(+), 53 deletions(-)

Index: linux-mcount.git/lib/tracing/Kconfig
===================================================================
--- linux-mcount.git.orig/lib/tracing/Kconfig	2008-01-29 15:05:34.000000000 -0500
+++ linux-mcount.git/lib/tracing/Kconfig	2008-01-29 15:22:31.000000000 -0500
@@ -46,6 +46,31 @@ config CRITICAL_IRQSOFF_TIMING
 
 	      echo 0 > /debugfs/tracing/tracing_max_latency
 
+	  (Note that kernel size and overhead increases with this option
+	  enabled. This option and the preempt-off timing option can be
+	  used together or separately.)
+
+config CRITICAL_PREEMPT_TIMING
+	bool "Preemption-off critical section latency timing"
+	default n
+	depends on GENERIC_TIME
+	depends on PREEMPT
+	select TRACING
+	select TRACER_MAX_TRACE
+	help
+	  This option measures the time spent in preemption off critical
+	  sections, with microsecond accuracy.
+
+	  The default measurement method is a maximum search, which is
+	  disabled by default and can be runtime (re-)started
+	  via:
+
+	      echo 0 > /debugfs/tracing/tracing_max_latency
+
+	  (Note that kernel size and overhead increases with this option
+	  enabled. This option and the irqs-off timing option can be
+	  used together or separately.)
+
 config WAKEUP_TRACER
 	bool "Trace wakeup latencies"
 	depends on DEBUG_KERNEL
Index: linux-mcount.git/lib/tracing/Makefile
===================================================================
--- linux-mcount.git.orig/lib/tracing/Makefile	2008-01-29 15:05:34.000000000 -0500
+++ linux-mcount.git/lib/tracing/Makefile	2008-01-29 15:22:31.000000000 -0500
@@ -4,6 +4,7 @@ obj-$(CONFIG_TRACING) += tracer.o
 obj-$(CONFIG_CONTEXT_SWITCH_TRACER) += trace_sched_switch.o
 obj-$(CONFIG_FUNCTION_TRACER) += trace_function.o
 obj-$(CONFIG_CRITICAL_IRQSOFF_TIMING) += trace_irqsoff.o
+obj-$(CONFIG_CRITICAL_PREEMPT_TIMING) += trace_irqsoff.o
 obj-$(CONFIG_WAKEUP_TRACER) += trace_wakeup.o
 
 libmcount-y := mcount.o
Index: linux-mcount.git/lib/tracing/trace_irqsoff.c
===================================================================
--- linux-mcount.git.orig/lib/tracing/trace_irqsoff.c	2008-01-29 15:05:34.000000000 -0500
+++ linux-mcount.git/lib/tracing/trace_irqsoff.c	2008-01-29 15:25:28.000000000 -0500
@@ -21,6 +21,34 @@ static struct tracing_trace *tracer_trac
 static __cacheline_aligned_in_smp DEFINE_MUTEX(max_mutex);
 static int trace_enabled __read_mostly;
 
+static DEFINE_PER_CPU(int, tracing_cpu);
+
+enum {
+	TRACER_IRQS_OFF		= (1 << 1),
+	TRACER_PREEMPT_OFF	= (1 << 2),
+};
+
+static int trace_type __read_mostly;
+
+#ifdef CONFIG_CRITICAL_PREEMPT_TIMING
+# define preempt_trace() \
+	((trace_type & TRACER_PREEMPT_OFF) && preempt_count())
+#else
+# define preempt_trace() (0)
+#endif
+
+#ifdef CONFIG_CRITICAL_IRQSOFF_TIMING
+# define irq_trace()				\
+	((trace_type & TRACER_IRQS_OFF) &&	\
+	 ({					\
+		 unsigned long __flags;		\
+		 local_save_flags(__flags);	\
+		 irqs_disabled_flags(__flags);	\
+	 }))
+#else
+# define irq_trace() (0)
+#endif
+
 /*
  * Sequence count - we record it when starting a measurement and
  * skip the latency if the sequence has changed - some other section
@@ -41,14 +69,11 @@ static void notrace irqsoff_trace_call(u
 	unsigned long flags;
 	int cpu;
 
-	if (likely(!trace_enabled))
+	if (likely(!__get_cpu_var(tracing_cpu)))
 		return;
 
 	local_save_flags(flags);
 
-	if (!irqs_disabled_flags(flags))
-		return;
-
 	cpu = raw_smp_processor_id();
 	data = tr->data[cpu];
 	atomic_inc(&data->disabled);
@@ -171,23 +196,29 @@ start_critical_timing(unsigned long ip, 
 	if (likely(!trace_enabled))
 		return;
 
+	if (__get_cpu_var(tracing_cpu))
+		return;
+
 	cpu = raw_smp_processor_id();
 	data = tr->data[cpu];
 
 	if (unlikely(!data) || unlikely(!data->trace) ||
-	    data->critical_start || atomic_read(&data->disabled))
+	    atomic_read(&data->disabled))
 		return;
 
 	atomic_inc(&data->disabled);
 
 	data->critical_sequence = max_sequence;
 	data->preempt_timestamp = now();
-	data->critical_start = parent_ip;
+	data->critical_start = parent_ip ? : ip;
 	tracing_reset(data);
 
 	local_save_flags(flags);
+
 	tracing_function_trace(tr, data, ip, parent_ip, flags);
 
+	__get_cpu_var(tracing_cpu) = 1;
+
 	atomic_dec(&data->disabled);
 }
 
@@ -199,7 +230,13 @@ stop_critical_timing(unsigned long ip, u
 	struct tracing_trace_cpu *data;
 	unsigned long flags;
 
-	if (likely(!trace_enabled))
+	/* Always clear the tracing cpu on stopping the trace */
+	if (unlikely(__get_cpu_var(tracing_cpu)))
+		__get_cpu_var(tracing_cpu) = 0;
+	else
+		return;
+
+	if (!trace_enabled)
 		return;
 
 	cpu = raw_smp_processor_id();
@@ -212,49 +249,35 @@ stop_critical_timing(unsigned long ip, u
 	atomic_inc(&data->disabled);
 	local_save_flags(flags);
 	tracing_function_trace(tr, data, ip, parent_ip, flags);
-	check_critical_timing(tr, data, parent_ip, cpu);
+	check_critical_timing(tr, data, parent_ip ? : ip, cpu);
 	data->critical_start = 0;
 	atomic_dec(&data->disabled);
 }
 
+/* start and stop critical timings used to for stoppage (in idle) */
 void notrace start_critical_timings(void)
 {
-	unsigned long flags;
-
-	local_save_flags(flags);
-
-	if (irqs_disabled_flags(flags))
+	if (preempt_trace() || irq_trace())
 		start_critical_timing(CALLER_ADDR0, 0);
 }
 
 void notrace stop_critical_timings(void)
 {
-	unsigned long flags;
-
-	local_save_flags(flags);
-
-	if (irqs_disabled_flags(flags))
+	if (preempt_trace() || irq_trace())
 		stop_critical_timing(CALLER_ADDR0, 0);
 }
 
+#ifdef CONFIG_CRITICAL_IRQSOFF_TIMING
 #ifdef CONFIG_LOCKDEP
 void notrace time_hardirqs_on(unsigned long a0, unsigned long a1)
 {
-	unsigned long flags;
-
-	local_save_flags(flags);
-
-	if (irqs_disabled_flags(flags))
+	if (!preempt_trace() && irq_trace())
 		stop_critical_timing(a0, a1);
 }
 
 void notrace time_hardirqs_off(unsigned long a0, unsigned long a1)
 {
-	unsigned long flags;
-
-	local_save_flags(flags);
-
-	if (irqs_disabled_flags(flags))
+	if (!preempt_trace() && irq_trace())
 		start_critical_timing(a0, a1);
 }
 
@@ -289,49 +312,46 @@ inline void print_irqtrace_events(struct
  */
 void notrace trace_hardirqs_on(void)
 {
-	unsigned long flags;
-
-	local_save_flags(flags);
-
-	if (irqs_disabled_flags(flags))
+	if (!preempt_trace() && irq_trace())
 		stop_critical_timing(CALLER_ADDR0, 0);
 }
 EXPORT_SYMBOL(trace_hardirqs_on);
 
 void notrace trace_hardirqs_off(void)
 {
-	unsigned long flags;
-
-	local_save_flags(flags);
-
-	if (irqs_disabled_flags(flags))
+	if (!preempt_trace() && irq_trace())
 		start_critical_timing(CALLER_ADDR0, 0);
 }
 EXPORT_SYMBOL(trace_hardirqs_off);
 
 void notrace trace_hardirqs_on_caller(unsigned long caller_addr)
 {
-	unsigned long flags;
-
-	local_save_flags(flags);
-
-	if (irqs_disabled_flags(flags))
+	if (!preempt_trace() && irq_trace())
 		stop_critical_timing(CALLER_ADDR0, caller_addr);
 }
 EXPORT_SYMBOL(trace_hardirqs_on_caller);
 
 void notrace trace_hardirqs_off_caller(unsigned long caller_addr)
 {
-	unsigned long flags;
-
-	local_save_flags(flags);
-
-	if (irqs_disabled_flags(flags))
+	if (!preempt_trace() && irq_trace())
 		start_critical_timing(CALLER_ADDR0, caller_addr);
 }
 EXPORT_SYMBOL(trace_hardirqs_off_caller);
 
 #endif /* CONFIG_LOCKDEP */
+#endif /*  CONFIG_CRITICAL_IRQSOFF_TIMING */
+
+#ifdef CONFIG_CRITICAL_PREEMPT_TIMING
+void notrace trace_preempt_on(unsigned long a0, unsigned long a1)
+{
+	stop_critical_timing(a0, a1);
+}
+
+void notrace trace_preempt_off(unsigned long a0, unsigned long a1)
+{
+	start_critical_timing(a0, a1);
+}
+#endif /* CONFIG_CRITICAL_PREEMPT_TIMING */
 
 static void start_irqsoff_trace(struct tracing_trace *tr)
 {
@@ -345,7 +365,7 @@ static void stop_irqsoff_trace(struct tr
 	trace_enabled = 0;
 }
 
-static void irqsoff_trace_init(struct tracing_trace *tr)
+static void __irqsoff_trace_init(struct tracing_trace *tr)
 {
 	tracer_trace = tr;
 	/* make sure that the tracer is visibel */
@@ -392,6 +412,13 @@ static void irqsoff_trace_stop(struct tr
 	mutex_unlock(&max_mutex);
 }
 
+#ifdef CONFIG_CRITICAL_IRQSOFF_TIMING
+static void irqsoff_trace_init(struct tracing_trace *tr)
+{
+	trace_type = TRACER_IRQS_OFF;
+
+	__irqsoff_trace_init(tr);
+}
 static struct trace_types_struct irqsoff_trace __read_mostly =
 {
 	.name = "irqsoff",
@@ -404,10 +431,66 @@ static struct trace_types_struct irqsoff
 	.ctrl_update = irqsoff_trace_ctrl_update,
 	.print_max = 1,
 };
+# define register_irqsoff(trace) register_trace(&trace)
+#else
+# define register_irqsoff(trace) do { } while (0)
+#endif
+
+#ifdef CONFIG_CRITICAL_PREEMPT_TIMING
+static void preemptoff_trace_init(struct tracing_trace *tr)
+{
+	trace_type = TRACER_PREEMPT_OFF;
+
+	__irqsoff_trace_init(tr);
+}
+static struct trace_types_struct preemptoff_trace __read_mostly =
+{
+	.name = "preemptoff",
+	.init = preemptoff_trace_init,
+	.reset = irqsoff_trace_reset,
+	.open = irqsoff_trace_open,
+	.close = irqsoff_trace_close,
+	.start = irqsoff_trace_start,
+	.stop = irqsoff_trace_stop,
+	.ctrl_update = irqsoff_trace_ctrl_update,
+	.print_max = 1,
+};
+# define register_preemptoff(trace) register_trace(&trace)
+#else
+# define register_preemptoff(trace) do { } while (0)
+#endif
+
+
+#if defined(CONFIG_CRITICAL_IRQSOFF_TIMING) && \
+	defined(CONFIG_CRITICAL_PREEMPT_TIMING)
+static void preemptirqsoff_trace_init(struct tracing_trace *tr)
+{
+	trace_type = TRACER_IRQS_OFF | TRACER_PREEMPT_OFF;
+
+	__irqsoff_trace_init(tr);
+}
+static struct trace_types_struct preemptirqsoff_trace __read_mostly =
+{
+	.name = "preemptirqsoff",
+	.init = preemptirqsoff_trace_init,
+	.reset = irqsoff_trace_reset,
+	.open = irqsoff_trace_open,
+	.close = irqsoff_trace_close,
+	.start = irqsoff_trace_start,
+	.stop = irqsoff_trace_stop,
+	.ctrl_update = irqsoff_trace_ctrl_update,
+	.print_max = 1,
+};
+# define register_preemptirqsoff(trace) register_trace(&trace)
+#else
+# define register_preemptirqsoff(trace) do { } while (0)
+#endif
 
 __init static int init_irqsoff_trace(void)
 {
-	register_trace(&irqsoff_trace);
+	register_irqsoff(irqsoff_trace);
+	register_preemptoff(preemptoff_trace);
+	register_preemptirqsoff(preemptirqsoff_trace);
 
 	return 0;
 }
Index: linux-mcount.git/include/linux/preempt.h
===================================================================
--- linux-mcount.git.orig/include/linux/preempt.h	2008-01-28 13:29:31.000000000 -0500
+++ linux-mcount.git/include/linux/preempt.h	2008-01-29 15:22:31.000000000 -0500
@@ -10,7 +10,7 @@
 #include <linux/linkage.h>
 #include <linux/list.h>
 
-#ifdef CONFIG_DEBUG_PREEMPT
+#if defined(CONFIG_DEBUG_PREEMPT) || defined(CONFIG_CRITICAL_PREEMPT_TIMING)
   extern void fastcall add_preempt_count(int val);
   extern void fastcall sub_preempt_count(int val);
 #else
Index: linux-mcount.git/kernel/sched.c
===================================================================
--- linux-mcount.git.orig/kernel/sched.c	2008-01-29 14:33:10.000000000 -0500
+++ linux-mcount.git/kernel/sched.c	2008-01-29 15:22:31.000000000 -0500
@@ -66,6 +66,7 @@
 #include <linux/unistd.h>
 #include <linux/pagemap.h>
 #include <linux/hrtimer.h>
+#include <linux/mcount.h>
 
 #include <asm/tlb.h>
 #include <asm/irq_regs.h>
@@ -3780,26 +3781,44 @@ void scheduler_tick(void)
 #endif
 }
 
-#if defined(CONFIG_PREEMPT) && defined(CONFIG_DEBUG_PREEMPT)
+#if defined(CONFIG_PREEMPT) && (defined(CONFIG_DEBUG_PREEMPT) || \
+				defined(CONFIG_CRITICAL_PREEMPT_TIMING))
+
+static inline unsigned long get_parent_ip(unsigned long addr)
+{
+	if (in_lock_functions(addr)) {
+		addr = CALLER_ADDR2;
+		if (in_lock_functions(addr))
+			addr = CALLER_ADDR3;
+	}
+	return addr;
+}
 
 void fastcall add_preempt_count(int val)
 {
+#ifdef CONFIG_DEBUG_PREEMPT
 	/*
 	 * Underflow?
 	 */
 	if (DEBUG_LOCKS_WARN_ON((preempt_count() < 0)))
 		return;
+#endif
 	preempt_count() += val;
+#ifdef CONFIG_DEBUG_PREEMPT
 	/*
 	 * Spinlock count overflowing soon?
 	 */
 	DEBUG_LOCKS_WARN_ON((preempt_count() & PREEMPT_MASK) >=
 				PREEMPT_MASK - 10);
+#endif
+	if (preempt_count() == val)
+		trace_preempt_off(CALLER_ADDR0, get_parent_ip(CALLER_ADDR1));
 }
 EXPORT_SYMBOL(add_preempt_count);
 
 void fastcall sub_preempt_count(int val)
 {
+#ifdef CONFIG_DEBUG_PREEMPT
 	/*
 	 * Underflow?
 	 */
@@ -3811,7 +3830,10 @@ void fastcall sub_preempt_count(int val)
 	if (DEBUG_LOCKS_WARN_ON((val < PREEMPT_MASK) &&
 			!(preempt_count() & PREEMPT_MASK)))
 		return;
+#endif
 
+	if (preempt_count() == val)
+		trace_preempt_on(CALLER_ADDR0, get_parent_ip(CALLER_ADDR1));
 	preempt_count() -= val;
 }
 EXPORT_SYMBOL(sub_preempt_count);
Index: linux-mcount.git/arch/x86/kernel/process_32.c
===================================================================
--- linux-mcount.git.orig/arch/x86/kernel/process_32.c	2008-01-28 13:29:31.000000000 -0500
+++ linux-mcount.git/arch/x86/kernel/process_32.c	2008-01-29 15:22:31.000000000 -0500
@@ -195,7 +195,10 @@ void cpu_idle(void)
 				play_dead();
 
 			__get_cpu_var(irq_stat).idle_timestamp = jiffies;
+			/* Don't trace irqs off for idle */
+			stop_critical_timings();
 			idle();
+			start_critical_timings();
 		}
 		tick_nohz_restart_sched_tick();
 		preempt_enable_no_resched();
Index: linux-mcount.git/include/linux/mcount.h
===================================================================
--- linux-mcount.git.orig/include/linux/mcount.h	2008-01-29 15:05:34.000000000 -0500
+++ linux-mcount.git/include/linux/mcount.h	2008-01-29 15:22:31.000000000 -0500
@@ -58,4 +58,12 @@ extern void mcount(void);
 # define time_hardirqs_off(a0, a1)		do { } while (0)
 #endif
 
+#ifdef CONFIG_CRITICAL_PREEMPT_TIMING
+  extern void notrace trace_preempt_on(unsigned long a0, unsigned long a1);
+  extern void notrace trace_preempt_off(unsigned long a0, unsigned long a1);
+#else
+# define trace_preempt_on(a0, a1)		do { } while (0)
+# define trace_preempt_off(a0, a1)		do { } while (0)
+#endif
+
 #endif /* _LINUX_MCOUNT_H */
Index: linux-mcount.git/include/linux/irqflags.h
===================================================================
--- linux-mcount.git.orig/include/linux/irqflags.h	2008-01-29 15:05:34.000000000 -0500
+++ linux-mcount.git/include/linux/irqflags.h	2008-01-29 15:22:31.000000000 -0500
@@ -54,7 +54,8 @@
 # define INIT_TRACE_IRQFLAGS
 #endif
 
-#ifdef CONFIG_CRITICAL_IRQSOFF_TIMING
+#if defined(CONFIG_CRITICAL_IRQSOFF_TIMING) || \
+	defined(CONFIG_CRITICAL_PREEMPT_TIMING)
  extern void stop_critical_timings(void);
  extern void start_critical_timings(void);
 #else

-- 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 20/22 -v7] Add markers to various events
  2008-01-30  3:15 [PATCH 00/22 -v7] mcount and latency tracing utility -v7 Steven Rostedt
                   ` (18 preceding siblings ...)
  2008-01-30  3:15 ` [PATCH 19/22 -v7] trace preempt off " Steven Rostedt
@ 2008-01-30  3:15 ` Steven Rostedt
  2008-01-30  3:15 ` [PATCH 21/22 -v7] Add event tracer Steven Rostedt
  2008-01-30  3:15 ` [PATCH 22/22 -v7] Critical latency timings histogram Steven Rostedt
  21 siblings, 0 replies; 45+ messages in thread
From: Steven Rostedt @ 2008-01-30  3:15 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, Peter Zijlstra,
	Christoph Hellwig, Mathieu Desnoyers, Gregory Haskins,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Tim Bird,
	Sam Ravnborg, Frank Ch. Eigler, Jan Kiszka, John Stultz,
	Arjan van de Ven, Steven Rostedt

[-- Attachment #1: tracer-add-event-markers.patch --]
[-- Type: text/plain, Size: 6569 bytes --]

This patch adds markers to various events in the kernel.
(interrupts, task activation and hrtimers)

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 arch/x86/kernel/apic_32.c  |    2 ++
 arch/x86/kernel/irq_32.c   |    1 +
 arch/x86/kernel/irq_64.c   |    2 ++
 arch/x86/kernel/traps_32.c |    2 ++
 arch/x86/kernel/traps_64.c |    2 ++
 arch/x86/mm/fault_32.c     |    3 +++
 arch/x86/mm/fault_64.c     |    3 +++
 kernel/hrtimer.c           |    7 +++++++
 kernel/sched.c             |   11 +++++++++++
 9 files changed, 33 insertions(+)

Index: linux-mcount.git/arch/x86/kernel/apic_32.c
===================================================================
--- linux-mcount.git.orig/arch/x86/kernel/apic_32.c	2008-01-28 08:37:49.000000000 -0500
+++ linux-mcount.git/arch/x86/kernel/apic_32.c	2008-01-28 09:54:49.000000000 -0500
@@ -581,6 +581,8 @@ notrace fastcall void smp_apic_timer_int
 {
 	struct pt_regs *old_regs = set_irq_regs(regs);
 
+	trace_mark(arch_apic_timer, "ip %lx", regs->eip);
+
 	/*
 	 * NOTE! We'd better ACK the irq immediately,
 	 * because timer handling can be slow.
Index: linux-mcount.git/arch/x86/kernel/irq_32.c
===================================================================
--- linux-mcount.git.orig/arch/x86/kernel/irq_32.c	2008-01-28 08:37:14.000000000 -0500
+++ linux-mcount.git/arch/x86/kernel/irq_32.c	2008-01-28 09:54:49.000000000 -0500
@@ -85,6 +85,7 @@ fastcall unsigned int do_IRQ(struct pt_r
 
 	old_regs = set_irq_regs(regs);
 	irq_enter();
+	trace_mark(arch_do_irq, "ip %lx irq %d", regs->eip, irq);
 #ifdef CONFIG_DEBUG_STACKOVERFLOW
 	/* Debugging check for stack overflow: is there less than 1KB free? */
 	{
Index: linux-mcount.git/arch/x86/kernel/irq_64.c
===================================================================
--- linux-mcount.git.orig/arch/x86/kernel/irq_64.c	2008-01-28 08:37:14.000000000 -0500
+++ linux-mcount.git/arch/x86/kernel/irq_64.c	2008-01-28 09:54:49.000000000 -0500
@@ -149,6 +149,8 @@ asmlinkage unsigned int do_IRQ(struct pt
 	irq_enter();
 	irq = __get_cpu_var(vector_irq)[vector];
 
+	trace_mark(arch_do_irq, "ip %lx irq %d", regs->rip, irq);
+
 #ifdef CONFIG_DEBUG_STACKOVERFLOW
 	stack_overflow_check(regs);
 #endif
Index: linux-mcount.git/arch/x86/kernel/traps_32.c
===================================================================
--- linux-mcount.git.orig/arch/x86/kernel/traps_32.c	2008-01-28 08:37:17.000000000 -0500
+++ linux-mcount.git/arch/x86/kernel/traps_32.c	2008-01-28 09:54:49.000000000 -0500
@@ -769,6 +769,8 @@ fastcall __kprobes void do_nmi(struct pt
 
 	nmi_enter();
 
+	trace_mark(arch_do_nmi, "ip %lx flags %lx", regs->eip, regs->eflags);
+
 	cpu = smp_processor_id();
 
 	++nmi_count(cpu);
Index: linux-mcount.git/arch/x86/kernel/traps_64.c
===================================================================
--- linux-mcount.git.orig/arch/x86/kernel/traps_64.c	2008-01-28 08:37:14.000000000 -0500
+++ linux-mcount.git/arch/x86/kernel/traps_64.c	2008-01-28 09:54:49.000000000 -0500
@@ -782,6 +782,8 @@ asmlinkage __kprobes void default_do_nmi
 
 	cpu = smp_processor_id();
 
+	trace_mark(arch_do_nmi, "ip %lx flags %lx", regs->rip, regs->eflags);
+
 	/* Only the BSP gets external NMIs from the system.  */
 	if (!cpu)
 		reason = get_nmi_reason();
Index: linux-mcount.git/arch/x86/mm/fault_32.c
===================================================================
--- linux-mcount.git.orig/arch/x86/mm/fault_32.c	2008-01-28 08:37:14.000000000 -0500
+++ linux-mcount.git/arch/x86/mm/fault_32.c	2008-01-28 09:54:49.000000000 -0500
@@ -311,6 +311,9 @@ fastcall void __kprobes do_page_fault(st
 	/* get the address */
         address = read_cr2();
 
+	trace_mark(arch_do_page_fault, "ip %lx err %lx addr %lx",
+		   regs->eip, error_code, address);
+
 	tsk = current;
 
 	si_code = SEGV_MAPERR;
Index: linux-mcount.git/arch/x86/mm/fault_64.c
===================================================================
--- linux-mcount.git.orig/arch/x86/mm/fault_64.c	2008-01-28 08:37:14.000000000 -0500
+++ linux-mcount.git/arch/x86/mm/fault_64.c	2008-01-28 09:54:49.000000000 -0500
@@ -316,6 +316,9 @@ asmlinkage void __kprobes do_page_fault(
 	/* get the address */
 	address = read_cr2();
 
+	trace_mark(arch_do_page_fault, "ip %lx err %lx addr %lx",
+		   regs->rip, error_code, address);
+
 	info.si_code = SEGV_MAPERR;
 
 
Index: linux-mcount.git/kernel/hrtimer.c
===================================================================
--- linux-mcount.git.orig/kernel/hrtimer.c	2008-01-28 08:37:14.000000000 -0500
+++ linux-mcount.git/kernel/hrtimer.c	2008-01-28 09:54:49.000000000 -0500
@@ -709,6 +709,8 @@ static void enqueue_hrtimer(struct hrtim
 	struct hrtimer *entry;
 	int leftmost = 1;
 
+	trace_mark(kernel_hrtimer_enqueue,
+		   "expires %p timer %p", &timer->expires, timer);
 	/*
 	 * Find the right place in the rbtree:
 	 */
@@ -1130,6 +1132,7 @@ void hrtimer_interrupt(struct clock_even
 
  retry:
 	now = ktime_get();
+	trace_mark(kernel_hrtimer_interrupt, "now %p", &now);
 
 	expires_next.tv64 = KTIME_MAX;
 
@@ -1168,6 +1171,10 @@ void hrtimer_interrupt(struct clock_even
 				continue;
 			}
 
+			trace_mark(kernel_hrtimer_interrupt_expire,
+				   "expires %p timer %p",
+				   &timer->expires, timer);
+
 			__run_hrtimer(timer);
 		}
 		spin_unlock(&cpu_base->lock);
Index: linux-mcount.git/kernel/sched.c
===================================================================
--- linux-mcount.git.orig/kernel/sched.c	2008-01-28 09:54:40.000000000 -0500
+++ linux-mcount.git/kernel/sched.c	2008-01-28 09:54:49.000000000 -0500
@@ -90,6 +90,11 @@ unsigned long long __attribute__((weak))
 #define PRIO_TO_NICE(prio)	((prio) - MAX_RT_PRIO - 20)
 #define TASK_NICE(p)		PRIO_TO_NICE((p)->static_prio)
 
+#define __PRIO(prio) \
+	((prio) <= 99 ? 199 - (prio) : (prio) - 120)
+
+#define PRIO(p) __PRIO((p)->prio)
+
 /*
  * 'User priority' is the nice value converted to something we
  * can work with better when scaling various scheduler parameters,
@@ -1372,6 +1377,9 @@ static void activate_task(struct rq *rq,
 	if (p->state == TASK_UNINTERRUPTIBLE)
 		rq->nr_uninterruptible--;
 
+	trace_mark(kernel_sched_activate_task,
+		   "pid %d prio %d nr_running %ld",
+		   p->pid, PRIO(p), rq->nr_running);
 	enqueue_task(rq, p, wakeup);
 	inc_nr_running(p, rq);
 }
@@ -1385,6 +1393,9 @@ static void deactivate_task(struct rq *r
 		rq->nr_uninterruptible++;
 
 	dequeue_task(rq, p, sleep);
+	trace_mark(kernel_sched_deactivate_task,
+		   "pid %d prio %d nr_running %ld",
+		   p->pid, PRIO(p), rq->nr_running);
 	dec_nr_running(p, rq);
 }
 

-- 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 21/22 -v7] Add event tracer.
  2008-01-30  3:15 [PATCH 00/22 -v7] mcount and latency tracing utility -v7 Steven Rostedt
                   ` (19 preceding siblings ...)
  2008-01-30  3:15 ` [PATCH 20/22 -v7] Add markers to various events Steven Rostedt
@ 2008-01-30  3:15 ` Steven Rostedt
  2008-01-30  3:15 ` [PATCH 22/22 -v7] Critical latency timings histogram Steven Rostedt
  21 siblings, 0 replies; 45+ messages in thread
From: Steven Rostedt @ 2008-01-30  3:15 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, Peter Zijlstra,
	Christoph Hellwig, Mathieu Desnoyers, Gregory Haskins,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Tim Bird,
	Sam Ravnborg, Frank Ch. Eigler, Jan Kiszka, John Stultz,
	Arjan van de Ven, Steven Rostedt

[-- Attachment #1: tracer-event-trace.patch --]
[-- Type: text/plain, Size: 25871 bytes --]

This patch adds a event trace that hooks into various events
in the kernel. Although it can be used separately, it is mainly
to help other traces (wakeup and preempt off) with seeing various
events in the traces without having to enable the heavy mcount
hooks.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 lib/tracing/Kconfig         |   12 +
 lib/tracing/Makefile        |    1 
 lib/tracing/trace_events.c  |  475 ++++++++++++++++++++++++++++++++++++++++++++
 lib/tracing/trace_irqsoff.c |    6 
 lib/tracing/trace_wakeup.c  |   55 ++++-
 lib/tracing/tracer.c        |  159 ++++++++++----
 lib/tracing/tracer.h        |   64 +++++
 7 files changed, 721 insertions(+), 51 deletions(-)

Index: linux-mcount.git/lib/tracing/trace_events.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-mcount.git/lib/tracing/trace_events.c	2008-01-29 18:11:37.000000000 -0500
@@ -0,0 +1,475 @@
+/*
+ * trace task events
+ *
+ * Copyright (C) 2007 Steven Rostedt <srostedt@redhat.com>
+ *
+ * Based on code from the latency_tracer, that is:
+ *
+ *  Copyright (C) 2004-2006 Ingo Molnar
+ *  Copyright (C) 2004 William Lee Irwin III
+ */
+#include <linux/module.h>
+#include <linux/fs.h>
+#include <linux/debugfs.h>
+#include <linux/kallsyms.h>
+#include <linux/uaccess.h>
+#include <linux/mcount.h>
+
+#include "tracer.h"
+
+static struct tracing_trace *tracer_trace __read_mostly;
+static int trace_enabled __read_mostly;
+
+static void notrace event_reset(struct tracing_trace *tr)
+{
+	struct tracing_trace_cpu *data;
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		data = tr->data[cpu];
+		tracing_reset(data);
+	}
+
+	tr->time_start = now();
+}
+
+static void notrace event_trace_sched_switch(void *private,
+					     struct task_struct *prev,
+					     struct task_struct *next)
+{
+	struct tracing_trace **ptr = private;
+	struct tracing_trace *tr = *ptr;
+	struct tracing_trace_cpu *data;
+	unsigned long flags;
+	int cpu;
+
+	if (!trace_enabled || !tr)
+		return;
+
+	local_irq_save(flags);
+	cpu = raw_smp_processor_id();
+	data = tr->data[cpu];
+
+	atomic_inc(&data->disabled);
+	if (atomic_read(&data->disabled) != 1)
+		goto out;
+
+	tracing_sched_switch_trace(tr, data, prev, next, flags);
+
+ out:
+	atomic_dec(&data->disabled);
+	local_irq_restore(flags);
+}
+
+static struct tracer_switch_ops switch_ops __read_mostly = {
+	.func = event_trace_sched_switch,
+	.private = &tracer_trace,
+};
+
+notrace int trace_event_enabled(void)
+{
+	return trace_enabled && tracer_trace;
+}
+
+/* Taken from sched.c */
+#define __PRIO(prio) \
+	((prio) <= 99 ? 199 - (prio) : (prio) - 120)
+
+#define PRIO(p) __PRIO((p)->prio)
+
+notrace void trace_event_wakeup(unsigned long ip,
+				struct task_struct *p,
+				struct task_struct *curr)
+{
+	struct tracing_trace *tr = tracer_trace;
+	struct tracing_trace_cpu *data;
+	unsigned long flags;
+	int cpu;
+
+	if (!trace_enabled || !tr)
+		return;
+
+	local_irq_save(flags);
+	cpu = raw_smp_processor_id();
+	data = tr->data[cpu];
+
+	atomic_inc(&data->disabled);
+	if (atomic_read(&data->disabled) != 1)
+		goto out;
+
+	/* record process's command line */
+	tracing_record_cmdline(p);
+	tracing_record_cmdline(curr);
+	tracing_trace_pid(tr, data, flags, ip, p->pid, PRIO(p), PRIO(curr));
+
+ out:
+	atomic_dec(&data->disabled);
+	local_irq_restore(flags);
+}
+
+struct event_probes {
+	const char *name;
+	const char *fmt;
+	void (*func)(const struct event_probes *probe,
+		     struct tracing_trace *tr,
+		     struct tracing_trace_cpu *data,
+		     unsigned long flags,
+		     unsigned long ip,
+		     va_list ap);
+	int active;
+	int armed;
+};
+
+#define getarg(arg, ap) arg = va_arg(ap, typeof(arg))
+
+static notrace void event_trace_apic_timer(const struct event_probes *probe,
+					   struct tracing_trace *tr,
+					   struct tracing_trace_cpu *data,
+					   unsigned long flags,
+					   unsigned long ip,
+					   va_list ap)
+{
+	unsigned long parent_ip;
+
+	getarg(parent_ip, ap);
+
+	tracing_trace_special(tr, data, flags, ip, parent_ip, 0, 0);
+}
+
+static notrace void event_trace_do_irq(const struct event_probes *probe,
+				       struct tracing_trace *tr,
+				       struct tracing_trace_cpu *data,
+				       unsigned long flags,
+				       unsigned long ip,
+				       va_list ap)
+{
+	unsigned long parent_ip;
+	int irq;
+
+	getarg(parent_ip, ap);
+	getarg(irq, ap);
+
+	tracing_trace_special(tr, data, flags, ip, parent_ip, irq, 0);
+}
+
+static notrace void event_trace_do_page_fault(const struct event_probes *probe,
+					      struct tracing_trace *tr,
+					      struct tracing_trace_cpu *data,
+					      unsigned long flags,
+					      unsigned long ip,
+					      va_list ap)
+{
+	unsigned long parent_ip;
+	unsigned long err;
+	unsigned long addr;
+
+	getarg(parent_ip, ap);
+	getarg(err, ap);
+	getarg(addr, ap);
+
+	tracing_trace_special(tr, data, flags, ip, parent_ip, err, addr);
+}
+
+static notrace void event_trace_hrtimer_timer(const struct event_probes *probe,
+					      struct tracing_trace *tr,
+					      struct tracing_trace_cpu *data,
+					      unsigned long flags,
+					      unsigned long ip,
+					      va_list ap)
+{
+	ktime_t *expires;
+	unsigned long timer;
+
+	getarg(expires, ap);
+	getarg(timer, ap);
+
+	tracing_trace_timer(tr, data, flags, ip, ktime_to_ns(*expires), timer);
+}
+
+static notrace void event_trace_hrtimer_time(const struct event_probes *probe,
+					     struct tracing_trace *tr,
+					     struct tracing_trace_cpu *data,
+					     unsigned long flags,
+					     unsigned long ip,
+					     va_list ap)
+{
+	ktime_t *now;
+
+	getarg(now, ap);
+
+	tracing_trace_timer(tr, data, flags, ip, ktime_to_ns(*now), 0);
+}
+
+static notrace void event_trace_sched3(const struct event_probes *probe,
+				       struct tracing_trace *tr,
+				       struct tracing_trace_cpu *data,
+				       unsigned long flags,
+				       unsigned long ip,
+				       va_list ap)
+{
+	int pid;
+	int prio;
+	int running;
+
+	getarg(pid, ap);
+	getarg(prio, ap);
+	getarg(running, ap);
+
+	tracing_trace_pid(tr, data, flags, ip, pid, prio, running);
+}
+
+#ifndef CONFIG_WAKEUP_TRACER
+static notrace event_trace_sched_wakeup(const struct event_probes *probe,
+					struct tracing_trace *tr,
+					struct tracing_trace_cpu *data,
+					unsigned long flags,
+					unsigned long ip,
+					va_list ap)
+{
+	struct task_struct *p;
+	struct task_struct *curr;
+
+	getarg(p, ap);
+	getarg(curr, ap);
+
+	tracing_record_cmdline(p);
+	tracing_record_cmdline(curr);
+	tracing_trace_pid(tr, data, flags, ip, p->pid, PRIO(p), PRIO(curr));
+}
+#endif
+
+static struct event_probes event_probes[] __read_mostly = {
+	{
+		.name = "arch_apic_timer",
+		.fmt = "ip %lx",
+		.func = event_trace_apic_timer,
+	},
+	{
+		.name = "arch_do_irq",
+		.fmt = "ip %lx irq %d",
+		.func = event_trace_do_irq,
+	},
+	{
+		.name = "arch_do_page_fault",
+		.fmt = "ip %lx err %lx addr %lx",
+		.func = event_trace_do_page_fault,
+	},
+	{
+		.name = "kernel_hrtimer_enqueue",
+		.fmt = "expires %p timer %p",
+		.func = event_trace_hrtimer_timer,
+	},
+	{
+		.name = "kernel_hrtimer_interrupt",
+		.fmt = "now %p",
+		.func = event_trace_hrtimer_time,
+	},
+	{
+		.name = "kernel_hrtimer_interrupt_expire",
+		.fmt = "expires %p timer %p",
+		.func = event_trace_hrtimer_timer,
+	},
+	{
+		.name = "kernel_sched_activate_task",
+		.fmt = "pid %d prio %d nr_running %ld",
+		.func = event_trace_sched3,
+	},
+	{
+		.name = "kernel_sched_deactivate_task",
+		.fmt = "pid %d prio %d nr_running %ld",
+		.func = event_trace_sched3,
+	},
+#ifndef CONFIG_WAKEUP_TRACER
+	{
+		.name = "kernel_sched_wakeup",
+		.fmt = "p %p rq->curr %p",
+		.func = event_trace_sched_wakeup,
+	},
+#endif
+	{
+		.name = NULL,
+	}
+};
+
+static notrace void event_trace_callback(const struct marker *mdata,
+					 void *private_data,
+					 const char *format, ...)
+{
+	struct event_probes *probe = mdata->private;
+	struct tracing_trace *tr = tracer_trace;
+	struct tracing_trace_cpu *data;
+	unsigned long flags;
+	unsigned long ip;
+	int cpu;
+	va_list ap;
+
+	if (!trace_enabled || !tracer_trace)
+		return;
+
+	local_irq_save(flags);
+	cpu = raw_smp_processor_id();
+	data = tr->data[cpu];
+
+	atomic_inc(&data->disabled);
+	if (atomic_read(&data->disabled) != 1)
+		goto out;
+
+	ip = CALLER_ADDR0;
+
+	va_start(ap, format);
+	probe->func(probe, tr, data, flags, ip, ap);
+	va_end(ap);
+
+ out:
+	atomic_dec(&data->disabled);
+	local_irq_restore(flags);
+}
+
+static notrace void event_trace_arm(void)
+{
+	struct event_probes *probe;
+	int ret;
+
+	for (probe = event_probes; probe->name; probe++) {
+		if (!probe->active || probe->armed)
+			continue;
+
+		ret = marker_arm(probe->name);
+		if (ret)
+			pr_info("event trace: Couldn't arm probe %s\n",
+				probe->name);
+		else
+			probe->armed = 1;
+	}
+
+	tracing_start_wakeup();
+}
+
+static notrace void event_trace_disarm(void)
+{
+	struct event_probes *probe;
+
+	tracing_stop_wakeup();
+
+	for (probe = event_probes; probe->name; probe++) {
+		if (!probe->active || !probe->armed)
+			continue;
+
+		marker_disarm(probe->name);
+		probe->armed = 0;
+	}
+}
+
+static notrace void start_event_trace(struct tracing_trace *tr)
+{
+	event_trace_arm();
+
+	event_reset(tr);
+	trace_enabled = 1;
+
+	register_tracer_switch(&switch_ops);
+	tracing_start_sched_switch();
+	tracing_start_function_trace();
+}
+
+static notrace void stop_event_trace(struct tracing_trace *tr)
+{
+	tracing_stop_function_trace();
+	trace_enabled = 0;
+	unregister_tracer_switch(&switch_ops);
+	tracing_stop_sched_switch();
+	event_trace_disarm();
+}
+
+static notrace void event_trace_init(struct tracing_trace *tr)
+{
+	tracer_trace = tr;
+
+	if (tr->ctrl)
+		start_event_trace(tr);
+}
+
+static notrace void event_trace_reset(struct tracing_trace *tr)
+{
+	if (tr->ctrl)
+		stop_event_trace(tr);
+}
+
+static void event_trace_ctrl_update(struct tracing_trace *tr)
+{
+	if (tr->ctrl)
+		start_event_trace(tr);
+	else
+		stop_event_trace(tr);
+}
+
+static void notrace event_trace_open(struct tracing_iterator *iter)
+{
+	/* stop the trace while dumping */
+	if (iter->tr->ctrl)
+		stop_event_trace(iter->tr);
+}
+
+static void notrace event_trace_close(struct tracing_iterator *iter)
+{
+	if (iter->tr->ctrl)
+		start_event_trace(iter->tr);
+}
+
+static struct trace_types_struct event_trace __read_mostly =
+{
+	.name = "events",
+	.init = event_trace_init,
+	.reset = event_trace_reset,
+	.open = event_trace_open,
+	.close = event_trace_close,
+	.ctrl_update = event_trace_ctrl_update,
+};
+
+notrace void trace_event_register(struct tracing_trace *tr)
+{
+	tracer_trace = tr;
+	event_trace_arm();
+}
+
+notrace void trace_event_unregister(struct tracing_trace *tr)
+{
+	event_trace_disarm();
+}
+
+notrace void trace_start_events(void)
+{
+	trace_enabled = 1;
+}
+
+notrace void trace_stop_events(void)
+{
+	trace_enabled = 0;
+}
+
+__init static int init_event_trace(void)
+{
+	struct event_probes *probe;
+	int ret;
+
+	ret = register_trace(&event_trace);
+	if (ret)
+		return ret;
+
+	for (probe = event_probes; probe->name; probe++) {
+
+		ret = marker_probe_register(probe->name,
+					    probe->fmt,
+					    event_trace_callback,
+					    probe);
+		if (ret)
+			pr_info("event trace: Couldn't add marker"
+				" probe to %s\n", probe->name);
+		else
+			probe->active = 1;
+	}
+
+	return 0;
+}
+
+device_initcall(init_event_trace);
Index: linux-mcount.git/lib/tracing/Kconfig
===================================================================
--- linux-mcount.git.orig/lib/tracing/Kconfig	2008-01-29 18:11:26.000000000 -0500
+++ linux-mcount.git/lib/tracing/Kconfig	2008-01-29 19:06:54.000000000 -0500
@@ -28,6 +28,18 @@ config FUNCTION_TRACER
 	  that the debugging mechanism using this facility will hook by
 	  providing a set of inline routines.
 
+config EVENT_TRACER
+	bool "trace kernel events"
+	depends on DEBUG_KERNEL
+	select TRACING
+	select CONTEXT_SWITCH_TRACER
+	select MARKERS
+	help
+	  This option activates the event tracer of the latency_tracer.
+	  It activates markers through out the kernel for tracing.
+	  This option has a fairly low overhead when enabled.
+
+
 config CRITICAL_IRQSOFF_TIMING
 	bool "Interrupts-off critical section latency timing"
 	default n
Index: linux-mcount.git/lib/tracing/Makefile
===================================================================
--- linux-mcount.git.orig/lib/tracing/Makefile	2008-01-29 18:11:26.000000000 -0500
+++ linux-mcount.git/lib/tracing/Makefile	2008-01-29 19:06:54.000000000 -0500
@@ -6,5 +6,6 @@ obj-$(CONFIG_FUNCTION_TRACER) += trace_f
 obj-$(CONFIG_CRITICAL_IRQSOFF_TIMING) += trace_irqsoff.o
 obj-$(CONFIG_CRITICAL_PREEMPT_TIMING) += trace_irqsoff.o
 obj-$(CONFIG_WAKEUP_TRACER) += trace_wakeup.o
+obj-$(CONFIG_EVENT_TRACER) += trace_events.o
 
 libmcount-y := mcount.o
Index: linux-mcount.git/lib/tracing/trace_wakeup.c
===================================================================
--- linux-mcount.git.orig/lib/tracing/trace_wakeup.c	2008-01-29 18:10:17.000000000 -0500
+++ linux-mcount.git/lib/tracing/trace_wakeup.c	2008-01-29 19:06:54.000000000 -0500
@@ -23,6 +23,7 @@ static int trace_enabled __read_mostly;
 static struct task_struct *wakeup_task;
 static int wakeup_cpu;
 static unsigned wakeup_prio = -1;
+static atomic_t wakeup_ref;
 
 static DEFINE_SPINLOCK(wakeup_lock);
 
@@ -218,7 +219,7 @@ static notrace void wake_up_callback(con
 	struct task_struct *p;
 	va_list ap;
 
-	if (likely(!trace_enabled))
+	if (likely(!trace_enabled) && !trace_event_enabled())
 		return;
 
 	va_start(ap, format);
@@ -229,18 +230,26 @@ static notrace void wake_up_callback(con
 
 	va_end(ap);
 
-	wakeup_check_start(tr, p, curr);
+	trace_event_wakeup(CALLER_ADDR0, p, curr);
+
+	if (trace_enabled)
+		wakeup_check_start(tr, p, curr);
 }
 
-static notrace void start_wakeup_trace(struct tracing_trace *tr)
+notrace int tracing_start_wakeup(void)
 {
+	long ref;
 	int ret;
 
+	ref = atomic_inc_return(&wakeup_ref);
+	if (ref != 1)
+		return 0;
+
 	ret = marker_arm("kernel_sched_wakeup");
 	if (ret) {
 		pr_info("wakeup trace: Couldn't arm probe"
 			" kernel_sched_wakeup\n");
-		return;
+		return -1;
 	}
 
 	ret = marker_arm("kernel_sched_wakeup_new");
@@ -250,38 +259,66 @@ static notrace void start_wakeup_trace(s
 		goto out;
 	}
 
+	return 0;
+
+ out:
+	marker_disarm("kernel_sched_wakeup");
+	return -1;
+}
+
+notrace void tracing_stop_wakeup(void)
+{
+	long ref;
+
+	ref = atomic_dec_return(&wakeup_ref);
+	if (!ref) {
+		marker_disarm("kernel_sched_wakeup");
+		marker_disarm("kernel_sched_wakeup_new");
+	}
+}
+
+static notrace void start_wakeup_trace(struct tracing_trace *tr)
+{
+	int ret;
+
+	ret = tracing_start_wakeup();
+	if (ret)
+		return;
+
 	register_tracer_switch(&switch_ops);
 	tracing_start_sched_switch();
+	trace_start_events();
 
 	wakeup_reset(tr);
 	trace_enabled = 1;
 	return;
-
- out:
-	marker_disarm("kernel_sched_wakeup");
 }
 
 static notrace void stop_wakeup_trace(struct tracing_trace *tr)
 {
 	trace_enabled = 0;
+	tracing_stop_wakeup();
 	tracing_stop_sched_switch();
 	unregister_tracer_switch(&switch_ops);
-	marker_disarm("kernel_sched_wakeup");
-	marker_disarm("kernel_sched_wakeup_new");
 }
 
 static notrace void wakeup_trace_init(struct tracing_trace *tr)
 {
 	tracer_trace = tr;
 
+	trace_event_register(tr);
+
 	if (tr->ctrl)
 		start_wakeup_trace(tr);
 }
 
 static notrace void wakeup_trace_reset(struct tracing_trace *tr)
 {
+	trace_event_unregister(tr);
+
 	if (tr->ctrl) {
 		stop_wakeup_trace(tr);
+		trace_stop_events();
 		/* make sure we put back any tasks we are tracing */
 		wakeup_reset(tr);
 	}
Index: linux-mcount.git/lib/tracing/tracer.c
===================================================================
--- linux-mcount.git.orig/lib/tracing/tracer.c	2008-01-29 18:11:18.000000000 -0500
+++ linux-mcount.git/lib/tracing/tracer.c	2008-01-29 18:14:24.000000000 -0500
@@ -58,6 +58,9 @@ enum trace_type {
 
 	TRACE_FN,
 	TRACE_CTX,
+	TRACE_PID,
+	TRACE_SPECIAL,
+	TRACE_TMR,
 
 	__TRACE_LAST_TYPE
 };
@@ -401,6 +404,61 @@ notrace void tracing_sched_switch_trace(
 	entry->ctx.next_prio	= next->prio;
 }
 
+void tracing_trace_pid(struct tracing_trace *tr,
+		       struct tracing_trace_cpu *data,
+		       unsigned long flags,
+		       unsigned long ip,
+		       unsigned long pid,
+		       unsigned long a,
+		       unsigned long b)
+{
+	struct tracing_entry *entry;
+
+	entry = tracing_get_trace_entry(tr, data);
+	tracing_generic_entry_update(entry, flags);
+	entry->type		= TRACE_PID;
+	entry->id.ip		= ip;
+	entry->id.pid		= pid;
+	entry->id.a		= a;
+	entry->id.b		= b;
+}
+
+void tracing_trace_special(struct tracing_trace *tr,
+			   struct tracing_trace_cpu *data,
+			   unsigned long flags,
+			   unsigned long ip,
+			   unsigned long parent_ip,
+			   unsigned long a,
+			   unsigned long b)
+{
+	struct tracing_entry *entry;
+
+	entry = tracing_get_trace_entry(tr, data);
+	tracing_generic_entry_update(entry, flags);
+	entry->type			= TRACE_SPECIAL;
+	entry->special.ip		= ip;
+	entry->special.parent_ip	= parent_ip;
+	entry->special.a		= a;
+	entry->special.b		= b;
+}
+
+void tracing_trace_timer(struct tracing_trace *tr,
+			 struct tracing_trace_cpu *data,
+			 unsigned long flags,
+			 unsigned long ip,
+			 unsigned long long time,
+			 unsigned long a)
+{
+	struct tracing_entry *entry;
+
+	entry = tracing_get_trace_entry(tr, data);
+	tracing_generic_entry_update(entry, flags);
+	entry->type			= TRACE_TMR;
+	entry->tmr.ip		= ip;
+	entry->tmr.time		= time;
+	entry->tmr.a		= a;
+}
+
 enum trace_iterator {
 	TRACE_ITER_SYM_ONLY	= 1,
 	TRACE_ITER_VERBOSE	= 2,
@@ -729,6 +787,62 @@ lat_print_timestamp(struct seq_file *m, 
 static const char state_to_char[] = TASK_STATE_TO_CHAR_STR;
 
 static void notrace
+print_generic_output(struct seq_file *m,
+		     struct tracing_entry *entry, int sym_only)
+{
+	char *comm;
+	int S;
+
+	switch (entry->type) {
+	case TRACE_FN:
+		seq_print_ip_sym(m, entry->fn.ip, sym_only);
+		seq_puts(m, " (");
+		seq_print_ip_sym(m, entry->fn.parent_ip, sym_only);
+		seq_puts(m, ")");
+		break;
+	case TRACE_CTX:
+		S = entry->ctx.prev_state < sizeof(state_to_char) ?
+			state_to_char[entry->ctx.prev_state] : 'X';
+		comm = trace_find_cmdline(entry->ctx.next_pid);
+		seq_printf(m, " %d:%d:%c --> %d:%d %s",
+			   entry->ctx.prev_pid,
+			   entry->ctx.prev_prio,
+			   S,
+			   entry->ctx.next_pid,
+			   entry->ctx.next_prio,
+			   comm);
+		break;
+	case TRACE_PID:
+		seq_print_ip_sym(m, entry->id.ip, sym_only);
+		seq_printf(m, " <%.8s-%ld> (%ld %ld)",
+			   trace_find_cmdline(entry->id.pid),
+			   entry->id.pid,
+			   entry->id.a, entry->id.b);
+		break;
+	case TRACE_SPECIAL:
+		seq_print_ip_sym(m, entry->special.ip, sym_only);
+		seq_putc(m, ' ');
+		seq_print_ip_sym(m, entry->special.parent_ip, 0);
+		/* For convenience, print small numbers in decimal */
+		if (abs((int)entry->special.a) < 100000)
+			seq_printf(m, "(%ld ", entry->special.a);
+		else
+			seq_printf(m, "(%lx ", entry->special.a);
+		if (abs((int)entry->special.b) < 100000)
+			seq_printf(m, "%ld)", entry->special.b);
+		else
+			seq_printf(m, "%lx)", entry->special.b);
+		break;
+	case TRACE_TMR:
+		seq_print_ip_sym(m, entry->tmr.ip, sym_only);
+		seq_printf(m, "(%13Lx %lx)",
+			   entry->tmr.time,
+			   entry->tmr.a);
+		break;
+	}
+}
+
+static void notrace
 print_lat_fmt(struct seq_file *m, struct tracing_iterator *iter,
 	      unsigned int trace_idx, int cpu)
 {
@@ -739,7 +853,6 @@ print_lat_fmt(struct seq_file *m, struct
 	char *comm;
 	int sym_only = !!(trace_flags & TRACE_ITER_SYM_ONLY);
 	int verbose = !!(trace_flags & TRACE_ITER_VERBOSE);
-	int S;
 
 	if (!next_entry)
 		next_entry = entry;
@@ -760,26 +873,8 @@ print_lat_fmt(struct seq_file *m, struct
 		lat_print_generic(m, entry, cpu);
 		lat_print_timestamp(m, abs_usecs, rel_usecs);
 	}
-	switch (entry->type) {
-	case TRACE_FN:
-		seq_print_ip_sym(m, entry->fn.ip, sym_only);
-		seq_puts(m, " (");
-		seq_print_ip_sym(m, entry->fn.parent_ip, sym_only);
-		seq_puts(m, ")\n");
-		break;
-	case TRACE_CTX:
-		S = entry->ctx.prev_state < sizeof(state_to_char) ?
-			state_to_char[entry->ctx.prev_state] : 'X';
-		comm = trace_find_cmdline(entry->ctx.next_pid);
-		seq_printf(m, " %d:%d:%c --> %d:%d %s\n",
-			   entry->ctx.prev_pid,
-			   entry->ctx.prev_prio,
-			   S,
-			   entry->ctx.next_pid,
-			   entry->ctx.next_prio,
-			   comm);
-		break;
-	}
+	print_generic_output(m, entry, sym_only);
+	seq_putc(m, '\n');
 }
 
 static void notrace print_trace_fmt(struct seq_file *m,
@@ -791,7 +886,6 @@ static void notrace print_trace_fmt(stru
 	int sym_only = !!(trace_flags & TRACE_ITER_SYM_ONLY);
 	unsigned long long t;
 	char *comm;
-	int S;
 
 	comm = trace_find_cmdline(iter->ent->pid);
 
@@ -803,26 +897,7 @@ static void notrace print_trace_fmt(stru
 	seq_printf(m, "CPU %d: ", iter->cpu);
 	seq_printf(m, "%s:%d ", comm,
 		   entry->pid);
-	switch (entry->type) {
-	case TRACE_FN:
-		seq_print_ip_sym(m, entry->fn.ip, sym_only);
-		if (entry->fn.parent_ip) {
-			seq_printf(m, " <-- ");
-			seq_print_ip_sym(m, entry->fn.parent_ip,
-					 sym_only);
-		}
-		break;
-	case TRACE_CTX:
-		S = entry->ctx.prev_state < sizeof(state_to_char) ?
-			state_to_char[entry->ctx.prev_state] : 'X';
-		seq_printf(m, " %d:%d:%c ==> %d:%d\n",
-			   entry->ctx.prev_pid,
-			   entry->ctx.prev_prio,
-			   S,
-			   entry->ctx.next_pid,
-			   entry->ctx.next_prio);
-		break;
-	}
+	print_generic_output(m, entry, sym_only);
 	seq_printf(m, "\n");
 }
 
Index: linux-mcount.git/lib/tracing/tracer.h
===================================================================
--- linux-mcount.git.orig/lib/tracing/tracer.h	2008-01-29 18:10:56.000000000 -0500
+++ linux-mcount.git/lib/tracing/tracer.h	2008-01-29 19:07:19.000000000 -0500
@@ -18,6 +18,27 @@ struct tracing_sched_switch {
 	unsigned char next_prio;
 };
 
+struct tracing_pid {
+	unsigned long ip;
+	unsigned long pid;
+	unsigned long a;
+	unsigned long b;
+};
+
+struct tracing_special {
+	unsigned long ip;
+	unsigned long parent_ip;
+	unsigned long pid;
+	unsigned long a;
+	unsigned long b;
+};
+
+struct tracing_timer {
+	unsigned long ip;
+	unsigned long long time;
+	unsigned long a;
+};
+
 struct tracing_entry {
 	char type;
 	char cpu;  /* who will want to trace more than 256 CPUS? */
@@ -28,6 +49,9 @@ struct tracing_entry {
 	union {
 		struct tracing_function fn;
 		struct tracing_sched_switch ctx;
+		struct tracing_pid id;
+		struct tracing_special special;
+		struct tracing_timer tmr;
 	};
 };
 
@@ -99,6 +123,26 @@ void tracing_sched_switch_trace(struct t
 				struct task_struct *next,
 				unsigned long flags);
 void tracing_record_cmdline(struct task_struct *tsk);
+void tracing_trace_pid(struct tracing_trace *tr,
+		       struct tracing_trace_cpu *data,
+		       unsigned long flags,
+		       unsigned long ip,
+		       unsigned long pid,
+		       unsigned long a,
+		       unsigned long b);
+void tracing_trace_special(struct tracing_trace *tr,
+			   struct tracing_trace_cpu *data,
+			   unsigned long flags,
+			   unsigned long ip,
+			   unsigned long parent_ip,
+			   unsigned long a,
+			   unsigned long b);
+void tracing_trace_timer(struct tracing_trace *tr,
+			 struct tracing_trace_cpu *data,
+			 unsigned long flags,
+			 unsigned long ip,
+			 unsigned long long time,
+			 unsigned long a);
 
 void tracing_start_function_trace(void);
 void tracing_stop_function_trace(void);
@@ -106,6 +150,8 @@ int register_trace(struct trace_types_st
 void unregister_trace(struct trace_types_struct *type);
 void tracing_start_sched_switch(void);
 void tracing_stop_sched_switch(void);
+int tracing_start_wakeup(void);
+void tracing_stop_wakeup(void);
 
 extern int tracing_sched_switch_enabled;
 extern unsigned long tracing_max_latency;
@@ -134,4 +180,22 @@ extern int register_tracer_switch(struct
 extern int unregister_tracer_switch(struct tracer_switch_ops *ops);
 #endif /* CONFIG_CONTEXT_SWITCH_TRACER */
 
+#ifdef CONFIG_EVENT_TRACER
+extern int trace_event_enabled(void);
+extern void trace_event_wakeup(unsigned long ip,
+			       struct task_struct *p,
+			       struct task_struct *curr);
+extern void trace_event_register(struct tracing_trace *tr);
+extern void trace_event_unregister(struct tracing_trace *tr);
+extern void trace_start_events(void);
+extern void trace_stop_events(void);
+#else
+# define trace_event_enabled() 0
+# define trace_event_wakeup(ip, p, curr) do { } while (0)
+# define trace_event_register(tr) do { } while (0)
+# define trace_event_unregister(tr) do { } while (0)
+# define trace_start_events() do { } while (0)
+# define trace_stop_events() do { } while (0)
+#endif
+
 #endif /* _LINUX_MCOUNT_TRACER_H */
Index: linux-mcount.git/lib/tracing/trace_irqsoff.c
===================================================================
--- linux-mcount.git.orig/lib/tracing/trace_irqsoff.c	2008-01-29 18:11:26.000000000 -0500
+++ linux-mcount.git/lib/tracing/trace_irqsoff.c	2008-01-29 19:06:54.000000000 -0500
@@ -216,6 +216,7 @@ start_critical_timing(unsigned long ip, 
 	local_save_flags(flags);
 
 	tracing_function_trace(tr, data, ip, parent_ip, flags);
+	trace_start_events();
 
 	__get_cpu_var(tracing_cpu) = 1;
 
@@ -248,6 +249,7 @@ stop_critical_timing(unsigned long ip, u
 
 	atomic_inc(&data->disabled);
 	local_save_flags(flags);
+	trace_stop_events();
 	tracing_function_trace(tr, data, ip, parent_ip, flags);
 	check_critical_timing(tr, data, parent_ip ? : ip, cpu);
 	data->critical_start = 0;
@@ -371,12 +373,16 @@ static void __irqsoff_trace_init(struct 
 	/* make sure that the tracer is visibel */
 	smp_wmb();
 
+	trace_event_register(tr);
+
 	if (tr->ctrl)
 		start_irqsoff_trace(tr);
 }
 
 static void irqsoff_trace_reset(struct tracing_trace *tr)
 {
+	trace_event_unregister(tr);
+
 	if (tr->ctrl)
 		stop_irqsoff_trace(tr);
 }

-- 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 22/22 -v7] Critical latency timings histogram
  2008-01-30  3:15 [PATCH 00/22 -v7] mcount and latency tracing utility -v7 Steven Rostedt
                   ` (20 preceding siblings ...)
  2008-01-30  3:15 ` [PATCH 21/22 -v7] Add event tracer Steven Rostedt
@ 2008-01-30  3:15 ` Steven Rostedt
  21 siblings, 0 replies; 45+ messages in thread
From: Steven Rostedt @ 2008-01-30  3:15 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, Peter Zijlstra,
	Christoph Hellwig, Mathieu Desnoyers, Gregory Haskins,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Tim Bird,
	Sam Ravnborg, Frank Ch. Eigler, Jan Kiszka, John Stultz,
	Arjan van de Ven, Steven Rostedt

[-- Attachment #1: trace-histograms.patch --]
[-- Type: text/plain, Size: 23024 bytes --]

This patch adds hooks into the latency tracer to give
us histograms of interrupts off, preemption off and
wakeup timings.

This code was based off of work done by Yi Yang <yyang@ch.mvista.com>

But heavily modified to work with the new tracer, and some
clean ups by Steven Rostedt <srostedt@redhat.com>

This adds the following to /debugfs/tracing

  latency_hist/ - root dir for historgrams.

  Under latency_hist there is (depending on what's configured):

    interrupt_off_latency/ - latency histograms of interrupts off.

    preempt_interrupts_off_latency/ - latency histograms of
                              preemption and/or interrupts off.

    preempt_off_latency/ - latency histograms of preemption off.

    wakeup_latency/ - latency histograms of wakeup timings.

  Under each of the above is a file labeled:

    CPU# for each possible CPU were # is the CPU number.

    reset - writing into this file will reset the histogram
            back to zeros and start again.


Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 lib/tracing/Kconfig         |   20 +
 lib/tracing/Makefile        |    4 
 lib/tracing/trace_irqsoff.c |   19 +
 lib/tracing/trace_wakeup.c  |   21 +
 lib/tracing/tracer_hist.c   |  514 ++++++++++++++++++++++++++++++++++++++++++++
 lib/tracing/tracer_hist.h   |   39 +++
 6 files changed, 613 insertions(+), 4 deletions(-)

Index: linux-mcount.git/lib/tracing/Kconfig
===================================================================
--- linux-mcount.git.orig/lib/tracing/Kconfig	2008-01-29 21:34:14.000000000 -0500
+++ linux-mcount.git/lib/tracing/Kconfig	2008-01-29 21:34:30.000000000 -0500
@@ -102,3 +102,23 @@ config CONTEXT_SWITCH_TRACER
 	  This tracer hooks into the context switch and records
 	  all switching of tasks.
 
+config INTERRUPT_OFF_HIST
+	bool "Interrupts off critical timings histogram"
+	depends on CRITICAL_IRQSOFF_TIMING
+	help
+	  This option uses the infrastructure of the critical
+	  irqs off timings to create a histogram of latencies.
+
+config PREEMPT_OFF_HIST
+	bool "Preempt off critical timings histogram"
+	depends on CRITICAL_PREEMPT_TIMING
+	help
+	  This option uses the infrastructure of the critical
+	  preemption off timings to create a histogram of latencies.
+
+config WAKEUP_LATENCY_HIST
+	bool "Interrupts off critical timings histogram"
+	depends on WAKEUP_TRACER
+	help
+	  This option uses the infrastructure of the wakeup tracer
+	  to create a histogram of latencies.
Index: linux-mcount.git/lib/tracing/Makefile
===================================================================
--- linux-mcount.git.orig/lib/tracing/Makefile	2008-01-29 21:34:14.000000000 -0500
+++ linux-mcount.git/lib/tracing/Makefile	2008-01-29 21:34:30.000000000 -0500
@@ -8,4 +8,8 @@ obj-$(CONFIG_CRITICAL_PREEMPT_TIMING) +=
 obj-$(CONFIG_WAKEUP_TRACER) += trace_wakeup.o
 obj-$(CONFIG_EVENT_TRACER) += trace_events.o
 
+obj-$(CONFIG_INTERRUPT_OFF_HIST) += tracer_hist.o
+obj-$(CONFIG_PREEMPT_OFF_HIST) += tracer_hist.o
+obj-$(CONFIG_WAKEUP_LATENCY_HIST) += tracer_hist.o
+
 libmcount-y := mcount.o
Index: linux-mcount.git/lib/tracing/trace_irqsoff.c
===================================================================
--- linux-mcount.git.orig/lib/tracing/trace_irqsoff.c	2008-01-29 21:34:14.000000000 -0500
+++ linux-mcount.git/lib/tracing/trace_irqsoff.c	2008-01-29 21:34:30.000000000 -0500
@@ -16,6 +16,7 @@
 #include <linux/mcount.h>
 
 #include "tracer.h"
+#include "tracer_hist.h"
 
 static struct tracing_trace *tracer_trace __read_mostly;
 static __cacheline_aligned_in_smp DEFINE_MUTEX(max_mutex);
@@ -261,10 +262,14 @@ void notrace start_critical_timings(void
 {
 	if (preempt_trace() || irq_trace())
 		start_critical_timing(CALLER_ADDR0, 0);
+
+	tracing_hist_preempt_start();
 }
 
 void notrace stop_critical_timings(void)
 {
+	tracing_hist_preempt_stop(TRACE_STOP);
+
 	if (preempt_trace() || irq_trace())
 		stop_critical_timing(CALLER_ADDR0, 0);
 }
@@ -273,6 +278,8 @@ void notrace stop_critical_timings(void)
 #ifdef CONFIG_LOCKDEP
 void notrace time_hardirqs_on(unsigned long a0, unsigned long a1)
 {
+	tracing_hist_preempt_stop(1);
+
 	if (!preempt_trace() && irq_trace())
 		stop_critical_timing(a0, a1);
 }
@@ -281,6 +288,8 @@ void notrace time_hardirqs_off(unsigned 
 {
 	if (!preempt_trace() && irq_trace())
 		start_critical_timing(a0, a1);
+
+	tracing_hist_preempt_start();
 }
 
 #else /* !CONFIG_LOCKDEP */
@@ -314,6 +323,8 @@ inline void print_irqtrace_events(struct
  */
 void notrace trace_hardirqs_on(void)
 {
+	tracing_hist_preempt_stop(1);
+
 	if (!preempt_trace() && irq_trace())
 		stop_critical_timing(CALLER_ADDR0, 0);
 }
@@ -323,11 +334,15 @@ void notrace trace_hardirqs_off(void)
 {
 	if (!preempt_trace() && irq_trace())
 		start_critical_timing(CALLER_ADDR0, 0);
+
+	tracing_hist_preempt_start();
 }
 EXPORT_SYMBOL(trace_hardirqs_off);
 
 void notrace trace_hardirqs_on_caller(unsigned long caller_addr)
 {
+	tracing_hist_preempt_stop(1);
+
 	if (!preempt_trace() && irq_trace())
 		stop_critical_timing(CALLER_ADDR0, caller_addr);
 }
@@ -337,6 +352,8 @@ void notrace trace_hardirqs_off_caller(u
 {
 	if (!preempt_trace() && irq_trace())
 		start_critical_timing(CALLER_ADDR0, caller_addr);
+
+	tracing_hist_preempt_start();
 }
 EXPORT_SYMBOL(trace_hardirqs_off_caller);
 
@@ -346,12 +363,14 @@ EXPORT_SYMBOL(trace_hardirqs_off_caller)
 #ifdef CONFIG_CRITICAL_PREEMPT_TIMING
 void notrace trace_preempt_on(unsigned long a0, unsigned long a1)
 {
+	tracing_hist_preempt_stop(0);
 	stop_critical_timing(a0, a1);
 }
 
 void notrace trace_preempt_off(unsigned long a0, unsigned long a1)
 {
 	start_critical_timing(a0, a1);
+	tracing_hist_preempt_start();
 }
 #endif /* CONFIG_CRITICAL_PREEMPT_TIMING */
 
Index: linux-mcount.git/lib/tracing/trace_wakeup.c
===================================================================
--- linux-mcount.git.orig/lib/tracing/trace_wakeup.c	2008-01-29 21:34:14.000000000 -0500
+++ linux-mcount.git/lib/tracing/trace_wakeup.c	2008-01-29 21:34:30.000000000 -0500
@@ -16,6 +16,7 @@
 #include <linux/mcount.h>
 
 #include "tracer.h"
+#include "tracer_hist.h"
 
 static struct tracing_trace *tracer_trace __read_mostly;
 static int trace_enabled __read_mostly;
@@ -55,7 +56,9 @@ static void notrace wakeup_sched_switch(
 	unsigned long flags;
 	int cpu;
 
-	if (unlikely(!trace_enabled) || next != wakeup_task)
+	tracing_hist_wakeup_stop(next);
+
+	if (!trace_enabled || next != wakeup_task)
 		return;
 
 	/* The task we are waitng for is waking up */
@@ -219,7 +222,8 @@ static notrace void wake_up_callback(con
 	struct task_struct *p;
 	va_list ap;
 
-	if (likely(!trace_enabled) && !trace_event_enabled())
+	if (likely(!trace_enabled) &&
+	    !trace_event_enabled() && !tracing_wakeup_hist)
 		return;
 
 	va_start(ap, format);
@@ -231,6 +235,7 @@ static notrace void wake_up_callback(con
 	va_end(ap);
 
 	trace_event_wakeup(CALLER_ADDR0, p, curr);
+	tracing_hist_wakeup_start(p, curr);
 
 	if (trace_enabled)
 		wakeup_check_start(tr, p, curr);
@@ -285,7 +290,8 @@ static notrace void start_wakeup_trace(s
 	if (ret)
 		return;
 
-	register_tracer_switch(&switch_ops);
+	if (!tracing_wakeup_hist)
+		register_tracer_switch(&switch_ops);
 	tracing_start_sched_switch();
 	trace_start_events();
 
@@ -299,7 +305,8 @@ static notrace void stop_wakeup_trace(st
 	trace_enabled = 0;
 	tracing_stop_wakeup();
 	tracing_stop_sched_switch();
-	unregister_tracer_switch(&switch_ops);
+	if (!tracing_wakeup_hist)
+		unregister_tracer_switch(&switch_ops);
 }
 
 static notrace void wakeup_trace_init(struct tracing_trace *tr)
@@ -385,6 +392,12 @@ __init static int init_wakeup_trace(void
 		goto fail_deprobe;
 	}
 
+	if (tracing_wakeup_hist) {
+		register_tracer_switch(&switch_ops);
+		tracing_start_wakeup();
+		tracing_start_sched_switch();
+	}
+
 	return 0;
  fail_deprobe:
 	marker_probe_unregister("kernel_sched_wakeup");
Index: linux-mcount.git/lib/tracing/tracer_hist.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-mcount.git/lib/tracing/tracer_hist.c	2008-01-29 21:34:30.000000000 -0500
@@ -0,0 +1,514 @@
+/*
+ * lib/tracing/tracer_hist.c
+ *
+ * Add support for histograms of preemption-off latency and
+ * interrupt-off latency and wakeup latency, it depends on
+ * Real-Time Preemption Support.
+ *
+ *  Copyright (C) 2005 MontaVista Software, Inc.
+ *  Yi Yang <yyang@ch.mvista.com>
+ *
+ *  Converted to work with the new latency tracer.
+ *  Copyright (C) 2008 Red Hat, Inc.
+ *    Steven Rostedt <srostedt@redhat.com>
+ *
+ */
+#include <linux/module.h>
+#include <linux/debugfs.h>
+#include <linux/seq_file.h>
+#include <linux/percpu.h>
+#include <linux/spinlock.h>
+#include <asm/atomic.h>
+#include <asm/div64.h>
+#include <asm/uaccess.h>
+
+#include "tracer.h"
+#include "tracer_hist.h"
+
+enum {
+	INTERRUPT_LATENCY = 0,
+	PREEMPT_LATENCY,
+	PREEMPT_INTERRUPT_LATENCY,
+	WAKEUP_LATENCY,
+};
+
+#define MAX_ENTRY_NUM 10240
+
+struct hist_data {
+	atomic_t hist_mode; /* 0 log, 1 don't log */
+	unsigned long min_lat;
+	unsigned long avg_lat;
+	unsigned long max_lat;
+	unsigned long long beyond_hist_bound_samples;
+	unsigned long long accumulate_lat;
+	unsigned long long total_samples;
+	unsigned long long hist_array[MAX_ENTRY_NUM];
+};
+
+static char *latency_hist_dir_root = "latency_hist";
+
+#ifdef CONFIG_INTERRUPT_OFF_HIST
+static DEFINE_PER_CPU(struct hist_data, interrupt_off_hist);
+static char *interrupt_off_hist_dir = "interrupt_off_latency";
+#endif
+
+#ifdef CONFIG_PREEMPT_OFF_HIST
+static DEFINE_PER_CPU(struct hist_data, preempt_off_hist);
+static char *preempt_off_hist_dir = "preempt_off_latency";
+#endif
+
+#if defined(CONFIG_PREEMPT_OFF_HIST) && defined(CONFIG_INTERRUPT_OFF_HIST)
+static DEFINE_PER_CPU(struct hist_data, preempt_irqs_off_hist);
+static char *preempt_irqs_off_hist_dir = "preempt_interrupts_off_latency";
+#endif
+
+#ifdef CONFIG_WAKEUP_LATENCY_HIST
+static DEFINE_PER_CPU(struct hist_data, wakeup_latency_hist);
+static char *wakeup_latency_hist_dir = "wakeup_latency";
+#endif
+
+static inline u64 u64_div(u64 x, u64 y)
+{
+	do_div(x, y);
+	return x;
+}
+
+void notrace latency_hist(int latency_type, int cpu, unsigned long latency)
+{
+	struct hist_data *my_hist;
+
+	if ((cpu < 0) || (cpu >= NR_CPUS) || (latency_type < INTERRUPT_LATENCY)
+			|| (latency_type > WAKEUP_LATENCY) || (latency < 0))
+		return;
+
+	switch (latency_type) {
+#ifdef CONFIG_INTERRUPT_OFF_HIST
+	case INTERRUPT_LATENCY:
+		my_hist = &per_cpu(interrupt_off_hist, cpu);
+		break;
+#endif
+
+#ifdef CONFIG_PREEMPT_OFF_HIST
+	case PREEMPT_LATENCY:
+		my_hist = &per_cpu(preempt_off_hist, cpu);
+		break;
+#endif
+
+#if defined(CONFIG_PREEMPT_OFF_HIST) && defined(CONFIG_INTERRUPT_OFF_HIST)
+	case PREEMPT_INTERRUPT_LATENCY:
+		my_hist = &per_cpu(preempt_irqs_off_hist, cpu);
+		break;
+#endif
+
+#ifdef CONFIG_WAKEUP_LATENCY_HIST
+	case WAKEUP_LATENCY:
+		my_hist = &per_cpu(wakeup_latency_hist, cpu);
+		break;
+#endif
+	default:
+		return;
+	}
+
+	if (atomic_read(&my_hist->hist_mode) == 0)
+		return;
+
+	if (latency >= MAX_ENTRY_NUM)
+		my_hist->beyond_hist_bound_samples++;
+	else
+		my_hist->hist_array[latency]++;
+
+	if (latency < my_hist->min_lat)
+		my_hist->min_lat = latency;
+	else if (latency > my_hist->max_lat)
+		my_hist->max_lat = latency;
+
+	my_hist->total_samples++;
+	my_hist->accumulate_lat += latency;
+	my_hist->avg_lat = (unsigned long) u64_div(my_hist->accumulate_lat,
+						  my_hist->total_samples);
+	return;
+}
+
+static void *l_start(struct seq_file *m, loff_t *pos)
+{
+	loff_t *index_ptr = kmalloc(sizeof(loff_t), GFP_KERNEL);
+	loff_t index = *pos;
+	struct hist_data *my_hist = m->private;
+
+	if (!index_ptr)
+		return NULL;
+
+	if (index == 0) {
+		atomic_dec(&my_hist->hist_mode);
+		seq_printf(m, "#Minimum latency: %lu microseconds.\n"
+			   "#Average latency: %lu microseconds.\n"
+			   "#Maximum latency: %lu microseconds.\n"
+			   "#Total samples: %llu\n"
+			   "#There are %llu samples greater or equal"
+			   " than %d microseconds\n"
+			   "#usecs\t%16s\n"
+			   , my_hist->min_lat
+			   , my_hist->avg_lat
+			   , my_hist->max_lat
+			   , my_hist->total_samples
+			   , my_hist->beyond_hist_bound_samples
+			   , MAX_ENTRY_NUM, "samples");
+	}
+	if (index >= MAX_ENTRY_NUM)
+		return NULL;
+
+	*index_ptr = index;
+	return index_ptr;
+}
+
+static void *l_next(struct seq_file *m, void *p, loff_t *pos)
+{
+	loff_t *index_ptr = p;
+	struct hist_data *my_hist = m->private;
+
+	if (++*pos >= MAX_ENTRY_NUM) {
+		atomic_inc(&my_hist->hist_mode);
+		return NULL;
+	}
+	*index_ptr = *pos;
+	return index_ptr;
+}
+
+static void l_stop(struct seq_file *m, void *p)
+{
+	kfree(p);
+}
+
+static int l_show(struct seq_file *m, void *p)
+{
+	int index = *(loff_t *) p;
+	struct hist_data *my_hist = m->private;
+
+	seq_printf(m, "%5d\t%16llu\n", index, my_hist->hist_array[index]);
+	return 0;
+}
+
+static struct seq_operations latency_hist_seq_op = {
+	.start = l_start,
+	.next  = l_next,
+	.stop  = l_stop,
+	.show  = l_show
+};
+
+static int latency_hist_open(struct inode *inode, struct file *file)
+{
+	int ret;
+
+	ret = seq_open(file, &latency_hist_seq_op);
+	if (!ret) {
+		struct seq_file *seq = file->private_data;
+		seq->private = inode->i_private;
+	}
+	return ret;
+}
+
+static struct file_operations latency_hist_fops = {
+	.open = latency_hist_open,
+	.read = seq_read,
+	.llseek = seq_lseek,
+	.release = seq_release,
+};
+
+static void hist_reset(struct hist_data *hist)
+{
+	atomic_dec(&hist->hist_mode);
+
+	memset(hist->hist_array, 0, sizeof(hist->hist_array));
+	hist->beyond_hist_bound_samples = 0UL;
+	hist->min_lat = 0xFFFFFFFFUL;
+	hist->max_lat = 0UL;
+	hist->total_samples = 0UL;
+	hist->accumulate_lat = 0UL;
+	hist->avg_lat = 0UL;
+
+	atomic_inc(&hist->hist_mode);
+}
+
+ssize_t latency_hist_reset(struct file *file, const char __user *a,
+			   size_t size, loff_t *off)
+{
+	int cpu;
+	struct hist_data *hist;
+	int latency_type = (long)file->private_data;
+
+	switch (latency_type) {
+
+#ifdef CONFIG_WAKEUP_LATENCY_HIST
+	case WAKEUP_LATENCY:
+		for_each_online_cpu(cpu) {
+			hist = &per_cpu(wakeup_latency_hist, cpu);
+			hist_reset(hist);
+		}
+		break;
+#endif
+
+#ifdef CONFIG_PREEMPT_OFF_HIST
+	case PREEMPT_LATENCY:
+		for_each_online_cpu(cpu) {
+			hist = &per_cpu(preempt_off_hist, cpu);
+			hist_reset(hist);
+		}
+		break;
+#endif
+
+#ifdef CONFIG_INTERRUPT_OFF_HIST
+	case INTERRUPT_LATENCY:
+		for_each_online_cpu(cpu) {
+			hist = &per_cpu(interrupt_off_hist, cpu);
+			hist_reset(hist);
+		}
+		break;
+#endif
+	}
+
+	return size;
+}
+
+static struct file_operations latency_hist_reset_fops = {
+	.open = tracing_open_generic,
+	.write = latency_hist_reset,
+};
+
+#if defined(CONFIG_INTERRUPT_OFF_HIST) || defined(CONFIG_PREEMPT_OFF_HIST)
+#ifdef CONFIG_INTERRUPT_OFF_HIST
+static DEFINE_PER_CPU(cycles_t, hist_irqsoff_start);
+static DEFINE_PER_CPU(int, hist_irqsoff_tracing);
+#endif
+#ifdef CONFIG_PREEMPT_OFF_HIST
+static DEFINE_PER_CPU(cycles_t, hist_preemptoff_start);
+static DEFINE_PER_CPU(int, hist_preemptoff_tracing);
+#endif
+#if defined(CONFIG_INTERRUPT_OFF_HIST) && defined(CONFIG_PREEMPT_OFF_HIST)
+static DEFINE_PER_CPU(cycles_t, hist_preemptirqsoff_start);
+static DEFINE_PER_CPU(int, hist_preemptirqsoff_tracing);
+#endif
+
+notrace void tracing_hist_preempt_start(void)
+{
+	cycle_t start;
+	int cpu;
+
+	start = now();
+	cpu = smp_processor_id();
+
+#ifdef CONFIG_INTERRUPT_OFF_HIST
+	if (!per_cpu(hist_irqsoff_tracing, cpu) &&
+	    irqs_disabled()) {
+		per_cpu(hist_irqsoff_tracing, cpu) = 1;
+		per_cpu(hist_irqsoff_start, cpu) = start;
+	}
+#endif
+
+#ifdef CONFIG_PREEMPT_OFF_HIST
+	if (!per_cpu(hist_preemptoff_tracing, cpu) &&
+	    preempt_count()) {
+		per_cpu(hist_preemptoff_tracing, cpu) = 1;
+		per_cpu(hist_preemptoff_start, cpu) = start;
+	}
+#endif
+
+#if defined(CONFIG_INTERRUPT_OFF_HIST) && defined(CONFIG_PREEMPT_OFF_HIST)
+	if (!per_cpu(hist_preemptirqsoff_tracing, cpu) &&
+	    (preempt_count() || irqs_disabled())) {
+		per_cpu(hist_preemptirqsoff_tracing, cpu) = 1;
+		per_cpu(hist_preemptirqsoff_start, cpu) = start;
+	}
+#endif
+}
+
+notrace void tracing_hist_preempt_stop(int irqs_on)
+{
+	long latency;
+	cycle_t start;
+	cycle_t stop;
+	int cpu;
+
+	stop = now();
+	cpu = smp_processor_id();
+
+	/* irqs_on == TRACE_STOP if we must stop tracing. */
+
+#ifdef CONFIG_INTERRUPT_OFF_HIST
+	if (per_cpu(hist_irqsoff_tracing, cpu) && irqs_on) {
+		per_cpu(hist_irqsoff_tracing, cpu) = 0;
+		start = per_cpu(hist_irqsoff_start, cpu);
+		latency = (long)cycles_to_usecs(stop - start);
+		latency_hist(INTERRUPT_LATENCY, smp_processor_id(), latency);
+	}
+#endif
+
+#ifdef CONFIG_PREEMPT_OFF_HIST
+	if (per_cpu(hist_preemptoff_tracing, cpu) &&
+	    (!preempt_count() || irqs_on == TRACE_STOP)) {
+		per_cpu(hist_preemptoff_tracing, cpu) = 0;
+		start = per_cpu(hist_preemptoff_start, cpu);
+		latency = (long)cycles_to_usecs(stop - start);
+		latency_hist(PREEMPT_LATENCY, smp_processor_id(), latency);
+	}
+#endif
+
+#if defined(CONFIG_INTERRUPT_OFF_HIST) && defined(CONFIG_PREEMPT_OFF_HIST)
+	if (per_cpu(hist_preemptirqsoff_tracing, cpu) &&
+	    ((!preempt_count() && irqs_on) || irqs_on == TRACE_STOP)) {
+		per_cpu(hist_preemptirqsoff_tracing, cpu) = 0;
+		start = per_cpu(hist_preemptirqsoff_start, cpu);
+		latency = (long)cycles_to_usecs(stop - start);
+		latency_hist(PREEMPT_INTERRUPT_LATENCY,
+			     smp_processor_id(), latency);
+	}
+#endif
+}
+#endif
+
+#ifdef CONFIG_WAKEUP_LATENCY_HIST
+int tracing_wakeup_hist __read_mostly = 1;
+
+static unsigned wakeup_prio = (unsigned)-1 ;
+static struct task_struct *wakeup_task;
+static cycle_t wakeup_start;
+static DEFINE_SPINLOCK(wakeup_lock);
+
+notrace void tracing_hist_wakeup_start(struct task_struct *p,
+				       struct task_struct *curr)
+{
+	unsigned long flags;
+
+	if (likely(!rt_task(p)) ||
+	    p->prio >= wakeup_prio ||
+	    p->prio >= curr->prio)
+		return;
+
+	spin_lock_irqsave(&wakeup_lock, flags);
+	if (wakeup_task)
+		put_task_struct(wakeup_task);
+
+	get_task_struct(p);
+	wakeup_task = p;
+	wakeup_prio = p->prio;
+	wakeup_start = now();
+	spin_unlock_irqrestore(&wakeup_lock, flags);
+}
+
+notrace void tracing_hist_wakeup_stop(struct task_struct *next)
+{
+	unsigned long flags;
+	long latency;
+	cycle_t stop;
+
+	if (next != wakeup_task)
+		return;
+
+	stop = now();
+
+	spin_lock_irqsave(&wakeup_lock, flags);
+	if (wakeup_task != next)
+		goto out;
+
+	latency = (long)cycles_to_usecs(stop - wakeup_start);
+
+	latency_hist(WAKEUP_LATENCY, smp_processor_id(), latency);
+
+	put_task_struct(wakeup_task);
+	wakeup_task = NULL;
+	wakeup_prio = (unsigned)-1;
+ out:
+	spin_unlock_irqrestore(&wakeup_lock, flags);
+
+}
+#endif
+
+static __init int latency_hist_init(void)
+{
+	struct dentry *latency_hist_root = NULL;
+	struct dentry *dentry;
+	struct dentry *entry;
+	int i = 0, len = 0;
+	struct hist_data *my_hist;
+	char name[64];
+
+	dentry = tracing_init_dentry();
+
+	latency_hist_root =
+		debugfs_create_dir(latency_hist_dir_root, dentry);
+
+#ifdef CONFIG_INTERRUPT_OFF_HIST
+	dentry = debugfs_create_dir(interrupt_off_hist_dir,
+				    latency_hist_root);
+	for_each_possible_cpu(i) {
+		len = sprintf(name, "CPU%d", i);
+		name[len] = '\0';
+		entry = debugfs_create_file(name, 0444, dentry,
+					    &per_cpu(interrupt_off_hist, i),
+					    &latency_hist_fops);
+		my_hist = &per_cpu(interrupt_off_hist, i);
+		atomic_set(&my_hist->hist_mode, 1);
+		my_hist->min_lat = 0xFFFFFFFFUL;
+	}
+	entry = debugfs_create_file("reset", 0444, dentry,
+				    (void *)INTERRUPT_LATENCY,
+				    &latency_hist_reset_fops);
+#endif
+
+#ifdef CONFIG_PREEMPT_OFF_HIST
+	dentry = debugfs_create_dir(preempt_off_hist_dir,
+				    latency_hist_root);
+	for_each_possible_cpu(i) {
+		len = sprintf(name, "CPU%d", i);
+		name[len] = '\0';
+		entry = debugfs_create_file(name, 0444, dentry,
+					    &per_cpu(preempt_off_hist, i),
+					    &latency_hist_fops);
+		my_hist = &per_cpu(preempt_off_hist, i);
+		atomic_set(&my_hist->hist_mode, 1);
+		my_hist->min_lat = 0xFFFFFFFFUL;
+	}
+	entry = debugfs_create_file("reset", 0444, dentry,
+				    (void *)PREEMPT_LATENCY,
+				    &latency_hist_reset_fops);
+#endif
+
+#if defined(CONFIG_INTERRUPT_OFF_HIST) && defined(CONFIG_PREEMPT_OFF_HIST)
+	dentry = debugfs_create_dir(preempt_irqs_off_hist_dir,
+				    latency_hist_root);
+	for_each_possible_cpu(i) {
+		len = sprintf(name, "CPU%d", i);
+		name[len] = '\0';
+		entry = debugfs_create_file(name, 0444, dentry,
+					    &per_cpu(preempt_off_hist, i),
+					    &latency_hist_fops);
+		my_hist = &per_cpu(preempt_irqs_off_hist, i);
+		atomic_set(&my_hist->hist_mode, 1);
+		my_hist->min_lat = 0xFFFFFFFFUL;
+	}
+	entry = debugfs_create_file("reset", 0444, dentry,
+				    (void *)PREEMPT_INTERRUPT_LATENCY,
+				    &latency_hist_reset_fops);
+#endif
+
+#ifdef CONFIG_WAKEUP_LATENCY_HIST
+	dentry = debugfs_create_dir(wakeup_latency_hist_dir,
+				    latency_hist_root);
+	for_each_possible_cpu(i) {
+		len = sprintf(name, "CPU%d", i);
+		name[len] = '\0';
+		entry = debugfs_create_file(name, 0444, dentry,
+					    &per_cpu(wakeup_latency_hist, i),
+					    &latency_hist_fops);
+		my_hist = &per_cpu(wakeup_latency_hist, i);
+		atomic_set(&my_hist->hist_mode, 1);
+		my_hist->min_lat = 0xFFFFFFFFUL;
+	}
+	entry = debugfs_create_file("reset", 0444, dentry,
+				    (void *)WAKEUP_LATENCY,
+				    &latency_hist_reset_fops);
+#endif
+	return 0;
+
+}
+
+__initcall(latency_hist_init);
Index: linux-mcount.git/lib/tracing/tracer_hist.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-mcount.git/lib/tracing/tracer_hist.h	2008-01-29 21:34:30.000000000 -0500
@@ -0,0 +1,39 @@
+/*
+ * lib/tracing/tracer_hist.h
+ *
+ * Add support for histograms of preemption-off latency and
+ * interrupt-off latency and wakeup latency, it depends on
+ * Real-Time Preemption Support.
+ *
+ *  Copyright (C) 2005 MontaVista Software, Inc.
+ *  Yi Yang <yyang@ch.mvista.com>
+ *
+ *  Converted to work with the new latency tracer.
+ *  Copyright (C) 2008 Red Hat, Inc.
+ *    Steven Rostedt <srostedt@redhat.com>
+ *
+ */
+#ifndef _LIB_TRACING_TRACER_HIST_H_
+#define _LIB_TRACING_TRACER_HIST_H_
+
+#if defined(CONFIG_INTERRUPT_OFF_HIST) || defined(CONFIG_PREEMPT_OFF_HIST)
+# define TRACE_STOP 2
+void tracing_hist_preempt_start(void);
+void tracing_hist_preempt_stop(int irqs_on);
+#else
+# define tracing_hist_preempt_start() do { } while (0)
+# define tracing_hist_preempt_stop(irqs_off) do { } while (0)
+#endif
+
+#ifdef CONFIG_WAKEUP_LATENCY_HIST
+void tracing_hist_wakeup_start(struct task_struct *p,
+			       struct task_struct *curr);
+void tracing_hist_wakeup_stop(struct task_struct *next);
+extern int tracing_wakeup_hist;
+#else
+# define tracing_hist_wakeup_start(p, curr) do { } while (0)
+# define tracing_hist_wakeup_stop(next) do { } while (0)
+# define tracing_wakeup_hist 0
+#endif
+
+#endif /* ifndef _LIB_TRACING_TRACER_HIST_H_ */

-- 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 02/22 -v7] Add basic support for gcc profiler instrumentation
  2008-01-30  3:15 ` [PATCH 02/22 -v7] Add basic support for gcc profiler instrumentation Steven Rostedt
@ 2008-01-30  8:46   ` Peter Zijlstra
  2008-01-30 13:08     ` Steven Rostedt
  2008-01-30 13:21   ` Jan Kiszka
  1 sibling, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2008-01-30  8:46 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Ingo Molnar, Linus Torvalds, Andrew Morton,
	Christoph Hellwig, Mathieu Desnoyers, Gregory Haskins,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Tim Bird,
	Sam Ravnborg, Frank Ch. Eigler, Jan Kiszka, John Stultz,
	Arjan van de Ven, Steven Rostedt


On Tue, 2008-01-29 at 22:15 -0500, Steven Rostedt wrote:

> +int register_mcount_function(struct mcount_ops *ops)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&mcount_func_lock, flags);
> +	ops->next = mcount_list;
> +	/* must have next seen before we update the list pointer */
> +	smp_wmb();

That comment does not explain which race it closes; this is esp
important as there is no paired barrier to give hints.

> +	mcount_list = ops;
> +	/*
> +	 * For one func, simply call it directly.
> +	 * For more than one func, call the chain.
> +	 */
> +	if (ops->next == &mcount_list_end)
> +		mcount_trace_function = ops->func;
> +	else
> +		mcount_trace_function = mcount_list_func;
> +	spin_unlock_irqrestore(&mcount_func_lock, flags);
> +
> +	return 0;
> +}



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 05/22 -v7] add notrace annotations to vsyscall.
  2008-01-30  3:15 ` [PATCH 05/22 -v7] add notrace annotations to vsyscall Steven Rostedt
@ 2008-01-30  8:49   ` Peter Zijlstra
  2008-01-30 13:15     ` Steven Rostedt
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2008-01-30  8:49 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Ingo Molnar, Linus Torvalds, Andrew Morton,
	Christoph Hellwig, Mathieu Desnoyers, Gregory Haskins,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Tim Bird,
	Sam Ravnborg, Frank Ch. Eigler, Jan Kiszka, John Stultz,
	Arjan van de Ven, Steven Rostedt


On Tue, 2008-01-29 at 22:15 -0500, Steven Rostedt wrote:
> plain text document attachment
> (mcount-add-x86-vdso-notrace-annotations.patch)
> Add the notrace annotations to some of the vsyscall functions.

Would the VDSO stuff crash a kernel without these annotations and
CONFIG_MCOUNT=y ?

If so, I think it would be best of you placed these annotations before
adding the core code, this would improve bisectability.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 17/22 -v7] mcount tracer for wakeup latency timings.
  2008-01-30  3:15 ` [PATCH 17/22 -v7] mcount tracer for wakeup latency timings Steven Rostedt
@ 2008-01-30  9:31   ` Peter Zijlstra
  2008-01-30 13:18     ` Steven Rostedt
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2008-01-30  9:31 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Ingo Molnar, Linus Torvalds, Andrew Morton,
	Christoph Hellwig, Mathieu Desnoyers, Gregory Haskins,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Tim Bird,
	Sam Ravnborg, Frank Ch. Eigler, Jan Kiszka, John Stultz,
	Arjan van de Ven, Steven Rostedt


On Tue, 2008-01-29 at 22:15 -0500, Steven Rostedt wrote:

> +static void notrace __wakeup_reset(struct tracing_trace *tr)
> +{
> +	struct tracing_trace_cpu *data;
> +	int cpu;
> +
> +	assert_spin_locked(&wakeup_lock);
> +
> +	for_each_possible_cpu(cpu) {
> +		data = tr->data[cpu];
> +		tracing_reset(data);
> +	}
> +
> +	wakeup_cpu = -1;
> +	wakeup_prio = -1;
> +	if (wakeup_task) {
> +		put_task_struct(wakeup_task);
> +		tracing_stop_function_trace();
> +	}
> +
> +	wakeup_task = NULL;
> +
> +	/*
> +	 * Don't let the trace_enabled = 1 show up before
> +	 * the wakeup_task is reset.
> +	 */
> +	smp_wmb();
> +}

Another un-balanced barrier.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 19/22 -v7] trace preempt off critical timings
  2008-01-30  3:15 ` [PATCH 19/22 -v7] trace preempt off " Steven Rostedt
@ 2008-01-30  9:40   ` Peter Zijlstra
  2008-01-30 13:40     ` Steven Rostedt
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2008-01-30  9:40 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Ingo Molnar, Linus Torvalds, Andrew Morton,
	Christoph Hellwig, Mathieu Desnoyers, Gregory Haskins,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Tim Bird,
	Sam Ravnborg, Frank Ch. Eigler, Jan Kiszka, John Stultz,
	Arjan van de Ven, Steven Rostedt


On Tue, 2008-01-29 at 22:15 -0500, Steven Rostedt wrote:

> +static DEFINE_PER_CPU(int, tracing_cpu);

Is the per-cpu tracing not also needed for irq off tracing?

Also, its not mentioned in the changelog.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 02/22 -v7] Add basic support for gcc profiler instrumentation
  2008-01-30  8:46   ` Peter Zijlstra
@ 2008-01-30 13:08     ` Steven Rostedt
  2008-01-30 14:09       ` Steven Rostedt
  0 siblings, 1 reply; 45+ messages in thread
From: Steven Rostedt @ 2008-01-30 13:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Ingo Molnar, Linus Torvalds, Andrew Morton,
	Christoph Hellwig, Mathieu Desnoyers, Gregory Haskins,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Tim Bird,
	Sam Ravnborg, Frank Ch. Eigler, Jan Kiszka, John Stultz,
	Arjan van de Ven, Steven Rostedt



On Wed, 30 Jan 2008, Peter Zijlstra wrote:

>
> On Tue, 2008-01-29 at 22:15 -0500, Steven Rostedt wrote:
>
> > +int register_mcount_function(struct mcount_ops *ops)
> > +{
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&mcount_func_lock, flags);
> > +	ops->next = mcount_list;
> > +	/* must have next seen before we update the list pointer */
> > +	smp_wmb();
>
> That comment does not explain which race it closes; this is esp
> important as there is no paired barrier to give hints.

OK, fair enough. I'll explain it a bit more.

How's this:

 /*
  * We are entering ops into the mcount_list but another
  * CPU might be walking that list. We need to make sure
  * the ops->next pointer is valid before another CPU sees
  * the ops pointer included into the mcount_list.
  */

-- Steve

>
> > +	mcount_list = ops;
> > +	/*
> > +	 * For one func, simply call it directly.
> > +	 * For more than one func, call the chain.
> > +	 */
> > +	if (ops->next == &mcount_list_end)
> > +		mcount_trace_function = ops->func;
> > +	else
> > +		mcount_trace_function = mcount_list_func;
> > +	spin_unlock_irqrestore(&mcount_func_lock, flags);
> > +
> > +	return 0;
> > +}
>
>
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 05/22 -v7] add notrace annotations to vsyscall.
  2008-01-30  8:49   ` Peter Zijlstra
@ 2008-01-30 13:15     ` Steven Rostedt
  0 siblings, 0 replies; 45+ messages in thread
From: Steven Rostedt @ 2008-01-30 13:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Ingo Molnar, Linus Torvalds, Andrew Morton,
	Christoph Hellwig, Mathieu Desnoyers, Gregory Haskins,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Tim Bird,
	Sam Ravnborg, Frank Ch. Eigler, Jan Kiszka, John Stultz,
	Arjan van de Ven, Steven Rostedt


On Wed, 30 Jan 2008, Peter Zijlstra wrote:

>
> On Tue, 2008-01-29 at 22:15 -0500, Steven Rostedt wrote:
> > plain text document attachment
> > (mcount-add-x86-vdso-notrace-annotations.patch)
> > Add the notrace annotations to some of the vsyscall functions.
>
> Would the VDSO stuff crash a kernel without these annotations and
> CONFIG_MCOUNT=y ?
>
> If so, I think it would be best of you placed these annotations before
> adding the core code, this would improve bisectability.

The thing is that MCOUNT needs something to select it, it's not a visible
option:

config MCOUNT
        bool
        select FRAME_POINTER

Nothing selects MCOUNT before the annotations are in, which really makes
the mcount code a nop until the tracer code comes in. A bisect (as long as
make oldconfig is done) should not be harmed by this patch ordering.
Note: the mcount patch introduces the annotation of notrace, which means
switching the order as it is will break it.

-- Steve


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 17/22 -v7] mcount tracer for wakeup latency timings.
  2008-01-30  9:31   ` Peter Zijlstra
@ 2008-01-30 13:18     ` Steven Rostedt
  0 siblings, 0 replies; 45+ messages in thread
From: Steven Rostedt @ 2008-01-30 13:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Ingo Molnar, Linus Torvalds, Andrew Morton,
	Christoph Hellwig, Mathieu Desnoyers, Gregory Haskins,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Tim Bird,
	Sam Ravnborg, Frank Ch. Eigler, Jan Kiszka, John Stultz,
	Arjan van de Ven, Steven Rostedt


On Wed, 30 Jan 2008, Peter Zijlstra wrote:

>
> On Tue, 2008-01-29 at 22:15 -0500, Steven Rostedt wrote:
>
> > +static void notrace __wakeup_reset(struct tracing_trace *tr)
[...]
> > +
> > +	wakeup_task = NULL;
> > +
> > +	/*
> > +	 * Don't let the trace_enabled = 1 show up before
> > +	 * the wakeup_task is reset.
> > +	 */
> > +	smp_wmb();
> > +}
>
> Another un-balanced barrier.


This one is probably me being paranoid, but even so, it should have a
pair.  Will fix.

-- Steve


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 02/22 -v7] Add basic support for gcc profiler instrumentation
  2008-01-30  3:15 ` [PATCH 02/22 -v7] Add basic support for gcc profiler instrumentation Steven Rostedt
  2008-01-30  8:46   ` Peter Zijlstra
@ 2008-01-30 13:21   ` Jan Kiszka
  2008-01-30 13:53     ` Steven Rostedt
  1 sibling, 1 reply; 45+ messages in thread
From: Jan Kiszka @ 2008-01-30 13:21 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Ingo Molnar, Linus Torvalds, Andrew Morton, Peter Zijlstra,
	Christoph Hellwig, Mathieu Desnoyers, Gregory Haskins,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Tim Bird,
	Sam Ravnborg, Frank Ch. Eigler, John Stultz, Arjan van de Ven,
	Steven Rostedt

Steven Rostedt wrote:

> --- linux-mcount.git.orig/arch/x86/kernel/entry_32.S	2008-01-29 16:59:15.000000000 -0500
> +++ linux-mcount.git/arch/x86/kernel/entry_32.S	2008-01-29 17:26:18.000000000 -0500
> @@ -75,6 +75,31 @@ DF_MASK		= 0x00000400 
>  NT_MASK		= 0x00004000
>  VM_MASK		= 0x00020000
>  
> +#ifdef CONFIG_MCOUNT
> +.globl mcount
> +mcount:
> +	/* unlikely(mcount_enabled) */
> +	cmpl $0, mcount_enabled
> +	jnz trace
> +	ret

(and the corresponding 64-bit version)

Is the impact of this change on the (already expensive) mcount_enabled
case negligible? I worried about use cases where we want to gain some
(relative) worst-case numbers via these instrumentations.

In my personal priority scheme, CONFIG_MCOUNT=y && !mcount_enabled comes
after mcount_enabled.

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 19/22 -v7] trace preempt off critical timings
  2008-01-30  9:40   ` Peter Zijlstra
@ 2008-01-30 13:40     ` Steven Rostedt
  0 siblings, 0 replies; 45+ messages in thread
From: Steven Rostedt @ 2008-01-30 13:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Ingo Molnar, Linus Torvalds, Andrew Morton,
	Christoph Hellwig, Mathieu Desnoyers, Gregory Haskins,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Tim Bird,
	Sam Ravnborg, Frank Ch. Eigler, Jan Kiszka, John Stultz,
	Arjan van de Ven, Steven Rostedt


On Wed, 30 Jan 2008, Peter Zijlstra wrote:

>
> On Tue, 2008-01-29 at 22:15 -0500, Steven Rostedt wrote:
>
> > +static DEFINE_PER_CPU(int, tracing_cpu);
>
> Is the per-cpu tracing not also needed for irq off tracing?

The preempt off code shares the irq off code. With irq off only (before
this patch) all that was needed was to check if trace_enabled is set or
not. Now that the preempt off code is coupled with the irqs off code, we
may hit some paths where tracing is enabled but we are not currently
tracing. This variable was added to facilitate this knowledge.

>
> Also, its not mentioned in the changelog.

It was more of an internal design change than a feature.

-- Steve

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 02/22 -v7] Add basic support for gcc profiler instrumentation
  2008-01-30 13:21   ` Jan Kiszka
@ 2008-01-30 13:53     ` Steven Rostedt
  2008-01-30 14:28       ` Steven Rostedt
  0 siblings, 1 reply; 45+ messages in thread
From: Steven Rostedt @ 2008-01-30 13:53 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: LKML, Ingo Molnar, Linus Torvalds, Andrew Morton, Peter Zijlstra,
	Christoph Hellwig, Mathieu Desnoyers, Gregory Haskins,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Tim Bird,
	Sam Ravnborg, Frank Ch. Eigler, John Stultz, Arjan van de Ven,
	Steven Rostedt


On Wed, 30 Jan 2008, Jan Kiszka wrote:

> Steven Rostedt wrote:
>
> > --- linux-mcount.git.orig/arch/x86/kernel/entry_32.S	2008-01-29 16:59:15.000000000 -0500
> > +++ linux-mcount.git/arch/x86/kernel/entry_32.S	2008-01-29 17:26:18.000000000 -0500
> > @@ -75,6 +75,31 @@ DF_MASK		= 0x00000400
> >  NT_MASK		= 0x00004000
> >  VM_MASK		= 0x00020000
> >
> > +#ifdef CONFIG_MCOUNT
> > +.globl mcount
> > +mcount:
> > +	/* unlikely(mcount_enabled) */
> > +	cmpl $0, mcount_enabled
> > +	jnz trace
> > +	ret
>
> (and the corresponding 64-bit version)
>
> Is the impact of this change on the (already expensive) mcount_enabled
> case negligible? I worried about use cases where we want to gain some
> (relative) worst-case numbers via these instrumentations.

The goal here was to limit the instruction cache hit that we take when
mcount_enabled = 0.
>
> In my personal priority scheme, CONFIG_MCOUNT=y && !mcount_enabled comes
> after mcount_enabled.

well, actually, I disagree. I only set mcount_enabled=1 when I'm about to
test something. You're right that we want the impact of the test least
affected, but when we have mcount_enabled=1 we usually also have a
function that's attached and in that case this change is negligible. But
on the normal case where mcount_enabled=0, this change may have a bigger
impact.

Remember CONFIG_MCOUNT=y && mcount_enabled=0 is (15% overhead)
         CONFIG_MCOUNT=y && mcount_enabled=1 dummy func (49% overhead)
         CONFIG_MCOUNT=y && mcount_enabled=1 trace func (500% overhead)

The trace func is the one that will be most likely used when analyzing. It
gives hackbench a 500% overhead, so I'm expecting this change to be
negligible in that case. But after I find what's wrong, I like to rebuild
the kernel without rebooting so I like to have mcount_enabled=0 have the
smallest impact ;-)

I'll put back the original code and run some new numbers.

-- Steve


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 02/22 -v7] Add basic support for gcc profiler instrumentation
  2008-01-30 13:08     ` Steven Rostedt
@ 2008-01-30 14:09       ` Steven Rostedt
  2008-01-30 14:25         ` Peter Zijlstra
  0 siblings, 1 reply; 45+ messages in thread
From: Steven Rostedt @ 2008-01-30 14:09 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: LKML, Ingo Molnar, Linus Torvalds, Andrew Morton,
	Christoph Hellwig, Mathieu Desnoyers, Gregory Haskins,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Tim Bird,
	Sam Ravnborg, Frank Ch. Eigler, Jan Kiszka, John Stultz,
	Peter Zijlstra, Arjan van de Ven, Steven Rostedt


Paul,

Peter and I are having a discussion on craziness of archs and memory
barriers. You seem to understand crazy archs pretty well, and we would
like some advice. :-)

See below:

On Wed, 30 Jan 2008, Steven Rostedt wrote:

>
>
> On Wed, 30 Jan 2008, Peter Zijlstra wrote:
>
> >
> > On Tue, 2008-01-29 at 22:15 -0500, Steven Rostedt wrote:
> >
> > > +int register_mcount_function(struct mcount_ops *ops)
> > > +{
> > > +	unsigned long flags;
> > > +
> > > +	spin_lock_irqsave(&mcount_func_lock, flags);
> > > +	ops->next = mcount_list;
> > > +	/* must have next seen before we update the list pointer */
> > > +	smp_wmb();
> >
> > That comment does not explain which race it closes; this is esp
> > important as there is no paired barrier to give hints.
>
> OK, fair enough. I'll explain it a bit more.
>
> How's this:
>
>  /*
>   * We are entering ops into the mcount_list but another
>   * CPU might be walking that list. We need to make sure
>   * the ops->next pointer is valid before another CPU sees
>   * the ops pointer included into the mcount_list.
>   */
>

The above is my new comment. But Peter says that it's still not good
enough and that all write memory barriers need read barriers. Let me
explain the situation here.

We have a single link list called mcount_list that is walked when more
than one function is registered by mcount. Mcount is called at the start
of all C functions that are not annotated with "notrace". When more than
one function is registered, mcount calls a loop function that does the
following:

notrace void mcount_list_func(unsigned long ip, unsigned long parent_ip)
{
        struct mcount_ops *op = mcount_list;

        while (op != &mcount_list_end) {
                op->func(ip, parent_ip);
                op = op->next;
        };
}

A registered function must already have a "func" filled, and the mcount
register code takes care of "next".  It is documented that the calling
function should "never" change next and always expect that the func can be
called after it is unregistered. That's not the issue here.

The issue is how to insert the ops into the list. I've done the following,
as you can see in the code this text is inserted between.

   ops->next = mcount_list;
   smp_wmb();
   mcount_list = ops;

The read side pair is the reading of ops to ops->next, which should imply
a smp_rmb() just by the logic. But Peter tells me things like alpha is
crazy enough to do better than that! Thus, I'm asking you.

Can some arch have a reader where it receives ops->next before it received
ops? This seems to me to be a phsyic arch, to know where ops->next is
before it knows ops!

Remember, that the ops that is being registered, is not viewable by any
other CPU until mcount_list = ops. I don't see the need for a read barrier
in this case. But I could very well be wrong.

Help!

-- Steve


>
> >
> > > +	mcount_list = ops;
> > > +	/*
> > > +	 * For one func, simply call it directly.
> > > +	 * For more than one func, call the chain.
> > > +	 */
> > > +	if (ops->next == &mcount_list_end)
> > > +		mcount_trace_function = ops->func;
> > > +	else
> > > +		mcount_trace_function = mcount_list_func;
> > > +	spin_unlock_irqrestore(&mcount_func_lock, flags);
> > > +
> > > +	return 0;
> > > +}
> >
> >
> >
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 02/22 -v7] Add basic support for gcc profiler instrumentation
  2008-01-30 14:09       ` Steven Rostedt
@ 2008-01-30 14:25         ` Peter Zijlstra
  2008-02-01 22:34           ` Paul E. McKenney
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2008-01-30 14:25 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Paul E. McKenney, LKML, Ingo Molnar, Linus Torvalds,
	Andrew Morton, Christoph Hellwig, Mathieu Desnoyers,
	Gregory Haskins, Arnaldo Carvalho de Melo, Thomas Gleixner,
	Tim Bird, Sam Ravnborg, Frank Ch. Eigler, Jan Kiszka,
	John Stultz, Arjan van de Ven, Steven Rostedt


On Wed, 2008-01-30 at 09:09 -0500, Steven Rostedt wrote:
> Paul,
> 
> Peter and I are having a discussion on craziness of archs and memory
> barriers. You seem to understand crazy archs pretty well, and we would
> like some advice. :-)
> 
> See below:
> 
> On Wed, 30 Jan 2008, Steven Rostedt wrote:
> 
> >
> >
> > On Wed, 30 Jan 2008, Peter Zijlstra wrote:
> >
> > >
> > > On Tue, 2008-01-29 at 22:15 -0500, Steven Rostedt wrote:
> > >
> > > > +int register_mcount_function(struct mcount_ops *ops)
> > > > +{
> > > > +	unsigned long flags;
> > > > +
> > > > +	spin_lock_irqsave(&mcount_func_lock, flags);
> > > > +	ops->next = mcount_list;
> > > > +	/* must have next seen before we update the list pointer */
> > > > +	smp_wmb();
> > >
> > > That comment does not explain which race it closes; this is esp
> > > important as there is no paired barrier to give hints.
> >
> > OK, fair enough. I'll explain it a bit more.
> >
> > How's this:
> >
> >  /*
> >   * We are entering ops into the mcount_list but another
> >   * CPU might be walking that list. We need to make sure
> >   * the ops->next pointer is valid before another CPU sees
> >   * the ops pointer included into the mcount_list.
> >   */
> >
> 
> The above is my new comment. But Peter says that it's still not good
> enough and that all write memory barriers need read barriers.

To clarify, either: full mb, rmb or read depend.

> Let me explain the situation here.
> 
> We have a single link list called mcount_list that is walked when more
> than one function is registered by mcount. Mcount is called at the start
> of all C functions that are not annotated with "notrace". When more than
> one function is registered, mcount calls a loop function that does the
> following:
> 
> notrace void mcount_list_func(unsigned long ip, unsigned long parent_ip)
> {
>         struct mcount_ops *op = mcount_list;

When thinking RCU, this would be rcu_dereference and imply a read
barrier.

>         while (op != &mcount_list_end) {
>                 op->func(ip, parent_ip);
>                 op = op->next;

Same here; the rcu_dereference() would do the read depend barrier.

>         };
> }
> 
> A registered function must already have a "func" filled, and the mcount
> register code takes care of "next".  It is documented that the calling
> function should "never" change next and always expect that the func can be
> called after it is unregistered. That's not the issue here.
> 
> The issue is how to insert the ops into the list. I've done the following,
> as you can see in the code this text is inserted between.
> 
>    ops->next = mcount_list;
>    smp_wmb();
>    mcount_list = ops;
> 
> The read side pair is the reading of ops to ops->next, which should imply
> a smp_rmb() just by the logic. But Peter tells me things like alpha is
> crazy enough to do better than that! Thus, I'm asking you.
> 
> Can some arch have a reader where it receives ops->next before it received
> ops? This seems to me to be a phsyic arch, to know where ops->next is
> before it knows ops!
> 
> Remember, that the ops that is being registered, is not viewable by any
> other CPU until mcount_list = ops. I don't see the need for a read barrier
> in this case. But I could very well be wrong.
> 
> Help!
> 
> -- Steve
> 
> 
> >
> > >
> > > > +	mcount_list = ops;
> > > > +	/*
> > > > +	 * For one func, simply call it directly.
> > > > +	 * For more than one func, call the chain.
> > > > +	 */
> > > > +	if (ops->next == &mcount_list_end)
> > > > +		mcount_trace_function = ops->func;
> > > > +	else
> > > > +		mcount_trace_function = mcount_list_func;
> > > > +	spin_unlock_irqrestore(&mcount_func_lock, flags);
> > > > +
> > > > +	return 0;
> > > > +}
> > >
> > >
> > >
> >


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 02/22 -v7] Add basic support for gcc profiler instrumentation
  2008-01-30 13:53     ` Steven Rostedt
@ 2008-01-30 14:28       ` Steven Rostedt
  0 siblings, 0 replies; 45+ messages in thread
From: Steven Rostedt @ 2008-01-30 14:28 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: LKML, Ingo Molnar, Linus Torvalds, Andrew Morton, Peter Zijlstra,
	Christoph Hellwig, Mathieu Desnoyers, Gregory Haskins,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Tim Bird,
	Sam Ravnborg, Frank Ch. Eigler, John Stultz, Arjan van de Ven,
	Steven Rostedt


On Wed, 30 Jan 2008, Steven Rostedt wrote:
> well, actually, I disagree. I only set mcount_enabled=1 when I'm about to
> test something. You're right that we want the impact of the test least
> affected, but when we have mcount_enabled=1 we usually also have a
> function that's attached and in that case this change is negligible. But
> on the normal case where mcount_enabled=0, this change may have a bigger
> impact.
>
> Remember CONFIG_MCOUNT=y && mcount_enabled=0 is (15% overhead)
>          CONFIG_MCOUNT=y && mcount_enabled=1 dummy func (49% overhead)
>          CONFIG_MCOUNT=y && mcount_enabled=1 trace func (500% overhead)
>
> The trace func is the one that will be most likely used when analyzing. It
> gives hackbench a 500% overhead, so I'm expecting this change to be
> negligible in that case. But after I find what's wrong, I like to rebuild
> the kernel without rebooting so I like to have mcount_enabled=0 have the
> smallest impact ;-)
>
> I'll put back the original code and run some new numbers.

I just ran with the original version of that test (on x86_64, the same box
as the previous tests were done, with the same kernel and config except
for this change)

Here's the numbers with the new design (the one that was used in this
patch):

mcount disabled:
 Avg: 4.8638 (15.934498% overhead)

mcount enabled:
 Avg: 6.2819 (49.736610% overhead)

function tracing:
 Avg: 25.2035 (500.755607% overhead)

Now changing the code to:

ENTRY(mcount)
        /* likely(mcount_enabled) */
        cmpl $0, mcount_enabled
        jz out

        /* taken from glibc */
        subq $0x38, %rsp
        movq %rax, (%rsp)
        movq %rcx, 8(%rsp)
        movq %rdx, 16(%rsp)
        movq %rsi, 24(%rsp)
        movq %rdi, 32(%rsp)
        movq %r8, 40(%rsp)
        movq %r9, 48(%rsp)

        movq 0x38(%rsp), %rsi
        movq 8(%rbp), %rdi

        call   *mcount_trace_function

        movq 48(%rsp), %r9
        movq 40(%rsp), %r8
        movq 32(%rsp), %rdi
        movq 24(%rsp), %rsi
        movq 16(%rsp), %rdx
        movq 8(%rsp), %rcx
        movq (%rsp), %rax
        addq $0x38, %rsp

out:
        retq


mcount disabled:
 Avg: 4.908 (16.988058% overhead)

mcount enabled:
 Avg: 6.244. (48.840369% overhead)

function tracing:
 Avg: 25.1963 (500.583987% overhead)


The change seems to cause a 1% overhead difference. With mcount disabled,
the newer code has a 1% performance benefit. With mcount enabled as well
as with tracing on, the old code has the 1% benefit.

But 1% has a bigger impact on something that is 15% than it does on
something that is 48% or 500%, so I'm keeping the newer version.

-- Steve


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 02/22 -v7] Add basic support for gcc profiler instrumentation
  2008-01-30 14:25         ` Peter Zijlstra
@ 2008-02-01 22:34           ` Paul E. McKenney
  2008-02-02  1:56             ` Steven Rostedt
  0 siblings, 1 reply; 45+ messages in thread
From: Paul E. McKenney @ 2008-02-01 22:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, LKML, Ingo Molnar, Linus Torvalds, Andrew Morton,
	Christoph Hellwig, Mathieu Desnoyers, Gregory Haskins,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Tim Bird,
	Sam Ravnborg, Frank Ch. Eigler, Jan Kiszka, John Stultz,
	Arjan van de Ven, Steven Rostedt

On Wed, Jan 30, 2008 at 03:25:00PM +0100, Peter Zijlstra wrote:
> 
> On Wed, 2008-01-30 at 09:09 -0500, Steven Rostedt wrote:
> > Paul,
> > 
> > Peter and I are having a discussion on craziness of archs and memory
> > barriers. You seem to understand crazy archs pretty well, and we would
> > like some advice. :-)

OK, let's see what we have here...

> > See below:
> > 
> > On Wed, 30 Jan 2008, Steven Rostedt wrote:
> > 
> > >
> > >
> > > On Wed, 30 Jan 2008, Peter Zijlstra wrote:
> > >
> > > >
> > > > On Tue, 2008-01-29 at 22:15 -0500, Steven Rostedt wrote:
> > > >
> > > > > +int register_mcount_function(struct mcount_ops *ops)
> > > > > +{
> > > > > +	unsigned long flags;
> > > > > +
> > > > > +	spin_lock_irqsave(&mcount_func_lock, flags);
> > > > > +	ops->next = mcount_list;
> > > > > +	/* must have next seen before we update the list pointer */
> > > > > +	smp_wmb();
> > > >
> > > > That comment does not explain which race it closes; this is esp
> > > > important as there is no paired barrier to give hints.
> > >
> > > OK, fair enough. I'll explain it a bit more.
> > >
> > > How's this:
> > >
> > >  /*
> > >   * We are entering ops into the mcount_list but another
> > >   * CPU might be walking that list. We need to make sure
> > >   * the ops->next pointer is valid before another CPU sees
> > >   * the ops pointer included into the mcount_list.
> > >   */
> > >
> > 
> > The above is my new comment. But Peter says that it's still not good
> > enough and that all write memory barriers need read barriers.
> 
> To clarify, either: full mb, rmb or read depend.

This is true.  A write barrier ensures that the writes remain ordered,
but unless the reads are also ordered, the reader can still get confused.
For example (assuming all variables are initially zero):

writer:

	a = 1;
	smp_wmb();  /* or smp_mb() */
	b = 1;

reader:

	tb = b;
	ta = a;

The writer will (roughly speaking) execute the assignments in order,
but the reader might not.  If the reader executes the assignment from
"a" first, it might see tb==1&&ta==0.  To prevent this, we do:

reader:

	tb = b;
	smp_rmb();  /* or smp_mb() */
	ta = a;

There are a lot of variations on this theme.

> > Let me explain the situation here.
> > 
> > We have a single link list called mcount_list that is walked when more
> > than one function is registered by mcount. Mcount is called at the start
> > of all C functions that are not annotated with "notrace". When more than
> > one function is registered, mcount calls a loop function that does the
> > following:
> > 
> > notrace void mcount_list_func(unsigned long ip, unsigned long parent_ip)
> > {
> >         struct mcount_ops *op = mcount_list;
> 
> When thinking RCU, this would be rcu_dereference and imply a read
> barrier.
> 
> >         while (op != &mcount_list_end) {
> >                 op->func(ip, parent_ip);
> >                 op = op->next;
> 
> Same here; the rcu_dereference() would do the read depend barrier.

Specifically:

notrace void mcount_list_func(unsigned long ip, unsigned long parent_ip)
{
        struct mcount_ops *op = rcu_dereference(mcount_list);

        while (op != &mcount_list_end) {
                op->func(ip, parent_ip);
                op = rcu_dereference(op->next);

This assumes that you are using call_rcu(), synchronize_rcu(), or
whatever to defer freeing/reuse of the ops structure.

> >         };
> > }
> > 
> > A registered function must already have a "func" filled, and the mcount
> > register code takes care of "next".  It is documented that the calling
> > function should "never" change next and always expect that the func can be
> > called after it is unregistered. That's not the issue here.
> > 
> > The issue is how to insert the ops into the list. I've done the following,
> > as you can see in the code this text is inserted between.
> > 
> >    ops->next = mcount_list;
> >    smp_wmb();
> >    mcount_list = ops;
> > 
> > The read side pair is the reading of ops to ops->next, which should imply
> > a smp_rmb() just by the logic. But Peter tells me things like alpha is
> > crazy enough to do better than that! Thus, I'm asking you.

Peter is correct when he says that Alpha does not necessarily respect data
dependencies.  See the following URL for the official story:

	http://www.openvms.compaq.com/wizard/wiz_2637.html

And I give an example hardware cache design that can result in this
situation here:

	http://www.rdrop.com/users/paulmck/scalability/paper/ordering.2007.09.19a.pdf

See the discussion starting with the "Why Reorder Memory Accesses?"
heading in the second column of the first page.

Strange, but true.  It took an Alpha architect quite some time to
convince me of this back in the late 90s.  ;-)

> > Can some arch have a reader where it receives ops->next before it received
> > ops? This seems to me to be a phsyic arch, to know where ops->next is
> > before it knows ops!

The trick is that the machine might have a split cache, with (say)
odd-numbered cache lines being processed by one half and even-numbered
lines processed by the other half.  If reading CPU has one half of the
cache extremely busy (e.g., processing invalidation requests from other
CPUs) and the other half idle, memory misordering can happen in the
receiving CPU -- if the pointer is processed by the idle half, and
the pointed-to struct by the busy half, you might see the unitialized
contents of the pointed-to structure.  The reading CPU must execute
a memory barrier to force ordering in this case.

> > Remember, that the ops that is being registered, is not viewable by any
> > other CPU until mcount_list = ops. I don't see the need for a read barrier
> > in this case. But I could very well be wrong.

And I was right there with you before my extended discussions with the
aforementioned Alpha architect!

						Thanx, Paul

> > Help!
> > 
> > -- Steve
> > 
> > 
> > >
> > > >
> > > > > +	mcount_list = ops;
> > > > > +	/*
> > > > > +	 * For one func, simply call it directly.
> > > > > +	 * For more than one func, call the chain.
> > > > > +	 */
> > > > > +	if (ops->next == &mcount_list_end)
> > > > > +		mcount_trace_function = ops->func;
> > > > > +	else
> > > > > +		mcount_trace_function = mcount_list_func;
> > > > > +	spin_unlock_irqrestore(&mcount_func_lock, flags);
> > > > > +
> > > > > +	return 0;
> > > > > +}
> > > >
> > > >
> > > >
> > >
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 02/22 -v7] Add basic support for gcc profiler instrumentation
  2008-02-01 22:34           ` Paul E. McKenney
@ 2008-02-02  1:56             ` Steven Rostedt
  2008-02-02 21:41               ` Paul E. McKenney
  0 siblings, 1 reply; 45+ messages in thread
From: Steven Rostedt @ 2008-02-02  1:56 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, LKML, Ingo Molnar, Linus Torvalds, Andrew Morton,
	Christoph Hellwig, Mathieu Desnoyers, Gregory Haskins,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Tim Bird,
	Sam Ravnborg, Frank Ch. Eigler, Jan Kiszka, John Stultz,
	Arjan van de Ven, Steven Rostedt



On Fri, 1 Feb 2008, Paul E. McKenney wrote:

> > > > OK, fair enough. I'll explain it a bit more.
> > > >
> > > > How's this:
> > > >
> > > >  /*
> > > >   * We are entering ops into the mcount_list but another
> > > >   * CPU might be walking that list. We need to make sure
> > > >   * the ops->next pointer is valid before another CPU sees
> > > >   * the ops pointer included into the mcount_list.
> > > >   */
> > > >
> > >
> > > The above is my new comment. But Peter says that it's still not good
> > > enough and that all write memory barriers need read barriers.
> >
> > To clarify, either: full mb, rmb or read depend.
>
> This is true.  A write barrier ensures that the writes remain ordered,
> but unless the reads are also ordered, the reader can still get confused.
> For example (assuming all variables are initially zero):
>
> writer:
>
> 	a = 1;
> 	smp_wmb();  /* or smp_mb() */
> 	b = 1;
>
> reader:
>
> 	tb = b;
> 	ta = a;
>
> The writer will (roughly speaking) execute the assignments in order,
> but the reader might not.  If the reader executes the assignment from
> "a" first, it might see tb==1&&ta==0.  To prevent this, we do:
>
> reader:
>
> 	tb = b;
> 	smp_rmb();  /* or smp_mb() */
> 	ta = a;
>
> There are a lot of variations on this theme.

Yep, this is all clear, but not quite what this code does.

>
> > > Let me explain the situation here.
> > >
> > > We have a single link list called mcount_list that is walked when more
> > > than one function is registered by mcount. Mcount is called at the start
> > > of all C functions that are not annotated with "notrace". When more than
> > > one function is registered, mcount calls a loop function that does the
> > > following:
> > >
> > > notrace void mcount_list_func(unsigned long ip, unsigned long parent_ip)
> > > {
> > >         struct mcount_ops *op = mcount_list;
> >
> > When thinking RCU, this would be rcu_dereference and imply a read
> > barrier.
> >
> > >         while (op != &mcount_list_end) {
> > >                 op->func(ip, parent_ip);
> > >                 op = op->next;
> >
> > Same here; the rcu_dereference() would do the read depend barrier.
>
> Specifically:
>
> notrace void mcount_list_func(unsigned long ip, unsigned long parent_ip)
> {
>         struct mcount_ops *op = rcu_dereference(mcount_list);
>
>         while (op != &mcount_list_end) {
>                 op->func(ip, parent_ip);
>                 op = rcu_dereference(op->next);
>
> This assumes that you are using call_rcu(), synchronize_rcu(), or
> whatever to defer freeing/reuse of the ops structure.

One special part of this is that the ops structure is never to be freed
(this is documented). It should be a static read-mostly structure.
Since it is not to be freed, I did not export the registered functions to
keep modules from using it. I may later add an export that will cause the
module to increment it's usage count so that it may never be freed.

There's no guarantees that prevent the func from being called after it was
unregistered, nor should the users of this, ever touch the "next" pointer.

This makes things easy when you don't need to free ;-)

>
> > >         };
> > > }
> > >
> > > A registered function must already have a "func" filled, and the mcount
> > > register code takes care of "next".  It is documented that the calling
> > > function should "never" change next and always expect that the func can be
> > > called after it is unregistered. That's not the issue here.
> > >
> > > The issue is how to insert the ops into the list. I've done the following,
> > > as you can see in the code this text is inserted between.
> > >
> > >    ops->next = mcount_list;
> > >    smp_wmb();
> > >    mcount_list = ops;
> > >
> > > The read side pair is the reading of ops to ops->next, which should imply
> > > a smp_rmb() just by the logic. But Peter tells me things like alpha is
> > > crazy enough to do better than that! Thus, I'm asking you.
>
> Peter is correct when he says that Alpha does not necessarily respect data
> dependencies.  See the following URL for the official story:
>
> 	http://www.openvms.compaq.com/wizard/wiz_2637.html
>
> And I give an example hardware cache design that can result in this
> situation here:
>
> 	http://www.rdrop.com/users/paulmck/scalability/paper/ordering.2007.09.19a.pdf
>
> See the discussion starting with the "Why Reorder Memory Accesses?"
> heading in the second column of the first page.
>
> Strange, but true.  It took an Alpha architect quite some time to
> convince me of this back in the late 90s.  ;-)
>
> > > Can some arch have a reader where it receives ops->next before it received
> > > ops? This seems to me to be a phsyic arch, to know where ops->next is
> > > before it knows ops!
>
> The trick is that the machine might have a split cache, with (say)
> odd-numbered cache lines being processed by one half and even-numbered
> lines processed by the other half.  If reading CPU has one half of the
> cache extremely busy (e.g., processing invalidation requests from other
> CPUs) and the other half idle, memory misordering can happen in the
> receiving CPU -- if the pointer is processed by the idle half, and
> the pointed-to struct by the busy half, you might see the unitialized
> contents of the pointed-to structure.  The reading CPU must execute
> a memory barrier to force ordering in this case.
>
> > > Remember, that the ops that is being registered, is not viewable by any
> > > other CPU until mcount_list = ops. I don't see the need for a read barrier
> > > in this case. But I could very well be wrong.
>
> And I was right there with you before my extended discussions with the
> aforementioned Alpha architect!
>

hmm, I'm still not convinced ;-)

This is a unique situation. We don't need to worry about items being freed
because there's too many races to allow that. The items are only to
register functions and are not to be dynamically allocated or freed. In
this situation we do not need to worry about deletions.

The smp_wmb is only for initialization of something that is about to enter
the list. It is not to protect against freeing.

Specifically:

   ops->next = mcount_list;
   smp_wmb();
   mcount_list = ops;


What this is to prevent is a new item that has next = NULL being viewable
to other CPUS before next is initalized.

On another cpu we have (simplified by removing loop):

  op = mcount_list;
  op->func();
  op = op->next;
  if (op->next != NULL)
     op->func;

What we want to prevent is reading of the new ops before ops->next is set.

What you are saying is that on alpha, even though the write to ops->next
has completed before mcount_list is set, we can still get a reversed
order?

  ops->next = mcount_list;  -- in one cache line
  smp_wmb();
  mcount_list = ops;       -- in another cache line

Even though the ops->next is completed, we can have on another cpu:

   op = mcount_list; (which is the ops from above)
   op = op->next;  -- still see the old ops->next?

I just want to understand this. I already put in the read_barrier_depends
because it doesn't hurt on most archs anyway (nops).

Thanks,

-- Steve



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 02/22 -v7] Add basic support for gcc profiler instrumentation
  2008-02-02  1:56             ` Steven Rostedt
@ 2008-02-02 21:41               ` Paul E. McKenney
  2008-02-04 17:09                 ` Steven Rostedt
  0 siblings, 1 reply; 45+ messages in thread
From: Paul E. McKenney @ 2008-02-02 21:41 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, LKML, Ingo Molnar, Linus Torvalds, Andrew Morton,
	Christoph Hellwig, Mathieu Desnoyers, Gregory Haskins,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Tim Bird,
	Sam Ravnborg, Frank Ch. Eigler, Jan Kiszka, John Stultz,
	Arjan van de Ven, Steven Rostedt

On Fri, Feb 01, 2008 at 08:56:12PM -0500, Steven Rostedt wrote:
> 
> 
> On Fri, 1 Feb 2008, Paul E. McKenney wrote:
> 
> > > > > OK, fair enough. I'll explain it a bit more.
> > > > >
> > > > > How's this:
> > > > >
> > > > >  /*
> > > > >   * We are entering ops into the mcount_list but another
> > > > >   * CPU might be walking that list. We need to make sure
> > > > >   * the ops->next pointer is valid before another CPU sees
> > > > >   * the ops pointer included into the mcount_list.
> > > > >   */
> > > > >
> > > >
> > > > The above is my new comment. But Peter says that it's still not good
> > > > enough and that all write memory barriers need read barriers.
> > >
> > > To clarify, either: full mb, rmb or read depend.
> >
> > This is true.  A write barrier ensures that the writes remain ordered,
> > but unless the reads are also ordered, the reader can still get confused.
> > For example (assuming all variables are initially zero):
> >
> > writer:
> >
> > 	a = 1;
> > 	smp_wmb();  /* or smp_mb() */
> > 	b = 1;
> >
> > reader:
> >
> > 	tb = b;
> > 	ta = a;
> >
> > The writer will (roughly speaking) execute the assignments in order,
> > but the reader might not.  If the reader executes the assignment from
> > "a" first, it might see tb==1&&ta==0.  To prevent this, we do:
> >
> > reader:
> >
> > 	tb = b;
> > 	smp_rmb();  /* or smp_mb() */
> > 	ta = a;
> >
> > There are a lot of variations on this theme.
> 
> Yep, this is all clear, but not quite what this code does.

Yep, you have dependencies, so something like the following:

initial state:

	struct foo {
		int a;
	};
	struct foo x = { 0 };
	struct foo y = { 0 };
	struct foo *global_p = &y;
	/* other variables are appropriately declared auto variables */

	/* No kmalloc() or kfree(), hence no RCU grace periods. */
	/* In the terminology of http://lwn.net/Articles/262464/, we */
	/* are doing only publish-subscribe, nothing else. */

writer:

	x.a = 1;
	smp_wmb();  /* or smp_mb() */
	global_p = &x;

reader:

	p = global_p;
	ta = p->a;

Both Alpha and aggressive compiler optimizations can result in the reader
seeing the new value of the pointer (&x) but the old value of the field
(0).  Strange but true.  The fix is as follows:

reader:

	p = global_p;
	smp_read_barrier_depends();  /* or use rcu_dereference() */
	ta = p->a;

So how can this happen?  First note that if smp_read_barrier_depends()
was unnecessary in this case, it would be unnecessary in all cases.

Second, let's start with the compiler.  Suppose that a highly optimizing
compiler notices that in almost all cases, the reader finds p==global_p.
Suppose that this compiler also notices that one of the registers (say
r1) almost always contains this expected value of global_p, and that
cache pressure ensures that an actual load from global_p almost always
generates an expensive cache miss.  Such a compiler would be within its
rights (as defined by the C standard) to generate code assuming that r1
already had the right value, while also generating code to validate this
assumption, perhaps as follows:

	r2 = global_p;  /* high latency, other things complete meanwhile */
	ta == r1->a;
	if (r1 != r2)
		ta = r2->a;

Now consider the following sequence of events on a superscalar CPU:

	reader: r2 = global_p; /* issued, has not yet completed. */
	reader: ta = r1->a; /* which gives zero. */
	writer: x.a = 1;
	writer: smp_wmb();
	writer: global_p = &x;
	reader: r2 = global_p; /* this instruction now completes */
	reader: if (r1 != r2) /* and these are equal, so we keep bad ta! */

I have great sympathy with the argument that this level of optimization
is simply insane, but the fact is that there are real-world compilers
that actually do this sort of thing.  In addition, there are cases where
the compiler might be able to figure out that a value is constant, thus
breaking the dependency chain.  This is most common for array references
where the compiler might be able to figure out that a given array index
is always zero, thus optimizing away the load and the dependency that
the programmer might expect to enforce ordering.  (I have an example
of this down at the end.)

This sort of misordering is also done by DEC Alpha hardware, assuming
split caches.  This can happen if the variable x is in an odd-numbered
cache line and the variable global_p is in an even-numbered cache line.
In this case, the smp_wmb() affects the memory order, but only within
the writing CPU.  The ordering can be defeated in the reading CPU as
follows:

	writer: x.a = 1;
	writer: smp_wmb();
	writer: global_p = &x;
	reader: p = global_p;
	reader: ta = p->a;

		But the reader's odd-numbered cache shard is loaded
		down with many queued cacheline invalidation requests,
		so the old cached version of x.a==0 remains in the
		reader's cache, so that the reader sees ta==0.

In contrast:

	writer: x.a = 1;
	writer: smp_wmb();
	writer: global_p = &x;
	reader: p = global_p;
	reader: smp_read_barrier_depends();

		The above barrier forces all cacheline invalidation
		requests that have arrived at the reading CPU to be
		processed  before any subsequent reads, including
		the pending invalidation request for the variable x.

	reader: ta = p->a;

		So ta is now guaranteed to be 1, as desired.

> > > > Let me explain the situation here.
> > > >
> > > > We have a single link list called mcount_list that is walked when more
> > > > than one function is registered by mcount. Mcount is called at the start
> > > > of all C functions that are not annotated with "notrace". When more than
> > > > one function is registered, mcount calls a loop function that does the
> > > > following:
> > > >
> > > > notrace void mcount_list_func(unsigned long ip, unsigned long parent_ip)
> > > > {
> > > >         struct mcount_ops *op = mcount_list;
> > >
> > > When thinking RCU, this would be rcu_dereference and imply a read
> > > barrier.
> > >
> > > >         while (op != &mcount_list_end) {
> > > >                 op->func(ip, parent_ip);
> > > >                 op = op->next;
> > >
> > > Same here; the rcu_dereference() would do the read depend barrier.
> >
> > Specifically:
> >
> > notrace void mcount_list_func(unsigned long ip, unsigned long parent_ip)
> > {
> >         struct mcount_ops *op = rcu_dereference(mcount_list);
> >
> >         while (op != &mcount_list_end) {
> >                 op->func(ip, parent_ip);
> >                 op = rcu_dereference(op->next);
> >
> > This assumes that you are using call_rcu(), synchronize_rcu(), or
> > whatever to defer freeing/reuse of the ops structure.
> 
> One special part of this is that the ops structure is never to be freed
> (this is documented). It should be a static read-mostly structure.
> Since it is not to be freed, I did not export the registered functions to
> keep modules from using it. I may later add an export that will cause the
> module to increment it's usage count so that it may never be freed.
> 
> There's no guarantees that prevent the func from being called after it was
> unregistered, nor should the users of this, ever touch the "next" pointer.
> 
> This makes things easy when you don't need to free ;-)

It can indeed make things easier, but it does not help in this case.
This memory-ordering problem appears even if you never free anything, as
described above.  Again, in the terminology laid out in the LWN article
at http://lwn.net/Articles/262464/, you are doing a publish-subscribe
operation, and it still must be protected.

But yes, my comment above about using call_rcu() and friends did in fact
incorrectly assume that you were freeing (or otherwise re-using) the
data structures.

> > > >         };
> > > > }
> > > >
> > > > A registered function must already have a "func" filled, and the mcount
> > > > register code takes care of "next".  It is documented that the calling
> > > > function should "never" change next and always expect that the func can be
> > > > called after it is unregistered. That's not the issue here.
> > > >
> > > > The issue is how to insert the ops into the list. I've done the following,
> > > > as you can see in the code this text is inserted between.
> > > >
> > > >    ops->next = mcount_list;
> > > >    smp_wmb();
> > > >    mcount_list = ops;
> > > >
> > > > The read side pair is the reading of ops to ops->next, which should imply
> > > > a smp_rmb() just by the logic. But Peter tells me things like alpha is
> > > > crazy enough to do better than that! Thus, I'm asking you.
> >
> > Peter is correct when he says that Alpha does not necessarily respect data
> > dependencies.  See the following URL for the official story:
> >
> > 	http://www.openvms.compaq.com/wizard/wiz_2637.html
> >
> > And I give an example hardware cache design that can result in this
> > situation here:
> >
> > 	http://www.rdrop.com/users/paulmck/scalability/paper/ordering.2007.09.19a.pdf
> >
> > See the discussion starting with the "Why Reorder Memory Accesses?"
> > heading in the second column of the first page.
> >
> > Strange, but true.  It took an Alpha architect quite some time to
> > convince me of this back in the late 90s.  ;-)
> >
> > > > Can some arch have a reader where it receives ops->next before it received
> > > > ops? This seems to me to be a phsyic arch, to know where ops->next is
> > > > before it knows ops!
> >
> > The trick is that the machine might have a split cache, with (say)
> > odd-numbered cache lines being processed by one half and even-numbered
> > lines processed by the other half.  If reading CPU has one half of the
> > cache extremely busy (e.g., processing invalidation requests from other
> > CPUs) and the other half idle, memory misordering can happen in the
> > receiving CPU -- if the pointer is processed by the idle half, and
> > the pointed-to struct by the busy half, you might see the unitialized
> > contents of the pointed-to structure.  The reading CPU must execute
> > a memory barrier to force ordering in this case.
> >
> > > > Remember, that the ops that is being registered, is not viewable by any
> > > > other CPU until mcount_list = ops. I don't see the need for a read barrier
> > > > in this case. But I could very well be wrong.
> >
> > And I was right there with you before my extended discussions with the
> > aforementioned Alpha architect!
> 
> hmm, I'm still not convinced ;-)
> 
> This is a unique situation. We don't need to worry about items being freed
> because there's too many races to allow that. The items are only to
> register functions and are not to be dynamically allocated or freed. In
> this situation we do not need to worry about deletions.
> 
> The smp_wmb is only for initialization of something that is about to enter
> the list. It is not to protect against freeing.

Similarly, the smp_read_barrier_depends() is only for initialization
of something that is about to enter the list.  As with the smp_wmb()
primitive, smp_read_barrier_depends() also is not to protect against
freeing.  Instead, it is rcu_read_lock() and rcu_read_unlock() that
protect against freeing.

> Specifically:
> 
>    ops->next = mcount_list;
>    smp_wmb();
>    mcount_list = ops;
> 
> What this is to prevent is a new item that has next = NULL being viewable
> to other CPUS before next is initalized.

Were it not for aggressive compiler optimizations and DEC Alpha, you would
be correct.  What this instead does is to do the writer's part of the job
of preventing such new items from being visible to other CPUs before ->next
is initialized.  These other CPUs must do their part as well, and that
part is smp_read_barrier_depends() -- or rcu_dereference(), whichever is
most appropriate.

> On another cpu we have (simplified by removing loop):
> 
>   op = mcount_list;
>   op->func();
>   op = op->next;
>   if (op->next != NULL)
>      op->func;
> 
> What we want to prevent is reading of the new ops before ops->next is set.

Understood.

> What you are saying is that on alpha, even though the write to ops->next
> has completed before mcount_list is set, we can still get a reversed
> order?

That is exactly what I am saying.  In addition, I am saying that
aggressive compiler optimizations can have this same effect, even on
non-Alpha CPUs.

>   ops->next = mcount_list;  -- in one cache line
>   smp_wmb();
>   mcount_list = ops;       -- in another cache line
> 
> Even though the ops->next is completed, we can have on another cpu:
> 
>    op = mcount_list; (which is the ops from above)
>    op = op->next;  -- still see the old ops->next?

Yes, this bizarre sequence of events really can happen.  The fix is to
do the following:

   op = mcount_list; (which is the ops from above)
   smp_read_barrier_depends();
   op = op->next;  -- no longer see the old ops->next

> I just want to understand this. I already put in the read_barrier_depends
> because it doesn't hurt on most archs anyway (nops).

Very good!!!

And here is the example using array indexes.

initial state:

	struct foo {
		int a;
	};
	struct foo x[ARRAY_SIZE] = { 0 };
	struct foo *global_p = &x[0];
	/* other variables are appropriately declared auto variables */

	/* No kmalloc() or kfree(), hence no RCU grace periods. */
	/* In the terminology of http://lwn.net/Articles/262464/, we */
	/* are doing only publish-subscribe, nothing else. */

writer:

	x[cur_idx].a = 1;
	smp_wmb();  /* or smp_mb() */
	global_idx = cur_idx;

reader:

	i = global_idx;
	ta = x[i].a

Suppose we have ARRAY_SIZE of 1.  Then the standard states that the
results of indexing x[] with a non-zero index are undefined.  Since they
are undefined, the compiler is within its rights to assume that the
index will always be zero, so that the reader code would be as follows:

reader:

	ta = x[0].a

No dependency, no ordering.  So this totally reasonable generated code
could see the pre-initialized value of field a.  The job of both
smp_read_barrier_depends() and rcu_dereference() is to tell both the
CPU and the compiler that such assumptions are ill-advised.

							Thanx, Paul

> Thanks,
> 
> -- Steve
> 
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 02/22 -v7] Add basic support for gcc profiler instrumentation
  2008-02-02 21:41               ` Paul E. McKenney
@ 2008-02-04 17:09                 ` Steven Rostedt
  2008-02-04 21:40                   ` Paul E. McKenney
  0 siblings, 1 reply; 45+ messages in thread
From: Steven Rostedt @ 2008-02-04 17:09 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, LKML, Ingo Molnar, Linus Torvalds, Andrew Morton,
	Christoph Hellwig, Mathieu Desnoyers, Gregory Haskins,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Tim Bird,
	Sam Ravnborg, Frank Ch. Eigler, Jan Kiszka, John Stultz,
	Arjan van de Ven, Steven Rostedt


Hi Paul,

First I want to say, "Thank you", for taking the time to explain this in
considerable detail. But I still have some minor questions.

 (Even though you already convinced me, but I still want full
  understanding ;-)


On Sat, 2 Feb 2008, Paul E. McKenney wrote:

> Yep, you have dependencies, so something like the following:
>
> initial state:
>
> 	struct foo {
> 		int a;
> 	};
> 	struct foo x = { 0 };
> 	struct foo y = { 0 };
> 	struct foo *global_p = &y;
> 	/* other variables are appropriately declared auto variables */
>
> 	/* No kmalloc() or kfree(), hence no RCU grace periods. */
> 	/* In the terminology of http://lwn.net/Articles/262464/, we */
> 	/* are doing only publish-subscribe, nothing else. */
>
> writer:
>
> 	x.a = 1;
> 	smp_wmb();  /* or smp_mb() */
> 	global_p = &x;
>
> reader:
>
> 	p = global_p;
> 	ta = p->a;
>
> Both Alpha and aggressive compiler optimizations can result in the reader
> seeing the new value of the pointer (&x) but the old value of the field
> (0).  Strange but true.  The fix is as follows:
>
> reader:
>
> 	p = global_p;
> 	smp_read_barrier_depends();  /* or use rcu_dereference() */
> 	ta = p->a;
>
> So how can this happen?  First note that if smp_read_barrier_depends()
> was unnecessary in this case, it would be unnecessary in all cases.
>
> Second, let's start with the compiler.  Suppose that a highly optimizing
> compiler notices that in almost all cases, the reader finds p==global_p.
> Suppose that this compiler also notices that one of the registers (say
> r1) almost always contains this expected value of global_p, and that
> cache pressure ensures that an actual load from global_p almost always
> generates an expensive cache miss.  Such a compiler would be within its
> rights (as defined by the C standard) to generate code assuming that r1
> already had the right value, while also generating code to validate this
> assumption, perhaps as follows:
>
> 	r2 = global_p;  /* high latency, other things complete meanwhile */
> 	ta == r1->a;
> 	if (r1 != r2)
> 		ta = r2->a;
>
> Now consider the following sequence of events on a superscalar CPU:

I think you missed one step here (causing my confusion). I don't want to
assume so I'll try to put in the missing step:

	writer: r1 = p;  /* happens to use r1 to store parameter p */

> 	reader: r2 = global_p; /* issued, has not yet completed. */
> 	reader: ta = r1->a; /* which gives zero. */
> 	writer: x.a = 1;
> 	writer: smp_wmb();
> 	writer: global_p = &x;
> 	reader: r2 = global_p; /* this instruction now completes */
> 	reader: if (r1 != r2) /* and these are equal, so we keep bad ta! */

Is that the case?

>
> I have great sympathy with the argument that this level of optimization
> is simply insane, but the fact is that there are real-world compilers
> that actually do this sort of thing.  In addition, there are cases where
> the compiler might be able to figure out that a value is constant, thus
> breaking the dependency chain.  This is most common for array references
> where the compiler might be able to figure out that a given array index
> is always zero, thus optimizing away the load and the dependency that
> the programmer might expect to enforce ordering.  (I have an example
> of this down at the end.)
>
> This sort of misordering is also done by DEC Alpha hardware, assuming
> split caches.  This can happen if the variable x is in an odd-numbered
> cache line and the variable global_p is in an even-numbered cache line.
> In this case, the smp_wmb() affects the memory order, but only within
> the writing CPU.  The ordering can be defeated in the reading CPU as
> follows:
>
> 	writer: x.a = 1;
> 	writer: smp_wmb();
> 	writer: global_p = &x;
> 	reader: p = global_p;
> 	reader: ta = p->a;
>
> 		But the reader's odd-numbered cache shard is loaded
> 		down with many queued cacheline invalidation requests,
> 		so the old cached version of x.a==0 remains in the
> 		reader's cache, so that the reader sees ta==0.
>
> In contrast:
>
> 	writer: x.a = 1;
> 	writer: smp_wmb();
> 	writer: global_p = &x;
> 	reader: p = global_p;
> 	reader: smp_read_barrier_depends();
>
> 		The above barrier forces all cacheline invalidation
> 		requests that have arrived at the reading CPU to be
> 		processed  before any subsequent reads, including
> 		the pending invalidation request for the variable x.
>
> 	reader: ta = p->a;
>
> 		So ta is now guaranteed to be 1, as desired.

Thanks, this is starting to clear things up for me (And scare me away from
Alpha's)

>
> > > > > Let me explain the situation here.
> > > > >
> > > > > We have a single link list called mcount_list that is walked when more
> > > > > than one function is registered by mcount. Mcount is called at the start
> > > > > of all C functions that are not annotated with "notrace". When more than
> > > > > one function is registered, mcount calls a loop function that does the
> > > > > following:
> > > > >
> > > > > notrace void mcount_list_func(unsigned long ip, unsigned long parent_ip)
> > > > > {
> > > > >         struct mcount_ops *op = mcount_list;
> > > >
> > > > When thinking RCU, this would be rcu_dereference and imply a read
> > > > barrier.
> > > >
> > > > >         while (op != &mcount_list_end) {
> > > > >                 op->func(ip, parent_ip);
> > > > >                 op = op->next;
> > > >
> > > > Same here; the rcu_dereference() would do the read depend barrier.
> > >
> > > Specifically:
> > >
> > > notrace void mcount_list_func(unsigned long ip, unsigned long parent_ip)
> > > {
> > >         struct mcount_ops *op = rcu_dereference(mcount_list);
> > >
> > >         while (op != &mcount_list_end) {
> > >                 op->func(ip, parent_ip);
> > >                 op = rcu_dereference(op->next);
> > >
> > > This assumes that you are using call_rcu(), synchronize_rcu(), or
> > > whatever to defer freeing/reuse of the ops structure.
> >
> > One special part of this is that the ops structure is never to be freed
> > (this is documented). It should be a static read-mostly structure.
> > Since it is not to be freed, I did not export the registered functions to
> > keep modules from using it. I may later add an export that will cause the
> > module to increment it's usage count so that it may never be freed.
> >
> > There's no guarantees that prevent the func from being called after it was
> > unregistered, nor should the users of this, ever touch the "next" pointer.
> >
> > This makes things easy when you don't need to free ;-)
>
> It can indeed make things easier, but it does not help in this case.
> This memory-ordering problem appears even if you never free anything, as
> described above.  Again, in the terminology laid out in the LWN article
> at http://lwn.net/Articles/262464/, you are doing a publish-subscribe
> operation, and it still must be protected.
>
> But yes, my comment above about using call_rcu() and friends did in fact
> incorrectly assume that you were freeing (or otherwise re-using) the
> data structures.
>
> > > > >         };
> > > > > }
> > > > >
> > > > > A registered function must already have a "func" filled, and the mcount
> > > > > register code takes care of "next".  It is documented that the calling
> > > > > function should "never" change next and always expect that the func can be
> > > > > called after it is unregistered. That's not the issue here.
> > > > >
> > > > > The issue is how to insert the ops into the list. I've done the following,
> > > > > as you can see in the code this text is inserted between.
> > > > >
> > > > >    ops->next = mcount_list;
> > > > >    smp_wmb();
> > > > >    mcount_list = ops;
> > > > >
> > > > > The read side pair is the reading of ops to ops->next, which should imply
> > > > > a smp_rmb() just by the logic. But Peter tells me things like alpha is
> > > > > crazy enough to do better than that! Thus, I'm asking you.
> > >
> > > Peter is correct when he says that Alpha does not necessarily respect data
> > > dependencies.  See the following URL for the official story:
> > >
> > > 	http://www.openvms.compaq.com/wizard/wiz_2637.html
> > >
> > > And I give an example hardware cache design that can result in this
> > > situation here:
> > >
> > > 	http://www.rdrop.com/users/paulmck/scalability/paper/ordering.2007.09.19a.pdf
> > >
> > > See the discussion starting with the "Why Reorder Memory Accesses?"
> > > heading in the second column of the first page.
> > >
> > > Strange, but true.  It took an Alpha architect quite some time to
> > > convince me of this back in the late 90s.  ;-)
> > >
> > > > > Can some arch have a reader where it receives ops->next before it received
> > > > > ops? This seems to me to be a phsyic arch, to know where ops->next is
> > > > > before it knows ops!
> > >
> > > The trick is that the machine might have a split cache, with (say)
> > > odd-numbered cache lines being processed by one half and even-numbered
> > > lines processed by the other half.  If reading CPU has one half of the
> > > cache extremely busy (e.g., processing invalidation requests from other
> > > CPUs) and the other half idle, memory misordering can happen in the
> > > receiving CPU -- if the pointer is processed by the idle half, and
> > > the pointed-to struct by the busy half, you might see the unitialized
> > > contents of the pointed-to structure.  The reading CPU must execute
> > > a memory barrier to force ordering in this case.
> > >
> > > > > Remember, that the ops that is being registered, is not viewable by any
> > > > > other CPU until mcount_list = ops. I don't see the need for a read barrier
> > > > > in this case. But I could very well be wrong.
> > >
> > > And I was right there with you before my extended discussions with the
> > > aforementioned Alpha architect!
> >
> > hmm, I'm still not convinced ;-)
> >
> > This is a unique situation. We don't need to worry about items being freed
> > because there's too many races to allow that. The items are only to
> > register functions and are not to be dynamically allocated or freed. In
> > this situation we do not need to worry about deletions.
> >
> > The smp_wmb is only for initialization of something that is about to enter
> > the list. It is not to protect against freeing.
>
> Similarly, the smp_read_barrier_depends() is only for initialization
> of something that is about to enter the list.  As with the smp_wmb()
> primitive, smp_read_barrier_depends() also is not to protect against
> freeing.  Instead, it is rcu_read_lock() and rcu_read_unlock() that
> protect against freeing.
>
> > Specifically:
> >
> >    ops->next = mcount_list;
> >    smp_wmb();
> >    mcount_list = ops;
> >
> > What this is to prevent is a new item that has next = NULL being viewable
> > to other CPUS before next is initalized.
>
> Were it not for aggressive compiler optimizations and DEC Alpha, you would
> be correct.  What this instead does is to do the writer's part of the job
> of preventing such new items from being visible to other CPUs before ->next
> is initialized.  These other CPUs must do their part as well, and that
> part is smp_read_barrier_depends() -- or rcu_dereference(), whichever is
> most appropriate.

Since the code doesn't use RCU, I'll keep with the
smp_read_barrier_depends().

>
> > On another cpu we have (simplified by removing loop):
> >
> >   op = mcount_list;
> >   op->func();
> >   op = op->next;
> >   if (op->next != NULL)
> >      op->func;
> >
> > What we want to prevent is reading of the new ops before ops->next is set.
>
> Understood.
>
> > What you are saying is that on alpha, even though the write to ops->next
> > has completed before mcount_list is set, we can still get a reversed
> > order?
>
> That is exactly what I am saying.  In addition, I am saying that
> aggressive compiler optimizations can have this same effect, even on
> non-Alpha CPUs.
>
> >   ops->next = mcount_list;  -- in one cache line
> >   smp_wmb();
> >   mcount_list = ops;       -- in another cache line
> >
> > Even though the ops->next is completed, we can have on another cpu:
> >
> >    op = mcount_list; (which is the ops from above)
> >    op = op->next;  -- still see the old ops->next?
>
> Yes, this bizarre sequence of events really can happen.  The fix is to
> do the following:
>
>    op = mcount_list; (which is the ops from above)
>    smp_read_barrier_depends();
>    op = op->next;  -- no longer see the old ops->next
>
> > I just want to understand this. I already put in the read_barrier_depends
> > because it doesn't hurt on most archs anyway (nops).
>
> Very good!!!
>
> And here is the example using array indexes.
>
> initial state:
>
> 	struct foo {
> 		int a;
> 	};
> 	struct foo x[ARRAY_SIZE] = { 0 };
> 	struct foo *global_p = &x[0];
> 	/* other variables are appropriately declared auto variables */
>
> 	/* No kmalloc() or kfree(), hence no RCU grace periods. */
> 	/* In the terminology of http://lwn.net/Articles/262464/, we */
> 	/* are doing only publish-subscribe, nothing else. */
>
> writer:
>
> 	x[cur_idx].a = 1;
> 	smp_wmb();  /* or smp_mb() */
> 	global_idx = cur_idx;
>
> reader:
>
> 	i = global_idx;
> 	ta = x[i].a
>
> Suppose we have ARRAY_SIZE of 1.  Then the standard states that the
> results of indexing x[] with a non-zero index are undefined.  Since they
> are undefined, the compiler is within its rights to assume that the
> index will always be zero, so that the reader code would be as follows:
>
> reader:
>
> 	ta = x[0].a
>
> No dependency, no ordering.  So this totally reasonable generated code
> could see the pre-initialized value of field a.  The job of both
> smp_read_barrier_depends() and rcu_dereference() is to tell both the
> CPU and the compiler that such assumptions are ill-advised.
>

Paul,

Thanks again for this lengthy email. It took me several readings to absorb
it all in.

I recommend that someone have a pointer to this email because it really
does explain why read_barrier_depends is needed.

Excellent job of explaining this!!! Much appreciated.

-- Steve


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 02/22 -v7] Add basic support for gcc profiler instrumentation
  2008-02-04 17:09                 ` Steven Rostedt
@ 2008-02-04 21:40                   ` Paul E. McKenney
  2008-02-04 22:03                     ` Steven Rostedt
  0 siblings, 1 reply; 45+ messages in thread
From: Paul E. McKenney @ 2008-02-04 21:40 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, LKML, Ingo Molnar, Linus Torvalds, Andrew Morton,
	Christoph Hellwig, Mathieu Desnoyers, Gregory Haskins,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Tim Bird,
	Sam Ravnborg, Frank Ch. Eigler, Jan Kiszka, John Stultz,
	Arjan van de Ven, Steven Rostedt

On Mon, Feb 04, 2008 at 12:09:00PM -0500, Steven Rostedt wrote:
> 
> Hi Paul,
> 
> First I want to say, "Thank you", for taking the time to explain this in
> considerable detail. But I still have some minor questions.
> 
>  (Even though you already convinced me, but I still want full
>   understanding ;-)

OK, will see what I can do...

> On Sat, 2 Feb 2008, Paul E. McKenney wrote:
> 
> > Yep, you have dependencies, so something like the following:
> >
> > initial state:
> >
> > 	struct foo {
> > 		int a;
> > 	};
> > 	struct foo x = { 0 };
> > 	struct foo y = { 0 };
> > 	struct foo *global_p = &y;
> > 	/* other variables are appropriately declared auto variables */
> >
> > 	/* No kmalloc() or kfree(), hence no RCU grace periods. */
> > 	/* In the terminology of http://lwn.net/Articles/262464/, we */
> > 	/* are doing only publish-subscribe, nothing else. */
> >
> > writer:
> >
> > 	x.a = 1;
> > 	smp_wmb();  /* or smp_mb() */
> > 	global_p = &x;
> >
> > reader:
> >
> > 	p = global_p;
> > 	ta = p->a;
> >
> > Both Alpha and aggressive compiler optimizations can result in the reader
> > seeing the new value of the pointer (&x) but the old value of the field
> > (0).  Strange but true.  The fix is as follows:
> >
> > reader:
> >
> > 	p = global_p;
> > 	smp_read_barrier_depends();  /* or use rcu_dereference() */
> > 	ta = p->a;
> >
> > So how can this happen?  First note that if smp_read_barrier_depends()
> > was unnecessary in this case, it would be unnecessary in all cases.
> >
> > Second, let's start with the compiler.  Suppose that a highly optimizing
> > compiler notices that in almost all cases, the reader finds p==global_p.
> > Suppose that this compiler also notices that one of the registers (say
> > r1) almost always contains this expected value of global_p, and that
> > cache pressure ensures that an actual load from global_p almost always
> > generates an expensive cache miss.  Such a compiler would be within its
> > rights (as defined by the C standard) to generate code assuming that r1
> > already had the right value, while also generating code to validate this
> > assumption, perhaps as follows:
> >
> > 	r2 = global_p;  /* high latency, other things complete meanwhile */
> > 	ta == r1->a;
> > 	if (r1 != r2)
> > 		ta = r2->a;
> >
> > Now consider the following sequence of events on a superscalar CPU:
> 
> I think you missed one step here (causing my confusion). I don't want to
> assume so I'll try to put in the missing step:
> 
> 	writer: r1 = p;  /* happens to use r1 to store parameter p */

You lost me on this one...  The writer has only the following three steps:

writer:

	x.a = 1;
	smp_wmb();  /* or smp_mb() */
	global_p = &x;

Where did the "r1 = p" come from?  For that matter, where did "p" come
from?
 
> > 	reader: r2 = global_p; /* issued, has not yet completed. */
> > 	reader: ta = r1->a; /* which gives zero. */
> > 	writer: x.a = 1;
> > 	writer: smp_wmb();
> > 	writer: global_p = &x;
> > 	reader: r2 = global_p; /* this instruction now completes */
> > 	reader: if (r1 != r2) /* and these are equal, so we keep bad ta! */
> 
> Is that the case?

Ah!  Please note that I am doing something unusual here in that I am
working with global variables, as opposed to the normal RCU practice of
dynamically allocating memory.  So "x" is just a global struct, not a
pointer to a struct.

> > I have great sympathy with the argument that this level of optimization
> > is simply insane, but the fact is that there are real-world compilers
> > that actually do this sort of thing.  In addition, there are cases where
> > the compiler might be able to figure out that a value is constant, thus
> > breaking the dependency chain.  This is most common for array references
> > where the compiler might be able to figure out that a given array index
> > is always zero, thus optimizing away the load and the dependency that
> > the programmer might expect to enforce ordering.  (I have an example
> > of this down at the end.)
> >
> > This sort of misordering is also done by DEC Alpha hardware, assuming
> > split caches.  This can happen if the variable x is in an odd-numbered
> > cache line and the variable global_p is in an even-numbered cache line.
> > In this case, the smp_wmb() affects the memory order, but only within
> > the writing CPU.  The ordering can be defeated in the reading CPU as
> > follows:
> >
> > 	writer: x.a = 1;
> > 	writer: smp_wmb();
> > 	writer: global_p = &x;
> > 	reader: p = global_p;
> > 	reader: ta = p->a;
> >
> > 		But the reader's odd-numbered cache shard is loaded
> > 		down with many queued cacheline invalidation requests,
> > 		so the old cached version of x.a==0 remains in the
> > 		reader's cache, so that the reader sees ta==0.
> >
> > In contrast:
> >
> > 	writer: x.a = 1;
> > 	writer: smp_wmb();
> > 	writer: global_p = &x;
> > 	reader: p = global_p;
> > 	reader: smp_read_barrier_depends();
> >
> > 		The above barrier forces all cacheline invalidation
> > 		requests that have arrived at the reading CPU to be
> > 		processed  before any subsequent reads, including
> > 		the pending invalidation request for the variable x.
> >
> > 	reader: ta = p->a;
> >
> > 		So ta is now guaranteed to be 1, as desired.
> 
> Thanks, this is starting to clear things up for me (And scare me away from
> Alpha's)

Of course, fairness would require that it also scare you away from
value-speculation optimizations in compilers.  ;-)

							Thanx, Paul

> > > > > > Let me explain the situation here.
> > > > > >
> > > > > > We have a single link list called mcount_list that is walked when more
> > > > > > than one function is registered by mcount. Mcount is called at the start
> > > > > > of all C functions that are not annotated with "notrace". When more than
> > > > > > one function is registered, mcount calls a loop function that does the
> > > > > > following:
> > > > > >
> > > > > > notrace void mcount_list_func(unsigned long ip, unsigned long parent_ip)
> > > > > > {
> > > > > >         struct mcount_ops *op = mcount_list;
> > > > >
> > > > > When thinking RCU, this would be rcu_dereference and imply a read
> > > > > barrier.
> > > > >
> > > > > >         while (op != &mcount_list_end) {
> > > > > >                 op->func(ip, parent_ip);
> > > > > >                 op = op->next;
> > > > >
> > > > > Same here; the rcu_dereference() would do the read depend barrier.
> > > >
> > > > Specifically:
> > > >
> > > > notrace void mcount_list_func(unsigned long ip, unsigned long parent_ip)
> > > > {
> > > >         struct mcount_ops *op = rcu_dereference(mcount_list);
> > > >
> > > >         while (op != &mcount_list_end) {
> > > >                 op->func(ip, parent_ip);
> > > >                 op = rcu_dereference(op->next);
> > > >
> > > > This assumes that you are using call_rcu(), synchronize_rcu(), or
> > > > whatever to defer freeing/reuse of the ops structure.
> > >
> > > One special part of this is that the ops structure is never to be freed
> > > (this is documented). It should be a static read-mostly structure.
> > > Since it is not to be freed, I did not export the registered functions to
> > > keep modules from using it. I may later add an export that will cause the
> > > module to increment it's usage count so that it may never be freed.
> > >
> > > There's no guarantees that prevent the func from being called after it was
> > > unregistered, nor should the users of this, ever touch the "next" pointer.
> > >
> > > This makes things easy when you don't need to free ;-)
> >
> > It can indeed make things easier, but it does not help in this case.
> > This memory-ordering problem appears even if you never free anything, as
> > described above.  Again, in the terminology laid out in the LWN article
> > at http://lwn.net/Articles/262464/, you are doing a publish-subscribe
> > operation, and it still must be protected.
> >
> > But yes, my comment above about using call_rcu() and friends did in fact
> > incorrectly assume that you were freeing (or otherwise re-using) the
> > data structures.
> >
> > > > > >         };
> > > > > > }
> > > > > >
> > > > > > A registered function must already have a "func" filled, and the mcount
> > > > > > register code takes care of "next".  It is documented that the calling
> > > > > > function should "never" change next and always expect that the func can be
> > > > > > called after it is unregistered. That's not the issue here.
> > > > > >
> > > > > > The issue is how to insert the ops into the list. I've done the following,
> > > > > > as you can see in the code this text is inserted between.
> > > > > >
> > > > > >    ops->next = mcount_list;
> > > > > >    smp_wmb();
> > > > > >    mcount_list = ops;
> > > > > >
> > > > > > The read side pair is the reading of ops to ops->next, which should imply
> > > > > > a smp_rmb() just by the logic. But Peter tells me things like alpha is
> > > > > > crazy enough to do better than that! Thus, I'm asking you.
> > > >
> > > > Peter is correct when he says that Alpha does not necessarily respect data
> > > > dependencies.  See the following URL for the official story:
> > > >
> > > > 	http://www.openvms.compaq.com/wizard/wiz_2637.html
> > > >
> > > > And I give an example hardware cache design that can result in this
> > > > situation here:
> > > >
> > > > 	http://www.rdrop.com/users/paulmck/scalability/paper/ordering.2007.09.19a.pdf
> > > >
> > > > See the discussion starting with the "Why Reorder Memory Accesses?"
> > > > heading in the second column of the first page.
> > > >
> > > > Strange, but true.  It took an Alpha architect quite some time to
> > > > convince me of this back in the late 90s.  ;-)
> > > >
> > > > > > Can some arch have a reader where it receives ops->next before it received
> > > > > > ops? This seems to me to be a phsyic arch, to know where ops->next is
> > > > > > before it knows ops!
> > > >
> > > > The trick is that the machine might have a split cache, with (say)
> > > > odd-numbered cache lines being processed by one half and even-numbered
> > > > lines processed by the other half.  If reading CPU has one half of the
> > > > cache extremely busy (e.g., processing invalidation requests from other
> > > > CPUs) and the other half idle, memory misordering can happen in the
> > > > receiving CPU -- if the pointer is processed by the idle half, and
> > > > the pointed-to struct by the busy half, you might see the unitialized
> > > > contents of the pointed-to structure.  The reading CPU must execute
> > > > a memory barrier to force ordering in this case.
> > > >
> > > > > > Remember, that the ops that is being registered, is not viewable by any
> > > > > > other CPU until mcount_list = ops. I don't see the need for a read barrier
> > > > > > in this case. But I could very well be wrong.
> > > >
> > > > And I was right there with you before my extended discussions with the
> > > > aforementioned Alpha architect!
> > >
> > > hmm, I'm still not convinced ;-)
> > >
> > > This is a unique situation. We don't need to worry about items being freed
> > > because there's too many races to allow that. The items are only to
> > > register functions and are not to be dynamically allocated or freed. In
> > > this situation we do not need to worry about deletions.
> > >
> > > The smp_wmb is only for initialization of something that is about to enter
> > > the list. It is not to protect against freeing.
> >
> > Similarly, the smp_read_barrier_depends() is only for initialization
> > of something that is about to enter the list.  As with the smp_wmb()
> > primitive, smp_read_barrier_depends() also is not to protect against
> > freeing.  Instead, it is rcu_read_lock() and rcu_read_unlock() that
> > protect against freeing.
> >
> > > Specifically:
> > >
> > >    ops->next = mcount_list;
> > >    smp_wmb();
> > >    mcount_list = ops;
> > >
> > > What this is to prevent is a new item that has next = NULL being viewable
> > > to other CPUS before next is initalized.
> >
> > Were it not for aggressive compiler optimizations and DEC Alpha, you would
> > be correct.  What this instead does is to do the writer's part of the job
> > of preventing such new items from being visible to other CPUs before ->next
> > is initialized.  These other CPUs must do their part as well, and that
> > part is smp_read_barrier_depends() -- or rcu_dereference(), whichever is
> > most appropriate.
> 
> Since the code doesn't use RCU, I'll keep with the
> smp_read_barrier_depends().
> 
> >
> > > On another cpu we have (simplified by removing loop):
> > >
> > >   op = mcount_list;
> > >   op->func();
> > >   op = op->next;
> > >   if (op->next != NULL)
> > >      op->func;
> > >
> > > What we want to prevent is reading of the new ops before ops->next is set.
> >
> > Understood.
> >
> > > What you are saying is that on alpha, even though the write to ops->next
> > > has completed before mcount_list is set, we can still get a reversed
> > > order?
> >
> > That is exactly what I am saying.  In addition, I am saying that
> > aggressive compiler optimizations can have this same effect, even on
> > non-Alpha CPUs.
> >
> > >   ops->next = mcount_list;  -- in one cache line
> > >   smp_wmb();
> > >   mcount_list = ops;       -- in another cache line
> > >
> > > Even though the ops->next is completed, we can have on another cpu:
> > >
> > >    op = mcount_list; (which is the ops from above)
> > >    op = op->next;  -- still see the old ops->next?
> >
> > Yes, this bizarre sequence of events really can happen.  The fix is to
> > do the following:
> >
> >    op = mcount_list; (which is the ops from above)
> >    smp_read_barrier_depends();
> >    op = op->next;  -- no longer see the old ops->next
> >
> > > I just want to understand this. I already put in the read_barrier_depends
> > > because it doesn't hurt on most archs anyway (nops).
> >
> > Very good!!!
> >
> > And here is the example using array indexes.
> >
> > initial state:
> >
> > 	struct foo {
> > 		int a;
> > 	};
> > 	struct foo x[ARRAY_SIZE] = { 0 };
> > 	struct foo *global_p = &x[0];
> > 	/* other variables are appropriately declared auto variables */
> >
> > 	/* No kmalloc() or kfree(), hence no RCU grace periods. */
> > 	/* In the terminology of http://lwn.net/Articles/262464/, we */
> > 	/* are doing only publish-subscribe, nothing else. */
> >
> > writer:
> >
> > 	x[cur_idx].a = 1;
> > 	smp_wmb();  /* or smp_mb() */
> > 	global_idx = cur_idx;
> >
> > reader:
> >
> > 	i = global_idx;
> > 	ta = x[i].a
> >
> > Suppose we have ARRAY_SIZE of 1.  Then the standard states that the
> > results of indexing x[] with a non-zero index are undefined.  Since they
> > are undefined, the compiler is within its rights to assume that the
> > index will always be zero, so that the reader code would be as follows:
> >
> > reader:
> >
> > 	ta = x[0].a
> >
> > No dependency, no ordering.  So this totally reasonable generated code
> > could see the pre-initialized value of field a.  The job of both
> > smp_read_barrier_depends() and rcu_dereference() is to tell both the
> > CPU and the compiler that such assumptions are ill-advised.
> >
> 
> Paul,
> 
> Thanks again for this lengthy email. It took me several readings to absorb
> it all in.
> 
> I recommend that someone have a pointer to this email because it really
> does explain why read_barrier_depends is needed.
> 
> Excellent job of explaining this!!! Much appreciated.

Glad you liked it!!!

						Thanx, Paul

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 02/22 -v7] Add basic support for gcc profiler instrumentation
  2008-02-04 21:40                   ` Paul E. McKenney
@ 2008-02-04 22:03                     ` Steven Rostedt
  2008-02-04 22:41                       ` Mathieu Desnoyers
  2008-02-05  5:13                       ` Paul E. McKenney
  0 siblings, 2 replies; 45+ messages in thread
From: Steven Rostedt @ 2008-02-04 22:03 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, LKML, Ingo Molnar, Linus Torvalds, Andrew Morton,
	Christoph Hellwig, Mathieu Desnoyers, Gregory Haskins,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Tim Bird,
	Sam Ravnborg, Frank Ch. Eigler, Jan Kiszka, John Stultz,
	Arjan van de Ven, Steven Rostedt


On Mon, 4 Feb 2008, Paul E. McKenney wrote:
> OK, will see what I can do...
>
> > On Sat, 2 Feb 2008, Paul E. McKenney wrote:
> >
> > > Yep, you have dependencies, so something like the following:
> > >
> > > initial state:
> > >
> > > 	struct foo {
> > > 		int a;
> > > 	};
> > > 	struct foo x = { 0 };
> > > 	struct foo y = { 0 };
> > > 	struct foo *global_p = &y;
> > > 	/* other variables are appropriately declared auto variables */
> > >
> > > 	/* No kmalloc() or kfree(), hence no RCU grace periods. */
> > > 	/* In the terminology of http://lwn.net/Articles/262464/, we */
> > > 	/* are doing only publish-subscribe, nothing else. */
> > >
> > > writer:
> > >
> > > 	x.a = 1;
> > > 	smp_wmb();  /* or smp_mb() */
> > > 	global_p = &x;
> > >
> > > reader:
> > >
> > > 	p = global_p;
> > > 	ta = p->a;
> > >
> > > Both Alpha and aggressive compiler optimizations can result in the reader
> > > seeing the new value of the pointer (&x) but the old value of the field
> > > (0).  Strange but true.  The fix is as follows:
> > >
> > > reader:
> > >
> > > 	p = global_p;
> > > 	smp_read_barrier_depends();  /* or use rcu_dereference() */
> > > 	ta = p->a;
> > >
> > > So how can this happen?  First note that if smp_read_barrier_depends()
> > > was unnecessary in this case, it would be unnecessary in all cases.
> > >
> > > Second, let's start with the compiler.  Suppose that a highly optimizing
> > > compiler notices that in almost all cases, the reader finds p==global_p.
> > > Suppose that this compiler also notices that one of the registers (say
> > > r1) almost always contains this expected value of global_p, and that
> > > cache pressure ensures that an actual load from global_p almost always
> > > generates an expensive cache miss.  Such a compiler would be within its
> > > rights (as defined by the C standard) to generate code assuming that r1
> > > already had the right value, while also generating code to validate this
> > > assumption, perhaps as follows:
> > >
> > > 	r2 = global_p;  /* high latency, other things complete meanwhile */
> > > 	ta == r1->a;
> > > 	if (r1 != r2)
> > > 		ta = r2->a;
> > >
> > > Now consider the following sequence of events on a superscalar CPU:
> >
> > I think you missed one step here (causing my confusion). I don't want to
> > assume so I'll try to put in the missing step:
> >
> > 	writer: r1 = p;  /* happens to use r1 to store parameter p */
>
> You lost me on this one...  The writer has only the following three steps:

You're right. I meant "writer:  r1 = x;"

>
> writer:
>
> 	x.a = 1;
> 	smp_wmb();  /* or smp_mb() */
> 	global_p = &x;
>
> Where did the "r1 = p" come from?  For that matter, where did "p" come
> from?
>
> > > 	reader: r2 = global_p; /* issued, has not yet completed. */
> > > 	reader: ta = r1->a; /* which gives zero. */
> > > 	writer: x.a = 1;
> > > 	writer: smp_wmb();
> > > 	writer: global_p = &x;
> > > 	reader: r2 = global_p; /* this instruction now completes */
> > > 	reader: if (r1 != r2) /* and these are equal, so we keep bad ta! */
> >
> > Is that the case?
>
> Ah!  Please note that I am doing something unusual here in that I am
> working with global variables, as opposed to the normal RCU practice of
> dynamically allocating memory.  So "x" is just a global struct, not a
> pointer to a struct.
>

But lets look at a simple version of my original code anyway ;-)

Writer:

void add_op(struct myops *x) {
	/* x->next may be garbage here */
	x->next = global_p;
	smp_wmb();
	global_p = x;
}

Reader:

void read_op(void)
{
	struct myops *p = global_p;

	while (p != NULL) {
		p->func();
		p = next;
		/* if p->next is garbage we crash */
	}
}


Here, we are missing the read_barrier_depends(). Lets look at the Alpha
cache issue:


reader reads the new version of global_p, and then reads the next
pointer. But since the next pointer is on a different cacheline than
global_p, it may have somehow had that in it's cache still. So it uses the
old next pointer which contains the garbage.

Is that correct?

But I will have to admit, that I can't see how an aggressive compiler
might have screwed this up. Being that x is a parameter, and the function
add_op is not in a header file.

-- Steve


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 02/22 -v7] Add basic support for gcc profiler instrumentation
  2008-02-04 22:03                     ` Steven Rostedt
@ 2008-02-04 22:41                       ` Mathieu Desnoyers
  2008-02-05  6:11                         ` Paul E. McKenney
  2008-02-05  5:13                       ` Paul E. McKenney
  1 sibling, 1 reply; 45+ messages in thread
From: Mathieu Desnoyers @ 2008-02-04 22:41 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Paul E. McKenney, Peter Zijlstra, LKML, Ingo Molnar,
	Linus Torvalds, Andrew Morton, Christoph Hellwig,
	Gregory Haskins, Arnaldo Carvalho de Melo, Thomas Gleixner,
	Tim Bird, Sam Ravnborg, Frank Ch. Eigler, Jan Kiszka,
	John Stultz, Arjan van de Ven, Steven Rostedt

* Steven Rostedt (rostedt@goodmis.org) wrote:
> 
> On Mon, 4 Feb 2008, Paul E. McKenney wrote:
> > OK, will see what I can do...
> >
> > > On Sat, 2 Feb 2008, Paul E. McKenney wrote:
> > >
> > > > Yep, you have dependencies, so something like the following:
> > > >
> > > > initial state:
> > > >
> > > > 	struct foo {
> > > > 		int a;
> > > > 	};
> > > > 	struct foo x = { 0 };
> > > > 	struct foo y = { 0 };
> > > > 	struct foo *global_p = &y;
> > > > 	/* other variables are appropriately declared auto variables */
> > > >
> > > > 	/* No kmalloc() or kfree(), hence no RCU grace periods. */
> > > > 	/* In the terminology of http://lwn.net/Articles/262464/, we */
> > > > 	/* are doing only publish-subscribe, nothing else. */
> > > >
> > > > writer:
> > > >
> > > > 	x.a = 1;
> > > > 	smp_wmb();  /* or smp_mb() */
> > > > 	global_p = &x;
> > > >
> > > > reader:
> > > >
> > > > 	p = global_p;
> > > > 	ta = p->a;
> > > >
> > > > Both Alpha and aggressive compiler optimizations can result in the reader
> > > > seeing the new value of the pointer (&x) but the old value of the field
> > > > (0).  Strange but true.  The fix is as follows:
> > > >
> > > > reader:
> > > >
> > > > 	p = global_p;
> > > > 	smp_read_barrier_depends();  /* or use rcu_dereference() */
> > > > 	ta = p->a;
> > > >
> > > > So how can this happen?  First note that if smp_read_barrier_depends()
> > > > was unnecessary in this case, it would be unnecessary in all cases.
> > > >
> > > > Second, let's start with the compiler.  Suppose that a highly optimizing
> > > > compiler notices that in almost all cases, the reader finds p==global_p.
> > > > Suppose that this compiler also notices that one of the registers (say
> > > > r1) almost always contains this expected value of global_p, and that
> > > > cache pressure ensures that an actual load from global_p almost always
> > > > generates an expensive cache miss.  Such a compiler would be within its
> > > > rights (as defined by the C standard) to generate code assuming that r1
> > > > already had the right value, while also generating code to validate this
> > > > assumption, perhaps as follows:
> > > >
> > > > 	r2 = global_p;  /* high latency, other things complete meanwhile */
> > > > 	ta == r1->a;
> > > > 	if (r1 != r2)
> > > > 		ta = r2->a;
> > > >
> > > > Now consider the following sequence of events on a superscalar CPU:
> > >
> > > I think you missed one step here (causing my confusion). I don't want to
> > > assume so I'll try to put in the missing step:
> > >
> > > 	writer: r1 = p;  /* happens to use r1 to store parameter p */
> >
> > You lost me on this one...  The writer has only the following three steps:
> 
> You're right. I meant "writer:  r1 = x;"
> 
> >
> > writer:
> >
> > 	x.a = 1;
> > 	smp_wmb();  /* or smp_mb() */
> > 	global_p = &x;
> >
> > Where did the "r1 = p" come from?  For that matter, where did "p" come
> > from?
> >
> > > > 	reader: r2 = global_p; /* issued, has not yet completed. */
> > > > 	reader: ta = r1->a; /* which gives zero. */
> > > > 	writer: x.a = 1;
> > > > 	writer: smp_wmb();
> > > > 	writer: global_p = &x;
> > > > 	reader: r2 = global_p; /* this instruction now completes */
> > > > 	reader: if (r1 != r2) /* and these are equal, so we keep bad ta! */
> > >
> > > Is that the case?
> >
> > Ah!  Please note that I am doing something unusual here in that I am
> > working with global variables, as opposed to the normal RCU practice of
> > dynamically allocating memory.  So "x" is just a global struct, not a
> > pointer to a struct.
> >
> 
> But lets look at a simple version of my original code anyway ;-)
> 
> Writer:
> 
> void add_op(struct myops *x) {
> 	/* x->next may be garbage here */
> 	x->next = global_p;
> 	smp_wmb();
> 	global_p = x;
> }
> 
> Reader:
> 
> void read_op(void)
> {
> 	struct myops *p = global_p;
> 
> 	while (p != NULL) {
> 		p->func();
> 		p = next;
> 		/* if p->next is garbage we crash */
> 	}
> }
> 
> 
> Here, we are missing the read_barrier_depends(). Lets look at the Alpha
> cache issue:
> 
> 
> reader reads the new version of global_p, and then reads the next
> pointer. But since the next pointer is on a different cacheline than
> global_p, it may have somehow had that in it's cache still. So it uses the
> old next pointer which contains the garbage.
> 
> Is that correct?
> 
> But I will have to admit, that I can't see how an aggressive compiler
> might have screwed this up. Being that x is a parameter, and the function
> add_op is not in a header file.
> 

Tell me if I am mistakened, but applying Paul's explanation to your
example would give (I unroll the loop for clarity) :

Writer:

void add_op(struct myops *x) {
	/* x->next may be garbage here */
	x->next = global_p;
	smp_wmb();
	global_p = x;
}

Reader:

void read_op(void)
{
	struct myops *p = global_p;

  if (p != NULL) {
		p->func();
		p = p->next;
  /*
   * Suppose the compiler expects that p->next is likely to be equal to
   * p + sizeof(struct myops), uses r1 to store previous p, r2 to store the
   * next p and r3 to store the expected value. Let's look at what the
   * compiler could do for the next loop iteration.
   */
  r2 = r1->next   (1)
  r3 = r1 + sizeof(struct myops)
  r4 = r3->func   (2)
  if (r3 == r2 && r3 != NULL)
    call r4

		/* if p->next is garbage we crash */
	} else
    return;

  if (p != NULL) {
		p->func();
		p = p->next;
		/* if p->next is garbage we crash */
	} else
    return;
  .....
}

In this example, we would be reading the expected "r3->func" (2) before
reading the real r1->next (1) value if reads are issued out of order.

Paul, am I correct ? And.. does the specific loop optimization I just
described actually exist ?

Thanks for your enlightenment :)

Mathieu

> -- Steve
> 

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 02/22 -v7] Add basic support for gcc profiler instrumentation
  2008-02-04 22:03                     ` Steven Rostedt
  2008-02-04 22:41                       ` Mathieu Desnoyers
@ 2008-02-05  5:13                       ` Paul E. McKenney
  1 sibling, 0 replies; 45+ messages in thread
From: Paul E. McKenney @ 2008-02-05  5:13 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, LKML, Ingo Molnar, Linus Torvalds, Andrew Morton,
	Christoph Hellwig, Mathieu Desnoyers, Gregory Haskins,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Tim Bird,
	Sam Ravnborg, Frank Ch. Eigler, Jan Kiszka, John Stultz,
	Arjan van de Ven, Steven Rostedt

On Mon, Feb 04, 2008 at 05:03:47PM -0500, Steven Rostedt wrote:
> 
> On Mon, 4 Feb 2008, Paul E. McKenney wrote:
> > OK, will see what I can do...
> >
> > > On Sat, 2 Feb 2008, Paul E. McKenney wrote:
> > >
> > > > Yep, you have dependencies, so something like the following:
> > > >
> > > > initial state:
> > > >
> > > > 	struct foo {
> > > > 		int a;
> > > > 	};
> > > > 	struct foo x = { 0 };
> > > > 	struct foo y = { 0 };
> > > > 	struct foo *global_p = &y;
> > > > 	/* other variables are appropriately declared auto variables */
> > > >
> > > > 	/* No kmalloc() or kfree(), hence no RCU grace periods. */
> > > > 	/* In the terminology of http://lwn.net/Articles/262464/, we */
> > > > 	/* are doing only publish-subscribe, nothing else. */
> > > >
> > > > writer:
> > > >
> > > > 	x.a = 1;
> > > > 	smp_wmb();  /* or smp_mb() */
> > > > 	global_p = &x;
> > > >
> > > > reader:
> > > >
> > > > 	p = global_p;
> > > > 	ta = p->a;
> > > >
> > > > Both Alpha and aggressive compiler optimizations can result in the reader
> > > > seeing the new value of the pointer (&x) but the old value of the field
> > > > (0).  Strange but true.  The fix is as follows:
> > > >
> > > > reader:
> > > >
> > > > 	p = global_p;
> > > > 	smp_read_barrier_depends();  /* or use rcu_dereference() */
> > > > 	ta = p->a;
> > > >
> > > > So how can this happen?  First note that if smp_read_barrier_depends()
> > > > was unnecessary in this case, it would be unnecessary in all cases.
> > > >
> > > > Second, let's start with the compiler.  Suppose that a highly optimizing
> > > > compiler notices that in almost all cases, the reader finds p==global_p.
> > > > Suppose that this compiler also notices that one of the registers (say
> > > > r1) almost always contains this expected value of global_p, and that
> > > > cache pressure ensures that an actual load from global_p almost always
> > > > generates an expensive cache miss.  Such a compiler would be within its
> > > > rights (as defined by the C standard) to generate code assuming that r1
> > > > already had the right value, while also generating code to validate this
> > > > assumption, perhaps as follows:
> > > >
> > > > 	r2 = global_p;  /* high latency, other things complete meanwhile */
> > > > 	ta == r1->a;
> > > > 	if (r1 != r2)
> > > > 		ta = r2->a;
> > > >
> > > > Now consider the following sequence of events on a superscalar CPU:
> > >
> > > I think you missed one step here (causing my confusion). I don't want to
> > > assume so I'll try to put in the missing step:
> > >
> > > 	writer: r1 = p;  /* happens to use r1 to store parameter p */
> >
> > You lost me on this one...  The writer has only the following three steps:
> 
> You're right. I meant "writer:  r1 = x;"

OK, I understand.  You are correct, it would make more sense at the machine
level for the writer to do something like:

writer:

	r1 = &x;
	r1->a = 1;
	smp_wmb();  /* or smp_mb() */
	global_p = r1;

> > writer:
> >
> > 	x.a = 1;
> > 	smp_wmb();  /* or smp_mb() */
> > 	global_p = &x;
> >
> > Where did the "r1 = p" come from?  For that matter, where did "p" come
> > from?
> >
> > > > 	reader: r2 = global_p; /* issued, has not yet completed. */
> > > > 	reader: ta = r1->a; /* which gives zero. */
> > > > 	writer: x.a = 1;
> > > > 	writer: smp_wmb();
> > > > 	writer: global_p = &x;
> > > > 	reader: r2 = global_p; /* this instruction now completes */
> > > > 	reader: if (r1 != r2) /* and these are equal, so we keep bad ta! */
> > >
> > > Is that the case?
> >
> > Ah!  Please note that I am doing something unusual here in that I am
> > working with global variables, as opposed to the normal RCU practice of
> > dynamically allocating memory.  So "x" is just a global struct, not a
> > pointer to a struct.
> 
> But lets look at a simple version of my original code anyway ;-)

Fair enough!  ;-)

> Writer:
> 
> void add_op(struct myops *x) {
> 	/* x->next may be garbage here */
> 	x->next = global_p;
> 	smp_wmb();
> 	global_p = x;
> }
> 
> Reader:
> 
> void read_op(void)
> {
> 	struct myops *p = global_p;
> 
> 	while (p != NULL) {
> 		p->func();
> 		p = next;
> 		/* if p->next is garbage we crash */
> 	}
> }
> 
> 
> Here, we are missing the read_barrier_depends(). Lets look at the Alpha
> cache issue:
> 
> 
> reader reads the new version of global_p, and then reads the next
> pointer. But since the next pointer is on a different cacheline than
> global_p, it may have somehow had that in it's cache still. So it uses the
> old next pointer which contains the garbage.
> 
> Is that correct?

Indeed!  Changing the reader to be as follows should fix it:

Reader:

void read_op(void)
{
	struct myops *p = global_p;

	while (p != NULL) {
		smp_read_barrier_depends();
		p->func();
		p = next;
		/* if p->next is garbage we crash */
	}
}

> But I will have to admit, that I can't see how an aggressive compiler
> might have screwed this up. Being that x is a parameter, and the function
> add_op is not in a header file.

Ah...

Suppose that we have a compiler that uses profile-based feedback.
It compiles the kernel with profiling code that tracks the values of
pointers, whether function arguments or global variables.  All it need
do is look for repeated values, and then track the fraction of time
that the value occurs.  If a given value occurs (say) 99.999% of the
time, it might make sense for the compiler to simply guess the value
ahead of time.  Even more to the point, if the compiler determines
that an existing register already has the correct value 99.999% of
the time, we can simply use that register, then check that the value
was correct.

This might appear as follows:

	Reader:

	void read_op(void)
	{
		struct myops *p = global_p;

		while (p != NULL) {
			p->func();
			/* do stuff with r1 assuming r1==p->next. */
			r2 = p->next;
			if (r2 != r1)
				/* compensate somehow for guessing wrong. */
		}
	}

Insane?  Probably so.  But there are compiler guys who swear by it.

						Thanx, Paul

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 02/22 -v7] Add basic support for gcc profiler instrumentation
  2008-02-04 22:41                       ` Mathieu Desnoyers
@ 2008-02-05  6:11                         ` Paul E. McKenney
  0 siblings, 0 replies; 45+ messages in thread
From: Paul E. McKenney @ 2008-02-05  6:11 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Steven Rostedt, Peter Zijlstra, LKML, Ingo Molnar,
	Linus Torvalds, Andrew Morton, Christoph Hellwig,
	Gregory Haskins, Arnaldo Carvalho de Melo, Thomas Gleixner,
	Tim Bird, Sam Ravnborg, Frank Ch. Eigler, Jan Kiszka,
	John Stultz, Arjan van de Ven, Steven Rostedt

On Mon, Feb 04, 2008 at 05:41:40PM -0500, Mathieu Desnoyers wrote:
> * Steven Rostedt (rostedt@goodmis.org) wrote:
> > 
> > On Mon, 4 Feb 2008, Paul E. McKenney wrote:
> > > OK, will see what I can do...
> > >
> > > > On Sat, 2 Feb 2008, Paul E. McKenney wrote:
> > > >
> > > > > Yep, you have dependencies, so something like the following:
> > > > >
> > > > > initial state:
> > > > >
> > > > > 	struct foo {
> > > > > 		int a;
> > > > > 	};
> > > > > 	struct foo x = { 0 };
> > > > > 	struct foo y = { 0 };
> > > > > 	struct foo *global_p = &y;
> > > > > 	/* other variables are appropriately declared auto variables */
> > > > >
> > > > > 	/* No kmalloc() or kfree(), hence no RCU grace periods. */
> > > > > 	/* In the terminology of http://lwn.net/Articles/262464/, we */
> > > > > 	/* are doing only publish-subscribe, nothing else. */
> > > > >
> > > > > writer:
> > > > >
> > > > > 	x.a = 1;
> > > > > 	smp_wmb();  /* or smp_mb() */
> > > > > 	global_p = &x;
> > > > >
> > > > > reader:
> > > > >
> > > > > 	p = global_p;
> > > > > 	ta = p->a;
> > > > >
> > > > > Both Alpha and aggressive compiler optimizations can result in the reader
> > > > > seeing the new value of the pointer (&x) but the old value of the field
> > > > > (0).  Strange but true.  The fix is as follows:
> > > > >
> > > > > reader:
> > > > >
> > > > > 	p = global_p;
> > > > > 	smp_read_barrier_depends();  /* or use rcu_dereference() */
> > > > > 	ta = p->a;
> > > > >
> > > > > So how can this happen?  First note that if smp_read_barrier_depends()
> > > > > was unnecessary in this case, it would be unnecessary in all cases.
> > > > >
> > > > > Second, let's start with the compiler.  Suppose that a highly optimizing
> > > > > compiler notices that in almost all cases, the reader finds p==global_p.
> > > > > Suppose that this compiler also notices that one of the registers (say
> > > > > r1) almost always contains this expected value of global_p, and that
> > > > > cache pressure ensures that an actual load from global_p almost always
> > > > > generates an expensive cache miss.  Such a compiler would be within its
> > > > > rights (as defined by the C standard) to generate code assuming that r1
> > > > > already had the right value, while also generating code to validate this
> > > > > assumption, perhaps as follows:
> > > > >
> > > > > 	r2 = global_p;  /* high latency, other things complete meanwhile */
> > > > > 	ta == r1->a;
> > > > > 	if (r1 != r2)
> > > > > 		ta = r2->a;
> > > > >
> > > > > Now consider the following sequence of events on a superscalar CPU:
> > > >
> > > > I think you missed one step here (causing my confusion). I don't want to
> > > > assume so I'll try to put in the missing step:
> > > >
> > > > 	writer: r1 = p;  /* happens to use r1 to store parameter p */
> > >
> > > You lost me on this one...  The writer has only the following three steps:
> > 
> > You're right. I meant "writer:  r1 = x;"
> > 
> > >
> > > writer:
> > >
> > > 	x.a = 1;
> > > 	smp_wmb();  /* or smp_mb() */
> > > 	global_p = &x;
> > >
> > > Where did the "r1 = p" come from?  For that matter, where did "p" come
> > > from?
> > >
> > > > > 	reader: r2 = global_p; /* issued, has not yet completed. */
> > > > > 	reader: ta = r1->a; /* which gives zero. */
> > > > > 	writer: x.a = 1;
> > > > > 	writer: smp_wmb();
> > > > > 	writer: global_p = &x;
> > > > > 	reader: r2 = global_p; /* this instruction now completes */
> > > > > 	reader: if (r1 != r2) /* and these are equal, so we keep bad ta! */
> > > >
> > > > Is that the case?
> > >
> > > Ah!  Please note that I am doing something unusual here in that I am
> > > working with global variables, as opposed to the normal RCU practice of
> > > dynamically allocating memory.  So "x" is just a global struct, not a
> > > pointer to a struct.
> > >
> > 
> > But lets look at a simple version of my original code anyway ;-)
> > 
> > Writer:
> > 
> > void add_op(struct myops *x) {
> > 	/* x->next may be garbage here */
> > 	x->next = global_p;
> > 	smp_wmb();
> > 	global_p = x;
> > }
> > 
> > Reader:
> > 
> > void read_op(void)
> > {
> > 	struct myops *p = global_p;
> > 
> > 	while (p != NULL) {
> > 		p->func();
> > 		p = next;
> > 		/* if p->next is garbage we crash */
> > 	}
> > }
> > 
> > 
> > Here, we are missing the read_barrier_depends(). Lets look at the Alpha
> > cache issue:
> > 
> > 
> > reader reads the new version of global_p, and then reads the next
> > pointer. But since the next pointer is on a different cacheline than
> > global_p, it may have somehow had that in it's cache still. So it uses the
> > old next pointer which contains the garbage.
> > 
> > Is that correct?
> > 
> > But I will have to admit, that I can't see how an aggressive compiler
> > might have screwed this up. Being that x is a parameter, and the function
> > add_op is not in a header file.
> > 
> 
> Tell me if I am mistakened, but applying Paul's explanation to your
> example would give (I unroll the loop for clarity) :
> 
> Writer:
> 
> void add_op(struct myops *x) {
> 	/* x->next may be garbage here */
> 	x->next = global_p;
> 	smp_wmb();
> 	global_p = x;
> }
> 
> Reader:
> 
> void read_op(void)
> {
> 	struct myops *p = global_p;
> 
>   if (p != NULL) {
> 		p->func();
> 		p = p->next;
>   /*
>    * Suppose the compiler expects that p->next is likely to be equal to
>    * p + sizeof(struct myops), uses r1 to store previous p, r2 to store the
>    * next p and r3 to store the expected value. Let's look at what the
>    * compiler could do for the next loop iteration.
>    */
>   r2 = r1->next   (1)
>   r3 = r1 + sizeof(struct myops)
>   r4 = r3->func   (2)
>   if (r3 == r2 && r3 != NULL)
>     call r4
> 
> 		/* if p->next is garbage we crash */
> 	} else
>     return;
> 
>   if (p != NULL) {
> 		p->func();
> 		p = p->next;
> 		/* if p->next is garbage we crash */
> 	} else
>     return;
>   .....
> }
> 
> In this example, we would be reading the expected "r3->func" (2) before
> reading the real r1->next (1) value if reads are issued out of order.
> 
> Paul, am I correct ? And.. does the specific loop optimization I just
> described actually exist ?

This is indeed another form of value prediction.  Perhaps more common
in scientific applications, but one could imagine it occurring in the
kernel as well.

In some cases, the read from the real r1->next might be deferred until
after the computation so as to overlap the speculative computation with
the memory latency.  Border-line insane, perhaps, but some compiler
folks like this sort of approach...

> Thanks for your enlightenment :)

;-)

						Thanx, Paul

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2008-02-05  6:57 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-01-30  3:15 [PATCH 00/22 -v7] mcount and latency tracing utility -v7 Steven Rostedt
2008-01-30  3:15 ` [PATCH 01/22 -v7] printk - dont wakeup klogd with interrupts disabled Steven Rostedt
2008-01-30  3:15 ` [PATCH 02/22 -v7] Add basic support for gcc profiler instrumentation Steven Rostedt
2008-01-30  8:46   ` Peter Zijlstra
2008-01-30 13:08     ` Steven Rostedt
2008-01-30 14:09       ` Steven Rostedt
2008-01-30 14:25         ` Peter Zijlstra
2008-02-01 22:34           ` Paul E. McKenney
2008-02-02  1:56             ` Steven Rostedt
2008-02-02 21:41               ` Paul E. McKenney
2008-02-04 17:09                 ` Steven Rostedt
2008-02-04 21:40                   ` Paul E. McKenney
2008-02-04 22:03                     ` Steven Rostedt
2008-02-04 22:41                       ` Mathieu Desnoyers
2008-02-05  6:11                         ` Paul E. McKenney
2008-02-05  5:13                       ` Paul E. McKenney
2008-01-30 13:21   ` Jan Kiszka
2008-01-30 13:53     ` Steven Rostedt
2008-01-30 14:28       ` Steven Rostedt
2008-01-30  3:15 ` [PATCH 03/22 -v7] Annotate core code that should not be traced Steven Rostedt
2008-01-30  3:15 ` [PATCH 04/22 -v7] x86_64: notrace annotations Steven Rostedt
2008-01-30  3:15 ` [PATCH 05/22 -v7] add notrace annotations to vsyscall Steven Rostedt
2008-01-30  8:49   ` Peter Zijlstra
2008-01-30 13:15     ` Steven Rostedt
2008-01-30  3:15 ` [PATCH 06/22 -v7] handle accurate time keeping over long delays Steven Rostedt
2008-01-30  3:15 ` [PATCH 07/22 -v7] initialize the clock source to jiffies clock Steven Rostedt
2008-01-30  3:15 ` [PATCH 08/22 -v7] add get_monotonic_cycles Steven Rostedt
2008-01-30  3:15 ` [PATCH 09/22 -v7] add notrace annotations to timing events Steven Rostedt
2008-01-30  3:15 ` [PATCH 10/22 -v7] mcount based trace in the form of a header file library Steven Rostedt
2008-01-30  3:15 ` [PATCH 11/22 -v7] Add context switch marker to sched.c Steven Rostedt
2008-01-30  3:15 ` [PATCH 12/22 -v7] Make the task State char-string visible to all Steven Rostedt
2008-01-30  3:15 ` [PATCH 13/22 -v7] Add tracing of context switches Steven Rostedt
2008-01-30  3:15 ` [PATCH 14/22 -v7] Generic command line storage Steven Rostedt
2008-01-30  3:15 ` [PATCH 15/22 -v7] trace generic call to schedule switch Steven Rostedt
2008-01-30  3:15 ` [PATCH 16/22 -v7] Add marker in try_to_wake_up Steven Rostedt
2008-01-30  3:15 ` [PATCH 17/22 -v7] mcount tracer for wakeup latency timings Steven Rostedt
2008-01-30  9:31   ` Peter Zijlstra
2008-01-30 13:18     ` Steven Rostedt
2008-01-30  3:15 ` [PATCH 18/22 -v7] Trace irq disabled critical timings Steven Rostedt
2008-01-30  3:15 ` [PATCH 19/22 -v7] trace preempt off " Steven Rostedt
2008-01-30  9:40   ` Peter Zijlstra
2008-01-30 13:40     ` Steven Rostedt
2008-01-30  3:15 ` [PATCH 20/22 -v7] Add markers to various events Steven Rostedt
2008-01-30  3:15 ` [PATCH 21/22 -v7] Add event tracer Steven Rostedt
2008-01-30  3:15 ` [PATCH 22/22 -v7] Critical latency timings histogram Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).