LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH 0/4] ftrace: add userspace stacktrace support and semaphore-latency tracer
@ 2008-10-12 13:11 Török Edwin
  2008-10-12 13:12 ` Török Edwin
                   ` (5 more replies)
  0 siblings, 6 replies; 26+ messages in thread
From: Török Edwin @ 2008-10-12 13:11 UTC (permalink / raw)
  To: mingo, srostedt
  Cc: a.p.zijlstra, sandmann, linux-kernel, Török Edwin

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 2308 bytes --]

Hi,

This patch series is a draft implementation of a semaphore-latency tracer.
I would appreciate feedback on these patches so that they can be merged into
mainline.

The patches apply on top of tip/master (commit aaf9631d60b82182c7e9890b5ffec70f38131a50).

The first two patches add userspace stacktrace support.
Unfortunately if the userspace application (actually libc) is compiled without framepointers
on x86-64, the userspace trace will contain only 1 entry: the return address.
You can either request raw IP addresses in the trace, or ask for them to be
resolved to object file + relative address, then a tool such as addr2line can be
used to get file:line information.
This is currently x86-only, but other architectures could implement it.

TODO: It would be worth investigating if we could avoid code duplication vs. oprofile,
and use a common function for userspace traces, and convert oprofile to it.

The last 2 patches add a new tracer (and is independent of the first 2 patches).
This tracer attempts to help in tracking rwsem latency issues.
Each latency instance gets an ID, and a userspace program can be used to group
the stacktraces by latency ID, thus obtaining a list of all tasks involved
in a latency: the task(s) that was (were) holding the semaphore, and the tasks
that had to wait due to this.

See the patches themselves for more detail on how to use them.

Shortlog below:

Török Edwin (4):
  Add support for userspace stacktraces in tracing/iter_ctrl
  Identify which executable object the userspace address belongs to.
  add tracepoints in rwsem
  Implement semaphore latency tracer

 Documentation/ftrace.txt     |    5 +-
 arch/x86/kernel/stacktrace.c |   57 +++++++++++++
 include/linux/stacktrace.h   |    8 ++
 include/trace/rwsem.h        |   35 ++++++++
 kernel/trace/Kconfig         |   12 +++
 kernel/trace/Makefile        |    1 +
 kernel/trace/trace.c         |  188 ++++++++++++++++++++++++++++++++++++++++++
 kernel/trace/trace.h         |   35 ++++++++
 kernel/trace/trace_semlat.c  |  186 +++++++++++++++++++++++++++++++++++++++++
 lib/rwsem-spinlock.c         |   16 ++--
 lib/rwsem.c                  |   13 +--
 11 files changed, 539 insertions(+), 17 deletions(-)
 create mode 100644 include/trace/rwsem.h
 create mode 100644 kernel/trace/trace_semlat.c


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 0/4] ftrace: add userspace stacktrace support and semaphore-latency tracer
  2008-10-12 13:11 [PATCH 0/4] ftrace: add userspace stacktrace support and semaphore-latency tracer Török Edwin
@ 2008-10-12 13:12 ` Török Edwin
  2008-10-12 13:12 ` [PATCH 1/4] Add support for userspace stacktraces in tracing/iter_ctrl Török Edwin
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 26+ messages in thread
From: Török Edwin @ 2008-10-12 13:12 UTC (permalink / raw)
  To: mingo, srostedt
  Cc: a.p.zijlstra, sandmann, linux-kernel, Török Edwin

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 2308 bytes --]

Hi,

This patch series is a draft implementation of a semaphore-latency tracer.
I would appreciate feedback on these patches so that they can be merged into
mainline.

The patches apply on top of tip/master (commit aaf9631d60b82182c7e9890b5ffec70f38131a50).

The first two patches add userspace stacktrace support.
Unfortunately if the userspace application (actually libc) is compiled without framepointers
on x86-64, the userspace trace will contain only 1 entry: the return address.
You can either request raw IP addresses in the trace, or ask for them to be
resolved to object file + relative address, then a tool such as addr2line can be
used to get file:line information.
This is currently x86-only, but other architectures could implement it.

TODO: It would be worth investigating if we could avoid code duplication vs. oprofile,
and use a common function for userspace traces, and convert oprofile to it.

The last 2 patches add a new tracer (and is independent of the first 2 patches).
This tracer attempts to help in tracking rwsem latency issues.
Each latency instance gets an ID, and a userspace program can be used to group
the stacktraces by latency ID, thus obtaining a list of all tasks involved
in a latency: the task(s) that was (were) holding the semaphore, and the tasks
that had to wait due to this.

See the patches themselves for more detail on how to use them.

Shortlog below:

Török Edwin (4):
  Add support for userspace stacktraces in tracing/iter_ctrl
  Identify which executable object the userspace address belongs to.
  add tracepoints in rwsem
  Implement semaphore latency tracer

 Documentation/ftrace.txt     |    5 +-
 arch/x86/kernel/stacktrace.c |   57 +++++++++++++
 include/linux/stacktrace.h   |    8 ++
 include/trace/rwsem.h        |   35 ++++++++
 kernel/trace/Kconfig         |   12 +++
 kernel/trace/Makefile        |    1 +
 kernel/trace/trace.c         |  188 ++++++++++++++++++++++++++++++++++++++++++
 kernel/trace/trace.h         |   35 ++++++++
 kernel/trace/trace_semlat.c  |  186 +++++++++++++++++++++++++++++++++++++++++
 lib/rwsem-spinlock.c         |   16 ++--
 lib/rwsem.c                  |   13 +--
 11 files changed, 539 insertions(+), 17 deletions(-)
 create mode 100644 include/trace/rwsem.h
 create mode 100644 kernel/trace/trace_semlat.c


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/4] Add support for userspace stacktraces in tracing/iter_ctrl
  2008-10-12 13:11 [PATCH 0/4] ftrace: add userspace stacktrace support and semaphore-latency tracer Török Edwin
  2008-10-12 13:12 ` Török Edwin
@ 2008-10-12 13:12 ` Török Edwin
  2008-10-12 13:31   ` Frédéric Weisbecker
                     ` (2 more replies)
  2008-10-12 13:12 ` [PATCH 2/4] Identify which executable object the userspace address belongs to Török Edwin
                   ` (3 subsequent siblings)
  5 siblings, 3 replies; 26+ messages in thread
From: Török Edwin @ 2008-10-12 13:12 UTC (permalink / raw)
  To: mingo, srostedt
  Cc: a.p.zijlstra, sandmann, linux-kernel, Török Edwin

Usage example:
mount -t debugfs nodev /sys/kernel/debug
cd /sys/kernel/debug/tracing
echo userstacktrace >iter_ctrl
echo sched_switch >current_tracer
echo 1 >tracing_enabled
.... run application ...
echo 0 >tracing_enabled

Then read one of 'trace','latency_trace','trace_pipe'

Signed-off-by: Török Edwin <edwintorok@gmail.com>
---
 Documentation/ftrace.txt     |    5 ++-
 arch/x86/kernel/stacktrace.c |   57 +++++++++++++++++++++++++++++++++++
 include/linux/stacktrace.h   |    8 +++++
 kernel/trace/trace.c         |   67 ++++++++++++++++++++++++++++++++++++++++++
 kernel/trace/trace.h         |    2 +
 5 files changed, 138 insertions(+), 1 deletions(-)

diff --git a/Documentation/ftrace.txt b/Documentation/ftrace.txt
index d330fe3..f39cb68 100644
--- a/Documentation/ftrace.txt
+++ b/Documentation/ftrace.txt
@@ -327,7 +327,7 @@ output. To see what is available, simply cat the file:
 
   cat /debug/tracing/iter_ctrl
   print-parent nosym-offset nosym-addr noverbose noraw nohex nobin \
- noblock nostacktrace nosched-tree
+ noblock nostacktrace nosched-tree nouserstacktrace
 
 To disable one of the options, echo in the option prepended with "no".
 
@@ -381,6 +381,9 @@ Here are the available options:
 		When a trace is recorded, so is the stack of functions.
 		This allows for back traces of trace sites.
 
+  userstacktrace - This option changes the trace.
+		   It records a stacktrace of the current userspace thread.
+
   sched-tree - TBD (any users??)
 
 
diff --git a/arch/x86/kernel/stacktrace.c b/arch/x86/kernel/stacktrace.c
index d1d850a..000a965 100644
--- a/arch/x86/kernel/stacktrace.c
+++ b/arch/x86/kernel/stacktrace.c
@@ -6,6 +6,7 @@
 #include <linux/sched.h>
 #include <linux/stacktrace.h>
 #include <linux/module.h>
+#include <linux/uaccess.h>
 #include <asm/stacktrace.h>
 
 static void save_stack_warning(void *data, char *msg)
@@ -90,3 +91,59 @@ void save_stack_trace_tsk(struct task_struct *tsk, struct stack_trace *trace)
 		trace->entries[trace->nr_entries++] = ULONG_MAX;
 }
 EXPORT_SYMBOL_GPL(save_stack_trace_tsk);
+
+/* Userspace stacktrace - based on kernel/trace/trace_sysprof.c */
+
+struct stack_frame {
+	const void __user	*next_fp;
+	unsigned long		return_address;
+};
+
+static int copy_stack_frame(const void __user *fp, struct stack_frame *frame)
+{
+	int ret;
+
+	if (!access_ok(VERIFY_READ, fp, sizeof(*frame)))
+		return 0;
+
+	ret = 1;
+	pagefault_disable();
+	if (__copy_from_user_inatomic(frame, fp, sizeof(*frame)))
+		ret = 0;
+	pagefault_enable();
+
+	return ret;
+}
+
+void save_stack_trace_user(struct stack_trace *trace)
+{
+	/*
+	 * Trace user stack if we are not a kernel thread
+	 */
+	if (current->mm) {
+		const struct pt_regs *regs = task_pt_regs(current);
+		const void __user *fp = (const void __user *)regs->bp;
+
+		if (trace->nr_entries < trace->max_entries)
+			trace->entries[trace->nr_entries++] = regs->ip;
+
+		while (trace->nr_entries < trace->max_entries) {
+			struct stack_frame frame;
+			frame.next_fp = NULL;
+			frame.return_address = 0;
+			if (!copy_stack_frame(fp, &frame))
+				break;
+			if ((unsigned long)fp < regs->sp)
+				break;
+			if (frame.return_address)
+				trace->entries[trace->nr_entries++] =
+					frame.return_address;
+			if (fp == frame.next_fp)
+				break;
+			fp = frame.next_fp;
+		}
+	}
+	if (trace->nr_entries < trace->max_entries)
+		trace->entries[trace->nr_entries++] = ULONG_MAX;
+}
+
diff --git a/include/linux/stacktrace.h b/include/linux/stacktrace.h
index 6b8e54a..46704a5 100644
--- a/include/linux/stacktrace.h
+++ b/include/linux/stacktrace.h
@@ -18,9 +18,17 @@ extern void save_stack_trace_tsk(struct task_struct *tsk,
 				struct stack_trace *trace);
 
 extern void print_stack_trace(struct stack_trace *trace, int spaces);
+
+#ifdef CONFIG_X86
+extern void save_stack_trace_user(struct stack_trace *trace);
+#else
+# define save_stack_trace_user(trace)              do { } while (0)
+#endif
+
 #else
 # define save_stack_trace(trace)			do { } while (0)
 # define save_stack_trace_tsk(tsk, trace)		do { } while (0)
+# define save_stack_trace_user(trace)              do { } while (0)
 # define print_stack_trace(trace, spaces)		do { } while (0)
 #endif
 
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index d345d64..4c17453 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -212,6 +212,7 @@ static const char *trace_options[] = {
 	"stacktrace",
 	"sched-tree",
 	"ftrace_printk",
+	"userstacktrace",
 	NULL
 };
 
@@ -735,6 +736,28 @@ void __trace_stack(struct trace_array *tr,
 	ftrace_trace_stack(tr, data, flags, skip, preempt_count());
 }
 
+static void __trace_userstack(struct trace_array *tr,
+		   struct trace_array_cpu *data,
+		   unsigned long flags)
+{
+	struct trace_entry *entry;
+	struct stack_trace trace;
+
+	if (!(trace_flags & TRACE_ITER_USERSTACKTRACE))
+		return;
+
+	entry			= tracing_get_trace_entry(tr, data);
+	tracing_generic_entry_update(entry, flags);
+	entry->type		= TRACE_USER_STACK;
+
+	memset(&entry->field.stack, 0, sizeof(entry->field.stack));
+	trace.nr_entries	= 0;
+	trace.max_entries	= FTRACE_STACK_ENTRIES;
+	trace.skip		= 0;
+	trace.entries		= entry->field.stack.caller;
+	save_stack_trace_user(&trace);
+}
+
 static void
 ftrace_trace_special(void *__tr, void *__data,
 		     unsigned long arg1, unsigned long arg2, unsigned long arg3,
@@ -758,6 +781,7 @@ ftrace_trace_special(void *__tr, void *__data,
 	entry->arg3			= arg3;
 	ring_buffer_unlock_commit(tr->buffer, event, irq_flags);
 	ftrace_trace_stack(tr, data, irq_flags, 4, pc);
+	__trace_userstack(tr, data, irq_flags);
 
 	trace_wake_up();
 }
@@ -1151,6 +1175,31 @@ seq_print_ip_sym(struct trace_seq *s, unsigned long ip, unsigned long sym_flags)
 	return ret;
 }
 
+static int
+seq_print_userip_objs(const struct trace_entry *entry, struct trace_seq *s,
+		unsigned long sym_flags)
+{
+	int ret = 1;
+	unsigned i;
+
+	for (i = 0; i < FTRACE_STACK_ENTRIES; i++) {
+		unsigned long ip = entry->field.stack.caller[i];
+
+		if (ip == ULONG_MAX || !ret)
+			break;
+		if (i)
+			ret = trace_seq_puts(s, " <- ");
+		if (!ip) {
+			ret = trace_seq_puts(s, "??");
+			continue;
+		}
+		if (sym_flags & TRACE_ITER_SYM_ADDR)
+			ret = trace_seq_printf(s, " <" IP_FMT ">", ip);
+	}
+
+	return ret;
+}
+
 static void print_lat_help_header(struct seq_file *m)
 {
 	seq_puts(m, "#                  _------=> CPU#            \n");
@@ -1437,6 +1486,10 @@ print_lat_fmt(struct trace_iterator *iter, unsigned int trace_idx, int cpu)
 			trace_seq_print_cont(s, iter);
 		break;
 	}
+	case TRACE_USER_STACK:
+		seq_print_userip_objs(entry, s, sym_flags);
+		trace_seq_putc(s, '\n');
+		break;
 	default:
 		trace_seq_printf(s, "Unknown type %d\n", entry->type);
 	}
@@ -1572,6 +1625,14 @@ static enum print_line_t print_trace_fmt(struct trace_iterator *iter)
 		if (entry->flags & TRACE_FLAG_CONT)
 			trace_seq_print_cont(s, iter);
 		break;
+	case TRACE_USER_STACK:
+		ret = seq_print_userip_objs(entry, s, sym_flags);
+		if (!ret)
+			return 0;
+		ret = trace_seq_putc(s, '\n');
+		if (!ret)
+			return 0;
+		break;
 	}
 	}
 	return TRACE_TYPE_HANDLED;
@@ -1632,6 +1693,7 @@ static enum print_line_t print_raw_fmt(struct trace_iterator *iter)
 		break;
 	}
 	case TRACE_SPECIAL:
+	case TRACE_USER_STACK:
 	case TRACE_STACK: {
 		struct special_entry *field;
 
@@ -1720,6 +1782,7 @@ static enum print_line_t print_hex_fmt(struct trace_iterator *iter)
 		break;
 	}
 	case TRACE_SPECIAL:
+	case TRACE_USER_STACK:
 	case TRACE_STACK: {
 		struct special_entry *field;
 
@@ -1774,6 +1837,7 @@ static enum print_line_t print_bin_fmt(struct trace_iterator *iter)
 		break;
 	}
 	case TRACE_SPECIAL:
+	case TRACE_USER_STACK:
 	case TRACE_STACK: {
 		struct special_entry *field;
 
@@ -3103,6 +3167,9 @@ void ftrace_dump(void)
 		atomic_inc(&global_trace.data[cpu]->disabled);
 	}
 
+	/* don't look at user memory in panic mode */
+	trace_flags &= ~TRACE_ITER_SYM_USEROBJ;
+
 	printk(KERN_TRACE "Dumping ftrace buffer:\n");
 
 	iter.tr = &global_trace;
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index f1f9957..3ad6343 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -22,6 +22,7 @@ enum trace_type {
 	TRACE_MMIO_RW,
 	TRACE_MMIO_MAP,
 	TRACE_BOOT,
+	TRACE_USER_STACK,
 
 	__TRACE_LAST_TYPE
 };
@@ -413,6 +414,7 @@ enum trace_iterator_flags {
 	TRACE_ITER_STACKTRACE		= 0x100,
 	TRACE_ITER_SCHED_TREE		= 0x200,
 	TRACE_ITER_PRINTK		= 0x400,
+	TRACE_ITER_USERSTACKTRACE       = 0x800
 };
 
 extern struct tracer nop_trace;
-- 
1.5.6.5


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 2/4] Identify which executable object the userspace address belongs to.
  2008-10-12 13:11 [PATCH 0/4] ftrace: add userspace stacktrace support and semaphore-latency tracer Török Edwin
  2008-10-12 13:12 ` Török Edwin
  2008-10-12 13:12 ` [PATCH 1/4] Add support for userspace stacktraces in tracing/iter_ctrl Török Edwin
@ 2008-10-12 13:12 ` Török Edwin
  2008-10-12 13:12 ` [PATCH 3/4] add tracepoints in rwsem Török Edwin
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 26+ messages in thread
From: Török Edwin @ 2008-10-12 13:12 UTC (permalink / raw)
  To: mingo, srostedt
  Cc: a.p.zijlstra, sandmann, linux-kernel, Török Edwin

Example usage:
mount -t debugfs nodev /sys/kernel/debug
cd /sys/kernel/debug/tracing
echo userstacktrace >iter_ctrl
echo sym-userobj >iter_ctrl
echo sched_switch >current_tracer
echo 1 >tracing_enabled
cat trace_pipe >/tmp/trace&
.... run application ...
echo 0 >tracing_enabled
cat /tmp/trace

You'll see stack entries like:
 /lib/libpthread-2.7.so[+0xd370]
You can convert them to function/line using:
 addr2line -fie /lib/libpthread-2.7.so 0xd370
Or
 addr2line -fie /usr/lib/debug/libpthread-2.7.so 0xd370

For non-PIC/PIE executables this won't work:
 a.out[+0x73b]
You need to run the following: addr2line -fie a.out 0x40073b
(where 0x400000 is the default load address of a.out)

Signed-off-by: Török Edwin <edwintorok@gmail.com>
---
 kernel/trace/trace.c |   48 +++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/trace/trace.h |    3 ++-
 2 files changed, 49 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 4c17453..722ab74 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -213,6 +213,7 @@ static const char *trace_options[] = {
 	"sched-tree",
 	"ftrace_printk",
 	"userstacktrace",
+	"sym-userobj",
 	NULL
 };
 
@@ -359,6 +360,20 @@ trace_seq_putmem_hex(struct trace_seq *s, void *mem, size_t len)
 	return trace_seq_putmem(s, hex, j);
 }
 
+static int
+trace_seq_path(struct trace_seq *s, struct path *path)
+{
+	int ret;
+	struct seq_file m;
+	m.count = s->len;
+	m.size = PAGE_SIZE;
+	m.buf = s->buffer;
+	ret = seq_path(&m, path, "\n");
+	if (ret)
+		s->len = m.count;
+	return ret;
+}
+
 static void
 trace_seq_reset(struct trace_seq *s)
 {
@@ -655,6 +670,7 @@ tracing_generic_entry_update(struct trace_entry *entry, unsigned long flags,
 
 	entry->preempt_count		= pc & 0xff;
 	entry->pid			= (tsk) ? tsk->pid : 0;
+	entry->tgid               	= (tsk) ? tsk->tgid : 0;
 	entry->flags =
 		(irqs_disabled_flags(flags) ? TRACE_FLAG_IRQS_OFF : 0) |
 		((pc & HARDIRQ_MASK) ? TRACE_FLAG_HARDIRQ : 0) |
@@ -1179,11 +1195,26 @@ static int
 seq_print_userip_objs(const struct trace_entry *entry, struct trace_seq *s,
 		unsigned long sym_flags)
 {
+	struct mm_struct *mm = NULL;
 	int ret = 1;
 	unsigned i;
 
+	if (trace_flags & TRACE_ITER_SYM_USEROBJ) {
+		struct task_struct *task;
+		/* we do the lookup on the thread group leader,
+		 * since individual threads might have already quit! */
+		rcu_read_lock();
+		task = find_task_by_vpid(entry->field.tgid);
+		rcu_read_unlock();
+
+		if (task)
+			mm = get_task_mm(task);
+	}
+
 	for (i = 0; i < FTRACE_STACK_ENTRIES; i++) {
 		unsigned long ip = entry->field.stack.caller[i];
+		unsigned long vmstart = 0;
+		struct file *file = NULL;
 
 		if (ip == ULONG_MAX || !ret)
 			break;
@@ -1193,10 +1224,25 @@ seq_print_userip_objs(const struct trace_entry *entry, struct trace_seq *s,
 			ret = trace_seq_puts(s, "??");
 			continue;
 		}
-		if (sym_flags & TRACE_ITER_SYM_ADDR)
+		if (mm) {
+			const struct vm_area_struct *vma = find_vma(mm, ip);
+			if (vma) {
+				file = vma->vm_file;
+				vmstart = vma->vm_start;
+			}
+		}
+		if (file) {
+			ret = trace_seq_path(s, &file->f_path);
+			if (ret)
+				ret = trace_seq_printf(s, "[+0x%lx]",
+						ip - vmstart);
+		}
+		if (ret && ((sym_flags & TRACE_ITER_SYM_ADDR) || !file))
 			ret = trace_seq_printf(s, " <" IP_FMT ">", ip);
 	}
 
+	if (mm)
+		mmput(mm);
 	return ret;
 }
 
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 3ad6343..59d3b17 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -414,7 +414,8 @@ enum trace_iterator_flags {
 	TRACE_ITER_STACKTRACE		= 0x100,
 	TRACE_ITER_SCHED_TREE		= 0x200,
 	TRACE_ITER_PRINTK		= 0x400,
-	TRACE_ITER_USERSTACKTRACE       = 0x800
+	TRACE_ITER_USERSTACKTRACE       = 0x800,
+	TRACE_ITER_SYM_USEROBJ          = 0x1000
 };
 
 extern struct tracer nop_trace;
-- 
1.5.6.5


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 3/4] add tracepoints in rwsem
  2008-10-12 13:11 [PATCH 0/4] ftrace: add userspace stacktrace support and semaphore-latency tracer Török Edwin
                   ` (2 preceding siblings ...)
  2008-10-12 13:12 ` [PATCH 2/4] Identify which executable object the userspace address belongs to Török Edwin
@ 2008-10-12 13:12 ` Török Edwin
  2008-10-12 13:12 ` [PATCH 4/4] Implement semaphore latency tracer Török Edwin
  2008-10-12 18:25 ` [PATCH 0/4] ftrace: add userspace stacktrace support and semaphore-latency tracer Steven Rostedt
  5 siblings, 0 replies; 26+ messages in thread
From: Török Edwin @ 2008-10-12 13:12 UTC (permalink / raw)
  To: mingo, srostedt
  Cc: a.p.zijlstra, sandmann, linux-kernel, Török Edwin

This allows a tracing framework, such as ftrace to measure the latency
during lock contention, and (unlike latencytop) allows to show whom to blame for holding
the lock for too long.

The tracepoints consist of 3 events
- a down_read, or down_write failed - a tracer can use this as a point of reference
- semaphore  is granted after a failed down_read or down_write - usable to measure latency
- one writer, or all readers are woken - a tracer can see who was/is holding the lock,
 and blame it for the latencies of all reader/writer tasks on the waiter list

Signed-off-by: Török Edwin <edwintorok@gmail.com>
---
 include/trace/rwsem.h |   35 +++++++++++++++++++++++++++++++++++
 lib/rwsem-spinlock.c  |   16 ++++++++--------
 lib/rwsem.c           |   13 +++++--------
 3 files changed, 48 insertions(+), 16 deletions(-)
 create mode 100644 include/trace/rwsem.h

diff --git a/include/trace/rwsem.h b/include/trace/rwsem.h
new file mode 100644
index 0000000..8afb773
--- /dev/null
+++ b/include/trace/rwsem.h
@@ -0,0 +1,35 @@
+#ifndef _TRACE_RWSEM_H
+#define _TRACE_RWSEM_H
+
+#include <linux/tracepoint.h>
+
+struct rwsem_waiter {
+	struct list_head list;
+	struct task_struct *task;
+	unsigned int flags;
+#define RWSEM_WAITING_FOR_READ	0x00000001
+#define RWSEM_WAITING_FOR_WRITE	0x00000002
+};
+
+/* some waiters from the waiter list are woken,
+ * some remain sleeping */
+DEFINE_TRACE(rwsem_wake,
+		TPPROTO(const struct rw_semaphore *sem,
+			const struct rwsem_waiter *waiter),
+		TPARGS(sem, waiter));
+
+/* a down_read() failed because a writer already has the lock,
+ * or a down_write() failed because another reader or writer
+ * already has the lock.
+ * tsk indicates this task, and waiter->list contains
+ * all other tasks that wait for this lock */
+DEFINE_TRACE(rwsem_lock_failed,
+		TPPROTO(const struct rw_semaphore *sem),
+		TPARGS(sem));
+
+/* the tsk was granted the lock, after previously failing to
+ * acquire the lock due to contention */
+DEFINE_TRACE(rwsem_lock_ok,
+		TPPROTO(const struct task_struct *tsk),
+		TPARGS(tsk));
+#endif
diff --git a/lib/rwsem-spinlock.c b/lib/rwsem-spinlock.c
index 9df3ca5..9e32ce4 100644
--- a/lib/rwsem-spinlock.c
+++ b/lib/rwsem-spinlock.c
@@ -8,14 +8,7 @@
 #include <linux/rwsem.h>
 #include <linux/sched.h>
 #include <linux/module.h>
-
-struct rwsem_waiter {
-	struct list_head list;
-	struct task_struct *task;
-	unsigned int flags;
-#define RWSEM_WAITING_FOR_READ	0x00000001
-#define RWSEM_WAITING_FOR_WRITE	0x00000002
-};
+#include <trace/rwsem.h>
 
 /*
  * initialise the semaphore
@@ -69,6 +62,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wakewrite)
 		list_del(&waiter->list);
 		tsk = waiter->task;
 		/* Don't touch waiter after ->task has been NULLed */
+		trace_rwsem_wake(sem, waiter);
 		smp_mb();
 		waiter->task = NULL;
 		wake_up_process(tsk);
@@ -78,6 +72,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wakewrite)
 
 	/* grant an infinite number of read locks to the front of the queue */
  dont_wake_writers:
+	trace_rwsem_wake(sem, waiter);
 	woken = 0;
 	while (waiter->flags & RWSEM_WAITING_FOR_READ) {
 		struct list_head *next = waiter->list.next;
@@ -115,6 +110,7 @@ __rwsem_wake_one_writer(struct rw_semaphore *sem)
 	list_del(&waiter->list);
 
 	tsk = waiter->task;
+	trace_rwsem_wake(sem, waiter);
 	smp_mb();
 	waiter->task = NULL;
 	wake_up_process(tsk);
@@ -152,6 +148,7 @@ void __sched __down_read(struct rw_semaphore *sem)
 	/* we don't need to touch the semaphore struct anymore */
 	spin_unlock_irq(&sem->wait_lock);
 
+	trace_rwsem_lock_failed(sem);
 	/* wait to be given the lock */
 	for (;;) {
 		if (!waiter.task)
@@ -160,6 +157,7 @@ void __sched __down_read(struct rw_semaphore *sem)
 		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
 	}
 
+	trace_rwsem_lock_ok(tsk);
 	tsk->state = TASK_RUNNING;
  out:
 	;
@@ -218,6 +216,7 @@ void __sched __down_write_nested(struct rw_semaphore *sem, int subclass)
 	/* we don't need to touch the semaphore struct anymore */
 	spin_unlock_irq(&sem->wait_lock);
 
+	trace_rwsem_lock_failed(sem);
 	/* wait to be given the lock */
 	for (;;) {
 		if (!waiter.task)
@@ -226,6 +225,7 @@ void __sched __down_write_nested(struct rw_semaphore *sem, int subclass)
 		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
 	}
 
+	trace_rwsem_lock_ok(tsk);
 	tsk->state = TASK_RUNNING;
  out:
 	;
diff --git a/lib/rwsem.c b/lib/rwsem.c
index 3e3365e..a9c5571 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -7,6 +7,7 @@
 #include <linux/sched.h>
 #include <linux/init.h>
 #include <linux/module.h>
+#include <trace/rwsem.h>
 
 /*
  * Initialize an rwsem:
@@ -28,14 +29,6 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
 
 EXPORT_SYMBOL(__init_rwsem);
 
-struct rwsem_waiter {
-	struct list_head list;
-	struct task_struct *task;
-	unsigned int flags;
-#define RWSEM_WAITING_FOR_READ	0x00000001
-#define RWSEM_WAITING_FOR_WRITE	0x00000002
-};
-
 /*
  * handle the lock release when processes blocked on it that can now run
  * - if we come here from up_xxxx(), then:
@@ -81,6 +74,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, int downgrading)
 	 */
 	list_del(&waiter->list);
 	tsk = waiter->task;
+	trace_rwsem_wake(sem, waiter);
 	smp_mb();
 	waiter->task = NULL;
 	wake_up_process(tsk);
@@ -99,6 +93,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, int downgrading)
 	 *   readers before waking any processes up
 	 */
  readers_only:
+	trace_rwsem_wake(sem, waiter);
 	woken = 0;
 	do {
 		woken++;
@@ -171,6 +166,7 @@ rwsem_down_failed_common(struct rw_semaphore *sem,
 
 	spin_unlock_irq(&sem->wait_lock);
 
+	trace_rwsem_lock_failed(sem);
 	/* wait to be given the lock */
 	for (;;) {
 		if (!waiter->task)
@@ -179,6 +175,7 @@ rwsem_down_failed_common(struct rw_semaphore *sem,
 		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
 	}
 
+	trace_rwsem_lock_ok(tsk);
 	tsk->state = TASK_RUNNING;
 
 	return sem;
-- 
1.5.6.5


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 4/4] Implement semaphore latency tracer
  2008-10-12 13:11 [PATCH 0/4] ftrace: add userspace stacktrace support and semaphore-latency tracer Török Edwin
                   ` (3 preceding siblings ...)
  2008-10-12 13:12 ` [PATCH 3/4] add tracepoints in rwsem Török Edwin
@ 2008-10-12 13:12 ` Török Edwin
  2008-10-12 19:13   ` Peter Zijlstra
  2008-10-12 18:25 ` [PATCH 0/4] ftrace: add userspace stacktrace support and semaphore-latency tracer Steven Rostedt
  5 siblings, 1 reply; 26+ messages in thread
From: Török Edwin @ 2008-10-12 13:12 UTC (permalink / raw)
  To: mingo, srostedt
  Cc: a.p.zijlstra, sandmann, linux-kernel, Török Edwin

Each time a down_read or down_write fails, a unique latency id is generated.
Later when someone releases the semaphore, it is blamed for the latency of all
tasks on the wait_list of the semaphore.
If you would group the output from latency_trace by the latency_id you get all those
who were contending on a lock, and the tasks that were holding the lock.
An entry in latency_trace has the format:
(latency_id) [semaphore_id] read|write
stacktrace <= stacktrace

Signed-off-by: Török Edwin <edwintorok@gmail.com>
---
 kernel/trace/Kconfig        |   12 +++
 kernel/trace/Makefile       |    1 +
 kernel/trace/trace.c        |  105 +++++++++++++++++++++----
 kernel/trace/trace.h        |   32 ++++++++
 kernel/trace/trace_semlat.c |  186 +++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 321 insertions(+), 15 deletions(-)
 create mode 100644 kernel/trace/trace_semlat.c

diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 396aea1..8faac14 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -144,6 +144,18 @@ config STACK_TRACER
 	  This tracer records the max stack of the kernel, and displays
 	  it in debugfs/tracing/stack_trace
 
+config SEMLAT_TRACER
+	bool "Semaphore latency tracer"
+	depends on HAVE_FTRACE
+	depends on DEBUG_KERNEL
+	select TRACING
+	select TRACEPOINTS
+	select TRACER_MAX_TRACE
+	help
+	  This tracer records the reason for semaphore contention, showing
+	  who is holding a lock for too long and causing other tasks to contend
+	  on the semaphore. Currently it instruments rwsem and rwsem-spinlock with tracepoints.
+
 config DYNAMIC_FTRACE
 	bool "enable/disable ftrace tracepoints dynamically"
 	depends on FTRACE
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index a85dfba..d037df3 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -24,5 +24,6 @@ obj-$(CONFIG_NOP_TRACER) += trace_nop.o
 obj-$(CONFIG_STACK_TRACER) += trace_stack.o
 obj-$(CONFIG_MMIOTRACE) += trace_mmiotrace.o
 obj-$(CONFIG_BOOT_TRACER) += trace_boot.o
+obj-$(CONFIG_SEMLAT_TRACER) += trace_semlat.o
 
 libftrace-y := ftrace.o
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 722ab74..bd9145a 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -752,26 +752,42 @@ void __trace_stack(struct trace_array *tr,
 	ftrace_trace_stack(tr, data, flags, skip, preempt_count());
 }
 
-static void __trace_userstack(struct trace_array *tr,
+static void ftrace_trace_userstack(struct trace_array *tr,
 		   struct trace_array_cpu *data,
-		   unsigned long flags)
+		   unsigned long flags, int pc)
 {
-	struct trace_entry *entry;
+	struct userstack_entry *entry;
 	struct stack_trace trace;
+	struct ring_buffer_event *event;
+	unsigned long irq_flags;
 
 	if (!(trace_flags & TRACE_ITER_USERSTACKTRACE))
 		return;
 
-	entry			= tracing_get_trace_entry(tr, data);
-	tracing_generic_entry_update(entry, flags);
-	entry->type		= TRACE_USER_STACK;
+	event = ring_buffer_lock_reserve(tr->buffer, sizeof(*entry),
+					 &irq_flags);
+	if (!event)
+		return;
+	entry	= ring_buffer_event_data(event);
+	tracing_generic_entry_update(&entry->ent, flags, pc);
+	entry->ent.type		= TRACE_USER_STACK;
+
+	memset(&entry->caller, 0, sizeof(entry->caller));
 
-	memset(&entry->field.stack, 0, sizeof(entry->field.stack));
 	trace.nr_entries	= 0;
 	trace.max_entries	= FTRACE_STACK_ENTRIES;
 	trace.skip		= 0;
-	trace.entries		= entry->field.stack.caller;
+	trace.entries		= entry->caller;
+
 	save_stack_trace_user(&trace);
+	ring_buffer_unlock_commit(tr->buffer, event, irq_flags);
+}
+
+void __trace_userstack(struct trace_array *tr,
+		   struct trace_array_cpu *data,
+		   unsigned long flags)
+{
+	ftrace_trace_userstack(tr, data, flags, preempt_count());
 }
 
 static void
@@ -810,6 +826,36 @@ __trace_special(void *__tr, void *__data,
 }
 
 void
+trace_semlat(struct trace_array *tr, struct trace_array_cpu *data,
+		const struct semlat_data *sdata, int samestack)
+{
+	struct ring_buffer_event *event;
+	struct semlat_entry *entry;
+	unsigned long irq_flags;
+	int pc = preempt_count();
+
+	event = ring_buffer_lock_reserve(tr->buffer, sizeof(*entry),
+					 &irq_flags);
+	if (!event)
+		return;
+	entry	= ring_buffer_event_data(event);
+	tracing_generic_entry_update(&entry->ent, 0, pc);
+	entry->ent.type			= TRACE_SEM;
+	entry->data 			= *sdata;
+	entry->data.pid			= current->pid;
+	ring_buffer_unlock_commit(tr->buffer, event, irq_flags);
+	if (!samestack) {
+		/* show a stack trace only if different from  previous */
+		__trace_stack(tr, data, irq_flags, 3);
+		__trace_userstack(tr, data, irq_flags);
+	}
+	ftrace_trace_stack(tr, data, irq_flags, 4, pc);
+	__trace_userstack(tr, data, irq_flags);
+
+	trace_wake_up();
+}
+
+void
 tracing_sched_switch_trace(struct trace_array *tr,
 			   struct trace_array_cpu *data,
 			   struct task_struct *prev,
@@ -1192,7 +1238,7 @@ seq_print_ip_sym(struct trace_seq *s, unsigned long ip, unsigned long sym_flags)
 }
 
 static int
-seq_print_userip_objs(const struct trace_entry *entry, struct trace_seq *s,
+seq_print_userip_objs(const struct userstack_entry *entry, struct trace_seq *s,
 		unsigned long sym_flags)
 {
 	struct mm_struct *mm = NULL;
@@ -1204,7 +1250,7 @@ seq_print_userip_objs(const struct trace_entry *entry, struct trace_seq *s,
 		/* we do the lookup on the thread group leader,
 		 * since individual threads might have already quit! */
 		rcu_read_lock();
-		task = find_task_by_vpid(entry->field.tgid);
+		task = find_task_by_vpid(entry->ent.tgid);
 		rcu_read_unlock();
 
 		if (task)
@@ -1212,7 +1258,7 @@ seq_print_userip_objs(const struct trace_entry *entry, struct trace_seq *s,
 	}
 
 	for (i = 0; i < FTRACE_STACK_ENTRIES; i++) {
-		unsigned long ip = entry->field.stack.caller[i];
+		unsigned long ip = entry->caller[i];
 		unsigned long vmstart = 0;
 		struct file *file = NULL;
 
@@ -1532,10 +1578,34 @@ print_lat_fmt(struct trace_iterator *iter, unsigned int trace_idx, int cpu)
 			trace_seq_print_cont(s, iter);
 		break;
 	}
-	case TRACE_USER_STACK:
-		seq_print_userip_objs(entry, s, sym_flags);
+	case TRACE_USER_STACK: {
+		struct userstack_entry *field;
+
+		trace_assign_type(field, entry);
+
+		seq_print_userip_objs(field, s, sym_flags);
 		trace_seq_putc(s, '\n');
 		break;
+	}
+	case TRACE_SEM: {
+		struct semlat_entry *field;
+
+		trace_assign_type(field, entry);
+
+		if (field->data.pid == field->ent.pid) {
+			/* task that suffered the latency */
+			comm = trace_find_cmdline(field->data.pid);
+			trace_seq_printf(s, " (%u) %s["IP_FMT"] %s %s\n",
+				field->data.latency_id,
+				field->data.name ? field->data.name : "",
+				field->data.semaphore,
+				field->data.kind_read ? "read" : "write",
+				comm);
+		} else
+			trace_seq_printf(s, " (%u)\n",
+					 field->data.latency_id);
+		break;
+	}
 	default:
 		trace_seq_printf(s, "Unknown type %d\n", entry->type);
 	}
@@ -1671,8 +1741,13 @@ static enum print_line_t print_trace_fmt(struct trace_iterator *iter)
 		if (entry->flags & TRACE_FLAG_CONT)
 			trace_seq_print_cont(s, iter);
 		break;
-	case TRACE_USER_STACK:
-		ret = seq_print_userip_objs(entry, s, sym_flags);
+	}
+	case TRACE_USER_STACK: {
+		struct userstack_entry *field;
+
+		trace_assign_type(field, entry);
+
+		ret = seq_print_userip_objs(field, s, sym_flags);
 		if (!ret)
 			return 0;
 		ret = trace_seq_putc(s, '\n');
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 59d3b17..d734561 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -23,6 +23,7 @@ enum trace_type {
 	TRACE_MMIO_MAP,
 	TRACE_BOOT,
 	TRACE_USER_STACK,
+	TRACE_SEM,
 
 	__TRACE_LAST_TYPE
 };
@@ -39,6 +40,7 @@ struct trace_entry {
 	unsigned char		flags;
 	unsigned char		preempt_count;
 	int			pid;
+	int			tgid;
 };
 
 /*
@@ -86,6 +88,11 @@ struct stack_entry {
 	unsigned long		caller[FTRACE_STACK_ENTRIES];
 };
 
+struct userstack_entry {
+	struct trace_entry	ent;
+	unsigned long		caller[FTRACE_STACK_ENTRIES];
+};
+
 /*
  * ftrace_printk entry:
  */
@@ -119,6 +126,25 @@ struct trace_boot {
 };
 
 /*
+ * semaphore-latency entry
+ */
+
+struct semlat_data {
+	const void     *semaphore;
+	const char     *name;
+	unsigned int	latency_id; /* a unique id for this latency instance */
+	unsigned int 	kind_read;
+	unsigned int    pid;/* pid of task that suffered the latency */
+	cycle_t  	t;/* start of latency - lock contention */
+	int      	cpu;
+};
+
+struct semlat_entry {
+	struct trace_entry	ent;
+	struct semlat_data      data;
+};
+
+/*
  * trace_flag_type is an enumeration that holds different
  * states when a trace occurs. These are:
  *  IRQS_OFF	- interrupts were disabled
@@ -211,6 +237,7 @@ extern void __ftrace_bad_type(void);
 		IF_ASSIGN(var, ent, struct ctx_switch_entry, 0);	\
 		IF_ASSIGN(var, ent, struct trace_field_cont, TRACE_CONT); \
 		IF_ASSIGN(var, ent, struct stack_entry, TRACE_STACK);	\
+		IF_ASSIGN(var, ent, struct userstack_entry, TRACE_USER_STACK);\
 		IF_ASSIGN(var, ent, struct print_entry, TRACE_PRINT);	\
 		IF_ASSIGN(var, ent, struct special_entry, 0);		\
 		IF_ASSIGN(var, ent, struct trace_mmiotrace_rw,		\
@@ -218,6 +245,7 @@ extern void __ftrace_bad_type(void);
 		IF_ASSIGN(var, ent, struct trace_mmiotrace_map,		\
 			  TRACE_MMIO_MAP);				\
 		IF_ASSIGN(var, ent, struct trace_boot, TRACE_BOOT);	\
+		IF_ASSIGN(var, ent, struct semlat_entry, TRACE_SEM);	\
 		__ftrace_bad_type();					\
 	} while (0)
 
@@ -314,6 +342,10 @@ void trace_special(struct trace_array *tr,
 		   unsigned long arg1,
 		   unsigned long arg2,
 		   unsigned long arg3, int pc);
+
+void trace_semlat(struct trace_array *tr, struct trace_array_cpu *data,
+		  const struct semlat_data *sdata, int samestack);
+
 void trace_function(struct trace_array *tr,
 		    struct trace_array_cpu *data,
 		    unsigned long ip,
diff --git a/kernel/trace/trace_semlat.c b/kernel/trace/trace_semlat.c
new file mode 100644
index 0000000..d72e3e3
--- /dev/null
+++ b/kernel/trace/trace_semlat.c
@@ -0,0 +1,186 @@
+/*
+ * Semaphore latency tracer
+ *
+ * Copyright (C) 2008 Török Edwin <edwintorok@gmail.com>
+ *
+ */
+
+#include <linux/module.h>
+#include <linux/fs.h>
+#include <linux/debugfs.h>
+#include <linux/ftrace.h>
+#include <trace/rwsem.h>
+#include "trace.h"
+
+static struct trace_array	*ctx_trace;
+static int __read_mostly	tracer_enabled;
+
+static struct semlat_data map_pid_to_latency[PID_MAX_DEFAULT+1];
+
+static inline int report_latency(unsigned long delta)
+{
+	if (tracing_thresh) {
+		if (delta < tracing_thresh)
+			return 0;
+	} else {
+		if (delta <= tracing_max_latency)
+			return 0;
+	}
+	return 1;
+}
+
+static inline void probe_rwsem_wake(const struct rw_semaphore *sem,
+				    const struct rwsem_waiter *waiter)
+{
+	/* first task in waitlist is oldest, so calculate delta against it */
+	const struct semlat_data *data;
+	struct trace_array_cpu *cpudata;
+	unsigned long T0, T1, delta;
+	int cpu, first;
+
+	if (!tracer_enabled)
+		return;
+	data = &map_pid_to_latency[waiter->task->pid];
+
+	if (data->cpu == -1)
+		return;
+	T0 = data->t;
+	T1 = ftrace_now(data->cpu);
+	delta = T1-T0;
+
+	if (!report_latency(delta))
+		return;
+	cpu = raw_smp_processor_id();
+	cpudata = ctx_trace->data[cpu];
+	/* blame the current task for latency of all tasks on wait list
+	 * note that some of those are being woken now, but that doesn't matter
+	 * we want to blame this task for their latency too */
+	if (list_empty(&waiter->list)) {
+		/* waiter is not part of wait_list, so process it here */
+		trace_semlat(ctx_trace, cpudata, data, 1);
+		first = 0;
+	} else {
+		first = 1;
+	}
+	list_for_each_entry(waiter, &sem->wait_list, list) {
+		data = &map_pid_to_latency[waiter->task->pid];
+		if (data->cpu != -1)
+			trace_semlat(ctx_trace, cpudata, data, first);
+		first = 0;
+	}
+}
+
+static atomic_t unique_latency_id = ATOMIC_INIT(0);
+
+static void probe_rwsem_lock_failed(const struct rw_semaphore *sem)
+{
+	const struct rwsem_waiter *waiter;
+	struct semlat_data *data;
+	int cpu = raw_smp_processor_id();
+
+	if (!tracer_enabled)
+		return;
+	waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
+	data = &map_pid_to_latency[waiter->task->pid];
+	data->semaphore = sem;
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	data->name = sem->dep_map.name;
+#else
+	data->name = NULL;
+#endif
+
+	data->latency_id = atomic_inc_return(&unique_latency_id);
+	data->kind_read = waiter->flags == RWSEM_WAITING_FOR_READ;
+	data->t = ftrace_now(cpu);
+	data->cpu = cpu;
+}
+
+static void probe_rwsem_lock_ok(const struct task_struct *tsk)
+{
+	struct semlat_data *data = &map_pid_to_latency[tsk->pid];
+	struct trace_array_cpu *cpudata;
+	int cpu;
+	cpu = raw_smp_processor_id();
+	cpudata = ctx_trace->data[cpu];
+
+	if (!tracer_enabled)
+		return;
+	trace_semlat(ctx_trace, cpudata, data, 1);
+	/* invalidate entry */
+	data->cpu = -1;
+}
+
+static void start_semlat_trace(struct trace_array *tr)
+{
+	int ret;
+	ret = register_trace_rwsem_wake(probe_rwsem_wake);
+	WARN_ON(ret);
+
+	ret = register_trace_rwsem_lock_failed(probe_rwsem_lock_failed);
+	WARN_ON(ret);
+
+	ret = register_trace_rwsem_lock_ok(probe_rwsem_lock_ok);
+	WARN_ON(ret);
+	printk(KERN_INFO "registered semlat\n");
+	tracer_enabled = 1;
+}
+
+static void stop_semlat_trace(struct trace_array *tr)
+{
+	tracer_enabled = 0;
+	unregister_trace_rwsem_wake(probe_rwsem_wake);
+	unregister_trace_rwsem_lock_failed(probe_rwsem_lock_failed);
+	printk(KERN_INFO "unregistered semlat\n");
+}
+
+static void semlat_trace_init(struct trace_array *tr)
+{
+	unsigned int i;
+	int cpu;
+
+	ctx_trace = tr;
+
+	atomic_set(&unique_latency_id, 0);
+	for (i = 0; i <= PID_MAX_DEFAULT; i++)
+		map_pid_to_latency[i].cpu = -1;
+
+	for_each_online_cpu(cpu)
+		tracing_reset(tr, cpu);
+
+	if (tr->ctrl)
+		start_semlat_trace(tr);
+}
+
+static void semlat_trace_reset(struct trace_array *tr)
+{
+
+	if (tr->ctrl)
+		stop_semlat_trace(tr);
+}
+
+static void semlat_trace_ctrl_update(struct trace_array *tr)
+{
+	/* When starting a new trace, reset the buffers */
+	if (tr->ctrl)
+		start_semlat_trace(tr);
+	else
+		stop_semlat_trace(tr);
+}
+
+static struct tracer semlat_trace __read_mostly =
+{
+	.name		= "semlat",
+	.init		= semlat_trace_init,
+	.reset		= semlat_trace_reset,
+	.ctrl_update	= semlat_trace_ctrl_update,
+#ifdef CONFIG_FTRACE_SELFTEST
+/*	.selftest	= trace_selftest_startup_semlat,*/
+#endif
+};
+
+__init static int init_semlat_trace(void)
+{
+	return register_tracer(&semlat_trace);
+}
+device_initcall(init_semlat_trace);
-- 
1.5.6.5


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/4] Add support for userspace stacktraces in tracing/iter_ctrl
  2008-10-12 13:12 ` [PATCH 1/4] Add support for userspace stacktraces in tracing/iter_ctrl Török Edwin
@ 2008-10-12 13:31   ` Frédéric Weisbecker
  2008-10-12 13:53     ` Török Edwin
  2008-10-26  4:05   ` Frédéric Weisbecker
  2008-10-27 16:03   ` Ingo Molnar
  2 siblings, 1 reply; 26+ messages in thread
From: Frédéric Weisbecker @ 2008-10-12 13:31 UTC (permalink / raw)
  To: Török Edwin
  Cc: mingo, srostedt, a.p.zijlstra, sandmann, linux-kernel

2008/10/12 Török Edwin <edwintorok@gmail.com>:
> +static int copy_stack_frame(const void __user *fp, struct stack_frame *frame)
> +{
> +       int ret;
> +
> +       if (!access_ok(VERIFY_READ, fp, sizeof(*frame)))
> +               return 0;
> +
> +       ret = 1;
> +       pagefault_disable();
> +       if (__copy_from_user_inatomic(frame, fp, sizeof(*frame)))
> +               ret = 0;
> +       pagefault_enable();
> +
> +       return ret;
> +}
> +

Hello Török,

I don't understand why you are using this pagefault disabling and
atomic copy here.
Wouldn't a simple copy_from_user actually take of this? You don't seem
to be in atomic
context.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/4] Add support for userspace stacktraces in tracing/iter_ctrl
  2008-10-12 13:31   ` Frédéric Weisbecker
@ 2008-10-12 13:53     ` Török Edwin
  2008-10-13  8:02       ` Frédéric Weisbecker
  0 siblings, 1 reply; 26+ messages in thread
From: Török Edwin @ 2008-10-12 13:53 UTC (permalink / raw)
  To: Frédéric Weisbecker
  Cc: mingo, srostedt, a.p.zijlstra, sandmann, linux-kernel

On 2008-10-12 16:31, Frédéric Weisbecker wrote:
> 2008/10/12 Török Edwin <edwintorok@gmail.com>:
>   
>> +static int copy_stack_frame(const void __user *fp, struct stack_frame *frame)
>> +{
>> +       int ret;
>> +
>> +       if (!access_ok(VERIFY_READ, fp, sizeof(*frame)))
>> +               return 0;
>> +
>> +       ret = 1;
>> +       pagefault_disable();
>> +       if (__copy_from_user_inatomic(frame, fp, sizeof(*frame)))
>> +               ret = 0;
>> +       pagefault_enable();
>> +
>> +       return ret;
>> +}
>> +
>>     
>
> Hello Török,
>
> I don't understand why you are using this pagefault disabling and
> atomic copy here.
> Wouldn't a simple copy_from_user actually take of this? You don't seem
> to be in atomic
> context.
>   

Hello,

This function is called from ftrace, as such I could be holding a
spinlock. It can also be called from
__rwsem_do_wake, which takes a spinlock.
This code is from sysprof's copy_stack_frame. Maybe pagefault_disable()
is not needed.

Best regards,
--Edwin

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] ftrace: add userspace stacktrace support and semaphore-latency tracer
  2008-10-12 13:11 [PATCH 0/4] ftrace: add userspace stacktrace support and semaphore-latency tracer Török Edwin
                   ` (4 preceding siblings ...)
  2008-10-12 13:12 ` [PATCH 4/4] Implement semaphore latency tracer Török Edwin
@ 2008-10-12 18:25 ` Steven Rostedt
  5 siblings, 0 replies; 26+ messages in thread
From: Steven Rostedt @ 2008-10-12 18:25 UTC (permalink / raw)
  To: Török Edwin; +Cc: mingo, a.p.zijlstra, sandmann, linux-kernel

Török Edwin wrote:
> Hi,
>
> This patch series is a draft implementation of a semaphore-latency tracer.
> I would appreciate feedback on these patches so that they can be merged into
> mainline.
>   
Hi!

You caught me at a bad time. I'm just about to leave for NYC for the 
Linux End Users conference. I'll be back on Wednesday. Please send any 
new emails to rostedt@goodmis.org. I will not be checking my Red Hat 
account when I'm away.

I'll try to look at the patches this week, but I can't promise anything. 
I'll forward your patches to my goodmis account, and if I have time, 
I'll look at them.

-- Steve



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] Implement semaphore latency tracer
  2008-10-12 13:12 ` [PATCH 4/4] Implement semaphore latency tracer Török Edwin
@ 2008-10-12 19:13   ` Peter Zijlstra
  2008-10-12 20:10     ` Török Edwin
  0 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2008-10-12 19:13 UTC (permalink / raw)
  To: Török Edwin
  Cc: mingo, sandmann, linux-kernel, Steven Rostedt, Arjan van de Ven

On Sun, 2008-10-12 at 16:12 +0300, Török Edwin wrote:
> Each time a down_read or down_write fails, a unique latency id is generated.
> Later when someone releases the semaphore, it is blamed for the latency of all
> tasks on the wait_list of the semaphore.
> If you would group the output from latency_trace by the latency_id you get all those
> who were contending on a lock, and the tasks that were holding the lock.
> An entry in latency_trace has the format:
> (latency_id) [semaphore_id] read|write
> stacktrace <= stacktrace

What can this tracer do that latencytop cannot already do?



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] Implement semaphore latency tracer
  2008-10-12 19:13   ` Peter Zijlstra
@ 2008-10-12 20:10     ` Török Edwin
  2008-10-22 15:28       ` Ingo Molnar
  0 siblings, 1 reply; 26+ messages in thread
From: Török Edwin @ 2008-10-12 20:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, sandmann, linux-kernel, Steven Rostedt, Arjan van de Ven

On 2008-10-12 22:13, Peter Zijlstra wrote:
> On Sun, 2008-10-12 at 16:12 +0300, Török Edwin wrote:
>   
>> Each time a down_read or down_write fails, a unique latency id is generated.
>> Later when someone releases the semaphore, it is blamed for the latency of all
>> tasks on the wait_list of the semaphore.
>> If you would group the output from latency_trace by the latency_id you get all those
>> who were contending on a lock, and the tasks that were holding the lock.
>> An entry in latency_trace has the format:
>> (latency_id) [semaphore_id] read|write
>> stacktrace <= stacktrace
>>     
>
> What can this tracer do that latencytop cannot already do?

Latencytop can show latencies in down_read or down_write (and is very
useful at doing that), but it cannot show who else was holding the
semaphore,
i.e. the true cause of the latency.

Consider:
process A holds a semaphore for reading, process B tries to acquire it
for writing and fails. Latencytop shows the latency in process B, but
doesn't
show anything about process A.

The semlat tracer is doing something more similar to lockstat, but
doesn't need lockdep, and it adds tracepoints on the slowpath only (lock
failed, wakeup).

Best regards,
--Edwin

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/4] Add support for userspace stacktraces in tracing/iter_ctrl
  2008-10-12 13:53     ` Török Edwin
@ 2008-10-13  8:02       ` Frédéric Weisbecker
  0 siblings, 0 replies; 26+ messages in thread
From: Frédéric Weisbecker @ 2008-10-13  8:02 UTC (permalink / raw)
  To: Török Edwin
  Cc: mingo, srostedt, a.p.zijlstra, sandmann, linux-kernel

2008/10/12 Török Edwin <edwintorok@gmail.com>:
> Hello,
>
> This function is called from ftrace, as such I could be holding a
> spinlock. It can also be called from
> __rwsem_do_wake, which takes a spinlock.
> This code is from sysprof's copy_stack_frame. Maybe pagefault_disable()
> is not needed.
>
> Best regards,
> --Edwin

Ok, I didn't see the context that way.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] Implement semaphore latency tracer
  2008-10-12 20:10     ` Török Edwin
@ 2008-10-22 15:28       ` Ingo Molnar
  2008-10-22 15:41         ` Török Edwin
  0 siblings, 1 reply; 26+ messages in thread
From: Ingo Molnar @ 2008-10-22 15:28 UTC (permalink / raw)
  To: Török Edwin
  Cc: Peter Zijlstra, sandmann, linux-kernel, Steven Rostedt, Arjan van de Ven


* Török Edwin <edwintorok@gmail.com> wrote:

> On 2008-10-12 22:13, Peter Zijlstra wrote:
> > On Sun, 2008-10-12 at 16:12 +0300, Török Edwin wrote:
> >   
> >> Each time a down_read or down_write fails, a unique latency id is generated.
> >> Later when someone releases the semaphore, it is blamed for the latency of all
> >> tasks on the wait_list of the semaphore.
> >> If you would group the output from latency_trace by the latency_id you get all those
> >> who were contending on a lock, and the tasks that were holding the lock.
> >> An entry in latency_trace has the format:
> >> (latency_id) [semaphore_id] read|write
> >> stacktrace <= stacktrace
> >>     
> >
> > What can this tracer do that latencytop cannot already do?
> 
> Latencytop can show latencies in down_read or down_write (and is very
> useful at doing that), but it cannot show who else was holding the
> semaphore,
> i.e. the true cause of the latency.
> 
> Consider:
> process A holds a semaphore for reading, process B tries to acquire it
> for writing and fails. Latencytop shows the latency in process B, but
> doesn't
> show anything about process A.
> 
> The semlat tracer is doing something more similar to lockstat, but
> doesn't need lockdep, and it adds tracepoints on the slowpath only (lock
> failed, wakeup).

hm, but the most common synchronization primitive are mutexes - and 
those are not covered by your patchset.

	Ingo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] Implement semaphore latency tracer
  2008-10-22 15:28       ` Ingo Molnar
@ 2008-10-22 15:41         ` Török Edwin
  2008-10-22 15:48           ` Ingo Molnar
  0 siblings, 1 reply; 26+ messages in thread
From: Török Edwin @ 2008-10-22 15:41 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, sandmann, linux-kernel, Steven Rostedt, Arjan van de Ven

On 2008-10-22 18:28, Ingo Molnar wrote:
> hm, but the most common synchronization primitive are mutexes - and 
> those are not covered by your patchset.
>   

Indeed. I've seen a patch from Jason Baron to introduce tracepoints for
mutexes, but the conclusion was that the tracepoints
should be in lockstat instead.

And if lockstat is enabled Peter Zijlstra's 'contend with points' patch
seems to do exactly what I want to.

However I think it would be useful to have (a tracepoints based?)
latency tracker, which can be enabled/disabled at runtime,
and which doesn't add any data to the mutex/semaphore structures.
My patchset was a first attempt towards that, but it seems that such use
of tracepoints is not welcome at this time?

Please tell me if I should continue working on this, or if I my patches
are designed totally on the wrong way.

Best regards,
--Edwin

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] Implement semaphore latency tracer
  2008-10-22 15:41         ` Török Edwin
@ 2008-10-22 15:48           ` Ingo Molnar
  2008-10-22 17:22             ` Peter Zijlstra
  0 siblings, 1 reply; 26+ messages in thread
From: Ingo Molnar @ 2008-10-22 15:48 UTC (permalink / raw)
  To: Török Edwin
  Cc: Peter Zijlstra, sandmann, linux-kernel, Steven Rostedt, Arjan van de Ven


* Török Edwin <edwintorok@gmail.com> wrote:

> On 2008-10-22 18:28, Ingo Molnar wrote:
> > hm, but the most common synchronization primitive are mutexes - and 
> > those are not covered by your patchset.
> >   
> 
> Indeed. I've seen a patch from Jason Baron to introduce tracepoints 
> for mutexes, but the conclusion was that the tracepoints should be in 
> lockstat instead.
> 
> And if lockstat is enabled Peter Zijlstra's 'contend with points' 
> patch seems to do exactly what I want to.
> 
> However I think it would be useful to have (a tracepoints based?) 
> latency tracker, which can be enabled/disabled at runtime, and which 
> doesn't add any data to the mutex/semaphore structures. My patchset 
> was a first attempt towards that, but it seems that such use of 
> tracepoints is not welcome at this time?
> 
> Please tell me if I should continue working on this, or if I my 
> patches are designed totally on the wrong way.

i think if you hook into Peter's lockstat APIs that should give us a 
pretty good tracer, with no ugliness introduced. That would be rather 
interesting. Peter, do you concur?

	Ingo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] Implement semaphore latency tracer
  2008-10-22 15:48           ` Ingo Molnar
@ 2008-10-22 17:22             ` Peter Zijlstra
  2008-10-22 17:25               ` Török Edwin
  0 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2008-10-22 17:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Török Edwin, sandmann, linux-kernel, Steven Rostedt,
	Arjan van de Ven

On Wed, 2008-10-22 at 17:48 +0200, Ingo Molnar wrote:
> * Török Edwin <edwintorok@gmail.com> wrote:
> 
> > On 2008-10-22 18:28, Ingo Molnar wrote:
> > > hm, but the most common synchronization primitive are mutexes - and 
> > > those are not covered by your patchset.
> > >   
> > 
> > Indeed. I've seen a patch from Jason Baron to introduce tracepoints 
> > for mutexes, but the conclusion was that the tracepoints should be in 
> > lockstat instead.
> > 
> > And if lockstat is enabled Peter Zijlstra's 'contend with points' 
> > patch seems to do exactly what I want to.
> > 
> > However I think it would be useful to have (a tracepoints based?) 
> > latency tracker, which can be enabled/disabled at runtime, and which 
> > doesn't add any data to the mutex/semaphore structures. My patchset 
> > was a first attempt towards that, but it seems that such use of 
> > tracepoints is not welcome at this time?
> > 
> > Please tell me if I should continue working on this, or if I my 
> > patches are designed totally on the wrong way.
> 
> i think if you hook into Peter's lockstat APIs that should give us a 
> pretty good tracer, with no ugliness introduced. That would be rather 
> interesting. Peter, do you concur?

Yes, I've already suggested this. Use the exact same hooks that
lockdep/lockstat use.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] Implement semaphore latency tracer
  2008-10-22 17:22             ` Peter Zijlstra
@ 2008-10-22 17:25               ` Török Edwin
  0 siblings, 0 replies; 26+ messages in thread
From: Török Edwin @ 2008-10-22 17:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, sandmann, linux-kernel, Steven Rostedt, Arjan van de Ven

On 2008-10-22 20:22, Peter Zijlstra wrote:
> On Wed, 2008-10-22 at 17:48 +0200, Ingo Molnar wrote:
>   
>> * Török Edwin <edwintorok@gmail.com> wrote:
>>
>>     
>>> On 2008-10-22 18:28, Ingo Molnar wrote:
>>>       
>>>> hm, but the most common synchronization primitive are mutexes - and 
>>>> those are not covered by your patchset.
>>>>   
>>>>         
>>> Indeed. I've seen a patch from Jason Baron to introduce tracepoints 
>>> for mutexes, but the conclusion was that the tracepoints should be in 
>>> lockstat instead.
>>>
>>> And if lockstat is enabled Peter Zijlstra's 'contend with points' 
>>> patch seems to do exactly what I want to.
>>>
>>> However I think it would be useful to have (a tracepoints based?) 
>>> latency tracker, which can be enabled/disabled at runtime, and which 
>>> doesn't add any data to the mutex/semaphore structures. My patchset 
>>> was a first attempt towards that, but it seems that such use of 
>>> tracepoints is not welcome at this time?
>>>
>>> Please tell me if I should continue working on this, or if I my 
>>> patches are designed totally on the wrong way.
>>>       
>> i think if you hook into Peter's lockstat APIs that should give us a 
>> pretty good tracer, with no ugliness introduced. That would be rather 
>> interesting. Peter, do you concur?
>>     
>
> Yes, I've already suggested this. Use the exact same hooks that
> lockdep/lockstat use.

Ok, I'll work on this when I get some time :)
[hopefully this weekend test the 'contend with points patch', next
weekend write the new tracepoints]

Best regards,
--Edwin

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/4] Add support for userspace stacktraces in tracing/iter_ctrl
  2008-10-12 13:12 ` [PATCH 1/4] Add support for userspace stacktraces in tracing/iter_ctrl Török Edwin
  2008-10-12 13:31   ` Frédéric Weisbecker
@ 2008-10-26  4:05   ` Frédéric Weisbecker
  2008-10-26  7:03     ` Török Edwin
  2008-10-26 13:15     ` Frank Ch. Eigler
  2008-10-27 16:03   ` Ingo Molnar
  2 siblings, 2 replies; 26+ messages in thread
From: Frédéric Weisbecker @ 2008-10-26  4:05 UTC (permalink / raw)
  To: Török Edwin
  Cc: mingo, srostedt, a.p.zijlstra, sandmann, linux-kernel

2008/10/12 Török Edwin <edwintorok@gmail.com>:
> Usage example:
> mount -t debugfs nodev /sys/kernel/debug
> cd /sys/kernel/debug/tracing
> echo userstacktrace >iter_ctrl
> echo sched_switch >current_tracer
> echo 1 >tracing_enabled
> .... run application ...
> echo 0 >tracing_enabled
>
> Then read one of 'trace','latency_trace','trace_pipe'
>
> Signed-off-by: Török Edwin <edwintorok@gmail.com>
> ---
>  Documentation/ftrace.txt     |    5 ++-
>  arch/x86/kernel/stacktrace.c |   57 +++++++++++++++++++++++++++++++++++
>  include/linux/stacktrace.h   |    8 +++++
>  kernel/trace/trace.c         |   67 ++++++++++++++++++++++++++++++++++++++++++
>  kernel/trace/trace.h         |    2 +
>  5 files changed, 138 insertions(+), 1 deletions(-)
>
> diff --git a/Documentation/ftrace.txt b/Documentation/ftrace.txt
> index d330fe3..f39cb68 100644
> --- a/Documentation/ftrace.txt
> +++ b/Documentation/ftrace.txt
> @@ -327,7 +327,7 @@ output. To see what is available, simply cat the file:
>
>   cat /debug/tracing/iter_ctrl
>   print-parent nosym-offset nosym-addr noverbose noraw nohex nobin \
> - noblock nostacktrace nosched-tree
> + noblock nostacktrace nosched-tree nouserstacktrace
>
>  To disable one of the options, echo in the option prepended with "no".
>
> @@ -381,6 +381,9 @@ Here are the available options:
>                When a trace is recorded, so is the stack of functions.
>                This allows for back traces of trace sites.
>
> +  userstacktrace - This option changes the trace.
> +                  It records a stacktrace of the current userspace thread.
> +
>   sched-tree - TBD (any users??)
>
>
> diff --git a/arch/x86/kernel/stacktrace.c b/arch/x86/kernel/stacktrace.c
> index d1d850a..000a965 100644
> --- a/arch/x86/kernel/stacktrace.c
> +++ b/arch/x86/kernel/stacktrace.c
> @@ -6,6 +6,7 @@
>  #include <linux/sched.h>
>  #include <linux/stacktrace.h>
>  #include <linux/module.h>
> +#include <linux/uaccess.h>
>  #include <asm/stacktrace.h>
>
>  static void save_stack_warning(void *data, char *msg)
> @@ -90,3 +91,59 @@ void save_stack_trace_tsk(struct task_struct *tsk, struct stack_trace *trace)
>                trace->entries[trace->nr_entries++] = ULONG_MAX;
>  }
>  EXPORT_SYMBOL_GPL(save_stack_trace_tsk);
> +
> +/* Userspace stacktrace - based on kernel/trace/trace_sysprof.c */
> +
> +struct stack_frame {
> +       const void __user       *next_fp;
> +       unsigned long           return_address;
> +};
> +
> +static int copy_stack_frame(const void __user *fp, struct stack_frame *frame)
> +{
> +       int ret;
> +
> +       if (!access_ok(VERIFY_READ, fp, sizeof(*frame)))
> +               return 0;
> +
> +       ret = 1;
> +       pagefault_disable();
> +       if (__copy_from_user_inatomic(frame, fp, sizeof(*frame)))
> +               ret = 0;
> +       pagefault_enable();
> +
> +       return ret;
> +}
> +
> +void save_stack_trace_user(struct stack_trace *trace)
> +{
> +       /*
> +        * Trace user stack if we are not a kernel thread
> +        */
> +       if (current->mm) {
> +               const struct pt_regs *regs = task_pt_regs(current);
> +               const void __user *fp = (const void __user *)regs->bp;
> +
> +               if (trace->nr_entries < trace->max_entries)
> +                       trace->entries[trace->nr_entries++] = regs->ip;
> +
> +               while (trace->nr_entries < trace->max_entries) {
> +                       struct stack_frame frame;
> +                       frame.next_fp = NULL;
> +                       frame.return_address = 0;
> +                       if (!copy_stack_frame(fp, &frame))
> +                               break;
> +                       if ((unsigned long)fp < regs->sp)
> +                               break;
> +                       if (frame.return_address)
> +                               trace->entries[trace->nr_entries++] =
> +                                       frame.return_address;
> +                       if (fp == frame.next_fp)
> +                               break;
> +                       fp = frame.next_fp;
> +               }
> +       }
> +       if (trace->nr_entries < trace->max_entries)
> +               trace->entries[trace->nr_entries++] = ULONG_MAX;
> +}
> +
> diff --git a/include/linux/stacktrace.h b/include/linux/stacktrace.h
> index 6b8e54a..46704a5 100644
> --- a/include/linux/stacktrace.h
> +++ b/include/linux/stacktrace.h
> @@ -18,9 +18,17 @@ extern void save_stack_trace_tsk(struct task_struct *tsk,
>                                struct stack_trace *trace);
>
>  extern void print_stack_trace(struct stack_trace *trace, int spaces);
> +
> +#ifdef CONFIG_X86
> +extern void save_stack_trace_user(struct stack_trace *trace);
> +#else
> +# define save_stack_trace_user(trace)              do { } while (0)
> +#endif
> +
>  #else
>  # define save_stack_trace(trace)                       do { } while (0)
>  # define save_stack_trace_tsk(tsk, trace)              do { } while (0)
> +# define save_stack_trace_user(trace)              do { } while (0)
>  # define print_stack_trace(trace, spaces)              do { } while (0)
>  #endif
>
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index d345d64..4c17453 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -212,6 +212,7 @@ static const char *trace_options[] = {
>        "stacktrace",
>        "sched-tree",
>        "ftrace_printk",
> +       "userstacktrace",
>        NULL
>  };
>
> @@ -735,6 +736,28 @@ void __trace_stack(struct trace_array *tr,
>        ftrace_trace_stack(tr, data, flags, skip, preempt_count());
>  }
>
> +static void __trace_userstack(struct trace_array *tr,
> +                  struct trace_array_cpu *data,
> +                  unsigned long flags)
> +{
> +       struct trace_entry *entry;
> +       struct stack_trace trace;
> +
> +       if (!(trace_flags & TRACE_ITER_USERSTACKTRACE))
> +               return;
> +
> +       entry                   = tracing_get_trace_entry(tr, data);
> +       tracing_generic_entry_update(entry, flags);
> +       entry->type             = TRACE_USER_STACK;
> +
> +       memset(&entry->field.stack, 0, sizeof(entry->field.stack));
> +       trace.nr_entries        = 0;
> +       trace.max_entries       = FTRACE_STACK_ENTRIES;
> +       trace.skip              = 0;
> +       trace.entries           = entry->field.stack.caller;
> +       save_stack_trace_user(&trace);
> +}
> +
>  static void
>  ftrace_trace_special(void *__tr, void *__data,
>                     unsigned long arg1, unsigned long arg2, unsigned long arg3,
> @@ -758,6 +781,7 @@ ftrace_trace_special(void *__tr, void *__data,
>        entry->arg3                     = arg3;
>        ring_buffer_unlock_commit(tr->buffer, event, irq_flags);
>        ftrace_trace_stack(tr, data, irq_flags, 4, pc);
> +       __trace_userstack(tr, data, irq_flags);
>
>        trace_wake_up();
>  }
> @@ -1151,6 +1175,31 @@ seq_print_ip_sym(struct trace_seq *s, unsigned long ip, unsigned long sym_flags)
>        return ret;
>  }
>
> +static int
> +seq_print_userip_objs(const struct trace_entry *entry, struct trace_seq *s,
> +               unsigned long sym_flags)
> +{
> +       int ret = 1;
> +       unsigned i;
> +
> +       for (i = 0; i < FTRACE_STACK_ENTRIES; i++) {
> +               unsigned long ip = entry->field.stack.caller[i];
> +
> +               if (ip == ULONG_MAX || !ret)
> +                       break;
> +               if (i)
> +                       ret = trace_seq_puts(s, " <- ");
> +               if (!ip) {
> +                       ret = trace_seq_puts(s, "??");
> +                       continue;
> +               }
> +               if (sym_flags & TRACE_ITER_SYM_ADDR)
> +                       ret = trace_seq_printf(s, " <" IP_FMT ">", ip);
> +       }
> +
> +       return ret;
> +}
> +
>  static void print_lat_help_header(struct seq_file *m)
>  {
>        seq_puts(m, "#                  _------=> CPU#            \n");
> @@ -1437,6 +1486,10 @@ print_lat_fmt(struct trace_iterator *iter, unsigned int trace_idx, int cpu)
>                        trace_seq_print_cont(s, iter);
>                break;
>        }
> +       case TRACE_USER_STACK:
> +               seq_print_userip_objs(entry, s, sym_flags);
> +               trace_seq_putc(s, '\n');
> +               break;
>        default:
>                trace_seq_printf(s, "Unknown type %d\n", entry->type);
>        }
> @@ -1572,6 +1625,14 @@ static enum print_line_t print_trace_fmt(struct trace_iterator *iter)
>                if (entry->flags & TRACE_FLAG_CONT)
>                        trace_seq_print_cont(s, iter);
>                break;
> +       case TRACE_USER_STACK:
> +               ret = seq_print_userip_objs(entry, s, sym_flags);
> +               if (!ret)
> +                       return 0;
> +               ret = trace_seq_putc(s, '\n');
> +               if (!ret)
> +                       return 0;
> +               break;
>        }
>        }
>        return TRACE_TYPE_HANDLED;
> @@ -1632,6 +1693,7 @@ static enum print_line_t print_raw_fmt(struct trace_iterator *iter)
>                break;
>        }
>        case TRACE_SPECIAL:
> +       case TRACE_USER_STACK:
>        case TRACE_STACK: {
>                struct special_entry *field;
>
> @@ -1720,6 +1782,7 @@ static enum print_line_t print_hex_fmt(struct trace_iterator *iter)
>                break;
>        }
>        case TRACE_SPECIAL:
> +       case TRACE_USER_STACK:
>        case TRACE_STACK: {
>                struct special_entry *field;
>
> @@ -1774,6 +1837,7 @@ static enum print_line_t print_bin_fmt(struct trace_iterator *iter)
>                break;
>        }
>        case TRACE_SPECIAL:
> +       case TRACE_USER_STACK:
>        case TRACE_STACK: {
>                struct special_entry *field;
>
> @@ -3103,6 +3167,9 @@ void ftrace_dump(void)
>                atomic_inc(&global_trace.data[cpu]->disabled);
>        }
>
> +       /* don't look at user memory in panic mode */
> +       trace_flags &= ~TRACE_ITER_SYM_USEROBJ;
> +
>        printk(KERN_TRACE "Dumping ftrace buffer:\n");
>
>        iter.tr = &global_trace;
> diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
> index f1f9957..3ad6343 100644
> --- a/kernel/trace/trace.h
> +++ b/kernel/trace/trace.h
> @@ -22,6 +22,7 @@ enum trace_type {
>        TRACE_MMIO_RW,
>        TRACE_MMIO_MAP,
>        TRACE_BOOT,
> +       TRACE_USER_STACK,
>
>        __TRACE_LAST_TYPE
>  };
> @@ -413,6 +414,7 @@ enum trace_iterator_flags {
>        TRACE_ITER_STACKTRACE           = 0x100,
>        TRACE_ITER_SCHED_TREE           = 0x200,
>        TRACE_ITER_PRINTK               = 0x400,
> +       TRACE_ITER_USERSTACKTRACE       = 0x800
>  };
>
>  extern struct tracer nop_trace;
> --


Hi Török,

I just read a bit more those two patches about user stack tracing.
Note that despite the recent changes on the tracing Api
(ring buffer and other stuff), it still applies well, with some little
hunks. But it needs some updates to support the last changes.

I adapted the patch and it builds well, but I didn't tested yet.
However, some parts are "architecture dependent", I guess these
special pieces would find a better place in the arch directory. So it
would let a proper and seperated implementation.

And most arch implement a stacktrace implementation, even if yours
involve special operations such as copy_from_user, I wonder if the
already established code couldn't be reused for your needs.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/4] Add support for userspace stacktraces in tracing/iter_ctrl
  2008-10-26  4:05   ` Frédéric Weisbecker
@ 2008-10-26  7:03     ` Török Edwin
  2008-10-26 15:06       ` Frédéric Weisbecker
  2008-10-26 13:15     ` Frank Ch. Eigler
  1 sibling, 1 reply; 26+ messages in thread
From: Török Edwin @ 2008-10-26  7:03 UTC (permalink / raw)
  To: Frédéric Weisbecker
  Cc: mingo, srostedt, a.p.zijlstra, sandmann, linux-kernel

On 2008-10-26 06:05, Frédéric Weisbecker wrote:
> Hi Török,
>
> I just read a bit more those two patches about user stack tracing.
> Note that despite the recent changes on the tracing Api
> (ring buffer and other stuff), it still applies well, with some little
> hunks. But it needs some updates to support the last changes.
>
> I adapted the patch and it builds well, but I didn't tested yet.
> However, some parts are "architecture dependent", I guess these
> special pieces would find a better place in the arch directory. So it
> would let a proper and seperated implementation.
>   

Hi,

The stacktrace code itself is in the arch directory, did I miss anything
else that should belong there?

arch/x86/kernel/stacktrace.c |   57 +++++++++++++++++++++++++++++++++++
include/linux/stacktrace.h   |    8 +++++


> And most arch implement a stacktrace implementation, even if yours
> involve special operations such as copy_from_user, I wonder if the
> already established code couldn't be reused for your needs.
>   

I think it would make sense to use oprofile's code of user stacktracing,
there  is a backtrace
operation in 'struct oprofile_operations'. Perhaps that could be
extended so that non-oprofile code can call it?
[currently it uses oprofile_add_trace to store the stacktrace]

Best regards,
--Edwin

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/4] Add support for userspace stacktraces in tracing/iter_ctrl
  2008-10-26  4:05   ` Frédéric Weisbecker
  2008-10-26  7:03     ` Török Edwin
@ 2008-10-26 13:15     ` Frank Ch. Eigler
  2008-10-26 13:29       ` Peter Zijlstra
  1 sibling, 1 reply; 26+ messages in thread
From: Frank Ch. Eigler @ 2008-10-26 13:15 UTC (permalink / raw)
  To: =?ISO-8859-1?Q?Fr=E9d=E9ric_Weisbecker?=
  Cc: =?ISO-8859-1?Q?T=F6r=F6k_Edwin?=,
	mingo, srostedt, a.p.zijlstra, sandmann, linux-kernel

"=?ISO-8859-1?Q?Fr=E9d=E9ric_Weisbecker?=" <fweisbec@gmail.com> writes:

> [...]
>> +/* Userspace stacktrace - based on kernel/trace/trace_sysprof.c */
>> +
>> +struct stack_frame {
>> +       const void __user       *next_fp;
>> +       unsigned long           return_address;
>> +};
> [...]

To what extent does this actually work, and on which architectures?
While the kernel appears willing to sacrifice some performance for
functionality by building with frame pointers at the drop of a tracing
hat, userspace is not like that.

- FChE

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/4] Add support for userspace stacktraces in tracing/iter_ctrl
  2008-10-26 13:15     ` Frank Ch. Eigler
@ 2008-10-26 13:29       ` Peter Zijlstra
  2008-10-26 13:38         ` Török Edwin
  0 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2008-10-26 13:29 UTC (permalink / raw)
  To: Frank Ch. Eigler
  Cc: fweisbec, Török Edwin, mingo, srostedt, sandmann, linux-kernel

On Sun, 2008-10-26 at 09:15 -0400, Frank Ch. Eigler wrote:
> "=?ISO-8859-1?Q?Fr=E9d=E9ric_Weisbecker?=" <fweisbec@gmail.com> writes:
> 
> > [...]
> >> +/* Userspace stacktrace - based on kernel/trace/trace_sysprof.c */
> >> +
> >> +struct stack_frame {
> >> +       const void __user       *next_fp;
> >> +       unsigned long           return_address;
> >> +};
> > [...]
> 
> To what extent does this actually work, and on which architectures?
> While the kernel appears willing to sacrifice some performance for
> functionality by building with frame pointers at the drop of a tracing
> hat, userspace is not like that.

Aww, common, Gentoo is brilliant for that :-)

CFLAGS+=-fframe-pointer

emerge -uDNe world

Then again, you'd better not have all that desktop bloat installed,
otherwise that will take ages.. KDE/OOo compile times anyone?

/me runs


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/4] Add support for userspace stacktraces in tracing/iter_ctrl
  2008-10-26 13:29       ` Peter Zijlstra
@ 2008-10-26 13:38         ` Török Edwin
  2008-10-26 13:49           ` Frank Ch. Eigler
  0 siblings, 1 reply; 26+ messages in thread
From: Török Edwin @ 2008-10-26 13:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frank Ch. Eigler, fweisbec, mingo, srostedt, sandmann, linux-kernel

On 2008-10-26 15:29, Peter Zijlstra wrote:
> On Sun, 2008-10-26 at 09:15 -0400, Frank Ch. Eigler wrote:
>   
>> "=?ISO-8859-1?Q?Fr=E9d=E9ric_Weisbecker?=" <fweisbec@gmail.com> writes:
>>
>>     
>>> [...]
>>>       
>>>> +/* Userspace stacktrace - based on kernel/trace/trace_sysprof.c */
>>>> +
>>>> +struct stack_frame {
>>>> +       const void __user       *next_fp;
>>>> +       unsigned long           return_address;
>>>> +};
>>>>         
>>> [...]
>>>       
>> To what extent does this actually work, and on which architectures?
>> While the kernel appears willing to sacrifice some performance for
>> functionality by building with frame pointers at the drop of a tracing
>> hat, userspace is not like that.
>>     
>
> Aww, common, Gentoo is brilliant for that :-)
>
> CFLAGS+=-fframe-pointer
>
> emerge -uDNe world
>
> Then again, you'd better not have all that desktop bloat installed,
> otherwise that will take ages.. KDE/OOo compile times anyone?
>
> /me runs

It should be enough to rebuild the application you are tracing[1] +
libraries, or at least libc.
In userspace I can get a stacktrace using DWARF unwind info, but doing
that in the kernel would be too expensive, right?

[1] assuming you are tracing latencies in a single application

Best regards,
--Edwin

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/4] Add support for userspace stacktraces in tracing/iter_ctrl
  2008-10-26 13:38         ` Török Edwin
@ 2008-10-26 13:49           ` Frank Ch. Eigler
  0 siblings, 0 replies; 26+ messages in thread
From: Frank Ch. Eigler @ 2008-10-26 13:49 UTC (permalink / raw)
  To: Török Edwin
  Cc: Peter Zijlstra, fweisbec, mingo, srostedt, sandmann, linux-kernel

Hi -

On Sun, Oct 26, 2008 at 03:38:45PM +0200, Török Edwin wrote:

> [...]  In userspace I can get a stacktrace using DWARF unwind info,
> but doing that in the kernel would be too expensive, right? [...]

We're experimenting with just that in systemtap.  The expenses
(memory: megabytes of unwind data, time: running code to process it)
would be paid only for the duration of a systemtap script that needs
that particular sort of information.

- FChE

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/4] Add support for userspace stacktraces in tracing/iter_ctrl
  2008-10-26  7:03     ` Török Edwin
@ 2008-10-26 15:06       ` Frédéric Weisbecker
  0 siblings, 0 replies; 26+ messages in thread
From: Frédéric Weisbecker @ 2008-10-26 15:06 UTC (permalink / raw)
  To: Török Edwin
  Cc: mingo, srostedt, a.p.zijlstra, sandmann, linux-kernel

2008/10/26 Török Edwin <edwintorok@gmail.com>:
> Hi,
>
> The stacktrace code itself is in the arch directory, did I miss anything
> else that should belong there?

No, that's my bad, I misread the patch. Sorry :)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/4] Add support for userspace stacktraces in tracing/iter_ctrl
  2008-10-12 13:12 ` [PATCH 1/4] Add support for userspace stacktraces in tracing/iter_ctrl Török Edwin
  2008-10-12 13:31   ` Frédéric Weisbecker
  2008-10-26  4:05   ` Frédéric Weisbecker
@ 2008-10-27 16:03   ` Ingo Molnar
  2008-10-27 16:16     ` Török Edwin
  2 siblings, 1 reply; 26+ messages in thread
From: Ingo Molnar @ 2008-10-27 16:03 UTC (permalink / raw)
  To: Török Edwin; +Cc: srostedt, a.p.zijlstra, sandmann, linux-kernel


* Török Edwin <edwintorok@gmail.com> wrote:

> Usage example:
> mount -t debugfs nodev /sys/kernel/debug
> cd /sys/kernel/debug/tracing
> echo userstacktrace >iter_ctrl
> echo sched_switch >current_tracer
> echo 1 >tracing_enabled
> .... run application ...
> echo 0 >tracing_enabled
> 
> Then read one of 'trace','latency_trace','trace_pipe'

okay, this makes quite a bit of sense - and sysprof already kind of 
walks down into the user-space stack. (and so does oprofile, if asked)

Could you send this independently of the lock contention tracing 
patches perhaps?

	Ingo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/4] Add support for userspace stacktraces in tracing/iter_ctrl
  2008-10-27 16:03   ` Ingo Molnar
@ 2008-10-27 16:16     ` Török Edwin
  0 siblings, 0 replies; 26+ messages in thread
From: Török Edwin @ 2008-10-27 16:16 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: srostedt, a.p.zijlstra, sandmann, linux-kernel

On 2008-10-27 18:03, Ingo Molnar wrote:
> * Török Edwin <edwintorok@gmail.com> wrote:
>
>   
>> Usage example:
>> mount -t debugfs nodev /sys/kernel/debug
>> cd /sys/kernel/debug/tracing
>> echo userstacktrace >iter_ctrl
>> echo sched_switch >current_tracer
>> echo 1 >tracing_enabled
>> .... run application ...
>> echo 0 >tracing_enabled
>>
>> Then read one of 'trace','latency_trace','trace_pipe'
>>     
>
> okay, this makes quite a bit of sense - and sysprof already kind of 
> walks down into the user-space stack. (and so does oprofile, if asked)
>
> Could you send this independently of the lock contention tracing 
> patches perhaps?

Sure, and "[PATCH 2/4] Identify which executable object the userspace
address belongs to" is independent of the lock tracing part too.

Perhaps I should send these 2 patches as 3 separate patches:
- introduce save_stack_trace_user in arch/
- the ftrace parts for user stack tracing (userstacktrace >iter_ctrl)
- the sym-userobj part (which is useful if you got ASLR, otherwise you
don't have a chance to resolve the symbols later after the app is gone)

I promised the lock contention tracepoints for this weekend, I'll send
out these patches at the same time too.

Best regards,
--Edwin

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2008-10-27 16:16 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-10-12 13:11 [PATCH 0/4] ftrace: add userspace stacktrace support and semaphore-latency tracer Török Edwin
2008-10-12 13:12 ` Török Edwin
2008-10-12 13:12 ` [PATCH 1/4] Add support for userspace stacktraces in tracing/iter_ctrl Török Edwin
2008-10-12 13:31   ` Frédéric Weisbecker
2008-10-12 13:53     ` Török Edwin
2008-10-13  8:02       ` Frédéric Weisbecker
2008-10-26  4:05   ` Frédéric Weisbecker
2008-10-26  7:03     ` Török Edwin
2008-10-26 15:06       ` Frédéric Weisbecker
2008-10-26 13:15     ` Frank Ch. Eigler
2008-10-26 13:29       ` Peter Zijlstra
2008-10-26 13:38         ` Török Edwin
2008-10-26 13:49           ` Frank Ch. Eigler
2008-10-27 16:03   ` Ingo Molnar
2008-10-27 16:16     ` Török Edwin
2008-10-12 13:12 ` [PATCH 2/4] Identify which executable object the userspace address belongs to Török Edwin
2008-10-12 13:12 ` [PATCH 3/4] add tracepoints in rwsem Török Edwin
2008-10-12 13:12 ` [PATCH 4/4] Implement semaphore latency tracer Török Edwin
2008-10-12 19:13   ` Peter Zijlstra
2008-10-12 20:10     ` Török Edwin
2008-10-22 15:28       ` Ingo Molnar
2008-10-22 15:41         ` Török Edwin
2008-10-22 15:48           ` Ingo Molnar
2008-10-22 17:22             ` Peter Zijlstra
2008-10-22 17:25               ` Török Edwin
2008-10-12 18:25 ` [PATCH 0/4] ftrace: add userspace stacktrace support and semaphore-latency tracer Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).