LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH 1/2] perf_counter: dynamically allocate tasks' perf_counter_context struct [v2]
@ 2009-05-22  4:17 Paul Mackerras
  2009-05-22  4:27 ` [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts Paul Mackerras
                   ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Paul Mackerras @ 2009-05-22  4:17 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Peter Zijlstra, linux-kernel, Corey Ashford, Thomas Gleixner

This replaces the struct perf_counter_context in the task_struct with
a pointer to a dynamically allocated perf_counter_context struct.  The
main reason for doing is this is to allow us to transfer a
perf_counter_context from one task to another when we do lazy PMU
switching in a later patch.

This has a few side-benefits: the task_struct becomes a little smaller,
we save some memory because only tasks that have perf_counters attached
get a perf_counter_context allocated for them, and we can remove the
inclusion of <linux/perf_counter.h> in sched.h, meaning that we don't
end up recompiling nearly everything whenever perf_counter.h changes.

The perf_counter_context structures are reference-counted and freed
when the last reference is dropped.  A context can have references
from its task and the counters on its task.  Counters can outlive the
task so it is possible that a context will be freed well after its
task has exited.

Contexts are allocated on fork if the parent had a context, or
otherwise the first time that a per-task counter is created on a task.
In the latter case, we set the context pointer in the task struct
locklessly using an atomic compare-and-exchange operation in case we
raced with some other task in creating a context for the subject task.

This also removes the task pointer from the perf_counter struct.  The
task pointer was not used anywhere and would make it harder to move a
context from one task to another.  Anything that needed to know which
task a counter was attached to was already using counter->ctx->task.

The __perf_counter_init_context function moves up in perf_counter.c
so that it can be called from find_get_context, and now initializes
the refcount, but is otherwise unchanged.

We were potentially calling list_del_counter twice: once from
__perf_counter_exit_task when the task exits and once from
__perf_counter_remove_from_context when the counter's fd gets closed.
This adds a check in list_del_counter so it doesn't do anything if
the counter has already been removed from the lists.

Since perf_counter_task_sched_in doesn't do anything if the task doesn't
have a context, and leaves cpuctx->task_ctx = NULL, this adds code to
__perf_install_in_context to set cpuctx->task_ctx if necessary, i.e. in
the case where the current task adds the first counter to itself and
thus creates a context for itself.

This also adds similar code to __perf_counter_enable to handle a
similar situation which can arise when the counters have been disabled
using prctl; that also leaves cpuctx->task_ctx = NULL.

Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 arch/x86/kernel/apic/apic.c  |    1 +
 include/linux/init_task.h    |   13 ---
 include/linux/perf_counter.h |    4 +-
 include/linux/sched.h        |    6 +-
 kernel/exit.c                |    3 +-
 kernel/fork.c                |    1 +
 kernel/perf_counter.c        |  218 ++++++++++++++++++++++++++----------------
 7 files changed, 145 insertions(+), 101 deletions(-)

diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index e9021a9..b4f6440 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -14,6 +14,7 @@
  *	Mikael Pettersson	:	PM converted to driver model.
  */
 
+#include <linux/perf_counter.h>
 #include <linux/kernel_stat.h>
 #include <linux/mc146818rtc.h>
 #include <linux/acpi_pmtmr.h>
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 503afaa..d87247d 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -108,18 +108,6 @@ extern struct group_info init_groups;
 
 extern struct cred init_cred;
 
-#ifdef CONFIG_PERF_COUNTERS
-# define INIT_PERF_COUNTERS(tsk)					\
-	.perf_counter_ctx.counter_list =				\
-		LIST_HEAD_INIT(tsk.perf_counter_ctx.counter_list),	\
-	.perf_counter_ctx.event_list =					\
-		LIST_HEAD_INIT(tsk.perf_counter_ctx.event_list),	\
-	.perf_counter_ctx.lock =					\
-		__SPIN_LOCK_UNLOCKED(tsk.perf_counter_ctx.lock),
-#else
-# define INIT_PERF_COUNTERS(tsk)
-#endif
-
 /*
  *  INIT_TASK is used to set up the first task table, touch at
  * your own risk!. Base=0, limit=0x1fffff (=2MB)
@@ -183,7 +171,6 @@ extern struct cred init_cred;
 	},								\
 	.dirties = INIT_PROP_LOCAL_SINGLE(dirties),			\
 	INIT_IDS							\
-	INIT_PERF_COUNTERS(tsk)						\
 	INIT_TRACE_IRQFLAGS						\
 	INIT_LOCKDEP							\
 	INIT_FTRACE_GRAPH						\
diff --git a/include/linux/perf_counter.h b/include/linux/perf_counter.h
index f612941..0713090 100644
--- a/include/linux/perf_counter.h
+++ b/include/linux/perf_counter.h
@@ -449,7 +449,6 @@ struct perf_counter {
 	struct hw_perf_counter		hw;
 
 	struct perf_counter_context	*ctx;
-	struct task_struct		*task;
 	struct file			*filp;
 
 	struct perf_counter		*parent;
@@ -498,7 +497,6 @@ struct perf_counter {
  * Used as a container for task counters and CPU counters as well:
  */
 struct perf_counter_context {
-#ifdef CONFIG_PERF_COUNTERS
 	/*
 	 * Protect the states of the counters in the list,
 	 * nr_active, and the list:
@@ -516,6 +514,7 @@ struct perf_counter_context {
 	int			nr_counters;
 	int			nr_active;
 	int			is_active;
+	atomic_t		refcount;
 	struct task_struct	*task;
 
 	/*
@@ -523,7 +522,6 @@ struct perf_counter_context {
 	 */
 	u64			time;
 	u64			timestamp;
-#endif
 };
 
 /**
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ff59d12..9714d45 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -71,7 +71,6 @@ struct sched_param {
 #include <linux/path.h>
 #include <linux/compiler.h>
 #include <linux/completion.h>
-#include <linux/perf_counter.h>
 #include <linux/pid.h>
 #include <linux/percpu.h>
 #include <linux/topology.h>
@@ -99,6 +98,7 @@ struct robust_list_head;
 struct bio;
 struct bts_tracer;
 struct fs_struct;
+struct perf_counter_context;
 
 /*
  * List of flags we want to share for kernel threads,
@@ -1387,7 +1387,9 @@ struct task_struct {
 	struct list_head pi_state_list;
 	struct futex_pi_state *pi_state_cache;
 #endif
-	struct perf_counter_context perf_counter_ctx;
+#ifdef CONFIG_PERF_COUNTERS
+	struct perf_counter_context *perf_counter_ctxp;
+#endif
 #ifdef CONFIG_NUMA
 	struct mempolicy *mempolicy;
 	short il_next;
diff --git a/kernel/exit.c b/kernel/exit.c
index f9dfedd..99ad406 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -48,6 +48,7 @@
 #include <linux/tracehook.h>
 #include <linux/fs_struct.h>
 #include <linux/init_task.h>
+#include <linux/perf_counter.h>
 #include <trace/sched.h>
 
 #include <asm/uaccess.h>
@@ -159,7 +160,7 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
 	struct task_struct *tsk = container_of(rhp, struct task_struct, rcu);
 
 #ifdef CONFIG_PERF_COUNTERS
-	WARN_ON_ONCE(!list_empty(&tsk->perf_counter_ctx.counter_list));
+	WARN_ON_ONCE(tsk->perf_counter_ctxp);
 #endif
 	trace_sched_process_free(tsk);
 	put_task_struct(tsk);
diff --git a/kernel/fork.c b/kernel/fork.c
index d32fef4..e72a09f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -63,6 +63,7 @@
 #include <linux/fs_struct.h>
 #include <trace/sched.h>
 #include <linux/magic.h>
+#include <linux/perf_counter.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
diff --git a/kernel/perf_counter.c b/kernel/perf_counter.c
index 08584c1..06ea3ea 100644
--- a/kernel/perf_counter.c
+++ b/kernel/perf_counter.c
@@ -97,6 +97,17 @@ void perf_enable(void)
 		hw_perf_enable();
 }
 
+static void get_ctx(struct perf_counter_context *ctx)
+{
+	atomic_inc(&ctx->refcount);
+}
+
+static void put_ctx(struct perf_counter_context *ctx)
+{
+	if (atomic_dec_and_test(&ctx->refcount))
+		kfree(ctx);
+}
+
 static void
 list_add_counter(struct perf_counter *counter, struct perf_counter_context *ctx)
 {
@@ -118,11 +129,17 @@ list_add_counter(struct perf_counter *counter, struct perf_counter_context *ctx)
 	ctx->nr_counters++;
 }
 
+/*
+ * Remove a counter from the lists for its context.
+ * Must be called with counter->mutex and ctx->mutex held.
+ */
 static void
 list_del_counter(struct perf_counter *counter, struct perf_counter_context *ctx)
 {
 	struct perf_counter *sibling, *tmp;
 
+	if (list_empty(&counter->list_entry))
+		return;
 	ctx->nr_counters--;
 
 	list_del_init(&counter->list_entry);
@@ -216,8 +233,6 @@ static void __perf_counter_remove_from_context(void *info)
 
 	counter_sched_out(counter, cpuctx, ctx);
 
-	counter->task = NULL;
-
 	list_del_counter(counter, ctx);
 
 	if (!ctx->task) {
@@ -279,7 +294,6 @@ retry:
 	 */
 	if (!list_empty(&counter->list_entry)) {
 		list_del_counter(counter, ctx);
-		counter->task = NULL;
 	}
 	spin_unlock_irq(&ctx->lock);
 }
@@ -568,11 +582,17 @@ static void __perf_install_in_context(void *info)
 	 * If this is a task context, we need to check whether it is
 	 * the current task context of this cpu. If not it has been
 	 * scheduled out before the smp call arrived.
+	 * Or possibly this is the right context but it isn't
+	 * on this cpu because it had no counters.
 	 */
-	if (ctx->task && cpuctx->task_ctx != ctx)
-		return;
+	if (ctx->task && cpuctx->task_ctx != ctx) {
+		if (cpuctx->task_ctx || ctx->task != current)
+			return;
+		cpuctx->task_ctx = ctx;
+	}
 
 	spin_lock_irqsave(&ctx->lock, flags);
+	ctx->is_active = 1;
 	update_context_time(ctx);
 
 	/*
@@ -653,7 +673,6 @@ perf_install_in_context(struct perf_counter_context *ctx,
 		return;
 	}
 
-	counter->task = task;
 retry:
 	task_oncpu_function_call(task, __perf_install_in_context,
 				 counter);
@@ -693,10 +712,14 @@ static void __perf_counter_enable(void *info)
 	 * If this is a per-task counter, need to check whether this
 	 * counter's task is the current task on this cpu.
 	 */
-	if (ctx->task && cpuctx->task_ctx != ctx)
-		return;
+	if (ctx->task && cpuctx->task_ctx != ctx) {
+		if (cpuctx->task_ctx || ctx->task != current)
+			return;
+		cpuctx->task_ctx = ctx;
+	}
 
 	spin_lock_irqsave(&ctx->lock, flags);
+	ctx->is_active = 1;
 	update_context_time(ctx);
 
 	counter->prev_state = counter->state;
@@ -852,10 +875,10 @@ void __perf_counter_sched_out(struct perf_counter_context *ctx,
 void perf_counter_task_sched_out(struct task_struct *task, int cpu)
 {
 	struct perf_cpu_context *cpuctx = &per_cpu(perf_cpu_context, cpu);
-	struct perf_counter_context *ctx = &task->perf_counter_ctx;
+	struct perf_counter_context *ctx = task->perf_counter_ctxp;
 	struct pt_regs *regs;
 
-	if (likely(!cpuctx->task_ctx))
+	if (likely(!ctx || !cpuctx->task_ctx))
 		return;
 
 	update_context_time(ctx);
@@ -871,6 +894,8 @@ static void __perf_counter_task_sched_out(struct perf_counter_context *ctx)
 {
 	struct perf_cpu_context *cpuctx = &__get_cpu_var(perf_cpu_context);
 
+	if (!cpuctx->task_ctx)
+		return;
 	__perf_counter_sched_out(ctx, cpuctx);
 	cpuctx->task_ctx = NULL;
 }
@@ -969,8 +994,10 @@ __perf_counter_sched_in(struct perf_counter_context *ctx,
 void perf_counter_task_sched_in(struct task_struct *task, int cpu)
 {
 	struct perf_cpu_context *cpuctx = &per_cpu(perf_cpu_context, cpu);
-	struct perf_counter_context *ctx = &task->perf_counter_ctx;
+	struct perf_counter_context *ctx = task->perf_counter_ctxp;
 
+	if (likely(!ctx))
+		return;
 	__perf_counter_sched_in(ctx, cpuctx, cpu);
 	cpuctx->task_ctx = ctx;
 }
@@ -985,11 +1012,11 @@ static void perf_counter_cpu_sched_in(struct perf_cpu_context *cpuctx, int cpu)
 int perf_counter_task_disable(void)
 {
 	struct task_struct *curr = current;
-	struct perf_counter_context *ctx = &curr->perf_counter_ctx;
+	struct perf_counter_context *ctx = curr->perf_counter_ctxp;
 	struct perf_counter *counter;
 	unsigned long flags;
 
-	if (likely(!ctx->nr_counters))
+	if (!ctx || !ctx->nr_counters)
 		return 0;
 
 	local_irq_save(flags);
@@ -1020,12 +1047,12 @@ int perf_counter_task_disable(void)
 int perf_counter_task_enable(void)
 {
 	struct task_struct *curr = current;
-	struct perf_counter_context *ctx = &curr->perf_counter_ctx;
+	struct perf_counter_context *ctx = curr->perf_counter_ctxp;
 	struct perf_counter *counter;
 	unsigned long flags;
 	int cpu;
 
-	if (likely(!ctx->nr_counters))
+	if (!ctx || !ctx->nr_counters)
 		return 0;
 
 	local_irq_save(flags);
@@ -1128,19 +1155,23 @@ void perf_counter_task_tick(struct task_struct *curr, int cpu)
 		return;
 
 	cpuctx = &per_cpu(perf_cpu_context, cpu);
-	ctx = &curr->perf_counter_ctx;
+	ctx = curr->perf_counter_ctxp;
 
 	perf_adjust_freq(&cpuctx->ctx);
-	perf_adjust_freq(ctx);
+	if (ctx)
+		perf_adjust_freq(ctx);
 
 	perf_counter_cpu_sched_out(cpuctx);
-	__perf_counter_task_sched_out(ctx);
+	if (ctx)
+		__perf_counter_task_sched_out(ctx);
 
 	rotate_ctx(&cpuctx->ctx);
-	rotate_ctx(ctx);
+	if (ctx)
+		rotate_ctx(ctx);
 
 	perf_counter_cpu_sched_in(cpuctx, cpu);
-	perf_counter_task_sched_in(curr, cpu);
+	if (ctx)
+		perf_counter_task_sched_in(curr, cpu);
 }
 
 /*
@@ -1176,6 +1207,22 @@ static u64 perf_counter_read(struct perf_counter *counter)
 	return atomic64_read(&counter->count);
 }
 
+/*
+ * Initialize the perf_counter context in a task_struct:
+ */
+static void
+__perf_counter_init_context(struct perf_counter_context *ctx,
+			    struct task_struct *task)
+{
+	memset(ctx, 0, sizeof(*ctx));
+	spin_lock_init(&ctx->lock);
+	mutex_init(&ctx->mutex);
+	INIT_LIST_HEAD(&ctx->counter_list);
+	INIT_LIST_HEAD(&ctx->event_list);
+	atomic_set(&ctx->refcount, 1);
+	ctx->task = task;
+}
+
 static void put_context(struct perf_counter_context *ctx)
 {
 	if (ctx->task)
@@ -1186,6 +1233,7 @@ static struct perf_counter_context *find_get_context(pid_t pid, int cpu)
 {
 	struct perf_cpu_context *cpuctx;
 	struct perf_counter_context *ctx;
+	struct perf_counter_context *tctx;
 	struct task_struct *task;
 
 	/*
@@ -1225,15 +1273,36 @@ static struct perf_counter_context *find_get_context(pid_t pid, int cpu)
 	if (!task)
 		return ERR_PTR(-ESRCH);
 
-	ctx = &task->perf_counter_ctx;
-	ctx->task = task;
-
 	/* Reuse ptrace permission checks for now. */
 	if (!ptrace_may_access(task, PTRACE_MODE_READ)) {
-		put_context(ctx);
+		put_task_struct(task);
 		return ERR_PTR(-EACCES);
 	}
 
+	ctx = task->perf_counter_ctxp;
+	if (!ctx) {
+		ctx = kmalloc(sizeof(struct perf_counter_context), GFP_KERNEL);
+		if (!ctx) {
+			put_task_struct(task);
+			return ERR_PTR(-ENOMEM);
+		}
+		__perf_counter_init_context(ctx, task);
+		/*
+		 * Make sure other cpus see correct values for *ctx
+		 * once task->perf_counter_ctxp is visible to them.
+		 */
+		smp_wmb();
+		tctx = cmpxchg(&task->perf_counter_ctxp, NULL, ctx);
+		if (tctx) {
+			/*
+			 * We raced with some other task; use
+			 * the context they set.
+			 */
+			kfree(ctx);
+			ctx = tctx;
+		}
+	}
+
 	return ctx;
 }
 
@@ -1242,6 +1311,7 @@ static void free_counter_rcu(struct rcu_head *head)
 	struct perf_counter *counter;
 
 	counter = container_of(head, struct perf_counter, rcu_head);
+	put_ctx(counter->ctx);
 	kfree(counter);
 }
 
@@ -2247,7 +2317,7 @@ static void perf_counter_comm_event(struct perf_comm_event *comm_event)
 	perf_counter_comm_ctx(&cpuctx->ctx, comm_event);
 	put_cpu_var(perf_cpu_context);
 
-	perf_counter_comm_ctx(&current->perf_counter_ctx, comm_event);
+	perf_counter_comm_ctx(current->perf_counter_ctxp, comm_event);
 }
 
 void perf_counter_comm(struct task_struct *task)
@@ -2256,7 +2326,9 @@ void perf_counter_comm(struct task_struct *task)
 
 	if (!atomic_read(&nr_comm_tracking))
 		return;
-       
+	if (!current->perf_counter_ctxp)
+		return;
+
 	comm_event = (struct perf_comm_event){
 		.task	= task,
 		.event  = {
@@ -2372,7 +2444,7 @@ got_name:
 	perf_counter_mmap_ctx(&cpuctx->ctx, mmap_event);
 	put_cpu_var(perf_cpu_context);
 
-	perf_counter_mmap_ctx(&current->perf_counter_ctx, mmap_event);
+	perf_counter_mmap_ctx(current->perf_counter_ctxp, mmap_event);
 
 	kfree(buf);
 }
@@ -2384,6 +2456,8 @@ void perf_counter_mmap(unsigned long addr, unsigned long len,
 
 	if (!atomic_read(&nr_mmap_tracking))
 		return;
+	if (!current->perf_counter_ctxp)
+		return;
 
 	mmap_event = (struct perf_mmap_event){
 		.file   = file,
@@ -2985,6 +3059,7 @@ perf_counter_alloc(struct perf_counter_hw_event *hw_event,
 	counter->group_leader		= group_leader;
 	counter->pmu			= NULL;
 	counter->ctx			= ctx;
+	get_ctx(ctx);
 
 	counter->state = PERF_COUNTER_STATE_INACTIVE;
 	if (hw_event->disabled)
@@ -3150,21 +3225,6 @@ err_put_context:
 }
 
 /*
- * Initialize the perf_counter context in a task_struct:
- */
-static void
-__perf_counter_init_context(struct perf_counter_context *ctx,
-			    struct task_struct *task)
-{
-	memset(ctx, 0, sizeof(*ctx));
-	spin_lock_init(&ctx->lock);
-	mutex_init(&ctx->mutex);
-	INIT_LIST_HEAD(&ctx->counter_list);
-	INIT_LIST_HEAD(&ctx->event_list);
-	ctx->task = task;
-}
-
-/*
  * inherit a counter from parent task to child task:
  */
 static struct perf_counter *
@@ -3195,7 +3255,6 @@ inherit_counter(struct perf_counter *parent_counter,
 	/*
 	 * Link it up in the child's context:
 	 */
-	child_counter->task = child;
 	add_counter_to_ctx(child_counter, child_ctx);
 
 	child_counter->parent = parent_counter;
@@ -3294,40 +3353,15 @@ __perf_counter_exit_task(struct task_struct *child,
 	struct perf_counter *parent_counter;
 
 	/*
-	 * If we do not self-reap then we have to wait for the
-	 * child task to unschedule (it will happen for sure),
-	 * so that its counter is at its final count. (This
-	 * condition triggers rarely - child tasks usually get
-	 * off their CPU before the parent has a chance to
-	 * get this far into the reaping action)
+	 * Protect against concurrent operations on child_counter
+	 * due its fd getting closed, etc.
 	 */
-	if (child != current) {
-		wait_task_inactive(child, 0);
-		update_counter_times(child_counter);
-		list_del_counter(child_counter, child_ctx);
-	} else {
-		struct perf_cpu_context *cpuctx;
-		unsigned long flags;
-
-		/*
-		 * Disable and unlink this counter.
-		 *
-		 * Be careful about zapping the list - IRQ/NMI context
-		 * could still be processing it:
-		 */
-		local_irq_save(flags);
-		perf_disable();
-
-		cpuctx = &__get_cpu_var(perf_cpu_context);
+	mutex_lock(&child_counter->mutex);
 
-		group_sched_out(child_counter, cpuctx, child_ctx);
-		update_counter_times(child_counter);
+	update_counter_times(child_counter);
+	list_del_counter(child_counter, child_ctx);
 
-		list_del_counter(child_counter, child_ctx);
-
-		perf_enable();
-		local_irq_restore(flags);
-	}
+	mutex_unlock(&child_counter->mutex);
 
 	parent_counter = child_counter->parent;
 	/*
@@ -3346,19 +3380,29 @@ __perf_counter_exit_task(struct task_struct *child,
  *
  * Note: we may be running in child context, but the PID is not hashed
  * anymore so new counters will not be added.
+ * (XXX not sure that is true when we get called from flush_old_exec.
+ *  -- paulus)
  */
 void perf_counter_exit_task(struct task_struct *child)
 {
 	struct perf_counter *child_counter, *tmp;
 	struct perf_counter_context *child_ctx;
+	unsigned long flags;
 
 	WARN_ON_ONCE(child != current);
 
-	child_ctx = &child->perf_counter_ctx;
+	child_ctx = child->perf_counter_ctxp;
 
-	if (likely(!child_ctx->nr_counters))
+	if (likely(!child_ctx))
 		return;
 
+	local_irq_save(flags);
+	__perf_counter_task_sched_out(child_ctx);
+	child->perf_counter_ctxp = NULL;
+	local_irq_restore(flags);
+
+	mutex_lock(&child_ctx->mutex);
+
 again:
 	list_for_each_entry_safe(child_counter, tmp, &child_ctx->counter_list,
 				 list_entry)
@@ -3371,6 +3415,10 @@ again:
 	 */
 	if (!list_empty(&child_ctx->counter_list))
 		goto again;
+
+	mutex_unlock(&child_ctx->mutex);
+
+	put_ctx(child_ctx);
 }
 
 /*
@@ -3382,19 +3430,25 @@ void perf_counter_init_task(struct task_struct *child)
 	struct perf_counter *counter;
 	struct task_struct *parent = current;
 
-	child_ctx  =  &child->perf_counter_ctx;
-	parent_ctx = &parent->perf_counter_ctx;
-
-	__perf_counter_init_context(child_ctx, child);
+	child->perf_counter_ctxp = NULL;
 
 	/*
 	 * This is executed from the parent task context, so inherit
-	 * counters that have been marked for cloning:
+	 * counters that have been marked for cloning.
+	 * First allocate and initialize a context for the child.
 	 */
 
-	if (likely(!parent_ctx->nr_counters))
+	child_ctx = kmalloc(sizeof(struct perf_counter_context), GFP_KERNEL);
+	if (!child_ctx)
+		return;
+
+	parent_ctx = parent->perf_counter_ctxp;
+	if (likely(!parent_ctx || !parent_ctx->nr_counters))
 		return;
 
+	__perf_counter_init_context(child_ctx, child);
+	child->perf_counter_ctxp = child_ctx;
+
 	/*
 	 * Lock the parent list. No need to lock the child - not PID
 	 * hashed yet and not running, so nobody can access it.
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts
  2009-05-22  4:17 [PATCH 1/2] perf_counter: dynamically allocate tasks' perf_counter_context struct [v2] Paul Mackerras
@ 2009-05-22  4:27 ` Paul Mackerras
  2009-05-22  8:16   ` Peter Zijlstra
                     ` (6 more replies)
  2009-05-22  8:06 ` [PATCH 1/2] perf_counter: dynamically allocate tasks' perf_counter_context struct [v2] Peter Zijlstra
  2009-05-22 10:27 ` [tip:perfcounters/core] perf_counter: Dynamically allocate tasks' perf_counter_context struct tip-bot for Paul Mackerras
  2 siblings, 7 replies; 32+ messages in thread
From: Paul Mackerras @ 2009-05-22  4:27 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Peter Zijlstra, linux-kernel, Corey Ashford, Thomas Gleixner

When monitoring a process and its descendants with a set of inherited
counters, we can often get the situation in a context switch where
both the old (outgoing) and new (incoming) process have the same set
of counters, and their values are ultimately going to be added together.
In that situation it doesn't matter which set of counters are used to
count the activity for the new process, so there is really no need to
go through the process of reading the hardware counters and updating
the old task's counters and then setting up the PMU for the new task.

This optimizes the context switch in this situation.  Instead of
scheduling out the perf_counter_context for the old task and
scheduling in the new context, we simply transfer the old context
to the new task and keep using it without interruption.  The new
context gets transferred to the old task.  This means that both
tasks still have a valid perf_counter_context, so no special case
is introduced when the old task gets scheduled in again, either on
this CPU or another CPU.

The equivalence of contexts is detected by keeping a pointer in
each cloned context pointing to the context it was cloned from.
To cope with the situation where a context is changed by adding
or removing counters after it has been cloned, we also keep a
generation number on each context which is incremented every time
a context is changed.  When a context is cloned we take a copy
of the parent's generation number, and two cloned contexts are
equivalent only if they have the same parent and the same
generation number.  In order that the parent context pointer
remains valid (and is not reused), we increment the parent
context's reference count for each context cloned from it.

Since we don't have individual fds for the counters in a cloned
context, the only thing that can make two clones of a given parent
different after they have been cloned is enabling or disabling all
counters with prctl.  To account for this, we keep a count of the
number of enabled counters in each context.  Two contexts must have
the same number of enabled counters to be considered equivalent.

Here are some measurements of the context switch time as measured with
the lat_ctx benchmark from lmbench, comparing the times obtained with
and without this patch series:

		-----Unmodified-----		With this patch series
Counters:	none	2 HW	4H+4S	none	2 HW	4H+4S

2 processes:
Average		3.44	6.45	11.24	3.12	3.39	3.60
St dev		0.04	0.04	0.13	0.05	0.17	0.19

8 processes:
Average		6.45	8.79	14.00	5.57	6.23	7.57
St dev		1.27	1.04	0.88	1.42	1.46	1.42

32 processes:
Average		5.56	8.43	13.78	5.28	5.55	7.15
St dev		0.41	0.47	0.53	0.54	0.57	0.81

The numbers are the mean and standard deviation of 20 runs of
lat_ctx.  The "none" columns are lat_ctx run directly without any
counters.  The "2 HW" columns are with lat_ctx run under perfstat,
counting cycles and instructions.  The "4H+4S" columns are lat_ctx run
under perfstat with 4 hardware counters and 4 software counters
(cycles, instructions, cache references, cache misses, task
clock, context switch, cpu migrations, and page faults).

Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 include/linux/perf_counter.h |   12 ++++-
 kernel/perf_counter.c        |  109 ++++++++++++++++++++++++++++++++++++-----
 kernel/sched.c               |    2 +-
 3 files changed, 107 insertions(+), 16 deletions(-)

diff --git a/include/linux/perf_counter.h b/include/linux/perf_counter.h
index 0713090..4cae01a 100644
--- a/include/linux/perf_counter.h
+++ b/include/linux/perf_counter.h
@@ -513,6 +513,7 @@ struct perf_counter_context {
 	struct list_head	event_list;
 	int			nr_counters;
 	int			nr_active;
+	int			nr_enabled;
 	int			is_active;
 	atomic_t		refcount;
 	struct task_struct	*task;
@@ -522,6 +523,14 @@ struct perf_counter_context {
 	 */
 	u64			time;
 	u64			timestamp;
+
+	/*
+	 * These fields let us detect when two contexts have both
+	 * been cloned (inherited) from a common ancestor.
+	 */
+	struct perf_counter_context *parent_ctx;
+	u32			parent_gen;
+	u32			generation;
 };
 
 /**
@@ -552,7 +561,8 @@ extern int perf_max_counters;
 extern const struct pmu *hw_perf_counter_init(struct perf_counter *counter);
 
 extern void perf_counter_task_sched_in(struct task_struct *task, int cpu);
-extern void perf_counter_task_sched_out(struct task_struct *task, int cpu);
+extern void perf_counter_task_sched_out(struct task_struct *task,
+					struct task_struct *next, int cpu);
 extern void perf_counter_task_tick(struct task_struct *task, int cpu);
 extern void perf_counter_init_task(struct task_struct *child);
 extern void perf_counter_exit_task(struct task_struct *child);
diff --git a/kernel/perf_counter.c b/kernel/perf_counter.c
index 06ea3ea..c100554 100644
--- a/kernel/perf_counter.c
+++ b/kernel/perf_counter.c
@@ -104,8 +104,11 @@ static void get_ctx(struct perf_counter_context *ctx)
 
 static void put_ctx(struct perf_counter_context *ctx)
 {
-	if (atomic_dec_and_test(&ctx->refcount))
+	if (atomic_dec_and_test(&ctx->refcount)) {
+		if (ctx->parent_ctx)
+			put_ctx(ctx->parent_ctx);
 		kfree(ctx);
+	}
 }
 
 static void
@@ -127,6 +130,8 @@ list_add_counter(struct perf_counter *counter, struct perf_counter_context *ctx)
 
 	list_add_rcu(&counter->event_entry, &ctx->event_list);
 	ctx->nr_counters++;
+	if (counter->state >= PERF_COUNTER_STATE_INACTIVE)
+		ctx->nr_enabled++;
 }
 
 /*
@@ -141,6 +146,8 @@ list_del_counter(struct perf_counter *counter, struct perf_counter_context *ctx)
 	if (list_empty(&counter->list_entry))
 		return;
 	ctx->nr_counters--;
+	if (counter->state >= PERF_COUNTER_STATE_INACTIVE)
+		ctx->nr_enabled--;
 
 	list_del_init(&counter->list_entry);
 	list_del_rcu(&counter->event_entry);
@@ -204,6 +211,22 @@ group_sched_out(struct perf_counter *group_counter,
 }
 
 /*
+ * Mark this context as not being a clone of another.
+ * Called when counters are added to or removed from this context.
+ * We also increment our generation number so that anything that
+ * was cloned from this context before this will not match anything
+ * cloned from this context after this.
+ */
+static void unclone_ctx(struct perf_counter_context *ctx)
+{
+	++ctx->generation;
+	if (!ctx->parent_ctx)
+		return;
+	put_ctx(ctx->parent_ctx);
+	ctx->parent_ctx = NULL;
+}
+
+/*
  * Cross CPU call to remove a performance counter
  *
  * We disable the counter on the hardware level first. After that we
@@ -263,6 +286,7 @@ static void perf_counter_remove_from_context(struct perf_counter *counter)
 	struct perf_counter_context *ctx = counter->ctx;
 	struct task_struct *task = ctx->task;
 
+	unclone_ctx(ctx);
 	if (!task) {
 		/*
 		 * Per cpu counters are removed via an smp call and
@@ -378,6 +402,7 @@ static void __perf_counter_disable(void *info)
 		else
 			counter_sched_out(counter, cpuctx, ctx);
 		counter->state = PERF_COUNTER_STATE_OFF;
+		ctx->nr_enabled--;
 	}
 
 	spin_unlock_irqrestore(&ctx->lock, flags);
@@ -419,6 +444,7 @@ static void perf_counter_disable(struct perf_counter *counter)
 	if (counter->state == PERF_COUNTER_STATE_INACTIVE) {
 		update_counter_times(counter);
 		counter->state = PERF_COUNTER_STATE_OFF;
+		ctx->nr_enabled--;
 	}
 
 	spin_unlock_irq(&ctx->lock);
@@ -727,6 +753,7 @@ static void __perf_counter_enable(void *info)
 		goto unlock;
 	counter->state = PERF_COUNTER_STATE_INACTIVE;
 	counter->tstamp_enabled = ctx->time - counter->total_time_enabled;
+	ctx->nr_enabled++;
 
 	/*
 	 * If the counter is in a group and isn't the group leader,
@@ -817,6 +844,7 @@ static void perf_counter_enable(struct perf_counter *counter)
 		counter->state = PERF_COUNTER_STATE_INACTIVE;
 		counter->tstamp_enabled =
 			ctx->time - counter->total_time_enabled;
+		ctx->nr_enabled++;
 	}
  out:
 	spin_unlock_irq(&ctx->lock);
@@ -862,6 +890,25 @@ void __perf_counter_sched_out(struct perf_counter_context *ctx,
 }
 
 /*
+ * Test whether two contexts are equivalent, i.e. whether they
+ * have both been cloned from the same version of the same context
+ * and they both have the same number of enabled counters.
+ * If the number of enabled counters is the same, then the set
+ * of enabled counters should be the same, because these are both
+ * inherited contexts, therefore we can't access individual counters
+ * in them directly with an fd; we can only enable/disable all
+ * counters via prctl, or enable/disable all counters in a family
+ * via ioctl, which will have the same effect on both contexts.
+ */
+static int context_equiv(struct perf_counter_context *ctx1,
+			 struct perf_counter_context *ctx2)
+{
+	return ctx1->parent_ctx && ctx1->parent_ctx == ctx2->parent_ctx
+		&& ctx1->parent_gen == ctx2->parent_gen
+		&& ctx1->nr_enabled == ctx2->nr_enabled;
+}
+
+/*
  * Called from scheduler to remove the counters of the current task,
  * with interrupts disabled.
  *
@@ -872,10 +919,12 @@ void __perf_counter_sched_out(struct perf_counter_context *ctx,
  * accessing the counter control register. If a NMI hits, then it will
  * not restart the counter.
  */
-void perf_counter_task_sched_out(struct task_struct *task, int cpu)
+void perf_counter_task_sched_out(struct task_struct *task,
+				 struct task_struct *next, int cpu)
 {
 	struct perf_cpu_context *cpuctx = &per_cpu(perf_cpu_context, cpu);
 	struct perf_counter_context *ctx = task->perf_counter_ctxp;
+	struct perf_counter_context *next_ctx;
 	struct pt_regs *regs;
 
 	if (likely(!ctx || !cpuctx->task_ctx))
@@ -885,6 +934,16 @@ void perf_counter_task_sched_out(struct task_struct *task, int cpu)
 
 	regs = task_pt_regs(task);
 	perf_swcounter_event(PERF_COUNT_CONTEXT_SWITCHES, 1, 1, regs, 0);
+
+	next_ctx = next->perf_counter_ctxp;
+	if (next_ctx && context_equiv(ctx, next_ctx)) {
+		task->perf_counter_ctxp = next_ctx;
+		next->perf_counter_ctxp = ctx;
+		ctx->task = next;
+		next_ctx->task = task;
+		return;
+	}
+
 	__perf_counter_sched_out(ctx, cpuctx);
 
 	cpuctx->task_ctx = NULL;
@@ -998,6 +1057,8 @@ void perf_counter_task_sched_in(struct task_struct *task, int cpu)
 
 	if (likely(!ctx))
 		return;
+	if (cpuctx->task_ctx == ctx)
+		return;
 	__perf_counter_sched_in(ctx, cpuctx, cpu);
 	cpuctx->task_ctx = ctx;
 }
@@ -3253,6 +3314,16 @@ inherit_counter(struct perf_counter *parent_counter,
 		return child_counter;
 
 	/*
+	 * Make the child state follow the state of the parent counter,
+	 * not its hw_event.disabled bit.  We hold the parent's mutex,
+	 * so we won't race with perf_counter_{en,dis}able_family.
+	 */
+	if (parent_counter->state >= PERF_COUNTER_STATE_INACTIVE)
+		child_counter->state = PERF_COUNTER_STATE_INACTIVE;
+	else
+		child_counter->state = PERF_COUNTER_STATE_OFF;
+
+	/*
 	 * Link it up in the child's context:
 	 */
 	add_counter_to_ctx(child_counter, child_ctx);
@@ -3277,16 +3348,6 @@ inherit_counter(struct perf_counter *parent_counter,
 	mutex_lock(&parent_counter->mutex);
 	list_add_tail(&child_counter->child_list, &parent_counter->child_list);
 
-	/*
-	 * Make the child state follow the state of the parent counter,
-	 * not its hw_event.disabled bit.  We hold the parent's mutex,
-	 * so we won't race with perf_counter_{en,dis}able_family.
-	 */
-	if (parent_counter->state >= PERF_COUNTER_STATE_INACTIVE)
-		child_counter->state = PERF_COUNTER_STATE_INACTIVE;
-	else
-		child_counter->state = PERF_COUNTER_STATE_OFF;
-
 	mutex_unlock(&parent_counter->mutex);
 
 	return child_counter;
@@ -3429,6 +3490,7 @@ void perf_counter_init_task(struct task_struct *child)
 	struct perf_counter_context *child_ctx, *parent_ctx;
 	struct perf_counter *counter;
 	struct task_struct *parent = current;
+	int inherited_all = 1;
 
 	child->perf_counter_ctxp = NULL;
 
@@ -3463,12 +3525,31 @@ void perf_counter_init_task(struct task_struct *child)
 		if (counter != counter->group_leader)
 			continue;
 
-		if (!counter->hw_event.inherit)
+		if (!counter->hw_event.inherit) {
+			inherited_all = 0;
 			continue;
+		}
 
 		if (inherit_group(counter, parent,
-				  parent_ctx, child, child_ctx))
+				  parent_ctx, child, child_ctx)) {
+			inherited_all = 0;
 			break;
+		}
+	}
+
+	if (inherited_all) {
+		/*
+		 * Mark the child context as a clone of the parent
+		 * context, or of whatever the parent is a clone of.
+		 */
+		if (parent_ctx->parent_ctx) {
+			child_ctx->parent_ctx = parent_ctx->parent_ctx;
+			child_ctx->parent_gen = parent_ctx->parent_gen;
+		} else {
+			child_ctx->parent_ctx = parent_ctx;
+			child_ctx->parent_gen = parent_ctx->generation;
+		}
+		get_ctx(child_ctx->parent_ctx);
 	}
 
 	mutex_unlock(&parent_ctx->mutex);
diff --git a/kernel/sched.c b/kernel/sched.c
index 419a39d..4c0d58b 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5091,7 +5091,7 @@ need_resched_nonpreemptible:
 
 	if (likely(prev != next)) {
 		sched_info_switch(prev, next);
-		perf_counter_task_sched_out(prev, cpu);
+		perf_counter_task_sched_out(prev, next, cpu);
 
 		rq->nr_switches++;
 		rq->curr = next;
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/2] perf_counter: dynamically allocate tasks' perf_counter_context struct [v2]
  2009-05-22  4:17 [PATCH 1/2] perf_counter: dynamically allocate tasks' perf_counter_context struct [v2] Paul Mackerras
  2009-05-22  4:27 ` [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts Paul Mackerras
@ 2009-05-22  8:06 ` Peter Zijlstra
  2009-05-22  9:30   ` Paul Mackerras
  2009-05-22 10:27 ` [tip:perfcounters/core] perf_counter: Dynamically allocate tasks' perf_counter_context struct tip-bot for Paul Mackerras
  2 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2009-05-22  8:06 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Ingo Molnar, linux-kernel, Corey Ashford, Thomas Gleixner

On Fri, 2009-05-22 at 14:17 +1000, Paul Mackerras wrote:
> +               /*
> +                * Make sure other cpus see correct values for *ctx
> +                * once task->perf_counter_ctxp is visible to them.
> +                */
> +               smp_wmb();
> +               tctx = cmpxchg(&task->perf_counter_ctxp, NULL, ctx);


Documentation/memory-barriers.txt:

> Any atomic operation that modifies some state in memory and returns information
> about the state (old or new) implies an SMP-conditional general memory barrier
> (smp_mb()) on each side of the actual operation (with the exception of
> explicit lock operations, described later).  These include:

>         cmpxchg();





^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts
  2009-05-22  4:27 ` [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts Paul Mackerras
@ 2009-05-22  8:16   ` Peter Zijlstra
  2009-05-22  9:56     ` Paul Mackerras
  2009-05-22  8:32   ` Peter Zijlstra
                     ` (5 subsequent siblings)
  6 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2009-05-22  8:16 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Ingo Molnar, linux-kernel, Corey Ashford, Thomas Gleixner

On Fri, 2009-05-22 at 14:27 +1000, Paul Mackerras wrote:
> Since we don't have individual fds for the counters in a cloned
> context, the only thing that can make two clones of a given parent
> different after they have been cloned is enabling or disabling all
> counters with prctl.  To account for this, we keep a count of the
> number of enabled counters in each context.  Two contexts must have
> the same number of enabled counters to be considered equivalent.

Curious point that.. so prctl() can disable counters it doesn't own.

Shouldn't we instead fix that?


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts
  2009-05-22  4:27 ` [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts Paul Mackerras
  2009-05-22  8:16   ` Peter Zijlstra
@ 2009-05-22  8:32   ` Peter Zijlstra
  2009-05-22  8:57     ` Ingo Molnar
  2009-05-22  9:29     ` Paul Mackerras
  2009-05-22  9:22   ` Peter Zijlstra
                     ` (4 subsequent siblings)
  6 siblings, 2 replies; 32+ messages in thread
From: Peter Zijlstra @ 2009-05-22  8:32 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Ingo Molnar, linux-kernel, Corey Ashford, Thomas Gleixner

On Fri, 2009-05-22 at 14:27 +1000, Paul Mackerras wrote:

> The equivalence of contexts is detected by keeping a pointer in
> each cloned context pointing to the context it was cloned from.
> To cope with the situation where a context is changed by adding
> or removing counters after it has been cloned, we also keep a
> generation number on each context which is incremented every time
> a context is changed.  When a context is cloned we take a copy
> of the parent's generation number, and two cloned contexts are
> equivalent only if they have the same parent and the same
> generation number.  In order that the parent context pointer
> remains valid (and is not reused), we increment the parent
> context's reference count for each context cloned from it.

> +	u32			generation;

Suppose someone writes a malicious proglet that inherits the counters,
puts the child to sleep, does 2^32 mods on the counter set, and then
wakes up the child.

Would that merely corrupt the results, or make the kernel explode?


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts
  2009-05-22  8:32   ` Peter Zijlstra
@ 2009-05-22  8:57     ` Ingo Molnar
  2009-05-22  9:02       ` Peter Zijlstra
  2009-05-22  9:29     ` Paul Mackerras
  1 sibling, 1 reply; 32+ messages in thread
From: Ingo Molnar @ 2009-05-22  8:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul Mackerras, linux-kernel, Corey Ashford, Thomas Gleixner


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Fri, 2009-05-22 at 14:27 +1000, Paul Mackerras wrote:
> 
> > The equivalence of contexts is detected by keeping a pointer in
> > each cloned context pointing to the context it was cloned from.
> > To cope with the situation where a context is changed by adding
> > or removing counters after it has been cloned, we also keep a
> > generation number on each context which is incremented every time
> > a context is changed.  When a context is cloned we take a copy
> > of the parent's generation number, and two cloned contexts are
> > equivalent only if they have the same parent and the same
> > generation number.  In order that the parent context pointer
> > remains valid (and is not reused), we increment the parent
> > context's reference count for each context cloned from it.
> 
> > +	u32			generation;
> 
> Suppose someone writes a malicious proglet that inherits the 
> counters, puts the child to sleep, does 2^32 mods on the counter 
> set, and then wakes up the child.
> 
> Would that merely corrupt the results, or make the kernel explode?

i think it should be an u64.

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts
  2009-05-22  8:57     ` Ingo Molnar
@ 2009-05-22  9:02       ` Peter Zijlstra
  2009-05-22 10:14         ` Ingo Molnar
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2009-05-22  9:02 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Paul Mackerras, linux-kernel, Corey Ashford, Thomas Gleixner

On Fri, 2009-05-22 at 10:57 +0200, Ingo Molnar wrote:
> * Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> 
> > On Fri, 2009-05-22 at 14:27 +1000, Paul Mackerras wrote:
> > 
> > > The equivalence of contexts is detected by keeping a pointer in
> > > each cloned context pointing to the context it was cloned from.
> > > To cope with the situation where a context is changed by adding
> > > or removing counters after it has been cloned, we also keep a
> > > generation number on each context which is incremented every time
> > > a context is changed.  When a context is cloned we take a copy
> > > of the parent's generation number, and two cloned contexts are
> > > equivalent only if they have the same parent and the same
> > > generation number.  In order that the parent context pointer
> > > remains valid (and is not reused), we increment the parent
> > > context's reference count for each context cloned from it.
> > 
> > > +	u32			generation;
> > 
> > Suppose someone writes a malicious proglet that inherits the 
> > counters, puts the child to sleep, does 2^32 mods on the counter 
> > set, and then wakes up the child.
> > 
> > Would that merely corrupt the results, or make the kernel explode?
> 
> i think it should be an u64.

So at to make the attack infeasible by death of the attacker and burnout
of the solar-system? :-)


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts
  2009-05-22  4:27 ` [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts Paul Mackerras
  2009-05-22  8:16   ` Peter Zijlstra
  2009-05-22  8:32   ` Peter Zijlstra
@ 2009-05-22  9:22   ` Peter Zijlstra
  2009-05-22  9:42     ` Peter Zijlstra
  2009-05-22 10:05     ` Paul Mackerras
  2009-05-22 10:11   ` Ingo Molnar
                     ` (3 subsequent siblings)
  6 siblings, 2 replies; 32+ messages in thread
From: Peter Zijlstra @ 2009-05-22  9:22 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Ingo Molnar, linux-kernel, Corey Ashford, Thomas Gleixner

On Fri, 2009-05-22 at 14:27 +1000, Paul Mackerras wrote:
> 
>                 -----Unmodified-----            With this patch series
> Counters:       none    2 HW    4H+4S   none    2 HW    4H+4S
> 
> 2 processes:
> Average         3.44    6.45    11.24   3.12    3.39    3.60
> St dev          0.04    0.04    0.13    0.05    0.17    0.19
> 
> 8 processes:
> Average         6.45    8.79    14.00   5.57    6.23    7.57
> St dev          1.27    1.04    0.88    1.42    1.46    1.42
> 
> 32 processes:
> Average         5.56    8.43    13.78   5.28    5.55    7.15
> St dev          0.41    0.47    0.53    0.54    0.57    0.81

Any clues as to why the time is still dependent on the number of
counters in the context? The approach seems to be O(1) in that it
does a simple counter context swap on sched_out and nothing on sched_in.




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts
  2009-05-22  8:32   ` Peter Zijlstra
  2009-05-22  8:57     ` Ingo Molnar
@ 2009-05-22  9:29     ` Paul Mackerras
  1 sibling, 0 replies; 32+ messages in thread
From: Paul Mackerras @ 2009-05-22  9:29 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, linux-kernel, Corey Ashford, Thomas Gleixner

Peter Zijlstra writes:

> Suppose someone writes a malicious proglet that inherits the counters,
> puts the child to sleep, does 2^32 mods on the counter set, and then
> wakes up the child.
> 
> Would that merely corrupt the results, or make the kernel explode?

You'd have to do something like create some counters on yourself,
fork, do the 2^32 counter creations and deletions, then do another
fork.  All that would happen is that some of the counters would count
on one of the child processes when they should be counting on the
other.

But it would be easy and cheap to make the generation count be 64
bits, and then it won't overflow in my lifetime at least, and after
that I don't care. :)  So I agree with Ingo that we should just do
that.

Paul.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/2] perf_counter: dynamically allocate tasks' perf_counter_context struct [v2]
  2009-05-22  8:06 ` [PATCH 1/2] perf_counter: dynamically allocate tasks' perf_counter_context struct [v2] Peter Zijlstra
@ 2009-05-22  9:30   ` Paul Mackerras
  0 siblings, 0 replies; 32+ messages in thread
From: Paul Mackerras @ 2009-05-22  9:30 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, linux-kernel, Corey Ashford, Thomas Gleixner

Peter Zijlstra writes:

> On Fri, 2009-05-22 at 14:17 +1000, Paul Mackerras wrote:
> > +               /*
> > +                * Make sure other cpus see correct values for *ctx
> > +                * once task->perf_counter_ctxp is visible to them.
> > +                */
> > +               smp_wmb();
> > +               tctx = cmpxchg(&task->perf_counter_ctxp, NULL, ctx);
> 
> 
> Documentation/memory-barriers.txt:
> 
> > Any atomic operation that modifies some state in memory and returns information
> > about the state (old or new) implies an SMP-conditional general memory barrier
> > (smp_mb()) on each side of the actual operation (with the exception of
> > explicit lock operations, described later).  These include:
> 
> >         cmpxchg();

OK, fair enough, I'll take it out.

Paul.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts
  2009-05-22  9:22   ` Peter Zijlstra
@ 2009-05-22  9:42     ` Peter Zijlstra
  2009-05-22 10:07       ` Paul Mackerras
  2009-05-22 10:05     ` Paul Mackerras
  1 sibling, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2009-05-22  9:42 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Ingo Molnar, linux-kernel, Corey Ashford, Thomas Gleixner

On Fri, 2009-05-22 at 11:22 +0200, Peter Zijlstra wrote:
> On Fri, 2009-05-22 at 14:27 +1000, Paul Mackerras wrote:
> > 
> >                 -----Unmodified-----            With this patch series
> > Counters:       none    2 HW    4H+4S   none    2 HW    4H+4S
> > 
> > 2 processes:
> > Average         3.44    6.45    11.24   3.12    3.39    3.60
> > St dev          0.04    0.04    0.13    0.05    0.17    0.19
> > 
> > 8 processes:
> > Average         6.45    8.79    14.00   5.57    6.23    7.57
> > St dev          1.27    1.04    0.88    1.42    1.46    1.42
> > 
> > 32 processes:
> > Average         5.56    8.43    13.78   5.28    5.55    7.15
> > St dev          0.41    0.47    0.53    0.54    0.57    0.81
> 
> Any clues as to why the time is still dependent on the number of
> counters in the context? The approach seems to be O(1) in that it
> does a simple counter context swap on sched_out and nothing on sched_in.

Ah, should I read this as 'n' lmbench instances on a single cpu? So that
we get multiple inheritance sets mixed and have to switch between them?


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts
  2009-05-22  8:16   ` Peter Zijlstra
@ 2009-05-22  9:56     ` Paul Mackerras
  2009-05-22 10:08       ` Peter Zijlstra
  0 siblings, 1 reply; 32+ messages in thread
From: Paul Mackerras @ 2009-05-22  9:56 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, linux-kernel, Corey Ashford, Thomas Gleixner

Peter Zijlstra writes:

> On Fri, 2009-05-22 at 14:27 +1000, Paul Mackerras wrote:
> > Since we don't have individual fds for the counters in a cloned
> > context, the only thing that can make two clones of a given parent
> > different after they have been cloned is enabling or disabling all
> > counters with prctl.  To account for this, we keep a count of the
> > number of enabled counters in each context.  Two contexts must have
> > the same number of enabled counters to be considered equivalent.
> 
> Curious point that.. so prctl() can disable counters it doesn't own.
> 
> Shouldn't we instead fix that?

Well, prctl enables/disables the counters that are counting on the
current process, regardless of who or what created them.  I always
thought that was a little strange; maybe it is useful to be able to
disable all the counters that any other process might have put on to
you, but I can't think of a scenario where you'd really need to do
that, particularly since the disable is a one-shot operation, and
doesn't prevent new (enabled) counters being attached to you.

On the other hand, what does "all the counters I own" mean?  Does it
mean all the ones that I have fds open for?  Or does it mean all the
ones that I created?  Either way we don't have a good way to enumerate
them.

Paul.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts
  2009-05-22  9:22   ` Peter Zijlstra
  2009-05-22  9:42     ` Peter Zijlstra
@ 2009-05-22 10:05     ` Paul Mackerras
  1 sibling, 0 replies; 32+ messages in thread
From: Paul Mackerras @ 2009-05-22 10:05 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, linux-kernel, Corey Ashford, Thomas Gleixner

Peter Zijlstra writes:

> On Fri, 2009-05-22 at 14:27 +1000, Paul Mackerras wrote:
> > 
> >                 -----Unmodified-----            With this patch series
> > Counters:       none    2 HW    4H+4S   none    2 HW    4H+4S
> > 
> > 2 processes:
> > Average         3.44    6.45    11.24   3.12    3.39    3.60
> > St dev          0.04    0.04    0.13    0.05    0.17    0.19
> > 
> > 8 processes:
> > Average         6.45    8.79    14.00   5.57    6.23    7.57
> > St dev          1.27    1.04    0.88    1.42    1.46    1.42
> > 
> > 32 processes:
> > Average         5.56    8.43    13.78   5.28    5.55    7.15
> > St dev          0.41    0.47    0.53    0.54    0.57    0.81
> 
> Any clues as to why the time is still dependent on the number of
> counters in the context? The approach seems to be O(1) in that it
> does a simple counter context swap on sched_out and nothing on sched_in.

Only the context switches between the lat_ctx processes will be
optimized; switches between them and other processes will still do the
full PMU switch.  The CPU would be switching to other processes
from time to time during the run (e.g. to run various kernel daemons,
which seem to continue to proliferate) so some fraction of the context
switches would be as expensive as before.  Probably I should measure
what that fraction is, but it's Friday night and I'm feeling lazy. :)

Paul.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts
  2009-05-22  9:42     ` Peter Zijlstra
@ 2009-05-22 10:07       ` Paul Mackerras
  0 siblings, 0 replies; 32+ messages in thread
From: Paul Mackerras @ 2009-05-22 10:07 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, linux-kernel, Corey Ashford, Thomas Gleixner

Peter Zijlstra writes:

> Ah, should I read this as 'n' lmbench instances on a single cpu? So that
> we get multiple inheritance sets mixed and have to switch between them?

No, it's one lat_ctx instance, which creates N processes connected in
a ring with pipes, and passes a token around some number of times and
measures the overall time to do that.  It subtracts off the time it
takes to do all the pipe reads and writes in a single process to get
an estimate of how much the context switches are costing.

Paul.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts
  2009-05-22  9:56     ` Paul Mackerras
@ 2009-05-22 10:08       ` Peter Zijlstra
  2009-05-23 12:38         ` Ingo Molnar
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2009-05-22 10:08 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Ingo Molnar, linux-kernel, Corey Ashford, Thomas Gleixner

On Fri, 2009-05-22 at 19:56 +1000, Paul Mackerras wrote:
> Peter Zijlstra writes:
> 
> > On Fri, 2009-05-22 at 14:27 +1000, Paul Mackerras wrote:
> > > Since we don't have individual fds for the counters in a cloned
> > > context, the only thing that can make two clones of a given parent
> > > different after they have been cloned is enabling or disabling all
> > > counters with prctl.  To account for this, we keep a count of the
> > > number of enabled counters in each context.  Two contexts must have
> > > the same number of enabled counters to be considered equivalent.
> > 
> > Curious point that.. so prctl() can disable counters it doesn't own.
> > 
> > Shouldn't we instead fix that?
> 
> Well, prctl enables/disables the counters that are counting on the
> current process, regardless of who or what created them.  I always
> thought that was a little strange; maybe it is useful to be able to
> disable all the counters that any other process might have put on to
> you, but I can't think of a scenario where you'd really need to do
> that, particularly since the disable is a one-shot operation, and
> doesn't prevent new (enabled) counters being attached to you.
> 
> On the other hand, what does "all the counters I own" mean?  Does it
> mean all the ones that I have fds open for?  Or does it mean all the
> ones that I created?  Either way we don't have a good way to enumerate
> them.

I'm for all counters you created (ie have a fd for). Being able to
disable counters others created on you just sounds wrong.

If we can settle on a semantic, I'm sure we can implement it :-)

Ingo, Corey, any opinions?


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts
  2009-05-22  4:27 ` [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts Paul Mackerras
                     ` (2 preceding siblings ...)
  2009-05-22  9:22   ` Peter Zijlstra
@ 2009-05-22 10:11   ` Ingo Molnar
  2009-05-22 10:27   ` [tip:perfcounters/core] perf_counter: Optimize " tip-bot for Paul Mackerras
                     ` (2 subsequent siblings)
  6 siblings, 0 replies; 32+ messages in thread
From: Ingo Molnar @ 2009-05-22 10:11 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Peter Zijlstra, linux-kernel, Corey Ashford, Thomas Gleixner


* Paul Mackerras <paulus@samba.org> wrote:

> When monitoring a process and its descendants with a set of 
> inherited counters, we can often get the situation in a context 
> switch where both the old (outgoing) and new (incoming) process 
> have the same set of counters, and their values are ultimately 
> going to be added together. In that situation it doesn't matter 
> which set of counters are used to count the activity for the new 
> process, so there is really no need to go through the process of 
> reading the hardware counters and updating the old task's counters 
> and then setting up the PMU for the new task.
> 
> This optimizes the context switch in this situation.  Instead of 
> scheduling out the perf_counter_context for the old task and 
> scheduling in the new context, we simply transfer the old context 
> to the new task and keep using it without interruption.  The new 
> context gets transferred to the old task.  This means that both 
> tasks still have a valid perf_counter_context, so no special case 
> is introduced when the old task gets scheduled in again, either on 
> this CPU or another CPU.
> 
> The equivalence of contexts is detected by keeping a pointer in 
> each cloned context pointing to the context it was cloned from. To 
> cope with the situation where a context is changed by adding or 
> removing counters after it has been cloned, we also keep a 
> generation number on each context which is incremented every time 
> a context is changed.  When a context is cloned we take a copy of 
> the parent's generation number, and two cloned contexts are 
> equivalent only if they have the same parent and the same 
> generation number.  In order that the parent context pointer 
> remains valid (and is not reused), we increment the parent 
> context's reference count for each context cloned from it.
> 
> Since we don't have individual fds for the counters in a cloned
> context, the only thing that can make two clones of a given parent
> different after they have been cloned is enabling or disabling all
> counters with prctl.  To account for this, we keep a count of the
> number of enabled counters in each context.  Two contexts must have
> the same number of enabled counters to be considered equivalent.
> 
> Here are some measurements of the context switch time as measured with
> the lat_ctx benchmark from lmbench, comparing the times obtained with
> and without this patch series:
> 
> 		        -----Unmodified-----	With this patch series
> Counters:	        none	2 HW	4H+4S	none	2 HW	4H+4S
>
> 2 processes:
> Average		3.44	6.45	11.24	3.12	3.39	3.60
> St dev		0.04	0.04	0.13	0.05	0.17	0.19
>
> 8 processes:
> Average		6.45	8.79	14.00	5.57	6.23	7.57
> St dev		1.27	1.04	0.88	1.42	1.46	1.42
> 
> 32 processes:
> Average		5.56	8.43	13.78	5.28	5.55	7.15
> St dev		0.41	0.47	0.53	0.54	0.57	0.81
> 
> The numbers are the mean and standard deviation of 20 runs of 
> lat_ctx.  The "none" columns are lat_ctx run directly without any 
> counters.  The "2 HW" columns are with lat_ctx run under perfstat, 
> counting cycles and instructions.  The "4H+4S" columns are lat_ctx 
> run under perfstat with 4 hardware counters and 4 software 
> counters (cycles, instructions, cache references, cache misses, 
> task clock, context switch, cpu migrations, and page faults).
> 
> Signed-off-by: Paul Mackerras <paulus@samba.org>
> ---
>  include/linux/perf_counter.h |   12 ++++-
>  kernel/perf_counter.c        |  109 ++++++++++++++++++++++++++++++++++++-----
>  kernel/sched.c               |    2 +-
>  3 files changed, 107 insertions(+), 16 deletions(-)

Impressive!

I'm wondering where the sensitivity of lat_ctx on the number of 
counters comes from. I'd expect there to be constant (and very low) 
overhead. It could be measurement noise - lat_ctx is very sensitive 
on L2 layout and memory allocation patterns - those are very hard to 
eliminate and are not measured via the stddev numbers. (many of 
those effects are per bootup specific, and bootups dont randomize 
them - so there's no easy way to measure their statistical impact.)

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts
  2009-05-22  9:02       ` Peter Zijlstra
@ 2009-05-22 10:14         ` Ingo Molnar
  0 siblings, 0 replies; 32+ messages in thread
From: Ingo Molnar @ 2009-05-22 10:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul Mackerras, linux-kernel, Corey Ashford, Thomas Gleixner


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Fri, 2009-05-22 at 10:57 +0200, Ingo Molnar wrote:
> > * Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > 
> > > On Fri, 2009-05-22 at 14:27 +1000, Paul Mackerras wrote:
> > > 
> > > > The equivalence of contexts is detected by keeping a pointer in
> > > > each cloned context pointing to the context it was cloned from.
> > > > To cope with the situation where a context is changed by adding
> > > > or removing counters after it has been cloned, we also keep a
> > > > generation number on each context which is incremented every time
> > > > a context is changed.  When a context is cloned we take a copy
> > > > of the parent's generation number, and two cloned contexts are
> > > > equivalent only if they have the same parent and the same
> > > > generation number.  In order that the parent context pointer
> > > > remains valid (and is not reused), we increment the parent
> > > > context's reference count for each context cloned from it.
> > > 
> > > > +	u32			generation;
> > > 
> > > Suppose someone writes a malicious proglet that inherits the 
> > > counters, puts the child to sleep, does 2^32 mods on the counter 
> > > set, and then wakes up the child.
> > > 
> > > Would that merely corrupt the results, or make the kernel explode?
> > 
> > i think it should be an u64.
> 
> So at to make the attack infeasible by death of the attacker and 
> burnout of the solar-system? :-)

Yeah. We only have to make it for the next 1 billion years anyway - 
then due to the sun being ~50% brighter the planet as we know it 
will be scorched up anyway. (Unless some entity puts it under 
massive space based shades until then, to preserve the ancient 
biosphere for nostalgic or museologic reasons or so.)

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [tip:perfcounters/core] perf_counter: Dynamically allocate tasks' perf_counter_context struct
  2009-05-22  4:17 [PATCH 1/2] perf_counter: dynamically allocate tasks' perf_counter_context struct [v2] Paul Mackerras
  2009-05-22  4:27 ` [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts Paul Mackerras
  2009-05-22  8:06 ` [PATCH 1/2] perf_counter: dynamically allocate tasks' perf_counter_context struct [v2] Peter Zijlstra
@ 2009-05-22 10:27 ` tip-bot for Paul Mackerras
  2 siblings, 0 replies; 32+ messages in thread
From: tip-bot for Paul Mackerras @ 2009-05-22 10:27 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, acme, paulus, hpa, mingo, a.p.zijlstra, mtosatti,
	tglx, cjashfor, mingo

Commit-ID:  a63eaf34ae60bdb067a354cc8def2e8f4a01f5f4
Gitweb:     http://git.kernel.org/tip/a63eaf34ae60bdb067a354cc8def2e8f4a01f5f4
Author:     Paul Mackerras <paulus@samba.org>
AuthorDate: Fri, 22 May 2009 14:17:31 +1000
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Fri, 22 May 2009 12:18:19 +0200

perf_counter: Dynamically allocate tasks' perf_counter_context struct

This replaces the struct perf_counter_context in the task_struct with
a pointer to a dynamically allocated perf_counter_context struct.  The
main reason for doing is this is to allow us to transfer a
perf_counter_context from one task to another when we do lazy PMU
switching in a later patch.

This has a few side-benefits: the task_struct becomes a little smaller,
we save some memory because only tasks that have perf_counters attached
get a perf_counter_context allocated for them, and we can remove the
inclusion of <linux/perf_counter.h> in sched.h, meaning that we don't
end up recompiling nearly everything whenever perf_counter.h changes.

The perf_counter_context structures are reference-counted and freed
when the last reference is dropped.  A context can have references
from its task and the counters on its task.  Counters can outlive the
task so it is possible that a context will be freed well after its
task has exited.

Contexts are allocated on fork if the parent had a context, or
otherwise the first time that a per-task counter is created on a task.
In the latter case, we set the context pointer in the task struct
locklessly using an atomic compare-and-exchange operation in case we
raced with some other task in creating a context for the subject task.

This also removes the task pointer from the perf_counter struct.  The
task pointer was not used anywhere and would make it harder to move a
context from one task to another.  Anything that needed to know which
task a counter was attached to was already using counter->ctx->task.

The __perf_counter_init_context function moves up in perf_counter.c
so that it can be called from find_get_context, and now initializes
the refcount, but is otherwise unchanged.

We were potentially calling list_del_counter twice: once from
__perf_counter_exit_task when the task exits and once from
__perf_counter_remove_from_context when the counter's fd gets closed.
This adds a check in list_del_counter so it doesn't do anything if
the counter has already been removed from the lists.

Since perf_counter_task_sched_in doesn't do anything if the task doesn't
have a context, and leaves cpuctx->task_ctx = NULL, this adds code to
__perf_install_in_context to set cpuctx->task_ctx if necessary, i.e. in
the case where the current task adds the first counter to itself and
thus creates a context for itself.

This also adds similar code to __perf_counter_enable to handle a
similar situation which can arise when the counters have been disabled
using prctl; that also leaves cpuctx->task_ctx = NULL.

[ Impact: refactor counter context management to prepare for new feature ]

Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <18966.10075.781053.231153@cargo.ozlabs.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


---
 arch/x86/kernel/apic/apic.c  |    1 +
 include/linux/init_task.h    |   13 ---
 include/linux/perf_counter.h |    4 +-
 include/linux/sched.h        |    6 +-
 kernel/exit.c                |    3 +-
 kernel/fork.c                |    1 +
 kernel/perf_counter.c        |  218 ++++++++++++++++++++++++++----------------
 7 files changed, 145 insertions(+), 101 deletions(-)

diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index e9021a9..b4f6440 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -14,6 +14,7 @@
  *	Mikael Pettersson	:	PM converted to driver model.
  */
 
+#include <linux/perf_counter.h>
 #include <linux/kernel_stat.h>
 #include <linux/mc146818rtc.h>
 #include <linux/acpi_pmtmr.h>
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 503afaa..d87247d 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -108,18 +108,6 @@ extern struct group_info init_groups;
 
 extern struct cred init_cred;
 
-#ifdef CONFIG_PERF_COUNTERS
-# define INIT_PERF_COUNTERS(tsk)					\
-	.perf_counter_ctx.counter_list =				\
-		LIST_HEAD_INIT(tsk.perf_counter_ctx.counter_list),	\
-	.perf_counter_ctx.event_list =					\
-		LIST_HEAD_INIT(tsk.perf_counter_ctx.event_list),	\
-	.perf_counter_ctx.lock =					\
-		__SPIN_LOCK_UNLOCKED(tsk.perf_counter_ctx.lock),
-#else
-# define INIT_PERF_COUNTERS(tsk)
-#endif
-
 /*
  *  INIT_TASK is used to set up the first task table, touch at
  * your own risk!. Base=0, limit=0x1fffff (=2MB)
@@ -183,7 +171,6 @@ extern struct cred init_cred;
 	},								\
 	.dirties = INIT_PROP_LOCAL_SINGLE(dirties),			\
 	INIT_IDS							\
-	INIT_PERF_COUNTERS(tsk)						\
 	INIT_TRACE_IRQFLAGS						\
 	INIT_LOCKDEP							\
 	INIT_FTRACE_GRAPH						\
diff --git a/include/linux/perf_counter.h b/include/linux/perf_counter.h
index f612941..0713090 100644
--- a/include/linux/perf_counter.h
+++ b/include/linux/perf_counter.h
@@ -449,7 +449,6 @@ struct perf_counter {
 	struct hw_perf_counter		hw;
 
 	struct perf_counter_context	*ctx;
-	struct task_struct		*task;
 	struct file			*filp;
 
 	struct perf_counter		*parent;
@@ -498,7 +497,6 @@ struct perf_counter {
  * Used as a container for task counters and CPU counters as well:
  */
 struct perf_counter_context {
-#ifdef CONFIG_PERF_COUNTERS
 	/*
 	 * Protect the states of the counters in the list,
 	 * nr_active, and the list:
@@ -516,6 +514,7 @@ struct perf_counter_context {
 	int			nr_counters;
 	int			nr_active;
 	int			is_active;
+	atomic_t		refcount;
 	struct task_struct	*task;
 
 	/*
@@ -523,7 +522,6 @@ struct perf_counter_context {
 	 */
 	u64			time;
 	u64			timestamp;
-#endif
 };
 
 /**
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ff59d12..9714d45 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -71,7 +71,6 @@ struct sched_param {
 #include <linux/path.h>
 #include <linux/compiler.h>
 #include <linux/completion.h>
-#include <linux/perf_counter.h>
 #include <linux/pid.h>
 #include <linux/percpu.h>
 #include <linux/topology.h>
@@ -99,6 +98,7 @@ struct robust_list_head;
 struct bio;
 struct bts_tracer;
 struct fs_struct;
+struct perf_counter_context;
 
 /*
  * List of flags we want to share for kernel threads,
@@ -1387,7 +1387,9 @@ struct task_struct {
 	struct list_head pi_state_list;
 	struct futex_pi_state *pi_state_cache;
 #endif
-	struct perf_counter_context perf_counter_ctx;
+#ifdef CONFIG_PERF_COUNTERS
+	struct perf_counter_context *perf_counter_ctxp;
+#endif
 #ifdef CONFIG_NUMA
 	struct mempolicy *mempolicy;
 	short il_next;
diff --git a/kernel/exit.c b/kernel/exit.c
index f9dfedd..99ad406 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -48,6 +48,7 @@
 #include <linux/tracehook.h>
 #include <linux/fs_struct.h>
 #include <linux/init_task.h>
+#include <linux/perf_counter.h>
 #include <trace/sched.h>
 
 #include <asm/uaccess.h>
@@ -159,7 +160,7 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
 	struct task_struct *tsk = container_of(rhp, struct task_struct, rcu);
 
 #ifdef CONFIG_PERF_COUNTERS
-	WARN_ON_ONCE(!list_empty(&tsk->perf_counter_ctx.counter_list));
+	WARN_ON_ONCE(tsk->perf_counter_ctxp);
 #endif
 	trace_sched_process_free(tsk);
 	put_task_struct(tsk);
diff --git a/kernel/fork.c b/kernel/fork.c
index d32fef4..e72a09f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -63,6 +63,7 @@
 #include <linux/fs_struct.h>
 #include <trace/sched.h>
 #include <linux/magic.h>
+#include <linux/perf_counter.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
diff --git a/kernel/perf_counter.c b/kernel/perf_counter.c
index 08584c1..06ea3ea 100644
--- a/kernel/perf_counter.c
+++ b/kernel/perf_counter.c
@@ -97,6 +97,17 @@ void perf_enable(void)
 		hw_perf_enable();
 }
 
+static void get_ctx(struct perf_counter_context *ctx)
+{
+	atomic_inc(&ctx->refcount);
+}
+
+static void put_ctx(struct perf_counter_context *ctx)
+{
+	if (atomic_dec_and_test(&ctx->refcount))
+		kfree(ctx);
+}
+
 static void
 list_add_counter(struct perf_counter *counter, struct perf_counter_context *ctx)
 {
@@ -118,11 +129,17 @@ list_add_counter(struct perf_counter *counter, struct perf_counter_context *ctx)
 	ctx->nr_counters++;
 }
 
+/*
+ * Remove a counter from the lists for its context.
+ * Must be called with counter->mutex and ctx->mutex held.
+ */
 static void
 list_del_counter(struct perf_counter *counter, struct perf_counter_context *ctx)
 {
 	struct perf_counter *sibling, *tmp;
 
+	if (list_empty(&counter->list_entry))
+		return;
 	ctx->nr_counters--;
 
 	list_del_init(&counter->list_entry);
@@ -216,8 +233,6 @@ static void __perf_counter_remove_from_context(void *info)
 
 	counter_sched_out(counter, cpuctx, ctx);
 
-	counter->task = NULL;
-
 	list_del_counter(counter, ctx);
 
 	if (!ctx->task) {
@@ -279,7 +294,6 @@ retry:
 	 */
 	if (!list_empty(&counter->list_entry)) {
 		list_del_counter(counter, ctx);
-		counter->task = NULL;
 	}
 	spin_unlock_irq(&ctx->lock);
 }
@@ -568,11 +582,17 @@ static void __perf_install_in_context(void *info)
 	 * If this is a task context, we need to check whether it is
 	 * the current task context of this cpu. If not it has been
 	 * scheduled out before the smp call arrived.
+	 * Or possibly this is the right context but it isn't
+	 * on this cpu because it had no counters.
 	 */
-	if (ctx->task && cpuctx->task_ctx != ctx)
-		return;
+	if (ctx->task && cpuctx->task_ctx != ctx) {
+		if (cpuctx->task_ctx || ctx->task != current)
+			return;
+		cpuctx->task_ctx = ctx;
+	}
 
 	spin_lock_irqsave(&ctx->lock, flags);
+	ctx->is_active = 1;
 	update_context_time(ctx);
 
 	/*
@@ -653,7 +673,6 @@ perf_install_in_context(struct perf_counter_context *ctx,
 		return;
 	}
 
-	counter->task = task;
 retry:
 	task_oncpu_function_call(task, __perf_install_in_context,
 				 counter);
@@ -693,10 +712,14 @@ static void __perf_counter_enable(void *info)
 	 * If this is a per-task counter, need to check whether this
 	 * counter's task is the current task on this cpu.
 	 */
-	if (ctx->task && cpuctx->task_ctx != ctx)
-		return;
+	if (ctx->task && cpuctx->task_ctx != ctx) {
+		if (cpuctx->task_ctx || ctx->task != current)
+			return;
+		cpuctx->task_ctx = ctx;
+	}
 
 	spin_lock_irqsave(&ctx->lock, flags);
+	ctx->is_active = 1;
 	update_context_time(ctx);
 
 	counter->prev_state = counter->state;
@@ -852,10 +875,10 @@ void __perf_counter_sched_out(struct perf_counter_context *ctx,
 void perf_counter_task_sched_out(struct task_struct *task, int cpu)
 {
 	struct perf_cpu_context *cpuctx = &per_cpu(perf_cpu_context, cpu);
-	struct perf_counter_context *ctx = &task->perf_counter_ctx;
+	struct perf_counter_context *ctx = task->perf_counter_ctxp;
 	struct pt_regs *regs;
 
-	if (likely(!cpuctx->task_ctx))
+	if (likely(!ctx || !cpuctx->task_ctx))
 		return;
 
 	update_context_time(ctx);
@@ -871,6 +894,8 @@ static void __perf_counter_task_sched_out(struct perf_counter_context *ctx)
 {
 	struct perf_cpu_context *cpuctx = &__get_cpu_var(perf_cpu_context);
 
+	if (!cpuctx->task_ctx)
+		return;
 	__perf_counter_sched_out(ctx, cpuctx);
 	cpuctx->task_ctx = NULL;
 }
@@ -969,8 +994,10 @@ __perf_counter_sched_in(struct perf_counter_context *ctx,
 void perf_counter_task_sched_in(struct task_struct *task, int cpu)
 {
 	struct perf_cpu_context *cpuctx = &per_cpu(perf_cpu_context, cpu);
-	struct perf_counter_context *ctx = &task->perf_counter_ctx;
+	struct perf_counter_context *ctx = task->perf_counter_ctxp;
 
+	if (likely(!ctx))
+		return;
 	__perf_counter_sched_in(ctx, cpuctx, cpu);
 	cpuctx->task_ctx = ctx;
 }
@@ -985,11 +1012,11 @@ static void perf_counter_cpu_sched_in(struct perf_cpu_context *cpuctx, int cpu)
 int perf_counter_task_disable(void)
 {
 	struct task_struct *curr = current;
-	struct perf_counter_context *ctx = &curr->perf_counter_ctx;
+	struct perf_counter_context *ctx = curr->perf_counter_ctxp;
 	struct perf_counter *counter;
 	unsigned long flags;
 
-	if (likely(!ctx->nr_counters))
+	if (!ctx || !ctx->nr_counters)
 		return 0;
 
 	local_irq_save(flags);
@@ -1020,12 +1047,12 @@ int perf_counter_task_disable(void)
 int perf_counter_task_enable(void)
 {
 	struct task_struct *curr = current;
-	struct perf_counter_context *ctx = &curr->perf_counter_ctx;
+	struct perf_counter_context *ctx = curr->perf_counter_ctxp;
 	struct perf_counter *counter;
 	unsigned long flags;
 	int cpu;
 
-	if (likely(!ctx->nr_counters))
+	if (!ctx || !ctx->nr_counters)
 		return 0;
 
 	local_irq_save(flags);
@@ -1128,19 +1155,23 @@ void perf_counter_task_tick(struct task_struct *curr, int cpu)
 		return;
 
 	cpuctx = &per_cpu(perf_cpu_context, cpu);
-	ctx = &curr->perf_counter_ctx;
+	ctx = curr->perf_counter_ctxp;
 
 	perf_adjust_freq(&cpuctx->ctx);
-	perf_adjust_freq(ctx);
+	if (ctx)
+		perf_adjust_freq(ctx);
 
 	perf_counter_cpu_sched_out(cpuctx);
-	__perf_counter_task_sched_out(ctx);
+	if (ctx)
+		__perf_counter_task_sched_out(ctx);
 
 	rotate_ctx(&cpuctx->ctx);
-	rotate_ctx(ctx);
+	if (ctx)
+		rotate_ctx(ctx);
 
 	perf_counter_cpu_sched_in(cpuctx, cpu);
-	perf_counter_task_sched_in(curr, cpu);
+	if (ctx)
+		perf_counter_task_sched_in(curr, cpu);
 }
 
 /*
@@ -1176,6 +1207,22 @@ static u64 perf_counter_read(struct perf_counter *counter)
 	return atomic64_read(&counter->count);
 }
 
+/*
+ * Initialize the perf_counter context in a task_struct:
+ */
+static void
+__perf_counter_init_context(struct perf_counter_context *ctx,
+			    struct task_struct *task)
+{
+	memset(ctx, 0, sizeof(*ctx));
+	spin_lock_init(&ctx->lock);
+	mutex_init(&ctx->mutex);
+	INIT_LIST_HEAD(&ctx->counter_list);
+	INIT_LIST_HEAD(&ctx->event_list);
+	atomic_set(&ctx->refcount, 1);
+	ctx->task = task;
+}
+
 static void put_context(struct perf_counter_context *ctx)
 {
 	if (ctx->task)
@@ -1186,6 +1233,7 @@ static struct perf_counter_context *find_get_context(pid_t pid, int cpu)
 {
 	struct perf_cpu_context *cpuctx;
 	struct perf_counter_context *ctx;
+	struct perf_counter_context *tctx;
 	struct task_struct *task;
 
 	/*
@@ -1225,15 +1273,36 @@ static struct perf_counter_context *find_get_context(pid_t pid, int cpu)
 	if (!task)
 		return ERR_PTR(-ESRCH);
 
-	ctx = &task->perf_counter_ctx;
-	ctx->task = task;
-
 	/* Reuse ptrace permission checks for now. */
 	if (!ptrace_may_access(task, PTRACE_MODE_READ)) {
-		put_context(ctx);
+		put_task_struct(task);
 		return ERR_PTR(-EACCES);
 	}
 
+	ctx = task->perf_counter_ctxp;
+	if (!ctx) {
+		ctx = kmalloc(sizeof(struct perf_counter_context), GFP_KERNEL);
+		if (!ctx) {
+			put_task_struct(task);
+			return ERR_PTR(-ENOMEM);
+		}
+		__perf_counter_init_context(ctx, task);
+		/*
+		 * Make sure other cpus see correct values for *ctx
+		 * once task->perf_counter_ctxp is visible to them.
+		 */
+		smp_wmb();
+		tctx = cmpxchg(&task->perf_counter_ctxp, NULL, ctx);
+		if (tctx) {
+			/*
+			 * We raced with some other task; use
+			 * the context they set.
+			 */
+			kfree(ctx);
+			ctx = tctx;
+		}
+	}
+
 	return ctx;
 }
 
@@ -1242,6 +1311,7 @@ static void free_counter_rcu(struct rcu_head *head)
 	struct perf_counter *counter;
 
 	counter = container_of(head, struct perf_counter, rcu_head);
+	put_ctx(counter->ctx);
 	kfree(counter);
 }
 
@@ -2247,7 +2317,7 @@ static void perf_counter_comm_event(struct perf_comm_event *comm_event)
 	perf_counter_comm_ctx(&cpuctx->ctx, comm_event);
 	put_cpu_var(perf_cpu_context);
 
-	perf_counter_comm_ctx(&current->perf_counter_ctx, comm_event);
+	perf_counter_comm_ctx(current->perf_counter_ctxp, comm_event);
 }
 
 void perf_counter_comm(struct task_struct *task)
@@ -2256,7 +2326,9 @@ void perf_counter_comm(struct task_struct *task)
 
 	if (!atomic_read(&nr_comm_tracking))
 		return;
-       
+	if (!current->perf_counter_ctxp)
+		return;
+
 	comm_event = (struct perf_comm_event){
 		.task	= task,
 		.event  = {
@@ -2372,7 +2444,7 @@ got_name:
 	perf_counter_mmap_ctx(&cpuctx->ctx, mmap_event);
 	put_cpu_var(perf_cpu_context);
 
-	perf_counter_mmap_ctx(&current->perf_counter_ctx, mmap_event);
+	perf_counter_mmap_ctx(current->perf_counter_ctxp, mmap_event);
 
 	kfree(buf);
 }
@@ -2384,6 +2456,8 @@ void perf_counter_mmap(unsigned long addr, unsigned long len,
 
 	if (!atomic_read(&nr_mmap_tracking))
 		return;
+	if (!current->perf_counter_ctxp)
+		return;
 
 	mmap_event = (struct perf_mmap_event){
 		.file   = file,
@@ -2985,6 +3059,7 @@ perf_counter_alloc(struct perf_counter_hw_event *hw_event,
 	counter->group_leader		= group_leader;
 	counter->pmu			= NULL;
 	counter->ctx			= ctx;
+	get_ctx(ctx);
 
 	counter->state = PERF_COUNTER_STATE_INACTIVE;
 	if (hw_event->disabled)
@@ -3150,21 +3225,6 @@ err_put_context:
 }
 
 /*
- * Initialize the perf_counter context in a task_struct:
- */
-static void
-__perf_counter_init_context(struct perf_counter_context *ctx,
-			    struct task_struct *task)
-{
-	memset(ctx, 0, sizeof(*ctx));
-	spin_lock_init(&ctx->lock);
-	mutex_init(&ctx->mutex);
-	INIT_LIST_HEAD(&ctx->counter_list);
-	INIT_LIST_HEAD(&ctx->event_list);
-	ctx->task = task;
-}
-
-/*
  * inherit a counter from parent task to child task:
  */
 static struct perf_counter *
@@ -3195,7 +3255,6 @@ inherit_counter(struct perf_counter *parent_counter,
 	/*
 	 * Link it up in the child's context:
 	 */
-	child_counter->task = child;
 	add_counter_to_ctx(child_counter, child_ctx);
 
 	child_counter->parent = parent_counter;
@@ -3294,40 +3353,15 @@ __perf_counter_exit_task(struct task_struct *child,
 	struct perf_counter *parent_counter;
 
 	/*
-	 * If we do not self-reap then we have to wait for the
-	 * child task to unschedule (it will happen for sure),
-	 * so that its counter is at its final count. (This
-	 * condition triggers rarely - child tasks usually get
-	 * off their CPU before the parent has a chance to
-	 * get this far into the reaping action)
+	 * Protect against concurrent operations on child_counter
+	 * due its fd getting closed, etc.
 	 */
-	if (child != current) {
-		wait_task_inactive(child, 0);
-		update_counter_times(child_counter);
-		list_del_counter(child_counter, child_ctx);
-	} else {
-		struct perf_cpu_context *cpuctx;
-		unsigned long flags;
-
-		/*
-		 * Disable and unlink this counter.
-		 *
-		 * Be careful about zapping the list - IRQ/NMI context
-		 * could still be processing it:
-		 */
-		local_irq_save(flags);
-		perf_disable();
-
-		cpuctx = &__get_cpu_var(perf_cpu_context);
+	mutex_lock(&child_counter->mutex);
 
-		group_sched_out(child_counter, cpuctx, child_ctx);
-		update_counter_times(child_counter);
+	update_counter_times(child_counter);
+	list_del_counter(child_counter, child_ctx);
 
-		list_del_counter(child_counter, child_ctx);
-
-		perf_enable();
-		local_irq_restore(flags);
-	}
+	mutex_unlock(&child_counter->mutex);
 
 	parent_counter = child_counter->parent;
 	/*
@@ -3346,19 +3380,29 @@ __perf_counter_exit_task(struct task_struct *child,
  *
  * Note: we may be running in child context, but the PID is not hashed
  * anymore so new counters will not be added.
+ * (XXX not sure that is true when we get called from flush_old_exec.
+ *  -- paulus)
  */
 void perf_counter_exit_task(struct task_struct *child)
 {
 	struct perf_counter *child_counter, *tmp;
 	struct perf_counter_context *child_ctx;
+	unsigned long flags;
 
 	WARN_ON_ONCE(child != current);
 
-	child_ctx = &child->perf_counter_ctx;
+	child_ctx = child->perf_counter_ctxp;
 
-	if (likely(!child_ctx->nr_counters))
+	if (likely(!child_ctx))
 		return;
 
+	local_irq_save(flags);
+	__perf_counter_task_sched_out(child_ctx);
+	child->perf_counter_ctxp = NULL;
+	local_irq_restore(flags);
+
+	mutex_lock(&child_ctx->mutex);
+
 again:
 	list_for_each_entry_safe(child_counter, tmp, &child_ctx->counter_list,
 				 list_entry)
@@ -3371,6 +3415,10 @@ again:
 	 */
 	if (!list_empty(&child_ctx->counter_list))
 		goto again;
+
+	mutex_unlock(&child_ctx->mutex);
+
+	put_ctx(child_ctx);
 }
 
 /*
@@ -3382,19 +3430,25 @@ void perf_counter_init_task(struct task_struct *child)
 	struct perf_counter *counter;
 	struct task_struct *parent = current;
 
-	child_ctx  =  &child->perf_counter_ctx;
-	parent_ctx = &parent->perf_counter_ctx;
-
-	__perf_counter_init_context(child_ctx, child);
+	child->perf_counter_ctxp = NULL;
 
 	/*
 	 * This is executed from the parent task context, so inherit
-	 * counters that have been marked for cloning:
+	 * counters that have been marked for cloning.
+	 * First allocate and initialize a context for the child.
 	 */
 
-	if (likely(!parent_ctx->nr_counters))
+	child_ctx = kmalloc(sizeof(struct perf_counter_context), GFP_KERNEL);
+	if (!child_ctx)
+		return;
+
+	parent_ctx = parent->perf_counter_ctxp;
+	if (likely(!parent_ctx || !parent_ctx->nr_counters))
 		return;
 
+	__perf_counter_init_context(child_ctx, child);
+	child->perf_counter_ctxp = child_ctx;
+
 	/*
 	 * Lock the parent list. No need to lock the child - not PID
 	 * hashed yet and not running, so nobody can access it.

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [tip:perfcounters/core] perf_counter: Optimize context switch between identical inherited contexts
  2009-05-22  4:27 ` [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts Paul Mackerras
                     ` (3 preceding siblings ...)
  2009-05-22 10:11   ` Ingo Molnar
@ 2009-05-22 10:27   ` tip-bot for Paul Mackerras
  2009-05-24 11:33     ` Ingo Molnar
  2009-05-22 10:36   ` [tip:perfcounters/core] perf_counter: fix !PERF_COUNTERS build failure tip-bot for Ingo Molnar
  2009-05-22 13:46   ` [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts Peter Zijlstra
  6 siblings, 1 reply; 32+ messages in thread
From: tip-bot for Paul Mackerras @ 2009-05-22 10:27 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, acme, paulus, hpa, mingo, a.p.zijlstra, mtosatti,
	tglx, cjashfor, mingo

Commit-ID:  564c2b210add41df9a3a5aaa365c1d97cff6110d
Gitweb:     http://git.kernel.org/tip/564c2b210add41df9a3a5aaa365c1d97cff6110d
Author:     Paul Mackerras <paulus@samba.org>
AuthorDate: Fri, 22 May 2009 14:27:22 +1000
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Fri, 22 May 2009 12:18:20 +0200

perf_counter: Optimize context switch between identical inherited contexts

When monitoring a process and its descendants with a set of inherited
counters, we can often get the situation in a context switch where
both the old (outgoing) and new (incoming) process have the same set
of counters, and their values are ultimately going to be added together.
In that situation it doesn't matter which set of counters are used to
count the activity for the new process, so there is really no need to
go through the process of reading the hardware counters and updating
the old task's counters and then setting up the PMU for the new task.

This optimizes the context switch in this situation.  Instead of
scheduling out the perf_counter_context for the old task and
scheduling in the new context, we simply transfer the old context
to the new task and keep using it without interruption.  The new
context gets transferred to the old task.  This means that both
tasks still have a valid perf_counter_context, so no special case
is introduced when the old task gets scheduled in again, either on
this CPU or another CPU.

The equivalence of contexts is detected by keeping a pointer in
each cloned context pointing to the context it was cloned from.
To cope with the situation where a context is changed by adding
or removing counters after it has been cloned, we also keep a
generation number on each context which is incremented every time
a context is changed.  When a context is cloned we take a copy
of the parent's generation number, and two cloned contexts are
equivalent only if they have the same parent and the same
generation number.  In order that the parent context pointer
remains valid (and is not reused), we increment the parent
context's reference count for each context cloned from it.

Since we don't have individual fds for the counters in a cloned
context, the only thing that can make two clones of a given parent
different after they have been cloned is enabling or disabling all
counters with prctl.  To account for this, we keep a count of the
number of enabled counters in each context.  Two contexts must have
the same number of enabled counters to be considered equivalent.

Here are some measurements of the context switch time as measured with
the lat_ctx benchmark from lmbench, comparing the times obtained with
and without this patch series:

		-----Unmodified-----		With this patch series
Counters:	none	2 HW	4H+4S	none	2 HW	4H+4S

2 processes:
Average		3.44	6.45	11.24	3.12	3.39	3.60
St dev		0.04	0.04	0.13	0.05	0.17	0.19

8 processes:
Average		6.45	8.79	14.00	5.57	6.23	7.57
St dev		1.27	1.04	0.88	1.42	1.46	1.42

32 processes:
Average		5.56	8.43	13.78	5.28	5.55	7.15
St dev		0.41	0.47	0.53	0.54	0.57	0.81

The numbers are the mean and standard deviation of 20 runs of
lat_ctx.  The "none" columns are lat_ctx run directly without any
counters.  The "2 HW" columns are with lat_ctx run under perfstat,
counting cycles and instructions.  The "4H+4S" columns are lat_ctx run
under perfstat with 4 hardware counters and 4 software counters
(cycles, instructions, cache references, cache misses, task
clock, context switch, cpu migrations, and page faults).

[ Impact: performance optimization of counter context-switches ]

Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <18966.10666.517218.332164@cargo.ozlabs.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


---
 include/linux/perf_counter.h |   12 ++++-
 kernel/perf_counter.c        |  109 ++++++++++++++++++++++++++++++++++++-----
 kernel/sched.c               |    2 +-
 3 files changed, 107 insertions(+), 16 deletions(-)

diff --git a/include/linux/perf_counter.h b/include/linux/perf_counter.h
index 0713090..4cae01a 100644
--- a/include/linux/perf_counter.h
+++ b/include/linux/perf_counter.h
@@ -513,6 +513,7 @@ struct perf_counter_context {
 	struct list_head	event_list;
 	int			nr_counters;
 	int			nr_active;
+	int			nr_enabled;
 	int			is_active;
 	atomic_t		refcount;
 	struct task_struct	*task;
@@ -522,6 +523,14 @@ struct perf_counter_context {
 	 */
 	u64			time;
 	u64			timestamp;
+
+	/*
+	 * These fields let us detect when two contexts have both
+	 * been cloned (inherited) from a common ancestor.
+	 */
+	struct perf_counter_context *parent_ctx;
+	u32			parent_gen;
+	u32			generation;
 };
 
 /**
@@ -552,7 +561,8 @@ extern int perf_max_counters;
 extern const struct pmu *hw_perf_counter_init(struct perf_counter *counter);
 
 extern void perf_counter_task_sched_in(struct task_struct *task, int cpu);
-extern void perf_counter_task_sched_out(struct task_struct *task, int cpu);
+extern void perf_counter_task_sched_out(struct task_struct *task,
+					struct task_struct *next, int cpu);
 extern void perf_counter_task_tick(struct task_struct *task, int cpu);
 extern void perf_counter_init_task(struct task_struct *child);
 extern void perf_counter_exit_task(struct task_struct *child);
diff --git a/kernel/perf_counter.c b/kernel/perf_counter.c
index 06ea3ea..c100554 100644
--- a/kernel/perf_counter.c
+++ b/kernel/perf_counter.c
@@ -104,8 +104,11 @@ static void get_ctx(struct perf_counter_context *ctx)
 
 static void put_ctx(struct perf_counter_context *ctx)
 {
-	if (atomic_dec_and_test(&ctx->refcount))
+	if (atomic_dec_and_test(&ctx->refcount)) {
+		if (ctx->parent_ctx)
+			put_ctx(ctx->parent_ctx);
 		kfree(ctx);
+	}
 }
 
 static void
@@ -127,6 +130,8 @@ list_add_counter(struct perf_counter *counter, struct perf_counter_context *ctx)
 
 	list_add_rcu(&counter->event_entry, &ctx->event_list);
 	ctx->nr_counters++;
+	if (counter->state >= PERF_COUNTER_STATE_INACTIVE)
+		ctx->nr_enabled++;
 }
 
 /*
@@ -141,6 +146,8 @@ list_del_counter(struct perf_counter *counter, struct perf_counter_context *ctx)
 	if (list_empty(&counter->list_entry))
 		return;
 	ctx->nr_counters--;
+	if (counter->state >= PERF_COUNTER_STATE_INACTIVE)
+		ctx->nr_enabled--;
 
 	list_del_init(&counter->list_entry);
 	list_del_rcu(&counter->event_entry);
@@ -204,6 +211,22 @@ group_sched_out(struct perf_counter *group_counter,
 }
 
 /*
+ * Mark this context as not being a clone of another.
+ * Called when counters are added to or removed from this context.
+ * We also increment our generation number so that anything that
+ * was cloned from this context before this will not match anything
+ * cloned from this context after this.
+ */
+static void unclone_ctx(struct perf_counter_context *ctx)
+{
+	++ctx->generation;
+	if (!ctx->parent_ctx)
+		return;
+	put_ctx(ctx->parent_ctx);
+	ctx->parent_ctx = NULL;
+}
+
+/*
  * Cross CPU call to remove a performance counter
  *
  * We disable the counter on the hardware level first. After that we
@@ -263,6 +286,7 @@ static void perf_counter_remove_from_context(struct perf_counter *counter)
 	struct perf_counter_context *ctx = counter->ctx;
 	struct task_struct *task = ctx->task;
 
+	unclone_ctx(ctx);
 	if (!task) {
 		/*
 		 * Per cpu counters are removed via an smp call and
@@ -378,6 +402,7 @@ static void __perf_counter_disable(void *info)
 		else
 			counter_sched_out(counter, cpuctx, ctx);
 		counter->state = PERF_COUNTER_STATE_OFF;
+		ctx->nr_enabled--;
 	}
 
 	spin_unlock_irqrestore(&ctx->lock, flags);
@@ -419,6 +444,7 @@ static void perf_counter_disable(struct perf_counter *counter)
 	if (counter->state == PERF_COUNTER_STATE_INACTIVE) {
 		update_counter_times(counter);
 		counter->state = PERF_COUNTER_STATE_OFF;
+		ctx->nr_enabled--;
 	}
 
 	spin_unlock_irq(&ctx->lock);
@@ -727,6 +753,7 @@ static void __perf_counter_enable(void *info)
 		goto unlock;
 	counter->state = PERF_COUNTER_STATE_INACTIVE;
 	counter->tstamp_enabled = ctx->time - counter->total_time_enabled;
+	ctx->nr_enabled++;
 
 	/*
 	 * If the counter is in a group and isn't the group leader,
@@ -817,6 +844,7 @@ static void perf_counter_enable(struct perf_counter *counter)
 		counter->state = PERF_COUNTER_STATE_INACTIVE;
 		counter->tstamp_enabled =
 			ctx->time - counter->total_time_enabled;
+		ctx->nr_enabled++;
 	}
  out:
 	spin_unlock_irq(&ctx->lock);
@@ -862,6 +890,25 @@ void __perf_counter_sched_out(struct perf_counter_context *ctx,
 }
 
 /*
+ * Test whether two contexts are equivalent, i.e. whether they
+ * have both been cloned from the same version of the same context
+ * and they both have the same number of enabled counters.
+ * If the number of enabled counters is the same, then the set
+ * of enabled counters should be the same, because these are both
+ * inherited contexts, therefore we can't access individual counters
+ * in them directly with an fd; we can only enable/disable all
+ * counters via prctl, or enable/disable all counters in a family
+ * via ioctl, which will have the same effect on both contexts.
+ */
+static int context_equiv(struct perf_counter_context *ctx1,
+			 struct perf_counter_context *ctx2)
+{
+	return ctx1->parent_ctx && ctx1->parent_ctx == ctx2->parent_ctx
+		&& ctx1->parent_gen == ctx2->parent_gen
+		&& ctx1->nr_enabled == ctx2->nr_enabled;
+}
+
+/*
  * Called from scheduler to remove the counters of the current task,
  * with interrupts disabled.
  *
@@ -872,10 +919,12 @@ void __perf_counter_sched_out(struct perf_counter_context *ctx,
  * accessing the counter control register. If a NMI hits, then it will
  * not restart the counter.
  */
-void perf_counter_task_sched_out(struct task_struct *task, int cpu)
+void perf_counter_task_sched_out(struct task_struct *task,
+				 struct task_struct *next, int cpu)
 {
 	struct perf_cpu_context *cpuctx = &per_cpu(perf_cpu_context, cpu);
 	struct perf_counter_context *ctx = task->perf_counter_ctxp;
+	struct perf_counter_context *next_ctx;
 	struct pt_regs *regs;
 
 	if (likely(!ctx || !cpuctx->task_ctx))
@@ -885,6 +934,16 @@ void perf_counter_task_sched_out(struct task_struct *task, int cpu)
 
 	regs = task_pt_regs(task);
 	perf_swcounter_event(PERF_COUNT_CONTEXT_SWITCHES, 1, 1, regs, 0);
+
+	next_ctx = next->perf_counter_ctxp;
+	if (next_ctx && context_equiv(ctx, next_ctx)) {
+		task->perf_counter_ctxp = next_ctx;
+		next->perf_counter_ctxp = ctx;
+		ctx->task = next;
+		next_ctx->task = task;
+		return;
+	}
+
 	__perf_counter_sched_out(ctx, cpuctx);
 
 	cpuctx->task_ctx = NULL;
@@ -998,6 +1057,8 @@ void perf_counter_task_sched_in(struct task_struct *task, int cpu)
 
 	if (likely(!ctx))
 		return;
+	if (cpuctx->task_ctx == ctx)
+		return;
 	__perf_counter_sched_in(ctx, cpuctx, cpu);
 	cpuctx->task_ctx = ctx;
 }
@@ -3253,6 +3314,16 @@ inherit_counter(struct perf_counter *parent_counter,
 		return child_counter;
 
 	/*
+	 * Make the child state follow the state of the parent counter,
+	 * not its hw_event.disabled bit.  We hold the parent's mutex,
+	 * so we won't race with perf_counter_{en,dis}able_family.
+	 */
+	if (parent_counter->state >= PERF_COUNTER_STATE_INACTIVE)
+		child_counter->state = PERF_COUNTER_STATE_INACTIVE;
+	else
+		child_counter->state = PERF_COUNTER_STATE_OFF;
+
+	/*
 	 * Link it up in the child's context:
 	 */
 	add_counter_to_ctx(child_counter, child_ctx);
@@ -3277,16 +3348,6 @@ inherit_counter(struct perf_counter *parent_counter,
 	mutex_lock(&parent_counter->mutex);
 	list_add_tail(&child_counter->child_list, &parent_counter->child_list);
 
-	/*
-	 * Make the child state follow the state of the parent counter,
-	 * not its hw_event.disabled bit.  We hold the parent's mutex,
-	 * so we won't race with perf_counter_{en,dis}able_family.
-	 */
-	if (parent_counter->state >= PERF_COUNTER_STATE_INACTIVE)
-		child_counter->state = PERF_COUNTER_STATE_INACTIVE;
-	else
-		child_counter->state = PERF_COUNTER_STATE_OFF;
-
 	mutex_unlock(&parent_counter->mutex);
 
 	return child_counter;
@@ -3429,6 +3490,7 @@ void perf_counter_init_task(struct task_struct *child)
 	struct perf_counter_context *child_ctx, *parent_ctx;
 	struct perf_counter *counter;
 	struct task_struct *parent = current;
+	int inherited_all = 1;
 
 	child->perf_counter_ctxp = NULL;
 
@@ -3463,12 +3525,31 @@ void perf_counter_init_task(struct task_struct *child)
 		if (counter != counter->group_leader)
 			continue;
 
-		if (!counter->hw_event.inherit)
+		if (!counter->hw_event.inherit) {
+			inherited_all = 0;
 			continue;
+		}
 
 		if (inherit_group(counter, parent,
-				  parent_ctx, child, child_ctx))
+				  parent_ctx, child, child_ctx)) {
+			inherited_all = 0;
 			break;
+		}
+	}
+
+	if (inherited_all) {
+		/*
+		 * Mark the child context as a clone of the parent
+		 * context, or of whatever the parent is a clone of.
+		 */
+		if (parent_ctx->parent_ctx) {
+			child_ctx->parent_ctx = parent_ctx->parent_ctx;
+			child_ctx->parent_gen = parent_ctx->parent_gen;
+		} else {
+			child_ctx->parent_ctx = parent_ctx;
+			child_ctx->parent_gen = parent_ctx->generation;
+		}
+		get_ctx(child_ctx->parent_ctx);
 	}
 
 	mutex_unlock(&parent_ctx->mutex);
diff --git a/kernel/sched.c b/kernel/sched.c
index 419a39d..4c0d58b 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5091,7 +5091,7 @@ need_resched_nonpreemptible:
 
 	if (likely(prev != next)) {
 		sched_info_switch(prev, next);
-		perf_counter_task_sched_out(prev, cpu);
+		perf_counter_task_sched_out(prev, next, cpu);
 
 		rq->nr_switches++;
 		rq->curr = next;

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [tip:perfcounters/core] perf_counter: fix !PERF_COUNTERS build failure
  2009-05-22  4:27 ` [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts Paul Mackerras
                     ` (4 preceding siblings ...)
  2009-05-22 10:27   ` [tip:perfcounters/core] perf_counter: Optimize " tip-bot for Paul Mackerras
@ 2009-05-22 10:36   ` tip-bot for Ingo Molnar
  2009-05-22 13:46   ` [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts Peter Zijlstra
  6 siblings, 0 replies; 32+ messages in thread
From: tip-bot for Ingo Molnar @ 2009-05-22 10:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, acme, paulus, hpa, mingo, a.p.zijlstra, mtosatti,
	tglx, cjashfor, mingo

Commit-ID:  910431c7f2e963017d767b29c80ae706421e569f
Gitweb:     http://git.kernel.org/tip/910431c7f2e963017d767b29c80ae706421e569f
Author:     Ingo Molnar <mingo@elte.hu>
AuthorDate: Fri, 22 May 2009 12:32:15 +0200
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Fri, 22 May 2009 12:33:14 +0200

perf_counter: fix !PERF_COUNTERS build failure

Update the !CONFIG_PERF_COUNTERS prototype too, for
perf_counter_task_sched_out().

[ Impact: build fix ]

Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <18966.10666.517218.332164@cargo.ozlabs.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


---
 include/linux/perf_counter.h |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/include/linux/perf_counter.h b/include/linux/perf_counter.h
index 4cae01a..2eedae8 100644
--- a/include/linux/perf_counter.h
+++ b/include/linux/perf_counter.h
@@ -625,7 +625,8 @@ extern void perf_counter_init(void);
 static inline void
 perf_counter_task_sched_in(struct task_struct *task, int cpu)		{ }
 static inline void
-perf_counter_task_sched_out(struct task_struct *task, int cpu)		{ }
+perf_counter_task_sched_out(struct task_struct *task,
+			    struct task_struct *next, int cpu)		{ }
 static inline void
 perf_counter_task_tick(struct task_struct *task, int cpu)		{ }
 static inline void perf_counter_init_task(struct task_struct *child)	{ }

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts
  2009-05-22  4:27 ` [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts Paul Mackerras
                     ` (5 preceding siblings ...)
  2009-05-22 10:36   ` [tip:perfcounters/core] perf_counter: fix !PERF_COUNTERS build failure tip-bot for Ingo Molnar
@ 2009-05-22 13:46   ` Peter Zijlstra
  2009-05-25  0:15     ` Paul Mackerras
  6 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2009-05-22 13:46 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Ingo Molnar, linux-kernel, Corey Ashford, Thomas Gleixner

On Fri, 2009-05-22 at 14:27 +1000, Paul Mackerras wrote:
> +       next_ctx = next->perf_counter_ctxp;
> +       if (next_ctx && context_equiv(ctx, next_ctx)) {
> +               task->perf_counter_ctxp = next_ctx;
> +               next->perf_counter_ctxp = ctx;
> +               ctx->task = next;
> +               next_ctx->task = task;
> +               return;
> +       }

Ingo just pointed out that there is nothing there to close the race with
attaching a counter.

That is, you could end up attaching your counter to the wrong task.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts
  2009-05-22 10:08       ` Peter Zijlstra
@ 2009-05-23 12:38         ` Ingo Molnar
  2009-05-23 13:06           ` Peter Zijlstra
  2009-05-24 23:55           ` Paul Mackerras
  0 siblings, 2 replies; 32+ messages in thread
From: Ingo Molnar @ 2009-05-23 12:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul Mackerras, linux-kernel, Corey Ashford, Thomas Gleixner


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Fri, 2009-05-22 at 19:56 +1000, Paul Mackerras wrote:
> > Peter Zijlstra writes:
> > 
> > > On Fri, 2009-05-22 at 14:27 +1000, Paul Mackerras wrote:
> > > > Since we don't have individual fds for the counters in a cloned
> > > > context, the only thing that can make two clones of a given parent
> > > > different after they have been cloned is enabling or disabling all
> > > > counters with prctl.  To account for this, we keep a count of the
> > > > number of enabled counters in each context.  Two contexts must have
> > > > the same number of enabled counters to be considered equivalent.
> > > 
> > > Curious point that.. so prctl() can disable counters it doesn't own.
> > > 
> > > Shouldn't we instead fix that?
> > 
> > Well, prctl enables/disables the counters that are counting on the
> > current process, regardless of who or what created them.  I always
> > thought that was a little strange; maybe it is useful to be able to
> > disable all the counters that any other process might have put on to
> > you, but I can't think of a scenario where you'd really need to do
> > that, particularly since the disable is a one-shot operation, and
> > doesn't prevent new (enabled) counters being attached to you.
> > 
> > On the other hand, what does "all the counters I own" mean?  
> > Does it mean all the ones that I have fds open for?  Or does it 
> > mean all the ones that I created?  Either way we don't have a 
> > good way to enumerate them.
> 
> I'm for all counters you created (ie have a fd for). Being able to 
> disable counters others created on you just sounds wrong.
> 
> If we can settle on a semantic, I'm sure we can implement it :-)
> 
> Ingo, Corey, any opinions?

It indeed doesnt sound correct that we can disable counters others 
created on us - especially if they are in a different (higher 
privileged) security context than us.

OTOH, enabling/disabling counters in specific functions of a library 
might be a valid use-case. So perhaps make this an attribute: 
.transparent or so, with perf stat defaulting on it to be 
transparent (i.e. not child context disable-able).

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts
  2009-05-23 12:38         ` Ingo Molnar
@ 2009-05-23 13:06           ` Peter Zijlstra
  2009-05-24 23:55           ` Paul Mackerras
  1 sibling, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2009-05-23 13:06 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Paul Mackerras, linux-kernel, Corey Ashford, Thomas Gleixner

On Sat, 2009-05-23 at 14:38 +0200, Ingo Molnar wrote:
> > I'm for all counters you created (ie have a fd for). Being able to 
> > disable counters others created on you just sounds wrong.
> > 
> > If we can settle on a semantic, I'm sure we can implement it :-)
> > 
> > Ingo, Corey, any opinions?
> 
> It indeed doesnt sound correct that we can disable counters others 
> created on us - especially if they are in a different (higher 
> privileged) security context than us.
> 
> OTOH, enabling/disabling counters in specific functions of a library 
> might be a valid use-case. So perhaps make this an attribute: 
> ..transparent or so, with perf stat defaulting on it to be 
> transparent (i.e. not child context disable-able).

I'm not sure that's something we want to do. Furthermore, if we do want
it, the current implementation is not sufficient, because, as Paul
noted, we can attach a new counter right after the disable.

I really think such limitations should come from whatever security
policy there is on attaching counters. Eg. using selinux, label gnupg as
non-countable so that you simply cannot attach (incl inherit) counters
to it.

Allowing such hooks into libraries will destroy transparency for
developers and I don't think that's something we'd want to promote.

I'll implement my suggestion so we can take it from there.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [tip:perfcounters/core] perf_counter: Optimize context switch between identical inherited contexts
  2009-05-22 10:27   ` [tip:perfcounters/core] perf_counter: Optimize " tip-bot for Paul Mackerras
@ 2009-05-24 11:33     ` Ingo Molnar
  2009-05-25  6:18       ` Paul Mackerras
  0 siblings, 1 reply; 32+ messages in thread
From: Ingo Molnar @ 2009-05-24 11:33 UTC (permalink / raw)
  To: mingo, hpa, paulus, acme, linux-kernel, a.p.zijlstra, mtosatti,
	tglx, cjashfor
  Cc: linux-tip-commits


* tip-bot for Paul Mackerras <paulus@samba.org> wrote:

> @@ -885,6 +934,16 @@ void perf_counter_task_sched_out(struct task_struct *task, int cpu)
>  
>  	regs = task_pt_regs(task);
>  	perf_swcounter_event(PERF_COUNT_CONTEXT_SWITCHES, 1, 1, regs, 0);
> +
> +	next_ctx = next->perf_counter_ctxp;
> +	if (next_ctx && context_equiv(ctx, next_ctx)) {
> +		task->perf_counter_ctxp = next_ctx;
> +		next->perf_counter_ctxp = ctx;
> +		ctx->task = next;
> +		next_ctx->task = task;
> +		return;
> +	}

there's one complication that this trick is causing - the migration 
counter relies on ctx->task to get per task migration stats:

 static inline u64 get_cpu_migrations(struct perf_counter *counter)
 {
        struct task_struct *curr = counter->ctx->task;

        if (curr)
                return curr->se.nr_migrations;
        return cpu_nr_migrations(smp_processor_id());
 }

as ctx->task is now jumping (while we keep the context), the 
migration stats are out of whack.

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts
  2009-05-23 12:38         ` Ingo Molnar
  2009-05-23 13:06           ` Peter Zijlstra
@ 2009-05-24 23:55           ` Paul Mackerras
  1 sibling, 0 replies; 32+ messages in thread
From: Paul Mackerras @ 2009-05-24 23:55 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Peter Zijlstra, linux-kernel, Corey Ashford, Thomas Gleixner

Ingo Molnar writes:

> OTOH, enabling/disabling counters in specific functions of a library 
> might be a valid use-case.

If that's the use case, wouldn't we wouldn't the "enable" to restore
the previous state, rather than enabling all counters?

Paul.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts
  2009-05-22 13:46   ` [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts Peter Zijlstra
@ 2009-05-25  0:15     ` Paul Mackerras
  2009-05-25 10:38       ` Peter Zijlstra
  0 siblings, 1 reply; 32+ messages in thread
From: Paul Mackerras @ 2009-05-25  0:15 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, linux-kernel, Corey Ashford, Thomas Gleixner

Peter Zijlstra writes:

> Ingo just pointed out that there is nothing there to close the race with
> attaching a counter.
> 
> That is, you could end up attaching your counter to the wrong task.

Good point.  Doing the unclone in find_get_context would be a start,
but the locking could get tricky (in fact we don't have any way to
remove an inherited counter from a context, so we only have to worry
about counters being attached).  I'll work out a solution after I have
digested your recent batch of patches.

Paul.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [tip:perfcounters/core] perf_counter: Optimize context switch between identical inherited contexts
  2009-05-24 11:33     ` Ingo Molnar
@ 2009-05-25  6:18       ` Paul Mackerras
  2009-05-25  6:54         ` Ingo Molnar
  0 siblings, 1 reply; 32+ messages in thread
From: Paul Mackerras @ 2009-05-25  6:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: mingo, hpa, acme, linux-kernel, a.p.zijlstra, mtosatti, tglx,
	cjashfor, linux-tip-commits

Ingo Molnar writes:

> * tip-bot for Paul Mackerras <paulus@samba.org> wrote:
> 
> > @@ -885,6 +934,16 @@ void perf_counter_task_sched_out(struct task_struct *task, int cpu)
> >  
> >  	regs = task_pt_regs(task);
> >  	perf_swcounter_event(PERF_COUNT_CONTEXT_SWITCHES, 1, 1, regs, 0);
> > +
> > +	next_ctx = next->perf_counter_ctxp;
> > +	if (next_ctx && context_equiv(ctx, next_ctx)) {
> > +		task->perf_counter_ctxp = next_ctx;
> > +		next->perf_counter_ctxp = ctx;
> > +		ctx->task = next;
> > +		next_ctx->task = task;
> > +		return;
> > +	}
> 
> there's one complication that this trick is causing - the migration 
> counter relies on ctx->task to get per task migration stats:
> 
>  static inline u64 get_cpu_migrations(struct perf_counter *counter)
>  {
>         struct task_struct *curr = counter->ctx->task;
> 
>         if (curr)
>                 return curr->se.nr_migrations;
>         return cpu_nr_migrations(smp_processor_id());
>  }
> 
> as ctx->task is now jumping (while we keep the context), the 
> migration stats are out of whack.

How did you notice this?  The overall sum over all children should
still be correct, though some individual children's counters could go
negative, so the result of a read on the counter when some children
have exited and others haven't could look a bit strange.  Reading the
counter after all children have exited should be fine, though.

One of the effects of optimizing the context switch is that in
general, reading the value of an inheritable counter when some
children have exited but some are still running might produce results
that include some of the activity of the still-running children and
might not include all of the activity of the children that have
exited.  If that's a concern then we need to implement the "sync child
counters" ioctl that has been suggested.

As for the migration counter, it is the only software counter that is
still using the "old" approach, i.e. it doesn't generate interrupts
and it uses the counter->prev_state field (which I hope to
eliminate one day).  It's also the only software counter which counts
events that happen while the task is not scheduled in.  The cleanest
thing would be to rewrite the migration counter code to have a callin
from the scheduler when migrations happen.

Paul.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [tip:perfcounters/core] perf_counter: Optimize context switch between identical inherited contexts
  2009-05-25  6:18       ` Paul Mackerras
@ 2009-05-25  6:54         ` Ingo Molnar
  0 siblings, 0 replies; 32+ messages in thread
From: Ingo Molnar @ 2009-05-25  6:54 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: mingo, hpa, acme, linux-kernel, a.p.zijlstra, mtosatti, tglx,
	cjashfor, linux-tip-commits


* Paul Mackerras <paulus@samba.org> wrote:

> Ingo Molnar writes:
> 
> > * tip-bot for Paul Mackerras <paulus@samba.org> wrote:
> > 
> > > @@ -885,6 +934,16 @@ void perf_counter_task_sched_out(struct task_struct *task, int cpu)
> > >  
> > >  	regs = task_pt_regs(task);
> > >  	perf_swcounter_event(PERF_COUNT_CONTEXT_SWITCHES, 1, 1, regs, 0);
> > > +
> > > +	next_ctx = next->perf_counter_ctxp;
> > > +	if (next_ctx && context_equiv(ctx, next_ctx)) {
> > > +		task->perf_counter_ctxp = next_ctx;
> > > +		next->perf_counter_ctxp = ctx;
> > > +		ctx->task = next;
> > > +		next_ctx->task = task;
> > > +		return;
> > > +	}
> > 
> > there's one complication that this trick is causing - the migration 
> > counter relies on ctx->task to get per task migration stats:
> > 
> >  static inline u64 get_cpu_migrations(struct perf_counter *counter)
> >  {
> >         struct task_struct *curr = counter->ctx->task;
> > 
> >         if (curr)
> >                 return curr->se.nr_migrations;
> >         return cpu_nr_migrations(smp_processor_id());
> >  }
> > 
> > as ctx->task is now jumping (while we keep the context), the 
> > migration stats are out of whack.
> 
> How did you notice this?  The overall sum over all children should 
> still be correct, though some individual children's counters could 
> go negative, so the result of a read on the counter when some 
> children have exited and others haven't could look a bit strange.  
> Reading the counter after all children have exited should be fine, 
> though.

i've noticed a few weirdnesses and then added a debug check and 
noticed the negative delta values.

> One of the effects of optimizing the context switch is that in 
> general, reading the value of an inheritable counter when some 
> children have exited but some are still running might produce 
> results that include some of the activity of the still-running 
> children and might not include all of the activity of the children 
> that have exited.  If that's a concern then we need to implement 
> the "sync child counters" ioctl that has been suggested.
> 
> As for the migration counter, it is the only software counter that 
> is still using the "old" approach, i.e. it doesn't generate 
> interrupts and it uses the counter->prev_state field (which I hope 
> to eliminate one day).  It's also the only software counter which 
> counts events that happen while the task is not scheduled in.  The 
> cleanest thing would be to rewrite the migration counter code to 
> have a callin from the scheduler when migrations happen.

I'll check with the debug check removed again. If the end result is 
OK then i dont think we need to worry much about this, at this 
stage.

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts
  2009-05-25  0:15     ` Paul Mackerras
@ 2009-05-25 10:38       ` Peter Zijlstra
  2009-05-25 10:50         ` Peter Zijlstra
  2009-05-25 11:06         ` Paul Mackerras
  0 siblings, 2 replies; 32+ messages in thread
From: Peter Zijlstra @ 2009-05-25 10:38 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Ingo Molnar, linux-kernel, Corey Ashford, Thomas Gleixner

On Mon, 2009-05-25 at 10:15 +1000, Paul Mackerras wrote:
> Peter Zijlstra writes:
> 
> > Ingo just pointed out that there is nothing there to close the race with
> > attaching a counter.
> > 
> > That is, you could end up attaching your counter to the wrong task.
> 
> Good point.  Doing the unclone in find_get_context would be a start,
> but the locking could get tricky (in fact we don't have any way to
> remove an inherited counter from a context, so we only have to worry
> about counters being attached).  I'll work out a solution after I have
> digested your recent batch of patches.

I'm currently staring at something like the below, trying to find races
etc.. ;-)

attach vs destroy vs flip

---

Index: linux-2.6/kernel/perf_counter.c
===================================================================
--- linux-2.6.orig/kernel/perf_counter.c
+++ linux-2.6/kernel/perf_counter.c
@@ -102,13 +102,29 @@ static void get_ctx(struct perf_counter_
 	atomic_inc(&ctx->refcount);
 }
 
-static void put_ctx(struct perf_counter_context *ctx)
+static void free_ctx_rcu(struct rcu_head *head)
+{
+	struct perf_counter_context *ctx;
+
+	ctx = container_of(head, struct perf_counter_context, rcu_head);
+	kfree(ctx);
+}
+
+static bool __put_ctx(struct perf_counter_context *ctx)
 {
 	if (atomic_dec_and_test(&ctx->refcount)) {
 		if (ctx->parent_ctx)
 			put_ctx(ctx->parent_ctx);
-		kfree(ctx);
+		return true;
 	}
+
+	return false;
+}
+
+static void put_ctx(struct perf_counter_context *ctx)
+{
+	if (__put_ctx(ctx))
+		call_rcu(&ctx->rcu_head, free_ctx_rcu);
 }
 
 /*
@@ -934,8 +950,16 @@ void perf_counter_task_sched_out(struct 
 
 	next_ctx = next->perf_counter_ctxp;
 	if (next_ctx && context_equiv(ctx, next_ctx)) {
+		ctx->task = NULL;
+		next_ctx->task = NULL;
+
+		smp_wmb();
+
 		task->perf_counter_ctxp = next_ctx;
 		next->perf_counter_ctxp = ctx;
+
+		smp_wmb();
+
 		ctx->task = next;
 		next_ctx->task = task;
 		return;
@@ -1284,19 +1308,31 @@ static struct perf_counter_context *find
 		return ERR_PTR(-EACCES);
 	}
 
+	rcu_read_lock();
+again:
 	ctx = task->perf_counter_ctxp;
+	/*
+	 * matched against the xchg() in perf_counter_exit_task() setting
+	 * ctx to NULL and the cmpxchg() below.
+	 */
+	smp_read_barrier_depends();
 	if (!ctx) {
+		rcu_read_unlock();
+		/*
+		 * cannot attach counters to a dying task.
+		 */
+		if (task->flags & PF_EXITING) {
+			put_task_struct(task);
+			return ERR_PTR(-ESRCH);
+		}
+
 		ctx = kmalloc(sizeof(struct perf_counter_context), GFP_KERNEL);
 		if (!ctx) {
 			put_task_struct(task);
 			return ERR_PTR(-ENOMEM);
 		}
 		__perf_counter_init_context(ctx, task);
-		/*
-		 * Make sure other cpus see correct values for *ctx
-		 * once task->perf_counter_ctxp is visible to them.
-		 */
-		smp_wmb();
+		rcu_read_lock();
 		tctx = cmpxchg(&task->perf_counter_ctxp, NULL, ctx);
 		if (tctx) {
 			/*
@@ -1308,6 +1344,16 @@ static struct perf_counter_context *find
 		}
 	}
 
+	if (!atomic_inc_not_zero(&ctx->reference))
+		goto again;
+
+	if (rcu_dereference(ctx->task) != task) {
+		put_ctx(ctx);
+		goto again;
+	}
+
+	rcu_read_unlock();
+
 	return ctx;
 }
 
@@ -1316,7 +1362,6 @@ static void free_counter_rcu(struct rcu_
 	struct perf_counter *counter;
 
 	counter = container_of(head, struct perf_counter, rcu_head);
-	put_ctx(counter->ctx);
 	kfree(counter);
 }
 
@@ -1337,6 +1382,7 @@ static void free_counter(struct perf_cou
 	if (counter->destroy)
 		counter->destroy(counter);
 
+	put_ctx(counter->ctx);
 	call_rcu(&counter->rcu_head, free_counter_rcu);
 }
 
@@ -3231,6 +3277,7 @@ SYSCALL_DEFINE5(perf_counter_open,
 
 out_fput:
 	fput_light(group_file, fput_needed);
+	put_ctx(ctx);
 
 	return ret;
 
@@ -3390,25 +3437,25 @@ __perf_counter_exit_task(struct task_str
  *
  * Note: we may be running in child context, but the PID is not hashed
  * anymore so new counters will not be added.
- * (XXX not sure that is true when we get called from flush_old_exec.
- *  -- paulus)
  */
 void perf_counter_exit_task(struct task_struct *child)
 {
 	struct perf_counter *child_counter, *tmp;
 	struct perf_counter_context *child_ctx;
 	unsigned long flags;
+	bool free_ctx;
 
 	WARN_ON_ONCE(child != current);
 
-	child_ctx = child->perf_counter_ctxp;
+	child_ctx = xchg(&child->perf_counter_ctxp, NULL);
 
 	if (likely(!child_ctx))
 		return;
 
+	free_ctx = __put_ctx(child_ctx);
+
 	local_irq_save(flags);
 	__perf_counter_task_sched_out(child_ctx);
-	child->perf_counter_ctxp = NULL;
 	local_irq_restore(flags);
 
 	mutex_lock(&child_ctx->mutex);
@@ -3428,7 +3475,8 @@ again:
 
 	mutex_unlock(&child_ctx->mutex);
 
-	put_ctx(child_ctx);
+	if (free_ctx)
+		call_rcu(&ctx->rcu_head, free_ctx_rcu);
 }
 
 /*



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts
  2009-05-25 10:38       ` Peter Zijlstra
@ 2009-05-25 10:50         ` Peter Zijlstra
  2009-05-25 11:06         ` Paul Mackerras
  1 sibling, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2009-05-25 10:50 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Ingo Molnar, linux-kernel, Corey Ashford, Thomas Gleixner

On Mon, 2009-05-25 at 12:38 +0200, Peter Zijlstra wrote:
> On Mon, 2009-05-25 at 10:15 +1000, Paul Mackerras wrote:
> > Peter Zijlstra writes:
> > 
> > > Ingo just pointed out that there is nothing there to close the race with
> > > attaching a counter.
> > > 
> > > That is, you could end up attaching your counter to the wrong task.
> > 
> > Good point.  Doing the unclone in find_get_context would be a start,
> > but the locking could get tricky (in fact we don't have any way to
> > remove an inherited counter from a context, so we only have to worry
> > about counters being attached).  I'll work out a solution after I have
> > digested your recent batch of patches.
> 
> I'm currently staring at something like the below, trying to find races
> etc.. ;-)
> 
> attach vs destroy vs flip

Hmm, it appears we only unclone ctx on remove, not attach.. how odd..

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts
  2009-05-25 10:38       ` Peter Zijlstra
  2009-05-25 10:50         ` Peter Zijlstra
@ 2009-05-25 11:06         ` Paul Mackerras
  2009-05-25 11:27           ` Peter Zijlstra
  1 sibling, 1 reply; 32+ messages in thread
From: Paul Mackerras @ 2009-05-25 11:06 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, linux-kernel, Corey Ashford, Thomas Gleixner

Peter Zijlstra writes:

> I'm currently staring at something like the below, trying to find races
> etc.. ;-)
[snip]
>  	next_ctx = next->perf_counter_ctxp;
>  	if (next_ctx && context_equiv(ctx, next_ctx)) {
> +		ctx->task = NULL;
> +		next_ctx->task = NULL;

Trouble is, ctx->task == NULL is used as an indication that this is a
per-cpu context in various places.

Also, in find_get_context, we have a lifetime problem because *ctx
could get swapped and then freed underneath us immediately after we
read task->perf_counter_ctxp.  So we need a lock in the task_struct
that stops sched_out from swapping the context underneath us.  That
led me to the patch below, which I'm about to test... :)  It does the
unclone in find_get_context; we don't actually need it on remove
because we have no way to remove an inherited counter from a task
without the task exiting.

Paul.

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 353c0ac..96b0a73 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -110,6 +110,8 @@ extern struct cred init_cred;
 
 #ifdef CONFIG_PERF_COUNTERS
 # define INIT_PERF_COUNTERS(tsk)					\
+	.perf_counter_ctx_lock =					\
+		__SPIN_LOCK_UNLOCKED(tsk.perf_counter_ctx_lock),	\
 	.perf_counter_mutex = 						\
 		 __MUTEX_INITIALIZER(tsk.perf_counter_mutex),		\
 	.perf_counter_list = LIST_HEAD_INIT(tsk.perf_counter_list),
diff --git a/include/linux/perf_counter.h b/include/linux/perf_counter.h
index 2ddf5e3..77cbbe7 100644
--- a/include/linux/perf_counter.h
+++ b/include/linux/perf_counter.h
@@ -531,8 +531,8 @@ struct perf_counter_context {
 	 * been cloned (inherited) from a common ancestor.
 	 */
 	struct perf_counter_context *parent_ctx;
-	u32			parent_gen;
-	u32			generation;
+	u64			parent_gen;
+	u64			generation;
 };
 
 /**
diff --git a/include/linux/sched.h b/include/linux/sched.h
index bc9326d..717830e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1389,6 +1389,7 @@ struct task_struct {
 #endif
 #ifdef CONFIG_PERF_COUNTERS
 	struct perf_counter_context *perf_counter_ctxp;
+	spinlock_t perf_counter_ctx_lock;
 	struct mutex perf_counter_mutex;
 	struct list_head perf_counter_list;
 #endif
diff --git a/kernel/perf_counter.c b/kernel/perf_counter.c
index cb40625..4c19404 100644
--- a/kernel/perf_counter.c
+++ b/kernel/perf_counter.c
@@ -211,22 +211,6 @@ group_sched_out(struct perf_counter *group_counter,
 }
 
 /*
- * Mark this context as not being a clone of another.
- * Called when counters are added to or removed from this context.
- * We also increment our generation number so that anything that
- * was cloned from this context before this will not match anything
- * cloned from this context after this.
- */
-static void unclone_ctx(struct perf_counter_context *ctx)
-{
-	++ctx->generation;
-	if (!ctx->parent_ctx)
-		return;
-	put_ctx(ctx->parent_ctx);
-	ctx->parent_ctx = NULL;
-}
-
-/*
  * Cross CPU call to remove a performance counter
  *
  * We disable the counter on the hardware level first. After that we
@@ -280,13 +264,16 @@ static void __perf_counter_remove_from_context(void *info)
  *
  * CPU counters are removed with a smp call. For task counters we only
  * call when the task is on a CPU.
+ *
+ * If counter->ctx is a cloned context, callers must make sure that
+ * every task struct that counter->ctx->task could possibly point to
+ * remains valid.
  */
 static void perf_counter_remove_from_context(struct perf_counter *counter)
 {
 	struct perf_counter_context *ctx = counter->ctx;
 	struct task_struct *task = ctx->task;
 
-	unclone_ctx(ctx);
 	if (!task) {
 		/*
 		 * Per cpu counters are removed via an smp call and
@@ -409,6 +396,10 @@ static void __perf_counter_disable(void *info)
 
 /*
  * Disable a counter.
+ *
+ * If counter->ctx is a cloned context, callers must make sure that
+ * every task struct that counter->ctx->task could possibly point to
+ * remains valid.
  */
 static void perf_counter_disable(struct perf_counter *counter)
 {
@@ -793,6 +784,10 @@ static void __perf_counter_enable(void *info)
 
 /*
  * Enable a counter.
+ *
+ * If counter->ctx is a cloned context, callers must make sure that
+ * every task struct that counter->ctx->task could possibly point to
+ * remains valid.
  */
 static void perf_counter_enable(struct perf_counter *counter)
 {
@@ -932,12 +927,32 @@ void perf_counter_task_sched_out(struct task_struct *task,
 	regs = task_pt_regs(task);
 	perf_swcounter_event(PERF_COUNT_CONTEXT_SWITCHES, 1, 1, regs, 0);
 
+	/*
+	 * Note: task->perf_counter_ctxp and next->perf_counter_ctxp
+	 * can't change underneath us here if we see them non-NULL,
+	 * because this is the only place where we change which
+	 * context a task points to, and the scheduler will ensure
+	 * that this code isn't being called for either of these tasks
+	 * on any other cpu at the same time.
+	 */
 	next_ctx = next->perf_counter_ctxp;
 	if (next_ctx && context_equiv(ctx, next_ctx)) {
+		/*
+		 * Lock the contexts of both the old and new tasks so that we
+		 * don't change things underneath find_get_context etc.
+		 * We don't need to be careful about the order in which we
+		 * lock the tasks because no other cpu could be trying to lock
+		 * both of these tasks -- this is the only place where we lock
+		 * two tasks.
+		 */
+		spin_lock(&task->perf_counter_ctx_lock);
+		spin_lock(&next->perf_counter_ctx_lock);
 		task->perf_counter_ctxp = next_ctx;
 		next->perf_counter_ctxp = ctx;
 		ctx->task = next;
 		next_ctx->task = task;
+		spin_unlock(&next->perf_counter_ctx_lock);
+		spin_unlock(&task->perf_counter_ctx_lock);
 		return;
 	}
 
@@ -1067,6 +1082,7 @@ static void perf_counter_cpu_sched_in(struct perf_cpu_context *cpuctx, int cpu)
 	__perf_counter_sched_in(ctx, cpuctx, cpu);
 }
 
+// XXX doesn't do inherited counters too?
 int perf_counter_task_enable(void)
 {
 	struct perf_counter *counter;
@@ -1079,6 +1095,7 @@ int perf_counter_task_enable(void)
 	return 0;
 }
 
+// XXX doesn't do inherited counters too?
 int perf_counter_task_disable(void)
 {
 	struct perf_counter *counter;
@@ -1108,6 +1125,7 @@ static void perf_adjust_freq(struct perf_counter_context *ctx)
 		if (!counter->hw_event.freq || !counter->hw_event.irq_freq)
 			continue;
 
+		// XXX does this assume we got a whole tick worth of counts?
 		events = HZ * counter->hw.interrupts * counter->hw.irq_period;
 		period = div64_u64(events, counter->hw_event.irq_freq);
 
@@ -1238,7 +1256,6 @@ static struct perf_counter_context *find_get_context(pid_t pid, int cpu)
 {
 	struct perf_cpu_context *cpuctx;
 	struct perf_counter_context *ctx;
-	struct perf_counter_context *tctx;
 	struct task_struct *task;
 
 	/*
@@ -1284,30 +1301,42 @@ static struct perf_counter_context *find_get_context(pid_t pid, int cpu)
 		return ERR_PTR(-EACCES);
 	}
 
+	/*
+	 * Taking this lock stops the context-switch code from switching
+	 * our context to another task.  We also use the lock to solve
+	 * the race where two tasks allocate a context at the same time.
+	 */
+	spin_lock(&task->perf_counter_ctx_lock);
 	ctx = task->perf_counter_ctxp;
 	if (!ctx) {
+		spin_unlock(&task->perf_counter_ctx_lock);
 		ctx = kmalloc(sizeof(struct perf_counter_context), GFP_KERNEL);
 		if (!ctx) {
 			put_task_struct(task);
 			return ERR_PTR(-ENOMEM);
 		}
 		__perf_counter_init_context(ctx, task);
-		/*
-		 * Make sure other cpus see correct values for *ctx
-		 * once task->perf_counter_ctxp is visible to them.
-		 */
-		smp_wmb();
-		tctx = cmpxchg(&task->perf_counter_ctxp, NULL, ctx);
-		if (tctx) {
+		spin_lock(&task->perf_counter_ctx_lock);
+		if (task->perf_counter_ctxp) {
 			/*
 			 * We raced with some other task; use
 			 * the context they set.
 			 */
 			kfree(ctx);
-			ctx = tctx;
+			ctx = task->perf_counter_ctxp;
 		}
 	}
 
+	/*
+	 * Since we're about to add another counter to this context,
+	 * it won't be a clone of another context any longer.
+	 */
+	++ctx->generation;
+	if (ctx->parent_ctx) {
+		put_ctx(ctx->parent_ctx);
+		ctx->parent_ctx = NULL;
+	}
+
 	return ctx;
 }
 
@@ -1360,7 +1389,7 @@ static int perf_release(struct inode *inode, struct file *file)
 	put_task_struct(counter->owner);
 
 	free_counter(counter);
-	put_context(ctx);
+	put_context(ctx);		// XXX use after free?
 
 	return 0;
 }
@@ -1382,7 +1411,7 @@ perf_read_hw(struct perf_counter *counter, char __user *buf, size_t count)
 	if (counter->state == PERF_COUNTER_STATE_ERROR)
 		return 0;
 
-	mutex_lock(&counter->child_mutex);
+	mutex_lock(&counter->child_mutex);	// XXX doesn't exclude sync_child_counter, so why?
 	values[0] = perf_counter_read(counter);
 	n = 1;
 	if (counter->hw_event.read_format & PERF_FORMAT_TOTAL_TIME_ENABLED)
@@ -1450,6 +1479,12 @@ static void perf_counter_for_each_sibling(struct perf_counter *counter,
 	mutex_unlock(&ctx->mutex);
 }
 
+/*
+ * Holding the top-level counter's child_mutex means that any
+ * descendant process that has inherited this counter will block
+ * in sync_child_counter, thus satisfying the task existence requirements
+ * of perf_counter_enable/disable.
+ */
 static void perf_counter_for_each_child(struct perf_counter *counter,
 					void (*func)(struct perf_counter *))
 {
@@ -3395,15 +3430,14 @@ void perf_counter_exit_task(struct task_struct *child)
 
 	WARN_ON_ONCE(child != current);
 
-	child_ctx = child->perf_counter_ctxp;
-
-	if (likely(!child_ctx))
+	if (likely(!child->perf_counter_ctxp))
 		return;
 
-	local_irq_save(flags);
+	spin_lock_irqsave(&child->perf_counter_ctx_lock, flags);
+	child_ctx = child->perf_counter_ctxp;
 	__perf_counter_task_sched_out(child_ctx);
 	child->perf_counter_ctxp = NULL;
-	local_irq_restore(flags);
+	spin_unlock_irqrestore(&child->perf_counter_ctx_lock, flags);
 
 	mutex_lock(&child_ctx->mutex);
 

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts
  2009-05-25 11:06         ` Paul Mackerras
@ 2009-05-25 11:27           ` Peter Zijlstra
  0 siblings, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2009-05-25 11:27 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Ingo Molnar, linux-kernel, Corey Ashford, Thomas Gleixner

On Mon, 2009-05-25 at 21:06 +1000, Paul Mackerras wrote:
> Peter Zijlstra writes:
> 
> > I'm currently staring at something like the below, trying to find races
> > etc.. ;-)
> [snip]
> >  	next_ctx = next->perf_counter_ctxp;
> >  	if (next_ctx && context_equiv(ctx, next_ctx)) {
> > +		ctx->task = NULL;
> > +		next_ctx->task = NULL;
> 
> Trouble is, ctx->task == NULL is used as an indication that this is a
> per-cpu context in various places.

Yeah, already realized that, its enough to simply put them to the new
task, before flipping the context pointers.

> Also, in find_get_context, we have a lifetime problem because *ctx
> could get swapped and then freed underneath us immediately after we
> read task->perf_counter_ctxp.  So we need a lock in the task_struct
> that stops sched_out from swapping the context underneath us.  That
> led me to the patch below, which I'm about to test... :)  It does the
> unclone in find_get_context; we don't actually need it on remove
> because we have no way to remove an inherited counter from a task
> without the task exiting.

Right, I was trying to solve the lifetime issues with RCU and refcount
tricks and the races with ordering, instead of adding another lock.

> @@ -932,12 +927,32 @@ void perf_counter_task_sched_out(struct task_struct *task,
>  	regs = task_pt_regs(task);
>  	perf_swcounter_event(PERF_COUNT_CONTEXT_SWITCHES, 1, 1, regs, 0);
>  
> +	/*
> +	 * Note: task->perf_counter_ctxp and next->perf_counter_ctxp
> +	 * can't change underneath us here if we see them non-NULL,
> +	 * because this is the only place where we change which
> +	 * context a task points to, and the scheduler will ensure
> +	 * that this code isn't being called for either of these tasks
> +	 * on any other cpu at the same time.
> +	 */
>  	next_ctx = next->perf_counter_ctxp;
>  	if (next_ctx && context_equiv(ctx, next_ctx)) {
> +		/*
> +		 * Lock the contexts of both the old and new tasks so that we
> +		 * don't change things underneath find_get_context etc.
> +		 * We don't need to be careful about the order in which we
> +		 * lock the tasks because no other cpu could be trying to lock
> +		 * both of these tasks -- this is the only place where we lock
> +		 * two tasks.
> +		 */
> +		spin_lock(&task->perf_counter_ctx_lock);
> +		spin_lock(&next->perf_counter_ctx_lock);
>  		task->perf_counter_ctxp = next_ctx;
>  		next->perf_counter_ctxp = ctx;
>  		ctx->task = next;
>  		next_ctx->task = task;
> +		spin_unlock(&next->perf_counter_ctx_lock);
> +		spin_unlock(&task->perf_counter_ctx_lock);
>  		return;
>  	}

FWIW that nested lock will make lockdep complain -- it can't deadlock
since we're under rq->lock and the tasks can't be stolen from the rq in
that case. So you can silence lockdep by using spin_lock_nested_lock()

> @@ -1067,6 +1082,7 @@ static void perf_counter_cpu_sched_in(struct perf_cpu_context *cpuctx, int cpu)
>  	__perf_counter_sched_in(ctx, cpuctx, cpu);
>  }
>  
> +// XXX doesn't do inherited counters too?
>  int perf_counter_task_enable(void)
>  {
>  	struct perf_counter *counter;

Good point,.. perf_counter_for_each_child(counter, perf_counter_disable)
should fix that I think.

> @@ -1360,7 +1389,7 @@ static int perf_release(struct inode *inode, struct file *file)
>  	put_task_struct(counter->owner);
>  
>  	free_counter(counter);
> -	put_context(ctx);
> +	put_context(ctx);		// XXX use after free?
>  
>  	return 0;
>  }

don't htink so, but will have a look.



^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2009-05-25 11:27 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-05-22  4:17 [PATCH 1/2] perf_counter: dynamically allocate tasks' perf_counter_context struct [v2] Paul Mackerras
2009-05-22  4:27 ` [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts Paul Mackerras
2009-05-22  8:16   ` Peter Zijlstra
2009-05-22  9:56     ` Paul Mackerras
2009-05-22 10:08       ` Peter Zijlstra
2009-05-23 12:38         ` Ingo Molnar
2009-05-23 13:06           ` Peter Zijlstra
2009-05-24 23:55           ` Paul Mackerras
2009-05-22  8:32   ` Peter Zijlstra
2009-05-22  8:57     ` Ingo Molnar
2009-05-22  9:02       ` Peter Zijlstra
2009-05-22 10:14         ` Ingo Molnar
2009-05-22  9:29     ` Paul Mackerras
2009-05-22  9:22   ` Peter Zijlstra
2009-05-22  9:42     ` Peter Zijlstra
2009-05-22 10:07       ` Paul Mackerras
2009-05-22 10:05     ` Paul Mackerras
2009-05-22 10:11   ` Ingo Molnar
2009-05-22 10:27   ` [tip:perfcounters/core] perf_counter: Optimize " tip-bot for Paul Mackerras
2009-05-24 11:33     ` Ingo Molnar
2009-05-25  6:18       ` Paul Mackerras
2009-05-25  6:54         ` Ingo Molnar
2009-05-22 10:36   ` [tip:perfcounters/core] perf_counter: fix !PERF_COUNTERS build failure tip-bot for Ingo Molnar
2009-05-22 13:46   ` [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts Peter Zijlstra
2009-05-25  0:15     ` Paul Mackerras
2009-05-25 10:38       ` Peter Zijlstra
2009-05-25 10:50         ` Peter Zijlstra
2009-05-25 11:06         ` Paul Mackerras
2009-05-25 11:27           ` Peter Zijlstra
2009-05-22  8:06 ` [PATCH 1/2] perf_counter: dynamically allocate tasks' perf_counter_context struct [v2] Peter Zijlstra
2009-05-22  9:30   ` Paul Mackerras
2009-05-22 10:27 ` [tip:perfcounters/core] perf_counter: Dynamically allocate tasks' perf_counter_context struct tip-bot for Paul Mackerras

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).