LKML Archive on
help / color / mirror / Atom feed
* [PATCH V6 1/6] perf: Save PMU specific data in task_struct
@ 2021-07-20 13:40 kan.liang
  2021-07-20 13:40 ` [PATCH V6 2/6] perf: attach/detach PMU specific data kan.liang
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: kan.liang @ 2021-07-20 13:40 UTC (permalink / raw)
  To: peterz, mingo, acme, tglx, bp, linux-kernel
  Cc: eranian, namhyung, ak, Kan Liang

From: Kan Liang <>

Some PMU specific data has to be saved/restored during context switch,
e.g. LBR call stack data. Currently, the data is saved in event context
structure, but only for per-process event. For system-wide event,
because of missing the LBR call stack data after context switch, LBR
callstacks are always shorter in comparison to per-process mode.

For example,
  Per-process mode:
  $perf record --call-graph lbr -- taskset -c 0 ./tchain_edit

  -   99.90%    99.86%  tchain_edit  tchain_edit       [.] f3
       99.86% _start
        - f2

  System-wide mode:
  $perf record --call-graph lbr -a -- taskset -c 0 ./tchain_edit

  -   99.88%    99.82%  tchain_edit  tchain_edit        [.] f3
   - 62.02% main
   - 28.83% f1
      - f2
   - 28.83% f1
      - f2
   - 8.88% generic_start_main

It isn't practical to simply allocate the data for system-wide event in
CPU context structure for all tasks. We have no idea which CPU a task
will be scheduled to. The duplicated LBR data has to be maintained on
every CPU context structure. That's a huge waste. Otherwise, the LBR
data still lost if the task is scheduled to another CPU.

Save the pmu specific data in task_struct. The size of pmu specific data
is 788 bytes for LBR call stack. Usually, the overall amount of threads
doesn't exceed a few thousands. For 10K threads, keeping LBR data would
consume additional ~8MB. The additional space will only be allocated
during LBR call stack monitoring. It will be released when the
monitoring is finished.

Furthermore, moving task_ctx_data from perf_event_context to task_struct
can reduce complexity and make things clearer. E.g. perf doesn't need to
swap task_ctx_data on optimized context switch path.
This patch set is just the first step. There could be other
optimization/extension on top of this patch set. E.g. for cgroup
profiling, perf just needs to save/store the LBR call stack information
for tasks in specific cgroup. That could reduce the additional space.
Also, the LBR call stack can be available for software events, or allow
even debugging use cases, like LBRs on crash later.

The Kmem cache of pmu specific data is saved in struct perf_ctx_data.
It's required when child task allocates the space.
The refcount in struct perf_ctx_data is used to track the users of pmu
specific data.

Reviewed-by: Alexey Budankov <>
Signed-off-by: Kan Liang <>

No changes since V5

The V4 can be found here.

Changes since V4:
- Add global to track system-wide users
- Remove spinlock perf_ctx_data_lock

The V3 can be found here.

Changes since V3:
- Rebase for the Arch LBR. Use Kmem cache to replace the data_size.

Changes since V2:
- Cannot use mutex inside rcu_read_lock().
  Restore the pin lock perf_ctx_data_lock

 include/linux/perf_event.h | 30 ++++++++++++++++++++++++++++++
 include/linux/sched.h      |  2 ++
 kernel/events/core.c       |  1 +
 3 files changed, 33 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index f5a6a2f..ece4035d 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -851,6 +851,36 @@ struct perf_event_context {
 	struct rcu_head			rcu_head;
+ * struct perf_ctx_data - PMU specific data for a task
+ * @rcu_head:  To avoid the race on free PMU specific data
+ * @refcount:  To track users
+ * @global:    To track system-wide users
+ * @ctx_cache: Kmem cache of PMU specific data
+ * @data:      PMU specific data
+ *
+ * Currently, the struct is only used in Intel LBR call stack mode to
+ * save/restore the call stack of a task on context switches.
+ * The data only be allocated when Intel LBR call stack mode is enabled.
+ * The data will be freed when the mode is disabled. The rcu_head is
+ * used to prevent the race on free the data.
+ * The content of the data will only be accessed in context switch, which
+ * should be protected by rcu_read_lock().
+ *
+ * Careful: Struct perf_ctx_data is added as a pointor in struct task_struct.
+ * When system-wide Intel LBR call stack mode is enabled, a buffer with
+ * constant size will be allocated for each task.
+ * Also, system memory consumption can further grow when the size of
+ * struct perf_ctx_data enlarges.
+ */
+struct perf_ctx_data {
+	struct rcu_head			rcu_head;
+	refcount_t			refcount;
+	int				global;
+	struct kmem_cache		*ctx_cache;
+	void				*data;
  * Number of contexts where an event can trigger:
  *	task, softirq, hardirq, nmi.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d2c8813..4b4d746b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -52,6 +52,7 @@ struct mempolicy;
 struct nameidata;
 struct nsproxy;
 struct perf_event_context;
+struct perf_ctx_data;
 struct pid_namespace;
 struct pipe_inode_info;
 struct rcu_node;
@@ -1135,6 +1136,7 @@ struct task_struct {
 	struct perf_event_context	*perf_event_ctxp[perf_nr_task_contexts];
 	struct mutex			perf_event_mutex;
 	struct list_head		perf_event_list;
+	struct perf_ctx_data __rcu	*perf_ctx_data;
 	unsigned long			preempt_disable_ip;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 0e125ae..dcdd164 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -13142,6 +13142,7 @@ int perf_event_init_task(struct task_struct *child, u64 clone_flags)
 	memset(child->perf_event_ctxp, 0, sizeof(child->perf_event_ctxp));
+	child->perf_ctx_data = NULL;
 	for_each_task_context_nr(ctxn) {
 		ret = perf_event_init_context(child, ctxn, clone_flags);

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-07-20 13:51 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-20 13:40 [PATCH V6 1/6] perf: Save PMU specific data in task_struct kan.liang
2021-07-20 13:40 ` [PATCH V6 2/6] perf: attach/detach PMU specific data kan.liang
2021-07-20 13:40 ` [PATCH V6 3/6] perf: Supply task information to sched_task() kan.liang
2021-07-20 13:40 ` [PATCH V6 4/6] perf/x86/lbr: Fix shorter LBRs call stacks for the system-wide mode kan.liang
2021-07-20 13:40 ` [PATCH V6 5/6] perf/x86: Remove swap_task_ctx() kan.liang
2021-07-20 13:40 ` [PATCH V6 6/6] perf: Clean up pmu specific data kan.liang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).