LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF
@ 2021-03-12  2:02 Song Liu
  2021-03-12  8:36 ` Namhyung Kim
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Song Liu @ 2021-03-12  2:02 UTC (permalink / raw)
  To: linux-kernel; +Cc: kernel-team, acme, acme, namhyung, jolsa, Song Liu

perf uses performance monitoring counters (PMCs) to monitor system
performance. The PMCs are limited hardware resources. For example,
Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways:
system level monitoring, (maybe nested) container level monitoring, per
process monitoring, profiling (in sample mode), etc. In some cases,
there are more active perf_events than available hardware PMCs. To allow
all perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics (cycles,
instructions). It is a waste to have multiple tools create multiple
perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --use-bpf option to perf-stat.
bperf uses a BPF hashmap to share information about BPF programs and maps
used by bperf. This map is pinned to bpffs. The default address is
/sys/fs/bpf/bperf_attr_map. The user could change the address with option
--attr-map.

---
Known limitations:
1. Do not support per cgroup events;
2. Do not support monitoring of BPF program (perf-stat -b);
3. Do not support event groups.

The following commands have been tested:

   perf stat --use-bpf -e cycles -a
   perf stat --use-bpf -e cycles -C 1,3,4
   perf stat --use-bpf -e cycles -p 123
   perf stat --use-bpf -e cycles -t 100,101

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 tools/perf/Makefile.perf                      |   1 +
 tools/perf/builtin-stat.c                     |  20 +-
 tools/perf/util/bpf_counter.c                 | 552 +++++++++++++++++-
 tools/perf/util/bpf_skel/bperf.h              |  14 +
 tools/perf/util/bpf_skel/bperf_follower.bpf.c |  65 +++
 tools/perf/util/bpf_skel/bperf_leader.bpf.c   |  46 ++
 tools/perf/util/evsel.h                       |  20 +-
 tools/perf/util/target.h                      |   4 +-
 8 files changed, 712 insertions(+), 10 deletions(-)
 create mode 100644 tools/perf/util/bpf_skel/bperf.h
 create mode 100644 tools/perf/util/bpf_skel/bperf_follower.bpf.c
 create mode 100644 tools/perf/util/bpf_skel/bperf_leader.bpf.c

diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
index f6e609673de2b..ca9aa08e85a1f 100644
--- a/tools/perf/Makefile.perf
+++ b/tools/perf/Makefile.perf
@@ -1007,6 +1007,7 @@ python-clean:
 SKEL_OUT := $(abspath $(OUTPUT)util/bpf_skel)
 SKEL_TMP_OUT := $(abspath $(SKEL_OUT)/.tmp)
 SKELETONS := $(SKEL_OUT)/bpf_prog_profiler.skel.h
+SKELETONS += $(SKEL_OUT)/bperf_leader.skel.h $(SKEL_OUT)/bperf_follower.skel.h
 
 ifdef BUILD_BPF_SKEL
 BPFTOOL := $(SKEL_TMP_OUT)/bootstrap/bpftool
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 2e2e4a8345ea2..34df713a8eea9 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -792,6 +792,12 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
 	}
 
 	evlist__for_each_cpu (evsel_list, i, cpu) {
+		/*
+		 * bperf calls evsel__open_per_cpu() in bperf__load(), so
+		 * no need to call it again here.
+		 */
+		if (target.use_bpf)
+			break;
 		affinity__set(&affinity, cpu);
 
 		evlist__for_each_entry(evsel_list, counter) {
@@ -925,15 +931,15 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
 	/*
 	 * Enable counters and exec the command:
 	 */
-	t0 = rdclock();
-	clock_gettime(CLOCK_MONOTONIC, &ref_time);
-
 	if (forks) {
 		evlist__start_workload(evsel_list);
 		err = enable_counters();
 		if (err)
 			return -1;
 
+		t0 = rdclock();
+		clock_gettime(CLOCK_MONOTONIC, &ref_time);
+
 		if (interval || timeout || evlist__ctlfd_initialized(evsel_list))
 			status = dispatch_events(forks, timeout, interval, &times);
 		if (child_pid != -1) {
@@ -954,6 +960,10 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
 		err = enable_counters();
 		if (err)
 			return -1;
+
+		t0 = rdclock();
+		clock_gettime(CLOCK_MONOTONIC, &ref_time);
+
 		status = dispatch_events(forks, timeout, interval, &times);
 	}
 
@@ -1146,6 +1156,10 @@ static struct option stat_options[] = {
 #ifdef HAVE_BPF_SKEL
 	OPT_STRING('b', "bpf-prog", &target.bpf_str, "bpf-prog-id",
 		   "stat events on existing bpf program id"),
+	OPT_BOOLEAN(0, "use-bpf", &target.use_bpf,
+		    "use bpf program to count events"),
+	OPT_STRING(0, "attr-map", &target.attr_map, "attr-map-path",
+		   "path to perf_event_attr map"),
 #endif
 	OPT_BOOLEAN('a', "all-cpus", &target.system_wide,
 		    "system-wide collection from all CPUs"),
diff --git a/tools/perf/util/bpf_counter.c b/tools/perf/util/bpf_counter.c
index 04f89120b3232..d232011ec8f26 100644
--- a/tools/perf/util/bpf_counter.c
+++ b/tools/perf/util/bpf_counter.c
@@ -5,6 +5,7 @@
 #include <assert.h>
 #include <limits.h>
 #include <unistd.h>
+#include <sys/file.h>
 #include <sys/time.h>
 #include <sys/resource.h>
 #include <linux/err.h>
@@ -18,8 +19,37 @@
 #include "debug.h"
 #include "evsel.h"
 #include "target.h"
+#include "cpumap.h"
+#include "thread_map.h"
 
 #include "bpf_skel/bpf_prog_profiler.skel.h"
+#include "bpf_skel/bperf_u.h"
+#include "bpf_skel/bperf_leader.skel.h"
+#include "bpf_skel/bperf_follower.skel.h"
+
+/*
+ * bperf uses a hashmap, the attr_map, to track all the leader programs.
+ * The hashmap is pinned in bpffs. flock() on this file is used to ensure
+ * no concurrent access to the attr_map.  The key of attr_map is struct
+ * perf_event_attr, and the value is struct bperf_attr_map_entry.
+ *
+ * struct bperf_attr_map_entry contains two __u32 IDs, bpf_link of the
+ * leader prog, and the diff_map. Each perf-stat session holds a reference
+ * to the bpf_link to make sure the leader prog is attached to sched_switch
+ * tracepoint.
+ *
+ * Since the hashmap only contains IDs of the bpf_link and diff_map, it
+ * does not hold any references to the leader program. Once all perf-stat
+ * sessions of these events exit, the leader prog, its maps, and the
+ * perf_events will be freed.
+ */
+struct bperf_attr_map_entry {
+	__u32 link_id;
+	__u32 diff_map_id;
+};
+
+#define DEFAULT_ATTR_MAP_PATH "/sys/fs/bpf/bperf_attr_map"
+#define ATTR_MAP_SIZE 16
 
 static inline void *u64_to_ptr(__u64 ptr)
 {
@@ -274,17 +304,529 @@ struct bpf_counter_ops bpf_program_profiler_ops = {
 	.install_pe = bpf_program_profiler__install_pe,
 };
 
+static __u32 bpf_link_get_id(int fd)
+{
+	struct bpf_link_info link_info;
+	__u32 link_info_len;
+
+	link_info_len = sizeof(link_info);
+	memset(&link_info, 0, link_info_len);
+	bpf_obj_get_info_by_fd(fd, &link_info, &link_info_len);
+	return link_info.id;
+}
+
+static __u32 bpf_link_get_prog_id(int fd)
+{
+	struct bpf_link_info link_info;
+	__u32 link_info_len;
+
+	link_info_len = sizeof(link_info);
+	memset(&link_info, 0, link_info_len);
+	bpf_obj_get_info_by_fd(fd, &link_info, &link_info_len);
+	return link_info.prog_id;
+}
+
+static __u32 bpf_map_get_id(int fd)
+{
+	struct bpf_map_info map_info;
+	__u32 map_info_len;
+
+	map_info_len = sizeof(map_info);
+	memset(&map_info, 0, map_info_len);
+	bpf_obj_get_info_by_fd(fd, &map_info, &map_info_len);
+	return map_info.id;
+}
+
+static int bperf_lock_attr_map(struct target *target)
+{
+	const char *path = target->attr_map ? : DEFAULT_ATTR_MAP_PATH;
+	int map_fd, err;
+
+	if (access(path, F_OK)) {
+		map_fd = bpf_create_map(BPF_MAP_TYPE_HASH,
+					sizeof(struct perf_event_attr),
+					sizeof(struct bperf_attr_map_entry),
+					ATTR_MAP_SIZE, 0);
+		if (map_fd < 0)
+			return -1;
+
+		err = bpf_obj_pin(map_fd, path);
+		if (err) {
+			/* someone pinned the map in parallel? */
+			close(map_fd);
+			map_fd = bpf_obj_get(path);
+			if (map_fd < 0)
+				return -1;
+		}
+	} else {
+		map_fd = bpf_obj_get(path);
+		if (map_fd < 0)
+			return -1;
+	}
+
+	err = flock(map_fd, LOCK_EX);
+	if (err) {
+		close(map_fd);
+		return -1;
+	}
+	return map_fd;
+}
+
+/* trigger the leader program on a cpu */
+static int bperf_trigger_reading(int prog_fd, int cpu)
+{
+	DECLARE_LIBBPF_OPTS(bpf_test_run_opts, opts,
+			    .ctx_in = NULL,
+			    .ctx_size_in = 0,
+			    .flags = BPF_F_TEST_RUN_ON_CPU,
+			    .cpu = cpu,
+			    .retval = 0,
+		);
+
+	return bpf_prog_test_run_opts(prog_fd, &opts);
+}
+
+static int bperf_check_target(struct evsel *evsel,
+			      struct target *target,
+			      enum bperf_filter_type *filter_type,
+			      __u32 *filter_entry_cnt)
+{
+	if (evsel->leader->core.nr_members > 1) {
+		pr_err("bpf managed perf events do not yet support groups.\n");
+		return -1;
+	}
+
+	/* determine filter type based on target */
+	if (target->system_wide) {
+		*filter_type = BPERF_FILTER_GLOBAL;
+		*filter_entry_cnt = 1;
+	} else if (target->cpu_list) {
+		*filter_type = BPERF_FILTER_CPU;
+		*filter_entry_cnt = perf_cpu_map__nr(evsel__cpus(evsel));
+	} else if (target->tid) {
+		*filter_type = BPERF_FILTER_PID;
+		*filter_entry_cnt = perf_thread_map__nr(evsel->core.threads);
+	} else if (target->pid) {
+		*filter_type = BPERF_FILTER_TGID;
+		*filter_entry_cnt = perf_thread_map__nr(evsel->core.threads);
+	} else {
+		pr_err("bpf managed perf events do not yet support these targets.\n");
+		return -1;
+	}
+
+	return 0;
+}
+
+static int bperf_reload_leader_program(struct evsel *evsel, int attr_map_fd,
+				       struct bperf_attr_map_entry *entry)
+{
+	struct bperf_leader_bpf *skel = NULL;
+	struct perf_cpu_map *all_cpus;
+	int link_fd, diff_map_fd, err;
+	struct bpf_link *link = NULL;
+
+	skel = bperf_leader_bpf__open();
+	if (!skel) {
+		pr_err("Failed to open leader skeleton\n");
+		return -1;
+	}
+
+	bpf_map__resize(skel->maps.events, libbpf_num_possible_cpus());
+	err = bperf_leader_bpf__load(skel);
+	if (err) {
+		pr_err("Failed to load leader skeleton\n");
+		goto out;
+	}
+
+	err = -1;
+	link = bpf_program__attach(skel->progs.on_switch);
+	if (!link) {
+		pr_err("Failed to attach leader program\n");
+		goto out;
+	}
+
+	all_cpus = perf_cpu_map__new(NULL);
+	if (!all_cpus) {
+		pr_err("Failed to open cpu_map for all cpus\n");
+		goto out;
+	}
+
+	link_fd = bpf_link__fd(link);
+	diff_map_fd = bpf_map__fd(skel->maps.diff_readings);
+	entry->link_id = bpf_link_get_id(link_fd);
+	entry->diff_map_id = bpf_map_get_id(diff_map_fd);
+	err = bpf_map_update_elem(attr_map_fd, &evsel->core.attr, entry, BPF_ANY);
+	assert(err == 0);
+
+	evsel->bperf_leader_link_fd = bpf_link_get_fd_by_id(entry->link_id);
+	assert(evsel->bperf_leader_link_fd >= 0);
+
+	/* save leader_skel for install_pe */
+	evsel->leader_skel = skel;
+	evsel__open_per_cpu(evsel, all_cpus, -1);
+	perf_cpu_map__put(all_cpus);
+
+out:
+	bperf_leader_bpf__destroy(skel);
+	bpf_link__destroy(link);
+	return err;
+}
+
+static int bperf__load(struct evsel *evsel, struct target *target)
+{
+	struct bperf_attr_map_entry entry = {0xffffffff, 0xffffffff};
+	int attr_map_fd, diff_map_fd = -1, err;
+	enum bperf_filter_type filter_type;
+	__u32 filter_entry_cnt, i;
+
+	if (bperf_check_target(evsel, target, &filter_type, &filter_entry_cnt))
+		return -1;
+
+	evsel->bperf_leader_prog_fd = -1;
+	evsel->bperf_leader_link_fd = -1;
+
+	/*
+	 * Step 1: hold a fd on the leader program and the bpf_link, if
+	 * the program is not already gone, reload the program.
+	 * Use flock() to ensure exclusive access to the perf_event_attr
+	 * map.
+	 */
+	attr_map_fd = bperf_lock_attr_map(target);
+	if (attr_map_fd < 0) {
+		pr_err("Failed to lock perf_event_attr map\n");
+		return -1;
+	}
+
+	err = bpf_map_lookup_elem(attr_map_fd, &evsel->core.attr, &entry);
+	if (err) {
+		err = bpf_map_update_elem(attr_map_fd, &evsel->core.attr, &entry, BPF_ANY);
+		if (err)
+			goto out;
+	}
+
+	evsel->bperf_leader_link_fd = bpf_link_get_fd_by_id(entry.link_id);
+	if (evsel->bperf_leader_link_fd < 0 &&
+	    bperf_reload_leader_program(evsel, attr_map_fd, &entry))
+		goto out;
+
+	/*
+	 * The bpf_link holds reference to the leader program, and the
+	 * leader program holds reference to the maps. Therefore, if
+	 * link_id is valid, diff_map_id should also be valid.
+	 */
+	evsel->bperf_leader_prog_fd = bpf_prog_get_fd_by_id(
+		bpf_link_get_prog_id(evsel->bperf_leader_link_fd));
+	assert(evsel->bperf_leader_prog_fd >= 0);
+
+	diff_map_fd = bpf_map_get_fd_by_id(entry.diff_map_id);
+	assert(diff_map_fd >= 0);
+
+	/*
+	 * bperf uses BPF_PROG_TEST_RUN to get accurate reading. Check
+	 * whether the kernel support it
+	 */
+	err = bperf_trigger_reading(evsel->bperf_leader_prog_fd, 0);
+	if (err) {
+		pr_err("The kernel does not support test_run for raw_tp BPF programs.\n"
+		       "Therefore, --use-bpf might show inaccurate readings\n");
+		goto out;
+	}
+
+	/* Step 2: load the follower skeleton */
+	evsel->follower_skel = bperf_follower_bpf__open();
+	if (!evsel->follower_skel) {
+		pr_err("Failed to open follower skeleton\n");
+		goto out;
+	}
+
+	/* attach fexit program to the leader program */
+	bpf_program__set_attach_target(evsel->follower_skel->progs.fexit_XXX,
+				       evsel->bperf_leader_prog_fd, "on_switch");
+
+	/* connect to leader diff_reading map */
+	bpf_map__reuse_fd(evsel->follower_skel->maps.diff_readings, diff_map_fd);
+
+	/* set up reading map */
+	bpf_map__set_max_entries(evsel->follower_skel->maps.accum_readings,
+				 filter_entry_cnt);
+	/* set up follower filter based on target */
+	bpf_map__set_max_entries(evsel->follower_skel->maps.filter,
+				 filter_entry_cnt);
+	err = bperf_follower_bpf__load(evsel->follower_skel);
+	if (err) {
+		pr_err("Failed to load follower skeleton\n");
+		bperf_follower_bpf__destroy(evsel->follower_skel);
+		evsel->follower_skel = NULL;
+		goto out;
+	}
+
+	for (i = 0; i < filter_entry_cnt; i++) {
+		int filter_map_fd;
+		__u32 key;
+
+		if (filter_type == BPERF_FILTER_PID ||
+		    filter_type == BPERF_FILTER_TGID)
+			key = evsel->core.threads->map[i].pid;
+		else if (filter_type == BPERF_FILTER_CPU)
+			key = evsel->core.cpus->map[i];
+		else
+			break;
+
+		filter_map_fd = bpf_map__fd(evsel->follower_skel->maps.filter);
+		bpf_map_update_elem(filter_map_fd, &key, &i, BPF_ANY);
+	}
+
+	evsel->follower_skel->bss->type = filter_type;
+
+out:
+	if (err && evsel->bperf_leader_link_fd >= 0)
+		close(evsel->bperf_leader_link_fd);
+	if (err && evsel->bperf_leader_prog_fd >= 0)
+		close(evsel->bperf_leader_prog_fd);
+	if (diff_map_fd >= 0)
+		close(diff_map_fd);
+
+	flock(attr_map_fd, LOCK_UN);
+	close(attr_map_fd);
+
+	return err;
+}
+
+static int bperf__install_pe(struct evsel *evsel, int cpu, int fd)
+{
+	struct bperf_leader_bpf *skel = evsel->leader_skel;
+
+	return bpf_map_update_elem(bpf_map__fd(skel->maps.events),
+				   &cpu, &fd, BPF_ANY);
+}
+
+/*
+ * trigger the leader prog on each cpu, so the accum_reading map could get
+ * the latest readings.
+ */
+static int bperf_sync_counters(struct evsel *evsel)
+{
+	struct perf_cpu_map *all_cpus = perf_cpu_map__new(NULL);
+	int num_cpu, i, cpu;
+
+	if (!all_cpus)
+		return -1;
+
+	num_cpu = all_cpus->nr;
+	for (i = 0; i < num_cpu; i++) {
+		cpu = all_cpus->map[i];
+		bperf_trigger_reading(evsel->bperf_leader_prog_fd, cpu);
+	}
+	return 0;
+}
+
+static int bperf__enable(struct evsel *evsel)
+{
+	struct bperf_follower_bpf *skel = evsel->follower_skel;
+	__u32 num_cpu_bpf = libbpf_num_possible_cpus();
+	struct bpf_perf_event_value values[num_cpu_bpf];
+	int err, i, entry_cnt;
+
+	err = bperf_follower_bpf__attach(evsel->follower_skel);
+	if (err)
+		return -1;
+
+	/*
+	 * Attahcing the skeleton takes non-trivial time (0.2s+ on a kernel
+	 * with some debug options enabled). This shows as a longer first
+	 * interval:
+	 *
+	 * # perf stat -e cycles -a -I 1000
+	 * #           time             counts unit events
+	 *      1.267634674     26,259,166,523      cycles
+	 *      2.271637827     22,550,822,286      cycles
+	 *      3.275406553     22,852,583,744      cycles
+	 *
+	 * Fix this by zeroing accum_readings after attaching the program.
+	 */
+	bperf_sync_counters(evsel);
+	entry_cnt = bpf_map__max_entries(skel->maps.accum_readings);
+	memset(values, 0, sizeof(struct bpf_perf_event_value) * num_cpu_bpf);
+
+	for (i = 0; i < entry_cnt; i++) {
+		bpf_map_update_elem(bpf_map__fd(skel->maps.accum_readings),
+				    &i, values, BPF_ANY);
+	}
+	return 0;
+}
+
+static int bperf__read(struct evsel *evsel)
+{
+	struct bperf_follower_bpf *skel = evsel->follower_skel;
+	__u32 num_cpu_bpf = libbpf_num_possible_cpus();
+	struct bpf_perf_event_value values[num_cpu_bpf];
+	static struct perf_cpu_map *all_cpus;
+	int reading_map_fd, err = 0;
+	__u32 i, j, num_cpu;
+
+	if (!all_cpus) {
+		all_cpus = perf_cpu_map__new(NULL);
+		if (!all_cpus)
+			return -1;
+	}
+
+	bperf_sync_counters(evsel);
+	reading_map_fd = bpf_map__fd(skel->maps.accum_readings);
+
+	for (i = 0; i < bpf_map__max_entries(skel->maps.accum_readings); i++) {
+		__u32 cpu;
+
+		err = bpf_map_lookup_elem(reading_map_fd, &i, values);
+		if (err)
+			goto out;
+		switch (evsel->follower_skel->bss->type) {
+		case BPERF_FILTER_GLOBAL:
+			assert(i == 0);
+
+			num_cpu = all_cpus->nr;
+			for (j = 0; j < num_cpu; j++) {
+				cpu = all_cpus->map[j];
+				perf_counts(evsel->counts, cpu, 0)->val = values[cpu].counter;
+				perf_counts(evsel->counts, cpu, 0)->ena = values[cpu].enabled;
+				perf_counts(evsel->counts, cpu, 0)->run = values[cpu].running;
+			}
+			break;
+		case BPERF_FILTER_CPU:
+			cpu = evsel->core.cpus->map[i];
+			perf_counts(evsel->counts, i, 0)->val = values[cpu].counter;
+			perf_counts(evsel->counts, i, 0)->ena = values[cpu].enabled;
+			perf_counts(evsel->counts, i, 0)->run = values[cpu].running;
+			break;
+		case BPERF_FILTER_PID:
+		case BPERF_FILTER_TGID:
+			perf_counts(evsel->counts, 0, i)->val = 0;
+			perf_counts(evsel->counts, 0, i)->ena = 0;
+			perf_counts(evsel->counts, 0, i)->run = 0;
+
+			for (cpu = 0; cpu < num_cpu_bpf; cpu++) {
+				perf_counts(evsel->counts, 0, i)->val += values[cpu].counter;
+				perf_counts(evsel->counts, 0, i)->ena += values[cpu].enabled;
+				perf_counts(evsel->counts, 0, i)->run += values[cpu].running;
+			}
+			break;
+		default:
+			break;
+		}
+	}
+out:
+	return err;
+}
+
+static int bperf__destroy(struct evsel *evsel)
+{
+	bperf_follower_bpf__destroy(evsel->follower_skel);
+	close(evsel->bperf_leader_prog_fd);
+	close(evsel->bperf_leader_link_fd);
+	return 0;
+}
+
+/*
+ * bperf: share hardware PMCs with BPF
+ *
+ * perf uses performance monitoring counters (PMC) to monitor system
+ * performance. The PMCs are limited hardware resources. For example,
+ * Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
+ *
+ * Modern data center systems use these PMCs in many different ways:
+ * system level monitoring, (maybe nested) container level monitoring, per
+ * process monitoring, profiling (in sample mode), etc. In some cases,
+ * there are more active perf_events than available hardware PMCs. To allow
+ * all perf_events to have a chance to run, it is necessary to do expensive
+ * time multiplexing of events.
+ *
+ * On the other hand, many monitoring tools count the common metrics
+ * (cycles, instructions). It is a waste to have multiple tools create
+ * multiple perf_events of "cycles" and occupy multiple PMCs.
+ *
+ * bperf tries to reduce such wastes by allowing multiple perf_events of
+ * "cycles" or "instructions" (at different scopes) to share PMUs. Instead
+ * of having each perf-stat session to read its own perf_events, bperf uses
+ * BPF programs to read the perf_events and aggregate readings to BPF maps.
+ * Then, the perf-stat session(s) reads the values from these BPF maps.
+ *
+ *                                ||
+ *       shared progs and maps <- || -> per session progs and maps
+ *                                ||
+ *   ---------------              ||
+ *   | perf_events |              ||
+ *   ---------------       fexit  ||      -----------------
+ *          |             --------||----> | follower prog |
+ *       --------------- /        || ---  -----------------
+ * cs -> | leader prog |/         ||/        |         |
+ *   --> ---------------         /||  --------------  ------------------
+ *  /       |         |         / ||  | filter map |  | accum_readings |
+ * /  ------------  ------------  ||  --------------  ------------------
+ * |  | prev map |  | diff map |  ||                        |
+ * |  ------------  ------------  ||                        |
+ *  \                             ||                        |
+ * = \ ==================================================== | ============
+ *    \                                                    /   user space
+ *     \                                                  /
+ *      \                                                /
+ *    BPF_PROG_TEST_RUN                    BPF_MAP_LOOKUP_ELEM
+ *        \                                            /
+ *         \                                          /
+ *          \------  perf-stat ----------------------/
+ *
+ * The figure above shows the architecture of bperf. Note that the figure
+ * is divided into 3 regions: shared progs and maps (top left), per session
+ * progs and maps (top right), and user space (bottom).
+ *
+ * The leader prog is triggered on each context switch (cs). The leader
+ * prog reads perf_events and stores the difference (current_reading -
+ * previous_reading) to the diff map. For the same metric, e.g. "cycles",
+ * multiple perf-stat sessions share the same leader prog.
+ *
+ * Each perf-stat session creates a follower prog as fexit program to the
+ * leader prog. It is possible to attach up to BPF_MAX_TRAMP_PROGS (38)
+ * follower progs to the same leader prog. The follower prog checks current
+ * task and processor ID to decide whether to add the value from the diff
+ * map to its accumulated reading map (accum_readings).
+ *
+ * Finally, perf-stat user space reads the value from accum_reading map.
+ *
+ * Besides context switch, it is also necessary to trigger the leader prog
+ * before perf-stat reads the value. Otherwise, the accum_reading map may
+ * not have the latest reading from the perf_events. This is achieved by
+ * triggering the event via sys_bpf(BPF_PROG_TEST_RUN) to each CPU.
+ *
+ * Comment before the definition of struct bperf_attr_map_entry describes
+ * how different sessions of perf-stat share information about the leader
+ * prog.
+ */
+
+struct bpf_counter_ops bperf_ops = {
+	.load       = bperf__load,
+	.enable     = bperf__enable,
+	.read       = bperf__read,
+	.install_pe = bperf__install_pe,
+	.destroy    = bperf__destroy,
+};
+
+static inline bool bpf_counter_skip(struct evsel *evsel)
+{
+	return list_empty(&evsel->bpf_counter_list) &&
+		evsel->follower_skel == NULL;
+}
+
 int bpf_counter__install_pe(struct evsel *evsel, int cpu, int fd)
 {
-	if (list_empty(&evsel->bpf_counter_list))
+	if (bpf_counter_skip(evsel))
 		return 0;
 	return evsel->bpf_counter_ops->install_pe(evsel, cpu, fd);
 }
 
 int bpf_counter__load(struct evsel *evsel, struct target *target)
 {
-	if (target__has_bpf(target))
+	if (target->bpf_str)
 		evsel->bpf_counter_ops = &bpf_program_profiler_ops;
+	else if (target->use_bpf)
+		evsel->bpf_counter_ops = &bperf_ops;
 
 	if (evsel->bpf_counter_ops)
 		return evsel->bpf_counter_ops->load(evsel, target);
@@ -293,21 +835,21 @@ int bpf_counter__load(struct evsel *evsel, struct target *target)
 
 int bpf_counter__enable(struct evsel *evsel)
 {
-	if (list_empty(&evsel->bpf_counter_list))
+	if (bpf_counter_skip(evsel))
 		return 0;
 	return evsel->bpf_counter_ops->enable(evsel);
 }
 
 int bpf_counter__read(struct evsel *evsel)
 {
-	if (list_empty(&evsel->bpf_counter_list))
+	if (bpf_counter_skip(evsel))
 		return -EAGAIN;
 	return evsel->bpf_counter_ops->read(evsel);
 }
 
 void bpf_counter__destroy(struct evsel *evsel)
 {
-	if (list_empty(&evsel->bpf_counter_list))
+	if (bpf_counter_skip(evsel))
 		return;
 	evsel->bpf_counter_ops->destroy(evsel);
 	evsel->bpf_counter_ops = NULL;
diff --git a/tools/perf/util/bpf_skel/bperf.h b/tools/perf/util/bpf_skel/bperf.h
new file mode 100644
index 0000000000000..186a5551ddb9d
--- /dev/null
+++ b/tools/perf/util/bpf_skel/bperf.h
@@ -0,0 +1,14 @@
+// SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+// Copyright (c) 2021 Facebook
+
+#ifndef __BPERF_STAT_H
+#define __BPERF_STAT_H
+
+typedef struct {
+	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(struct bpf_perf_event_value));
+	__uint(max_entries, 1);
+} reading_map;
+
+#endif /* __BPERF_STAT_H */
diff --git a/tools/perf/util/bpf_skel/bperf_follower.bpf.c b/tools/perf/util/bpf_skel/bperf_follower.bpf.c
new file mode 100644
index 0000000000000..7535d9557ab9a
--- /dev/null
+++ b/tools/perf/util/bpf_skel/bperf_follower.bpf.c
@@ -0,0 +1,65 @@
+// SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+// Copyright (c) 2021 Facebook
+#include <linux/bpf.h>
+#include <linux/perf_event.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include "bperf.h"
+#include "bperf_u.h"
+
+reading_map diff_readings SEC(".maps");
+reading_map accum_readings SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(__u32));
+} filter SEC(".maps");
+
+enum bperf_filter_type type = 0;
+
+SEC("fexit/XXX")
+int BPF_PROG(fexit_XXX)
+{
+	struct bpf_perf_event_value *diff_val, *accum_val;
+	__u32 filter_key, zero = 0;
+	__u32 *accum_key;
+
+	switch (type) {
+	case BPERF_FILTER_GLOBAL:
+		accum_key = &zero;
+		goto do_add;
+	case BPERF_FILTER_CPU:
+		filter_key = bpf_get_smp_processor_id();
+		break;
+	case BPERF_FILTER_PID:
+		filter_key = bpf_get_current_pid_tgid() & 0xffffffff;
+		break;
+	case BPERF_FILTER_TGID:
+		filter_key = bpf_get_current_pid_tgid() >> 32;
+		break;
+	default:
+		return 0;
+	}
+
+	accum_key = bpf_map_lookup_elem(&filter, &filter_key);
+	if (!accum_key)
+		return 0;
+
+do_add:
+	diff_val = bpf_map_lookup_elem(&diff_readings, &zero);
+	if (!diff_val)
+		return 0;
+
+	accum_val = bpf_map_lookup_elem(&accum_readings, accum_key);
+	if (!accum_val)
+		return 0;
+
+	accum_val->counter += diff_val->counter;
+	accum_val->enabled += diff_val->enabled;
+	accum_val->running += diff_val->running;
+
+	return 0;
+}
+
+char LICENSE[] SEC("license") = "Dual BSD/GPL";
diff --git a/tools/perf/util/bpf_skel/bperf_leader.bpf.c b/tools/perf/util/bpf_skel/bperf_leader.bpf.c
new file mode 100644
index 0000000000000..4f70d1459e86c
--- /dev/null
+++ b/tools/perf/util/bpf_skel/bperf_leader.bpf.c
@@ -0,0 +1,46 @@
+// SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+// Copyright (c) 2021 Facebook
+#include <linux/bpf.h>
+#include <linux/perf_event.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include "bperf.h"
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(int));
+	__uint(map_flags, BPF_F_PRESERVE_ELEMS);
+} events SEC(".maps");
+
+reading_map prev_readings SEC(".maps");
+reading_map diff_readings SEC(".maps");
+
+SEC("raw_tp/sched_switch")
+int BPF_PROG(on_switch)
+{
+	struct bpf_perf_event_value val, *prev_val, *diff_val;
+	__u32 key = bpf_get_smp_processor_id();
+	__u32 zero = 0;
+	long err;
+
+	prev_val = bpf_map_lookup_elem(&prev_readings, &zero);
+	if (!prev_val)
+		return 0;
+
+	diff_val = bpf_map_lookup_elem(&diff_readings, &zero);
+	if (!diff_val)
+		return 0;
+
+	err = bpf_perf_event_read_value(&events, key, &val, sizeof(val));
+	if (err)
+		return 0;
+
+	diff_val->counter = val.counter - prev_val->counter;
+	diff_val->enabled = val.enabled - prev_val->enabled;
+	diff_val->running = val.running - prev_val->running;
+	*prev_val = val;
+	return 0;
+}
+
+char LICENSE[] SEC("license") = "Dual BSD/GPL";
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index 6026487353dd8..dd4f56f9cfdf5 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -20,6 +20,8 @@ union perf_event;
 struct bpf_counter_ops;
 struct target;
 struct hashmap;
+struct bperf_leader_bpf;
+struct bperf_follower_bpf;
 
 typedef int (evsel__sb_cb_t)(union perf_event *event, void *data);
 
@@ -130,8 +132,24 @@ struct evsel {
 	 * See also evsel__has_callchain().
 	 */
 	__u64			synth_sample_type;
-	struct list_head	bpf_counter_list;
+
+	/*
+	 * bpf_counter_ops serves two use cases:
+	 *   1. perf-stat -b          counting events used byBPF programs
+	 *   2. perf-stat --use-bpf   use BPF programs to aggregate counts
+	 */
 	struct bpf_counter_ops	*bpf_counter_ops;
+
+	/* for perf-stat -b */
+	struct list_head	bpf_counter_list;
+
+	/* for perf-stat --use-bpf */
+	int			bperf_leader_prog_fd;
+	int			bperf_leader_link_fd;
+	union {
+		struct bperf_leader_bpf *leader_skel;
+		struct bperf_follower_bpf *follower_skel;
+	};
 };
 
 struct perf_missing_features {
diff --git a/tools/perf/util/target.h b/tools/perf/util/target.h
index f132c6c2eef81..1bce3eb28ef25 100644
--- a/tools/perf/util/target.h
+++ b/tools/perf/util/target.h
@@ -16,6 +16,8 @@ struct target {
 	bool	     uses_mmap;
 	bool	     default_per_cpu;
 	bool	     per_thread;
+	bool	     use_bpf;
+	const char   *attr_map;
 };
 
 enum target_errno {
@@ -66,7 +68,7 @@ static inline bool target__has_cpu(struct target *target)
 
 static inline bool target__has_bpf(struct target *target)
 {
-	return target->bpf_str;
+	return target->bpf_str || target->use_bpf;
 }
 
 static inline bool target__none(struct target *target)
-- 
2.24.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF
  2021-03-12  2:02 [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF Song Liu
@ 2021-03-12  8:36 ` Namhyung Kim
  2021-03-12 15:38   ` Song Liu
  2021-03-12 12:12 ` Jiri Olsa
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 21+ messages in thread
From: Namhyung Kim @ 2021-03-12  8:36 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-kernel, kernel-team, Arnaldo Carvalho de Melo,
	Arnaldo Carvalho de Melo, Jiri Olsa

Hi,

On Fri, Mar 12, 2021 at 11:03 AM Song Liu <songliubraving@fb.com> wrote:
>
> perf uses performance monitoring counters (PMCs) to monitor system
> performance. The PMCs are limited hardware resources. For example,
> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
>
> Modern data center systems use these PMCs in many different ways:
> system level monitoring, (maybe nested) container level monitoring, per
> process monitoring, profiling (in sample mode), etc. In some cases,
> there are more active perf_events than available hardware PMCs. To allow
> all perf_events to have a chance to run, it is necessary to do expensive
> time multiplexing of events.
>
> On the other hand, many monitoring tools count the common metrics (cycles,
> instructions). It is a waste to have multiple tools create multiple
> perf_events of "cycles" and occupy multiple PMCs.
>
> bperf tries to reduce such wastes by allowing multiple perf_events of
> "cycles" or "instructions" (at different scopes) to share PMUs. Instead
> of having each perf-stat session to read its own perf_events, bperf uses
> BPF programs to read the perf_events and aggregate readings to BPF maps.
> Then, the perf-stat session(s) reads the values from these BPF maps.
>
> Please refer to the comment before the definition of bperf_ops for the
> description of bperf architecture.

Interesting!  Actually I thought about something similar before,
but my BPF knowledge is outdated.  So I need to catch up but
failed to have some time for it so far. ;-)

>
> bperf is off by default. To enable it, pass --use-bpf option to perf-stat.
> bperf uses a BPF hashmap to share information about BPF programs and maps
> used by bperf. This map is pinned to bpffs. The default address is
> /sys/fs/bpf/bperf_attr_map. The user could change the address with option
> --attr-map.
>
> ---
> Known limitations:
> 1. Do not support per cgroup events;
> 2. Do not support monitoring of BPF program (perf-stat -b);
> 3. Do not support event groups.

In my case, per cgroup event counting is very important.
And I'd like to do that with lots of cpus and cgroups.
So I'm working on an in-kernel solution (without BPF),
I hope to share it soon.

And for event groups, it seems the current implementation
cannot handle more than one event (not even in a group).
That could be a serious limitation..

>
> The following commands have been tested:
>
>    perf stat --use-bpf -e cycles -a
>    perf stat --use-bpf -e cycles -C 1,3,4
>    perf stat --use-bpf -e cycles -p 123
>    perf stat --use-bpf -e cycles -t 100,101

Hmm... so it loads both leader and follower programs if needed, right?
Does it support multiple followers with different targets at the same time?

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF
  2021-03-12  2:02 [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF Song Liu
  2021-03-12  8:36 ` Namhyung Kim
@ 2021-03-12 12:12 ` Jiri Olsa
  2021-03-12 15:45   ` Song Liu
  2021-03-12 14:24 ` Arnaldo Carvalho de Melo
  2021-03-14 22:48 ` Jiri Olsa
  3 siblings, 1 reply; 21+ messages in thread
From: Jiri Olsa @ 2021-03-12 12:12 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-kernel, kernel-team, acme, acme, namhyung, jolsa

On Thu, Mar 11, 2021 at 06:02:57PM -0800, Song Liu wrote:
> perf uses performance monitoring counters (PMCs) to monitor system
> performance. The PMCs are limited hardware resources. For example,
> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
> 
> Modern data center systems use these PMCs in many different ways:
> system level monitoring, (maybe nested) container level monitoring, per
> process monitoring, profiling (in sample mode), etc. In some cases,
> there are more active perf_events than available hardware PMCs. To allow
> all perf_events to have a chance to run, it is necessary to do expensive
> time multiplexing of events.
> 
> On the other hand, many monitoring tools count the common metrics (cycles,
> instructions). It is a waste to have multiple tools create multiple
> perf_events of "cycles" and occupy multiple PMCs.
> 
> bperf tries to reduce such wastes by allowing multiple perf_events of
> "cycles" or "instructions" (at different scopes) to share PMUs. Instead
> of having each perf-stat session to read its own perf_events, bperf uses
> BPF programs to read the perf_events and aggregate readings to BPF maps.
> Then, the perf-stat session(s) reads the values from these BPF maps.
> 
> Please refer to the comment before the definition of bperf_ops for the
> description of bperf architecture.
> 
> bperf is off by default. To enable it, pass --use-bpf option to perf-stat.
> bperf uses a BPF hashmap to share information about BPF programs and maps
> used by bperf. This map is pinned to bpffs. The default address is
> /sys/fs/bpf/bperf_attr_map. The user could change the address with option
> --attr-map.

nice, I recall the presentation about that and was wondering
when this will come up ;-)

> 
> ---
> Known limitations:
> 1. Do not support per cgroup events;
> 2. Do not support monitoring of BPF program (perf-stat -b);
> 3. Do not support event groups.
> 
> The following commands have been tested:
> 
>    perf stat --use-bpf -e cycles -a
>    perf stat --use-bpf -e cycles -C 1,3,4
>    perf stat --use-bpf -e cycles -p 123
>    perf stat --use-bpf -e cycles -t 100,101

I assume the output is same as standard perf?

jirka


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF
  2021-03-12  2:02 [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF Song Liu
  2021-03-12  8:36 ` Namhyung Kim
  2021-03-12 12:12 ` Jiri Olsa
@ 2021-03-12 14:24 ` Arnaldo Carvalho de Melo
  2021-03-12 16:07   ` Song Liu
  2021-03-12 18:52   ` Song Liu
  2021-03-14 22:48 ` Jiri Olsa
  3 siblings, 2 replies; 21+ messages in thread
From: Arnaldo Carvalho de Melo @ 2021-03-12 14:24 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-kernel, kernel-team, acme, namhyung, jolsa, linux-perf-users

Em Thu, Mar 11, 2021 at 06:02:57PM -0800, Song Liu escreveu:
> perf uses performance monitoring counters (PMCs) to monitor system
> performance. The PMCs are limited hardware resources. For example,
> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
> 
> Modern data center systems use these PMCs in many different ways:
> system level monitoring, (maybe nested) container level monitoring, per
> process monitoring, profiling (in sample mode), etc. In some cases,
> there are more active perf_events than available hardware PMCs. To allow
> all perf_events to have a chance to run, it is necessary to do expensive
> time multiplexing of events.
> 
> On the other hand, many monitoring tools count the common metrics (cycles,
> instructions). It is a waste to have multiple tools create multiple
> perf_events of "cycles" and occupy multiple PMCs.
> 
> bperf tries to reduce such wastes by allowing multiple perf_events of
> "cycles" or "instructions" (at different scopes) to share PMUs. Instead
> of having each perf-stat session to read its own perf_events, bperf uses
> BPF programs to read the perf_events and aggregate readings to BPF maps.
> Then, the perf-stat session(s) reads the values from these BPF maps.
> 
> Please refer to the comment before the definition of bperf_ops for the
> description of bperf architecture.
> 
> bperf is off by default. To enable it, pass --use-bpf option to perf-stat.
> bperf uses a BPF hashmap to share information about BPF programs and maps
> used by bperf. This map is pinned to bpffs. The default address is
> /sys/fs/bpf/bperf_attr_map. The user could change the address with option
> --attr-map.
> 
> ---
> Known limitations:
> 1. Do not support per cgroup events;
> 2. Do not support monitoring of BPF program (perf-stat -b);
> 3. Do not support event groups.

Cool stuff, but I think you can break this up into more self contained
patches, see below.

Apart from that, some suggestions/requests:

We need a shell 'perf test' that uses some synthetic workload so that we
can count events with/without --use-bpf (--bpf-counters is my
alternative name, see below), and then compare if the difference is
under some acceptable range.

As a followup patch we could have something like:

 perf config stat.bpf-counters=yes

That would make 'perf stat' use BPF counters for what it can, using the
default method for the non-supported targets, emitting some 'perf stat
-v' visible warning (i.e. a debug message), i.e. make it opt-in that the
user wants to use BPF counters for all that is possible at that point in
time.o

Thanks for working on this,

- Arnaldo
 
> The following commands have been tested:
> 
>    perf stat --use-bpf -e cycles -a
>    perf stat --use-bpf -e cycles -C 1,3,4
>    perf stat --use-bpf -e cycles -p 123
>    perf stat --use-bpf -e cycles -t 100,101
> 
> Signed-off-by: Song Liu <songliubraving@fb.com>
> ---
>  tools/perf/Makefile.perf                      |   1 +
>  tools/perf/builtin-stat.c                     |  20 +-
>  tools/perf/util/bpf_counter.c                 | 552 +++++++++++++++++-
>  tools/perf/util/bpf_skel/bperf.h              |  14 +
>  tools/perf/util/bpf_skel/bperf_follower.bpf.c |  65 +++
>  tools/perf/util/bpf_skel/bperf_leader.bpf.c   |  46 ++
>  tools/perf/util/evsel.h                       |  20 +-
>  tools/perf/util/target.h                      |   4 +-
>  8 files changed, 712 insertions(+), 10 deletions(-)
>  create mode 100644 tools/perf/util/bpf_skel/bperf.h
>  create mode 100644 tools/perf/util/bpf_skel/bperf_follower.bpf.c
>  create mode 100644 tools/perf/util/bpf_skel/bperf_leader.bpf.c
> 
> diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
> index f6e609673de2b..ca9aa08e85a1f 100644
> --- a/tools/perf/Makefile.perf
> +++ b/tools/perf/Makefile.perf
> @@ -1007,6 +1007,7 @@ python-clean:
>  SKEL_OUT := $(abspath $(OUTPUT)util/bpf_skel)
>  SKEL_TMP_OUT := $(abspath $(SKEL_OUT)/.tmp)
>  SKELETONS := $(SKEL_OUT)/bpf_prog_profiler.skel.h
> +SKELETONS += $(SKEL_OUT)/bperf_leader.skel.h $(SKEL_OUT)/bperf_follower.skel.h
>  
>  ifdef BUILD_BPF_SKEL
>  BPFTOOL := $(SKEL_TMP_OUT)/bootstrap/bpftool
> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> index 2e2e4a8345ea2..34df713a8eea9 100644
> --- a/tools/perf/builtin-stat.c
> +++ b/tools/perf/builtin-stat.c
> @@ -792,6 +792,12 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
>  	}
>  
>  	evlist__for_each_cpu (evsel_list, i, cpu) {
> +		/*
> +		 * bperf calls evsel__open_per_cpu() in bperf__load(), so
> +		 * no need to call it again here.
> +		 */
> +		if (target.use_bpf)
> +			break;
>  		affinity__set(&affinity, cpu);
>  
>  		evlist__for_each_entry(evsel_list, counter) {
> @@ -925,15 +931,15 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
>  	/*
>  	 * Enable counters and exec the command:
>  	 */
> -	t0 = rdclock();
> -	clock_gettime(CLOCK_MONOTONIC, &ref_time);
> -
>  	if (forks) {
>  		evlist__start_workload(evsel_list);
>  		err = enable_counters();
>  		if (err)
>  			return -1;
>  
> +		t0 = rdclock();
> +		clock_gettime(CLOCK_MONOTONIC, &ref_time);
> +
>  		if (interval || timeout || evlist__ctlfd_initialized(evsel_list))
>  			status = dispatch_events(forks, timeout, interval, &times);
>  		if (child_pid != -1) {
> @@ -954,6 +960,10 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
>  		err = enable_counters();
>  		if (err)
>  			return -1;
> +
> +		t0 = rdclock();
> +		clock_gettime(CLOCK_MONOTONIC, &ref_time);
> +
>  		status = dispatch_events(forks, timeout, interval, &times);
>  	}
>  

The above two hunks seems out of place, i.e. can they go to a different
patch and with an explanation about why this is needed?

> @@ -1146,6 +1156,10 @@ static struct option stat_options[] = {
>  #ifdef HAVE_BPF_SKEL
>  	OPT_STRING('b', "bpf-prog", &target.bpf_str, "bpf-prog-id",
>  		   "stat events on existing bpf program id"),
> +	OPT_BOOLEAN(0, "use-bpf", &target.use_bpf,
> +		    "use bpf program to count events"),
> +	OPT_STRING(0, "attr-map", &target.attr_map, "attr-map-path",
> +		   "path to perf_event_attr map"),

These need to be documented in tools/perf/Documentation/perf-stat.txt

Also please rename it to:

  --bpf-counters
  --bpf-attr-map

Andrii suggested prefixing with --bpf BPF related options in pahole, I
think this applies here as well.

>  #endif
>  	OPT_BOOLEAN('a', "all-cpus", &target.system_wide,
>  		    "system-wide collection from all CPUs"),
> diff --git a/tools/perf/util/bpf_counter.c b/tools/perf/util/bpf_counter.c
> index 04f89120b3232..d232011ec8f26 100644
> --- a/tools/perf/util/bpf_counter.c
> +++ b/tools/perf/util/bpf_counter.c
> @@ -5,6 +5,7 @@
>  #include <assert.h>
>  #include <limits.h>
>  #include <unistd.h>
> +#include <sys/file.h>
>  #include <sys/time.h>
>  #include <sys/resource.h>
>  #include <linux/err.h>
> @@ -18,8 +19,37 @@
>  #include "debug.h"
>  #include "evsel.h"
>  #include "target.h"
> +#include "cpumap.h"
> +#include "thread_map.h"
>  
>  #include "bpf_skel/bpf_prog_profiler.skel.h"
> +#include "bpf_skel/bperf_u.h"
> +#include "bpf_skel/bperf_leader.skel.h"
> +#include "bpf_skel/bperf_follower.skel.h"
> +
> +/*
> + * bperf uses a hashmap, the attr_map, to track all the leader programs.
> + * The hashmap is pinned in bpffs. flock() on this file is used to ensure
> + * no concurrent access to the attr_map.  The key of attr_map is struct
> + * perf_event_attr, and the value is struct bperf_attr_map_entry.
> + *
> + * struct bperf_attr_map_entry contains two __u32 IDs, bpf_link of the
> + * leader prog, and the diff_map. Each perf-stat session holds a reference
> + * to the bpf_link to make sure the leader prog is attached to sched_switch
> + * tracepoint.
> + *
> + * Since the hashmap only contains IDs of the bpf_link and diff_map, it
> + * does not hold any references to the leader program. Once all perf-stat
> + * sessions of these events exit, the leader prog, its maps, and the
> + * perf_events will be freed.
> + */
> +struct bperf_attr_map_entry {
> +	__u32 link_id;
> +	__u32 diff_map_id;
> +};
> +
> +#define DEFAULT_ATTR_MAP_PATH "/sys/fs/bpf/bperf_attr_map"

Humm bpf is already in the parent directory, perhaps we can have it as
just /sys/fs/bpf/perf_attr_map? Here 'counter' isn't applicable, I
guess, eventually we may want to use this for other purposes? ;-)

> +#define ATTR_MAP_SIZE 16
>  
>  static inline void *u64_to_ptr(__u64 ptr)
>  {
> @@ -274,17 +304,529 @@ struct bpf_counter_ops bpf_program_profiler_ops = {
>  	.install_pe = bpf_program_profiler__install_pe,
>  };
>  
> +static __u32 bpf_link_get_id(int fd)
> +{
> +	struct bpf_link_info link_info;
> +	__u32 link_info_len;
> +
> +	link_info_len = sizeof(link_info);

Can you combine declaration with attribution to make the code more
compact? i.e.:

	__u32 link_info_len = sizeof(link_info);


> +	memset(&link_info, 0, link_info_len);
> +	bpf_obj_get_info_by_fd(fd, &link_info, &link_info_len);
> +	return link_info.id;
> +}
> +
> +static __u32 bpf_link_get_prog_id(int fd)
> +{
> +	struct bpf_link_info link_info;
> +	__u32 link_info_len;
> +
> +	link_info_len = sizeof(link_info);

Ditto

> +	memset(&link_info, 0, link_info_len);
> +	bpf_obj_get_info_by_fd(fd, &link_info, &link_info_len);
> +	return link_info.prog_id;
> +}
> +
> +static __u32 bpf_map_get_id(int fd)
> +{
> +	struct bpf_map_info map_info;
> +	__u32 map_info_len;

Ditto.

> +	map_info_len = sizeof(map_info);
> +	memset(&map_info, 0, map_info_len);
> +	bpf_obj_get_info_by_fd(fd, &map_info, &map_info_len);
> +	return map_info.id;
> +}
> +
> +static int bperf_lock_attr_map(struct target *target)
> +{
> +	const char *path = target->attr_map ? : DEFAULT_ATTR_MAP_PATH;
> +	int map_fd, err;
> +
> +	if (access(path, F_OK)) {
> +		map_fd = bpf_create_map(BPF_MAP_TYPE_HASH,
> +					sizeof(struct perf_event_attr),
> +					sizeof(struct bperf_attr_map_entry),

Also naming it as perf_event_attr_map_entry should make the equivalence
more explicit if albeit a bit longer, i.e.:

+		map_fd = bpf_create_map(BPF_MAP_TYPE_HASH,
+					sizeof(struct perf_event_attr),
+					sizeof(struct perf_event_attr_map_entry),

> +					ATTR_MAP_SIZE, 0);
> +		if (map_fd < 0)
> +			return -1;
> +
> +		err = bpf_obj_pin(map_fd, path);
> +		if (err) {
> +			/* someone pinned the map in parallel? */
> +			close(map_fd);
> +			map_fd = bpf_obj_get(path);
> +			if (map_fd < 0)
> +				return -1;
> +		}
> +	} else {
> +		map_fd = bpf_obj_get(path);
> +		if (map_fd < 0)
> +			return -1;
> +	}
> +
> +	err = flock(map_fd, LOCK_EX);
> +	if (err) {
> +		close(map_fd);
> +		return -1;
> +	}
> +	return map_fd;
> +}
> +
> +/* trigger the leader program on a cpu */
> +static int bperf_trigger_reading(int prog_fd, int cpu)
> +{
> +	DECLARE_LIBBPF_OPTS(bpf_test_run_opts, opts,
> +			    .ctx_in = NULL,
> +			    .ctx_size_in = 0,
> +			    .flags = BPF_F_TEST_RUN_ON_CPU,
> +			    .cpu = cpu,
> +			    .retval = 0,
> +		);
> +
> +	return bpf_prog_test_run_opts(prog_fd, &opts);
> +}
> +
> +static int bperf_check_target(struct evsel *evsel,
> +			      struct target *target,
> +			      enum bperf_filter_type *filter_type,
> +			      __u32 *filter_entry_cnt)
> +{
> +	if (evsel->leader->core.nr_members > 1) {
> +		pr_err("bpf managed perf events do not yet support groups.\n");
> +		return -1;
> +	}
> +
> +	/* determine filter type based on target */
> +	if (target->system_wide) {
> +		*filter_type = BPERF_FILTER_GLOBAL;
> +		*filter_entry_cnt = 1;
> +	} else if (target->cpu_list) {
> +		*filter_type = BPERF_FILTER_CPU;
> +		*filter_entry_cnt = perf_cpu_map__nr(evsel__cpus(evsel));
> +	} else if (target->tid) {
> +		*filter_type = BPERF_FILTER_PID;
> +		*filter_entry_cnt = perf_thread_map__nr(evsel->core.threads);
> +	} else if (target->pid) {
> +		*filter_type = BPERF_FILTER_TGID;
> +		*filter_entry_cnt = perf_thread_map__nr(evsel->core.threads);
> +	} else {
> +		pr_err("bpf managed perf events do not yet support these targets.\n");
> +		return -1;
> +	}
> +
> +	return 0;
> +}
> +
> +static int bperf_reload_leader_program(struct evsel *evsel, int attr_map_fd,
> +				       struct bperf_attr_map_entry *entry)
> +{
> +	struct bperf_leader_bpf *skel = NULL;

Init it to NULL to then right away set it to the return of
bperf_leader_bpf__open?

see below

> +	struct perf_cpu_map *all_cpus;
> +	int link_fd, diff_map_fd, err;
> +	struct bpf_link *link = NULL;
> +
> +	skel = bperf_leader_bpf__open();

	struct bperf_leader_bpf *skel = bperf_leader_bpf__open();

> +	if (!skel) {
> +		pr_err("Failed to open leader skeleton\n");
> +		return -1;
> +	}
> +
> +	bpf_map__resize(skel->maps.events, libbpf_num_possible_cpus());
> +	err = bperf_leader_bpf__load(skel);
> +	if (err) {
> +		pr_err("Failed to load leader skeleton\n");
> +		goto out;
> +	}
> +
> +	err = -1;
> +	link = bpf_program__attach(skel->progs.on_switch);
> +	if (!link) {
> +		pr_err("Failed to attach leader program\n");
> +		goto out;
> +	}
> +
> +	all_cpus = perf_cpu_map__new(NULL);
> +	if (!all_cpus) {
> +		pr_err("Failed to open cpu_map for all cpus\n");
> +		goto out;
> +	}

If you reorder things maybe you can determine the number of possible
cpus from all_cpus?

> +
> +	link_fd = bpf_link__fd(link);
> +	diff_map_fd = bpf_map__fd(skel->maps.diff_readings);
> +	entry->link_id = bpf_link_get_id(link_fd);
> +	entry->diff_map_id = bpf_map_get_id(diff_map_fd);
> +	err = bpf_map_update_elem(attr_map_fd, &evsel->core.attr, entry, BPF_ANY);
> +	assert(err == 0);
> +
> +	evsel->bperf_leader_link_fd = bpf_link_get_fd_by_id(entry->link_id);
> +	assert(evsel->bperf_leader_link_fd >= 0);
> +
> +	/* save leader_skel for install_pe */
> +	evsel->leader_skel = skel;
> +	evsel__open_per_cpu(evsel, all_cpus, -1);
> +	perf_cpu_map__put(all_cpus);
> +
> +out:
> +	bperf_leader_bpf__destroy(skel);
> +	bpf_link__destroy(link);
> +	return err;
> +}
> +
> +static int bperf__load(struct evsel *evsel, struct target *target)
> +{
> +	struct bperf_attr_map_entry entry = {0xffffffff, 0xffffffff};
> +	int attr_map_fd, diff_map_fd = -1, err;
> +	enum bperf_filter_type filter_type;
> +	__u32 filter_entry_cnt, i;
> +
> +	if (bperf_check_target(evsel, target, &filter_type, &filter_entry_cnt))
> +		return -1;
> +
> +	evsel->bperf_leader_prog_fd = -1;
> +	evsel->bperf_leader_link_fd = -1;
> +
> +	/*
> +	 * Step 1: hold a fd on the leader program and the bpf_link, if
> +	 * the program is not already gone, reload the program.
> +	 * Use flock() to ensure exclusive access to the perf_event_attr
> +	 * map.
> +	 */
> +	attr_map_fd = bperf_lock_attr_map(target);
> +	if (attr_map_fd < 0) {
> +		pr_err("Failed to lock perf_event_attr map\n");
> +		return -1;
> +	}
> +
> +	err = bpf_map_lookup_elem(attr_map_fd, &evsel->core.attr, &entry);
> +	if (err) {
> +		err = bpf_map_update_elem(attr_map_fd, &evsel->core.attr, &entry, BPF_ANY);
> +		if (err)
> +			goto out;
> +	}
> +
> +	evsel->bperf_leader_link_fd = bpf_link_get_fd_by_id(entry.link_id);
> +	if (evsel->bperf_leader_link_fd < 0 &&
> +	    bperf_reload_leader_program(evsel, attr_map_fd, &entry))
> +		goto out;
> +
> +	/*
> +	 * The bpf_link holds reference to the leader program, and the
> +	 * leader program holds reference to the maps. Therefore, if
> +	 * link_id is valid, diff_map_id should also be valid.
> +	 */
> +	evsel->bperf_leader_prog_fd = bpf_prog_get_fd_by_id(
> +		bpf_link_get_prog_id(evsel->bperf_leader_link_fd));
> +	assert(evsel->bperf_leader_prog_fd >= 0);
> +
> +	diff_map_fd = bpf_map_get_fd_by_id(entry.diff_map_id);
> +	assert(diff_map_fd >= 0);
> +
> +	/*
> +	 * bperf uses BPF_PROG_TEST_RUN to get accurate reading. Check
> +	 * whether the kernel support it
> +	 */
> +	err = bperf_trigger_reading(evsel->bperf_leader_prog_fd, 0);
> +	if (err) {
> +		pr_err("The kernel does not support test_run for raw_tp BPF programs.\n"
> +		       "Therefore, --use-bpf might show inaccurate readings\n");
> +		goto out;
> +	}
> +
> +	/* Step 2: load the follower skeleton */
> +	evsel->follower_skel = bperf_follower_bpf__open();
> +	if (!evsel->follower_skel) {
> +		pr_err("Failed to open follower skeleton\n");
> +		goto out;
> +	}
> +
> +	/* attach fexit program to the leader program */
> +	bpf_program__set_attach_target(evsel->follower_skel->progs.fexit_XXX,
> +				       evsel->bperf_leader_prog_fd, "on_switch");
> +
> +	/* connect to leader diff_reading map */
> +	bpf_map__reuse_fd(evsel->follower_skel->maps.diff_readings, diff_map_fd);
> +
> +	/* set up reading map */
> +	bpf_map__set_max_entries(evsel->follower_skel->maps.accum_readings,
> +				 filter_entry_cnt);
> +	/* set up follower filter based on target */
> +	bpf_map__set_max_entries(evsel->follower_skel->maps.filter,
> +				 filter_entry_cnt);
> +	err = bperf_follower_bpf__load(evsel->follower_skel);
> +	if (err) {
> +		pr_err("Failed to load follower skeleton\n");
> +		bperf_follower_bpf__destroy(evsel->follower_skel);
> +		evsel->follower_skel = NULL;
> +		goto out;
> +	}
> +
> +	for (i = 0; i < filter_entry_cnt; i++) {
> +		int filter_map_fd;
> +		__u32 key;
> +
> +		if (filter_type == BPERF_FILTER_PID ||
> +		    filter_type == BPERF_FILTER_TGID)
> +			key = evsel->core.threads->map[i].pid;
> +		else if (filter_type == BPERF_FILTER_CPU)
> +			key = evsel->core.cpus->map[i];
> +		else
> +			break;
> +
> +		filter_map_fd = bpf_map__fd(evsel->follower_skel->maps.filter);
> +		bpf_map_update_elem(filter_map_fd, &key, &i, BPF_ANY);
> +	}
> +
> +	evsel->follower_skel->bss->type = filter_type;
> +
> +out:
> +	if (err && evsel->bperf_leader_link_fd >= 0)
> +		close(evsel->bperf_leader_link_fd);
> +	if (err && evsel->bperf_leader_prog_fd >= 0)
> +		close(evsel->bperf_leader_prog_fd);
> +	if (diff_map_fd >= 0)
> +		close(diff_map_fd);
> +
> +	flock(attr_map_fd, LOCK_UN);
> +	close(attr_map_fd);
> +
> +	return err;
> +}
> +
> +static int bperf__install_pe(struct evsel *evsel, int cpu, int fd)
> +{
> +	struct bperf_leader_bpf *skel = evsel->leader_skel;
> +
> +	return bpf_map_update_elem(bpf_map__fd(skel->maps.events),
> +				   &cpu, &fd, BPF_ANY);
> +}
> +
> +/*
> + * trigger the leader prog on each cpu, so the accum_reading map could get
> + * the latest readings.
> + */
> +static int bperf_sync_counters(struct evsel *evsel)
> +{
> +	struct perf_cpu_map *all_cpus = perf_cpu_map__new(NULL);
> +	int num_cpu, i, cpu;
> +
> +	if (!all_cpus)
> +		return -1;
> +
> +	num_cpu = all_cpus->nr;
> +	for (i = 0; i < num_cpu; i++) {
> +		cpu = all_cpus->map[i];
> +		bperf_trigger_reading(evsel->bperf_leader_prog_fd, cpu);
> +	}
> +	return 0;
> +}
> +
> +static int bperf__enable(struct evsel *evsel)
> +{
> +	struct bperf_follower_bpf *skel = evsel->follower_skel;
> +	__u32 num_cpu_bpf = libbpf_num_possible_cpus();
> +	struct bpf_perf_event_value values[num_cpu_bpf];
> +	int err, i, entry_cnt;
> +
> +	err = bperf_follower_bpf__attach(evsel->follower_skel);
> +	if (err)
> +		return -1;
> +
> +	/*
> +	 * Attahcing the skeleton takes non-trivial time (0.2s+ on a kernel

Attaching

> +	 * with some debug options enabled). This shows as a longer first
> +	 * interval:
> +	 *
> +	 * # perf stat -e cycles -a -I 1000
> +	 * #           time             counts unit events
> +	 *      1.267634674     26,259,166,523      cycles
> +	 *      2.271637827     22,550,822,286      cycles
> +	 *      3.275406553     22,852,583,744      cycles
> +	 *
> +	 * Fix this by zeroing accum_readings after attaching the program.
> +	 */
> +	bperf_sync_counters(evsel);
> +	entry_cnt = bpf_map__max_entries(skel->maps.accum_readings);
> +	memset(values, 0, sizeof(struct bpf_perf_event_value) * num_cpu_bpf);
> +
> +	for (i = 0; i < entry_cnt; i++) {
> +		bpf_map_update_elem(bpf_map__fd(skel->maps.accum_readings),
> +				    &i, values, BPF_ANY);
> +	}
> +	return 0;
> +}
> +
> +static int bperf__read(struct evsel *evsel)
> +{
> +	struct bperf_follower_bpf *skel = evsel->follower_skel;
> +	__u32 num_cpu_bpf = libbpf_num_possible_cpus();
> +	struct bpf_perf_event_value values[num_cpu_bpf];
> +	static struct perf_cpu_map *all_cpus;
> +	int reading_map_fd, err = 0;
> +	__u32 i, j, num_cpu;
> +
> +	if (!all_cpus) {
> +		all_cpus = perf_cpu_map__new(NULL);
> +		if (!all_cpus)
> +			return -1;
> +	}
> +
> +	bperf_sync_counters(evsel);
> +	reading_map_fd = bpf_map__fd(skel->maps.accum_readings);
> +
> +	for (i = 0; i < bpf_map__max_entries(skel->maps.accum_readings); i++) {
> +		__u32 cpu;
> +
> +		err = bpf_map_lookup_elem(reading_map_fd, &i, values);
> +		if (err)
> +			goto out;
> +		switch (evsel->follower_skel->bss->type) {
> +		case BPERF_FILTER_GLOBAL:
> +			assert(i == 0);
> +
> +			num_cpu = all_cpus->nr;
> +			for (j = 0; j < num_cpu; j++) {
> +				cpu = all_cpus->map[j];
> +				perf_counts(evsel->counts, cpu, 0)->val = values[cpu].counter;
> +				perf_counts(evsel->counts, cpu, 0)->ena = values[cpu].enabled;
> +				perf_counts(evsel->counts, cpu, 0)->run = values[cpu].running;
> +			}
> +			break;
> +		case BPERF_FILTER_CPU:
> +			cpu = evsel->core.cpus->map[i];
> +			perf_counts(evsel->counts, i, 0)->val = values[cpu].counter;
> +			perf_counts(evsel->counts, i, 0)->ena = values[cpu].enabled;
> +			perf_counts(evsel->counts, i, 0)->run = values[cpu].running;
> +			break;
> +		case BPERF_FILTER_PID:
> +		case BPERF_FILTER_TGID:
> +			perf_counts(evsel->counts, 0, i)->val = 0;
> +			perf_counts(evsel->counts, 0, i)->ena = 0;
> +			perf_counts(evsel->counts, 0, i)->run = 0;
> +
> +			for (cpu = 0; cpu < num_cpu_bpf; cpu++) {
> +				perf_counts(evsel->counts, 0, i)->val += values[cpu].counter;
> +				perf_counts(evsel->counts, 0, i)->ena += values[cpu].enabled;
> +				perf_counts(evsel->counts, 0, i)->run += values[cpu].running;
> +			}
> +			break;
> +		default:
> +			break;
> +		}
> +	}
> +out:
> +	return err;
> +}
> +
> +static int bperf__destroy(struct evsel *evsel)
> +{
> +	bperf_follower_bpf__destroy(evsel->follower_skel);
> +	close(evsel->bperf_leader_prog_fd);
> +	close(evsel->bperf_leader_link_fd);
> +	return 0;
> +}
> +
> +/*
> + * bperf: share hardware PMCs with BPF
> + *
> + * perf uses performance monitoring counters (PMC) to monitor system
> + * performance. The PMCs are limited hardware resources. For example,
> + * Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
> + *
> + * Modern data center systems use these PMCs in many different ways:
> + * system level monitoring, (maybe nested) container level monitoring, per
> + * process monitoring, profiling (in sample mode), etc. In some cases,
> + * there are more active perf_events than available hardware PMCs. To allow
> + * all perf_events to have a chance to run, it is necessary to do expensive
> + * time multiplexing of events.
> + *
> + * On the other hand, many monitoring tools count the common metrics
> + * (cycles, instructions). It is a waste to have multiple tools create
> + * multiple perf_events of "cycles" and occupy multiple PMCs.
> + *
> + * bperf tries to reduce such wastes by allowing multiple perf_events of
> + * "cycles" or "instructions" (at different scopes) to share PMUs. Instead
> + * of having each perf-stat session to read its own perf_events, bperf uses
> + * BPF programs to read the perf_events and aggregate readings to BPF maps.
> + * Then, the perf-stat session(s) reads the values from these BPF maps.
> + *
> + *                                ||
> + *       shared progs and maps <- || -> per session progs and maps
> + *                                ||
> + *   ---------------              ||
> + *   | perf_events |              ||
> + *   ---------------       fexit  ||      -----------------
> + *          |             --------||----> | follower prog |
> + *       --------------- /        || ---  -----------------
> + * cs -> | leader prog |/         ||/        |         |
> + *   --> ---------------         /||  --------------  ------------------
> + *  /       |         |         / ||  | filter map |  | accum_readings |
> + * /  ------------  ------------  ||  --------------  ------------------
> + * |  | prev map |  | diff map |  ||                        |
> + * |  ------------  ------------  ||                        |
> + *  \                             ||                        |
> + * = \ ==================================================== | ============
> + *    \                                                    /   user space
> + *     \                                                  /
> + *      \                                                /
> + *    BPF_PROG_TEST_RUN                    BPF_MAP_LOOKUP_ELEM
> + *        \                                            /
> + *         \                                          /
> + *          \------  perf-stat ----------------------/
> + *
> + * The figure above shows the architecture of bperf. Note that the figure
> + * is divided into 3 regions: shared progs and maps (top left), per session
> + * progs and maps (top right), and user space (bottom).
> + *
> + * The leader prog is triggered on each context switch (cs). The leader
> + * prog reads perf_events and stores the difference (current_reading -
> + * previous_reading) to the diff map. For the same metric, e.g. "cycles",
> + * multiple perf-stat sessions share the same leader prog.
> + *
> + * Each perf-stat session creates a follower prog as fexit program to the
> + * leader prog. It is possible to attach up to BPF_MAX_TRAMP_PROGS (38)
> + * follower progs to the same leader prog. The follower prog checks current
> + * task and processor ID to decide whether to add the value from the diff
> + * map to its accumulated reading map (accum_readings).
> + *
> + * Finally, perf-stat user space reads the value from accum_reading map.
> + *
> + * Besides context switch, it is also necessary to trigger the leader prog
> + * before perf-stat reads the value. Otherwise, the accum_reading map may
> + * not have the latest reading from the perf_events. This is achieved by
> + * triggering the event via sys_bpf(BPF_PROG_TEST_RUN) to each CPU.
> + *
> + * Comment before the definition of struct bperf_attr_map_entry describes
> + * how different sessions of perf-stat share information about the leader
> + * prog.
> + */
> +
> +struct bpf_counter_ops bperf_ops = {
> +	.load       = bperf__load,
> +	.enable     = bperf__enable,
> +	.read       = bperf__read,
> +	.install_pe = bperf__install_pe,
> +	.destroy    = bperf__destroy,
> +};
> +
> +static inline bool bpf_counter_skip(struct evsel *evsel)
> +{
> +	return list_empty(&evsel->bpf_counter_list) &&
> +		evsel->follower_skel == NULL;
> +}
> +
>  int bpf_counter__install_pe(struct evsel *evsel, int cpu, int fd)
>  {
> -	if (list_empty(&evsel->bpf_counter_list))
> +	if (bpf_counter_skip(evsel))
>  		return 0;
>  	return evsel->bpf_counter_ops->install_pe(evsel, cpu, fd);
>  }
>  
>  int bpf_counter__load(struct evsel *evsel, struct target *target)
>  {
> -	if (target__has_bpf(target))
> +	if (target->bpf_str)
>  		evsel->bpf_counter_ops = &bpf_program_profiler_ops;
> +	else if (target->use_bpf)
> +		evsel->bpf_counter_ops = &bperf_ops;
>  
>  	if (evsel->bpf_counter_ops)
>  		return evsel->bpf_counter_ops->load(evsel, target);
> @@ -293,21 +835,21 @@ int bpf_counter__load(struct evsel *evsel, struct target *target)
>  
>  int bpf_counter__enable(struct evsel *evsel)
>  {
> -	if (list_empty(&evsel->bpf_counter_list))
> +	if (bpf_counter_skip(evsel))
>  		return 0;
>  	return evsel->bpf_counter_ops->enable(evsel);
>  }
>  
>  int bpf_counter__read(struct evsel *evsel)
>  {
> -	if (list_empty(&evsel->bpf_counter_list))
> +	if (bpf_counter_skip(evsel))
>  		return -EAGAIN;
>  	return evsel->bpf_counter_ops->read(evsel);
>  }
>  
>  void bpf_counter__destroy(struct evsel *evsel)
>  {
> -	if (list_empty(&evsel->bpf_counter_list))
> +	if (bpf_counter_skip(evsel))
>  		return;
>  	evsel->bpf_counter_ops->destroy(evsel);
>  	evsel->bpf_counter_ops = NULL;
> diff --git a/tools/perf/util/bpf_skel/bperf.h b/tools/perf/util/bpf_skel/bperf.h
> new file mode 100644
> index 0000000000000..186a5551ddb9d
> --- /dev/null
> +++ b/tools/perf/util/bpf_skel/bperf.h
> @@ -0,0 +1,14 @@
> +// SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
> +// Copyright (c) 2021 Facebook
> +
> +#ifndef __BPERF_STAT_H
> +#define __BPERF_STAT_H
> +
> +typedef struct {
> +	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
> +	__uint(key_size, sizeof(__u32));
> +	__uint(value_size, sizeof(struct bpf_perf_event_value));
> +	__uint(max_entries, 1);
> +} reading_map;
> +
> +#endif /* __BPERF_STAT_H */
> diff --git a/tools/perf/util/bpf_skel/bperf_follower.bpf.c b/tools/perf/util/bpf_skel/bperf_follower.bpf.c
> new file mode 100644
> index 0000000000000..7535d9557ab9a
> --- /dev/null
> +++ b/tools/perf/util/bpf_skel/bperf_follower.bpf.c
> @@ -0,0 +1,65 @@
> +// SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
> +// Copyright (c) 2021 Facebook
> +#include <linux/bpf.h>
> +#include <linux/perf_event.h>
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_tracing.h>
> +#include "bperf.h"
> +#include "bperf_u.h"
> +
> +reading_map diff_readings SEC(".maps");
> +reading_map accum_readings SEC(".maps");
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_HASH);
> +	__uint(key_size, sizeof(__u32));
> +	__uint(value_size, sizeof(__u32));
> +} filter SEC(".maps");
> +
> +enum bperf_filter_type type = 0;
> +
> +SEC("fexit/XXX")
> +int BPF_PROG(fexit_XXX)
> +{
> +	struct bpf_perf_event_value *diff_val, *accum_val;
> +	__u32 filter_key, zero = 0;
> +	__u32 *accum_key;
> +
> +	switch (type) {
> +	case BPERF_FILTER_GLOBAL:
> +		accum_key = &zero;
> +		goto do_add;
> +	case BPERF_FILTER_CPU:
> +		filter_key = bpf_get_smp_processor_id();
> +		break;
> +	case BPERF_FILTER_PID:
> +		filter_key = bpf_get_current_pid_tgid() & 0xffffffff;
> +		break;
> +	case BPERF_FILTER_TGID:
> +		filter_key = bpf_get_current_pid_tgid() >> 32;
> +		break;
> +	default:
> +		return 0;
> +	}
> +
> +	accum_key = bpf_map_lookup_elem(&filter, &filter_key);
> +	if (!accum_key)
> +		return 0;
> +
> +do_add:
> +	diff_val = bpf_map_lookup_elem(&diff_readings, &zero);
> +	if (!diff_val)
> +		return 0;
> +
> +	accum_val = bpf_map_lookup_elem(&accum_readings, accum_key);
> +	if (!accum_val)
> +		return 0;
> +
> +	accum_val->counter += diff_val->counter;
> +	accum_val->enabled += diff_val->enabled;
> +	accum_val->running += diff_val->running;
> +
> +	return 0;
> +}
> +
> +char LICENSE[] SEC("license") = "Dual BSD/GPL";
> diff --git a/tools/perf/util/bpf_skel/bperf_leader.bpf.c b/tools/perf/util/bpf_skel/bperf_leader.bpf.c
> new file mode 100644
> index 0000000000000..4f70d1459e86c
> --- /dev/null
> +++ b/tools/perf/util/bpf_skel/bperf_leader.bpf.c
> @@ -0,0 +1,46 @@
> +// SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
> +// Copyright (c) 2021 Facebook
> +#include <linux/bpf.h>
> +#include <linux/perf_event.h>
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_tracing.h>
> +#include "bperf.h"
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
> +	__uint(key_size, sizeof(__u32));
> +	__uint(value_size, sizeof(int));
> +	__uint(map_flags, BPF_F_PRESERVE_ELEMS);
> +} events SEC(".maps");
> +
> +reading_map prev_readings SEC(".maps");
> +reading_map diff_readings SEC(".maps");
> +
> +SEC("raw_tp/sched_switch")
> +int BPF_PROG(on_switch)
> +{
> +	struct bpf_perf_event_value val, *prev_val, *diff_val;
> +	__u32 key = bpf_get_smp_processor_id();
> +	__u32 zero = 0;
> +	long err;
> +
> +	prev_val = bpf_map_lookup_elem(&prev_readings, &zero);
> +	if (!prev_val)
> +		return 0;
> +
> +	diff_val = bpf_map_lookup_elem(&diff_readings, &zero);
> +	if (!diff_val)
> +		return 0;
> +
> +	err = bpf_perf_event_read_value(&events, key, &val, sizeof(val));
> +	if (err)
> +		return 0;
> +
> +	diff_val->counter = val.counter - prev_val->counter;
> +	diff_val->enabled = val.enabled - prev_val->enabled;
> +	diff_val->running = val.running - prev_val->running;
> +	*prev_val = val;
> +	return 0;
> +}
> +
> +char LICENSE[] SEC("license") = "Dual BSD/GPL";
> diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
> index 6026487353dd8..dd4f56f9cfdf5 100644
> --- a/tools/perf/util/evsel.h
> +++ b/tools/perf/util/evsel.h
> @@ -20,6 +20,8 @@ union perf_event;
>  struct bpf_counter_ops;
>  struct target;
>  struct hashmap;
> +struct bperf_leader_bpf;
> +struct bperf_follower_bpf;
>  
>  typedef int (evsel__sb_cb_t)(union perf_event *event, void *data);
>  
> @@ -130,8 +132,24 @@ struct evsel {
>  	 * See also evsel__has_callchain().
>  	 */
>  	__u64			synth_sample_type;
> -	struct list_head	bpf_counter_list;
> +
> +	/*
> +	 * bpf_counter_ops serves two use cases:
> +	 *   1. perf-stat -b          counting events used byBPF programs
> +	 *   2. perf-stat --use-bpf   use BPF programs to aggregate counts
> +	 */
>  	struct bpf_counter_ops	*bpf_counter_ops;
> +
> +	/* for perf-stat -b */
> +	struct list_head	bpf_counter_list;
> +
> +	/* for perf-stat --use-bpf */
> +	int			bperf_leader_prog_fd;
> +	int			bperf_leader_link_fd;
> +	union {
> +		struct bperf_leader_bpf *leader_skel;
> +		struct bperf_follower_bpf *follower_skel;
> +	};
>  };
>  
>  struct perf_missing_features {
> diff --git a/tools/perf/util/target.h b/tools/perf/util/target.h
> index f132c6c2eef81..1bce3eb28ef25 100644
> --- a/tools/perf/util/target.h
> +++ b/tools/perf/util/target.h
> @@ -16,6 +16,8 @@ struct target {
>  	bool	     uses_mmap;
>  	bool	     default_per_cpu;
>  	bool	     per_thread;
> +	bool	     use_bpf;
> +	const char   *attr_map;
>  };
>  
>  enum target_errno {
> @@ -66,7 +68,7 @@ static inline bool target__has_cpu(struct target *target)
>  
>  static inline bool target__has_bpf(struct target *target)
>  {
> -	return target->bpf_str;
> +	return target->bpf_str || target->use_bpf;
>  }
>  
>  static inline bool target__none(struct target *target)
> -- 
> 2.24.1
> 

-- 

- Arnaldo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF
  2021-03-12  8:36 ` Namhyung Kim
@ 2021-03-12 15:38   ` Song Liu
  2021-03-13  2:47     ` Namhyung Kim
  0 siblings, 1 reply; 21+ messages in thread
From: Song Liu @ 2021-03-12 15:38 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: linux-kernel, Kernel Team, Arnaldo Carvalho de Melo,
	Arnaldo Carvalho de Melo, Jiri Olsa



> On Mar 12, 2021, at 12:36 AM, Namhyung Kim <namhyung@kernel.org> wrote:
> 
> Hi,
> 
> On Fri, Mar 12, 2021 at 11:03 AM Song Liu <songliubraving@fb.com> wrote:
>> 
>> perf uses performance monitoring counters (PMCs) to monitor system
>> performance. The PMCs are limited hardware resources. For example,
>> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
>> 
>> Modern data center systems use these PMCs in many different ways:
>> system level monitoring, (maybe nested) container level monitoring, per
>> process monitoring, profiling (in sample mode), etc. In some cases,
>> there are more active perf_events than available hardware PMCs. To allow
>> all perf_events to have a chance to run, it is necessary to do expensive
>> time multiplexing of events.
>> 
>> On the other hand, many monitoring tools count the common metrics (cycles,
>> instructions). It is a waste to have multiple tools create multiple
>> perf_events of "cycles" and occupy multiple PMCs.
>> 
>> bperf tries to reduce such wastes by allowing multiple perf_events of
>> "cycles" or "instructions" (at different scopes) to share PMUs. Instead
>> of having each perf-stat session to read its own perf_events, bperf uses
>> BPF programs to read the perf_events and aggregate readings to BPF maps.
>> Then, the perf-stat session(s) reads the values from these BPF maps.
>> 
>> Please refer to the comment before the definition of bperf_ops for the
>> description of bperf architecture.
> 
> Interesting!  Actually I thought about something similar before,
> but my BPF knowledge is outdated.  So I need to catch up but
> failed to have some time for it so far. ;-)
> 
>> 
>> bperf is off by default. To enable it, pass --use-bpf option to perf-stat.
>> bperf uses a BPF hashmap to share information about BPF programs and maps
>> used by bperf. This map is pinned to bpffs. The default address is
>> /sys/fs/bpf/bperf_attr_map. The user could change the address with option
>> --attr-map.
>> 
>> ---
>> Known limitations:
>> 1. Do not support per cgroup events;
>> 2. Do not support monitoring of BPF program (perf-stat -b);
>> 3. Do not support event groups.
> 
> In my case, per cgroup event counting is very important.
> And I'd like to do that with lots of cpus and cgroups.

We can easily extend this approach to support cgroups events. I didn't 
implement it to keep the first version simple. 

> So I'm working on an in-kernel solution (without BPF),
> I hope to share it soon.

This is interesting! I cannot wait to see how it looks like. I spent
quite some time try to enable in kernel sharing (not just cgroup
events), but finally decided to try BPF approach. 

> 
> And for event groups, it seems the current implementation
> cannot handle more than one event (not even in a group).
> That could be a serious limitation..

It supports multiple events. Multiple events are independent, i.e.,
"cycles" and "instructions" would use two independent leader programs.

> 
>> 
>> The following commands have been tested:
>> 
>>   perf stat --use-bpf -e cycles -a
>>   perf stat --use-bpf -e cycles -C 1,3,4
>>   perf stat --use-bpf -e cycles -p 123
>>   perf stat --use-bpf -e cycles -t 100,101
> 
> Hmm... so it loads both leader and follower programs if needed, right?
> Does it support multiple followers with different targets at the same time?

Yes, the whole idea is to have one leader program and multiple follower
programs. If we only run one of these commands at a time, it will load 
one leader and one follower. If we run multiple of them in parallel, 
they will share the same leader program and load multiple follower 
programs. 

I actually tested more than the commands above. The list actually means
we support -a, -C -p, and -t. 

Currently, this works for multiple events, and different parallel 
perf-stat. The two commands below will work well in parallel:
  
  perf stat --use-bpf -e ref-cycles,instructions -a
  perf stat --use-bpf -e ref-cycles,cycles -C 1,3,5

Note the use of ref-cycles, which can only use one counter on Intel CPUs.
With this approach, the above two commands will not do time multiplexing
on ref-cycles. 

Thanks,
Song

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF
  2021-03-12 12:12 ` Jiri Olsa
@ 2021-03-12 15:45   ` Song Liu
  2021-03-12 16:09     ` Song Liu
  0 siblings, 1 reply; 21+ messages in thread
From: Song Liu @ 2021-03-12 15:45 UTC (permalink / raw)
  To: Jiri Olsa; +Cc: linux-kernel, Kernel Team, acme, acme, namhyung, jolsa



> On Mar 12, 2021, at 4:12 AM, Jiri Olsa <jolsa@redhat.com> wrote:
> 
> On Thu, Mar 11, 2021 at 06:02:57PM -0800, Song Liu wrote:
>> perf uses performance monitoring counters (PMCs) to monitor system
>> performance. The PMCs are limited hardware resources. For example,
>> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
>> 
>> Modern data center systems use these PMCs in many different ways:
>> system level monitoring, (maybe nested) container level monitoring, per
>> process monitoring, profiling (in sample mode), etc. In some cases,
>> there are more active perf_events than available hardware PMCs. To allow
>> all perf_events to have a chance to run, it is necessary to do expensive
>> time multiplexing of events.
>> 
>> On the other hand, many monitoring tools count the common metrics (cycles,
>> instructions). It is a waste to have multiple tools create multiple
>> perf_events of "cycles" and occupy multiple PMCs.
>> 
>> bperf tries to reduce such wastes by allowing multiple perf_events of
>> "cycles" or "instructions" (at different scopes) to share PMUs. Instead
>> of having each perf-stat session to read its own perf_events, bperf uses
>> BPF programs to read the perf_events and aggregate readings to BPF maps.
>> Then, the perf-stat session(s) reads the values from these BPF maps.
>> 
>> Please refer to the comment before the definition of bperf_ops for the
>> description of bperf architecture.
>> 
>> bperf is off by default. To enable it, pass --use-bpf option to perf-stat.
>> bperf uses a BPF hashmap to share information about BPF programs and maps
>> used by bperf. This map is pinned to bpffs. The default address is
>> /sys/fs/bpf/bperf_attr_map. The user could change the address with option
>> --attr-map.
> 
> nice, I recall the presentation about that and was wondering
> when this will come up ;-)

The progress is slower than I expected. But I finished some dependencies of 
this in the last year: 

  1. BPF_PROG_TEST_RUN for raw_tp event;
  2. perf-stat -b, which introduced skeleton and bpf_counter;
  3. BPF task local storage, I didn't use it in this version, but it could,
     help optimize bperf in the future. 

> 
>> 
>> ---
>> Known limitations:
>> 1. Do not support per cgroup events;
>> 2. Do not support monitoring of BPF program (perf-stat -b);
>> 3. Do not support event groups.
>> 
>> The following commands have been tested:
>> 
>>   perf stat --use-bpf -e cycles -a
>>   perf stat --use-bpf -e cycles -C 1,3,4
>>   perf stat --use-bpf -e cycles -p 123
>>   perf stat --use-bpf -e cycles -t 100,101
> 
> I assume the output is same as standard perf?

Yes, the output is identical to that without --use-bpf option. 

Thanks,
Song


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF
  2021-03-12 14:24 ` Arnaldo Carvalho de Melo
@ 2021-03-12 16:07   ` Song Liu
  2021-03-12 17:37     ` Arnaldo Carvalho de Melo
  2021-03-12 18:52   ` Song Liu
  1 sibling, 1 reply; 21+ messages in thread
From: Song Liu @ 2021-03-12 16:07 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: linux-kernel, Kernel Team, acme, namhyung, jolsa, linux-perf-users



> On Mar 12, 2021, at 6:24 AM, Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
> 
> Em Thu, Mar 11, 2021 at 06:02:57PM -0800, Song Liu escreveu:
>> perf uses performance monitoring counters (PMCs) to monitor system
>> performance. The PMCs are limited hardware resources. For example,
>> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
>> 
>> Modern data center systems use these PMCs in many different ways:
>> system level monitoring, (maybe nested) container level monitoring, per
>> process monitoring, profiling (in sample mode), etc. In some cases,
>> there are more active perf_events than available hardware PMCs. To allow
>> all perf_events to have a chance to run, it is necessary to do expensive
>> time multiplexing of events.
>> 
>> On the other hand, many monitoring tools count the common metrics (cycles,
>> instructions). It is a waste to have multiple tools create multiple
>> perf_events of "cycles" and occupy multiple PMCs.
>> 
>> bperf tries to reduce such wastes by allowing multiple perf_events of
>> "cycles" or "instructions" (at different scopes) to share PMUs. Instead
>> of having each perf-stat session to read its own perf_events, bperf uses
>> BPF programs to read the perf_events and aggregate readings to BPF maps.
>> Then, the perf-stat session(s) reads the values from these BPF maps.
>> 
>> Please refer to the comment before the definition of bperf_ops for the
>> description of bperf architecture.
>> 
>> bperf is off by default. To enable it, pass --use-bpf option to perf-stat.
>> bperf uses a BPF hashmap to share information about BPF programs and maps
>> used by bperf. This map is pinned to bpffs. The default address is
>> /sys/fs/bpf/bperf_attr_map. The user could change the address with option
>> --attr-map.
>> 
>> ---
>> Known limitations:
>> 1. Do not support per cgroup events;
>> 2. Do not support monitoring of BPF program (perf-stat -b);
>> 3. Do not support event groups.
> 
> Cool stuff, but I think you can break this up into more self contained
> patches, see below.
> 
> Apart from that, some suggestions/requests:
> 
> We need a shell 'perf test' that uses some synthetic workload so that we
> can count events with/without --use-bpf (--bpf-counters is my
> alternative name, see below), and then compare if the difference is
> under some acceptable range.

Yes, "perf test" makes sense. Would this be the extension of current 
perf-test command? Or a new set of tests?

> 
> As a followup patch we could have something like:
> 
> perf config stat.bpf-counters=yes
> 
> That would make 'perf stat' use BPF counters for what it can, using the
> default method for the non-supported targets, emitting some 'perf stat
> -v' visible warning (i.e. a debug message), i.e. make it opt-in that the
> user wants to use BPF counters for all that is possible at that point in
> time.o

Yes, the fallback mechanism will be very helpful. I also have ideas on
setting a list for "common events", and only use BPF for the common 
events. Not common events should just use the original method. 

> 
> Thanks for working on this,
> 
> - Arnaldo
> 
>> The following commands have been tested:
>> 
>>   perf stat --use-bpf -e cycles -a
>>   perf stat --use-bpf -e cycles -C 1,3,4
>>   perf stat --use-bpf -e cycles -p 123
>>   perf stat --use-bpf -e cycles -t 100,101
>> 
>> Signed-off-by: Song Liu <songliubraving@fb.com>
>> ---
>> tools/perf/Makefile.perf                      |   1 +
>> tools/perf/builtin-stat.c                     |  20 +-
>> tools/perf/util/bpf_counter.c                 | 552 +++++++++++++++++-
>> tools/perf/util/bpf_skel/bperf.h              |  14 +
>> tools/perf/util/bpf_skel/bperf_follower.bpf.c |  65 +++
>> tools/perf/util/bpf_skel/bperf_leader.bpf.c   |  46 ++
>> tools/perf/util/evsel.h                       |  20 +-
>> tools/perf/util/target.h                      |   4 +-
>> 8 files changed, 712 insertions(+), 10 deletions(-)
>> create mode 100644 tools/perf/util/bpf_skel/bperf.h
>> create mode 100644 tools/perf/util/bpf_skel/bperf_follower.bpf.c
>> create mode 100644 tools/perf/util/bpf_skel/bperf_leader.bpf.c
>> 
>> diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
>> index f6e609673de2b..ca9aa08e85a1f 100644
>> --- a/tools/perf/Makefile.perf
>> +++ b/tools/perf/Makefile.perf
>> @@ -1007,6 +1007,7 @@ python-clean:
>> SKEL_OUT := $(abspath $(OUTPUT)util/bpf_skel)
>> SKEL_TMP_OUT := $(abspath $(SKEL_OUT)/.tmp)
>> SKELETONS := $(SKEL_OUT)/bpf_prog_profiler.skel.h
>> +SKELETONS += $(SKEL_OUT)/bperf_leader.skel.h $(SKEL_OUT)/bperf_follower.skel.h
>> 
>> ifdef BUILD_BPF_SKEL
>> BPFTOOL := $(SKEL_TMP_OUT)/bootstrap/bpftool
>> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
>> index 2e2e4a8345ea2..34df713a8eea9 100644
>> --- a/tools/perf/builtin-stat.c
>> +++ b/tools/perf/builtin-stat.c
>> @@ -792,6 +792,12 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
>> 	}
>> 
>> 	evlist__for_each_cpu (evsel_list, i, cpu) {
>> +		/*
>> +		 * bperf calls evsel__open_per_cpu() in bperf__load(), so
>> +		 * no need to call it again here.
>> +		 */
>> +		if (target.use_bpf)
>> +			break;
>> 		affinity__set(&affinity, cpu);
>> 
>> 		evlist__for_each_entry(evsel_list, counter) {
>> @@ -925,15 +931,15 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
>> 	/*
>> 	 * Enable counters and exec the command:
>> 	 */
>> -	t0 = rdclock();
>> -	clock_gettime(CLOCK_MONOTONIC, &ref_time);
>> -
>> 	if (forks) {
>> 		evlist__start_workload(evsel_list);
>> 		err = enable_counters();
>> 		if (err)
>> 			return -1;
>> 
>> +		t0 = rdclock();
>> +		clock_gettime(CLOCK_MONOTONIC, &ref_time);
>> +
>> 		if (interval || timeout || evlist__ctlfd_initialized(evsel_list))
>> 			status = dispatch_events(forks, timeout, interval, &times);
>> 		if (child_pid != -1) {
>> @@ -954,6 +960,10 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
>> 		err = enable_counters();
>> 		if (err)
>> 			return -1;
>> +
>> +		t0 = rdclock();
>> +		clock_gettime(CLOCK_MONOTONIC, &ref_time);
>> +
>> 		status = dispatch_events(forks, timeout, interval, &times);
>> 	}
>> 
> 
> The above two hunks seems out of place, i.e. can they go to a different
> patch and with an explanation about why this is needed?

Yeah, make sense. I will split them to a separate patch. 

> 
>> @@ -1146,6 +1156,10 @@ static struct option stat_options[] = {
>> #ifdef HAVE_BPF_SKEL
>> 	OPT_STRING('b', "bpf-prog", &target.bpf_str, "bpf-prog-id",
>> 		   "stat events on existing bpf program id"),
>> +	OPT_BOOLEAN(0, "use-bpf", &target.use_bpf,
>> +		    "use bpf program to count events"),
>> +	OPT_STRING(0, "attr-map", &target.attr_map, "attr-map-path",
>> +		   "path to perf_event_attr map"),
> 
> These need to be documented in tools/perf/Documentation/perf-stat.txt
> 
> Also please rename it to:
> 
>  --bpf-counters
>  --bpf-attr-map
> 
> Andrii suggested prefixing with --bpf BPF related options in pahole, I
> think this applies here as well.

Will change it in v2. 

> 
>> #endif
>> 	OPT_BOOLEAN('a', "all-cpus", &target.system_wide,
>> 		    "system-wide collection from all CPUs"),
>> diff --git a/tools/perf/util/bpf_counter.c b/tools/perf/util/bpf_counter.c
>> index 04f89120b3232..d232011ec8f26 100644
>> --- a/tools/perf/util/bpf_counter.c
>> +++ b/tools/perf/util/bpf_counter.c
>> @@ -5,6 +5,7 @@
>> #include <assert.h>
>> #include <limits.h>
>> #include <unistd.h>
>> +#include <sys/file.h>
>> #include <sys/time.h>
>> #include <sys/resource.h>
>> #include <linux/err.h>
>> @@ -18,8 +19,37 @@
>> #include "debug.h"
>> #include "evsel.h"
>> #include "target.h"
>> +#include "cpumap.h"
>> +#include "thread_map.h"
>> 
>> #include "bpf_skel/bpf_prog_profiler.skel.h"
>> +#include "bpf_skel/bperf_u.h"
>> +#include "bpf_skel/bperf_leader.skel.h"
>> +#include "bpf_skel/bperf_follower.skel.h"
>> +
>> +/*
>> + * bperf uses a hashmap, the attr_map, to track all the leader programs.
>> + * The hashmap is pinned in bpffs. flock() on this file is used to ensure
>> + * no concurrent access to the attr_map.  The key of attr_map is struct
>> + * perf_event_attr, and the value is struct bperf_attr_map_entry.
>> + *
>> + * struct bperf_attr_map_entry contains two __u32 IDs, bpf_link of the
>> + * leader prog, and the diff_map. Each perf-stat session holds a reference
>> + * to the bpf_link to make sure the leader prog is attached to sched_switch
>> + * tracepoint.
>> + *
>> + * Since the hashmap only contains IDs of the bpf_link and diff_map, it
>> + * does not hold any references to the leader program. Once all perf-stat
>> + * sessions of these events exit, the leader prog, its maps, and the
>> + * perf_events will be freed.
>> + */
>> +struct bperf_attr_map_entry {
>> +	__u32 link_id;
>> +	__u32 diff_map_id;
>> +};
>> +
>> +#define DEFAULT_ATTR_MAP_PATH "/sys/fs/bpf/bperf_attr_map"
> 
> Humm bpf is already in the parent directory, perhaps we can have it as
> just /sys/fs/bpf/perf_attr_map? Here 'counter' isn't applicable, I
> guess, eventually we may want to use this for other purposes? ;-)

Do you mean sharing PMCs with sampling events? Well, I do have it on my
list for 2022. ;-)

> 
>> +#define ATTR_MAP_SIZE 16
>> 
>> static inline void *u64_to_ptr(__u64 ptr)
>> {
>> @@ -274,17 +304,529 @@ struct bpf_counter_ops bpf_program_profiler_ops = {
>> 	.install_pe = bpf_program_profiler__install_pe,
>> };
>> 
>> +static __u32 bpf_link_get_id(int fd)
>> +{
>> +	struct bpf_link_info link_info;
>> +	__u32 link_info_len;
>> +
>> +	link_info_len = sizeof(link_info);
> 
> Can you combine declaration with attribution to make the code more
> compact? i.e.:
> 
> 	__u32 link_info_len = sizeof(link_info);
> 
> 
>> +	memset(&link_info, 0, link_info_len);
>> +	bpf_obj_get_info_by_fd(fd, &link_info, &link_info_len);
>> +	return link_info.id;
>> +}
>> +
>> +static __u32 bpf_link_get_prog_id(int fd)
>> +{
>> +	struct bpf_link_info link_info;
>> +	__u32 link_info_len;
>> +
>> +	link_info_len = sizeof(link_info);
> 
> Ditto
> 
>> +	memset(&link_info, 0, link_info_len);
>> +	bpf_obj_get_info_by_fd(fd, &link_info, &link_info_len);
>> +	return link_info.prog_id;
>> +}
>> +
>> +static __u32 bpf_map_get_id(int fd)
>> +{
>> +	struct bpf_map_info map_info;
>> +	__u32 map_info_len;
> 
> Ditto.
> 
>> +	map_info_len = sizeof(map_info);
>> +	memset(&map_info, 0, map_info_len);
>> +	bpf_obj_get_info_by_fd(fd, &map_info, &map_info_len);
>> +	return map_info.id;
>> +}
>> +
>> +static int bperf_lock_attr_map(struct target *target)
>> +{
>> +	const char *path = target->attr_map ? : DEFAULT_ATTR_MAP_PATH;
>> +	int map_fd, err;
>> +
>> +	if (access(path, F_OK)) {
>> +		map_fd = bpf_create_map(BPF_MAP_TYPE_HASH,
>> +					sizeof(struct perf_event_attr),
>> +					sizeof(struct bperf_attr_map_entry),
> 
> Also naming it as perf_event_attr_map_entry should make the equivalence
> more explicit if albeit a bit longer, i.e.:
> 
> +		map_fd = bpf_create_map(BPF_MAP_TYPE_HASH,
> +					sizeof(struct perf_event_attr),
> +					sizeof(struct perf_event_attr_map_entry),
> 
>> +					ATTR_MAP_SIZE, 0);
>> +		if (map_fd < 0)
>> +			return -1;
>> +
>> +		err = bpf_obj_pin(map_fd, path);
>> +		if (err) {
>> +			/* someone pinned the map in parallel? */
>> +			close(map_fd);
>> +			map_fd = bpf_obj_get(path);
>> +			if (map_fd < 0)
>> +				return -1;
>> +		}
>> +	} else {
>> +		map_fd = bpf_obj_get(path);
>> +		if (map_fd < 0)
>> +			return -1;
>> +	}
>> +
>> +	err = flock(map_fd, LOCK_EX);
>> +	if (err) {
>> +		close(map_fd);
>> +		return -1;
>> +	}
>> +	return map_fd;
>> +}
>> +
>> +/* trigger the leader program on a cpu */
>> +static int bperf_trigger_reading(int prog_fd, int cpu)
>> +{
>> +	DECLARE_LIBBPF_OPTS(bpf_test_run_opts, opts,
>> +			    .ctx_in = NULL,
>> +			    .ctx_size_in = 0,
>> +			    .flags = BPF_F_TEST_RUN_ON_CPU,
>> +			    .cpu = cpu,
>> +			    .retval = 0,
>> +		);
>> +
>> +	return bpf_prog_test_run_opts(prog_fd, &opts);
>> +}
>> +
>> +static int bperf_check_target(struct evsel *evsel,
>> +			      struct target *target,
>> +			      enum bperf_filter_type *filter_type,
>> +			      __u32 *filter_entry_cnt)
>> +{
>> +	if (evsel->leader->core.nr_members > 1) {
>> +		pr_err("bpf managed perf events do not yet support groups.\n");
>> +		return -1;
>> +	}
>> +
>> +	/* determine filter type based on target */
>> +	if (target->system_wide) {
>> +		*filter_type = BPERF_FILTER_GLOBAL;
>> +		*filter_entry_cnt = 1;
>> +	} else if (target->cpu_list) {
>> +		*filter_type = BPERF_FILTER_CPU;
>> +		*filter_entry_cnt = perf_cpu_map__nr(evsel__cpus(evsel));
>> +	} else if (target->tid) {
>> +		*filter_type = BPERF_FILTER_PID;
>> +		*filter_entry_cnt = perf_thread_map__nr(evsel->core.threads);
>> +	} else if (target->pid) {
>> +		*filter_type = BPERF_FILTER_TGID;
>> +		*filter_entry_cnt = perf_thread_map__nr(evsel->core.threads);
>> +	} else {
>> +		pr_err("bpf managed perf events do not yet support these targets.\n");
>> +		return -1;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static int bperf_reload_leader_program(struct evsel *evsel, int attr_map_fd,
>> +				       struct bperf_attr_map_entry *entry)
>> +{
>> +	struct bperf_leader_bpf *skel = NULL;
> 
> Init it to NULL to then right away set it to the return of
> bperf_leader_bpf__open?
> 
> see below
> 
>> +	struct perf_cpu_map *all_cpus;
>> +	int link_fd, diff_map_fd, err;
>> +	struct bpf_link *link = NULL;
>> +
>> +	skel = bperf_leader_bpf__open();
> 
> 	struct bperf_leader_bpf *skel = bperf_leader_bpf__open();
> 
>> +	if (!skel) {
>> +		pr_err("Failed to open leader skeleton\n");
>> +		return -1;
>> +	}
>> +
>> +	bpf_map__resize(skel->maps.events, libbpf_num_possible_cpus());
>> +	err = bperf_leader_bpf__load(skel);
>> +	if (err) {
>> +		pr_err("Failed to load leader skeleton\n");
>> +		goto out;
>> +	}
>> +
>> +	err = -1;
>> +	link = bpf_program__attach(skel->progs.on_switch);
>> +	if (!link) {
>> +		pr_err("Failed to attach leader program\n");
>> +		goto out;
>> +	}
>> +
>> +	all_cpus = perf_cpu_map__new(NULL);
>> +	if (!all_cpus) {
>> +		pr_err("Failed to open cpu_map for all cpus\n");
>> +		goto out;
>> +	}
> 
> If you reorder things maybe you can determine the number of possible
> cpus from all_cpus?

I tried to optimize the use of "all_cpus". But it we still call
perf_cpu_map__new(NULL) on each interval. Maybe we should just save 
num_possible_cpu to evsel? 

I will address the other feedbacks in v2. 

Thanks,
Song



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF
  2021-03-12 15:45   ` Song Liu
@ 2021-03-12 16:09     ` Song Liu
  2021-03-13 22:06       ` Jiri Olsa
  0 siblings, 1 reply; 21+ messages in thread
From: Song Liu @ 2021-03-12 16:09 UTC (permalink / raw)
  To: Jiri Olsa; +Cc: linux-kernel, Kernel Team, acme, acme, namhyung, jolsa



> On Mar 12, 2021, at 7:45 AM, Song Liu <songliubraving@fb.com> wrote:
> 
> 
> 
>> On Mar 12, 2021, at 4:12 AM, Jiri Olsa <jolsa@redhat.com> wrote:
>> 
>> On Thu, Mar 11, 2021 at 06:02:57PM -0800, Song Liu wrote:
>>> perf uses performance monitoring counters (PMCs) to monitor system
>>> performance. The PMCs are limited hardware resources. For example,
>>> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
>>> 
>>> Modern data center systems use these PMCs in many different ways:
>>> system level monitoring, (maybe nested) container level monitoring, per
>>> process monitoring, profiling (in sample mode), etc. In some cases,
>>> there are more active perf_events than available hardware PMCs. To allow
>>> all perf_events to have a chance to run, it is necessary to do expensive
>>> time multiplexing of events.
>>> 
>>> On the other hand, many monitoring tools count the common metrics (cycles,
>>> instructions). It is a waste to have multiple tools create multiple
>>> perf_events of "cycles" and occupy multiple PMCs.
>>> 
>>> bperf tries to reduce such wastes by allowing multiple perf_events of
>>> "cycles" or "instructions" (at different scopes) to share PMUs. Instead
>>> of having each perf-stat session to read its own perf_events, bperf uses
>>> BPF programs to read the perf_events and aggregate readings to BPF maps.
>>> Then, the perf-stat session(s) reads the values from these BPF maps.
>>> 
>>> Please refer to the comment before the definition of bperf_ops for the
>>> description of bperf architecture.
>>> 
>>> bperf is off by default. To enable it, pass --use-bpf option to perf-stat.
>>> bperf uses a BPF hashmap to share information about BPF programs and maps
>>> used by bperf. This map is pinned to bpffs. The default address is
>>> /sys/fs/bpf/bperf_attr_map. The user could change the address with option
>>> --attr-map.
>> 
>> nice, I recall the presentation about that and was wondering
>> when this will come up ;-)
> 
> The progress is slower than I expected. But I finished some dependencies of 
> this in the last year: 
> 
>  1. BPF_PROG_TEST_RUN for raw_tp event;
>  2. perf-stat -b, which introduced skeleton and bpf_counter;
>  3. BPF task local storage, I didn't use it in this version, but it could,
>     help optimize bperf in the future. 
> 
>> 
>>> 
>>> ---
>>> Known limitations:
>>> 1. Do not support per cgroup events;
>>> 2. Do not support monitoring of BPF program (perf-stat -b);
>>> 3. Do not support event groups.
>>> 
>>> The following commands have been tested:
>>> 
>>>  perf stat --use-bpf -e cycles -a
>>>  perf stat --use-bpf -e cycles -C 1,3,4
>>>  perf stat --use-bpf -e cycles -p 123
>>>  perf stat --use-bpf -e cycles -t 100,101
>> 
>> I assume the output is same as standard perf?

Btw, please give it a try. :) 

It worked pretty well in my tests. If it doesn't work for some combination 
of options, please let me know. 

Thanks,
Song

> 
> Yes, the output is identical to that without --use-bpf option. 




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF
  2021-03-12 16:07   ` Song Liu
@ 2021-03-12 17:37     ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 21+ messages in thread
From: Arnaldo Carvalho de Melo @ 2021-03-12 17:37 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-kernel, Kernel Team, acme, namhyung, jolsa, linux-perf-users

Em Fri, Mar 12, 2021 at 04:07:42PM +0000, Song Liu escreveu:
> 
> 
> > On Mar 12, 2021, at 6:24 AM, Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
> > 
> > Em Thu, Mar 11, 2021 at 06:02:57PM -0800, Song Liu escreveu:
> >> perf uses performance monitoring counters (PMCs) to monitor system
> >> performance. The PMCs are limited hardware resources. For example,
> >> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
> >> 
> >> Modern data center systems use these PMCs in many different ways:
> >> system level monitoring, (maybe nested) container level monitoring, per
> >> process monitoring, profiling (in sample mode), etc. In some cases,
> >> there are more active perf_events than available hardware PMCs. To allow
> >> all perf_events to have a chance to run, it is necessary to do expensive
> >> time multiplexing of events.
> >> 
> >> On the other hand, many monitoring tools count the common metrics (cycles,
> >> instructions). It is a waste to have multiple tools create multiple
> >> perf_events of "cycles" and occupy multiple PMCs.
> >> 
> >> bperf tries to reduce such wastes by allowing multiple perf_events of
> >> "cycles" or "instructions" (at different scopes) to share PMUs. Instead
> >> of having each perf-stat session to read its own perf_events, bperf uses
> >> BPF programs to read the perf_events and aggregate readings to BPF maps.
> >> Then, the perf-stat session(s) reads the values from these BPF maps.
> >> 
> >> Please refer to the comment before the definition of bperf_ops for the
> >> description of bperf architecture.
> >> 
> >> bperf is off by default. To enable it, pass --use-bpf option to perf-stat.
> >> bperf uses a BPF hashmap to share information about BPF programs and maps
> >> used by bperf. This map is pinned to bpffs. The default address is
> >> /sys/fs/bpf/bperf_attr_map. The user could change the address with option
> >> --attr-map.
> >> 
> >> ---
> >> Known limitations:
> >> 1. Do not support per cgroup events;
> >> 2. Do not support monitoring of BPF program (perf-stat -b);
> >> 3. Do not support event groups.
> > 
> > Cool stuff, but I think you can break this up into more self contained
> > patches, see below.
> > 
> > Apart from that, some suggestions/requests:
> > 
> > We need a shell 'perf test' that uses some synthetic workload so that we
> > can count events with/without --use-bpf (--bpf-counters is my
> > alternative name, see below), and then compare if the difference is
> > under some acceptable range.
> 
> Yes, "perf test" makes sense. Would this be the extension of current 
> perf-test command? Or a new set of tests?

Extension, please look at:

tools/perf/tests/shell/

Those are shell script based tests, that will be run by 'perf test'
right after the other, C based ones.
 
> > As a followup patch we could have something like:
> > 
> > perf config stat.bpf-counters=yes
> > 
> > That would make 'perf stat' use BPF counters for what it can, using the
> > default method for the non-supported targets, emitting some 'perf stat
> > -v' visible warning (i.e. a debug message), i.e. make it opt-in that the
> > user wants to use BPF counters for all that is possible at that point in
> > time.o
 
> Yes, the fallback mechanism will be very helpful. I also have ideas on
> setting a list for "common events", and only use BPF for the common 
> events. Not common events should just use the original method. 

Yeah, transition period, as the need arises, more can be done, with the
pre-existing method being the fallback or better than any BPF based
mechanism already.
 
> > Thanks for working on this,
> > 
> >> The following commands have been tested:
> >> 
> >>   perf stat --use-bpf -e cycles -a
> >>   perf stat --use-bpf -e cycles -C 1,3,4
> >>   perf stat --use-bpf -e cycles -p 123
> >>   perf stat --use-bpf -e cycles -t 100,101
> >> 
> >> Signed-off-by: Song Liu <songliubraving@fb.com>
> >> ---
> >> tools/perf/Makefile.perf                      |   1 +
> >> tools/perf/builtin-stat.c                     |  20 +-
> >> tools/perf/util/bpf_counter.c                 | 552 +++++++++++++++++-
> >> tools/perf/util/bpf_skel/bperf.h              |  14 +
> >> tools/perf/util/bpf_skel/bperf_follower.bpf.c |  65 +++
> >> tools/perf/util/bpf_skel/bperf_leader.bpf.c   |  46 ++
> >> tools/perf/util/evsel.h                       |  20 +-
> >> tools/perf/util/target.h                      |   4 +-
> >> 8 files changed, 712 insertions(+), 10 deletions(-)
> >> create mode 100644 tools/perf/util/bpf_skel/bperf.h
> >> create mode 100644 tools/perf/util/bpf_skel/bperf_follower.bpf.c
> >> create mode 100644 tools/perf/util/bpf_skel/bperf_leader.bpf.c
> >> 
> >> diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
> >> index f6e609673de2b..ca9aa08e85a1f 100644
> >> --- a/tools/perf/Makefile.perf
> >> +++ b/tools/perf/Makefile.perf
> >> @@ -1007,6 +1007,7 @@ python-clean:
> >> SKEL_OUT := $(abspath $(OUTPUT)util/bpf_skel)
> >> SKEL_TMP_OUT := $(abspath $(SKEL_OUT)/.tmp)
> >> SKELETONS := $(SKEL_OUT)/bpf_prog_profiler.skel.h
> >> +SKELETONS += $(SKEL_OUT)/bperf_leader.skel.h $(SKEL_OUT)/bperf_follower.skel.h
> >> 
> >> ifdef BUILD_BPF_SKEL
> >> BPFTOOL := $(SKEL_TMP_OUT)/bootstrap/bpftool
> >> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> >> index 2e2e4a8345ea2..34df713a8eea9 100644
> >> --- a/tools/perf/builtin-stat.c
> >> +++ b/tools/perf/builtin-stat.c
> >> @@ -792,6 +792,12 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
> >> 	}
> >> 
> >> 	evlist__for_each_cpu (evsel_list, i, cpu) {
> >> +		/*
> >> +		 * bperf calls evsel__open_per_cpu() in bperf__load(), so
> >> +		 * no need to call it again here.
> >> +		 */
> >> +		if (target.use_bpf)
> >> +			break;
> >> 		affinity__set(&affinity, cpu);
> >> 
> >> 		evlist__for_each_entry(evsel_list, counter) {
> >> @@ -925,15 +931,15 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
> >> 	/*
> >> 	 * Enable counters and exec the command:
> >> 	 */
> >> -	t0 = rdclock();
> >> -	clock_gettime(CLOCK_MONOTONIC, &ref_time);
> >> -
> >> 	if (forks) {
> >> 		evlist__start_workload(evsel_list);
> >> 		err = enable_counters();
> >> 		if (err)
> >> 			return -1;
> >> 
> >> +		t0 = rdclock();
> >> +		clock_gettime(CLOCK_MONOTONIC, &ref_time);
> >> +
> >> 		if (interval || timeout || evlist__ctlfd_initialized(evsel_list))
> >> 			status = dispatch_events(forks, timeout, interval, &times);
> >> 		if (child_pid != -1) {
> >> @@ -954,6 +960,10 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
> >> 		err = enable_counters();
> >> 		if (err)
> >> 			return -1;
> >> +
> >> +		t0 = rdclock();
> >> +		clock_gettime(CLOCK_MONOTONIC, &ref_time);
> >> +
> >> 		status = dispatch_events(forks, timeout, interval, &times);
> >> 	}
> >> 
> > 
> > The above two hunks seems out of place, i.e. can they go to a different
> > patch and with an explanation about why this is needed?
 
> Yeah, make sense. I will split them to a separate patch. 

Ok
 
> >> @@ -1146,6 +1156,10 @@ static struct option stat_options[] = {
> >> #ifdef HAVE_BPF_SKEL
> >> 	OPT_STRING('b', "bpf-prog", &target.bpf_str, "bpf-prog-id",
> >> 		   "stat events on existing bpf program id"),
> >> +	OPT_BOOLEAN(0, "use-bpf", &target.use_bpf,
> >> +		    "use bpf program to count events"),
> >> +	OPT_STRING(0, "attr-map", &target.attr_map, "attr-map-path",
> >> +		   "path to perf_event_attr map"),
> > 
> > These need to be documented in tools/perf/Documentation/perf-stat.txt
> > 
> > Also please rename it to:
> > 
> >  --bpf-counters
> >  --bpf-attr-map
> > 
> > Andrii suggested prefixing with --bpf BPF related options in pahole, I
> > think this applies here as well.
> 
> Will change it in v2. 
 
Ok

> >> #endif
> >> 	OPT_BOOLEAN('a', "all-cpus", &target.system_wide,
> >> 		    "system-wide collection from all CPUs"),
> >> diff --git a/tools/perf/util/bpf_counter.c b/tools/perf/util/bpf_counter.c
> >> index 04f89120b3232..d232011ec8f26 100644
> >> --- a/tools/perf/util/bpf_counter.c
> >> +++ b/tools/perf/util/bpf_counter.c
> >> @@ -5,6 +5,7 @@
> >> #include <assert.h>
> >> #include <limits.h>
> >> #include <unistd.h>
> >> +#include <sys/file.h>
> >> #include <sys/time.h>
> >> #include <sys/resource.h>
> >> #include <linux/err.h>
> >> @@ -18,8 +19,37 @@
> >> #include "debug.h"
> >> #include "evsel.h"
> >> #include "target.h"
> >> +#include "cpumap.h"
> >> +#include "thread_map.h"
> >> 
> >> #include "bpf_skel/bpf_prog_profiler.skel.h"
> >> +#include "bpf_skel/bperf_u.h"
> >> +#include "bpf_skel/bperf_leader.skel.h"
> >> +#include "bpf_skel/bperf_follower.skel.h"
> >> +
> >> +/*
> >> + * bperf uses a hashmap, the attr_map, to track all the leader programs.
> >> + * The hashmap is pinned in bpffs. flock() on this file is used to ensure
> >> + * no concurrent access to the attr_map.  The key of attr_map is struct
> >> + * perf_event_attr, and the value is struct bperf_attr_map_entry.
> >> + *
> >> + * struct bperf_attr_map_entry contains two __u32 IDs, bpf_link of the
> >> + * leader prog, and the diff_map. Each perf-stat session holds a reference
> >> + * to the bpf_link to make sure the leader prog is attached to sched_switch
> >> + * tracepoint.
> >> + *
> >> + * Since the hashmap only contains IDs of the bpf_link and diff_map, it
> >> + * does not hold any references to the leader program. Once all perf-stat
> >> + * sessions of these events exit, the leader prog, its maps, and the
> >> + * perf_events will be freed.
> >> + */
> >> +struct bperf_attr_map_entry {
> >> +	__u32 link_id;
> >> +	__u32 diff_map_id;
> >> +};
> >> +
> >> +#define DEFAULT_ATTR_MAP_PATH "/sys/fs/bpf/bperf_attr_map"
> > 
> > Humm bpf is already in the parent directory, perhaps we can have it as
> > just /sys/fs/bpf/perf_attr_map? Here 'counter' isn't applicable, I
> > guess, eventually we may want to use this for other purposes? ;-)
> 
> Do you mean sharing PMCs with sampling events? Well, I do have it on my
> list for 2022. ;-)

Right, its just a little joke about how everything will end up being a
BPF program attached at the right place :-)
 
> > 
> >> +#define ATTR_MAP_SIZE 16
> >> 
> >> static inline void *u64_to_ptr(__u64 ptr)
> >> {
> >> @@ -274,17 +304,529 @@ struct bpf_counter_ops bpf_program_profiler_ops = {
> >> 	.install_pe = bpf_program_profiler__install_pe,
> >> };
> >> 
> >> +static __u32 bpf_link_get_id(int fd)
> >> +{
> >> +	struct bpf_link_info link_info;
> >> +	__u32 link_info_len;
> >> +
> >> +	link_info_len = sizeof(link_info);
> > 
> > Can you combine declaration with attribution to make the code more
> > compact? i.e.:
> > 
> > 	__u32 link_info_len = sizeof(link_info);
> > 
> > 
> >> +	memset(&link_info, 0, link_info_len);
> >> +	bpf_obj_get_info_by_fd(fd, &link_info, &link_info_len);
> >> +	return link_info.id;
> >> +}
> >> +
> >> +static __u32 bpf_link_get_prog_id(int fd)
> >> +{
> >> +	struct bpf_link_info link_info;
> >> +	__u32 link_info_len;
> >> +
> >> +	link_info_len = sizeof(link_info);
> > 
> > Ditto
> > 
> >> +	memset(&link_info, 0, link_info_len);
> >> +	bpf_obj_get_info_by_fd(fd, &link_info, &link_info_len);
> >> +	return link_info.prog_id;
> >> +}
> >> +
> >> +static __u32 bpf_map_get_id(int fd)
> >> +{
> >> +	struct bpf_map_info map_info;
> >> +	__u32 map_info_len;
> > 
> > Ditto.
> > 
> >> +	map_info_len = sizeof(map_info);
> >> +	memset(&map_info, 0, map_info_len);
> >> +	bpf_obj_get_info_by_fd(fd, &map_info, &map_info_len);
> >> +	return map_info.id;
> >> +}
> >> +
> >> +static int bperf_lock_attr_map(struct target *target)
> >> +{
> >> +	const char *path = target->attr_map ? : DEFAULT_ATTR_MAP_PATH;
> >> +	int map_fd, err;
> >> +
> >> +	if (access(path, F_OK)) {
> >> +		map_fd = bpf_create_map(BPF_MAP_TYPE_HASH,
> >> +					sizeof(struct perf_event_attr),
> >> +					sizeof(struct bperf_attr_map_entry),
> > 
> > Also naming it as perf_event_attr_map_entry should make the equivalence
> > more explicit if albeit a bit longer, i.e.:
> > 
> > +		map_fd = bpf_create_map(BPF_MAP_TYPE_HASH,
> > +					sizeof(struct perf_event_attr),
> > +					sizeof(struct perf_event_attr_map_entry),
> > 
> >> +					ATTR_MAP_SIZE, 0);
> >> +		if (map_fd < 0)
> >> +			return -1;
> >> +
> >> +		err = bpf_obj_pin(map_fd, path);
> >> +		if (err) {
> >> +			/* someone pinned the map in parallel? */
> >> +			close(map_fd);
> >> +			map_fd = bpf_obj_get(path);
> >> +			if (map_fd < 0)
> >> +				return -1;
> >> +		}
> >> +	} else {
> >> +		map_fd = bpf_obj_get(path);
> >> +		if (map_fd < 0)
> >> +			return -1;
> >> +	}
> >> +
> >> +	err = flock(map_fd, LOCK_EX);
> >> +	if (err) {
> >> +		close(map_fd);
> >> +		return -1;
> >> +	}
> >> +	return map_fd;
> >> +}
> >> +
> >> +/* trigger the leader program on a cpu */
> >> +static int bperf_trigger_reading(int prog_fd, int cpu)
> >> +{
> >> +	DECLARE_LIBBPF_OPTS(bpf_test_run_opts, opts,
> >> +			    .ctx_in = NULL,
> >> +			    .ctx_size_in = 0,
> >> +			    .flags = BPF_F_TEST_RUN_ON_CPU,
> >> +			    .cpu = cpu,
> >> +			    .retval = 0,
> >> +		);
> >> +
> >> +	return bpf_prog_test_run_opts(prog_fd, &opts);
> >> +}
> >> +
> >> +static int bperf_check_target(struct evsel *evsel,
> >> +			      struct target *target,
> >> +			      enum bperf_filter_type *filter_type,
> >> +			      __u32 *filter_entry_cnt)
> >> +{
> >> +	if (evsel->leader->core.nr_members > 1) {
> >> +		pr_err("bpf managed perf events do not yet support groups.\n");
> >> +		return -1;
> >> +	}
> >> +
> >> +	/* determine filter type based on target */
> >> +	if (target->system_wide) {
> >> +		*filter_type = BPERF_FILTER_GLOBAL;
> >> +		*filter_entry_cnt = 1;
> >> +	} else if (target->cpu_list) {
> >> +		*filter_type = BPERF_FILTER_CPU;
> >> +		*filter_entry_cnt = perf_cpu_map__nr(evsel__cpus(evsel));
> >> +	} else if (target->tid) {
> >> +		*filter_type = BPERF_FILTER_PID;
> >> +		*filter_entry_cnt = perf_thread_map__nr(evsel->core.threads);
> >> +	} else if (target->pid) {
> >> +		*filter_type = BPERF_FILTER_TGID;
> >> +		*filter_entry_cnt = perf_thread_map__nr(evsel->core.threads);
> >> +	} else {
> >> +		pr_err("bpf managed perf events do not yet support these targets.\n");
> >> +		return -1;
> >> +	}
> >> +
> >> +	return 0;
> >> +}
> >> +
> >> +static int bperf_reload_leader_program(struct evsel *evsel, int attr_map_fd,
> >> +				       struct bperf_attr_map_entry *entry)
> >> +{
> >> +	struct bperf_leader_bpf *skel = NULL;
> > 
> > Init it to NULL to then right away set it to the return of
> > bperf_leader_bpf__open?
> > 
> > see below
> > 
> >> +	struct perf_cpu_map *all_cpus;
> >> +	int link_fd, diff_map_fd, err;
> >> +	struct bpf_link *link = NULL;
> >> +
> >> +	skel = bperf_leader_bpf__open();
> > 
> > 	struct bperf_leader_bpf *skel = bperf_leader_bpf__open();
> > 
> >> +	if (!skel) {
> >> +		pr_err("Failed to open leader skeleton\n");
> >> +		return -1;
> >> +	}
> >> +
> >> +	bpf_map__resize(skel->maps.events, libbpf_num_possible_cpus());
> >> +	err = bperf_leader_bpf__load(skel);
> >> +	if (err) {
> >> +		pr_err("Failed to load leader skeleton\n");
> >> +		goto out;
> >> +	}
> >> +
> >> +	err = -1;
> >> +	link = bpf_program__attach(skel->progs.on_switch);
> >> +	if (!link) {
> >> +		pr_err("Failed to attach leader program\n");
> >> +		goto out;
> >> +	}
> >> +
> >> +	all_cpus = perf_cpu_map__new(NULL);
> >> +	if (!all_cpus) {
> >> +		pr_err("Failed to open cpu_map for all cpus\n");
> >> +		goto out;
> >> +	}
> > 
> > If you reorder things maybe you can determine the number of possible
> > cpus from all_cpus?
> 
> I tried to optimize the use of "all_cpus". But it we still call
> perf_cpu_map__new(NULL) on each interval. Maybe we should just save 
> num_possible_cpu to evsel? 

Yeah, or do it at perf tool start, as this is an invariant and use it
everywhere? online cpus vary, but possible? I guess this is a hard
limit, right, one that the tool can get at system start.

> I will address the other feedbacks in v2. 

Thanks a lot!

- Arnaldo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF
  2021-03-12 14:24 ` Arnaldo Carvalho de Melo
  2021-03-12 16:07   ` Song Liu
@ 2021-03-12 18:52   ` Song Liu
  2021-03-15 13:10     ` Arnaldo Carvalho de Melo
  1 sibling, 1 reply; 21+ messages in thread
From: Song Liu @ 2021-03-12 18:52 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: linux-kernel, Kernel Team, acme, namhyung, jolsa, linux-perf-users



> On Mar 12, 2021, at 6:24 AM, Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
> 
> Em Thu, Mar 11, 2021 at 06:02:57PM -0800, Song Liu escreveu:
>> perf uses performance monitoring counters (PMCs) to monitor system
>> performance. The PMCs are limited hardware resources. For example,
>> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
>> 
>> Modern data center systems use these PMCs in many different ways:
>> system level monitoring, (maybe nested) container level monitoring, per
>> process monitoring, profiling (in sample mode), etc. In some cases,
>> there are more active perf_events than available hardware PMCs. To allow
>> all perf_events to have a chance to run, it is necessary to do expensive
>> time multiplexing of events.
>> 
>> On the other hand, many monitoring tools count the common metrics (cycles,
>> instructions). It is a waste to have multiple tools create multiple
>> perf_events of "cycles" and occupy multiple PMCs.
>> 
>> bperf tries to reduce such wastes by allowing multiple perf_events of
>> "cycles" or "instructions" (at different scopes) to share PMUs. Instead
>> of having each perf-stat session to read its own perf_events, bperf uses
>> BPF programs to read the perf_events and aggregate readings to BPF maps.
>> Then, the perf-stat session(s) reads the values from these BPF maps.
>> 
>> Please refer to the comment before the definition of bperf_ops for the
>> description of bperf architecture.
>> 
>> bperf is off by default. To enable it, pass --use-bpf option to perf-stat.
>> bperf uses a BPF hashmap to share information about BPF programs and maps
>> used by bperf. This map is pinned to bpffs. The default address is
>> /sys/fs/bpf/bperf_attr_map. The user could change the address with option
>> --attr-map.
>> 
>> ---
>> Known limitations:
>> 1. Do not support per cgroup events;
>> 2. Do not support monitoring of BPF program (perf-stat -b);
>> 3. Do not support event groups.
> 
> Cool stuff, but I think you can break this up into more self contained
> patches, see below.
> 
> Apart from that, some suggestions/requests:
> 
> We need a shell 'perf test' that uses some synthetic workload so that we
> can count events with/without --use-bpf (--bpf-counters is my
> alternative name, see below), and then compare if the difference is
> under some acceptable range.
> 
> As a followup patch we could have something like:
> 
> perf config stat.bpf-counters=yes
> 
> That would make 'perf stat' use BPF counters for what it can, using the
> default method for the non-supported targets, emitting some 'perf stat
> -v' visible warning (i.e. a debug message), i.e. make it opt-in that the
> user wants to use BPF counters for all that is possible at that point in
> time.o
> 
> Thanks for working on this,
> 
> - Arnaldo
> 
>> The following commands have been tested:
>> 
>>   perf stat --use-bpf -e cycles -a
>>   perf stat --use-bpf -e cycles -C 1,3,4
>>   perf stat --use-bpf -e cycles -p 123
>>   perf stat --use-bpf -e cycles -t 100,101
>> 
>> Signed-off-by: Song Liu <songliubraving@fb.com>
>> ---
>> tools/perf/Makefile.perf                      |   1 +
>> tools/perf/builtin-stat.c                     |  20 +-
>> tools/perf/util/bpf_counter.c                 | 552 +++++++++++++++++-
>> tools/perf/util/bpf_skel/bperf.h              |  14 +
>> tools/perf/util/bpf_skel/bperf_follower.bpf.c |  65 +++
>> tools/perf/util/bpf_skel/bperf_leader.bpf.c   |  46 ++
>> tools/perf/util/evsel.h                       |  20 +-
>> tools/perf/util/target.h                      |   4 +-
>> 8 files changed, 712 insertions(+), 10 deletions(-)
>> create mode 100644 tools/perf/util/bpf_skel/bperf.h
>> create mode 100644 tools/perf/util/bpf_skel/bperf_follower.bpf.c
>> create mode 100644 tools/perf/util/bpf_skel/bperf_leader.bpf.c
>> 
>> diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
>> index f6e609673de2b..ca9aa08e85a1f 100644
>> --- a/tools/perf/Makefile.perf
>> +++ b/tools/perf/Makefile.perf
>> @@ -1007,6 +1007,7 @@ python-clean:
>> SKEL_OUT := $(abspath $(OUTPUT)util/bpf_skel)
>> SKEL_TMP_OUT := $(abspath $(SKEL_OUT)/.tmp)
>> SKELETONS := $(SKEL_OUT)/bpf_prog_profiler.skel.h
>> +SKELETONS += $(SKEL_OUT)/bperf_leader.skel.h $(SKEL_OUT)/bperf_follower.skel.h
>> 
>> ifdef BUILD_BPF_SKEL
>> BPFTOOL := $(SKEL_TMP_OUT)/bootstrap/bpftool
>> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
>> index 2e2e4a8345ea2..34df713a8eea9 100644
>> --- a/tools/perf/builtin-stat.c
>> +++ b/tools/perf/builtin-stat.c
>> @@ -792,6 +792,12 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
>> 	}
>> 
>> 	evlist__for_each_cpu (evsel_list, i, cpu) {
>> +		/*
>> +		 * bperf calls evsel__open_per_cpu() in bperf__load(), so
>> +		 * no need to call it again here.
>> +		 */
>> +		if (target.use_bpf)
>> +			break;
>> 		affinity__set(&affinity, cpu);
>> 
>> 		evlist__for_each_entry(evsel_list, counter) {
>> @@ -925,15 +931,15 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
>> 	/*
>> 	 * Enable counters and exec the command:
>> 	 */
>> -	t0 = rdclock();
>> -	clock_gettime(CLOCK_MONOTONIC, &ref_time);
>> -
>> 	if (forks) {
>> 		evlist__start_workload(evsel_list);
>> 		err = enable_counters();
>> 		if (err)
>> 			return -1;
>> 
>> +		t0 = rdclock();
>> +		clock_gettime(CLOCK_MONOTONIC, &ref_time);
>> +
>> 		if (interval || timeout || evlist__ctlfd_initialized(evsel_list))
>> 			status = dispatch_events(forks, timeout, interval, &times);
>> 		if (child_pid != -1) {
>> @@ -954,6 +960,10 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
>> 		err = enable_counters();
>> 		if (err)
>> 			return -1;
>> +
>> +		t0 = rdclock();
>> +		clock_gettime(CLOCK_MONOTONIC, &ref_time);
>> +
>> 		status = dispatch_events(forks, timeout, interval, &times);
>> 	}
>> 
> 
> The above two hunks seems out of place, i.e. can they go to a different
> patch and with an explanation about why this is needed?

Actually, I am still debating whether we want the above change in a separate 
patch. It is related to the following change. 

[...]

>> +	/*
>> +	 * Attahcing the skeleton takes non-trivial time (0.2s+ on a kernel
>> +	 * with some debug options enabled). This shows as a longer first
>> +	 * interval:
>> +	 *
>> +	 * # perf stat -e cycles -a -I 1000
>> +	 * #           time             counts unit events
>> +	 *      1.267634674     26,259,166,523      cycles
>> +	 *      2.271637827     22,550,822,286      cycles
>> +	 *      3.275406553     22,852,583,744      cycles
>> +	 *
>> +	 * Fix this by zeroing accum_readings after attaching the program.
>> +	 */
>> +	bperf_sync_counters(evsel);
>> +	entry_cnt = bpf_map__max_entries(skel->maps.accum_readings);
>> +	memset(values, 0, sizeof(struct bpf_perf_event_value) * num_cpu_bpf);
>> +
>> +	for (i = 0; i < entry_cnt; i++) {
>> +		bpf_map_update_elem(bpf_map__fd(skel->maps.accum_readings),
>> +				    &i, values, BPF_ANY);
>> +	}
>> +	return 0;
>> +}

Attaching the skeleton takes non-trivial time, so that we get a bigger first 
interval (1.26s in the example above). To fix this, in __run_perf_stat(), we 
get t0 and ref_time after enable_counters(). 

Maybe a comment in __run_perf_stat() is better than a separate patch?

Thanks,
Song

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF
  2021-03-12 15:38   ` Song Liu
@ 2021-03-13  2:47     ` Namhyung Kim
  2021-03-13 19:37       ` Song Liu
  0 siblings, 1 reply; 21+ messages in thread
From: Namhyung Kim @ 2021-03-13  2:47 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-kernel, Kernel Team, Arnaldo Carvalho de Melo,
	Arnaldo Carvalho de Melo, Jiri Olsa

On Sat, Mar 13, 2021 at 12:38 AM Song Liu <songliubraving@fb.com> wrote:
>
>
>
> > On Mar 12, 2021, at 12:36 AM, Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > Hi,
> >
> > On Fri, Mar 12, 2021 at 11:03 AM Song Liu <songliubraving@fb.com> wrote:
> >>
> >> perf uses performance monitoring counters (PMCs) to monitor system
> >> performance. The PMCs are limited hardware resources. For example,
> >> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
> >>
> >> Modern data center systems use these PMCs in many different ways:
> >> system level monitoring, (maybe nested) container level monitoring, per
> >> process monitoring, profiling (in sample mode), etc. In some cases,
> >> there are more active perf_events than available hardware PMCs. To allow
> >> all perf_events to have a chance to run, it is necessary to do expensive
> >> time multiplexing of events.
> >>
> >> On the other hand, many monitoring tools count the common metrics (cycles,
> >> instructions). It is a waste to have multiple tools create multiple
> >> perf_events of "cycles" and occupy multiple PMCs.
> >>
> >> bperf tries to reduce such wastes by allowing multiple perf_events of
> >> "cycles" or "instructions" (at different scopes) to share PMUs. Instead
> >> of having each perf-stat session to read its own perf_events, bperf uses
> >> BPF programs to read the perf_events and aggregate readings to BPF maps.
> >> Then, the perf-stat session(s) reads the values from these BPF maps.
> >>
> >> Please refer to the comment before the definition of bperf_ops for the
> >> description of bperf architecture.
> >
> > Interesting!  Actually I thought about something similar before,
> > but my BPF knowledge is outdated.  So I need to catch up but
> > failed to have some time for it so far. ;-)
> >
> >>
> >> bperf is off by default. To enable it, pass --use-bpf option to perf-stat.
> >> bperf uses a BPF hashmap to share information about BPF programs and maps
> >> used by bperf. This map is pinned to bpffs. The default address is
> >> /sys/fs/bpf/bperf_attr_map. The user could change the address with option
> >> --attr-map.
> >>
> >> ---
> >> Known limitations:
> >> 1. Do not support per cgroup events;
> >> 2. Do not support monitoring of BPF program (perf-stat -b);
> >> 3. Do not support event groups.
> >
> > In my case, per cgroup event counting is very important.
> > And I'd like to do that with lots of cpus and cgroups.
>
> We can easily extend this approach to support cgroups events. I didn't
> implement it to keep the first version simple.

OK.

>
> > So I'm working on an in-kernel solution (without BPF),
> > I hope to share it soon.
>
> This is interesting! I cannot wait to see how it looks like. I spent
> quite some time try to enable in kernel sharing (not just cgroup
> events), but finally decided to try BPF approach.

Well I found it hard to support generic event sharing that works
for all use cases.  So I'm focusing on the per cgroup case only.

>
> >
> > And for event groups, it seems the current implementation
> > cannot handle more than one event (not even in a group).
> > That could be a serious limitation..
>
> It supports multiple events. Multiple events are independent, i.e.,
> "cycles" and "instructions" would use two independent leader programs.

OK, then do you need multiple bperf_attr_maps?  Does it work
for an arbitrary number of events?

>
> >
> >>
> >> The following commands have been tested:
> >>
> >>   perf stat --use-bpf -e cycles -a
> >>   perf stat --use-bpf -e cycles -C 1,3,4
> >>   perf stat --use-bpf -e cycles -p 123
> >>   perf stat --use-bpf -e cycles -t 100,101
> >
> > Hmm... so it loads both leader and follower programs if needed, right?
> > Does it support multiple followers with different targets at the same time?
>
> Yes, the whole idea is to have one leader program and multiple follower
> programs. If we only run one of these commands at a time, it will load
> one leader and one follower. If we run multiple of them in parallel,
> they will share the same leader program and load multiple follower
> programs.
>
> I actually tested more than the commands above. The list actually means
> we support -a, -C -p, and -t.
>
> Currently, this works for multiple events, and different parallel
> perf-stat. The two commands below will work well in parallel:
>
>   perf stat --use-bpf -e ref-cycles,instructions -a
>   perf stat --use-bpf -e ref-cycles,cycles -C 1,3,5
>
> Note the use of ref-cycles, which can only use one counter on Intel CPUs.
> With this approach, the above two commands will not do time multiplexing
> on ref-cycles.

Awesome!

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF
  2021-03-13  2:47     ` Namhyung Kim
@ 2021-03-13 19:37       ` Song Liu
  0 siblings, 0 replies; 21+ messages in thread
From: Song Liu @ 2021-03-13 19:37 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: linux-kernel, Kernel Team, Arnaldo Carvalho de Melo,
	Arnaldo Carvalho de Melo, Jiri Olsa



> On Mar 12, 2021, at 6:47 PM, Namhyung Kim <namhyung@kernel.org> wrote:
> 
> On Sat, Mar 13, 2021 at 12:38 AM Song Liu <songliubraving@fb.com> wrote:
>> 
>> 
>> 
>>> On Mar 12, 2021, at 12:36 AM, Namhyung Kim <namhyung@kernel.org> wrote:
>>> 
>>> Hi,
>>> 
>>> On Fri, Mar 12, 2021 at 11:03 AM Song Liu <songliubraving@fb.com> wrote:
>>>> 
>>>> perf uses performance monitoring counters (PMCs) to monitor system
>>>> performance. The PMCs are limited hardware resources. For example,
>>>> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
>>>> 
>>>> Modern data center systems use these PMCs in many different ways:
>>>> system level monitoring, (maybe nested) container level monitoring, per
>>>> process monitoring, profiling (in sample mode), etc. In some cases,
>>>> there are more active perf_events than available hardware PMCs. To allow
>>>> all perf_events to have a chance to run, it is necessary to do expensive
>>>> time multiplexing of events.
>>>> 
>>>> On the other hand, many monitoring tools count the common metrics (cycles,
>>>> instructions). It is a waste to have multiple tools create multiple
>>>> perf_events of "cycles" and occupy multiple PMCs.
>>>> 
>>>> bperf tries to reduce such wastes by allowing multiple perf_events of
>>>> "cycles" or "instructions" (at different scopes) to share PMUs. Instead
>>>> of having each perf-stat session to read its own perf_events, bperf uses
>>>> BPF programs to read the perf_events and aggregate readings to BPF maps.
>>>> Then, the perf-stat session(s) reads the values from these BPF maps.
>>>> 
>>>> Please refer to the comment before the definition of bperf_ops for the
>>>> description of bperf architecture.
>>> 
>>> Interesting!  Actually I thought about something similar before,
>>> but my BPF knowledge is outdated.  So I need to catch up but
>>> failed to have some time for it so far. ;-)
>>> 
>>>> 
>>>> bperf is off by default. To enable it, pass --use-bpf option to perf-stat.
>>>> bperf uses a BPF hashmap to share information about BPF programs and maps
>>>> used by bperf. This map is pinned to bpffs. The default address is
>>>> /sys/fs/bpf/bperf_attr_map. The user could change the address with option
>>>> --attr-map.
>>>> 
>>>> ---
>>>> Known limitations:
>>>> 1. Do not support per cgroup events;
>>>> 2. Do not support monitoring of BPF program (perf-stat -b);
>>>> 3. Do not support event groups.
>>> 
>>> In my case, per cgroup event counting is very important.
>>> And I'd like to do that with lots of cpus and cgroups.
>> 
>> We can easily extend this approach to support cgroups events. I didn't
>> implement it to keep the first version simple.
> 
> OK.
> 
>> 
>>> So I'm working on an in-kernel solution (without BPF),
>>> I hope to share it soon.
>> 
>> This is interesting! I cannot wait to see how it looks like. I spent
>> quite some time try to enable in kernel sharing (not just cgroup
>> events), but finally decided to try BPF approach.
> 
> Well I found it hard to support generic event sharing that works
> for all use cases.  So I'm focusing on the per cgroup case only.
> 
>> 
>>> 
>>> And for event groups, it seems the current implementation
>>> cannot handle more than one event (not even in a group).
>>> That could be a serious limitation..
>> 
>> It supports multiple events. Multiple events are independent, i.e.,
>> "cycles" and "instructions" would use two independent leader programs.
> 
> OK, then do you need multiple bperf_attr_maps?  Does it work
> for an arbitrary number of events?

The bperf_attr_map (or perf_attr_map) is shared among different events. 
It is a hash map with perf_event_attr as the key. Currently, I hard coded
its size to 16. We can introduce more flexible management of this map. 

Thanks,
Song


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF
  2021-03-12 16:09     ` Song Liu
@ 2021-03-13 22:06       ` Jiri Olsa
  2021-03-13 23:03         ` Song Liu
  0 siblings, 1 reply; 21+ messages in thread
From: Jiri Olsa @ 2021-03-13 22:06 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-kernel, Kernel Team, acme, acme, namhyung, jolsa

On Fri, Mar 12, 2021 at 04:09:53PM +0000, Song Liu wrote:
> 
> 
> > On Mar 12, 2021, at 7:45 AM, Song Liu <songliubraving@fb.com> wrote:
> > 
> > 
> > 
> >> On Mar 12, 2021, at 4:12 AM, Jiri Olsa <jolsa@redhat.com> wrote:
> >> 
> >> On Thu, Mar 11, 2021 at 06:02:57PM -0800, Song Liu wrote:
> >>> perf uses performance monitoring counters (PMCs) to monitor system
> >>> performance. The PMCs are limited hardware resources. For example,
> >>> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
> >>> 
> >>> Modern data center systems use these PMCs in many different ways:
> >>> system level monitoring, (maybe nested) container level monitoring, per
> >>> process monitoring, profiling (in sample mode), etc. In some cases,
> >>> there are more active perf_events than available hardware PMCs. To allow
> >>> all perf_events to have a chance to run, it is necessary to do expensive
> >>> time multiplexing of events.
> >>> 
> >>> On the other hand, many monitoring tools count the common metrics (cycles,
> >>> instructions). It is a waste to have multiple tools create multiple
> >>> perf_events of "cycles" and occupy multiple PMCs.
> >>> 
> >>> bperf tries to reduce such wastes by allowing multiple perf_events of
> >>> "cycles" or "instructions" (at different scopes) to share PMUs. Instead
> >>> of having each perf-stat session to read its own perf_events, bperf uses
> >>> BPF programs to read the perf_events and aggregate readings to BPF maps.
> >>> Then, the perf-stat session(s) reads the values from these BPF maps.
> >>> 
> >>> Please refer to the comment before the definition of bperf_ops for the
> >>> description of bperf architecture.
> >>> 
> >>> bperf is off by default. To enable it, pass --use-bpf option to perf-stat.
> >>> bperf uses a BPF hashmap to share information about BPF programs and maps
> >>> used by bperf. This map is pinned to bpffs. The default address is
> >>> /sys/fs/bpf/bperf_attr_map. The user could change the address with option
> >>> --attr-map.
> >> 
> >> nice, I recall the presentation about that and was wondering
> >> when this will come up ;-)
> > 
> > The progress is slower than I expected. But I finished some dependencies of 
> > this in the last year: 
> > 
> >  1. BPF_PROG_TEST_RUN for raw_tp event;
> >  2. perf-stat -b, which introduced skeleton and bpf_counter;
> >  3. BPF task local storage, I didn't use it in this version, but it could,
> >     help optimize bperf in the future. 
> > 
> >> 
> >>> 
> >>> ---
> >>> Known limitations:
> >>> 1. Do not support per cgroup events;
> >>> 2. Do not support monitoring of BPF program (perf-stat -b);
> >>> 3. Do not support event groups.
> >>> 
> >>> The following commands have been tested:
> >>> 
> >>>  perf stat --use-bpf -e cycles -a
> >>>  perf stat --use-bpf -e cycles -C 1,3,4
> >>>  perf stat --use-bpf -e cycles -p 123
> >>>  perf stat --use-bpf -e cycles -t 100,101
> >> 
> >> I assume the output is same as standard perf?
> 
> Btw, please give it a try. :) 
> 
> It worked pretty well in my tests. If it doesn't work for some combination 
> of options, please let me know. 

heya, can't compile

  CLANG    /home/jolsa/linux-perf/tools/perf/util/bpf_skel/.tmp/bperf_follower.bpf.o
util/bpf_skel/bperf_follower.bpf.c:8:10: fatal error: 'bperf_u.h' file not found
#include "bperf_u.h"
         ^~~~~~~~~~~

jirka


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF
  2021-03-13 22:06       ` Jiri Olsa
@ 2021-03-13 23:03         ` Song Liu
  0 siblings, 0 replies; 21+ messages in thread
From: Song Liu @ 2021-03-13 23:03 UTC (permalink / raw)
  To: Jiri Olsa; +Cc: linux-kernel, Kernel Team, acme, acme, namhyung, jolsa



> On Mar 13, 2021, at 2:06 PM, Jiri Olsa <jolsa@redhat.com> wrote:
> 
> On Fri, Mar 12, 2021 at 04:09:53PM +0000, Song Liu wrote:
>> 
>> 
>>> On Mar 12, 2021, at 7:45 AM, Song Liu <songliubraving@fb.com> wrote:
>>> 
>>> 
>>> 
>>>> On Mar 12, 2021, at 4:12 AM, Jiri Olsa <jolsa@redhat.com> wrote:
>>>> 
>>>> On Thu, Mar 11, 2021 at 06:02:57PM -0800, Song Liu wrote:
>>>>> perf uses performance monitoring counters (PMCs) to monitor system
>>>>> performance. The PMCs are limited hardware resources. For example,
>>>>> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
>>>>> 
>>>>> Modern data center systems use these PMCs in many different ways:
>>>>> system level monitoring, (maybe nested) container level monitoring, per
>>>>> process monitoring, profiling (in sample mode), etc. In some cases,
>>>>> there are more active perf_events than available hardware PMCs. To allow
>>>>> all perf_events to have a chance to run, it is necessary to do expensive
>>>>> time multiplexing of events.
>>>>> 
>>>>> On the other hand, many monitoring tools count the common metrics (cycles,
>>>>> instructions). It is a waste to have multiple tools create multiple
>>>>> perf_events of "cycles" and occupy multiple PMCs.
>>>>> 
>>>>> bperf tries to reduce such wastes by allowing multiple perf_events of
>>>>> "cycles" or "instructions" (at different scopes) to share PMUs. Instead
>>>>> of having each perf-stat session to read its own perf_events, bperf uses
>>>>> BPF programs to read the perf_events and aggregate readings to BPF maps.
>>>>> Then, the perf-stat session(s) reads the values from these BPF maps.
>>>>> 
>>>>> Please refer to the comment before the definition of bperf_ops for the
>>>>> description of bperf architecture.
>>>>> 
>>>>> bperf is off by default. To enable it, pass --use-bpf option to perf-stat.
>>>>> bperf uses a BPF hashmap to share information about BPF programs and maps
>>>>> used by bperf. This map is pinned to bpffs. The default address is
>>>>> /sys/fs/bpf/bperf_attr_map. The user could change the address with option
>>>>> --attr-map.
>>>> 
>>>> nice, I recall the presentation about that and was wondering
>>>> when this will come up ;-)
>>> 
>>> The progress is slower than I expected. But I finished some dependencies of 
>>> this in the last year: 
>>> 
>>> 1. BPF_PROG_TEST_RUN for raw_tp event;
>>> 2. perf-stat -b, which introduced skeleton and bpf_counter;
>>> 3. BPF task local storage, I didn't use it in this version, but it could,
>>>    help optimize bperf in the future. 
>>> 
>>>> 
>>>>> 
>>>>> ---
>>>>> Known limitations:
>>>>> 1. Do not support per cgroup events;
>>>>> 2. Do not support monitoring of BPF program (perf-stat -b);
>>>>> 3. Do not support event groups.
>>>>> 
>>>>> The following commands have been tested:
>>>>> 
>>>>> perf stat --use-bpf -e cycles -a
>>>>> perf stat --use-bpf -e cycles -C 1,3,4
>>>>> perf stat --use-bpf -e cycles -p 123
>>>>> perf stat --use-bpf -e cycles -t 100,101
>>>> 
>>>> I assume the output is same as standard perf?
>> 
>> Btw, please give it a try. :) 
>> 
>> It worked pretty well in my tests. If it doesn't work for some combination 
>> of options, please let me know. 
> 
> heya, can't compile
> 
>  CLANG    /home/jolsa/linux-perf/tools/perf/util/bpf_skel/.tmp/bperf_follower.bpf.o
> util/bpf_skel/bperf_follower.bpf.c:8:10: fatal error: 'bperf_u.h' file not found
> #include "bperf_u.h"
>         ^~~~~~~~~~~

Oops, I forgot git-add. :( 

The file is very simple:

tools/perf/util/bpf_skel/bperf_u.h:


// SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
// Copyright (c) 2021 Facebook

#ifndef __BPERF_STAT_U_H
#define __BPERF_STAT_U_H

enum bperf_filter_type {
        BPERF_FILTER_GLOBAL = 1,
        BPERF_FILTER_CPU,
        BPERF_FILTER_PID,
        BPERF_FILTER_TGID,
};

#endif /* __BPERF_STAT_U_H */

Thanks,
Song


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF
  2021-03-12  2:02 [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF Song Liu
                   ` (2 preceding siblings ...)
  2021-03-12 14:24 ` Arnaldo Carvalho de Melo
@ 2021-03-14 22:48 ` Jiri Olsa
  2021-03-15  7:51   ` Song Liu
  3 siblings, 1 reply; 21+ messages in thread
From: Jiri Olsa @ 2021-03-14 22:48 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-kernel, kernel-team, acme, acme, namhyung, jolsa

On Thu, Mar 11, 2021 at 06:02:57PM -0800, Song Liu wrote:
> perf uses performance monitoring counters (PMCs) to monitor system
> performance. The PMCs are limited hardware resources. For example,
> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
> 
> Modern data center systems use these PMCs in many different ways:
> system level monitoring, (maybe nested) container level monitoring, per
> process monitoring, profiling (in sample mode), etc. In some cases,
> there are more active perf_events than available hardware PMCs. To allow
> all perf_events to have a chance to run, it is necessary to do expensive
> time multiplexing of events.
> 
> On the other hand, many monitoring tools count the common metrics (cycles,
> instructions). It is a waste to have multiple tools create multiple
> perf_events of "cycles" and occupy multiple PMCs.
> 
> bperf tries to reduce such wastes by allowing multiple perf_events of
> "cycles" or "instructions" (at different scopes) to share PMUs. Instead
> of having each perf-stat session to read its own perf_events, bperf uses
> BPF programs to read the perf_events and aggregate readings to BPF maps.
> Then, the perf-stat session(s) reads the values from these BPF maps.
> 
> Please refer to the comment before the definition of bperf_ops for the
> description of bperf architecture.
> 
> bperf is off by default. To enable it, pass --use-bpf option to perf-stat.
> bperf uses a BPF hashmap to share information about BPF programs and maps
> used by bperf. This map is pinned to bpffs. The default address is
> /sys/fs/bpf/bperf_attr_map. The user could change the address with option
> --attr-map.
> 
> ---
> Known limitations:
> 1. Do not support per cgroup events;

isn't that another filter via bpf_get_current_cgroup_id ?

> 2. Do not support monitoring of BPF program (perf-stat -b);

we'd need to call the leadr on fentry/fexit of the monitored bpf
program, right?

> 3. Do not support event groups.

I guess for group support you'll need to load 'leaders' for each group member

> 
> The following commands have been tested:
> 
>    perf stat --use-bpf -e cycles -a
>    perf stat --use-bpf -e cycles -C 1,3,4
>    perf stat --use-bpf -e cycles -p 123
>    perf stat --use-bpf -e cycles -t 100,101

works good with that file you sent.. I'll check/test more,
so far some quick comments below

thanks,
jirka



SNIP

> @@ -1146,6 +1156,10 @@ static struct option stat_options[] = {
>  #ifdef HAVE_BPF_SKEL
>  	OPT_STRING('b', "bpf-prog", &target.bpf_str, "bpf-prog-id",
>  		   "stat events on existing bpf program id"),
> +	OPT_BOOLEAN(0, "use-bpf", &target.use_bpf,
> +		    "use bpf program to count events"),
> +	OPT_STRING(0, "attr-map", &target.attr_map, "attr-map-path",
> +		   "path to perf_event_attr map"),

what's the point of allowing another name? just debug purpose?

SNIP

> + * bperf uses a hashmap, the attr_map, to track all the leader programs.
> + * The hashmap is pinned in bpffs. flock() on this file is used to ensure
> + * no concurrent access to the attr_map.  The key of attr_map is struct
> + * perf_event_attr, and the value is struct bperf_attr_map_entry.
> + *
> + * struct bperf_attr_map_entry contains two __u32 IDs, bpf_link of the
> + * leader prog, and the diff_map. Each perf-stat session holds a reference
> + * to the bpf_link to make sure the leader prog is attached to sched_switch
> + * tracepoint.
> + *
> + * Since the hashmap only contains IDs of the bpf_link and diff_map, it
> + * does not hold any references to the leader program. Once all perf-stat
> + * sessions of these events exit, the leader prog, its maps, and the
> + * perf_events will be freed.
> + */
> +struct bperf_attr_map_entry {
> +	__u32 link_id;
> +	__u32 diff_map_id;
> +};
> +
> +#define DEFAULT_ATTR_MAP_PATH "/sys/fs/bpf/bperf_attr_map"

we should make that with sysfs__mountpoint

SNIP

> +static int bperf_reload_leader_program(struct evsel *evsel, int attr_map_fd,
> +				       struct bperf_attr_map_entry *entry)
> +{
> +	struct bperf_leader_bpf *skel = NULL;
> +	struct perf_cpu_map *all_cpus;
> +	int link_fd, diff_map_fd, err;
> +	struct bpf_link *link = NULL;
> +
> +	skel = bperf_leader_bpf__open();
> +	if (!skel) {
> +		pr_err("Failed to open leader skeleton\n");
> +		return -1;
> +	}
> +
> +	bpf_map__resize(skel->maps.events, libbpf_num_possible_cpus());
> +	err = bperf_leader_bpf__load(skel);
> +	if (err) {
> +		pr_err("Failed to load leader skeleton\n");
> +		goto out;
> +	}
> +
> +	err = -1;
> +	link = bpf_program__attach(skel->progs.on_switch);
> +	if (!link) {
> +		pr_err("Failed to attach leader program\n");
> +		goto out;
> +	}
> +
> +	all_cpus = perf_cpu_map__new(NULL);
> +	if (!all_cpus) {
> +		pr_err("Failed to open cpu_map for all cpus\n");
> +		goto out;
> +	}
> +
> +	link_fd = bpf_link__fd(link);
> +	diff_map_fd = bpf_map__fd(skel->maps.diff_readings);
> +	entry->link_id = bpf_link_get_id(link_fd);
> +	entry->diff_map_id = bpf_map_get_id(diff_map_fd);
> +	err = bpf_map_update_elem(attr_map_fd, &evsel->core.attr, entry, BPF_ANY);
> +	assert(err == 0);
> +
> +	evsel->bperf_leader_link_fd = bpf_link_get_fd_by_id(entry->link_id);
> +	assert(evsel->bperf_leader_link_fd >= 0);
> +
> +	/* save leader_skel for install_pe */

please add to the comment: ', which is called within following evsel__open_per_cpu call'
or something like that.. 

SNIP

> +static int bperf_sync_counters(struct evsel *evsel)
> +{
> +	struct perf_cpu_map *all_cpus = perf_cpu_map__new(NULL);
> +	int num_cpu, i, cpu;
> +
> +	if (!all_cpus)
> +		return -1;
> +
> +	num_cpu = all_cpus->nr;
> +	for (i = 0; i < num_cpu; i++) {
> +		cpu = all_cpus->map[i];
> +		bperf_trigger_reading(evsel->bperf_leader_prog_fd, cpu);
> +	}
> +	return 0;
> +}
> +
> +static int bperf__enable(struct evsel *evsel)
> +{
> +	struct bperf_follower_bpf *skel = evsel->follower_skel;
> +	__u32 num_cpu_bpf = libbpf_num_possible_cpus();

we have cpu__max_cpu for that

> +	struct bpf_perf_event_value values[num_cpu_bpf];
> +	int err, i, entry_cnt;
> +
> +	err = bperf_follower_bpf__attach(evsel->follower_skel);
> +	if (err)
> +		return -1;

could we attach it in load callback and have the some map
value be checked in follower program and 'if value != 0'
it would let the program to continue.. I used/saw this
in few bcc programs

> +
> +	/*
> +	 * Attahcing the skeleton takes non-trivial time (0.2s+ on a kernel
> +	 * with some debug options enabled). This shows as a longer first
> +	 * interval:
> +	 *
> +	 * # perf stat -e cycles -a -I 1000
> +	 * #           time             counts unit events
> +	 *      1.267634674     26,259,166,523      cycles
> +	 *      2.271637827     22,550,822,286      cycles
> +	 *      3.275406553     22,852,583,744      cycles
> +	 *
> +	 * Fix this by zeroing accum_readings after attaching the program.
> +	 */
> +	bperf_sync_counters(evsel);
> +	entry_cnt = bpf_map__max_entries(skel->maps.accum_readings);
> +	memset(values, 0, sizeof(struct bpf_perf_event_value) * num_cpu_bpf);
> +
> +	for (i = 0; i < entry_cnt; i++) {
> +		bpf_map_update_elem(bpf_map__fd(skel->maps.accum_readings),
> +				    &i, values, BPF_ANY);
> +	}
> +	return 0;
> +}
> +
> +static int bperf__read(struct evsel *evsel)
> +{
> +	struct bperf_follower_bpf *skel = evsel->follower_skel;
> +	__u32 num_cpu_bpf = libbpf_num_possible_cpus();
> +	struct bpf_perf_event_value values[num_cpu_bpf];
> +	static struct perf_cpu_map *all_cpus;
> +	int reading_map_fd, err = 0;
> +	__u32 i, j, num_cpu;
> +
> +	if (!all_cpus) {
> +		all_cpus = perf_cpu_map__new(NULL);
> +		if (!all_cpus)
> +			return -1;
> +	}
> +
> +	bperf_sync_counters(evsel);
> +	reading_map_fd = bpf_map__fd(skel->maps.accum_readings);
> +
> +	for (i = 0; i < bpf_map__max_entries(skel->maps.accum_readings); i++) {
> +		__u32 cpu;
> +
> +		err = bpf_map_lookup_elem(reading_map_fd, &i, values);
> +		if (err)
> +			goto out;
> +		switch (evsel->follower_skel->bss->type) {
> +		case BPERF_FILTER_GLOBAL:
> +			assert(i == 0);
> +
> +			num_cpu = all_cpus->nr;
> +			for (j = 0; j < num_cpu; j++) {
> +				cpu = all_cpus->map[j];
> +				perf_counts(evsel->counts, cpu, 0)->val = values[cpu].counter;
> +				perf_counts(evsel->counts, cpu, 0)->ena = values[cpu].enabled;
> +				perf_counts(evsel->counts, cpu, 0)->run = values[cpu].running;
> +			}
> +			break;
> +		case BPERF_FILTER_CPU:
> +			cpu = evsel->core.cpus->map[i];
> +			perf_counts(evsel->counts, i, 0)->val = values[cpu].counter;
> +			perf_counts(evsel->counts, i, 0)->ena = values[cpu].enabled;
> +			perf_counts(evsel->counts, i, 0)->run = values[cpu].running;
> +			break;
> +		case BPERF_FILTER_PID:
> +		case BPERF_FILTER_TGID:
> +			perf_counts(evsel->counts, 0, i)->val = 0;
> +			perf_counts(evsel->counts, 0, i)->ena = 0;
> +			perf_counts(evsel->counts, 0, i)->run = 0;
> +
> +			for (cpu = 0; cpu < num_cpu_bpf; cpu++) {
> +				perf_counts(evsel->counts, 0, i)->val += values[cpu].counter;
> +				perf_counts(evsel->counts, 0, i)->ena += values[cpu].enabled;
> +				perf_counts(evsel->counts, 0, i)->run += values[cpu].running;
> +			}
> +			break;
> +		default:
> +			break;
> +		}
> +	}
> +out:
> +	return err;
> +}
> +
> +static int bperf__destroy(struct evsel *evsel)
> +{
> +	bperf_follower_bpf__destroy(evsel->follower_skel);
> +	close(evsel->bperf_leader_prog_fd);
> +	close(evsel->bperf_leader_link_fd);
> +	return 0;
> +}
> +
> +/*
> + * bperf: share hardware PMCs with BPF
> + *
> + * perf uses performance monitoring counters (PMC) to monitor system
> + * performance. The PMCs are limited hardware resources. For example,
> + * Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
> + *
> + * Modern data center systems use these PMCs in many different ways:
> + * system level monitoring, (maybe nested) container level monitoring, per
> + * process monitoring, profiling (in sample mode), etc. In some cases,
> + * there are more active perf_events than available hardware PMCs. To allow
> + * all perf_events to have a chance to run, it is necessary to do expensive
> + * time multiplexing of events.
> + *
> + * On the other hand, many monitoring tools count the common metrics
> + * (cycles, instructions). It is a waste to have multiple tools create
> + * multiple perf_events of "cycles" and occupy multiple PMCs.
> + *
> + * bperf tries to reduce such wastes by allowing multiple perf_events of
> + * "cycles" or "instructions" (at different scopes) to share PMUs. Instead
> + * of having each perf-stat session to read its own perf_events, bperf uses
> + * BPF programs to read the perf_events and aggregate readings to BPF maps.
> + * Then, the perf-stat session(s) reads the values from these BPF maps.
> + *
> + *                                ||
> + *       shared progs and maps <- || -> per session progs and maps
> + *                                ||
> + *   ---------------              ||
> + *   | perf_events |              ||
> + *   ---------------       fexit  ||      -----------------
> + *          |             --------||----> | follower prog |
> + *       --------------- /        || ---  -----------------
> + * cs -> | leader prog |/         ||/        |         |
> + *   --> ---------------         /||  --------------  ------------------
> + *  /       |         |         / ||  | filter map |  | accum_readings |
> + * /  ------------  ------------  ||  --------------  ------------------
> + * |  | prev map |  | diff map |  ||                        |
> + * |  ------------  ------------  ||                        |
> + *  \                             ||                        |
> + * = \ ==================================================== | ============
> + *    \                                                    /   user space
> + *     \                                                  /
> + *      \                                                /
> + *    BPF_PROG_TEST_RUN                    BPF_MAP_LOOKUP_ELEM
> + *        \                                            /
> + *         \                                          /
> + *          \------  perf-stat ----------------------/
> + *
> + * The figure above shows the architecture of bperf. Note that the figure
> + * is divided into 3 regions: shared progs and maps (top left), per session
> + * progs and maps (top right), and user space (bottom).
> + *
> + * The leader prog is triggered on each context switch (cs). The leader
> + * prog reads perf_events and stores the difference (current_reading -
> + * previous_reading) to the diff map. For the same metric, e.g. "cycles",
> + * multiple perf-stat sessions share the same leader prog.
> + *
> + * Each perf-stat session creates a follower prog as fexit program to the
> + * leader prog. It is possible to attach up to BPF_MAX_TRAMP_PROGS (38)
> + * follower progs to the same leader prog. The follower prog checks current
> + * task and processor ID to decide whether to add the value from the diff
> + * map to its accumulated reading map (accum_readings).
> + *
> + * Finally, perf-stat user space reads the value from accum_reading map.
> + *
> + * Besides context switch, it is also necessary to trigger the leader prog
> + * before perf-stat reads the value. Otherwise, the accum_reading map may
> + * not have the latest reading from the perf_events. This is achieved by
> + * triggering the event via sys_bpf(BPF_PROG_TEST_RUN) to each CPU.
> + *
> + * Comment before the definition of struct bperf_attr_map_entry describes
> + * how different sessions of perf-stat share information about the leader
> + * prog.
> + */
> +
> +struct bpf_counter_ops bperf_ops = {
> +	.load       = bperf__load,
> +	.enable     = bperf__enable,
> +	.read       = bperf__read,
> +	.install_pe = bperf__install_pe,
> +	.destroy    = bperf__destroy,
> +};
> +
> +static inline bool bpf_counter_skip(struct evsel *evsel)
> +{
> +	return list_empty(&evsel->bpf_counter_list) &&
> +		evsel->follower_skel == NULL;
> +}
> +
>  int bpf_counter__install_pe(struct evsel *evsel, int cpu, int fd)
>  {
> -	if (list_empty(&evsel->bpf_counter_list))
> +	if (bpf_counter_skip(evsel))
>  		return 0;
>  	return evsel->bpf_counter_ops->install_pe(evsel, cpu, fd);
>  }
>  
>  int bpf_counter__load(struct evsel *evsel, struct target *target)
>  {
> -	if (target__has_bpf(target))
> +	if (target->bpf_str)
>  		evsel->bpf_counter_ops = &bpf_program_profiler_ops;
> +	else if (target->use_bpf)
> +		evsel->bpf_counter_ops = &bperf_ops;
>  
>  	if (evsel->bpf_counter_ops)
>  		return evsel->bpf_counter_ops->load(evsel, target);
> @@ -293,21 +835,21 @@ int bpf_counter__load(struct evsel *evsel, struct target *target)
>  
>  int bpf_counter__enable(struct evsel *evsel)
>  {
> -	if (list_empty(&evsel->bpf_counter_list))
> +	if (bpf_counter_skip(evsel))
>  		return 0;
>  	return evsel->bpf_counter_ops->enable(evsel);
>  }
>  
>  int bpf_counter__read(struct evsel *evsel)
>  {
> -	if (list_empty(&evsel->bpf_counter_list))
> +	if (bpf_counter_skip(evsel))
>  		return -EAGAIN;
>  	return evsel->bpf_counter_ops->read(evsel);
>  }
>  
>  void bpf_counter__destroy(struct evsel *evsel)
>  {
> -	if (list_empty(&evsel->bpf_counter_list))
> +	if (bpf_counter_skip(evsel))
>  		return;
>  	evsel->bpf_counter_ops->destroy(evsel);
>  	evsel->bpf_counter_ops = NULL;
> diff --git a/tools/perf/util/bpf_skel/bperf.h b/tools/perf/util/bpf_skel/bperf.h
> new file mode 100644
> index 0000000000000..186a5551ddb9d
> --- /dev/null
> +++ b/tools/perf/util/bpf_skel/bperf.h
> @@ -0,0 +1,14 @@
> +// SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
> +// Copyright (c) 2021 Facebook
> +
> +#ifndef __BPERF_STAT_H
> +#define __BPERF_STAT_H
> +
> +typedef struct {
> +	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
> +	__uint(key_size, sizeof(__u32));
> +	__uint(value_size, sizeof(struct bpf_perf_event_value));
> +	__uint(max_entries, 1);
> +} reading_map;
> +
> +#endif /* __BPERF_STAT_H */
> diff --git a/tools/perf/util/bpf_skel/bperf_follower.bpf.c b/tools/perf/util/bpf_skel/bperf_follower.bpf.c
> new file mode 100644
> index 0000000000000..7535d9557ab9a
> --- /dev/null
> +++ b/tools/perf/util/bpf_skel/bperf_follower.bpf.c
> @@ -0,0 +1,65 @@
> +// SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
> +// Copyright (c) 2021 Facebook
> +#include <linux/bpf.h>
> +#include <linux/perf_event.h>
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_tracing.h>
> +#include "bperf.h"
> +#include "bperf_u.h"
> +
> +reading_map diff_readings SEC(".maps");
> +reading_map accum_readings SEC(".maps");
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_HASH);
> +	__uint(key_size, sizeof(__u32));
> +	__uint(value_size, sizeof(__u32));
> +} filter SEC(".maps");
> +
> +enum bperf_filter_type type = 0;
> +
> +SEC("fexit/XXX")
> +int BPF_PROG(fexit_XXX)
> +{
> +	struct bpf_perf_event_value *diff_val, *accum_val;
> +	__u32 filter_key, zero = 0;
> +	__u32 *accum_key;
> +
> +	switch (type) {
> +	case BPERF_FILTER_GLOBAL:
> +		accum_key = &zero;
> +		goto do_add;
> +	case BPERF_FILTER_CPU:
> +		filter_key = bpf_get_smp_processor_id();
> +		break;
> +	case BPERF_FILTER_PID:
> +		filter_key = bpf_get_current_pid_tgid() & 0xffffffff;
> +		break;
> +	case BPERF_FILTER_TGID:
> +		filter_key = bpf_get_current_pid_tgid() >> 32;
> +		break;
> +	default:
> +		return 0;
> +	}
> +
> +	accum_key = bpf_map_lookup_elem(&filter, &filter_key);
> +	if (!accum_key)
> +		return 0;
> +
> +do_add:
> +	diff_val = bpf_map_lookup_elem(&diff_readings, &zero);
> +	if (!diff_val)
> +		return 0;
> +
> +	accum_val = bpf_map_lookup_elem(&accum_readings, accum_key);
> +	if (!accum_val)
> +		return 0;
> +
> +	accum_val->counter += diff_val->counter;
> +	accum_val->enabled += diff_val->enabled;
> +	accum_val->running += diff_val->running;
> +
> +	return 0;
> +}
> +
> +char LICENSE[] SEC("license") = "Dual BSD/GPL";
> diff --git a/tools/perf/util/bpf_skel/bperf_leader.bpf.c b/tools/perf/util/bpf_skel/bperf_leader.bpf.c
> new file mode 100644
> index 0000000000000..4f70d1459e86c
> --- /dev/null
> +++ b/tools/perf/util/bpf_skel/bperf_leader.bpf.c
> @@ -0,0 +1,46 @@
> +// SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
> +// Copyright (c) 2021 Facebook
> +#include <linux/bpf.h>
> +#include <linux/perf_event.h>
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_tracing.h>
> +#include "bperf.h"
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
> +	__uint(key_size, sizeof(__u32));
> +	__uint(value_size, sizeof(int));
> +	__uint(map_flags, BPF_F_PRESERVE_ELEMS);
> +} events SEC(".maps");
> +
> +reading_map prev_readings SEC(".maps");
> +reading_map diff_readings SEC(".maps");
> +
> +SEC("raw_tp/sched_switch")
> +int BPF_PROG(on_switch)
> +{
> +	struct bpf_perf_event_value val, *prev_val, *diff_val;
> +	__u32 key = bpf_get_smp_processor_id();
> +	__u32 zero = 0;
> +	long err;
> +
> +	prev_val = bpf_map_lookup_elem(&prev_readings, &zero);
> +	if (!prev_val)
> +		return 0;
> +
> +	diff_val = bpf_map_lookup_elem(&diff_readings, &zero);
> +	if (!diff_val)
> +		return 0;
> +
> +	err = bpf_perf_event_read_value(&events, key, &val, sizeof(val));
> +	if (err)
> +		return 0;
> +
> +	diff_val->counter = val.counter - prev_val->counter;
> +	diff_val->enabled = val.enabled - prev_val->enabled;
> +	diff_val->running = val.running - prev_val->running;
> +	*prev_val = val;
> +	return 0;
> +}
> +
> +char LICENSE[] SEC("license") = "Dual BSD/GPL";
> diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
> index 6026487353dd8..dd4f56f9cfdf5 100644
> --- a/tools/perf/util/evsel.h
> +++ b/tools/perf/util/evsel.h
> @@ -20,6 +20,8 @@ union perf_event;
>  struct bpf_counter_ops;
>  struct target;
>  struct hashmap;
> +struct bperf_leader_bpf;
> +struct bperf_follower_bpf;
>  
>  typedef int (evsel__sb_cb_t)(union perf_event *event, void *data);
>  
> @@ -130,8 +132,24 @@ struct evsel {
>  	 * See also evsel__has_callchain().
>  	 */
>  	__u64			synth_sample_type;
> -	struct list_head	bpf_counter_list;
> +
> +	/*
> +	 * bpf_counter_ops serves two use cases:
> +	 *   1. perf-stat -b          counting events used byBPF programs
> +	 *   2. perf-stat --use-bpf   use BPF programs to aggregate counts
> +	 */
>  	struct bpf_counter_ops	*bpf_counter_ops;
> +
> +	/* for perf-stat -b */
> +	struct list_head	bpf_counter_list;
> +
> +	/* for perf-stat --use-bpf */
> +	int			bperf_leader_prog_fd;
> +	int			bperf_leader_link_fd;
> +	union {
> +		struct bperf_leader_bpf *leader_skel;
> +		struct bperf_follower_bpf *follower_skel;
> +	};
>  };
>  
>  struct perf_missing_features {
> diff --git a/tools/perf/util/target.h b/tools/perf/util/target.h
> index f132c6c2eef81..1bce3eb28ef25 100644
> --- a/tools/perf/util/target.h
> +++ b/tools/perf/util/target.h
> @@ -16,6 +16,8 @@ struct target {
>  	bool	     uses_mmap;
>  	bool	     default_per_cpu;
>  	bool	     per_thread;
> +	bool	     use_bpf;
> +	const char   *attr_map;
>  };
>  
>  enum target_errno {
> @@ -66,7 +68,7 @@ static inline bool target__has_cpu(struct target *target)
>  
>  static inline bool target__has_bpf(struct target *target)
>  {
> -	return target->bpf_str;
> +	return target->bpf_str || target->use_bpf;
>  }
>  
>  static inline bool target__none(struct target *target)
> -- 
> 2.24.1
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF
  2021-03-14 22:48 ` Jiri Olsa
@ 2021-03-15  7:51   ` Song Liu
  2021-03-15 14:09     ` Jiri Olsa
  0 siblings, 1 reply; 21+ messages in thread
From: Song Liu @ 2021-03-15  7:51 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: linux-kernel, Kernel Team, Arnaldo Carvalho de Melo, acme,
	namhyung, jolsa



> On Mar 14, 2021, at 3:48 PM, Jiri Olsa <jolsa@redhat.com> wrote:
> 
> On Thu, Mar 11, 2021 at 06:02:57PM -0800, Song Liu wrote:
>> perf uses performance monitoring counters (PMCs) to monitor system
>> performance. The PMCs are limited hardware resources. For example,
>> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
>> 
>> Modern data center systems use these PMCs in many different ways:
>> system level monitoring, (maybe nested) container level monitoring, per
>> process monitoring, profiling (in sample mode), etc. In some cases,
>> there are more active perf_events than available hardware PMCs. To allow
>> all perf_events to have a chance to run, it is necessary to do expensive
>> time multiplexing of events.
>> 
>> On the other hand, many monitoring tools count the common metrics (cycles,
>> instructions). It is a waste to have multiple tools create multiple
>> perf_events of "cycles" and occupy multiple PMCs.
>> 
>> bperf tries to reduce such wastes by allowing multiple perf_events of
>> "cycles" or "instructions" (at different scopes) to share PMUs. Instead
>> of having each perf-stat session to read its own perf_events, bperf uses
>> BPF programs to read the perf_events and aggregate readings to BPF maps.
>> Then, the perf-stat session(s) reads the values from these BPF maps.
>> 
>> Please refer to the comment before the definition of bperf_ops for the
>> description of bperf architecture.
>> 
>> bperf is off by default. To enable it, pass --use-bpf option to perf-stat.
>> bperf uses a BPF hashmap to share information about BPF programs and maps
>> used by bperf. This map is pinned to bpffs. The default address is
>> /sys/fs/bpf/bperf_attr_map. The user could change the address with option
>> --attr-map.
>> 
>> ---
>> Known limitations:
>> 1. Do not support per cgroup events;
> 
> isn't that another filter via bpf_get_current_cgroup_id ?

This is tricky with nested cgroups. We also have bpf_get_current_ancestor_cgroup_id,
but that's not the exact same we need either. We may need a new helper. 

Also, we are limited by 38 follower programs for the leader program, which 
might be too few for Namhyung's use case. We can use some logic in the 
follower program to count events for many cgroups in one fexit program. 

> 
>> 2. Do not support monitoring of BPF program (perf-stat -b);
> 
> we'd need to call the leadr on fentry/fexit of the monitored bpf
> program, right?

My current plan is to let perf-stat -b share the same perf_event map with
bperf, but not the leader program. 

> 
>> 3. Do not support event groups.
> 
> I guess for group support you'll need to load 'leaders' for each group member

I am not sure how this would work. Say the user started the following in 
parallel:

    #1  perf stat --bpf-counters -e cycles -a &
    #2  perf stat --bpf-counters -e instructions -C 1,2,3 &
    #3  perf stat --bpf-counters -e {cycles,instructions} -p 123 

Event "cycles" is loaded by #1; while event "instruction" is loaded by #2.
If #3 reuses these events, it is tricky (or impossible?) to make sure the
event group works as expected, right?

> 
>> 
>> The following commands have been tested:
>> 
>>   perf stat --use-bpf -e cycles -a
>>   perf stat --use-bpf -e cycles -C 1,3,4
>>   perf stat --use-bpf -e cycles -p 123
>>   perf stat --use-bpf -e cycles -t 100,101
> 
> works good with that file you sent.. I'll check/test more,
> so far some quick comments below
> 
> thanks,
> jirka
> 
> 
> 
> SNIP
> 
>> @@ -1146,6 +1156,10 @@ static struct option stat_options[] = {
>> #ifdef HAVE_BPF_SKEL
>> 	OPT_STRING('b', "bpf-prog", &target.bpf_str, "bpf-prog-id",
>> 		   "stat events on existing bpf program id"),
>> +	OPT_BOOLEAN(0, "use-bpf", &target.use_bpf,
>> +		    "use bpf program to count events"),
>> +	OPT_STRING(0, "attr-map", &target.attr_map, "attr-map-path",
>> +		   "path to perf_event_attr map"),
> 
> what's the point of allowing another name? just debug purpose?

It is mostly to cover corner cases, like something else used the same 
name. 

> 
> SNIP
> 
>> + * bperf uses a hashmap, the attr_map, to track all the leader programs.
>> + * The hashmap is pinned in bpffs. flock() on this file is used to ensure
>> + * no concurrent access to the attr_map.  The key of attr_map is struct
>> + * perf_event_attr, and the value is struct bperf_attr_map_entry.
>> + *
>> + * struct bperf_attr_map_entry contains two __u32 IDs, bpf_link of the
>> + * leader prog, and the diff_map. Each perf-stat session holds a reference
>> + * to the bpf_link to make sure the leader prog is attached to sched_switch
>> + * tracepoint.
>> + *
>> + * Since the hashmap only contains IDs of the bpf_link and diff_map, it
>> + * does not hold any references to the leader program. Once all perf-stat
>> + * sessions of these events exit, the leader prog, its maps, and the
>> + * perf_events will be freed.
>> + */
>> +struct bperf_attr_map_entry {
>> +	__u32 link_id;
>> +	__u32 diff_map_id;
>> +};
>> +
>> +#define DEFAULT_ATTR_MAP_PATH "/sys/fs/bpf/bperf_attr_map"
> 
> we should make that with sysfs__mountpoint

Sounds great!

> 
> SNIP
> 
>> +static int bperf_reload_leader_program(struct evsel *evsel, int attr_map_fd,
>> +				       struct bperf_attr_map_entry *entry)
>> +{
>> +	struct bperf_leader_bpf *skel = NULL;
>> +	struct perf_cpu_map *all_cpus;
>> +	int link_fd, diff_map_fd, err;
>> +	struct bpf_link *link = NULL;
>> +
>> +	skel = bperf_leader_bpf__open();
>> +	if (!skel) {
>> +		pr_err("Failed to open leader skeleton\n");
>> +		return -1;
>> +	}
>> +
>> +	bpf_map__resize(skel->maps.events, libbpf_num_possible_cpus());
>> +	err = bperf_leader_bpf__load(skel);
>> +	if (err) {
>> +		pr_err("Failed to load leader skeleton\n");
>> +		goto out;
>> +	}
>> +
>> +	err = -1;
>> +	link = bpf_program__attach(skel->progs.on_switch);
>> +	if (!link) {
>> +		pr_err("Failed to attach leader program\n");
>> +		goto out;
>> +	}
>> +
>> +	all_cpus = perf_cpu_map__new(NULL);
>> +	if (!all_cpus) {
>> +		pr_err("Failed to open cpu_map for all cpus\n");
>> +		goto out;
>> +	}
>> +
>> +	link_fd = bpf_link__fd(link);
>> +	diff_map_fd = bpf_map__fd(skel->maps.diff_readings);
>> +	entry->link_id = bpf_link_get_id(link_fd);
>> +	entry->diff_map_id = bpf_map_get_id(diff_map_fd);
>> +	err = bpf_map_update_elem(attr_map_fd, &evsel->core.attr, entry, BPF_ANY);
>> +	assert(err == 0);
>> +
>> +	evsel->bperf_leader_link_fd = bpf_link_get_fd_by_id(entry->link_id);
>> +	assert(evsel->bperf_leader_link_fd >= 0);
>> +
>> +	/* save leader_skel for install_pe */
> 
> please add to the comment: ', which is called within following evsel__open_per_cpu call'
> or something like that.. 

Will add it in v2. 

> 
> SNIP
> 
>> +static int bperf_sync_counters(struct evsel *evsel)
>> +{
>> +	struct perf_cpu_map *all_cpus = perf_cpu_map__new(NULL);
>> +	int num_cpu, i, cpu;
>> +
>> +	if (!all_cpus)
>> +		return -1;
>> +
>> +	num_cpu = all_cpus->nr;
>> +	for (i = 0; i < num_cpu; i++) {
>> +		cpu = all_cpus->map[i];
>> +		bperf_trigger_reading(evsel->bperf_leader_prog_fd, cpu);
>> +	}
>> +	return 0;
>> +}
>> +
>> +static int bperf__enable(struct evsel *evsel)
>> +{
>> +	struct bperf_follower_bpf *skel = evsel->follower_skel;
>> +	__u32 num_cpu_bpf = libbpf_num_possible_cpus();
> 
> we have cpu__max_cpu for that

libbpf calls for percpu array use libbpf_num_possible_cpus. So I guess it 
is better to use the same here. The two are identical at the moment though.

> 
>> +	struct bpf_perf_event_value values[num_cpu_bpf];
>> +	int err, i, entry_cnt;
>> +
>> +	err = bperf_follower_bpf__attach(evsel->follower_skel);
>> +	if (err)
>> +		return -1;
> 
> could we attach it in load callback and have the some map
> value be checked in follower program and 'if value != 0'
> it would let the program to continue.. I used/saw this
> in few bcc programs

Yeah, that's another way to eliminate the __attach() latency. Let me try it. 

> 
>> +
>> +	/*
>> +	 * Attahcing the skeleton takes non-trivial time (0.2s+ on a kernel
>> +	 * with some debug options enabled). This shows as a longer first
>> +	 * interval:
>> +	 *
>> +	 * # perf stat -e cycles -a -I 1000
>> +	 * #           time             counts unit events
>> +	 *      1.267634674     26,259,166,523      cycles
>> +	 *      2.271637827     22,550,822,286      cycles
>> +	 *      3.275406553     22,852,583,744      cycles
>> +	 *
>> +	 * Fix this by zeroing accum_readings after attaching the program.
>> +	 */
>> +	bperf_sync_counters(evsel);
>> +	entry_cnt = bpf_map__max_entries(skel->maps.accum_readings);
>> +	memset(values, 0, sizeof(struct bpf_perf_event_value) * num_cpu_bpf);
>> +
>> +	for (i = 0; i < entry_cnt; i++) {
>> +		bpf_map_update_elem(bpf_map__fd(skel->maps.accum_readings),
>> +				    &i, values, BPF_ANY);
>> +	}
>> +	return 0;
>> +}

[...]


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF
  2021-03-12 18:52   ` Song Liu
@ 2021-03-15 13:10     ` Arnaldo Carvalho de Melo
  2021-03-15 15:34       ` Andi Kleen
  0 siblings, 1 reply; 21+ messages in thread
From: Arnaldo Carvalho de Melo @ 2021-03-15 13:10 UTC (permalink / raw)
  To: Song Liu, Andi Kleen, Jiri Olsa
  Cc: linux-kernel, Kernel Team, namhyung, linux-perf-users

Em Fri, Mar 12, 2021 at 06:52:39PM +0000, Song Liu escreveu:
> > On Mar 12, 2021, at 6:24 AM, Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
> > Em Thu, Mar 11, 2021 at 06:02:57PM -0800, Song Liu escreveu:
> >> perf uses performance monitoring counters (PMCs) to monitor system
> >> performance. The PMCs are limited hardware resources. For example,
> >> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
> >> 
> >> Modern data center systems use these PMCs in many different ways:
> >> system level monitoring, (maybe nested) container level monitoring, per
> >> process monitoring, profiling (in sample mode), etc. In some cases,
> >> there are more active perf_events than available hardware PMCs. To allow
> >> all perf_events to have a chance to run, it is necessary to do expensive
> >> time multiplexing of events.
> >> 
> >> On the other hand, many monitoring tools count the common metrics (cycles,
> >> instructions). It is a waste to have multiple tools create multiple
> >> perf_events of "cycles" and occupy multiple PMCs.
> >> 
> >> bperf tries to reduce such wastes by allowing multiple perf_events of
> >> "cycles" or "instructions" (at different scopes) to share PMUs. Instead
> >> of having each perf-stat session to read its own perf_events, bperf uses
> >> BPF programs to read the perf_events and aggregate readings to BPF maps.
> >> Then, the perf-stat session(s) reads the values from these BPF maps.
> >> 
> >> Please refer to the comment before the definition of bperf_ops for the
> >> description of bperf architecture.
> >> 
> >> bperf is off by default. To enable it, pass --use-bpf option to perf-stat.
> >> bperf uses a BPF hashmap to share information about BPF programs and maps
> >> used by bperf. This map is pinned to bpffs. The default address is
> >> /sys/fs/bpf/bperf_attr_map. The user could change the address with option
> >> --attr-map.
> >> 
> >> ---
> >> Known limitations:
> >> 1. Do not support per cgroup events;
> >> 2. Do not support monitoring of BPF program (perf-stat -b);
> >> 3. Do not support event groups.
> >> The following commands have been tested:
> >> 
> >>   perf stat --use-bpf -e cycles -a
> >>   perf stat --use-bpf -e cycles -C 1,3,4
> >>   perf stat --use-bpf -e cycles -p 123
> >>   perf stat --use-bpf -e cycles -t 100,101

<SNIP>

> >> @@ -925,15 +931,15 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
> >> 	/*
> >> 	 * Enable counters and exec the command:
> >> 	 */
> >> -	t0 = rdclock();
> >> -	clock_gettime(CLOCK_MONOTONIC, &ref_time);
> >> -
> >> 	if (forks) {
> >> 		evlist__start_workload(evsel_list);
> >> 		err = enable_counters();
> >> 		if (err)
> >> 			return -1;
> >> 
> >> +		t0 = rdclock();
> >> +		clock_gettime(CLOCK_MONOTONIC, &ref_time);
> >> +
> >> 		if (interval || timeout || evlist__ctlfd_initialized(evsel_list))
> >> 			status = dispatch_events(forks, timeout, interval, &times);
> >> 		if (child_pid != -1) {
> >> @@ -954,6 +960,10 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
> >> 		err = enable_counters();
> >> 		if (err)
> >> 			return -1;
> >> +
> >> +		t0 = rdclock();
> >> +		clock_gettime(CLOCK_MONOTONIC, &ref_time);
> >> +
> >> 		status = dispatch_events(forks, timeout, interval, &times);
> >> 	}
> >> 
> > 
> > The above two hunks seems out of place, i.e. can they go to a different
> > patch and with an explanation about why this is needed?
> 
> Actually, I am still debating whether we want the above change in a separate 
> patch. It is related to the following change. 
> 
> [...]
> 
> >> +	/*
> >> +	 * Attahcing the skeleton takes non-trivial time (0.2s+ on a kernel
> >> +	 * with some debug options enabled). This shows as a longer first
> >> +	 * interval:
> >> +	 *
> >> +	 * # perf stat -e cycles -a -I 1000
> >> +	 * #           time             counts unit events
> >> +	 *      1.267634674     26,259,166,523      cycles
> >> +	 *      2.271637827     22,550,822,286      cycles
> >> +	 *      3.275406553     22,852,583,744      cycles
> >> +	 *
> >> +	 * Fix this by zeroing accum_readings after attaching the program.
> >> +	 */
> >> +	bperf_sync_counters(evsel);
> >> +	entry_cnt = bpf_map__max_entries(skel->maps.accum_readings);
> >> +	memset(values, 0, sizeof(struct bpf_perf_event_value) * num_cpu_bpf);
> >> +
> >> +	for (i = 0; i < entry_cnt; i++) {
> >> +		bpf_map_update_elem(bpf_map__fd(skel->maps.accum_readings),
> >> +				    &i, values, BPF_ANY);
> >> +	}
> >> +	return 0;
> >> +}
> 
> Attaching the skeleton takes non-trivial time, so that we get a bigger first 
> interval (1.26s in the example above). To fix this, in __run_perf_stat(), we 
> get t0 and ref_time after enable_counters(). 
> 
> Maybe a comment in __run_perf_stat() is better than a separate patch?

I still think that there is value in taking those measurements after we
enable the counters, as, for instance, for interval mode we want
measurements with the counters enabled, whatever happens before the
counters are enabled is just startup time, etc. Jiri, Andi?

- Arnaldo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF
  2021-03-15  7:51   ` Song Liu
@ 2021-03-15 14:09     ` Jiri Olsa
  2021-03-15 18:24       ` Song Liu
  0 siblings, 1 reply; 21+ messages in thread
From: Jiri Olsa @ 2021-03-15 14:09 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-kernel, Kernel Team, Arnaldo Carvalho de Melo, acme,
	namhyung, jolsa

On Mon, Mar 15, 2021 at 07:51:11AM +0000, Song Liu wrote:

SNIP

> >> Known limitations:
> >> 1. Do not support per cgroup events;
> > 
> > isn't that another filter via bpf_get_current_cgroup_id ?
> 
> This is tricky with nested cgroups. We also have bpf_get_current_ancestor_cgroup_id,
> but that's not the exact same we need either. We may need a new helper. 
> 
> Also, we are limited by 38 follower programs for the leader program, which 
> might be too few for Namhyung's use case. We can use some logic in the 
> follower program to count events for many cgroups in one fexit program. 

ok, I see

> 
> > 
> >> 2. Do not support monitoring of BPF program (perf-stat -b);
> > 
> > we'd need to call the leadr on fentry/fexit of the monitored bpf
> > program, right?
> 
> My current plan is to let perf-stat -b share the same perf_event map with
> bperf, but not the leader program. 

ok

> 
> > 
> >> 3. Do not support event groups.
> > 
> > I guess for group support you'll need to load 'leaders' for each group member
> 
> I am not sure how this would work. Say the user started the following in 
> parallel:
> 
>     #1  perf stat --bpf-counters -e cycles -a &
>     #2  perf stat --bpf-counters -e instructions -C 1,2,3 &
>     #3  perf stat --bpf-counters -e {cycles,instructions} -p 123 
> 
> Event "cycles" is loaded by #1; while event "instruction" is loaded by #2.
> If #3 reuses these events, it is tricky (or impossible?) to make sure the
> event group works as expected, right?

the reason for group is to force kernel to all events if possible
or none..  so with your change I don't see a problem, you just
provide values for those counters.. but let's see when you get
there ;-)

> 
> > 
> >> 
> >> The following commands have been tested:
> >> 
> >>   perf stat --use-bpf -e cycles -a
> >>   perf stat --use-bpf -e cycles -C 1,3,4
> >>   perf stat --use-bpf -e cycles -p 123
> >>   perf stat --use-bpf -e cycles -t 100,101
> > 
> > works good with that file you sent.. I'll check/test more,
> > so far some quick comments below
> > 
> > thanks,
> > jirka
> > 
> > 
> > 
> > SNIP
> > 
> >> @@ -1146,6 +1156,10 @@ static struct option stat_options[] = {
> >> #ifdef HAVE_BPF_SKEL
> >> 	OPT_STRING('b', "bpf-prog", &target.bpf_str, "bpf-prog-id",
> >> 		   "stat events on existing bpf program id"),
> >> +	OPT_BOOLEAN(0, "use-bpf", &target.use_bpf,
> >> +		    "use bpf program to count events"),
> >> +	OPT_STRING(0, "attr-map", &target.attr_map, "attr-map-path",
> >> +		   "path to perf_event_attr map"),
> > 
> > what's the point of allowing another name? just debug purpose?
> 
> It is mostly to cover corner cases, like something else used the same 
> name. 

about that.. we just take the object fd assuming it's map,
should we test it somehow?

  map_fd = bpf_obj_get(path);

if it's not the map we expect, I think we should generate
another name without forcing user to run again with --attr-map

but still warn, so other perf session can use the new name

SNIP

> >> +static int bperf_sync_counters(struct evsel *evsel)
> >> +{
> >> +	struct perf_cpu_map *all_cpus = perf_cpu_map__new(NULL);
> >> +	int num_cpu, i, cpu;
> >> +
> >> +	if (!all_cpus)
> >> +		return -1;
> >> +
> >> +	num_cpu = all_cpus->nr;
> >> +	for (i = 0; i < num_cpu; i++) {
> >> +		cpu = all_cpus->map[i];
> >> +		bperf_trigger_reading(evsel->bperf_leader_prog_fd, cpu);
> >> +	}
> >> +	return 0;
> >> +}
> >> +
> >> +static int bperf__enable(struct evsel *evsel)
> >> +{
> >> +	struct bperf_follower_bpf *skel = evsel->follower_skel;
> >> +	__u32 num_cpu_bpf = libbpf_num_possible_cpus();
> > 
> > we have cpu__max_cpu for that
> 
> libbpf calls for percpu array use libbpf_num_possible_cpus. So I guess it 
> is better to use the same here. The two are identical at the moment though.

then in the bperf__read you take that array and update
perf_counts, which is based on perf's cpus, so they mix
anyway

I'd like to keep perf code using perf's cpus api.. could
we just check at the begining that libbpf_num_possible_cpus
returns same number as cpu__max_cpu (if not, we have a
problem anyway) and use perf's cpu api

thanks,
jirka


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF
  2021-03-15 13:10     ` Arnaldo Carvalho de Melo
@ 2021-03-15 15:34       ` Andi Kleen
  2021-03-15 18:29         ` Song Liu
  0 siblings, 1 reply; 21+ messages in thread
From: Andi Kleen @ 2021-03-15 15:34 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Song Liu, Andi Kleen, Jiri Olsa, linux-kernel, Kernel Team,
	namhyung, linux-perf-users

> I still think that there is value in taking those measurements after we
> enable the counters, as, for instance, for interval mode we want
> measurements with the counters enabled, whatever happens before the
> counters are enabled is just startup time, etc. Jiri, Andi?

Yes I agree. Only the time with counters enabled matters.


-Andi

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF
  2021-03-15 14:09     ` Jiri Olsa
@ 2021-03-15 18:24       ` Song Liu
  0 siblings, 0 replies; 21+ messages in thread
From: Song Liu @ 2021-03-15 18:24 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: linux-kernel, Kernel Team, Arnaldo Carvalho de Melo, acme,
	namhyung, jolsa



> On Mar 15, 2021, at 7:09 AM, Jiri Olsa <jolsa@redhat.com> wrote:
> 
> On Mon, Mar 15, 2021 at 07:51:11AM +0000, Song Liu wrote:
> 
> SNIP

[...]

>> 
>> It is mostly to cover corner cases, like something else used the same 
>> name. 
> 
> about that.. we just take the object fd assuming it's map,
> should we test it somehow?
> 
>  map_fd = bpf_obj_get(path);
> 
> if it's not the map we expect, I think we should generate
> another name without forcing user to run again with --attr-map
> 
> but still warn, so other perf session can use the new name

The auto failover is an interesting idea. But I guess we still need 
--attr-map. Another use case is when the user mounted bpffs to a 
different path. Alternatively, maybe we can teach perf to search all 
bpffs mount points for perf_attr_map? 

> 
> SNIP
> 
>>>> +static int bperf_sync_counters(struct evsel *evsel)
>>>> +{
>>>> +	struct perf_cpu_map *all_cpus = perf_cpu_map__new(NULL);
>>>> +	int num_cpu, i, cpu;
>>>> +
>>>> +	if (!all_cpus)
>>>> +		return -1;
>>>> +
>>>> +	num_cpu = all_cpus->nr;
>>>> +	for (i = 0; i < num_cpu; i++) {
>>>> +		cpu = all_cpus->map[i];
>>>> +		bperf_trigger_reading(evsel->bperf_leader_prog_fd, cpu);
>>>> +	}
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +static int bperf__enable(struct evsel *evsel)
>>>> +{
>>>> +	struct bperf_follower_bpf *skel = evsel->follower_skel;
>>>> +	__u32 num_cpu_bpf = libbpf_num_possible_cpus();
>>> 
>>> we have cpu__max_cpu for that
>> 
>> libbpf calls for percpu array use libbpf_num_possible_cpus. So I guess it 
>> is better to use the same here. The two are identical at the moment though.
> 
> then in the bperf__read you take that array and update
> perf_counts, which is based on perf's cpus, so they mix
> anyway
> 
> I'd like to keep perf code using perf's cpus api.. could
> we just check at the begining that libbpf_num_possible_cpus
> returns same number as cpu__max_cpu (if not, we have a
> problem anyway) and use perf's cpu api

Let me try cpu__max_cpu. 

Thanks,
Song


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF
  2021-03-15 15:34       ` Andi Kleen
@ 2021-03-15 18:29         ` Song Liu
  0 siblings, 0 replies; 21+ messages in thread
From: Song Liu @ 2021-03-15 18:29 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, linux-kernel, Kernel Team,
	namhyung, linux-perf-users



> On Mar 15, 2021, at 8:34 AM, Andi Kleen <andi@firstfloor.org> wrote:
> 
>> I still think that there is value in taking those measurements after we
>> enable the counters, as, for instance, for interval mode we want
>> measurements with the counters enabled, whatever happens before the
>> counters are enabled is just startup time, etc. Jiri, Andi?
> 
> Yes I agree. Only the time with counters enabled matters.

I see. This change itself is a valid fix. I will create a separate patch.

Thanks,
Song

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2021-03-15 18:30 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-12  2:02 [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF Song Liu
2021-03-12  8:36 ` Namhyung Kim
2021-03-12 15:38   ` Song Liu
2021-03-13  2:47     ` Namhyung Kim
2021-03-13 19:37       ` Song Liu
2021-03-12 12:12 ` Jiri Olsa
2021-03-12 15:45   ` Song Liu
2021-03-12 16:09     ` Song Liu
2021-03-13 22:06       ` Jiri Olsa
2021-03-13 23:03         ` Song Liu
2021-03-12 14:24 ` Arnaldo Carvalho de Melo
2021-03-12 16:07   ` Song Liu
2021-03-12 17:37     ` Arnaldo Carvalho de Melo
2021-03-12 18:52   ` Song Liu
2021-03-15 13:10     ` Arnaldo Carvalho de Melo
2021-03-15 15:34       ` Andi Kleen
2021-03-15 18:29         ` Song Liu
2021-03-14 22:48 ` Jiri Olsa
2021-03-15  7:51   ` Song Liu
2021-03-15 14:09     ` Jiri Olsa
2021-03-15 18:24       ` Song Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).