LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH v6 00/12] perf/x86/amd: Add AMD Fam19h Branch Sampling support
@ 2022-02-08 21:16 Stephane Eranian
  2022-02-08 21:16 ` [PATCH v6 01/12] perf/core: add perf_clear_branch_entry_bitfields() helper Stephane Eranian
                   ` (11 more replies)
  0 siblings, 12 replies; 29+ messages in thread
From: Stephane Eranian @ 2022-02-08 21:16 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, kim.phillips, acme, jolsa, songliubraving

Add support for the AMD Fam19h 16-deep branch sampling feature as described
in the AMD PPR Fam19h Model 01h Revision B1 section 2.1.13. This is a model
specific extension. It is not an architected AMD feature.

The Branch Sampling Feature (BRS) provides the statistical taken branch
information necessary to enable autoFDO-style optimization by compilers,
i.e., basic block execution counts.

BRS operates with a 16-deep saturating buffer in MSR registers. There is no
hardware branch type filtering. All control flow changes are captured. BRS
relies on specific programming of the core PMU of Fam19h.  In particular,
the following requirements must be met:
 - the sampling period be greater than 16 (BRS depth)
 - the sampling period must use fixed and not frequency mode

BRS interacts with the NMI interrupt as well. Because enabling BRS is
expensive, it is only activated after P event occurrences, where P is the
desired sampling period. At P occurrences of the event, the counter
overflows, the CPU catches the NMI interrupt, activates BRS for 16 branches
until it saturates, and then delivers the NMI to the kernel. Between the
overflow and the time BRS activates more branches may be executed skewing the
period. All along, the sampling event keeps counting. The skid may be
attenuated by reducing the sampling period by 16.

BRS is integrated into perf_events seamlessly via the same
PERF_RECORD_BRANCH_STACK sample format. BRS generates branch
perf_branch_entry records in the sampling buffer. There is no prediction or
latency information supported. The branches are stored in reverse order of
execution.  The most recent branch is the first entry in each record.

Because BRS must be stopped when a CPU goes into low power mode, the series
includes patches to add callbacks on ACPI low power entry and exit which is
used on AMD processors.

Given that there is no privilege filterting with BRS, the kernel implements
filtering on privlege level.

This version adds a few simple modifications to perf record and report.
1. add the branch-brs event as a builtin such as it can used directly:
   perf record -e branch-brs ...
2. improve error handling for AMD IBS and is contributed by Kim Phillips.
3. use the better error handling to improve error handling for BRS.
4. add two new sort dimensions to help display the branch sampling
   information. Because there is no latency information associated with the
   branch sampling feature perf report would collapse all samples within a
   function into a single histogram entry. This is expected because the
   default sort mode for PERF_SAMPLE_BRANCH_STACK is symbol_from/symbol_to.
   This propagates to the annotation.

For more detailed view of the branch samples, the new sort dimensions
addr_from,addr_to can be used instead as follows:

$ perf report --sort=overhead,comm,dso,addr_from,addr_to 
# Overhead  Command    Shared Object     Source Address   Target Address
# ........  .......... ..............  ..............     ..............
#
     4.21%  test_prg   test_prg       [.] test_threa+0x3c [.] test_threa+0x4
     4.14%  test_prg   test_prg       [.] test_threa+0x3e [.] test_threa+0x2
     4.10%  test_prg   test_prg       [.] test_threa+0x4  [.] test_threa+0x3a
     4.07%  test_prg   test_prg       [.] test_threa+0x2  [.] test_threa+0x3c

Versus the default output:

$ perf report 
# Overhead  Command   Source Shared Object Source Symbol    Target Symbol      Basic Block Cycles
# ........  ......... .................... ................ .................  ..................
#
    99.52%  test_prg  test_prg             [.] test_thread  [.] test_thread    -                 

BRS can be used with any sampling event. However, it is recommended to use
the RETIRED_BRANCH event because it matches what the BRS captures. For
convenience, a pseudo event matching the branches captured by BRS is
exported by the kernel (branch-brs):

$ perf record -b -e cpu/branch-brs/ -c 1000037 test

$ perf report -D
56531696056126 0x193c000 [0x1a8]: PERF_RECORD_SAMPLE(IP, 0x2): 18122/18230: 0x401d24 period: 1000037 addr: 0
... branch stack: nr:16
.....  0: 0000000000401d24 -> 0000000000401d5a 0 cycles      0
.....  1: 0000000000401d5c -> 0000000000401d24 0 cycles      0
.....  2: 0000000000401d22 -> 0000000000401d5c 0 cycles      0
.....  3: 0000000000401d5e -> 0000000000401d22 0 cycles      0
.....  4: 0000000000401d20 -> 0000000000401d5e 0 cycles      0
.....  5: 0000000000401d3e -> 0000000000401d20 0 cycles      0
.....  6: 0000000000401d42 -> 0000000000401d3e 0 cycles      0
.....  7: 0000000000401d3c -> 0000000000401d42 0 cycles      0
.....  8: 0000000000401d44 -> 0000000000401d3c 0 cycles      0
.....  9: 0000000000401d3a -> 0000000000401d44 0 cycles      0
..... 10: 0000000000401d46 -> 0000000000401d3a 0 cycles      0
..... 11: 0000000000401d38 -> 0000000000401d46 0 cycles      0
..... 12: 0000000000401d48 -> 0000000000401d38 0 cycles      0
..... 13: 0000000000401d36 -> 0000000000401d48 0 cycles      0
..... 14: 0000000000401d4a -> 0000000000401d36 0 cycles      0
..... 15: 0000000000401d34 -> 0000000000401d4a 0 cycles      0
 ... thread: test:18230
 ...... dso: test

Special thanks to Kim Phillips @ AMD for the testing, reviews and
contributions.

V2 makes the following changes:
  - the low power callback code has be reworked completly. It is not
    impacting the generic perf_events code anymore. This is all handled
    via x86 code and only for ACPI low power driver which seems to be the
    default on AMD. The change in acpi_pad.c and processor_idle.c has no
    impact on non x86 architectures, on Intel x86 or AMD without BRS, a
    jump label is used to void the code unless necessary
  - BRS is an opt-in compile time option for the kernel
  - branch_stack bit clearing helper is introduced
  - As for the fact that BRS holds the NMI and that it may conflict with
    other sampling events and introduced skid, this is not really a problem
    because AMD PMI skid is already very large prompting special handling in
    amd_pmu_wait_on_overflow(), so adding a few cycles while the CPU executes
    at most 16 taken branches is not a problem.


V3 makes the following changes:
   - simplifies the handling of BRS enable/disable to mimic the Intel LBR code
     path more closely. That removes some callbacks in generic x86 code
   - add config option to compile BRS as an opt-in (off by default)
   - updated perf tool error reporting patch updates by Kim Phillips

V4 makes the following changes:
   - rebase to latest tip.git (commit 6f5ac142e5df)
   - integrate Kim Phillips latest perf tool error handling patches


V5 makes the following changes:
   - rebased to 5.17-rc1
   - integrated feedback from PeterZ about AMD perf_events callbacks
   - fix cpufeature macros name X86_FEATURE_BRS
   - integrated all perf tool error handling from kim Phillips

V6 makes the following changes:
   - rebase to 5.17-rc3
   - fix the typo in the Kconfig for BRS opt-in
   - reword all changelogs to be more aligned with standard


Kim Phillips (1):
  perf tools: Improve IBS error handling

Stephane Eranian (11):
  perf/core: add perf_clear_branch_entry_bitfields() helper
  x86/cpufeatures: add AMD Fam19h Branch Sampling feature
  perf/x86/amd: add AMD Fam19h Branch Sampling support
  perf/x86/amd: add branch-brs helper event for Fam19h BRS
  perf/x86/amd: enable branch sampling priv level filtering
  perf/x86/amd: add AMD branch sampling period adjustment
  perf/x86/amd: make Zen3 branch sampling opt-in
  ACPI: add perf low power callback
  perf/x86/amd: add idle hooks for branch sampling
  perf tools: Improve error handling of AMD Branch Sampling
  perf report: add addr_from/addr_to sort dimensions

 arch/x86/events/Kconfig            |   8 +
 arch/x86/events/amd/Makefile       |   1 +
 arch/x86/events/amd/brs.c          | 363 +++++++++++++++++++++++++++++
 arch/x86/events/amd/core.c         | 217 ++++++++++++++++-
 arch/x86/events/core.c             |  17 +-
 arch/x86/events/intel/lbr.c        |  36 ++-
 arch/x86/events/perf_event.h       | 143 ++++++++++--
 arch/x86/include/asm/cpufeatures.h |   1 +
 arch/x86/include/asm/msr-index.h   |   4 +
 arch/x86/include/asm/perf_event.h  |  21 ++
 drivers/acpi/acpi_pad.c            |   6 +
 drivers/acpi/processor_idle.c      |   5 +
 include/linux/perf_event.h         |  22 ++
 tools/perf/util/evsel.c            |  38 +++
 tools/perf/util/hist.c             |   2 +
 tools/perf/util/hist.h             |   2 +
 tools/perf/util/sort.c             | 128 ++++++++++
 tools/perf/util/sort.h             |   2 +
 18 files changed, 976 insertions(+), 40 deletions(-)
 create mode 100644 arch/x86/events/amd/brs.c

-- 
2.35.0.263.gb82422642f-goog


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v6 01/12] perf/core: add perf_clear_branch_entry_bitfields() helper
  2022-02-08 21:16 [PATCH v6 00/12] perf/x86/amd: Add AMD Fam19h Branch Sampling support Stephane Eranian
@ 2022-02-08 21:16 ` Stephane Eranian
  2022-02-08 21:16 ` [PATCH v6 02/12] x86/cpufeatures: add AMD Fam19h Branch Sampling feature Stephane Eranian
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 29+ messages in thread
From: Stephane Eranian @ 2022-02-08 21:16 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, kim.phillips, acme, jolsa, songliubraving

Make it simpler to reset all the info fields on the
perf_branch_entry by adding a helper inline function.

The goal is to centralize the initialization to avoid missing
a field in case more are added.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/events/intel/lbr.c | 36 +++++++++++++++++-------------------
 include/linux/perf_event.h  | 16 ++++++++++++++++
 2 files changed, 33 insertions(+), 19 deletions(-)

diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index 669c2be14784..6a903113d3a6 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -769,6 +769,7 @@ void intel_pmu_lbr_disable_all(void)
 void intel_pmu_lbr_read_32(struct cpu_hw_events *cpuc)
 {
 	unsigned long mask = x86_pmu.lbr_nr - 1;
+	struct perf_branch_entry *br = cpuc->lbr_entries;
 	u64 tos = intel_pmu_lbr_tos();
 	int i;
 
@@ -784,15 +785,11 @@ void intel_pmu_lbr_read_32(struct cpu_hw_events *cpuc)
 
 		rdmsrl(x86_pmu.lbr_from + lbr_idx, msr_lastbranch.lbr);
 
-		cpuc->lbr_entries[i].from	= msr_lastbranch.from;
-		cpuc->lbr_entries[i].to		= msr_lastbranch.to;
-		cpuc->lbr_entries[i].mispred	= 0;
-		cpuc->lbr_entries[i].predicted	= 0;
-		cpuc->lbr_entries[i].in_tx	= 0;
-		cpuc->lbr_entries[i].abort	= 0;
-		cpuc->lbr_entries[i].cycles	= 0;
-		cpuc->lbr_entries[i].type	= 0;
-		cpuc->lbr_entries[i].reserved	= 0;
+		perf_clear_branch_entry_bitfields(br);
+
+		br->from	= msr_lastbranch.from;
+		br->to		= msr_lastbranch.to;
+		br++;
 	}
 	cpuc->lbr_stack.nr = i;
 	cpuc->lbr_stack.hw_idx = tos;
@@ -807,6 +804,7 @@ void intel_pmu_lbr_read_64(struct cpu_hw_events *cpuc)
 {
 	bool need_info = false, call_stack = false;
 	unsigned long mask = x86_pmu.lbr_nr - 1;
+	struct perf_branch_entry *br = cpuc->lbr_entries;
 	u64 tos = intel_pmu_lbr_tos();
 	int i;
 	int out = 0;
@@ -878,15 +876,14 @@ void intel_pmu_lbr_read_64(struct cpu_hw_events *cpuc)
 		if (abort && x86_pmu.lbr_double_abort && out > 0)
 			out--;
 
-		cpuc->lbr_entries[out].from	 = from;
-		cpuc->lbr_entries[out].to	 = to;
-		cpuc->lbr_entries[out].mispred	 = mis;
-		cpuc->lbr_entries[out].predicted = pred;
-		cpuc->lbr_entries[out].in_tx	 = in_tx;
-		cpuc->lbr_entries[out].abort	 = abort;
-		cpuc->lbr_entries[out].cycles	 = cycles;
-		cpuc->lbr_entries[out].type	 = 0;
-		cpuc->lbr_entries[out].reserved	 = 0;
+		perf_clear_branch_entry_bitfields(br+out);
+		br[out].from	 = from;
+		br[out].to	 = to;
+		br[out].mispred	 = mis;
+		br[out].predicted = pred;
+		br[out].in_tx	 = in_tx;
+		br[out].abort	 = abort;
+		br[out].cycles	 = cycles;
 		out++;
 	}
 	cpuc->lbr_stack.nr = out;
@@ -951,6 +948,8 @@ static void intel_pmu_store_lbr(struct cpu_hw_events *cpuc,
 		to = rdlbr_to(i, lbr);
 		info = rdlbr_info(i, lbr);
 
+		perf_clear_branch_entry_bitfields(e);
+
 		e->from		= from;
 		e->to		= to;
 		e->mispred	= get_lbr_mispred(info);
@@ -959,7 +958,6 @@ static void intel_pmu_store_lbr(struct cpu_hw_events *cpuc,
 		e->abort	= !!(info & LBR_INFO_ABORT);
 		e->cycles	= get_lbr_cycles(info);
 		e->type		= get_lbr_br_type(info);
-		e->reserved	= 0;
 	}
 
 	cpuc->lbr_stack.nr = i;
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 733649184b27..496eb6aa6e54 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1063,6 +1063,22 @@ static inline void perf_sample_data_init(struct perf_sample_data *data,
 	data->txn = 0;
 }
 
+/*
+ * Clear all bitfields in the perf_branch_entry.
+ * The to and from fields are not cleared because they are
+ * systematically modified by caller.
+ */
+static inline void perf_clear_branch_entry_bitfields(struct perf_branch_entry *br)
+{
+	br->mispred = 0;
+	br->predicted = 0;
+	br->in_tx = 0;
+	br->abort = 0;
+	br->cycles = 0;
+	br->type = 0;
+	br->reserved = 0;
+}
+
 extern void perf_output_sample(struct perf_output_handle *handle,
 			       struct perf_event_header *header,
 			       struct perf_sample_data *data,
-- 
2.35.0.263.gb82422642f-goog


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v6 02/12] x86/cpufeatures: add AMD Fam19h Branch Sampling feature
  2022-02-08 21:16 [PATCH v6 00/12] perf/x86/amd: Add AMD Fam19h Branch Sampling support Stephane Eranian
  2022-02-08 21:16 ` [PATCH v6 01/12] perf/core: add perf_clear_branch_entry_bitfields() helper Stephane Eranian
@ 2022-02-08 21:16 ` Stephane Eranian
  2022-02-08 21:16 ` [PATCH v6 03/12] perf/x86/amd: add AMD Fam19h Branch Sampling support Stephane Eranian
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 29+ messages in thread
From: Stephane Eranian @ 2022-02-08 21:16 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, kim.phillips, acme, jolsa, songliubraving

Add a cpu feature for AMD Fam19h Branch Sampling feature as bit
31 of EBX on CPUID leaf function 0x80000008.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/include/asm/cpufeatures.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 4cc0ef96152c..a3e895aded02 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -315,6 +315,7 @@
 #define X86_FEATURE_VIRT_SSBD		(13*32+25) /* Virtualized Speculative Store Bypass Disable */
 #define X86_FEATURE_AMD_SSB_NO		(13*32+26) /* "" Speculative Store Bypass is fixed in hardware. */
 #define X86_FEATURE_CPPC		(13*32+27) /* Collaborative Processor Performance Control */
+#define X86_FEATURE_BRS			(13*32+31) /* Branch Sampling available */
 
 /* Thermal and Power Management Leaf, CPUID level 0x00000006 (EAX), word 14 */
 #define X86_FEATURE_DTHERM		(14*32+ 0) /* Digital Thermal Sensor */
-- 
2.35.0.263.gb82422642f-goog


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v6 03/12] perf/x86/amd: add AMD Fam19h Branch Sampling support
  2022-02-08 21:16 [PATCH v6 00/12] perf/x86/amd: Add AMD Fam19h Branch Sampling support Stephane Eranian
  2022-02-08 21:16 ` [PATCH v6 01/12] perf/core: add perf_clear_branch_entry_bitfields() helper Stephane Eranian
  2022-02-08 21:16 ` [PATCH v6 02/12] x86/cpufeatures: add AMD Fam19h Branch Sampling feature Stephane Eranian
@ 2022-02-08 21:16 ` Stephane Eranian
  2022-02-08 21:16 ` [PATCH v6 04/12] perf/x86/amd: add branch-brs helper event for Fam19h BRS Stephane Eranian
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 29+ messages in thread
From: Stephane Eranian @ 2022-02-08 21:16 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, kim.phillips, acme, jolsa, songliubraving

Add support for the AMD Fam19h 16-deep branch sampling feature as
described in the AMD PPR Fam19h Model 01h Revision B1.  This is a model
specific extension. It is not an architected AMD feature.

The Branch Sampling (BRS) operates with a 16-deep saturating buffer in MSR
registers. There is no branch type filtering. All control flow changes are
captured. BRS relies on specific programming of the core PMU of Fam19h.  In
particular, the following requirements must be met:
 - the sampling period be greater than 16 (BRS depth)
 - the sampling period must use a fixed and not frequency mode

BRS interacts with the NMI interrupt as well. Because enabling BRS is
expensive, it is only activated after P event occurrences, where P is the
desired sampling period.  At P occurrences of the event, the counter
overflows, the CPU catches the interrupt, activates BRS for 16 branches until
it saturates, and then delivers the NMI to the kernel.  Between the overflow
and the time BRS activates more branches may be executed skewing the period.
All along, the sampling event keeps counting. The skid may be attenuated by
reducing the sampling period by 16 (subsequent patch).

BRS is integrated into perf_events seamlessly via the same
PERF_RECORD_BRANCH_STACK sample format. BRS generates perf_branch_entry
records in the sampling buffer. No prediction information is supported. The
branches are stored in reverse order of execution.  The most recent branch is
the first entry in each record.

No modification to the perf tool is necessary.

BRS can be used with any sampling event. However, it is recommended to use
the RETIRED_BRANCH_INSTRUCTIONS event because it matches what the BRS
captures.

$ perf record -b -c 1000037 -e cpu/event=0xc2,name=ret_br_instructions/ test

$ perf report -D
56531696056126 0x193c000 [0x1a8]: PERF_RECORD_SAMPLE(IP, 0x2): 18122/18230: 0x401d24 period: 1000037 addr: 0
... branch stack: nr:16
.....  0: 0000000000401d24 -> 0000000000401d5a 0 cycles      0
.....  1: 0000000000401d5c -> 0000000000401d24 0 cycles      0
.....  2: 0000000000401d22 -> 0000000000401d5c 0 cycles      0
.....  3: 0000000000401d5e -> 0000000000401d22 0 cycles      0
.....  4: 0000000000401d20 -> 0000000000401d5e 0 cycles      0
.....  5: 0000000000401d3e -> 0000000000401d20 0 cycles      0
.....  6: 0000000000401d42 -> 0000000000401d3e 0 cycles      0
.....  7: 0000000000401d3c -> 0000000000401d42 0 cycles      0
.....  8: 0000000000401d44 -> 0000000000401d3c 0 cycles      0
.....  9: 0000000000401d3a -> 0000000000401d44 0 cycles      0
..... 10: 0000000000401d46 -> 0000000000401d3a 0 cycles      0
..... 11: 0000000000401d38 -> 0000000000401d46 0 cycles      0
..... 12: 0000000000401d48 -> 0000000000401d38 0 cycles      0
..... 13: 0000000000401d36 -> 0000000000401d48 0 cycles      0
..... 14: 0000000000401d4a -> 0000000000401d36 0 cycles      0
..... 15: 0000000000401d34 -> 0000000000401d4a 0 cycles      0
 ... thread: test:18230
 ...... dso: test

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/events/amd/Makefile     |   2 +-
 arch/x86/events/amd/brs.c        | 317 +++++++++++++++++++++++++++++++
 arch/x86/events/amd/core.c       | 197 ++++++++++++++++++-
 arch/x86/events/core.c           |  10 +-
 arch/x86/events/perf_event.h     | 101 ++++++++--
 arch/x86/include/asm/msr-index.h |   4 +
 6 files changed, 609 insertions(+), 22 deletions(-)
 create mode 100644 arch/x86/events/amd/brs.c

diff --git a/arch/x86/events/amd/Makefile b/arch/x86/events/amd/Makefile
index 6cbe38d5fd9d..cf323ffab5cd 100644
--- a/arch/x86/events/amd/Makefile
+++ b/arch/x86/events/amd/Makefile
@@ -1,5 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0
-obj-$(CONFIG_CPU_SUP_AMD)		+= core.o
+obj-$(CONFIG_CPU_SUP_AMD)		+= core.o brs.o
 obj-$(CONFIG_PERF_EVENTS_AMD_POWER)	+= power.o
 obj-$(CONFIG_X86_LOCAL_APIC)		+= ibs.o
 obj-$(CONFIG_PERF_EVENTS_AMD_UNCORE)	+= amd-uncore.o
diff --git a/arch/x86/events/amd/brs.c b/arch/x86/events/amd/brs.c
new file mode 100644
index 000000000000..3c13c484c637
--- /dev/null
+++ b/arch/x86/events/amd/brs.c
@@ -0,0 +1,317 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Implement support for AMD Fam19h Branch Sampling feature
+ * Based on specifications published in AMD PPR Fam19 Model 01
+ *
+ * Copyright 2021 Google LLC
+ * Contributed by Stephane Eranian <eranian@google.com>
+ */
+#include <linux/kernel.h>
+#include <asm/msr.h>
+#include <asm/cpufeature.h>
+
+#include "../perf_event.h"
+
+#define BRS_POISON	0xFFFFFFFFFFFFFFFEULL /* mark limit of valid entries */
+
+/* Debug Extension Configuration register layout */
+union amd_debug_extn_cfg {
+	__u64 val;
+	struct {
+		__u64	rsvd0:2,  /* reserved */
+			brsmen:1, /* branch sample enable */
+			rsvd4_3:2,/* reserved - must be 0x3 */
+			vb:1,     /* valid branches recorded */
+			rsvd2:10, /* reserved */
+			msroff:4, /* index of next entry to write */
+			rsvd3:4,  /* reserved */
+			pmc:3,    /* #PMC holding the sampling event */
+			rsvd4:37; /* reserved */
+	};
+};
+
+static inline unsigned int brs_from(int idx)
+{
+	return MSR_AMD_SAMP_BR_FROM + 2 * idx;
+}
+
+static inline unsigned int brs_to(int idx)
+{
+	return MSR_AMD_SAMP_BR_FROM + 2 * idx + 1;
+}
+
+static inline void set_debug_extn_cfg(u64 val)
+{
+	/* bits[4:3] must always be set to 11b */
+	wrmsrl(MSR_AMD_DBG_EXTN_CFG, val | 3ULL << 3);
+}
+
+static inline u64 get_debug_extn_cfg(void)
+{
+	u64 val;
+
+	rdmsrl(MSR_AMD_DBG_EXTN_CFG, val);
+	return val;
+}
+
+static bool __init amd_brs_detect(void)
+{
+	if (!boot_cpu_has(X86_FEATURE_BRS))
+		return false;
+
+	switch (boot_cpu_data.x86) {
+	case 0x19: /* AMD Fam19h (Zen3) */
+		x86_pmu.lbr_nr = 16;
+
+		/* No hardware filtering supported */
+		x86_pmu.lbr_sel_map = NULL;
+		x86_pmu.lbr_sel_mask = 0;
+		break;
+	default:
+		return false;
+	}
+
+	return true;
+}
+
+/*
+ * Current BRS implementation does not support branch type or privilege level
+ * filtering. Therefore, this function simply enforces these limitations. No need for
+ * a br_sel_map. Software filtering is not supported because it would not correlate well
+ * with a sampling period.
+ */
+int amd_brs_setup_filter(struct perf_event *event)
+{
+	u64 type = event->attr.branch_sample_type;
+
+	/* No BRS support */
+	if (!x86_pmu.lbr_nr)
+		return -EOPNOTSUPP;
+
+	/* Can only capture all branches, i.e., no filtering */
+	if ((type & ~PERF_SAMPLE_BRANCH_PLM_ALL) != PERF_SAMPLE_BRANCH_ANY)
+		return -EINVAL;
+
+	/* can only capture at all priv levels due to the way BRS works */
+	if ((type & PERF_SAMPLE_BRANCH_PLM_ALL) != PERF_SAMPLE_BRANCH_PLM_ALL)
+		return -EINVAL;
+
+	return 0;
+}
+
+/* tos = top of stack, i.e., last valid entry written */
+static inline int amd_brs_get_tos(union amd_debug_extn_cfg *cfg)
+{
+	/*
+	 * msroff: index of next entry to write so top-of-stack is one off
+	 * if BRS is full then msroff is set back to 0.
+	 */
+	return (cfg->msroff ? cfg->msroff : x86_pmu.lbr_nr) - 1;
+}
+
+/*
+ * make sure we have a sane BRS offset to begin with
+ * especially with kexec
+ */
+void amd_brs_reset(void)
+{
+	/*
+	 * Reset config
+	 */
+	set_debug_extn_cfg(0);
+
+	/*
+	 * Mark first entry as poisoned
+	 */
+	wrmsrl(brs_to(0), BRS_POISON);
+}
+
+int __init amd_brs_init(void)
+{
+	if (!amd_brs_detect())
+		return -EOPNOTSUPP;
+
+	pr_cont("%d-deep BRS, ", x86_pmu.lbr_nr);
+
+	return 0;
+}
+
+void amd_brs_enable(void)
+{
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+	union amd_debug_extn_cfg cfg;
+
+	/* Activate only on first user */
+	if (++cpuc->brs_active > 1)
+		return;
+
+	cfg.val    = 0; /* reset all fields */
+	cfg.brsmen = 1; /* enable branch sampling */
+
+	/* Set enable bit */
+	set_debug_extn_cfg(cfg.val);
+}
+
+void amd_brs_enable_all(void)
+{
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+	if (cpuc->lbr_users)
+		amd_brs_enable();
+}
+
+void amd_brs_disable(void)
+{
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+	union amd_debug_extn_cfg cfg;
+
+	/* Check if active (could be disabled via x86_pmu_disable_all()) */
+	if (!cpuc->brs_active)
+		return;
+
+	/* Only disable for last user */
+	if (--cpuc->brs_active)
+		return;
+
+	/*
+	 * Clear the brsmen bit but preserve the others as they contain
+	 * useful state such as vb and msroff
+	 */
+	cfg.val = get_debug_extn_cfg();
+
+	/*
+	 * When coming in on interrupt and BRS is full, then hw will have
+	 * already stopped BRS, no need to issue wrmsr again
+	 */
+	if (cfg.brsmen) {
+		cfg.brsmen = 0;
+		set_debug_extn_cfg(cfg.val);
+	}
+}
+
+void amd_brs_disable_all(void)
+{
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+	if (cpuc->lbr_users)
+		amd_brs_disable();
+}
+
+/*
+ * Caller must ensure amd_brs_inuse() is true before calling
+ * return:
+ */
+void amd_brs_drain(void)
+{
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+	struct perf_event *event = cpuc->events[0];
+	struct perf_branch_entry *br = cpuc->lbr_entries;
+	union amd_debug_extn_cfg cfg;
+	u32 i, nr = 0, num, tos, start;
+	u32 shift = 64 - boot_cpu_data.x86_virt_bits;
+
+	/*
+	 * BRS event forced on PMC0,
+	 * so check if there is an event.
+	 * It is possible to have lbr_users > 0 but the event
+	 * not yet scheduled due to long latency PMU irq
+	 */
+	if (!event)
+		goto empty;
+
+	cfg.val = get_debug_extn_cfg();
+
+	/* Sanity check [0-x86_pmu.lbr_nr] */
+	if (WARN_ON_ONCE(cfg.msroff >= x86_pmu.lbr_nr))
+		goto empty;
+
+	/* No valid branch */
+	if (cfg.vb == 0)
+		goto empty;
+
+	/*
+	 * msr.off points to next entry to be written
+	 * tos = most recent entry index = msr.off - 1
+	 * BRS register buffer saturates, so we know we have
+	 * start < tos and that we have to read from start to tos
+	 */
+	start = 0;
+	tos = amd_brs_get_tos(&cfg);
+
+	num = tos - start + 1;
+
+	/*
+	 * BRS is only one pass (saturation) from MSROFF to depth-1
+	 * MSROFF wraps to zero when buffer is full
+	 */
+	for (i = 0; i < num; i++) {
+		u32 brs_idx = tos - i;
+		u64 from, to;
+
+		rdmsrl(brs_to(brs_idx), to);
+
+		/* Entry does not belong to us (as marked by kernel) */
+		if (to == BRS_POISON)
+			break;
+
+		rdmsrl(brs_from(brs_idx), from);
+
+		/*
+		 * Sign-extend SAMP_BR_TO to 64 bits, bits 61-63 are reserved.
+		 * Necessary to generate proper virtual addresses suitable for
+		 * symbolization
+		 */
+		to = (u64)(((s64)to << shift) >> shift);
+
+		perf_clear_branch_entry_bitfields(br+nr);
+
+		br[nr].from = from;
+		br[nr].to   = to;
+
+		nr++;
+	}
+empty:
+	/* Record number of sampled branches */
+	cpuc->lbr_stack.nr = nr;
+}
+
+/*
+ * Poison most recent entry to prevent reuse by next task
+ * required because BRS entry are not tagged by PID
+ */
+static void amd_brs_poison_buffer(void)
+{
+	union amd_debug_extn_cfg cfg;
+	unsigned int idx;
+
+	/* Get current state */
+	cfg.val = get_debug_extn_cfg();
+
+	/* idx is most recently written entry */
+	idx = amd_brs_get_tos(&cfg);
+
+	/* Poison target of entry */
+	wrmsrl(brs_to(idx), BRS_POISON);
+}
+
+/*
+ * On context switch in, we need to make sure no samples from previous user
+ * are left in the BRS.
+ *
+ * On ctxswin, sched_in = true, called after the PMU has started
+ * On ctxswout, sched_in = false, called before the PMU is stopped
+ */
+void amd_pmu_brs_sched_task(struct perf_event_context *ctx, bool sched_in)
+{
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+	/* no active users */
+	if (!cpuc->lbr_users)
+		return;
+
+	/*
+	 * On context switch in, we need to ensure we do not use entries
+	 * from previous BRS user on that CPU, so we poison the buffer as
+	 * a faster way compared to resetting all entries.
+	 */
+	if (sched_in)
+		amd_brs_poison_buffer();
+}
diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c
index 9687a8aef01c..44d8f618bb3e 100644
--- a/arch/x86/events/amd/core.c
+++ b/arch/x86/events/amd/core.c
@@ -327,6 +327,8 @@ static inline bool amd_is_pair_event_code(struct hw_perf_event *hwc)
 
 static int amd_core_hw_config(struct perf_event *event)
 {
+	int ret = 0;
+
 	if (event->attr.exclude_host && event->attr.exclude_guest)
 		/*
 		 * When HO == GO == 1 the hardware treats that as GO == HO == 0
@@ -343,7 +345,32 @@ static int amd_core_hw_config(struct perf_event *event)
 	if ((x86_pmu.flags & PMU_FL_PAIR) && amd_is_pair_event_code(&event->hw))
 		event->hw.flags |= PERF_X86_EVENT_PAIR;
 
-	return 0;
+	/*
+	 * if branch stack is requested
+	 */
+	if (has_branch_stack(event) && is_sampling_event(event)) {
+		/*
+		 * BRS implementation does not work with frequency mode
+		 * reprogramming of the period.
+		 */
+		if (event->attr.freq)
+			return -EINVAL;
+		/*
+		 * The kernel subtracts BRS depth from period, so it must be big enough
+		 */
+		if (event->attr.sample_period <= x86_pmu.lbr_nr)
+			return -EINVAL;
+
+		/*
+		 * Check if we can allow PERF_SAMPLE_BRANCH_STACK
+		 */
+		ret = amd_brs_setup_filter(event);
+
+		/* only set in case of success */
+		if (!ret)
+			event->hw.flags |= PERF_X86_EVENT_AMD_BRS;
+	}
+	return ret;
 }
 
 static inline int amd_is_nb_event(struct hw_perf_event *hwc)
@@ -366,7 +393,7 @@ static int amd_pmu_hw_config(struct perf_event *event)
 	if (event->attr.precise_ip && get_ibs_caps())
 		return -ENOENT;
 
-	if (has_branch_stack(event))
+	if (has_branch_stack(event) && !x86_pmu.lbr_nr)
 		return -EOPNOTSUPP;
 
 	ret = x86_pmu_hw_config(event);
@@ -555,6 +582,8 @@ static void amd_pmu_cpu_starting(int cpu)
 
 	cpuc->amd_nb->nb_id = nb_id;
 	cpuc->amd_nb->refcnt++;
+
+	amd_brs_reset();
 }
 
 static void amd_pmu_cpu_dead(int cpu)
@@ -610,6 +639,8 @@ static void amd_pmu_disable_all(void)
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 	int idx;
 
+	amd_brs_disable_all();
+
 	x86_pmu_disable_all();
 
 	/*
@@ -634,6 +665,30 @@ static void amd_pmu_disable_all(void)
 	}
 }
 
+static void amd_pmu_enable_event(struct perf_event *event)
+{
+	x86_pmu_enable_event(event);
+}
+
+static void amd_pmu_enable_all(int added)
+{
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+	struct hw_perf_event *hwc;
+	int idx;
+
+	amd_brs_enable_all();
+
+	for (idx = 0; idx < x86_pmu.num_counters; idx++) {
+		hwc = &cpuc->events[idx]->hw;
+
+		/* only activate events which are marked as active */
+		if (!test_bit(idx, cpuc->active_mask))
+			continue;
+
+		amd_pmu_enable_event(cpuc->events[idx]);
+	}
+}
+
 static void amd_pmu_disable_event(struct perf_event *event)
 {
 	x86_pmu_disable_event(event);
@@ -651,6 +706,18 @@ static void amd_pmu_disable_event(struct perf_event *event)
 	amd_pmu_wait_on_overflow(event->hw.idx);
 }
 
+static void amd_pmu_add_event(struct perf_event *event)
+{
+	if (needs_branch_stack(event))
+		amd_pmu_brs_add(event);
+}
+
+static void amd_pmu_del_event(struct perf_event *event)
+{
+	if (needs_branch_stack(event))
+		amd_pmu_brs_del(event);
+}
+
 /*
  * Because of NMI latency, if multiple PMC counters are active or other sources
  * of NMIs are received, the perf NMI handler can handle one or more overflowed
@@ -671,11 +738,31 @@ static void amd_pmu_disable_event(struct perf_event *event)
  */
 static int amd_pmu_handle_irq(struct pt_regs *regs)
 {
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 	int handled;
+	int pmu_enabled;
+
+	/*
+	 * Save the PMU state.
+	 * It needs to be restored when leaving the handler.
+	 */
+	pmu_enabled = cpuc->enabled;
+	cpuc->enabled = 0;
+
+	/* stop everything (includes BRS) */
+	amd_pmu_disable_all();
+
+	/* Drain BRS is in use (could be inactive) */
+	if (cpuc->lbr_users)
+		amd_brs_drain();
 
 	/* Process any counter overflows */
 	handled = x86_pmu_handle_irq(regs);
 
+	cpuc->enabled = pmu_enabled;
+	if (pmu_enabled)
+		amd_pmu_enable_all(0);
+
 	/*
 	 * If a counter was handled, record a timestamp such that un-handled
 	 * NMIs will be claimed if arriving within that window.
@@ -897,6 +984,51 @@ static void amd_put_event_constraints_f17h(struct cpu_hw_events *cpuc,
 		--cpuc->n_pair;
 }
 
+/*
+ * Because of the way BRS operates with an inactive and active phases, and
+ * the link to one counter, it is not possible to have two events using BRS
+ * scheduled at the same time. There would be an issue with enforcing the
+ * period of each one and given that the BRS saturates, it would not be possible
+ * to guarantee correlated content for all events. Therefore, in situations
+ * where multiple events want to use BRS, the kernel enforces mutual exclusion.
+ * Exclusion is enforced by chosing only one counter for events using BRS.
+ * The event scheduling logic will then automatically multiplex the
+ * events and ensure that at most one event is actively using BRS.
+ *
+ * The BRS counter could be any counter, but there is no constraint on Fam19h,
+ * therefore all counters are equal and thus we pick the first one: PMC0
+ */
+static struct event_constraint amd_fam19h_brs_cntr0_constraint =
+	EVENT_CONSTRAINT(0, 0x1, AMD64_RAW_EVENT_MASK);
+
+static struct event_constraint amd_fam19h_brs_pair_cntr0_constraint =
+	__EVENT_CONSTRAINT(0, 0x1, AMD64_RAW_EVENT_MASK, 1, 0, PERF_X86_EVENT_PAIR);
+
+static struct event_constraint *
+amd_get_event_constraints_f19h(struct cpu_hw_events *cpuc, int idx,
+			  struct perf_event *event)
+{
+	struct hw_perf_event *hwc = &event->hw;
+	bool has_brs = has_amd_brs(hwc);
+
+	/*
+	 * In case BRS is used with an event requiring a counter pair,
+	 * the kernel allows it but only on counter 0 & 1 to enforce
+	 * multiplexing requiring to protect BRS in case of multiple
+	 * BRS users
+	 */
+	if (amd_is_pair_event_code(hwc)) {
+		return has_brs ? &amd_fam19h_brs_pair_cntr0_constraint
+			       : &pair_constraint;
+	}
+
+	if (has_brs)
+		return &amd_fam19h_brs_cntr0_constraint;
+
+	return &unconstrained;
+}
+
+
 static ssize_t amd_event_sysfs_show(char *page, u64 config)
 {
 	u64 event = (config & ARCH_PERFMON_EVENTSEL_EVENT) |
@@ -905,12 +1037,19 @@ static ssize_t amd_event_sysfs_show(char *page, u64 config)
 	return x86_event_sysfs_show(page, config, event);
 }
 
+static void amd_pmu_sched_task(struct perf_event_context *ctx,
+				 bool sched_in)
+{
+	if (sched_in && x86_pmu.lbr_nr)
+		amd_pmu_brs_sched_task(ctx, sched_in);
+}
+
 static __initconst const struct x86_pmu amd_pmu = {
 	.name			= "AMD",
 	.handle_irq		= amd_pmu_handle_irq,
 	.disable_all		= amd_pmu_disable_all,
-	.enable_all		= x86_pmu_enable_all,
-	.enable			= x86_pmu_enable_event,
+	.enable_all		= amd_pmu_enable_all,
+	.enable			= amd_pmu_enable_event,
 	.disable		= amd_pmu_disable_event,
 	.hw_config		= amd_pmu_hw_config,
 	.schedule_events	= x86_schedule_events,
@@ -920,6 +1059,8 @@ static __initconst const struct x86_pmu amd_pmu = {
 	.event_map		= amd_pmu_event_map,
 	.max_events		= ARRAY_SIZE(amd_perfmon_event_map),
 	.num_counters		= AMD64_NUM_COUNTERS,
+	.add			= amd_pmu_add_event,
+	.del			= amd_pmu_del_event,
 	.cntval_bits		= 48,
 	.cntval_mask		= (1ULL << 48) - 1,
 	.apic			= 1,
@@ -938,6 +1079,37 @@ static __initconst const struct x86_pmu amd_pmu = {
 	.amd_nb_constraints	= 1,
 };
 
+static ssize_t branches_show(struct device *cdev,
+			      struct device_attribute *attr,
+			      char *buf)
+{
+	return snprintf(buf, PAGE_SIZE, "%d\n", x86_pmu.lbr_nr);
+}
+
+static DEVICE_ATTR_RO(branches);
+
+static struct attribute *amd_pmu_brs_attrs[] = {
+	&dev_attr_branches.attr,
+	NULL,
+};
+
+static umode_t
+amd_brs_is_visible(struct kobject *kobj, struct attribute *attr, int i)
+{
+	return x86_pmu.lbr_nr ? attr->mode : 0;
+}
+
+static struct attribute_group group_caps_amd_brs = {
+	.name  = "caps",
+	.attrs = amd_pmu_brs_attrs,
+	.is_visible = amd_brs_is_visible,
+};
+
+static const struct attribute_group *amd_attr_update[] = {
+	&group_caps_amd_brs,
+	NULL,
+};
+
 static int __init amd_core_pmu_init(void)
 {
 	u64 even_ctr_mask = 0ULL;
@@ -989,6 +1161,23 @@ static int __init amd_core_pmu_init(void)
 		x86_pmu.flags |= PMU_FL_PAIR;
 	}
 
+	if (boot_cpu_data.x86 >= 0x19) {
+		/*
+		 * On AMD, invoking pmu_disable_all() is very expensive and the function is
+		 * invoked on context-switch in via sched_task_in(), so enable only when necessary
+		 */
+		if (!amd_brs_init()) {
+			x86_pmu.get_event_constraints = amd_get_event_constraints_f19h;
+			x86_pmu.sched_task = amd_pmu_sched_task;
+			/*
+			 * The put_event_constraints callback is shared with
+			 * Fam17h, set above
+			 */
+		}
+	}
+
+	x86_pmu.attr_update = amd_attr_update;
+
 	pr_cont("core perfctr, ");
 	return 0;
 }
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index e686c5e0537b..c2a890caeb0a 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1338,6 +1338,10 @@ static void x86_pmu_enable(struct pmu *pmu)
 			if (hwc->state & PERF_HES_ARCH)
 				continue;
 
+			/*
+			 * if cpuc->enabled = 0, then no wrmsr as
+			 * per x86_pmu_enable_event()
+			 */
 			x86_pmu_start(event, PERF_EF_RELOAD);
 		}
 		cpuc->n_added = 0;
@@ -1704,11 +1708,15 @@ int x86_pmu_handle_irq(struct pt_regs *regs)
 		 * event overflow
 		 */
 		handled++;
-		perf_sample_data_init(&data, 0, event->hw.last_period);
 
 		if (!x86_perf_event_set_period(event))
 			continue;
 
+		perf_sample_data_init(&data, 0, event->hw.last_period);
+
+		if (has_branch_stack(event))
+			data.br_stack = &cpuc->lbr_stack;
+
 		if (perf_event_overflow(event, &data, regs))
 			x86_pmu_stop(event, 0);
 	}
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 150261d929b9..3485a4cf0241 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -67,22 +67,23 @@ static inline bool constraint_match(struct event_constraint *c, u64 ecode)
 /*
  * struct hw_perf_event.flags flags
  */
-#define PERF_X86_EVENT_PEBS_LDLAT	0x0001 /* ld+ldlat data address sampling */
-#define PERF_X86_EVENT_PEBS_ST		0x0002 /* st data address sampling */
-#define PERF_X86_EVENT_PEBS_ST_HSW	0x0004 /* haswell style datala, store */
-#define PERF_X86_EVENT_PEBS_LD_HSW	0x0008 /* haswell style datala, load */
-#define PERF_X86_EVENT_PEBS_NA_HSW	0x0010 /* haswell style datala, unknown */
-#define PERF_X86_EVENT_EXCL		0x0020 /* HT exclusivity on counter */
-#define PERF_X86_EVENT_DYNAMIC		0x0040 /* dynamic alloc'd constraint */
-
-#define PERF_X86_EVENT_EXCL_ACCT	0x0100 /* accounted EXCL event */
-#define PERF_X86_EVENT_AUTO_RELOAD	0x0200 /* use PEBS auto-reload */
-#define PERF_X86_EVENT_LARGE_PEBS	0x0400 /* use large PEBS */
-#define PERF_X86_EVENT_PEBS_VIA_PT	0x0800 /* use PT buffer for PEBS */
-#define PERF_X86_EVENT_PAIR		0x1000 /* Large Increment per Cycle */
-#define PERF_X86_EVENT_LBR_SELECT	0x2000 /* Save/Restore MSR_LBR_SELECT */
-#define PERF_X86_EVENT_TOPDOWN		0x4000 /* Count Topdown slots/metrics events */
-#define PERF_X86_EVENT_PEBS_STLAT	0x8000 /* st+stlat data address sampling */
+#define PERF_X86_EVENT_PEBS_LDLAT	0x00001 /* ld+ldlat data address sampling */
+#define PERF_X86_EVENT_PEBS_ST		0x00002 /* st data address sampling */
+#define PERF_X86_EVENT_PEBS_ST_HSW	0x00004 /* haswell style datala, store */
+#define PERF_X86_EVENT_PEBS_LD_HSW	0x00008 /* haswell style datala, load */
+#define PERF_X86_EVENT_PEBS_NA_HSW	0x00010 /* haswell style datala, unknown */
+#define PERF_X86_EVENT_EXCL		0x00020 /* HT exclusivity on counter */
+#define PERF_X86_EVENT_DYNAMIC		0x00040 /* dynamic alloc'd constraint */
+
+#define PERF_X86_EVENT_EXCL_ACCT	0x00100 /* accounted EXCL event */
+#define PERF_X86_EVENT_AUTO_RELOAD	0x00200 /* use PEBS auto-reload */
+#define PERF_X86_EVENT_LARGE_PEBS	0x00400 /* use large PEBS */
+#define PERF_X86_EVENT_PEBS_VIA_PT	0x00800 /* use PT buffer for PEBS */
+#define PERF_X86_EVENT_PAIR		0x01000 /* Large Increment per Cycle */
+#define PERF_X86_EVENT_LBR_SELECT	0x02000 /* Save/Restore MSR_LBR_SELECT */
+#define PERF_X86_EVENT_TOPDOWN		0x04000 /* Count Topdown slots/metrics events */
+#define PERF_X86_EVENT_PEBS_STLAT	0x08000 /* st+stlat data address sampling */
+#define PERF_X86_EVENT_AMD_BRS		0x10000 /* AMD Branch Sampling */
 
 static inline bool is_topdown_count(struct perf_event *event)
 {
@@ -325,6 +326,8 @@ struct cpu_hw_events {
 	 * AMD specific bits
 	 */
 	struct amd_nb			*amd_nb;
+	int				brs_active; /* BRS is enabled */
+
 	/* Inverted mask of bits to clear in the perf_ctr ctrl registers */
 	u64				perf_ctr_virt_mask;
 	int				n_pair; /* Large increment events */
@@ -1105,6 +1108,11 @@ int x86_pmu_hw_config(struct perf_event *event);
 
 void x86_pmu_disable_all(void);
 
+static inline bool has_amd_brs(struct hw_perf_event *hwc)
+{
+	return hwc->flags & PERF_X86_EVENT_AMD_BRS;
+}
+
 static inline bool is_counter_pair(struct hw_perf_event *hwc)
 {
 	return hwc->flags & PERF_X86_EVENT_PAIR;
@@ -1210,6 +1218,50 @@ static inline bool fixed_counter_disabled(int i, struct pmu *pmu)
 #ifdef CONFIG_CPU_SUP_AMD
 
 int amd_pmu_init(void);
+int amd_brs_init(void);
+void amd_brs_disable(void);
+void amd_brs_enable(void);
+void amd_brs_enable_all(void);
+void amd_brs_disable_all(void);
+void amd_brs_drain(void);
+void amd_brs_disable_all(void);
+int amd_brs_setup_filter(struct perf_event *event);
+void amd_brs_reset(void);
+
+static inline void amd_pmu_brs_add(struct perf_event *event)
+{
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+	perf_sched_cb_inc(event->ctx->pmu);
+	cpuc->lbr_users++;
+	/*
+	 * No need to reset BRS because it is reset
+	 * on brs_enable() and it is saturating
+	 */
+}
+
+static inline void amd_pmu_brs_del(struct perf_event *event)
+{
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+	cpuc->lbr_users--;
+	WARN_ON_ONCE(cpuc->lbr_users < 0);
+
+	perf_sched_cb_dec(event->ctx->pmu);
+}
+
+void amd_pmu_brs_sched_task(struct perf_event_context *ctx, bool sched_in);
+
+/*
+ * check if BRS is activated on the CPU
+ * active defined as it has non-zero users and DBG_EXT_CFG.BRSEN=1
+ */
+static inline bool amd_brs_active(void)
+{
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+	return cpuc->brs_active;
+}
 
 #else /* CONFIG_CPU_SUP_AMD */
 
@@ -1218,6 +1270,23 @@ static inline int amd_pmu_init(void)
 	return 0;
 }
 
+static inline int amd_brs_init(void)
+{
+	return 0;
+}
+
+static inline void amd_brs_drain(void)
+{
+}
+
+static inline void amd_brs_enable_all(void)
+{
+}
+
+static inline void amd_brs_disable_all(void)
+{
+}
+
 #endif /* CONFIG_CPU_SUP_AMD */
 
 static inline int is_pebs_pt(struct perf_event *event)
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 3faf0f97edb1..d44bc769dd6f 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -667,6 +667,10 @@
 #define MSR_IA32_PERF_CTL		0x00000199
 #define INTEL_PERF_CTL_MASK		0xffff
 
+/* AMD Branch Sampling configuration */
+#define MSR_AMD_DBG_EXTN_CFG		0xc000010f
+#define MSR_AMD_SAMP_BR_FROM		0xc0010300
+
 #define MSR_IA32_MPERF			0x000000e7
 #define MSR_IA32_APERF			0x000000e8
 
-- 
2.35.0.263.gb82422642f-goog


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v6 04/12] perf/x86/amd: add branch-brs helper event for Fam19h BRS
  2022-02-08 21:16 [PATCH v6 00/12] perf/x86/amd: Add AMD Fam19h Branch Sampling support Stephane Eranian
                   ` (2 preceding siblings ...)
  2022-02-08 21:16 ` [PATCH v6 03/12] perf/x86/amd: add AMD Fam19h Branch Sampling support Stephane Eranian
@ 2022-02-08 21:16 ` Stephane Eranian
  2022-02-08 21:16 ` [PATCH v6 05/12] perf/x86/amd: enable branch sampling priv level filtering Stephane Eranian
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 29+ messages in thread
From: Stephane Eranian @ 2022-02-08 21:16 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, kim.phillips, acme, jolsa, songliubraving

Add a pseudo event called branch-brs to help use the FAM Fam19h
Branch Sampling feature (BRS). BRS samples taken branches, so it is best used
when sampling on a retired taken branch event (0xc4) which is what BRS
captures.  Instead of trying to remember the event code or actual event name,
users can simply do:

$ perf record -b -e cpu/branch-brs/ -c 1000037 .....

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/events/amd/core.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c
index 44d8f618bb3e..597defee1e02 100644
--- a/arch/x86/events/amd/core.c
+++ b/arch/x86/events/amd/core.c
@@ -1105,8 +1105,24 @@ static struct attribute_group group_caps_amd_brs = {
 	.is_visible = amd_brs_is_visible,
 };
 
+#define AMD_FAM19H_BRS_EVENT 0xc4 /* Fam19h RETIRED_TAKEN_BRANCH_INSTRUCTIONS */
+EVENT_ATTR_STR(branch-brs, amd_branch_brs,
+	       "event=" __stringify(AMD_FAM19H_BRS_EVENT)"\n");
+
+static struct attribute *amd_brs_events_attrs[] = {
+	EVENT_PTR(amd_branch_brs),
+	NULL,
+};
+
+static struct attribute_group group_events_amd_brs = {
+	.name       = "events",
+	.attrs      = amd_brs_events_attrs,
+	.is_visible = amd_brs_is_visible,
+};
+
 static const struct attribute_group *amd_attr_update[] = {
 	&group_caps_amd_brs,
+	&group_events_amd_brs,
 	NULL,
 };
 
-- 
2.35.0.263.gb82422642f-goog


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v6 05/12] perf/x86/amd: enable branch sampling priv level filtering
  2022-02-08 21:16 [PATCH v6 00/12] perf/x86/amd: Add AMD Fam19h Branch Sampling support Stephane Eranian
                   ` (3 preceding siblings ...)
  2022-02-08 21:16 ` [PATCH v6 04/12] perf/x86/amd: add branch-brs helper event for Fam19h BRS Stephane Eranian
@ 2022-02-08 21:16 ` Stephane Eranian
  2022-02-08 21:16 ` [PATCH v6 06/12] perf/x86/amd: add AMD branch sampling period adjustment Stephane Eranian
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 29+ messages in thread
From: Stephane Eranian @ 2022-02-08 21:16 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, kim.phillips, acme, jolsa, songliubraving

The AMD Branch Sampling features does not provide hardware filtering by
privilege level. The associated PMU counter does but not the branch sampling
by itself. Given how BRS operates there is a possibility that BRS captures
kernel level branches even though the event is programmed to count only at
the user level.

Implement a workaround in software by removing the branches which belong to
the wrong privilege level. The privilege level is evaluated on the target of
the branch and not the source so as to be compatible with other architectures.
As a consequence of this patch, the number of entries in the
PERF_RECORD_BRANCH_STACK buffer may be less than the maximum (16).  It could
even be zero. Another consequence is that consecutive entries in the branch
stack may not reflect actual code path and may have discontinuities, in case
kernel branches were suppressed. But this is no different than what happens
on other architectures.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/events/amd/brs.c | 26 ++++++++++++++++++++------
 1 file changed, 20 insertions(+), 6 deletions(-)

diff --git a/arch/x86/events/amd/brs.c b/arch/x86/events/amd/brs.c
index 3c13c484c637..40461c3ce714 100644
--- a/arch/x86/events/amd/brs.c
+++ b/arch/x86/events/amd/brs.c
@@ -92,10 +92,6 @@ int amd_brs_setup_filter(struct perf_event *event)
 	if ((type & ~PERF_SAMPLE_BRANCH_PLM_ALL) != PERF_SAMPLE_BRANCH_ANY)
 		return -EINVAL;
 
-	/* can only capture at all priv levels due to the way BRS works */
-	if ((type & PERF_SAMPLE_BRANCH_PLM_ALL) != PERF_SAMPLE_BRANCH_PLM_ALL)
-		return -EINVAL;
-
 	return 0;
 }
 
@@ -195,6 +191,21 @@ void amd_brs_disable_all(void)
 		amd_brs_disable();
 }
 
+static bool amd_brs_match_plm(struct perf_event *event, u64 to)
+{
+	int type = event->attr.branch_sample_type;
+	int plm_k = PERF_SAMPLE_BRANCH_KERNEL | PERF_SAMPLE_BRANCH_HV;
+	int plm_u = PERF_SAMPLE_BRANCH_USER;
+
+	if (!(type & plm_k) && kernel_ip(to))
+		return 0;
+
+	if (!(type & plm_u) && !kernel_ip(to))
+		return 0;
+
+	return 1;
+}
+
 /*
  * Caller must ensure amd_brs_inuse() is true before calling
  * return:
@@ -252,8 +263,6 @@ void amd_brs_drain(void)
 		if (to == BRS_POISON)
 			break;
 
-		rdmsrl(brs_from(brs_idx), from);
-
 		/*
 		 * Sign-extend SAMP_BR_TO to 64 bits, bits 61-63 are reserved.
 		 * Necessary to generate proper virtual addresses suitable for
@@ -261,6 +270,11 @@ void amd_brs_drain(void)
 		 */
 		to = (u64)(((s64)to << shift) >> shift);
 
+		if (!amd_brs_match_plm(event, to))
+			continue;
+
+		rdmsrl(brs_from(brs_idx), from);
+
 		perf_clear_branch_entry_bitfields(br+nr);
 
 		br[nr].from = from;
-- 
2.35.0.263.gb82422642f-goog


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v6 06/12] perf/x86/amd: add AMD branch sampling period adjustment
  2022-02-08 21:16 [PATCH v6 00/12] perf/x86/amd: Add AMD Fam19h Branch Sampling support Stephane Eranian
                   ` (4 preceding siblings ...)
  2022-02-08 21:16 ` [PATCH v6 05/12] perf/x86/amd: enable branch sampling priv level filtering Stephane Eranian
@ 2022-02-08 21:16 ` Stephane Eranian
  2022-02-09 15:32   ` Peter Zijlstra
  2022-02-08 21:16 ` [PATCH v6 07/12] perf/x86/amd: make Zen3 branch sampling opt-in Stephane Eranian
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 29+ messages in thread
From: Stephane Eranian @ 2022-02-08 21:16 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, kim.phillips, acme, jolsa, songliubraving

Add code to adjust the sampling event period when used with the Branch
Sampling feature (BRS). Given the depth of the BRS (16), the period is
reduced by that depth such that in the best case scenario, BRS saturates at
the desired sampling period. In practice, though, the processor may execute
more branches. Given a desired period P and a depth D, the kernel programs
the actual period at P - D. After P occurrences of the sampling event, the
counter overflows. It then may take X branches (skid) before the NMI is
caught and held by the hardware and BRS activates. Then, after D branches,
BRS saturates and the NMI is delivered.  With no skid, the effective period
would be (P - D) + D = P. In practice, however, it will likely be (P - D) +
X + D. There is no way to eliminate X or predict X.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/events/core.c       |  7 +++++++
 arch/x86/events/perf_event.h | 12 ++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index c2a890caeb0a..ed285f640efe 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1374,6 +1374,13 @@ int x86_perf_event_set_period(struct perf_event *event)
 	    x86_pmu.set_topdown_event_period)
 		return x86_pmu.set_topdown_event_period(event);
 
+	/*
+	 * decrease period by the depth of the BRS feature to get
+	 * the last N taken branches and approximate the desired period
+	 */
+	if (has_branch_stack(event))
+		period = amd_brs_adjust_period(period);
+
 	/*
 	 * If we are way outside a reasonable range then just skip forward:
 	 */
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 3485a4cf0241..25b037b571e4 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -1263,6 +1263,14 @@ static inline bool amd_brs_active(void)
 	return cpuc->brs_active;
 }
 
+static inline s64 amd_brs_adjust_period(s64 period)
+{
+	if (period > x86_pmu.lbr_nr)
+		return period - x86_pmu.lbr_nr;
+
+	return period;
+}
+
 #else /* CONFIG_CPU_SUP_AMD */
 
 static inline int amd_pmu_init(void)
@@ -1287,6 +1295,10 @@ static inline void amd_brs_disable_all(void)
 {
 }
 
+static inline s64 amd_brs_adjust_period(s64 period)
+{
+	return period;
+}
 #endif /* CONFIG_CPU_SUP_AMD */
 
 static inline int is_pebs_pt(struct perf_event *event)
-- 
2.35.0.263.gb82422642f-goog


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v6 07/12] perf/x86/amd: make Zen3 branch sampling opt-in
  2022-02-08 21:16 [PATCH v6 00/12] perf/x86/amd: Add AMD Fam19h Branch Sampling support Stephane Eranian
                   ` (5 preceding siblings ...)
  2022-02-08 21:16 ` [PATCH v6 06/12] perf/x86/amd: add AMD branch sampling period adjustment Stephane Eranian
@ 2022-02-08 21:16 ` Stephane Eranian
  2022-02-08 21:16 ` [PATCH v6 08/12] ACPI: add perf low power callback Stephane Eranian
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 29+ messages in thread
From: Stephane Eranian @ 2022-02-08 21:16 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, kim.phillips, acme, jolsa, songliubraving

Add a kernel config option CONFIG_PERF_EVENTS_AMD_BRS
to make the support for AMD Zen3 Branch Sampling (BRS) an opt-in
compile time option.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/events/Kconfig      |  8 ++++++
 arch/x86/events/amd/Makefile |  3 ++-
 arch/x86/events/perf_event.h | 49 ++++++++++++++++++++++++++++--------
 3 files changed, 49 insertions(+), 11 deletions(-)

diff --git a/arch/x86/events/Kconfig b/arch/x86/events/Kconfig
index d6cdfe631674..09c56965750a 100644
--- a/arch/x86/events/Kconfig
+++ b/arch/x86/events/Kconfig
@@ -44,4 +44,12 @@ config PERF_EVENTS_AMD_UNCORE
 
 	  To compile this driver as a module, choose M here: the
 	  module will be called 'amd-uncore'.
+
+config PERF_EVENTS_AMD_BRS
+	depends on PERF_EVENTS && CPU_SUP_AMD
+	bool "AMD Zen3 Branch Sampling support"
+	help
+	  Enable AMD Zen3 branch sampling support (BRS) which samples up to
+	  16 consecutive taken branches in registers.
+
 endmenu
diff --git a/arch/x86/events/amd/Makefile b/arch/x86/events/amd/Makefile
index cf323ffab5cd..b9f5d4610256 100644
--- a/arch/x86/events/amd/Makefile
+++ b/arch/x86/events/amd/Makefile
@@ -1,5 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0
-obj-$(CONFIG_CPU_SUP_AMD)		+= core.o brs.o
+obj-$(CONFIG_CPU_SUP_AMD)		+= core.o
+obj-$(CONFIG_PERF_EVENTS_AMD_BRS)	+= brs.o
 obj-$(CONFIG_PERF_EVENTS_AMD_POWER)	+= power.o
 obj-$(CONFIG_X86_LOCAL_APIC)		+= ibs.o
 obj-$(CONFIG_PERF_EVENTS_AMD_UNCORE)	+= amd-uncore.o
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 25b037b571e4..4d050579dcbd 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -1218,6 +1218,8 @@ static inline bool fixed_counter_disabled(int i, struct pmu *pmu)
 #ifdef CONFIG_CPU_SUP_AMD
 
 int amd_pmu_init(void);
+
+#ifdef CONFIG_PERF_EVENTS_AMD_BRS
 int amd_brs_init(void);
 void amd_brs_disable(void);
 void amd_brs_enable(void);
@@ -1252,25 +1254,52 @@ static inline void amd_pmu_brs_del(struct perf_event *event)
 
 void amd_pmu_brs_sched_task(struct perf_event_context *ctx, bool sched_in);
 
-/*
- * check if BRS is activated on the CPU
- * active defined as it has non-zero users and DBG_EXT_CFG.BRSEN=1
- */
-static inline bool amd_brs_active(void)
+static inline s64 amd_brs_adjust_period(s64 period)
 {
-	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+	if (period > x86_pmu.lbr_nr)
+		return period - x86_pmu.lbr_nr;
 
-	return cpuc->brs_active;
+	return period;
+}
+#else
+static inline int amd_brs_init(void)
+{
+	return 0;
 }
+static inline void amd_brs_disable(void) {}
+static inline void amd_brs_enable(void) {}
+static inline void amd_brs_drain(void) {}
+static inline void amd_brs_lopwr_init(void) {}
+static inline void amd_brs_disable_all(void) {}
+static inline int amd_brs_setup_filter(struct perf_event *event)
+{
+	return 0;
+}
+static inline void amd_brs_reset(void) {}
 
-static inline s64 amd_brs_adjust_period(s64 period)
+static inline void amd_pmu_brs_add(struct perf_event *event)
 {
-	if (period > x86_pmu.lbr_nr)
-		return period - x86_pmu.lbr_nr;
+}
+
+static inline void amd_pmu_brs_del(struct perf_event *event)
+{
+}
+
+static inline void amd_pmu_brs_sched_task(struct perf_event_context *ctx, bool sched_in)
+{
+}
 
+static inline s64 amd_brs_adjust_period(s64 period)
+{
 	return period;
 }
 
+static inline void amd_brs_enable_all(void)
+{
+}
+
+#endif
+
 #else /* CONFIG_CPU_SUP_AMD */
 
 static inline int amd_pmu_init(void)
-- 
2.35.0.263.gb82422642f-goog


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v6 08/12] ACPI: add perf low power callback
  2022-02-08 21:16 [PATCH v6 00/12] perf/x86/amd: Add AMD Fam19h Branch Sampling support Stephane Eranian
                   ` (6 preceding siblings ...)
  2022-02-08 21:16 ` [PATCH v6 07/12] perf/x86/amd: make Zen3 branch sampling opt-in Stephane Eranian
@ 2022-02-08 21:16 ` Stephane Eranian
  2022-02-08 21:16 ` [PATCH v6 09/12] perf/x86/amd: add idle hooks for branch sampling Stephane Eranian
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 29+ messages in thread
From: Stephane Eranian @ 2022-02-08 21:16 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, kim.phillips, acme, jolsa, songliubraving

Add an optional callback needed by some PMU features, e.g., AMD
BRS, to give a chance to the perf_events code to change its state before
a CPU goes to low power and after it comes back.

The callback is void when the PERF_NEEDS_LOPWR_CB flag is not set.
This flag must be set in arch specific perf_event.h header whenever needed.
When not set, there is no impact on the ACPI code.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 drivers/acpi/acpi_pad.c       | 6 ++++++
 drivers/acpi/processor_idle.c | 5 +++++
 include/linux/perf_event.h    | 6 ++++++
 3 files changed, 17 insertions(+)

diff --git a/drivers/acpi/acpi_pad.c b/drivers/acpi/acpi_pad.c
index f45979aa2d64..a306a07a60b5 100644
--- a/drivers/acpi/acpi_pad.c
+++ b/drivers/acpi/acpi_pad.c
@@ -164,6 +164,9 @@ static int power_saving_thread(void *data)
 				tsc_marked_unstable = 1;
 			}
 			local_irq_disable();
+
+			perf_lopwr_cb(true);
+
 			tick_broadcast_enable();
 			tick_broadcast_enter();
 			stop_critical_timings();
@@ -172,6 +175,9 @@ static int power_saving_thread(void *data)
 
 			start_critical_timings();
 			tick_broadcast_exit();
+
+			perf_lopwr_cb(false);
+
 			local_irq_enable();
 
 			if (time_before(expire_time, jiffies)) {
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index 86560a28751b..880c0a43a529 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -21,6 +21,7 @@
 #include <linux/cpuidle.h>
 #include <linux/cpu.h>
 #include <linux/minmax.h>
+#include <linux/perf_event.h>
 #include <acpi/processor.h>
 
 /*
@@ -544,6 +545,8 @@ static void wait_for_freeze(void)
  */
 static void __cpuidle acpi_idle_do_entry(struct acpi_processor_cx *cx)
 {
+	perf_lopwr_cb(true);
+
 	if (cx->entry_method == ACPI_CSTATE_FFH) {
 		/* Call into architectural FFH based C-state */
 		acpi_processor_ffh_cstate_enter(cx);
@@ -554,6 +557,8 @@ static void __cpuidle acpi_idle_do_entry(struct acpi_processor_cx *cx)
 		inb(cx->address);
 		wait_for_freeze();
 	}
+
+	perf_lopwr_cb(false);
 }
 
 /**
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 496eb6aa6e54..1b98e46588bc 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1676,4 +1676,10 @@ typedef int (perf_snapshot_branch_stack_t)(struct perf_branch_entry *entries,
 					   unsigned int cnt);
 DECLARE_STATIC_CALL(perf_snapshot_branch_stack, perf_snapshot_branch_stack_t);
 
+#ifndef PERF_NEEDS_LOPWR_CB
+static inline void perf_lopwr_cb(bool mode)
+{
+}
+#endif
+
 #endif /* _LINUX_PERF_EVENT_H */
-- 
2.35.0.263.gb82422642f-goog


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v6 09/12] perf/x86/amd: add idle hooks for branch sampling
  2022-02-08 21:16 [PATCH v6 00/12] perf/x86/amd: Add AMD Fam19h Branch Sampling support Stephane Eranian
                   ` (7 preceding siblings ...)
  2022-02-08 21:16 ` [PATCH v6 08/12] ACPI: add perf low power callback Stephane Eranian
@ 2022-02-08 21:16 ` Stephane Eranian
  2022-02-08 21:16 ` [PATCH v6 10/12] perf tools: Improve IBS error handling Stephane Eranian
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 29+ messages in thread
From: Stephane Eranian @ 2022-02-08 21:16 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, kim.phillips, acme, jolsa, songliubraving

On AMD Fam19h Zen3, the branch sampling (BRS) feature must be disabled before
entering low power and re-enabled (if was active) when returning from low
power. Otherwise, the NMI interrupt may be held up for too long and cause
problems. Stopping BRS will cause the NMI to be delivered if it was held up.

Define a perf_amd_brs_lopwr_cb() callback to stop/restart BRS.  The callback
is protected by a jump label which is enabled only when AMD BRS is detected.
In all other cases, the callback is never called.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/events/amd/brs.c         | 32 +++++++++++++++++++++++++++++++
 arch/x86/events/amd/core.c        |  4 ++++
 arch/x86/events/perf_event.h      |  1 +
 arch/x86/include/asm/perf_event.h | 21 ++++++++++++++++++++
 4 files changed, 58 insertions(+)

diff --git a/arch/x86/events/amd/brs.c b/arch/x86/events/amd/brs.c
index 40461c3ce714..185a58cea917 100644
--- a/arch/x86/events/amd/brs.c
+++ b/arch/x86/events/amd/brs.c
@@ -7,6 +7,7 @@
  * Contributed by Stephane Eranian <eranian@google.com>
  */
 #include <linux/kernel.h>
+#include <linux/jump_label.h>
 #include <asm/msr.h>
 #include <asm/cpufeature.h>
 
@@ -329,3 +330,34 @@ void amd_pmu_brs_sched_task(struct perf_event_context *ctx, bool sched_in)
 	if (sched_in)
 		amd_brs_poison_buffer();
 }
+
+DEFINE_STATIC_KEY_FALSE(perf_lopwr_needed);
+
+/*
+ * called from ACPI processor_idle.c or acpi_pad.c
+ * with interrupts disabled
+ */
+void perf_amd_brs_lopwr_cb(bool lopwr_in)
+{
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+	union amd_debug_extn_cfg cfg;
+
+	/*
+	 * on mwait in, we may end up in non C0 state.
+	 * we must disable branch sampling to avoid holding the NMI
+	 * for too long. We disable it in hardware but we
+	 * keep the state in cpuc, so we can re-enable.
+	 *
+	 * The hardware will deliver the NMI if needed when brsmen cleared
+	 */
+	if (cpuc->brs_active) {
+		cfg.val = get_debug_extn_cfg();
+		cfg.brsmen = !lopwr_in;
+		set_debug_extn_cfg(cfg.val);
+	}
+}
+
+void __init amd_brs_lopwr_init(void)
+{
+	static_branch_enable(&perf_lopwr_needed);
+}
diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c
index 597defee1e02..ea71ee52b758 100644
--- a/arch/x86/events/amd/core.c
+++ b/arch/x86/events/amd/core.c
@@ -1,5 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-only
 #include <linux/perf_event.h>
+#include <linux/jump_label.h>
 #include <linux/export.h>
 #include <linux/types.h>
 #include <linux/init.h>
@@ -1189,6 +1190,9 @@ static int __init amd_core_pmu_init(void)
 			 * The put_event_constraints callback is shared with
 			 * Fam17h, set above
 			 */
+
+			/* branch sampling must be stopped when entering low power */
+			amd_brs_lopwr_init();
 		}
 	}
 
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 4d050579dcbd..2ed7bf5b51b1 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -1226,6 +1226,7 @@ void amd_brs_enable(void);
 void amd_brs_enable_all(void);
 void amd_brs_disable_all(void);
 void amd_brs_drain(void);
+void amd_brs_lopwr_init(void);
 void amd_brs_disable_all(void);
 int amd_brs_setup_filter(struct perf_event *event);
 void amd_brs_reset(void);
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 58d9e4b1fa0a..42753a9dc3ed 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -513,6 +513,27 @@ static inline void intel_pt_handle_vmx(int on)
 #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_AMD)
  extern void amd_pmu_enable_virt(void);
  extern void amd_pmu_disable_virt(void);
+
+#if defined(CONFIG_PERF_EVENTS_AMD_BRS)
+
+#define PERF_NEEDS_LOPWR_CB 1
+
+/*
+ * architectural low power callback impacts
+ * drivers/acpi/processor_idle.c
+ * drivers/acpi/acpi_pad.c
+ */
+extern void perf_amd_brs_lopwr_cb(bool lopwr_in);
+DECLARE_STATIC_KEY_FALSE(perf_lopwr_needed);
+
+static inline void perf_lopwr_cb(bool mode)
+{
+	/* key enabled only when BRS is available */
+	if (static_branch_unlikely(&perf_lopwr_needed))
+		perf_amd_brs_lopwr_cb(mode);
+}
+#endif /* PERF_NEEDS_LOPWR_CB */
+
 #else
  static inline void amd_pmu_enable_virt(void) { }
  static inline void amd_pmu_disable_virt(void) { }
-- 
2.35.0.263.gb82422642f-goog


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v6 10/12] perf tools: Improve IBS error handling
  2022-02-08 21:16 [PATCH v6 00/12] perf/x86/amd: Add AMD Fam19h Branch Sampling support Stephane Eranian
                   ` (8 preceding siblings ...)
  2022-02-08 21:16 ` [PATCH v6 09/12] perf/x86/amd: add idle hooks for branch sampling Stephane Eranian
@ 2022-02-08 21:16 ` Stephane Eranian
  2022-02-09 15:47   ` Arnaldo Carvalho de Melo
                     ` (2 more replies)
  2022-02-08 21:16 ` [PATCH v6 11/12] perf tools: Improve error handling of AMD Branch Sampling Stephane Eranian
  2022-02-08 21:16 ` [PATCH v6 12/12] perf report: add addr_from/addr_to sort dimensions Stephane Eranian
  11 siblings, 3 replies; 29+ messages in thread
From: Stephane Eranian @ 2022-02-08 21:16 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, kim.phillips, acme, jolsa, songliubraving

From: Kim Phillips <kim.phillips@amd.com>

improve the error message returned on failed perf_event_open() on AMD when
using IBS.

Output of executing 'perf record -e ibs_op// true' BEFORE this patch:

The sys_perf_event_open() syscall returned with 22 (Invalid argument)for event (ibs_op//u).
/bin/dmesg | grep -i perf may provide additional information.

Output after:

AMD IBS cannot exclude kernel events.  Try running at a higher privilege level.

Output of executing 'sudo perf record -e ibs_op// true' BEFORE this patch:

Error:
The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (ibs_op//).
/bin/dmesg | grep -i perf may provide additional information.

Output after:

Error:
AMD IBS may only be available in system-wide/per-cpu mode.  Try using -a, or -C and workload affinity

Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Petlan <mpetlan@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Stephane Eranian <eranian@google.com>
---
 tools/perf/util/evsel.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 22d3267ce294..d42f63a484df 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -2847,9 +2847,22 @@ static bool find_process(const char *name)
 	return ret ? false : true;
 }
 
+static bool is_amd(const char *arch, const char *cpuid)
+{
+	return arch && !strcmp("x86", arch) && cpuid && strstarts(cpuid, "AuthenticAMD");
+}
+
+static bool is_amd_ibs(struct evsel *evsel)
+{
+	return evsel->core.attr.precise_ip || !strncmp(evsel->pmu_name, "ibs", 3);
+}
+
 int evsel__open_strerror(struct evsel *evsel, struct target *target,
 			 int err, char *msg, size_t size)
 {
+	struct perf_env *env = evsel__env(evsel);
+	const char *arch = perf_env__arch(env);
+	const char *cpuid = perf_env__cpuid(env);
 	char sbuf[STRERR_BUFSIZE];
 	int printed = 0, enforced = 0;
 
@@ -2949,6 +2962,17 @@ int evsel__open_strerror(struct evsel *evsel, struct target *target,
 			return scnprintf(msg, size,
 	"Invalid event (%s) in per-thread mode, enable system wide with '-a'.",
 					evsel__name(evsel));
+		if (is_amd(arch, cpuid)) {
+			if (is_amd_ibs(evsel)) {
+				if (evsel->core.attr.exclude_kernel)
+					return scnprintf(msg, size,
+	"AMD IBS can't exclude kernel events.  Try running at a higher privilege level.");
+				if (!evsel->core.system_wide)
+					return scnprintf(msg, size,
+	"AMD IBS may only be available in system-wide/per-cpu mode.  Try using -a, or -C and workload affinity");
+			}
+		}
+
 		break;
 	case ENODATA:
 		return scnprintf(msg, size, "Cannot collect data source with the load latency event alone. "
-- 
2.35.0.263.gb82422642f-goog


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v6 11/12] perf tools: Improve error handling of AMD Branch Sampling
  2022-02-08 21:16 [PATCH v6 00/12] perf/x86/amd: Add AMD Fam19h Branch Sampling support Stephane Eranian
                   ` (9 preceding siblings ...)
  2022-02-08 21:16 ` [PATCH v6 10/12] perf tools: Improve IBS error handling Stephane Eranian
@ 2022-02-08 21:16 ` Stephane Eranian
  2022-02-16 14:17   ` Arnaldo Carvalho de Melo
  2022-02-08 21:16 ` [PATCH v6 12/12] perf report: add addr_from/addr_to sort dimensions Stephane Eranian
  11 siblings, 1 reply; 29+ messages in thread
From: Stephane Eranian @ 2022-02-08 21:16 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, kim.phillips, acme, jolsa, songliubraving

Improve the error message printed by perf when perf_event_open() fails on
AMD Zen3 when using the branch sampling feature. In the case of EINVAL, there
are two main reasons: frequency mode or period is smaller than the depth of
the branch sampling buffer (16). The patch checks the parameters of the call
and tries to print a relevant message to explain the error:

$ perf record -b -e cpu/branch-brs/ -c 10 ls
Error:
AMD Branch Sampling does not support sampling period smaller than what is reported in /sys/devices/cpu/caps/branches.

$ perf record -b -e cpu/branch-brs/ ls
Error:
AMD Branch Sampling does not support frequency mode sampling, must pass a fixed sampling period via -c option or cpu/branch-brs,period=xxxx/.

Signed-off-by: Stephane Eranian <eranian@google.com>
[Rebased on commit 9fe8895a27a84 ("perf env: Add perf_env__cpuid, perf_env__{nr_}pmu_mappings")]
Signed-off-by: Kim Phillips <kim.phillips@amd.com>
---
 tools/perf/util/evsel.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index d42f63a484df..7311e7b4d34d 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -2857,6 +2857,12 @@ static bool is_amd_ibs(struct evsel *evsel)
 	return evsel->core.attr.precise_ip || !strncmp(evsel->pmu_name, "ibs", 3);
 }
 
+static bool is_amd_brs(struct evsel *evsel)
+{
+	return ((evsel->core.attr.config & 0xff) == 0xc4) &&
+	       (evsel->core.attr.sample_type & PERF_SAMPLE_BRANCH_STACK);
+}
+
 int evsel__open_strerror(struct evsel *evsel, struct target *target,
 			 int err, char *msg, size_t size)
 {
@@ -2971,6 +2977,14 @@ int evsel__open_strerror(struct evsel *evsel, struct target *target,
 					return scnprintf(msg, size,
 	"AMD IBS may only be available in system-wide/per-cpu mode.  Try using -a, or -C and workload affinity");
 			}
+			if (is_amd_brs(evsel)) {
+				if (evsel->core.attr.freq)
+					return scnprintf(msg, size,
+	"AMD Branch Sampling does not support frequency mode sampling, must pass a fixed sampling period via -c option or cpu/branch-brs,period=xxxx/.");
+				/* another reason is that the period is too small */
+				return scnprintf(msg, size,
+	"AMD Branch Sampling does not support sampling period smaller than what is reported in /sys/devices/cpu/caps/branches.");
+			}
 		}
 
 		break;
-- 
2.35.0.263.gb82422642f-goog


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v6 12/12] perf report: add addr_from/addr_to sort dimensions
  2022-02-08 21:16 [PATCH v6 00/12] perf/x86/amd: Add AMD Fam19h Branch Sampling support Stephane Eranian
                   ` (10 preceding siblings ...)
  2022-02-08 21:16 ` [PATCH v6 11/12] perf tools: Improve error handling of AMD Branch Sampling Stephane Eranian
@ 2022-02-08 21:16 ` Stephane Eranian
  2022-02-16 14:21   ` Arnaldo Carvalho de Melo
  11 siblings, 1 reply; 29+ messages in thread
From: Stephane Eranian @ 2022-02-08 21:16 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, kim.phillips, acme, jolsa, songliubraving

With the existing symbol_from/symbol_to, branches captured in the same
function would be collapsed into a single function if the latencies associated
with the each branch (cycles) were all the same.  That is the case on Intel
Broadwell, for instance. Since Intel Skylake, the latency is captured by
hardware and therefore is used to disambiguate branches.

Add addr_from/addr_to sort dimensions to sort branches based on their
addresses and not the function there are in. The output is still the function
name but the offset within the function is provided to uniquely identify each
branch.  These new sort dimensions also help with annotate because they create
different entries in the histogram which, in turn, generates proper branch
annotations.

Here is an example using AMD's branch sampling:

$ perf record -a -b -c 1000037 -e cpu/branch-brs/ test_prg

$ perf report
Samples: 6M of event 'cpu/branch-brs/', Event count (approx.): 6901276
Overhead  Command          Source Shared Object  Source Symbol                                   Target Symbol                                   Basic Block Cycle
  99.65%  test_prg	   test_prg              [.] test_thread                                 [.] test_thread                                 -
   0.02%  test_prg         [kernel.vmlinux]      [k] asm_sysvec_apic_timer_interrupt             [k] error_entry                                 -

$ perf report -F overhead,comm,dso,addr_from,addr_to
Samples: 6M of event 'cpu/branch-brs/', Event count (approx.): 6901276
Overhead  Command          Shared Object     Source Address          Target Address
   4.22%  test_prg         test_prg          [.] test_thread+0x3c    [.] test_thread+0x4
   4.13%  test_prg         test_prg          [.] test_thread+0x4     [.] test_thread+0x3a
   4.09%  test_prg         test_prg          [.] test_thread+0x3a    [.] test_thread+0x6
   4.08%  test_prg         test_prg          [.] test_thread+0x2     [.] test_thread+0x3c
   4.06%  test_prg         test_prg          [.] test_thread+0x3e    [.] test_thread+0x2
   3.87%  test_prg         test_prg          [.] test_thread+0x6     [.] test_thread+0x38
   3.84%  test_prg         test_prg          [.] test_thread         [.] test_thread+0x3e
   3.76%  test_prg         test_prg          [.] test_thread+0x1e    [.] test_thread
   3.76%  test_prg         test_prg          [.] test_thread+0x38    [.] test_thread+0x8
   3.56%  test_prg         test_prg          [.] test_thread+0x22    [.] test_thread+0x1e
   3.54%  test_prg         test_prg          [.] test_thread+0x8     [.] test_thread+0x36
   3.47%  test_prg         test_prg          [.] test_thread+0x1c    [.] test_thread+0x22
   3.45%  test_prg         test_prg          [.] test_thread+0x36    [.] test_thread+0xa
   3.28%  test_prg         test_prg          [.] test_thread+0x24    [.] test_thread+0x1c
   3.25%  test_prg         test_prg          [.] test_thread+0xa     [.] test_thread+0x34
   3.24%  test_prg         test_prg          [.] test_thread+0x1a    [.] test_thread+0x24
   3.20%  test_prg         test_prg          [.] test_thread+0x34    [.] test_thread+0xc
   3.04%  test_prg         test_prg          [.] test_thread+0x26    [.] test_thread+0x1a
   3.01%  test_prg         test_prg          [.] test_thread+0xc     [.] test_thread+0x32
   2.98%  test_prg         test_prg          [.] test_thread+0x18    [.] test_thread+0x26
   2.94%  test_prg         test_prg          [.] test_thread+0x32    [.] test_thread+0xe
   2.76%  test_prg         test_prg          [.] test_thread+0x28    [.] test_thread+0x18
   2.73%  test_prg         test_prg          [.] test_thread+0xe     [.] test_thread+0x30
   2.67%  test_prg         test_prg          [.] test_thread+0x30    [.] test_thread+0x10
   2.67%  test_prg         test_prg          [.] test_thread+0x16    [.] test_thread+0x28
   2.46%  test_prg         test_prg          [.] test_thread+0x10    [.] test_thread+0x2e
   2.44%  test_prg         test_prg          [.] test_thread+0x2a    [.] test_thread+0x16
   2.38%  test_prg         test_prg          [.] test_thread+0x14    [.] test_thread+0x2a
   2.32%  test_prg         test_prg          [.] test_thread+0x2e    [.] test_thread+0x12
   2.28%  test_prg         test_prg          [.] test_thread+0x12    [.] test_thread+0x2c
   2.16%  test_prg         test_prg          [.] test_thread+0x2c    [.] test_thread+0x14
   0.02%  test_prg         [kernel.vmlinux]  [k] asm_sysvec_apic_ti+0x5  [k] error_entry

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 tools/perf/util/hist.c |   2 +
 tools/perf/util/hist.h |   2 +
 tools/perf/util/sort.c | 128 +++++++++++++++++++++++++++++++++++++++++
 tools/perf/util/sort.h |   2 +
 4 files changed, 134 insertions(+)

diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index 0a8033b09e28..1c085ab56534 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -124,6 +124,7 @@ void hists__calc_col_len(struct hists *hists, struct hist_entry *h)
 		} else {
 			symlen = unresolved_col_width + 4 + 2;
 			hists__new_col_len(hists, HISTC_SYMBOL_FROM, symlen);
+			hists__new_col_len(hists, HISTC_ADDR_FROM, symlen);
 			hists__set_unres_dso_col_len(hists, HISTC_DSO_FROM);
 		}
 
@@ -138,6 +139,7 @@ void hists__calc_col_len(struct hists *hists, struct hist_entry *h)
 		} else {
 			symlen = unresolved_col_width + 4 + 2;
 			hists__new_col_len(hists, HISTC_SYMBOL_TO, symlen);
+			hists__new_col_len(hists, HISTC_ADDR_TO, symlen);
 			hists__set_unres_dso_col_len(hists, HISTC_DSO_TO);
 		}
 
diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index 2a15e22fb89c..7ed4648d2fc2 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -77,6 +77,8 @@ enum hist_column {
 	HISTC_GLOBAL_INS_LAT,
 	HISTC_LOCAL_P_STAGE_CYC,
 	HISTC_GLOBAL_P_STAGE_CYC,
+	HISTC_ADDR_FROM,
+	HISTC_ADDR_TO,
 	HISTC_NR_COLS, /* Last entry */
 };
 
diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index 2da081ef532b..6d5588e80935 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -990,6 +990,128 @@ struct sort_entry sort_sym_to = {
 	.se_width_idx	= HISTC_SYMBOL_TO,
 };
 
+static int _hist_entry__addr_snprintf(struct map_symbol *ms,
+				     u64 ip, char level, char *bf, size_t size,
+				     unsigned int width)
+{
+	struct symbol *sym = ms->sym;
+	struct map *map = ms->map;
+	size_t ret = 0, offs;
+
+	ret += repsep_snprintf(bf + ret, size - ret, "[%c] ", level);
+	if (sym && map) {
+		if (sym->type == STT_OBJECT) {
+			ret += repsep_snprintf(bf + ret, size - ret, "%s", sym->name);
+			ret += repsep_snprintf(bf + ret, size - ret, "+0x%llx",
+					ip - map->unmap_ip(map, sym->start));
+		} else {
+			ret += repsep_snprintf(bf + ret, size - ret, "%.*s",
+					       width - ret,
+					       sym->name);
+			offs = ip - sym->start;
+			if (offs)
+				ret += repsep_snprintf(bf + ret, size - ret, "+0x%llx", offs);
+		}
+	} else {
+		size_t len = BITS_PER_LONG / 4;
+		ret += repsep_snprintf(bf + ret, size - ret, "%-#.*llx",
+				       len, ip);
+	}
+
+	return ret;
+}
+
+static int hist_entry__addr_from_snprintf(struct hist_entry *he, char *bf,
+					 size_t size, unsigned int width)
+{
+	if (he->branch_info) {
+		struct addr_map_symbol *from = &he->branch_info->from;
+
+		return _hist_entry__addr_snprintf(&from->ms, from->al_addr,
+						 he->level, bf, size, width);
+	}
+
+	return repsep_snprintf(bf, size, "%-*.*s", width, width, "N/A");
+}
+
+static int hist_entry__addr_to_snprintf(struct hist_entry *he, char *bf,
+				       size_t size, unsigned int width)
+{
+	if (he->branch_info) {
+		struct addr_map_symbol *to = &he->branch_info->to;
+
+		return _hist_entry__addr_snprintf(&to->ms, to->al_addr,
+						 he->level, bf, size, width);
+	}
+
+	return repsep_snprintf(bf, size, "%-*.*s", width, width, "N/A");
+}
+
+static int64_t
+sort__addr_from_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+	struct addr_map_symbol *from_l;
+	struct addr_map_symbol *from_r;
+	int64_t ret;
+
+	if (!left->branch_info || !right->branch_info)
+		return cmp_null(left->branch_info, right->branch_info);
+
+	from_l = &left->branch_info->from;
+	from_r = &right->branch_info->from;
+
+	/*
+	 * comparing symbol address alone is not enough since it's a
+	 * relative address within a dso.
+	 */
+	ret = _sort__dso_cmp(from_l->ms.map, from_r->ms.map);
+	if (ret != 0)
+		return ret;
+
+	return _sort__addr_cmp(from_l->addr, from_r->addr);
+}
+
+static int64_t
+sort__addr_to_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+	struct addr_map_symbol *to_l;
+	struct addr_map_symbol *to_r;
+	int64_t ret;
+
+	if (!left->branch_info || !right->branch_info)
+		return cmp_null(left->branch_info, right->branch_info);
+
+	to_l = &left->branch_info->to;
+	to_r = &right->branch_info->to;
+
+	/*
+	 * comparing symbol address alone is not enough since it's a
+	 * relative address within a dso.
+	 */
+	ret = _sort__dso_cmp(to_l->ms.map, to_r->ms.map);
+	if (ret != 0)
+		return ret;
+
+	return _sort__addr_cmp(to_l->addr, to_r->addr);
+}
+
+struct sort_entry sort_addr_from = {
+	.se_header	= "Source Address",
+	.se_cmp		= sort__addr_from_cmp,
+	.se_snprintf	= hist_entry__addr_from_snprintf,
+	.se_filter	= hist_entry__sym_from_filter, /* shared with sym_from */
+	.se_width_idx	= HISTC_ADDR_FROM,
+};
+
+struct sort_entry sort_addr_to = {
+	.se_header	= "Target Address",
+	.se_cmp		= sort__addr_to_cmp,
+	.se_snprintf	= hist_entry__addr_to_snprintf,
+	.se_filter	= hist_entry__sym_to_filter, /* shared with sym_to */
+	.se_width_idx	= HISTC_ADDR_TO,
+};
+
+
 static int64_t
 sort__mispredict_cmp(struct hist_entry *left, struct hist_entry *right)
 {
@@ -1893,6 +2015,8 @@ static struct sort_dimension bstack_sort_dimensions[] = {
 	DIM(SORT_SRCLINE_FROM, "srcline_from", sort_srcline_from),
 	DIM(SORT_SRCLINE_TO, "srcline_to", sort_srcline_to),
 	DIM(SORT_SYM_IPC, "ipc_lbr", sort_sym_ipc),
+	DIM(SORT_ADDR_FROM, "addr_from", sort_addr_from),
+	DIM(SORT_ADDR_TO, "addr_to", sort_addr_to),
 };
 
 #undef DIM
@@ -3126,6 +3250,10 @@ static bool get_elide(int idx, FILE *output)
 		return __get_elide(symbol_conf.dso_from_list, "dso_from", output);
 	case HISTC_DSO_TO:
 		return __get_elide(symbol_conf.dso_to_list, "dso_to", output);
+	case HISTC_ADDR_FROM:
+		return __get_elide(symbol_conf.sym_from_list, "addr_from", output);
+	case HISTC_ADDR_TO:
+		return __get_elide(symbol_conf.sym_to_list, "addr_to", output);
 	default:
 		break;
 	}
diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
index f994261888e1..2ddc00d1c464 100644
--- a/tools/perf/util/sort.h
+++ b/tools/perf/util/sort.h
@@ -251,6 +251,8 @@ enum sort_type {
 	SORT_SRCLINE_FROM,
 	SORT_SRCLINE_TO,
 	SORT_SYM_IPC,
+	SORT_ADDR_FROM,
+	SORT_ADDR_TO,
 
 	/* memory mode specific sort keys */
 	__SORT_MEMORY_MODE,
-- 
2.35.0.263.gb82422642f-goog


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 06/12] perf/x86/amd: add AMD branch sampling period adjustment
  2022-02-08 21:16 ` [PATCH v6 06/12] perf/x86/amd: add AMD branch sampling period adjustment Stephane Eranian
@ 2022-02-09 15:32   ` Peter Zijlstra
  2022-03-04 15:45     ` Peter Zijlstra
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2022-02-09 15:32 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: linux-kernel, kim.phillips, acme, jolsa, songliubraving

On Tue, Feb 08, 2022 at 01:16:31PM -0800, Stephane Eranian wrote:
> Add code to adjust the sampling event period when used with the Branch
> Sampling feature (BRS). Given the depth of the BRS (16), the period is
> reduced by that depth such that in the best case scenario, BRS saturates at
> the desired sampling period. In practice, though, the processor may execute
> more branches. Given a desired period P and a depth D, the kernel programs
> the actual period at P - D. After P occurrences of the sampling event, the
> counter overflows. It then may take X branches (skid) before the NMI is
> caught and held by the hardware and BRS activates. Then, after D branches,
> BRS saturates and the NMI is delivered.  With no skid, the effective period
> would be (P - D) + D = P. In practice, however, it will likely be (P - D) +
> X + D. There is no way to eliminate X or predict X.
> 
> Signed-off-by: Stephane Eranian <eranian@google.com>
> ---
>  arch/x86/events/core.c       |  7 +++++++
>  arch/x86/events/perf_event.h | 12 ++++++++++++
>  2 files changed, 19 insertions(+)
> 
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index c2a890caeb0a..ed285f640efe 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -1374,6 +1374,13 @@ int x86_perf_event_set_period(struct perf_event *event)
>  	    x86_pmu.set_topdown_event_period)
>  		return x86_pmu.set_topdown_event_period(event);
>  
> +	/*
> +	 * decrease period by the depth of the BRS feature to get
> +	 * the last N taken branches and approximate the desired period
> +	 */
> +	if (has_branch_stack(event))
> +		period = amd_brs_adjust_period(period);
> +
>  	/*
>  	 * If we are way outside a reasonable range then just skip forward:
>  	 */
> diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
> index 3485a4cf0241..25b037b571e4 100644
> --- a/arch/x86/events/perf_event.h
> +++ b/arch/x86/events/perf_event.h
> @@ -1263,6 +1263,14 @@ static inline bool amd_brs_active(void)
>  	return cpuc->brs_active;
>  }
>  
> +static inline s64 amd_brs_adjust_period(s64 period)
> +{
> +	if (period > x86_pmu.lbr_nr)
> +		return period - x86_pmu.lbr_nr;
> +
> +	return period;
> +}

This makes no sense to me without also enforcing that the event is in
fact that branch retired thing.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 10/12] perf tools: Improve IBS error handling
  2022-02-08 21:16 ` [PATCH v6 10/12] perf tools: Improve IBS error handling Stephane Eranian
@ 2022-02-09 15:47   ` Arnaldo Carvalho de Melo
  2022-03-15  6:49     ` Ravi Bangoria
  2022-03-15  2:01   ` Stephane Eranian
  2022-03-15  7:45   ` Ravi Bangoria
  2 siblings, 1 reply; 29+ messages in thread
From: Arnaldo Carvalho de Melo @ 2022-02-09 15:47 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, peterz, kim.phillips, acme, jolsa, songliubraving

Em Tue, Feb 08, 2022 at 01:16:35PM -0800, Stephane Eranian escreveu:
> From: Kim Phillips <kim.phillips@amd.com>
> 
> improve the error message returned on failed perf_event_open() on AMD when
> using IBS.
> 
> Output of executing 'perf record -e ibs_op// true' BEFORE this patch:
> 
> The sys_perf_event_open() syscall returned with 22 (Invalid argument)for event (ibs_op//u).
> /bin/dmesg | grep -i perf may provide additional information.

Humm, here on a

  $ grep -m1 'model name' /proc/cpuinfo
  model name	: AMD Ryzen 9 5950X 16-Core Processor
  $ ls -la /sys/devices/ibs_op
  total 0
  drwxr-xr-x.  4 root root    0 Feb  9 07:12 .
  drwxr-xr-x. 21 root root    0 Feb  9 07:12 ..
  drwxr-xr-x.  2 root root    0 Feb  9 12:17 format
  -rw-r--r--.  1 root root 4096 Feb  9 12:21 perf_event_mux_interval_ms
  drwxr-xr-x.  2 root root    0 Feb  9 12:21 power
  lrwxrwxrwx.  1 root root    0 Feb  9 07:12 subsystem -> ../../bus/event_source
  -r--r--r--.  1 root root 4096 Feb  9 12:17 type
  -rw-r--r--.  1 root root 4096 Feb  9 07:12 uevent
  $ cat /sys/devices/ibs_op/type
  9
  $

Running without this patch:

  $ uname -a
  Linux five 5.15.14-100.fc34.x86_64 #1 SMP Tue Jan 11 16:53:51 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
  $

  $ cat /etc/redhat-release
  Fedora release 34 (Thirty Four)
  $

  $ perf record -e ibs_op// true
  Error:
  Invalid event (ibs_op//u) in per-thread mode, enable system wide with '-a'.
  $

Trying with system wide:

  $ perf record -a -e ibs_op// true
  Error:
  The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (ibs_op//u).
  /bin/dmesg | grep -i perf may provide additional information.
  
  $

So you're missing -a in all examples? Am I missing something?

> Output after:
> 
> AMD IBS cannot exclude kernel events.  Try running at a higher privilege level.
> 
> Output of executing 'sudo perf record -e ibs_op// true' BEFORE this patch:
> 
> Error:
> The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (ibs_op//).
> /bin/dmesg | grep -i perf may provide additional information.

Here, as root:

[root@five ~]# perf record -e ibs_op// true
Error:
Invalid event (ibs_op//) in per-thread mode, enable system wide with '-a'.
[root@five ~]# perf record -a -e ibs_op// true
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 1.482 MB perf.data (175 samples) ]
[root@five ~]#

- Arnaldo
 
> Output after:
> 
> Error:
> AMD IBS may only be available in system-wide/per-cpu mode.  Try using -a, or -C and workload affinity
> 
> Signed-off-by: Kim Phillips <kim.phillips@amd.com>
> Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> Cc: Ian Rogers <irogers@google.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Jiri Olsa <jolsa@redhat.com>
> Cc: Joao Martins <joao.m.martins@oracle.com>
> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Michael Petlan <mpetlan@redhat.com>
> Cc: Namhyung Kim <namhyung@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Robert Richter <robert.richter@amd.com>
> Cc: Stephane Eranian <eranian@google.com>
> ---
>  tools/perf/util/evsel.c | 24 ++++++++++++++++++++++++
>  1 file changed, 24 insertions(+)
> 
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index 22d3267ce294..d42f63a484df 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -2847,9 +2847,22 @@ static bool find_process(const char *name)
>  	return ret ? false : true;
>  }
>  
> +static bool is_amd(const char *arch, const char *cpuid)
> +{
> +	return arch && !strcmp("x86", arch) && cpuid && strstarts(cpuid, "AuthenticAMD");
> +}
> +
> +static bool is_amd_ibs(struct evsel *evsel)
> +{
> +	return evsel->core.attr.precise_ip || !strncmp(evsel->pmu_name, "ibs", 3);
> +}
> +
>  int evsel__open_strerror(struct evsel *evsel, struct target *target,
>  			 int err, char *msg, size_t size)
>  {
> +	struct perf_env *env = evsel__env(evsel);
> +	const char *arch = perf_env__arch(env);
> +	const char *cpuid = perf_env__cpuid(env);
>  	char sbuf[STRERR_BUFSIZE];
>  	int printed = 0, enforced = 0;
>  
> @@ -2949,6 +2962,17 @@ int evsel__open_strerror(struct evsel *evsel, struct target *target,
>  			return scnprintf(msg, size,
>  	"Invalid event (%s) in per-thread mode, enable system wide with '-a'.",
>  					evsel__name(evsel));
> +		if (is_amd(arch, cpuid)) {
> +			if (is_amd_ibs(evsel)) {
> +				if (evsel->core.attr.exclude_kernel)
> +					return scnprintf(msg, size,
> +	"AMD IBS can't exclude kernel events.  Try running at a higher privilege level.");
> +				if (!evsel->core.system_wide)
> +					return scnprintf(msg, size,
> +	"AMD IBS may only be available in system-wide/per-cpu mode.  Try using -a, or -C and workload affinity");
> +			}
> +		}
> +
>  		break;
>  	case ENODATA:
>  		return scnprintf(msg, size, "Cannot collect data source with the load latency event alone. "
> -- 
> 2.35.0.263.gb82422642f-goog

-- 

- Arnaldo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 11/12] perf tools: Improve error handling of AMD Branch Sampling
  2022-02-08 21:16 ` [PATCH v6 11/12] perf tools: Improve error handling of AMD Branch Sampling Stephane Eranian
@ 2022-02-16 14:17   ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 29+ messages in thread
From: Arnaldo Carvalho de Melo @ 2022-02-16 14:17 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, peterz, kim.phillips, acme, jolsa, songliubraving

Em Tue, Feb 08, 2022 at 01:16:36PM -0800, Stephane Eranian escreveu:
> Improve the error message printed by perf when perf_event_open() fails on
> AMD Zen3 when using the branch sampling feature. In the case of EINVAL, there
> are two main reasons: frequency mode or period is smaller than the depth of
> the branch sampling buffer (16). The patch checks the parameters of the call
> and tries to print a relevant message to explain the error:
> 
> $ perf record -b -e cpu/branch-brs/ -c 10 ls
> Error:
> AMD Branch Sampling does not support sampling period smaller than what is reported in /sys/devices/cpu/caps/branches.
> 
> $ perf record -b -e cpu/branch-brs/ ls
> Error:
> AMD Branch Sampling does not support frequency mode sampling, must pass a fixed sampling period via -c option or cpu/branch-brs,period=xxxx/.
> 
> Signed-off-by: Stephane Eranian <eranian@google.com>
> [Rebased on commit 9fe8895a27a84 ("perf env: Add perf_env__cpuid, perf_env__{nr_}pmu_mappings")]
> Signed-off-by: Kim Phillips <kim.phillips@amd.com>
> ---
>  tools/perf/util/evsel.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index d42f63a484df..7311e7b4d34d 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -2857,6 +2857,12 @@ static bool is_amd_ibs(struct evsel *evsel)
>  	return evsel->core.attr.precise_ip || !strncmp(evsel->pmu_name, "ibs", 3);
>  }
>  
> +static bool is_amd_brs(struct evsel *evsel)
> +{
> +	return ((evsel->core.attr.config & 0xff) == 0xc4) &&
> +	       (evsel->core.attr.sample_type & PERF_SAMPLE_BRANCH_STACK);
> +}
> +

Well, this assumes we're on x86_64, right? Shouldn't we have some extra
condition using perf_env routines to check we're on x86_64.

Did a quick check and powerpc also supports PERF_SAMPLE_BRANCH_STACK

⬢[acme@toolbox perf]$ find arch/ -name "*.c" | xargs grep PERF_SAMPLE_BRANCH_STACK
arch/powerpc/perf/core-book3s.c:		if (event->attr.sample_type & PERF_SAMPLE_BRANCH_STACK) {
arch/x86/events/intel/ds.c:	if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
arch/x86/events/intel/ds.c:					PERF_SAMPLE_BRANCH_STACK |
arch/x86/events/intel/lbr.c: * in PERF_SAMPLE_BRANCH_STACK sample may vary.
arch/x86/kvm/vmx/pmu_intel.c:	 * - set 'sample_type = PERF_SAMPLE_BRANCH_STACK' and
arch/x86/kvm/vmx/pmu_intel.c:		.sample_type = PERF_SAMPLE_BRANCH_STACK,
⬢[acme@toolbox perf]$

arch/powerpc/perf/core-book3s.c:

                if (event->attr.sample_type & PERF_SAMPLE_BRANCH_STACK) {
                        struct cpu_hw_events *cpuhw;
                        cpuhw = this_cpu_ptr(&cpu_hw_events);
                        power_pmu_bhrb_read(event, cpuhw);
                        data.br_stack = &cpuhw->bhrb_stack;
                }

>  int evsel__open_strerror(struct evsel *evsel, struct target *target,
>  			 int err, char *msg, size_t size)
>  {
> @@ -2971,6 +2977,14 @@ int evsel__open_strerror(struct evsel *evsel, struct target *target,
>  					return scnprintf(msg, size,
>  	"AMD IBS may only be available in system-wide/per-cpu mode.  Try using -a, or -C and workload affinity");
>  			}
> +			if (is_amd_brs(evsel)) {
> +				if (evsel->core.attr.freq)
> +					return scnprintf(msg, size,
> +	"AMD Branch Sampling does not support frequency mode sampling, must pass a fixed sampling period via -c option or cpu/branch-brs,period=xxxx/.");
> +				/* another reason is that the period is too small */
> +				return scnprintf(msg, size,
> +	"AMD Branch Sampling does not support sampling period smaller than what is reported in /sys/devices/cpu/caps/branches.");
> +			}
>  		}
>  
>  		break;
> -- 
> 2.35.0.263.gb82422642f-goog

-- 

- Arnaldo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 12/12] perf report: add addr_from/addr_to sort dimensions
  2022-02-08 21:16 ` [PATCH v6 12/12] perf report: add addr_from/addr_to sort dimensions Stephane Eranian
@ 2022-02-16 14:21   ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 29+ messages in thread
From: Arnaldo Carvalho de Melo @ 2022-02-16 14:21 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, peterz, kim.phillips, acme, jolsa, songliubraving

Em Tue, Feb 08, 2022 at 01:16:37PM -0800, Stephane Eranian escreveu:
> With the existing symbol_from/symbol_to, branches captured in the same
> function would be collapsed into a single function if the latencies associated
> with the each branch (cycles) were all the same.  That is the case on Intel
> Broadwell, for instance. Since Intel Skylake, the latency is captured by
> hardware and therefore is used to disambiguate branches.
> 
> Add addr_from/addr_to sort dimensions to sort branches based on their
> addresses and not the function there are in. The output is still the function
> name but the offset within the function is provided to uniquely identify each
> branch.  These new sort dimensions also help with annotate because they create
> different entries in the histogram which, in turn, generates proper branch
> annotations.
> 
> Here is an example using AMD's branch sampling:

This can be cherry picked from this patchset, I'll try to do it now, and
also add the above explanation for the new sort dimensions to:

tools/perf/Documentation/perf-report.txt

Please update the documentation in the future when adding new sort
dimensions.

Thanks,

- Arnaldo
 
> $ perf record -a -b -c 1000037 -e cpu/branch-brs/ test_prg
> 
> $ perf report
> Samples: 6M of event 'cpu/branch-brs/', Event count (approx.): 6901276
> Overhead  Command          Source Shared Object  Source Symbol                                   Target Symbol                                   Basic Block Cycle
>   99.65%  test_prg	   test_prg              [.] test_thread                                 [.] test_thread                                 -
>    0.02%  test_prg         [kernel.vmlinux]      [k] asm_sysvec_apic_timer_interrupt             [k] error_entry                                 -
> 
> $ perf report -F overhead,comm,dso,addr_from,addr_to
> Samples: 6M of event 'cpu/branch-brs/', Event count (approx.): 6901276
> Overhead  Command          Shared Object     Source Address          Target Address
>    4.22%  test_prg         test_prg          [.] test_thread+0x3c    [.] test_thread+0x4
>    4.13%  test_prg         test_prg          [.] test_thread+0x4     [.] test_thread+0x3a
>    4.09%  test_prg         test_prg          [.] test_thread+0x3a    [.] test_thread+0x6
>    4.08%  test_prg         test_prg          [.] test_thread+0x2     [.] test_thread+0x3c
>    4.06%  test_prg         test_prg          [.] test_thread+0x3e    [.] test_thread+0x2
>    3.87%  test_prg         test_prg          [.] test_thread+0x6     [.] test_thread+0x38
>    3.84%  test_prg         test_prg          [.] test_thread         [.] test_thread+0x3e
>    3.76%  test_prg         test_prg          [.] test_thread+0x1e    [.] test_thread
>    3.76%  test_prg         test_prg          [.] test_thread+0x38    [.] test_thread+0x8
>    3.56%  test_prg         test_prg          [.] test_thread+0x22    [.] test_thread+0x1e
>    3.54%  test_prg         test_prg          [.] test_thread+0x8     [.] test_thread+0x36
>    3.47%  test_prg         test_prg          [.] test_thread+0x1c    [.] test_thread+0x22
>    3.45%  test_prg         test_prg          [.] test_thread+0x36    [.] test_thread+0xa
>    3.28%  test_prg         test_prg          [.] test_thread+0x24    [.] test_thread+0x1c
>    3.25%  test_prg         test_prg          [.] test_thread+0xa     [.] test_thread+0x34
>    3.24%  test_prg         test_prg          [.] test_thread+0x1a    [.] test_thread+0x24
>    3.20%  test_prg         test_prg          [.] test_thread+0x34    [.] test_thread+0xc
>    3.04%  test_prg         test_prg          [.] test_thread+0x26    [.] test_thread+0x1a
>    3.01%  test_prg         test_prg          [.] test_thread+0xc     [.] test_thread+0x32
>    2.98%  test_prg         test_prg          [.] test_thread+0x18    [.] test_thread+0x26
>    2.94%  test_prg         test_prg          [.] test_thread+0x32    [.] test_thread+0xe
>    2.76%  test_prg         test_prg          [.] test_thread+0x28    [.] test_thread+0x18
>    2.73%  test_prg         test_prg          [.] test_thread+0xe     [.] test_thread+0x30
>    2.67%  test_prg         test_prg          [.] test_thread+0x30    [.] test_thread+0x10
>    2.67%  test_prg         test_prg          [.] test_thread+0x16    [.] test_thread+0x28
>    2.46%  test_prg         test_prg          [.] test_thread+0x10    [.] test_thread+0x2e
>    2.44%  test_prg         test_prg          [.] test_thread+0x2a    [.] test_thread+0x16
>    2.38%  test_prg         test_prg          [.] test_thread+0x14    [.] test_thread+0x2a
>    2.32%  test_prg         test_prg          [.] test_thread+0x2e    [.] test_thread+0x12
>    2.28%  test_prg         test_prg          [.] test_thread+0x12    [.] test_thread+0x2c
>    2.16%  test_prg         test_prg          [.] test_thread+0x2c    [.] test_thread+0x14
>    0.02%  test_prg         [kernel.vmlinux]  [k] asm_sysvec_apic_ti+0x5  [k] error_entry
> 
> Signed-off-by: Stephane Eranian <eranian@google.com>
> ---
>  tools/perf/util/hist.c |   2 +
>  tools/perf/util/hist.h |   2 +
>  tools/perf/util/sort.c | 128 +++++++++++++++++++++++++++++++++++++++++
>  tools/perf/util/sort.h |   2 +
>  4 files changed, 134 insertions(+)
> 
> diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
> index 0a8033b09e28..1c085ab56534 100644
> --- a/tools/perf/util/hist.c
> +++ b/tools/perf/util/hist.c
> @@ -124,6 +124,7 @@ void hists__calc_col_len(struct hists *hists, struct hist_entry *h)
>  		} else {
>  			symlen = unresolved_col_width + 4 + 2;
>  			hists__new_col_len(hists, HISTC_SYMBOL_FROM, symlen);
> +			hists__new_col_len(hists, HISTC_ADDR_FROM, symlen);
>  			hists__set_unres_dso_col_len(hists, HISTC_DSO_FROM);
>  		}
>  
> @@ -138,6 +139,7 @@ void hists__calc_col_len(struct hists *hists, struct hist_entry *h)
>  		} else {
>  			symlen = unresolved_col_width + 4 + 2;
>  			hists__new_col_len(hists, HISTC_SYMBOL_TO, symlen);
> +			hists__new_col_len(hists, HISTC_ADDR_TO, symlen);
>  			hists__set_unres_dso_col_len(hists, HISTC_DSO_TO);
>  		}
>  
> diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
> index 2a15e22fb89c..7ed4648d2fc2 100644
> --- a/tools/perf/util/hist.h
> +++ b/tools/perf/util/hist.h
> @@ -77,6 +77,8 @@ enum hist_column {
>  	HISTC_GLOBAL_INS_LAT,
>  	HISTC_LOCAL_P_STAGE_CYC,
>  	HISTC_GLOBAL_P_STAGE_CYC,
> +	HISTC_ADDR_FROM,
> +	HISTC_ADDR_TO,
>  	HISTC_NR_COLS, /* Last entry */
>  };
>  
> diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
> index 2da081ef532b..6d5588e80935 100644
> --- a/tools/perf/util/sort.c
> +++ b/tools/perf/util/sort.c
> @@ -990,6 +990,128 @@ struct sort_entry sort_sym_to = {
>  	.se_width_idx	= HISTC_SYMBOL_TO,
>  };
>  
> +static int _hist_entry__addr_snprintf(struct map_symbol *ms,
> +				     u64 ip, char level, char *bf, size_t size,
> +				     unsigned int width)
> +{
> +	struct symbol *sym = ms->sym;
> +	struct map *map = ms->map;
> +	size_t ret = 0, offs;
> +
> +	ret += repsep_snprintf(bf + ret, size - ret, "[%c] ", level);
> +	if (sym && map) {
> +		if (sym->type == STT_OBJECT) {
> +			ret += repsep_snprintf(bf + ret, size - ret, "%s", sym->name);
> +			ret += repsep_snprintf(bf + ret, size - ret, "+0x%llx",
> +					ip - map->unmap_ip(map, sym->start));
> +		} else {
> +			ret += repsep_snprintf(bf + ret, size - ret, "%.*s",
> +					       width - ret,
> +					       sym->name);
> +			offs = ip - sym->start;
> +			if (offs)
> +				ret += repsep_snprintf(bf + ret, size - ret, "+0x%llx", offs);
> +		}
> +	} else {
> +		size_t len = BITS_PER_LONG / 4;
> +		ret += repsep_snprintf(bf + ret, size - ret, "%-#.*llx",
> +				       len, ip);
> +	}
> +
> +	return ret;
> +}
> +
> +static int hist_entry__addr_from_snprintf(struct hist_entry *he, char *bf,
> +					 size_t size, unsigned int width)
> +{
> +	if (he->branch_info) {
> +		struct addr_map_symbol *from = &he->branch_info->from;
> +
> +		return _hist_entry__addr_snprintf(&from->ms, from->al_addr,
> +						 he->level, bf, size, width);
> +	}
> +
> +	return repsep_snprintf(bf, size, "%-*.*s", width, width, "N/A");
> +}
> +
> +static int hist_entry__addr_to_snprintf(struct hist_entry *he, char *bf,
> +				       size_t size, unsigned int width)
> +{
> +	if (he->branch_info) {
> +		struct addr_map_symbol *to = &he->branch_info->to;
> +
> +		return _hist_entry__addr_snprintf(&to->ms, to->al_addr,
> +						 he->level, bf, size, width);
> +	}
> +
> +	return repsep_snprintf(bf, size, "%-*.*s", width, width, "N/A");
> +}
> +
> +static int64_t
> +sort__addr_from_cmp(struct hist_entry *left, struct hist_entry *right)
> +{
> +	struct addr_map_symbol *from_l;
> +	struct addr_map_symbol *from_r;
> +	int64_t ret;
> +
> +	if (!left->branch_info || !right->branch_info)
> +		return cmp_null(left->branch_info, right->branch_info);
> +
> +	from_l = &left->branch_info->from;
> +	from_r = &right->branch_info->from;
> +
> +	/*
> +	 * comparing symbol address alone is not enough since it's a
> +	 * relative address within a dso.
> +	 */
> +	ret = _sort__dso_cmp(from_l->ms.map, from_r->ms.map);
> +	if (ret != 0)
> +		return ret;
> +
> +	return _sort__addr_cmp(from_l->addr, from_r->addr);
> +}
> +
> +static int64_t
> +sort__addr_to_cmp(struct hist_entry *left, struct hist_entry *right)
> +{
> +	struct addr_map_symbol *to_l;
> +	struct addr_map_symbol *to_r;
> +	int64_t ret;
> +
> +	if (!left->branch_info || !right->branch_info)
> +		return cmp_null(left->branch_info, right->branch_info);
> +
> +	to_l = &left->branch_info->to;
> +	to_r = &right->branch_info->to;
> +
> +	/*
> +	 * comparing symbol address alone is not enough since it's a
> +	 * relative address within a dso.
> +	 */
> +	ret = _sort__dso_cmp(to_l->ms.map, to_r->ms.map);
> +	if (ret != 0)
> +		return ret;
> +
> +	return _sort__addr_cmp(to_l->addr, to_r->addr);
> +}
> +
> +struct sort_entry sort_addr_from = {
> +	.se_header	= "Source Address",
> +	.se_cmp		= sort__addr_from_cmp,
> +	.se_snprintf	= hist_entry__addr_from_snprintf,
> +	.se_filter	= hist_entry__sym_from_filter, /* shared with sym_from */
> +	.se_width_idx	= HISTC_ADDR_FROM,
> +};
> +
> +struct sort_entry sort_addr_to = {
> +	.se_header	= "Target Address",
> +	.se_cmp		= sort__addr_to_cmp,
> +	.se_snprintf	= hist_entry__addr_to_snprintf,
> +	.se_filter	= hist_entry__sym_to_filter, /* shared with sym_to */
> +	.se_width_idx	= HISTC_ADDR_TO,
> +};
> +
> +
>  static int64_t
>  sort__mispredict_cmp(struct hist_entry *left, struct hist_entry *right)
>  {
> @@ -1893,6 +2015,8 @@ static struct sort_dimension bstack_sort_dimensions[] = {
>  	DIM(SORT_SRCLINE_FROM, "srcline_from", sort_srcline_from),
>  	DIM(SORT_SRCLINE_TO, "srcline_to", sort_srcline_to),
>  	DIM(SORT_SYM_IPC, "ipc_lbr", sort_sym_ipc),
> +	DIM(SORT_ADDR_FROM, "addr_from", sort_addr_from),
> +	DIM(SORT_ADDR_TO, "addr_to", sort_addr_to),
>  };
>  
>  #undef DIM
> @@ -3126,6 +3250,10 @@ static bool get_elide(int idx, FILE *output)
>  		return __get_elide(symbol_conf.dso_from_list, "dso_from", output);
>  	case HISTC_DSO_TO:
>  		return __get_elide(symbol_conf.dso_to_list, "dso_to", output);
> +	case HISTC_ADDR_FROM:
> +		return __get_elide(symbol_conf.sym_from_list, "addr_from", output);
> +	case HISTC_ADDR_TO:
> +		return __get_elide(symbol_conf.sym_to_list, "addr_to", output);
>  	default:
>  		break;
>  	}
> diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
> index f994261888e1..2ddc00d1c464 100644
> --- a/tools/perf/util/sort.h
> +++ b/tools/perf/util/sort.h
> @@ -251,6 +251,8 @@ enum sort_type {
>  	SORT_SRCLINE_FROM,
>  	SORT_SRCLINE_TO,
>  	SORT_SYM_IPC,
> +	SORT_ADDR_FROM,
> +	SORT_ADDR_TO,
>  
>  	/* memory mode specific sort keys */
>  	__SORT_MEMORY_MODE,
> -- 
> 2.35.0.263.gb82422642f-goog

-- 

- Arnaldo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 06/12] perf/x86/amd: add AMD branch sampling period adjustment
  2022-02-09 15:32   ` Peter Zijlstra
@ 2022-03-04 15:45     ` Peter Zijlstra
  2022-03-09 23:03       ` Stephane Eranian
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2022-03-04 15:45 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: linux-kernel, kim.phillips, acme, jolsa, songliubraving

On Wed, Feb 09, 2022 at 04:32:04PM +0100, Peter Zijlstra wrote:
> On Tue, Feb 08, 2022 at 01:16:31PM -0800, Stephane Eranian wrote:
> > Add code to adjust the sampling event period when used with the Branch
> > Sampling feature (BRS). Given the depth of the BRS (16), the period is
> > reduced by that depth such that in the best case scenario, BRS saturates at
> > the desired sampling period. In practice, though, the processor may execute
> > more branches. Given a desired period P and a depth D, the kernel programs
> > the actual period at P - D. After P occurrences of the sampling event, the
> > counter overflows. It then may take X branches (skid) before the NMI is
> > caught and held by the hardware and BRS activates. Then, after D branches,
> > BRS saturates and the NMI is delivered.  With no skid, the effective period
> > would be (P - D) + D = P. In practice, however, it will likely be (P - D) +
> > X + D. There is no way to eliminate X or predict X.
> > 
> > Signed-off-by: Stephane Eranian <eranian@google.com>
> > ---
> >  arch/x86/events/core.c       |  7 +++++++
> >  arch/x86/events/perf_event.h | 12 ++++++++++++
> >  2 files changed, 19 insertions(+)
> > 
> > diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> > index c2a890caeb0a..ed285f640efe 100644
> > --- a/arch/x86/events/core.c
> > +++ b/arch/x86/events/core.c
> > @@ -1374,6 +1374,13 @@ int x86_perf_event_set_period(struct perf_event *event)
> >  	    x86_pmu.set_topdown_event_period)
> >  		return x86_pmu.set_topdown_event_period(event);
> >  
> > +	/*
> > +	 * decrease period by the depth of the BRS feature to get
> > +	 * the last N taken branches and approximate the desired period
> > +	 */
> > +	if (has_branch_stack(event))
> > +		period = amd_brs_adjust_period(period);
> > +
> >  	/*
> >  	 * If we are way outside a reasonable range then just skip forward:
> >  	 */
> > diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
> > index 3485a4cf0241..25b037b571e4 100644
> > --- a/arch/x86/events/perf_event.h
> > +++ b/arch/x86/events/perf_event.h
> > @@ -1263,6 +1263,14 @@ static inline bool amd_brs_active(void)
> >  	return cpuc->brs_active;
> >  }
> >  
> > +static inline s64 amd_brs_adjust_period(s64 period)
> > +{
> > +	if (period > x86_pmu.lbr_nr)
> > +		return period - x86_pmu.lbr_nr;
> > +
> > +	return period;
> > +}
> 
> This makes no sense to me without also enforcing that the event is in
> fact that branch retired thing.

So what are we going to do with all these patches? Note that I did pick
them up for testing and I've fixed at least 2 build problems with them.

But I still don't think they're actually completely sane. So there's the
above issue, subtracting lbr_nr from a random event just makes no sense.
But there's also the whole exclusion thing, IIRC you're making it
exclusive against other LBR users, but AFAICT having one LBR user active
will completely screw over any other sampling event due to introducing
these massive skids.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 06/12] perf/x86/amd: add AMD branch sampling period adjustment
  2022-03-04 15:45     ` Peter Zijlstra
@ 2022-03-09 23:03       ` Stephane Eranian
  2022-03-15 12:08         ` Peter Zijlstra
  0 siblings, 1 reply; 29+ messages in thread
From: Stephane Eranian @ 2022-03-09 23:03 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, kim.phillips, acme, jolsa, songliubraving

On Fri, Mar 4, 2022 at 7:45 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Feb 09, 2022 at 04:32:04PM +0100, Peter Zijlstra wrote:
> > On Tue, Feb 08, 2022 at 01:16:31PM -0800, Stephane Eranian wrote:
> > > Add code to adjust the sampling event period when used with the Branch
> > > Sampling feature (BRS). Given the depth of the BRS (16), the period is
> > > reduced by that depth such that in the best case scenario, BRS saturates at
> > > the desired sampling period. In practice, though, the processor may execute
> > > more branches. Given a desired period P and a depth D, the kernel programs
> > > the actual period at P - D. After P occurrences of the sampling event, the
> > > counter overflows. It then may take X branches (skid) before the NMI is
> > > caught and held by the hardware and BRS activates. Then, after D branches,
> > > BRS saturates and the NMI is delivered.  With no skid, the effective period
> > > would be (P - D) + D = P. In practice, however, it will likely be (P - D) +
> > > X + D. There is no way to eliminate X or predict X.
> > >
> > > Signed-off-by: Stephane Eranian <eranian@google.com>
> > > ---
> > >  arch/x86/events/core.c       |  7 +++++++
> > >  arch/x86/events/perf_event.h | 12 ++++++++++++
> > >  2 files changed, 19 insertions(+)
> > >
> > > diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> > > index c2a890caeb0a..ed285f640efe 100644
> > > --- a/arch/x86/events/core.c
> > > +++ b/arch/x86/events/core.c
> > > @@ -1374,6 +1374,13 @@ int x86_perf_event_set_period(struct perf_event *event)
> > >         x86_pmu.set_topdown_event_period)
> > >             return x86_pmu.set_topdown_event_period(event);
> > >
> > > +   /*
> > > +    * decrease period by the depth of the BRS feature to get
> > > +    * the last N taken branches and approximate the desired period
> > > +    */
> > > +   if (has_branch_stack(event))
> > > +           period = amd_brs_adjust_period(period);
> > > +
> > >     /*
> > >      * If we are way outside a reasonable range then just skip forward:
> > >      */
> > > diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
> > > index 3485a4cf0241..25b037b571e4 100644
> > > --- a/arch/x86/events/perf_event.h
> > > +++ b/arch/x86/events/perf_event.h
> > > @@ -1263,6 +1263,14 @@ static inline bool amd_brs_active(void)
> > >     return cpuc->brs_active;
> > >  }
> > >
> > > +static inline s64 amd_brs_adjust_period(s64 period)
> > > +{
> > > +   if (period > x86_pmu.lbr_nr)
> > > +           return period - x86_pmu.lbr_nr;
> > > +
> > > +   return period;
> > > +}
> >
> > This makes no sense to me without also enforcing that the event is in
> > fact that branch retired thing.
>
> So what are we going to do with all these patches? Note that I did pick
> them up for testing and I've fixed at least 2 build problems with them.
>
> But I still don't think they're actually completely sane. So there's the
> above issue, subtracting lbr_nr from a random event just makes no sense.


You are right. Initially, I had it such that only retired_branch_taken was
the only event possible. In that case, subtracting lbr_nr made sense.
Since, I have relaxed the event but it exposes this problem. I think
given how BRS works, I am okay restricting to retired_br_taken
because no matter what the hw is going to activate at P (period)
and wait for 16  taken branches before delivering the NMI. So if I
am sampling on cycles with P=1000000, then the NMI is delivered
at P + X + Z, where X = number of cycles elapsed for the 16 taken
branches (unpredictable) and Z the interrupt skid for NMI (which is
extremely big on AMD). With retired_branch_taken, that formula
becomes: P + 16 + Z, where Z is the number of taken branches
during the skid. But given BRS saturates when full, you do lose
the content because of the Z skid. My opinion is we keep the
lbr_nr subtraction and force event to be only retired_branch_taken.

> But there's also the whole exclusion thing, IIRC you're making it
> exclusive against other LBR users, but AFAICT having one LBR user active
> will completely screw over any other sampling event due to introducing
> these massive skids.


The skid is not massive compared to the actual skid of regular interrupt-based
sampling. You are looking at the time it takes to execute 16 taken branches
vs. 2000+ cycles for the NMI skid.  And this would happen only if the other
events overflow during that 16 taken branch window.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 10/12] perf tools: Improve IBS error handling
  2022-02-08 21:16 ` [PATCH v6 10/12] perf tools: Improve IBS error handling Stephane Eranian
  2022-02-09 15:47   ` Arnaldo Carvalho de Melo
@ 2022-03-15  2:01   ` Stephane Eranian
  2022-03-15  6:23     ` Ravi Bangoria
  2022-03-15  7:45   ` Ravi Bangoria
  2 siblings, 1 reply; 29+ messages in thread
From: Stephane Eranian @ 2022-03-15  2:01 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, kim.phillips, acme, jolsa, songliubraving

Hi Kim,

On Tue, Feb 8, 2022 at 1:17 PM Stephane Eranian <eranian@google.com> wrote:
>
> From: Kim Phillips <kim.phillips@amd.com>
>
> improve the error message returned on failed perf_event_open() on AMD when
> using IBS.
>
> Output of executing 'perf record -e ibs_op// true' BEFORE this patch:
>
> The sys_perf_event_open() syscall returned with 22 (Invalid argument)for event (ibs_op//u).
> /bin/dmesg | grep -i perf may provide additional information.
>
> Output after:
>
> AMD IBS cannot exclude kernel events.  Try running at a higher privilege level.
>
> Output of executing 'sudo perf record -e ibs_op// true' BEFORE this patch:
>
> Error:
> The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (ibs_op//).
> /bin/dmesg | grep -i perf may provide additional information.
>
> Output after:
>
> Error:
> AMD IBS may only be available in system-wide/per-cpu mode.  Try using -a, or -C and workload affinity
>
> Signed-off-by: Kim Phillips <kim.phillips@amd.com>
> Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> Cc: Ian Rogers <irogers@google.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Jiri Olsa <jolsa@redhat.com>
> Cc: Joao Martins <joao.m.martins@oracle.com>
> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Michael Petlan <mpetlan@redhat.com>
> Cc: Namhyung Kim <namhyung@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Robert Richter <robert.richter@amd.com>
> Cc: Stephane Eranian <eranian@google.com>
> ---
>  tools/perf/util/evsel.c | 24 ++++++++++++++++++++++++
>  1 file changed, 24 insertions(+)
>
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index 22d3267ce294..d42f63a484df 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -2847,9 +2847,22 @@ static bool find_process(const char *name)
>         return ret ? false : true;
>  }
>
> +static bool is_amd(const char *arch, const char *cpuid)
> +{
> +       return arch && !strcmp("x86", arch) && cpuid && strstarts(cpuid, "AuthenticAMD");
> +}
> +
> +static bool is_amd_ibs(struct evsel *evsel)
> +{
> +       return evsel->core.attr.precise_ip || !strncmp(evsel->pmu_name, "ibs", 3);
> +}
> +
>  int evsel__open_strerror(struct evsel *evsel, struct target *target,
>                          int err, char *msg, size_t size)
>  {
> +       struct perf_env *env = evsel__env(evsel);
> +       const char *arch = perf_env__arch(env);
> +       const char *cpuid = perf_env__cpuid(env);


This code dies for me on the latest tip.git because env = NULL and
perf_env_cpuid() is broken for NULL argument.
I don't quite know where this env global variable is set but I hope
there is a better way of doing this, maybe using
the evsel__env() function in the same util/evsel.c file.

Similarly, the is_amd_ibs() suffers from a NULL pointer dereference
because evsel->pmu_name maybe NULL:

$ perf record -e rc2 .....

causes a NULL pmu_name.

Could you please send me an updated version to integrate with the
branch sampling code?

Thanks.


>
>         char sbuf[STRERR_BUFSIZE];
>         int printed = 0, enforced = 0;
>
> @@ -2949,6 +2962,17 @@ int evsel__open_strerror(struct evsel *evsel, struct target *target,
>                         return scnprintf(msg, size,
>         "Invalid event (%s) in per-thread mode, enable system wide with '-a'.",
>                                         evsel__name(evsel));
> +               if (is_amd(arch, cpuid)) {
> +                       if (is_amd_ibs(evsel)) {
> +                               if (evsel->core.attr.exclude_kernel)
> +                                       return scnprintf(msg, size,
> +       "AMD IBS can't exclude kernel events.  Try running at a higher privilege level.");
> +                               if (!evsel->core.system_wide)
> +                                       return scnprintf(msg, size,
> +       "AMD IBS may only be available in system-wide/per-cpu mode.  Try using -a, or -C and workload affinity");
> +                       }
> +               }
> +
>                 break;
>         case ENODATA:
>                 return scnprintf(msg, size, "Cannot collect data source with the load latency event alone. "
> --
> 2.35.0.263.gb82422642f-goog
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 10/12] perf tools: Improve IBS error handling
  2022-03-15  2:01   ` Stephane Eranian
@ 2022-03-15  6:23     ` Ravi Bangoria
  2022-03-15  7:12       ` Stephane Eranian
  0 siblings, 1 reply; 29+ messages in thread
From: Ravi Bangoria @ 2022-03-15  6:23 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: peterz, kim.phillips, acme, jolsa, songliubraving, Ravi Bangoria,
	linux-kernel

>> +static bool is_amd(const char *arch, const char *cpuid)
>> +{
>> +       return arch && !strcmp("x86", arch) && cpuid && strstarts(cpuid, "AuthenticAMD");
>> +}
>> +
>> +static bool is_amd_ibs(struct evsel *evsel)
>> +{
>> +       return evsel->core.attr.precise_ip || !strncmp(evsel->pmu_name, "ibs", 3);
>> +}
>> +
>>  int evsel__open_strerror(struct evsel *evsel, struct target *target,
>>                          int err, char *msg, size_t size)
>>  {
>> +       struct perf_env *env = evsel__env(evsel);
>> +       const char *arch = perf_env__arch(env);
>> +       const char *cpuid = perf_env__cpuid(env);
> 
> 
> This code dies for me on the latest tip.git because env = NULL and
> perf_env_cpuid() is broken for NULL argument.
> I don't quite know where this env global variable is set but I hope
> there is a better way of doing this, maybe using
> the evsel__env() function in the same util/evsel.c file.
> 
> Similarly, the is_amd_ibs() suffers from a NULL pointer dereference
> because evsel->pmu_name maybe NULL:
> 
> $ perf record -e rc2 .....
> 
> causes a NULL pmu_name.

This should fix the issue:

@@ -2965,7 +2989,7 @@ int evsel__open_strerror(struct evsel *evsel, struct target *target,

 struct perf_env *evsel__env(struct evsel *evsel)
 {
-	if (evsel && evsel->evlist)
+	if (evsel && evsel->evlist && evsel->evlist->env)
 		return evsel->evlist->env;
 	return &perf_env;
 }

Thanks,
Ravi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 10/12] perf tools: Improve IBS error handling
  2022-02-09 15:47   ` Arnaldo Carvalho de Melo
@ 2022-03-15  6:49     ` Ravi Bangoria
  0 siblings, 0 replies; 29+ messages in thread
From: Ravi Bangoria @ 2022-03-15  6:49 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: linux-kernel, peterz, kim.phillips, acme, jolsa, songliubraving,
	Stephane Eranian, Ravi Bangoria

Arnaldo,

On 09-Feb-22 9:17 PM, Arnaldo Carvalho de Melo wrote:
> Em Tue, Feb 08, 2022 at 01:16:35PM -0800, Stephane Eranian escreveu:
>> From: Kim Phillips <kim.phillips@amd.com>
>>
>> improve the error message returned on failed perf_event_open() on AMD when
>> using IBS.
>>
>> Output of executing 'perf record -e ibs_op// true' BEFORE this patch:
>>
>> The sys_perf_event_open() syscall returned with 22 (Invalid argument)for event (ibs_op//u).
>> /bin/dmesg | grep -i perf may provide additional information.
> 
> Humm, here on a
> 
>   $ grep -m1 'model name' /proc/cpuinfo
>   model name	: AMD Ryzen 9 5950X 16-Core Processor
>   $ ls -la /sys/devices/ibs_op
>   total 0
>   drwxr-xr-x.  4 root root    0 Feb  9 07:12 .
>   drwxr-xr-x. 21 root root    0 Feb  9 07:12 ..
>   drwxr-xr-x.  2 root root    0 Feb  9 12:17 format
>   -rw-r--r--.  1 root root 4096 Feb  9 12:21 perf_event_mux_interval_ms
>   drwxr-xr-x.  2 root root    0 Feb  9 12:21 power
>   lrwxrwxrwx.  1 root root    0 Feb  9 07:12 subsystem -> ../../bus/event_source
>   -r--r--r--.  1 root root 4096 Feb  9 12:17 type
>   -rw-r--r--.  1 root root 4096 Feb  9 07:12 uevent
>   $ cat /sys/devices/ibs_op/type
>   9
>   $
> 
> Running without this patch:
> 
>   $ uname -a
>   Linux five 5.15.14-100.fc34.x86_64 #1 SMP Tue Jan 11 16:53:51 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
>   $
> 
>   $ cat /etc/redhat-release
>   Fedora release 34 (Thirty Four)
>   $
> 
>   $ perf record -e ibs_op// true
>   Error:
>   Invalid event (ibs_op//u) in per-thread mode, enable system wide with '-a'.
>   $
> 
> Trying with system wide:
> 
>   $ perf record -a -e ibs_op// true
>   Error:
>   The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (ibs_op//u).
>   /bin/dmesg | grep -i perf may provide additional information.
>   
>   $
> 
> So you're missing -a in all examples? Am I missing something?

AMD IBS does not support per-thread monitoring because it's configured
as uncore pmu (perf_invalid_context).

Thanks,
Ravi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 10/12] perf tools: Improve IBS error handling
  2022-03-15  6:23     ` Ravi Bangoria
@ 2022-03-15  7:12       ` Stephane Eranian
  0 siblings, 0 replies; 29+ messages in thread
From: Stephane Eranian @ 2022-03-15  7:12 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: peterz, kim.phillips, acme, jolsa, songliubraving, linux-kernel

On Mon, Mar 14, 2022 at 11:23 PM Ravi Bangoria <ravi.bangoria@amd.com> wrote:
>
> >> +static bool is_amd(const char *arch, const char *cpuid)
> >> +{
> >> +       return arch && !strcmp("x86", arch) && cpuid && strstarts(cpuid, "AuthenticAMD");
> >> +}
> >> +
> >> +static bool is_amd_ibs(struct evsel *evsel)
> >> +{
> >> +       return evsel->core.attr.precise_ip || !strncmp(evsel->pmu_name, "ibs", 3);
> >> +}
> >> +
> >>  int evsel__open_strerror(struct evsel *evsel, struct target *target,
> >>                          int err, char *msg, size_t size)
> >>  {
> >> +       struct perf_env *env = evsel__env(evsel);
> >> +       const char *arch = perf_env__arch(env);
> >> +       const char *cpuid = perf_env__cpuid(env);
> >
> >
> > This code dies for me on the latest tip.git because env = NULL and
> > perf_env_cpuid() is broken for NULL argument.
> > I don't quite know where this env global variable is set but I hope
> > there is a better way of doing this, maybe using
> > the evsel__env() function in the same util/evsel.c file.
> >
> > Similarly, the is_amd_ibs() suffers from a NULL pointer dereference
> > because evsel->pmu_name maybe NULL:
> >
> > $ perf record -e rc2 .....
> >
> > causes a NULL pmu_name.
>
> This should fix the issue:
>
> @@ -2965,7 +2989,7 @@ int evsel__open_strerror(struct evsel *evsel, struct target *target,
>
>  struct perf_env *evsel__env(struct evsel *evsel)
>  {
> -       if (evsel && evsel->evlist)
> +       if (evsel && evsel->evlist && evsel->evlist->env)
>                 return evsel->evlist->env;
>         return &perf_env;
>  }
>
I will check with that change. Similarly, the IBS helper needs to have
the following change:
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -2854,13 +2854,18 @@ static bool is_amd(const char *arch, const char *cpuid)

 static bool is_amd_ibs(struct evsel *evsel)
 {
-       return evsel->core.attr.precise_ip ||
!strncmp(evsel->pmu_name, "ibs", 3);
+       return evsel->core.attr.precise_ip
+           || (evsel->pmu_name && !strncmp(evsel->pmu_name, "ibs", 3));
 }

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 10/12] perf tools: Improve IBS error handling
  2022-02-08 21:16 ` [PATCH v6 10/12] perf tools: Improve IBS error handling Stephane Eranian
  2022-02-09 15:47   ` Arnaldo Carvalho de Melo
  2022-03-15  2:01   ` Stephane Eranian
@ 2022-03-15  7:45   ` Ravi Bangoria
  2022-03-16  0:03     ` Stephane Eranian
  2 siblings, 1 reply; 29+ messages in thread
From: Ravi Bangoria @ 2022-03-15  7:45 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: peterz, kim.phillips, acme, jolsa, songliubraving, linux-kernel,
	Ravi Bangoria

Stephane,

On 09-Feb-22 2:46 AM, Stephane Eranian wrote:
> From: Kim Phillips <kim.phillips@amd.com>
> 
> improve the error message returned on failed perf_event_open() on AMD when
> using IBS.
> 
> Output of executing 'perf record -e ibs_op// true' BEFORE this patch:
> 
> The sys_perf_event_open() syscall returned with 22 (Invalid argument)for event (ibs_op//u).
> /bin/dmesg | grep -i perf may provide additional information.
> 
> Output after:
> 
> AMD IBS cannot exclude kernel events.  Try running at a higher privilege level.
> 
> Output of executing 'sudo perf record -e ibs_op// true' BEFORE this patch:
> 
> Error:
> The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (ibs_op//).
> /bin/dmesg | grep -i perf may provide additional information.
> 
> Output after:
> 
> Error:
> AMD IBS may only be available in system-wide/per-cpu mode.  Try using -a, or -C and workload affinity

This patch seems to be causing regression to perf python test.

Without this patch:

  $ sudo ./perf test -vvv 19
   19: 'import perf' in python                                         :
  --- start ---
  test child forked, pid 145391
  python usage test: "echo "import sys ; sys.path.append('python'); import perf" | '/usr/bin/python2' "
  test child finished with 0
  ---- end ----
  'import perf' in python: Ok

With this patch:

  $ sudo ./perf test -vvv 19
   19: 'import perf' in python                                         :
  --- start ---
  test child forked, pid 144415
  python usage test: "echo "import sys ; sys.path.append('python'); import perf" | '/usr/bin/python2' "
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  ImportError: python/perf.so: undefined symbol: perf_env__cpuid
  test child finished with -1
  ---- end ----
  'import perf' in python: FAILED!

Thanks,
Ravi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 06/12] perf/x86/amd: add AMD branch sampling period adjustment
  2022-03-09 23:03       ` Stephane Eranian
@ 2022-03-15 12:08         ` Peter Zijlstra
  2022-03-17 17:11           ` Stephane Eranian
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2022-03-15 12:08 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: linux-kernel, kim.phillips, acme, jolsa, songliubraving

On Wed, Mar 09, 2022 at 03:03:39PM -0800, Stephane Eranian wrote:
> On Fri, Mar 4, 2022 at 7:45 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Wed, Feb 09, 2022 at 04:32:04PM +0100, Peter Zijlstra wrote:
> > > On Tue, Feb 08, 2022 at 01:16:31PM -0800, Stephane Eranian wrote:
> > > > Add code to adjust the sampling event period when used with the Branch
> > > > Sampling feature (BRS). Given the depth of the BRS (16), the period is
> > > > reduced by that depth such that in the best case scenario, BRS saturates at
> > > > the desired sampling period. In practice, though, the processor may execute
> > > > more branches. Given a desired period P and a depth D, the kernel programs
> > > > the actual period at P - D. After P occurrences of the sampling event, the
> > > > counter overflows. It then may take X branches (skid) before the NMI is
> > > > caught and held by the hardware and BRS activates. Then, after D branches,
> > > > BRS saturates and the NMI is delivered.  With no skid, the effective period
> > > > would be (P - D) + D = P. In practice, however, it will likely be (P - D) +
> > > > X + D. There is no way to eliminate X or predict X.
> > > >
> > > > Signed-off-by: Stephane Eranian <eranian@google.com>
> > > > ---
> > > >  arch/x86/events/core.c       |  7 +++++++
> > > >  arch/x86/events/perf_event.h | 12 ++++++++++++
> > > >  2 files changed, 19 insertions(+)
> > > >
> > > > diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> > > > index c2a890caeb0a..ed285f640efe 100644
> > > > --- a/arch/x86/events/core.c
> > > > +++ b/arch/x86/events/core.c
> > > > @@ -1374,6 +1374,13 @@ int x86_perf_event_set_period(struct perf_event *event)
> > > >         x86_pmu.set_topdown_event_period)
> > > >             return x86_pmu.set_topdown_event_period(event);
> > > >
> > > > +   /*
> > > > +    * decrease period by the depth of the BRS feature to get
> > > > +    * the last N taken branches and approximate the desired period
> > > > +    */
> > > > +   if (has_branch_stack(event))
> > > > +           period = amd_brs_adjust_period(period);
> > > > +
> > > >     /*
> > > >      * If we are way outside a reasonable range then just skip forward:
> > > >      */
> > > > diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
> > > > index 3485a4cf0241..25b037b571e4 100644
> > > > --- a/arch/x86/events/perf_event.h
> > > > +++ b/arch/x86/events/perf_event.h
> > > > @@ -1263,6 +1263,14 @@ static inline bool amd_brs_active(void)
> > > >     return cpuc->brs_active;
> > > >  }
> > > >
> > > > +static inline s64 amd_brs_adjust_period(s64 period)
> > > > +{
> > > > +   if (period > x86_pmu.lbr_nr)
> > > > +           return period - x86_pmu.lbr_nr;
> > > > +
> > > > +   return period;
> > > > +}
> > >
> > > This makes no sense to me without also enforcing that the event is in
> > > fact that branch retired thing.
> >
> > So what are we going to do with all these patches? Note that I did pick
> > them up for testing and I've fixed at least 2 build problems with them.
> >
> > But I still don't think they're actually completely sane. So there's the
> > above issue, subtracting lbr_nr from a random event just makes no sense.
> 
> 
> You are right. Initially, I had it such that only retired_branch_taken was
> the only event possible. In that case, subtracting lbr_nr made sense.
> Since, I have relaxed the event but it exposes this problem. I think
> given how BRS works, I am okay restricting to retired_br_taken
> because no matter what the hw is going to activate at P (period)
> and wait for 16  taken branches before delivering the NMI. So if I
> am sampling on cycles with P=1000000, then the NMI is delivered
> at P + X + Z, where X = number of cycles elapsed for the 16 taken
> branches (unpredictable) and Z the interrupt skid for NMI (which is
> extremely big on AMD). With retired_branch_taken, that formula
> becomes: P + 16 + Z, where Z is the number of taken branches
> during the skid. But given BRS saturates when full, you do lose
> the content because of the Z skid. My opinion is we keep the
> lbr_nr subtraction and force event to be only retired_branch_taken.

OK, can you do me a delta patch and tell me which commit to merge it in?

> > But there's also the whole exclusion thing, IIRC you're making it
> > exclusive against other LBR users, but AFAICT having one LBR user active
> > will completely screw over any other sampling event due to introducing
> > these massive skids.
> 
> 
> The skid is not massive compared to the actual skid of regular interrupt-based
> sampling. You are looking at the time it takes to execute 16 taken branches
> vs. 2000+ cycles for the NMI skid.  And this would happen only if the other
> events overflow during that 16 taken branch window.

Wait, you're telling me that regs->ip is 2000 cycles/CPI further along
than the instruction that caused the PMI on AMD? That seems beyond
useless.

That's also not what I seem to remember from the last time I used perf
on AMD (admittedly a while ago). Normally the reported IP is a few
instructions beyond the eventing IP. Yielding the normal perf-annotate
output that's shifted but mostly trivially readable.

However, if you delay that NMI for however many instructions it takes to
do 16 branches, the reported IP (regs->ip) will be completely unrelated
to the eventing IP (the one that actually triggered PMI).

In that case the perf-annotate output becomes really hard to interpret.
Esp. if you don't know which IPs were basically garbage.

One possible work-around might be to discard the sample for any
!retired_branch_taken overflow and reprogram those counters with a very
small (1?) value to 'insta' take a new sample without interference. But
that's yuck too.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 10/12] perf tools: Improve IBS error handling
  2022-03-15  7:45   ` Ravi Bangoria
@ 2022-03-16  0:03     ` Stephane Eranian
  2022-03-16 11:07       ` Ravi Bangoria
  0 siblings, 1 reply; 29+ messages in thread
From: Stephane Eranian @ 2022-03-16  0:03 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: peterz, kim.phillips, acme, jolsa, songliubraving, linux-kernel

On Tue, Mar 15, 2022 at 12:46 AM Ravi Bangoria <ravi.bangoria@amd.com> wrote:
>
> Stephane,
>
> On 09-Feb-22 2:46 AM, Stephane Eranian wrote:
> > From: Kim Phillips <kim.phillips@amd.com>
> >
> > improve the error message returned on failed perf_event_open() on AMD when
> > using IBS.
> >
> > Output of executing 'perf record -e ibs_op// true' BEFORE this patch:
> >
> > The sys_perf_event_open() syscall returned with 22 (Invalid argument)for event (ibs_op//u).
> > /bin/dmesg | grep -i perf may provide additional information.
> >
> > Output after:
> >
> > AMD IBS cannot exclude kernel events.  Try running at a higher privilege level.
> >
> > Output of executing 'sudo perf record -e ibs_op// true' BEFORE this patch:
> >
> > Error:
> > The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (ibs_op//).
> > /bin/dmesg | grep -i perf may provide additional information.
> >
> > Output after:
> >
> > Error:
> > AMD IBS may only be available in system-wide/per-cpu mode.  Try using -a, or -C and workload affinity
>
> This patch seems to be causing regression to perf python test.
>
> Without this patch:
>
>   $ sudo ./perf test -vvv 19
>    19: 'import perf' in python                                         :
>   --- start ---
>   test child forked, pid 145391
>   python usage test: "echo "import sys ; sys.path.append('python'); import perf" | '/usr/bin/python2' "
>   test child finished with 0
>   ---- end ----
>   'import perf' in python: Ok
>
> With this patch:
>
>   $ sudo ./perf test -vvv 19
>    19: 'import perf' in python                                         :
>   --- start ---
>   test child forked, pid 144415
>   python usage test: "echo "import sys ; sys.path.append('python'); import perf" | '/usr/bin/python2' "
>   Traceback (most recent call last):
>     File "<stdin>", line 1, in <module>
>   ImportError: python/perf.so: undefined symbol: perf_env__cpuid
>   test child finished with -1
>   ---- end ----
>   'import perf' in python: FAILED!
>

The fix I sent you is just to prevent a potential SEGFAULT in
is_amd_ibs(). I bet the test fails some perf_event_open()
and ends up in strerror function and from there I don't see how the
patch could impact the test, given you'd segfault
otherwise.

I tried on my side and with or without this patch this test fails. I
think this looks like an unrelated issue.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 10/12] perf tools: Improve IBS error handling
  2022-03-16  0:03     ` Stephane Eranian
@ 2022-03-16 11:07       ` Ravi Bangoria
  2022-03-16 11:16         ` Ravi Bangoria
  0 siblings, 1 reply; 29+ messages in thread
From: Ravi Bangoria @ 2022-03-16 11:07 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: peterz, kim.phillips, acme, jolsa, songliubraving, linux-kernel,
	Ravi Bangoria

Stephane,

On 16-Mar-22 5:33 AM, Stephane Eranian wrote:
> On Tue, Mar 15, 2022 at 12:46 AM Ravi Bangoria <ravi.bangoria@amd.com> wrote:
>>
>> Stephane,
>>
>> On 09-Feb-22 2:46 AM, Stephane Eranian wrote:
>>> From: Kim Phillips <kim.phillips@amd.com>
>>>
>>> improve the error message returned on failed perf_event_open() on AMD when
>>> using IBS.
>>>
>>> Output of executing 'perf record -e ibs_op// true' BEFORE this patch:
>>>
>>> The sys_perf_event_open() syscall returned with 22 (Invalid argument)for event (ibs_op//u).
>>> /bin/dmesg | grep -i perf may provide additional information.
>>>
>>> Output after:
>>>
>>> AMD IBS cannot exclude kernel events.  Try running at a higher privilege level.
>>>
>>> Output of executing 'sudo perf record -e ibs_op// true' BEFORE this patch:
>>>
>>> Error:
>>> The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (ibs_op//).
>>> /bin/dmesg | grep -i perf may provide additional information.
>>>
>>> Output after:
>>>
>>> Error:
>>> AMD IBS may only be available in system-wide/per-cpu mode.  Try using -a, or -C and workload affinity
>>
>> This patch seems to be causing regression to perf python test.
>>
>> Without this patch:
>>
>>   $ sudo ./perf test -vvv 19
>>    19: 'import perf' in python                                         :
>>   --- start ---
>>   test child forked, pid 145391
>>   python usage test: "echo "import sys ; sys.path.append('python'); import perf" | '/usr/bin/python2' "
>>   test child finished with 0
>>   ---- end ----
>>   'import perf' in python: Ok
>>
>> With this patch:
>>
>>   $ sudo ./perf test -vvv 19
>>    19: 'import perf' in python                                         :
>>   --- start ---
>>   test child forked, pid 144415
>>   python usage test: "echo "import sys ; sys.path.append('python'); import perf" | '/usr/bin/python2' "
>>   Traceback (most recent call last):
>>     File "<stdin>", line 1, in <module>
>>   ImportError: python/perf.so: undefined symbol: perf_env__cpuid
>>   test child finished with -1
>>   ---- end ----
>>   'import perf' in python: FAILED!
>>
> 
> The fix I sent you is just to prevent a potential SEGFAULT in
> is_amd_ibs(). I bet the test fails some perf_event_open()
> and ends up in strerror function and from there I don't see how the
> patch could impact the test, given you'd segfault
> otherwise.
> 
> I tried on my side and with or without this patch this test fails. I
> think this looks like an unrelated issue.

That's strange. IIUC, python/perf.so needs to know about util/evsel.c
which can be done by adding an entry in util/python-ext-sources.

In any case, do we really need this patch now? For both the example given
in description, I see no difference with or without patch:

Without patch:

  $ sudo ./perf record -e ibs_op// true
  Error:
  Invalid event (ibs_op//) in per-thread mode, enable system wide with '-a'.

  $ ./perf record -e ibs_op// true
  Error:
  Invalid event (ibs_op//u) in per-thread mode, enable system wide with '-a'.

Same o/p with the patch as well.

Thanks,
Ravi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 10/12] perf tools: Improve IBS error handling
  2022-03-16 11:07       ` Ravi Bangoria
@ 2022-03-16 11:16         ` Ravi Bangoria
  0 siblings, 0 replies; 29+ messages in thread
From: Ravi Bangoria @ 2022-03-16 11:16 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: peterz, kim.phillips, acme, jolsa, songliubraving, linux-kernel,
	Ravi Bangoria



On 16-Mar-22 4:37 PM, Ravi Bangoria wrote:
> Stephane,
> 
> On 16-Mar-22 5:33 AM, Stephane Eranian wrote:
>> On Tue, Mar 15, 2022 at 12:46 AM Ravi Bangoria <ravi.bangoria@amd.com> wrote:
>>>
>>> Stephane,
>>>
>>> On 09-Feb-22 2:46 AM, Stephane Eranian wrote:
>>>> From: Kim Phillips <kim.phillips@amd.com>
>>>>
>>>> improve the error message returned on failed perf_event_open() on AMD when
>>>> using IBS.
>>>>
>>>> Output of executing 'perf record -e ibs_op// true' BEFORE this patch:
>>>>
>>>> The sys_perf_event_open() syscall returned with 22 (Invalid argument)for event (ibs_op//u).
>>>> /bin/dmesg | grep -i perf may provide additional information.
>>>>
>>>> Output after:
>>>>
>>>> AMD IBS cannot exclude kernel events.  Try running at a higher privilege level.
>>>>
>>>> Output of executing 'sudo perf record -e ibs_op// true' BEFORE this patch:
>>>>
>>>> Error:
>>>> The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (ibs_op//).
>>>> /bin/dmesg | grep -i perf may provide additional information.
>>>>
>>>> Output after:
>>>>
>>>> Error:
>>>> AMD IBS may only be available in system-wide/per-cpu mode.  Try using -a, or -C and workload affinity
>>>
>>> This patch seems to be causing regression to perf python test.
>>>
>>> Without this patch:
>>>
>>>   $ sudo ./perf test -vvv 19
>>>    19: 'import perf' in python                                         :
>>>   --- start ---
>>>   test child forked, pid 145391
>>>   python usage test: "echo "import sys ; sys.path.append('python'); import perf" | '/usr/bin/python2' "
>>>   test child finished with 0
>>>   ---- end ----
>>>   'import perf' in python: Ok
>>>
>>> With this patch:
>>>
>>>   $ sudo ./perf test -vvv 19
>>>    19: 'import perf' in python                                         :
>>>   --- start ---
>>>   test child forked, pid 144415
>>>   python usage test: "echo "import sys ; sys.path.append('python'); import perf" | '/usr/bin/python2' "
>>>   Traceback (most recent call last):
>>>     File "<stdin>", line 1, in <module>
>>>   ImportError: python/perf.so: undefined symbol: perf_env__cpuid
>>>   test child finished with -1
>>>   ---- end ----
>>>   'import perf' in python: FAILED!
>>>
>>
>> The fix I sent you is just to prevent a potential SEGFAULT in
>> is_amd_ibs(). I bet the test fails some perf_event_open()
>> and ends up in strerror function and from there I don't see how the
>> patch could impact the test, given you'd segfault
>> otherwise.
>>
>> I tried on my side and with or without this patch this test fails. I
>> think this looks like an unrelated issue.
> 
> That's strange. IIUC, python/perf.so needs to know about util/evsel.c

I meant util/env.c not util/evsel.c

> which can be done by adding an entry in util/python-ext-sources.
> 
> In any case, do we really need this patch now? For both the example given
> in description, I see no difference with or without patch:
> 
> Without patch:
> 
>   $ sudo ./perf record -e ibs_op// true
>   Error:
>   Invalid event (ibs_op//) in per-thread mode, enable system wide with '-a'.
> 
>   $ ./perf record -e ibs_op// true
>   Error:
>   Invalid event (ibs_op//u) in per-thread mode, enable system wide with '-a'.
> 
> Same o/p with the patch as well.
> 
> Thanks,
> Ravi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 06/12] perf/x86/amd: add AMD branch sampling period adjustment
  2022-03-15 12:08         ` Peter Zijlstra
@ 2022-03-17 17:11           ` Stephane Eranian
  0 siblings, 0 replies; 29+ messages in thread
From: Stephane Eranian @ 2022-03-17 17:11 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, kim.phillips, acme, jolsa, songliubraving

On Tue, Mar 15, 2022 at 5:09 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Mar 09, 2022 at 03:03:39PM -0800, Stephane Eranian wrote:
> > On Fri, Mar 4, 2022 at 7:45 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Wed, Feb 09, 2022 at 04:32:04PM +0100, Peter Zijlstra wrote:
> > > > On Tue, Feb 08, 2022 at 01:16:31PM -0800, Stephane Eranian wrote:
> > > > > Add code to adjust the sampling event period when used with the Branch
> > > > > Sampling feature (BRS). Given the depth of the BRS (16), the period is
> > > > > reduced by that depth such that in the best case scenario, BRS saturates at
> > > > > the desired sampling period. In practice, though, the processor may execute
> > > > > more branches. Given a desired period P and a depth D, the kernel programs
> > > > > the actual period at P - D. After P occurrences of the sampling event, the
> > > > > counter overflows. It then may take X branches (skid) before the NMI is
> > > > > caught and held by the hardware and BRS activates. Then, after D branches,
> > > > > BRS saturates and the NMI is delivered.  With no skid, the effective period
> > > > > would be (P - D) + D = P. In practice, however, it will likely be (P - D) +
> > > > > X + D. There is no way to eliminate X or predict X.
> > > > >
> > > > > Signed-off-by: Stephane Eranian <eranian@google.com>
> > > > > ---
> > > > >  arch/x86/events/core.c       |  7 +++++++
> > > > >  arch/x86/events/perf_event.h | 12 ++++++++++++
> > > > >  2 files changed, 19 insertions(+)
> > > > >
> > > > > diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> > > > > index c2a890caeb0a..ed285f640efe 100644
> > > > > --- a/arch/x86/events/core.c
> > > > > +++ b/arch/x86/events/core.c
> > > > > @@ -1374,6 +1374,13 @@ int x86_perf_event_set_period(struct perf_event *event)
> > > > >         x86_pmu.set_topdown_event_period)
> > > > >             return x86_pmu.set_topdown_event_period(event);
> > > > >
> > > > > +   /*
> > > > > +    * decrease period by the depth of the BRS feature to get
> > > > > +    * the last N taken branches and approximate the desired period
> > > > > +    */
> > > > > +   if (has_branch_stack(event))
> > > > > +           period = amd_brs_adjust_period(period);
> > > > > +
> > > > >     /*
> > > > >      * If we are way outside a reasonable range then just skip forward:
> > > > >      */
> > > > > diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
> > > > > index 3485a4cf0241..25b037b571e4 100644
> > > > > --- a/arch/x86/events/perf_event.h
> > > > > +++ b/arch/x86/events/perf_event.h
> > > > > @@ -1263,6 +1263,14 @@ static inline bool amd_brs_active(void)
> > > > >     return cpuc->brs_active;
> > > > >  }
> > > > >
> > > > > +static inline s64 amd_brs_adjust_period(s64 period)
> > > > > +{
> > > > > +   if (period > x86_pmu.lbr_nr)
> > > > > +           return period - x86_pmu.lbr_nr;
> > > > > +
> > > > > +   return period;
> > > > > +}
> > > >
> > > > This makes no sense to me without also enforcing that the event is in
> > > > fact that branch retired thing.
> > >
> > > So what are we going to do with all these patches? Note that I did pick
> > > them up for testing and I've fixed at least 2 build problems with them.
> > >
> > > But I still don't think they're actually completely sane. So there's the
> > > above issue, subtracting lbr_nr from a random event just makes no sense.
> >
> >
> > You are right. Initially, I had it such that only retired_branch_taken was
> > the only event possible. In that case, subtracting lbr_nr made sense.
> > Since, I have relaxed the event but it exposes this problem. I think
> > given how BRS works, I am okay restricting to retired_br_taken
> > because no matter what the hw is going to activate at P (period)
> > and wait for 16  taken branches before delivering the NMI. So if I
> > am sampling on cycles with P=1000000, then the NMI is delivered
> > at P + X + Z, where X = number of cycles elapsed for the 16 taken
> > branches (unpredictable) and Z the interrupt skid for NMI (which is
> > extremely big on AMD). With retired_branch_taken, that formula
> > becomes: P + 16 + Z, where Z is the number of taken branches
> > during the skid. But given BRS saturates when full, you do lose
> > the content because of the Z skid. My opinion is we keep the
> > lbr_nr subtraction and force event to be only retired_branch_taken.
>
> OK, can you do me a delta patch and tell me which commit to merge it in?
>
You want a delta patch covering the restriction allowing only
retired_taken_branch
as the sampling event. Is that right? I can also post a V7 that has
all the changes
including fixes in the perf tool strerror code. Whatever you prefer?

> > > But there's also the whole exclusion thing, IIRC you're making it
> > > exclusive against other LBR users, but AFAICT having one LBR user active
> > > will completely screw over any other sampling event due to introducing
> > > these massive skids.
> >
> >
> > The skid is not massive compared to the actual skid of regular interrupt-based
> > sampling. You are looking at the time it takes to execute 16 taken branches
> > vs. 2000+ cycles for the NMI skid.  And this would happen only if the other
> > events overflow during that 16 taken branch window.
>
> Wait, you're telling me that regs->ip is 2000 cycles/CPI further along
> than the instruction that caused the PMI on AMD? That seems beyond
> useless.
>
Yes, depending on the conditions, the skid can be large. Remember that AMD added
a loop to poll on amd_pmu_disable_all(). See commit:
914123fa3904 x86/perf/amd: Resolve race condition when disabling PMC

> That's also not what I seem to remember from the last time I used perf
> on AMD (admittedly a while ago). Normally the reported IP is a few
> instructions beyond the eventing IP. Yielding the normal perf-annotate
> output that's shifted but mostly trivially readable.
>
It all depends on what event you are sampling on. If you are sampling on
cycles and you are stalled, then you cannot make forward progress and
therefore the skid appears small and the profile still points you in the right
direction.

> However, if you delay that NMI for however many instructions it takes to
> do 16 branches, the reported IP (regs->ip) will be completely unrelated
> to the eventing IP (the one that actually triggered PMI).
>
No, because  once the BRS is full and saturates, the CPU delivers the NMI, still
subject to the skid. The regs->ip will be at the closest possible from
the last taken
branch recorded, just like with regular (non BRS sampling).

In the code we usually have, there are a lot of taken branches. In the case
of autoFDO, the regs->ip is not used, only the content of the branch buffer
is used.

> In that case the perf-annotate output becomes really hard to interpret.
> Esp. if you don't know which IPs were basically garbage.
>
That is true. But usually LBR data is not used for visualization by perf report.
It is post-processed for pure LBR data analysis, such as FDO.

> One possible work-around might be to discard the sample for any
> !retired_branch_taken overflow and reprogram those counters with a very
> small (1?) value to 'insta' take a new sample without interference. But
> that's yuck too.
>
Yes, that is ugly.

Thanks.

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2022-03-17 17:11 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-08 21:16 [PATCH v6 00/12] perf/x86/amd: Add AMD Fam19h Branch Sampling support Stephane Eranian
2022-02-08 21:16 ` [PATCH v6 01/12] perf/core: add perf_clear_branch_entry_bitfields() helper Stephane Eranian
2022-02-08 21:16 ` [PATCH v6 02/12] x86/cpufeatures: add AMD Fam19h Branch Sampling feature Stephane Eranian
2022-02-08 21:16 ` [PATCH v6 03/12] perf/x86/amd: add AMD Fam19h Branch Sampling support Stephane Eranian
2022-02-08 21:16 ` [PATCH v6 04/12] perf/x86/amd: add branch-brs helper event for Fam19h BRS Stephane Eranian
2022-02-08 21:16 ` [PATCH v6 05/12] perf/x86/amd: enable branch sampling priv level filtering Stephane Eranian
2022-02-08 21:16 ` [PATCH v6 06/12] perf/x86/amd: add AMD branch sampling period adjustment Stephane Eranian
2022-02-09 15:32   ` Peter Zijlstra
2022-03-04 15:45     ` Peter Zijlstra
2022-03-09 23:03       ` Stephane Eranian
2022-03-15 12:08         ` Peter Zijlstra
2022-03-17 17:11           ` Stephane Eranian
2022-02-08 21:16 ` [PATCH v6 07/12] perf/x86/amd: make Zen3 branch sampling opt-in Stephane Eranian
2022-02-08 21:16 ` [PATCH v6 08/12] ACPI: add perf low power callback Stephane Eranian
2022-02-08 21:16 ` [PATCH v6 09/12] perf/x86/amd: add idle hooks for branch sampling Stephane Eranian
2022-02-08 21:16 ` [PATCH v6 10/12] perf tools: Improve IBS error handling Stephane Eranian
2022-02-09 15:47   ` Arnaldo Carvalho de Melo
2022-03-15  6:49     ` Ravi Bangoria
2022-03-15  2:01   ` Stephane Eranian
2022-03-15  6:23     ` Ravi Bangoria
2022-03-15  7:12       ` Stephane Eranian
2022-03-15  7:45   ` Ravi Bangoria
2022-03-16  0:03     ` Stephane Eranian
2022-03-16 11:07       ` Ravi Bangoria
2022-03-16 11:16         ` Ravi Bangoria
2022-02-08 21:16 ` [PATCH v6 11/12] perf tools: Improve error handling of AMD Branch Sampling Stephane Eranian
2022-02-16 14:17   ` Arnaldo Carvalho de Melo
2022-02-08 21:16 ` [PATCH v6 12/12] perf report: add addr_from/addr_to sort dimensions Stephane Eranian
2022-02-16 14:21   ` Arnaldo Carvalho de Melo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).