LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH v3 0/2] kstats: kernel metric collector
@ 2020-02-26 13:50 Luigi Rizzo
  2020-02-26 13:50 ` [PATCH v3 1/2] " Luigi Rizzo
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Luigi Rizzo @ 2020-02-26 13:50 UTC (permalink / raw)
  To: linux-kernel, mhiramat, akpm, gregkh, naveen.n.rao, ardb, rizzo,
	pabeni, giuseppe.lettieri, toke, hawk, mingo, acme, rostedt,
	peterz
  Cc: Luigi Rizzo


This patchset introduces a small library to collect per-cpu samples and
accumulate distributions to be exported through debugfs.

This v3 series addresses some initial comments (mostly style fixes in the
code) and revises commit logs.

Note that use case and performance are orthogonal to those of the
perf/tracing architecture, which is why this is proposed as a standalone
component.

I have used this to trace the execution times of small sections of code
(network stack functions, drivers and xdp processing, ...), whole functions,
or notification latencies (e.g. IPI dispatch, etc.).  The fine granularity
of the aggregation helps identifying outliers, multimodal distributions,
caching effects and general performance-related issues.

Samples are 64-bit values that are aggregated into per-cpu logarithmic buckets
with configurable density (typical sample collectors use one buckets for
each power of 2, powers of 2, but that is very coarse and corresponds
to 1 significant bit per value; in quickstats one can specify the number
of significant bits, up to 5, which is useful for finer grain measurements).

There are two ways to trace a block of code: manual annotations has the best
performance at runtime and is done as follows:

	// create metric and entry in debugfs
	struct kstats *key = kstats_new("foo", frac_bits);

	...
	// instrument the code
	u64 t0 = ktime_get_ns();        // about 20ns
	<section of code to measure>
	t0 = ktime_get_ns() - t0;       // about 20ns
	kstats_record(key, t0);         // 5ns hot cache, 300ns cold

This method has an accuracy of about 20-30ns (inherited from the clock)
and an overhead with hot/cold cache as show above.

Values can be read from debugfs in an easy to parse format

	# cat /sys/kernel/debug/kstats/foo
	...
	slot 55  CPU  0    count      589 avg      480 p 0.027613
	slot 55  CPU  1    count       18 avg      480 p 0.002572
	slot 55  CPU  2    count       25 avg      480 p 0.003325
	...
	slot 55  CPUS 28   count      814 avg      480 p 0.002474
	...
	slot 97  CPU  13   count     1150 avg    20130 p 0.447442
	slot 97  CPUS 28   count   152585 avg    19809 p 0.651747
	...

Writing STOP, START, RESET to the file executes the corresponding action

	echo RESET > /sys/kernel/debug/kstats/foo

The second instrumentation mechanism uses kretrprobes or tracepoints and
lets tracing be enabled or removed at runtime from the command line for
any globally visible function

	echo trace some_function bits 3 > /sys/kernel/debug/kstats/_control

Data are exported or controlled in the same way as above. Accuracy is
worse due to the presence of kretprobe trampolines: 90ns with hot cache,
500ns with cold cache. The overhead on the traced function is 250ns hot,
1500ns cold.

Hope you find this useful.


Luigi Rizzo (2):
  kstats: kernel metric collector
  kstats: kretprobe and tracepoint support

 include/linux/kstats.h |  62 ++++
 lib/Kconfig.debug      |   7 +
 lib/Makefile           |   1 +
 lib/kstats.c           | 654 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 724 insertions(+)
 create mode 100644 include/linux/kstats.h
 create mode 100644 lib/kstats.c

-- 
2.25.0.265.gbab2e86ba0-goog


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v3 1/2] kstats: kernel metric collector
  2020-02-26 13:50 [PATCH v3 0/2] kstats: kernel metric collector Luigi Rizzo
@ 2020-02-26 13:50 ` Luigi Rizzo
  2020-02-26 14:48   ` Paolo Abeni
                     ` (4 more replies)
  2020-02-26 13:50 ` [PATCH v3 2/2] kstats: kretprobe and tracepoint support Luigi Rizzo
  2020-02-26 15:00 ` [PATCH v3 0/2] kstats: kernel metric collector Toke Høiland-Jørgensen
  2 siblings, 5 replies; 16+ messages in thread
From: Luigi Rizzo @ 2020-02-26 13:50 UTC (permalink / raw)
  To: linux-kernel, mhiramat, akpm, gregkh, naveen.n.rao, ardb, rizzo,
	pabeni, giuseppe.lettieri, toke, hawk, mingo, acme, rostedt,
	peterz
  Cc: Luigi Rizzo

kstats is a helper to accumulate in-kernel metrics (timestamps, sizes,
etc.) and show distributions through debugfs.
Set CONFIG_KSTATS=m or y to enable it.

Creating a metric takes one line of code (and one to destroy it):

  struct kstats *key = kstats_new("foo", 3 /* frac_bits */);
  ...
  kstats_delete(key);

The following line records a u64 sample:

  kstats_record(key, value);

kstats_record() is cheap (5ns hot cache, 250ns cold cache). Samples are
accumulated in a per-cpu array with 2^frac_bits slots for each power
of 2. Using frac_bits = 3 gives about 30 slots per decade.

Each metric has an entry in /sys/kernel/debug/kstats which can be used
to read and control data collection:

- writing start/stop/reset starts and stops sample collection,
  or resets the counters, e.g.

  echo reset > /sys/kernel/debug/kstats/foo

- reading from the node produces a detailed output that can be
  processed with external tools for improved presentation:

  cat /sys/kernel/debug/kstats/foo
  ...
  slot 55  CPU  0    count      589 avg      480 p 0.027613
  slot 55  CPU  1    count       18 avg      480 p 0.002572
  slot 55  CPU  2    count       25 avg      480 p 0.003325
  ...
  slot 55  CPUS 16   count      814 avg      480 p 0.002474
  ...
  slot 97  CPU  8    count     1150 avg    20130 p 0.447442
  slot 97  CPU  12   count       26 avg    20294 p 0.275555
  slot 97  CPUS 16   count   152585 avg    19809 p 0.651747
  slot 98  CPU  0    count       38 avg    21360 p 0.954691
  slot 98  CPU  1    count      456 avg    21412 p 0.872619
  ...
  slot 144 CPUS 16   count       12 avg  1061896 p 0.999999
  slot 146 CPU  4    count        1 avg  1313664 p 1.000000
  slot 146 CPUS 16   count        1 avg  1313664 p 1.000000

Postprocessing examples:
look at statistics for a single CPU (note double space)
        cd /sys/kernel/kstats/
        echo reset > foo; watch grep "'CPU  12 '" foo

Use gnuplot to plot distribution and cdf in a terminal from a remote host
  gnuplot
  set terminal dumb size 200 80
  set logscale x
  plot "<ssh root@myhost grep CPUS /sys/kernel/debug/kstats/foo" u 8:6 w l title "distribution"
  plot "<ssh root@myhost grep CPUS /sys/kernel/debug/kstats/foo" u 8:10 w l title "cdf"

Signed-off-by: Luigi Rizzo <lrizzo@google.com>
---
 include/linux/kstats.h |  62 +++++++++
 lib/Kconfig.debug      |   7 +
 lib/Makefile           |   1 +
 lib/kstats.c           | 290 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 360 insertions(+)
 create mode 100644 include/linux/kstats.h
 create mode 100644 lib/kstats.c

diff --git a/include/linux/kstats.h b/include/linux/kstats.h
new file mode 100644
index 0000000000000..636cc007fa138
--- /dev/null
+++ b/include/linux/kstats.h
@@ -0,0 +1,62 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_KSTATS_H
+#define _LINUX_KSTATS_H
+
+#include <linux/types.h>
+
+/*
+ * Helper to collect and report kernel metrics. Use as follows:
+ *
+ * - creates a new debugfs entry in /sys/kernel/debug/kstats/foo
+ *   to collect the metric, accumulating samples in 2^frac_bits slots
+ *   per power of 2
+ *
+ *	struct kstats *key = kstats_new("foo", frac_bits);
+ *
+ * - add instrumentation around code:
+ *
+ *	u64 dt = ktime_get_ns();	// about 20ns
+ *	<section of code to measure>
+ *	dt = ktime_get_ns() - dt;	// about 20ns
+ *	kstats_record(key, dt);		// 5ns hot cache, 300ns cold
+ *
+ * - read values from debugfs
+ *	cat /sys/kernel/debug/kstats/foo
+ *	...
+ *	slot 55  CPU  0    count      589 avg      480 p 0.027613
+ *	slot 55  CPU  1    count       18 avg      480 p 0.002572
+ *	slot 55  CPU  2    count       25 avg      480 p 0.003325
+ *	...
+ *	slot 55  CPUS 28   count      814 avg      480 p 0.002474
+ *	...
+ *	slot 97  CPU  13   count     1150 avg    20130 p 0.447442
+ *	slot 97  CPUS 28   count   152585 avg    19809 p 0.651747
+ *	...
+ *
+ * - write to the file STOP, START, RESET executes the corresponding action
+ *
+ *	echo RESET > /sys/kernel/debug/kstats/foo
+ */
+
+struct kstats;
+
+#if defined(CONFIG_KSTATS) || defined(CONFIG_KSTATS_MODULE)
+/* Add an entry to debugfs. */
+struct kstats *kstats_new(const char *name, u8 frac_bits);
+
+/* Record a sample */
+void kstats_record(struct kstats *key, u64 value);
+
+/* Remove an entry and frees memory */
+void kstats_delete(struct kstats *key);
+#else
+static inline struct kstats *kstats_new(const char *name, u8 frac_bits)
+{
+	return NULL;
+}
+
+static inline void kstats_record(struct kstats *key, u64 value) {}
+static inline void kstats_delete(struct kstats *key) {}
+#endif
+
+#endif /* _LINUX_KSTATS_H */
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 69def4a9df009..7d7fae57bfde5 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1452,6 +1452,13 @@ config LATENCYTOP
 	  Enable this option if you want to use the LatencyTOP tool
 	  to find out which userspace is blocking on what kernel operations.
 
+config KSTATS
+	tristate "collect percpu metrics and export distributions through debugfs"
+	depends on DEBUG_FS
+	help
+	  Helper library to collect percpu kernel metrics, exporting
+	  distributions through debugfs. See lib/kstats.c
+
 source "kernel/trace/Kconfig"
 
 config PROVIDE_OHCI1394_DMA_INIT
diff --git a/lib/Makefile b/lib/Makefile
index 611872c069269..3f8578007db4e 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -236,6 +236,7 @@ obj-$(CONFIG_RBTREE_TEST) += rbtree_test.o
 obj-$(CONFIG_INTERVAL_TREE_TEST) += interval_tree_test.o
 
 obj-$(CONFIG_PERCPU_TEST) += percpu_test.o
+obj-$(CONFIG_KSTATS) += kstats.o
 
 obj-$(CONFIG_ASN1) += asn1_decoder.o
 
diff --git a/lib/kstats.c b/lib/kstats.c
new file mode 100644
index 0000000000000..885ab448708de
--- /dev/null
+++ b/lib/kstats.c
@@ -0,0 +1,290 @@
+/*
+ * kstats, collect samples and export distributions through debugfs
+ *
+ * CREATE OBJECT:
+ *	struct kstats *key = kstats_new("some_name", 3);
+ *
+ * ADD A SAMPLE:
+ *	u64 dt = ktime_get_ns();	// about 20ns
+ *	<code to instrument>
+ *	dt = ktime_get_ns() - t;	// about 20ns
+ *	kstats_record(key, dt);		// 5ns hot cache, 250ns cold
+ *
+ * SHOW DATA:
+ *	cat /sys/kernel/debug/kstats/some_name
+ *
+ *	...
+ *	slot 12  CPU  0    count      764 avg       12 p 0.011339
+ *	slot 12  CPU  1    count      849 avg       12 p 0.011496
+ *	slot 12  CPU  2    count      712 avg       12 p 0.009705
+ *	...
+ *	slot 12  CPU  243  count        1 avg       12 p 0.000097
+ *	slot 12  CPUS 256  count    19977 avg       12 p 0.006153
+ *	...
+ */
+
+#include <linux/kstats.h>
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/percpu.h>
+#include <linux/seq_file.h>
+#include <linux/slab.h>
+#include <linux/debugfs.h>
+
+/*
+ * Values are 64 bit unsigned and are accumulated per cpu, in one bucket for
+ * each power of 2. Each bucket is further subdivided in 2^frac_bits slots.
+ * The range for each slot is 2^-frac_bits of the base value for the bucket.
+ */
+#define BUCKETS	64	/* Total powers of 2 */
+
+/* For large values, sum is scaled to reduce the chance of overflow */
+#define SUM_SCALE 20
+
+/* Internal names start with ks_, external ones with kstats_ */
+
+struct ks_slot {
+	u64 samples;
+	u64 sum;
+};
+
+struct kstats {
+	u16 n_slots;		/* typically BUCKETS * 2^frac_bits + 2 */
+	u8 frac_bits;
+	u8 frac_mask;		/* 2^frac_bits - 1 */
+	bool active;		/* recording status */
+	struct ks_slot __percpu *slots;
+	struct dentry *entry;	/* debugfs entry */
+};
+
+static void ks_print(struct seq_file *p, int slot, int cpu, u64 sum,
+		     u64 tot, unsigned long samples, unsigned long avg)
+{
+	unsigned long frac = (tot == 0) ? 0 : ((sum % tot) * 1000000) / tot;
+
+	seq_printf(p,
+		   "slot %-3d CPU%c %-4d count %8lu avg %8lu p %c.%06lu\n",
+		   slot, cpu == nr_cpu_ids ? 'S' : ' ', cpu,
+		   samples, avg, sum == tot ? '1' : '0', frac);
+}
+
+/* Read handler */
+static int ks_show_entry(struct seq_file *p, void *v)
+{
+	u32 slot, cpu;
+	struct ks_slot *slots, *cur;
+	struct kstats *ks = p->private;
+	u64 *partials, *totals, grand_total = 0;
+	const size_t rowsize = ks ? ks->n_slots * sizeof(struct ks_slot) : 0;
+
+	if (!ks)
+		return -ENOENT;
+	if (!rowsize)
+		return 0;
+	/*
+	 * Counters are updated while we read them, so make a copy first.
+	 * kvzalloc'ed memory contains three areas:
+	 *
+	 *   slots:	[ nr_cpu_ids ][ ks->n_slots ](struct ks_slot)
+	 *   partials:	[ nr_cpu_ids ](u64)
+	 *   totals:	[ nr_cpu_ids ](u64)
+	 */
+	slots = kvzalloc(nr_cpu_ids * (rowsize + 2 * sizeof(u64)), GFP_KERNEL);
+	if (!slots)
+		return -ENOMEM;
+	partials = (u64 *)(slots + ks->n_slots * nr_cpu_ids);
+	totals = partials + nr_cpu_ids;
+	/* Copy data and compute counts totals (per-cpu and grand_total).
+	 * These values are needed to compute percentiles.
+	 */
+	for_each_possible_cpu(cpu) {
+		cur = slots + ks->n_slots * cpu;
+		memcpy(cur, per_cpu_ptr(ks->slots, cpu), rowsize);
+		for (slot = 0; slot < ks->n_slots; slot++)
+			totals[cpu] += cur[slot].samples;
+		grand_total += totals[cpu];
+	}
+
+	/* Second pass, produce individual lines */
+	for (slot = 0; slot < ks->n_slots; slot++) {
+		u64 n, samples = 0, sum = 0, samples_cumulative = 0;
+		u32 bucket = slot >> ks->frac_bits;
+		u32 sum_shift = bucket < SUM_SCALE ? 0 : bucket - SUM_SCALE;
+
+		for_each_possible_cpu(cpu) {
+			cur = slots + ks->n_slots * cpu;
+			sum += cur[slot].sum;
+			n = cur[slot].samples;
+			samples += n;
+			partials[cpu] += n;
+			samples_cumulative += partials[cpu];
+			if (n == 0)
+				continue;
+			ks_print(p, slot, cpu, partials[cpu], totals[cpu], n,
+				 (cur[slot].sum / n) << sum_shift);
+		}
+		if (samples == 0)
+			continue;
+		ks_print(p, slot, nr_cpu_ids, samples_cumulative, grand_total,
+			 samples, (sum / samples) << sum_shift);
+	}
+	kvfree(slots);
+	return 0;
+}
+
+static ssize_t ks_write(struct file *fp, const char __user *user_buffer,
+			size_t count, loff_t *position)
+{
+	struct inode *ino = fp->f_inode;
+	struct kstats *ks = ino ? ino->i_private : NULL;
+	char buf[256] = {};
+	ssize_t ret;
+	u32 cpu;
+
+	if (count >= sizeof(buf) - 1)
+		return -EINVAL;
+	ret = simple_write_to_buffer(buf, sizeof(buf),
+				     position, user_buffer, count);
+	if (ret < 0)
+		return ret;
+	/* Trim final newline if any */
+	if (count > 0 && buf[count - 1] == '\n')
+		buf[count - 1] = '\0';
+
+	if (ks == NULL)
+		return -EINVAL;
+
+	if (!strcasecmp(buf, "START")) {
+		ks->active = 1;
+	} else if (!strcasecmp(buf, "STOP")) {
+		ks->active = 0;
+	} else if (!strcasecmp(buf, "RESET")) {
+		for_each_possible_cpu(cpu) {
+			memset(per_cpu_ptr(ks->slots, cpu), 0,
+			       ks->n_slots * sizeof(struct ks_slot));
+		}
+	} else {
+		ret = -EINVAL;
+	}
+	/* TODO: add another command to turn off and deallocate memory. */
+	return ret;
+}
+
+static int ks_open(struct inode *inode, struct file *f)
+{
+	return single_open(f, ks_show_entry, inode->i_private);
+}
+
+static const struct file_operations ks_file_ops = {
+	.owner = THIS_MODULE,
+	.open = ks_open,
+	.release = single_release,
+	.read = seq_read,
+	.write = ks_write,
+	.llseek = seq_lseek,
+};
+
+static struct dentry *ks_root;	/* kstats root in debugfs */
+
+static int __init ks_init(void)
+{
+	ks_root = debugfs_create_dir("kstats", NULL);
+	return 0;
+}
+
+static void __exit ks_exit(void)
+{
+	debugfs_remove_recursive(ks_root);
+}
+
+/* Run as soon as possible, but after debugfs, which is in core_initcall */
+postcore_initcall(ks_init);
+module_exit(ks_exit);
+MODULE_LICENSE("GPL");
+
+/* User API: kstats_new(), kstats_delete(), kstats_record() */
+
+struct kstats *kstats_new(const char *name, u8 frac_bits)
+{
+	struct kstats *ks = NULL;
+	const char *errmsg = "";
+
+	if (IS_ERR_OR_NULL(ks_root)) {
+		errmsg = "ks_root not set yet";
+		goto error;
+	}
+
+	if (frac_bits > 5) {
+		pr_info("fractional bits %d too large, using 3\n", frac_bits);
+		frac_bits = 3;
+	}
+	ks = kzalloc(sizeof(*ks), GFP_KERNEL);
+	if (!ks)
+		return NULL;
+	ks->active = 1;
+	ks->frac_bits = frac_bits;
+	ks->frac_mask = (1 << frac_bits) - 1;
+	ks->n_slots = ((BUCKETS - frac_bits + 1) << frac_bits) + 1;
+
+	/* Add one extra bucket for user timestamps */
+	ks->slots = __alloc_percpu((1 + ks->n_slots) * sizeof(struct ks_slot),
+				   sizeof(u64));
+	if (!ks->slots) {
+		errmsg = "failed to allocate pcpu";
+		goto error;
+	}
+
+	/* 'ks' is saved in the inode (entry->d_inode->i_private). */
+	ks->entry = debugfs_create_file(name, 0444, ks_root, ks, &ks_file_ops);
+	__module_get(THIS_MODULE);
+	return ks;
+
+error:
+	pr_info("kstats: '%s' error %s\n", name, errmsg);
+	kstats_delete(ks);
+	return NULL;
+}
+EXPORT_SYMBOL(kstats_new);
+
+void kstats_delete(struct kstats *ks)
+{
+	if (!ks)
+		return;
+	debugfs_remove(ks->entry);
+	if (ks->slots)
+		free_percpu(ks->slots);
+	*ks = (struct kstats){};
+	kfree(ks);
+	module_put(THIS_MODULE);
+}
+EXPORT_SYMBOL(kstats_delete);
+
+void kstats_record(struct kstats *ks, u64 val)
+{
+	u32 bucket, slot;
+
+	if (!ks || !ks->active)
+		return;
+	/* The leftmost 1 selects the bucket, subsequent frac_bits select
+	 * the slot within the bucket. fls returns 0 when the argument is 0.
+	 */
+	bucket = fls64(val >> ks->frac_bits);
+	slot = bucket == 0 ? val :
+		((bucket << ks->frac_bits) |
+		 ((val >> (bucket - 1)) & ks->frac_mask));
+
+	/* Use the last slot on overflow if BUCKETS < 64 */
+	if (slot > ks->n_slots - 2)
+		slot = ks->n_slots - 1;
+
+	/* preempt_disable makes sure samples and sum modify the same slot.
+	 * this_cpu_add() uses a non-interruptible add to protect against
+	 * hardware interrupts which may call kstats_record.
+	 */
+	preempt_disable();
+	this_cpu_add(ks->slots[slot].samples, 1);
+	this_cpu_add(ks->slots[slot].sum,
+		     bucket < SUM_SCALE ? val : (val >> (bucket - SUM_SCALE)));
+	preempt_enable();
+}
+EXPORT_SYMBOL(kstats_record);
-- 
2.25.0.265.gbab2e86ba0-goog


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v3 2/2] kstats: kretprobe and tracepoint support
  2020-02-26 13:50 [PATCH v3 0/2] kstats: kernel metric collector Luigi Rizzo
  2020-02-26 13:50 ` [PATCH v3 1/2] " Luigi Rizzo
@ 2020-02-26 13:50 ` Luigi Rizzo
  2020-02-26 15:00 ` [PATCH v3 0/2] kstats: kernel metric collector Toke Høiland-Jørgensen
  2 siblings, 0 replies; 16+ messages in thread
From: Luigi Rizzo @ 2020-02-26 13:50 UTC (permalink / raw)
  To: linux-kernel, mhiramat, akpm, gregkh, naveen.n.rao, ardb, rizzo,
	pabeni, giuseppe.lettieri, toke, hawk, mingo, acme, rostedt,
	peterz
  Cc: Luigi Rizzo

The following code supports runtime monitoring of the execution time of
a block of code (a function, a section between two function entry points
or tracepoints) _and_ the interval between calls to the block.

Use as follows:

  echo "trace S bits B start X end Y" > /sys/kernel/debug/kstats/_control

	creates node /sys/kernel/debug/kstats/S, traces time between X and Y
	with 2^B buckets. Arguments after S are optional, X defaults to S,
	bits defaults to 3, end defaults to empty. X and Y can be function
	names or __tracepoint_T where T is a tracepoint name.

	It also creates a second node /sys/kernel/debug/kstats/GAP-S that
	traces the time end and start of subsequent calls to the function
	on the same CPU.

  echo "remove S" > /sys/kernel/debug/kstats/_control

	removes the two /sys/kernel/debugfs/kstats nodes, S and GAP-S

The code uses 3 different methods to track start and end of the block:

1. if end is not specified, uses kretprobe to collect timestamps around
   calls to function X.

2. if end != start, use kprobe to collect timestaps in the two places.
   Only meaningful when the two functions uniquely identify a code path.

3. if names have the form __tracepoint_XXX, collect timestamps at the
   two tracepoints.

Metric collection through k[ret]probes or tracepoints is very convenient
but much more intrusive and less accurate than manual annotation: this is
because those hooks involve several out of line code and memory accesses,
particularly expensive when not in cache.
On a 2020 server-class x86 CPU, tracing a function with kretprobe adds
~250ns with hot cache, 1500+ns with cold cache; an empty functions reports
a minimum time of ~90ns with hot cache, 500ns with cold cache.

(Hot and cold cache behavior can be easily tested by calling the traced
function at high rates (<20us between calls) and low rates (>1ms))

Signed-off-by: Luigi Rizzo <lrizzo@google.com>
---
 lib/kstats.c | 368 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 366 insertions(+), 2 deletions(-)

diff --git a/lib/kstats.c b/lib/kstats.c
index 885ab448708de..4507f026b2fbc 100644
--- a/lib/kstats.c
+++ b/lib/kstats.c
@@ -21,6 +21,13 @@
  *	slot 12  CPU  243  count        1 avg       12 p 0.000097
  *	slot 12  CPUS 256  count    19977 avg       12 p 0.006153
  *	...
+ *
+ * Besides manual code annotations, the following two commands add and remove
+ * tracing of the execution time of a function or a section of code, see more
+ * details later in this file:
+ *
+ *	echo "trace some_function" > /sys/kernel/debug/kstats/_control
+ *	echo "remove some_function" > /sys/kernel/debug/kstats/_control
  */
 
 #include <linux/kstats.h>
@@ -68,6 +75,11 @@ static void ks_print(struct seq_file *p, int slot, int cpu, u64 sum,
 		   samples, avg, sum == tot ? '1' : '0', frac);
 }
 
+/* Helpers for user-created nodes via _control */
+static int ks_list_nodes(struct seq_file *p);
+static int ks_control_write(char *buf, int ret);
+static bool ks_delete_nodes(const char *name);
+
 /* Read handler */
 static int ks_show_entry(struct seq_file *p, void *v)
 {
@@ -78,7 +90,7 @@ static int ks_show_entry(struct seq_file *p, void *v)
 	const size_t rowsize = ks ? ks->n_slots * sizeof(struct ks_slot) : 0;
 
 	if (!ks)
-		return -ENOENT;
+		return ks_list_nodes(p);
 	if (!rowsize)
 		return 0;
 	/*
@@ -152,7 +164,7 @@ static ssize_t ks_write(struct file *fp, const char __user *user_buffer,
 		buf[count - 1] = '\0';
 
 	if (ks == NULL)
-		return -EINVAL;
+		return ks_control_write(buf, ret);
 
 	if (!strcasecmp(buf, "START")) {
 		ks->active = 1;
@@ -189,11 +201,13 @@ static struct dentry *ks_root;	/* kstats root in debugfs */
 static int __init ks_init(void)
 {
 	ks_root = debugfs_create_dir("kstats", NULL);
+	debugfs_create_file("_control", 0644, ks_root, NULL, &ks_file_ops);
 	return 0;
 }
 
 static void __exit ks_exit(void)
 {
+	ks_delete_nodes(NULL);
 	debugfs_remove_recursive(ks_root);
 }
 
@@ -288,3 +302,353 @@ void kstats_record(struct kstats *ks, u64 val)
 	preempt_enable();
 }
 EXPORT_SYMBOL(kstats_record);
+
+/*
+ * The following code supports runtime monitoring of the execution time of
+ * a block of code (a function, a section between two function entry points
+ * or tracepoints) with the following command:
+ *
+ * echo "trace S bits B start X end Y" > /sys/kernel/debug/kstats/_control
+ *
+ *    creates node /sys/kernel/debug/kstats/S, traces time between X and Y
+ *    with 2^B buckets. Arguments after S are optional, X defaults to S,
+ *    bits defaults to 3, end defaults to empty. X and Y can be function names
+ *    or __tracepoint_T where T is a tracepoint name.
+ *
+ *    It also creates a second node /sys/kernel/debug/kstats/GAP-S that traces
+ *    the time between end and start of subsequent calls to the function on
+ *    the same CPU.
+ *
+ * echo "remove S" > /sys/kernel/debug/kstats/_control
+ *
+ *    removes the two /sys/kernel/debugfs/kstats nodes, S and GAP-S
+ *
+ * The code uses 3 different methods to track start and end of the block:
+ *
+ * 1. if end is not specified, uses kretprobe to collect timestamps around
+ *    calls to function X.
+ *
+ * 2. if end != start, use kprobe to collect timestaps in the two places.
+ *    Only meaningful when the two functions uniquely identify a code path.
+ *
+ * 3. if names have the form __tracepoint_XXX, collect timestamps at the
+ *    two tracepoints.
+ *
+ * Metric collection through k[ret]probes or tracepoints is very convenient
+ * but much more intrusive and less accurate than manual annotation: this is
+ * because those hooks involve several out of line code and memory accesses,
+ * particularly expensive when not in cache.
+ * On a 2020 server-class x86 CPU, tracing a function with kretprobe adds
+ * ~250ns with hot cache, 1500+ns with cold cache; an empty functions reports
+ * a minimum time of ~90ns with hot cache, 500ns with cold cache.
+ */
+
+#include <linux/kprobes.h>
+#include <linux/tracepoint.h>
+
+/* Manually added entries are in a list protected by ks_mutex */
+static LIST_HEAD(ks_user_nodes);
+static DEFINE_MUTEX(ks_mutex);
+
+/* Manually added nodes */
+enum node_type { KSN_NONE = 0, KSN_KPROBE, KSN_RETPROBE, KSN_TRACEPOINT };
+struct ks_node {
+	struct kstats *ks;	/* record time for a call */
+	struct kstats *ks_gap;	/* record gap between calls */
+	struct list_head link;	/* Set for nodes added to the main list */
+	enum node_type type;
+	/* These could do in a union */
+	struct kprobe kp1;
+	struct kprobe kp2;
+	struct kretprobe kret;
+	struct tracepoint *tp1;
+	struct tracepoint *tp2;
+	char name[0];
+};
+
+/* Address of the temporary storage for starting timestamp */
+static u64 *ts_addr(struct kstats *ks)
+{
+	return &((this_cpu_ptr(ks->slots) + ks->n_slots + 1)->sum);
+}
+
+/* Store value in slot if not set already */
+static void ks_ts_start(struct kstats *ks, u64 value)
+{
+	u64 *addr = ts_addr(ks);
+
+	if (!*addr)
+		*addr = value;
+}
+
+/* Record value if previous was non zero */
+static void ks_ts_record(struct kstats *ks, u64 value)
+{
+	u64 *addr = ts_addr(ks);
+
+	if (*addr) {
+		kstats_record(ks, value - *addr);
+		*addr = 0;
+	}
+}
+
+/*
+ * Method 3, tracepoints. The first argument of the tracepoint callback
+ * comes from tracepoint_probe_register, others as defined in the proto.
+ */
+static int ks_tp_start(struct ks_node *cur)
+{
+	u64 now;
+
+	preempt_disable();
+	now = ktime_get_ns();
+	ks_ts_start(cur->ks, now);
+	if (cur->ks_gap)
+		ks_ts_record(cur->ks_gap, now);
+	preempt_enable();
+	return 0;
+}
+
+static int ks_tp_end(struct ks_node *cur)
+{
+	u64 now;
+
+	preempt_disable();
+	now = ktime_get_ns();
+	ks_ts_record(cur->ks, now);
+	if (cur->ks_gap)
+		ks_ts_start(cur->ks_gap, now);
+	preempt_enable();
+	return 0;
+}
+
+/* Method 1: use kretprobe */
+static int ks_kretp_start(struct kretprobe_instance *r, struct pt_regs *regs)
+{
+	return ks_tp_start(container_of(r->rp, struct ks_node, kret));
+}
+
+static int ks_kretp_end(struct kretprobe_instance *r, struct pt_regs *regs)
+{
+	return ks_tp_end(container_of(r->rp, struct ks_node, kret));
+}
+
+/* Method 2, kprobes */
+static int ks_kprobe_start(struct kprobe *f, struct pt_regs *regs)
+{
+	return ks_tp_start(container_of(f, struct ks_node, kp1));
+}
+
+static int ks_kprobe_end(struct kprobe *f, struct pt_regs *regs)
+{
+	return ks_tp_end(container_of(f, struct ks_node, kp2));
+}
+
+/* Destroy a user-defined node */
+static void ks_node_delete(struct ks_node *cur)
+{
+	if (!cur)
+		return;
+#ifdef CONFIG_TRACEPOINTS
+	if (cur->tp2)
+		tracepoint_probe_unregister(cur->tp2, ks_tp_end, cur);
+	if (cur->tp1)
+		tracepoint_probe_unregister(cur->tp1, ks_tp_start, cur);
+	tracepoint_synchronize_unregister();
+#endif
+	unregister_kretprobe(&cur->kret);
+	unregister_kprobe(&cur->kp1);
+	unregister_kprobe(&cur->kp2);
+	kstats_delete(cur->ks);
+	kstats_delete(cur->ks_gap);
+	kfree(cur);
+}
+
+/* Some names cannot be attached to */
+static bool blacklisted(const char *name)
+{
+	static const char * const blacklist[] = {
+		"kstats_record",
+		NULL
+	};
+	int i;
+
+	for (i = 0; name && blacklist[i]; i++) {
+		if (!strcmp(name, blacklist[i])) {
+			pr_info("%s is blacklisted\n", name);
+			return true;
+		}
+	}
+	return false;
+}
+
+static const char gap[] = "GAP-";
+static char *ksn_name(struct ks_node *cur)
+{
+	return cur->name + sizeof(gap) - 1;
+}
+
+/* Create a new user-defined node */
+static struct ks_node *ks_node_new(int n, char *argv[])
+{
+	static const char *tp_prefix = "__tracepoint_";
+	char *name = argv[1], *start = name, *end = NULL;
+	struct ks_node *cur;
+	u64 bits = 3;
+	int i, ret;
+
+	/* 'arg value' pairs may override defaults */
+	for (i = 2; i < n - 1; i += 2) {
+		if (!strcasecmp(argv[i], "bits")) {
+			if (kstrtou64(argv[i + 1], 0, &bits) || bits > 4) {
+				pr_info("invalid bits %d\n", (int)bits);
+				break;
+			}
+		} else if (!strcasecmp(argv[i], "start")) {
+			start = argv[i + 1];
+		} else if (!strcasecmp(argv[i], "end")) {
+			end = argv[i + 1];
+		} else {
+			break;
+		}
+	}
+	if (i != n)
+		return ERR_PTR(-EINVAL);
+
+	cur = kzalloc(sizeof(*cur) + strlen(name) + sizeof(gap), GFP_KERNEL);
+	if (!cur) {
+		pr_info("%s: no memory to add %s\n", __func__, name);
+		return ERR_PTR(-ENOMEM);
+	}
+	strcpy(ksn_name(cur), name);
+	if (blacklisted(start) || blacklisted(end))
+		return ERR_PTR(-EINVAL);
+
+	cur->ks = kstats_new(name, bits);
+	if (!cur->ks)
+		goto fail;
+
+	if (!end || !*end) {
+		/* try to create an entry with the gap between calls */
+		memcpy(cur->name, gap, sizeof(gap) - 1);
+		cur->ks_gap = kstats_new(cur->name, bits);
+
+		cur->kret.entry_handler = ks_kretp_start;
+		cur->kret.handler = ks_kretp_end;
+		cur->kret.data_size = 0;
+		cur->kret.kp.symbol_name = start;
+		ret = register_kretprobe(&cur->kret);
+		if (ret)
+			goto fail;
+	} else if (strncmp(start, tp_prefix, strlen(tp_prefix)) != 0) {
+		pr_info("XXX use kprobe on '%s', '%s'\n", start, end);
+		cur->kp2.symbol_name = end;
+		cur->kp2.pre_handler = ks_kprobe_end;
+		if (register_kprobe(&cur->kp2))
+			goto fail;
+		cur->kp1.symbol_name = start;
+		cur->kp1.pre_handler = ks_kprobe_start;
+		if (register_kprobe(&cur->kp1))
+			goto fail;
+	} else {
+		cur->tp1 = (void *)kallsyms_lookup_name(start);
+		cur->tp2 = (void *)kallsyms_lookup_name(end);
+		if (!cur->tp1 || !cur->tp2)
+			goto fail;
+#ifndef CONFIG_TRACEPOINTS
+		goto fail;
+#else
+		ret = tracepoint_probe_register(cur->tp1, ks_tp_start, cur);
+		if (ret)
+			goto fail;
+		ret = tracepoint_probe_register(cur->tp2, ks_tp_end, cur);
+		if (ret)
+			goto fail;
+#endif
+	}
+	return cur;
+
+fail:
+	ks_node_delete(cur);
+	return ERR_PTR(-EINVAL);
+
+}
+
+static int ks_list_nodes(struct seq_file *p)
+{
+	struct ks_node *cur;
+	const char *sep = "";
+
+	mutex_lock(&ks_mutex);
+	list_for_each_entry(cur, &ks_user_nodes, link) {
+		seq_printf(p, "%s%s", sep, ksn_name(cur));
+		sep = " ";
+	}
+	mutex_unlock(&ks_mutex);
+	seq_printf(p, "\n");
+	return 0;
+}
+
+/* Split a string into words, returns number of words */
+static int ks_split_command(char *s, char *words[], int count)
+{
+	int i = 0, n;
+
+	for (n = 0; n < count; n++) {
+		/* Skip and clear leading whitespace */
+		while (s[i] && strchr(" \t\r\n", s[i]))
+			s[i++] = '\0';
+		words[n] = s + i; /* Tentative offset */
+		/* Find end of word */
+		while (s[i] && s[i] > ' ' && s[i] < 127)
+			i++;
+		if (s + i == words[n])
+			break;
+	}
+	return n;
+}
+
+/* Delete one/all nodes (name == NULL). Returns true if some are deleted */
+static bool ks_delete_nodes(const char *name)
+{
+	struct ks_node *tmp, *cur;
+	bool found = false;
+
+	mutex_lock(&ks_mutex);
+	list_for_each_entry_safe(cur, tmp, &ks_user_nodes, link) {
+		if (name != NULL && strcmp(ksn_name(cur), name))
+			continue; /* no match */
+		found = true;
+		list_del(&cur->link);
+		ks_node_delete(cur);
+	}
+	mutex_unlock(&ks_mutex);
+	return found;
+}
+
+static int ks_control_write(char *buf, int ret)
+{
+	char *args[10];	/* we don't need more than 8 */
+	struct ks_node *cur;
+	int n;
+
+	n = ks_split_command(buf, args, ARRAY_SIZE(args));
+	if (n < 2 || n == ARRAY_SIZE(args))
+		return -EINVAL;
+	if (!strcasecmp(args[0], "remove")) {
+		if (n != 2)
+			return -EINVAL;
+		if (!ks_delete_nodes(args[1]))
+			return -ENOENT;
+	} else if (!strcasecmp(args[0], "trace")) {
+		cur = ks_node_new(n, args);
+		if (IS_ERR_OR_NULL(cur))
+			return PTR_ERR(cur);
+		mutex_lock(&ks_mutex);
+		list_add(&cur->link, &ks_user_nodes);
+		mutex_unlock(&ks_mutex);
+	} else {
+		ret = -EINVAL;
+	}
+	return ret;
+}
-- 
2.25.0.265.gbab2e86ba0-goog


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/2] kstats: kernel metric collector
  2020-02-26 13:50 ` [PATCH v3 1/2] " Luigi Rizzo
@ 2020-02-26 14:48   ` Paolo Abeni
  2020-03-10 13:58   ` kbuild test robot
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 16+ messages in thread
From: Paolo Abeni @ 2020-02-26 14:48 UTC (permalink / raw)
  To: Luigi Rizzo, linux-kernel, mhiramat, akpm, gregkh, naveen.n.rao,
	ardb, rizzo, giuseppe.lettieri, toke, hawk, mingo, acme, rostedt,
	peterz

Hi,

I'm sorry for the lack of feedback on earlier iteration, I've simply
been too slow.

On Wed, 2020-02-26 at 05:50 -0800, Luigi Rizzo wrote:
[...]
> +static void ks_print(struct seq_file *p, int slot, int cpu, u64 sum,
> +		     u64 tot, unsigned long samples, unsigned long avg)
> +{
> +	unsigned long frac = (tot == 0) ? 0 : ((sum % tot) * 1000000) / tot;

I think/fear kbuildbot will still trigger some build issues on 32bit
arches on the above line.

I think div64_u64_rem() should be used in place of '%' for and 
div64_u64() in place of '/' when operating on 64bits integers. 

There are a few more occurences below.

[...]
> ++	/* preempt_disable makes sure samples and sum modify the same slot.
> +	 * this_cpu_add() uses a non-interruptible add to protect against
> +	 * hardware interrupts which may call kstats_record.
> +	 */
> +	preempt_disable();
> +	this_cpu_add(ks->slots[slot].samples, 1);

I think 'this_cpu_inc()' could be used here.

> +	this_cpu_add(ks->slots[slot].sum,
> +		     bucket < SUM_SCALE ? val : (val >> (bucket - SUM_SCALE)));
> +	preempt_enable();
> +}
> +EXPORT_SYMBOL(kstats_record);

Thanks for sharing, this infra will be likely quite useful to me :)

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 0/2] kstats: kernel metric collector
  2020-02-26 13:50 [PATCH v3 0/2] kstats: kernel metric collector Luigi Rizzo
  2020-02-26 13:50 ` [PATCH v3 1/2] " Luigi Rizzo
  2020-02-26 13:50 ` [PATCH v3 2/2] kstats: kretprobe and tracepoint support Luigi Rizzo
@ 2020-02-26 15:00 ` Toke Høiland-Jørgensen
  2020-02-26 16:31   ` Alexei Starovoitov
  2020-02-26 17:26   ` Luigi Rizzo
  2 siblings, 2 replies; 16+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-02-26 15:00 UTC (permalink / raw)
  To: Luigi Rizzo, linux-kernel, mhiramat, akpm, gregkh, naveen.n.rao,
	ardb, rizzo, pabeni, giuseppe.lettieri, hawk, mingo, acme,
	rostedt, peterz
  Cc: Luigi Rizzo

Luigi Rizzo <lrizzo@google.com> writes:

> This patchset introduces a small library to collect per-cpu samples and
> accumulate distributions to be exported through debugfs.
>
> This v3 series addresses some initial comments (mostly style fixes in the
> code) and revises commit logs.

Could you please add a proper changelog spanning all versions of the
patch as you iterate?

As for the idea itself; picking up this argument you made on v1:

> The tracepoint/kprobe/kretprobe solution is much more expensive --
> from my measurements, the hooks that invoke the various handlers take
> ~250ns with hot cache, 1500+ns with cold cache, and tracing an empty
> function this way reports 90ns with hot cache, 500ns with cold cache.

I think it would be good if you could include an equivalent BPF-based
implementation of your instrumentation example so people can (a) see the
difference for themselves and get a better idea of how the approaches
differ in a concrete case and (b) quantify the difference in performance
between the two implementations.

-Toke


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 0/2] kstats: kernel metric collector
  2020-02-26 15:00 ` [PATCH v3 0/2] kstats: kernel metric collector Toke Høiland-Jørgensen
@ 2020-02-26 16:31   ` Alexei Starovoitov
  2020-02-26 17:26   ` Luigi Rizzo
  1 sibling, 0 replies; 16+ messages in thread
From: Alexei Starovoitov @ 2020-02-26 16:31 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Luigi Rizzo, LKML, Masami Hiramatsu, Andrew Morton,
	Greg Kroah-Hartman, Naveen N . Rao, ardb, rizzo, Paolo Abeni,
	giuseppe.lettieri, Jesper Dangaard Brouer, Ingo Molnar,
	Arnaldo Carvalho de Melo, Steven Rostedt, Peter Zijlstra

On Wed, Feb 26, 2020 at 7:03 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> > The tracepoint/kprobe/kretprobe solution is much more expensive --
> > from my measurements, the hooks that invoke the various handlers take
> > ~250ns with hot cache, 1500+ns with cold cache, and tracing an empty
> > function this way reports 90ns with hot cache, 500ns with cold cache.
>
> I think it would be good if you could include an equivalent BPF-based
> implementation of your instrumentation example so people can (a) see the
> difference for themselves and get a better idea of how the approaches
> differ in a concrete case and (b) quantify the difference in performance
> between the two implementations.

+1

kprobe/kretprobe are expensive.
That was the reason we switched to bpf fentry/fexit based on bpf trampoline.
The overhead is close to zero. Currently it's used to collect stats for
bpf programs themselves, but the framework is there to collect these
stats for any kernel function. Please see:
https://lore.kernel.org/bpf/20200213210115.1455809-1-songliubraving@fb.com/T/#mae90f23e545f03bde837239e159909f4e4a1acaa
One of the ideas that came up during discussion is to
teach 'perf stat' to do the same.
So the kernel has all the facilities to instrument itself.
Only user space work left.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 0/2] kstats: kernel metric collector
  2020-02-26 15:00 ` [PATCH v3 0/2] kstats: kernel metric collector Toke Høiland-Jørgensen
  2020-02-26 16:31   ` Alexei Starovoitov
@ 2020-02-26 17:26   ` Luigi Rizzo
  2020-02-26 19:14     ` Peter Zijlstra
                       ` (2 more replies)
  1 sibling, 3 replies; 16+ messages in thread
From: Luigi Rizzo @ 2020-02-26 17:26 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: linux-kernel, Masami Hiramatsu, Andrew Morton, Greg KH,
	naveen.n.rao, ardb, Luigi Rizzo, Paolo Abeni, giuseppe.lettieri,
	Jesper Dangaard Brouer, mingo, acme, Steven Rostedt, peterz

[this reply also addresses comments from Alexei and Peter]

On Wed, Feb 26, 2020 at 7:00 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Luigi Rizzo <lrizzo@google.com> writes:
>
> > This patchset introduces a small library to collect per-cpu samples and
> > accumulate distributions to be exported through debugfs.
> >
> > This v3 series addresses some initial comments (mostly style fixes in the
> > code) and revises commit logs.
>
> Could you please add a proper changelog spanning all versions of the
> patch as you iterate?

Will do (v2->v3 was just a removal of stray Change-Id from the log messages)

> As for the idea itself; picking up this argument you made on v1:
>
> > The tracepoint/kprobe/kretprobe solution is much more expensive --
> > from my measurements, the hooks that invoke the various handlers take
> > ~250ns with hot cache, 1500+ns with cold cache, and tracing an empty
> > function this way reports 90ns with hot cache, 500ns with cold cache.
>
> I think it would be good if you could include an equivalent BPF-based
> implementation of your instrumentation example so people can (a) see the
> difference for themselves and get a better idea of how the approaches
> differ in a concrete case and (b) quantify the difference in performance
> between the two implementations.

At the moment, a bpf version is probably beyond my skills and goals,
but I hope the
following comments can clarify the difference in approach/performance:

-  my primary goal, implemented in patch 1/2, is to have this code embedded in
  the kernel, _always available_ , even to users without the skills to
hack up the
  necessary bpf code, or load a bpf program (which may not be allowed in
  certain environments), and eventually replace and possibly improve custom
  variants of metric collections which we already have (or wished to,
but don't have
  because there wasn't a convenient library to use for them).

- I agree that this code can be recompiled in bpf (using a
  BPF_MAP_TYPE_PERCPU_ARRAY for storage, and kstats_record() and
  ks_show_entry() should be easy to convert).

- the runtime cost and complexity of hooking bpf code is still a bit
unclear to me.
  kretprobe or tracepoints are expensive, I suppose that some lean hook
  replace register_kretprobe() may exist and the difference from inline
  annotations would be marginal (we'd still need to put in the hooks around
  the code we want to time, though, so it wouldn't be a pure bpf solution).
  Any pointers to this are welcome; Alexei mentioned fentry/fexit and
  bpf trampolines, but I haven't found an example that lets me do something
  equivalent to kretprobe (take a timestamp before and one after a function
  without explicit instrumentation)

- I still see some huge differences in usability, and this is in my opinion
  one very big difference between the two approaches. The systems where data
  collection may be of interest are not necessarily accessible to developers
  with the skills to write custom bpf code, or load bpf modules
(security policies
  may prevent that). One thing is to tell a sysadmin to run
  "echo trace foo > /sys/kernel/debug/kstats/_config"
  or "watch grep CPUS /sys/kernel/debug/kstats/bar",
  another one is to tell them to load a bpf program (or write their own one).

thanks for the feedback
luigi

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 0/2] kstats: kernel metric collector
  2020-02-26 17:26   ` Luigi Rizzo
@ 2020-02-26 19:14     ` Peter Zijlstra
  2020-02-26 20:49     ` Alexei Starovoitov
  2020-02-26 23:11     ` Toke Høiland-Jørgensen
  2 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2020-02-26 19:14 UTC (permalink / raw)
  To: Luigi Rizzo
  Cc: Toke Høiland-Jørgensen, linux-kernel, Masami Hiramatsu,
	Andrew Morton, Greg KH, naveen.n.rao, ardb, Luigi Rizzo,
	Paolo Abeni, giuseppe.lettieri, Jesper Dangaard Brouer, mingo,
	acme, Steven Rostedt

On Wed, Feb 26, 2020 at 09:26:22AM -0800, Luigi Rizzo wrote:
> [this reply also addresses comments from Alexei and Peter]

I don't see where you address my comment; all you talk about is bpf.

I've never used bpf. Still I also don't see any actual value in you few
lines of code.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 0/2] kstats: kernel metric collector
  2020-02-26 17:26   ` Luigi Rizzo
  2020-02-26 19:14     ` Peter Zijlstra
@ 2020-02-26 20:49     ` Alexei Starovoitov
  2020-02-26 23:11     ` Toke Høiland-Jørgensen
  2 siblings, 0 replies; 16+ messages in thread
From: Alexei Starovoitov @ 2020-02-26 20:49 UTC (permalink / raw)
  To: Luigi Rizzo
  Cc: Toke Høiland-Jørgensen, linux-kernel, Masami Hiramatsu,
	Andrew Morton, Greg KH, naveen.n.rao, ardb, Luigi Rizzo,
	Paolo Abeni, giuseppe.lettieri, Jesper Dangaard Brouer, mingo,
	acme, Steven Rostedt, peterz

On Wed, Feb 26, 2020 at 09:26:22AM -0800, Luigi Rizzo wrote:
>   Any pointers to this are welcome; Alexei mentioned fentry/fexit and
>   bpf trampolines, but I haven't found an example that lets me do something
>   equivalent to kretprobe (take a timestamp before and one after a function
>   without explicit instrumentation)

please see the link in my previous email.

> - I still see some huge differences in usability, and this is in my opinion
>   one very big difference between the two approaches. The systems where data
>   collection may be of interest are not necessarily accessible to developers
>   with the skills to write custom bpf code,

bpf code and tooling is already written. There are bcc and bpftrace scripts
that already do what you're proposing without adding any new code to the kernel.
These tools are trivial to use. Requires zero knowledge in bpf.
So please stop the fud.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 0/2] kstats: kernel metric collector
  2020-02-26 17:26   ` Luigi Rizzo
  2020-02-26 19:14     ` Peter Zijlstra
  2020-02-26 20:49     ` Alexei Starovoitov
@ 2020-02-26 23:11     ` Toke Høiland-Jørgensen
  2020-02-27 10:31       ` Luigi Rizzo
  2 siblings, 1 reply; 16+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-02-26 23:11 UTC (permalink / raw)
  To: Luigi Rizzo
  Cc: linux-kernel, Masami Hiramatsu, Andrew Morton, Greg KH,
	naveen.n.rao, ardb, Luigi Rizzo, Paolo Abeni, giuseppe.lettieri,
	Jesper Dangaard Brouer, mingo, acme, Steven Rostedt, peterz

Luigi Rizzo <lrizzo@google.com> writes:

> - the runtime cost and complexity of hooking bpf code is still a bit
> unclear to me. kretprobe or tracepoints are expensive, I suppose that
> some lean hook replace register_kretprobe() may exist and the
> difference from inline annotations would be marginal (we'd still need
> to put in the hooks around the code we want to time, though, so it
> wouldn't be a pure bpf solution). Any pointers to this are welcome;
> Alexei mentioned fentry/fexit and bpf trampolines, but I haven't found
> an example that lets me do something equivalent to kretprobe (take a
> timestamp before and one after a function without explicit
> instrumentation)

As Alexei said, with fentry/fexit the overhead should be on par with
your example. This functionality is pretty new, though, so I can
understand why it's not obvious how to do things with it yet :)

I think the best place to look is currently in selftests/bpf in the
kernel sources. Grep for 'fexit' and 'fentry' in the progs/ subdir.
test_overhead.c and kfree_skb.c seem to have some examples you may be
able to work from.

> - I still see some huge differences in usability, and this is in my
> opinion one very big difference between the two approaches. The
> systems where data collection may be of interest are not necessarily
> accessible to developers with the skills to write custom bpf code, or
> load bpf modules (security policies may prevent that). One thing is to
> tell a sysadmin to run "echo trace foo >
> /sys/kernel/debug/kstats/_config" or "watch grep CPUS
> /sys/kernel/debug/kstats/bar", another one is to tell them to load a
> bpf program (or write their own one).

With BPF the solution for this is to distribute a tool that does all the
setup for the user. Basically the userspace equivalent of what you're
proposing to include into the kernel here. You can make this arbitrarily
user-friendly, up to and including having a GUI list all the functions
available in the running kernel and letting the user just click on the
one to measure :)

-Toke


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 0/2] kstats: kernel metric collector
  2020-02-26 23:11     ` Toke Høiland-Jørgensen
@ 2020-02-27 10:31       ` Luigi Rizzo
  2020-02-27 12:13         ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 16+ messages in thread
From: Luigi Rizzo @ 2020-02-27 10:31 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: linux-kernel, Masami Hiramatsu, Andrew Morton, Greg KH,
	naveen.n.rao, ardb, Luigi Rizzo, Paolo Abeni, giuseppe.lettieri,
	Jesper Dangaard Brouer, mingo, acme, Steven Rostedt,
	Peter Zijlstra

On Wed, Feb 26, 2020 at 3:11 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Luigi Rizzo <lrizzo@google.com> writes:
>
> > - the runtime cost and complexity of hooking bpf code is still a bit
> > unclear to me. kretprobe or tracepoints are expensive, I suppose that
> > some lean hook replace register_kretprobe() may exist and the
> > difference from inline annotations would be marginal (we'd still need
> > to put in the hooks around the code we want to time, though, so it
> > wouldn't be a pure bpf solution). Any pointers to this are welcome;
> > Alexei mentioned fentry/fexit and bpf trampolines, but I haven't found
> > an example that lets me do something equivalent to kretprobe (take a
> > timestamp before and one after a function without explicit
> > instrumentation)
>
> As Alexei said, with fentry/fexit the overhead should be on par with
> your example. This functionality is pretty new, though, so I can
> understand why it's not obvious how to do things with it yet :)
>
> I think the best place to look is currently in selftests/bpf in the
> kernel sources. Grep for 'fexit' and 'fentry' in the progs/ subdir.
> test_overhead.c and kfree_skb.c seem to have some examples you may be
> able to work from.

Thank you for the precise reference, Toke.
I tweaked test_overhead.c to measure (using kstats) the cost of the various
hooks and I can confirm that fentry and fexit are pretty fast. The
following table
shows the p90 runtime of __set_task_comm() at low (100/s) and high (1M/s) rates:

                      90 percentile of __set_task_comm() runtime
(accuracy: 30ns)
call rate          base     kprobe   kretprobe  tracepoint   fentry   fexit
100/sec          270       870        1220         500             400       450
 >1M/s            60        120         210          90
70          80

For high rate operation, the overhead of fentry and fexit is quite good,
even better than tracepoints, and well below the clock's accuracy
(more detailed measurements indicate ~5ns for fentry, ~10ns for fexit).
At very low call rates there is an extra 150-200ns
but that is expected due to the out of line code.

cheers
luigi

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 0/2] kstats: kernel metric collector
  2020-02-27 10:31       ` Luigi Rizzo
@ 2020-02-27 12:13         ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 16+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-02-27 12:13 UTC (permalink / raw)
  To: Luigi Rizzo
  Cc: linux-kernel, Masami Hiramatsu, Andrew Morton, Greg KH,
	naveen.n.rao, ardb, Luigi Rizzo, Paolo Abeni, giuseppe.lettieri,
	Jesper Dangaard Brouer, mingo, acme, Steven Rostedt,
	Peter Zijlstra

Luigi Rizzo <lrizzo@google.com> writes:

> On Wed, Feb 26, 2020 at 3:11 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Luigi Rizzo <lrizzo@google.com> writes:
>>
>> > - the runtime cost and complexity of hooking bpf code is still a bit
>> > unclear to me. kretprobe or tracepoints are expensive, I suppose that
>> > some lean hook replace register_kretprobe() may exist and the
>> > difference from inline annotations would be marginal (we'd still need
>> > to put in the hooks around the code we want to time, though, so it
>> > wouldn't be a pure bpf solution). Any pointers to this are welcome;
>> > Alexei mentioned fentry/fexit and bpf trampolines, but I haven't found
>> > an example that lets me do something equivalent to kretprobe (take a
>> > timestamp before and one after a function without explicit
>> > instrumentation)
>>
>> As Alexei said, with fentry/fexit the overhead should be on par with
>> your example. This functionality is pretty new, though, so I can
>> understand why it's not obvious how to do things with it yet :)
>>
>> I think the best place to look is currently in selftests/bpf in the
>> kernel sources. Grep for 'fexit' and 'fentry' in the progs/ subdir.
>> test_overhead.c and kfree_skb.c seem to have some examples you may be
>> able to work from.
>
> Thank you for the precise reference, Toke.
> I tweaked test_overhead.c to measure (using kstats) the cost of the various
> hooks and I can confirm that fentry and fexit are pretty fast. The
> following table
> shows the p90 runtime of __set_task_comm() at low (100/s) and high (1M/s) rates:
>
>                       90 percentile of __set_task_comm() runtime
> (accuracy: 30ns)
> call rate          base     kprobe   kretprobe  tracepoint   fentry   fexit
> 100/sec          270       870        1220         500             400       450
>  >1M/s            60        120         210          90
> 70          80
>
> For high rate operation, the overhead of fentry and fexit is quite good,
> even better than tracepoints, and well below the clock's accuracy
> (more detailed measurements indicate ~5ns for fentry, ~10ns for fexit).
> At very low call rates there is an extra 150-200ns
> but that is expected due to the out of line code.

Great, thank you for the performance numbers! This is indeed quite good :)

-Toke


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/2] kstats: kernel metric collector
  2020-02-26 13:50 ` [PATCH v3 1/2] " Luigi Rizzo
  2020-02-26 14:48   ` Paolo Abeni
@ 2020-03-10 13:58   ` kbuild test robot
  2020-03-10 16:44   ` kbuild test robot
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 16+ messages in thread
From: kbuild test robot @ 2020-03-10 13:58 UTC (permalink / raw)
  To: Luigi Rizzo
  Cc: kbuild-all, linux-kernel, mhiramat, akpm, gregkh, naveen.n.rao,
	ardb, rizzo, pabeni, giuseppe.lettieri, toke, hawk, mingo, acme,
	rostedt, peterz

[-- Attachment #1: Type: text/plain, Size: 1498 bytes --]

Hi Luigi,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linux/master]
[also build test ERROR on linus/master v5.6-rc5 next-20200310]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Luigi-Rizzo/kstats-kernel-metric-collector/20200227-022930
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git f5ae2ea6347a308cfe91f53b53682ce635497d0d
config: i386-allyesconfig (attached as .config)
compiler: gcc-7 (Debian 7.5.0-5) 7.5.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   ld: lib/kstats.o: in function `ks_show_entry':
   kstats.c:(.text+0x65f): undefined reference to `__udivdi3'
>> ld: kstats.c:(.text+0x697): undefined reference to `__umoddi3'
>> ld: kstats.c:(.text+0x6af): undefined reference to `__udivdi3'
   ld: kstats.c:(.text+0x739): undefined reference to `__udivdi3'
   ld: kstats.c:(.text+0x768): undefined reference to `__umoddi3'
   ld: kstats.c:(.text+0x784): undefined reference to `__udivdi3'

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 70194 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/2] kstats: kernel metric collector
  2020-02-26 13:50 ` [PATCH v3 1/2] " Luigi Rizzo
  2020-02-26 14:48   ` Paolo Abeni
  2020-03-10 13:58   ` kbuild test robot
@ 2020-03-10 16:44   ` kbuild test robot
  2020-03-11  0:08   ` kbuild test robot
  2020-03-11  3:30   ` kbuild test robot
  4 siblings, 0 replies; 16+ messages in thread
From: kbuild test robot @ 2020-03-10 16:44 UTC (permalink / raw)
  To: Luigi Rizzo
  Cc: kbuild-all, linux-kernel, mhiramat, akpm, gregkh, naveen.n.rao,
	ardb, rizzo, pabeni, giuseppe.lettieri, toke, hawk, mingo, acme,
	rostedt, peterz

[-- Attachment #1: Type: text/plain, Size: 1495 bytes --]

Hi Luigi,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linux/master]
[also build test ERROR on linus/master v5.6-rc5 next-20200310]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Luigi-Rizzo/kstats-kernel-metric-collector/20200227-022930
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git f5ae2ea6347a308cfe91f53b53682ce635497d0d
config: m68k-allyesconfig (attached as .config)
compiler: m68k-linux-gcc (GCC) 9.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=9.2.0 make.cross ARCH=m68k 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   m68k-linux-ld: lib/kstats.o: in function `ks_show_entry':
   kstats.c:(.text+0x482): undefined reference to `__udivdi3'
>> m68k-linux-ld: kstats.c:(.text+0x522): undefined reference to `__umoddi3'
   m68k-linux-ld: kstats.c:(.text+0x5ac): undefined reference to `__umoddi3'

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 52147 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/2] kstats: kernel metric collector
  2020-02-26 13:50 ` [PATCH v3 1/2] " Luigi Rizzo
                     ` (2 preceding siblings ...)
  2020-03-10 16:44   ` kbuild test robot
@ 2020-03-11  0:08   ` kbuild test robot
  2020-03-11  3:30   ` kbuild test robot
  4 siblings, 0 replies; 16+ messages in thread
From: kbuild test robot @ 2020-03-11  0:08 UTC (permalink / raw)
  To: Luigi Rizzo
  Cc: kbuild-all, linux-kernel, mhiramat, akpm, gregkh, naveen.n.rao,
	ardb, rizzo, pabeni, giuseppe.lettieri, toke, hawk, mingo, acme,
	rostedt, peterz

[-- Attachment #1: Type: text/plain, Size: 2241 bytes --]

Hi Luigi,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linux/master]
[also build test ERROR on linus/master v5.6-rc5 next-20200310]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Luigi-Rizzo/kstats-kernel-metric-collector/20200227-022930
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git f5ae2ea6347a308cfe91f53b53682ce635497d0d
config: nds32-randconfig-a001-20200311 (attached as .config)
compiler: nds32le-linux-gcc (GCC) 9.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=9.2.0 make.cross ARCH=nds32 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   nds32le-linux-ld: lib/kstats.o: in function `ks_show_entry':
   kstats.c:(.text+0x4b0): undefined reference to `__udivdi3'
>> nds32le-linux-ld: kstats.c:(.text+0x4b4): undefined reference to `__udivdi3'
>> nds32le-linux-ld: kstats.c:(.text+0x4e8): undefined reference to `__umoddi3'
   nds32le-linux-ld: kstats.c:(.text+0x4ec): undefined reference to `__umoddi3'
   nds32le-linux-ld: kstats.c:(.text+0x508): undefined reference to `__udivdi3'
   nds32le-linux-ld: kstats.c:(.text+0x50c): undefined reference to `__udivdi3'
   nds32le-linux-ld: kstats.c:(.text+0x560): undefined reference to `__umoddi3'
   nds32le-linux-ld: kstats.c:(.text+0x564): undefined reference to `__umoddi3'
   nds32le-linux-ld: kstats.c:(.text+0x578): undefined reference to `__udivdi3'
   nds32le-linux-ld: kstats.c:(.text+0x57c): undefined reference to `__udivdi3'
   nds32le-linux-ld: kstats.c:(.text+0x5b0): undefined reference to `__udivdi3'
   nds32le-linux-ld: kstats.c:(.text+0x5b4): undefined reference to `__udivdi3'

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 27312 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/2] kstats: kernel metric collector
  2020-02-26 13:50 ` [PATCH v3 1/2] " Luigi Rizzo
                     ` (3 preceding siblings ...)
  2020-03-11  0:08   ` kbuild test robot
@ 2020-03-11  3:30   ` kbuild test robot
  4 siblings, 0 replies; 16+ messages in thread
From: kbuild test robot @ 2020-03-11  3:30 UTC (permalink / raw)
  To: Luigi Rizzo
  Cc: kbuild-all, linux-kernel, mhiramat, akpm, gregkh, naveen.n.rao,
	ardb, rizzo, pabeni, giuseppe.lettieri, toke, hawk, mingo, acme,
	rostedt, peterz

[-- Attachment #1: Type: text/plain, Size: 1557 bytes --]

Hi Luigi,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linux/master]
[also build test ERROR on linus/master v5.6-rc5 next-20200310]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Luigi-Rizzo/kstats-kernel-metric-collector/20200227-022930
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git f5ae2ea6347a308cfe91f53b53682ce635497d0d
config: i386-randconfig-a003-20200310 (attached as .config)
compiler: gcc-7 (Debian 7.5.0-5) 7.5.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   ld: lib/kstats.o: in function `ks_show_entry':
   lib/kstats.c:124: undefined reference to `__udivdi3'
   ld: lib/kstats.o: in function `ks_print':
   lib/kstats.c:63: undefined reference to `__umoddi3'
>> ld: lib/kstats.c:63: undefined reference to `__udivdi3'
>> ld: lib/kstats.c:63: undefined reference to `__umoddi3'
>> ld: lib/kstats.c:63: undefined reference to `__udivdi3'
   ld: lib/kstats.o: in function `ks_show_entry':
   lib/kstats.c:129: undefined reference to `__udivdi3'

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 30756 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-03-11  3:30 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-26 13:50 [PATCH v3 0/2] kstats: kernel metric collector Luigi Rizzo
2020-02-26 13:50 ` [PATCH v3 1/2] " Luigi Rizzo
2020-02-26 14:48   ` Paolo Abeni
2020-03-10 13:58   ` kbuild test robot
2020-03-10 16:44   ` kbuild test robot
2020-03-11  0:08   ` kbuild test robot
2020-03-11  3:30   ` kbuild test robot
2020-02-26 13:50 ` [PATCH v3 2/2] kstats: kretprobe and tracepoint support Luigi Rizzo
2020-02-26 15:00 ` [PATCH v3 0/2] kstats: kernel metric collector Toke Høiland-Jørgensen
2020-02-26 16:31   ` Alexei Starovoitov
2020-02-26 17:26   ` Luigi Rizzo
2020-02-26 19:14     ` Peter Zijlstra
2020-02-26 20:49     ` Alexei Starovoitov
2020-02-26 23:11     ` Toke Høiland-Jørgensen
2020-02-27 10:31       ` Luigi Rizzo
2020-02-27 12:13         ` Toke Høiland-Jørgensen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).