Linux-Fsdevel Archive on lore.kernel.org
help / color / mirror / Atom feed
* [RFC PATCH 0/7] metricfs metric file system and examples
@ 2020-08-07 21:29 Jonathan Adams
  2020-08-07 21:29 ` [RFC PATCH 1/7] core/metricfs: Create metricfs, standardized files under debugfs Jonathan Adams
                   ` (8 more replies)
  0 siblings, 9 replies; 21+ messages in thread
From: Jonathan Adams @ 2020-08-07 21:29 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: netdev, kvm, Paolo Bonzini, Greg KH, Jim Mattson, David Rientjes,
	Jonathan Adams

[resending to widen the CC lists per rdunlap@infradead.org's suggestion
original posting to lkml here: https://lkml.org/lkml/2020/8/5/1009]

To try to restart the discussion of kernel statistics started by the
statsfs patchsets (https://lkml.org/lkml/2020/5/26/332), I wanted
to share the following set of patches which are Google's 'metricfs'
implementation and some example uses.  Google has been using metricfs
internally since 2012 as a way to export various statistics to our
telemetry systems (similar to OpenTelemetry), and we have over 200
statistics exported on a typical machine.

These patches have been cleaned up and modernized v.s. the versions
in production; I've included notes under the fold in the patches.
They're based on v5.8-rc6.

The statistics live under debugfs, in a tree rooted at:

	/sys/kernel/debug/metricfs

Each metric is a directory, with four files in it.  For example, the '
core/metricfs: Create metricfs, standardized files under debugfs.' patch
includes a simple 'metricfs_presence' metric, whose files look like:
/sys/kernel/debug/metricfs:
 metricfs_presence/annotations
  DESCRIPTION A\ basic\ presence\ metric.
 metricfs_presence/fields
  value
  int
 metricfs_presence/values
  1
 metricfs_presence/version
  1

(The "version" field always says '1', and is kind of vestigial)

An example of a more complicated stat is the networking stats.
For example, the tx_bytes stat looks like:

net/dev/stats/tx_bytes/annotations
  DESCRIPTION net\ device\ transmited\ bytes\ count
  CUMULATIVE
net/dev/stats/tx_bytes/fields
  interface value
  str int
net/dev/stats/tx_bytes/values
  lo 4394430608
  eth0 33353183843
  eth1 16228847091
net/dev/stats/tx_bytes/version
  1

The per-cpu statistics show up in the schedulat stat info and x86
IRQ counts.  For example:

stat/user/annotations
  DESCRIPTION time\ in\ user\ mode\ (nsec)
  CUMULATIVE
stat/user/fields
  cpu value
  int int
stat/user/values
  0 1183486517734
  1 1038284237228
  ...
stat/user/version
  1

The full set of example metrics I've included are:

core/metricfs: Create metricfs, standardized files under debugfs.
  metricfs_presence
core/metricfs: metric for kernel warnings
  warnings/values
core/metricfs: expose scheduler stat information through metricfs
  stat/*
net-metricfs: Export /proc/net/dev via metricfs.
  net/dev/stats/[tr]x_*
core/metricfs: expose x86-specific irq information through metricfs
  irq_x86/*

The general approach is called out in kernel/metricfs.c:

The kernel provides:
  - A description of the metric
  - The subsystem for the metric (NULL is ok)
  - Type information about the metric, and
  - A callback function which supplies metric values.

Limitations:
  - "values" files are at MOST 64K. We truncate the file at that point.
  - The list of fields and types is at most 1K.
  - Metrics may have at most 2 fields.

Best Practices:
  - Emit the most important data first! Once the 64K per-metric buffer
    is full, the emit* functions won't do anything.
  - In userspace, open(), read(), and close() the file quickly! The kernel
    allocation for the metric is alive as long as the file is open. This
    permits users to seek around the contents of the file, while
    permitting an atomic view of the data.

Note that since the callbacks are called and the data is generated at
file open() time, the relative consistency is only between members of
a given metric; the rx_bytes stat for every network interface will
be read at almost the same time, but if you want to get rx_bytes
and rx_packets, there could be a bunch of slew between the two file
opens.  (So this doesn't entirely address Andrew Lunn's comments in
https://lkml.org/lkml/2020/5/26/490)

This also doesn't address one of the basic parts of the statsfs work:
moving the statistics out of debugfs to avoid lockdown interactions.

Google has found a lot of value in having a generic interface for adding
these kinds of statistics with reasonably low overhead (reading them
is O(number of statistics), not number of objects in each statistic).
There are definitely warts in the interface, but does the basic approach
make sense to folks?

Thanks,
- Jonathan

Jonathan Adams (5):
  core/metricfs: add support for percpu metricfs files
  core/metricfs: metric for kernel warnings
  core/metricfs: expose softirq information through metricfs
  core/metricfs: expose scheduler stat information through metricfs
  core/metricfs: expose x86-specific irq information through metricfs

Justin TerAvest (1):
  core/metricfs: Create metricfs, standardized files under debugfs.

Laurent Chavey (1):
  net-metricfs: Export /proc/net/dev via metricfs.

 arch/x86/kernel/irq.c      |  80 ++++
 fs/proc/stat.c             |  57 +++
 include/linux/metricfs.h   | 131 +++++++
 kernel/Makefile            |   2 +
 kernel/metricfs.c          | 775 +++++++++++++++++++++++++++++++++++++
 kernel/metricfs_examples.c | 151 ++++++++
 kernel/panic.c             | 131 +++++++
 kernel/softirq.c           |  45 +++
 lib/Kconfig.debug          |  18 +
 net/core/Makefile          |   1 +
 net/core/net_metricfs.c    | 194 ++++++++++
 11 files changed, 1585 insertions(+)
 create mode 100644 include/linux/metricfs.h
 create mode 100644 kernel/metricfs.c
 create mode 100644 kernel/metricfs_examples.c
 create mode 100644 net/core/net_metricfs.c

-- 
2.28.0.236.gb10cc79966-goog


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC PATCH 1/7] core/metricfs: Create metricfs, standardized files under debugfs.
  2020-08-07 21:29 [RFC PATCH 0/7] metricfs metric file system and examples Jonathan Adams
@ 2020-08-07 21:29 ` Jonathan Adams
  2020-08-08  5:41   ` Greg KH
  2020-08-07 21:29 ` [RFC PATCH 2/7] core/metricfs: add support for percpu metricfs files Jonathan Adams
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 21+ messages in thread
From: Jonathan Adams @ 2020-08-07 21:29 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: netdev, kvm, Paolo Bonzini, Greg KH, Jim Mattson, David Rientjes,
	Jonathan Adams, Justin TerAvest

From: Justin TerAvest <teravest@google.com>

Metricfs is a standardized set of files and directories under debugfs,
with a kernel API designed to be simpler than exporting new files under
sysfs. Type and field information is reported so that a userspace daemon
can easily process the information.

The statistics live under debugfs, in a tree rooted at:

	/sys/kernel/debug/metricfs

Each metric is a directory, with four files in it.  This patch includes
a single "metricfs_presence" metric, whose files look like:
/sys/kernel/debug/metricfs:
 metricfs_presence/annotations
  DESCRIPTION A\ basic\ presence\ metric.
 metricfs_presence/fields
  value
  int
 metricfs_presence/values
  1
 metricfs_presence/version
  1

Statistics can have zero, one, or two 'fields', which are keys for the
table of metric values.  With no fields, you have a simple statistic as
above, with one field you have a 1-dimensional table of string -> value,
and with two fields you have a 2-dimensional table of
{string, string} -> value.

When a statistic's 'values' file is opened, we pre-allocate a 64k buffer
and call the statistic's callback to fill it with data, truncating if
the buffer overflows.

Statistic creators can create a hierarchy for their statistics using
metricfs_create_subsys().

Signed-off-by: Justin TerAvest <teravest@google.com>
[jwadams@google.com: Forward ported to v5.8, cleaned up and modernized
	code significantly]
Signed-off-by: Jonathan Adams <jwadams@google.com>

---

notes:
* To go upstream, this will need documentation and a MAINTAINERS update.
* It's not clear what the "version" file is for; it's vestigial and
should probably be removed.

jwadams@google.com: Forward ported to v5.8, removed some google-isms and
    cleaned up some anachronisms (atomic->refcount, moving to
    kvmalloc(), using POISON_POINTER_DELTA, made more functions static,
    made 'emitter_fn' into an explicit union instead of a void *),
    renamed 'struct emitter -> metric_emitter' and renamed
    some funcs for consistency.
---
 include/linux/metricfs.h   | 103 ++++++
 kernel/Makefile            |   2 +
 kernel/metricfs.c          | 727 +++++++++++++++++++++++++++++++++++++
 kernel/metricfs_examples.c | 151 ++++++++
 lib/Kconfig.debug          |  18 +
 5 files changed, 1001 insertions(+)
 create mode 100644 include/linux/metricfs.h
 create mode 100644 kernel/metricfs.c
 create mode 100644 kernel/metricfs_examples.c

diff --git a/include/linux/metricfs.h b/include/linux/metricfs.h
new file mode 100644
index 000000000000..65a1baa8e8c1
--- /dev/null
+++ b/include/linux/metricfs.h
@@ -0,0 +1,103 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _METRICFS_H_
+#define _METRICFS_H_
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/stringify.h>
+
+struct metric;
+struct metricfs_subsys;
+
+#define METRIC_EXPORT_GENERIC(name, desc, fname0, fname1, fn, is_str, cumulative) \
+static struct metric *metric_##name; \
+void metric_init_##name(struct metricfs_subsys *parent) \
+{ \
+	metric_##name = metric_register(__stringify(name), (parent), (desc), \
+					(fname0), (fname1), (fn), (is_str), \
+					(cumulative), THIS_MODULE); \
+} \
+void metric_exit_##name(void) \
+{ \
+	metric_unregister(metric_##name); \
+}
+
+/*
+ * Metricfs only deals with two types: int64_t and const char*.
+ *
+ * If a metric has fewer than two fields, pass NULL for the field name
+ * arguments.
+ *
+ * The metric does not take ownership of any of the strings passed in.
+ *
+ * See kernel/metricfs_examples.c for a set of example metrics, with
+ * corresponding output.
+ *
+ * METRIC_EXPORT_INT - An integer-valued metric.
+ * METRIC_EXPORT_COUNTER - An integer-valued cumulative metric.
+ * METRIC_EXPORT_STR - A string-valued metric.
+ */
+#define METRIC_EXPORT_INT(name, desc, fname0, fname1, fn) \
+	METRIC_EXPORT_GENERIC(name, (desc), (fname0), (fname1), (fn), \
+				false, false)
+#define METRIC_EXPORT_COUNTER(name, desc, fname0, fname1, fn) \
+	METRIC_EXPORT_GENERIC(name, (desc), (fname0), (fname1), (fn), \
+				false, true)
+#define METRIC_EXPORT_STR(name, desc, fname0, fname1, fn) \
+	METRIC_EXPORT_GENERIC(name, (desc), (fname0), (fname1), (fn), \
+				true, false)
+
+/* Subsystem support. */
+/* Pass NULL as 'parent' to create a new top-level subsystem. */
+struct metricfs_subsys *metricfs_create_subsys(const char *name,
+						struct metricfs_subsys *parent);
+void metricfs_destroy_subsys(struct metricfs_subsys *d);
+
+/*
+ * An opaque struct that metric emit functions use to keep our internal
+ * state.
+ */
+struct metric_emitter;
+
+/* The number of non-NULL arguments passed to EMIT macros must match the number
+ * of arguments passed to the EXPORT macro for a given metric.
+ *
+ * Failure to do so will cause data to be mangled (or dropped) by userspace or
+ * Monarch.
+ */
+#define METRIC_EMIT_INT(e, v, f0, f1) \
+	metric_emit_int_value((e), (v), (f0), (f1))
+#define METRIC_EMIT_STR(e, v, f0, f1) \
+	metric_emit_str_value((e), (v), (f0), (f1))
+
+/* Users don't have to call any functions below;
+ * use the macro definitions above instead.
+ */
+void metric_emit_int_value(struct metric_emitter *e,
+			   int64_t v, const char *f0, const char *f1);
+void metric_emit_str_value(struct metric_emitter *e,
+			   const char *v, const char *f0, const char *f1);
+
+struct metric *metric_register(const char *name,
+			       struct metricfs_subsys *parent,
+			       const char *description,
+			       const char *fname0, const char *fname1,
+			       void (*fn)(struct metric_emitter *e),
+			       bool is_string,
+			       bool is_cumulative,
+			       struct module *owner);
+
+struct metric *metric_register_parm(const char *name,
+				    struct metricfs_subsys *parent,
+			  const char *description,
+				    const char *fname0, const char *fname1,
+				    void (*fn)(struct metric_emitter *e,
+					       void *parm),
+				    void *parm,
+				    bool is_string,
+				    bool is_cumulative,
+				    struct module *owner);
+
+void metric_unregister(struct metric *m);
+
+#endif /* _METRICFS_H_ */
diff --git a/kernel/Makefile b/kernel/Makefile
index f3218bc5ec69..0edf790935b0 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -109,6 +109,8 @@ obj-$(CONFIG_CPU_PM) += cpu_pm.o
 obj-$(CONFIG_BPF) += bpf/
 obj-$(CONFIG_KCSAN) += kcsan/
 obj-$(CONFIG_SHADOW_CALL_STACK) += scs.o
+obj-$(CONFIG_METRICFS) += metricfs.o
+obj-$(CONFIG_METRICFS_EXAMPLES) += metricfs_examples.o
 
 obj-$(CONFIG_PERF_EVENTS) += events/
 
diff --git a/kernel/metricfs.c b/kernel/metricfs.c
new file mode 100644
index 000000000000..676b7b04aa2b
--- /dev/null
+++ b/kernel/metricfs.c
@@ -0,0 +1,727 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/refcount.h>
+#include <linux/dcache.h>
+#include <linux/debugfs.h>
+#include <linux/init.h>
+#include <linux/kref.h>
+#include <linux/metricfs.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+#include <linux/mm.h>
+
+/*
+ * Metricfs: A mechanism for exporting metrics from the kernel.
+ *
+ * Kernel code must provide:
+ *   - A description of the metric
+ *   - The subsystem for the metric (NULL is ok)
+ *   - Type information about the metric, and
+ *   - A callback function which supplies metric values.
+ *
+ * In return, metricfs provides files in debugfs at:
+ *   /sys/kernel/debug/metricfs/<subsys>/<metric_name>/
+ * The files are:
+ *   - annotations, which provides streamz "annotations"-- the description, and
+ *                  other metadata (e.g. if it's constant, deprecated, etc.)
+ *   - fields, which provides type information about the metric and its fields.
+ *   - values, which contains the actual metric value data.
+ *   - version, which is kept around for future-proofing.
+ *
+ * Metrics only support a limited subset of types-- for fields, they only
+ * support strings, integers, and boolean types. For simplicity, we only support
+ * strings and integers and strictly control how the data is formatted when
+ * displayed from debugfs.
+ *
+ * See kernel/metricfs_examples.c for example code.
+ *
+ * Limitations:
+ *   - "values" files are at MOST 64K. We truncate the file at that point.
+ *   - The list of fields and types is at most 1K.
+ *   - Metrics may have at most 2 fields.
+ *
+ * Best Practices:
+ *   - Emit the most important data first! Once the 64K per-metric buffer
+ *     is full, the emit* functions won't do anything.
+ *   - In userspace, open(), read(), and close() the file quickly! The kernel
+ *     allocation for the metric is alive as long as the file is open. This
+ *     permits users to seek around the contents of the file, while permitting
+ *     an atomic view of the data.
+ *
+ * FAQ:
+ *   - Why is memory allocated for file data at open()?
+ *     Snapshots of data provided by the kernel should be as "atomic" as
+ *     possible. If userspace code performs read()s smaller than the total
+ *     amount of data, we'd like for that tool to still work, while providing a
+ *     consistent view of the file.
+ *
+ * Questions:
+ *   - Would it be simpler if we escaped spaces instead of wrapping strings in
+ *     quotes?
+ */
+struct metric {
+	const char *name;
+	const char *description;
+
+	/* Metric field names (optional, NULL if unused) */
+	const char *fname0;
+	const char *fname1;
+
+	union {
+		void (*emit_noparm)(struct metric_emitter *e); /* !has_parm */
+		void (*emit_parm)(struct metric_emitter *e,
+				  void *parm); /* has_parm */
+	} emit_fn;
+	void *eparm;
+	bool is_string;
+	bool is_cumulative;
+	bool has_parm;
+
+	/* dentry for the directory that contains the metric */
+	struct dentry *dentry;
+
+	struct module *owner;
+
+	refcount_t refcnt;
+
+	/* Inodes that have references to our metric, protected under
+	 * big_mutex.
+	 */
+	struct inode *inodes[4];
+};
+
+/* Returns true if the refcount was successfully incremented for the metric */
+static int metric_module_get(struct metric *m)
+{
+	if (!try_module_get(m->owner))
+		return 0;
+
+	if (!refcount_inc_not_zero(&m->refcnt)) {
+		module_put(m->owner);
+		return 0;
+	}
+
+	return 1;
+}
+
+/* Returns true if the last reference was put. */
+static bool metric_put(struct metric *m)
+{
+	bool rc = refcount_dec_and_test(&m->refcnt);
+
+	if (rc)
+		kfree(m);
+	return rc;
+}
+
+static void metric_module_put(struct metric *m)
+{
+	struct module *owner = m->owner;
+
+	metric_put(m);
+	module_put(owner);
+}
+
+struct metric_emitter {
+	char *buf;
+	char *orig_buf;  /* To calculate total written. */
+	int size;  /* Size of underlying buffer. */
+	struct metric *metric;  /* For type checking. */
+};
+
+#define METRICFS_ANNOTATIONS_BUF_SIZE (1 * 1024)
+#define METRICFS_FIELDS_BUF_SIZE (1 * 1024)
+#define METRICFS_VALUES_BUF_SIZE (64 * 1024)
+#define METRICFS_VERSION_BUF_SIZE (8)
+
+/* Maximum length for fields. They're truncated at this point. */
+#define METRICFS_MAX_FIELD_LEN (100)
+
+static int emit_bytes_left(const struct metric_emitter *e)
+{
+	WARN_ON(e->orig_buf > e->buf);
+	return e->size - (e->buf - e->orig_buf);
+}
+
+struct char_tracker {
+	char *dest;
+	int size;
+	int pos;
+};
+
+static void add_char(struct char_tracker *t, char c)
+{
+	if (t->pos < t->size)
+		t->dest[t->pos] = c;
+	/* Increment pos even if we don't print, so we know how many
+	 * characters we'd print if we had room.
+	 */
+	t->pos++;
+}
+
+/* Escape backslashes, spaces, and newlines in string "s",
+ * copying to "dest", to a maximum of "size" characters.
+ *
+ * examples:
+ *  [Hi\ , "there"] -> [Hi\\\ ,\ "there"]
+ *  [foo
+ *   bar] - > [foo\nbar]
+ *
+ * Returns the number of characters that would be copied, if enough space
+ * was available. Doesn't emit a trailing zero.
+ */
+static int escape_string(char *dest, const char *s, int size)
+{
+	struct char_tracker tracker = {
+		.dest = dest,
+		.size = size,
+		.pos = 0,
+	};
+
+	/* We have to process the entire source string to ensure that
+	 * we return a useful value for the total possible emitted length.
+	 */
+	while (*s != 0) {
+		/* escape newlines */
+		if (*s == '\n') {
+			add_char(&tracker, '\\');
+			add_char(&tracker, 'n');
+			s++;
+			continue;
+		}
+
+		/* escape spaces and backslashes. */
+		if (*s == ' ' || *s == '\\')
+			add_char(&tracker, '\\');
+		add_char(&tracker, *s);
+		s++;
+	}
+
+	return tracker.pos;
+}
+
+/* Emits a string into the emitter buffer, no escaping */
+static bool emit_string(struct metric_emitter *e, const char *s)
+{
+	int bytes_left = emit_bytes_left(e);
+	int rc = snprintf(e->buf, bytes_left, "%s", s);
+
+	e->buf += min(rc, bytes_left);
+	return rc < bytes_left;
+}
+
+/* Emits a string into the emitter buffer, escaping quotes and newlines. */
+static bool emit_quoted_string(struct metric_emitter *e, const char *s)
+{
+	int bytes_left = emit_bytes_left(e);
+	int rc = escape_string(e->buf, s, bytes_left);
+
+	e->buf += min(rc, bytes_left);
+	return rc < bytes_left;
+}
+
+/* Emits an int into the emitter buffer */
+static bool emit_int(struct metric_emitter *e, int64_t i)
+{
+	int bytes_left = emit_bytes_left(e);
+	int rc = snprintf(e->buf, bytes_left, "%lld", i);
+
+	e->buf += min(rc, bytes_left);
+	return rc < bytes_left;
+}
+
+static void check_field_mismatch(struct metric *m, const char *f0,
+				 const char *f1)
+{
+	WARN_ON(m->fname0 && !f0);
+	WARN_ON(!m->fname0 && f0);
+	WARN_ON(m->fname1 && !f1);
+	WARN_ON(!m->fname1 && f1);
+}
+
+void metric_emit_int_value(struct metric_emitter *e, int64_t v,
+			   const char *f0, const char *f1)
+{
+	char *ckpt = e->buf;
+	bool ok = true;
+
+	WARN_ON_ONCE(e->metric->is_string);
+	check_field_mismatch(e->metric, f0, f1);
+	if (f0) {
+		ok &= emit_quoted_string(e, f0);
+		ok &= emit_string(e, " ");
+		if (f1) {
+			ok &= emit_quoted_string(e, f1);
+			ok &= emit_string(e, " ");
+		}
+	}
+	ok &= emit_int(e, v);
+	ok &= emit_string(e, "\n");
+	if (!ok)
+		e->buf = ckpt;
+}
+EXPORT_SYMBOL(metric_emit_int_value);
+
+void metric_emit_str_value(struct metric_emitter *e, const char *v,
+			   const char *f0, const char *f1)
+{
+	char *ckpt = e->buf;
+	bool ok = true;
+
+	WARN_ON_ONCE(!e->metric->is_string);
+	check_field_mismatch(e->metric, f0, f1);
+	if (f0) {
+		ok &= emit_quoted_string(e, f0);
+		ok &= emit_string(e, " ");
+		if (f1) {
+			ok &= emit_quoted_string(e, f1);
+			ok &= emit_string(e, " ");
+		}
+	}
+	ok &= emit_quoted_string(e, v);
+	ok &= emit_string(e, "\n");
+	if (!ok)
+		e->buf = ckpt;
+}
+EXPORT_SYMBOL(metric_emit_str_value);
+
+/* Contains file data generated at open() */
+struct metricfs_file_private {
+	size_t bytes_written;
+	char buf[0];
+};
+
+/* A mutex to prevent races involving the pointer to the inode stored in
+ * inode->i_private. We'll remove this if we can get a callback at inode
+ * deletion in debugfs.
+ */
+static DEFINE_MUTEX(big_mutex);
+
+/* Returns 1 on success, <0 otherwise. */
+static int metric_open_helper(struct inode *inode, struct file *filp,
+			      int buf_size,
+			      struct metric **m,
+			      struct metricfs_file_private **p)
+{
+	int size;
+
+	mutex_lock(&big_mutex);
+	/* Debugfs stores the "data" parameter from debugfs_create_file in
+	 * inode->i_private.
+	 */
+	*m = (struct metric *)inode->i_private;
+	if (!(*m) || !metric_module_get(*m)) {
+		mutex_unlock(&big_mutex);
+		return -ENXIO;
+	}
+	mutex_unlock(&big_mutex);
+
+	size = sizeof(struct metricfs_file_private) + buf_size;
+	*p = kvmalloc(size, GFP_KERNEL);
+	if (!*p) {
+		metric_module_put(*m);
+		return -ENOMEM;
+	}
+	filp->private_data = *p;
+	return 1;
+}
+
+static int metricfs_generic_release(struct inode *inode, struct file *filp)
+{
+	struct metricfs_file_private *p =
+			(struct metricfs_file_private *)filp->private_data;
+	kvfree(p);
+
+	filp->private_data = (void *)(0xDEADBEEFul + POISON_POINTER_DELTA);
+	/* FIXME here too? */
+	metric_module_put((struct metric *)inode->i_private);
+	return 0;
+}
+
+static int metricfs_annotations_open(struct inode *inode, struct file *filp)
+{
+	struct metric_emitter e;
+	struct metric *m;
+	struct metricfs_file_private *p;
+	bool ok = true;
+
+	int rc = metric_open_helper(inode, filp, METRICFS_ANNOTATIONS_BUF_SIZE,
+				    &m, &p);
+	if (rc < 0)
+		return rc;
+
+	e.buf = p->buf;
+	e.orig_buf = p->buf;
+	e.size = METRICFS_ANNOTATIONS_BUF_SIZE;
+	ok &= emit_string(&e, "DESCRIPTION ");
+	ok &= emit_quoted_string(&e, m->description);
+	ok &= emit_string(&e, "\n");
+	if (m->is_cumulative)
+		ok &= emit_string(&e, "CUMULATIVE\n");
+
+	/* Emit all or nothing. */
+	if (ok) {
+		p->bytes_written = e.buf - e.orig_buf;
+	} else {
+		metricfs_generic_release(inode, filp);
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static int metricfs_fields_open(struct inode *inode, struct file *filp)
+{
+	struct metric_emitter e;
+	struct metric *m;
+	struct metricfs_file_private *p;
+	bool ok = true;
+
+	int rc = metric_open_helper(inode, filp, METRICFS_FIELDS_BUF_SIZE,
+				    &m, &p);
+	if (rc < 0)
+		return rc;
+
+	e.buf = p->buf;
+	e.orig_buf = p->buf;
+	e.size = METRICFS_FIELDS_BUF_SIZE;
+	e.metric = m;
+
+	/* We don't have to do string escaping on fields, as quotes aren't
+	 * permitted in field names.
+	 */
+	if (m->fname0) {
+		ok &= emit_string(&e, m->fname0);
+		ok &= emit_string(&e, " ");
+	}
+	if (m->fname1) {
+		ok &= emit_string(&e, m->fname1);
+		ok &= emit_string(&e, " ");
+	}
+	ok &= emit_string(&e, "value\n");
+
+	if (m->fname0)
+		ok &= emit_string(&e, "str ");
+	if (m->fname1)
+		ok &= emit_string(&e, "str ");
+	ok &= emit_string(&e, (m->is_string) ? "str\n" : "int\n");
+
+	/* Emit all or nothing. */
+	if (ok) {
+		p->bytes_written = e.buf - e.orig_buf;
+	} else {
+		metricfs_generic_release(inode, filp);
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static int metricfs_version_open(struct inode *inode, struct file *filp)
+{
+	struct metric *m;
+	struct metricfs_file_private *p;
+	int rc = metric_open_helper(inode, filp, METRICFS_VERSION_BUF_SIZE,
+				    &m, &p);
+	if (rc < 0)
+		return rc;
+
+	p->bytes_written = snprintf(p->buf, METRICFS_VERSION_BUF_SIZE,
+				    "1\n");
+
+	if (p->bytes_written >= METRICFS_VERSION_BUF_SIZE) {
+		metricfs_generic_release(inode, filp);
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static int metricfs_values_open(struct inode *inode, struct file *filp)
+{
+	struct metric_emitter e;
+
+	struct metric *m;
+	struct metricfs_file_private *p;
+	int rc = metric_open_helper(inode, filp, METRICFS_VALUES_BUF_SIZE,
+				    &m, &p);
+	if (rc < 0)
+		return rc;
+
+	e.buf = p->buf;
+	e.orig_buf = p->buf;
+	e.size = METRICFS_VALUES_BUF_SIZE;
+	e.metric = m;
+
+	if (m->has_parm) {
+		if (m->emit_fn.emit_parm)
+			(m->emit_fn.emit_parm)(&e, m->eparm);
+	} else {
+		if (m->emit_fn.emit_noparm)
+			(m->emit_fn.emit_noparm)(&e);
+	}
+	p->bytes_written = e.buf - e.orig_buf;
+	return 0;
+}
+
+static ssize_t metricfs_generic_read(struct file *filp, char __user *ubuf,
+				     size_t cnt, loff_t *ppos)
+{
+	struct metricfs_file_private *p =
+			(struct metricfs_file_private *)filp->private_data;
+	return simple_read_from_buffer(ubuf, cnt, ppos, p->buf,
+					p->bytes_written);
+}
+
+static const struct file_operations metricfs_annotations_ops = {
+	.open = metricfs_annotations_open,
+	.read = metricfs_generic_read,
+	.release = metricfs_generic_release,
+};
+
+static const struct file_operations metricfs_fields_ops = {
+	.open = metricfs_fields_open,
+	.read = metricfs_generic_read,
+	.release = metricfs_generic_release,
+};
+
+static const struct file_operations metricfs_values_ops = {
+	.open = metricfs_values_open,
+	.read = metricfs_generic_read,
+	.release = metricfs_generic_release,
+};
+
+static const struct file_operations metricfs_version_ops = {
+	.open = metricfs_version_open,
+	.read = metricfs_generic_read,
+	.release = metricfs_generic_release,
+};
+
+static struct dentry *d_metricfs;
+
+static struct dentry *metricfs_init_dentry(void)
+{
+	static int once;
+
+	if (d_metricfs)
+		return d_metricfs;
+
+	if (!debugfs_initialized())
+		return NULL;
+
+	d_metricfs = debugfs_create_dir("metricfs", NULL);
+
+	if (!d_metricfs && !once) {
+		once = 1;
+		pr_warn("Could not create debugfs directory 'metricfs'\n");
+		return NULL;
+	}
+
+	return d_metricfs;
+}
+
+/* We always cast in and out to struct dentry. */
+struct metricfs_subsys {
+	struct dentry dentry;
+};
+
+static struct dentry *metricfs_create_file(const char *name,
+					   mode_t mode,
+					   struct dentry *parent,
+					   void *data,
+					   const struct file_operations *fops)
+{
+	struct dentry *ret;
+
+	ret = debugfs_create_file(name, mode, parent, data, fops);
+	if (!ret)
+		pr_warn("Could not create debugfs '%s' entry\n", name);
+
+	return ret;
+}
+
+static struct dentry *metricfs_create_dir(const char *name,
+					  struct metricfs_subsys *s)
+{
+	struct dentry *d;
+
+	if (!s)
+		d = d_metricfs;
+	else
+		d = &s->dentry;
+
+	if (!d) {
+		pr_warn("Couldn't create %s, subsys doesn't exist.", name);
+		return NULL;
+	}
+	return debugfs_create_dir(name, d);
+}
+
+static int metricfs_initialized;
+
+struct metric *metric_register(const char *name,
+				struct metricfs_subsys *parent,
+				const char *description,
+				const char *fname0,
+				const char *fname1,
+				void (*fn)(struct metric_emitter *e),
+				bool is_string,
+				bool is_cumulative,
+				struct module *owner)
+{
+	struct metric *m;
+	struct dentry *d, *t;
+
+	if (!metricfs_initialized) {
+		pr_warn("Could not create metric before initing metricfs\n");
+		return NULL;
+	}
+
+	m = kzalloc(sizeof(*m), GFP_KERNEL);
+	if (!m)
+		return NULL;
+
+	d = metricfs_create_dir(name, parent);
+	if (!d) {
+		pr_warn("Could not create dir '%s' in metricfs.\n", name);
+		kfree(m);
+		return NULL;
+	}
+
+	m->description = description;
+	m->fname0 = fname0;
+	m->fname1 = fname1;
+	m->has_parm = false;
+	m->emit_fn.emit_noparm = fn;
+	m->eparm = NULL;
+	m->is_string = is_string;
+	m->is_cumulative = is_cumulative;
+	refcount_set(&m->refcnt, 1);
+	m->owner = owner;
+	m->dentry = d;
+
+
+	mutex_lock(&big_mutex);
+	t = metricfs_create_file("annotations", 0444, d, m,
+					&metricfs_annotations_ops);
+	if (!t)
+		goto done;
+	m->inodes[0] = t->d_inode;
+
+	t = metricfs_create_file("fields", 0444, d, m,
+					&metricfs_fields_ops);
+	if (!t)
+		goto done;
+	m->inodes[1] = t->d_inode;
+
+	t = metricfs_create_file("values", 0444, d, m,
+					&metricfs_values_ops);
+	if (!t)
+		goto done;
+	m->inodes[2] = t->d_inode;
+
+	t = metricfs_create_file("version", 0444, d, m,
+					&metricfs_version_ops);
+	if (!t)
+		goto done;
+	m->inodes[3] = t->d_inode;
+
+done:
+	/* Unregister the metric before anyone calls open() if we had any
+	 * errors on file creation.
+	 */
+	if (!t) {
+		metric_unregister(m);
+		m = NULL;
+	}
+	mutex_unlock(&big_mutex);
+
+	return m;
+}
+EXPORT_SYMBOL(metric_register);
+
+struct metric *metric_register_parm(const char *name,
+				    struct metricfs_subsys *parent,
+				    const char *description,
+				    const char *fname0,
+				    const char *fname1,
+				    void (*fn)(struct metric_emitter *e,
+					       void *parm),
+				    void *eparm,
+				    bool is_string,
+				    bool is_cumulative,
+				    struct module *owner)
+{
+	struct metric *metric =
+		metric_register(name, parent, description,
+				fname0, fname1,
+				(void (*)(struct metric_emitter *))NULL,
+				is_string,
+				is_cumulative, owner);
+	if (metric) {
+		metric->has_parm = true;
+		metric->emit_fn.emit_parm = fn;
+		metric->eparm = eparm;
+	}
+	return metric;
+}
+EXPORT_SYMBOL(metric_register_parm);
+
+void metric_unregister(struct metric *m)
+{
+	/* We have to NULL out the i_private pointers here so that no other
+	 * callers come into open, getting a pointer to the metric that we
+	 * freed.
+	 */
+	mutex_lock(&big_mutex);
+	m->inodes[0]->i_private = NULL;
+	m->inodes[1]->i_private = NULL;
+	m->inodes[2]->i_private = NULL;
+	m->inodes[3]->i_private = NULL;
+	mutex_unlock(&big_mutex);
+
+	debugfs_remove_recursive(m->dentry);
+	metric_put(m);
+}
+EXPORT_SYMBOL(metric_unregister);
+
+struct metricfs_subsys *metricfs_create_subsys(const char *name,
+					       struct metricfs_subsys *parent)
+{
+	struct dentry *d = metricfs_create_dir(name, parent);
+
+	return container_of(d, struct metricfs_subsys, dentry);
+}
+EXPORT_SYMBOL(metricfs_create_subsys);
+
+void metricfs_destroy_subsys(struct metricfs_subsys *s)
+{
+	if (s)
+		debugfs_remove(&s->dentry);
+}
+EXPORT_SYMBOL(metricfs_destroy_subsys);
+
+static void metricfs_presence_fn(struct metric_emitter *e)
+{
+	METRIC_EMIT_INT(e, 1, NULL, NULL);
+}
+METRIC_EXPORT_INT(metricfs_presence, "A basic presence metric.",
+			NULL, NULL, metricfs_presence_fn);
+
+static int __init metricfs_init(void)
+{
+	if (!metricfs_init_dentry())
+		return -ENOMEM;
+	metricfs_initialized = 1;
+
+	/* Create a basic "presence" metric. */
+	metric_init_metricfs_presence(NULL);
+
+	mutex_init(&big_mutex);
+	return 0;
+}
+
+/*
+ * Debugfs should be fine by the time we're at fs_initcall.
+ */
+fs_initcall(metricfs_init);
diff --git a/kernel/metricfs_examples.c b/kernel/metricfs_examples.c
new file mode 100644
index 000000000000..50d891176728
--- /dev/null
+++ b/kernel/metricfs_examples.c
@@ -0,0 +1,151 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/init.h>
+#include <linux/metricfs.h>
+#include <linux/module.h>
+
+/* A metric to force truncation of the values file. "values" files in
+ * metricfs can be at most 64K in size. It truncates to the last record
+ * that fits entirely in the output file.
+ *
+ * Creates a metric with a values file that looks like:
+ * val"0" 0
+ * val"1" 1
+ * val"2" 2
+ * ...
+ * "val"3565" 3565
+ */
+static void more_than_64k_fn(struct metric_emitter *e)
+{
+	char buf[80];
+	int i;
+
+	for (i = 0; i < 10000; i++) {
+		sprintf(buf, "val\"%d\"", i);
+		/* Argument order is (emitter, value, field0, field1...) */
+		METRIC_EMIT_INT(e, i, buf, NULL);
+	}
+}
+METRIC_EXPORT_INT(more_than_64k, "Stress test metric.",
+			"v", NULL, more_than_64k_fn);
+
+
+/* A metric with two string fields and int64 values.
+ *
+ * # cat /sys/kernel/debug/metricfs/two_string_fields/annotations
+ * DESCRIPTION "Two fields example."
+ * # cat /sys/kernel/debug/metricfs/two_string_fields/fields
+ * disk cgroup value
+ * str str int
+ * # cat /sys/kernel/debug/metricfs/two_string_fields/values
+ * sda /map_reduce1 0
+ * sda /sys 50
+ * sdb /map_reduce2 12
+ */
+static void two_string_fields_fn(struct metric_emitter *e)
+{
+#define NR_ENTRIES 3
+	const char *disk[NR_ENTRIES] = {"sda", "sda", "sdb"};
+	const char *cgroups[NR_ENTRIES] = {
+				"/map_reduce1", "/sys", "/map_reduce2"};
+	const int64_t counters[NR_ENTRIES] = {0, 50, 12};
+	int i;
+
+	for (i = 0; i < NR_ENTRIES; i++) {
+		METRIC_EMIT_INT(e,
+				counters[i], disk[i], cgroups[i]);
+	}
+}
+#undef NR_ENTRIES
+METRIC_EXPORT_INT(two_string_fields, "Two fields example.",
+			"disk", "cgroup", two_string_fields_fn);
+
+
+/* A metric with zero fields and a string value.
+ *
+ * # cat /sys/kernel/debug/metricfs/string_valued_metric/annotations
+ * DESCRIPTION "String metric."
+ * # cat /sys/kernel/debug/metricfs/string_valued_metric/fields
+ * value
+ * str
+ * # cat /sys/kernel/debug/metricfs/string_valued_metric/values
+ * Test\ninfo.
+ */
+static void string_valued_metric_fn(struct metric_emitter *e)
+{
+	METRIC_EMIT_STR(e, "Test\ninfo.", NULL, NULL);
+}
+METRIC_EXPORT_STR(string_valued_metric, "String metric.",
+			NULL, NULL, string_valued_metric_fn);
+
+/* Test metric to ensure we behave properly with a large annotation string. */
+static void huge_annotation_fn(struct metric_emitter *e)
+{
+	METRIC_EMIT_STR(e, "test\n", NULL, NULL);
+}
+static const char *huge_annotation_s =
+	"1231231231231231231231231231231241241212895781930750981347503485"
+	"7029348750923847502384750923847590234857902348759023475028934751"
+	"1111111111111112312312312312312312312312312312412412128957819307"
+	"5098134750348570293487509238475023847509238475902348579023487590"
+	"2347502893475 23123123123123123123123123123124124121289578193075"
+	"0981347503485702934875092384750238475092384759023485790234875902"
+	"347502893475 231231231231231231231231231231241241212895781930750"
+	"9813475034857029348750923847502384750923847590234857902348759023"
+	"47502893475 2312312312312312312312312312312412412128957819307509"
+	"8134750348570293487509238475023847509238475902348579023487590234"
+	"7502893475 23123123123123123123123123123124124121289578193075098"
+	"1347503485702934875092384750238475092384759023485790234875902347"
+	"502893475 231231231231231231231231231231241241212895781930750981"
+	"3475034857029348750923847502384750923847590234857902348759023475"
+	"02893475 2312312312312312312312312312312412412128957819307509813"
+	"4750348570293487509238475023847509238475902348579023487590234750"
+	"2893475 23123123123123123123123123123124124121289578193075098134"
+	"7503485702934875092384750238475092384759023485790234875902347502"
+	"893475 231231231231231231231231231231241241212895781930750981347"
+	"5034857029348750923847502384750923847590234857902348759023475028"
+	"93475 2312312312312312312312312312312412412128957819307509813475"
+	"0348570293487509238475023847509238475902348579023487590234750289"
+	"3475 23123123123123123123123123123124124121289578193075098134750"
+	"3485702934875092384750238475092384759023485790234875902347502893"
+	"475 231231231231231231231231231231241241212895781930750981347503"
+	"4857029348750923847502384750923847590234857902348759023475028934"
+	"75 2312312312312312312312312312312412412128957819307509813475034"
+	"8570293487509238475023847509238475902348579023487590234750289347"
+	"5 23123123123123123123123123123124124121289578193075098134750348"
+	"5702934875092384750238475092384759023485790234875902347502893475"
+	" 231231231231231231231231231231241241212895781930750981347503485"
+	"702934875092384750238475092384759023485790234875902347502893475 "
+	"2312312312312312312312312312312412412128957819307509813475034857"
+	"02934875092384750238475092384759023485790234875902347502893475";
+
+METRIC_EXPORT_STR(huge_annotation, huge_annotation_s, NULL, NULL,
+			huge_annotation_fn);
+
+
+struct metricfs_subsys *examples_subsys;
+
+static int __init metricfs_examples_init(void)
+{
+	examples_subsys = metricfs_create_subsys("examples", NULL);
+	metric_init_more_than_64k(examples_subsys);
+	metric_init_two_string_fields(examples_subsys);
+	metric_init_string_valued_metric(examples_subsys);
+	metric_init_huge_annotation(examples_subsys);
+
+	return 0;
+}
+
+static void __exit metricfs_examples_exit(void)
+{
+	metric_exit_more_than_64k();
+	metric_exit_two_string_fields();
+	metric_exit_string_valued_metric();
+	metric_exit_huge_annotation();
+
+	metricfs_destroy_subsys(examples_subsys);
+}
+
+module_init(metricfs_examples_init);
+module_exit(metricfs_examples_exit);
+
+MODULE_LICENSE("GPL");
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 9ad9210d70a1..8de0244e7804 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -325,6 +325,24 @@ config READABLE_ASM
 	  to keep kernel developers who have to stare a lot at assembler listings
 	  sane.
 
+config METRICFS
+	bool "Metricfs for sysmon"
+	depends on DEBUG_FS
+	help
+	  metricfs is a library for creating rigidly-formatted files in debugfs
+	  which can be automatically monitored by user-space telemetry.  The
+	  hierarchy is rooted at /sys/kernel/debug/metricfs, and each metric
+	  contains metadata about the metric and types involved, as well as a
+	  tabular values file with the metrics themselves.
+
+config METRICFS_EXAMPLES
+	tristate "Metricfs examples"
+	depends on METRICFS
+	help
+	  example tests and metrics for metricfs.  With this, a set of metrics
+	  appear under "examples", covering various corner cases of the metricfs
+	  interface.  These can be used to test the metricfs functionality.
+
 config HEADERS_INSTALL
 	bool "Install uapi headers to usr/include"
 	depends on !UML
-- 
2.28.0.236.gb10cc79966-goog


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC PATCH 2/7] core/metricfs: add support for percpu metricfs files
  2020-08-07 21:29 [RFC PATCH 0/7] metricfs metric file system and examples Jonathan Adams
  2020-08-07 21:29 ` [RFC PATCH 1/7] core/metricfs: Create metricfs, standardized files under debugfs Jonathan Adams
@ 2020-08-07 21:29 ` Jonathan Adams
  2020-08-08  5:43   ` Greg KH
  2020-08-07 21:29 ` [RFC PATCH 3/7] core/metricfs: metric for kernel warnings Jonathan Adams
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 21+ messages in thread
From: Jonathan Adams @ 2020-08-07 21:29 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: netdev, kvm, Paolo Bonzini, Greg KH, Jim Mattson, David Rientjes,
	Jonathan Adams

Add a simple mechanism for exporting percpu data through metricfs.
The API follows the existing metricfs pattern.  A percpu file is
defined with:

    METRIC_EXPORT_PERCPU_INT(name, desc, fn)
    METRIC_EXPORT_PERCPU_COUNTER(name, desc, fn)

The first defines a file for exposing a percpu int.  The second is
similar, but is for a counter that accumulates since boot.  The
'name' is used as the metricfs file.  The 'desc' is a description
of the metric.  The 'fn' is a callback function to emit a single
percpu value:

    void (*fn)(struct metric_emitter *e, int cpu);

The callback must call METRIC_EMIT_PERCPU_INT with the value for
the specified CPU.

Signed-off-by: Jonathan Adams <jwadams@google.com>

---

jwadams@google.com: rebased to 5.6-pre6, renamed funcs to start with
	metric_.  This is work originally done by another engineer at
	google, who would rather not have their name associated with this
	patchset. They're okay with me sending it under my name.
---
 include/linux/metricfs.h | 28 +++++++++++++++++++
 kernel/metricfs.c        | 58 ++++++++++++++++++++++++++++++++++++----
 2 files changed, 81 insertions(+), 5 deletions(-)

diff --git a/include/linux/metricfs.h b/include/linux/metricfs.h
index 65a1baa8e8c1..f103dc8c44ec 100644
--- a/include/linux/metricfs.h
+++ b/include/linux/metricfs.h
@@ -22,6 +22,19 @@ void metric_exit_##name(void) \
 	metric_unregister(metric_##name); \
 }
 
+#define METRIC_EXPORT_PERCPU(name, desc, fn, cumulative) \
+static struct metric *metric_##name; \
+void metric_init_##name(struct metricfs_subsys *parent) \
+{ \
+	metric_##name = metric_register_percpu(__stringify(name), (parent), \
+					(desc), (fn), \
+					(cumulative), THIS_MODULE); \
+} \
+void metric_exit_##name(void) \
+{ \
+	metric_unregister(metric_##name); \
+}
+
 /*
  * Metricfs only deals with two types: int64_t and const char*.
  *
@@ -47,6 +60,11 @@ void metric_exit_##name(void) \
 	METRIC_EXPORT_GENERIC(name, (desc), (fname0), (fname1), (fn), \
 				true, false)
 
+#define METRIC_EXPORT_PERCPU_INT(name, desc, fn) \
+	METRIC_EXPORT_PERCPU(name, (desc), (fn), false)
+#define METRIC_EXPORT_PERCPU_COUNTER(name, desc, fn) \
+	METRIC_EXPORT_PERCPU(name, (desc), (fn), true)
+
 /* Subsystem support. */
 /* Pass NULL as 'parent' to create a new top-level subsystem. */
 struct metricfs_subsys *metricfs_create_subsys(const char *name,
@@ -69,6 +87,8 @@ struct metric_emitter;
 	metric_emit_int_value((e), (v), (f0), (f1))
 #define METRIC_EMIT_STR(e, v, f0, f1) \
 	metric_emit_str_value((e), (v), (f0), (f1))
+#define METRIC_EMIT_PERCPU_INT(e, cpu, v) \
+	metric_emit_percpu_int_value((e), (cpu), (v))
 
 /* Users don't have to call any functions below;
  * use the macro definitions above instead.
@@ -77,6 +97,7 @@ void metric_emit_int_value(struct metric_emitter *e,
 			   int64_t v, const char *f0, const char *f1);
 void metric_emit_str_value(struct metric_emitter *e,
 			   const char *v, const char *f0, const char *f1);
+void metric_emit_percpu_int_value(struct metric_emitter *e, int cpu, int64_t v);
 
 struct metric *metric_register(const char *name,
 			       struct metricfs_subsys *parent,
@@ -98,6 +119,13 @@ struct metric *metric_register_parm(const char *name,
 				    bool is_cumulative,
 				    struct module *owner);
 
+struct metric *metric_register_percpu(const char *name,
+			       struct metricfs_subsys *parent,
+			       const char *description,
+			       void (*fn)(struct metric_emitter *e, int cpu),
+			       bool is_cumulative,
+			       struct module *owner);
+
 void metric_unregister(struct metric *m);
 
 #endif /* _METRICFS_H_ */
diff --git a/kernel/metricfs.c b/kernel/metricfs.c
index 676b7b04aa2b..992fdd9a4d0a 100644
--- a/kernel/metricfs.c
+++ b/kernel/metricfs.c
@@ -76,6 +76,8 @@ struct metric {
 	bool is_string;
 	bool is_cumulative;
 	bool has_parm;
+	bool is_percpu;
+	void (*percpu_fn)(struct metric_emitter *e, int cpu);
 
 	/* dentry for the directory that contains the metric */
 	struct dentry *dentry;
@@ -285,6 +287,19 @@ void metric_emit_str_value(struct metric_emitter *e, const char *v,
 }
 EXPORT_SYMBOL(metric_emit_str_value);
 
+void metric_emit_percpu_int_value(struct metric_emitter *e, int cpu, int64_t v)
+{
+	char *ckpt = e->buf;
+	bool ok = true;
+
+	ok &= emit_int(e, cpu);
+	ok &= emit_string(e, " ");
+	ok &= emit_int(e, v);
+	ok &= emit_string(e, "\n");
+	if (!ok)
+		e->buf = ckpt;
+}
+
 /* Contains file data generated at open() */
 struct metricfs_file_private {
 	size_t bytes_written;
@@ -400,11 +415,15 @@ static int metricfs_fields_open(struct inode *inode, struct file *filp)
 	}
 	ok &= emit_string(&e, "value\n");
 
-	if (m->fname0)
-		ok &= emit_string(&e, "str ");
-	if (m->fname1)
-		ok &= emit_string(&e, "str ");
-	ok &= emit_string(&e, (m->is_string) ? "str\n" : "int\n");
+	if (m->is_percpu) {
+		ok &= emit_string(&e, "int int\n");
+	} else {
+		if (m->fname0)
+			ok &= emit_string(&e, "str ");
+		if (m->fname1)
+			ok &= emit_string(&e, "str ");
+		ok &= emit_string(&e, (m->is_string) ? "str\n" : "int\n");
+	}
 
 	/* Emit all or nothing. */
 	if (ok) {
@@ -640,6 +659,35 @@ struct metric *metric_register(const char *name,
 }
 EXPORT_SYMBOL(metric_register);
 
+static void metric_emit_percpu(struct metric_emitter *e)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		e->metric->percpu_fn(e, cpu);
+}
+
+struct metric *metric_register_percpu(const char *name,
+				struct metricfs_subsys *parent,
+				const char *description,
+				void (*fn)(struct metric_emitter *e, int cpu),
+				bool is_cumulative,
+				struct module *owner)
+{
+	struct metric *metric =
+		metric_register(name, parent, description,
+				"cpu", NULL,
+				metric_emit_percpu,
+				false,
+				is_cumulative, owner);
+	if (metric) {
+		metric->is_percpu = true;
+		metric->percpu_fn = fn;
+	}
+	return metric;
+}
+EXPORT_SYMBOL(metric_register_percpu);
+
 struct metric *metric_register_parm(const char *name,
 				    struct metricfs_subsys *parent,
 				    const char *description,
-- 
2.28.0.236.gb10cc79966-goog


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC PATCH 3/7] core/metricfs: metric for kernel warnings
  2020-08-07 21:29 [RFC PATCH 0/7] metricfs metric file system and examples Jonathan Adams
  2020-08-07 21:29 ` [RFC PATCH 1/7] core/metricfs: Create metricfs, standardized files under debugfs Jonathan Adams
  2020-08-07 21:29 ` [RFC PATCH 2/7] core/metricfs: add support for percpu metricfs files Jonathan Adams
@ 2020-08-07 21:29 ` Jonathan Adams
  2020-08-08  5:45   ` Greg KH
  2020-08-07 21:29 ` [RFC PATCH 4/7] core/metricfs: expose softirq information through metricfs Jonathan Adams
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 21+ messages in thread
From: Jonathan Adams @ 2020-08-07 21:29 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: netdev, kvm, Paolo Bonzini, Greg KH, Jim Mattson, David Rientjes,
	Jonathan Adams

Count kernel warnings by function name of the caller.

Each time WARN() is called, which includes WARN_ON(), increment a counter
in a 256-entry hash table. The table key is the entry point of the calling
function, which is found using kallsyms.

We store the name of the function in the table (because it may be a
module address); reporting the metric just walks the table and prints
the values.

The "warnings" metric is cumulative.

Signed-off-by: Jonathan Adams <jwadams@google.com>

---

jwadams@google.com: rebased to 5.8-rc6, removed google-isms,
	added lockdep_assert_held(), NMI handling, ..._unknown*_counts
	and locking in warn_tbl_fn(); renamed warn_metric... to
	warn_tbl...

	The original work was done in 2012 by an engineer no longer
	at Google.
---
 kernel/panic.c | 131 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 131 insertions(+)

diff --git a/kernel/panic.c b/kernel/panic.c
index e2157ca387c8..c019b41ab387 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -31,6 +31,9 @@
 #include <linux/bug.h>
 #include <linux/ratelimit.h>
 #include <linux/debugfs.h>
+#include <linux/utsname.h>
+#include <linux/hash.h>
+#include <linux/metricfs.h>
 #include <asm/sections.h>
 
 #define PANIC_TIMER_STEP 100
@@ -568,6 +571,133 @@ void oops_exit(void)
 	kmsg_dump(KMSG_DUMP_OOPS);
 }
 
+#ifdef CONFIG_METRICFS
+
+/*
+ * Hash table from function address to count of WARNs called within that
+ * function.
+ * So far this is an add-only hash table (ie, entries never removed), so some
+ * simplifying assumptions are made.
+ */
+#define WARN_TBL_BITS (8)
+#define WARN_TBL_SIZE (1<<WARN_TBL_BITS)
+static struct {
+	void *function;
+	int count;
+	char function_name[KSYM_NAME_LEN];
+} warn_tbl[WARN_TBL_SIZE];
+
+static DEFINE_SPINLOCK(warn_tbl_lock);
+static atomic_t warn_tbl_unknown_lookup_count = ATOMIC_INIT(0);
+static atomic_t warn_tbl_unknown_nmi_count = ATOMIC_INIT(0);
+static int warn_tbl_unknown_count;
+
+/*
+ * Find the entry corresponding to the given function address.
+ * Insert a new entry if one doesn't exist yet.
+ * Returns -1 if the hash table is full.
+ */
+static int tbl_find(void *caller_function)
+{
+	int entry, start_entry;
+
+	lockdep_assert_held(&warn_tbl_lock);
+
+	start_entry = hash_ptr(caller_function, WARN_TBL_BITS);
+	entry = start_entry;
+	do {
+		if (warn_tbl[entry].function == caller_function)
+			return entry;
+		if (warn_tbl[entry].function == NULL) {
+			if (!kallsyms_lookup((unsigned long)caller_function,
+					NULL, NULL, NULL,
+					warn_tbl[entry].function_name))
+				return -1;
+			warn_tbl[entry].function = caller_function;
+			return entry;
+		}
+		entry = (entry + 1) % (WARN_TBL_SIZE);
+	} while (entry != start_entry);
+
+	return -1;
+}
+
+static void tbl_increment(void *caller)
+{
+	void *caller_function;
+	unsigned long caller_offset;
+	unsigned long flags;
+	int entry;
+
+	if (!kallsyms_lookup_size_offset(
+			(unsigned long)caller, NULL, &caller_offset)) {
+		atomic_inc(&warn_tbl_unknown_lookup_count);
+		return;
+	}
+	/* use function entrypoint */
+	caller_function = caller - caller_offset;
+
+	if (in_nmi()) {
+		if (!spin_trylock_irqsave(&warn_tbl_lock, flags)) {
+			atomic_inc(&warn_tbl_unknown_nmi_count);
+			return;
+		}
+	} else {
+		spin_lock_irqsave(&warn_tbl_lock, flags);
+	}
+	entry = tbl_find(caller_function);
+	if (entry >= 0)
+		warn_tbl[entry].count++;
+	else
+		warn_tbl_unknown_count++;
+
+	spin_unlock_irqrestore(&warn_tbl_lock, flags);
+}
+
+/*
+ * Export the hash table to metricfs.
+ */
+static void warn_tbl_fn(struct metric_emitter *e)
+{
+	int i;
+	unsigned long flags;
+	int unknown_count = READ_ONCE(warn_tbl_unknown_count) +
+		atomic_read(&warn_tbl_unknown_nmi_count) +
+		atomic_read(&warn_tbl_unknown_lookup_count);
+
+	if (unknown_count != 0)
+		METRIC_EMIT_INT(e, unknown_count, "(unknown)", NULL);
+
+	spin_lock_irqsave(&warn_tbl_lock, flags);
+	for (i = 0; i < WARN_TBL_SIZE; i++) {
+		unsigned long fn = (unsigned long)warn_tbl[i].function;
+		const char *function_name = warn_tbl[i].function_name;
+		int count = warn_tbl[i].count;
+
+		if (!fn)
+			continue;
+
+		// function_name[] is constant once function is non-NULL
+		spin_unlock_irqrestore(&warn_tbl_lock, flags);
+		METRIC_EMIT_INT(e, count, function_name, NULL);
+		spin_lock_irqsave(&warn_tbl_lock, flags);
+	}
+	spin_unlock_irqrestore(&warn_tbl_lock, flags);
+}
+METRIC_EXPORT_COUNTER(warnings, "Count of calls to WARN().",
+		      "function", NULL, warn_tbl_fn);
+
+static int __init metricfs_panic_init(void)
+{
+	metric_init_warnings(NULL);
+	return 0;
+}
+late_initcall(metricfs_panic_init);
+
+#else  /* CONFIG_METRICFS */
+inline void tbl_increment(void *caller) {}
+#endif
+
 struct warn_args {
 	const char *fmt;
 	va_list args;
@@ -576,6 +706,7 @@ struct warn_args {
 void __warn(const char *file, int line, void *caller, unsigned taint,
 	    struct pt_regs *regs, struct warn_args *args)
 {
+	tbl_increment(caller);
 	disable_trace_on_warning();
 
 	if (file)
-- 
2.28.0.236.gb10cc79966-goog


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC PATCH 4/7] core/metricfs: expose softirq information through metricfs
  2020-08-07 21:29 [RFC PATCH 0/7] metricfs metric file system and examples Jonathan Adams
                   ` (2 preceding siblings ...)
  2020-08-07 21:29 ` [RFC PATCH 3/7] core/metricfs: metric for kernel warnings Jonathan Adams
@ 2020-08-07 21:29 ` Jonathan Adams
  2020-08-08  5:46   ` Greg KH
  2020-08-07 21:29 ` [RFC PATCH 5/7] core/metricfs: expose scheduler stat " Jonathan Adams
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 21+ messages in thread
From: Jonathan Adams @ 2020-08-07 21:29 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: netdev, kvm, Paolo Bonzini, Greg KH, Jim Mattson, David Rientjes,
	Jonathan Adams

Add metricfs support for displaying percpu softirq counters.  The
top directory is /sys/kernel/debug/metricfs/softirq.  Then there
is a subdirectory for each softirq type.  For example:

    cat /sys/kernel/debug/metricfs/softirq/NET_RX/values

Signed-off-by: Jonathan Adams <jwadams@google.com>

---

jwadams@google.com: rebased to 5.8-pre6
	This is work originally done by another engineer at
	google, who would rather not have their name associated with this
	patchset. They're okay with me sending it under my name.
---
 kernel/softirq.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/kernel/softirq.c b/kernel/softirq.c
index c4201b7f42b1..1ae3a540b789 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -25,6 +25,8 @@
 #include <linux/smpboot.h>
 #include <linux/tick.h>
 #include <linux/irq.h>
+#include <linux/jump_label.h>
+#include <linux/metricfs.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/irq.h>
@@ -738,3 +740,46 @@ unsigned int __weak arch_dynirq_lower_bound(unsigned int from)
 {
 	return from;
 }
+
+#ifdef CONFIG_METRICFS
+
+#define METRICFS_ITEM(name) \
+static void \
+metricfs_##name(struct metric_emitter *e, int cpu) \
+{ \
+	int64_t v = kstat_softirqs_cpu(name##_SOFTIRQ, cpu); \
+	METRIC_EMIT_PERCPU_INT(e, cpu, v); \
+} \
+METRIC_EXPORT_PERCPU_COUNTER(name, #name " softirq", metricfs_##name)
+
+METRICFS_ITEM(HI);
+METRICFS_ITEM(TIMER);
+METRICFS_ITEM(NET_TX);
+METRICFS_ITEM(NET_RX);
+METRICFS_ITEM(BLOCK);
+METRICFS_ITEM(IRQ_POLL);
+METRICFS_ITEM(TASKLET);
+METRICFS_ITEM(SCHED);
+METRICFS_ITEM(HRTIMER);
+METRICFS_ITEM(RCU);
+
+static int __init init_softirq_metricfs(void)
+{
+	struct metricfs_subsys *subsys;
+
+	subsys = metricfs_create_subsys("softirq", NULL);
+	metric_init_HI(subsys);
+	metric_init_TIMER(subsys);
+	metric_init_NET_TX(subsys);
+	metric_init_NET_RX(subsys);
+	metric_init_BLOCK(subsys);
+	metric_init_IRQ_POLL(subsys);
+	metric_init_TASKLET(subsys);
+	metric_init_SCHED(subsys);
+	metric_init_RCU(subsys);
+
+	return 0;
+}
+module_init(init_softirq_metricfs);
+
+#endif
-- 
2.28.0.236.gb10cc79966-goog


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC PATCH 5/7] core/metricfs: expose scheduler stat information through metricfs
  2020-08-07 21:29 [RFC PATCH 0/7] metricfs metric file system and examples Jonathan Adams
                   ` (3 preceding siblings ...)
  2020-08-07 21:29 ` [RFC PATCH 4/7] core/metricfs: expose softirq information through metricfs Jonathan Adams
@ 2020-08-07 21:29 ` Jonathan Adams
  2020-08-07 21:29 ` [RFC PATCH 6/7] core/metricfs: expose x86-specific irq " Jonathan Adams
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 21+ messages in thread
From: Jonathan Adams @ 2020-08-07 21:29 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: netdev, kvm, Paolo Bonzini, Greg KH, Jim Mattson, David Rientjes,
	Jonathan Adams

Add metricfs support for displaying percpu scheduler counters.
The top directory is /sys/kernel/debug/metricfs/stat (analogous
to /proc/stat).  Then there is a subdirectory for each scheduler
stat.  For example:

    cat /sys/kernel/debug/metricfs/stat/user/values

Signed-off-by: Jonathan Adams <jwadams@google.com>

---

jwadams@google.com: rebased to 5.8-pre6
	This is work originally done by another engineer at
	google, who would rather not have their name associated with this
	patchset. They're okay with me sending it under my name.
---
 fs/proc/stat.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 57 insertions(+)

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 46b3293015fe..deb378507b0b 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -13,6 +13,7 @@
 #include <linux/irqnr.h>
 #include <linux/sched/cputime.h>
 #include <linux/tick.h>
+#include <linux/metricfs.h>
 
 #ifndef arch_irq_stat_cpu
 #define arch_irq_stat_cpu(cpu) 0
@@ -237,3 +238,59 @@ static int __init proc_stat_init(void)
 	return 0;
 }
 fs_initcall(proc_stat_init);
+
+#ifdef CONFIG_METRICFS
+#define METRICFS_ITEM(name, field, desc) \
+static void \
+metricfs_##name(struct metric_emitter *e, int cpu) \
+{ \
+	int64_t v = kcpustat_field(&kcpustat_cpu(cpu), field, cpu); \
+	METRIC_EMIT_PERCPU_INT(e, cpu, v); \
+} \
+METRIC_EXPORT_PERCPU_COUNTER(name, desc, metricfs_##name)
+
+#define METRICFS_FUNC_ITEM(name, func, desc) \
+static void \
+metricfs_##name(struct metric_emitter *e, int cpu) \
+{ \
+	struct kernel_cpustat cpustat; \
+	int64_t v; \
+	kcpustat_cpu_fetch(&cpustat, cpu); \
+	v = func(&cpustat, cpu); \
+	METRIC_EMIT_PERCPU_INT(e, cpu, v); \
+} \
+METRIC_EXPORT_PERCPU_COUNTER(name, desc, metricfs_##name)
+
+METRICFS_ITEM(user, CPUTIME_USER, "time in user mode (nsec)");
+METRICFS_ITEM(nice, CPUTIME_NICE, "time in user mode niced (nsec)");
+METRICFS_ITEM(system, CPUTIME_SYSTEM, "time in system calls (nsec)");
+METRICFS_ITEM(irq, CPUTIME_IRQ, "time in interrupts (nsec)");
+METRICFS_ITEM(softirq, CPUTIME_SOFTIRQ, "time in softirqs (nsec)");
+METRICFS_ITEM(steal, CPUTIME_STEAL, "time in involuntary wait (nsec)");
+METRICFS_ITEM(guest, CPUTIME_GUEST, "time in guest mode (nsec)");
+METRICFS_ITEM(guest_nice, CPUTIME_GUEST_NICE,
+	"time in guest mode niced (nsec)");
+METRICFS_FUNC_ITEM(idle, get_idle_time, "time in idle (nsec)");
+METRICFS_FUNC_ITEM(iowait, get_iowait_time, "time in iowait (nsec)");
+
+static int __init init_stat_metricfs(void)
+{
+	struct metricfs_subsys *subsys;
+
+	subsys = metricfs_create_subsys("stat", NULL);
+	metric_init_user(subsys);
+	metric_init_nice(subsys);
+	metric_init_system(subsys);
+	metric_init_irq(subsys);
+	metric_init_softirq(subsys);
+	metric_init_steal(subsys);
+	metric_init_guest(subsys);
+	metric_init_guest_nice(subsys);
+	metric_init_idle(subsys);
+	metric_init_iowait(subsys);
+
+	return 0;
+}
+module_init(init_stat_metricfs);
+
+#endif
-- 
2.28.0.236.gb10cc79966-goog


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC PATCH 6/7] core/metricfs: expose x86-specific irq information through metricfs
  2020-08-07 21:29 [RFC PATCH 0/7] metricfs metric file system and examples Jonathan Adams
                   ` (4 preceding siblings ...)
  2020-08-07 21:29 ` [RFC PATCH 5/7] core/metricfs: expose scheduler stat " Jonathan Adams
@ 2020-08-07 21:29 ` Jonathan Adams
  2020-08-13 10:11   ` Thomas Gleixner
  2020-08-07 21:29 ` [RFC PATCH 7/7] net-metricfs: Export /proc/net/dev via metricfs Jonathan Adams
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 21+ messages in thread
From: Jonathan Adams @ 2020-08-07 21:29 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: netdev, kvm, Paolo Bonzini, Greg KH, Jim Mattson, David Rientjes,
	Jonathan Adams

Add metricfs support for displaying percpu irq counters for x86.
The top directory is /sys/kernel/debug/metricfs/irq_x86.
Then there is a subdirectory for each x86-specific irq counter.
For example:

    cat /sys/kernel/debug/metricfs/irq_x86/TLB/values

Signed-off-by: Jonathan Adams <jwadams@google.com>

---

jwadams@google.com: rebased to 5.8-pre6
	This is work originally done by another engineer at
	google, who would rather not have their name associated with
	this patchset. They're okay with me sending it under my name.
---
 arch/x86/kernel/irq.c | 80 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 80 insertions(+)

diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 181060247e3c..ffacbbc4066c 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -12,6 +12,7 @@
 #include <linux/delay.h>
 #include <linux/export.h>
 #include <linux/irq.h>
+#include <linux/metricfs.h>
 
 #include <asm/irq_stack.h>
 #include <asm/apic.h>
@@ -374,3 +375,82 @@ void fixup_irqs(void)
 	}
 }
 #endif
+
+#ifdef CONFIG_METRICFS
+#define METRICFS_ITEM(name, field, desc) \
+static void \
+metricfs_##name(struct metric_emitter *e, int cpu) \
+{ \
+	int64_t v = irq_stats(cpu)->field; \
+	METRIC_EMIT_PERCPU_INT(e, cpu, v); \
+} \
+METRIC_EXPORT_PERCPU_COUNTER(name, desc, metricfs_##name)
+
+METRICFS_ITEM(NMI, __nmi_count, "Non-maskable interrupts");
+#ifdef CONFIG_X86_LOCAL_APIC
+METRICFS_ITEM(LOC, apic_timer_irqs, "Local timer interrupts");
+METRICFS_ITEM(SPU, irq_spurious_count, "Spurious interrupts");
+METRICFS_ITEM(PMI, apic_perf_irqs, "Performance monitoring interrupts");
+METRICFS_ITEM(IWI, apic_irq_work_irqs, "IRQ work interrupts");
+METRICFS_ITEM(RTR, icr_read_retry_count, "APIC ICR read retries");
+#endif
+METRICFS_ITEM(PLT, x86_platform_ipis, "Platform interrupts");
+#ifdef CONFIG_SMP
+METRICFS_ITEM(RES, irq_resched_count, "Rescheduling interrupts");
+METRICFS_ITEM(CAL, irq_call_count, "Function call interrupts");
+METRICFS_ITEM(TLB, irq_tlb_count, "TLB shootdowns");
+#endif
+#ifdef CONFIG_X86_THERMAL_VECTOR
+METRICFS_ITEM(TRM, irq_thermal_count, "Thermal event interrupts");
+#endif
+#ifdef CONFIG_X86_MCE_THRESHOLD
+METRICFS_ITEM(THR, irq_threshold_count, "Threshold APIC interrupts");
+#endif
+#ifdef CONFIG_X86_MCE_AMD
+METRICFS_ITEM(DFR, irq_deferred_error_count, "Deferred Error APIC interrupts");
+#endif
+#ifdef CONFIG_HAVE_KVM
+METRICFS_ITEM(PIN, kvm_posted_intr_ipis, "Posted-interrupt notification event");
+METRICFS_ITEM(PIW, kvm_posted_intr_wakeup_ipis,
+	"Posted-interrupt wakeup event");
+#endif
+
+static int __init init_irq_metricfs(void)
+{
+	struct metricfs_subsys *subsys;
+
+	subsys = metricfs_create_subsys("irq_x86", NULL);
+
+	metric_init_NMI(subsys);
+#ifdef CONFIG_X86_LOCAL_APIC
+	metric_init_LOC(subsys);
+	metric_init_SPU(subsys);
+	metric_init_PMI(subsys);
+	metric_init_IWI(subsys);
+	metric_init_RTR(subsys);
+#endif
+	metric_init_PLT(subsys);
+#ifdef CONFIG_SMP
+	metric_init_RES(subsys);
+	metric_init_CAL(subsys);
+	metric_init_TLB(subsys);
+#endif
+#ifdef CONFIG_X86_THERMAL_VECTOR
+	metric_init_TRM(subsys);
+#endif
+#ifdef CONFIG_X86_MCE_THRESHOLD
+	metric_init_THR(subsys);
+#endif
+#ifdef CONFIG_X86_MCE_AMD
+	metric_init_DFR(subsys);
+#endif
+#ifdef CONFIG_HAVE_KVM
+	metric_init_PIN(subsys);
+	metric_init_PIW(subsys);
+#endif
+
+	return 0;
+}
+module_init(init_irq_metricfs);
+
+#endif
-- 
2.28.0.236.gb10cc79966-goog


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC PATCH 7/7] net-metricfs: Export /proc/net/dev via metricfs.
  2020-08-07 21:29 [RFC PATCH 0/7] metricfs metric file system and examples Jonathan Adams
                   ` (5 preceding siblings ...)
  2020-08-07 21:29 ` [RFC PATCH 6/7] core/metricfs: expose x86-specific irq " Jonathan Adams
@ 2020-08-07 21:29 ` Jonathan Adams
  2020-08-08  2:06 ` [RFC PATCH 0/7] metricfs metric file system and examples Andrew Lunn
  2020-08-10  9:23 ` Pavel Machek
  8 siblings, 0 replies; 21+ messages in thread
From: Jonathan Adams @ 2020-08-07 21:29 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: netdev, kvm, Paolo Bonzini, Greg KH, Jim Mattson, David Rientjes,
	Jonathan Adams, Laurent Chavey

From: Laurent Chavey <chavey@google.com>

Export /proc/net/dev statistics via metricfs.

The implementation reports all the devices that are in the same
network namespace as the process reading metricfs.

The implementation does not report devices across network namespaces

Signed-off-by: Laurent Chavey <chavey@google.com>
[jwadams@google.com: ported code to 5.8-pre6, cleaned up googleisms ]
Signed-off-by: Jonathan Adams <jwadams@google.com>
---
 net/core/Makefile       |   1 +
 net/core/net_metricfs.c | 194 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 195 insertions(+)
 create mode 100644 net/core/net_metricfs.c

diff --git a/net/core/Makefile b/net/core/Makefile
index 3e2c378e5f31..7647380b9679 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -27,6 +27,7 @@ obj-$(CONFIG_NET_PTP_CLASSIFY) += ptp_classifier.o
 obj-$(CONFIG_CGROUP_NET_PRIO) += netprio_cgroup.o
 obj-$(CONFIG_CGROUP_NET_CLASSID) += netclassid_cgroup.o
 obj-$(CONFIG_LWTUNNEL) += lwtunnel.o
+obj-$(CONFIG_METRICFS) += net_metricfs.o
 obj-$(CONFIG_LWTUNNEL_BPF) += lwt_bpf.o
 obj-$(CONFIG_BPF_STREAM_PARSER) += sock_map.o
 obj-$(CONFIG_DST_CACHE) += dst_cache.o
diff --git a/net/core/net_metricfs.c b/net/core/net_metricfs.c
new file mode 100644
index 000000000000..82f0f797b0b0
--- /dev/null
+++ b/net/core/net_metricfs.c
@@ -0,0 +1,194 @@
+// SPDX-License-Identifier: GPL-2.0
+/* net_metricfs: Exports network counters using metricfs.
+ */
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/metricfs.h>
+#include <linux/netdevice.h>
+#include <linux/nsproxy.h>
+#include <linux/rcupdate.h>
+#include <linux/stddef.h>
+#include <linux/types.h>
+#include <net/net_namespace.h>
+
+struct metric_def {
+	struct metric *metric;
+	size_t off;
+	char *name;
+	char *desc;
+};
+
+/* If needed, we could export this via a function for other /net users */
+static struct metricfs_subsys *net_root_subsys;
+static struct metricfs_subsys *dev_subsys;
+static struct metricfs_subsys *dev_stats_subsys;
+
+static struct metric_def metric_def[] = {
+	{NULL, offsetof(struct rtnl_link_stats64, rx_bytes),
+	 "rx_bytes", "net device received bytes count"},
+	{NULL, offsetof(struct rtnl_link_stats64, rx_packets),
+	 "rx_packets", "net device received packets count"},
+	{NULL, offsetof(struct rtnl_link_stats64, rx_errors),
+	 "rx_errors", "net device received errors count"},
+	{NULL, offsetof(struct rtnl_link_stats64, rx_dropped),
+	 "rx_dropped", "net device dropped packets count"},
+	{NULL, offsetof(struct rtnl_link_stats64, rx_missed_errors),
+	 "rx_missed_errors",  "net device missed errors count"},
+	{NULL, offsetof(struct rtnl_link_stats64, rx_fifo_errors),
+	 "rx_fifo_errors", "net device fifo errors count"},
+	{NULL, offsetof(struct rtnl_link_stats64, rx_length_errors),
+	 "rx_length_errors", "net device length errors count"},
+	{NULL, offsetof(struct rtnl_link_stats64, rx_over_errors),
+	 "rx_over_errors", "net device received overflow errors count"},
+	{NULL, offsetof(struct rtnl_link_stats64, rx_crc_errors),
+	 "rx_crc_errors", "net device received crc errors count"},
+	{NULL, offsetof(struct rtnl_link_stats64, rx_frame_errors),
+	 "rx_frame_errors", "net device received frame errors count"},
+	{NULL, offsetof(struct rtnl_link_stats64, rx_compressed),
+	 "rx_compressed", "net device received compressed packet count"},
+	{NULL, offsetof(struct rtnl_link_stats64, multicast),
+	 "rx_multicast", "net device received multicast packet count"},
+	{NULL, offsetof(struct rtnl_link_stats64, tx_bytes),
+	 "tx_bytes", "net device transmited bytes count"},
+	{NULL, offsetof(struct rtnl_link_stats64, tx_packets),
+	 "tx_packets", "net device transmited packets count"},
+	{NULL, offsetof(struct rtnl_link_stats64, tx_errors),
+	 "tx_errors", "net device transmited errors count"},
+	{NULL, offsetof(struct rtnl_link_stats64, tx_dropped),
+	 "tx_dropped", "net device transmited packet drop count"},
+	{NULL, offsetof(struct rtnl_link_stats64, tx_fifo_errors),
+	 "tx_fifo_errors", "net device transmit fifo errors count"},
+	{NULL, offsetof(struct rtnl_link_stats64, collisions),
+	 "tx_collision", "net device transmit collisions count"},
+	{NULL, offsetof(struct rtnl_link_stats64, tx_carrier_errors),
+	 "tx_carrier_errors", "net device transmit carrier errors count"},
+	{NULL, offsetof(struct rtnl_link_stats64, tx_aborted_errors),
+	 "tx_aborted_errors", "net device transmit aborted errors count"},
+	{NULL, offsetof(struct rtnl_link_stats64, tx_window_errors),
+	 "tx_window_errors", "net device transmit window errors count"},
+	{NULL, offsetof(struct rtnl_link_stats64, tx_heartbeat_errors),
+	 "tx_heartbeat_errors", "net device transmit heartbeat errors count"},
+	{NULL, offsetof(struct rtnl_link_stats64, tx_compressed),
+	 "tx_compressed_errors", "net device transmit compressed count"},
+};
+
+static __init int init_net_subsys(void)
+{
+	net_root_subsys = metricfs_create_subsys("net", NULL);
+	if (!net_root_subsys) {
+		WARN_ONCE(1, "Net metricfs root not created.");
+		return -1;
+	}
+	return 0;
+}
+
+late_initcall(init_net_subsys);
+
+static void dev_stats_emit(struct metric_emitter *e,
+			   struct net_device *dev,
+			   struct metric_def *metricd)
+{
+	struct rtnl_link_stats64 temp;
+	const struct rtnl_link_stats64 *stats = dev_get_stats(dev, &temp);
+
+	if (stats) {
+		__u8 *ptr = (((__u8 *)stats) + metricd->off);
+
+		METRIC_EMIT_INT(e, *(__u64 *)ptr, dev->name, NULL);
+	}
+}
+
+/* metricfs export function */
+static void dev_stats_fn(struct metric_emitter *e, void *parm)
+{
+	struct net_device *dev;
+	struct net *net;
+	struct nsproxy *nsproxy = current->nsproxy;
+
+	rcu_read_lock();
+	for_each_net_rcu(net) {
+		/* skip namespaces not associated with the caller */
+		if (nsproxy->net_ns != net)
+			continue;
+		for_each_netdev_rcu(net, dev) {
+			dev_stats_emit(e, dev, (struct metric_def *)parm);
+		}
+	}
+	rcu_read_unlock();
+}
+
+static void clean_dev_stats_subsys(void)
+{
+	int x;
+	int metric_count = sizeof(metric_def) / sizeof(struct metric_def);
+
+	for (x = 0; x < metric_count; x++) {
+		if (metric_def[x].metric) {
+			metric_unregister(metric_def[x].metric);
+			metric_def[x].metric = NULL;
+		}
+	}
+	if (dev_stats_subsys)
+		metricfs_destroy_subsys(dev_stats_subsys);
+	if (dev_subsys)
+		metricfs_destroy_subsys(dev_subsys);
+	dev_stats_subsys = NULL;
+	dev_subsys = NULL;
+}
+
+static int __init init_dev_stats_subsys(void)
+{
+	int x;
+	int metric_count = sizeof(metric_def) / sizeof(struct metric_def);
+
+	dev_subsys = NULL;
+	dev_stats_subsys = NULL;
+	if (!net_root_subsys) {
+		WARN_ONCE(1, "Net metricfs root not initialized.");
+		goto error;
+	}
+	dev_subsys =
+		metricfs_create_subsys("dev", net_root_subsys);
+	if (!dev_subsys) {
+		WARN_ONCE(1, "Net metricfs dev not created.");
+		goto error;
+	}
+	dev_stats_subsys =
+		metricfs_create_subsys("stats", dev_subsys);
+	if (!dev_stats_subsys) {
+		WARN_ONCE(1, "Dev metricfs stats not created.");
+		goto error;
+	}
+
+	/* initialize each of the metrics */
+	for (x = 0; x < metric_count; x++) {
+		metric_def[x].metric =
+			metric_register_parm(metric_def[x].name,
+					     dev_stats_subsys,
+					     metric_def[x].desc,
+					     "interface",
+					     NULL,
+					     dev_stats_fn,
+					     (void *)&metric_def[x],
+					     false,
+					     true,  /* this is a counter */
+					     THIS_MODULE);
+		if (!metric_def[x].metric) {
+			WARN_ONCE(1, "Dev metricfs stats %s not registered.",
+				  metric_def[x].name);
+			goto error;
+		}
+	}
+	return 0;
+error:
+	clean_dev_stats_subsys();
+	return -1;
+}
+
+/* need to wait for metricfs and net metricfs root to be initialized */
+late_initcall_sync(init_dev_stats_subsys);
+
+static void __exit dev_stats_exit(void)
+{
+	clean_dev_stats_subsys();
+}
-- 
2.28.0.236.gb10cc79966-goog


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/7] metricfs metric file system and examples
  2020-08-07 21:29 [RFC PATCH 0/7] metricfs metric file system and examples Jonathan Adams
                   ` (6 preceding siblings ...)
  2020-08-07 21:29 ` [RFC PATCH 7/7] net-metricfs: Export /proc/net/dev via metricfs Jonathan Adams
@ 2020-08-08  2:06 ` Andrew Lunn
  2020-08-08 15:59   ` David Ahern
  2020-08-10  9:23 ` Pavel Machek
  8 siblings, 1 reply; 21+ messages in thread
From: Andrew Lunn @ 2020-08-08  2:06 UTC (permalink / raw)
  To: Jonathan Adams
  Cc: linux-kernel, linux-fsdevel, netdev, kvm, Paolo Bonzini, Greg KH,
	Jim Mattson, David Rientjes

> net/dev/stats/tx_bytes/annotations
>   DESCRIPTION net\ device\ transmited\ bytes\ count
>   CUMULATIVE
> net/dev/stats/tx_bytes/fields
>   interface value
>   str int
> net/dev/stats/tx_bytes/values
>   lo 4394430608
>   eth0 33353183843
>   eth1 16228847091

This is a rather small system. Have you tested it at scale? An
Ethernet switch with 64 physical interfaces, and say 32 VLAN
interfaces stack on top. So this one file will contain 2048 entries?

And generally, you are not interested in one statistic, but many
statistics. So you will need to cat each file, not just one file. And
the way this is implemented:

+static void dev_stats_emit(struct metric_emitter *e,
+                          struct net_device *dev,
+                          struct metric_def *metricd)
+{
+       struct rtnl_link_stats64 temp;
+       const struct rtnl_link_stats64 *stats = dev_get_stats(dev, &temp);
+
+       if (stats) {
+               __u8 *ptr = (((__u8 *)stats) + metricd->off);
+
+               METRIC_EMIT_INT(e, *(__u64 *)ptr, dev->name, NULL);
+       }
+}

means you are going to be calling dev_get_stats() for each file, and
there are 23 files if i counted correctly. So dev_get_stats() will be
called 47104 times, in this made up example. And this is not always
cheap, these counts can be atomic.

So i personally don't think netdev statistics is a good idea, i doubt
it scales.

I also think you are looking at the wrong set of netdev counters. I
would be more interested in ethtool -S counters. But it appears you
make the assumption that each object you are collecting metrics for
has the same set of counters. This is untrue for network interfaces,
where each driver can export whatever counters it wants, and they can
be dynamic.

	Andrew

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 1/7] core/metricfs: Create metricfs, standardized files under debugfs.
  2020-08-07 21:29 ` [RFC PATCH 1/7] core/metricfs: Create metricfs, standardized files under debugfs Jonathan Adams
@ 2020-08-08  5:41   ` Greg KH
  0 siblings, 0 replies; 21+ messages in thread
From: Greg KH @ 2020-08-08  5:41 UTC (permalink / raw)
  To: Jonathan Adams
  Cc: linux-kernel, linux-fsdevel, netdev, kvm, Paolo Bonzini,
	Jim Mattson, David Rientjes, Justin TerAvest

debugfs interaction nits:

On Fri, Aug 07, 2020 at 02:29:10PM -0700, Jonathan Adams wrote:
> +static struct dentry *metricfs_init_dentry(void)
> +{
> +	static int once;
> +
> +	if (d_metricfs)
> +		return d_metricfs;
> +
> +	if (!debugfs_initialized())
> +		return NULL;
> +
> +	d_metricfs = debugfs_create_dir("metricfs", NULL);
> +
> +	if (!d_metricfs && !once) {

As it is impossible for d_metricfs to ever be NULL, why are you checking
it?

> +		once = 1;
> +		pr_warn("Could not create debugfs directory 'metricfs'\n");

There is a pr_warn_once I think, but again, how can this ever trigger?

> +		return NULL;
> +	}
> +
> +	return d_metricfs;
> +}
> +
> +/* We always cast in and out to struct dentry. */
> +struct metricfs_subsys {
> +	struct dentry dentry;
> +};
> +
> +static struct dentry *metricfs_create_file(const char *name,
> +					   mode_t mode,
> +					   struct dentry *parent,
> +					   void *data,
> +					   const struct file_operations *fops)
> +{
> +	struct dentry *ret;
> +
> +	ret = debugfs_create_file(name, mode, parent, data, fops);
> +	if (!ret)
> +		pr_warn("Could not create debugfs '%s' entry\n", name);

As ret can never be NULL, why check?

There is no need to ever check debugfs return values, just keep on
going, that's the design of it.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 2/7] core/metricfs: add support for percpu metricfs files
  2020-08-07 21:29 ` [RFC PATCH 2/7] core/metricfs: add support for percpu metricfs files Jonathan Adams
@ 2020-08-08  5:43   ` Greg KH
  0 siblings, 0 replies; 21+ messages in thread
From: Greg KH @ 2020-08-08  5:43 UTC (permalink / raw)
  To: Jonathan Adams
  Cc: linux-kernel, linux-fsdevel, netdev, kvm, Paolo Bonzini,
	Jim Mattson, David Rientjes

On Fri, Aug 07, 2020 at 02:29:11PM -0700, Jonathan Adams wrote:
> Add a simple mechanism for exporting percpu data through metricfs.
> The API follows the existing metricfs pattern.  A percpu file is
> defined with:
> 
>     METRIC_EXPORT_PERCPU_INT(name, desc, fn)
>     METRIC_EXPORT_PERCPU_COUNTER(name, desc, fn)
> 
> The first defines a file for exposing a percpu int.  The second is
> similar, but is for a counter that accumulates since boot.  The
> 'name' is used as the metricfs file.  The 'desc' is a description
> of the metric.  The 'fn' is a callback function to emit a single
> percpu value:
> 
>     void (*fn)(struct metric_emitter *e, int cpu);
> 
> The callback must call METRIC_EMIT_PERCPU_INT with the value for
> the specified CPU.
> 
> Signed-off-by: Jonathan Adams <jwadams@google.com>
> 
> ---
> 
> jwadams@google.com: rebased to 5.6-pre6, renamed funcs to start with
> 	metric_.  This is work originally done by another engineer at
> 	google, who would rather not have their name associated with this
> 	patchset. They're okay with me sending it under my name.
> ---
>  include/linux/metricfs.h | 28 +++++++++++++++++++
>  kernel/metricfs.c        | 58 ++++++++++++++++++++++++++++++++++++----

fs/metricfs/ ?  This isn't a kernel "core" feature.

Or just put it in fs/debugfs/ and tack it along with one of the debugfs
helper functions to make it easier for everyone to use these (if they
actually are valuable.  It's hard to see how this differs from any other
debugfs interface today.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 3/7] core/metricfs: metric for kernel warnings
  2020-08-07 21:29 ` [RFC PATCH 3/7] core/metricfs: metric for kernel warnings Jonathan Adams
@ 2020-08-08  5:45   ` Greg KH
  0 siblings, 0 replies; 21+ messages in thread
From: Greg KH @ 2020-08-08  5:45 UTC (permalink / raw)
  To: Jonathan Adams
  Cc: linux-kernel, linux-fsdevel, netdev, kvm, Paolo Bonzini,
	Jim Mattson, David Rientjes

On Fri, Aug 07, 2020 at 02:29:12PM -0700, Jonathan Adams wrote:
> Count kernel warnings by function name of the caller.
> 
> Each time WARN() is called, which includes WARN_ON(), increment a counter
> in a 256-entry hash table. The table key is the entry point of the calling
> function, which is found using kallsyms.

Why is this needed?

As systems seem to like to reboot when WARN() is called, will this only
ever show 1?  :)

> 
> We store the name of the function in the table (because it may be a
> module address); reporting the metric just walks the table and prints
> the values.
> 
> The "warnings" metric is cumulative.

If you are creating specific files in a specific location that people
can rely on, shouldn't they show up in Documentation/ABI/ as well?

But again, is this feature something that anyone really needs/wants?
What can the number of warnings show you?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 4/7] core/metricfs: expose softirq information through metricfs
  2020-08-07 21:29 ` [RFC PATCH 4/7] core/metricfs: expose softirq information through metricfs Jonathan Adams
@ 2020-08-08  5:46   ` Greg KH
  0 siblings, 0 replies; 21+ messages in thread
From: Greg KH @ 2020-08-08  5:46 UTC (permalink / raw)
  To: Jonathan Adams
  Cc: linux-kernel, linux-fsdevel, netdev, kvm, Paolo Bonzini,
	Jim Mattson, David Rientjes

On Fri, Aug 07, 2020 at 02:29:13PM -0700, Jonathan Adams wrote:
> Add metricfs support for displaying percpu softirq counters.  The
> top directory is /sys/kernel/debug/metricfs/softirq.  Then there
> is a subdirectory for each softirq type.  For example:
> 
>     cat /sys/kernel/debug/metricfs/softirq/NET_RX/values
> 
> Signed-off-by: Jonathan Adams <jwadams@google.com>
> 
> ---
> 
> jwadams@google.com: rebased to 5.8-pre6
> 	This is work originally done by another engineer at
> 	google, who would rather not have their name associated with this
> 	patchset. They're okay with me sending it under my name.
> ---
>  kernel/softirq.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 45 insertions(+)
> 
> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index c4201b7f42b1..1ae3a540b789 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -25,6 +25,8 @@
>  #include <linux/smpboot.h>
>  #include <linux/tick.h>
>  #include <linux/irq.h>
> +#include <linux/jump_label.h>
> +#include <linux/metricfs.h>
>  
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/irq.h>
> @@ -738,3 +740,46 @@ unsigned int __weak arch_dynirq_lower_bound(unsigned int from)
>  {
>  	return from;
>  }
> +
> +#ifdef CONFIG_METRICFS
> +
> +#define METRICFS_ITEM(name) \
> +static void \
> +metricfs_##name(struct metric_emitter *e, int cpu) \
> +{ \
> +	int64_t v = kstat_softirqs_cpu(name##_SOFTIRQ, cpu); \
> +	METRIC_EMIT_PERCPU_INT(e, cpu, v); \
> +} \
> +METRIC_EXPORT_PERCPU_COUNTER(name, #name " softirq", metricfs_##name)
> +
> +METRICFS_ITEM(HI);
> +METRICFS_ITEM(TIMER);
> +METRICFS_ITEM(NET_TX);
> +METRICFS_ITEM(NET_RX);
> +METRICFS_ITEM(BLOCK);
> +METRICFS_ITEM(IRQ_POLL);
> +METRICFS_ITEM(TASKLET);
> +METRICFS_ITEM(SCHED);
> +METRICFS_ITEM(HRTIMER);
> +METRICFS_ITEM(RCU);
> +
> +static int __init init_softirq_metricfs(void)
> +{
> +	struct metricfs_subsys *subsys;
> +
> +	subsys = metricfs_create_subsys("softirq", NULL);
> +	metric_init_HI(subsys);
> +	metric_init_TIMER(subsys);
> +	metric_init_NET_TX(subsys);
> +	metric_init_NET_RX(subsys);
> +	metric_init_BLOCK(subsys);
> +	metric_init_IRQ_POLL(subsys);
> +	metric_init_TASKLET(subsys);
> +	metric_init_SCHED(subsys);
> +	metric_init_RCU(subsys);
> +
> +	return 0;
> +}
> +module_init(init_softirq_metricfs);

I like the "simple" ways these look, and think you will be better off
just adding this type of api to debugfs.  That way people can use them
anywhere they currently use debugfs.

But note, we already have simple ways of exporting single variable data
in debugfs, so why do we need yet-another-macro for them?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/7] metricfs metric file system and examples
  2020-08-08  2:06 ` [RFC PATCH 0/7] metricfs metric file system and examples Andrew Lunn
@ 2020-08-08 15:59   ` David Ahern
  2020-08-10 18:20     ` Jakub Kicinski
  0 siblings, 1 reply; 21+ messages in thread
From: David Ahern @ 2020-08-08 15:59 UTC (permalink / raw)
  To: Andrew Lunn, Jonathan Adams
  Cc: linux-kernel, linux-fsdevel, netdev, kvm, Paolo Bonzini, Greg KH,
	Jim Mattson, David Rientjes

On 8/7/20 8:06 PM, Andrew Lunn wrote:
> So i personally don't think netdev statistics is a good idea, i doubt
> it scales.

+1

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/7] metricfs metric file system and examples
  2020-08-07 21:29 [RFC PATCH 0/7] metricfs metric file system and examples Jonathan Adams
                   ` (7 preceding siblings ...)
  2020-08-08  2:06 ` [RFC PATCH 0/7] metricfs metric file system and examples Andrew Lunn
@ 2020-08-10  9:23 ` Pavel Machek
  8 siblings, 0 replies; 21+ messages in thread
From: Pavel Machek @ 2020-08-10  9:23 UTC (permalink / raw)
  To: Jonathan Adams
  Cc: linux-kernel, linux-fsdevel, netdev, kvm, Paolo Bonzini, Greg KH,
	Jim Mattson, David Rientjes

On Fri 2020-08-07 14:29:09, Jonathan Adams wrote:
> [resending to widen the CC lists per rdunlap@infradead.org's suggestion
> original posting to lkml here: https://lkml.org/lkml/2020/8/5/1009]
> 
> To try to restart the discussion of kernel statistics started by the
> statsfs patchsets (https://lkml.org/lkml/2020/5/26/332), I wanted
> to share the following set of patches which are Google's 'metricfs'
> implementation and some example uses.  Google has been using metricfs
> internally since 2012 as a way to export various statistics to our
> telemetry systems (similar to OpenTelemetry), and we have over 200
> statistics exported on a typical machine.
> 
> These patches have been cleaned up and modernized v.s. the versions
> in production; I've included notes under the fold in the patches.
> They're based on v5.8-rc6.
> 
> The statistics live under debugfs, in a tree rooted at:
> 
> 	/sys/kernel/debug/metricfs

Is debugfs right place for this? It looks like something where people
would expect compatibility guarantees...

								Pavel

-- 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/7] metricfs metric file system and examples
  2020-08-08 15:59   ` David Ahern
@ 2020-08-10 18:20     ` Jakub Kicinski
  0 siblings, 0 replies; 21+ messages in thread
From: Jakub Kicinski @ 2020-08-10 18:20 UTC (permalink / raw)
  To: David Ahern
  Cc: Andrew Lunn, Jonathan Adams, linux-kernel, linux-fsdevel, netdev,
	kvm, Paolo Bonzini, Greg KH, Jim Mattson, David Rientjes

On Sat, 8 Aug 2020 09:59:34 -0600 David Ahern wrote:
> On 8/7/20 8:06 PM, Andrew Lunn wrote:
> > So i personally don't think netdev statistics is a good idea, i doubt
> > it scales.  
> 
> +1

+1

Please stop using networking as the example for this.

We don't want file interfaces for stats, and we already made that very
clear last time.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 6/7] core/metricfs: expose x86-specific irq information through metricfs
  2020-08-07 21:29 ` [RFC PATCH 6/7] core/metricfs: expose x86-specific irq " Jonathan Adams
@ 2020-08-13 10:11   ` Thomas Gleixner
  2020-08-13 11:47     ` Paolo Bonzini
  0 siblings, 1 reply; 21+ messages in thread
From: Thomas Gleixner @ 2020-08-13 10:11 UTC (permalink / raw)
  To: Jonathan Adams, linux-kernel, linux-fsdevel
  Cc: netdev, kvm, Paolo Bonzini, Greg KH, Jim Mattson, David Rientjes,
	Jonathan Adams

Jonathan Adams <jwadams@google.com> writes:

How is that related to core? The x86 subsys prefix is 'x86' and for this
particular thing it's 'x86/irq:'. That applies to the rest of the series
as well. 

> Add metricfs support for displaying percpu irq counters for x86.
> The top directory is /sys/kernel/debug/metricfs/irq_x86.
> Then there is a subdirectory for each x86-specific irq counter.
> For example:
>
>    cat /sys/kernel/debug/metricfs/irq_x86/TLB/values

What is 'TLB'? I'm not aware of any vector which is named TLB.

The changelog is pretty useless in providing any form of rationale for
this. It tells the WHAT, but not the WHY.

Also what is does this file contain? Aggregates, one line per CPU or the
value of the random CPU of the day? I'm not going to dive into the macro
zoo to figure that out.

> jwadams@google.com: rebased to 5.8-pre6
> 	This is work originally done by another engineer at
> 	google, who would rather not have their name associated with
> 	this patchset. They're okay with me sending it under my name.

I can understand why they wont have their name associated with this.

> +#ifdef CONFIG_METRICFS
> +METRICFS_ITEM(NMI, __nmi_count, "Non-maskable interrupts");
> +#ifdef CONFIG_X86_LOCAL_APIC
> +METRICFS_ITEM(LOC, apic_timer_irqs, "Local timer interrupts");
> +METRICFS_ITEM(SPU, irq_spurious_count, "Spurious interrupts");
> +METRICFS_ITEM(PMI, apic_perf_irqs, "Performance monitoring interrupts");
> +METRICFS_ITEM(IWI, apic_irq_work_irqs, "IRQ work interrupts");
> +METRICFS_ITEM(RTR, icr_read_retry_count, "APIC ICR read retries");
> +#endif
....

So you are adding NR_CPUS * NR_DIRECT_VECTORS debugfs files which show
exactly the same information as /proc/interrupts, right? 

Aside of that _all_ of this information is available via tracepoints as
well.

That's NR_CPUS * 15 and incomplete because x86 has 23 of those directly
handled vectors which do not go through the irq core. So with just 15
and 256 CPUs that's 3840 files.

Impressive number especially without any information why this is useful
and provides value over the existing mechanisms to retrieve exactly the
same information.

The cover letter talks a lot about who Google finds this useful, but
that's not really a convincing argument for this metric failsystem
addon.

Thanks,

        tglx




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 6/7] core/metricfs: expose x86-specific irq information through metricfs
  2020-08-13 10:11   ` Thomas Gleixner
@ 2020-08-13 11:47     ` Paolo Bonzini
  2020-08-13 12:13       ` Thomas Gleixner
  0 siblings, 1 reply; 21+ messages in thread
From: Paolo Bonzini @ 2020-08-13 11:47 UTC (permalink / raw)
  To: Thomas Gleixner, Jonathan Adams, linux-kernel, linux-fsdevel
  Cc: netdev, kvm, Greg KH, Jim Mattson, David Rientjes

On 13/08/20 12:11, Thomas Gleixner wrote:
>> Add metricfs support for displaying percpu irq counters for x86.
>> The top directory is /sys/kernel/debug/metricfs/irq_x86.
>> Then there is a subdirectory for each x86-specific irq counter.
>> For example:
>>
>>    cat /sys/kernel/debug/metricfs/irq_x86/TLB/values
> What is 'TLB'? I'm not aware of any vector which is named TLB.

There's a "TLB" entry in /proc/interrupts.

Paolo


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 6/7] core/metricfs: expose x86-specific irq information through metricfs
  2020-08-13 11:47     ` Paolo Bonzini
@ 2020-08-13 12:13       ` Thomas Gleixner
  2020-08-13 14:10         ` Paolo Bonzini
  0 siblings, 1 reply; 21+ messages in thread
From: Thomas Gleixner @ 2020-08-13 12:13 UTC (permalink / raw)
  To: Paolo Bonzini, Jonathan Adams, linux-kernel, linux-fsdevel
  Cc: netdev, kvm, Greg KH, Jim Mattson, David Rientjes

Paolo Bonzini <pbonzini@redhat.com> writes:

> On 13/08/20 12:11, Thomas Gleixner wrote:
>>> Add metricfs support for displaying percpu irq counters for x86.
>>> The top directory is /sys/kernel/debug/metricfs/irq_x86.
>>> Then there is a subdirectory for each x86-specific irq counter.
>>> For example:
>>>
>>>    cat /sys/kernel/debug/metricfs/irq_x86/TLB/values
>> What is 'TLB'? I'm not aware of any vector which is named TLB.
>
> There's a "TLB" entry in /proc/interrupts.

It's TLB shootdowns and not TLB.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 6/7] core/metricfs: expose x86-specific irq information through metricfs
  2020-08-13 12:13       ` Thomas Gleixner
@ 2020-08-13 14:10         ` Paolo Bonzini
  2020-08-13 14:21           ` Thomas Gleixner
  0 siblings, 1 reply; 21+ messages in thread
From: Paolo Bonzini @ 2020-08-13 14:10 UTC (permalink / raw)
  To: Thomas Gleixner, Jonathan Adams, linux-kernel, linux-fsdevel
  Cc: netdev, kvm, Greg KH, Jim Mattson, David Rientjes

On 13/08/20 14:13, Thomas Gleixner wrote:
>>>> Add metricfs support for displaying percpu irq counters for x86.
>>>> The top directory is /sys/kernel/debug/metricfs/irq_x86.
>>>> Then there is a subdirectory for each x86-specific irq counter.
>>>> For example:
>>>>
>>>>    cat /sys/kernel/debug/metricfs/irq_x86/TLB/values
>>> What is 'TLB'? I'm not aware of any vector which is named TLB.
>> There's a "TLB" entry in /proc/interrupts.
> It's TLB shootdowns and not TLB.

Yes but it's using the shortcut name on the left of the table.

> +METRICFS_ITEM(LOC, apic_timer_irqs, "Local timer interrupts");
> +METRICFS_ITEM(SPU, irq_spurious_count, "Spurious interrupts");
> +METRICFS_ITEM(PMI, apic_perf_irqs, "Performance monitoring interrupts");
> +METRICFS_ITEM(IWI, apic_irq_work_irqs, "IRQ work interrupts");
> +METRICFS_ITEM(RTR, icr_read_retry_count, "APIC ICR read retries");
> +#endif
> +METRICFS_ITEM(PLT, x86_platform_ipis, "Platform interrupts");
> +#ifdef CONFIG_SMP
> +METRICFS_ITEM(RES, irq_resched_count, "Rescheduling interrupts");
> +METRICFS_ITEM(CAL, irq_call_count, "Function call interrupts");
> +METRICFS_ITEM(TLB, irq_tlb_count, "TLB shootdowns");

Paolo


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 6/7] core/metricfs: expose x86-specific irq information through metricfs
  2020-08-13 14:10         ` Paolo Bonzini
@ 2020-08-13 14:21           ` Thomas Gleixner
  0 siblings, 0 replies; 21+ messages in thread
From: Thomas Gleixner @ 2020-08-13 14:21 UTC (permalink / raw)
  To: Paolo Bonzini, Jonathan Adams, linux-kernel, linux-fsdevel
  Cc: netdev, kvm, Greg KH, Jim Mattson, David Rientjes

Paolo Bonzini <pbonzini@redhat.com> writes:
> On 13/08/20 14:13, Thomas Gleixner wrote:
>>>>>    cat /sys/kernel/debug/metricfs/irq_x86/TLB/values
>>>> What is 'TLB'? I'm not aware of any vector which is named TLB.
>>> There's a "TLB" entry in /proc/interrupts.
>> It's TLB shootdowns and not TLB.
>
> Yes but it's using the shortcut name on the left of the table.

Fair enough, that's the first column in /proc/interrupts. I totally
missed the explanation in the elaborate changelog.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2020-08-13 14:21 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-07 21:29 [RFC PATCH 0/7] metricfs metric file system and examples Jonathan Adams
2020-08-07 21:29 ` [RFC PATCH 1/7] core/metricfs: Create metricfs, standardized files under debugfs Jonathan Adams
2020-08-08  5:41   ` Greg KH
2020-08-07 21:29 ` [RFC PATCH 2/7] core/metricfs: add support for percpu metricfs files Jonathan Adams
2020-08-08  5:43   ` Greg KH
2020-08-07 21:29 ` [RFC PATCH 3/7] core/metricfs: metric for kernel warnings Jonathan Adams
2020-08-08  5:45   ` Greg KH
2020-08-07 21:29 ` [RFC PATCH 4/7] core/metricfs: expose softirq information through metricfs Jonathan Adams
2020-08-08  5:46   ` Greg KH
2020-08-07 21:29 ` [RFC PATCH 5/7] core/metricfs: expose scheduler stat " Jonathan Adams
2020-08-07 21:29 ` [RFC PATCH 6/7] core/metricfs: expose x86-specific irq " Jonathan Adams
2020-08-13 10:11   ` Thomas Gleixner
2020-08-13 11:47     ` Paolo Bonzini
2020-08-13 12:13       ` Thomas Gleixner
2020-08-13 14:10         ` Paolo Bonzini
2020-08-13 14:21           ` Thomas Gleixner
2020-08-07 21:29 ` [RFC PATCH 7/7] net-metricfs: Export /proc/net/dev via metricfs Jonathan Adams
2020-08-08  2:06 ` [RFC PATCH 0/7] metricfs metric file system and examples Andrew Lunn
2020-08-08 15:59   ` David Ahern
2020-08-10 18:20     ` Jakub Kicinski
2020-08-10  9:23 ` Pavel Machek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).