LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH RFC bpf-next v2 0/5] Extend cgroup interface with bpf
@ 2022-02-01 20:55 Hao Luo
  2022-02-01 20:55 ` [PATCH RFC bpf-next v2 1/5] bpf: Bpffs directory tag Hao Luo
                   ` (4 more replies)
  0 siblings, 5 replies; 19+ messages in thread
From: Hao Luo @ 2022-02-01 20:55 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann
  Cc: Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Shakeel Butt, Joe Burton, Stanislav Fomichev, bpf, linux-kernel,
	Hao Luo

This patchset introduces a new program type to extend the cgroup interfaces.
It extends the bpf filesystem (bpffs) to allow creating a directory
hierarchy that tracks any kernfs hierarchy, in particular cgroupfs. Each
subdirectory in this hierarchy will keep a reference to a corresponding
kernfs node when created. This is done by associating a new data
structure (called "tag" in this patchset) to the bpffs directory inodes.
File inode in bpffs holds a reference to bpf object and directory inode
may point to a tag, which holds a kernfs node.

A bpf object can be pinned in these directories and objects can choose to
enable inheritance in tagged directories. In this patchset, a new
program type "cgroup_view" is introduced, which supports inheritance.
More specifically, when a link to cgroup_view prog is pinned in a bpffs
directory, it tags the directory and connects the directory to the root
cgroup. Subdirectories created underneath has to match a subcgroup, and
when created, they will inherit the pinned cgroup_view link from the
parent directory.

The pinned cgroup_view objects can be read as files. When the object is
read, it tries to get the cgroup its parent directory is matched to.
Failure to get the cgroup's reference will not run the cgroup_view prog.
Users can implement cgroup_view program to define what to print out to
the file, given the cgroup object.

See patch 5/5 for an example of how this works.

Userspace has to manually create/remove directories in bpffs to mirror
the cgroup hierarchy. It was suggested using overlayfs to create a
hierarchy that contains both cgroupfs and bpffs. But after some
experiments, I found overlayfs is not intended to be used this way:
overlayfs assumes the underlying filesystem will not change [1], but our
underlaying fs (i.e. cgroupfs) will change and cause weird behavior. So
I didn't pursue in that direction.

This patchset v2 is only for demonstrating the high level design. There
are a lot of places in its implementation that can be improved. Cgroup_view
is a type of bpf_iter, because seqfile printing has been supported well
in bpf_iter, although cgroup_view is not iterating kernel objects.

Changes v1->v2:
 - Complete redesign. v1 implements pinning bpf objects in cgroupfs[2].
   v2 implements object inheritance in bpffs. Due to its simplicity,
   bpffs is better for implementing inheritance compared to cgroupfs.
 - Extend selftest to include a more realistic use case. The selftests
   in v2 developed a cgroup-level sched performance metric and exported
   through the new prog type.

[1] https://www.kernel.org/doc/html/latest/filesystems/overlayfs.html#changes-to-underlying-filesystems
[2] https://lore.kernel.org/bpf/Ydd1IIUG7%2F3kQRcR@google.com/

Hao Luo (5):
  bpf: Bpffs directory tag
  bpf: Introduce inherit list for dir tag.
  bpf: cgroup_view iter
  bpf: Pin cgroup_view
  selftests/bpf: test for pinning for cgroup_view link

 include/linux/bpf.h                           |   2 +
 kernel/bpf/Makefile                           |   2 +-
 kernel/bpf/bpf_iter.c                         |  11 +
 kernel/bpf/cgroup_view_iter.c                 | 114 ++++++++
 kernel/bpf/inode.c                            | 272 +++++++++++++++++-
 kernel/bpf/inode.h                            |  55 ++++
 .../selftests/bpf/prog_tests/pinning_cgroup.c | 143 +++++++++
 tools/testing/selftests/bpf/progs/bpf_iter.h  |   7 +
 .../bpf/progs/bpf_iter_cgroup_view.c          | 232 +++++++++++++++
 9 files changed, 829 insertions(+), 9 deletions(-)
 create mode 100644 kernel/bpf/cgroup_view_iter.c
 create mode 100644 kernel/bpf/inode.h
 create mode 100644 tools/testing/selftests/bpf/prog_tests/pinning_cgroup.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_cgroup_view.c

-- 
2.35.0.rc2.247.g8bbb082509-goog


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH RFC bpf-next v2 1/5] bpf: Bpffs directory tag
  2022-02-01 20:55 [PATCH RFC bpf-next v2 0/5] Extend cgroup interface with bpf Hao Luo
@ 2022-02-01 20:55 ` Hao Luo
  2022-02-01 20:55 ` [PATCH RFC bpf-next v2 2/5] bpf: Introduce inherit list for dir tag Hao Luo
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 19+ messages in thread
From: Hao Luo @ 2022-02-01 20:55 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann
  Cc: Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Shakeel Butt, Joe Burton, Stanislav Fomichev, bpf, linux-kernel,
	Hao Luo

Introduce a tag structure for directories in bpffs. A tag carries
special information about a directory. For example, a BPF_DIR_KERNFS_REP
tag denotes that a directory is a replicate of a kernfs hierarchy.

At mkdir, if the parent directory has a tag, the child directory also
gets tag. For KERNFS_REP directories, the tag references a kernfs node.
The KERNFS_REP hierarchy mirrors the hierarchy in kernfs. Userspace is
responsible for sync'ing two hierarchies.

The initial tag can be created by pinning a certain type of bpf objects.
The following patches will introduce such objects and the tagged
directory will mirror the cgroup hierarchy.

Tags are destroyed at rmdir.

Signed-off-by: Hao Luo <haoluo@google.com>
---
 kernel/bpf/inode.c | 80 +++++++++++++++++++++++++++++++++++++++++++++-
 kernel/bpf/inode.h | 22 +++++++++++++
 2 files changed, 101 insertions(+), 1 deletion(-)
 create mode 100644 kernel/bpf/inode.h

diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
index 5a8d9f7467bf..ecc357009df5 100644
--- a/kernel/bpf/inode.c
+++ b/kernel/bpf/inode.c
@@ -16,11 +16,13 @@
 #include <linux/fs.h>
 #include <linux/fs_context.h>
 #include <linux/fs_parser.h>
+#include <linux/kernfs.h>
 #include <linux/kdev_t.h>
 #include <linux/filter.h>
 #include <linux/bpf.h>
 #include <linux/bpf_trace.h>
 #include "preload/bpf_preload.h"
+#include "inode.h"
 
 enum bpf_type {
 	BPF_TYPE_UNSPEC	= 0,
@@ -142,6 +144,52 @@ static int bpf_inode_type(const struct inode *inode, enum bpf_type *type)
 	return 0;
 }
 
+static struct bpf_dir_tag *inode_tag(const struct inode *inode)
+{
+	if (unlikely(!S_ISDIR(inode->i_mode)))
+		return NULL;
+
+	return inode->i_private;
+}
+
+/* tag_dir_inode - tag a newly created directory.
+ * @tag: tag of parent directory
+ * @dentry: dentry of the new directory
+ * @inode: inode of the new directory
+ *
+ * Called from bpf_mkdir.
+ */
+static int tag_dir_inode(const struct bpf_dir_tag *tag,
+			 const struct dentry *dentry, struct inode *inode)
+{
+	struct bpf_dir_tag *t;
+	struct kernfs_node *kn;
+
+	WARN_ON(tag->type != BPF_DIR_KERNFS_REP);
+
+	/* kn is put at tag deallocation. */
+	kn = kernfs_find_and_get_ns(tag->private, dentry->d_name.name, NULL);
+	if (unlikely(!kn))
+		return -ENOENT;
+
+	if (unlikely(kernfs_type(kn) != KERNFS_DIR)) {
+		kernfs_put(kn);
+		return -EPERM;
+	}
+
+	t = kzalloc(sizeof(struct bpf_dir_tag), GFP_KERNEL | __GFP_NOWARN);
+	if (unlikely(!t)) {
+		kernfs_put(kn);
+		return -ENOMEM;
+	}
+
+	t->type = tag->type;
+	t->private = kn;
+
+	inode->i_private = t;
+	return 0;
+}
+
 static void bpf_dentry_finalize(struct dentry *dentry, struct inode *inode,
 				struct inode *dir)
 {
@@ -156,6 +204,8 @@ static int bpf_mkdir(struct user_namespace *mnt_userns, struct inode *dir,
 		     struct dentry *dentry, umode_t mode)
 {
 	struct inode *inode;
+	struct bpf_dir_tag *tag;
+	int err;
 
 	inode = bpf_get_inode(dir->i_sb, dir, mode | S_IFDIR);
 	if (IS_ERR(inode))
@@ -164,6 +214,15 @@ static int bpf_mkdir(struct user_namespace *mnt_userns, struct inode *dir,
 	inode->i_op = &bpf_dir_iops;
 	inode->i_fop = &simple_dir_operations;
 
+	tag = inode_tag(dir);
+	if (tag) {
+		err = tag_dir_inode(tag, dentry, inode);
+		if (err) {
+			iput(inode);
+			return err;
+		}
+	}
+
 	inc_nlink(inode);
 	inc_nlink(dir);
 
@@ -404,11 +463,30 @@ static int bpf_symlink(struct user_namespace *mnt_userns, struct inode *dir,
 	return 0;
 }
 
+static void untag_dir_inode(struct inode *dir)
+{
+	struct bpf_dir_tag *tag = inode_tag(dir);
+
+	WARN_ON(tag->type != BPF_DIR_KERNFS_REP);
+
+	dir->i_private = NULL;
+	kernfs_put(tag->private);
+	kfree(tag);
+}
+
+static int bpf_rmdir(struct inode *dir, struct dentry *dentry)
+{
+	if (inode_tag(dir))
+		untag_dir_inode(dir);
+
+	return simple_rmdir(dir, dentry);
+}
+
 static const struct inode_operations bpf_dir_iops = {
 	.lookup		= bpf_lookup,
 	.mkdir		= bpf_mkdir,
 	.symlink	= bpf_symlink,
-	.rmdir		= simple_rmdir,
+	.rmdir		= bpf_rmdir,
 	.rename		= simple_rename,
 	.link		= simple_link,
 	.unlink		= simple_unlink,
diff --git a/kernel/bpf/inode.h b/kernel/bpf/inode.h
new file mode 100644
index 000000000000..2cfeef39e861
--- /dev/null
+++ b/kernel/bpf/inode.h
@@ -0,0 +1,22 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/* Copyright (c) 2022 Google
+ */
+#ifndef __BPF_INODE_H_
+#define __BPF_INODE_H_
+
+enum tag_type {
+	/* The directory is a replicate of a kernfs directory hierarchy. */
+	BPF_DIR_KERNFS_REP = 0,
+};
+
+/* A tag for bpffs directories. It carries special information about a
+ * directory. For example, BPF_DIR_KERNFS_REP denotes that the directory is
+ * a replicate of a kernfs hierarchy. Pinning a certain type of objects tags
+ * a directory and the tag will be removed at rmdir.
+ */
+struct bpf_dir_tag {
+	enum tag_type type;
+	void *private;  /* tag private data */
+};
+
+#endif
-- 
2.35.0.rc2.247.g8bbb082509-goog


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH RFC bpf-next v2 2/5] bpf: Introduce inherit list for dir tag.
  2022-02-01 20:55 [PATCH RFC bpf-next v2 0/5] Extend cgroup interface with bpf Hao Luo
  2022-02-01 20:55 ` [PATCH RFC bpf-next v2 1/5] bpf: Bpffs directory tag Hao Luo
@ 2022-02-01 20:55 ` Hao Luo
  2022-02-01 20:55 ` [PATCH RFC bpf-next v2 3/5] bpf: cgroup_view iter Hao Luo
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 19+ messages in thread
From: Hao Luo @ 2022-02-01 20:55 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann
  Cc: Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Shakeel Butt, Joe Burton, Stanislav Fomichev, bpf, linux-kernel,
	Hao Luo

Embed a list of bpf objects in a directory's tag. This list is
shared by all the directories in the tagged hierarchy.

When a new tagged directory is created, it will be prepopulated
with the objects in the inherit list. When the directory is
removed, the inherited objects will be removed automatically.

Because the whole tagged hierarchy share the same list, all the
directories in the hierarchy have the same set of objects to be
prepopulated.

Signed-off-by: Hao Luo <haoluo@google.com>
---
 kernel/bpf/inode.c | 110 +++++++++++++++++++++++++++++++++++++++++----
 kernel/bpf/inode.h |  33 ++++++++++++++
 2 files changed, 135 insertions(+), 8 deletions(-)

diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
index ecc357009df5..9ae17a2bf779 100644
--- a/kernel/bpf/inode.c
+++ b/kernel/bpf/inode.c
@@ -24,13 +24,6 @@
 #include "preload/bpf_preload.h"
 #include "inode.h"
 
-enum bpf_type {
-	BPF_TYPE_UNSPEC	= 0,
-	BPF_TYPE_PROG,
-	BPF_TYPE_MAP,
-	BPF_TYPE_LINK,
-};
-
 static void *bpf_any_get(void *raw, enum bpf_type type)
 {
 	switch (type) {
@@ -69,6 +62,20 @@ static void bpf_any_put(void *raw, enum bpf_type type)
 	}
 }
 
+static void free_obj_list(struct kref *kref)
+{
+	struct obj_list *list;
+	struct bpf_inherit_entry *e;
+
+	list = container_of(kref, struct obj_list, refcnt);
+	list_for_each_entry(e, &list->list, list) {
+		list_del_rcu(&e->list);
+		bpf_any_put(e->obj, e->type);
+		kfree(e);
+	}
+	kfree(list);
+}
+
 static void *bpf_fd_probe_obj(u32 ufd, enum bpf_type *type)
 {
 	void *raw;
@@ -100,6 +107,10 @@ static const struct inode_operations bpf_prog_iops = { };
 static const struct inode_operations bpf_map_iops  = { };
 static const struct inode_operations bpf_link_iops  = { };
 
+static int bpf_mkprog(struct dentry *dentry, umode_t mode, void *arg);
+static int bpf_mkmap(struct dentry *dentry, umode_t mode, void *arg);
+static int bpf_mklink(struct dentry *dentry, umode_t mode, void *arg);
+
 static struct inode *bpf_get_inode(struct super_block *sb,
 				   const struct inode *dir,
 				   umode_t mode)
@@ -184,12 +195,62 @@ static int tag_dir_inode(const struct bpf_dir_tag *tag,
 	}
 
 	t->type = tag->type;
+	t->inherit_objects = tag->inherit_objects;
+	kref_get(&t->inherit_objects->refcnt);
 	t->private = kn;
 
 	inode->i_private = t;
 	return 0;
 }
 
+/* populate_dir - populate directory with bpf objects in a tag's
+ * inherit_objects.
+ * @dir: dentry of the directory.
+ * @inode: inode of the direcotry.
+ *
+ * Called from mkdir. Must be called after dentry has been finalized.
+ */
+static int populate_dir(struct dentry *dir, struct inode *inode)
+{
+	struct bpf_dir_tag *tag = inode_tag(inode);
+	struct bpf_inherit_entry *e;
+	struct dentry *child;
+	int ret;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(e, &tag->inherit_objects->list, list) {
+		child = lookup_one_len_unlocked(e->name.name, dir,
+						strlen(e->name.name));
+		if (unlikely(IS_ERR(child))) {
+			ret = PTR_ERR(child);
+			break;
+		}
+
+		switch (e->type) {
+		case BPF_TYPE_PROG:
+			ret = bpf_mkprog(child, e->mode, e->obj);
+			break;
+		case BPF_TYPE_MAP:
+			ret = bpf_mkmap(child, e->mode, e->obj);
+			break;
+		case BPF_TYPE_LINK:
+			ret = bpf_mklink(child, e->mode, e->obj);
+			break;
+		default:
+			ret = -EPERM;
+			break;
+		}
+		dput(child);
+		if (ret)
+			break;
+
+		/* To match bpf_any_put in bpf_free_inode. */
+		bpf_any_get(e->obj, e->type);
+	}
+	rcu_read_unlock();
+	return ret;
+}
+
 static void bpf_dentry_finalize(struct dentry *dentry, struct inode *inode,
 				struct inode *dir)
 {
@@ -227,6 +288,12 @@ static int bpf_mkdir(struct user_namespace *mnt_userns, struct inode *dir,
 	inc_nlink(dir);
 
 	bpf_dentry_finalize(dentry, inode, dir);
+
+	if (tag) {
+		err = populate_dir(dentry, inode);
+		if (err)
+			return err;
+	}
 	return 0;
 }
 
@@ -463,6 +530,30 @@ static int bpf_symlink(struct user_namespace *mnt_userns, struct inode *dir,
 	return 0;
 }
 
+/* unpopulate_dir - remove pre-populated entries from directory.
+ * @dentry: dentry of directory
+ * @inode: inode of directory
+ *
+ * Called from rmdir.
+ */
+static void unpopulate_dir(struct dentry *dentry, struct inode *inode)
+{
+	struct bpf_dir_tag *tag = inode_tag(inode);
+	struct bpf_inherit_entry *e;
+	struct dentry *child;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(e, &tag->inherit_objects->list, list) {
+		child = d_hash_and_lookup(dentry, &e->name);
+		if (unlikely(IS_ERR(child)))
+			continue;
+
+		simple_unlink(inode, child);
+		dput(child);
+	}
+	rcu_read_unlock();
+}
+
 static void untag_dir_inode(struct inode *dir)
 {
 	struct bpf_dir_tag *tag = inode_tag(dir);
@@ -471,13 +562,16 @@ static void untag_dir_inode(struct inode *dir)
 
 	dir->i_private = NULL;
 	kernfs_put(tag->private);
+	kref_put(&tag->inherit_objects->refcnt, free_obj_list);
 	kfree(tag);
 }
 
 static int bpf_rmdir(struct inode *dir, struct dentry *dentry)
 {
-	if (inode_tag(dir))
+	if (inode_tag(dir)) {
+		unpopulate_dir(dentry, dir);
 		untag_dir_inode(dir);
+	}
 
 	return simple_rmdir(dir, dentry);
 }
diff --git a/kernel/bpf/inode.h b/kernel/bpf/inode.h
index 2cfeef39e861..a8207122643d 100644
--- a/kernel/bpf/inode.h
+++ b/kernel/bpf/inode.h
@@ -4,11 +4,42 @@
 #ifndef __BPF_INODE_H_
 #define __BPF_INODE_H_
 
+#include <linux/bpf.h>
+#include <linux/fs.h>
+
+enum bpf_type {
+	BPF_TYPE_UNSPEC	= 0,
+	BPF_TYPE_PROG,
+	BPF_TYPE_MAP,
+	BPF_TYPE_LINK,
+};
+
 enum tag_type {
 	/* The directory is a replicate of a kernfs directory hierarchy. */
 	BPF_DIR_KERNFS_REP = 0,
 };
 
+/* Entry for bpf_dir_tag->inherit_objects.
+ *
+ * When a new directory is created from a tagged directory, the new directory
+ * will be populated with bpf objects in the tag's inherit_objects list. Each
+ * entry holds a reference of a bpf object and the information needed to
+ * recreate the object's entry in the new directory.
+ */
+struct bpf_inherit_entry {
+	struct list_head list;
+	void *obj; /* bpf object to inherit. */
+	enum bpf_type type; /* type of the object (prog, map or link). */
+	struct qstr name;  /* name of the entry. */
+	umode_t mode;  /* access mode of the entry. */
+};
+
+struct obj_list {
+	struct list_head list;
+	struct kref refcnt;
+	struct inode *root;
+};
+
 /* A tag for bpffs directories. It carries special information about a
  * directory. For example, BPF_DIR_KERNFS_REP denotes that the directory is
  * a replicate of a kernfs hierarchy. Pinning a certain type of objects tags
@@ -16,6 +47,8 @@ enum tag_type {
  */
 struct bpf_dir_tag {
 	enum tag_type type;
+	/* list of bpf objects that a directory inherits from its parent. */
+	struct obj_list *inherit_objects;
 	void *private;  /* tag private data */
 };
 
-- 
2.35.0.rc2.247.g8bbb082509-goog


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH RFC bpf-next v2 3/5] bpf: cgroup_view iter
  2022-02-01 20:55 [PATCH RFC bpf-next v2 0/5] Extend cgroup interface with bpf Hao Luo
  2022-02-01 20:55 ` [PATCH RFC bpf-next v2 1/5] bpf: Bpffs directory tag Hao Luo
  2022-02-01 20:55 ` [PATCH RFC bpf-next v2 2/5] bpf: Introduce inherit list for dir tag Hao Luo
@ 2022-02-01 20:55 ` Hao Luo
  2022-02-01 20:55 ` [PATCH RFC bpf-next v2 4/5] bpf: Pin cgroup_view Hao Luo
  2022-02-01 20:55 ` [PATCH RFC bpf-next v2 5/5] selftests/bpf: test for pinning for cgroup_view link Hao Luo
  4 siblings, 0 replies; 19+ messages in thread
From: Hao Luo @ 2022-02-01 20:55 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann
  Cc: Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Shakeel Butt, Joe Burton, Stanislav Fomichev, bpf, linux-kernel,
	Hao Luo

Introduce a new type of iter prog: 'cgroup_view'. It prints out cgroup's
state.

Cgroup_view is supposed to be used together with directory tagging. When
cgroup_view is pinned in a directory, it tags that directory as
KERNFS_REP, i.e. a replicate of the cgroup hierarchy. Whenever a
subdirectory is created, if there is a child cgroup of the same name
exists, the subdirectory inherits the pinned cgroup_view object from
its parent and holds a reference of the corresponding kernfs node.

The cgroup_view prog takes a pointer to the cgroup and can use family
of seq_print helpers to print out cgroup state. A typical use case of
cgroup_view is to extend the cgroupfs interface.

Signed-off-by: Hao Luo <haoluo@google.com>
---
 include/linux/bpf.h           |   2 +
 kernel/bpf/Makefile           |   2 +-
 kernel/bpf/bpf_iter.c         |  11 ++++
 kernel/bpf/cgroup_view_iter.c | 114 ++++++++++++++++++++++++++++++++++
 4 files changed, 128 insertions(+), 1 deletion(-)
 create mode 100644 kernel/bpf/cgroup_view_iter.c

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 6eb0b180d33b..494927b2b3c2 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1610,6 +1610,7 @@ typedef const struct bpf_func_proto *
 
 enum bpf_iter_feature {
 	BPF_ITER_RESCHED	= BIT(0),
+	BPF_ITER_INHERIT	= BIT(1),
 };
 
 #define BPF_ITER_CTX_ARG_MAX 2
@@ -1647,6 +1648,7 @@ bpf_iter_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog);
 int bpf_iter_link_attach(const union bpf_attr *attr, bpfptr_t uattr, struct bpf_prog *prog);
 int bpf_iter_new_fd(struct bpf_link *link);
 bool bpf_link_is_iter(struct bpf_link *link);
+bool bpf_link_support_inherit(struct bpf_link *link);
 struct bpf_prog *bpf_iter_get_info(struct bpf_iter_meta *meta, bool in_stop);
 int bpf_iter_run_prog(struct bpf_prog *prog, void *ctx);
 void bpf_iter_map_show_fdinfo(const struct bpf_iter_aux_info *aux,
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index c1a9be6a4b9f..d9d2b8541ba7 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -8,7 +8,7 @@ CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-nogcse-yy)
 
 obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o map_iter.o task_iter.o prog_iter.o
 obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o
-obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
+obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o cgroup_view_iter.o
 obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
 obj-${CONFIG_BPF_LSM}	  += bpf_inode_storage.o
 obj-$(CONFIG_BPF_SYSCALL) += disasm.o
diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c
index 110029ede71e..ff5577a5f73a 100644
--- a/kernel/bpf/bpf_iter.c
+++ b/kernel/bpf/bpf_iter.c
@@ -496,6 +496,17 @@ bool bpf_link_is_iter(struct bpf_link *link)
 	return link->ops == &bpf_iter_link_lops;
 }
 
+bool bpf_link_support_inherit(struct bpf_link *link)
+{
+	struct bpf_iter_link *iter_link;
+
+	if (!bpf_link_is_iter(link))
+		return false;
+
+	iter_link = container_of(link, struct bpf_iter_link, link);
+	return iter_link->tinfo->reg_info->feature & BPF_ITER_INHERIT;
+}
+
 int bpf_iter_link_attach(const union bpf_attr *attr, bpfptr_t uattr,
 			 struct bpf_prog *prog)
 {
diff --git a/kernel/bpf/cgroup_view_iter.c b/kernel/bpf/cgroup_view_iter.c
new file mode 100644
index 000000000000..a44d115235c4
--- /dev/null
+++ b/kernel/bpf/cgroup_view_iter.c
@@ -0,0 +1,114 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2022 Google */
+#include <linux/bpf.h>
+#include <linux/fs.h>
+#include <linux/filter.h>
+#include <linux/kernel.h>
+#include <linux/btf_ids.h>
+#include <linux/cgroup.h>
+#include <linux/kernfs.h>
+#include "inode.h"
+
+static void *cgroup_view_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	struct bpf_dir_tag *tag;
+	struct kernfs_node *kn;
+	struct cgroup *cgroup;
+	struct inode *dir;
+
+	/* Only one session is supported. */
+	if (*pos > 0)
+		return NULL;
+
+	dir = d_inode(seq->file->f_path.dentry->d_parent);
+	tag = dir->i_private;
+	if (!tag)
+		return NULL;
+
+	kn = tag->private;
+
+	rcu_read_lock();
+	cgroup = rcu_dereference(*(void __rcu __force **)&kn->priv);
+	if (!cgroup || !cgroup_tryget(cgroup))
+		cgroup = NULL;
+	rcu_read_unlock();
+
+	if (!cgroup)
+		return NULL;
+
+	if (*pos == 0)
+		++*pos;
+	return cgroup;
+}
+
+static void *cgroup_view_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	++*pos;
+	return NULL;
+}
+
+struct bpf_iter__cgroup_view {
+	__bpf_md_ptr(struct bpf_iter_meta *, meta);
+	__bpf_md_ptr(struct cgroup *, cgroup);
+};
+
+DEFINE_BPF_ITER_FUNC(cgroup_view, struct bpf_iter_meta *meta, struct cgroup *cgroup)
+
+static int cgroup_view_seq_show(struct seq_file *seq, void *v)
+{
+	struct bpf_iter__cgroup_view ctx;
+	struct bpf_iter_meta meta;
+	struct bpf_prog *prog;
+	int ret = 0;
+
+	ctx.meta = &meta;
+	ctx.cgroup = v;
+	meta.seq = seq;
+	prog = bpf_iter_get_info(&meta, false);
+	if (prog)
+		ret = bpf_iter_run_prog(prog, &ctx);
+
+	return ret;
+}
+
+static void cgroup_view_seq_stop(struct seq_file *seq, void *v)
+{
+	if (v)
+		cgroup_put(v);
+}
+
+static const struct seq_operations cgroup_view_seq_ops = {
+	.start	= cgroup_view_seq_start,
+	.next	= cgroup_view_seq_next,
+	.stop	= cgroup_view_seq_stop,
+	.show	= cgroup_view_seq_show,
+};
+
+BTF_ID_LIST(btf_cgroup_id)
+BTF_ID(struct, cgroup)
+
+static const struct bpf_iter_seq_info cgroup_view_seq_info = {
+	.seq_ops		= &cgroup_view_seq_ops,
+	.init_seq_private	= NULL,
+	.fini_seq_private	= NULL,
+	.seq_priv_size		= 0,
+};
+
+static struct bpf_iter_reg cgroup_view_reg_info = {
+	.target			= "cgroup_view",
+	.feature		= BPF_ITER_INHERIT,
+	.ctx_arg_info_size	= 1,
+	.ctx_arg_info		= {
+		{ offsetof(struct bpf_iter__cgroup_view, cgroup),
+		  PTR_TO_BTF_ID },
+	},
+	.seq_info		= &cgroup_view_seq_info,
+};
+
+static int __init cgroup_view_init(void)
+{
+	cgroup_view_reg_info.ctx_arg_info[0].btf_id = *btf_cgroup_id;
+	return bpf_iter_reg_target(&cgroup_view_reg_info);
+}
+
+late_initcall(cgroup_view_init);
-- 
2.35.0.rc2.247.g8bbb082509-goog


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH RFC bpf-next v2 4/5] bpf: Pin cgroup_view
  2022-02-01 20:55 [PATCH RFC bpf-next v2 0/5] Extend cgroup interface with bpf Hao Luo
                   ` (2 preceding siblings ...)
  2022-02-01 20:55 ` [PATCH RFC bpf-next v2 3/5] bpf: cgroup_view iter Hao Luo
@ 2022-02-01 20:55 ` Hao Luo
  2022-02-01 20:55 ` [PATCH RFC bpf-next v2 5/5] selftests/bpf: test for pinning for cgroup_view link Hao Luo
  4 siblings, 0 replies; 19+ messages in thread
From: Hao Luo @ 2022-02-01 20:55 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann
  Cc: Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Shakeel Butt, Joe Burton, Stanislav Fomichev, bpf, linux-kernel,
	Hao Luo

When pinning cgroup_view iter into bpffs, the pinning adds a tag to
the parent directory, indicating the underneath directory hierarchy
is a replicate of the cgroup hierarchy. Each of the directory connects
to a cgroup directory. Whenever a subdirectory is created, if there is
a subcgroup of the same name exists, the subdirectory will be populated
with entries holding a list of bpf objects registered in the tag's
inherit list. The inherit list is formed by the objects pinned in the
top level tagged directory. For example,

 bpf_obj_pin(cgroup_view_link, "/sys/fs/bpf/A/obj");

pins a link in A. A becomes tagged and the link object is registered
in the inherit list of A's tag.

 mkdir("/sys/fs/bpf/A/B");

When A/B is created, B inherits the pinned objects in A. B is populated
with objects.

 > ls /sys/fs/bpf/A/B
 obj

Currently, only pinning cgroup_view link can tag a directory. And once
tagged, only rmdir can remove the tag.

Signed-off-by: Hao Luo <haoluo@google.com>
---
 kernel/bpf/inode.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 84 insertions(+)

diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
index 9ae17a2bf779..b71840bf979d 100644
--- a/kernel/bpf/inode.c
+++ b/kernel/bpf/inode.c
@@ -71,6 +71,7 @@ static void free_obj_list(struct kref *kref)
 	list_for_each_entry(e, &list->list, list) {
 		list_del_rcu(&e->list);
 		bpf_any_put(e->obj, e->type);
+		kfree(e->name.name);
 		kfree(e);
 	}
 	kfree(list);
@@ -486,9 +487,20 @@ static int bpf_mkmap(struct dentry *dentry, umode_t mode, void *arg)
 			     &bpffs_map_fops : &bpffs_obj_fops);
 }
 
+static int
+bpf_inherit_object(struct dentry *dentry, umode_t mode, void *obj,
+		   enum bpf_type type);
+
 static int bpf_mklink(struct dentry *dentry, umode_t mode, void *arg)
 {
 	struct bpf_link *link = arg;
+	int err;
+
+	if (bpf_link_support_inherit(link)) {
+		err = bpf_inherit_object(dentry, mode, link, BPF_TYPE_LINK);
+		if (err)
+			return err;
+	}
 
 	return bpf_mkobj_ops(dentry, mode, arg, &bpf_link_iops,
 			     bpf_link_is_iter(link) ?
@@ -586,6 +598,78 @@ static const struct inode_operations bpf_dir_iops = {
 	.unlink		= simple_unlink,
 };
 
+/* bpf_inherit_object - register an object in a diretory tag's inherit list
+ * @dentry: dentry of the location to pin
+ * @mode: mode of created file entry
+ * @obj: bpf object
+ * @type: type of bpf object
+ *
+ * Could be called from bpf_obj_do_pin() or from mkdir().
+ */
+static int bpf_inherit_object(struct dentry *dentry, umode_t mode,
+			      void *obj, enum bpf_type type)
+{
+	struct inode *dir = d_inode(dentry->d_parent);
+	struct obj_list *inherits;
+	struct bpf_inherit_entry *e;
+	struct bpf_dir_tag *tag;
+	const char *name;
+	bool queued = false, new_tag = false;
+
+	/* allocate bpf_dir_tag */
+	tag = inode_tag(dir);
+	if (!tag) {
+		new_tag = true;
+		tag = kzalloc(sizeof(struct bpf_dir_tag), GFP_KERNEL);
+		if (unlikely(!tag))
+			return -ENOMEM;
+
+		tag->type = BPF_DIR_KERNFS_REP;
+		inherits = kzalloc(sizeof(struct obj_list), GFP_KERNEL);
+		if (unlikely(!inherits)) {
+			kfree(tag);
+			return -ENOMEM;
+		}
+
+		kref_init(&inherits->refcnt);
+		INIT_LIST_HEAD(&inherits->list);
+		tag->inherit_objects = inherits;
+		/* initial tag points to the default root cgroup. */
+		tag->private = cgrp_dfl_root.kf_root->kn;
+		dir->i_private = tag;
+	} else {
+		inherits = tag->inherit_objects;
+	}
+
+	list_for_each_entry_rcu(e, &inherits->list, list) {
+		if (!strcmp(dentry->d_name.name, e->name.name)) {
+			queued = true;
+			break;
+		}
+	}
+
+	/* queue in tag's inherit_list. */
+	if (!queued) {
+		e = kzalloc(sizeof(struct bpf_inherit_entry), GFP_KERNEL);
+		if (!e) {
+			if (new_tag) {
+				kfree(tag);
+				kfree(inherits);
+			}
+			return -ENOMEM;
+		}
+
+		INIT_LIST_HEAD(&e->list);
+		e->mode = mode;
+		e->obj = obj;
+		e->type = type;
+		name = kstrdup(dentry->d_name.name, GFP_USER | __GFP_NOWARN);
+		e->name = (struct qstr)QSTR_INIT(name, strlen(name));
+		list_add_rcu(&e->list, &inherits->list);
+	}
+	return 0;
+}
+
 /* pin iterator link into bpffs */
 static int bpf_iter_link_pin_kernel(struct dentry *parent,
 				    const char *name, struct bpf_link *link)
-- 
2.35.0.rc2.247.g8bbb082509-goog


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH RFC bpf-next v2 5/5] selftests/bpf: test for pinning for cgroup_view link
  2022-02-01 20:55 [PATCH RFC bpf-next v2 0/5] Extend cgroup interface with bpf Hao Luo
                   ` (3 preceding siblings ...)
  2022-02-01 20:55 ` [PATCH RFC bpf-next v2 4/5] bpf: Pin cgroup_view Hao Luo
@ 2022-02-01 20:55 ` Hao Luo
  2022-02-03 18:04   ` Alexei Starovoitov
  4 siblings, 1 reply; 19+ messages in thread
From: Hao Luo @ 2022-02-01 20:55 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann
  Cc: Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Shakeel Butt, Joe Burton, Stanislav Fomichev, bpf, linux-kernel,
	Hao Luo

A selftest for demo/testing cgroup_view pinning, directory tagging and
cat cgroup_view pinned objects. The added cgroup_view prog is a naive
use case, which merely reads the cgroup's id.

This selftest introduces two metrics for measuring cgroup-level
scheduling queuing delays:

 - queue_self: queueing delays caused by waiting behind self cgroup.
 - queue_other: queueing delays caused by waiting behind other cgroups.

The selftest collects task-level queuing delays and breaks them into
queue_self delays and queue_other delays. This is done by hooking at
handlers at context switch and wakeup. Only the max queue_self and
queue_other delays are recorded. This breakdown is helpful in analyzing
the cause of long scheduling delays. Large value in queue_self is an
indication of contention that comes from within the cgroup. Large value
in queue_other indicates contention between cgroups.

A new iter prog type cgroup_view is implemented "dump_cgropu_lat", which
reads the recorded delays and dumps the stats through bpffs interfaces.
Specifically, cgroup_view is initially pinned in an empty directory
bpffs, which effectively turns the directory into a mirror of the cgroup
hierarchy. When a new cgroup is created, we manually created a new
directory in bpffs in correspondence. The new directory will contain
a file prepopulated with the pinned cgroup_view object. Reading that
file yields the queue stats of the corresponding cgroup.

Signed-off-by: Hao Luo <haoluo@google.com>
---
 .../selftests/bpf/prog_tests/pinning_cgroup.c | 143 +++++++++++
 tools/testing/selftests/bpf/progs/bpf_iter.h  |   7 +
 .../bpf/progs/bpf_iter_cgroup_view.c          | 232 ++++++++++++++++++
 3 files changed, 382 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/pinning_cgroup.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_cgroup_view.c

diff --git a/tools/testing/selftests/bpf/prog_tests/pinning_cgroup.c b/tools/testing/selftests/bpf/prog_tests/pinning_cgroup.c
new file mode 100644
index 000000000000..ebef154e63c9
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/pinning_cgroup.c
@@ -0,0 +1,143 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <fcntl.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <test_progs.h>
+#include <time.h>
+#include <unistd.h>
+#include "bpf_iter_cgroup_view.skel.h"
+
+static void spin_on_cpu(int seconds)
+{
+	time_t start, now;
+
+	start = time(NULL);
+	do {
+		now = time(NULL);
+	} while (now - start < seconds);
+}
+
+static void do_work(const char *cgroup)
+{
+	int i, cpu = 0, pid;
+	char cmd[128];
+
+	/* make cgroup threaded */
+	snprintf(cmd, 128, "echo threaded > %s/cgroup.type", cgroup);
+	system(cmd);
+
+	/* try to enable cpu controller. this may fail if there cpu controller
+	 * is not available in cgroup.controllers or there is a cgroup v1 already
+	 * mounted in the system.
+	 */
+	snprintf(cmd, 128, "echo \"+cpu\" > %s/cgroup.subtree_control", cgroup);
+	system(cmd);
+
+	/* launch two children, both running in child cgroup */
+	for (i = 0; i < 2; ++i) {
+		pid = fork();
+		if (pid == 0) {
+			/* attach to cgroup */
+			snprintf(cmd, 128, "echo %d > %s/cgroup.procs", getpid(), cgroup);
+			system(cmd);
+
+			/* pin process to target cpu */
+			snprintf(cmd, 128, "taskset -pc %d %d", cpu, getpid());
+			system(cmd);
+
+			spin_on_cpu(3); /* spin on cpu for 3 seconds */
+			exit(0);
+		}
+	}
+
+	/* pin process to target cpu */
+	snprintf(cmd, 128, "taskset -pc %d %d", cpu, getpid());
+	system(cmd);
+
+	spin_on_cpu(3); /* spin on cpu for 3 seconds */
+	wait(NULL);
+}
+
+static void check_pinning(const char *rootpath)
+{
+	const char *child_cgroup = "/sys/fs/cgroup/child";
+	struct bpf_iter_cgroup_view *skel;
+	struct bpf_link *link;
+	struct stat statbuf = {};
+	FILE *file;
+	unsigned long queue_self, queue_other;
+	int cgroup_id, link_fd;
+	char path[64];
+	char buf[64];
+
+	skel = bpf_iter_cgroup_view__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "bpf_iter_cgroup_view__open_and_load"))
+		return;
+
+	/* pin path at parent dir. */
+	link = bpf_program__attach_iter(skel->progs.dump_cgroup_lat, NULL);
+	link_fd = bpf_link__fd(link);
+
+	/* test initial pinning */
+	snprintf(path, 64, "%s/obj", rootpath);
+	ASSERT_OK(bpf_obj_pin(link_fd, path), "bpf_obj_pin");
+	ASSERT_OK(stat(path, &statbuf), "pinned_object_exists");
+
+	/* test mkdir */
+	mkdir(child_cgroup, 0755);
+	snprintf(path, 64, "%s/child", rootpath);
+	ASSERT_OK(mkdir(path, 0755), "mkdir");
+
+	/* test that new dir has been pre-populated with pinned objects */
+	snprintf(path, 64, "%s/child/obj", rootpath);
+	ASSERT_OK(stat(path, &statbuf), "populate");
+
+	bpf_iter_cgroup_view__attach(skel);
+	do_work(child_cgroup);
+	bpf_iter_cgroup_view__detach(skel);
+
+	/* test cat inherited objects */
+	file = fopen(path, "r");
+	if (ASSERT_OK_PTR(file, "open")) {
+		ASSERT_OK_PTR(fgets(buf, sizeof(buf), file), "cat");
+		ASSERT_EQ(sscanf(buf, "cgroup_id: %8d", &cgroup_id), 1, "output");
+
+		ASSERT_OK_PTR(fgets(buf, sizeof(buf), file), "cat");
+		ASSERT_EQ(sscanf(buf, "queue_self: %8lu", &queue_self), 1, "output");
+
+		ASSERT_OK_PTR(fgets(buf, sizeof(buf), file), "cat");
+		ASSERT_EQ(sscanf(buf, "queue_other: %8lu", &queue_other), 1, "output");
+
+		fclose(file);
+	}
+
+	/* test rmdir */
+	snprintf(path, 64, "%s/child", rootpath);
+	ASSERT_OK(rmdir(path), "rmdir");
+
+	/* unpin object */
+	snprintf(path, 64, "%s/obj", rootpath);
+	ASSERT_OK(unlink(path), "unlink");
+
+	bpf_link__destroy(link);
+	bpf_iter_cgroup_view__destroy(skel);
+}
+
+void test_pinning_cgroup(void)
+{
+	char tmpl[] = "/sys/fs/bpf/pinning_test_XXXXXX";
+	char *rootpath;
+
+	system("mount -t cgroup2 none /sys/fs/cgroup");
+	system("mount -t bpf bpffs /sys/fs/bpf");
+
+	rootpath = mkdtemp(tmpl);
+	chmod(rootpath, 0755);
+
+	/* check pinning map, prog and link in kernfs */
+	if (test__start_subtest("pinning"))
+		check_pinning(rootpath);
+
+	rmdir(rootpath);
+}
diff --git a/tools/testing/selftests/bpf/progs/bpf_iter.h b/tools/testing/selftests/bpf/progs/bpf_iter.h
index 8cfaeba1ddbf..506bb3efd9b4 100644
--- a/tools/testing/selftests/bpf/progs/bpf_iter.h
+++ b/tools/testing/selftests/bpf/progs/bpf_iter.h
@@ -16,6 +16,7 @@
 #define bpf_iter__bpf_map_elem bpf_iter__bpf_map_elem___not_used
 #define bpf_iter__bpf_sk_storage_map bpf_iter__bpf_sk_storage_map___not_used
 #define bpf_iter__sockmap bpf_iter__sockmap___not_used
+#define bpf_iter__cgroup_view bpf_iter__cgroup_view___not_used
 #define btf_ptr btf_ptr___not_used
 #define BTF_F_COMPACT BTF_F_COMPACT___not_used
 #define BTF_F_NONAME BTF_F_NONAME___not_used
@@ -37,6 +38,7 @@
 #undef bpf_iter__bpf_map_elem
 #undef bpf_iter__bpf_sk_storage_map
 #undef bpf_iter__sockmap
+#undef bpf_iter__cgroup_view
 #undef btf_ptr
 #undef BTF_F_COMPACT
 #undef BTF_F_NONAME
@@ -132,6 +134,11 @@ struct bpf_iter__sockmap {
 	struct sock *sk;
 };
 
+struct bpf_iter__cgroup_view {
+	struct bpf_iter_meta *meta;
+	struct cgroup *cgroup;
+} __attribute__((preserve_access_index));
+
 struct btf_ptr {
 	void *ptr;
 	__u32 type_id;
diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_cgroup_view.c b/tools/testing/selftests/bpf/progs/bpf_iter_cgroup_view.c
new file mode 100644
index 000000000000..43404c21aee3
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_iter_cgroup_view.c
@@ -0,0 +1,232 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Google */
+#include "bpf_iter.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_core_read.h>
+
+char _license[] SEC("license") = "GPL";
+
+#define TASK_RUNNING 0
+#define BPF_F_CURRENT_CPU 0xffffffffULL
+
+extern void fair_sched_class __ksym;
+extern bool CONFIG_FAIR_GROUP_SCHED __kconfig;
+extern bool CONFIG_CGROUP_SCHED __kconfig;
+
+struct wait_lat {
+	/* Queue_self stands for the latency a task experiences while waiting
+	 * behind the tasks that are from the same cgroup.
+	 *
+	 * Queue_other stands for the latency a task experiences while waiting
+	 * behind the tasks that are from other cgroups.
+	 *
+	 * For example, if there are three tasks: A, B and C. Suppose A and B
+	 * are in the same cgroup and C is in another cgroup and we see A has
+	 * a queueing latency X milliseconds. Let's say during the X milliseconds,
+	 * B has run for Y milliseconds. We can break down X to two parts: time
+	 * when B is on cpu, that is Y; the time when C is on cpu, that is X - Y.
+	 *
+	 * Queue_self is the former (Y) while queue_other is the latter (X - Y).
+	 *
+	 * large value in queue_self is an indication of contention within a
+	 * cgroup; while large value in queue_other is an indication of
+	 * contention from multiple cgroups.
+	 */
+	u64 queue_self;
+	u64 queue_other;
+};
+
+struct timestamp {
+	/* timestamp when last queued */
+	u64 tsp;
+
+	/* cgroup exec_clock when last queued */
+	u64 exec_clock;
+};
+
+/* Map to store per-cgroup wait latency */
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__type(key, u64);
+	__type(value, struct wait_lat);
+	__uint(max_entries, 65532);
+} cgroup_lat SEC(".maps");
+
+/* Map to store per-task queue timestamp */
+struct {
+	__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, int);
+	__type(value, struct timestamp);
+} start SEC(".maps");
+
+/* adapt from task_cfs_rq in kernel/sched/sched.h */
+__always_inline
+struct cfs_rq *task_cfs_rq(struct task_struct *t)
+{
+	if (!CONFIG_FAIR_GROUP_SCHED)
+		return NULL;
+
+	return BPF_CORE_READ(&t->se, cfs_rq);
+}
+
+/* record enqueue timestamp */
+__always_inline
+static int trace_enqueue(struct task_struct *t)
+{
+	u32 pid = t->pid;
+	struct timestamp *ptr;
+	struct cfs_rq *cfs_rq;
+
+	if (!pid)
+		return 0;
+
+	/* only measure for CFS tasks */
+	if (t->sched_class != &fair_sched_class)
+		return 0;
+
+	ptr = bpf_task_storage_get(&start, t, 0,
+				   BPF_LOCAL_STORAGE_GET_F_CREATE);
+	if (!ptr)
+		return 0;
+
+	/* CONFIG_FAIR_GROUP_SCHED may not be enabled */
+	cfs_rq = task_cfs_rq(t);
+	if (!cfs_rq)
+		return 0;
+
+	ptr->tsp = bpf_ktime_get_ns();
+	ptr->exec_clock = BPF_CORE_READ(cfs_rq, exec_clock);
+	return 0;
+}
+
+SEC("tp_btf/sched_wakeup")
+int handle__sched_wakeup(u64 *ctx)
+{
+	/* TP_PROTO(struct task_struct *p) */
+	struct task_struct *p = (void *)ctx[0];
+
+	return trace_enqueue(p);
+}
+
+SEC("tp_btf/sched_wakeup_new")
+int handle__sched_wakeup_new(u64 *ctx)
+{
+	/* TP_PROTO(struct task_struct *p) */
+	struct task_struct *p = (void *)ctx[0];
+
+	return trace_enqueue(p);
+}
+
+/* task_group() from kernel/sched/sched.h */
+__always_inline
+struct task_group *task_group(struct task_struct *p)
+{
+	if (!CONFIG_CGROUP_SCHED)
+		return NULL;
+
+	return BPF_CORE_READ(p, sched_task_group);
+}
+
+__always_inline
+struct cgroup *task_cgroup(struct task_struct *p)
+{
+	struct task_group *tg;
+
+	tg = task_group(p);
+	if (!tg)
+		return NULL;
+
+	return BPF_CORE_READ(tg, css).cgroup;
+}
+
+__always_inline
+u64 max(u64 x, u64 y)
+{
+	return x > y ? x : y;
+}
+
+SEC("tp_btf/sched_switch")
+int handle__sched_switch(u64 *ctx)
+{
+	/* TP_PROTO(bool preempt, struct task_struct *prev,
+	 *	    struct task_struct *next)
+	 */
+	struct task_struct *prev = (struct task_struct *)ctx[1];
+	struct task_struct *next = (struct task_struct *)ctx[2];
+	u64 delta, delta_self, delta_other, id;
+	struct cfs_rq *cfs_rq;
+	struct timestamp *tsp;
+	struct wait_lat *lat;
+	struct cgroup *cgroup;
+
+	/* ivcsw: treat like an enqueue event and store timestamp */
+	if (prev->__state == TASK_RUNNING)
+		trace_enqueue(prev);
+
+	/* only measure for CFS tasks */
+	if (next->sched_class != &fair_sched_class)
+		return 0;
+
+	/* fetch timestamp and calculate delta */
+	tsp = bpf_task_storage_get(&start, next, 0, 0);
+	if (!tsp)
+		return 0;   /* missed enqueue */
+
+	/* CONFIG_FAIR_GROUP_SCHED may not be enabled */
+	cfs_rq = task_cfs_rq(next);
+	if (!cfs_rq)
+		return 0;
+
+	/* cpu controller may not be enabled */
+	cgroup = task_cgroup(next);
+	if (!cgroup)
+		return 0;
+
+	/* calculate self delay and other delay */
+	delta = bpf_ktime_get_ns() - tsp->tsp;
+	delta_self = BPF_CORE_READ(cfs_rq, exec_clock) - tsp->exec_clock;
+	if (delta_self > delta)
+		delta_self = delta;
+	delta_other = delta - delta_self;
+
+	/* insert into cgroup_lat map */
+	id = BPF_CORE_READ(cgroup, kn, id);
+	lat = bpf_map_lookup_elem(&cgroup_lat, &id);
+	if (!lat) {
+		struct wait_lat w = {
+			.queue_self = delta_self,
+			.queue_other = delta_other,
+		};
+
+		bpf_map_update_elem(&cgroup_lat, &id, &w, BPF_ANY);
+	} else {
+		lat->queue_self = max(delta_self, lat->queue_self);
+		lat->queue_other = max(delta_other, lat->queue_other);
+	}
+
+	bpf_task_storage_delete(&start, next);
+	return 0;
+}
+
+SEC("iter/cgroup_view")
+int dump_cgroup_lat(struct bpf_iter__cgroup_view *ctx)
+{
+	struct seq_file *seq = ctx->meta->seq;
+	struct cgroup *cgroup = ctx->cgroup;
+	struct wait_lat *lat;
+	u64 id;
+
+	BPF_SEQ_PRINTF(seq, "cgroup_id: %8lu\n", cgroup->kn->id);
+	lat = bpf_map_lookup_elem(&cgroup_lat, &id);
+	if (lat) {
+		BPF_SEQ_PRINTF(seq, "queue_self: %8lu\n", lat->queue_self);
+		BPF_SEQ_PRINTF(seq, "queue_other: %8lu\n", lat->queue_other);
+	} else {
+		/* print anyway for universal parsing logic in userspace. */
+		BPF_SEQ_PRINTF(seq, "queue_self: %8d\n", 0);
+		BPF_SEQ_PRINTF(seq, "queue_other: %8d\n", 0);
+	}
+	return 0;
+}
-- 
2.35.0.rc2.247.g8bbb082509-goog


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC bpf-next v2 5/5] selftests/bpf: test for pinning for cgroup_view link
  2022-02-01 20:55 ` [PATCH RFC bpf-next v2 5/5] selftests/bpf: test for pinning for cgroup_view link Hao Luo
@ 2022-02-03 18:04   ` Alexei Starovoitov
  2022-02-03 22:46     ` Hao Luo
  0 siblings, 1 reply; 19+ messages in thread
From: Alexei Starovoitov @ 2022-02-03 18:04 UTC (permalink / raw)
  To: Hao Luo
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Shakeel Butt, Joe Burton, Stanislav Fomichev, bpf, linux-kernel

On Tue, Feb 01, 2022 at 12:55:34PM -0800, Hao Luo wrote:
> +
> +SEC("iter/cgroup_view")
> +int dump_cgroup_lat(struct bpf_iter__cgroup_view *ctx)
> +{
> +	struct seq_file *seq = ctx->meta->seq;
> +	struct cgroup *cgroup = ctx->cgroup;
> +	struct wait_lat *lat;
> +	u64 id;
> +
> +	BPF_SEQ_PRINTF(seq, "cgroup_id: %8lu\n", cgroup->kn->id);
> +	lat = bpf_map_lookup_elem(&cgroup_lat, &id);

Looks like "id = cgroup->kn->id" assignment is missing here?

Thanks a lot for this test. It explains the motivation well.

It seems that the patches 1-4 are there to automatically
supply cgroup pointer into bpf_iter__cgroup_view.

Since user space needs to track good part of cgroup dir opreations
can we task it with the job of patches 1-4 as well?
It can register notifier for cgroupfs operations and
do mkdir in bpffs similarly _and_ parametrize 'view' bpf program
with corresponding cgroup_id.
Ideally there is no new 'view' program and it's a subset of 'iter'
bpf program. They're already parametrizable.
When 'iter' is pinned the user space can tell it which object it should
iterate on. The 'view' will be an interator of one element and
argument to it can be cgroup_id.
When user space pins the same 'view' program in a newly created bpffs
directory it will parametrize it with a different cgroup_id.
At the end the same 'view' program will be pinned in multiple directories
with different cgroup_id arguments.
This patch 5 will look very much the same, but patches 1-4 will not be
necessary.
Of course there are races between cgroup create/destroy and bpffs
mkdir, prog pin operatiosn, but they will be there regardless.
The patch 1-4 approach is not race free either.
Will that work?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC bpf-next v2 5/5] selftests/bpf: test for pinning for cgroup_view link
  2022-02-03 18:04   ` Alexei Starovoitov
@ 2022-02-03 22:46     ` Hao Luo
  2022-02-04  3:33       ` Alexei Starovoitov
  0 siblings, 1 reply; 19+ messages in thread
From: Hao Luo @ 2022-02-03 22:46 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Shakeel Butt, Joe Burton, Stanislav Fomichev, bpf, linux-kernel

On Thu, Feb 3, 2022 at 10:04 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Feb 01, 2022 at 12:55:34PM -0800, Hao Luo wrote:
> > +
> > +SEC("iter/cgroup_view")
> > +int dump_cgroup_lat(struct bpf_iter__cgroup_view *ctx)
> > +{
> > +     struct seq_file *seq = ctx->meta->seq;
> > +     struct cgroup *cgroup = ctx->cgroup;
> > +     struct wait_lat *lat;
> > +     u64 id;
> > +
> > +     BPF_SEQ_PRINTF(seq, "cgroup_id: %8lu\n", cgroup->kn->id);
> > +     lat = bpf_map_lookup_elem(&cgroup_lat, &id);
>
> Looks like "id = cgroup->kn->id" assignment is missing here?
>

Ah, yes. I'll fix it.

> Thanks a lot for this test. It explains the motivation well.
>
> It seems that the patches 1-4 are there to automatically
> supply cgroup pointer into bpf_iter__cgroup_view.
>
> Since user space needs to track good part of cgroup dir opreations
> can we task it with the job of patches 1-4 as well?
> It can register notifier for cgroupfs operations and
> do mkdir in bpffs similarly _and_ parametrize 'view' bpf program
> with corresponding cgroup_id.
> Ideally there is no new 'view' program and it's a subset of 'iter'
> bpf program. They're already parametrizable.
> When 'iter' is pinned the user space can tell it which object it should
> iterate on. The 'view' will be an interator of one element and
> argument to it can be cgroup_id.
> When user space pins the same 'view' program in a newly created bpffs
> directory it will parametrize it with a different cgroup_id.
> At the end the same 'view' program will be pinned in multiple directories
> with different cgroup_id arguments.
> This patch 5 will look very much the same, but patches 1-4 will not be
> necessary.
> Of course there are races between cgroup create/destroy and bpffs
> mkdir, prog pin operatiosn, but they will be there regardless.
> The patch 1-4 approach is not race free either.

Right. I tried to minimize the races between cgroupfs and bpffs in
this patchset. The cgroup and kernfs APIs called in this patchset
guarantee that the cgroup and kernfs objects are alive once get. Some
states in the objects such as 'id' should be valid at least.

> Will that work?

Thanks Alexei for the idea.

The parameterization part sounds good. By 'parametrize', do you mean a
variable in iter prog (like the 'pid' variable in bpf_iter_task_vma.c
[1])? or some metadata of the program? I assume it's program's
metadata. Either parameterizing with cgroup_id or passing cgroup
object to the prog should work. The problem is at pinning.

In our use case, we can't ask the users who create cgroups to do the
pinning. Pinning requires root privilege. In our use case, we have
non-root users who can create cgroup directories and still want to
read bpf stats. They can't do pinning by themselves. This is why
inheritance is a requirement for us. With inheritance, they only need
to mkdir in cgroupfs and bpffs (unprivileged operations), no pinning
operation is required. Patch 1-4 are needed to implement inheritance.

It's also not a good idea in our use case to add a userspace
privileged process to monitor cgroupfs operations and perform the
pinning. It's more complex and has a higher maintenance cost and
runtime overhead, compared to the solution of asking whoever makes
cgroups to mkdir in bpffs. The other problem is: if there are nodes in
the data center that don't have the userspace process deployed, the
stats will be unavailable, which is a no-no for some of our users.

[1] tools/testing/selftests/bpf/progs/bpf_iter_task_vma.c

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC bpf-next v2 5/5] selftests/bpf: test for pinning for cgroup_view link
  2022-02-03 22:46     ` Hao Luo
@ 2022-02-04  3:33       ` Alexei Starovoitov
  2022-02-04 18:26         ` Hao Luo
  0 siblings, 1 reply; 19+ messages in thread
From: Alexei Starovoitov @ 2022-02-04  3:33 UTC (permalink / raw)
  To: Hao Luo
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Shakeel Butt, Joe Burton, Stanislav Fomichev, bpf, LKML

On Thu, Feb 3, 2022 at 2:46 PM Hao Luo <haoluo@google.com> wrote:
>
> On Thu, Feb 3, 2022 at 10:04 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Tue, Feb 01, 2022 at 12:55:34PM -0800, Hao Luo wrote:
> > > +
> > > +SEC("iter/cgroup_view")
> > > +int dump_cgroup_lat(struct bpf_iter__cgroup_view *ctx)
> > > +{
> > > +     struct seq_file *seq = ctx->meta->seq;
> > > +     struct cgroup *cgroup = ctx->cgroup;
> > > +     struct wait_lat *lat;
> > > +     u64 id;
> > > +
> > > +     BPF_SEQ_PRINTF(seq, "cgroup_id: %8lu\n", cgroup->kn->id);
> > > +     lat = bpf_map_lookup_elem(&cgroup_lat, &id);
> >
> > Looks like "id = cgroup->kn->id" assignment is missing here?
> >
>
> Ah, yes. I'll fix it.
>
> > Thanks a lot for this test. It explains the motivation well.
> >
> > It seems that the patches 1-4 are there to automatically
> > supply cgroup pointer into bpf_iter__cgroup_view.
> >
> > Since user space needs to track good part of cgroup dir opreations
> > can we task it with the job of patches 1-4 as well?
> > It can register notifier for cgroupfs operations and
> > do mkdir in bpffs similarly _and_ parametrize 'view' bpf program
> > with corresponding cgroup_id.
> > Ideally there is no new 'view' program and it's a subset of 'iter'
> > bpf program. They're already parametrizable.
> > When 'iter' is pinned the user space can tell it which object it should
> > iterate on. The 'view' will be an interator of one element and
> > argument to it can be cgroup_id.
> > When user space pins the same 'view' program in a newly created bpffs
> > directory it will parametrize it with a different cgroup_id.
> > At the end the same 'view' program will be pinned in multiple directories
> > with different cgroup_id arguments.
> > This patch 5 will look very much the same, but patches 1-4 will not be
> > necessary.
> > Of course there are races between cgroup create/destroy and bpffs
> > mkdir, prog pin operatiosn, but they will be there regardless.
> > The patch 1-4 approach is not race free either.
>
> Right. I tried to minimize the races between cgroupfs and bpffs in
> this patchset. The cgroup and kernfs APIs called in this patchset
> guarantee that the cgroup and kernfs objects are alive once get. Some
> states in the objects such as 'id' should be valid at least.
>
> > Will that work?
>
> Thanks Alexei for the idea.
>
> The parameterization part sounds good. By 'parametrize', do you mean a
> variable in iter prog (like the 'pid' variable in bpf_iter_task_vma.c
> [1])? or some metadata of the program? I assume it's program's
> metadata. Either parameterizing with cgroup_id or passing cgroup
> object to the prog should work. The problem is at pinning.

The bpf_iter_link_info is used to parametrize the iterator.
The map iterator will iterate the given map_fd.
iirc pinning is not parameterizable yet,
but that's not difficult to add.


> In our use case, we can't ask the users who create cgroups to do the
> pinning. Pinning requires root privilege. In our use case, we have
> non-root users who can create cgroup directories and still want to
> read bpf stats. They can't do pinning by themselves. This is why
> inheritance is a requirement for us. With inheritance, they only need
> to mkdir in cgroupfs and bpffs (unprivileged operations), no pinning
> operation is required. Patch 1-4 are needed to implement inheritance.
>
> It's also not a good idea in our use case to add a userspace
> privileged process to monitor cgroupfs operations and perform the
> pinning. It's more complex and has a higher maintenance cost and
> runtime overhead, compared to the solution of asking whoever makes
> cgroups to mkdir in bpffs. The other problem is: if there are nodes in
> the data center that don't have the userspace process deployed, the
> stats will be unavailable, which is a no-no for some of our users.

The commit log says that there will be a daemon that does that
monitoring of cgroupfs. And that daemon needs to mkdir
directories in bpffs when a new cgroup is created, no?
The kernel is only doing inheritance of bpf progs into
new dirs. I think that daemon can pin as well.

The cgroup creation is typically managed by an agent like systemd.
Sounds like you have your own agent that creates cgroups?
If so it has to be privileged and it can mkdir in bpffs and pin too ?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC bpf-next v2 5/5] selftests/bpf: test for pinning for cgroup_view link
  2022-02-04  3:33       ` Alexei Starovoitov
@ 2022-02-04 18:26         ` Hao Luo
  2022-02-06  4:29           ` Alexei Starovoitov
  0 siblings, 1 reply; 19+ messages in thread
From: Hao Luo @ 2022-02-04 18:26 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Shakeel Butt, Joe Burton, Stanislav Fomichev, bpf, LKML

On Thu, Feb 3, 2022 at 7:33 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Feb 3, 2022 at 2:46 PM Hao Luo <haoluo@google.com> wrote:
> >
> > On Thu, Feb 3, 2022 at 10:04 AM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Tue, Feb 01, 2022 at 12:55:34PM -0800, Hao Luo wrote:
> > > > +
> > > > +SEC("iter/cgroup_view")
> > > > +int dump_cgroup_lat(struct bpf_iter__cgroup_view *ctx)
> > > > +{
> > > > +     struct seq_file *seq = ctx->meta->seq;
> > > > +     struct cgroup *cgroup = ctx->cgroup;
> > > > +     struct wait_lat *lat;
> > > > +     u64 id;
> > > > +
> > > > +     BPF_SEQ_PRINTF(seq, "cgroup_id: %8lu\n", cgroup->kn->id);
> > > > +     lat = bpf_map_lookup_elem(&cgroup_lat, &id);
> > >
> > > Looks like "id = cgroup->kn->id" assignment is missing here?
> > >
> >
> > Ah, yes. I'll fix it.
> >
> > > Thanks a lot for this test. It explains the motivation well.
> > >
> > > It seems that the patches 1-4 are there to automatically
> > > supply cgroup pointer into bpf_iter__cgroup_view.
> > >
> > > Since user space needs to track good part of cgroup dir opreations
> > > can we task it with the job of patches 1-4 as well?
> > > It can register notifier for cgroupfs operations and
> > > do mkdir in bpffs similarly _and_ parametrize 'view' bpf program
> > > with corresponding cgroup_id.
> > > Ideally there is no new 'view' program and it's a subset of 'iter'
> > > bpf program. They're already parametrizable.
> > > When 'iter' is pinned the user space can tell it which object it should
> > > iterate on. The 'view' will be an interator of one element and
> > > argument to it can be cgroup_id.
> > > When user space pins the same 'view' program in a newly created bpffs
> > > directory it will parametrize it with a different cgroup_id.
> > > At the end the same 'view' program will be pinned in multiple directories
> > > with different cgroup_id arguments.
> > > This patch 5 will look very much the same, but patches 1-4 will not be
> > > necessary.
> > > Of course there are races between cgroup create/destroy and bpffs
> > > mkdir, prog pin operatiosn, but they will be there regardless.
> > > The patch 1-4 approach is not race free either.
> >
> > Right. I tried to minimize the races between cgroupfs and bpffs in
> > this patchset. The cgroup and kernfs APIs called in this patchset
> > guarantee that the cgroup and kernfs objects are alive once get. Some
> > states in the objects such as 'id' should be valid at least.
> >
> > > Will that work?
> >
> > Thanks Alexei for the idea.
> >
> > The parameterization part sounds good. By 'parametrize', do you mean a
> > variable in iter prog (like the 'pid' variable in bpf_iter_task_vma.c
> > [1])? or some metadata of the program? I assume it's program's
> > metadata. Either parameterizing with cgroup_id or passing cgroup
> > object to the prog should work. The problem is at pinning.
>
> The bpf_iter_link_info is used to parametrize the iterator.
> The map iterator will iterate the given map_fd.
> iirc pinning is not parameterizable yet,
> but that's not difficult to add.
>

I can take a look at that. This will be useful in our use case.

>
> > In our use case, we can't ask the users who create cgroups to do the
> > pinning. Pinning requires root privilege. In our use case, we have
> > non-root users who can create cgroup directories and still want to
> > read bpf stats. They can't do pinning by themselves. This is why
> > inheritance is a requirement for us. With inheritance, they only need
> > to mkdir in cgroupfs and bpffs (unprivileged operations), no pinning
> > operation is required. Patch 1-4 are needed to implement inheritance.
> >
> > It's also not a good idea in our use case to add a userspace
> > privileged process to monitor cgroupfs operations and perform the
> > pinning. It's more complex and has a higher maintenance cost and
> > runtime overhead, compared to the solution of asking whoever makes
> > cgroups to mkdir in bpffs. The other problem is: if there are nodes in
> > the data center that don't have the userspace process deployed, the
> > stats will be unavailable, which is a no-no for some of our users.
>
> The commit log says that there will be a daemon that does that
> monitoring of cgroupfs. And that daemon needs to mkdir
> directories in bpffs when a new cgroup is created, no?
> The kernel is only doing inheritance of bpf progs into
> new dirs. I think that daemon can pin as well.
>
> The cgroup creation is typically managed by an agent like systemd.
> Sounds like you have your own agent that creates cgroups?
> If so it has to be privileged and it can mkdir in bpffs and pin too ?

Ah, yes, we have our own daemon to manage cgroups. That daemon creates
the top-level cgroup for each job to run inside. However, the job can
create its own cgroups inside the top-level cgroup, for fine grained
resource control. This doesn't go through the daemon. The job-created
cgroups don't have the pinned objects and this is a no-no for our
users.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC bpf-next v2 5/5] selftests/bpf: test for pinning for cgroup_view link
  2022-02-04 18:26         ` Hao Luo
@ 2022-02-06  4:29           ` Alexei Starovoitov
  2022-02-08 20:07             ` Hao Luo
  0 siblings, 1 reply; 19+ messages in thread
From: Alexei Starovoitov @ 2022-02-06  4:29 UTC (permalink / raw)
  To: Hao Luo
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Shakeel Butt, Joe Burton, Stanislav Fomichev, bpf, LKML

On Fri, Feb 4, 2022 at 10:27 AM Hao Luo <haoluo@google.com> wrote:
> >
> > > In our use case, we can't ask the users who create cgroups to do the
> > > pinning. Pinning requires root privilege. In our use case, we have
> > > non-root users who can create cgroup directories and still want to
> > > read bpf stats. They can't do pinning by themselves. This is why
> > > inheritance is a requirement for us. With inheritance, they only need
> > > to mkdir in cgroupfs and bpffs (unprivileged operations), no pinning
> > > operation is required. Patch 1-4 are needed to implement inheritance.
> > >
> > > It's also not a good idea in our use case to add a userspace
> > > privileged process to monitor cgroupfs operations and perform the
> > > pinning. It's more complex and has a higher maintenance cost and
> > > runtime overhead, compared to the solution of asking whoever makes
> > > cgroups to mkdir in bpffs. The other problem is: if there are nodes in
> > > the data center that don't have the userspace process deployed, the
> > > stats will be unavailable, which is a no-no for some of our users.
> >
> > The commit log says that there will be a daemon that does that
> > monitoring of cgroupfs. And that daemon needs to mkdir
> > directories in bpffs when a new cgroup is created, no?
> > The kernel is only doing inheritance of bpf progs into
> > new dirs. I think that daemon can pin as well.
> >
> > The cgroup creation is typically managed by an agent like systemd.
> > Sounds like you have your own agent that creates cgroups?
> > If so it has to be privileged and it can mkdir in bpffs and pin too ?
>
> Ah, yes, we have our own daemon to manage cgroups. That daemon creates
> the top-level cgroup for each job to run inside. However, the job can
> create its own cgroups inside the top-level cgroup, for fine grained
> resource control. This doesn't go through the daemon. The job-created
> cgroups don't have the pinned objects and this is a no-no for our
> users.

We can whitelist certain tracepoints to be sleepable and extend
tp_btf prog type to include everything from prog_type_syscall.
Such prog would attach to cgroup_mkdir and cgroup_release
and would call bpf_sys_bpf() helper to pin progs in new bpffs dirs.
We can allow prog_type_syscall to do mkdir in bpffs as well.

This feature could be useful for similar monitoring/introspection tasks.
We can write a program that would monitor bpf prog load/unload
and would pin an iterator prog that would show debug info about a prog.
Like cat /sys/fs/bpf/progs.debug shows a list of loaded progs.
With this feature we can implement:
ls /sys/fs/bpf/all_progs.debug/
and each loaded prog would have a corresponding file.
The file name would be a program name, for example.
cat /sys/fs/bpf/all_progs.debug/my_prog
would pretty print info about 'my_prog' bpf program.

This way the kernfs/cgroupfs specific logic from patches 1-4
will not be necessary.

wdyt?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC bpf-next v2 5/5] selftests/bpf: test for pinning for cgroup_view link
  2022-02-06  4:29           ` Alexei Starovoitov
@ 2022-02-08 20:07             ` Hao Luo
  2022-02-08 21:20               ` Alexei Starovoitov
  0 siblings, 1 reply; 19+ messages in thread
From: Hao Luo @ 2022-02-08 20:07 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Shakeel Butt, Joe Burton, Stanislav Fomichev, bpf, LKML

On Sat, Feb 5, 2022 at 8:29 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Feb 4, 2022 at 10:27 AM Hao Luo <haoluo@google.com> wrote:
> > >
> > > > In our use case, we can't ask the users who create cgroups to do the
> > > > pinning. Pinning requires root privilege. In our use case, we have
> > > > non-root users who can create cgroup directories and still want to
> > > > read bpf stats. They can't do pinning by themselves. This is why
> > > > inheritance is a requirement for us. With inheritance, they only need
> > > > to mkdir in cgroupfs and bpffs (unprivileged operations), no pinning
> > > > operation is required. Patch 1-4 are needed to implement inheritance.
> > > >
> > > > It's also not a good idea in our use case to add a userspace
> > > > privileged process to monitor cgroupfs operations and perform the
> > > > pinning. It's more complex and has a higher maintenance cost and
> > > > runtime overhead, compared to the solution of asking whoever makes
> > > > cgroups to mkdir in bpffs. The other problem is: if there are nodes in
> > > > the data center that don't have the userspace process deployed, the
> > > > stats will be unavailable, which is a no-no for some of our users.
> > >
> > > The commit log says that there will be a daemon that does that
> > > monitoring of cgroupfs. And that daemon needs to mkdir
> > > directories in bpffs when a new cgroup is created, no?
> > > The kernel is only doing inheritance of bpf progs into
> > > new dirs. I think that daemon can pin as well.
> > >
> > > The cgroup creation is typically managed by an agent like systemd.
> > > Sounds like you have your own agent that creates cgroups?
> > > If so it has to be privileged and it can mkdir in bpffs and pin too ?
> >
> > Ah, yes, we have our own daemon to manage cgroups. That daemon creates
> > the top-level cgroup for each job to run inside. However, the job can
> > create its own cgroups inside the top-level cgroup, for fine grained
> > resource control. This doesn't go through the daemon. The job-created
> > cgroups don't have the pinned objects and this is a no-no for our
> > users.
>
> We can whitelist certain tracepoints to be sleepable and extend
> tp_btf prog type to include everything from prog_type_syscall.
> Such prog would attach to cgroup_mkdir and cgroup_release
> and would call bpf_sys_bpf() helper to pin progs in new bpffs dirs.
> We can allow prog_type_syscall to do mkdir in bpffs as well.
>
> This feature could be useful for similar monitoring/introspection tasks.
> We can write a program that would monitor bpf prog load/unload
> and would pin an iterator prog that would show debug info about a prog.
> Like cat /sys/fs/bpf/progs.debug shows a list of loaded progs.
> With this feature we can implement:
> ls /sys/fs/bpf/all_progs.debug/
> and each loaded prog would have a corresponding file.
> The file name would be a program name, for example.
> cat /sys/fs/bpf/all_progs.debug/my_prog
> would pretty print info about 'my_prog' bpf program.
>
> This way the kernfs/cgroupfs specific logic from patches 1-4
> will not be necessary.
>
> wdyt?

Thanks Alexei. I gave it more thought in the last couple of days.
Actually I think it's a good idea, more flexible. It gets rid of the
need of a user space daemon for monitoring cgroup creation and
destruction. We could monitor task creations and exits as well, so
that we can export per-task information (e.g. task_vma_iter) more
efficiently.

A couple of thoughts when thinking about the details:

- Regarding parameterized pinning, I don't think we can have one
single bpf_iter_link object, but with different parameters. Because
parameters are part of the bpf_iter_link (bpf_iter_aux_info). So every
time we pin, we have to attach iter in order to get a new link object
first. So we need to add attach and detach in bpf_sys_bpf().
- We also need to add those syscalls for cleanup: (1) unlink for
removing pinned obj and (2) rmdir for removing the directory in
prog_type_syscall.

With these extensions, we can shift some of the bpf operations
currently performed in system daemons into the kernel. IMHO it's a
great thing, making system monitoring more flexible.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC bpf-next v2 5/5] selftests/bpf: test for pinning for cgroup_view link
  2022-02-08 20:07             ` Hao Luo
@ 2022-02-08 21:20               ` Alexei Starovoitov
  2022-02-08 21:34                 ` Hao Luo
  2022-02-14 18:29                 ` Hao Luo
  0 siblings, 2 replies; 19+ messages in thread
From: Alexei Starovoitov @ 2022-02-08 21:20 UTC (permalink / raw)
  To: Hao Luo
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Shakeel Butt, Joe Burton, Stanislav Fomichev, bpf, LKML

On Tue, Feb 8, 2022 at 12:07 PM Hao Luo <haoluo@google.com> wrote:
>
> On Sat, Feb 5, 2022 at 8:29 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Fri, Feb 4, 2022 at 10:27 AM Hao Luo <haoluo@google.com> wrote:
> > > >
> > > > > In our use case, we can't ask the users who create cgroups to do the
> > > > > pinning. Pinning requires root privilege. In our use case, we have
> > > > > non-root users who can create cgroup directories and still want to
> > > > > read bpf stats. They can't do pinning by themselves. This is why
> > > > > inheritance is a requirement for us. With inheritance, they only need
> > > > > to mkdir in cgroupfs and bpffs (unprivileged operations), no pinning
> > > > > operation is required. Patch 1-4 are needed to implement inheritance.
> > > > >
> > > > > It's also not a good idea in our use case to add a userspace
> > > > > privileged process to monitor cgroupfs operations and perform the
> > > > > pinning. It's more complex and has a higher maintenance cost and
> > > > > runtime overhead, compared to the solution of asking whoever makes
> > > > > cgroups to mkdir in bpffs. The other problem is: if there are nodes in
> > > > > the data center that don't have the userspace process deployed, the
> > > > > stats will be unavailable, which is a no-no for some of our users.
> > > >
> > > > The commit log says that there will be a daemon that does that
> > > > monitoring of cgroupfs. And that daemon needs to mkdir
> > > > directories in bpffs when a new cgroup is created, no?
> > > > The kernel is only doing inheritance of bpf progs into
> > > > new dirs. I think that daemon can pin as well.
> > > >
> > > > The cgroup creation is typically managed by an agent like systemd.
> > > > Sounds like you have your own agent that creates cgroups?
> > > > If so it has to be privileged and it can mkdir in bpffs and pin too ?
> > >
> > > Ah, yes, we have our own daemon to manage cgroups. That daemon creates
> > > the top-level cgroup for each job to run inside. However, the job can
> > > create its own cgroups inside the top-level cgroup, for fine grained
> > > resource control. This doesn't go through the daemon. The job-created
> > > cgroups don't have the pinned objects and this is a no-no for our
> > > users.
> >
> > We can whitelist certain tracepoints to be sleepable and extend
> > tp_btf prog type to include everything from prog_type_syscall.
> > Such prog would attach to cgroup_mkdir and cgroup_release
> > and would call bpf_sys_bpf() helper to pin progs in new bpffs dirs.
> > We can allow prog_type_syscall to do mkdir in bpffs as well.
> >
> > This feature could be useful for similar monitoring/introspection tasks.
> > We can write a program that would monitor bpf prog load/unload
> > and would pin an iterator prog that would show debug info about a prog.
> > Like cat /sys/fs/bpf/progs.debug shows a list of loaded progs.
> > With this feature we can implement:
> > ls /sys/fs/bpf/all_progs.debug/
> > and each loaded prog would have a corresponding file.
> > The file name would be a program name, for example.
> > cat /sys/fs/bpf/all_progs.debug/my_prog
> > would pretty print info about 'my_prog' bpf program.
> >
> > This way the kernfs/cgroupfs specific logic from patches 1-4
> > will not be necessary.
> >
> > wdyt?
>
> Thanks Alexei. I gave it more thought in the last couple of days.
> Actually I think it's a good idea, more flexible. It gets rid of the
> need of a user space daemon for monitoring cgroup creation and
> destruction. We could monitor task creations and exits as well, so
> that we can export per-task information (e.g. task_vma_iter) more
> efficiently.

Yep. Monitoring task creation and exposing via bpf_iter sounds
useful too.

> A couple of thoughts when thinking about the details:
>
> - Regarding parameterized pinning, I don't think we can have one
> single bpf_iter_link object, but with different parameters. Because
> parameters are part of the bpf_iter_link (bpf_iter_aux_info). So every
> time we pin, we have to attach iter in order to get a new link object
> first. So we need to add attach and detach in bpf_sys_bpf().

Makes sense.
I'm adding bpf_link_create to bpf_sys_bpf as part of
the "lskel for kernel" patch set.
The detach is sys_close. It's already available.

> - We also need to add those syscalls for cleanup: (1) unlink for
> removing pinned obj and (2) rmdir for removing the directory in
> prog_type_syscall.

Yes. These two would be needed.
And obj_pin too.

> With these extensions, we can shift some of the bpf operations
> currently performed in system daemons into the kernel. IMHO it's a
> great thing, making system monitoring more flexible.

Awesome. Sounds like we're converging :)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC bpf-next v2 5/5] selftests/bpf: test for pinning for cgroup_view link
  2022-02-08 21:20               ` Alexei Starovoitov
@ 2022-02-08 21:34                 ` Hao Luo
  2022-02-14 18:29                 ` Hao Luo
  1 sibling, 0 replies; 19+ messages in thread
From: Hao Luo @ 2022-02-08 21:34 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Shakeel Butt, Joe Burton, Stanislav Fomichev, bpf, LKML

On Tue, Feb 8, 2022 at 1:20 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Feb 8, 2022 at 12:07 PM Hao Luo <haoluo@google.com> wrote:
> >
> > On Sat, Feb 5, 2022 at 8:29 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Fri, Feb 4, 2022 at 10:27 AM Hao Luo <haoluo@google.com> wrote:
> > > > >
> > > > > > In our use case, we can't ask the users who create cgroups to do the
> > > > > > pinning. Pinning requires root privilege. In our use case, we have
> > > > > > non-root users who can create cgroup directories and still want to
> > > > > > read bpf stats. They can't do pinning by themselves. This is why
> > > > > > inheritance is a requirement for us. With inheritance, they only need
> > > > > > to mkdir in cgroupfs and bpffs (unprivileged operations), no pinning
> > > > > > operation is required. Patch 1-4 are needed to implement inheritance.
> > > > > >
> > > > > > It's also not a good idea in our use case to add a userspace
> > > > > > privileged process to monitor cgroupfs operations and perform the
> > > > > > pinning. It's more complex and has a higher maintenance cost and
> > > > > > runtime overhead, compared to the solution of asking whoever makes
> > > > > > cgroups to mkdir in bpffs. The other problem is: if there are nodes in
> > > > > > the data center that don't have the userspace process deployed, the
> > > > > > stats will be unavailable, which is a no-no for some of our users.
> > > > >
> > > > > The commit log says that there will be a daemon that does that
> > > > > monitoring of cgroupfs. And that daemon needs to mkdir
> > > > > directories in bpffs when a new cgroup is created, no?
> > > > > The kernel is only doing inheritance of bpf progs into
> > > > > new dirs. I think that daemon can pin as well.
> > > > >
> > > > > The cgroup creation is typically managed by an agent like systemd.
> > > > > Sounds like you have your own agent that creates cgroups?
> > > > > If so it has to be privileged and it can mkdir in bpffs and pin too ?
> > > >
> > > > Ah, yes, we have our own daemon to manage cgroups. That daemon creates
> > > > the top-level cgroup for each job to run inside. However, the job can
> > > > create its own cgroups inside the top-level cgroup, for fine grained
> > > > resource control. This doesn't go through the daemon. The job-created
> > > > cgroups don't have the pinned objects and this is a no-no for our
> > > > users.
> > >
> > > We can whitelist certain tracepoints to be sleepable and extend
> > > tp_btf prog type to include everything from prog_type_syscall.
> > > Such prog would attach to cgroup_mkdir and cgroup_release
> > > and would call bpf_sys_bpf() helper to pin progs in new bpffs dirs.
> > > We can allow prog_type_syscall to do mkdir in bpffs as well.
> > >
> > > This feature could be useful for similar monitoring/introspection tasks.
> > > We can write a program that would monitor bpf prog load/unload
> > > and would pin an iterator prog that would show debug info about a prog.
> > > Like cat /sys/fs/bpf/progs.debug shows a list of loaded progs.
> > > With this feature we can implement:
> > > ls /sys/fs/bpf/all_progs.debug/
> > > and each loaded prog would have a corresponding file.
> > > The file name would be a program name, for example.
> > > cat /sys/fs/bpf/all_progs.debug/my_prog
> > > would pretty print info about 'my_prog' bpf program.
> > >
> > > This way the kernfs/cgroupfs specific logic from patches 1-4
> > > will not be necessary.
> > >
> > > wdyt?
> >
> > Thanks Alexei. I gave it more thought in the last couple of days.
> > Actually I think it's a good idea, more flexible. It gets rid of the
> > need of a user space daemon for monitoring cgroup creation and
> > destruction. We could monitor task creations and exits as well, so
> > that we can export per-task information (e.g. task_vma_iter) more
> > efficiently.
>
> Yep. Monitoring task creation and exposing via bpf_iter sounds
> useful too.
>
> > A couple of thoughts when thinking about the details:
> >
> > - Regarding parameterized pinning, I don't think we can have one
> > single bpf_iter_link object, but with different parameters. Because
> > parameters are part of the bpf_iter_link (bpf_iter_aux_info). So every
> > time we pin, we have to attach iter in order to get a new link object
> > first. So we need to add attach and detach in bpf_sys_bpf().
>
> Makes sense.
> I'm adding bpf_link_create to bpf_sys_bpf as part of
> the "lskel for kernel" patch set.
> The detach is sys_close. It's already available.
>
> > - We also need to add those syscalls for cleanup: (1) unlink for
> > removing pinned obj and (2) rmdir for removing the directory in
> > prog_type_syscall.
>
> Yes. These two would be needed.
> And obj_pin too.
>
> > With these extensions, we can shift some of the bpf operations
> > currently performed in system daemons into the kernel. IMHO it's a
> > great thing, making system monitoring more flexible.
>
> Awesome. Sounds like we're converging :)

Right. :)

Thank you for proposing this solution. This really helps! I'm going to
work on a prototype.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC bpf-next v2 5/5] selftests/bpf: test for pinning for cgroup_view link
  2022-02-08 21:20               ` Alexei Starovoitov
  2022-02-08 21:34                 ` Hao Luo
@ 2022-02-14 18:29                 ` Hao Luo
  2022-02-14 19:24                   ` Alexei Starovoitov
  1 sibling, 1 reply; 19+ messages in thread
From: Hao Luo @ 2022-02-14 18:29 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Shakeel Butt, Joe Burton, Stanislav Fomichev, bpf, LKML

On Tue, Feb 8, 2022 at 1:20 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Feb 8, 2022 at 12:07 PM Hao Luo <haoluo@google.com> wrote:
> >
> > On Sat, Feb 5, 2022 at 8:29 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Fri, Feb 4, 2022 at 10:27 AM Hao Luo <haoluo@google.com> wrote:
> > > > >
> > > > > > In our use case, we can't ask the users who create cgroups to do the
> > > > > > pinning. Pinning requires root privilege. In our use case, we have
> > > > > > non-root users who can create cgroup directories and still want to
> > > > > > read bpf stats. They can't do pinning by themselves. This is why
> > > > > > inheritance is a requirement for us. With inheritance, they only need
> > > > > > to mkdir in cgroupfs and bpffs (unprivileged operations), no pinning
> > > > > > operation is required. Patch 1-4 are needed to implement inheritance.
> > > > > >
> > > > > > It's also not a good idea in our use case to add a userspace
> > > > > > privileged process to monitor cgroupfs operations and perform the
> > > > > > pinning. It's more complex and has a higher maintenance cost and
> > > > > > runtime overhead, compared to the solution of asking whoever makes
> > > > > > cgroups to mkdir in bpffs. The other problem is: if there are nodes in
> > > > > > the data center that don't have the userspace process deployed, the
> > > > > > stats will be unavailable, which is a no-no for some of our users.
> > > > >
> > > > > The commit log says that there will be a daemon that does that
> > > > > monitoring of cgroupfs. And that daemon needs to mkdir
> > > > > directories in bpffs when a new cgroup is created, no?
> > > > > The kernel is only doing inheritance of bpf progs into
> > > > > new dirs. I think that daemon can pin as well.
> > > > >
> > > > > The cgroup creation is typically managed by an agent like systemd.
> > > > > Sounds like you have your own agent that creates cgroups?
> > > > > If so it has to be privileged and it can mkdir in bpffs and pin too ?
> > > >
> > > > Ah, yes, we have our own daemon to manage cgroups. That daemon creates
> > > > the top-level cgroup for each job to run inside. However, the job can
> > > > create its own cgroups inside the top-level cgroup, for fine grained
> > > > resource control. This doesn't go through the daemon. The job-created
> > > > cgroups don't have the pinned objects and this is a no-no for our
> > > > users.
> > >
> > > We can whitelist certain tracepoints to be sleepable and extend
> > > tp_btf prog type to include everything from prog_type_syscall.
> > > Such prog would attach to cgroup_mkdir and cgroup_release
> > > and would call bpf_sys_bpf() helper to pin progs in new bpffs dirs.
> > > We can allow prog_type_syscall to do mkdir in bpffs as well.
> > >
> > > This feature could be useful for similar monitoring/introspection tasks.
> > > We can write a program that would monitor bpf prog load/unload
> > > and would pin an iterator prog that would show debug info about a prog.
> > > Like cat /sys/fs/bpf/progs.debug shows a list of loaded progs.
> > > With this feature we can implement:
> > > ls /sys/fs/bpf/all_progs.debug/
> > > and each loaded prog would have a corresponding file.
> > > The file name would be a program name, for example.
> > > cat /sys/fs/bpf/all_progs.debug/my_prog
> > > would pretty print info about 'my_prog' bpf program.
> > >
> > > This way the kernfs/cgroupfs specific logic from patches 1-4
> > > will not be necessary.
> > >
> > > wdyt?

Hi Alexei,

Actually, I found this almost worked, except that the tracepoints
cgroup_mkdir and cgroup_rmdir are not sleepable. They are inside a
spinlock's critical section with irq off. I guess one solution is to
offload the sleepable part of the bpf prog into a thread context. We
may create a dedicated kernel thread or use workqueue for this. Do you
have any advice?

> >
> > Thanks Alexei. I gave it more thought in the last couple of days.
> > Actually I think it's a good idea, more flexible. It gets rid of the
> > need of a user space daemon for monitoring cgroup creation and
> > destruction. We could monitor task creations and exits as well, so
> > that we can export per-task information (e.g. task_vma_iter) more
> > efficiently.
>
> Yep. Monitoring task creation and exposing via bpf_iter sounds
> useful too.
>
> > A couple of thoughts when thinking about the details:
> >
> > - Regarding parameterized pinning, I don't think we can have one
> > single bpf_iter_link object, but with different parameters. Because
> > parameters are part of the bpf_iter_link (bpf_iter_aux_info). So every
> > time we pin, we have to attach iter in order to get a new link object
> > first. So we need to add attach and detach in bpf_sys_bpf().
>
> Makes sense.
> I'm adding bpf_link_create to bpf_sys_bpf as part of
> the "lskel for kernel" patch set.
> The detach is sys_close. It's already available.
>
> > - We also need to add those syscalls for cleanup: (1) unlink for
> > removing pinned obj and (2) rmdir for removing the directory in
> > prog_type_syscall.
>
> Yes. These two would be needed.
> And obj_pin too.
>
> > With these extensions, we can shift some of the bpf operations
> > currently performed in system daemons into the kernel. IMHO it's a
> > great thing, making system monitoring more flexible.
>
> Awesome. Sounds like we're converging :)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC bpf-next v2 5/5] selftests/bpf: test for pinning for cgroup_view link
  2022-02-14 18:29                 ` Hao Luo
@ 2022-02-14 19:24                   ` Alexei Starovoitov
  2022-02-14 20:23                     ` Hao Luo
  0 siblings, 1 reply; 19+ messages in thread
From: Alexei Starovoitov @ 2022-02-14 19:24 UTC (permalink / raw)
  To: Hao Luo
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Shakeel Butt, Joe Burton, Stanislav Fomichev, bpf, LKML

On Mon, Feb 14, 2022 at 10:29 AM Hao Luo <haoluo@google.com> wrote:
> Hi Alexei,
>
> Actually, I found this almost worked, except that the tracepoints
> cgroup_mkdir and cgroup_rmdir are not sleepable. They are inside a
> spinlock's critical section with irq off. I guess one solution is to
> offload the sleepable part of the bpf prog into a thread context. We
> may create a dedicated kernel thread or use workqueue for this. Do you
> have any advice?

Are you referring to spin_lock in TRACE_CGROUP_PATH
that protects global trace_cgroup_path[] buffer?
That is fixable.
Do you actually need the string path returned by cgroup_path() in bpf prog?
Maybe prog can call cgroup_path() by itself when necessary.
Parsing strings isn't great anyway. The bpf prog probably needs the last
part of the dir only. So cgrp->kn->name would do it?
The TRACE_CGROUP_PATH wasn't designed to be turned on 24/7.
That global spin_lock is not great for production use.
No need to delegate sleepable bpf to thread context.
Let's refactor that tracepoint a bit.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC bpf-next v2 5/5] selftests/bpf: test for pinning for cgroup_view link
  2022-02-14 19:24                   ` Alexei Starovoitov
@ 2022-02-14 20:23                     ` Hao Luo
  2022-02-14 20:27                       ` Alexei Starovoitov
  0 siblings, 1 reply; 19+ messages in thread
From: Hao Luo @ 2022-02-14 20:23 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Shakeel Butt, Joe Burton, Stanislav Fomichev, bpf, LKML

On Mon, Feb 14, 2022 at 11:25 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Feb 14, 2022 at 10:29 AM Hao Luo <haoluo@google.com> wrote:
> > Hi Alexei,
> >
> > Actually, I found this almost worked, except that the tracepoints
> > cgroup_mkdir and cgroup_rmdir are not sleepable. They are inside a
> > spinlock's critical section with irq off. I guess one solution is to
> > offload the sleepable part of the bpf prog into a thread context. We
> > may create a dedicated kernel thread or use workqueue for this. Do you
> > have any advice?
>
> Are you referring to spin_lock in TRACE_CGROUP_PATH
> that protects global trace_cgroup_path[] buffer?

Yes, that's the spin_lock I am talking about.

> That is fixable.
> Do you actually need the string path returned by cgroup_path() in bpf prog?
> Maybe prog can call cgroup_path() by itself when necessary.
> Parsing strings isn't great anyway. The bpf prog probably needs the last
> part of the dir only. So cgrp->kn->name would do it?
> The TRACE_CGROUP_PATH wasn't designed to be turned on 24/7.
> That global spin_lock is not great for production use.
> No need to delegate sleepable bpf to thread context.
> Let's refactor that tracepoint a bit.

No, we don't need cgroup_path(). We are going to name the directories
by cgrp->kn->id. I can add a fast version for cgroup_xxx tracepoints,
which don't require the full path and can be turned on 24/7.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC bpf-next v2 5/5] selftests/bpf: test for pinning for cgroup_view link
  2022-02-14 20:23                     ` Hao Luo
@ 2022-02-14 20:27                       ` Alexei Starovoitov
  2022-02-14 20:39                         ` Hao Luo
  0 siblings, 1 reply; 19+ messages in thread
From: Alexei Starovoitov @ 2022-02-14 20:27 UTC (permalink / raw)
  To: Hao Luo
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Shakeel Butt, Joe Burton, Stanislav Fomichev, bpf, LKML

On Mon, Feb 14, 2022 at 12:23 PM Hao Luo <haoluo@google.com> wrote:
>
> On Mon, Feb 14, 2022 at 11:25 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Mon, Feb 14, 2022 at 10:29 AM Hao Luo <haoluo@google.com> wrote:
> > > Hi Alexei,
> > >
> > > Actually, I found this almost worked, except that the tracepoints
> > > cgroup_mkdir and cgroup_rmdir are not sleepable. They are inside a
> > > spinlock's critical section with irq off. I guess one solution is to
> > > offload the sleepable part of the bpf prog into a thread context. We
> > > may create a dedicated kernel thread or use workqueue for this. Do you
> > > have any advice?
> >
> > Are you referring to spin_lock in TRACE_CGROUP_PATH
> > that protects global trace_cgroup_path[] buffer?
>
> Yes, that's the spin_lock I am talking about.
>
> > That is fixable.
> > Do you actually need the string path returned by cgroup_path() in bpf prog?
> > Maybe prog can call cgroup_path() by itself when necessary.
> > Parsing strings isn't great anyway. The bpf prog probably needs the last
> > part of the dir only. So cgrp->kn->name would do it?
> > The TRACE_CGROUP_PATH wasn't designed to be turned on 24/7.
> > That global spin_lock is not great for production use.
> > No need to delegate sleepable bpf to thread context.
> > Let's refactor that tracepoint a bit.
>
> No, we don't need cgroup_path(). We are going to name the directories
> by cgrp->kn->id. I can add a fast version for cgroup_xxx tracepoints,
> which don't require the full path and can be turned on 24/7.

Sounds good. We need a flag for tracepoints anyway to indicate
which ones are sleepable.
Probably similar to what we did for DEFINE_EVENT_WRITABLE.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC bpf-next v2 5/5] selftests/bpf: test for pinning for cgroup_view link
  2022-02-14 20:27                       ` Alexei Starovoitov
@ 2022-02-14 20:39                         ` Hao Luo
  0 siblings, 0 replies; 19+ messages in thread
From: Hao Luo @ 2022-02-14 20:39 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Shakeel Butt, Joe Burton, Stanislav Fomichev, bpf, LKML

On Mon, Feb 14, 2022 at 12:28 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Feb 14, 2022 at 12:23 PM Hao Luo <haoluo@google.com> wrote:
> >
> > On Mon, Feb 14, 2022 at 11:25 AM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Mon, Feb 14, 2022 at 10:29 AM Hao Luo <haoluo@google.com> wrote:
> > > > Hi Alexei,
> > > >
> > > > Actually, I found this almost worked, except that the tracepoints
> > > > cgroup_mkdir and cgroup_rmdir are not sleepable. They are inside a
> > > > spinlock's critical section with irq off. I guess one solution is to
> > > > offload the sleepable part of the bpf prog into a thread context. We
> > > > may create a dedicated kernel thread or use workqueue for this. Do you
> > > > have any advice?
> > >
> > > Are you referring to spin_lock in TRACE_CGROUP_PATH
> > > that protects global trace_cgroup_path[] buffer?
> >
> > Yes, that's the spin_lock I am talking about.
> >
> > > That is fixable.
> > > Do you actually need the string path returned by cgroup_path() in bpf prog?
> > > Maybe prog can call cgroup_path() by itself when necessary.
> > > Parsing strings isn't great anyway. The bpf prog probably needs the last
> > > part of the dir only. So cgrp->kn->name would do it?
> > > The TRACE_CGROUP_PATH wasn't designed to be turned on 24/7.
> > > That global spin_lock is not great for production use.
> > > No need to delegate sleepable bpf to thread context.
> > > Let's refactor that tracepoint a bit.
> >
> > No, we don't need cgroup_path(). We are going to name the directories
> > by cgrp->kn->id. I can add a fast version for cgroup_xxx tracepoints,
> > which don't require the full path and can be turned on 24/7.
>
> Sounds good. We need a flag for tracepoints anyway to indicate
> which ones are sleepable.
> Probably similar to what we did for DEFINE_EVENT_WRITABLE.

No problem. I'll take a look at it. Thanks!

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2022-02-14 20:59 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-01 20:55 [PATCH RFC bpf-next v2 0/5] Extend cgroup interface with bpf Hao Luo
2022-02-01 20:55 ` [PATCH RFC bpf-next v2 1/5] bpf: Bpffs directory tag Hao Luo
2022-02-01 20:55 ` [PATCH RFC bpf-next v2 2/5] bpf: Introduce inherit list for dir tag Hao Luo
2022-02-01 20:55 ` [PATCH RFC bpf-next v2 3/5] bpf: cgroup_view iter Hao Luo
2022-02-01 20:55 ` [PATCH RFC bpf-next v2 4/5] bpf: Pin cgroup_view Hao Luo
2022-02-01 20:55 ` [PATCH RFC bpf-next v2 5/5] selftests/bpf: test for pinning for cgroup_view link Hao Luo
2022-02-03 18:04   ` Alexei Starovoitov
2022-02-03 22:46     ` Hao Luo
2022-02-04  3:33       ` Alexei Starovoitov
2022-02-04 18:26         ` Hao Luo
2022-02-06  4:29           ` Alexei Starovoitov
2022-02-08 20:07             ` Hao Luo
2022-02-08 21:20               ` Alexei Starovoitov
2022-02-08 21:34                 ` Hao Luo
2022-02-14 18:29                 ` Hao Luo
2022-02-14 19:24                   ` Alexei Starovoitov
2022-02-14 20:23                     ` Hao Luo
2022-02-14 20:27                       ` Alexei Starovoitov
2022-02-14 20:39                         ` Hao Luo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).