Netdev Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH bpf v2 1/3] bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode
@ 2021-09-09 20:43 Daniel Borkmann
  2021-09-09 20:43 ` [PATCH bpf v2 2/3] bpf, selftests: Add cgroup v1 net_cls classid helpers Daniel Borkmann
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Daniel Borkmann @ 2021-09-09 20:43 UTC (permalink / raw)
  To: bpf
  Cc: netdev, tj, davem, m, alexei.starovoitov, andrii, sdf, Daniel Borkmann

Fix cgroup v1 interference when non-root cgroup v2 BPF programs are used.
Back in the days, commit bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup")
embedded per-socket cgroup information into sock->sk_cgrp_data and in order
to save 8 bytes in struct sock made both mutually exclusive, that is, when
cgroup v1 socket tagging (e.g. net_cls/net_prio) is used, then cgroup v2
falls back to the root cgroup in sock_cgroup_ptr() (&cgrp_dfl_root.cgrp).

The assumption made was "there is no reason to mix the two and this is in line
with how legacy and v2 compatibility is handled" as stated in bd1060a1d671.
However, with Kubernetes more widely supporting cgroups v2 as well nowadays,
this assumption no longer holds, and the possibility of the v1/v2 mixed mode
with the v2 root fallback being hit becomes a real security issue.

Many of the cgroup v2 BPF programs are also used for policy enforcement, just
to pick _one_ example, that is, to programmatically deny socket related system
calls like connect(2) or bind(2). A v2 root fallback would implicitly cause
a policy bypass for the affected Pods.

In production environments, we have recently seen this case due to various
circumstances: i) a different 3rd party agent and/or ii) a container runtime
such as [0] in the user's environment configuring legacy cgroup v1 net_cls
tags, which triggered implicitly mentioned root fallback. Another case is
Kubernetes projects like kind [1] which create Kubernetes nodes in a container
and also add cgroup namespaces to the mix, meaning programs which are attached
to the cgroup v2 root of the cgroup namespace get attached to a non-root
cgroup v2 path from init namespace point of view. And the latter's root is
out of reach for agents on a kind Kubernetes node to configure. Meaning, any
entity on the node setting cgroup v1 net_cls tag will trigger the bypass
despite cgroup v2 BPF programs attached to the namespace root.

Generally, this mutual exclusiveness does not hold anymore in today's user
environments and makes cgroup v2 usage from BPF side fragile and unreliable.
This fix adds proper struct cgroup pointer for the cgroup v2 case to struct
sock_cgroup_data in order to address these issues; this implicitly also fixes
the tradeoffs being made back then with regards to races and refcount leaks
as stated in bd1060a1d671, and removes the fallback, so that cgroup v2 BPF
programs always operate as expected.

  [0] https://github.com/nestybox/sysbox/
  [1] https://kind.sigs.k8s.io/

Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Martynas Pumputis <m@lambda.lt>
Cc: Stanislav Fomichev <sdf@google.com>
---
 v1 -> v2:
   - Remove unneeded READ_ONCE()/WRITE_ONCE() pair around skcd->cgroup,
     thanks Stanislav!

 include/linux/cgroup-defs.h  | 107 +++++++++--------------------------
 include/linux/cgroup.h       |  22 +------
 kernel/cgroup/cgroup.c       |  50 ++++------------
 net/core/netclassid_cgroup.c |   7 +--
 net/core/netprio_cgroup.c    |  10 +---
 5 files changed, 41 insertions(+), 155 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index e1c705fdfa7c..44446025741f 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -752,107 +752,54 @@ static inline void cgroup_threadgroup_change_end(struct task_struct *tsk) {}
  * sock_cgroup_data is embedded at sock->sk_cgrp_data and contains
  * per-socket cgroup information except for memcg association.
  *
- * On legacy hierarchies, net_prio and net_cls controllers directly set
- * attributes on each sock which can then be tested by the network layer.
- * On the default hierarchy, each sock is associated with the cgroup it was
- * created in and the networking layer can match the cgroup directly.
- *
- * To avoid carrying all three cgroup related fields separately in sock,
- * sock_cgroup_data overloads (prioidx, classid) and the cgroup pointer.
- * On boot, sock_cgroup_data records the cgroup that the sock was created
- * in so that cgroup2 matches can be made; however, once either net_prio or
- * net_cls starts being used, the area is overridden to carry prioidx and/or
- * classid.  The two modes are distinguished by whether the lowest bit is
- * set.  Clear bit indicates cgroup pointer while set bit prioidx and
- * classid.
- *
- * While userland may start using net_prio or net_cls at any time, once
- * either is used, cgroup2 matching no longer works.  There is no reason to
- * mix the two and this is in line with how legacy and v2 compatibility is
- * handled.  On mode switch, cgroup references which are already being
- * pointed to by socks may be leaked.  While this can be remedied by adding
- * synchronization around sock_cgroup_data, given that the number of leaked
- * cgroups is bound and highly unlikely to be high, this seems to be the
- * better trade-off.
+ * On legacy hierarchies, net_prio and net_cls controllers directly
+ * set attributes on each sock which can then be tested by the network
+ * layer. On the default hierarchy, each sock is associated with the
+ * cgroup it was created in and the networking layer can match the
+ * cgroup directly.
  */
 struct sock_cgroup_data {
-	union {
-#ifdef __LITTLE_ENDIAN
-		struct {
-			u8	is_data : 1;
-			u8	no_refcnt : 1;
-			u8	unused : 6;
-			u8	padding;
-			u16	prioidx;
-			u32	classid;
-		} __packed;
-#else
-		struct {
-			u32	classid;
-			u16	prioidx;
-			u8	padding;
-			u8	unused : 6;
-			u8	no_refcnt : 1;
-			u8	is_data : 1;
-		} __packed;
+	struct cgroup	*cgroup; /* v2 */
+#if defined(CONFIG_CGROUP_NET_CLASSID)
+	u32		classid; /* v1 */
+#endif
+#if defined(CONFIG_CGROUP_NET_PRIO)
+	u16		prioidx; /* v1 */
 #endif
-		u64		val;
-	};
 };
 
-/*
- * There's a theoretical window where the following accessors race with
- * updaters and return part of the previous pointer as the prioidx or
- * classid.  Such races are short-lived and the result isn't critical.
- */
 static inline u16 sock_cgroup_prioidx(const struct sock_cgroup_data *skcd)
 {
-	/* fallback to 1 which is always the ID of the root cgroup */
-	return (skcd->is_data & 1) ? skcd->prioidx : 1;
+#if defined(CONFIG_CGROUP_NET_PRIO)
+	return READ_ONCE(skcd->prioidx);
+#else
+	return 1;
+#endif
 }
 
 static inline u32 sock_cgroup_classid(const struct sock_cgroup_data *skcd)
 {
-	/* fallback to 0 which is the unconfigured default classid */
-	return (skcd->is_data & 1) ? skcd->classid : 0;
+#if defined(CONFIG_CGROUP_NET_CLASSID)
+	return READ_ONCE(skcd->classid);
+#else
+	return 0;
+#endif
 }
 
-/*
- * If invoked concurrently, the updaters may clobber each other.  The
- * caller is responsible for synchronization.
- */
 static inline void sock_cgroup_set_prioidx(struct sock_cgroup_data *skcd,
 					   u16 prioidx)
 {
-	struct sock_cgroup_data skcd_buf = {{ .val = READ_ONCE(skcd->val) }};
-
-	if (sock_cgroup_prioidx(&skcd_buf) == prioidx)
-		return;
-
-	if (!(skcd_buf.is_data & 1)) {
-		skcd_buf.val = 0;
-		skcd_buf.is_data = 1;
-	}
-
-	skcd_buf.prioidx = prioidx;
-	WRITE_ONCE(skcd->val, skcd_buf.val);	/* see sock_cgroup_ptr() */
+#if defined(CONFIG_CGROUP_NET_PRIO)
+	WRITE_ONCE(skcd->prioidx, prioidx);
+#endif
 }
 
 static inline void sock_cgroup_set_classid(struct sock_cgroup_data *skcd,
 					   u32 classid)
 {
-	struct sock_cgroup_data skcd_buf = {{ .val = READ_ONCE(skcd->val) }};
-
-	if (sock_cgroup_classid(&skcd_buf) == classid)
-		return;
-
-	if (!(skcd_buf.is_data & 1)) {
-		skcd_buf.val = 0;
-		skcd_buf.is_data = 1;
-	}
-
-	skcd_buf.classid = classid;
-	WRITE_ONCE(skcd->val, skcd_buf.val);	/* see sock_cgroup_ptr() */
+#if defined(CONFIG_CGROUP_NET_CLASSID)
+	WRITE_ONCE(skcd->classid, classid);
+#endif
 }
 
 #else	/* CONFIG_SOCK_CGROUP_DATA */
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 7bf60454a313..75c151413fda 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -829,33 +829,13 @@ static inline void cgroup_account_cputime_field(struct task_struct *task,
  */
 #ifdef CONFIG_SOCK_CGROUP_DATA
 
-#if defined(CONFIG_CGROUP_NET_PRIO) || defined(CONFIG_CGROUP_NET_CLASSID)
-extern spinlock_t cgroup_sk_update_lock;
-#endif
-
-void cgroup_sk_alloc_disable(void);
 void cgroup_sk_alloc(struct sock_cgroup_data *skcd);
 void cgroup_sk_clone(struct sock_cgroup_data *skcd);
 void cgroup_sk_free(struct sock_cgroup_data *skcd);
 
 static inline struct cgroup *sock_cgroup_ptr(struct sock_cgroup_data *skcd)
 {
-#if defined(CONFIG_CGROUP_NET_PRIO) || defined(CONFIG_CGROUP_NET_CLASSID)
-	unsigned long v;
-
-	/*
-	 * @skcd->val is 64bit but the following is safe on 32bit too as we
-	 * just need the lower ulong to be written and read atomically.
-	 */
-	v = READ_ONCE(skcd->val);
-
-	if (v & 3)
-		return &cgrp_dfl_root.cgrp;
-
-	return (struct cgroup *)(unsigned long)v ?: &cgrp_dfl_root.cgrp;
-#else
-	return (struct cgroup *)(unsigned long)skcd->val;
-#endif
+	return skcd->cgroup;
 }
 
 #else	/* CONFIG_CGROUP_DATA */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 881ce1470beb..8afa8690d288 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -6572,74 +6572,44 @@ int cgroup_parse_float(const char *input, unsigned dec_shift, s64 *v)
  */
 #ifdef CONFIG_SOCK_CGROUP_DATA
 
-#if defined(CONFIG_CGROUP_NET_PRIO) || defined(CONFIG_CGROUP_NET_CLASSID)
-
-DEFINE_SPINLOCK(cgroup_sk_update_lock);
-static bool cgroup_sk_alloc_disabled __read_mostly;
-
-void cgroup_sk_alloc_disable(void)
-{
-	if (cgroup_sk_alloc_disabled)
-		return;
-	pr_info("cgroup: disabling cgroup2 socket matching due to net_prio or net_cls activation\n");
-	cgroup_sk_alloc_disabled = true;
-}
-
-#else
-
-#define cgroup_sk_alloc_disabled	false
-
-#endif
-
 void cgroup_sk_alloc(struct sock_cgroup_data *skcd)
 {
-	if (cgroup_sk_alloc_disabled) {
-		skcd->no_refcnt = 1;
-		return;
-	}
-
 	/* Don't associate the sock with unrelated interrupted task's cgroup. */
 	if (in_interrupt())
 		return;
 
 	rcu_read_lock();
-
 	while (true) {
 		struct css_set *cset;
 
 		cset = task_css_set(current);
 		if (likely(cgroup_tryget(cset->dfl_cgrp))) {
-			skcd->val = (unsigned long)cset->dfl_cgrp;
+			skcd->cgroup = cset->dfl_cgrp;
 			cgroup_bpf_get(cset->dfl_cgrp);
 			break;
 		}
 		cpu_relax();
 	}
-
 	rcu_read_unlock();
 }
 
 void cgroup_sk_clone(struct sock_cgroup_data *skcd)
 {
-	if (skcd->val) {
-		if (skcd->no_refcnt)
-			return;
-		/*
-		 * We might be cloning a socket which is left in an empty
-		 * cgroup and the cgroup might have already been rmdir'd.
-		 * Don't use cgroup_get_live().
-		 */
-		cgroup_get(sock_cgroup_ptr(skcd));
-		cgroup_bpf_get(sock_cgroup_ptr(skcd));
-	}
+	struct cgroup *cgrp = sock_cgroup_ptr(skcd);
+
+	/*
+	 * We might be cloning a socket which is left in an empty
+	 * cgroup and the cgroup might have already been rmdir'd.
+	 * Don't use cgroup_get_live().
+	 */
+	cgroup_get(cgrp);
+	cgroup_bpf_get(cgrp);
 }
 
 void cgroup_sk_free(struct sock_cgroup_data *skcd)
 {
 	struct cgroup *cgrp = sock_cgroup_ptr(skcd);
 
-	if (skcd->no_refcnt)
-		return;
 	cgroup_bpf_put(cgrp);
 	cgroup_put(cgrp);
 }
diff --git a/net/core/netclassid_cgroup.c b/net/core/netclassid_cgroup.c
index b49c57d35a88..1a6a86693b74 100644
--- a/net/core/netclassid_cgroup.c
+++ b/net/core/netclassid_cgroup.c
@@ -71,11 +71,8 @@ static int update_classid_sock(const void *v, struct file *file, unsigned n)
 	struct update_classid_context *ctx = (void *)v;
 	struct socket *sock = sock_from_file(file);
 
-	if (sock) {
-		spin_lock(&cgroup_sk_update_lock);
+	if (sock)
 		sock_cgroup_set_classid(&sock->sk->sk_cgrp_data, ctx->classid);
-		spin_unlock(&cgroup_sk_update_lock);
-	}
 	if (--ctx->batch == 0) {
 		ctx->batch = UPDATE_CLASSID_BATCH;
 		return n + 1;
@@ -121,8 +118,6 @@ static int write_classid(struct cgroup_subsys_state *css, struct cftype *cft,
 	struct css_task_iter it;
 	struct task_struct *p;
 
-	cgroup_sk_alloc_disable();
-
 	cs->classid = (u32)value;
 
 	css_task_iter_start(css, 0, &it);
diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
index 99a431c56f23..8456dfbe2eb4 100644
--- a/net/core/netprio_cgroup.c
+++ b/net/core/netprio_cgroup.c
@@ -207,8 +207,6 @@ static ssize_t write_priomap(struct kernfs_open_file *of,
 	if (!dev)
 		return -ENODEV;
 
-	cgroup_sk_alloc_disable();
-
 	rtnl_lock();
 
 	ret = netprio_set_prio(of_css(of), dev, prio);
@@ -221,12 +219,10 @@ static ssize_t write_priomap(struct kernfs_open_file *of,
 static int update_netprio(const void *v, struct file *file, unsigned n)
 {
 	struct socket *sock = sock_from_file(file);
-	if (sock) {
-		spin_lock(&cgroup_sk_update_lock);
+
+	if (sock)
 		sock_cgroup_set_prioidx(&sock->sk->sk_cgrp_data,
 					(unsigned long)v);
-		spin_unlock(&cgroup_sk_update_lock);
-	}
 	return 0;
 }
 
@@ -235,8 +231,6 @@ static void net_prio_attach(struct cgroup_taskset *tset)
 	struct task_struct *p;
 	struct cgroup_subsys_state *css;
 
-	cgroup_sk_alloc_disable();
-
 	cgroup_taskset_for_each(p, css, tset) {
 		void *v = (void *)(unsigned long)css->id;
 
-- 
2.21.0


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH bpf v2 2/3] bpf, selftests: Add cgroup v1 net_cls classid helpers
  2021-09-09 20:43 [PATCH bpf v2 1/3] bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode Daniel Borkmann
@ 2021-09-09 20:43 ` Daniel Borkmann
  2021-09-09 20:43 ` [PATCH bpf v2 3/3] bpf, selftests: Add test case for mixed cgroup v1/v2 Daniel Borkmann
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: Daniel Borkmann @ 2021-09-09 20:43 UTC (permalink / raw)
  To: bpf
  Cc: netdev, tj, davem, m, alexei.starovoitov, andrii, sdf, Daniel Borkmann

Minimal set of helpers for net_cls classid cgroupv1 management in order
to set an id, join from a process, initiate setup and teardown. cgroupv2
helpers are left as-is, but reused where possible.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
---
 tools/testing/selftests/bpf/cgroup_helpers.c | 118 +++++++++++++++++--
 tools/testing/selftests/bpf/cgroup_helpers.h |  16 ++-
 2 files changed, 122 insertions(+), 12 deletions(-)

diff --git a/tools/testing/selftests/bpf/cgroup_helpers.c b/tools/testing/selftests/bpf/cgroup_helpers.c
index 033051717ba5..1fa92dbe9460 100644
--- a/tools/testing/selftests/bpf/cgroup_helpers.c
+++ b/tools/testing/selftests/bpf/cgroup_helpers.c
@@ -12,27 +12,35 @@
 #include <unistd.h>
 #include <ftw.h>
 
-
 #include "cgroup_helpers.h"
 
 /*
  * To avoid relying on the system setup, when setup_cgroup_env is called
- * we create a new mount namespace, and cgroup namespace. The cgroup2
- * root is mounted at CGROUP_MOUNT_PATH
- *
- * Unfortunately, most people don't have cgroupv2 enabled at this point in time.
- * It's easier to create our own mount namespace and manage it ourselves.
+ * we create a new mount namespace, and cgroup namespace. The cgroupv2
+ * root is mounted at CGROUP_MOUNT_PATH. Unfortunately, most people don't
+ * have cgroupv2 enabled at this point in time. It's easier to create our
+ * own mount namespace and manage it ourselves. We assume /mnt exists.
  *
- * We assume /mnt exists.
+ * Related cgroupv1 helpers are named *classid*(), since we only use the
+ * net_cls controller for tagging net_cls.classid. We assume the default
+ * mount under /sys/fs/cgroup/net_cls exists which should be the case for
+ * the vast majority of users.
  */
 
 #define WALK_FD_LIMIT			16
+
 #define CGROUP_MOUNT_PATH		"/mnt"
+#define NETCLS_MOUNT_PATH		"/sys/fs/cgroup/net_cls"
 #define CGROUP_WORK_DIR			"/cgroup-test-work-dir"
+
 #define format_cgroup_path(buf, path) \
 	snprintf(buf, sizeof(buf), "%s%s%s", CGROUP_MOUNT_PATH, \
 		 CGROUP_WORK_DIR, path)
 
+#define format_classid_path(buf)				\
+	snprintf(buf, sizeof(buf), "%s%s", NETCLS_MOUNT_PATH,	\
+		 CGROUP_WORK_DIR)
+
 /**
  * enable_all_controllers() - Enable all available cgroup v2 controllers
  *
@@ -139,8 +147,7 @@ static int nftwfunc(const char *filename, const struct stat *statptr,
 	return 0;
 }
 
-
-static int join_cgroup_from_top(char *cgroup_path)
+static int join_cgroup_from_top(const char *cgroup_path)
 {
 	char cgroup_procs_path[PATH_MAX + 1];
 	pid_t pid = getpid();
@@ -313,3 +320,96 @@ int cgroup_setup_and_join(const char *path) {
 	}
 	return cg_fd;
 }
+
+/**
+ * setup_classid_environment() - Setup the cgroupv1 net_cls environment
+ *
+ * After calling this function, cleanup_classid_environment should be called
+ * once testing is complete.
+ *
+ * This function will print an error to stderr and return 1 if it is unable
+ * to setup the cgroup environment. If setup is successful, 0 is returned.
+ */
+int setup_classid_environment(void)
+{
+	char cgroup_workdir[PATH_MAX + 1];
+
+	format_classid_path(cgroup_workdir);
+	cleanup_classid_environment();
+
+	if (mkdir(cgroup_workdir, 0777) && errno != EEXIST) {
+		log_err("mkdir cgroup work dir");
+		return 1;
+	}
+
+	return 0;
+}
+
+/**
+ * set_classid() - Set a cgroupv1 net_cls classid
+ * @id: the numeric classid
+ *
+ * Writes the passed classid into the cgroup work dir's net_cls.classid
+ * file in order to later on trigger socket tagging.
+ *
+ * On success, it returns 0, otherwise on failure it returns 1. If there
+ * is a failure, it prints the error to stderr.
+ */
+int set_classid(unsigned int id)
+{
+	char cgroup_workdir[PATH_MAX - 42];
+	char cgroup_classid_path[PATH_MAX + 1];
+	int fd, rc = 0;
+
+	format_classid_path(cgroup_workdir);
+	snprintf(cgroup_classid_path, sizeof(cgroup_classid_path),
+		 "%s/net_cls.classid", cgroup_workdir);
+
+	fd = open(cgroup_classid_path, O_WRONLY);
+	if (fd < 0) {
+		log_err("Opening cgroup classid: %s", cgroup_classid_path);
+		return 1;
+	}
+
+	if (dprintf(fd, "%u\n", id) < 0) {
+		log_err("Setting cgroup classid");
+		rc = 1;
+	}
+
+	close(fd);
+	return rc;
+}
+
+/**
+ * join_classid() - Join a cgroupv1 net_cls classid
+ *
+ * This function expects the cgroup work dir to be already created, as we
+ * join it here. This causes the process sockets to be tagged with the given
+ * net_cls classid.
+ *
+ * On success, it returns 0, otherwise on failure it returns 1.
+ */
+int join_classid(void)
+{
+	char cgroup_workdir[PATH_MAX + 1];
+
+	format_classid_path(cgroup_workdir);
+	return join_cgroup_from_top(cgroup_workdir);
+}
+
+/**
+ * cleanup_classid_environment() - Cleanup the cgroupv1 net_cls environment
+ *
+ * At call time, it moves the calling process to the root cgroup, and then
+ * runs the deletion process.
+ *
+ * On failure, it will print an error to stderr, and try to continue.
+ */
+void cleanup_classid_environment(void)
+{
+	char cgroup_workdir[PATH_MAX + 1];
+
+	format_classid_path(cgroup_workdir);
+	join_cgroup_from_top(NETCLS_MOUNT_PATH);
+	nftw(cgroup_workdir, nftwfunc, WALK_FD_LIMIT, FTW_DEPTH | FTW_MOUNT);
+}
diff --git a/tools/testing/selftests/bpf/cgroup_helpers.h b/tools/testing/selftests/bpf/cgroup_helpers.h
index 5fe3d88e4f0d..629da3854b3e 100644
--- a/tools/testing/selftests/bpf/cgroup_helpers.h
+++ b/tools/testing/selftests/bpf/cgroup_helpers.h
@@ -1,6 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #ifndef __CGROUP_HELPERS_H
 #define __CGROUP_HELPERS_H
+
 #include <errno.h>
 #include <string.h>
 
@@ -8,12 +9,21 @@
 #define log_err(MSG, ...) fprintf(stderr, "(%s:%d: errno: %s) " MSG "\n", \
 	__FILE__, __LINE__, clean_errno(), ##__VA_ARGS__)
 
-
+/* cgroupv2 related */
 int cgroup_setup_and_join(const char *path);
 int create_and_get_cgroup(const char *path);
+unsigned long long get_cgroup_id(const char *path);
+
 int join_cgroup(const char *path);
+
 int setup_cgroup_environment(void);
 void cleanup_cgroup_environment(void);
-unsigned long long get_cgroup_id(const char *path);
 
-#endif
+/* cgroupv1 related */
+int set_classid(unsigned int id);
+int join_classid(void);
+
+int setup_classid_environment(void);
+void cleanup_classid_environment(void);
+
+#endif /* __CGROUP_HELPERS_H */
-- 
2.21.0


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH bpf v2 3/3] bpf, selftests: Add test case for mixed cgroup v1/v2
  2021-09-09 20:43 [PATCH bpf v2 1/3] bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode Daniel Borkmann
  2021-09-09 20:43 ` [PATCH bpf v2 2/3] bpf, selftests: Add cgroup v1 net_cls classid helpers Daniel Borkmann
@ 2021-09-09 20:43 ` Daniel Borkmann
  2021-09-09 21:40 ` [PATCH bpf v2 1/3] bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode sdf
  2021-09-09 22:29 ` Tejun Heo
  3 siblings, 0 replies; 5+ messages in thread
From: Daniel Borkmann @ 2021-09-09 20:43 UTC (permalink / raw)
  To: bpf
  Cc: netdev, tj, davem, m, alexei.starovoitov, andrii, sdf, Daniel Borkmann

Minimal selftest which implements a small BPF policy program to the
connect(2) hook which rejects TCP connection requests to port 60123
with EPERM. This is being attached to a non-root cgroup v2 path. The
test asserts that this works under cgroup v2-only and under a mixed
cgroup v1/v2 environment where net_classid is set in the former case.

Before fix:

  # ./test_progs -t cgroup_v1v2
  test_cgroup_v1v2:PASS:server_fd 0 nsec
  test_cgroup_v1v2:PASS:client_fd 0 nsec
  test_cgroup_v1v2:PASS:cgroup_fd 0 nsec
  test_cgroup_v1v2:PASS:server_fd 0 nsec
  run_test:PASS:skel_open 0 nsec
  run_test:PASS:prog_attach 0 nsec
  test_cgroup_v1v2:PASS:cgroup-v2-only 0 nsec
  run_test:PASS:skel_open 0 nsec
  run_test:PASS:prog_attach 0 nsec
  run_test:PASS:join_classid 0 nsec
  (network_helpers.c:219: errno: None) Unexpected success to connect to server
  test_cgroup_v1v2:FAIL:cgroup-v1v2 unexpected error: -1 (errno 0)
  #27 cgroup_v1v2:FAIL
  Summary: 0/0 PASSED, 0 SKIPPED, 1 FAILED

After fix:

  # ./test_progs -t cgroup_v1v2
  #27 cgroup_v1v2:OK
  Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
---
 tools/testing/selftests/bpf/network_helpers.c | 27 +++++--
 tools/testing/selftests/bpf/network_helpers.h |  1 +
 .../selftests/bpf/prog_tests/cgroup_v1v2.c    | 79 +++++++++++++++++++
 .../selftests/bpf/progs/connect4_dropper.c    | 26 ++++++
 4 files changed, 127 insertions(+), 6 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_v1v2.c
 create mode 100644 tools/testing/selftests/bpf/progs/connect4_dropper.c

diff --git a/tools/testing/selftests/bpf/network_helpers.c b/tools/testing/selftests/bpf/network_helpers.c
index 7e9f6375757a..6db1af8fdee7 100644
--- a/tools/testing/selftests/bpf/network_helpers.c
+++ b/tools/testing/selftests/bpf/network_helpers.c
@@ -208,11 +208,26 @@ int fastopen_connect(int server_fd, const char *data, unsigned int data_len,
 
 static int connect_fd_to_addr(int fd,
 			      const struct sockaddr_storage *addr,
-			      socklen_t addrlen)
+			      socklen_t addrlen, const bool must_fail)
 {
-	if (connect(fd, (const struct sockaddr *)addr, addrlen)) {
-		log_err("Failed to connect to server");
-		return -1;
+	int ret;
+
+	errno = 0;
+	ret = connect(fd, (const struct sockaddr *)addr, addrlen);
+	if (must_fail) {
+		if (!ret) {
+			log_err("Unexpected success to connect to server");
+			return -1;
+		}
+		if (errno != EPERM) {
+			log_err("Unexpected error from connect to server");
+			return -1;
+		}
+	} else {
+		if (ret) {
+			log_err("Failed to connect to server");
+			return -1;
+		}
 	}
 
 	return 0;
@@ -257,7 +272,7 @@ int connect_to_fd_opts(int server_fd, const struct network_helper_opts *opts)
 		       strlen(opts->cc) + 1))
 		goto error_close;
 
-	if (connect_fd_to_addr(fd, &addr, addrlen))
+	if (connect_fd_to_addr(fd, &addr, addrlen, opts->must_fail))
 		goto error_close;
 
 	return fd;
@@ -289,7 +304,7 @@ int connect_fd_to_fd(int client_fd, int server_fd, int timeout_ms)
 		return -1;
 	}
 
-	if (connect_fd_to_addr(client_fd, &addr, len))
+	if (connect_fd_to_addr(client_fd, &addr, len, false))
 		return -1;
 
 	return 0;
diff --git a/tools/testing/selftests/bpf/network_helpers.h b/tools/testing/selftests/bpf/network_helpers.h
index da7e132657d5..d198181a5648 100644
--- a/tools/testing/selftests/bpf/network_helpers.h
+++ b/tools/testing/selftests/bpf/network_helpers.h
@@ -20,6 +20,7 @@ typedef __u16 __sum16;
 struct network_helper_opts {
 	const char *cc;
 	int timeout_ms;
+	bool must_fail;
 };
 
 /* ipv4 test vector */
diff --git a/tools/testing/selftests/bpf/prog_tests/cgroup_v1v2.c b/tools/testing/selftests/bpf/prog_tests/cgroup_v1v2.c
new file mode 100644
index 000000000000..ab3b9bc5e6d1
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/cgroup_v1v2.c
@@ -0,0 +1,79 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <test_progs.h>
+
+#include "connect4_dropper.skel.h"
+
+#include "cgroup_helpers.h"
+#include "network_helpers.h"
+
+static int run_test(int cgroup_fd, int server_fd, bool classid)
+{
+	struct network_helper_opts opts = {
+		.must_fail = true,
+	};
+	struct connect4_dropper *skel;
+	int fd, err = 0;
+
+	skel = connect4_dropper__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_open"))
+		return -1;
+
+	skel->links.connect_v4_dropper =
+		bpf_program__attach_cgroup(skel->progs.connect_v4_dropper,
+					   cgroup_fd);
+	if (!ASSERT_OK_PTR(skel->links.connect_v4_dropper, "prog_attach")) {
+		err = -1;
+		goto out;
+	}
+
+	if (classid && !ASSERT_OK(join_classid(), "join_classid")) {
+		err = -1;
+		goto out;
+	}
+
+	fd = connect_to_fd_opts(server_fd, &opts);
+	if (fd < 0)
+		err = -1;
+	else
+		close(fd);
+out:
+	connect4_dropper__destroy(skel);
+	return err;
+}
+
+void test_cgroup_v1v2(void)
+{
+	struct network_helper_opts opts = {};
+	int server_fd, client_fd, cgroup_fd;
+	static const int port = 60123;
+
+	/* Step 1: Check base connectivity works without any BPF. */
+	server_fd = start_server(AF_INET, SOCK_STREAM, NULL, port, 0);
+	if (!ASSERT_GE(server_fd, 0, "server_fd"))
+		return;
+	client_fd = connect_to_fd_opts(server_fd, &opts);
+	if (!ASSERT_GE(client_fd, 0, "client_fd")) {
+		close(server_fd);
+		return;
+	}
+	close(client_fd);
+	close(server_fd);
+
+	/* Step 2: Check BPF policy prog attached to cgroups drops connectivity. */
+	cgroup_fd = test__join_cgroup("/connect_dropper");
+	if (!ASSERT_GE(cgroup_fd, 0, "cgroup_fd"))
+		return;
+	server_fd = start_server(AF_INET, SOCK_STREAM, NULL, port, 0);
+	if (!ASSERT_GE(server_fd, 0, "server_fd")) {
+		close(cgroup_fd);
+		return;
+	}
+	ASSERT_OK(run_test(cgroup_fd, server_fd, false), "cgroup-v2-only");
+	setup_classid_environment();
+	set_classid(42);
+	ASSERT_OK(run_test(cgroup_fd, server_fd, true), "cgroup-v1v2");
+	cleanup_classid_environment();
+	close(server_fd);
+	close(cgroup_fd);
+}
diff --git a/tools/testing/selftests/bpf/progs/connect4_dropper.c b/tools/testing/selftests/bpf/progs/connect4_dropper.c
new file mode 100644
index 000000000000..b565d997810a
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/connect4_dropper.c
@@ -0,0 +1,26 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <string.h>
+
+#include <linux/stddef.h>
+#include <linux/bpf.h>
+
+#include <sys/socket.h>
+
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+#define VERDICT_REJECT	0
+#define VERDICT_PROCEED	1
+
+SEC("cgroup/connect4")
+int connect_v4_dropper(struct bpf_sock_addr *ctx)
+{
+	if (ctx->type != SOCK_STREAM)
+		return VERDICT_PROCEED;
+	if (ctx->user_port == bpf_htons(60123))
+		return VERDICT_REJECT;
+	return VERDICT_PROCEED;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.21.0


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH bpf v2 1/3] bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode
  2021-09-09 20:43 [PATCH bpf v2 1/3] bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode Daniel Borkmann
  2021-09-09 20:43 ` [PATCH bpf v2 2/3] bpf, selftests: Add cgroup v1 net_cls classid helpers Daniel Borkmann
  2021-09-09 20:43 ` [PATCH bpf v2 3/3] bpf, selftests: Add test case for mixed cgroup v1/v2 Daniel Borkmann
@ 2021-09-09 21:40 ` sdf
  2021-09-09 22:29 ` Tejun Heo
  3 siblings, 0 replies; 5+ messages in thread
From: sdf @ 2021-09-09 21:40 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: bpf, netdev, tj, davem, m, alexei.starovoitov, andrii

On 09/09, Daniel Borkmann wrote:
> Fix cgroup v1 interference when non-root cgroup v2 BPF programs are used.
> Back in the days, commit bd1060a1d671 ("sock, cgroup: add  
> sock->sk_cgroup")
> embedded per-socket cgroup information into sock->sk_cgrp_data and in  
> order
> to save 8 bytes in struct sock made both mutually exclusive, that is, when
> cgroup v1 socket tagging (e.g. net_cls/net_prio) is used, then cgroup v2
> falls back to the root cgroup in sock_cgroup_ptr() (&cgrp_dfl_root.cgrp).

> The assumption made was "there is no reason to mix the two and this is in  
> line
> with how legacy and v2 compatibility is handled" as stated in  
> bd1060a1d671.
> However, with Kubernetes more widely supporting cgroups v2 as well  
> nowadays,
> this assumption no longer holds, and the possibility of the v1/v2 mixed  
> mode
> with the v2 root fallback being hit becomes a real security issue.

> Many of the cgroup v2 BPF programs are also used for policy enforcement,  
> just
> to pick _one_ example, that is, to programmatically deny socket related  
> system
> calls like connect(2) or bind(2). A v2 root fallback would implicitly  
> cause
> a policy bypass for the affected Pods.

> In production environments, we have recently seen this case due to various
> circumstances: i) a different 3rd party agent and/or ii) a container  
> runtime
> such as [0] in the user's environment configuring legacy cgroup v1 net_cls
> tags, which triggered implicitly mentioned root fallback. Another case is
> Kubernetes projects like kind [1] which create Kubernetes nodes in a  
> container
> and also add cgroup namespaces to the mix, meaning programs which are  
> attached
> to the cgroup v2 root of the cgroup namespace get attached to a non-root
> cgroup v2 path from init namespace point of view. And the latter's root is
> out of reach for agents on a kind Kubernetes node to configure. Meaning,  
> any
> entity on the node setting cgroup v1 net_cls tag will trigger the bypass
> despite cgroup v2 BPF programs attached to the namespace root.

> Generally, this mutual exclusiveness does not hold anymore in today's user
> environments and makes cgroup v2 usage from BPF side fragile and  
> unreliable.
> This fix adds proper struct cgroup pointer for the cgroup v2 case to  
> struct
> sock_cgroup_data in order to address these issues; this implicitly also  
> fixes
> the tradeoffs being made back then with regards to races and refcount  
> leaks
> as stated in bd1060a1d671, and removes the fallback, so that cgroup v2 BPF
> programs always operate as expected.

>    [0] https://github.com/nestybox/sysbox/
>    [1] https://kind.sigs.k8s.io/

> Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup")
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Cc: David S. Miller <davem@davemloft.net>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Martynas Pumputis <m@lambda.lt>
> Cc: Stanislav Fomichev <sdf@google.com>
> ---
>   v1 -> v2:
>     - Remove unneeded READ_ONCE()/WRITE_ONCE() pair around skcd->cgroup,
>       thanks Stanislav!

LGTM! Thank you.


>   include/linux/cgroup-defs.h  | 107 +++++++++--------------------------
>   include/linux/cgroup.h       |  22 +------
>   kernel/cgroup/cgroup.c       |  50 ++++------------
>   net/core/netclassid_cgroup.c |   7 +--
>   net/core/netprio_cgroup.c    |  10 +---
>   5 files changed, 41 insertions(+), 155 deletions(-)

> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index e1c705fdfa7c..44446025741f 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -752,107 +752,54 @@ static inline void  
> cgroup_threadgroup_change_end(struct task_struct *tsk) {}
>    * sock_cgroup_data is embedded at sock->sk_cgrp_data and contains
>    * per-socket cgroup information except for memcg association.
>    *
> - * On legacy hierarchies, net_prio and net_cls controllers directly set
> - * attributes on each sock which can then be tested by the network layer.
> - * On the default hierarchy, each sock is associated with the cgroup it  
> was
> - * created in and the networking layer can match the cgroup directly.
> - *
> - * To avoid carrying all three cgroup related fields separately in sock,
> - * sock_cgroup_data overloads (prioidx, classid) and the cgroup pointer.
> - * On boot, sock_cgroup_data records the cgroup that the sock was created
> - * in so that cgroup2 matches can be made; however, once either net_prio  
> or
> - * net_cls starts being used, the area is overridden to carry prioidx  
> and/or
> - * classid.  The two modes are distinguished by whether the lowest bit is
> - * set.  Clear bit indicates cgroup pointer while set bit prioidx and
> - * classid.
> - *
> - * While userland may start using net_prio or net_cls at any time, once
> - * either is used, cgroup2 matching no longer works.  There is no reason  
> to
> - * mix the two and this is in line with how legacy and v2 compatibility  
> is
> - * handled.  On mode switch, cgroup references which are already being
> - * pointed to by socks may be leaked.  While this can be remedied by  
> adding
> - * synchronization around sock_cgroup_data, given that the number of  
> leaked
> - * cgroups is bound and highly unlikely to be high, this seems to be the
> - * better trade-off.
> + * On legacy hierarchies, net_prio and net_cls controllers directly
> + * set attributes on each sock which can then be tested by the network
> + * layer. On the default hierarchy, each sock is associated with the
> + * cgroup it was created in and the networking layer can match the
> + * cgroup directly.
>    */
>   struct sock_cgroup_data {
> -	union {
> -#ifdef __LITTLE_ENDIAN
> -		struct {
> -			u8	is_data : 1;
> -			u8	no_refcnt : 1;
> -			u8	unused : 6;
> -			u8	padding;
> -			u16	prioidx;
> -			u32	classid;
> -		} __packed;
> -#else
> -		struct {
> -			u32	classid;
> -			u16	prioidx;
> -			u8	padding;
> -			u8	unused : 6;
> -			u8	no_refcnt : 1;
> -			u8	is_data : 1;
> -		} __packed;
> +	struct cgroup	*cgroup; /* v2 */
> +#if defined(CONFIG_CGROUP_NET_CLASSID)
> +	u32		classid; /* v1 */
> +#endif
> +#if defined(CONFIG_CGROUP_NET_PRIO)
> +	u16		prioidx; /* v1 */
>   #endif
> -		u64		val;
> -	};
>   };

> -/*
> - * There's a theoretical window where the following accessors race with
> - * updaters and return part of the previous pointer as the prioidx or
> - * classid.  Such races are short-lived and the result isn't critical.
> - */
>   static inline u16 sock_cgroup_prioidx(const struct sock_cgroup_data  
> *skcd)
>   {
> -	/* fallback to 1 which is always the ID of the root cgroup */
> -	return (skcd->is_data & 1) ? skcd->prioidx : 1;
> +#if defined(CONFIG_CGROUP_NET_PRIO)
> +	return READ_ONCE(skcd->prioidx);
> +#else
> +	return 1;
> +#endif
>   }

>   static inline u32 sock_cgroup_classid(const struct sock_cgroup_data  
> *skcd)
>   {
> -	/* fallback to 0 which is the unconfigured default classid */
> -	return (skcd->is_data & 1) ? skcd->classid : 0;
> +#if defined(CONFIG_CGROUP_NET_CLASSID)
> +	return READ_ONCE(skcd->classid);
> +#else
> +	return 0;
> +#endif
>   }

> -/*
> - * If invoked concurrently, the updaters may clobber each other.  The
> - * caller is responsible for synchronization.
> - */
>   static inline void sock_cgroup_set_prioidx(struct sock_cgroup_data *skcd,
>   					   u16 prioidx)
>   {
> -	struct sock_cgroup_data skcd_buf = {{ .val = READ_ONCE(skcd->val) }};
> -
> -	if (sock_cgroup_prioidx(&skcd_buf) == prioidx)
> -		return;
> -
> -	if (!(skcd_buf.is_data & 1)) {
> -		skcd_buf.val = 0;
> -		skcd_buf.is_data = 1;
> -	}
> -
> -	skcd_buf.prioidx = prioidx;
> -	WRITE_ONCE(skcd->val, skcd_buf.val);	/* see sock_cgroup_ptr() */
> +#if defined(CONFIG_CGROUP_NET_PRIO)
> +	WRITE_ONCE(skcd->prioidx, prioidx);
> +#endif
>   }

>   static inline void sock_cgroup_set_classid(struct sock_cgroup_data *skcd,
>   					   u32 classid)
>   {
> -	struct sock_cgroup_data skcd_buf = {{ .val = READ_ONCE(skcd->val) }};
> -
> -	if (sock_cgroup_classid(&skcd_buf) == classid)
> -		return;
> -
> -	if (!(skcd_buf.is_data & 1)) {
> -		skcd_buf.val = 0;
> -		skcd_buf.is_data = 1;
> -	}
> -
> -	skcd_buf.classid = classid;
> -	WRITE_ONCE(skcd->val, skcd_buf.val);	/* see sock_cgroup_ptr() */
> +#if defined(CONFIG_CGROUP_NET_CLASSID)
> +	WRITE_ONCE(skcd->classid, classid);
> +#endif
>   }

>   #else	/* CONFIG_SOCK_CGROUP_DATA */
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 7bf60454a313..75c151413fda 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -829,33 +829,13 @@ static inline void  
> cgroup_account_cputime_field(struct task_struct *task,
>    */
>   #ifdef CONFIG_SOCK_CGROUP_DATA

> -#if defined(CONFIG_CGROUP_NET_PRIO) || defined(CONFIG_CGROUP_NET_CLASSID)
> -extern spinlock_t cgroup_sk_update_lock;
> -#endif
> -
> -void cgroup_sk_alloc_disable(void);
>   void cgroup_sk_alloc(struct sock_cgroup_data *skcd);
>   void cgroup_sk_clone(struct sock_cgroup_data *skcd);
>   void cgroup_sk_free(struct sock_cgroup_data *skcd);

>   static inline struct cgroup *sock_cgroup_ptr(struct sock_cgroup_data  
> *skcd)
>   {
> -#if defined(CONFIG_CGROUP_NET_PRIO) || defined(CONFIG_CGROUP_NET_CLASSID)
> -	unsigned long v;
> -
> -	/*
> -	 * @skcd->val is 64bit but the following is safe on 32bit too as we
> -	 * just need the lower ulong to be written and read atomically.
> -	 */
> -	v = READ_ONCE(skcd->val);
> -
> -	if (v & 3)
> -		return &cgrp_dfl_root.cgrp;
> -
> -	return (struct cgroup *)(unsigned long)v ?: &cgrp_dfl_root.cgrp;
> -#else
> -	return (struct cgroup *)(unsigned long)skcd->val;
> -#endif
> +	return skcd->cgroup;
>   }

>   #else	/* CONFIG_CGROUP_DATA */
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 881ce1470beb..8afa8690d288 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -6572,74 +6572,44 @@ int cgroup_parse_float(const char *input,  
> unsigned dec_shift, s64 *v)
>    */
>   #ifdef CONFIG_SOCK_CGROUP_DATA

> -#if defined(CONFIG_CGROUP_NET_PRIO) || defined(CONFIG_CGROUP_NET_CLASSID)
> -
> -DEFINE_SPINLOCK(cgroup_sk_update_lock);
> -static bool cgroup_sk_alloc_disabled __read_mostly;
> -
> -void cgroup_sk_alloc_disable(void)
> -{
> -	if (cgroup_sk_alloc_disabled)
> -		return;
> -	pr_info("cgroup: disabling cgroup2 socket matching due to net_prio or  
> net_cls activation\n");
> -	cgroup_sk_alloc_disabled = true;
> -}
> -
> -#else
> -
> -#define cgroup_sk_alloc_disabled	false
> -
> -#endif
> -
>   void cgroup_sk_alloc(struct sock_cgroup_data *skcd)
>   {
> -	if (cgroup_sk_alloc_disabled) {
> -		skcd->no_refcnt = 1;
> -		return;
> -	}
> -
>   	/* Don't associate the sock with unrelated interrupted task's cgroup. */
>   	if (in_interrupt())
>   		return;

>   	rcu_read_lock();
> -
>   	while (true) {
>   		struct css_set *cset;

>   		cset = task_css_set(current);
>   		if (likely(cgroup_tryget(cset->dfl_cgrp))) {
> -			skcd->val = (unsigned long)cset->dfl_cgrp;
> +			skcd->cgroup = cset->dfl_cgrp;
>   			cgroup_bpf_get(cset->dfl_cgrp);
>   			break;
>   		}
>   		cpu_relax();
>   	}
> -
>   	rcu_read_unlock();
>   }

>   void cgroup_sk_clone(struct sock_cgroup_data *skcd)
>   {
> -	if (skcd->val) {
> -		if (skcd->no_refcnt)
> -			return;
> -		/*
> -		 * We might be cloning a socket which is left in an empty
> -		 * cgroup and the cgroup might have already been rmdir'd.
> -		 * Don't use cgroup_get_live().
> -		 */
> -		cgroup_get(sock_cgroup_ptr(skcd));
> -		cgroup_bpf_get(sock_cgroup_ptr(skcd));
> -	}
> +	struct cgroup *cgrp = sock_cgroup_ptr(skcd);
> +
> +	/*
> +	 * We might be cloning a socket which is left in an empty
> +	 * cgroup and the cgroup might have already been rmdir'd.
> +	 * Don't use cgroup_get_live().
> +	 */
> +	cgroup_get(cgrp);
> +	cgroup_bpf_get(cgrp);
>   }

>   void cgroup_sk_free(struct sock_cgroup_data *skcd)
>   {
>   	struct cgroup *cgrp = sock_cgroup_ptr(skcd);

> -	if (skcd->no_refcnt)
> -		return;
>   	cgroup_bpf_put(cgrp);
>   	cgroup_put(cgrp);
>   }
> diff --git a/net/core/netclassid_cgroup.c b/net/core/netclassid_cgroup.c
> index b49c57d35a88..1a6a86693b74 100644
> --- a/net/core/netclassid_cgroup.c
> +++ b/net/core/netclassid_cgroup.c
> @@ -71,11 +71,8 @@ static int update_classid_sock(const void *v, struct  
> file *file, unsigned n)
>   	struct update_classid_context *ctx = (void *)v;
>   	struct socket *sock = sock_from_file(file);

> -	if (sock) {
> -		spin_lock(&cgroup_sk_update_lock);
> +	if (sock)
>   		sock_cgroup_set_classid(&sock->sk->sk_cgrp_data, ctx->classid);
> -		spin_unlock(&cgroup_sk_update_lock);
> -	}
>   	if (--ctx->batch == 0) {
>   		ctx->batch = UPDATE_CLASSID_BATCH;
>   		return n + 1;
> @@ -121,8 +118,6 @@ static int write_classid(struct cgroup_subsys_state  
> *css, struct cftype *cft,
>   	struct css_task_iter it;
>   	struct task_struct *p;

> -	cgroup_sk_alloc_disable();
> -
>   	cs->classid = (u32)value;

>   	css_task_iter_start(css, 0, &it);
> diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
> index 99a431c56f23..8456dfbe2eb4 100644
> --- a/net/core/netprio_cgroup.c
> +++ b/net/core/netprio_cgroup.c
> @@ -207,8 +207,6 @@ static ssize_t write_priomap(struct kernfs_open_file  
> *of,
>   	if (!dev)
>   		return -ENODEV;

> -	cgroup_sk_alloc_disable();
> -
>   	rtnl_lock();

>   	ret = netprio_set_prio(of_css(of), dev, prio);
> @@ -221,12 +219,10 @@ static ssize_t write_priomap(struct  
> kernfs_open_file *of,
>   static int update_netprio(const void *v, struct file *file, unsigned n)
>   {
>   	struct socket *sock = sock_from_file(file);
> -	if (sock) {
> -		spin_lock(&cgroup_sk_update_lock);
> +
> +	if (sock)
>   		sock_cgroup_set_prioidx(&sock->sk->sk_cgrp_data,
>   					(unsigned long)v);
> -		spin_unlock(&cgroup_sk_update_lock);
> -	}
>   	return 0;
>   }

> @@ -235,8 +231,6 @@ static void net_prio_attach(struct cgroup_taskset  
> *tset)
>   	struct task_struct *p;
>   	struct cgroup_subsys_state *css;

> -	cgroup_sk_alloc_disable();
> -
>   	cgroup_taskset_for_each(p, css, tset) {
>   		void *v = (void *)(unsigned long)css->id;

> --
> 2.21.0


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH bpf v2 1/3] bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode
  2021-09-09 20:43 [PATCH bpf v2 1/3] bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode Daniel Borkmann
                   ` (2 preceding siblings ...)
  2021-09-09 21:40 ` [PATCH bpf v2 1/3] bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode sdf
@ 2021-09-09 22:29 ` Tejun Heo
  3 siblings, 0 replies; 5+ messages in thread
From: Tejun Heo @ 2021-09-09 22:29 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: bpf, netdev, davem, m, alexei.starovoitov, andrii, sdf

Hello,

On Thu, Sep 09, 2021 at 10:43:40PM +0200, Daniel Borkmann wrote:
...
> Generally, this mutual exclusiveness does not hold anymore in today's user
> environments and makes cgroup v2 usage from BPF side fragile and unreliable.
> This fix adds proper struct cgroup pointer for the cgroup v2 case to struct
> sock_cgroup_data in order to address these issues; this implicitly also fixes
> the tradeoffs being made back then with regards to races and refcount leaks
> as stated in bd1060a1d671, and removes the fallback, so that cgroup v2 BPF
> programs always operate as expected.
> 
>   [0] https://github.com/nestybox/sysbox/
>   [1] https://kind.sigs.k8s.io/
> 
> Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup")
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Cc: David S. Miller <davem@davemloft.net>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Martynas Pumputis <m@lambda.lt>
> Cc: Stanislav Fomichev <sdf@google.com>

While this does increase cgroup's footprint inside sock, I think it's worth
considering the following points:

1. It's clear now that we won't need more cgroup related socket fields for
   network integration. cgroup2 membership tagging has proven flexible
   enough especially in combination with bpf.

2. Users have been transitioning from cgroup1 to cgroup2, some gradually,
   which is why this multiplexing is becoming an issue. In time, as
   transtions progress further, we should be able to disable cgroup1 network
   controllers for many use cases.

For the series,

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-09-09 22:30 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-09 20:43 [PATCH bpf v2 1/3] bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode Daniel Borkmann
2021-09-09 20:43 ` [PATCH bpf v2 2/3] bpf, selftests: Add cgroup v1 net_cls classid helpers Daniel Borkmann
2021-09-09 20:43 ` [PATCH bpf v2 3/3] bpf, selftests: Add test case for mixed cgroup v1/v2 Daniel Borkmann
2021-09-09 21:40 ` [PATCH bpf v2 1/3] bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode sdf
2021-09-09 22:29 ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).