LKML Archive on lore.kernel.org help / color / mirror / Atom feed
* [PATCH] [RFC] bpf: tracing: new helper bpf_get_current_cgroup_ino @ 2018-05-13 17:33 Alban Crequy 2018-05-14 19:38 ` Y Song 2018-05-21 16:26 ` Alexei Starovoitov 0 siblings, 2 replies; 11+ messages in thread From: Alban Crequy @ 2018-05-13 17:33 UTC (permalink / raw) To: netdev, linux-kernel, containers, cgroups; +Cc: Alban Crequy From: Alban Crequy <alban@kinvolk.io> bpf_get_current_cgroup_ino() allows BPF trace programs to get the inode of the cgroup where the current process resides. My use case is to get statistics about syscalls done by a specific Kubernetes container. I have a tracepoint on raw_syscalls/sys_enter and a BPF map containing the cgroup inode that I want to trace. I use bpf_get_current_cgroup_ino() and I quickly return from the tracepoint if the inode is not in the BPF hash map. Without this BPF helper, I would need to keep track of all pids in the container. The Netlink proc connector can be used to follow process creation and destruction but it is racy. This patch only looks at the memory cgroup, which was enough for me since each Kubernetes container is placed in a different mem cgroup. For a generic implementation, I'm not sure how to proceed: it seems I would need to use 'for_each_root(root)' (see example in proc_cgroup_show() from kernel/cgroup/cgroup.c) but I don't know if taking the cgroup mutex is possible in the BPF helper function. It might be ok in the tracepoint raw_syscalls/sys_enter but could the mutex already be taken in some other tracepoints? Signed-off-by: Alban Crequy <alban@kinvolk.io> --- include/uapi/linux/bpf.h | 11 ++++++++++- kernel/trace/bpf_trace.c | 25 +++++++++++++++++++++++++ 2 files changed, 35 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index c5ec89732a8d..38ac3959cdf3 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -755,6 +755,14 @@ union bpf_attr { * @addr: pointer to struct sockaddr to bind socket to * @addr_len: length of sockaddr structure * Return: 0 on success or negative error code + * + * u64 bpf_get_current_cgroup_ino(hierarchy, flags) + * Get the cgroup{1,2} inode of current task under the specified hierarchy. + * @hierarchy: cgroup hierarchy + * @flags: reserved for future use + * Return: + * == 0 error + * > 0 inode of the cgroup */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -821,7 +829,8 @@ union bpf_attr { FN(msg_apply_bytes), \ FN(msg_cork_bytes), \ FN(msg_pull_data), \ - FN(bind), + FN(bind), \ + FN(get_current_cgroup_ino), /* integer value in 'imm' field of BPF_CALL instruction selects which helper * function eBPF program intends to call diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 56ba0f2a01db..9bf92a786639 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -524,6 +524,29 @@ static const struct bpf_func_proto bpf_probe_read_str_proto = { .arg3_type = ARG_ANYTHING, }; +BPF_CALL_2(bpf_get_current_cgroup_ino, u32, hierarchy, u64, flags) +{ + // TODO: pick the correct hierarchy instead of the mem controller + struct cgroup *cgrp = task_cgroup(current, memory_cgrp_id); + + if (unlikely(!cgrp)) + return -EINVAL; + if (unlikely(hierarchy)) + return -EINVAL; + if (unlikely(flags)) + return -EINVAL; + + return cgrp->kn->id.ino; +} + +static const struct bpf_func_proto bpf_get_current_cgroup_ino_proto = { + .func = bpf_get_current_cgroup_ino, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_DONTCARE, + .arg2_type = ARG_DONTCARE, +}; + static const struct bpf_func_proto * tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) { @@ -564,6 +587,8 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_get_prandom_u32_proto; case BPF_FUNC_probe_read_str: return &bpf_probe_read_str_proto; + case BPF_FUNC_get_current_cgroup_ino: + return &bpf_get_current_cgroup_ino_proto; default: return NULL; } -- 2.14.3 ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] [RFC] bpf: tracing: new helper bpf_get_current_cgroup_ino 2018-05-13 17:33 [PATCH] [RFC] bpf: tracing: new helper bpf_get_current_cgroup_ino Alban Crequy @ 2018-05-14 19:38 ` Y Song 2018-05-21 13:52 ` Alban Crequy 2018-05-21 16:26 ` Alexei Starovoitov 1 sibling, 1 reply; 11+ messages in thread From: Y Song @ 2018-05-14 19:38 UTC (permalink / raw) To: Alban Crequy; +Cc: netdev, linux-kernel, containers, cgroups, Alban Crequy On Sun, May 13, 2018 at 10:33 AM, Alban Crequy <alban.crequy@gmail.com> wrote: > From: Alban Crequy <alban@kinvolk.io> > > bpf_get_current_cgroup_ino() allows BPF trace programs to get the inode > of the cgroup where the current process resides. > > My use case is to get statistics about syscalls done by a specific > Kubernetes container. I have a tracepoint on raw_syscalls/sys_enter and > a BPF map containing the cgroup inode that I want to trace. I use > bpf_get_current_cgroup_ino() and I quickly return from the tracepoint if > the inode is not in the BPF hash map. Alternatively, the kernel already has bpf_current_task_under_cgroup helper which uses a cgroup array to store cgroup fd's. If the current task is in the hierarchy of a particular cgroup, the helper will return true. One difference between your helper and bpf_current_task_under_cgroup() is that your helper tests against a particular cgroup, not including its children, but bpf_current_task_under_cgroup() will return true even the task is in a nested cgroup. Maybe this will work for you? > > Without this BPF helper, I would need to keep track of all pids in the > container. The Netlink proc connector can be used to follow process > creation and destruction but it is racy. > > This patch only looks at the memory cgroup, which was enough for me > since each Kubernetes container is placed in a different mem cgroup. > For a generic implementation, I'm not sure how to proceed: it seems I > would need to use 'for_each_root(root)' (see example in > proc_cgroup_show() from kernel/cgroup/cgroup.c) but I don't know if > taking the cgroup mutex is possible in the BPF helper function. It might > be ok in the tracepoint raw_syscalls/sys_enter but could the mutex > already be taken in some other tracepoints? mutex is not allowed in a helper since it can block. > > Signed-off-by: Alban Crequy <alban@kinvolk.io> > --- > include/uapi/linux/bpf.h | 11 ++++++++++- > kernel/trace/bpf_trace.c | 25 +++++++++++++++++++++++++ > 2 files changed, 35 insertions(+), 1 deletion(-) > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > index c5ec89732a8d..38ac3959cdf3 100644 > --- a/include/uapi/linux/bpf.h > +++ b/include/uapi/linux/bpf.h > @@ -755,6 +755,14 @@ union bpf_attr { > * @addr: pointer to struct sockaddr to bind socket to > * @addr_len: length of sockaddr structure > * Return: 0 on success or negative error code > + * > + * u64 bpf_get_current_cgroup_ino(hierarchy, flags) > + * Get the cgroup{1,2} inode of current task under the specified hierarchy. > + * @hierarchy: cgroup hierarchy Not sure what is the value to specify hierarchy here. A cgroup directory fd? > + * @flags: reserved for future use > + * Return: > + * == 0 error looks like < 0 means error. > + * > 0 inode of the cgroup >= 0 means good? > */ > #define __BPF_FUNC_MAPPER(FN) \ > FN(unspec), \ > @@ -821,7 +829,8 @@ union bpf_attr { > FN(msg_apply_bytes), \ > FN(msg_cork_bytes), \ > FN(msg_pull_data), \ > - FN(bind), > + FN(bind), \ > + FN(get_current_cgroup_ino), > > /* integer value in 'imm' field of BPF_CALL instruction selects which helper > * function eBPF program intends to call > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c > index 56ba0f2a01db..9bf92a786639 100644 > --- a/kernel/trace/bpf_trace.c > +++ b/kernel/trace/bpf_trace.c > @@ -524,6 +524,29 @@ static const struct bpf_func_proto bpf_probe_read_str_proto = { > .arg3_type = ARG_ANYTHING, > }; > > +BPF_CALL_2(bpf_get_current_cgroup_ino, u32, hierarchy, u64, flags) > +{ > + // TODO: pick the correct hierarchy instead of the mem controller > + struct cgroup *cgrp = task_cgroup(current, memory_cgrp_id); > + > + if (unlikely(!cgrp)) > + return -EINVAL; > + if (unlikely(hierarchy)) > + return -EINVAL; > + if (unlikely(flags)) > + return -EINVAL; > + > + return cgrp->kn->id.ino; > +} > + > +static const struct bpf_func_proto bpf_get_current_cgroup_ino_proto = { > + .func = bpf_get_current_cgroup_ino, > + .gpl_only = false, > + .ret_type = RET_INTEGER, > + .arg1_type = ARG_DONTCARE, > + .arg2_type = ARG_DONTCARE, > +}; > + > static const struct bpf_func_proto * > tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) > { > @@ -564,6 +587,8 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) > return &bpf_get_prandom_u32_proto; > case BPF_FUNC_probe_read_str: > return &bpf_probe_read_str_proto; > + case BPF_FUNC_get_current_cgroup_ino: > + return &bpf_get_current_cgroup_ino_proto; > default: > return NULL; > } > -- > 2.14.3 > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] [RFC] bpf: tracing: new helper bpf_get_current_cgroup_ino 2018-05-14 19:38 ` Y Song @ 2018-05-21 13:52 ` Alban Crequy 0 siblings, 0 replies; 11+ messages in thread From: Alban Crequy @ 2018-05-21 13:52 UTC (permalink / raw) To: Y Song; +Cc: Alban Crequy, netdev, LKML, Linux Containers, cgroups On Mon, May 14, 2018 at 9:38 PM, Y Song <ys114321@gmail.com> wrote: > > On Sun, May 13, 2018 at 10:33 AM, Alban Crequy <alban.crequy@gmail.com> wrote: > > From: Alban Crequy <alban@kinvolk.io> > > > > bpf_get_current_cgroup_ino() allows BPF trace programs to get the inode > > of the cgroup where the current process resides. > > > > My use case is to get statistics about syscalls done by a specific > > Kubernetes container. I have a tracepoint on raw_syscalls/sys_enter and > > a BPF map containing the cgroup inode that I want to trace. I use > > bpf_get_current_cgroup_ino() and I quickly return from the tracepoint if > > the inode is not in the BPF hash map. > > Alternatively, the kernel already has bpf_current_task_under_cgroup helper > which uses a cgroup array to store cgroup fd's. If the current task is > in the hierarchy of a particular cgroup, the helper will return true. > > One difference between your helper and bpf_current_task_under_cgroup() is > that your helper tests against a particular cgroup, not including its > children, but > bpf_current_task_under_cgroup() will return true even the task is in a > nested cgroup. > > Maybe this will work for you? I like the behaviour that it checks for children cgroups. But with the cgroup array, I can test only one cgroup at a time. I would like to be able to enable my tracer for a few Kubernetes containers or all by adding the inodes of a few cgroups in a hash map. So I could keep separate stats for each. With bpf_current_task_under_cgroup(), I would need to iterate over the list of cgroups, which is difficult with BPF. Also, Kubernetes is cgroup-v1 only and bpf_current_task_under_cgroup() is cgroup-v2 only. In Kubernetes, the processes remain in the root of the v2 hierarchy. I'd like to be able to select the cgroup hierarchy in my helper so it'd work for both v1 and v2. > > Without this BPF helper, I would need to keep track of all pids in the > > container. The Netlink proc connector can be used to follow process > > creation and destruction but it is racy. > > > > This patch only looks at the memory cgroup, which was enough for me > > since each Kubernetes container is placed in a different mem cgroup. > > For a generic implementation, I'm not sure how to proceed: it seems I > > would need to use 'for_each_root(root)' (see example in > > proc_cgroup_show() from kernel/cgroup/cgroup.c) but I don't know if > > taking the cgroup mutex is possible in the BPF helper function. It might > > be ok in the tracepoint raw_syscalls/sys_enter but could the mutex > > already be taken in some other tracepoints? > > mutex is not allowed in a helper since it can block. Ok. I don't know how to implement my helper properly then. Maybe I could just use the 1st cgroup-v1 hierarchy (the name=systemd one) so I don't have to iterate over the hierarchies. But would that be acceptable? Cheers, Alban > > Signed-off-by: Alban Crequy <alban@kinvolk.io> > > --- > > include/uapi/linux/bpf.h | 11 ++++++++++- > > kernel/trace/bpf_trace.c | 25 +++++++++++++++++++++++++ > > 2 files changed, 35 insertions(+), 1 deletion(-) > > > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > > index c5ec89732a8d..38ac3959cdf3 100644 > > --- a/include/uapi/linux/bpf.h > > +++ b/include/uapi/linux/bpf.h > > @@ -755,6 +755,14 @@ union bpf_attr { > > * @addr: pointer to struct sockaddr to bind socket to > > * @addr_len: length of sockaddr structure > > * Return: 0 on success or negative error code > > + * > > + * u64 bpf_get_current_cgroup_ino(hierarchy, flags) > > + * Get the cgroup{1,2} inode of current task under the specified hierarchy. > > + * @hierarchy: cgroup hierarchy > > Not sure what is the value to specify hierarchy here. > A cgroup directory fd? > > > + * @flags: reserved for future use > > + * Return: > > + * == 0 error > > looks like < 0 means error. > > > + * > 0 inode of the cgroup > >= 0 means good? > > */ > > #define __BPF_FUNC_MAPPER(FN) \ > > FN(unspec), \ > > @@ -821,7 +829,8 @@ union bpf_attr { > > FN(msg_apply_bytes), \ > > FN(msg_cork_bytes), \ > > FN(msg_pull_data), \ > > - FN(bind), > > + FN(bind), \ > > + FN(get_current_cgroup_ino), > > > > /* integer value in 'imm' field of BPF_CALL instruction selects which helper > > * function eBPF program intends to call > > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c > > index 56ba0f2a01db..9bf92a786639 100644 > > --- a/kernel/trace/bpf_trace.c > > +++ b/kernel/trace/bpf_trace.c > > @@ -524,6 +524,29 @@ static const struct bpf_func_proto bpf_probe_read_str_proto = { > > .arg3_type = ARG_ANYTHING, > > }; > > > > +BPF_CALL_2(bpf_get_current_cgroup_ino, u32, hierarchy, u64, flags) > > +{ > > + // TODO: pick the correct hierarchy instead of the mem controller > > + struct cgroup *cgrp = task_cgroup(current, memory_cgrp_id); > > + > > + if (unlikely(!cgrp)) > > + return -EINVAL; > > + if (unlikely(hierarchy)) > > + return -EINVAL; > > + if (unlikely(flags)) > > + return -EINVAL; > > + > > + return cgrp->kn->id.ino; > > +} > > + > > +static const struct bpf_func_proto bpf_get_current_cgroup_ino_proto = { > > + .func = bpf_get_current_cgroup_ino, > > + .gpl_only = false, > > + .ret_type = RET_INTEGER, > > + .arg1_type = ARG_DONTCARE, > > + .arg2_type = ARG_DONTCARE, > > +}; > > + > > static const struct bpf_func_proto * > > tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) > > { > > @@ -564,6 +587,8 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) > > return &bpf_get_prandom_u32_proto; > > case BPF_FUNC_probe_read_str: > > return &bpf_probe_read_str_proto; > > + case BPF_FUNC_get_current_cgroup_ino: > > + return &bpf_get_current_cgroup_ino_proto; > > default: > > return NULL; > > } > > -- > > 2.14.3 > > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] [RFC] bpf: tracing: new helper bpf_get_current_cgroup_ino 2018-05-13 17:33 [PATCH] [RFC] bpf: tracing: new helper bpf_get_current_cgroup_ino Alban Crequy 2018-05-14 19:38 ` Y Song @ 2018-05-21 16:26 ` Alexei Starovoitov 2018-05-22 0:24 ` Y Song 1 sibling, 1 reply; 11+ messages in thread From: Alexei Starovoitov @ 2018-05-21 16:26 UTC (permalink / raw) To: Alban Crequy; +Cc: netdev, linux-kernel, containers, cgroups, Alban Crequy, tj On Sun, May 13, 2018 at 07:33:18PM +0200, Alban Crequy wrote: > > +BPF_CALL_2(bpf_get_current_cgroup_ino, u32, hierarchy, u64, flags) > +{ > + // TODO: pick the correct hierarchy instead of the mem controller > + struct cgroup *cgrp = task_cgroup(current, memory_cgrp_id); > + > + if (unlikely(!cgrp)) > + return -EINVAL; > + if (unlikely(hierarchy)) > + return -EINVAL; > + if (unlikely(flags)) > + return -EINVAL; > + > + return cgrp->kn->id.ino; ino only is not enough to identify cgroup. It needs generation number too. I don't quite see how hierarchy and flags can be used in the future. Also why limit it to memcg? How about something like this instead: BPF_CALL_2(bpf_get_current_cgroup_id) { struct cgroup *cgrp = task_dfl_cgroup(current); return cgrp->kn->id.id; } The user space can use fhandle api to get the same 64-bit id. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] [RFC] bpf: tracing: new helper bpf_get_current_cgroup_ino 2018-05-21 16:26 ` Alexei Starovoitov @ 2018-05-22 0:24 ` Y Song 2018-05-23 3:33 ` Y Song 0 siblings, 1 reply; 11+ messages in thread From: Y Song @ 2018-05-22 0:24 UTC (permalink / raw) To: Alexei Starovoitov Cc: Alban Crequy, netdev, linux-kernel, containers, cgroups, Alban Crequy, tj On Mon, May 21, 2018 at 9:26 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > On Sun, May 13, 2018 at 07:33:18PM +0200, Alban Crequy wrote: >> >> +BPF_CALL_2(bpf_get_current_cgroup_ino, u32, hierarchy, u64, flags) >> +{ >> + // TODO: pick the correct hierarchy instead of the mem controller >> + struct cgroup *cgrp = task_cgroup(current, memory_cgrp_id); >> + >> + if (unlikely(!cgrp)) >> + return -EINVAL; >> + if (unlikely(hierarchy)) >> + return -EINVAL; >> + if (unlikely(flags)) >> + return -EINVAL; >> + >> + return cgrp->kn->id.ino; > > ino only is not enough to identify cgroup. It needs generation number too. > I don't quite see how hierarchy and flags can be used in the future. > Also why limit it to memcg? > > How about something like this instead: > > BPF_CALL_2(bpf_get_current_cgroup_id) > { > struct cgroup *cgrp = task_dfl_cgroup(current); > > return cgrp->kn->id.id; > } > The user space can use fhandle api to get the same 64-bit id. I think this should work. This will also be useful to bcc as user space can encode desired id in the bpf program and compared that id to the current cgroup id, so we can have cgroup level tracing (esp. stat collection) support. To cope with cgroup hierarchy, user can use cgroup-array based approach or explicitly compare against multiple cgroup id's. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] [RFC] bpf: tracing: new helper bpf_get_current_cgroup_ino 2018-05-22 0:24 ` Y Song @ 2018-05-23 3:33 ` Y Song 2018-05-23 3:35 ` Alexei Starovoitov 2018-05-25 15:21 ` Alban Crequy 0 siblings, 2 replies; 11+ messages in thread From: Y Song @ 2018-05-23 3:33 UTC (permalink / raw) To: Alexei Starovoitov Cc: Alban Crequy, netdev, linux-kernel, containers, cgroups, Alban Crequy, tj I did a quick prototyping and the above interface seems working fine. The kernel change: =============== [yhs@localhost bpf-next]$ git diff diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 97446bbe2ca5..669b7383fddb 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -1976,7 +1976,8 @@ union bpf_attr { FN(fib_lookup), \ FN(sock_hash_update), \ FN(msg_redirect_hash), \ - FN(sk_redirect_hash), + FN(sk_redirect_hash), \ + FN(get_current_cgroup_id), /* integer value in 'imm' field of BPF_CALL instruction selects which helper * function eBPF program intends to call diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index ce2cbbff27e4..e11e3298f911 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -493,6 +493,21 @@ static const struct bpf_func_proto bpf_current_task_under_cgroup_proto = { .arg2_type = ARG_ANYTHING, }; +BPF_CALL_0(bpf_get_current_cgroup_id) +{ + struct cgroup *cgrp = task_dfl_cgroup(current); + if (!cgrp) + return -EINVAL; + + return cgrp->kn->id.id; +} + +static const struct bpf_func_proto bpf_get_current_cgroup_id_proto = { + .func = bpf_get_current_cgroup_id, + .gpl_only = false, + .ret_type = RET_INTEGER, +}; + BPF_CALL_3(bpf_probe_read_str, void *, dst, u32, size, const void *, unsafe_ptr) { @@ -563,6 +578,8 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_get_prandom_u32_proto; case BPF_FUNC_probe_read_str: return &bpf_probe_read_str_proto; + case BPF_FUNC_get_current_cgroup_id: + return &bpf_get_current_cgroup_id_proto; default: return NULL; } The following program can be used to print out a cgroup id given a cgroup path. [yhs@localhost cg]$ cat get_cgroup_id.c #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> int main(int argc, char **argv) { int dirfd, err, flags, mount_id, fhsize; struct file_handle *fhp; char *pathname; if (argc != 2) { printf("usage: %s <cgroup_path>\n", argv[0]); return 1; } pathname = argv[1]; dirfd = AT_FDCWD; flags = 0; fhsize = sizeof(*fhp); fhp = malloc(fhsize); if (!fhp) return 1; err = name_to_handle_at(dirfd, pathname, fhp, &mount_id, flags); if (err >= 0) { printf("error\n"); return 1; } fhsize = sizeof(struct file_handle) + fhp->handle_bytes; fhp = realloc(fhp, fhsize); if (!fhp) return 1; err = name_to_handle_at(dirfd, pathname, fhp, &mount_id, flags); if (err < 0) perror("name_to_handle_at"); else { int i; printf("dir = %s, mount_id = %d\n", pathname, mount_id); printf("handle_bytes = %d, handle_type = %d\n", fhp->handle_bytes, fhp->handle_type); if (fhp->handle_bytes != 8) return 1; printf("cgroup_id = 0x%llx\n", *(unsigned long long *)fhp->f_handle); } return 0; } [yhs@localhost cg]$ Given a cgroup path, the user can get cgroup_id and use it in their bpf program for filtering purpose. I run a simple program t.c int main() { while(1) sleep(1); return 0; } in the cgroup v2 directory /home/yhs/tmp/yhs none on /home/yhs/tmp type cgroup2 (rw,relatime,seclabel) $ ./get_cgroup_id /home/yhs/tmp/yhs dir = /home/yhs/tmp/yhs, mount_id = 124 handle_bytes = 8, handle_type = 1 cgroup_id = 0x1000006b2 // the below command to get cgroup_id from the kernel for the // process compiled with t.c and ran under /home/yhs/tmp/yhs: $ sudo ./trace.py -p 4067 '__x64_sys_nanosleep "cgid = %llx", $cgid' PID TID COMM FUNC - 4067 4067 a.out __x64_sys_nanosleep cgid = 1000006b2 4067 4067 a.out __x64_sys_nanosleep cgid = 1000006b2 4067 4067 a.out __x64_sys_nanosleep cgid = 1000006b2 ^C[yhs@localhost tools]$ The kernel and user space cgid matches. Will provide a formal patch later. On Mon, May 21, 2018 at 5:24 PM, Y Song <ys114321@gmail.com> wrote: > On Mon, May 21, 2018 at 9:26 AM, Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: >> On Sun, May 13, 2018 at 07:33:18PM +0200, Alban Crequy wrote: >>> >>> +BPF_CALL_2(bpf_get_current_cgroup_ino, u32, hierarchy, u64, flags) >>> +{ >>> + // TODO: pick the correct hierarchy instead of the mem controller >>> + struct cgroup *cgrp = task_cgroup(current, memory_cgrp_id); >>> + >>> + if (unlikely(!cgrp)) >>> + return -EINVAL; >>> + if (unlikely(hierarchy)) >>> + return -EINVAL; >>> + if (unlikely(flags)) >>> + return -EINVAL; >>> + >>> + return cgrp->kn->id.ino; >> >> ino only is not enough to identify cgroup. It needs generation number too. >> I don't quite see how hierarchy and flags can be used in the future. >> Also why limit it to memcg? >> >> How about something like this instead: >> >> BPF_CALL_2(bpf_get_current_cgroup_id) >> { >> struct cgroup *cgrp = task_dfl_cgroup(current); >> >> return cgrp->kn->id.id; >> } >> The user space can use fhandle api to get the same 64-bit id. > > I think this should work. This will also be useful to bcc as user > space can encode desired id > in the bpf program and compared that id to the current cgroup id, so we can have > cgroup level tracing (esp. stat collection) support. To cope with > cgroup hierarchy, user can use > cgroup-array based approach or explicitly compare against multiple cgroup id's. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] [RFC] bpf: tracing: new helper bpf_get_current_cgroup_ino 2018-05-23 3:33 ` Y Song @ 2018-05-23 3:35 ` Alexei Starovoitov 2018-05-23 4:31 ` Y Song 2018-05-25 15:21 ` Alban Crequy 1 sibling, 1 reply; 11+ messages in thread From: Alexei Starovoitov @ 2018-05-23 3:35 UTC (permalink / raw) To: Y Song Cc: Alban Crequy, netdev, linux-kernel, containers, cgroups, Alban Crequy, tj On Tue, May 22, 2018 at 08:33:24PM -0700, Y Song wrote: > + struct cgroup *cgrp = task_dfl_cgroup(current); > + if (!cgrp) > + return -EINVAL; why this check is needed? ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] [RFC] bpf: tracing: new helper bpf_get_current_cgroup_ino 2018-05-23 3:35 ` Alexei Starovoitov @ 2018-05-23 4:31 ` Y Song 2018-05-23 8:57 ` Daniel Borkmann 0 siblings, 1 reply; 11+ messages in thread From: Y Song @ 2018-05-23 4:31 UTC (permalink / raw) To: Alexei Starovoitov Cc: Alban Crequy, netdev, linux-kernel, containers, cgroups, Alban Crequy, tj On Tue, May 22, 2018 at 8:35 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > On Tue, May 22, 2018 at 08:33:24PM -0700, Y Song wrote: >> + struct cgroup *cgrp = task_dfl_cgroup(current); >> + if (!cgrp) >> + return -EINVAL; > > why this check is needed? No reason :-) Originally I am concerned whether it is possible cgrp could be NULL. By looking at the code, it SEEMS to me that it could not be NULL, but I am not 100% sure (as I am not a cgroup expert). Since you are asking, probably means it cannot be NULL, so will remove it in formal upstream patch. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] [RFC] bpf: tracing: new helper bpf_get_current_cgroup_ino 2018-05-23 4:31 ` Y Song @ 2018-05-23 8:57 ` Daniel Borkmann 0 siblings, 0 replies; 11+ messages in thread From: Daniel Borkmann @ 2018-05-23 8:57 UTC (permalink / raw) To: Y Song, Alexei Starovoitov Cc: Alban Crequy, netdev, linux-kernel, containers, cgroups, Alban Crequy, tj On 05/23/2018 06:31 AM, Y Song wrote: > On Tue, May 22, 2018 at 8:35 PM, Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: >> On Tue, May 22, 2018 at 08:33:24PM -0700, Y Song wrote: >>> + struct cgroup *cgrp = task_dfl_cgroup(current); >>> + if (!cgrp) >>> + return -EINVAL; >> >> why this check is needed? > > No reason :-) Originally I am concerned whether it is possible cgrp > could be NULL. > By looking at the code, it SEEMS to me that it could not be NULL, but I am not > 100% sure (as I am not a cgroup expert). Since you are asking, > probably means it cannot be NULL, so will remove it in formal upstream patch. Aside from this the cgrp->kn->id.id is also u64, so overlapping this with error codes we'll get into a similar issue as with bpf_perf_event_read(). ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] [RFC] bpf: tracing: new helper bpf_get_current_cgroup_ino 2018-05-23 3:33 ` Y Song 2018-05-23 3:35 ` Alexei Starovoitov @ 2018-05-25 15:21 ` Alban Crequy 2018-05-25 16:28 ` Y Song 1 sibling, 1 reply; 11+ messages in thread From: Alban Crequy @ 2018-05-25 15:21 UTC (permalink / raw) To: Y Song Cc: Alexei Starovoitov, netdev, LKML, Linux Containers, cgroups, Tejun Heo, Iago López Galeiras On Wed, May 23, 2018 at 4:34 AM Y Song <ys114321@gmail.com> wrote: > I did a quick prototyping and the above interface seems working fine. Thanks! I gave your kernel patch & userspace program a try and it works for me on cgroup-v2. Also, I found out how to get my containers to use both cgroup-v1 and cgroup-v2 (by enabling systemd's hybrid cgroup mode and docker's '--exec-opt native.cgroupdriver=systemd' option). So I should be able to use the BPF helper function without having to add support for all the cgroup-v1 hierarchies. > The kernel change: > =============== > [yhs@localhost bpf-next]$ git diff > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > index 97446bbe2ca5..669b7383fddb 100644 > --- a/include/uapi/linux/bpf.h > +++ b/include/uapi/linux/bpf.h > @@ -1976,7 +1976,8 @@ union bpf_attr { > FN(fib_lookup), \ > FN(sock_hash_update), \ > FN(msg_redirect_hash), \ > - FN(sk_redirect_hash), > + FN(sk_redirect_hash), \ > + FN(get_current_cgroup_id), > /* integer value in 'imm' field of BPF_CALL instruction selects which helper > * function eBPF program intends to call > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c > index ce2cbbff27e4..e11e3298f911 100644 > --- a/kernel/trace/bpf_trace.c > +++ b/kernel/trace/bpf_trace.c > @@ -493,6 +493,21 @@ static const struct bpf_func_proto > bpf_current_task_under_cgroup_proto = { > .arg2_type = ARG_ANYTHING, > }; > +BPF_CALL_0(bpf_get_current_cgroup_id) > +{ > + struct cgroup *cgrp = task_dfl_cgroup(current); > + if (!cgrp) > + return -EINVAL; > + > + return cgrp->kn->id.id; > +} > + > +static const struct bpf_func_proto bpf_get_current_cgroup_id_proto = { > + .func = bpf_get_current_cgroup_id, > + .gpl_only = false, > + .ret_type = RET_INTEGER, > +}; > + > BPF_CALL_3(bpf_probe_read_str, void *, dst, u32, size, > const void *, unsafe_ptr) > { > @@ -563,6 +578,8 @@ tracing_func_proto(enum bpf_func_id func_id, const > struct bpf_prog *prog) > return &bpf_get_prandom_u32_proto; > case BPF_FUNC_probe_read_str: > return &bpf_probe_read_str_proto; > + case BPF_FUNC_get_current_cgroup_id: > + return &bpf_get_current_cgroup_id_proto; > default: > return NULL; > } > The following program can be used to print out a cgroup id given a cgroup path. > [yhs@localhost cg]$ cat get_cgroup_id.c > #define _GNU_SOURCE > #include <stdio.h> > #include <stdlib.h> > #include <sys/types.h> > #include <sys/stat.h> > #include <fcntl.h> > int main(int argc, char **argv) > { > int dirfd, err, flags, mount_id, fhsize; > struct file_handle *fhp; > char *pathname; > if (argc != 2) { > printf("usage: %s <cgroup_path>\n", argv[0]); > return 1; > } > pathname = argv[1]; > dirfd = AT_FDCWD; > flags = 0; > fhsize = sizeof(*fhp); > fhp = malloc(fhsize); > if (!fhp) > return 1; > err = name_to_handle_at(dirfd, pathname, fhp, &mount_id, flags); > if (err >= 0) { > printf("error\n"); > return 1; > } > fhsize = sizeof(struct file_handle) + fhp->handle_bytes; > fhp = realloc(fhp, fhsize); > if (!fhp) > return 1; > err = name_to_handle_at(dirfd, pathname, fhp, &mount_id, flags); > if (err < 0) > perror("name_to_handle_at"); > else { > int i; > printf("dir = %s, mount_id = %d\n", pathname, mount_id); > printf("handle_bytes = %d, handle_type = %d\n", fhp->handle_bytes, > fhp->handle_type); > if (fhp->handle_bytes != 8) > return 1; > printf("cgroup_id = 0x%llx\n", *(unsigned long long *)fhp->f_handle); > } > return 0; > } > [yhs@localhost cg]$ > Given a cgroup path, the user can get cgroup_id and use it in their bpf > program for filtering purpose. > I run a simple program t.c > int main() { while(1) sleep(1); return 0; } > in the cgroup v2 directory /home/yhs/tmp/yhs > none on /home/yhs/tmp type cgroup2 (rw,relatime,seclabel) > $ ./get_cgroup_id /home/yhs/tmp/yhs > dir = /home/yhs/tmp/yhs, mount_id = 124 > handle_bytes = 8, handle_type = 1 > cgroup_id = 0x1000006b2 > // the below command to get cgroup_id from the kernel for the > // process compiled with t.c and ran under /home/yhs/tmp/yhs: > $ sudo ./trace.py -p 4067 '__x64_sys_nanosleep "cgid = %llx", $cgid' > PID TID COMM FUNC - > 4067 4067 a.out __x64_sys_nanosleep cgid = 1000006b2 > 4067 4067 a.out __x64_sys_nanosleep cgid = 1000006b2 > 4067 4067 a.out __x64_sys_nanosleep cgid = 1000006b2 > ^C[yhs@localhost tools]$ > The kernel and user space cgid matches. Will provide a > formal patch later. > On Mon, May 21, 2018 at 5:24 PM, Y Song <ys114321@gmail.com> wrote: > > On Mon, May 21, 2018 at 9:26 AM, Alexei Starovoitov > > <alexei.starovoitov@gmail.com> wrote: > >> On Sun, May 13, 2018 at 07:33:18PM +0200, Alban Crequy wrote: > >>> > >>> +BPF_CALL_2(bpf_get_current_cgroup_ino, u32, hierarchy, u64, flags) > >>> +{ > >>> + // TODO: pick the correct hierarchy instead of the mem controller > >>> + struct cgroup *cgrp = task_cgroup(current, memory_cgrp_id); > >>> + > >>> + if (unlikely(!cgrp)) > >>> + return -EINVAL; > >>> + if (unlikely(hierarchy)) > >>> + return -EINVAL; > >>> + if (unlikely(flags)) > >>> + return -EINVAL; > >>> + > >>> + return cgrp->kn->id.ino; > >> > >> ino only is not enough to identify cgroup. It needs generation number too. > >> I don't quite see how hierarchy and flags can be used in the future. > >> Also why limit it to memcg? > >> > >> How about something like this instead: > >> > >> BPF_CALL_2(bpf_get_current_cgroup_id) > >> { > >> struct cgroup *cgrp = task_dfl_cgroup(current); > >> > >> return cgrp->kn->id.id; > >> } > >> The user space can use fhandle api to get the same 64-bit id. > > > > I think this should work. This will also be useful to bcc as user > > space can encode desired id > > in the bpf program and compared that id to the current cgroup id, so we can have > > cgroup level tracing (esp. stat collection) support. To cope with > > cgroup hierarchy, user can use > > cgroup-array based approach or explicitly compare against multiple cgroup id's. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] [RFC] bpf: tracing: new helper bpf_get_current_cgroup_ino 2018-05-25 15:21 ` Alban Crequy @ 2018-05-25 16:28 ` Y Song 0 siblings, 0 replies; 11+ messages in thread From: Y Song @ 2018-05-25 16:28 UTC (permalink / raw) To: Alban Crequy Cc: Alexei Starovoitov, netdev, LKML, Linux Containers, cgroups, Tejun Heo, Iago López Galeiras On Fri, May 25, 2018 at 8:21 AM, Alban Crequy <alban@kinvolk.io> wrote: > On Wed, May 23, 2018 at 4:34 AM Y Song <ys114321@gmail.com> wrote: > >> I did a quick prototyping and the above interface seems working fine. > > Thanks! I gave your kernel patch & userspace program a try and it works for > me on cgroup-v2. > > Also, I found out how to get my containers to use both cgroup-v1 and > cgroup-v2 (by enabling systemd's hybrid cgroup mode and docker's > '--exec-opt native.cgroupdriver=systemd' option). So I should be able to > use the BPF helper function without having to add support for all the > cgroup-v1 hierarchies. Great. Will submit a formal patch soon. > >> The kernel change: >> =============== > >> [yhs@localhost bpf-next]$ git diff >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h >> index 97446bbe2ca5..669b7383fddb 100644 >> --- a/include/uapi/linux/bpf.h >> +++ b/include/uapi/linux/bpf.h >> @@ -1976,7 +1976,8 @@ union bpf_attr { >> FN(fib_lookup), \ >> FN(sock_hash_update), \ >> FN(msg_redirect_hash), \ >> - FN(sk_redirect_hash), >> + FN(sk_redirect_hash), \ >> + FN(get_current_cgroup_id), > >> /* integer value in 'imm' field of BPF_CALL instruction selects which > helper >> * function eBPF program intends to call >> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c >> index ce2cbbff27e4..e11e3298f911 100644 >> --- a/kernel/trace/bpf_trace.c >> +++ b/kernel/trace/bpf_trace.c >> @@ -493,6 +493,21 @@ static const struct bpf_func_proto >> bpf_current_task_under_cgroup_proto = { >> .arg2_type = ARG_ANYTHING, >> }; > >> +BPF_CALL_0(bpf_get_current_cgroup_id) >> +{ >> + struct cgroup *cgrp = task_dfl_cgroup(current); >> + if (!cgrp) >> + return -EINVAL; >> + >> + return cgrp->kn->id.id; >> +} >> + >> +static const struct bpf_func_proto bpf_get_current_cgroup_id_proto = { >> + .func = bpf_get_current_cgroup_id, >> + .gpl_only = false, >> + .ret_type = RET_INTEGER, >> +}; >> + >> BPF_CALL_3(bpf_probe_read_str, void *, dst, u32, size, >> const void *, unsafe_ptr) >> { >> @@ -563,6 +578,8 @@ tracing_func_proto(enum bpf_func_id func_id, const >> struct bpf_prog *prog) >> return &bpf_get_prandom_u32_proto; >> case BPF_FUNC_probe_read_str: >> return &bpf_probe_read_str_proto; >> + case BPF_FUNC_get_current_cgroup_id: >> + return &bpf_get_current_cgroup_id_proto; >> default: >> return NULL; >> } > >> The following program can be used to print out a cgroup id given a cgroup > path. >> [yhs@localhost cg]$ cat get_cgroup_id.c >> #define _GNU_SOURCE >> #include <stdio.h> >> #include <stdlib.h> >> #include <sys/types.h> >> #include <sys/stat.h> >> #include <fcntl.h> > >> int main(int argc, char **argv) >> { >> int dirfd, err, flags, mount_id, fhsize; >> struct file_handle *fhp; >> char *pathname; > >> if (argc != 2) { >> printf("usage: %s <cgroup_path>\n", argv[0]); >> return 1; >> } > >> pathname = argv[1]; >> dirfd = AT_FDCWD; >> flags = 0; > >> fhsize = sizeof(*fhp); >> fhp = malloc(fhsize); >> if (!fhp) >> return 1; > >> err = name_to_handle_at(dirfd, pathname, fhp, &mount_id, flags); >> if (err >= 0) { >> printf("error\n"); >> return 1; >> } > >> fhsize = sizeof(struct file_handle) + fhp->handle_bytes; >> fhp = realloc(fhp, fhsize); >> if (!fhp) >> return 1; > >> err = name_to_handle_at(dirfd, pathname, fhp, &mount_id, flags); >> if (err < 0) >> perror("name_to_handle_at"); >> else { >> int i; > >> printf("dir = %s, mount_id = %d\n", pathname, mount_id); >> printf("handle_bytes = %d, handle_type = %d\n", fhp->handle_bytes, >> fhp->handle_type); >> if (fhp->handle_bytes != 8) >> return 1; > >> printf("cgroup_id = 0x%llx\n", *(unsigned long long > *)fhp->f_handle); >> } > >> return 0; >> } >> [yhs@localhost cg]$ > >> Given a cgroup path, the user can get cgroup_id and use it in their bpf >> program for filtering purpose. > >> I run a simple program t.c >> int main() { while(1) sleep(1); return 0; } >> in the cgroup v2 directory /home/yhs/tmp/yhs >> none on /home/yhs/tmp type cgroup2 (rw,relatime,seclabel) > >> $ ./get_cgroup_id /home/yhs/tmp/yhs >> dir = /home/yhs/tmp/yhs, mount_id = 124 >> handle_bytes = 8, handle_type = 1 >> cgroup_id = 0x1000006b2 > >> // the below command to get cgroup_id from the kernel for the >> // process compiled with t.c and ran under /home/yhs/tmp/yhs: >> $ sudo ./trace.py -p 4067 '__x64_sys_nanosleep "cgid = %llx", $cgid' >> PID TID COMM FUNC - >> 4067 4067 a.out __x64_sys_nanosleep cgid = 1000006b2 >> 4067 4067 a.out __x64_sys_nanosleep cgid = 1000006b2 >> 4067 4067 a.out __x64_sys_nanosleep cgid = 1000006b2 >> ^C[yhs@localhost tools]$ > >> The kernel and user space cgid matches. Will provide a >> formal patch later. > > > > >> On Mon, May 21, 2018 at 5:24 PM, Y Song <ys114321@gmail.com> wrote: >> > On Mon, May 21, 2018 at 9:26 AM, Alexei Starovoitov >> > <alexei.starovoitov@gmail.com> wrote: >> >> On Sun, May 13, 2018 at 07:33:18PM +0200, Alban Crequy wrote: >> >>> >> >>> +BPF_CALL_2(bpf_get_current_cgroup_ino, u32, hierarchy, u64, flags) >> >>> +{ >> >>> + // TODO: pick the correct hierarchy instead of the mem > controller >> >>> + struct cgroup *cgrp = task_cgroup(current, memory_cgrp_id); >> >>> + >> >>> + if (unlikely(!cgrp)) >> >>> + return -EINVAL; >> >>> + if (unlikely(hierarchy)) >> >>> + return -EINVAL; >> >>> + if (unlikely(flags)) >> >>> + return -EINVAL; >> >>> + >> >>> + return cgrp->kn->id.ino; >> >> >> >> ino only is not enough to identify cgroup. It needs generation number > too. >> >> I don't quite see how hierarchy and flags can be used in the future. >> >> Also why limit it to memcg? >> >> >> >> How about something like this instead: >> >> >> >> BPF_CALL_2(bpf_get_current_cgroup_id) >> >> { >> >> struct cgroup *cgrp = task_dfl_cgroup(current); >> >> >> >> return cgrp->kn->id.id; >> >> } >> >> The user space can use fhandle api to get the same 64-bit id. >> > >> > I think this should work. This will also be useful to bcc as user >> > space can encode desired id >> > in the bpf program and compared that id to the current cgroup id, so we > can have >> > cgroup level tracing (esp. stat collection) support. To cope with >> > cgroup hierarchy, user can use >> > cgroup-array based approach or explicitly compare against multiple > cgroup id's. ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2018-05-25 16:29 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-05-13 17:33 [PATCH] [RFC] bpf: tracing: new helper bpf_get_current_cgroup_ino Alban Crequy 2018-05-14 19:38 ` Y Song 2018-05-21 13:52 ` Alban Crequy 2018-05-21 16:26 ` Alexei Starovoitov 2018-05-22 0:24 ` Y Song 2018-05-23 3:33 ` Y Song 2018-05-23 3:35 ` Alexei Starovoitov 2018-05-23 4:31 ` Y Song 2018-05-23 8:57 ` Daniel Borkmann 2018-05-25 15:21 ` Alban Crequy 2018-05-25 16:28 ` Y Song
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).