Netdev Archive on lore.kernel.org help / color / mirror / Atom feed
* [PATCH bpf 0/2] bpf: two fixes for bpf task/task_file iterators @ 2020-08-18 16:24 Yonghong Song 2020-08-18 16:24 ` [PATCH bpf 1/2] bpf: fix a rcu_sched stall issue with bpf task/task_file iterator Yonghong Song 2020-08-18 16:24 ` [PATCH bpf 2/2] bpf: avoid visit same object multiple times Yonghong Song 0 siblings, 2 replies; 7+ messages in thread From: Yonghong Song @ 2020-08-18 16:24 UTC (permalink / raw) To: bpf, netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team Patch #1 fixed a rcu stall warning by adding proper cond_resched() when traversing tasks or files. Patch #2 calculated tid properly in a namespace in order to avoid visiting the name task multiple times. Yonghong Song (2): bpf: fix a rcu_sched stall issue with bpf task/task_file iterator bpf: avoid visit same object multiple times kernel/bpf/task_iter.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) -- 2.24.1 ^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH bpf 1/2] bpf: fix a rcu_sched stall issue with bpf task/task_file iterator 2020-08-18 16:24 [PATCH bpf 0/2] bpf: two fixes for bpf task/task_file iterators Yonghong Song @ 2020-08-18 16:24 ` Yonghong Song 2020-08-18 16:48 ` Andrii Nakryiko 2020-08-18 17:06 ` Alexei Starovoitov 2020-08-18 16:24 ` [PATCH bpf 2/2] bpf: avoid visit same object multiple times Yonghong Song 1 sibling, 2 replies; 7+ messages in thread From: Yonghong Song @ 2020-08-18 16:24 UTC (permalink / raw) To: bpf, netdev Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team, Paul E . McKenney In our production system, we observed rcu stalls when 'bpftool prog` is running. rcu: INFO: rcu_sched self-detected stall on CPU rcu: \x097-....: (20999 ticks this GP) idle=302/1/0x4000000000000000 softirq=1508852/1508852 fqs=4913 \x09(t=21031 jiffies g=2534773 q=179750) NMI backtrace for cpu 7 CPU: 7 PID: 184195 Comm: bpftool Kdump: loaded Tainted: G W 5.8.0-00004-g68bfc7f8c1b4 #6 Hardware name: Quanta Twin Lakes MP/Twin Lakes Passive MP, BIOS F09_3A17 05/03/2019 Call Trace: <IRQ> dump_stack+0x57/0x70 nmi_cpu_backtrace.cold+0x14/0x53 ? lapic_can_unplug_cpu.cold+0x39/0x39 nmi_trigger_cpumask_backtrace+0xb7/0xc7 rcu_dump_cpu_stacks+0xa2/0xd0 rcu_sched_clock_irq.cold+0x1ff/0x3d9 ? tick_nohz_handler+0x100/0x100 update_process_times+0x5b/0x90 tick_sched_timer+0x5e/0xf0 __hrtimer_run_queues+0x12a/0x2a0 hrtimer_interrupt+0x10e/0x280 __sysvec_apic_timer_interrupt+0x51/0xe0 asm_call_on_stack+0xf/0x20 </IRQ> sysvec_apic_timer_interrupt+0x6f/0x80 asm_sysvec_apic_timer_interrupt+0x12/0x20 RIP: 0010:task_file_seq_get_next+0x71/0x220 Code: 00 00 8b 53 1c 49 8b 7d 00 89 d6 48 8b 47 20 44 8b 18 41 39 d3 76 75 48 8b 4f 20 8b 01 39 d0 76 61 41 89 d1 49 39 c1 48 19 c0 <48> 8b 49 08 21 d0 48 8d 04 c1 4c 8b 08 4d 85 c9 74 46 49 8b 41 38 RSP: 0018:ffffc90006223e10 EFLAGS: 00000297 RAX: ffffffffffffffff RBX: ffff888f0d172388 RCX: ffff888c8c07c1c0 RDX: 00000000000f017b RSI: 00000000000f017b RDI: ffff888c254702c0 RBP: ffffc90006223e68 R08: ffff888be2a1c140 R09: 00000000000f017b R10: 0000000000000002 R11: 0000000000100000 R12: ffff888f23c24118 R13: ffffc90006223e60 R14: ffffffff828509a0 R15: 00000000ffffffff task_file_seq_next+0x52/0xa0 bpf_seq_read+0xb9/0x320 vfs_read+0x9d/0x180 ksys_read+0x5f/0xe0 do_syscall_64+0x38/0x60 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f8815f4f76e Code: c0 e9 f6 fe ff ff 55 48 8d 3d 76 70 0a 00 48 89 e5 e8 36 06 02 00 66 0f 1f 44 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00 f0 ff ff 77 52 c3 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 RSP: 002b:00007fff8f9df578 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 RAX: ffffffffffffffda RBX: 000000000170b9c0 RCX: 00007f8815f4f76e RDX: 0000000000001000 RSI: 00007fff8f9df5b0 RDI: 0000000000000007 RBP: 00007fff8f9e05f0 R08: 0000000000000049 R09: 0000000000000010 R10: 00007f881601fa40 R11: 0000000000000246 R12: 00007fff8f9e05a8 R13: 00007fff8f9e05a8 R14: 0000000001917f90 R15: 000000000000e22e Note that `bpftool prog` actually calls a task_file bpf iterator program to establish an association between prog/map/link/btf anon files and processes. In the case where the above rcu stall occured, we had a process having 1587 tasks and each task having roughly 81305 files. This implied 129 million bpf prog invocations. Unfortunwtely none of these files are prog/map/link/btf files so bpf iterator/prog needs to traverse all these files and not able to return to user space since there are no seq_file buffer overflow. The fix is to add cond_resched() during traversing tasks and files. So voluntarily releasing cpu gives other tasks, e.g., rcu resched kthread, a chance to run. Cc: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Yonghong Song <yhs@fb.com> --- kernel/bpf/task_iter.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c index f21b5e1e4540..885b14cab2c0 100644 --- a/kernel/bpf/task_iter.c +++ b/kernel/bpf/task_iter.c @@ -27,6 +27,8 @@ static struct task_struct *task_seq_get_next(struct pid_namespace *ns, struct task_struct *task = NULL; struct pid *pid; + cond_resched(); + rcu_read_lock(); retry: pid = idr_get_next(&ns->idr, tid); @@ -137,6 +139,8 @@ task_file_seq_get_next(struct bpf_iter_seq_task_file_info *info, struct task_struct *curr_task; int curr_fd = info->fd; + cond_resched(); + /* If this function returns a non-NULL file object, * it held a reference to the task/files_struct/file. * Otherwise, it does not hold any reference. -- 2.24.1 ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH bpf 1/2] bpf: fix a rcu_sched stall issue with bpf task/task_file iterator 2020-08-18 16:24 ` [PATCH bpf 1/2] bpf: fix a rcu_sched stall issue with bpf task/task_file iterator Yonghong Song @ 2020-08-18 16:48 ` Andrii Nakryiko 2020-08-18 17:07 ` Yonghong Song 2020-08-18 17:06 ` Alexei Starovoitov 1 sibling, 1 reply; 7+ messages in thread From: Andrii Nakryiko @ 2020-08-18 16:48 UTC (permalink / raw) To: Yonghong Song Cc: bpf, Networking, Alexei Starovoitov, Daniel Borkmann, Kernel Team, Paul E . McKenney On Tue, Aug 18, 2020 at 9:26 AM Yonghong Song <yhs@fb.com> wrote: > > In our production system, we observed rcu stalls when > 'bpftool prog` is running. [...] > > Note that `bpftool prog` actually calls a task_file bpf iterator > program to establish an association between prog/map/link/btf anon > files and processes. > > In the case where the above rcu stall occured, we had a process > having 1587 tasks and each task having roughly 81305 files. > This implied 129 million bpf prog invocations. Unfortunwtely none of > these files are prog/map/link/btf files so bpf iterator/prog needs > to traverse all these files and not able to return to user space > since there are no seq_file buffer overflow. > > The fix is to add cond_resched() during traversing tasks > and files. So voluntarily releasing cpu gives other tasks, e.g., > rcu resched kthread, a chance to run. What are the performance implications of doing this for every task and/or file? Have you benchmarked `bpftool prog` before/after? What was the difference? I wonder if it's possible to amortize those cond_resched() and call them only ever so often, based on CPU time or number of files/tasks processed, if cond_resched() does turn out to slow bpf_iter down. > > Cc: Paul E. McKenney <paulmck@kernel.org> > Signed-off-by: Yonghong Song <yhs@fb.com> > --- > kernel/bpf/task_iter.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c > index f21b5e1e4540..885b14cab2c0 100644 > --- a/kernel/bpf/task_iter.c > +++ b/kernel/bpf/task_iter.c > @@ -27,6 +27,8 @@ static struct task_struct *task_seq_get_next(struct pid_namespace *ns, > struct task_struct *task = NULL; > struct pid *pid; > > + cond_resched(); > + > rcu_read_lock(); > retry: > pid = idr_get_next(&ns->idr, tid); > @@ -137,6 +139,8 @@ task_file_seq_get_next(struct bpf_iter_seq_task_file_info *info, > struct task_struct *curr_task; > int curr_fd = info->fd; > > + cond_resched(); > + > /* If this function returns a non-NULL file object, > * it held a reference to the task/files_struct/file. > * Otherwise, it does not hold any reference. > -- > 2.24.1 > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH bpf 1/2] bpf: fix a rcu_sched stall issue with bpf task/task_file iterator 2020-08-18 16:48 ` Andrii Nakryiko @ 2020-08-18 17:07 ` Yonghong Song 0 siblings, 0 replies; 7+ messages in thread From: Yonghong Song @ 2020-08-18 17:07 UTC (permalink / raw) To: Andrii Nakryiko Cc: bpf, Networking, Alexei Starovoitov, Daniel Borkmann, Kernel Team, Paul E . McKenney, Rik van Riel On 8/18/20 9:48 AM, Andrii Nakryiko wrote: > On Tue, Aug 18, 2020 at 9:26 AM Yonghong Song <yhs@fb.com> wrote: >> >> In our production system, we observed rcu stalls when >> 'bpftool prog` is running. > > [...] > >> >> Note that `bpftool prog` actually calls a task_file bpf iterator >> program to establish an association between prog/map/link/btf anon >> files and processes. >> >> In the case where the above rcu stall occured, we had a process >> having 1587 tasks and each task having roughly 81305 files. >> This implied 129 million bpf prog invocations. Unfortunwtely none of >> these files are prog/map/link/btf files so bpf iterator/prog needs >> to traverse all these files and not able to return to user space >> since there are no seq_file buffer overflow. >> >> The fix is to add cond_resched() during traversing tasks >> and files. So voluntarily releasing cpu gives other tasks, e.g., >> rcu resched kthread, a chance to run. > > What are the performance implications of doing this for every task > and/or file? Have you benchmarked `bpftool prog` before/after? What > was the difference? The cond_resched() internally has a condition should_resched() to check whether rescheduling should be done or not. Most kernel invocations (if not all) just call cond_resched() without additional custom logic to guess when to call cond_resched(). I suppose should_resched() should cheaper enough already. Maybe Rik can comment here. Regarding to the measurement, I did measure with 'strace -T ./bpftool prog` for 'read' syscall to complete with and without my patch. e.g., read(7, "#\0\0\0\322\23\0\0tcpeventd\0\0\0\0\0\0\0)\0\0\0\322\23\0\0"..., 4096) = 4080 <27.094797> or read(7, "#\0\0\0\322\23\0\0tcpeventd\0\0\0\0\0\0\0)\0\0\0\322\23\0\0"..., 4096) = 4080 <34.281563> The time various a lot during different runs. But based on my observations, with and without cond_resched(), the range of read() elapse time roughly the same. > > I wonder if it's possible to amortize those cond_resched() and call > them only ever so often, based on CPU time or number of files/tasks > processed, if cond_resched() does turn out to slow bpf_iter down. > >> >> Cc: Paul E. McKenney <paulmck@kernel.org> >> Signed-off-by: Yonghong Song <yhs@fb.com> >> --- >> kernel/bpf/task_iter.c | 4 ++++ >> 1 file changed, 4 insertions(+) >> >> diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c >> index f21b5e1e4540..885b14cab2c0 100644 >> --- a/kernel/bpf/task_iter.c >> +++ b/kernel/bpf/task_iter.c >> @@ -27,6 +27,8 @@ static struct task_struct *task_seq_get_next(struct pid_namespace *ns, >> struct task_struct *task = NULL; >> struct pid *pid; >> >> + cond_resched(); >> + >> rcu_read_lock(); >> retry: >> pid = idr_get_next(&ns->idr, tid); >> @@ -137,6 +139,8 @@ task_file_seq_get_next(struct bpf_iter_seq_task_file_info *info, >> struct task_struct *curr_task; >> int curr_fd = info->fd; >> >> + cond_resched(); >> + >> /* If this function returns a non-NULL file object, >> * it held a reference to the task/files_struct/file. >> * Otherwise, it does not hold any reference. >> -- >> 2.24.1 >> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH bpf 1/2] bpf: fix a rcu_sched stall issue with bpf task/task_file iterator 2020-08-18 16:24 ` [PATCH bpf 1/2] bpf: fix a rcu_sched stall issue with bpf task/task_file iterator Yonghong Song 2020-08-18 16:48 ` Andrii Nakryiko @ 2020-08-18 17:06 ` Alexei Starovoitov 2020-08-18 17:13 ` Yonghong Song 1 sibling, 1 reply; 7+ messages in thread From: Alexei Starovoitov @ 2020-08-18 17:06 UTC (permalink / raw) To: Yonghong Song Cc: bpf, Network Development, Alexei Starovoitov, Daniel Borkmann, Kernel Team, Paul E . McKenney On Tue, Aug 18, 2020 at 9:24 AM Yonghong Song <yhs@fb.com> wrote: > index f21b5e1e4540..885b14cab2c0 100644 > --- a/kernel/bpf/task_iter.c > +++ b/kernel/bpf/task_iter.c > @@ -27,6 +27,8 @@ static struct task_struct *task_seq_get_next(struct pid_namespace *ns, > struct task_struct *task = NULL; > struct pid *pid; > > + cond_resched(); > + > rcu_read_lock(); > retry: > pid = idr_get_next(&ns->idr, tid); > @@ -137,6 +139,8 @@ task_file_seq_get_next(struct bpf_iter_seq_task_file_info *info, > struct task_struct *curr_task; > int curr_fd = info->fd; > > + cond_resched(); > + Instead of adding it to every *seq_get_next() it probably should be in bpf_seq_read(). If cond_resched() is needed in task_file_seq_get_next() it should probably be after 'again:'. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH bpf 1/2] bpf: fix a rcu_sched stall issue with bpf task/task_file iterator 2020-08-18 17:06 ` Alexei Starovoitov @ 2020-08-18 17:13 ` Yonghong Song 0 siblings, 0 replies; 7+ messages in thread From: Yonghong Song @ 2020-08-18 17:13 UTC (permalink / raw) To: Alexei Starovoitov Cc: bpf, Network Development, Alexei Starovoitov, Daniel Borkmann, Kernel Team, Paul E . McKenney On 8/18/20 10:06 AM, Alexei Starovoitov wrote: > On Tue, Aug 18, 2020 at 9:24 AM Yonghong Song <yhs@fb.com> wrote: >> index f21b5e1e4540..885b14cab2c0 100644 >> --- a/kernel/bpf/task_iter.c >> +++ b/kernel/bpf/task_iter.c >> @@ -27,6 +27,8 @@ static struct task_struct *task_seq_get_next(struct pid_namespace *ns, >> struct task_struct *task = NULL; >> struct pid *pid; >> >> + cond_resched(); >> + >> rcu_read_lock(); >> retry: >> pid = idr_get_next(&ns->idr, tid); >> @@ -137,6 +139,8 @@ task_file_seq_get_next(struct bpf_iter_seq_task_file_info *info, >> struct task_struct *curr_task; >> int curr_fd = info->fd; >> >> + cond_resched(); >> + > > Instead of adding it to every *seq_get_next() it probably should be in > bpf_seq_read(). Yes, we can add cond_resched() to bpf_seq_read(). This should cover both cases. Will make the change. > If cond_resched() is needed in task_file_seq_get_next() it should > probably be after 'again:'. We probably do not need here unless all tasks have zero files or each file just closed with f->f_count == 0 but the file pointer still there. ^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH bpf 2/2] bpf: avoid visit same object multiple times 2020-08-18 16:24 [PATCH bpf 0/2] bpf: two fixes for bpf task/task_file iterators Yonghong Song 2020-08-18 16:24 ` [PATCH bpf 1/2] bpf: fix a rcu_sched stall issue with bpf task/task_file iterator Yonghong Song @ 2020-08-18 16:24 ` Yonghong Song 1 sibling, 0 replies; 7+ messages in thread From: Yonghong Song @ 2020-08-18 16:24 UTC (permalink / raw) To: bpf, netdev Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team, Rik van Riel Currently when traversing all tasks, the next tid is always increased by one. This may result in visiting the same task multiple times in a pid namespace. This patch fixed the issue by seting the next tid as pid_nr_ns(pid, ns) + 1, similar to funciton next_tgid(). Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Yonghong Song <yhs@fb.com> --- kernel/bpf/task_iter.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c index 885b14cab2c0..d6e1a141ea5c 100644 --- a/kernel/bpf/task_iter.c +++ b/kernel/bpf/task_iter.c @@ -31,8 +31,9 @@ static struct task_struct *task_seq_get_next(struct pid_namespace *ns, rcu_read_lock(); retry: - pid = idr_get_next(&ns->idr, tid); + pid = find_ge_pid(*tid, ns); if (pid) { + *tid = pid_nr_ns(pid, ns); task = get_pid_task(pid, PIDTYPE_PID); if (!task) { ++*tid; -- 2.24.1 ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2020-08-18 17:14 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-08-18 16:24 [PATCH bpf 0/2] bpf: two fixes for bpf task/task_file iterators Yonghong Song 2020-08-18 16:24 ` [PATCH bpf 1/2] bpf: fix a rcu_sched stall issue with bpf task/task_file iterator Yonghong Song 2020-08-18 16:48 ` Andrii Nakryiko 2020-08-18 17:07 ` Yonghong Song 2020-08-18 17:06 ` Alexei Starovoitov 2020-08-18 17:13 ` Yonghong Song 2020-08-18 16:24 ` [PATCH bpf 2/2] bpf: avoid visit same object multiple times Yonghong Song
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).