LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [RFC Patch] perf_event: fix a cgroup switch warning
@ 2019-05-14  0:27 Cong Wang
  2019-05-14 12:32 ` Peter Zijlstra
  0 siblings, 1 reply; 3+ messages in thread
From: Cong Wang @ 2019-05-14  0:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: Cong Wang, Ingo Molnar, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim

We have been consistently triggering the warning
WARN_ON_ONCE(cpuctx->cgrp) in perf_cgroup_switch() for a rather
long time, although we still have no clue on how to reproduce it.

Looking into the code, it seems the only possibility here is that
the process calling perf_event_open() with a cgroup target exits
before the process in the target cgroup exits but after it gains
CPU to run. This is because we use the atomic counter
perf_cgroup_events as an indication of whether cgroup perf event
has enabled or not, which is inaccurate, illustrated as below:

CPU 0					CPU 1
// open perf events with a cgroup
// target for all CPU's
perf_event_open():
  account_event_cpu()
  // perf_cgroup_events == 1
				// Schedule in a process in the target cgroup
				perf_cgroup_switch()
perf_event_release_kernel():
  unaccount_event_cpu()
  // perf_cgroup_events == 0
				// schedule out
				// but perf_cgroup_sched_out() is skipped
				// cpuctx->cgrp left as non-NULL

				// schedule in another process
				perf_cgroup_switch() // WARN triggerred

The proposed fix is kinda ugly, as it adds a flag in each process to
indicate whether this process has to go through perf_cgroup_sched_out()
when perf_cgroup_events is false negative. The other possible fix is
to force a reschedule on each target CPU before decreasing the counter
perf_cgroup_events, but this is expensive.

Suggestions? Thoughts?

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 include/linux/sched.h | 3 +++
 kernel/events/core.c  | 5 ++++-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a2cd15855bad..835bdf15f92c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -733,6 +733,9 @@ struct task_struct {
 	/* to be used once the psi infrastructure lands upstream. */
 	unsigned			use_memdelay:1;
 #endif
+#ifdef CONFIG_PERF_EVENTS
+	unsigned			perf_cgroup_sched_in:1;
+#endif
 
 	unsigned long			atomic_flags; /* Flags requiring atomic access. */
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index abbd4b3b96c2..9b86b043018e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -817,6 +817,7 @@ static void perf_cgroup_switch(struct task_struct *task, int mode)
 			 * to event_filter_match() in event_sched_out()
 			 */
 			cpuctx->cgrp = NULL;
+			task->perf_cgroup_sched_in = 0;
 		}
 
 		if (mode & PERF_CGROUP_SWIN) {
@@ -831,6 +832,7 @@ static void perf_cgroup_switch(struct task_struct *task, int mode)
 			cpuctx->cgrp = perf_cgroup_from_task(task,
 							     &cpuctx->ctx);
 			cpu_ctx_sched_in(cpuctx, EVENT_ALL, task);
+			task->perf_cgroup_sched_in = 1;
 		}
 		perf_pmu_enable(cpuctx->ctx.pmu);
 		perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
@@ -3233,7 +3235,8 @@ void __perf_event_task_sched_out(struct task_struct *task,
 	 * to check if we have to switch out PMU state.
 	 * cgroup event are system-wide mode only
 	 */
-	if (atomic_read(this_cpu_ptr(&perf_cgroup_events)))
+	if (atomic_read(this_cpu_ptr(&perf_cgroup_events)) ||
+	    task->perf_cgroup_sched_in)
 		perf_cgroup_sched_out(task, next);
 }
 
-- 
2.21.0


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC Patch] perf_event: fix a cgroup switch warning
  2019-05-14  0:27 [RFC Patch] perf_event: fix a cgroup switch warning Cong Wang
@ 2019-05-14 12:32 ` Peter Zijlstra
  2019-05-14 18:06   ` Cong Wang
  0 siblings, 1 reply; 3+ messages in thread
From: Peter Zijlstra @ 2019-05-14 12:32 UTC (permalink / raw)
  To: Cong Wang
  Cc: linux-kernel, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim

On Mon, May 13, 2019 at 05:27:47PM -0700, Cong Wang wrote:
> We have been consistently triggering the warning
> WARN_ON_ONCE(cpuctx->cgrp) in perf_cgroup_switch() for a rather
> long time, although we still have no clue on how to reproduce it.
> 
> Looking into the code, it seems the only possibility here is that
> the process calling perf_event_open() with a cgroup target exits
> before the process in the target cgroup exits but after it gains
> CPU to run. This is because we use the atomic counter
> perf_cgroup_events as an indication of whether cgroup perf event
> has enabled or not, which is inaccurate, illustrated as below:
> 
> CPU 0					CPU 1
> // open perf events with a cgroup
> // target for all CPU's
> perf_event_open():
>   account_event_cpu()
>   // perf_cgroup_events == 1
> 				// Schedule in a process in the target cgroup
> 				perf_cgroup_switch()
> perf_event_release_kernel():
>   unaccount_event_cpu()
>   // perf_cgroup_events == 0
> 				// schedule out
> 				// but perf_cgroup_sched_out() is skipped
> 				// cpuctx->cgrp left as non-NULL

				which implies we observed:
				'perf_cgroup_events == 0'

> 				// schedule in another process
> 				perf_cgroup_switch() // WARN triggerred

				which implies we observed:
				'perf_cgroup_events == 1'


Which is impossible. It _might_ have been possible if the out and in
happened on different CPUs. But then I'm not sure that is enough to
trigger the problem.

> The proposed fix is kinda ugly,

Yes :-)

> Suggestions? Thoughts?

At perf_event_release time, when it is the last cgroup event, there
should not be any cgroup events running anymore, so ideally
perf_cgroup_switch() would not set state.

Furthermore; list_update_cgroup_event() will actually clear cpuctx->cgrp
on removal of the last cgroup event.

Also; perf_cgroup_switch() will WARN when there are not in fact any
cgroup events at all. I would expect that WARN to trigger too in your
scneario. But you're not seeing that?

I do however note that that check seems racy; we do that without holding
the ctx_lock.






^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC Patch] perf_event: fix a cgroup switch warning
  2019-05-14 12:32 ` Peter Zijlstra
@ 2019-05-14 18:06   ` Cong Wang
  0 siblings, 0 replies; 3+ messages in thread
From: Cong Wang @ 2019-05-14 18:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim

On Tue, May 14, 2019 at 5:32 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, May 13, 2019 at 05:27:47PM -0700, Cong Wang wrote:
> > We have been consistently triggering the warning
> > WARN_ON_ONCE(cpuctx->cgrp) in perf_cgroup_switch() for a rather
> > long time, although we still have no clue on how to reproduce it.
> >
> > Looking into the code, it seems the only possibility here is that
> > the process calling perf_event_open() with a cgroup target exits
> > before the process in the target cgroup exits but after it gains
> > CPU to run. This is because we use the atomic counter
> > perf_cgroup_events as an indication of whether cgroup perf event
> > has enabled or not, which is inaccurate, illustrated as below:
> >
> > CPU 0                                 CPU 1
> > // open perf events with a cgroup
> > // target for all CPU's
> > perf_event_open():
> >   account_event_cpu()
> >   // perf_cgroup_events == 1
> >                               // Schedule in a process in the target cgroup
> >                               perf_cgroup_switch()
> > perf_event_release_kernel():
> >   unaccount_event_cpu()
> >   // perf_cgroup_events == 0
> >                               // schedule out
> >                               // but perf_cgroup_sched_out() is skipped
> >                               // cpuctx->cgrp left as non-NULL
>
>                                 which implies we observed:
>                                 'perf_cgroup_events == 0'
>
> >                               // schedule in another process
> >                               perf_cgroup_switch() // WARN triggerred
>
>                                 which implies we observed:
>                                 'perf_cgroup_events == 1'
>
>
> Which is impossible. It _might_ have been possible if the out and in
> happened on different CPUs. But then I'm not sure that is enough to
> trigger the problem.

Good catch, but this just needs one more perf_event_open(),
right? :)


>
> > The proposed fix is kinda ugly,
>
> Yes :-)
>
> > Suggestions? Thoughts?
>
> At perf_event_release time, when it is the last cgroup event, there
> should not be any cgroup events running anymore, so ideally
> perf_cgroup_switch() would not set state.
>
> Furthermore; list_update_cgroup_event() will actually clear cpuctx->cgrp
> on removal of the last cgroup event.

Ah, yes, this probably explains why it is harder to trigger than I expected.


>
> Also; perf_cgroup_switch() will WARN when there are not in fact any
> cgroup events at all. I would expect that WARN to trigger too in your
> scneario. But you're not seeing that?

Not sure if I follow you, but if there is no cgroup event, cgrp_cpuctx_list
should be empty, right?

From the stack traces I can't tell, what I can tell is that we use cgroup
events in most cases.


>
> I do however note that that check seems racy; we do that without holding
> the ctx_lock.

Hmm? perf_ctx_lock() is taken in perf_cgroup_switch(), so I think locking
is fine.

Thanks.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2019-05-14 18:07 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-14  0:27 [RFC Patch] perf_event: fix a cgroup switch warning Cong Wang
2019-05-14 12:32 ` Peter Zijlstra
2019-05-14 18:06   ` Cong Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).