LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH 0/2] sched,kthread: Fix TASK_PARKED and special sleep states
@ 2018-04-30 14:17 Peter Zijlstra
2018-04-30 14:17 ` [PATCH 1/2] kthread: Fix kthread_parkme() wait-loop Peter Zijlstra
2018-04-30 14:17 ` [PATCH 2/2] sched: Introduce set_special_state() Peter Zijlstra
0 siblings, 2 replies; 9+ messages in thread
From: Peter Zijlstra @ 2018-04-30 14:17 UTC (permalink / raw)
To: mingo, tglx
Cc: linux-kernel, oleg, will.deacon, mpe, bigeasy, gkohli, neeraju, peterz
Hi,
The first patch should fix the reported TASK_PARKED problem and the second
should fix any potential TASK_STOPPED, TASK_TRACED issues.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 1/2] kthread: Fix kthread_parkme() wait-loop
2018-04-30 14:17 [PATCH 0/2] sched,kthread: Fix TASK_PARKED and special sleep states Peter Zijlstra
@ 2018-04-30 14:17 ` Peter Zijlstra
2018-04-30 14:17 ` [PATCH 2/2] sched: Introduce set_special_state() Peter Zijlstra
1 sibling, 0 replies; 9+ messages in thread
From: Peter Zijlstra @ 2018-04-30 14:17 UTC (permalink / raw)
To: mingo, tglx
Cc: linux-kernel, oleg, will.deacon, mpe, bigeasy, gkohli, neeraju, peterz
[-- Attachment #1: peterz-sched-smpboot-1.patch --]
[-- Type: text/plain, Size: 1859 bytes --]
Gaurav reported a problem with __kthread_parkme() where a concurrent
try_to_wake_up() could result in competing stores to ->state which,
when the TASK_PARKED store got lost bad things would happen.
The comment near set_current_state() actually mentions this competing
store, but only mentions the case against TASK_RUNNING. This same
store, with different timing, can happen against a subsequent !RUNNING
store.
This normally is not a problem, because as per that same comment, the
!RUNNING state store is inside a condition based wait-loop:
for (;;) {
set_current_state(TASK_UNINTERRUPTIBLE);
if (!need_sleep)
break;
schedule();
}
__set_current_state(TASK_RUNNING);
If we loose the (first) TASK_UNINTERRUPTIBLE store to a previous
(concurrent) wakeup, the schedule() will NO-OP and we'll go around the
loop once more.
The problem here is that the TASK_PARKED store is not inside the
KTHREAD_SHOULD_PARK condition wait-loop.
There is a genuine issue with sleeps that do not have a condition;
this is addressed in a subsequent patch.
Reported-by: Gaurav Kohli <gkohli@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
kernel/kthread.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -177,12 +177,13 @@ void *kthread_probe_data(struct task_str
static void __kthread_parkme(struct kthread *self)
{
- __set_current_state(TASK_PARKED);
- while (test_bit(KTHREAD_SHOULD_PARK, &self->flags)) {
+ for (;;) {
+ set_current_state(TASK_PARKED);
+ if (!test_bit(KTHREAD_SHOULD_PARK, &self->flags))
+ break;
if (!test_and_set_bit(KTHREAD_IS_PARKED, &self->flags))
complete(&self->parked);
schedule();
- __set_current_state(TASK_PARKED);
}
clear_bit(KTHREAD_IS_PARKED, &self->flags);
__set_current_state(TASK_RUNNING);
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 2/2] sched: Introduce set_special_state()
2018-04-30 14:17 [PATCH 0/2] sched,kthread: Fix TASK_PARKED and special sleep states Peter Zijlstra
2018-04-30 14:17 ` [PATCH 1/2] kthread: Fix kthread_parkme() wait-loop Peter Zijlstra
@ 2018-04-30 14:17 ` Peter Zijlstra
2018-04-30 16:45 ` Oleg Nesterov
1 sibling, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2018-04-30 14:17 UTC (permalink / raw)
To: mingo, tglx
Cc: linux-kernel, oleg, will.deacon, mpe, bigeasy, gkohli, neeraju, peterz
[-- Attachment #1: peterz-sched-smpboot-2.patch --]
[-- Type: text/plain, Size: 6347 bytes --]
Gaurav reported a perceived problem with TASK_PARKED, which turned out
to be a broken wait-loop pattern in __kthread_parkme(), but the
reported issue can (and does) in fact happen for states that do not do
condition based sleeps.
When the 'current->state = TASK_RUNNING' store of a previous
(concurrent) try_to_wake_up() collides with the setting of a 'special'
sleep state, we can loose the sleep state.
Normal condition based wait-loops are immune to this problem, but for
sleep states that are not condition based are subject to this problem.
There already is a fix for TASK_DEAD. Abstract that and also apply it to
TASK_STOPPED and TASK_TRACED, both of which are also without condition
based wait-loop.
Cc: Oleg Nesterov <oleg@redhat.com>
Reported-by: Gaurav Kohli <gkohli@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
include/linux/sched.h | 52 ++++++++++++++++++++++++++++++++++++++-----
include/linux/sched/signal.h | 2 -
kernel/sched/core.c | 17 --------------
kernel/signal.c | 4 +--
4 files changed, 51 insertions(+), 24 deletions(-)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -112,17 +112,38 @@ struct task_group;
#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
+/*
+ * Special states are those that do not use the normal wait-loop pattern. See
+ * the comment with set_special_state().
+ */
+#define is_special_task_state(state) \
+ ((state) == TASK_DEAD || \
+ (state) == TASK_STOPPED || \
+ (state) == TASK_TRACED)
+
#define __set_current_state(state_value) \
do { \
+ WARN_ON_ONCE(is_special_task_state(state)); \
current->task_state_change = _THIS_IP_; \
current->state = (state_value); \
} while (0)
+
#define set_current_state(state_value) \
do { \
+ WARN_ON_ONCE(is_special_task_state(state)); \
current->task_state_change = _THIS_IP_; \
smp_store_mb(current->state, (state_value)); \
} while (0)
+#define set_special_state(state_value) \
+ do { \
+ unsigned long flags; /* may shadow */ \
+ WARN_ON_ONCE(!is_special_task_state(state_value)); \
+ raw_spin_lock_irqsave(¤t->pi_lock, flags); \
+ current->task_state_change = _THIS_IP_; \
+ current->state = (state_value); \
+ raw_spin_unlock_irqrestore(¤t->pi_lock, flags); \
+ } while (0)
#else
/*
* set_current_state() includes a barrier so that the write of current->state
@@ -144,8 +165,8 @@ struct task_group;
*
* The above is typically ordered against the wakeup, which does:
*
- * need_sleep = false;
- * wake_up_state(p, TASK_UNINTERRUPTIBLE);
+ * need_sleep = false;
+ * wake_up_state(p, TASK_UNINTERRUPTIBLE);
*
* Where wake_up_state() (and all other wakeup primitives) imply enough
* barriers to order the store of the variable against wakeup.
@@ -154,12 +175,33 @@ struct task_group;
* once it observes the TASK_UNINTERRUPTIBLE store the waking CPU can issue a
* TASK_RUNNING store which can collide with __set_current_state(TASK_RUNNING).
*
- * This is obviously fine, since they both store the exact same value.
+ * However, with slightly different timing the wakeup TASK_RUNNING store can
+ * also collide with the TASK_UNINTERRUPTIBLE store. Loosing that store is not
+ * a problem either because that will result in one extra go around the loop
+ * and our @cond test will save the day.
*
* Also see the comments of try_to_wake_up().
*/
-#define __set_current_state(state_value) do { current->state = (state_value); } while (0)
-#define set_current_state(state_value) smp_store_mb(current->state, (state_value))
+#define __set_current_state(state_value) \
+ current->state = (state_value)
+
+#define set_current_state(state_value) \
+ smp_store_mb(current->state, (state_value))
+
+/*
+ * set_special_state() should be used for those states when the blocking task
+ * can not use the regular condition based wait-loop. In that case we must
+ * serialize against wakeups such that any possible in-flight TASK_RUNNING stores
+ * will not collide with our state change.
+ */
+#define set_special_state(state_value) \
+ do { \
+ unsigned long flags; /* may shadow */ \
+ raw_spin_lock_irqsave(¤t->pi_lock, flags); \
+ current->state = (state_value); \
+ raw_spin_unlock_irqrestore(¤t->pi_lock, flags); \
+ } while (0)
+
#endif
/* Task command name length: */
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -280,7 +280,7 @@ static inline void kernel_signal_stop(vo
{
spin_lock_irq(¤t->sighand->siglock);
if (current->jobctl & JOBCTL_STOP_DEQUEUED)
- __set_current_state(TASK_STOPPED);
+ set_special_state(TASK_STOPPED);
spin_unlock_irq(¤t->sighand->siglock);
schedule();
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3498,23 +3498,8 @@ static void __sched notrace __schedule(b
void __noreturn do_task_dead(void)
{
- /*
- * The setting of TASK_RUNNING by try_to_wake_up() may be delayed
- * when the following two conditions become true.
- * - There is race condition of mmap_sem (It is acquired by
- * exit_mm()), and
- * - SMI occurs before setting TASK_RUNINNG.
- * (or hypervisor of virtual machine switches to other guest)
- * As a result, we may become TASK_RUNNING after becoming TASK_DEAD
- *
- * To avoid it, we have to wait for releasing tsk->pi_lock which
- * is held by try_to_wake_up()
- */
- raw_spin_lock_irq(¤t->pi_lock);
- raw_spin_unlock_irq(¤t->pi_lock);
-
/* Causes final put_task_struct in finish_task_switch(): */
- __set_current_state(TASK_DEAD);
+ set_special_state(TASK_DEAD);
/* Tell freezer to ignore us: */
current->flags |= PF_NOFREEZE;
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1968,7 +1968,7 @@ static void ptrace_stop(int exit_code, i
* atomic with respect to siglock and should be done after the arch
* hook as siglock is released and regrabbed across it.
*/
- set_current_state(TASK_TRACED);
+ set_special_state(TASK_TRACED);
current->last_siginfo = info;
current->exit_code = exit_code;
@@ -2176,7 +2176,7 @@ static bool do_signal_stop(int signr)
if (task_participate_group_stop(current))
notify = CLD_STOPPED;
- __set_current_state(TASK_STOPPED);
+ set_special_state(TASK_STOPPED);
spin_unlock_irq(¤t->sighand->siglock);
/*
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 2/2] sched: Introduce set_special_state()
2018-04-30 14:17 ` [PATCH 2/2] sched: Introduce set_special_state() Peter Zijlstra
@ 2018-04-30 16:45 ` Oleg Nesterov
2018-04-30 19:40 ` Peter Zijlstra
0 siblings, 1 reply; 9+ messages in thread
From: Oleg Nesterov @ 2018-04-30 16:45 UTC (permalink / raw)
To: Peter Zijlstra
Cc: mingo, tglx, linux-kernel, will.deacon, mpe, bigeasy, gkohli, neeraju
On 04/30, Peter Zijlstra wrote:
>
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -1968,7 +1968,7 @@ static void ptrace_stop(int exit_code, i
> * atomic with respect to siglock and should be done after the arch
> * hook as siglock is released and regrabbed across it.
> */
> - set_current_state(TASK_TRACED);
> + set_special_state(TASK_TRACED);
Yes, but please note the comment above, we need a barrier after state = TASK_TRACED,
that is why ptrace_stop() does set_current_state(), not __set_current_state().
Otherwise both patches look good to me, feel free to add
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 2/2] sched: Introduce set_special_state()
2018-04-30 16:45 ` Oleg Nesterov
@ 2018-04-30 19:40 ` Peter Zijlstra
2018-05-01 9:50 ` Peter Zijlstra
0 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2018-04-30 19:40 UTC (permalink / raw)
To: Oleg Nesterov
Cc: mingo, tglx, linux-kernel, will.deacon, mpe, bigeasy, gkohli, neeraju
On Mon, Apr 30, 2018 at 06:45:46PM +0200, Oleg Nesterov wrote:
> On 04/30, Peter Zijlstra wrote:
> >
> > --- a/kernel/signal.c
> > +++ b/kernel/signal.c
> > @@ -1968,7 +1968,7 @@ static void ptrace_stop(int exit_code, i
> > * atomic with respect to siglock and should be done after the arch
> > * hook as siglock is released and regrabbed across it.
> > */
> > - set_current_state(TASK_TRACED);
> > + set_special_state(TASK_TRACED);
>
> Yes, but please note the comment above, we need a barrier after state = TASK_TRACED,
> that is why ptrace_stop() does set_current_state(), not __set_current_state().
OK, so I got properly lost in that stuff.
The comment says it we need to set TASK_TRACED before clearing
JOBCTL_TRAPPING because of do_wait(), but I cannot seem to locate code
in do_wait() and below that cares about JOBCTL_TRAPPING.
Could you elucidate my tired brain?
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 2/2] sched: Introduce set_special_state()
2018-04-30 19:40 ` Peter Zijlstra
@ 2018-05-01 9:50 ` Peter Zijlstra
2018-05-01 13:59 ` Oleg Nesterov
0 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2018-05-01 9:50 UTC (permalink / raw)
To: Oleg Nesterov
Cc: mingo, tglx, linux-kernel, will.deacon, mpe, bigeasy, gkohli, neeraju
On Mon, Apr 30, 2018 at 09:40:24PM +0200, Peter Zijlstra wrote:
> On Mon, Apr 30, 2018 at 06:45:46PM +0200, Oleg Nesterov wrote:
> > On 04/30, Peter Zijlstra wrote:
> > >
> > > --- a/kernel/signal.c
> > > +++ b/kernel/signal.c
> > > @@ -1968,7 +1968,7 @@ static void ptrace_stop(int exit_code, i
> > > * atomic with respect to siglock and should be done after the arch
> > > * hook as siglock is released and regrabbed across it.
> > > */
> > > - set_current_state(TASK_TRACED);
> > > + set_special_state(TASK_TRACED);
> >
> > Yes, but please note the comment above, we need a barrier after state = TASK_TRACED,
> > that is why ptrace_stop() does set_current_state(), not __set_current_state().
>
> OK, so I got properly lost in that stuff.
>
> The comment says it we need to set TASK_TRACED before clearing
> JOBCTL_TRAPPING because of do_wait(), but I cannot seem to locate code
> in do_wait() and below that cares about JOBCTL_TRAPPING.
>
> Could you elucidate my tired brain?
The only code I found that seems to care is ptrace_attach(), where we
wait for JOBCTL_TRAPPING to get cleared. That same function has a
comment about hiding the STOPPED -> RUNNING -> TRACED transition. So I'm
assuming it needs to observe TRACED if it observes !TRAPPING.
But I don't think there's enough barriers on that end to guarantee this.
Any ->state load after wait_on_bit() is, afact, free to have happened
before the ->jobctl load.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 2/2] sched: Introduce set_special_state()
2018-05-01 9:50 ` Peter Zijlstra
@ 2018-05-01 13:59 ` Oleg Nesterov
2018-05-01 14:22 ` Peter Zijlstra
0 siblings, 1 reply; 9+ messages in thread
From: Oleg Nesterov @ 2018-05-01 13:59 UTC (permalink / raw)
To: Peter Zijlstra
Cc: mingo, tglx, linux-kernel, will.deacon, mpe, bigeasy, gkohli, neeraju
On 05/01, Peter Zijlstra wrote:
>
> The only code I found that seems to care is ptrace_attach(), where we
> wait for JOBCTL_TRAPPING to get cleared. That same function has a
> comment about hiding the STOPPED -> RUNNING -> TRACED transition. So I'm
> assuming it needs to observe TRACED if it observes !TRAPPING.
Yes, exactly.
> But I don't think there's enough barriers on that end to guarantee this.
> Any ->state load after wait_on_bit() is, afact, free to have happened
> before the ->jobctl load.
do_wait() does set_current_state() before it checks ->state or anything else.
Oleg.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 2/2] sched: Introduce set_special_state()
2018-05-01 13:59 ` Oleg Nesterov
@ 2018-05-01 14:22 ` Peter Zijlstra
2018-05-01 15:10 ` Oleg Nesterov
0 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2018-05-01 14:22 UTC (permalink / raw)
To: Oleg Nesterov
Cc: mingo, tglx, linux-kernel, will.deacon, mpe, bigeasy, gkohli, neeraju
On Tue, May 01, 2018 at 03:59:24PM +0200, Oleg Nesterov wrote:
> On 05/01, Peter Zijlstra wrote:
> >
> > The only code I found that seems to care is ptrace_attach(), where we
> > wait for JOBCTL_TRAPPING to get cleared. That same function has a
> > comment about hiding the STOPPED -> RUNNING -> TRACED transition. So I'm
> > assuming it needs to observe TRACED if it observes !TRAPPING.
>
> Yes, exactly.
>
> > But I don't think there's enough barriers on that end to guarantee this.
> > Any ->state load after wait_on_bit() is, afact, free to have happened
> > before the ->jobctl load.
>
> do_wait() does set_current_state() before it checks ->state or anything else.
But how are ptrace_attach() and do_wait() related? I guess I'm missing
something fairly fundamental here. Is it userspace doing these two
system calls back-to-back or something?
Is that not a little bit fragile, to rely on a barrier in do_wait()
for this?
Anyway, does the below look ok?
---
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1961,14 +1961,26 @@ static void ptrace_stop(int exit_code, i
return;
}
+ set_special_state(TASK_TRACED);
+
/*
* We're committing to trapping. TRACED should be visible before
* TRAPPING is cleared; otherwise, the tracer might fail do_wait().
* Also, transition to TRACED and updates to ->jobctl should be
* atomic with respect to siglock and should be done after the arch
* hook as siglock is released and regrabbed across it.
+ *
+ * TRACER TRACEE
+ * ptrace_attach()
+ * [L] wait_on_bit(JOBCTL_TRAPPING) [S] set_special_state(TRACED)
+ * do_wait()
+ * set_current_state() smp_wmb();
+ * ptrace_do_wait()
+ * wait_task_stopped()
+ * task_stopped_code()
+ * [L] task_is_traced() [S] task_clear_jobctl_trapping();
*/
- set_special_state(TASK_TRACED);
+ smp_wmb();
current->last_siginfo = info;
current->exit_code = exit_code;
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 2/2] sched: Introduce set_special_state()
2018-05-01 14:22 ` Peter Zijlstra
@ 2018-05-01 15:10 ` Oleg Nesterov
0 siblings, 0 replies; 9+ messages in thread
From: Oleg Nesterov @ 2018-05-01 15:10 UTC (permalink / raw)
To: Peter Zijlstra
Cc: mingo, tglx, linux-kernel, will.deacon, mpe, bigeasy, gkohli, neeraju
On 05/01, Peter Zijlstra wrote:
>
> On Tue, May 01, 2018 at 03:59:24PM +0200, Oleg Nesterov wrote:
> > On 05/01, Peter Zijlstra wrote:
> > >
> > > The only code I found that seems to care is ptrace_attach(), where we
> > > wait for JOBCTL_TRAPPING to get cleared. That same function has a
> > > comment about hiding the STOPPED -> RUNNING -> TRACED transition. So I'm
> > > assuming it needs to observe TRACED if it observes !TRAPPING.
> >
> > Yes, exactly.
> >
> > > But I don't think there's enough barriers on that end to guarantee this.
> > > Any ->state load after wait_on_bit() is, afact, free to have happened
> > > before the ->jobctl load.
> >
> > do_wait() does set_current_state() before it checks ->state or anything else.
>
> But how are ptrace_attach() and do_wait() related?
Yes.
> I guess I'm missing
> something fairly fundamental here.
You are missing the fact that ptrace API is very old and ugly ;)
Just one example. If the debugger knows that the task is STOPPED then it has
all rights to do, say,
ptrace(PTRACE_ATTACH, pid);
BUG_ON(pid != waitpid(pid, WNOHANG));
Or even do another ptrace() request right after ptrace(PTRACE_ATTACH) returns,
without do_wait().
And unless my memory fools me, gdb even has some test-cases for this... Not sure,
but it certainly looks at tracee->state in /proc before it does PTRACE_ATTACH,
because if it was already STOPPED then gdb won't have any notification from the
tracee.
> Anyway, does the below look ok?
Yes, thanks.
Oleg.
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2018-05-01 15:10 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-30 14:17 [PATCH 0/2] sched,kthread: Fix TASK_PARKED and special sleep states Peter Zijlstra
2018-04-30 14:17 ` [PATCH 1/2] kthread: Fix kthread_parkme() wait-loop Peter Zijlstra
2018-04-30 14:17 ` [PATCH 2/2] sched: Introduce set_special_state() Peter Zijlstra
2018-04-30 16:45 ` Oleg Nesterov
2018-04-30 19:40 ` Peter Zijlstra
2018-05-01 9:50 ` Peter Zijlstra
2018-05-01 13:59 ` Oleg Nesterov
2018-05-01 14:22 ` Peter Zijlstra
2018-05-01 15:10 ` Oleg Nesterov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).