LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH] time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint
@ 2019-05-27 14:39 Paul E. McKenney
  2019-05-28 20:07 ` Thomas Gleixner
  0 siblings, 1 reply; 31+ messages in thread
From: Paul E. McKenney @ 2019-05-27 14:39 UTC (permalink / raw)
  To: fweisbec, tglx, mingo; +Cc: linux-kernel

The TASKS03 and TREE04 rcutorture scenarios produce the following
lockdep complaint:

================================
WARNING: inconsistent lock state
5.2.0-rc1+ #513 Not tainted
--------------------------------
inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
migration/1/14 [HC0[0]:SC0[0]:HE1:SE1] takes:
(____ptrval____) (tick_broadcast_lock){?...}, at: tick_broadcast_offline+0xf/0x70
{IN-HARDIRQ-W} state was registered at:
  lock_acquire+0xb0/0x1c0
  _raw_spin_lock_irqsave+0x3c/0x50
  tick_broadcast_switch_to_oneshot+0xd/0x40
  tick_switch_to_oneshot+0x4f/0xd0
  hrtimer_run_queues+0xf3/0x130
  run_local_timers+0x1c/0x50
  update_process_times+0x1c/0x50
  tick_periodic+0x26/0xc0
  tick_handle_periodic+0x1a/0x60
  smp_apic_timer_interrupt+0x80/0x2a0
  apic_timer_interrupt+0xf/0x20
  _raw_spin_unlock_irqrestore+0x4e/0x60
  rcu_nocb_gp_kthread+0x15d/0x590
  kthread+0xf3/0x130
  ret_from_fork+0x3a/0x50
irq event stamp: 171
hardirqs last  enabled at (171): [<ffffffff8a201a37>] trace_hardirqs_on_thunk+0x1a/0x1c
hardirqs last disabled at (170): [<ffffffff8a201a53>] trace_hardirqs_off_thunk+0x1a/0x1c
softirqs last  enabled at (0): [<ffffffff8a264ee0>] copy_process.part.56+0x650/0x1cb0
softirqs last disabled at (0): [<0000000000000000>] 0x0

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(tick_broadcast_lock);
  <Interrupt>
    lock(tick_broadcast_lock);

 *** DEADLOCK ***

1 lock held by migration/1/14:
 #0: (____ptrval____) (clockevents_lock){+.+.}, at: tick_offline_cpu+0xf/0x30

stack backtrace:
CPU: 1 PID: 14 Comm: migration/1 Not tainted 5.2.0-rc1+ #513
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Bochs 01/01/2011
Call Trace:
 dump_stack+0x5e/0x8b
 print_usage_bug+0x1fc/0x216
 ? print_shortest_lock_dependencies+0x1b0/0x1b0
 mark_lock+0x1f2/0x280
 __lock_acquire+0x1e0/0x18f0
 ? __lock_acquire+0x21b/0x18f0
 ? _raw_spin_unlock_irqrestore+0x4e/0x60
 lock_acquire+0xb0/0x1c0
 ? tick_broadcast_offline+0xf/0x70
 _raw_spin_lock+0x33/0x40
 ? tick_broadcast_offline+0xf/0x70
 tick_broadcast_offline+0xf/0x70
 tick_offline_cpu+0x16/0x30
 take_cpu_down+0x7d/0xa0
 multi_cpu_stop+0xa2/0xe0
 ? cpu_stop_queue_work+0xc0/0xc0
 cpu_stopper_thread+0x6d/0x100
 smpboot_thread_fn+0x169/0x240
 kthread+0xf3/0x130
 ? sort_range+0x20/0x20
 ? kthread_cancel_delayed_work_sync+0x10/0x10
 ret_from_fork+0x3a/0x50

It turns out that tick_broadcast_offline() can be invoked with interrupts
enabled, so this commit fixes this issue by replacing the raw_spin_lock()
with raw_spin_lock_irqsave().

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>

diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index e51778c312f1..1daf77020230 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -454,12 +454,14 @@ static void tick_shutdown_broadcast(void)
  */
 void tick_broadcast_offline(unsigned int cpu)
 {
-	raw_spin_lock(&tick_broadcast_lock);
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&tick_broadcast_lock, flags);
 	cpumask_clear_cpu(cpu, tick_broadcast_mask);
 	cpumask_clear_cpu(cpu, tick_broadcast_on);
 	tick_broadcast_oneshot_offline(cpu);
 	tick_shutdown_broadcast();
-	raw_spin_unlock(&tick_broadcast_lock);
+	raw_spin_unlock_irqrestore(&tick_broadcast_lock, flags);
 }
 
 #endif


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint
  2019-05-27 14:39 [PATCH] time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint Paul E. McKenney
@ 2019-05-28 20:07 ` Thomas Gleixner
  2019-05-29 18:19   ` Paul E. McKenney
  0 siblings, 1 reply; 31+ messages in thread
From: Thomas Gleixner @ 2019-05-28 20:07 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: fweisbec, mingo, linux-kernel

On Mon, 27 May 2019, Paul E. McKenney wrote:

> The TASKS03 and TREE04 rcutorture scenarios produce the following
> lockdep complaint:
> 
> ================================
> WARNING: inconsistent lock state
> 5.2.0-rc1+ #513 Not tainted
> --------------------------------
> inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
> migration/1/14 [HC0[0]:SC0[0]:HE1:SE1] takes:
> (____ptrval____) (tick_broadcast_lock){?...}, at: tick_broadcast_offline+0xf/0x70
> {IN-HARDIRQ-W} state was registered at:
>   lock_acquire+0xb0/0x1c0
>   _raw_spin_lock_irqsave+0x3c/0x50
>   tick_broadcast_switch_to_oneshot+0xd/0x40
>   tick_switch_to_oneshot+0x4f/0xd0
>   hrtimer_run_queues+0xf3/0x130
>   run_local_timers+0x1c/0x50
>   update_process_times+0x1c/0x50
>   tick_periodic+0x26/0xc0
>   tick_handle_periodic+0x1a/0x60
>   smp_apic_timer_interrupt+0x80/0x2a0
>   apic_timer_interrupt+0xf/0x20
>   _raw_spin_unlock_irqrestore+0x4e/0x60
>   rcu_nocb_gp_kthread+0x15d/0x590
>   kthread+0xf3/0x130
>   ret_from_fork+0x3a/0x50
> irq event stamp: 171
> hardirqs last  enabled at (171): [<ffffffff8a201a37>] trace_hardirqs_on_thunk+0x1a/0x1c
> hardirqs last disabled at (170): [<ffffffff8a201a53>] trace_hardirqs_off_thunk+0x1a/0x1c
> softirqs last  enabled at (0): [<ffffffff8a264ee0>] copy_process.part.56+0x650/0x1cb0
> softirqs last disabled at (0): [<0000000000000000>] 0x0
> 
> other info that might help us debug this:
>  Possible unsafe locking scenario:
> 
>        CPU0
>        ----
>   lock(tick_broadcast_lock);
>   <Interrupt>
>     lock(tick_broadcast_lock);
> 
>  *** DEADLOCK ***
> 
> 1 lock held by migration/1/14:
>  #0: (____ptrval____) (clockevents_lock){+.+.}, at: tick_offline_cpu+0xf/0x30
> 
> stack backtrace:
> CPU: 1 PID: 14 Comm: migration/1 Not tainted 5.2.0-rc1+ #513
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Bochs 01/01/2011
> Call Trace:
>  dump_stack+0x5e/0x8b
>  print_usage_bug+0x1fc/0x216
>  ? print_shortest_lock_dependencies+0x1b0/0x1b0
>  mark_lock+0x1f2/0x280
>  __lock_acquire+0x1e0/0x18f0
>  ? __lock_acquire+0x21b/0x18f0
>  ? _raw_spin_unlock_irqrestore+0x4e/0x60
>  lock_acquire+0xb0/0x1c0
>  ? tick_broadcast_offline+0xf/0x70
>  _raw_spin_lock+0x33/0x40
>  ? tick_broadcast_offline+0xf/0x70
>  tick_broadcast_offline+0xf/0x70
>  tick_offline_cpu+0x16/0x30
>  take_cpu_down+0x7d/0xa0
>  multi_cpu_stop+0xa2/0xe0
>  ? cpu_stop_queue_work+0xc0/0xc0
>  cpu_stopper_thread+0x6d/0x100
>  smpboot_thread_fn+0x169/0x240
>  kthread+0xf3/0x130
>  ? sort_range+0x20/0x20
>  ? kthread_cancel_delayed_work_sync+0x10/0x10
>  ret_from_fork+0x3a/0x50
> 
> It turns out that tick_broadcast_offline() can be invoked with interrupts
> enabled, so this commit fixes this issue by replacing the raw_spin_lock()
> with raw_spin_lock_irqsave().

What?

take_cpu_down() is called from multi_cpu_stop() with interrupts disabled.

So this is just papering over the fact that something called from
take_cpu_down() enabled interrupts. That needs to be found and fixed.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint
  2019-05-28 20:07 ` Thomas Gleixner
@ 2019-05-29 18:19   ` Paul E. McKenney
  2019-05-30 12:58     ` Paul E. McKenney
  0 siblings, 1 reply; 31+ messages in thread
From: Paul E. McKenney @ 2019-05-29 18:19 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: fweisbec, mingo, linux-kernel

On Tue, May 28, 2019 at 01:07:29PM -0700, Thomas Gleixner wrote:
> On Mon, 27 May 2019, Paul E. McKenney wrote:
> 
> > The TASKS03 and TREE04 rcutorture scenarios produce the following
> > lockdep complaint:
> > 
> > ================================
> > WARNING: inconsistent lock state
> > 5.2.0-rc1+ #513 Not tainted
> > --------------------------------
> > inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
> > migration/1/14 [HC0[0]:SC0[0]:HE1:SE1] takes:
> > (____ptrval____) (tick_broadcast_lock){?...}, at: tick_broadcast_offline+0xf/0x70
> > {IN-HARDIRQ-W} state was registered at:
> >   lock_acquire+0xb0/0x1c0
> >   _raw_spin_lock_irqsave+0x3c/0x50
> >   tick_broadcast_switch_to_oneshot+0xd/0x40
> >   tick_switch_to_oneshot+0x4f/0xd0
> >   hrtimer_run_queues+0xf3/0x130
> >   run_local_timers+0x1c/0x50
> >   update_process_times+0x1c/0x50
> >   tick_periodic+0x26/0xc0
> >   tick_handle_periodic+0x1a/0x60
> >   smp_apic_timer_interrupt+0x80/0x2a0
> >   apic_timer_interrupt+0xf/0x20
> >   _raw_spin_unlock_irqrestore+0x4e/0x60
> >   rcu_nocb_gp_kthread+0x15d/0x590
> >   kthread+0xf3/0x130
> >   ret_from_fork+0x3a/0x50
> > irq event stamp: 171
> > hardirqs last  enabled at (171): [<ffffffff8a201a37>] trace_hardirqs_on_thunk+0x1a/0x1c
> > hardirqs last disabled at (170): [<ffffffff8a201a53>] trace_hardirqs_off_thunk+0x1a/0x1c
> > softirqs last  enabled at (0): [<ffffffff8a264ee0>] copy_process.part.56+0x650/0x1cb0
> > softirqs last disabled at (0): [<0000000000000000>] 0x0
> > 
> > other info that might help us debug this:
> >  Possible unsafe locking scenario:
> > 
> >        CPU0
> >        ----
> >   lock(tick_broadcast_lock);
> >   <Interrupt>
> >     lock(tick_broadcast_lock);
> > 
> >  *** DEADLOCK ***
> > 
> > 1 lock held by migration/1/14:
> >  #0: (____ptrval____) (clockevents_lock){+.+.}, at: tick_offline_cpu+0xf/0x30
> > 
> > stack backtrace:
> > CPU: 1 PID: 14 Comm: migration/1 Not tainted 5.2.0-rc1+ #513
> > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Bochs 01/01/2011
> > Call Trace:
> >  dump_stack+0x5e/0x8b
> >  print_usage_bug+0x1fc/0x216
> >  ? print_shortest_lock_dependencies+0x1b0/0x1b0
> >  mark_lock+0x1f2/0x280
> >  __lock_acquire+0x1e0/0x18f0
> >  ? __lock_acquire+0x21b/0x18f0
> >  ? _raw_spin_unlock_irqrestore+0x4e/0x60
> >  lock_acquire+0xb0/0x1c0
> >  ? tick_broadcast_offline+0xf/0x70
> >  _raw_spin_lock+0x33/0x40
> >  ? tick_broadcast_offline+0xf/0x70
> >  tick_broadcast_offline+0xf/0x70
> >  tick_offline_cpu+0x16/0x30
> >  take_cpu_down+0x7d/0xa0
> >  multi_cpu_stop+0xa2/0xe0
> >  ? cpu_stop_queue_work+0xc0/0xc0
> >  cpu_stopper_thread+0x6d/0x100
> >  smpboot_thread_fn+0x169/0x240
> >  kthread+0xf3/0x130
> >  ? sort_range+0x20/0x20
> >  ? kthread_cancel_delayed_work_sync+0x10/0x10
> >  ret_from_fork+0x3a/0x50
> > 
> > It turns out that tick_broadcast_offline() can be invoked with interrupts
> > enabled, so this commit fixes this issue by replacing the raw_spin_lock()
> > with raw_spin_lock_irqsave().
> 
> What?
> 
> take_cpu_down() is called from multi_cpu_stop() with interrupts disabled.
> 
> So this is just papering over the fact that something called from
> take_cpu_down() enabled interrupts. That needs to be found and fixed.

Just posting the information covered via IRC for posterity.

A bisection located commit a0e928ed7c60
("Merge branch 'timers-core-for-linus' of
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip").
Yes, this is a merge commit, but both commits feeding into it are
fine, but the merge fails.  There were no merge conflicts.

It turns out that tick_broadcast_offline() was in innocent bystander.
After all, interrupts are supposed to be disabled throughout
take_cpu_down(), and therefore should have been disabled upon entry to
tick_offline_cpu() and thus to tick_broadcast_offline().

The function returning with irqs enabled was sched_cpu_dying().  It had
irqs enabled after return from sched_tick_stop().  And it had irqs enabled
after return from cancel_delayed_work_sync().  Which is a wrapper around
__cancel_work_timer().  Which can sleep in the case where something else
is concurrently trying to cancel the same delayed work, and sleeping is
a decidedly bad idea when you are invoked from take_cpu_down().

None of these functions have been changed (at all!) in the past year,
so my guess is that some other code was introduced that can race on
__cancel_work_timer().  Except that I am not seeing any other call
to cancel_delayed_work_sync().

Thoughts?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint
  2019-05-29 18:19   ` Paul E. McKenney
@ 2019-05-30 12:58     ` Paul E. McKenney
  2019-05-31  1:36       ` Frederic Weisbecker
  0 siblings, 1 reply; 31+ messages in thread
From: Paul E. McKenney @ 2019-05-30 12:58 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: fweisbec, mingo, linux-kernel

On Wed, May 29, 2019 at 11:19:41AM -0700, Paul E. McKenney wrote:
> On Tue, May 28, 2019 at 01:07:29PM -0700, Thomas Gleixner wrote:

[ . . . ]

> > What?
> > 
> > take_cpu_down() is called from multi_cpu_stop() with interrupts disabled.
> > 
> > So this is just papering over the fact that something called from
> > take_cpu_down() enabled interrupts. That needs to be found and fixed.
> 
> Just posting the information covered via IRC for posterity.
> 
> A bisection located commit a0e928ed7c60
> ("Merge branch 'timers-core-for-linus' of
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip").
> Yes, this is a merge commit, but both commits feeding into it are
> fine, but the merge fails.  There were no merge conflicts.
> 
> It turns out that tick_broadcast_offline() was in innocent bystander.
> After all, interrupts are supposed to be disabled throughout
> take_cpu_down(), and therefore should have been disabled upon entry to
> tick_offline_cpu() and thus to tick_broadcast_offline().
> 
> The function returning with irqs enabled was sched_cpu_dying().  It had
> irqs enabled after return from sched_tick_stop().  And it had irqs enabled
> after return from cancel_delayed_work_sync().  Which is a wrapper around
> __cancel_work_timer().  Which can sleep in the case where something else
> is concurrently trying to cancel the same delayed work, and sleeping is
> a decidedly bad idea when you are invoked from take_cpu_down().
> 
> None of these functions have been changed (at all!) in the past year,
> so my guess is that some other code was introduced that can race on
> __cancel_work_timer().  Except that I am not seeing any other call
> to cancel_delayed_work_sync().

And please see below for what might even be a proper patch.

Thoughts?

							Thanx, Paul

------------------------------------------------------------------------

commit 82801a716abf999d33846600a5d0faf7638a9b98
Author: Paul E. McKenney <paulmck@linux.ibm.com>
Date:   Thu May 30 05:39:25 2019 -0700

    time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint
    
    The TASKS03 and TREE04 rcutorture scenarios produce the following
    lockdep complaint:
    
    ------------------------------------------------------------------------
    
    ================================
    WARNING: inconsistent lock state
    5.2.0-rc1+ #513 Not tainted
    --------------------------------
    inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
    migration/1/14 [HC0[0]:SC0[0]:HE1:SE1] takes:
    (____ptrval____) (tick_broadcast_lock){?...}, at: tick_broadcast_offline+0xf/0x70
    {IN-HARDIRQ-W} state was registered at:
      lock_acquire+0xb0/0x1c0
      _raw_spin_lock_irqsave+0x3c/0x50
      tick_broadcast_switch_to_oneshot+0xd/0x40
      tick_switch_to_oneshot+0x4f/0xd0
      hrtimer_run_queues+0xf3/0x130
      run_local_timers+0x1c/0x50
      update_process_times+0x1c/0x50
      tick_periodic+0x26/0xc0
      tick_handle_periodic+0x1a/0x60
      smp_apic_timer_interrupt+0x80/0x2a0
      apic_timer_interrupt+0xf/0x20
      _raw_spin_unlock_irqrestore+0x4e/0x60
      rcu_nocb_gp_kthread+0x15d/0x590
      kthread+0xf3/0x130
      ret_from_fork+0x3a/0x50
    irq event stamp: 171
    hardirqs last  enabled at (171): [<ffffffff8a201a37>] trace_hardirqs_on_thunk+0x1a/0x1c
    hardirqs last disabled at (170): [<ffffffff8a201a53>] trace_hardirqs_off_thunk+0x1a/0x1c
    softirqs last  enabled at (0): [<ffffffff8a264ee0>] copy_process.part.56+0x650/0x1cb0
    softirqs last disabled at (0): [<0000000000000000>] 0x0
    
    other info that might help us debug this:
     Possible unsafe locking scenario:
    
           CPU0
           ----
      lock(tick_broadcast_lock);
      <Interrupt>
        lock(tick_broadcast_lock);
    
     *** DEADLOCK ***
    
    1 lock held by migration/1/14:
     #0: (____ptrval____) (clockevents_lock){+.+.}, at: tick_offline_cpu+0xf/0x30
    
    stack backtrace:
    CPU: 1 PID: 14 Comm: migration/1 Not tainted 5.2.0-rc1+ #513
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Bochs 01/01/2011
    Call Trace:
     dump_stack+0x5e/0x8b
     print_usage_bug+0x1fc/0x216
     ? print_shortest_lock_dependencies+0x1b0/0x1b0
     mark_lock+0x1f2/0x280
     __lock_acquire+0x1e0/0x18f0
     ? __lock_acquire+0x21b/0x18f0
     ? _raw_spin_unlock_irqrestore+0x4e/0x60
     lock_acquire+0xb0/0x1c0
     ? tick_broadcast_offline+0xf/0x70
     _raw_spin_lock+0x33/0x40
     ? tick_broadcast_offline+0xf/0x70
     tick_broadcast_offline+0xf/0x70
     tick_offline_cpu+0x16/0x30
     take_cpu_down+0x7d/0xa0
     multi_cpu_stop+0xa2/0xe0
     ? cpu_stop_queue_work+0xc0/0xc0
     cpu_stopper_thread+0x6d/0x100
     smpboot_thread_fn+0x169/0x240
     kthread+0xf3/0x130
     ? sort_range+0x20/0x20
     ? kthread_cancel_delayed_work_sync+0x10/0x10
     ret_from_fork+0x3a/0x50
    
    ------------------------------------------------------------------------
    
    To reproduce, run the following rcutorture test:
    
            tools/testing/selftests/rcutorture/bin/kvm.sh --duration 5 --kconfig "CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_PROVE_LOCKING=y" --configs "TASKS03 TREE04"
    
    It turns out that tick_broadcast_offline() was an innocent bystander.
    After all, interrupts are supposed to be disabled throughout
    take_cpu_down(), and therefore should have been disabled upon entry to
    tick_offline_cpu() and thus to tick_broadcast_offline().  This suggests
    that one of the CPU-hotplug notifiers was incorrectly enabling interrupts,
    and leaving them enabled on return.
    
    Some debugging code showed that the culprit was sched_cpu_dying().
    It had irqs enabled after return from sched_tick_stop().  Which in turn
    had irqs enabled after return from cancel_delayed_work_sync().  Which is a
    wrapper around __cancel_work_timer().  Which can sleep in the case where
    something else is concurrently trying to cancel the same delayed work,
    and as Thomas Gleixner pointed out on IRC, sleeping is a decidedly bad
    idea when you are invoked from take_cpu_down(), regardless of the state
    you leave interrupts in upon return.
    
    Code inspection located no reason why the delayed work absolutely
    needed to be canceled from sched_tick_stop():  The work is not
    bound to the outgoing CPU by design, given that the whole point is
    to collect statistics without disturbing the outgoing CPU.
    
    This commit therefore simply drops the cancel_delayed_work_sync() from
    sched_tick_stop().  Instead, a new ->state field is added to the tick_work
    structure so that the delayed-work handler function sched_tick_remote()
    can avoid reposting itself.  A cpu_is_offline() check is also added to
    sched_tick_remote() to avoid mucking with the state of an offlined CPU
    (though it does appear safe to do so).  The sched_tick_start() and
    sched_tick_stop() functions also update ->state, and sched_tick_start()
    also schedules the delayed work if ->state indicates that it is not
    already in flight.
    
    Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
    Cc: Frederic Weisbecker <fweisbec@gmail.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 102dfcf0a29a..9a10ee9afcbf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3050,14 +3050,44 @@ void scheduler_tick(void)
 
 struct tick_work {
 	int			cpu;
+	int			state;
 	struct delayed_work	work;
 };
+// Values for ->state, see diagram below.
+#define TICK_SCHED_REMOTE_IDLE		0
+#define TICK_SCHED_REMOTE_RUNNING	1
+#define TICK_SCHED_REMOTE_OFFLINING	2
+
+// State diagram for ->state:
+//
+//
+//      +----->IDLE-----------------------------+
+//      |                                       |
+//      |                                       |
+//      |                                       | sched_tick_start()
+//      | sched_tick_remote()                   |
+//      |                                       |
+//      |                                       V
+//      |                        +---------->RUNNING
+//      |                        |              |
+//      |                        |              |
+//      |                        |              |
+//      |     sched_tick_start() |              | sched_tick_stop()
+//      |                        |              |
+//      |                        |              |
+//      |                        |              |
+//      +--------------------OFFLINING<---------+
+//
+//
+// Other transitions get WARN_ON_ONCE(), except that sched_tick_remote()
+// and sched_tick_start() are happy to leave the state in RUNNING.
 
 static struct tick_work __percpu *tick_work_cpu;
 
 static void sched_tick_remote(struct work_struct *work)
 {
 	struct delayed_work *dwork = to_delayed_work(work);
+	int os;
 	struct tick_work *twork = container_of(dwork, struct tick_work, work);
 	int cpu = twork->cpu;
 	struct rq *rq = cpu_rq(cpu);
@@ -3077,7 +3107,7 @@ static void sched_tick_remote(struct work_struct *work)
 
 	rq_lock_irq(rq, &rf);
 	curr = rq->curr;
-	if (is_idle_task(curr))
+	if (is_idle_task(curr) || cpu_is_offline(cpu))
 		goto out_unlock;
 
 	update_rq_clock(rq);
@@ -3097,13 +3127,22 @@ static void sched_tick_remote(struct work_struct *work)
 	/*
 	 * Run the remote tick once per second (1Hz). This arbitrary
 	 * frequency is large enough to avoid overload but short enough
-	 * to keep scheduler internal stats reasonably up to date.
+	 * to keep scheduler internal stats reasonably up to date.  But
+	 * first update state to reflect hotplug activity if required.
 	 */
-	queue_delayed_work(system_unbound_wq, dwork, HZ);
+	do {
+		os = READ_ONCE(twork->state);
+		WARN_ON_ONCE(os == TICK_SCHED_REMOTE_IDLE);
+		if (os == TICK_SCHED_REMOTE_RUNNING)
+			break;
+	} while (cmpxchg(&twork->state, os, TICK_SCHED_REMOTE_IDLE) != os);
+	if (os == TICK_SCHED_REMOTE_RUNNING)
+		queue_delayed_work(system_unbound_wq, dwork, HZ);
 }
 
 static void sched_tick_start(int cpu)
 {
+	int os;
 	struct tick_work *twork;
 
 	if (housekeeping_cpu(cpu, HK_FLAG_TICK))
@@ -3112,14 +3151,23 @@ static void sched_tick_start(int cpu)
 	WARN_ON_ONCE(!tick_work_cpu);
 
 	twork = per_cpu_ptr(tick_work_cpu, cpu);
-	twork->cpu = cpu;
-	INIT_DELAYED_WORK(&twork->work, sched_tick_remote);
-	queue_delayed_work(system_unbound_wq, &twork->work, HZ);
+	do {
+		os = READ_ONCE(twork->state);
+		if (os == TICK_SCHED_REMOTE_RUNNING)
+			break;
+		// Either idle or offline for a short period
+	} while (cmpxchg(&twork->state, os, TICK_SCHED_REMOTE_RUNNING) != os);
+	if (os == TICK_SCHED_REMOTE_IDLE) {
+		twork->cpu = cpu;
+		INIT_DELAYED_WORK(&twork->work, sched_tick_remote);
+		queue_delayed_work(system_unbound_wq, &twork->work, HZ);
+	}
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
 static void sched_tick_stop(int cpu)
 {
+	int os;
 	struct tick_work *twork;
 
 	if (housekeeping_cpu(cpu, HK_FLAG_TICK))
@@ -3128,7 +3176,13 @@ static void sched_tick_stop(int cpu)
 	WARN_ON_ONCE(!tick_work_cpu);
 
 	twork = per_cpu_ptr(tick_work_cpu, cpu);
-	cancel_delayed_work_sync(&twork->work);
+	// There cannot be competing actions, but don't rely on stop_machine.
+	do {
+		os = READ_ONCE(twork->state);
+		WARN_ON_ONCE(os != TICK_SCHED_REMOTE_RUNNING);
+		// Either idle or offline for a short period
+	} while (cmpxchg(&twork->state, os, TICK_SCHED_REMOTE_OFFLINING) != os);
+	// Don't cancel, as this would mess up the state machine.
 }
 #endif /* CONFIG_HOTPLUG_CPU */
 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint
  2019-05-30 12:58     ` Paul E. McKenney
@ 2019-05-31  1:36       ` Frederic Weisbecker
  2019-05-31 13:43         ` Paul E. McKenney
  0 siblings, 1 reply; 31+ messages in thread
From: Frederic Weisbecker @ 2019-05-31  1:36 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Thomas Gleixner, fweisbec, mingo, linux-kernel

On Thu, May 30, 2019 at 05:58:09AM -0700, Paul E. McKenney wrote:
>     It turns out that tick_broadcast_offline() was an innocent bystander.
>     After all, interrupts are supposed to be disabled throughout
>     take_cpu_down(), and therefore should have been disabled upon entry to
>     tick_offline_cpu() and thus to tick_broadcast_offline().  This suggests
>     that one of the CPU-hotplug notifiers was incorrectly enabling interrupts,
>     and leaving them enabled on return.
>     
>     Some debugging code showed that the culprit was sched_cpu_dying().
>     It had irqs enabled after return from sched_tick_stop().  Which in turn
>     had irqs enabled after return from cancel_delayed_work_sync().  Which is a
>     wrapper around __cancel_work_timer().  Which can sleep in the case where
>     something else is concurrently trying to cancel the same delayed work,
>     and as Thomas Gleixner pointed out on IRC, sleeping is a decidedly bad
>     idea when you are invoked from take_cpu_down(), regardless of the state
>     you leave interrupts in upon return.

Nice catch! Sorry for leaving that puzzle behind.

>     
>     Code inspection located no reason why the delayed work absolutely
>     needed to be canceled from sched_tick_stop():  The work is not
>     bound to the outgoing CPU by design, given that the whole point is
>     to collect statistics without disturbing the outgoing CPU.
>     
>     This commit therefore simply drops the cancel_delayed_work_sync() from
>     sched_tick_stop().  Instead, a new ->state field is added to the tick_work
>     structure so that the delayed-work handler function sched_tick_remote()
>     can avoid reposting itself.  A cpu_is_offline() check is also added to
>     sched_tick_remote() to avoid mucking with the state of an offlined CPU
>     (though it does appear safe to do so).

I can't guarantee that it is safe myself to call the tick of an offline
CPU. Better have that check indeed.

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 102dfcf0a29a..9a10ee9afcbf 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3050,14 +3050,44 @@ void scheduler_tick(void)
>  
>  struct tick_work {
>  	int			cpu;
> +	int			state;
>  	struct delayed_work	work;
>  };
> +// Values for ->state, see diagram below.
> +#define TICK_SCHED_REMOTE_IDLE		0

So it took me some time to understand that the IDLE state is the
tick work that ackowledged OFFLINING and finally completes the
offline process. Therefore perhaps we can rename it to
TICK_SCHED_REMOTE_OFFLINE so that we instantly get the state
machine scenario.

> +#define TICK_SCHED_REMOTE_RUNNING	1
> +#define TICK_SCHED_REMOTE_OFFLINING	2
> +
> +// State diagram for ->state:
> +//
> +//
> +//      +----->IDLE-----------------------------+
> +//      |                                       |
> +//      |                                       |
> +//      |                                       | sched_tick_start()
> +//      | sched_tick_remote()                   |
> +//      |                                       |
> +//      |                                       V
> +//      |                        +---------->RUNNING
> +//      |                        |              |
> +//      |                        |              |
> +//      |                        |              |
> +//      |     sched_tick_start() |              | sched_tick_stop()
> +//      |                        |              |
> +//      |                        |              |
> +//      |                        |              |
> +//      +--------------------OFFLINING<---------+
> +//
> +//
> +// Other transitions get WARN_ON_ONCE(), except that sched_tick_remote()
> +// and sched_tick_start() are happy to leave the state in RUNNING.
>  
>  static struct tick_work __percpu *tick_work_cpu;
>  
>  static void sched_tick_remote(struct work_struct *work)
>  {
>  	struct delayed_work *dwork = to_delayed_work(work);
> +	int os;
>  	struct tick_work *twork = container_of(dwork, struct tick_work, work);
>  	int cpu = twork->cpu;
>  	struct rq *rq = cpu_rq(cpu);
> @@ -3077,7 +3107,7 @@ static void sched_tick_remote(struct work_struct *work)
>  
>  	rq_lock_irq(rq, &rf);
>  	curr = rq->curr;
> -	if (is_idle_task(curr))
> +	if (is_idle_task(curr) || cpu_is_offline(cpu))

Or we could simply check rq->online, while we have rq locked.

>  		goto out_unlock;
>  
>  	update_rq_clock(rq);
> @@ -3097,13 +3127,22 @@ static void sched_tick_remote(struct work_struct *work)
>  	/*
>  	 * Run the remote tick once per second (1Hz). This arbitrary
>  	 * frequency is large enough to avoid overload but short enough
> -	 * to keep scheduler internal stats reasonably up to date.
> +	 * to keep scheduler internal stats reasonably up to date.  But
> +	 * first update state to reflect hotplug activity if required.
>  	 */
> -	queue_delayed_work(system_unbound_wq, dwork, HZ);
> +	do {
> +		os = READ_ONCE(twork->state);
> +		WARN_ON_ONCE(os == TICK_SCHED_REMOTE_IDLE);
> +		if (os == TICK_SCHED_REMOTE_RUNNING)
> +			break;
> +	} while (cmpxchg(&twork->state, os, TICK_SCHED_REMOTE_IDLE) != os);
> +	if (os == TICK_SCHED_REMOTE_RUNNING)
> +		queue_delayed_work(system_unbound_wq, dwork, HZ);
>  }
>  
>  static void sched_tick_start(int cpu)
>  {
> +	int os;
>  	struct tick_work *twork;
>  
>  	if (housekeeping_cpu(cpu, HK_FLAG_TICK))
> @@ -3112,14 +3151,23 @@ static void sched_tick_start(int cpu)
>  	WARN_ON_ONCE(!tick_work_cpu);
>  
>  	twork = per_cpu_ptr(tick_work_cpu, cpu);
> -	twork->cpu = cpu;
> -	INIT_DELAYED_WORK(&twork->work, sched_tick_remote);
> -	queue_delayed_work(system_unbound_wq, &twork->work, HZ);
> +	do {
> +		os = READ_ONCE(twork->state);
> +		if (os == TICK_SCHED_REMOTE_RUNNING)

Is it possible to have RUNNING at this stage? sched_tick_start()
shouldn't be called twice without a sched_tick_stop() in the middle.

In which case we should even warn if we meet that value here.

> +			break;
> +		// Either idle or offline for a short period
> +	} while (cmpxchg(&twork->state, os, TICK_SCHED_REMOTE_RUNNING) != os);
> +	if (os == TICK_SCHED_REMOTE_IDLE) {
> +		twork->cpu = cpu;
> +		INIT_DELAYED_WORK(&twork->work, sched_tick_remote);
> +		queue_delayed_work(system_unbound_wq, &twork->work, HZ);
> +	}
>  }
>  
>  #ifdef CONFIG_HOTPLUG_CPU
>  static void sched_tick_stop(int cpu)
>  {
> +	int os;
>  	struct tick_work *twork;
>  
>  	if (housekeeping_cpu(cpu, HK_FLAG_TICK))
> @@ -3128,7 +3176,13 @@ static void sched_tick_stop(int cpu)
>  	WARN_ON_ONCE(!tick_work_cpu);
>  
>  	twork = per_cpu_ptr(tick_work_cpu, cpu);
> -	cancel_delayed_work_sync(&twork->work);
> +	// There cannot be competing actions, but don't rely on stop_machine.
> +	do {
> +		os = READ_ONCE(twork->state);
> +		WARN_ON_ONCE(os != TICK_SCHED_REMOTE_RUNNING);
> +		// Either idle or offline for a short period
> +	} while (cmpxchg(&twork->state, os, TICK_SCHED_REMOTE_OFFLINING) != os);
> +	// Don't cancel, as this would mess up the state machine.
>  }
>  #endif /* CONFIG_HOTPLUG_CPU */

I ran the state machine for a few hours inside my head through FWEISBEC_TORTURE
and I see no error message. Which is of course not a guarantee of anything but:

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>

Also, I believe that /* */ is the preferred comment style, FWIW ;-)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint
  2019-05-31  1:36       ` Frederic Weisbecker
@ 2019-05-31 13:43         ` Paul E. McKenney
  0 siblings, 0 replies; 31+ messages in thread
From: Paul E. McKenney @ 2019-05-31 13:43 UTC (permalink / raw)
  To: Frederic Weisbecker; +Cc: Thomas Gleixner, fweisbec, mingo, linux-kernel

On Fri, May 31, 2019 at 03:36:40AM +0200, Frederic Weisbecker wrote:
> On Thu, May 30, 2019 at 05:58:09AM -0700, Paul E. McKenney wrote:
> >     It turns out that tick_broadcast_offline() was an innocent bystander.
> >     After all, interrupts are supposed to be disabled throughout
> >     take_cpu_down(), and therefore should have been disabled upon entry to
> >     tick_offline_cpu() and thus to tick_broadcast_offline().  This suggests
> >     that one of the CPU-hotplug notifiers was incorrectly enabling interrupts,
> >     and leaving them enabled on return.
> >     
> >     Some debugging code showed that the culprit was sched_cpu_dying().
> >     It had irqs enabled after return from sched_tick_stop().  Which in turn
> >     had irqs enabled after return from cancel_delayed_work_sync().  Which is a
> >     wrapper around __cancel_work_timer().  Which can sleep in the case where
> >     something else is concurrently trying to cancel the same delayed work,
> >     and as Thomas Gleixner pointed out on IRC, sleeping is a decidedly bad
> >     idea when you are invoked from take_cpu_down(), regardless of the state
> >     you leave interrupts in upon return.
> 
> Nice catch! Sorry for leaving that puzzle behind.

Heh!  I needed the break from trying to make RCU do unnatural acts.  ;-)

> >     Code inspection located no reason why the delayed work absolutely
> >     needed to be canceled from sched_tick_stop():  The work is not
> >     bound to the outgoing CPU by design, given that the whole point is
> >     to collect statistics without disturbing the outgoing CPU.
> >     
> >     This commit therefore simply drops the cancel_delayed_work_sync() from
> >     sched_tick_stop().  Instead, a new ->state field is added to the tick_work
> >     structure so that the delayed-work handler function sched_tick_remote()
> >     can avoid reposting itself.  A cpu_is_offline() check is also added to
> >     sched_tick_remote() to avoid mucking with the state of an offlined CPU
> >     (though it does appear safe to do so).
> 
> I can't guarantee that it is safe myself to call the tick of an offline
> CPU. Better have that check indeed.

No argument, especially given that this preserved the current behavior
exactly.

But it looked to me that all of the code was running on some other online
CPU and was accessing data that would be left around in good state on
the offline CPU.  And the current behavior won't account for activity
just prior to the CPU going offline until it comes back online (right?),
so there might be some incentive to let the last call to sched_tick_remote()
do the accounting.  Might not be a big deal, though.

> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 102dfcf0a29a..9a10ee9afcbf 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3050,14 +3050,44 @@ void scheduler_tick(void)
> >  
> >  struct tick_work {
> >  	int			cpu;
> > +	int			state;
> >  	struct delayed_work	work;
> >  };
> > +// Values for ->state, see diagram below.
> > +#define TICK_SCHED_REMOTE_IDLE		0
> 
> So it took me some time to understand that the IDLE state is the
> tick work that ackowledged OFFLINING and finally completes the
> offline process. Therefore perhaps we can rename it to
> TICK_SCHED_REMOTE_OFFLINE so that we instantly get the state
> machine scenario.

Good point, your name is much better.  Updated.

> > +#define TICK_SCHED_REMOTE_RUNNING	1
> > +#define TICK_SCHED_REMOTE_OFFLINING	2
> > +
> > +// State diagram for ->state:
> > +//
> > +//
> > +//      +----->IDLE-----------------------------+
> > +//      |                                       |
> > +//      |                                       |
> > +//      |                                       | sched_tick_start()
> > +//      | sched_tick_remote()                   |
> > +//      |                                       |
> > +//      |                                       V
> > +//      |                        +---------->RUNNING
> > +//      |                        |              |
> > +//      |                        |              |
> > +//      |                        |              |
> > +//      |     sched_tick_start() |              | sched_tick_stop()
> > +//      |                        |              |
> > +//      |                        |              |
> > +//      |                        |              |
> > +//      +--------------------OFFLINING<---------+
> > +//
> > +//
> > +// Other transitions get WARN_ON_ONCE(), except that sched_tick_remote()
> > +// and sched_tick_start() are happy to leave the state in RUNNING.
> >  
> >  static struct tick_work __percpu *tick_work_cpu;
> >  
> >  static void sched_tick_remote(struct work_struct *work)
> >  {
> >  	struct delayed_work *dwork = to_delayed_work(work);
> > +	int os;
> >  	struct tick_work *twork = container_of(dwork, struct tick_work, work);
> >  	int cpu = twork->cpu;
> >  	struct rq *rq = cpu_rq(cpu);
> > @@ -3077,7 +3107,7 @@ static void sched_tick_remote(struct work_struct *work)
> >  
> >  	rq_lock_irq(rq, &rf);
> >  	curr = rq->curr;
> > -	if (is_idle_task(curr))
> > +	if (is_idle_task(curr) || cpu_is_offline(cpu))
> 
> Or we could simply check rq->online, while we have rq locked.

Which would have better cache locality as well.

Let's see...  This is cleared from the same sched_cpu_dying() that invokes
sched_tick_stop(), so that works.  Where is it set?  In set_rq_online(),
of course, which is invoked from sched_cpu_activate() which is assigned
to .startup.single, which sadly not the CPU-online counterpart to
sched_cpu_dying().  This would result in the workqueue being started
quite some time before its handler would allow itself to sample the state,
and would also thus be a deviation from earlier behavior.

In the immortal words of MS-DOS, "Are you sure?"

> >  		goto out_unlock;
> >  
> >  	update_rq_clock(rq);
> > @@ -3097,13 +3127,22 @@ static void sched_tick_remote(struct work_struct *work)
> >  	/*
> >  	 * Run the remote tick once per second (1Hz). This arbitrary
> >  	 * frequency is large enough to avoid overload but short enough
> > -	 * to keep scheduler internal stats reasonably up to date.
> > +	 * to keep scheduler internal stats reasonably up to date.  But
> > +	 * first update state to reflect hotplug activity if required.
> >  	 */
> > -	queue_delayed_work(system_unbound_wq, dwork, HZ);
> > +	do {
> > +		os = READ_ONCE(twork->state);
> > +		WARN_ON_ONCE(os == TICK_SCHED_REMOTE_IDLE);
> > +		if (os == TICK_SCHED_REMOTE_RUNNING)
> > +			break;
> > +	} while (cmpxchg(&twork->state, os, TICK_SCHED_REMOTE_IDLE) != os);
> > +	if (os == TICK_SCHED_REMOTE_RUNNING)
> > +		queue_delayed_work(system_unbound_wq, dwork, HZ);
> >  }
> >  
> >  static void sched_tick_start(int cpu)
> >  {
> > +	int os;
> >  	struct tick_work *twork;
> >  
> >  	if (housekeeping_cpu(cpu, HK_FLAG_TICK))
> > @@ -3112,14 +3151,23 @@ static void sched_tick_start(int cpu)
> >  	WARN_ON_ONCE(!tick_work_cpu);
> >  
> >  	twork = per_cpu_ptr(tick_work_cpu, cpu);
> > -	twork->cpu = cpu;
> > -	INIT_DELAYED_WORK(&twork->work, sched_tick_remote);
> > -	queue_delayed_work(system_unbound_wq, &twork->work, HZ);
> > +	do {
> > +		os = READ_ONCE(twork->state);
> > +		if (os == TICK_SCHED_REMOTE_RUNNING)
> 
> Is it possible to have RUNNING at this stage? sched_tick_start()
> shouldn't be called twice without a sched_tick_stop() in the middle.
> 
> In which case we should even warn if we meet that value here.

Good point, it now looks like this:

	if (os == WARN_ON_ONCE(TICK_SCHED_REMOTE_RUNNING))

Which, after an embarrassingly long time, turned into this:

	if (WARN_ON_ONCE(os == TICK_SCHED_REMOTE_RUNNING))

> > +			break;
> > +		// Either idle or offline for a short period
> > +	} while (cmpxchg(&twork->state, os, TICK_SCHED_REMOTE_RUNNING) != os);
> > +	if (os == TICK_SCHED_REMOTE_IDLE) {
> > +		twork->cpu = cpu;
> > +		INIT_DELAYED_WORK(&twork->work, sched_tick_remote);
> > +		queue_delayed_work(system_unbound_wq, &twork->work, HZ);
> > +	}
> >  }
> >  
> >  #ifdef CONFIG_HOTPLUG_CPU
> >  static void sched_tick_stop(int cpu)
> >  {
> > +	int os;
> >  	struct tick_work *twork;
> >  
> >  	if (housekeeping_cpu(cpu, HK_FLAG_TICK))
> > @@ -3128,7 +3176,13 @@ static void sched_tick_stop(int cpu)
> >  	WARN_ON_ONCE(!tick_work_cpu);
> >  
> >  	twork = per_cpu_ptr(tick_work_cpu, cpu);
> > -	cancel_delayed_work_sync(&twork->work);
> > +	// There cannot be competing actions, but don't rely on stop_machine.
> > +	do {
> > +		os = READ_ONCE(twork->state);
> > +		WARN_ON_ONCE(os != TICK_SCHED_REMOTE_RUNNING);
> > +		// Either idle or offline for a short period
> > +	} while (cmpxchg(&twork->state, os, TICK_SCHED_REMOTE_OFFLINING) != os);
> > +	// Don't cancel, as this would mess up the state machine.
> >  }
> >  #endif /* CONFIG_HOTPLUG_CPU */
> 
> I ran the state machine for a few hours inside my head through FWEISBEC_TORTURE
> and I see no error message. Which is of course not a guarantee of anything but:

I know that feeling!

> Reviewed-by: Frederic Weisbecker <frederic@kernel.org>

Thank you, applied!

> Also, I believe that /* */ is the preferred comment style, FWIW ;-)

True enough, but "//" is permitted and allows me three more characters of
comment before wanting to do a line break.  ;-)

But it is your code, so if you want "/* */", just let me know and I will
switch to that style.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint
  2019-06-28  7:37                                           ` Peter Zijlstra
@ 2019-06-28 12:17                                             ` Paul E. McKenney
  0 siblings, 0 replies; 31+ messages in thread
From: Paul E. McKenney @ 2019-06-28 12:17 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Frederic Weisbecker, linux-kernel, mingo, tglx

On Fri, Jun 28, 2019 at 09:37:12AM +0200, Peter Zijlstra wrote:
> On Tue, Jun 25, 2019 at 09:52:38AM -0700, Paul E. McKenney wrote:
> > Very well, the commit is as shown below.  This is on current -rcu,
> > but feel free to take it if you would like, Peter.  Just let me know
> > and I will mark it so that I don't push it myself.  (I need to keep
> > it in -rcu until I rebase onto a version of mainline that contains
> > it so as to avoid spurious rcutorture failures.)
> 
> Looks good. I'll pick it up and then we need to take care when all lands
> in tip.

Thank you!

My instance of the patch is in a part of the -rcu tree that won't hit
-tip until after v5.3-rc5, so there is a good chance that the -tip
sequencing will Just Work.  Stranger things have happened.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint
  2019-06-25 16:52                                         ` Paul E. McKenney
@ 2019-06-28  7:37                                           ` Peter Zijlstra
  2019-06-28 12:17                                             ` Paul E. McKenney
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2019-06-28  7:37 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Frederic Weisbecker, linux-kernel, mingo, tglx

On Tue, Jun 25, 2019 at 09:52:38AM -0700, Paul E. McKenney wrote:
> Very well, the commit is as shown below.  This is on current -rcu,
> but feel free to take it if you would like, Peter.  Just let me know
> and I will mark it so that I don't push it myself.  (I need to keep
> it in -rcu until I rebase onto a version of mainline that contains
> it so as to avoid spurious rcutorture failures.)

Looks good. I'll pick it up and then we need to take care when all lands
in tip.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint
  2019-06-25 16:20                                       ` Frederic Weisbecker
@ 2019-06-25 16:52                                         ` Paul E. McKenney
  2019-06-28  7:37                                           ` Peter Zijlstra
  0 siblings, 1 reply; 31+ messages in thread
From: Paul E. McKenney @ 2019-06-25 16:52 UTC (permalink / raw)
  To: Frederic Weisbecker; +Cc: Peter Zijlstra, linux-kernel, mingo, tglx

On Tue, Jun 25, 2019 at 06:20:59PM +0200, Frederic Weisbecker wrote:
> On Tue, Jun 25, 2019 at 07:16:24AM -0700, Paul E. McKenney wrote:
> > On Tue, Jun 25, 2019 at 04:05:38PM +0200, Peter Zijlstra wrote:
> > > On Tue, Jun 25, 2019 at 06:54:30AM -0700, Paul E. McKenney wrote:
> > > > And it allows dispensing with the initialization.  How about like
> > > > the following?
> > > 
> > > Looks good to me!
> > 
> > Limited rcutorture testing looking good thus far.  Here is hoping!
> > 
> > Frederic, you OK with this approach?
> 
> Yep, all good!

Very well, the commit is as shown below.  This is on current -rcu,
but feel free to take it if you would like, Peter.  Just let me know
and I will mark it so that I don't push it myself.  (I need to keep
it in -rcu until I rebase onto a version of mainline that contains
it so as to avoid spurious rcutorture failures.)

							Thanx, Paul

------------------------------------------------------------------------

commit b02b73cc95b6e3d912f36de20116b520cc3072c7
Author: Paul E. McKenney <paulmck@linux.ibm.com>
Date:   Thu May 30 05:39:25 2019 -0700

    time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint
    
    The TASKS03 and TREE04 rcutorture scenarios produce the following
    lockdep complaint:
    
    ------------------------------------------------------------------------
    
    ================================
    WARNING: inconsistent lock state
    5.2.0-rc1+ #513 Not tainted
    --------------------------------
    inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
    migration/1/14 [HC0[0]:SC0[0]:HE1:SE1] takes:
    (____ptrval____) (tick_broadcast_lock){?...}, at: tick_broadcast_offline+0xf/0x70
    {IN-HARDIRQ-W} state was registered at:
      lock_acquire+0xb0/0x1c0
      _raw_spin_lock_irqsave+0x3c/0x50
      tick_broadcast_switch_to_oneshot+0xd/0x40
      tick_switch_to_oneshot+0x4f/0xd0
      hrtimer_run_queues+0xf3/0x130
      run_local_timers+0x1c/0x50
      update_process_times+0x1c/0x50
      tick_periodic+0x26/0xc0
      tick_handle_periodic+0x1a/0x60
      smp_apic_timer_interrupt+0x80/0x2a0
      apic_timer_interrupt+0xf/0x20
      _raw_spin_unlock_irqrestore+0x4e/0x60
      rcu_nocb_gp_kthread+0x15d/0x590
      kthread+0xf3/0x130
      ret_from_fork+0x3a/0x50
    irq event stamp: 171
    hardirqs last  enabled at (171): [<ffffffff8a201a37>] trace_hardirqs_on_thunk+0x1a/0x1c
    hardirqs last disabled at (170): [<ffffffff8a201a53>] trace_hardirqs_off_thunk+0x1a/0x1c
    softirqs last  enabled at (0): [<ffffffff8a264ee0>] copy_process.part.56+0x650/0x1cb0
    softirqs last disabled at (0): [<0000000000000000>] 0x0
    
    other info that might help us debug this:
     Possible unsafe locking scenario:
    
           CPU0
           ----
      lock(tick_broadcast_lock);
      <Interrupt>
        lock(tick_broadcast_lock);
    
     *** DEADLOCK ***
    
    1 lock held by migration/1/14:
     #0: (____ptrval____) (clockevents_lock){+.+.}, at: tick_offline_cpu+0xf/0x30
    
    stack backtrace:
    CPU: 1 PID: 14 Comm: migration/1 Not tainted 5.2.0-rc1+ #513
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Bochs 01/01/2011
    Call Trace:
     dump_stack+0x5e/0x8b
     print_usage_bug+0x1fc/0x216
     ? print_shortest_lock_dependencies+0x1b0/0x1b0
     mark_lock+0x1f2/0x280
     __lock_acquire+0x1e0/0x18f0
     ? __lock_acquire+0x21b/0x18f0
     ? _raw_spin_unlock_irqrestore+0x4e/0x60
     lock_acquire+0xb0/0x1c0
     ? tick_broadcast_offline+0xf/0x70
     _raw_spin_lock+0x33/0x40
     ? tick_broadcast_offline+0xf/0x70
     tick_broadcast_offline+0xf/0x70
     tick_offline_cpu+0x16/0x30
     take_cpu_down+0x7d/0xa0
     multi_cpu_stop+0xa2/0xe0
     ? cpu_stop_queue_work+0xc0/0xc0
     cpu_stopper_thread+0x6d/0x100
     smpboot_thread_fn+0x169/0x240
     kthread+0xf3/0x130
     ? sort_range+0x20/0x20
     ? kthread_cancel_delayed_work_sync+0x10/0x10
     ret_from_fork+0x3a/0x50
    
    ------------------------------------------------------------------------
    
    To reproduce, run the following rcutorture test:
    
            tools/testing/selftests/rcutorture/bin/kvm.sh --duration 5 --kconfig "CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_PROVE_LOCKING=y" --configs "TASKS03 TREE04"
    
    It turns out that tick_broadcast_offline() was an innocent bystander.
    After all, interrupts are supposed to be disabled throughout
    take_cpu_down(), and therefore should have been disabled upon entry to
    tick_offline_cpu() and thus to tick_broadcast_offline().  This suggests
    that one of the CPU-hotplug notifiers was incorrectly enabling interrupts,
    and leaving them enabled on return.
    
    Some debugging code showed that the culprit was sched_cpu_dying().
    It had irqs enabled after return from sched_tick_stop().  Which in turn
    had irqs enabled after return from cancel_delayed_work_sync().  Which is a
    wrapper around __cancel_work_timer().  Which can sleep in the case where
    something else is concurrently trying to cancel the same delayed work,
    and as Thomas Gleixner pointed out on IRC, sleeping is a decidedly bad
    idea when you are invoked from take_cpu_down(), regardless of the state
    you leave interrupts in upon return.
    
    Code inspection located no reason why the delayed work absolutely
    needed to be canceled from sched_tick_stop():  The work is not
    bound to the outgoing CPU by design, given that the whole point is
    to collect statistics without disturbing the outgoing CPU.
    
    This commit therefore simply drops the cancel_delayed_work_sync() from
    sched_tick_stop().  Instead, a new ->state field is added to the tick_work
    structure so that the delayed-work handler function sched_tick_remote()
    can avoid reposting itself.  A cpu_is_offline() check is also added to
    sched_tick_remote() to avoid mucking with the state of an offlined CPU
    (though it does appear safe to do so).  The sched_tick_start() and
    sched_tick_stop() functions also update ->state, and sched_tick_start()
    also schedules the delayed work if ->state indicates that it is not
    already in flight.
    
    Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
    [ paulmck: Apply Peter Zijlstra and Frederic Weisbecker atomics feedback. ]
    Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 102dfcf0a29a..d7be6d4b6a0a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3050,8 +3050,36 @@ void scheduler_tick(void)
 
 struct tick_work {
 	int			cpu;
+	atomic_t		state;
 	struct delayed_work	work;
 };
+/* Values for ->state, see diagram below. */
+#define TICK_SCHED_REMOTE_OFFLINE	0
+#define TICK_SCHED_REMOTE_OFFLINING	1
+#define TICK_SCHED_REMOTE_RUNNING	2
+
+/*
+ * State diagram for ->state:
+ *
+ *
+ *          TICK_SCHED_REMOTE_OFFLINE
+ *                    |   ^
+ *                    |   |
+ *                    |   | sched_tick_remote()
+ *                    |   |
+ *                    |   |
+ *                    +--TICK_SCHED_REMOTE_OFFLINING
+ *                    |   ^
+ *                    |   |
+ * sched_tick_start() |   | sched_tick_stop()
+ *                    |   |
+ *                    V   |
+ *          TICK_SCHED_REMOTE_RUNNING
+ *
+ *
+ * Other transitions get WARN_ON_ONCE(), except that sched_tick_remote()
+ * and sched_tick_start() are happy to leave the state in RUNNING.
+ */
 
 static struct tick_work __percpu *tick_work_cpu;
 
@@ -3064,6 +3092,7 @@ static void sched_tick_remote(struct work_struct *work)
 	struct task_struct *curr;
 	struct rq_flags rf;
 	u64 delta;
+	int os;
 
 	/*
 	 * Handle the tick only if it appears the remote CPU is running in full
@@ -3077,7 +3106,7 @@ static void sched_tick_remote(struct work_struct *work)
 
 	rq_lock_irq(rq, &rf);
 	curr = rq->curr;
-	if (is_idle_task(curr))
+	if (is_idle_task(curr) || cpu_is_offline(cpu))
 		goto out_unlock;
 
 	update_rq_clock(rq);
@@ -3097,13 +3126,18 @@ static void sched_tick_remote(struct work_struct *work)
 	/*
 	 * Run the remote tick once per second (1Hz). This arbitrary
 	 * frequency is large enough to avoid overload but short enough
-	 * to keep scheduler internal stats reasonably up to date.
+	 * to keep scheduler internal stats reasonably up to date.  But
+	 * first update state to reflect hotplug activity if required.
 	 */
-	queue_delayed_work(system_unbound_wq, dwork, HZ);
+	os = atomic_fetch_add_unless(&twork->state, -1, TICK_SCHED_REMOTE_RUNNING);
+	WARN_ON_ONCE(os == TICK_SCHED_REMOTE_OFFLINE);
+	if (os == TICK_SCHED_REMOTE_RUNNING)
+		queue_delayed_work(system_unbound_wq, dwork, HZ);
 }
 
 static void sched_tick_start(int cpu)
 {
+	int os;
 	struct tick_work *twork;
 
 	if (housekeeping_cpu(cpu, HK_FLAG_TICK))
@@ -3112,15 +3146,20 @@ static void sched_tick_start(int cpu)
 	WARN_ON_ONCE(!tick_work_cpu);
 
 	twork = per_cpu_ptr(tick_work_cpu, cpu);
-	twork->cpu = cpu;
-	INIT_DELAYED_WORK(&twork->work, sched_tick_remote);
-	queue_delayed_work(system_unbound_wq, &twork->work, HZ);
+	os = atomic_xchg(&twork->state, TICK_SCHED_REMOTE_RUNNING);
+	WARN_ON_ONCE(os == TICK_SCHED_REMOTE_RUNNING);
+	if (os == TICK_SCHED_REMOTE_OFFLINE) {
+		twork->cpu = cpu;
+		INIT_DELAYED_WORK(&twork->work, sched_tick_remote);
+		queue_delayed_work(system_unbound_wq, &twork->work, HZ);
+	}
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
 static void sched_tick_stop(int cpu)
 {
 	struct tick_work *twork;
+	int os;
 
 	if (housekeeping_cpu(cpu, HK_FLAG_TICK))
 		return;
@@ -3128,7 +3167,10 @@ static void sched_tick_stop(int cpu)
 	WARN_ON_ONCE(!tick_work_cpu);
 
 	twork = per_cpu_ptr(tick_work_cpu, cpu);
-	cancel_delayed_work_sync(&twork->work);
+	/* There cannot be competing actions, but don't rely on stop-machine. */
+	os = atomic_xchg(&twork->state, TICK_SCHED_REMOTE_OFFLINING);
+	WARN_ON_ONCE(os != TICK_SCHED_REMOTE_RUNNING);
+	/* Don't cancel, as this would mess up the state machine. */
 }
 #endif /* CONFIG_HOTPLUG_CPU */
 
@@ -3136,7 +3178,6 @@ int __init sched_tick_offload_init(void)
 {
 	tick_work_cpu = alloc_percpu(struct tick_work);
 	BUG_ON(!tick_work_cpu);
-
 	return 0;
 }
 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint
  2019-06-25 14:16                                     ` Paul E. McKenney
@ 2019-06-25 16:20                                       ` Frederic Weisbecker
  2019-06-25 16:52                                         ` Paul E. McKenney
  0 siblings, 1 reply; 31+ messages in thread
From: Frederic Weisbecker @ 2019-06-25 16:20 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Peter Zijlstra, linux-kernel, mingo, tglx

On Tue, Jun 25, 2019 at 07:16:24AM -0700, Paul E. McKenney wrote:
> On Tue, Jun 25, 2019 at 04:05:38PM +0200, Peter Zijlstra wrote:
> > On Tue, Jun 25, 2019 at 06:54:30AM -0700, Paul E. McKenney wrote:
> > > And it allows dispensing with the initialization.  How about like
> > > the following?
> > 
> > Looks good to me!
> 
> Limited rcutorture testing looking good thus far.  Here is hoping!
> 
> Frederic, you OK with this approach?

Yep, all good!

Thanks!

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint
  2019-06-25 14:05                                   ` Peter Zijlstra
@ 2019-06-25 14:16                                     ` Paul E. McKenney
  2019-06-25 16:20                                       ` Frederic Weisbecker
  0 siblings, 1 reply; 31+ messages in thread
From: Paul E. McKenney @ 2019-06-25 14:16 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Frederic Weisbecker, linux-kernel, mingo, tglx

On Tue, Jun 25, 2019 at 04:05:38PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 25, 2019 at 06:54:30AM -0700, Paul E. McKenney wrote:
> > And it allows dispensing with the initialization.  How about like
> > the following?
> 
> Looks good to me!

Limited rcutorture testing looking good thus far.  Here is hoping!

Frederic, you OK with this approach?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint
  2019-06-25 13:54                                 ` Paul E. McKenney
@ 2019-06-25 14:05                                   ` Peter Zijlstra
  2019-06-25 14:16                                     ` Paul E. McKenney
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2019-06-25 14:05 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Frederic Weisbecker, linux-kernel, mingo, tglx

On Tue, Jun 25, 2019 at 06:54:30AM -0700, Paul E. McKenney wrote:
> And it allows dispensing with the initialization.  How about like
> the following?

Looks good to me!

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint
  2019-06-25 12:25                               ` Frederic Weisbecker
@ 2019-06-25 13:54                                 ` Paul E. McKenney
  2019-06-25 14:05                                   ` Peter Zijlstra
  0 siblings, 1 reply; 31+ messages in thread
From: Paul E. McKenney @ 2019-06-25 13:54 UTC (permalink / raw)
  To: Frederic Weisbecker; +Cc: Peter Zijlstra, linux-kernel, mingo, tglx

On Tue, Jun 25, 2019 at 02:25:16PM +0200, Frederic Weisbecker wrote:
> On Tue, Jun 25, 2019 at 09:51:39AM +0200, Peter Zijlstra wrote:
> > On Tue, Jun 25, 2019 at 02:43:00AM +0200, Frederic Weisbecker wrote:
> > > Yeah, unfortunately there is no atomic_add_not_zero_return().
> > 
> > There is atomic_fetch_add_unless().
> 
> Ah, that could work!

And it allows dispensing with the initialization.  How about like
the following?

							Thanx, Paul

------------------------------------------------------------------------

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 102dfcf0a29a..d7be6d4b6a0a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3050,8 +3050,36 @@ void scheduler_tick(void)
 
 struct tick_work {
 	int			cpu;
+	atomic_t		state;
 	struct delayed_work	work;
 };
+/* Values for ->state, see diagram below. */
+#define TICK_SCHED_REMOTE_OFFLINE	0
+#define TICK_SCHED_REMOTE_OFFLINING	1
+#define TICK_SCHED_REMOTE_RUNNING	2
+
+/*
+ * State diagram for ->state:
+ *
+ *
+ *          TICK_SCHED_REMOTE_OFFLINE
+ *                    |   ^
+ *                    |   |
+ *                    |   | sched_tick_remote()
+ *                    |   |
+ *                    |   |
+ *                    +--TICK_SCHED_REMOTE_OFFLINING
+ *                    |   ^
+ *                    |   |
+ * sched_tick_start() |   | sched_tick_stop()
+ *                    |   |
+ *                    V   |
+ *          TICK_SCHED_REMOTE_RUNNING
+ *
+ *
+ * Other transitions get WARN_ON_ONCE(), except that sched_tick_remote()
+ * and sched_tick_start() are happy to leave the state in RUNNING.
+ */
 
 static struct tick_work __percpu *tick_work_cpu;
 
@@ -3064,6 +3092,7 @@ static void sched_tick_remote(struct work_struct *work)
 	struct task_struct *curr;
 	struct rq_flags rf;
 	u64 delta;
+	int os;
 
 	/*
 	 * Handle the tick only if it appears the remote CPU is running in full
@@ -3077,7 +3106,7 @@ static void sched_tick_remote(struct work_struct *work)
 
 	rq_lock_irq(rq, &rf);
 	curr = rq->curr;
-	if (is_idle_task(curr))
+	if (is_idle_task(curr) || cpu_is_offline(cpu))
 		goto out_unlock;
 
 	update_rq_clock(rq);
@@ -3097,13 +3126,18 @@ static void sched_tick_remote(struct work_struct *work)
 	/*
 	 * Run the remote tick once per second (1Hz). This arbitrary
 	 * frequency is large enough to avoid overload but short enough
-	 * to keep scheduler internal stats reasonably up to date.
+	 * to keep scheduler internal stats reasonably up to date.  But
+	 * first update state to reflect hotplug activity if required.
 	 */
-	queue_delayed_work(system_unbound_wq, dwork, HZ);
+	os = atomic_fetch_add_unless(&twork->state, -1, TICK_SCHED_REMOTE_RUNNING);
+	WARN_ON_ONCE(os == TICK_SCHED_REMOTE_OFFLINE);
+	if (os == TICK_SCHED_REMOTE_RUNNING)
+		queue_delayed_work(system_unbound_wq, dwork, HZ);
 }
 
 static void sched_tick_start(int cpu)
 {
+	int os;
 	struct tick_work *twork;
 
 	if (housekeeping_cpu(cpu, HK_FLAG_TICK))
@@ -3112,15 +3146,20 @@ static void sched_tick_start(int cpu)
 	WARN_ON_ONCE(!tick_work_cpu);
 
 	twork = per_cpu_ptr(tick_work_cpu, cpu);
-	twork->cpu = cpu;
-	INIT_DELAYED_WORK(&twork->work, sched_tick_remote);
-	queue_delayed_work(system_unbound_wq, &twork->work, HZ);
+	os = atomic_xchg(&twork->state, TICK_SCHED_REMOTE_RUNNING);
+	WARN_ON_ONCE(os == TICK_SCHED_REMOTE_RUNNING);
+	if (os == TICK_SCHED_REMOTE_OFFLINE) {
+		twork->cpu = cpu;
+		INIT_DELAYED_WORK(&twork->work, sched_tick_remote);
+		queue_delayed_work(system_unbound_wq, &twork->work, HZ);
+	}
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
 static void sched_tick_stop(int cpu)
 {
 	struct tick_work *twork;
+	int os;
 
 	if (housekeeping_cpu(cpu, HK_FLAG_TICK))
 		return;
@@ -3128,7 +3167,10 @@ static void sched_tick_stop(int cpu)
 	WARN_ON_ONCE(!tick_work_cpu);
 
 	twork = per_cpu_ptr(tick_work_cpu, cpu);
-	cancel_delayed_work_sync(&twork->work);
+	/* There cannot be competing actions, but don't rely on stop-machine. */
+	os = atomic_xchg(&twork->state, TICK_SCHED_REMOTE_OFFLINING);
+	WARN_ON_ONCE(os != TICK_SCHED_REMOTE_RUNNING);
+	/* Don't cancel, as this would mess up the state machine. */
 }
 #endif /* CONFIG_HOTPLUG_CPU */
 
@@ -3136,7 +3178,6 @@ int __init sched_tick_offload_init(void)
 {
 	tick_work_cpu = alloc_percpu(struct tick_work);
 	BUG_ON(!tick_work_cpu);
-
 	return 0;
 }
 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint
  2019-06-25  7:51                             ` Peter Zijlstra
@ 2019-06-25 12:25                               ` Frederic Weisbecker
  2019-06-25 13:54                                 ` Paul E. McKenney
  0 siblings, 1 reply; 31+ messages in thread
From: Frederic Weisbecker @ 2019-06-25 12:25 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Paul E. McKenney, linux-kernel, mingo, tglx

On Tue, Jun 25, 2019 at 09:51:39AM +0200, Peter Zijlstra wrote:
> On Tue, Jun 25, 2019 at 02:43:00AM +0200, Frederic Weisbecker wrote:
> > Yeah, unfortunately there is no atomic_add_not_zero_return().
> 
> There is atomic_fetch_add_unless().

Ah, that could work!

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint
  2019-06-25  0:43                           ` Frederic Weisbecker
  2019-06-25  2:05                             ` Paul E. McKenney
@ 2019-06-25  7:51                             ` Peter Zijlstra
  2019-06-25 12:25                               ` Frederic Weisbecker
  1 sibling, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2019-06-25  7:51 UTC (permalink / raw)
  To: Frederic Weisbecker; +Cc: Paul E. McKenney, linux-kernel, mingo, tglx

On Tue, Jun 25, 2019 at 02:43:00AM +0200, Frederic Weisbecker wrote:
> Yeah, unfortunately there is no atomic_add_not_zero_return().

There is atomic_fetch_add_unless().

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint
  2019-06-25  0:43                           ` Frederic Weisbecker
@ 2019-06-25  2:05                             ` Paul E. McKenney
  2019-06-25  7:51                             ` Peter Zijlstra
  1 sibling, 0 replies; 31+ messages in thread
From: Paul E. McKenney @ 2019-06-25  2:05 UTC (permalink / raw)
  To: Frederic Weisbecker; +Cc: Peter Zijlstra, linux-kernel, mingo, tglx

On Tue, Jun 25, 2019 at 02:43:00AM +0200, Frederic Weisbecker wrote:
> On Mon, Jun 24, 2019 at 04:44:22PM -0700, Paul E. McKenney wrote:
> > On Tue, Jun 25, 2019 at 01:12:23AM +0200, Frederic Weisbecker wrote:
> > > On Fri, Jun 21, 2019 at 04:46:02PM -0700, Paul E. McKenney wrote:
> > > > @@ -3097,13 +3126,21 @@ static void sched_tick_remote(struct work_struct *work)
> > > >  	/*
> > > >  	 * Run the remote tick once per second (1Hz). This arbitrary
> > > >  	 * frequency is large enough to avoid overload but short enough
> > > > -	 * to keep scheduler internal stats reasonably up to date.
> > > > +	 * to keep scheduler internal stats reasonably up to date.  But
> > > > +	 * first update state to reflect hotplug activity if required.
> > > >  	 */
> > > > +	os = atomic_read(&twork->state);
> > > > +	if (os) {
> > > > +		WARN_ON_ONCE(os != TICK_SCHED_REMOTE_OFFLINING);
> > > > +		if (atomic_inc_not_zero(&twork->state))
> > > > +			return;
> > > 
> > > Using inc makes me a bit nervous here. If we do so, we should somewhow
> > > make sure that we never exceed a value higher than TICK_SCHED_REMOTE_OFFLINE
> > > by accident.
> > > 
> > > atomic_xchg() is probably a bit costlier but also safer as it allows
> > > us to check both the old and the new value. That path shouldn't be critically fast
> > > after all.
> > 
> > It would need to be cmpxchg() to avoid messing with the state if
> > the state were somehow TICK_SCHED_REMOTE_RUNNING, right?
> 
> Ah indeed! Nevermind, let's keep things as they are then.
> 
> > > > +	}
> > > >  	queue_delayed_work(system_unbound_wq, dwork, HZ);
> > > >  }
> > > >  
> > > >  static void sched_tick_start(int cpu)
> > > >  {
> > > > +	int os;
> > > >  	struct tick_work *twork;
> > > >  
> > > >  	if (housekeeping_cpu(cpu, HK_FLAG_TICK))
> > > > @@ -3112,15 +3149,20 @@ static void sched_tick_start(int cpu)
> > > >  	WARN_ON_ONCE(!tick_work_cpu);
> > > >  
> > > >  	twork = per_cpu_ptr(tick_work_cpu, cpu);
> > > > -	twork->cpu = cpu;
> > > > -	INIT_DELAYED_WORK(&twork->work, sched_tick_remote);
> > > > -	queue_delayed_work(system_unbound_wq, &twork->work, HZ);
> > > > +	os = atomic_xchg(&twork->state, TICK_SCHED_REMOTE_RUNNING);
> > > > +	WARN_ON_ONCE(os == TICK_SCHED_REMOTE_RUNNING);
> > > 
> > > See if we use atomic_inc(), we would need to also WARN(os > TICK_SCHED_REMOTE_OFFLINE).
> > 
> > How about if I put that WARN() between the atomic_inc_not_zero() and
> > the return, presumably also adding braces?
> 
> Yeah, unfortunately there is no atomic_add_not_zero_return().
> I guess we can live with a check using atomic_read(). In the best
> case it returns the fresh increment, otherwise it should be REMOTE_RUNNING.
> 
> In any case the (os > TICK_SCHED_REMOTE_OFFLINE) check applies.

True, so with high probability a warning would be emitted.  Fair enough?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint
  2019-06-24 23:44                         ` Paul E. McKenney
@ 2019-06-25  0:43                           ` Frederic Weisbecker
  2019-06-25  2:05                             ` Paul E. McKenney
  2019-06-25  7:51                             ` Peter Zijlstra
  0 siblings, 2 replies; 31+ messages in thread
From: Frederic Weisbecker @ 2019-06-25  0:43 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Peter Zijlstra, linux-kernel, mingo, tglx

On Mon, Jun 24, 2019 at 04:44:22PM -0700, Paul E. McKenney wrote:
> On Tue, Jun 25, 2019 at 01:12:23AM +0200, Frederic Weisbecker wrote:
> > On Fri, Jun 21, 2019 at 04:46:02PM -0700, Paul E. McKenney wrote:
> > > @@ -3097,13 +3126,21 @@ static void sched_tick_remote(struct work_struct *work)
> > >  	/*
> > >  	 * Run the remote tick once per second (1Hz). This arbitrary
> > >  	 * frequency is large enough to avoid overload but short enough
> > > -	 * to keep scheduler internal stats reasonably up to date.
> > > +	 * to keep scheduler internal stats reasonably up to date.  But
> > > +	 * first update state to reflect hotplug activity if required.
> > >  	 */
> > > +	os = atomic_read(&twork->state);
> > > +	if (os) {
> > > +		WARN_ON_ONCE(os != TICK_SCHED_REMOTE_OFFLINING);
> > > +		if (atomic_inc_not_zero(&twork->state))
> > > +			return;
> > 
> > Using inc makes me a bit nervous here. If we do so, we should somewhow
> > make sure that we never exceed a value higher than TICK_SCHED_REMOTE_OFFLINE
> > by accident.
> > 
> > atomic_xchg() is probably a bit costlier but also safer as it allows
> > us to check both the old and the new value. That path shouldn't be critically fast
> > after all.
> 
> It would need to be cmpxchg() to avoid messing with the state if
> the state were somehow TICK_SCHED_REMOTE_RUNNING, right?

Ah indeed! Nevermind, let's keep things as they are then.

> > > +	}
> > >  	queue_delayed_work(system_unbound_wq, dwork, HZ);
> > >  }
> > >  
> > >  static void sched_tick_start(int cpu)
> > >  {
> > > +	int os;
> > >  	struct tick_work *twork;
> > >  
> > >  	if (housekeeping_cpu(cpu, HK_FLAG_TICK))
> > > @@ -3112,15 +3149,20 @@ static void sched_tick_start(int cpu)
> > >  	WARN_ON_ONCE(!tick_work_cpu);
> > >  
> > >  	twork = per_cpu_ptr(tick_work_cpu, cpu);
> > > -	twork->cpu = cpu;
> > > -	INIT_DELAYED_WORK(&twork->work, sched_tick_remote);
> > > -	queue_delayed_work(system_unbound_wq, &twork->work, HZ);
> > > +	os = atomic_xchg(&twork->state, TICK_SCHED_REMOTE_RUNNING);
> > > +	WARN_ON_ONCE(os == TICK_SCHED_REMOTE_RUNNING);
> > 
> > See if we use atomic_inc(), we would need to also WARN(os > TICK_SCHED_REMOTE_OFFLINE).
> 
> How about if I put that WARN() between the atomic_inc_not_zero() and
> the return, presumably also adding braces?

Yeah, unfortunately there is no atomic_add_not_zero_return().
I guess we can live with a check using atomic_read(). In the best
case it returns the fresh increment, otherwise it should be REMOTE_RUNNING.

In any case the (os > TICK_SCHED_REMOTE_OFFLINE) check applies.

Thanks.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint
  2019-06-24 23:12                       ` Frederic Weisbecker
@ 2019-06-24 23:44                         ` Paul E. McKenney
  2019-06-25  0:43                           ` Frederic Weisbecker
  0 siblings, 1 reply; 31+ messages in thread
From: Paul E. McKenney @ 2019-06-24 23:44 UTC (permalink / raw)
  To: Frederic Weisbecker; +Cc: Peter Zijlstra, linux-kernel, mingo, tglx

On Tue, Jun 25, 2019 at 01:12:23AM +0200, Frederic Weisbecker wrote:
> On Fri, Jun 21, 2019 at 04:46:02PM -0700, Paul E. McKenney wrote:
> > @@ -3097,13 +3126,21 @@ static void sched_tick_remote(struct work_struct *work)
> >  	/*
> >  	 * Run the remote tick once per second (1Hz). This arbitrary
> >  	 * frequency is large enough to avoid overload but short enough
> > -	 * to keep scheduler internal stats reasonably up to date.
> > +	 * to keep scheduler internal stats reasonably up to date.  But
> > +	 * first update state to reflect hotplug activity if required.
> >  	 */
> > +	os = atomic_read(&twork->state);
> > +	if (os) {
> > +		WARN_ON_ONCE(os != TICK_SCHED_REMOTE_OFFLINING);
> > +		if (atomic_inc_not_zero(&twork->state))
> > +			return;
> 
> Using inc makes me a bit nervous here. If we do so, we should somewhow
> make sure that we never exceed a value higher than TICK_SCHED_REMOTE_OFFLINE
> by accident.
> 
> atomic_xchg() is probably a bit costlier but also safer as it allows
> us to check both the old and the new value. That path shouldn't be critically fast
> after all.

It would need to be cmpxchg() to avoid messing with the state if
the state were somehow TICK_SCHED_REMOTE_RUNNING, right?

> > +	}
> >  	queue_delayed_work(system_unbound_wq, dwork, HZ);
> >  }
> >  
> >  static void sched_tick_start(int cpu)
> >  {
> > +	int os;
> >  	struct tick_work *twork;
> >  
> >  	if (housekeeping_cpu(cpu, HK_FLAG_TICK))
> > @@ -3112,15 +3149,20 @@ static void sched_tick_start(int cpu)
> >  	WARN_ON_ONCE(!tick_work_cpu);
> >  
> >  	twork = per_cpu_ptr(tick_work_cpu, cpu);
> > -	twork->cpu = cpu;
> > -	INIT_DELAYED_WORK(&twork->work, sched_tick_remote);
> > -	queue_delayed_work(system_unbound_wq, &twork->work, HZ);
> > +	os = atomic_xchg(&twork->state, TICK_SCHED_REMOTE_RUNNING);
> > +	WARN_ON_ONCE(os == TICK_SCHED_REMOTE_RUNNING);
> 
> See if we use atomic_inc(), we would need to also WARN(os > TICK_SCHED_REMOTE_OFFLINE).

How about if I put that WARN() between the atomic_inc_not_zero() and
the return, presumably also adding braces?

							Thanx, Paul

> > +	if (os == TICK_SCHED_REMOTE_OFFLINE) {
> > +		twork->cpu = cpu;
> > +		INIT_DELAYED_WORK(&twork->work, sched_tick_remote);
> > +		queue_delayed_work(system_unbound_wq, &twork->work, HZ);
> > +	}
> >  }
> 
> Thanks.
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint
  2019-06-21 23:46                     ` Paul E. McKenney
@ 2019-06-24 23:12                       ` Frederic Weisbecker
  2019-06-24 23:44                         ` Paul E. McKenney
  0 siblings, 1 reply; 31+ messages in thread
From: Frederic Weisbecker @ 2019-06-24 23:12 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Peter Zijlstra, linux-kernel, mingo, tglx

On Fri, Jun 21, 2019 at 04:46:02PM -0700, Paul E. McKenney wrote:
> @@ -3097,13 +3126,21 @@ static void sched_tick_remote(struct work_struct *work)
>  	/*
>  	 * Run the remote tick once per second (1Hz). This arbitrary
>  	 * frequency is large enough to avoid overload but short enough
> -	 * to keep scheduler internal stats reasonably up to date.
> +	 * to keep scheduler internal stats reasonably up to date.  But
> +	 * first update state to reflect hotplug activity if required.
>  	 */
> +	os = atomic_read(&twork->state);
> +	if (os) {
> +		WARN_ON_ONCE(os != TICK_SCHED_REMOTE_OFFLINING);
> +		if (atomic_inc_not_zero(&twork->state))
> +			return;

Using inc makes me a bit nervous here. If we do so, we should somewhow
make sure that we never exceed a value higher than TICK_SCHED_REMOTE_OFFLINE
by accident.

atomic_xchg() is probably a bit costlier but also safer as it allows
us to check both the old and the new value. That path shouldn't be critically fast
after all.

> +	}
>  	queue_delayed_work(system_unbound_wq, dwork, HZ);
>  }
>  
>  static void sched_tick_start(int cpu)
>  {
> +	int os;
>  	struct tick_work *twork;
>  
>  	if (housekeeping_cpu(cpu, HK_FLAG_TICK))
> @@ -3112,15 +3149,20 @@ static void sched_tick_start(int cpu)
>  	WARN_ON_ONCE(!tick_work_cpu);
>  
>  	twork = per_cpu_ptr(tick_work_cpu, cpu);
> -	twork->cpu = cpu;
> -	INIT_DELAYED_WORK(&twork->work, sched_tick_remote);
> -	queue_delayed_work(system_unbound_wq, &twork->work, HZ);
> +	os = atomic_xchg(&twork->state, TICK_SCHED_REMOTE_RUNNING);
> +	WARN_ON_ONCE(os == TICK_SCHED_REMOTE_RUNNING);

See if we use atomic_inc(), we would need to also WARN(os > TICK_SCHED_REMOTE_OFFLINE).

> +	if (os == TICK_SCHED_REMOTE_OFFLINE) {
> +		twork->cpu = cpu;
> +		INIT_DELAYED_WORK(&twork->work, sched_tick_remote);
> +		queue_delayed_work(system_unbound_wq, &twork->work, HZ);
> +	}
>  }

Thanks.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint
  2019-06-21 17:50                   ` Paul E. McKenney
@ 2019-06-21 23:46                     ` Paul E. McKenney
  2019-06-24 23:12                       ` Frederic Weisbecker
  0 siblings, 1 reply; 31+ messages in thread
From: Paul E. McKenney @ 2019-06-21 23:46 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, mingo, frederic, tglx

On Fri, Jun 21, 2019 at 10:50:27AM -0700, Paul E. McKenney wrote:
> On Fri, Jun 21, 2019 at 10:41:04AM -0700, Paul E. McKenney wrote:
> > On Fri, Jun 21, 2019 at 06:34:14AM -0700, Paul E. McKenney wrote:
> > > On Fri, Jun 21, 2019 at 02:29:27PM +0200, Peter Zijlstra wrote:
> > > > On Fri, Jun 21, 2019 at 05:16:30AM -0700, Paul E. McKenney wrote:
> > > > > A pair of full hangs at boot (TASKS03 and TREE04), no console output
> > > > > whatsoever.  Not sure how these changes could cause that, but suspicion
> > > > > falls on sched_tick_offload_init().  Though even that is a bit strange
> > > > > because if so, why didn't TREE01 and TREE07 also hang?  Again, looking
> > > > > into it.
> > > > 
> > > > Pesky details ;-)
> > > 
> > > And backing out to the earlier patch removes the hangs, though statistical
> > > insignificance and all that.
> > 
> > And purists might argue that four failures out of four attempts does not
> > constitute true statistical significance, but too bad.  If I interpose
> > a twork pointer in sched_tick_offload_init()'s initialization, it seems
> > to work fine, give or take lack of statistical significance.  This is
> > surprising, so I am rerunning with added parentheses in the atomic_set()
> > expression.
> 
> Huh.  This works, albeit only once:
> 
> 	int __init sched_tick_offload_init(void)
> 	{
> 		struct tick_work *twork;
> 		int cpu;
> 
> 		tick_work_cpu = alloc_percpu(struct tick_work);
> 		BUG_ON(!tick_work_cpu);
> 		for_each_possible_cpu(cpu) {
> 			twork = per_cpu_ptr(tick_work_cpu, cpu);
> 			atomic_set(&twork->state, TICK_SCHED_REMOTE_OFFLINE);
> 		}
> 
> 		return 0;
> 	}
> 
> This does not work:
> 
> 	int __init sched_tick_offload_init(void)
> 	{
> 		int cpu;
> 
> 		tick_work_cpu = alloc_percpu(struct tick_work);
> 		BUG_ON(!tick_work_cpu);
> 		for_each_possible_cpu(cpu)
> 			atomic_set(&(per_cpu(tick_work_cpu, cpu)->state), TICK_SCHED_REMOTE_OFFLINE);
> 
> 		return 0;
> 	}
> 
> I will run more tests on the one that worked only once.  In the meantime,
> feel free to tell me what stupid thing I did with the parentheses.

Newer compilers are OK with either, it appears.  I took the version that
worked with older compilers, please see below.

I will be testing this more heavily over the weekend.  In the meantime,
thoughts?

							Thanx, Paul

------------------------------------------------------------------------

commit 373fb478813e58fea04c87c898959835abb12c8f
Author: Paul E. McKenney <paulmck@linux.ibm.com>
Date:   Thu May 30 05:39:25 2019 -0700

    time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint
    
    The TASKS03 and TREE04 rcutorture scenarios produce the following
    lockdep complaint:
    
    ------------------------------------------------------------------------
    
    ================================
    WARNING: inconsistent lock state
    5.2.0-rc1+ #513 Not tainted
    --------------------------------
    inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
    migration/1/14 [HC0[0]:SC0[0]:HE1:SE1] takes:
    (____ptrval____) (tick_broadcast_lock){?...}, at: tick_broadcast_offline+0xf/0x70
    {IN-HARDIRQ-W} state was registered at:
      lock_acquire+0xb0/0x1c0
      _raw_spin_lock_irqsave+0x3c/0x50
      tick_broadcast_switch_to_oneshot+0xd/0x40
      tick_switch_to_oneshot+0x4f/0xd0
      hrtimer_run_queues+0xf3/0x130
      run_local_timers+0x1c/0x50
      update_process_times+0x1c/0x50
      tick_periodic+0x26/0xc0
      tick_handle_periodic+0x1a/0x60
      smp_apic_timer_interrupt+0x80/0x2a0
      apic_timer_interrupt+0xf/0x20
      _raw_spin_unlock_irqrestore+0x4e/0x60
      rcu_nocb_gp_kthread+0x15d/0x590
      kthread+0xf3/0x130
      ret_from_fork+0x3a/0x50
    irq event stamp: 171
    hardirqs last  enabled at (171): [<ffffffff8a201a37>] trace_hardirqs_on_thunk+0x1a/0x1c
    hardirqs last disabled at (170): [<ffffffff8a201a53>] trace_hardirqs_off_thunk+0x1a/0x1c
    softirqs last  enabled at (0): [<ffffffff8a264ee0>] copy_process.part.56+0x650/0x1cb0
    softirqs last disabled at (0): [<0000000000000000>] 0x0
    
    other info that might help us debug this:
     Possible unsafe locking scenario:
    
           CPU0
           ----
      lock(tick_broadcast_lock);
      <Interrupt>
        lock(tick_broadcast_lock);
    
     *** DEADLOCK ***
    
    1 lock held by migration/1/14:
     #0: (____ptrval____) (clockevents_lock){+.+.}, at: tick_offline_cpu+0xf/0x30
    
    stack backtrace:
    CPU: 1 PID: 14 Comm: migration/1 Not tainted 5.2.0-rc1+ #513
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Bochs 01/01/2011
    Call Trace:
     dump_stack+0x5e/0x8b
     print_usage_bug+0x1fc/0x216
     ? print_shortest_lock_dependencies+0x1b0/0x1b0
     mark_lock+0x1f2/0x280
     __lock_acquire+0x1e0/0x18f0
     ? __lock_acquire+0x21b/0x18f0
     ? _raw_spin_unlock_irqrestore+0x4e/0x60
     lock_acquire+0xb0/0x1c0
     ? tick_broadcast_offline+0xf/0x70
     _raw_spin_lock+0x33/0x40
     ? tick_broadcast_offline+0xf/0x70
     tick_broadcast_offline+0xf/0x70
     tick_offline_cpu+0x16/0x30
     take_cpu_down+0x7d/0xa0
     multi_cpu_stop+0xa2/0xe0
     ? cpu_stop_queue_work+0xc0/0xc0
     cpu_stopper_thread+0x6d/0x100
     smpboot_thread_fn+0x169/0x240
     kthread+0xf3/0x130
     ? sort_range+0x20/0x20
     ? kthread_cancel_delayed_work_sync+0x10/0x10
     ret_from_fork+0x3a/0x50
    
    ------------------------------------------------------------------------
    
    To reproduce, run the following rcutorture test:
    
            tools/testing/selftests/rcutorture/bin/kvm.sh --duration 5 --kconfig "CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_PROVE_LOCKING=y" --configs "TASKS03 TREE04"
    
    It turns out that tick_broadcast_offline() was an innocent bystander.
    After all, interrupts are supposed to be disabled throughout
    take_cpu_down(), and therefore should have been disabled upon entry to
    tick_offline_cpu() and thus to tick_broadcast_offline().  This suggests
    that one of the CPU-hotplug notifiers was incorrectly enabling interrupts,
    and leaving them enabled on return.
    
    Some debugging code showed that the culprit was sched_cpu_dying().
    It had irqs enabled after return from sched_tick_stop().  Which in turn
    had irqs enabled after return from cancel_delayed_work_sync().  Which is a
    wrapper around __cancel_work_timer().  Which can sleep in the case where
    something else is concurrently trying to cancel the same delayed work,
    and as Thomas Gleixner pointed out on IRC, sleeping is a decidedly bad
    idea when you are invoked from take_cpu_down(), regardless of the state
    you leave interrupts in upon return.
    
    Code inspection located no reason why the delayed work absolutely
    needed to be canceled from sched_tick_stop():  The work is not
    bound to the outgoing CPU by design, given that the whole point is
    to collect statistics without disturbing the outgoing CPU.
    
    This commit therefore simply drops the cancel_delayed_work_sync() from
    sched_tick_stop().  Instead, a new ->state field is added to the tick_work
    structure so that the delayed-work handler function sched_tick_remote()
    can avoid reposting itself.  A cpu_is_offline() check is also added to
    sched_tick_remote() to avoid mucking with the state of an offlined CPU
    (though it does appear safe to do so).  The sched_tick_start() and
    sched_tick_stop() functions also update ->state, and sched_tick_start()
    also schedules the delayed work if ->state indicates that it is not
    already in flight.
    
    Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
    [ paulmck: Apply Peter Zijlstra synchronization feedback. ]
    Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 102dfcf0a29a..93a72926d8cc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3050,8 +3050,36 @@ void scheduler_tick(void)
 
 struct tick_work {
 	int			cpu;
+	atomic_t		state;
 	struct delayed_work	work;
 };
+/* Values for ->state, see diagram below. */
+#define TICK_SCHED_REMOTE_RUNNING	0
+#define TICK_SCHED_REMOTE_OFFLINING	1
+#define TICK_SCHED_REMOTE_OFFLINE	2
+
+/*
+ * State diagram for ->state:
+ *
+ *
+ *          TICK_SCHED_REMOTE_OFFLINE
+ *                    |   ^
+ *                    |   |
+ *                    |   | sched_tick_remote()
+ *                    |   |
+ *                    |   |
+ *                    +--TICK_SCHED_REMOTE_OFFLINING
+ *                    |   ^
+ *                    |   |
+ * sched_tick_start() |   | sched_tick_stop()
+ *                    |   |
+ *                    V   |
+ *          TICK_SCHED_REMOTE_RUNNING
+ *
+ *
+ * Other transitions get WARN_ON_ONCE(), except that sched_tick_remote()
+ * and sched_tick_start() are happy to leave the state in RUNNING.
+ */
 
 static struct tick_work __percpu *tick_work_cpu;
 
@@ -3064,6 +3092,7 @@ static void sched_tick_remote(struct work_struct *work)
 	struct task_struct *curr;
 	struct rq_flags rf;
 	u64 delta;
+	int os;
 
 	/*
 	 * Handle the tick only if it appears the remote CPU is running in full
@@ -3077,7 +3106,7 @@ static void sched_tick_remote(struct work_struct *work)
 
 	rq_lock_irq(rq, &rf);
 	curr = rq->curr;
-	if (is_idle_task(curr))
+	if (is_idle_task(curr) || cpu_is_offline(cpu))
 		goto out_unlock;
 
 	update_rq_clock(rq);
@@ -3097,13 +3126,21 @@ static void sched_tick_remote(struct work_struct *work)
 	/*
 	 * Run the remote tick once per second (1Hz). This arbitrary
 	 * frequency is large enough to avoid overload but short enough
-	 * to keep scheduler internal stats reasonably up to date.
+	 * to keep scheduler internal stats reasonably up to date.  But
+	 * first update state to reflect hotplug activity if required.
 	 */
+	os = atomic_read(&twork->state);
+	if (os) {
+		WARN_ON_ONCE(os != TICK_SCHED_REMOTE_OFFLINING);
+		if (atomic_inc_not_zero(&twork->state))
+			return;
+	}
 	queue_delayed_work(system_unbound_wq, dwork, HZ);
 }
 
 static void sched_tick_start(int cpu)
 {
+	int os;
 	struct tick_work *twork;
 
 	if (housekeeping_cpu(cpu, HK_FLAG_TICK))
@@ -3112,15 +3149,20 @@ static void sched_tick_start(int cpu)
 	WARN_ON_ONCE(!tick_work_cpu);
 
 	twork = per_cpu_ptr(tick_work_cpu, cpu);
-	twork->cpu = cpu;
-	INIT_DELAYED_WORK(&twork->work, sched_tick_remote);
-	queue_delayed_work(system_unbound_wq, &twork->work, HZ);
+	os = atomic_xchg(&twork->state, TICK_SCHED_REMOTE_RUNNING);
+	WARN_ON_ONCE(os == TICK_SCHED_REMOTE_RUNNING);
+	if (os == TICK_SCHED_REMOTE_OFFLINE) {
+		twork->cpu = cpu;
+		INIT_DELAYED_WORK(&twork->work, sched_tick_remote);
+		queue_delayed_work(system_unbound_wq, &twork->work, HZ);
+	}
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
 static void sched_tick_stop(int cpu)
 {
 	struct tick_work *twork;
+	int os;
 
 	if (housekeeping_cpu(cpu, HK_FLAG_TICK))
 		return;
@@ -3128,14 +3170,24 @@ static void sched_tick_stop(int cpu)
 	WARN_ON_ONCE(!tick_work_cpu);
 
 	twork = per_cpu_ptr(tick_work_cpu, cpu);
-	cancel_delayed_work_sync(&twork->work);
+	/* There cannot be competing actions, but don't rely on stop-machine. */
+	os = atomic_xchg(&twork->state, TICK_SCHED_REMOTE_OFFLINING);
+	WARN_ON_ONCE(os != TICK_SCHED_REMOTE_RUNNING);
+	/* Don't cancel, as this would mess up the state machine. */
 }
 #endif /* CONFIG_HOTPLUG_CPU */
 
 int __init sched_tick_offload_init(void)
 {
+	struct tick_work *twork;
+	int cpu;
+
 	tick_work_cpu = alloc_percpu(struct tick_work);
 	BUG_ON(!tick_work_cpu);
+	for_each_possible_cpu(cpu) {
+		twork = per_cpu_ptr(tick_work_cpu, cpu);
+		atomic_set(&twork->state, TICK_SCHED_REMOTE_OFFLINE);
+	}
 
 	return 0;
 }

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint
  2019-06-21 17:41                 ` Paul E. McKenney
@ 2019-06-21 17:50                   ` Paul E. McKenney
  2019-06-21 23:46                     ` Paul E. McKenney
  0 siblings, 1 reply; 31+ messages in thread
From: Paul E. McKenney @ 2019-06-21 17:50 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, mingo, frederic, tglx

On Fri, Jun 21, 2019 at 10:41:04AM -0700, Paul E. McKenney wrote:
> On Fri, Jun 21, 2019 at 06:34:14AM -0700, Paul E. McKenney wrote:
> > On Fri, Jun 21, 2019 at 02:29:27PM +0200, Peter Zijlstra wrote:
> > > On Fri, Jun 21, 2019 at 05:16:30AM -0700, Paul E. McKenney wrote:
> > > > A pair of full hangs at boot (TASKS03 and TREE04), no console output
> > > > whatsoever.  Not sure how these changes could cause that, but suspicion
> > > > falls on sched_tick_offload_init().  Though even that is a bit strange
> > > > because if so, why didn't TREE01 and TREE07 also hang?  Again, looking
> > > > into it.
> > > 
> > > Pesky details ;-)
> > 
> > And backing out to the earlier patch removes the hangs, though statistical
> > insignificance and all that.
> 
> And purists might argue that four failures out of four attempts does not
> constitute true statistical significance, but too bad.  If I interpose
> a twork pointer in sched_tick_offload_init()'s initialization, it seems
> to work fine, give or take lack of statistical significance.  This is
> surprising, so I am rerunning with added parentheses in the atomic_set()
> expression.

Huh.  This works, albeit only once:

	int __init sched_tick_offload_init(void)
	{
		struct tick_work *twork;
		int cpu;

		tick_work_cpu = alloc_percpu(struct tick_work);
		BUG_ON(!tick_work_cpu);
		for_each_possible_cpu(cpu) {
			twork = per_cpu_ptr(tick_work_cpu, cpu);
			atomic_set(&twork->state, TICK_SCHED_REMOTE_OFFLINE);
		}

		return 0;
	}

This does not work:

	int __init sched_tick_offload_init(void)
	{
		int cpu;

		tick_work_cpu = alloc_percpu(struct tick_work);
		BUG_ON(!tick_work_cpu);
		for_each_possible_cpu(cpu)
			atomic_set(&(per_cpu(tick_work_cpu, cpu)->state), TICK_SCHED_REMOTE_OFFLINE);

		return 0;
	}

I will run more tests on the one that worked only once.  In the meantime,
feel free to tell me what stupid thing I did with the parentheses.

								Thanx, Paul

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint
  2019-06-21 13:34               ` Paul E. McKenney
@ 2019-06-21 17:41                 ` Paul E. McKenney
  2019-06-21 17:50                   ` Paul E. McKenney
  0 siblings, 1 reply; 31+ messages in thread
From: Paul E. McKenney @ 2019-06-21 17:41 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, mingo, frederic, tglx

On Fri, Jun 21, 2019 at 06:34:14AM -0700, Paul E. McKenney wrote:
> On Fri, Jun 21, 2019 at 02:29:27PM +0200, Peter Zijlstra wrote:
> > On Fri, Jun 21, 2019 at 05:16:30AM -0700, Paul E. McKenney wrote:
> > > A pair of full hangs at boot (TASKS03 and TREE04), no console output
> > > whatsoever.  Not sure how these changes could cause that, but suspicion
> > > falls on sched_tick_offload_init().  Though even that is a bit strange
> > > because if so, why didn't TREE01 and TREE07 also hang?  Again, looking
> > > into it.
> > 
> > Pesky details ;-)
> 
> And backing out to the earlier patch removes the hangs, though statistical
> insignificance and all that.

And purists might argue that four failures out of four attempts does not
constitute true statistical significance, but too bad.  If I interpose
a twork pointer in sched_tick_offload_init()'s initialization, it seems
to work fine, give or take lack of statistical significance.  This is
surprising, so I am rerunning with added parentheses in the atomic_set()
expression.

			Thanx, Paul "I hate initialization" McKenney

> Ah, in answer to your earlier question, if you want it in v5.3, you
> will need to take it (but I do humbly request that you wait until it
> actually works).  If you don't take it, I won't be submitting it earlier
> than v5.4.  Either way, your choice!
> 
> 							Thanx, Paul


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint
  2019-06-21 12:29             ` Peter Zijlstra
@ 2019-06-21 13:34               ` Paul E. McKenney
  2019-06-21 17:41                 ` Paul E. McKenney
  0 siblings, 1 reply; 31+ messages in thread
From: Paul E. McKenney @ 2019-06-21 13:34 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, mingo, frederic, tglx

On Fri, Jun 21, 2019 at 02:29:27PM +0200, Peter Zijlstra wrote:
> On Fri, Jun 21, 2019 at 05:16:30AM -0700, Paul E. McKenney wrote:
> > A pair of full hangs at boot (TASKS03 and TREE04), no console output
> > whatsoever.  Not sure how these changes could cause that, but suspicion
> > falls on sched_tick_offload_init().  Though even that is a bit strange
> > because if so, why didn't TREE01 and TREE07 also hang?  Again, looking
> > into it.
> 
> Pesky details ;-)

And backing out to the earlier patch removes the hangs, though statistical
insignificance and all that.

Ah, in answer to your earlier question, if you want it in v5.3, you
will need to take it (but I do humbly request that you wait until it
actually works).  If you don't take it, I won't be submitting it earlier
than v5.4.  Either way, your choice!

							Thanx, Paul


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint
  2019-06-21 12:16           ` Paul E. McKenney
@ 2019-06-21 12:29             ` Peter Zijlstra
  2019-06-21 13:34               ` Paul E. McKenney
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2019-06-21 12:29 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: linux-kernel, mingo, frederic, tglx

On Fri, Jun 21, 2019 at 05:16:30AM -0700, Paul E. McKenney wrote:
> A pair of full hangs at boot (TASKS03 and TREE04), no console output
> whatsoever.  Not sure how these changes could cause that, but suspicion
> falls on sched_tick_offload_init().  Though even that is a bit strange
> because if so, why didn't TREE01 and TREE07 also hang?  Again, looking
> into it.

Pesky details ;-)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint
  2019-06-21 10:55         ` Peter Zijlstra
@ 2019-06-21 12:16           ` Paul E. McKenney
  2019-06-21 12:29             ` Peter Zijlstra
  0 siblings, 1 reply; 31+ messages in thread
From: Paul E. McKenney @ 2019-06-21 12:16 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, mingo, frederic, tglx

On Fri, Jun 21, 2019 at 12:55:03PM +0200, Peter Zijlstra wrote:
> On Thu, Jun 20, 2019 at 03:13:36PM -0700, Paul E. McKenney wrote:
> > So how about the following patch, which passes very light rcutorture
> > testing but should otherwise be regarded as being under suspicion?
> 
> Looks good to me,
> 
> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Thank you!  But...

> Or, if you want me to apply it, I can do that too ;-)

Last night's rcutorture run was not amused.  Looking into it...

A pair of full hangs at boot (TASKS03 and TREE04), no console output
whatsoever.  Not sure how these changes could cause that, but suspicion
falls on sched_tick_offload_init().  Though even that is a bit strange
because if so, why didn't TREE01 and TREE07 also hang?  Again, looking
into it.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint
  2019-06-20 22:13       ` Paul E. McKenney
@ 2019-06-21 10:55         ` Peter Zijlstra
  2019-06-21 12:16           ` Paul E. McKenney
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2019-06-21 10:55 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: linux-kernel, mingo, frederic, tglx

On Thu, Jun 20, 2019 at 03:13:36PM -0700, Paul E. McKenney wrote:
> So how about the following patch, which passes very light rcutorture
> testing but should otherwise be regarded as being under suspicion?

Looks good to me,

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Or, if you want me to apply it, I can do that too ;-)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint
  2019-06-20 21:10     ` Peter Zijlstra
@ 2019-06-20 22:13       ` Paul E. McKenney
  2019-06-21 10:55         ` Peter Zijlstra
  0 siblings, 1 reply; 31+ messages in thread
From: Paul E. McKenney @ 2019-06-20 22:13 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, mingo, frederic, tglx

On Thu, Jun 20, 2019 at 11:10:19PM +0200, Peter Zijlstra wrote:
> On Thu, Jun 20, 2019 at 09:01:18AM -0700, Paul E. McKenney wrote:
> 
> > > > +#define TICK_SCHED_REMOTE_OFFLINE	0
> > > > +#define TICK_SCHED_REMOTE_RUNNING	1
> > > > +#define TICK_SCHED_REMOTE_OFFLINING	2
> > > 
> > > That seems a daft set of values; consider { RUNNING, OFFLINING, OFFLINE }
> > > and see below.
> > 
> > As in make it an enum?  I could do that.
> 
> Enum or define, I don't much care, but the 'natural' ordering of the
> states is either: running -> offlining -> offline, or the other way
> around, the given order in the patch just didn't make sense.
> 
> The one with running=0 just seems to work out nicer.

Aside from now needing to be initialized, but so it goes.  Which is
why TICK_SCHED_REMOTE_OFFLINE was zero in my version, in case you were
wondering.  ;-)

> > > > +
> > > > +// State diagram for ->state:
> > > > +//
> > > > +//
> > > > +//      +----->OFFLINE--------------------------+
> > > > +//      |                                       |
> > > > +//      |                                       |
> > > > +//      |                                       | sched_tick_start()
> > > > +//      | sched_tick_remote()                   |
> > > > +//      |                                       |
> > > > +//      |                                       V
> > > > +//      |                        +---------->RUNNING
> > > > +//      |                        |              |
> > > > +//      |                        |              |
> > > > +//      |                        |              |
> > > > +//      |     sched_tick_start() |              | sched_tick_stop()
> > > > +//      |                        |              |
> > > > +//      |                        |              |
> > > > +//      |                        |              |
> > > > +//      +--------------------OFFLINING<---------+
> > > > +//
> > > > +//
> > > > +// Other transitions get WARN_ON_ONCE(), except that sched_tick_remote()
> > > > +// and sched_tick_start() are happy to leave the state in RUNNING.
> 
> > > Also, I find it harder to read that needed, maybe a little something
> > > like so:
> > > 
> > > /*
> > >  *              OFFLINE
> > >  *               |   ^
> > >  *               |   | tick_remote()
> > >  *               |   |
> > >  *               +--OFFLINING
> > >  *               |   ^
> > >  *  tick_start() |   | tick_stop()
> > >  *               v   |
> > >  *              RUNNING
> > >  */
> > 
> > As in remove the leading "sched_" from the function names?  (The names
> > were already there, so I left them be.)
> 
> That was just me being lazy, the main part being getting the states in a
> linear order, instead of spread around a 2d grid.

OK, I will expand them.  Including the ones where I was being lazy.

> > > While not wrong, it seems overly complicated; can't we do something
> > > like:
> > > 
> > > tick:
> > 
> > As in sched_tick_remote(), right?
> > 
> > > 	state = atomic_read(->state);
> > > 	if (state) {
> > 
> > You sure you don't want "if (state != RUNNING)"?  But I guess you need
> > to understand that RUNNING==0 to understand the atomic_inc_not_zero().
> 
> Right..
> 
> > 
> > > 		WARN_ON_ONCE(state != OFFLINING);
> > > 		if (atomic_inc_not_zero(->state))
> > 
> > This assumes that there cannot be concurrent calls to sched_tick_remote(),
> > otherwise, you can end up with ->state==3.  Which is a situation that
> > my version does a WARN_ON_ONCE() for, so I guess the only difference is
> > that mine would be guaranteed to complain and yours would complain with
> > high probability.  So fair enough!
> 
> I was assuming there was only a single work per CPU and there'd not be
> concurrency on this path.
> 
> > > 			return;
> > > 	}
> > > 	queue_delayed_work();
> > > 
> > > 
> > > stop:
> > > 	/*
> > > 	 * This is hotplug; even without stop-machine, there cannot be
> > > 	 * concurrency on offlining specific CPUs.
> > > 	 */
> > > 	state = atomic_read(->state);
> > 
> > There cannot be a sched_tick_stop() or sched_tick_stop(), but there really
> > can be a sched_tick_remote() right here in the absence of stop-machine,
> > can't there?  Or am I missing something other than stop-machine that
> > prevents this?
> 
> There can be a remote tick, indeed.
> 
> > Now, you could argue that concurrency is safe: Either sched_tick_remote()
> > sees RUNNING and doesn't touch the value, or it sees offlining and
> > sched_tick_stop() makes no further changes,
> 
> That was indeed the thinking.
> 
> > but I am not sure that this qualifies as simpler...
> 
> There is that I suppose. I think my initial version was a little more
> complicated, but after a few passes this happened :-)
> 
> > > 	WARN_ON_ONCE(state != RUNNING);
> > > 	atomic_set(->state, OFFLINING);
> > 
> > Another option would be to use atomic_xchg() as below instead of the
> > atomic_read()/atomic_set() pair.  Would that work for you?
> 
> Yes, that works I suppose. Is more expensive, but I don't think we
> particularly care about that here.

Not on this code path, agreed.

> > > start:
> > > 	state = atomic_xchg(->state, RUNNING);
> > > 	WARN_ON_ONCE(state == RUNNING);
> > > 	if (state == OFFLINE) {
> > > 		// ...
> > > 		queue_delayed_work();
> > > 	}
> > 
> > This one looks to be an improvement on mine regardless of the other two.

So how about the following patch, which passes very light rcutorture
testing but should otherwise be regarded as being under suspicion?

							Thanx, Paul

------------------------------------------------------------------------

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 874c427742a9..36631d2eff05 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3051,8 +3051,36 @@ void scheduler_tick(void)
 
 struct tick_work {
 	int			cpu;
+	atomic_t		state;
 	struct delayed_work	work;
 };
+/* Values for ->state, see diagram below. */
+#define TICK_SCHED_REMOTE_RUNNING	0
+#define TICK_SCHED_REMOTE_OFFLINING	1
+#define TICK_SCHED_REMOTE_OFFLINE	2
+
+/*
+ * State diagram for ->state:
+ *
+ *
+ *          TICK_SCHED_REMOTE_OFFLINE
+ *                    |   ^
+ *                    |   |
+ *                    |   | sched_tick_remote()
+ *                    |   |
+ *                    |   |
+ *                    +--TICK_SCHED_REMOTE_OFFLINING
+ *                    |   ^
+ *                    |   |
+ * sched_tick_start() |   | sched_tick_stop()
+ *                    |   |
+ *                    V   |
+ *          TICK_SCHED_REMOTE_RUNNING
+ *
+ *
+ * Other transitions get WARN_ON_ONCE(), except that sched_tick_remote()
+ * and sched_tick_start() are happy to leave the state in RUNNING.
+ */
 
 static struct tick_work __percpu *tick_work_cpu;
 
@@ -3065,6 +3093,7 @@ static void sched_tick_remote(struct work_struct *work)
 	struct task_struct *curr;
 	struct rq_flags rf;
 	u64 delta;
+	int os;
 
 	/*
 	 * Handle the tick only if it appears the remote CPU is running in full
@@ -3078,7 +3107,7 @@ static void sched_tick_remote(struct work_struct *work)
 
 	rq_lock_irq(rq, &rf);
 	curr = rq->curr;
-	if (is_idle_task(curr))
+	if (is_idle_task(curr) || cpu_is_offline(cpu))
 		goto out_unlock;
 
 	update_rq_clock(rq);
@@ -3098,13 +3127,21 @@ static void sched_tick_remote(struct work_struct *work)
 	/*
 	 * Run the remote tick once per second (1Hz). This arbitrary
 	 * frequency is large enough to avoid overload but short enough
-	 * to keep scheduler internal stats reasonably up to date.
+	 * to keep scheduler internal stats reasonably up to date.  But
+	 * first update state to reflect hotplug activity if required.
 	 */
+	os = atomic_read(&twork->state);
+	if (os) {
+		WARN_ON_ONCE(os != TICK_SCHED_REMOTE_OFFLINING);
+		if (atomic_inc_not_zero(&twork->state))
+			return;
+	}
 	queue_delayed_work(system_unbound_wq, dwork, HZ);
 }
 
 static void sched_tick_start(int cpu)
 {
+	int os;
 	struct tick_work *twork;
 
 	if (housekeeping_cpu(cpu, HK_FLAG_TICK))
@@ -3113,15 +3150,20 @@ static void sched_tick_start(int cpu)
 	WARN_ON_ONCE(!tick_work_cpu);
 
 	twork = per_cpu_ptr(tick_work_cpu, cpu);
-	twork->cpu = cpu;
-	INIT_DELAYED_WORK(&twork->work, sched_tick_remote);
-	queue_delayed_work(system_unbound_wq, &twork->work, HZ);
+	os = atomic_xchg(&twork->state, TICK_SCHED_REMOTE_RUNNING);
+	WARN_ON_ONCE(os == TICK_SCHED_REMOTE_RUNNING);
+	if (os == TICK_SCHED_REMOTE_OFFLINE) {
+		twork->cpu = cpu;
+		INIT_DELAYED_WORK(&twork->work, sched_tick_remote);
+		queue_delayed_work(system_unbound_wq, &twork->work, HZ);
+	}
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
 static void sched_tick_stop(int cpu)
 {
 	struct tick_work *twork;
+	int os;
 
 	if (housekeeping_cpu(cpu, HK_FLAG_TICK))
 		return;
@@ -3129,14 +3171,21 @@ static void sched_tick_stop(int cpu)
 	WARN_ON_ONCE(!tick_work_cpu);
 
 	twork = per_cpu_ptr(tick_work_cpu, cpu);
-	cancel_delayed_work_sync(&twork->work);
+	/* There cannot be competing actions, but don't rely on stop-machine. */
+	os = atomic_xchg(&twork->state, TICK_SCHED_REMOTE_OFFLINING);
+	WARN_ON_ONCE(os != TICK_SCHED_REMOTE_RUNNING);
+	/* Don't cancel, as this would mess up the state machine. */
 }
 #endif /* CONFIG_HOTPLUG_CPU */
 
 int __init sched_tick_offload_init(void)
 {
+	int cpu;
+
 	tick_work_cpu = alloc_percpu(struct tick_work);
 	BUG_ON(!tick_work_cpu);
+	for_each_possible_cpu(cpu)
+		atomic_set(&per_cpu(tick_work_cpu, cpu)->state, TICK_SCHED_REMOTE_OFFLINE);
 
 	return 0;
 }


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint
  2019-06-20 16:01   ` Paul E. McKenney
@ 2019-06-20 21:10     ` Peter Zijlstra
  2019-06-20 22:13       ` Paul E. McKenney
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2019-06-20 21:10 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: linux-kernel, mingo, frederic, tglx

On Thu, Jun 20, 2019 at 09:01:18AM -0700, Paul E. McKenney wrote:

> > > +#define TICK_SCHED_REMOTE_OFFLINE	0
> > > +#define TICK_SCHED_REMOTE_RUNNING	1
> > > +#define TICK_SCHED_REMOTE_OFFLINING	2
> > 
> > That seems a daft set of values; consider { RUNNING, OFFLINING, OFFLINE }
> > and see below.
> 
> As in make it an enum?  I could do that.

Enum or define, I don't much care, but the 'natural' ordering of the
states is either: running -> offlining -> offline, or the other way
around, the given order in the patch just didn't make sense.

The one with running=0 just seems to work out nicer.

> > > +
> > > +// State diagram for ->state:
> > > +//
> > > +//
> > > +//      +----->OFFLINE--------------------------+
> > > +//      |                                       |
> > > +//      |                                       |
> > > +//      |                                       | sched_tick_start()
> > > +//      | sched_tick_remote()                   |
> > > +//      |                                       |
> > > +//      |                                       V
> > > +//      |                        +---------->RUNNING
> > > +//      |                        |              |
> > > +//      |                        |              |
> > > +//      |                        |              |
> > > +//      |     sched_tick_start() |              | sched_tick_stop()
> > > +//      |                        |              |
> > > +//      |                        |              |
> > > +//      |                        |              |
> > > +//      +--------------------OFFLINING<---------+
> > > +//
> > > +//
> > > +// Other transitions get WARN_ON_ONCE(), except that sched_tick_remote()
> > > +// and sched_tick_start() are happy to leave the state in RUNNING.

> > Also, I find it harder to read that needed, maybe a little something
> > like so:
> > 
> > /*
> >  *              OFFLINE
> >  *               |   ^
> >  *               |   | tick_remote()
> >  *               |   |
> >  *               +--OFFLINING
> >  *               |   ^
> >  *  tick_start() |   | tick_stop()
> >  *               v   |
> >  *              RUNNING
> >  */
> 
> As in remove the leading "sched_" from the function names?  (The names
> were already there, so I left them be.)

That was just me being lazy, the main part being getting the states in a
linear order, instead of spread around a 2d grid.

> > While not wrong, it seems overly complicated; can't we do something
> > like:
> > 
> > tick:
> 
> As in sched_tick_remote(), right?
> 
> > 	state = atomic_read(->state);
> > 	if (state) {
> 
> You sure you don't want "if (state != RUNNING)"?  But I guess you need
> to understand that RUNNING==0 to understand the atomic_inc_not_zero().

Right..

> 
> > 		WARN_ON_ONCE(state != OFFLINING);
> > 		if (atomic_inc_not_zero(->state))
> 
> This assumes that there cannot be concurrent calls to sched_tick_remote(),
> otherwise, you can end up with ->state==3.  Which is a situation that
> my version does a WARN_ON_ONCE() for, so I guess the only difference is
> that mine would be guaranteed to complain and yours would complain with
> high probability.  So fair enough!

I was assuming there was only a single work per CPU and there'd not be
concurrency on this path.

> > 			return;
> > 	}
> > 	queue_delayed_work();
> > 
> > 
> > stop:
> > 	/*
> > 	 * This is hotplug; even without stop-machine, there cannot be
> > 	 * concurrency on offlining specific CPUs.
> > 	 */
> > 	state = atomic_read(->state);
> 
> There cannot be a sched_tick_stop() or sched_tick_stop(), but there really
> can be a sched_tick_remote() right here in the absence of stop-machine,
> can't there?  Or am I missing something other than stop-machine that
> prevents this?

There can be a remote tick, indeed.

> Now, you could argue that concurrency is safe: Either sched_tick_remote()
> sees RUNNING and doesn't touch the value, or it sees offlining and
> sched_tick_stop() makes no further changes,

That was indeed the thinking.

> but I am not sure that this qualifies as simpler...

There is that I suppose. I think my initial version was a little more
complicated, but after a few passes this happened :-)

> > 	WARN_ON_ONCE(state != RUNNING);
> > 	atomic_set(->state, OFFLINING);
> 
> Another option would be to use atomic_xchg() as below instead of the
> atomic_read()/atomic_set() pair.  Would that work for you?

Yes, that works I suppose. Is more expensive, but I don't think we
particularly care about that here.

> > start:
> > 	state = atomic_xchg(->state, RUNNING);
> > 	WARN_ON_ONCE(state == RUNNING);
> > 	if (state == OFFLINE) {
> > 		// ...
> > 		queue_delayed_work();
> > 	}
> 
> This one looks to be an improvement on mine regardless of the other two.



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint
  2019-06-20 12:10 ` Peter Zijlstra
@ 2019-06-20 16:01   ` Paul E. McKenney
  2019-06-20 21:10     ` Peter Zijlstra
  0 siblings, 1 reply; 31+ messages in thread
From: Paul E. McKenney @ 2019-06-20 16:01 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, mingo, frederic, tglx

On Thu, Jun 20, 2019 at 02:10:32PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 19, 2019 at 11:19:03AM -0700, Paul E. McKenney wrote:
> > [ Hearing no objections and given no test failures in multiple weeks of
> >   rcutorture testing, I intend to submit this to the upcoming merge
> >   window.  Thoughts? ]
> 
> I can't remember seeing this before; but then, there's a ton of unread
> email in the inbox :-(

I have no idea whether or not you were on CC in the earlier thread.
But you are now!  ;-)

> > Some debugging code showed that the culprit was sched_cpu_dying().
> > It had irqs enabled after return from sched_tick_stop().  Which in turn
> > had irqs enabled after return from cancel_delayed_work_sync().  Which is a
> > wrapper around __cancel_work_timer().  Which can sleep in the case where
> > something else is concurrently trying to cancel the same delayed work,
> > and as Thomas Gleixner pointed out on IRC, sleeping is a decidedly bad
> > idea when you are invoked from take_cpu_down(), regardless of the state
> > you leave interrupts in upon return.
> > 
> > Code inspection located no reason why the delayed work absolutely
> > needed to be canceled from sched_tick_stop():  The work is not
> > bound to the outgoing CPU by design, given that the whole point is
> > to collect statistics without disturbing the outgoing CPU.
> > 
> > This commit therefore simply drops the cancel_delayed_work_sync() from
> > sched_tick_stop().  Instead, a new ->state field is added to the tick_work
> > structure so that the delayed-work handler function sched_tick_remote()
> > can avoid reposting itself.  A cpu_is_offline() check is also added to
> > sched_tick_remote() to avoid mucking with the state of an offlined CPU
> > (though it does appear safe to do so).  The sched_tick_start() and
> > sched_tick_stop() functions also update ->state, and sched_tick_start()
> > also schedules the delayed work if ->state indicates that it is not
> > already in flight.
> > 
> > Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 102dfcf0a29a..8409c83aa5fa 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3050,14 +3050,44 @@ void scheduler_tick(void)
> >  
> >  struct tick_work {
> >  	int			cpu;
> > +	int			state;
> >  	struct delayed_work	work;
> >  };
> > +// Values for ->state, see diagram below.
> > +#define TICK_SCHED_REMOTE_OFFLINE	0
> > +#define TICK_SCHED_REMOTE_RUNNING	1
> > +#define TICK_SCHED_REMOTE_OFFLINING	2
> 
> That seems a daft set of values; consider { RUNNING, OFFLINING, OFFLINE }
> and see below.

As in make it an enum?  I could do that.

> > +
> > +// State diagram for ->state:
> > +//
> > +//
> > +//      +----->OFFLINE--------------------------+
> > +//      |                                       |
> > +//      |                                       |
> > +//      |                                       | sched_tick_start()
> > +//      | sched_tick_remote()                   |
> > +//      |                                       |
> > +//      |                                       V
> > +//      |                        +---------->RUNNING
> > +//      |                        |              |
> > +//      |                        |              |
> > +//      |                        |              |
> > +//      |     sched_tick_start() |              | sched_tick_stop()
> > +//      |                        |              |
> > +//      |                        |              |
> > +//      |                        |              |
> > +//      +--------------------OFFLINING<---------+
> > +//
> > +//
> > +// Other transitions get WARN_ON_ONCE(), except that sched_tick_remote()
> > +// and sched_tick_start() are happy to leave the state in RUNNING.
> 
> Can we please stick to old skool C comments?

Your file, your rules!

> Also, I find it harder to read that needed, maybe a little something
> like so:
> 
> /*
>  *              OFFLINE
>  *               |   ^
>  *               |   | tick_remote()
>  *               |   |
>  *               +--OFFLINING
>  *               |   ^
>  *  tick_start() |   | tick_stop()
>  *               v   |
>  *              RUNNING
>  */

As in remove the leading "sched_" from the function names?  (The names
were already there, so I left them be.)

> >  static struct tick_work __percpu *tick_work_cpu;
> >  
> >  static void sched_tick_remote(struct work_struct *work)
> >  {
> >  	struct delayed_work *dwork = to_delayed_work(work);
> > +	int os;
> 
> this should go at the end, reverse xmas tree preference and all that.

Alphabetical by variable name for me, but your file, your rules!

> >  	struct tick_work *twork = container_of(dwork, struct tick_work, work);
> >  	int cpu = twork->cpu;
> >  	struct rq *rq = cpu_rq(cpu);
> > @@ -3077,7 +3107,7 @@ static void sched_tick_remote(struct work_struct *work)
> >  
> >  	rq_lock_irq(rq, &rf);
> >  	curr = rq->curr;
> > -	if (is_idle_task(curr))
> > +	if (is_idle_task(curr) || cpu_is_offline(cpu))
> >  		goto out_unlock;
> >  
> >  	update_rq_clock(rq);
> > @@ -3097,13 +3127,22 @@ static void sched_tick_remote(struct work_struct *work)
> >  	/*
> >  	 * Run the remote tick once per second (1Hz). This arbitrary
> >  	 * frequency is large enough to avoid overload but short enough
> > -	 * to keep scheduler internal stats reasonably up to date.
> > +	 * to keep scheduler internal stats reasonably up to date.  But
> > +	 * first update state to reflect hotplug activity if required.
> >  	 */
> > -	queue_delayed_work(system_unbound_wq, dwork, HZ);
> > +	do {
> > +		os = READ_ONCE(twork->state);
> > +		WARN_ON_ONCE(os == TICK_SCHED_REMOTE_OFFLINE);
> > +		if (os == TICK_SCHED_REMOTE_RUNNING)
> > +			break;
> > +	} while (cmpxchg(&twork->state, os, TICK_SCHED_REMOTE_OFFLINE) != os);
> > +	if (os == TICK_SCHED_REMOTE_RUNNING)
> > +		queue_delayed_work(system_unbound_wq, dwork, HZ);
> >  }
> >  
> >  static void sched_tick_start(int cpu)
> >  {
> > +	int os;
> >  	struct tick_work *twork;
> >  
> >  	if (housekeeping_cpu(cpu, HK_FLAG_TICK))
> > @@ -3112,14 +3151,23 @@ static void sched_tick_start(int cpu)
> >  	WARN_ON_ONCE(!tick_work_cpu);
> >  
> >  	twork = per_cpu_ptr(tick_work_cpu, cpu);
> > -	twork->cpu = cpu;
> > -	INIT_DELAYED_WORK(&twork->work, sched_tick_remote);
> > -	queue_delayed_work(system_unbound_wq, &twork->work, HZ);
> > +	do {
> > +		os = READ_ONCE(twork->state);
> > +		if (WARN_ON_ONCE(os == TICK_SCHED_REMOTE_RUNNING))
> > +			break;
> > +		// Either idle or offline for a short period
> > +	} while (cmpxchg(&twork->state, os, TICK_SCHED_REMOTE_RUNNING) != os);
> > +	if (os == TICK_SCHED_REMOTE_OFFLINE) {
> > +		twork->cpu = cpu;
> > +		INIT_DELAYED_WORK(&twork->work, sched_tick_remote);
> > +		queue_delayed_work(system_unbound_wq, &twork->work, HZ);
> > +	}
> >  }
> >  
> >  #ifdef CONFIG_HOTPLUG_CPU
> >  static void sched_tick_stop(int cpu)
> >  {
> > +	int os;
> >  	struct tick_work *twork;
> >  
> >  	if (housekeeping_cpu(cpu, HK_FLAG_TICK))
> > @@ -3128,7 +3176,13 @@ static void sched_tick_stop(int cpu)
> >  	WARN_ON_ONCE(!tick_work_cpu);
> >  
> >  	twork = per_cpu_ptr(tick_work_cpu, cpu);
> > -	cancel_delayed_work_sync(&twork->work);
> > +	// There cannot be competing actions, but don't rely on stop_machine.
> > +	do {
> > +		os = READ_ONCE(twork->state);
> > +		WARN_ON_ONCE(os != TICK_SCHED_REMOTE_RUNNING);
> > +		// Either idle or offline for a short period
> > +	} while (cmpxchg(&twork->state, os, TICK_SCHED_REMOTE_OFFLINING) != os);
> > +	// Don't cancel, as this would mess up the state machine.
> >  }
> >  #endif /* CONFIG_HOTPLUG_CPU */
> 
> While not wrong, it seems overly complicated; can't we do something
> like:
> 
> tick:

As in sched_tick_remote(), right?

> 	state = atomic_read(->state);
> 	if (state) {

You sure you don't want "if (state != RUNNING)"?  But I guess you need
to understand that RUNNING==0 to understand the atomic_inc_not_zero().

> 		WARN_ON_ONCE(state != OFFLINING);
> 		if (atomic_inc_not_zero(->state))

This assumes that there cannot be concurrent calls to sched_tick_remote(),
otherwise, you can end up with ->state==3.  Which is a situation that
my version does a WARN_ON_ONCE() for, so I guess the only difference is
that mine would be guaranteed to complain and yours would complain with
high probability.  So fair enough!

> 			return;
> 	}
> 	queue_delayed_work();
> 
> 
> stop:
> 	/*
> 	 * This is hotplug; even without stop-machine, there cannot be
> 	 * concurrency on offlining specific CPUs.
> 	 */
> 	state = atomic_read(->state);

There cannot be a sched_tick_stop() or sched_tick_stop(), but there really
can be a sched_tick_remote() right here in the absence of stop-machine,
can't there?  Or am I missing something other than stop-machine that
prevents this?

Now, you could argue that concurrency is safe: Either sched_tick_remote()
sees RUNNING and doesn't touch the value, or it sees offlining and
sched_tick_stop() makes no further changes, but I am not sure that
this qualifies as simpler...

> 	WARN_ON_ONCE(state != RUNNING);
> 	atomic_set(->state, OFFLINING);

Another option would be to use atomic_xchg() as below instead of the
atomic_read()/atomic_set() pair.  Would that work for you?

> start:
> 	state = atomic_xchg(->state, RUNNING);
> 	WARN_ON_ONCE(state == RUNNING);
> 	if (state == OFFLINE) {
> 		// ...
> 		queue_delayed_work();
> 	}

This one looks to be an improvement on mine regardless of the other two.

								Thanx, Paul

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint
  2019-06-19 18:19 Paul E. McKenney
@ 2019-06-20 12:10 ` Peter Zijlstra
  2019-06-20 16:01   ` Paul E. McKenney
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2019-06-20 12:10 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: linux-kernel, mingo, frederic, tglx

On Wed, Jun 19, 2019 at 11:19:03AM -0700, Paul E. McKenney wrote:
> [ Hearing no objections and given no test failures in multiple weeks of
>   rcutorture testing, I intend to submit this to the upcoming merge
>   window.  Thoughts? ]

I can't remember seeing this before; but then, there's a ton of unread
email in the inbox :-(

> Some debugging code showed that the culprit was sched_cpu_dying().
> It had irqs enabled after return from sched_tick_stop().  Which in turn
> had irqs enabled after return from cancel_delayed_work_sync().  Which is a
> wrapper around __cancel_work_timer().  Which can sleep in the case where
> something else is concurrently trying to cancel the same delayed work,
> and as Thomas Gleixner pointed out on IRC, sleeping is a decidedly bad
> idea when you are invoked from take_cpu_down(), regardless of the state
> you leave interrupts in upon return.
> 
> Code inspection located no reason why the delayed work absolutely
> needed to be canceled from sched_tick_stop():  The work is not
> bound to the outgoing CPU by design, given that the whole point is
> to collect statistics without disturbing the outgoing CPU.
> 
> This commit therefore simply drops the cancel_delayed_work_sync() from
> sched_tick_stop().  Instead, a new ->state field is added to the tick_work
> structure so that the delayed-work handler function sched_tick_remote()
> can avoid reposting itself.  A cpu_is_offline() check is also added to
> sched_tick_remote() to avoid mucking with the state of an offlined CPU
> (though it does appear safe to do so).  The sched_tick_start() and
> sched_tick_stop() functions also update ->state, and sched_tick_start()
> also schedules the delayed work if ->state indicates that it is not
> already in flight.
> 
> Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 102dfcf0a29a..8409c83aa5fa 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3050,14 +3050,44 @@ void scheduler_tick(void)
>  
>  struct tick_work {
>  	int			cpu;
> +	int			state;
>  	struct delayed_work	work;
>  };
> +// Values for ->state, see diagram below.
> +#define TICK_SCHED_REMOTE_OFFLINE	0
> +#define TICK_SCHED_REMOTE_RUNNING	1
> +#define TICK_SCHED_REMOTE_OFFLINING	2

That seems a daft set of values; consider { RUNNING, OFFLINING, OFFLINE }
and see below.

> +
> +// State diagram for ->state:
> +//
> +//
> +//      +----->OFFLINE--------------------------+
> +//      |                                       |
> +//      |                                       |
> +//      |                                       | sched_tick_start()
> +//      | sched_tick_remote()                   |
> +//      |                                       |
> +//      |                                       V
> +//      |                        +---------->RUNNING
> +//      |                        |              |
> +//      |                        |              |
> +//      |                        |              |
> +//      |     sched_tick_start() |              | sched_tick_stop()
> +//      |                        |              |
> +//      |                        |              |
> +//      |                        |              |
> +//      +--------------------OFFLINING<---------+
> +//
> +//
> +// Other transitions get WARN_ON_ONCE(), except that sched_tick_remote()
> +// and sched_tick_start() are happy to leave the state in RUNNING.

Can we please stick to old skool C comments?

Also, I find it harder to read that needed, maybe a little something
like so:

/*
 *              OFFLINE
 *               |   ^
 *               |   | tick_remote()
 *               |   |
 *               +--OFFLINING
 *               |   ^
 *  tick_start() |   | tick_stop()
 *               v   |
 *              RUNNING
 */

>  static struct tick_work __percpu *tick_work_cpu;
>  
>  static void sched_tick_remote(struct work_struct *work)
>  {
>  	struct delayed_work *dwork = to_delayed_work(work);
> +	int os;

this should go at the end, reverse xmas tree preference and all that.

>  	struct tick_work *twork = container_of(dwork, struct tick_work, work);
>  	int cpu = twork->cpu;
>  	struct rq *rq = cpu_rq(cpu);
> @@ -3077,7 +3107,7 @@ static void sched_tick_remote(struct work_struct *work)
>  
>  	rq_lock_irq(rq, &rf);
>  	curr = rq->curr;
> -	if (is_idle_task(curr))
> +	if (is_idle_task(curr) || cpu_is_offline(cpu))
>  		goto out_unlock;
>  
>  	update_rq_clock(rq);
> @@ -3097,13 +3127,22 @@ static void sched_tick_remote(struct work_struct *work)
>  	/*
>  	 * Run the remote tick once per second (1Hz). This arbitrary
>  	 * frequency is large enough to avoid overload but short enough
> -	 * to keep scheduler internal stats reasonably up to date.
> +	 * to keep scheduler internal stats reasonably up to date.  But
> +	 * first update state to reflect hotplug activity if required.
>  	 */
> -	queue_delayed_work(system_unbound_wq, dwork, HZ);
> +	do {
> +		os = READ_ONCE(twork->state);
> +		WARN_ON_ONCE(os == TICK_SCHED_REMOTE_OFFLINE);
> +		if (os == TICK_SCHED_REMOTE_RUNNING)
> +			break;
> +	} while (cmpxchg(&twork->state, os, TICK_SCHED_REMOTE_OFFLINE) != os);
> +	if (os == TICK_SCHED_REMOTE_RUNNING)
> +		queue_delayed_work(system_unbound_wq, dwork, HZ);
>  }
>  
>  static void sched_tick_start(int cpu)
>  {
> +	int os;
>  	struct tick_work *twork;
>  
>  	if (housekeeping_cpu(cpu, HK_FLAG_TICK))
> @@ -3112,14 +3151,23 @@ static void sched_tick_start(int cpu)
>  	WARN_ON_ONCE(!tick_work_cpu);
>  
>  	twork = per_cpu_ptr(tick_work_cpu, cpu);
> -	twork->cpu = cpu;
> -	INIT_DELAYED_WORK(&twork->work, sched_tick_remote);
> -	queue_delayed_work(system_unbound_wq, &twork->work, HZ);
> +	do {
> +		os = READ_ONCE(twork->state);
> +		if (WARN_ON_ONCE(os == TICK_SCHED_REMOTE_RUNNING))
> +			break;
> +		// Either idle or offline for a short period
> +	} while (cmpxchg(&twork->state, os, TICK_SCHED_REMOTE_RUNNING) != os);
> +	if (os == TICK_SCHED_REMOTE_OFFLINE) {
> +		twork->cpu = cpu;
> +		INIT_DELAYED_WORK(&twork->work, sched_tick_remote);
> +		queue_delayed_work(system_unbound_wq, &twork->work, HZ);
> +	}
>  }
>  
>  #ifdef CONFIG_HOTPLUG_CPU
>  static void sched_tick_stop(int cpu)
>  {
> +	int os;
>  	struct tick_work *twork;
>  
>  	if (housekeeping_cpu(cpu, HK_FLAG_TICK))
> @@ -3128,7 +3176,13 @@ static void sched_tick_stop(int cpu)
>  	WARN_ON_ONCE(!tick_work_cpu);
>  
>  	twork = per_cpu_ptr(tick_work_cpu, cpu);
> -	cancel_delayed_work_sync(&twork->work);
> +	// There cannot be competing actions, but don't rely on stop_machine.
> +	do {
> +		os = READ_ONCE(twork->state);
> +		WARN_ON_ONCE(os != TICK_SCHED_REMOTE_RUNNING);
> +		// Either idle or offline for a short period
> +	} while (cmpxchg(&twork->state, os, TICK_SCHED_REMOTE_OFFLINING) != os);
> +	// Don't cancel, as this would mess up the state machine.
>  }
>  #endif /* CONFIG_HOTPLUG_CPU */

While not wrong, it seems overly complicated; can't we do something
like:

tick:
	state = atomic_read(->state);
	if (state) {
		WARN_ON_ONCE(state != OFFLINING);
		if (atomic_inc_not_zero(->state))
			return;
	}
	queue_delayed_work();


stop:
	/*
	 * This is hotplug; even without stop-machine, there cannot be
	 * concurrency on offlining specific CPUs.
	 */
	state = atomic_read(->state);
	WARN_ON_ONCE(state != RUNNING);
	atomic_set(->state, OFFLINING);


start:
	state = atomic_xchg(->state, RUNNING);
	WARN_ON_ONCE(state == RUNNING);
	if (state == OFFLINE) {
		// ...
		queue_delayed_work();
	}



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH] time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint
@ 2019-06-19 18:19 Paul E. McKenney
  2019-06-20 12:10 ` Peter Zijlstra
  0 siblings, 1 reply; 31+ messages in thread
From: Paul E. McKenney @ 2019-06-19 18:19 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, peterz, frederic, tglx

[ Hearing no objections and given no test failures in multiple weeks of
  rcutorture testing, I intend to submit this to the upcoming merge
  window.  Thoughts? ]

The TASKS03 and TREE04 rcutorture scenarios produce the following
lockdep complaint:

------------------------------------------------------------------------

================================
WARNING: inconsistent lock state
5.2.0-rc1+ #513 Not tainted
--------------------------------
inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
migration/1/14 [HC0[0]:SC0[0]:HE1:SE1] takes:
(____ptrval____) (tick_broadcast_lock){?...}, at: tick_broadcast_offline+0xf/0x70
{IN-HARDIRQ-W} state was registered at:
  lock_acquire+0xb0/0x1c0
  _raw_spin_lock_irqsave+0x3c/0x50
  tick_broadcast_switch_to_oneshot+0xd/0x40
  tick_switch_to_oneshot+0x4f/0xd0
  hrtimer_run_queues+0xf3/0x130
  run_local_timers+0x1c/0x50
  update_process_times+0x1c/0x50
  tick_periodic+0x26/0xc0
  tick_handle_periodic+0x1a/0x60
  smp_apic_timer_interrupt+0x80/0x2a0
  apic_timer_interrupt+0xf/0x20
  _raw_spin_unlock_irqrestore+0x4e/0x60
  rcu_nocb_gp_kthread+0x15d/0x590
  kthread+0xf3/0x130
  ret_from_fork+0x3a/0x50
irq event stamp: 171
hardirqs last  enabled at (171): [<ffffffff8a201a37>] trace_hardirqs_on_thunk+0x1a/0x1c
hardirqs last disabled at (170): [<ffffffff8a201a53>] trace_hardirqs_off_thunk+0x1a/0x1c
softirqs last  enabled at (0): [<ffffffff8a264ee0>] copy_process.part.56+0x650/0x1cb0
softirqs last disabled at (0): [<0000000000000000>] 0x0

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(tick_broadcast_lock);
  <Interrupt>
    lock(tick_broadcast_lock);

 *** DEADLOCK ***

1 lock held by migration/1/14:
 #0: (____ptrval____) (clockevents_lock){+.+.}, at: tick_offline_cpu+0xf/0x30

stack backtrace:
CPU: 1 PID: 14 Comm: migration/1 Not tainted 5.2.0-rc1+ #513
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Bochs 01/01/2011
Call Trace:
 dump_stack+0x5e/0x8b
 print_usage_bug+0x1fc/0x216
 ? print_shortest_lock_dependencies+0x1b0/0x1b0
 mark_lock+0x1f2/0x280
 __lock_acquire+0x1e0/0x18f0
 ? __lock_acquire+0x21b/0x18f0
 ? _raw_spin_unlock_irqrestore+0x4e/0x60
 lock_acquire+0xb0/0x1c0
 ? tick_broadcast_offline+0xf/0x70
 _raw_spin_lock+0x33/0x40
 ? tick_broadcast_offline+0xf/0x70
 tick_broadcast_offline+0xf/0x70
 tick_offline_cpu+0x16/0x30
 take_cpu_down+0x7d/0xa0
 multi_cpu_stop+0xa2/0xe0
 ? cpu_stop_queue_work+0xc0/0xc0
 cpu_stopper_thread+0x6d/0x100
 smpboot_thread_fn+0x169/0x240
 kthread+0xf3/0x130
 ? sort_range+0x20/0x20
 ? kthread_cancel_delayed_work_sync+0x10/0x10
 ret_from_fork+0x3a/0x50

------------------------------------------------------------------------

To reproduce, run the following rcutorture test:

        tools/testing/selftests/rcutorture/bin/kvm.sh --duration 5 --kconfig "CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_PROVE_LOCKING=y" --configs "TASKS03 TREE04"

It turns out that tick_broadcast_offline() was an innocent bystander.
After all, interrupts are supposed to be disabled throughout
take_cpu_down(), and therefore should have been disabled upon entry to
tick_offline_cpu() and thus to tick_broadcast_offline().  This suggests
that one of the CPU-hotplug notifiers was incorrectly enabling interrupts,
and leaving them enabled on return.

Some debugging code showed that the culprit was sched_cpu_dying().
It had irqs enabled after return from sched_tick_stop().  Which in turn
had irqs enabled after return from cancel_delayed_work_sync().  Which is a
wrapper around __cancel_work_timer().  Which can sleep in the case where
something else is concurrently trying to cancel the same delayed work,
and as Thomas Gleixner pointed out on IRC, sleeping is a decidedly bad
idea when you are invoked from take_cpu_down(), regardless of the state
you leave interrupts in upon return.

Code inspection located no reason why the delayed work absolutely
needed to be canceled from sched_tick_stop():  The work is not
bound to the outgoing CPU by design, given that the whole point is
to collect statistics without disturbing the outgoing CPU.

This commit therefore simply drops the cancel_delayed_work_sync() from
sched_tick_stop().  Instead, a new ->state field is added to the tick_work
structure so that the delayed-work handler function sched_tick_remote()
can avoid reposting itself.  A cpu_is_offline() check is also added to
sched_tick_remote() to avoid mucking with the state of an offlined CPU
(though it does appear safe to do so).  The sched_tick_start() and
sched_tick_stop() functions also update ->state, and sched_tick_start()
also schedules the delayed work if ->state indicates that it is not
already in flight.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 102dfcf0a29a..8409c83aa5fa 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3050,14 +3050,44 @@ void scheduler_tick(void)
 
 struct tick_work {
 	int			cpu;
+	int			state;
 	struct delayed_work	work;
 };
+// Values for ->state, see diagram below.
+#define TICK_SCHED_REMOTE_OFFLINE	0
+#define TICK_SCHED_REMOTE_RUNNING	1
+#define TICK_SCHED_REMOTE_OFFLINING	2
+
+// State diagram for ->state:
+//
+//
+//      +----->OFFLINE--------------------------+
+//      |                                       |
+//      |                                       |
+//      |                                       | sched_tick_start()
+//      | sched_tick_remote()                   |
+//      |                                       |
+//      |                                       V
+//      |                        +---------->RUNNING
+//      |                        |              |
+//      |                        |              |
+//      |                        |              |
+//      |     sched_tick_start() |              | sched_tick_stop()
+//      |                        |              |
+//      |                        |              |
+//      |                        |              |
+//      +--------------------OFFLINING<---------+
+//
+//
+// Other transitions get WARN_ON_ONCE(), except that sched_tick_remote()
+// and sched_tick_start() are happy to leave the state in RUNNING.
 
 static struct tick_work __percpu *tick_work_cpu;
 
 static void sched_tick_remote(struct work_struct *work)
 {
 	struct delayed_work *dwork = to_delayed_work(work);
+	int os;
 	struct tick_work *twork = container_of(dwork, struct tick_work, work);
 	int cpu = twork->cpu;
 	struct rq *rq = cpu_rq(cpu);
@@ -3077,7 +3107,7 @@ static void sched_tick_remote(struct work_struct *work)
 
 	rq_lock_irq(rq, &rf);
 	curr = rq->curr;
-	if (is_idle_task(curr))
+	if (is_idle_task(curr) || cpu_is_offline(cpu))
 		goto out_unlock;
 
 	update_rq_clock(rq);
@@ -3097,13 +3127,22 @@ static void sched_tick_remote(struct work_struct *work)
 	/*
 	 * Run the remote tick once per second (1Hz). This arbitrary
 	 * frequency is large enough to avoid overload but short enough
-	 * to keep scheduler internal stats reasonably up to date.
+	 * to keep scheduler internal stats reasonably up to date.  But
+	 * first update state to reflect hotplug activity if required.
 	 */
-	queue_delayed_work(system_unbound_wq, dwork, HZ);
+	do {
+		os = READ_ONCE(twork->state);
+		WARN_ON_ONCE(os == TICK_SCHED_REMOTE_OFFLINE);
+		if (os == TICK_SCHED_REMOTE_RUNNING)
+			break;
+	} while (cmpxchg(&twork->state, os, TICK_SCHED_REMOTE_OFFLINE) != os);
+	if (os == TICK_SCHED_REMOTE_RUNNING)
+		queue_delayed_work(system_unbound_wq, dwork, HZ);
 }
 
 static void sched_tick_start(int cpu)
 {
+	int os;
 	struct tick_work *twork;
 
 	if (housekeeping_cpu(cpu, HK_FLAG_TICK))
@@ -3112,14 +3151,23 @@ static void sched_tick_start(int cpu)
 	WARN_ON_ONCE(!tick_work_cpu);
 
 	twork = per_cpu_ptr(tick_work_cpu, cpu);
-	twork->cpu = cpu;
-	INIT_DELAYED_WORK(&twork->work, sched_tick_remote);
-	queue_delayed_work(system_unbound_wq, &twork->work, HZ);
+	do {
+		os = READ_ONCE(twork->state);
+		if (WARN_ON_ONCE(os == TICK_SCHED_REMOTE_RUNNING))
+			break;
+		// Either idle or offline for a short period
+	} while (cmpxchg(&twork->state, os, TICK_SCHED_REMOTE_RUNNING) != os);
+	if (os == TICK_SCHED_REMOTE_OFFLINE) {
+		twork->cpu = cpu;
+		INIT_DELAYED_WORK(&twork->work, sched_tick_remote);
+		queue_delayed_work(system_unbound_wq, &twork->work, HZ);
+	}
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
 static void sched_tick_stop(int cpu)
 {
+	int os;
 	struct tick_work *twork;
 
 	if (housekeeping_cpu(cpu, HK_FLAG_TICK))
@@ -3128,7 +3176,13 @@ static void sched_tick_stop(int cpu)
 	WARN_ON_ONCE(!tick_work_cpu);
 
 	twork = per_cpu_ptr(tick_work_cpu, cpu);
-	cancel_delayed_work_sync(&twork->work);
+	// There cannot be competing actions, but don't rely on stop_machine.
+	do {
+		os = READ_ONCE(twork->state);
+		WARN_ON_ONCE(os != TICK_SCHED_REMOTE_RUNNING);
+		// Either idle or offline for a short period
+	} while (cmpxchg(&twork->state, os, TICK_SCHED_REMOTE_OFFLINING) != os);
+	// Don't cancel, as this would mess up the state machine.
 }
 #endif /* CONFIG_HOTPLUG_CPU */
 


^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2019-06-28 12:18 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-27 14:39 [PATCH] time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint Paul E. McKenney
2019-05-28 20:07 ` Thomas Gleixner
2019-05-29 18:19   ` Paul E. McKenney
2019-05-30 12:58     ` Paul E. McKenney
2019-05-31  1:36       ` Frederic Weisbecker
2019-05-31 13:43         ` Paul E. McKenney
2019-06-19 18:19 Paul E. McKenney
2019-06-20 12:10 ` Peter Zijlstra
2019-06-20 16:01   ` Paul E. McKenney
2019-06-20 21:10     ` Peter Zijlstra
2019-06-20 22:13       ` Paul E. McKenney
2019-06-21 10:55         ` Peter Zijlstra
2019-06-21 12:16           ` Paul E. McKenney
2019-06-21 12:29             ` Peter Zijlstra
2019-06-21 13:34               ` Paul E. McKenney
2019-06-21 17:41                 ` Paul E. McKenney
2019-06-21 17:50                   ` Paul E. McKenney
2019-06-21 23:46                     ` Paul E. McKenney
2019-06-24 23:12                       ` Frederic Weisbecker
2019-06-24 23:44                         ` Paul E. McKenney
2019-06-25  0:43                           ` Frederic Weisbecker
2019-06-25  2:05                             ` Paul E. McKenney
2019-06-25  7:51                             ` Peter Zijlstra
2019-06-25 12:25                               ` Frederic Weisbecker
2019-06-25 13:54                                 ` Paul E. McKenney
2019-06-25 14:05                                   ` Peter Zijlstra
2019-06-25 14:16                                     ` Paul E. McKenney
2019-06-25 16:20                                       ` Frederic Weisbecker
2019-06-25 16:52                                         ` Paul E. McKenney
2019-06-28  7:37                                           ` Peter Zijlstra
2019-06-28 12:17                                             ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).