LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* sched: memory corruption on completing completions
@ 2015-02-04 23:24 Sasha Levin
  2015-02-04 23:46 ` Andrew Morton
  2015-02-05  0:16 ` Linus Torvalds
  0 siblings, 2 replies; 14+ messages in thread
From: Sasha Levin @ 2015-02-04 23:24 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Andrew Morton, Linus Torvalds, Andrey Ryabinin, Dave Jones, LKML

Hi all,

I was fuzzing with trinity on a -next kernel with the KASan patchset, and
got what initially appeared to be a rather odd trace:

[  856.817966] BUG: AddressSanitizer: out of bounds on stack in do_raw_spin_unlock+0x417/0x4f0 at addr ffff8803875c7c42
[  856.817966] Read of size 2 by task migration/15/140
[  856.821565] page:ffffea000e1d71c0 count:0 mapcount:0 mapping:          (null) index:0x0
[  856.821565] flags: 0x2efffff80000000()
[  856.821565] page dumped because: kasan: bad access detected
[  856.821565] CPU: 15 PID: 140 Comm: migration/15 Not tainted 3.19.0-rc5-next-20150123-sasha-00061-g527ff0d-dirty #1814
[  856.841712]  0000000000000000 0000000000000000 ffff8804d0447aa0 ffff8804d04479e8
[  856.843478]  ffffffff92eb033b 1ffffd4001c3ae3f ffffea000e1d71c0 ffff8804d0447a88
[  856.843478]  ffffffff81b4c292 ffff8804d0447a58 ffffffff815ef1e1 0000000000000000
[  856.843478] Call Trace:
[  856.843478] dump_stack (lib/dump_stack.c:52)
[  856.843478] kasan_report_error (mm/kasan/report.c:136 mm/kasan/report.c:194)
[  856.843478] ? __lock_is_held (kernel/locking/lockdep.c:3518)
[  856.843478] ? sched_ttwu_pending (kernel/sched/core.c:4921)
[  856.843478] __asan_report_load2_noabort (mm/kasan/report.c:234)
[  856.843478] ? do_raw_spin_unlock (./arch/x86/include/asm/spinlock.h:157 kernel/locking/spinlock_debug.c:159)
[  856.843478] do_raw_spin_unlock (./arch/x86/include/asm/spinlock.h:157 kernel/locking/spinlock_debug.c:159)
[  856.843478] ? do_raw_spin_trylock (kernel/locking/spinlock_debug.c:157)
[  856.843478] _raw_spin_unlock_irqrestore (include/linux/spinlock_api_smp.h:162 kernel/locking/spinlock.c:191)
[  856.843478] complete (kernel/sched/completion.c:37)
[  856.843478] cpu_stop_signal_done (kernel/stop_machine.c:69)
[  856.843478] cpu_stopper_thread (include/linux/spinlock.h:342 kernel/stop_machine.c:456)
[  856.843478] ? irq_cpu_stop_queue_work (kernel/stop_machine.c:449)
[  856.843478] ? do_raw_spin_trylock (kernel/locking/spinlock_debug.c:157)
[  856.843478] ? _raw_spin_unlock_irqrestore (./arch/x86/include/asm/paravirt.h:809 include/linux/spinlock_api_smp.h:162 kernel/locking/spinlock.c:191)
[  856.843478] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2554 kernel/locking/lockdep.c:2601)
[  856.843478] ? _raw_spin_unlock_irqrestore (./arch/x86/include/asm/preempt.h:95 include/linux/spinlock_api_smp.h:163 kernel/locking/spinlock.c:191)
[  856.843478] smpboot_thread_fn (kernel/smpboot.c:161)
[  856.843478] ? smpboot_unpark_thread (kernel/smpboot.c:105)
[  856.843478] ? schedule (kernel/sched/core.c:2853)
[  856.843478] ? __kthread_parkme (./arch/x86/include/asm/current.h:14 kernel/kthread.c:164)
[  856.843478] ? smpboot_unpark_thread (kernel/smpboot.c:105)
[  856.843478] kthread (kernel/kthread.c:207)
[  856.843478] ? flush_kthread_work (kernel/kthread.c:176)
[  856.843478] ? schedule_tail (./arch/x86/include/asm/preempt.h:95 kernel/sched/core.c:2318)
[  856.843478] ? flush_kthread_work (kernel/kthread.c:176)
[  856.843478] ret_from_fork (arch/x86/kernel/entry_64.S:349)
[  856.843478] ? flush_kthread_work (kernel/kthread.c:176)
[  856.843478] Memory state around the buggy address:
[  856.843478]  ffff8803875c7b00: f3 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[  856.843478]  ffff8803875c7b80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[  856.843478] >ffff8803875c7c00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f1
[  856.843478]                                            ^
[  856.843478]  ffff8803875c7c80: f1 f1 f1 00 f4 f4 f4 f2 f2 f2 f2 00 00 f4 f4 00
[  856.843478]  ffff8803875c7d00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

We see this trace only now because Andrey has recently made KASan able to detect issues
with memory on stack, rather than just kmalloc()ed memory, so the underlying bug has
probably been there for a while.

Initially we couldn't explain it. It was reproducing often, and always on the
spin_unlock_irqrestore() call at the end of the complete() code.

I now have a theory for why it happens:

Thread A				Thread B
----------------------------------------------------------

[Enter function]
DECLARE_COMPLETION_ONSTACK(x)
wait_for_completion(x)
					complete(x)
					[In complete(x):]
				        spin_lock_irqsave(&x->wait.lock, flags);
				        x->done++;
				        __wake_up_locked(&x->wait, TASK_NORMAL, 1);
[Done waiting, wakes up]
[Exit function]
				        spin_unlock_irqrestore(&x->wait.lock, flags);



So the spin_unlock_irqrestore() at the end of complete() would proceed to corruption
the stack of thread A.


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: sched: memory corruption on completing completions
  2015-02-04 23:24 sched: memory corruption on completing completions Sasha Levin
@ 2015-02-04 23:46 ` Andrew Morton
  2015-02-05  0:16 ` Linus Torvalds
  1 sibling, 0 replies; 14+ messages in thread
From: Andrew Morton @ 2015-02-04 23:46 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Peter Zijlstra, Ingo Molnar, Linus Torvalds, Andrey Ryabinin,
	Dave Jones, LKML

On Wed, 04 Feb 2015 18:24:06 -0500 Sasha Levin <sasha.levin@oracle.com> wrote:

> Hi all,
> 
> I was fuzzing with trinity on a -next kernel with the KASan patchset, and
> got what initially appeared to be a rather odd trace:
> 
> ...
>
> 
> I now have a theory for why it happens:
> 
> Thread A				Thread B
> ----------------------------------------------------------
> 
> [Enter function]
> DECLARE_COMPLETION_ONSTACK(x)
> wait_for_completion(x)
> 					complete(x)
> 					[In complete(x):]
> 				        spin_lock_irqsave(&x->wait.lock, flags);
> 				        x->done++;
> 				        __wake_up_locked(&x->wait, TASK_NORMAL, 1);
> [Done waiting, wakes up]
> [Exit function]
> 				        spin_unlock_irqrestore(&x->wait.lock, flags);
> 
> 
> 
> So the spin_unlock_irqrestore() at the end of complete() would proceed to corruption
> the stack of thread A.

But wait_for_completion() takes ->wait.lock as well, which should
provide the needed synchronization (__wait_for_common,
do_wait_for_common).  I'm not seeing a hole in the logic, but it looks
like there might be one.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: sched: memory corruption on completing completions
  2015-02-04 23:24 sched: memory corruption on completing completions Sasha Levin
  2015-02-04 23:46 ` Andrew Morton
@ 2015-02-05  0:16 ` Linus Torvalds
  2015-02-05  7:10   ` Ingo Molnar
                     ` (2 more replies)
  1 sibling, 3 replies; 14+ messages in thread
From: Linus Torvalds @ 2015-02-05  0:16 UTC (permalink / raw)
  To: Sasha Levin, Waiman Long
  Cc: Peter Zijlstra, Ingo Molnar, Andrew Morton, Andrey Ryabinin,
	Dave Jones, LKML

On Wed, Feb 4, 2015 at 3:24 PM, Sasha Levin <sasha.levin@oracle.com> wrote:
>
> I now have a theory for why it happens:
>
> Thread A                                Thread B
> ----------------------------------------------------------
>
> [Enter function]
> DECLARE_COMPLETION_ONSTACK(x)
> wait_for_completion(x)
>                                         complete(x)
>                                         [In complete(x):]
>                                         spin_lock_irqsave(&x->wait.lock, flags);
>                                         x->done++;
>                                         __wake_up_locked(&x->wait, TASK_NORMAL, 1);
> [Done waiting, wakes up]
> [Exit function]
>                                         spin_unlock_irqrestore(&x->wait.lock, flags);
>
> So the spin_unlock_irqrestore() at the end of complete() would proceed to corruption
> the stack of thread A.

We have indeed had bugs like that, but it *shouldn't be the case any
more. The hard rule for spinlocks is:

 - the last thing *unlock* does is to release the lock with a single
store. After that has happened, it will not touch it, and before that
has happened, nobody else can get the spinlock

which means that you cannot actually get the case you are talking
about (because the "done waiting, wakes up" in thread A happens with
the spinlock held).

That said, as mentioned, we have had bugs in spinlock debugging code
in particular, where the last unlock does other things after it
releases the low-level lock. And you are clearly not using normal
spinlocks, since your error trace has this:

   do_raw_spin_unlock+0x417/0x4f0

which means that "do_raw_spin_unlock() is 1250+ bytes in size.  A
"normal" do_raw_spin_unlock()" is inlined because it's a single store
operation.

Looks like the debug version of do_raw_spin_unlock() is slightly
larger than that ;) Anyway, apparently that version of
do_raw_spin_unlock isn't just huge, it's also buggy.. Although I
thought we fixed that bug some time ago, and looking at the code it
does

    void do_raw_spin_unlock(raw_spinlock_t *lock)
    {
            debug_spin_unlock(lock);
            arch_spin_unlock(&lock->raw_lock);
    }

so it *looks* ok in the generic code.

So off to look at the arch code...

And looking at the arch version, I think the paravirtualized code is crap.

It does:

                prev = *lock;
                add_smp(&lock->tickets.head, TICKET_LOCK_INC);

                /* add_smp() is a full mb() */

                if (unlikely(lock->tickets.tail & TICKET_SLOWPATH_FLAG))
                        __ticket_unlock_slowpath(lock, prev);


which is *exactly* the kind of things you cannot do with spinlocks,
because after you've done the "add_smp()" and released the spinlock
for the fast-path, you can't access the spinlock any more.  Exactly
because a fast-path lock migth come in, and release the whole data
structure.

As usual, the paravirt code is a horribly buggy heap of crud. Film at 11.

Why did I think we had this bug but already fixed it ? Maybe it's one
of those things that Waiman fixed in his long delayed qspinlock
series? Waiman? Or maybe I just remember the fixes where we changed
from a mutex to a spinlock, which fixed a very similar case for
non-paravirtualized cases.

                             Linus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: sched: memory corruption on completing completions
  2015-02-05  0:16 ` Linus Torvalds
@ 2015-02-05  7:10   ` Ingo Molnar
  2015-02-05  9:30   ` Peter Zijlstra
  2015-02-05 20:59   ` Davidlohr Bueso
  2 siblings, 0 replies; 14+ messages in thread
From: Ingo Molnar @ 2015-02-05  7:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Sasha Levin, Waiman Long, Peter Zijlstra, Andrew Morton,
	Andrey Ryabinin, Dave Jones, LKML


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> [...]
> 
> As usual, the paravirt code is a horribly buggy heap of crud. 
> Film at 11.
> 
> Why did I think we had this bug but already fixed it ? Maybe 
> it's one of those things that Waiman fixed in his long delayed 
> qspinlock series? Waiman? Or maybe I just remember the fixes 
> where we changed from a mutex to a spinlock, which fixed a very 
> similar case for non-paravirtualized cases.

We definitely had such a high profile bug in the native case, 
years before the qspinlock series came along. I don't remember 
the specifics anymore, but maybe it was in the VFS code?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: sched: memory corruption on completing completions
  2015-02-05  0:16 ` Linus Torvalds
  2015-02-05  7:10   ` Ingo Molnar
@ 2015-02-05  9:30   ` Peter Zijlstra
  2015-02-05 20:44     ` Sasha Levin
  2015-02-05 20:59   ` Davidlohr Bueso
  2 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2015-02-05  9:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Sasha Levin, Waiman Long, Ingo Molnar, Andrew Morton,
	Andrey Ryabinin, Dave Jones, LKML

On Wed, Feb 04, 2015 at 04:16:54PM -0800, Linus Torvalds wrote:
> 
> Why did I think we had this bug but already fixed it ? Maybe it's one
> of those things that Waiman fixed in his long delayed qspinlock
> series? Waiman? 

ISTR that that would do the exact same thing, but I need to go look a
the latest paravirt code -- that's the part that we all were still
bothered with.

> Or maybe I just remember the fixes where we changed
> from a mutex to a spinlock, which fixed a very similar case for
> non-paravirtualized cases.

I think we had a non paravirt mutex case last year or so.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: sched: memory corruption on completing completions
  2015-02-05  9:30   ` Peter Zijlstra
@ 2015-02-05 20:44     ` Sasha Levin
  0 siblings, 0 replies; 14+ messages in thread
From: Sasha Levin @ 2015-02-05 20:44 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Waiman Long, Ingo Molnar, Andrew Morton, Andrey Ryabinin,
	Dave Jones, LKML

On 02/05/2015 04:30 AM, Peter Zijlstra wrote:
> On Wed, Feb 04, 2015 at 04:16:54PM -0800, Linus Torvalds wrote:
>> > Why did I think we had this bug but already fixed it ? Maybe it's one
>> > of those things that Waiman fixed in his long delayed qspinlock
>> > series? Waiman? 
> ISTR that that would do the exact same thing, but I need to go look a
> the latest paravirt code -- that's the part that we all were still
> bothered with.

Testing Linus's explanation, I tried simply:


diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 7050d86..54454da 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -142,7 +142,7 @@ static inline void __ticket_unlock_slowpath(arch_spinlock_t *lock,
                __ticket_unlock_kick(lock, old.tickets.head);
        }
 }
-
+static inline int arch_spin_is_locked(arch_spinlock_t *lock);
 static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
 {
        if (TICKET_SLOWPATH_FLAG &&
@@ -153,7 +153,7 @@ static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
                add_smp(&lock->tickets.head, TICKET_LOCK_INC);

                /* add_smp() is a full mb() */
-
+               WARN_ON(arch_spin_is_locked(lock));
                if (unlikely(lock->tickets.tail & TICKET_SLOWPATH_FLAG))
                        __ticket_unlock_slowpath(lock, prev);
        } else


And the warnings confirmed that the lock is indeed "unlocked" before we finished
arch_spin_unlock()...


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: sched: memory corruption on completing completions
  2015-02-05  0:16 ` Linus Torvalds
  2015-02-05  7:10   ` Ingo Molnar
  2015-02-05  9:30   ` Peter Zijlstra
@ 2015-02-05 20:59   ` Davidlohr Bueso
  2015-02-05 21:02     ` Sasha Levin
  2 siblings, 1 reply; 14+ messages in thread
From: Davidlohr Bueso @ 2015-02-05 20:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Sasha Levin, Waiman Long, Peter Zijlstra, Ingo Molnar,
	Andrew Morton, Andrey Ryabinin, Dave Jones, LKML,
	Raghavendra K T

On Wed, 2015-02-04 at 16:16 -0800, Linus Torvalds wrote:
> And looking at the arch version, I think the paravirtualized code is crap.
> 
> It does:
> 
>                 prev = *lock;
>                 add_smp(&lock->tickets.head, TICKET_LOCK_INC);
> 
>                 /* add_smp() is a full mb() */
> 
>                 if (unlikely(lock->tickets.tail & TICKET_SLOWPATH_FLAG))
>                         __ticket_unlock_slowpath(lock, prev);
> 
> 
> which is *exactly* the kind of things you cannot do with spinlocks,
> because after you've done the "add_smp()" and released the spinlock
> for the fast-path, you can't access the spinlock any more.  Exactly
> because a fast-path lock migth come in, and release the whole data
> structure.
> 
> As usual, the paravirt code is a horribly buggy heap of crud. Film at 11.

Per http://lwn.net/Articles/495597/ which clearly describes the intent
of the slowpath unlocking. Cc'ing Raghavendra.

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: sched: memory corruption on completing completions
  2015-02-05 20:59   ` Davidlohr Bueso
@ 2015-02-05 21:02     ` Sasha Levin
  2015-02-05 21:34       ` Linus Torvalds
  0 siblings, 1 reply; 14+ messages in thread
From: Sasha Levin @ 2015-02-05 21:02 UTC (permalink / raw)
  To: Davidlohr Bueso, Linus Torvalds
  Cc: Waiman Long, Peter Zijlstra, Ingo Molnar, Andrew Morton,
	Andrey Ryabinin, Dave Jones, LKML, Raghavendra K T

On 02/05/2015 03:59 PM, Davidlohr Bueso wrote:
> On Wed, 2015-02-04 at 16:16 -0800, Linus Torvalds wrote:
>> And looking at the arch version, I think the paravirtualized code is crap.
>>
>> It does:
>>
>>                 prev = *lock;
>>                 add_smp(&lock->tickets.head, TICKET_LOCK_INC);
>>
>>                 /* add_smp() is a full mb() */
>>
>>                 if (unlikely(lock->tickets.tail & TICKET_SLOWPATH_FLAG))
>>                         __ticket_unlock_slowpath(lock, prev);
>>
>>
>> which is *exactly* the kind of things you cannot do with spinlocks,
>> because after you've done the "add_smp()" and released the spinlock
>> for the fast-path, you can't access the spinlock any more.  Exactly
>> because a fast-path lock migth come in, and release the whole data
>> structure.
>>
>> As usual, the paravirt code is a horribly buggy heap of crud. Film at 11.
> 
> Per http://lwn.net/Articles/495597/ which clearly describes the intent
> of the slowpath unlocking. Cc'ing Raghavendra.

Interestingly enough, according to that article this behaviour seems to be
"by design":

"""
This version of the patch uses a locked add to do this, followed by a test
to see if the slowflag is set.  The lock prefix acts as a full memory barrier,
so we can be sure that other CPUs will have seen the unlock before we read
the flag
"""

Thanks,
Sasha


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: sched: memory corruption on completing completions
  2015-02-05 21:02     ` Sasha Levin
@ 2015-02-05 21:34       ` Linus Torvalds
  2015-02-05 22:37         ` Davidlohr Bueso
  0 siblings, 1 reply; 14+ messages in thread
From: Linus Torvalds @ 2015-02-05 21:34 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Davidlohr Bueso, Waiman Long, Peter Zijlstra, Ingo Molnar,
	Andrew Morton, Andrey Ryabinin, Dave Jones, LKML,
	Raghavendra K T

On Thu, Feb 5, 2015 at 1:02 PM, Sasha Levin <sasha.levin@oracle.com> wrote:
>
> Interestingly enough, according to that article this behaviour seems to be
> "by design":

Oh, it's definitely by design, it's just that the design looked at
spinlocks without the admittedly very subtle issue of lifetime vs
unlocking.

Spinlocks (and completions) are special - for other locks we have
basically allowed lifetimes to be separate from the lock state, and if
you have a data structure with a mutex in it, you'll have to have some
separate lifetime rule outside of the lock itself. But spinlocks and
completions have their locking state tied into their lifetime.
Completions are so very much by design (because dynamic completions on
the stack is one of the core use cases), and spinlocks do it because
in some cases you cannot sanely avoid it (and one of those cases is
the implementation of completions - they aren't actually first-class
locking primitives of their own, although they actually *used* to be,
originally).

It is possible that the paravirt spinlocks could be saved by:

 - moving the clearing of TICKET_SLOWPATH_FLAG into the fastpath locking code.

 - making sure that the "unlock" path never does a *write* to the
possibly stale lock. KASan would still complain about the read, but we
could just say that it's a speculative read - bad form, but not
disastrous.

but right now they are clearly horribly broken. Because the unlock
path doesn't just read the value, it can end up writing to it too.

Looking at the Xen version, for example, the unlock_kick part looks
fine. It just compares the pointer to the lock against its data
structures, and doesn't touch the lock itself. Which is fine.

But right now, the real problem is that "__ticket_unlock_slowpath()"
will clear the TICKET_SLOWPATH_FLAG on a lock that has already been
unlocked - which means that it may actually *change* a byte that has
already been freed. It might not be a spinlock any more, it might have
been reused for something else, and might be anything else, and
clearing TICKET_SLOWPATH_FLAG might be clearing a random bit in a
kernel pointer or other value..

It is *very* hard to hit, though. And it obviously doesn't happen in
raw hardware, but paravirt people need to think hard about this.

                       Linus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: sched: memory corruption on completing completions
  2015-02-05 21:34       ` Linus Torvalds
@ 2015-02-05 22:37         ` Davidlohr Bueso
  2015-02-05 22:57           ` Linus Torvalds
  2015-02-06 12:29           ` Raghavendra K T
  0 siblings, 2 replies; 14+ messages in thread
From: Davidlohr Bueso @ 2015-02-05 22:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Sasha Levin, Waiman Long, Peter Zijlstra, Ingo Molnar,
	Andrew Morton, Andrey Ryabinin, Dave Jones, LKML,
	Raghavendra K T

On Thu, 2015-02-05 at 13:34 -0800, Linus Torvalds wrote:
> On Thu, Feb 5, 2015 at 1:02 PM, Sasha Levin <sasha.levin@oracle.com> wrote:
> >
> > Interestingly enough, according to that article this behaviour seems to be
> > "by design":
> 
> Oh, it's definitely by design, it's just that the design looked at
> spinlocks without the admittedly very subtle issue of lifetime vs
> unlocking.
> 
> Spinlocks (and completions) are special - for other locks we have
> basically allowed lifetimes to be separate from the lock state, and if
> you have a data structure with a mutex in it, you'll have to have some
> separate lifetime rule outside of the lock itself. But spinlocks and
> completions have their locking state tied into their lifetime.

For spinlocks I find this very much a virtue. Tight lifetimes allow the
overall locking logic to be *simple* - keeping people from being "smart"
and bloating up spinlocks. Similarly, I hate how the paravirt
alternative blends in with regular (sane) bare metal code. What was
preventing this instead??

#ifdef CONFIG_PARAVIRT_SPINLOCKS
static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
{
	if (!static_key_false(&paravirt_ticketlocks_enabled))
		return;

	add_smp(&lock->tickets.head, TICKET_LOCK_INC);
	/* Do slowpath tail stuff... */
}
#else
static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
{
	__add(&lock->tickets.head, TICKET_LOCK_INC, UNLOCK_LOCK_PREFIX);
}
#endif

I just don't see the point to all this TICKET_SLOWPATH_FLAG:

#ifdef CONFIG_PARAVIRT_SPINLOCKS
#define __TICKET_LOCK_INC	2
#define TICKET_SLOWPATH_FLAG	((__ticket_t)1)
#else
#define __TICKET_LOCK_INC	1
#define TICKET_SLOWPATH_FLAG	((__ticket_t)0)
#endif

when it is only for paravirt -- and the word slowpath implies the
general steps as part of the generic algorithm. Lets keep code for
simple locks simple.

> Completions are so very much by design (because dynamic completions on
> the stack is one of the core use cases), and spinlocks do it because
> in some cases you cannot sanely avoid it (and one of those cases is
> the implementation of completions - they aren't actually first-class
> locking primitives of their own, although they actually *used* to be,
> originally).
> 
> It is possible that the paravirt spinlocks could be saved by:
> 
>  - moving the clearing of TICKET_SLOWPATH_FLAG into the fastpath locking code.

Ouch, to avoid deadlocks they explicitly need the unlock to occur before
the slowpath tail flag is read.

>  - making sure that the "unlock" path never does a *write* to the
> possibly stale lock. KASan would still complain about the read, but we
> could just say that it's a speculative read - bad form, but not
> disastrous.

Yeah, you just cannot have a slowpath without reads or writes :D

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: sched: memory corruption on completing completions
  2015-02-05 22:37         ` Davidlohr Bueso
@ 2015-02-05 22:57           ` Linus Torvalds
  2015-02-06  6:48             ` Raghavendra K T
  2015-02-06 12:29           ` Raghavendra K T
  1 sibling, 1 reply; 14+ messages in thread
From: Linus Torvalds @ 2015-02-05 22:57 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Sasha Levin, Waiman Long, Peter Zijlstra, Ingo Molnar,
	Andrew Morton, Andrey Ryabinin, Dave Jones, LKML,
	Raghavendra K T

On Thu, Feb 5, 2015 at 2:37 PM, Davidlohr Bueso <dave@stgolabs.net> wrote:
>>
>> It is possible that the paravirt spinlocks could be saved by:
>>
>>  - moving the clearing of TICKET_SLOWPATH_FLAG into the fastpath locking code.
>
> Ouch, to avoid deadlocks they explicitly need the unlock to occur before
> the slowpath tail flag is read.

Well, just make the unlock do the actual real unlock operation
("fastpath"), leaving the TICKET_SLOWPATH_FLAG alone. The unlock path
can *test* the flag and do whatever kicking it needs to get people
started again, but not clear it again.

Then, the clearing could be done by the lockers. That way the unlock
path at least doesn't change the spinlock any more - sicne the
spinlock might have been free'd at any time after the actual unlock.

Or something. I'm handwaving. I really dislike all the paravirt crap
we do. It tends to be really ugly, and I'm not just talking about
spinlocks. TLB flushing etc tends to get interesting too.

> Yeah, you just cannot have a slowpath without reads or writes :D

Well, you *could* do the slow-path while holding the lock, and making
the actual ock release be the last part of the operation (and do the
final unlock with a "cmpxchg" that fails and re-does the slowpath if
something changed). Then the slowpath can read and write the lock all
it wants.

But people generally want to avoid that, since it makes the hold times
longer. So the "drop the lock, *then* test if we should do some slow
path fixup" tends to be the preferred model, it's just that that model
is broken due to the lifetime rules. Making it read-only would at
least make it a lot less broken..

                             Linus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: sched: memory corruption on completing completions
  2015-02-05 22:57           ` Linus Torvalds
@ 2015-02-06  6:48             ` Raghavendra K T
  2015-02-06 15:00               ` Raghavendra K T
  0 siblings, 1 reply; 14+ messages in thread
From: Raghavendra K T @ 2015-02-06  6:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Davidlohr Bueso, Sasha Levin, Waiman Long, Peter Zijlstra,
	Ingo Molnar, Andrew Morton, Andrey Ryabinin, Dave Jones, LKML

On 02/06/2015 04:27 AM, Linus Torvalds wrote:
> On Thu, Feb 5, 2015 at 2:37 PM, Davidlohr Bueso <dave@stgolabs.net> wrote:
>>>
>>> It is possible that the paravirt spinlocks could be saved by:
>>>
>>>   - moving the clearing of TICKET_SLOWPATH_FLAG into the fastpath locking code.
>>
>> Ouch, to avoid deadlocks they explicitly need the unlock to occur before
>> the slowpath tail flag is read.
>
> Well, just make the unlock do the actual real unlock operation
> ("fastpath"), leaving the TICKET_SLOWPATH_FLAG alone. The unlock path
> can *test* the flag and do whatever kicking it needs to get people
> started again, but not clear it again.
>

This is definitely a good idea, will think more on this.

(especially since any remote possibility of forgetting to wake up the 
lock-waiter would result in eventual hang of kvm guest).
Hopeful to come up with a solution soon.

/me agreeing with the fact that we did not have the 'lifetime' in mind
during the design :(.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: sched: memory corruption on completing completions
  2015-02-05 22:37         ` Davidlohr Bueso
  2015-02-05 22:57           ` Linus Torvalds
@ 2015-02-06 12:29           ` Raghavendra K T
  1 sibling, 0 replies; 14+ messages in thread
From: Raghavendra K T @ 2015-02-06 12:29 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Linus Torvalds, Sasha Levin, Waiman Long, Peter Zijlstra,
	Ingo Molnar, Andrew Morton, Andrey Ryabinin, Dave Jones, LKML

On 02/06/2015 04:07 AM, Davidlohr Bueso wrote:
> On Thu, 2015-02-05 at 13:34 -0800, Linus Torvalds wrote:
>> On Thu, Feb 5, 2015 at 1:02 PM, Sasha Levin <sasha.levin@oracle.com> wrote:
>>>
>>> Interestingly enough, according to that article this behaviour seems to be
>>> "by design":
>>
>> Oh, it's definitely by design, it's just that the design looked at
>> spinlocks without the admittedly very subtle issue of lifetime vs
>> unlocking.
>>
>> Spinlocks (and completions) are special - for other locks we have
>> basically allowed lifetimes to be separate from the lock state, and if
>> you have a data structure with a mutex in it, you'll have to have some
>> separate lifetime rule outside of the lock itself. But spinlocks and
>> completions have their locking state tied into their lifetime.
>
> For spinlocks I find this very much a virtue. Tight lifetimes allow the
> overall locking logic to be *simple* - keeping people from being "smart"
> and bloating up spinlocks. Similarly, I hate how the paravirt
> alternative blends in with regular (sane) bare metal code. What was
> preventing this instead??
>
> #ifdef CONFIG_PARAVIRT_SPINLOCKS
> static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
> {
> 	if (!static_key_false(&paravirt_ticketlocks_enabled))
> 		return;
>
> 	add_smp(&lock->tickets.head, TICKET_LOCK_INC);
> 	/* Do slowpath tail stuff... */
> }
> #else
> static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
> {
> 	__add(&lock->tickets.head, TICKET_LOCK_INC, UNLOCK_LOCK_PREFIX);
> }
> #endif
>
> I just don't see the point to all this TICKET_SLOWPATH_FLAG:
>
> #ifdef CONFIG_PARAVIRT_SPINLOCKS
> #define __TICKET_LOCK_INC	2
> #define TICKET_SLOWPATH_FLAG	((__ticket_t)1)
> #else
> #define __TICKET_LOCK_INC	1
> #define TICKET_SLOWPATH_FLAG	((__ticket_t)0)
> #endif
>
> when it is only for paravirt -- and the word slowpath implies the
> general steps as part of the generic algorithm. Lets keep code for
> simple locks simple.
>

Good point, I will send this as a separate cleanup once I test the
patch I have to correct the current problem.



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: sched: memory corruption on completing completions
  2015-02-06  6:48             ` Raghavendra K T
@ 2015-02-06 15:00               ` Raghavendra K T
  0 siblings, 0 replies; 14+ messages in thread
From: Raghavendra K T @ 2015-02-06 15:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Davidlohr Bueso, Sasha Levin, Waiman Long, Peter Zijlstra,
	Ingo Molnar, Andrew Morton, Andrey Ryabinin, Dave Jones, LKML

On 02/06/2015 12:18 PM, Raghavendra K T wrote:
> On 02/06/2015 04:27 AM, Linus Torvalds wrote:
>> On Thu, Feb 5, 2015 at 2:37 PM, Davidlohr Bueso <dave@stgolabs.net>
>> wrote:
>>>>
>>>> It is possible that the paravirt spinlocks could be saved by:
>>>>
>>>>   - moving the clearing of TICKET_SLOWPATH_FLAG into the fastpath
>>>> locking code.
>>>
>>> Ouch, to avoid deadlocks they explicitly need the unlock to occur before
>>> the slowpath tail flag is read.
>>
>> Well, just make the unlock do the actual real unlock operation
>> ("fastpath"), leaving the TICKET_SLOWPATH_FLAG alone. The unlock path
>> can *test* the flag and do whatever kicking it needs to get people
>> started again, but not clear it again.
>>
>
> This is definitely a good idea, will think more on this.
>

Just sent the patch with that idea, also keeping in mind that
head = tail may not hold good for trylock checking
(last unlock might have left TICKET_SLOWPATH_FLAG set)

http://article.gmane.org/gmane.linux.kernel/1883900


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2015-02-06 14:59 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-04 23:24 sched: memory corruption on completing completions Sasha Levin
2015-02-04 23:46 ` Andrew Morton
2015-02-05  0:16 ` Linus Torvalds
2015-02-05  7:10   ` Ingo Molnar
2015-02-05  9:30   ` Peter Zijlstra
2015-02-05 20:44     ` Sasha Levin
2015-02-05 20:59   ` Davidlohr Bueso
2015-02-05 21:02     ` Sasha Levin
2015-02-05 21:34       ` Linus Torvalds
2015-02-05 22:37         ` Davidlohr Bueso
2015-02-05 22:57           ` Linus Torvalds
2015-02-06  6:48             ` Raghavendra K T
2015-02-06 15:00               ` Raghavendra K T
2015-02-06 12:29           ` Raghavendra K T

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).