LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* frequent lockups in 3.18rc4
@ 2014-11-14 21:31 Dave Jones
  2014-11-14 22:01 ` Linus Torvalds
  2014-11-17 15:07 ` Don Zickus
  0 siblings, 2 replies; 486+ messages in thread
From: Dave Jones @ 2014-11-14 21:31 UTC (permalink / raw)
  To: Linux Kernel; +Cc: Linus Torvalds

I'm not sure how long this goes back (3.17 was fine afair) but I'm
seeing these several times a day lately..


NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [trinity-c129:25570]
irq event stamp: 74224
hardirqs last  enabled at (74223): [<ffffffff9c875664>] restore_args+0x0/0x30
hardirqs last disabled at (74224): [<ffffffff9c8759aa>] apic_timer_interrupt+0x6a/0x80
softirqs last  enabled at (74222): [<ffffffff9c07f43a>] __do_softirq+0x26a/0x6f0
softirqs last disabled at (74209): [<ffffffff9c07fb4d>] irq_exit+0x13d/0x170
CPU: 3 PID: 25570 Comm: trinity-c129 Not tainted 3.18.0-rc4+ #83 [loadavg: 198.04 186.66 181.58 24/442 26708]
task: ffff880213442f00 ti: ffff8801ea714000 task.ti: ffff8801ea714000
RIP: 0010:[<ffffffff9c11e98a>]  [<ffffffff9c11e98a>] generic_exec_single+0xea/0x1d0
RSP: 0018:ffff8801ea717a08  EFLAGS: 00000202
RAX: ffff880213442f00 RBX: ffffffff9c875664 RCX: 0000000000000006
RDX: 0000000000001370 RSI: ffff880213443790 RDI: ffff880213442f00
RBP: ffff8801ea717a68 R08: ffff880242b56690 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801ea717978
R13: ffff880213442f00 R14: ffff8801ea714000 R15: ffff880213442f00
FS:  00007f240994e700(0000) GS:ffff880244600000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000004 CR3: 000000019a017000 CR4: 00000000001407e0
DR0: 00007fb3367e0000 DR1: 00007f82542ab000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
Stack:
 ffffffff9ce4c620 0000000000000000 ffffffff9c048b20 ffff8801ea717b18
 0000000000000003 0000000052e0da3d ffffffff9cc7ef3c 0000000000000002
 ffffffff9c048b20 ffff8801ea717b18 0000000000000001 0000000000000003
Call Trace:
 [<ffffffff9c048b20>] ? leave_mm+0x210/0x210
 [<ffffffff9c048b20>] ? leave_mm+0x210/0x210
 [<ffffffff9c11ead6>] smp_call_function_single+0x66/0x110
 [<ffffffff9c048b20>] ? leave_mm+0x210/0x210
 [<ffffffff9c11f021>] smp_call_function_many+0x2f1/0x390
 [<ffffffff9c049300>] flush_tlb_mm_range+0xe0/0x370
 [<ffffffff9c1d95a2>] tlb_flush_mmu_tlbonly+0x42/0x50
 [<ffffffff9c1d9cb5>] tlb_finish_mmu+0x45/0x50
 [<ffffffff9c1daf59>] zap_page_range_single+0x119/0x170
 [<ffffffff9c1db140>] unmap_mapping_range+0x140/0x1b0
 [<ffffffff9c1c7edd>] shmem_fallocate+0x43d/0x540
 [<ffffffff9c0b111b>] ? preempt_count_sub+0xab/0x100
 [<ffffffff9c0cdac7>] ? prepare_to_wait+0x27/0x80
 [<ffffffff9c2287f3>] ? __sb_start_write+0x103/0x1d0
 [<ffffffff9c223aba>] do_fallocate+0x12a/0x1c0
 [<ffffffff9c1f0bd3>] SyS_madvise+0x3d3/0x890
 [<ffffffff9c1a40d2>] ? context_tracking_user_exit+0x52/0x260
 [<ffffffff9c013ebd>] ? syscall_trace_enter_phase2+0x10d/0x3d0
 [<ffffffff9c874c89>] tracesys_phase2+0xd4/0xd9
Code: 63 c7 48 89 de 48 89 df 48 c7 c2 c0 50 1d 00 48 03 14 c5 40 b9 f2 9c e8 d5 ea 2b 00 84 c0 74 0b e9 bc 00 00 00 0f 1f 40 00 f3 90 <f6> 43 18 01 75 f8 31 c0 48 8b 4d c8 65 48 33 0c 25 28 00 00 00 
Kernel panic - not syncing: softlockup: hung tasks


I've got a local hack to dump loadavg on traces, and as you can see in that
example, the machine was really busy, but we were at least making progress
before the trace spewed, and the machine rebooted. (I have reboot-on-lockup sysctl
set, without it, the machine just wedges indefinitely shortly after the spew).

The trace doesn't really enlighten me as to what we should be doing
to prevent this though.

ideas?
I can try to bisect it, but it takes hours before it happens,
so it might take days to complete, and the next few weeks are
complicated timewise..

	Dave


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-14 21:31 frequent lockups in 3.18rc4 Dave Jones
@ 2014-11-14 22:01 ` Linus Torvalds
  2014-11-14 22:30   ` Dave Jones
                     ` (2 more replies)
  2014-11-17 15:07 ` Don Zickus
  1 sibling, 3 replies; 486+ messages in thread
From: Linus Torvalds @ 2014-11-14 22:01 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel; +Cc: the arch/x86 maintainers

On Fri, Nov 14, 2014 at 1:31 PM, Dave Jones <davej@redhat.com> wrote:
> I'm not sure how long this goes back (3.17 was fine afair) but I'm
> seeing these several times a day lately..

Hmm. I don't see what would have changed in this area since v3.17.
There's a TLB range fix in mm/memory.c, but for the life of me I can't
see how that would possibly matter the way x86 does TLB flushing (if
the range fix does something bad and the range goes too large, x86
will just end up doing a full TLB invalidate instead).

Plus, judging by the fact that there's a stale "leave_mm+0x210/0x210"
(wouldn't that be the *next* function, namely do_flush_tlb_all())
pointer on the stack, I suspect that whole range-flushing doesn't even
trigger, and we are flushing everything.

But since you say "several times a day", just for fun, can you test
the follow-up patch to that one-liner fix that Will Deacon posted
today (Subject: "[PATCH] mmu_gather: move minimal range calculations
into generic code"). That does some further cleanup in this area.

I don't see any changes to the x86 IPI or TLB flush handling, but
maybe I'm missing something, so I'm adding the x86 maintainers to the
cc.

> I've got a local hack to dump loadavg on traces, and as you can see in that
> example, the machine was really busy, but we were at least making progress
> before the trace spewed, and the machine rebooted. (I have reboot-on-lockup sysctl
> set, without it, the machine just wedges indefinitely shortly after the spew).
>
> The trace doesn't really enlighten me as to what we should be doing
> to prevent this though.
>
> ideas?

I can't say I have any ideas except to point at the TLB range patch,
and quite frankly, I don't see how that would matter.

If Will's patch doesn't make a difference, what about reverting that
ce9ec37bddb6? Although it really *is* a "obvious bugfix", and I really
don't see why any of this would be noticeable on x86 (it triggered
issues on ARM64, but that was because ARM64 cared much more about the
exact range).

> I can try to bisect it, but it takes hours before it happens,
> so it might take days to complete, and the next few weeks are
> complicated timewise..

Hmm. Even narrowing it down a bit might help, ie if you could get say
four bisections in over a day, and see if that at least says "ok, it's
likely one of these pulls".

But yeah, I can see it being painful, so maybe a quick check of the
TLB ones, even if I can't for the life see why they would possibly
matter.

                 Linus

---
> NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [trinity-c129:25570]
> irq event stamp: 74224
> hardirqs last  enabled at (74223): [<ffffffff9c875664>] restore_args+0x0/0x30
> hardirqs last disabled at (74224): [<ffffffff9c8759aa>] apic_timer_interrupt+0x6a/0x80
> softirqs last  enabled at (74222): [<ffffffff9c07f43a>] __do_softirq+0x26a/0x6f0
> softirqs last disabled at (74209): [<ffffffff9c07fb4d>] irq_exit+0x13d/0x170
> CPU: 3 PID: 25570 Comm: trinity-c129 Not tainted 3.18.0-rc4+ #83 [loadavg: 198.04 186.66 181.58 24/442 26708]
> RIP: 0010:[<ffffffff9c11e98a>]  [<ffffffff9c11e98a>] generic_exec_single+0xea/0x1d0
> Call Trace:
>  [<ffffffff9c048b20>] ? leave_mm+0x210/0x210
>  [<ffffffff9c048b20>] ? leave_mm+0x210/0x210
>  [<ffffffff9c11ead6>] smp_call_function_single+0x66/0x110
>  [<ffffffff9c048b20>] ? leave_mm+0x210/0x210
>  [<ffffffff9c11f021>] smp_call_function_many+0x2f1/0x390
>  [<ffffffff9c049300>] flush_tlb_mm_range+0xe0/0x370
>  [<ffffffff9c1d95a2>] tlb_flush_mmu_tlbonly+0x42/0x50
>  [<ffffffff9c1d9cb5>] tlb_finish_mmu+0x45/0x50
>  [<ffffffff9c1daf59>] zap_page_range_single+0x119/0x170
>  [<ffffffff9c1db140>] unmap_mapping_range+0x140/0x1b0
>  [<ffffffff9c1c7edd>] shmem_fallocate+0x43d/0x540
>  [<ffffffff9c223aba>] do_fallocate+0x12a/0x1c0
>  [<ffffffff9c1f0bd3>] SyS_madvise+0x3d3/0x890
>  [<ffffffff9c874c89>] tracesys_phase2+0xd4/0xd9
> Kernel panic - not syncing: softlockup: hung tasks

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-14 22:01 ` Linus Torvalds
@ 2014-11-14 22:30   ` Dave Jones
  2014-11-14 22:55   ` Thomas Gleixner
  2014-11-15 21:34   ` Dave Jones
  2 siblings, 0 replies; 486+ messages in thread
From: Dave Jones @ 2014-11-14 22:30 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux Kernel, the arch/x86 maintainers

On Fri, Nov 14, 2014 at 02:01:27PM -0800, Linus Torvalds wrote:
 
 > Plus, judging by the fact that there's a stale "leave_mm+0x210/0x210"
 > (wouldn't that be the *next* function, namely do_flush_tlb_all())
 > pointer on the stack, I suspect that whole range-flushing doesn't even
 > trigger, and we are flushing everything.
 > 
 > But since you say "several times a day", just for fun, can you test
 > the follow-up patch to that one-liner fix that Will Deacon posted
 > today (Subject: "[PATCH] mmu_gather: move minimal range calculations
 > into generic code"). That does some further cleanup in this area.

I'll give it a shot. Should know by the morning if it changes anything.

 > > The trace doesn't really enlighten me as to what we should be doing
 > > to prevent this though.
 > >
 > > ideas?
 > 
 > I can't say I have any ideas except to point at the TLB range patch,
 > and quite frankly, I don't see how that would matter.
 > 
 > If Will's patch doesn't make a difference, what about reverting that
 > ce9ec37bddb6? Although it really *is* a "obvious bugfix", and I really
 > don't see why any of this would be noticeable on x86 (it triggered
 > issues on ARM64, but that was because ARM64 cared much more about the
 > exact range).

Digging through the serial console logs, there was one other trace variant,
which is even less informative..

NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [trinity-c104:19168]
irq event stamp: 223186
hardirqs last  enabled at (223185): [<ffffffff941a4092>] context_tracking_user_exit+0x52/0x260
hardirqs last disabled at (223186): [<ffffffff948756aa>] apic_timer_interrupt+0x6a/0x80
softirqs last  enabled at (187030): [<ffffffff9407f43a>] __do_softirq+0x26a/0x6f0
softirqs last disabled at (187017): [<ffffffff9407fb4d>] irq_exit+0x13d/0x170
CPU: 3 PID: 19168 Comm: trinity-c104 Not tainted 3.18.0-rc4+ #82 [loadavg: 99.30 85.88 82.88 9/303 19302]
task: ffff88023f8b4680 ti: ffff880157418000 task.ti: ffff880157418000
RIP: 0010:[<ffffffff941a4094>]  [<ffffffff941a4094>] context_tracking_user_exit+0x54/0x260
RSP: 0018:ffff88015741bee8  EFLAGS: 00000246
RAX: ffff88023f8b4680 RBX: ffffffff940b111b RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff88023f8b4680
RBP: ffff88015741bef8 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: ffff88015741bf58 R14: ffff88023f8b4ae8 R15: ffff88023f8b4b18
FS:  00007f9a0789b740(0000) GS:ffff880244e00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000003dfa1b7c90 CR3: 0000000165f3c000 CR4: 00000000001407e0
DR0: 00000000ffffffbf DR1: 00007f2c0c3d9000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
Stack:
 0000000000080000 ffff88015741c000 ffff88015741bf78 ffffffff94013d35
 ffff88015741bf28 ffffffff940d865d 0000000000004b02 0000000000000000
 00007f9a071bb000 ffffffff943d816b 0000000000000000 0000000000000000
Call Trace:
 [<ffffffff94013d35>] syscall_trace_enter_phase1+0x125/0x1a0
 [<ffffffff940d865d>] ? trace_hardirqs_on_caller+0x16d/0x210
 [<ffffffff943d816b>] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [<ffffffff9487487f>] tracesys+0x14/0x4a
Code: fa e8 51 0a f3 ff 48 c7 c7 26 52 cd 94 e8 f5 21 24 00 65 8b 04 25 f4 f8 1c 00 83 f8 01 74 28 f6 c7 02 74 13 e8 6e 46 f3 ff 53 9d <5b> 41 5c 5d c3 0f 1f 80
 00 00 00 00 53 9d e8 19 0a f3 ff eb eb 

It looks like I've been seeing these since 3.18-rc1 though,
but for those, the machine crashed before the trace even made it
over usb-serial, leaving just the "NMI watchdog" line.


 > > I can try to bisect it, but it takes hours before it happens,
 > > so it might take days to complete, and the next few weeks are
 > > complicated timewise..
 > 
 > Hmm. Even narrowing it down a bit might help, ie if you could get say
 > four bisections in over a day, and see if that at least says "ok, it's
 > likely one of these pulls".
 > 
 > But yeah, I can see it being painful, so maybe a quick check of the
 > TLB ones, even if I can't for the life see why they would possibly
 > matter.

Assuming the NMI watchdog traces I saw in rc1 are the same problem,
I'll see if I can bisect between .17 and .18rc1 on Monday, and see
if that yields anything interesting.

	Dave


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-14 22:01 ` Linus Torvalds
  2014-11-14 22:30   ` Dave Jones
@ 2014-11-14 22:55   ` Thomas Gleixner
  2014-11-14 23:32     ` Dave Jones
  2014-11-15  1:59     ` Linus Torvalds
  2014-11-15 21:34   ` Dave Jones
  2 siblings, 2 replies; 486+ messages in thread
From: Thomas Gleixner @ 2014-11-14 22:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Dave Jones, Linux Kernel, the arch/x86 maintainers

On Fri, 14 Nov 2014, Linus Torvalds wrote:
> On Fri, Nov 14, 2014 at 1:31 PM, Dave Jones <davej@redhat.com> wrote:
> > I'm not sure how long this goes back (3.17 was fine afair) but I'm
> > seeing these several times a day lately..
>
> Plus, judging by the fact that there's a stale "leave_mm+0x210/0x210"
> (wouldn't that be the *next* function, namely do_flush_tlb_all())
> pointer on the stack, I suspect that whole range-flushing doesn't even
> trigger, and we are flushing everything.

This stale entry is not relevant here because the thing is stuck in
generic_exec_single().
 
> > NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [trinity-c129:25570]
> > RIP: 0010:[<ffffffff9c11e98a>]  [<ffffffff9c11e98a>] generic_exec_single+0xea/0x1d0

> > Call Trace:
> >  [<ffffffff9c048b20>] ? leave_mm+0x210/0x210
> >  [<ffffffff9c048b20>] ? leave_mm+0x210/0x210
> >  [<ffffffff9c11ead6>] smp_call_function_single+0x66/0x110
> >  [<ffffffff9c048b20>] ? leave_mm+0x210/0x210
> >  [<ffffffff9c11f021>] smp_call_function_many+0x2f1/0x390
> >  [<ffffffff9c049300>] flush_tlb_mm_range+0xe0/0x370

flush_tlb_mm_range()
	.....
out:
        if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids)
                flush_tlb_others(mm_cpumask(mm), mm, start, end);

which calls

      smp_call_function_many() via native_flush_tlb_others()

which is either inlined or not on the stack the invocation of
smp_call_function_many() is a tail call.

So from smp_call_function_many() we end up via
smp_call_function_single() in generic_exec_single().

So the only ways to get stuck there are:

     csd_lock(csd);
and
     csd_lock_wait(csd);

The called function is flush_tlb_func() and I really can't see why
that would get stuck at all.

So this looks more like a smp function call fuckup.

I assume Dave is running that stuff on KVM. So it might be worth while
to look at the IPI magic there.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-14 22:55   ` Thomas Gleixner
@ 2014-11-14 23:32     ` Dave Jones
  2014-11-15  0:36       ` Thomas Gleixner
  2014-11-15  1:59     ` Linus Torvalds
  1 sibling, 1 reply; 486+ messages in thread
From: Dave Jones @ 2014-11-14 23:32 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Linus Torvalds, Linux Kernel, the arch/x86 maintainers

On Fri, Nov 14, 2014 at 11:55:30PM +0100, Thomas Gleixner wrote:
 
 > So this looks more like a smp function call fuckup.
 > 
 > I assume Dave is running that stuff on KVM. So it might be worth while
 > to look at the IPI magic there.

no, bare metal.

    Dave

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-14 23:32     ` Dave Jones
@ 2014-11-15  0:36       ` Thomas Gleixner
  2014-11-15  2:40         ` Dave Jones
  0 siblings, 1 reply; 486+ messages in thread
From: Thomas Gleixner @ 2014-11-15  0:36 UTC (permalink / raw)
  To: Dave Jones; +Cc: Linus Torvalds, Linux Kernel, the arch/x86 maintainers

On Fri, 14 Nov 2014, Dave Jones wrote:

> On Fri, Nov 14, 2014 at 11:55:30PM +0100, Thomas Gleixner wrote:
>  
>  > So this looks more like a smp function call fuckup.
>  > 
>  > I assume Dave is running that stuff on KVM. So it might be worth while
>  > to look at the IPI magic there.
> 
> no, bare metal.

Ok, but that does not change the fact that we are stuck in
smp_function_call land.

Enabling softlockup_all_cpu_backtrace will probably not help much as
we will end up waiting for csd_lock again :(

Is the machine still accesible when this happens? If yes, we might
enable a few trace points and functions and read out the trace
buffer. If not, we could just panic the machine and dump the trace
buffer over serial.

Sigh

	tglx

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-14 22:55   ` Thomas Gleixner
  2014-11-14 23:32     ` Dave Jones
@ 2014-11-15  1:59     ` Linus Torvalds
  2014-11-17 21:22       ` Linus Torvalds
  1 sibling, 1 reply; 486+ messages in thread
From: Linus Torvalds @ 2014-11-15  1:59 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Dave Jones, Linux Kernel, the arch/x86 maintainers

On Fri, Nov 14, 2014 at 2:55 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> This stale entry is not relevant here because the thing is stuck in
> generic_exec_single().

That wasn't really my argument. The fact that "do_flush_tlb_all()" was
left over on the stack frame implies that we're not doing the
range-flush, and if it was some odd bug with a negative range or
something like that (due to the fix in commit ce9ec37bddb6), I'd
expect the lockup to be due to a hung do_kernel_range_flush() or
something. But the range flushing never even happens.

> So from smp_call_function_many() we end up via
> smp_call_function_single() in generic_exec_single().
>
> So the only ways to get stuck there are:
>
>      csd_lock(csd);
> and
>      csd_lock_wait(csd);

Judging by the code disassembly, it's the "csd_lock_wait(csd)" at the
end. The disassembly looks like

  29: f3 90                 pause
  2b:* f6 43 18 01           testb  $0x1,0x18(%rbx) <-- trapping instruction
  2f: 75 f8                 jne    0x29
  31: 31 c0                 xor    %eax,%eax

and that "xor %eax,%eax" seems to be part of the "return 0"
immediately afterwards.

But that's not entirely conclusive, it's just a strong hint.

It does sound like there might be some IPI issue. I just don't see
*any* changes in this area since 3.17. Some unrelated APIC change? I
don't see that either. As you noted, there are KVM changes, but
apparently that isn't involved either.

                 Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-15  0:36       ` Thomas Gleixner
@ 2014-11-15  2:40         ` Dave Jones
  2014-11-16 12:16           ` Thomas Gleixner
  0 siblings, 1 reply; 486+ messages in thread
From: Dave Jones @ 2014-11-15  2:40 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Linus Torvalds, Linux Kernel, the arch/x86 maintainers

On Sat, Nov 15, 2014 at 01:36:41AM +0100, Thomas Gleixner wrote:
 > On Fri, 14 Nov 2014, Dave Jones wrote:
 > 
 > > On Fri, Nov 14, 2014 at 11:55:30PM +0100, Thomas Gleixner wrote:
 > >  
 > >  > So this looks more like a smp function call fuckup.
 > >  > 
 > >  > I assume Dave is running that stuff on KVM. So it might be worth while
 > >  > to look at the IPI magic there.
 > > 
 > > no, bare metal.
 > 
 > Ok, but that does not change the fact that we are stuck in
 > smp_function_call land.
 > 
 > Enabling softlockup_all_cpu_backtrace will probably not help much as
 > we will end up waiting for csd_lock again :(
 > 
 > Is the machine still accesible when this happens? If yes, we might
 > enable a few trace points and functions and read out the trace
 > buffer. If not, we could just panic the machine and dump the trace
 > buffer over serial.

No, it wedges solid. Even though it says something like "CPU3 locked up",
aparently all cores also get stuck.
9 times out of 10 it doesn't stay alive long enough to even get the full
trace out over usb-serial.

	Dave


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-14 22:01 ` Linus Torvalds
  2014-11-14 22:30   ` Dave Jones
  2014-11-14 22:55   ` Thomas Gleixner
@ 2014-11-15 21:34   ` Dave Jones
  2014-11-16  1:40     ` Dave Jones
  2 siblings, 1 reply; 486+ messages in thread
From: Dave Jones @ 2014-11-15 21:34 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux Kernel, the arch/x86 maintainers

On Fri, Nov 14, 2014 at 02:01:27PM -0800, Linus Torvalds wrote:

 > But since you say "several times a day", just for fun, can you test
 > the follow-up patch to that one-liner fix that Will Deacon posted
 > today (Subject: "[PATCH] mmu_gather: move minimal range calculations
 > into generic code"). That does some further cleanup in this area.

A few hours ago it hit the NMI watchdog again with that patch applied.
Incomplete trace, but it looks different based on what did make it over.
Different RIP at least.

[65155.054155] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [trinity-c127:12559]
[65155.054573] irq event stamp: 296752
[65155.054589] hardirqs last  enabled at (296751): [<ffffffff9d87403d>] _raw_spin_unlock_irqrestore+0x5d/0x80
[65155.054625] hardirqs last disabled at (296752): [<ffffffff9d875cea>] apic_timer_interrupt+0x6a/0x80
[65155.054657] softirqs last  enabled at (296188): [<ffffffff9d259943>] bdi_queue_work+0x83/0x270
[65155.054688] softirqs last disabled at (296184): [<ffffffff9d259920>] bdi_queue_work+0x60/0x270
[65155.054721] CPU: 1 PID: 12559 Comm: trinity-c127 Not tainted 3.18.0-rc4+ #84 [loadavg: 209.68 187.90 185.33 34/431 17515]
[65155.054795] task: ffff88023f664680 ti: ffff8801649f0000 task.ti: ffff8801649f0000
[65155.054820] RIP: 0010:[<ffffffff9d87403f>]  [<ffffffff9d87403f>] _raw_spin_unlock_irqrestore+0x5f/0x80
[65155.054852] RSP: 0018:ffff8801649f3be8  EFLAGS: 00000292
[65155.054872] RAX: ffff88023f664680 RBX: 0000000000000007 RCX: 0000000000000007
[65155.054895] RDX: 00000000000029e0 RSI: ffff88023f664ea0 RDI: ffff88023f664680
[65155.054919] RBP: ffff8801649f3bf8 R08: 0000000000000000 R09: 0000000000000000
[65155.055956] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[65155.056985] R13: ffff8801649f3b58 R14: ffffffff9d3e7d0e R15: 00000000000003e0
[65155.058037] FS:  00007f0dc957c700(0000) GS:ffff880244200000(0000) knlGS:0000000000000000
[65155.059083] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[65155.060121] CR2: 00007f0dc958e000 CR3: 000000022f31e000 CR4: 00000000001407e0
[65155.061152] DR0: 00007f54162bc000 DR1: 00007feb92c3d000 DR2: 0000000000000000
[65155.062180] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[65155.063202] Stack:

And that's all she wrote.

 > If Will's patch doesn't make a difference, what about reverting that
 > ce9ec37bddb6? Although it really *is* a "obvious bugfix", and I really
 > don't see why any of this would be noticeable on x86 (it triggered
 > issues on ARM64, but that was because ARM64 cared much more about the
 > exact range).

I'll try that next, and check in on it tomorrow.

	Dave

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-15 21:34   ` Dave Jones
@ 2014-11-16  1:40     ` Dave Jones
  2014-11-16  6:33       ` Linus Torvalds
  2014-11-20 15:28       ` Frederic Weisbecker
  0 siblings, 2 replies; 486+ messages in thread
From: Dave Jones @ 2014-11-16  1:40 UTC (permalink / raw)
  To: Linus Torvalds, Linux Kernel, the arch/x86 maintainers

On Sat, Nov 15, 2014 at 04:34:05PM -0500, Dave Jones wrote:
 > On Fri, Nov 14, 2014 at 02:01:27PM -0800, Linus Torvalds wrote:
 > 
 >  > But since you say "several times a day", just for fun, can you test
 >  > the follow-up patch to that one-liner fix that Will Deacon posted
 >  > today (Subject: "[PATCH] mmu_gather: move minimal range calculations
 >  > into generic code"). That does some further cleanup in this area.
 > 
 > A few hours ago it hit the NMI watchdog again with that patch applied.
 > Incomplete trace, but it looks different based on what did make it over.
 > Different RIP at least.
 > 
 > [65155.054155] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [trinity-c127:12559]
 > [65155.054573] irq event stamp: 296752
 > [65155.054589] hardirqs last  enabled at (296751): [<ffffffff9d87403d>] _raw_spin_unlock_irqrestore+0x5d/0x80
 > [65155.054625] hardirqs last disabled at (296752): [<ffffffff9d875cea>] apic_timer_interrupt+0x6a/0x80
 > [65155.054657] softirqs last  enabled at (296188): [<ffffffff9d259943>] bdi_queue_work+0x83/0x270
 > [65155.054688] softirqs last disabled at (296184): [<ffffffff9d259920>] bdi_queue_work+0x60/0x270
 > [65155.054721] CPU: 1 PID: 12559 Comm: trinity-c127 Not tainted 3.18.0-rc4+ #84 [loadavg: 209.68 187.90 185.33 34/431 17515]
 > [65155.054795] task: ffff88023f664680 ti: ffff8801649f0000 task.ti: ffff8801649f0000
 > [65155.054820] RIP: 0010:[<ffffffff9d87403f>]  [<ffffffff9d87403f>] _raw_spin_unlock_irqrestore+0x5f/0x80
 > [65155.054852] RSP: 0018:ffff8801649f3be8  EFLAGS: 00000292
 > [65155.054872] RAX: ffff88023f664680 RBX: 0000000000000007 RCX: 0000000000000007
 > [65155.054895] RDX: 00000000000029e0 RSI: ffff88023f664ea0 RDI: ffff88023f664680
 > [65155.054919] RBP: ffff8801649f3bf8 R08: 0000000000000000 R09: 0000000000000000
 > [65155.055956] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
 > [65155.056985] R13: ffff8801649f3b58 R14: ffffffff9d3e7d0e R15: 00000000000003e0
 > [65155.058037] FS:  00007f0dc957c700(0000) GS:ffff880244200000(0000) knlGS:0000000000000000
 > [65155.059083] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 > [65155.060121] CR2: 00007f0dc958e000 CR3: 000000022f31e000 CR4: 00000000001407e0
 > [65155.061152] DR0: 00007f54162bc000 DR1: 00007feb92c3d000 DR2: 0000000000000000
 > [65155.062180] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
 > [65155.063202] Stack:
 > 
 > And that's all she wrote.
 > 
 >  > If Will's patch doesn't make a difference, what about reverting that
 >  > ce9ec37bddb6? Although it really *is* a "obvious bugfix", and I really
 >  > don't see why any of this would be noticeable on x86 (it triggered
 >  > issues on ARM64, but that was because ARM64 cared much more about the
 >  > exact range).
 > 
 > I'll try that next, and check in on it tomorrow.

No luck. Died even faster this time.

[  772.459481] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [modprobe:31400]
[  772.459858] irq event stamp: 3362
[  772.459872] hardirqs last  enabled at (3361): [<ffffffff941a437c>] context_tracking_user_enter+0x9c/0x2c0
[  772.459907] hardirqs last disabled at (3362): [<ffffffff94875bea>] apic_timer_interrupt+0x6a/0x80
[  772.459937] softirqs last  enabled at (0): [<ffffffff940764d5>] copy_process.part.26+0x635/0x1d80
[  772.459968] softirqs last disabled at (0): [<          (null)>]           (null)
[  772.459996] CPU: 3 PID: 31400 Comm: modprobe Not tainted 3.18.0-rc4+ #85 [loadavg: 207.70 163.33 92.64 11/433 31547]
[  772.460086] task: ffff88022f0b2f00 ti: ffff88019a944000 task.ti: ffff88019a944000
[  772.460110] RIP: 0010:[<ffffffff941a437e>]  [<ffffffff941a437e>] context_tracking_user_enter+0x9e/0x2c0
[  772.460142] RSP: 0018:ffff88019a947f00  EFLAGS: 00000282
[  772.460161] RAX: ffff88022f0b2f00 RBX: 0000000000000000 RCX: 0000000000000000
[  772.460184] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff88022f0b2f00
[  772.460207] RBP: ffff88019a947f10 R08: 0000000000000000 R09: 0000000000000000
[  772.460229] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88019a947e90
[  772.460252] R13: ffffffff940f6d04 R14: ffff88019a947ec0 R15: ffff8802447cd640
[  772.460294] FS:  00007f3b71ee4700(0000) GS:ffff880244600000(0000) knlGS:0000000000000000
[  772.460362] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  772.460391] CR2: 00007fffdad5af58 CR3: 000000011608e000 CR4: 00000000001407e0
[  772.460424] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  772.460447] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  772.460470] Stack:
[  772.460480]  ffff88019a947f58 00000000006233a8 ffff88019a947f40 ffffffff9401429d
[  772.460512]  00000000006233a8 000000000041d68a 00000000006233a8 0000000000000000
[  772.460543]  00000000006233a0 ffffffff94874fa4 000000001008feff 000507d93d73a434
[  772.460574] Call Trace:
[  772.461576]  [<ffffffff9401429d>] syscall_trace_leave+0xad/0x2e0
[  772.462572]  [<ffffffff94874fa4>] int_check_syscall_exit_work+0x34/0x3d
[  772.463575] Code: f8 1c 00 84 c0 75 46 48 c7 c7 51 53 cd 94 e8 aa 23 24 00 65 c7 04 25 f4 f8 1c 00 01 00 00 00 f6 c7 02 74 19 e8 84 43 f3 ff 53 9d <5b> 41 5c 5d c3 0f 1f 44 00 00 c3 0f 1f 80 00 00 00 00 53 9d e8 
[  772.465797] Kernel panic - not syncing: softlockup: hung tasks
[  772.466821] CPU: 3 PID: 31400 Comm: modprobe Tainted: G             L 3.18.0-rc4+ #85 [loadavg: 207.70 163.33 92.64 11/433 31547]
[  772.468915]  ffff88022f0b2f00 00000000de65d5f5 ffff880244603dc8 ffffffff94869e01
[  772.470031]  0000000000000000 ffffffff94c7599b ffff880244603e48 ffffffff94866b21
[  772.471085]  ffff880200000008 ffff880244603e58 ffff880244603df8 00000000de65d5f5
[  772.472141] Call Trace:
[  772.473183]  <IRQ>  [<ffffffff94869e01>] dump_stack+0x4f/0x7c
[  772.474253]  [<ffffffff94866b21>] panic+0xcf/0x202
[  772.475346]  [<ffffffff94154d1e>] watchdog_timer_fn+0x27e/0x290
[  772.476414]  [<ffffffff94106297>] __run_hrtimer+0xe7/0x740
[  772.477475]  [<ffffffff94106b64>] ? hrtimer_interrupt+0x94/0x270
[  772.478555]  [<ffffffff94154aa0>] ? watchdog+0x40/0x40
[  772.479627]  [<ffffffff94106be7>] hrtimer_interrupt+0x117/0x270
[  772.480703]  [<ffffffff940303db>] local_apic_timer_interrupt+0x3b/0x70
[  772.481777]  [<ffffffff948777f3>] smp_apic_timer_interrupt+0x43/0x60
[  772.482856]  [<ffffffff94875bef>] apic_timer_interrupt+0x6f/0x80
[  772.483915]  <EOI>  [<ffffffff941a437e>] ? context_tracking_user_enter+0x9e/0x2c0
[  772.484972]  [<ffffffff9401429d>] syscall_trace_leave+0xad/0x2e0
[  772.486042]  [<ffffffff94874fa4>] int_check_syscall_exit_work+0x34/0x3d
[  772.487187] Kernel Offset: 0x13000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)


	Dave


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-16  1:40     ` Dave Jones
@ 2014-11-16  6:33       ` Linus Torvalds
  2014-11-16 10:06         ` Markus Trippelsdorf
                           ` (2 more replies)
  2014-11-20 15:28       ` Frederic Weisbecker
  1 sibling, 3 replies; 486+ messages in thread
From: Linus Torvalds @ 2014-11-16  6:33 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Linux Kernel, the arch/x86 maintainers

On Sat, Nov 15, 2014 at 5:40 PM, Dave Jones <davej@redhat.com> wrote:
>  >
>  > I'll try that next, and check in on it tomorrow.
>
> No luck. Died even faster this time.

Yeah, and your other lockups haven't even been TLB related. Not that
they look like anything else *either*.

I have no ideas left. I'd go for a bisection - rather than try random
things, at least bisection will get us a smaller set of suspects if
you can go through a few cycles of it. Even if you decide that you
want to run for most of a day before you are convinced it's all good,
a couple of days should get you a handful of bisection points (that's
assuming you hit a couple of bad ones too that turn bad in a shorter
while). And 4 or five bisections should get us from 11k commits down
to the ~600 commit range. That would be a huge improvement.

                   Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-16  6:33       ` Linus Torvalds
@ 2014-11-16 10:06         ` Markus Trippelsdorf
  2014-11-16 18:33           ` Linus Torvalds
  2014-11-17 17:03         ` Dave Jones
  2014-11-26  0:25         ` Dave Jones
  2 siblings, 1 reply; 486+ messages in thread
From: Markus Trippelsdorf @ 2014-11-16 10:06 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Dave Jones, Linux Kernel, the arch/x86 maintainers

On 2014.11.15 at 22:33 -0800, Linus Torvalds wrote:
> On Sat, Nov 15, 2014 at 5:40 PM, Dave Jones <davej@redhat.com> wrote:
> >  >
> >  > I'll try that next, and check in on it tomorrow.
> >
> > No luck. Died even faster this time.
> 
> Yeah, and your other lockups haven't even been TLB related. Not that
> they look like anything else *either*.
> 
> I have no ideas left. I'd go for a bisection

Before starting a bisection you could try and disable transparent_hugepages.
There are strange bugs, that were introduced during this merge-window, in this
area. See: https://lkml.org/lkml/2014/11/4/144
https://lkml.org/lkml/2014/11/4/904
http://thread.gmane.org/gmane.linux.kernel.mm/124451

-- 
Markus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-15  2:40         ` Dave Jones
@ 2014-11-16 12:16           ` Thomas Gleixner
  0 siblings, 0 replies; 486+ messages in thread
From: Thomas Gleixner @ 2014-11-16 12:16 UTC (permalink / raw)
  To: Dave Jones; +Cc: Linus Torvalds, Linux Kernel, the arch/x86 maintainers

On Fri, 14 Nov 2014, Dave Jones wrote:
> On Sat, Nov 15, 2014 at 01:36:41AM +0100, Thomas Gleixner wrote:
>  > On Fri, 14 Nov 2014, Dave Jones wrote:
>  > 
>  > > On Fri, Nov 14, 2014 at 11:55:30PM +0100, Thomas Gleixner wrote:
>  > >  
>  > >  > So this looks more like a smp function call fuckup.
>  > >  > 
>  > >  > I assume Dave is running that stuff on KVM. So it might be worth while
>  > >  > to look at the IPI magic there.
>  > > 
>  > > no, bare metal.
>  > 
>  > Ok, but that does not change the fact that we are stuck in
>  > smp_function_call land.
>  > 
>  > Enabling softlockup_all_cpu_backtrace will probably not help much as
>  > we will end up waiting for csd_lock again :(
>  > 
>  > Is the machine still accesible when this happens? If yes, we might
>  > enable a few trace points and functions and read out the trace
>  > buffer. If not, we could just panic the machine and dump the trace
>  > buffer over serial.
> 
> No, it wedges solid. Even though it says something like "CPU3 locked up",
> aparently all cores also get stuck.

Does not surprise me. Once the smp function call machinery is wedged...

> 9 times out of 10 it doesn't stay alive long enough to even get the full
> trace out over usb-serial.

usb-serial is definitely not the best tool for stuff like this. I
wonder whether netconsole might give us some more info.

Last time I looked into something like that on my laptop I had to
resort to a crash kernel to get anything useful out of the box.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-16 10:06         ` Markus Trippelsdorf
@ 2014-11-16 18:33           ` Linus Torvalds
  0 siblings, 0 replies; 486+ messages in thread
From: Linus Torvalds @ 2014-11-16 18:33 UTC (permalink / raw)
  To: Markus Trippelsdorf; +Cc: Dave Jones, Linux Kernel, the arch/x86 maintainers

On Sun, Nov 16, 2014 at 2:06 AM, Markus Trippelsdorf
<markus@trippelsdorf.de> wrote:
>
> Before starting a bisection you could try and disable transparent_hugepages.
> There are strange bugs, that were introduced during this merge-window, in this
> area. See: https://lkml.org/lkml/2014/11/4/144
> https://lkml.org/lkml/2014/11/4/904
> http://thread.gmane.org/gmane.linux.kernel.mm/124451

Those look different, and hopefully should be fixed by commit
1d5bfe1ffb5b ("mm, compaction: prevent infinite loop in
compact_zone"). Which admittedly isn't in -rc4 (it went in on
Thursday), but I think Dave tends to run git-of-the-day rather than
last rc, so he probably already had it.

I *think* that if it was the infinite compaction problem, you'd have
the soft-lockup reports showing that. Dave's are in random places.
Which is odd.

               Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-14 21:31 frequent lockups in 3.18rc4 Dave Jones
  2014-11-14 22:01 ` Linus Torvalds
@ 2014-11-17 15:07 ` Don Zickus
  1 sibling, 0 replies; 486+ messages in thread
From: Don Zickus @ 2014-11-17 15:07 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel, Linus Torvalds

On Fri, Nov 14, 2014 at 04:31:24PM -0500, Dave Jones wrote:
> I'm not sure how long this goes back (3.17 was fine afair) but I'm
> seeing these several times a day lately..
> 
> 
> NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [trinity-c129:25570]
> irq event stamp: 74224
> hardirqs last  enabled at (74223): [<ffffffff9c875664>] restore_args+0x0/0x30
> hardirqs last disabled at (74224): [<ffffffff9c8759aa>] apic_timer_interrupt+0x6a/0x80
> softirqs last  enabled at (74222): [<ffffffff9c07f43a>] __do_softirq+0x26a/0x6f0
> softirqs last disabled at (74209): [<ffffffff9c07fb4d>] irq_exit+0x13d/0x170
> CPU: 3 PID: 25570 Comm: trinity-c129 Not tainted 3.18.0-rc4+ #83 [loadavg: 198.04 186.66 181.58 24/442 26708]
> task: ffff880213442f00 ti: ffff8801ea714000 task.ti: ffff8801ea714000
> RIP: 0010:[<ffffffff9c11e98a>]  [<ffffffff9c11e98a>] generic_exec_single+0xea/0x1d0
> RSP: 0018:ffff8801ea717a08  EFLAGS: 00000202
> RAX: ffff880213442f00 RBX: ffffffff9c875664 RCX: 0000000000000006
> RDX: 0000000000001370 RSI: ffff880213443790 RDI: ffff880213442f00
> RBP: ffff8801ea717a68 R08: ffff880242b56690 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801ea717978
> R13: ffff880213442f00 R14: ffff8801ea714000 R15: ffff880213442f00
> FS:  00007f240994e700(0000) GS:ffff880244600000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000004 CR3: 000000019a017000 CR4: 00000000001407e0
> DR0: 00007fb3367e0000 DR1: 00007f82542ab000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
> Stack:
>  ffffffff9ce4c620 0000000000000000 ffffffff9c048b20 ffff8801ea717b18
>  0000000000000003 0000000052e0da3d ffffffff9cc7ef3c 0000000000000002
>  ffffffff9c048b20 ffff8801ea717b18 0000000000000001 0000000000000003
> Call Trace:
>  [<ffffffff9c048b20>] ? leave_mm+0x210/0x210
>  [<ffffffff9c048b20>] ? leave_mm+0x210/0x210
>  [<ffffffff9c11ead6>] smp_call_function_single+0x66/0x110
>  [<ffffffff9c048b20>] ? leave_mm+0x210/0x210
>  [<ffffffff9c11f021>] smp_call_function_many+0x2f1/0x390


Hi Dave,

When I usually see stuff like this, it is because another cpu is blocking
the IPI from smp_call_function_many from finishing, so this cpu waits
forever.

The problem usually becomes obvious with a dump of all cpus at the time
the lockup is detected.

Can you try adding 'softlockup_all_cpu_backtrace=1' to the kernel
commandline?  That should dump all the cpus to see if anything stands out.

Though I don't normally see it traverse down to smp_call_function_single.

Anyway something to try.

Cheers,
Don

>  [<ffffffff9c049300>] flush_tlb_mm_range+0xe0/0x370
>  [<ffffffff9c1d95a2>] tlb_flush_mmu_tlbonly+0x42/0x50
>  [<ffffffff9c1d9cb5>] tlb_finish_mmu+0x45/0x50
>  [<ffffffff9c1daf59>] zap_page_range_single+0x119/0x170
>  [<ffffffff9c1db140>] unmap_mapping_range+0x140/0x1b0
>  [<ffffffff9c1c7edd>] shmem_fallocate+0x43d/0x540
>  [<ffffffff9c0b111b>] ? preempt_count_sub+0xab/0x100
>  [<ffffffff9c0cdac7>] ? prepare_to_wait+0x27/0x80
>  [<ffffffff9c2287f3>] ? __sb_start_write+0x103/0x1d0
>  [<ffffffff9c223aba>] do_fallocate+0x12a/0x1c0
>  [<ffffffff9c1f0bd3>] SyS_madvise+0x3d3/0x890
>  [<ffffffff9c1a40d2>] ? context_tracking_user_exit+0x52/0x260
>  [<ffffffff9c013ebd>] ? syscall_trace_enter_phase2+0x10d/0x3d0
>  [<ffffffff9c874c89>] tracesys_phase2+0xd4/0xd9
> Code: 63 c7 48 89 de 48 89 df 48 c7 c2 c0 50 1d 00 48 03 14 c5 40 b9 f2 9c e8 d5 ea 2b 00 84 c0 74 0b e9 bc 00 00 00 0f 1f 40 00 f3 90 <f6> 43 18 01 75 f8 31 c0 48 8b 4d c8 65 48 33 0c 25 28 00 00 00 
> Kernel panic - not syncing: softlockup: hung tasks
> 
> 
> I've got a local hack to dump loadavg on traces, and as you can see in that
> example, the machine was really busy, but we were at least making progress
> before the trace spewed, and the machine rebooted. (I have reboot-on-lockup sysctl
> set, without it, the machine just wedges indefinitely shortly after the spew).
> 
> The trace doesn't really enlighten me as to what we should be doing
> to prevent this though.
> 
> ideas?
> I can try to bisect it, but it takes hours before it happens,
> so it might take days to complete, and the next few weeks are
> complicated timewise..
> 
> 	Dave
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-16  6:33       ` Linus Torvalds
  2014-11-16 10:06         ` Markus Trippelsdorf
@ 2014-11-17 17:03         ` Dave Jones
  2014-11-17 19:59           ` Linus Torvalds
  2014-11-20 15:08           ` Frederic Weisbecker
  2014-11-26  0:25         ` Dave Jones
  2 siblings, 2 replies; 486+ messages in thread
From: Dave Jones @ 2014-11-17 17:03 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux Kernel, the arch/x86 maintainers

On Sat, Nov 15, 2014 at 10:33:19PM -0800, Linus Torvalds wrote:
 
 > >  > I'll try that next, and check in on it tomorrow.
 > >
 > > No luck. Died even faster this time.
 > 
 > Yeah, and your other lockups haven't even been TLB related. Not that
 > they look like anything else *either*.
 > 
 > I have no ideas left. I'd go for a bisection - rather than try random
 > things, at least bisection will get us a smaller set of suspects if
 > you can go through a few cycles of it. Even if you decide that you
 > want to run for most of a day before you are convinced it's all good,
 > a couple of days should get you a handful of bisection points (that's
 > assuming you hit a couple of bad ones too that turn bad in a shorter
 > while). And 4 or five bisections should get us from 11k commits down
 > to the ~600 commit range. That would be a huge improvement.

Great start to the week: I decided to confirm my recollection that .17
was ok, only to hit this within 10 minutes.

Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 3
CPU: 3 PID: 17176 Comm: trinity-c95 Not tainted 3.17.0+ #87
 0000000000000000 00000000f3a61725 ffff880244606bf0 ffffffff9583e9fa
 ffffffff95c67918 ffff880244606c78 ffffffff9583bcc0 0000000000000010
 ffff880244606c88 ffff880244606c20 00000000f3a61725 0000000000000000
Call Trace:
 <NMI>  [<ffffffff9583e9fa>] dump_stack+0x4e/0x7a
 [<ffffffff9583bcc0>] panic+0xd4/0x207
 [<ffffffff95150908>] watchdog_overflow_callback+0x118/0x120
 [<ffffffff95193dbe>] __perf_event_overflow+0xae/0x340
 [<ffffffff95192230>] ? perf_event_task_disable+0xa0/0xa0
 [<ffffffff9501a7bf>] ? x86_perf_event_set_period+0xbf/0x150
 [<ffffffff95194be4>] perf_event_overflow+0x14/0x20
 [<ffffffff95020676>] intel_pmu_handle_irq+0x206/0x410
 [<ffffffff9501966b>] perf_event_nmi_handler+0x2b/0x50
 [<ffffffff95007bb2>] nmi_handle+0xd2/0x390
 [<ffffffff95007ae5>] ? nmi_handle+0x5/0x390
 [<ffffffff958489b0>] ? _raw_spin_lock_irqsave+0x80/0x90
 [<ffffffff950080a2>] default_do_nmi+0x72/0x1c0
 [<ffffffff950082a8>] do_nmi+0xb8/0x100
 [<ffffffff9584b9aa>] end_repeat_nmi+0x1e/0x2e
 [<ffffffff958489b0>] ? _raw_spin_lock_irqsave+0x80/0x90
 [<ffffffff958489b0>] ? _raw_spin_lock_irqsave+0x80/0x90
 [<ffffffff958489b0>] ? _raw_spin_lock_irqsave+0x80/0x90
 <<EOE>>  <IRQ>  [<ffffffff95101685>] lock_hrtimer_base.isra.18+0x25/0x50
 [<ffffffff951019d3>] hrtimer_try_to_cancel+0x33/0x1f0
 [<ffffffff95101baa>] hrtimer_cancel+0x1a/0x30
 [<ffffffff95113557>] tick_nohz_restart+0x17/0x90
 [<ffffffff95114533>] __tick_nohz_full_check+0xc3/0x100
 [<ffffffff9511457e>] nohz_full_kick_work_func+0xe/0x10
 [<ffffffff95188894>] irq_work_run_list+0x44/0x70
 [<ffffffff951888ea>] irq_work_run+0x2a/0x50
 [<ffffffff9510109b>] update_process_times+0x5b/0x70
 [<ffffffff95113325>] tick_sched_handle.isra.20+0x25/0x60
 [<ffffffff95113801>] tick_sched_timer+0x41/0x60
 [<ffffffff95102281>] __run_hrtimer+0x81/0x480
 [<ffffffff951137c0>] ? tick_sched_do_timer+0xb0/0xb0
 [<ffffffff95102977>] hrtimer_interrupt+0x117/0x270
 [<ffffffff950346d7>] local_apic_timer_interrupt+0x37/0x60
 [<ffffffff9584c44f>] smp_apic_timer_interrupt+0x3f/0x50
 [<ffffffff9584a86f>] apic_timer_interrupt+0x6f/0x80
 <EOI>  [<ffffffff950d3f3a>] ? lock_release_holdtime.part.28+0x9a/0x160
 [<ffffffff950ef3b7>] ? rcu_is_watching+0x27/0x60
 [<ffffffff9508cb75>] kill_pid_info+0xf5/0x130
 [<ffffffff9508ca85>] ? kill_pid_info+0x5/0x130
 [<ffffffff9508ccd3>] SYSC_kill+0x103/0x330
 [<ffffffff9508cc7c>] ? SYSC_kill+0xac/0x330
 [<ffffffff9519b592>] ? context_tracking_user_exit+0x52/0x1a0
 [<ffffffff950d6f1d>] ? trace_hardirqs_on_caller+0x16d/0x210
 [<ffffffff950d6fcd>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffff950137ad>] ? syscall_trace_enter+0x14d/0x330
 [<ffffffff9508f44e>] SyS_kill+0xe/0x10
 [<ffffffff95849b24>] tracesys+0xdd/0xe2
Kernel Offset: 0x14000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

It could a completely different cause for lockup, but seeing this now
has me wondering if perhaps it's something unrelated to the kernel.
I have recollection of running late .17rc's for days without incident,
and I'm pretty sure .17 was ok too.  But a few weeks ago I did upgrade
that test box to the Fedora 21 beta.  Which means I have a new gcc.
I'm not sure I really trust 4.9.1 yet, so maybe I'll see if I can
get 4.8 back on there and see if that's any better.

	Dave


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-17 17:03         ` Dave Jones
@ 2014-11-17 19:59           ` Linus Torvalds
  2014-11-18  2:09             ` Dave Jones
  2014-11-20 15:08           ` Frederic Weisbecker
  1 sibling, 1 reply; 486+ messages in thread
From: Linus Torvalds @ 2014-11-17 19:59 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Linux Kernel, the arch/x86 maintainers

On Mon, Nov 17, 2014 at 9:03 AM, Dave Jones <davej@redhat.com> wrote:
>
> It could a completely different cause for lockup, but seeing this now
> has me wondering if perhaps it's something unrelated to the kernel.
> I have recollection of running late .17rc's for days without incident,
> and I'm pretty sure .17 was ok too.  But a few weeks ago I did upgrade
> that test box to the Fedora 21 beta.  Which means I have a new gcc.
> I'm not sure I really trust 4.9.1 yet, so maybe I'll see if I can
> get 4.8 back on there and see if that's any better.

I'm not sure if I should be relieved or horrified.

Horrified, I think.

It really would be a wonderful thing to have some kind of "compiler
bisection" with mixed object files to see exactly which file it
miscompiles (and by "miscompiles" it might just be a kernel bug where
we are missing a barrier or something, and older gcc's just happened
to not show it - so it could still easily be a kernel problem).

                  Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-15  1:59     ` Linus Torvalds
@ 2014-11-17 21:22       ` Linus Torvalds
  2014-11-17 22:31         ` Thomas Gleixner
  2014-11-17 23:04         ` Jens Axboe
  0 siblings, 2 replies; 486+ messages in thread
From: Linus Torvalds @ 2014-11-17 21:22 UTC (permalink / raw)
  To: Thomas Gleixner, Jens Axboe, Ingo Molnar
  Cc: Dave Jones, Linux Kernel, the arch/x86 maintainers

[-- Attachment #1: Type: text/plain, Size: 1762 bytes --]

On Fri, Nov 14, 2014 at 5:59 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Judging by the code disassembly, it's the "csd_lock_wait(csd)" at the
> end.

Btw, looking at this, I grew really suspicious of this code in csd_unlock():

        WARN_ON((csd->flags & CSD_FLAG_WAIT) && !(csd->flags & CSD_FLAG_LOCK));

because that makes no sense at all. It basically removes a sanity
check, yet that sanity check makes a hell of a lot of sense. Unlocking
a CSD that is not locked is *wrong*.

The crazy code code comes from commit c84a83e2aaab ("smp: don't warn
about csd->flags having CSD_FLAG_LOCK cleared for !wait") by Jens, but
the explanation and the code is pure crap.

There is no way in hell that it is ever correct to unlock an entry
that isn't locked, so that whole CSD_FLAG_WAIT thing is buggy as hell.

The explanation in commit c84a83e2aaab says that  "blk-mq reuses the
request potentially immediately" and claims that that is somehow ok,
but that's utter BS. Even if you don't ever wait for it, the CSD lock
bit fundamentally also protects the "csd->llist" pointer. So what that
commit actually does is to just remove a safety check, and do so in a
very unsafe manner. And apparently block-mq re-uses something THAT IS
STILL ACTIVELY IN USE. That's just horrible.

Now, I think we might do this differently, by doing the "csd_unlock()"
after we have loaded everything from the csd, but *before* actually
calling the callback function. That would seem to be equivalent
(interrupts are disabled, so this will not result in the func()
possibly called twice), more efficient, _and_  not remove a useful
check.

Hmm? Completely untested patch attached. Jens, does this still work for you?

Am I missing something?

                    Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 1215 bytes --]

 kernel/smp.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index f38a1e692259..fbeb9827bdae 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -19,7 +19,6 @@
 
 enum {
 	CSD_FLAG_LOCK		= 0x01,
-	CSD_FLAG_WAIT		= 0x02,
 };
 
 struct call_function_data {
@@ -126,7 +125,7 @@ static void csd_lock(struct call_single_data *csd)
 
 static void csd_unlock(struct call_single_data *csd)
 {
-	WARN_ON((csd->flags & CSD_FLAG_WAIT) && !(csd->flags & CSD_FLAG_LOCK));
+	WARN_ON(!(csd->flags & CSD_FLAG_LOCK));
 
 	/*
 	 * ensure we're all done before releasing data:
@@ -173,9 +172,6 @@ static int generic_exec_single(int cpu, struct call_single_data *csd,
 	csd->func = func;
 	csd->info = info;
 
-	if (wait)
-		csd->flags |= CSD_FLAG_WAIT;
-
 	/*
 	 * The list addition should be visible before sending the IPI
 	 * handler locks the list to pull the entry off it because of
@@ -250,8 +246,11 @@ static void flush_smp_call_function_queue(bool warn_cpu_offline)
 	}
 
 	llist_for_each_entry_safe(csd, csd_next, entry, llist) {
-		csd->func(csd->info);
+		smp_call_func_t func = csd->func;
+		void *info = csd->info;
 		csd_unlock(csd);
+
+		func(info);
 	}
 
 	/*

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-17 21:22       ` Linus Torvalds
@ 2014-11-17 22:31         ` Thomas Gleixner
  2014-11-17 22:43           ` Thomas Gleixner
  2014-11-17 23:04         ` Jens Axboe
  1 sibling, 1 reply; 486+ messages in thread
From: Thomas Gleixner @ 2014-11-17 22:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Ingo Molnar, Dave Jones, Linux Kernel,
	the arch/x86 maintainers

On Mon, 17 Nov 2014, Linus Torvalds wrote:
> On Fri, Nov 14, 2014 at 5:59 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Judging by the code disassembly, it's the "csd_lock_wait(csd)" at the
> > end.
> 
> Btw, looking at this, I grew really suspicious of this code in csd_unlock():
> 
>         WARN_ON((csd->flags & CSD_FLAG_WAIT) && !(csd->flags & CSD_FLAG_LOCK));
> 
> because that makes no sense at all. It basically removes a sanity
> check, yet that sanity check makes a hell of a lot of sense. Unlocking
> a CSD that is not locked is *wrong*.
> 
> The crazy code code comes from commit c84a83e2aaab ("smp: don't warn
> about csd->flags having CSD_FLAG_LOCK cleared for !wait") by Jens, but
> the explanation and the code is pure crap.
> 
> There is no way in hell that it is ever correct to unlock an entry
> that isn't locked, so that whole CSD_FLAG_WAIT thing is buggy as hell.
> 
> The explanation in commit c84a83e2aaab says that  "blk-mq reuses the
> request potentially immediately" and claims that that is somehow ok,
> but that's utter BS. Even if you don't ever wait for it, the CSD lock
> bit fundamentally also protects the "csd->llist" pointer. So what that
> commit actually does is to just remove a safety check, and do so in a
> very unsafe manner. And apparently block-mq re-uses something THAT IS
> STILL ACTIVELY IN USE. That's just horrible.
>  
> Now, I think we might do this differently, by doing the "csd_unlock()"
> after we have loaded everything from the csd, but *before* actually
> calling the callback function. That would seem to be equivalent
> (interrupts are disabled, so this will not result in the func()
> possibly called twice), more efficient, _and_  not remove a useful
> check.
> 
> Hmm? Completely untested patch attached. Jens, does this still work for you?
> 
> Am I missing something?

Yes. :)

> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -19,7 +19,6 @@
>  
>  enum {
>  	CSD_FLAG_LOCK		= 0x01,
> -	CSD_FLAG_WAIT		= 0x02,
>  };
>  
>  struct call_function_data {
> @@ -126,7 +125,7 @@ static void csd_lock(struct call_single_data *csd)
>  
>  static void csd_unlock(struct call_single_data *csd)
>  {
> -	WARN_ON((csd->flags & CSD_FLAG_WAIT) && !(csd->flags & CSD_FLAG_LOCK));
> +	WARN_ON(!(csd->flags & CSD_FLAG_LOCK));
>  
>  	/*
>  	 * ensure we're all done before releasing data:
> @@ -173,9 +172,6 @@ static int generic_exec_single(int cpu, struct call_single_data *csd,
>  	csd->func = func;
>  	csd->info = info;
>  
> -	if (wait)
> -		csd->flags |= CSD_FLAG_WAIT;
> -
>  	/*
>  	 * The list addition should be visible before sending the IPI
>  	 * handler locks the list to pull the entry off it because of
> @@ -250,8 +246,11 @@ static void flush_smp_call_function_queue(bool warn_cpu_offline)
>  	}
>  
>  	llist_for_each_entry_safe(csd, csd_next, entry, llist) {
> -		csd->func(csd->info);
> +		smp_call_func_t func = csd->func;
> +		void *info = csd->info;
>  		csd_unlock(csd);
> +
> +		func(info);

No, that won't work for synchronous calls:

    CPU 0      	    		CPU 1

    csd_lock(csd);
    queue_csd();
    ipi();
				func = csd->func;
				info = csd->info;
				csd_unlock(csd);
    csd_lock_wait();    
				func(info);
   
The csd_lock_wait() side will succeed and therefor assume that the
call has been completed while the function has not been called at
all. Interesting explosions to follow.

The proper solution is to revert that commit and properly analyze the
problem which Jens was trying to solve and work from there.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-17 22:31         ` Thomas Gleixner
@ 2014-11-17 22:43           ` Thomas Gleixner
  2014-11-17 22:58             ` Jens Axboe
  2014-11-17 23:59             ` Linus Torvalds
  0 siblings, 2 replies; 486+ messages in thread
From: Thomas Gleixner @ 2014-11-17 22:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Ingo Molnar, Dave Jones, Linux Kernel,
	the arch/x86 maintainers

On Mon, 17 Nov 2014, Thomas Gleixner wrote:
> On Mon, 17 Nov 2014, Linus Torvalds wrote:
> >  	llist_for_each_entry_safe(csd, csd_next, entry, llist) {
> > -		csd->func(csd->info);
> > +		smp_call_func_t func = csd->func;
> > +		void *info = csd->info;
> >  		csd_unlock(csd);
> > +
> > +		func(info);
> 
> No, that won't work for synchronous calls:
> 
>     CPU 0      	    		CPU 1
> 
>     csd_lock(csd);
>     queue_csd();
>     ipi();
> 				func = csd->func;
> 				info = csd->info;
> 				csd_unlock(csd);
>     csd_lock_wait();    
> 				func(info);
>    
> The csd_lock_wait() side will succeed and therefor assume that the
> call has been completed while the function has not been called at
> all. Interesting explosions to follow.
> 
> The proper solution is to revert that commit and properly analyze the
> problem which Jens was trying to solve and work from there.

So a combo of both (Jens and yours) might do the trick. Patch below.

I think what Jens was trying to solve is:

     CPU 0      	    		CPU 1
 
     csd_lock(csd);
     queue_csd();
     ipi();
 				csd->func(csd->info);
     wait_for_completion(csd);
				   complete(csd);
     reuse_csd(csd);		
				csd_unlock(csd);

Thanks,

	tglx	

Index: linux/kernel/smp.c
===================================================================
--- linux.orig/kernel/smp.c
+++ linux/kernel/smp.c
@@ -126,7 +126,7 @@ static void csd_lock(struct call_single_
 
 static void csd_unlock(struct call_single_data *csd)
 {
-	WARN_ON((csd->flags & CSD_FLAG_WAIT) && !(csd->flags & CSD_FLAG_LOCK));
+	WARN_ON(!(csd->flags & CSD_FLAG_LOCK));
 
 	/*
 	 * ensure we're all done before releasing data:
@@ -250,8 +250,23 @@ static void flush_smp_call_function_queu
 	}
 
 	llist_for_each_entry_safe(csd, csd_next, entry, llist) {
-		csd->func(csd->info);
-		csd_unlock(csd);
+
+		/*
+		 * For synchronous calls we are not allowed to unlock
+		 * before the callback returned. For the async case
+		 * its the responsibility of the caller to keep
+		 * csd->info consistent while the callback runs.
+		 */
+		if (csd->flags & CSD_FLAG_WAIT) {
+			csd->func(csd->info);
+			csd_unlock(csd);
+		} else {
+			smp_call_func_t func = csd->func;
+			void *info = csd->info;
+
+			csd_unlock(csd);
+			func(info);
+		}
 	}
 
 	/*

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-17 22:43           ` Thomas Gleixner
@ 2014-11-17 22:58             ` Jens Axboe
  2014-11-17 23:59             ` Linus Torvalds
  1 sibling, 0 replies; 486+ messages in thread
From: Jens Axboe @ 2014-11-17 22:58 UTC (permalink / raw)
  To: Thomas Gleixner, Linus Torvalds
  Cc: Ingo Molnar, Dave Jones, Linux Kernel, the arch/x86 maintainers

On 11/17/2014 03:43 PM, Thomas Gleixner wrote:
> On Mon, 17 Nov 2014, Thomas Gleixner wrote:
>> On Mon, 17 Nov 2014, Linus Torvalds wrote:
>>>  	llist_for_each_entry_safe(csd, csd_next, entry, llist) {
>>> -		csd->func(csd->info);
>>> +		smp_call_func_t func = csd->func;
>>> +		void *info = csd->info;
>>>  		csd_unlock(csd);
>>> +
>>> +		func(info);
>>
>> No, that won't work for synchronous calls:
>>
>>     CPU 0      	    		CPU 1
>>
>>     csd_lock(csd);
>>     queue_csd();
>>     ipi();
>> 				func = csd->func;
>> 				info = csd->info;
>> 				csd_unlock(csd);
>>     csd_lock_wait();    
>> 				func(info);
>>    
>> The csd_lock_wait() side will succeed and therefor assume that the
>> call has been completed while the function has not been called at
>> all. Interesting explosions to follow.
>>
>> The proper solution is to revert that commit and properly analyze the
>> problem which Jens was trying to solve and work from there.
> 
> So a combo of both (Jens and yours) might do the trick. Patch below.
> 
> I think what Jens was trying to solve is:
> 
>      CPU 0      	    		CPU 1
>  
>      csd_lock(csd);
>      queue_csd();
>      ipi();
>  				csd->func(csd->info);
>      wait_for_completion(csd);
> 				   complete(csd);
>      reuse_csd(csd);		
> 				csd_unlock(csd);

Maybe... The above looks ok to me from a functional point of view, but
now I can't convince myself that the blk-mq use case is correct.

I'll try and backout the original patch and reproduce the issue, that
should jog my memory and give me full understanding of what the issue I
faced back then was.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-17 21:22       ` Linus Torvalds
  2014-11-17 22:31         ` Thomas Gleixner
@ 2014-11-17 23:04         ` Jens Axboe
  2014-11-17 23:17           ` Thomas Gleixner
  1 sibling, 1 reply; 486+ messages in thread
From: Jens Axboe @ 2014-11-17 23:04 UTC (permalink / raw)
  To: Linus Torvalds, Thomas Gleixner, Ingo Molnar
  Cc: Dave Jones, Linux Kernel, the arch/x86 maintainers

On 11/17/2014 02:22 PM, Linus Torvalds wrote:
> On Fri, Nov 14, 2014 at 5:59 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> Judging by the code disassembly, it's the "csd_lock_wait(csd)" at the
>> end.
> 
> Btw, looking at this, I grew really suspicious of this code in csd_unlock():
> 
>         WARN_ON((csd->flags & CSD_FLAG_WAIT) && !(csd->flags & CSD_FLAG_LOCK));
> 
> because that makes no sense at all. It basically removes a sanity
> check, yet that sanity check makes a hell of a lot of sense. Unlocking
> a CSD that is not locked is *wrong*.
> 
> The crazy code code comes from commit c84a83e2aaab ("smp: don't warn
> about csd->flags having CSD_FLAG_LOCK cleared for !wait") by Jens, but
> the explanation and the code is pure crap.
> 
> There is no way in hell that it is ever correct to unlock an entry
> that isn't locked, so that whole CSD_FLAG_WAIT thing is buggy as hell.
> 
> The explanation in commit c84a83e2aaab says that  "blk-mq reuses the
> request potentially immediately" and claims that that is somehow ok,
> but that's utter BS. Even if you don't ever wait for it, the CSD lock
> bit fundamentally also protects the "csd->llist" pointer. So what that
> commit actually does is to just remove a safety check, and do so in a
> very unsafe manner. And apparently block-mq re-uses something THAT IS
> STILL ACTIVELY IN USE. That's just horrible.

I agree that this description is probably utter crap. And now I do
actually remember the issue at hand. The resource here is the tag, that
decides what request we'll use, and subsequently what call_single_data
storage is used. When this was originally done, blk-mq cleared the
request from the function callback, instead of doing it at allocation
time. The assumption here was cache hotness. That in turn also cleared
->csd, which meant that the flags got zeroed and csd_unlock() was
naturally unhappy. THAT was the reuse case, not that the request would
get reused before we had finished the IPI fn callback since that would
obviously create other badness. Now I'm not sure what made me create
that patch, which in retrospect is a bad hammer for this problem.

blk-mq doesn't do the init-at-finish time anymore, so it should not be
hit by the issue. But if we do bring that back, then it would still work
fine with Thomas' patch, since we unlock prior to running the callback.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-17 23:04         ` Jens Axboe
@ 2014-11-17 23:17           ` Thomas Gleixner
  2014-11-18  2:23             ` Jens Axboe
  0 siblings, 1 reply; 486+ messages in thread
From: Thomas Gleixner @ 2014-11-17 23:17 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Ingo Molnar, Dave Jones, Linux Kernel,
	the arch/x86 maintainers

On Mon, 17 Nov 2014, Jens Axboe wrote:
> On 11/17/2014 02:22 PM, Linus Torvalds wrote:
> > On Fri, Nov 14, 2014 at 5:59 PM, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> >>
> >> Judging by the code disassembly, it's the "csd_lock_wait(csd)" at the
> >> end.
> > 
> > Btw, looking at this, I grew really suspicious of this code in csd_unlock():
> > 
> >         WARN_ON((csd->flags & CSD_FLAG_WAIT) && !(csd->flags & CSD_FLAG_LOCK));
> > 
> > because that makes no sense at all. It basically removes a sanity
> > check, yet that sanity check makes a hell of a lot of sense. Unlocking
> > a CSD that is not locked is *wrong*.
> > 
> > The crazy code code comes from commit c84a83e2aaab ("smp: don't warn
> > about csd->flags having CSD_FLAG_LOCK cleared for !wait") by Jens, but
> > the explanation and the code is pure crap.
> > 
> > There is no way in hell that it is ever correct to unlock an entry
> > that isn't locked, so that whole CSD_FLAG_WAIT thing is buggy as hell.
> > 
> > The explanation in commit c84a83e2aaab says that  "blk-mq reuses the
> > request potentially immediately" and claims that that is somehow ok,
> > but that's utter BS. Even if you don't ever wait for it, the CSD lock
> > bit fundamentally also protects the "csd->llist" pointer. So what that
> > commit actually does is to just remove a safety check, and do so in a
> > very unsafe manner. And apparently block-mq re-uses something THAT IS
> > STILL ACTIVELY IN USE. That's just horrible.
> 
> I agree that this description is probably utter crap. And now I do
> actually remember the issue at hand. The resource here is the tag, that
> decides what request we'll use, and subsequently what call_single_data
> storage is used. When this was originally done, blk-mq cleared the
> request from the function callback, instead of doing it at allocation
> time. The assumption here was cache hotness. That in turn also cleared
> ->csd, which meant that the flags got zeroed and csd_unlock() was
> naturally unhappy.

So that's exactly what I described in my other reply.

     csd_lock(csd);
     queue_csd();
     ipi();
				csd->func(csd->info);
     wait_for_completion(csd);
				  complete(csd);
     reuse_csd(csd);		
				csd_unlock(csd);

When you call complete() nothing can rely on csd anymore, except for
the smp core code ....

> THAT was the reuse case, not that the request would get reused
> before we had finished the IPI fn callback since that would
> obviously create other badness. Now I'm not sure what made me create
> that patch, which in retrospect is a bad hammer for this problem.

Performance blindness?
 
> blk-mq doesn't do the init-at-finish time anymore, so it should not be
> hit by the issue. But if we do bring that back, then it would still work
> fine with Thomas' patch, since we unlock prior to running the callback.

So if blk-mq is not relying on that, then we really should back out
that stuff for 3.18 and tag it for stable.

Treating sync and async function calls differently makes sense,
because any async caller which cannot deal with the unlock before call
scheme is broken by definition already today. But that's material for
next.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-17 22:43           ` Thomas Gleixner
  2014-11-17 22:58             ` Jens Axboe
@ 2014-11-17 23:59             ` Linus Torvalds
  2014-11-18  0:15               ` Thomas Gleixner
  1 sibling, 1 reply; 486+ messages in thread
From: Linus Torvalds @ 2014-11-17 23:59 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jens Axboe, Ingo Molnar, Dave Jones, Linux Kernel,
	the arch/x86 maintainers

On Mon, Nov 17, 2014 at 2:43 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> No, that won't work for synchronous calls:\

Right you are.

> So a combo of both (Jens and yours) might do the trick. Patch below.

Yeah, I guess that would work. The important part is that *if*
somebody really reuses the csd, we'd better have a release barrier
(which csd_unlock() does, although badly - but this probably isn't
that performance-critical) *before* we call the function, because
otherwise there's no real serialization for the reuse.

Of course, most of these things are presumably always per-cpu data
structures, so the whole worry about "csd" being accessed from
different CPU's probably doesn't even exist, and this all works fine
as-is anyway, even in the presense of odd memory ordering issues.

Judging from Jens' later email, it looks like we simply don't need
this code at all any more, though, and we could just revert the
commit.

NOTE! I don't think this actually has anything to do with the actual
problem that Dave saw. I just reacted to that WARN_ON() when I was
looking at the code, and it made me go "that looks extremely
suspicious".

Particularly on x86, with strong memory ordering, I don't think that
any random accesses to 'csd' after the call to 'csd->func()' could
actually matter. I just felt very nervous about the claim that
somebody can reuse the csd immediately, that smelled bad to me from a
*conceptual* standpoint, even if I suspect it works perfectly fine in
practice.

Anyway, I've found *another* race condition, which (again) doesn't
actually seem to be an issue on x86.

In particular, "csd_lock()" does things pretty well, in that it does a
smp_mb() after setting the lock bit, so certainly nothing afterwards
will leak out of that locked region.

But look at csd_lock_wait(). It just does

        while (csd->flags & CSD_FLAG_LOCK)
                cpu_relax();

and basically there's no memory barriers there. Now, on x86, this is a
non-issue, since all reads act as an acquire, but at least in *theory*
we have this completely unordered read going on. So any subsequent
memory oeprations (ie after the return from generic_exec_single()
could in theory see data from *before* the read.

So that whole kernel/smp.c locking looks rather dubious. The smp_mb()
in csd_lock() is overkill (a "smp_store_release()" should be
sufficient), and I think that the read of csd->flags in csd_unlock()
should be a smp_load_acquire().

Again, none of this has anything to do with Dave's problem. The memory
ordering issues really cannot be an issue on x86, I'm just saying that
there's code there that makes me a bit uncomfortable.

                     Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-17 23:59             ` Linus Torvalds
@ 2014-11-18  0:15               ` Thomas Gleixner
  0 siblings, 0 replies; 486+ messages in thread
From: Thomas Gleixner @ 2014-11-18  0:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Ingo Molnar, Dave Jones, Linux Kernel,
	the arch/x86 maintainers

On Mon, 17 Nov 2014, Linus Torvalds wrote:
> On Mon, Nov 17, 2014 at 2:43 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > So a combo of both (Jens and yours) might do the trick. Patch below.
> 
> Yeah, I guess that would work. The important part is that *if*
> somebody really reuses the csd, we'd better have a release barrier
> (which csd_unlock() does, although badly - but this probably isn't
> that performance-critical) *before* we call the function, because
> otherwise there's no real serialization for the reuse.

Indeed.
 
> Of course, most of these things are presumably always per-cpu data
> structures, so the whole worry about "csd" being accessed from
> different CPU's probably doesn't even exist, and this all works fine
> as-is anyway, even in the presense of odd memory ordering issues.
> 
> Judging from Jens' later email, it looks like we simply don't need
> this code at all any more, though, and we could just revert the
> commit.

Right. Reverting it is the proper solution for now. Though we should
really think about the async seperation later. It makes a lot of
sense.

> NOTE! I don't think this actually has anything to do with the actual
> problem that Dave saw. I just reacted to that WARN_ON() when I was
> looking at the code, and it made me go "that looks extremely
> suspicious".

One thing I was looking into today is the increased use of irq_work
which uses IPIs as well. Not sure whether that's related, but I it's
not from my radar yet.

But the possible compiler wreckage (or exposed kernel wreckage) is
frightening in several aspects ...

> Particularly on x86, with strong memory ordering, I don't think that
> any random accesses to 'csd' after the call to 'csd->func()' could
> actually matter. I just felt very nervous about the claim that
> somebody can reuse the csd immediately, that smelled bad to me from a
> *conceptual* standpoint, even if I suspect it works perfectly fine in
> practice.
> 
> Anyway, I've found *another* race condition, which (again) doesn't
> actually seem to be an issue on x86.
> 
> In particular, "csd_lock()" does things pretty well, in that it does a
> smp_mb() after setting the lock bit, so certainly nothing afterwards
> will leak out of that locked region.
> 
> But look at csd_lock_wait(). It just does
> 
>         while (csd->flags & CSD_FLAG_LOCK)
>                 cpu_relax();
> 
> and basically there's no memory barriers there. Now, on x86, this is a
> non-issue, since all reads act as an acquire, but at least in *theory*
> we have this completely unordered read going on. So any subsequent
> memory oeprations (ie after the return from generic_exec_single()
> could in theory see data from *before* the read.

True.
 
> So that whole kernel/smp.c locking looks rather dubious. The smp_mb()
> in csd_lock() is overkill (a "smp_store_release()" should be
> sufficient), and I think that the read of csd->flags in csd_unlock()
> should be a smp_load_acquire().
> 
> Again, none of this has anything to do with Dave's problem. The memory
> ordering issues really cannot be an issue on x86, I'm just saying that
> there's code there that makes me a bit uncomfortable.

Right you are and we should fix it asap.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-17 19:59           ` Linus Torvalds
@ 2014-11-18  2:09             ` Dave Jones
  2014-11-18  2:21               ` Linus Torvalds
  0 siblings, 1 reply; 486+ messages in thread
From: Dave Jones @ 2014-11-18  2:09 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux Kernel, the arch/x86 maintainers

On Mon, Nov 17, 2014 at 11:59:34AM -0800, Linus Torvalds wrote:
 > On Mon, Nov 17, 2014 at 9:03 AM, Dave Jones <davej@redhat.com> wrote:
 > >
 > > It could a completely different cause for lockup, but seeing this now
 > > has me wondering if perhaps it's something unrelated to the kernel.
 > > I have recollection of running late .17rc's for days without incident,
 > > and I'm pretty sure .17 was ok too.  But a few weeks ago I did upgrade
 > > that test box to the Fedora 21 beta.  Which means I have a new gcc.
 > > I'm not sure I really trust 4.9.1 yet, so maybe I'll see if I can
 > > get 4.8 back on there and see if that's any better.
 > 
 > I'm not sure if I should be relieved or horrified.
 > 
 > Horrified, I think.
 > 
 > It really would be a wonderful thing to have some kind of "compiler
 > bisection" with mixed object files to see exactly which file it
 > miscompiles (and by "miscompiles" it might just be a kernel bug where
 > we are missing a barrier or something, and older gcc's just happened
 > to not show it - so it could still easily be a kernel problem).

After wasting countless hours rolling back to Fedora 20 and gcc 4.8.1,
I saw the exact same trace on 3.17, so now I don't know what to think.

So it's great that it's not a regression vs .17, but otoh, who knows
how far back this goes. This looks like a nightmarish bisect case, and
I've no idea why it's now happening so often.

I'll give Don's softlockup_all_cpu_backtrace=1 idea a try on 3.18rc5
and see if that shines any more light on this.

Deeply puzzling.

	Dave


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-18  2:09             ` Dave Jones
@ 2014-11-18  2:21               ` Linus Torvalds
  2014-11-18  2:39                 ` Dave Jones
  0 siblings, 1 reply; 486+ messages in thread
From: Linus Torvalds @ 2014-11-18  2:21 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Linux Kernel, the arch/x86 maintainers

On Mon, Nov 17, 2014 at 6:09 PM, Dave Jones <davej@redhat.com> wrote:
>
> After wasting countless hours rolling back to Fedora 20 and gcc 4.8.1,
> I saw the exact same trace on 3.17, so now I don't know what to think.

Uhhuh.

Has anything else changed? New trinity tests? If it has happened in as
little as ten minutes, and you don't recall having seen this until
about a week ago, it does sound like something changed.

But yeah, try the softlockup_all_cpu_backtrace, maybe there's a
pattern somewhere..

              Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-17 23:17           ` Thomas Gleixner
@ 2014-11-18  2:23             ` Jens Axboe
  0 siblings, 0 replies; 486+ messages in thread
From: Jens Axboe @ 2014-11-18  2:23 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linus Torvalds, Ingo Molnar, Dave Jones, Linux Kernel,
	the arch/x86 maintainers

On 11/17/2014 04:17 PM, Thomas Gleixner wrote:
> On Mon, 17 Nov 2014, Jens Axboe wrote:
>> On 11/17/2014 02:22 PM, Linus Torvalds wrote:
>>> On Fri, Nov 14, 2014 at 5:59 PM, Linus Torvalds
>>> <torvalds@linux-foundation.org> wrote:
>>>>
>>>> Judging by the code disassembly, it's the "csd_lock_wait(csd)" at the
>>>> end.
>>>
>>> Btw, looking at this, I grew really suspicious of this code in csd_unlock():
>>>
>>>          WARN_ON((csd->flags & CSD_FLAG_WAIT) && !(csd->flags & CSD_FLAG_LOCK));
>>>
>>> because that makes no sense at all. It basically removes a sanity
>>> check, yet that sanity check makes a hell of a lot of sense. Unlocking
>>> a CSD that is not locked is *wrong*.
>>>
>>> The crazy code code comes from commit c84a83e2aaab ("smp: don't warn
>>> about csd->flags having CSD_FLAG_LOCK cleared for !wait") by Jens, but
>>> the explanation and the code is pure crap.
>>>
>>> There is no way in hell that it is ever correct to unlock an entry
>>> that isn't locked, so that whole CSD_FLAG_WAIT thing is buggy as hell.
>>>
>>> The explanation in commit c84a83e2aaab says that  "blk-mq reuses the
>>> request potentially immediately" and claims that that is somehow ok,
>>> but that's utter BS. Even if you don't ever wait for it, the CSD lock
>>> bit fundamentally also protects the "csd->llist" pointer. So what that
>>> commit actually does is to just remove a safety check, and do so in a
>>> very unsafe manner. And apparently block-mq re-uses something THAT IS
>>> STILL ACTIVELY IN USE. That's just horrible.
>>
>> I agree that this description is probably utter crap. And now I do
>> actually remember the issue at hand. The resource here is the tag, that
>> decides what request we'll use, and subsequently what call_single_data
>> storage is used. When this was originally done, blk-mq cleared the
>> request from the function callback, instead of doing it at allocation
>> time. The assumption here was cache hotness. That in turn also cleared
>> ->csd, which meant that the flags got zeroed and csd_unlock() was
>> naturally unhappy.
>
> So that's exactly what I described in my other reply.
>
>       csd_lock(csd);
>       queue_csd();
>       ipi();
> 				csd->func(csd->info);
>       wait_for_completion(csd);
> 				  complete(csd);
>       reuse_csd(csd);		
> 				csd_unlock(csd);
>
> When you call complete() nothing can rely on csd anymore, except for
> the smp core code ....

Right, and I didn't. It was the core use of csd->flags afterwards that 
complained. blk-mq merely cleared ->flags in csd->func(), which 
(granted) was a bit weird. So it was just storing to csd (before 
unlock), but in an inappropriate way. It would obviously have broken a 
sync invocation, but the block layer never does that.

>> THAT was the reuse case, not that the request would get reused
>> before we had finished the IPI fn callback since that would
>> obviously create other badness. Now I'm not sure what made me create
>> that patch, which in retrospect is a bad hammer for this problem.
>
> Performance blindness?

Possibly...

>> blk-mq doesn't do the init-at-finish time anymore, so it should not be
>> hit by the issue. But if we do bring that back, then it would still work
>> fine with Thomas' patch, since we unlock prior to running the callback.
>
> So if blk-mq is not relying on that, then we really should back out
> that stuff for 3.18 and tag it for stable.

Yeah, I'd be fine with doing that. I don't recall at the top of my head 
when when we stopped doing the clear at free time, but it was relatively 
early. OK, so checked, and 3.15 does init at free time, and 3.16 and 
later does it at allocation time. So the revert can only safely be 
applied to 3.16 and later...

> Treating sync and async function calls differently makes sense,
> because any async caller which cannot deal with the unlock before call
> scheme is broken by definition already today. But that's material for
> next.

Agree.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-18  2:21               ` Linus Torvalds
@ 2014-11-18  2:39                 ` Dave Jones
  2014-11-18  2:51                   ` Linus Torvalds
  0 siblings, 1 reply; 486+ messages in thread
From: Dave Jones @ 2014-11-18  2:39 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux Kernel, the arch/x86 maintainers

On Mon, Nov 17, 2014 at 06:21:08PM -0800, Linus Torvalds wrote:
 > On Mon, Nov 17, 2014 at 6:09 PM, Dave Jones <davej@redhat.com> wrote:
 > >
 > > After wasting countless hours rolling back to Fedora 20 and gcc 4.8.1,
 > > I saw the exact same trace on 3.17, so now I don't know what to think.
 > 
 > Uhhuh.
 > 
 > Has anything else changed? New trinity tests? If it has happened in as
 > little as ten minutes, and you don't recall having seen this until
 > about a week ago, it does sound like something changed.

Looking at the trinity commits over the last month or so, there's a few
new things, but nothing that sounds like it would trip up a bug like
this. "generate random ascii strings" and "mess with fcntl's after
opening fd's on startup" being the stand-outs. Everything else is pretty
much cleanups and code-motion. There was a lot of work on the code
that tracks mmaps about a month ago, but that shouldn't have had any
visible runtime differences.

<runs git diff>

hm, something I changed not that long ago, which I didn't commit yet,
was that it now runs more child processes than it used to (was 64, now 256)
I've been running like that for a while though. I want to say that was
before .17, but I'm not 100% sure.

So it could be that I'm just generating a lot more load now.
I could drop that back down and see if it 'goes away' or at least
happens less, but it strikes me that there's something here that needs
fixing regardless.

	Dave


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-18  2:39                 ` Dave Jones
@ 2014-11-18  2:51                   ` Linus Torvalds
  2014-11-18 14:52                     ` Dave Jones
  0 siblings, 1 reply; 486+ messages in thread
From: Linus Torvalds @ 2014-11-18  2:51 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Linux Kernel, the arch/x86 maintainers

On Mon, Nov 17, 2014 at 6:39 PM, Dave Jones <davej@redhat.com> wrote:
>
> So it could be that I'm just generating a lot more load now.
> I could drop that back down and see if it 'goes away' or at least
> happens less, but it strikes me that there's something here that needs
> fixing regardless.

Oh, absolutely. It's more a question of "maybe what changed can give us a clue".

But if it' something like "more load", that's not going to help
pinpoint, and you might be better off just doing the all-cpu-backtrace
thing and hope that gives some pattern to appreciate..

                  Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-18  2:51                   ` Linus Torvalds
@ 2014-11-18 14:52                     ` Dave Jones
  2014-11-18 17:20                       ` Linus Torvalds
  2014-11-18 18:54                       ` Thomas Gleixner
  0 siblings, 2 replies; 486+ messages in thread
From: Dave Jones @ 2014-11-18 14:52 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux Kernel, the arch/x86 maintainers

On Mon, Nov 17, 2014 at 06:51:25PM -0800, Linus Torvalds wrote:
 > On Mon, Nov 17, 2014 at 6:39 PM, Dave Jones <davej@redhat.com> wrote:
 > >
 > > So it could be that I'm just generating a lot more load now.
 > > I could drop that back down and see if it 'goes away' or at least
 > > happens less, but it strikes me that there's something here that needs
 > > fixing regardless.
 > 
 > Oh, absolutely. It's more a question of "maybe what changed can give us a clue".
 > 
 > But if it' something like "more load", that's not going to help
 > pinpoint, and you might be better off just doing the all-cpu-backtrace
 > thing and hope that gives some pattern to appreciate..

Here's the first hit. Curiously, one cpu is missing.


NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [trinity-c180:17837]
Modules linked in: dlci snd_seq_dummy fuse tun rfcomm bnep hidp scsi_transport_iscsi af_key llc2 can_raw nfnetlink can_bcm sctp libcrc32c nfc caif_socket caif af_802154 ieee802154 phonet af_rxrpc bluetooth can pppoe pppox ppp_generic slhc irda crc_ccitt rds rose x25 atm netrom appletalk ipx p8023 psnap p8022 llc ax25 cfg80211 rfkill coretemp hwmon x86_pkg_temp_thermal kvm_intel kvm snd_hda_codec_realtek snd_hda_codec_hdmi snd_hda_codec_generic crct10dif_pclmul crc32c_intel ghash_clmulni_intel microcode serio_raw pcspkr usb_debug snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm e1000e snd_timer ptp shpchp snd pps_core soundcore nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc
irq event stamp: 2258092
hardirqs last  enabled at (2258091): [<ffffffffa91a58b5>] get_page_from_freelist+0x555/0xaa0
hardirqs last disabled at (2258092): [<ffffffffa985396a>] apic_timer_interrupt+0x6a/0x80
softirqs last  enabled at (2244380): [<ffffffffa907b87f>] __do_softirq+0x24f/0x6f0
softirqs last disabled at (2244377): [<ffffffffa907c0dd>] irq_exit+0x13d/0x160
CPU: 1 PID: 17837 Comm: trinity-c180 Not tainted 3.18.0-rc5+ #90 [loadavg: 199.00 178.81 173.92 35/402 20526]
task: ffff8801575e4680 ti: ffff880202434000 task.ti: ffff880202434000
RIP: 0010:[<ffffffffa91a0db0>]  [<ffffffffa91a0db0>] bad_range+0x0/0x90
RSP: 0018:ffff8802024377a0  EFLAGS: 00000246
RAX: ffff8801575e4680 RBX: 0000000000000007 RCX: 0000000000000006
RDX: 0000000000002a20 RSI: ffffea0000887fc0 RDI: ffff88024d64c740
RBP: ffff880202437898 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000001
R13: 0000000000000020 R14: 00000000001d8608 R15: 00000000001d8668
FS:  00007fd3b8960740(0000) GS:ffff880244200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fd3b5ea0777 CR3: 00000001027cd000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
Stack:
 ffffffffa91a58c4 00000000000009e4 ffff8801575e4680 0000000000000001
 ffff88024d64dd08 0000010000000000 0000000000000000 ffff8802024377f8
 0000000000000000 ffff88024d64dd00 ffffffffa90ac411 ffffffff00000003
Call Trace:
 [<ffffffffa91a58c4>] ? get_page_from_freelist+0x564/0xaa0
 [<ffffffffa90ac411>] ? get_parent_ip+0x11/0x50
 [<ffffffffa91a6030>] __alloc_pages_nodemask+0x230/0xd20
 [<ffffffffa90ac411>] ? get_parent_ip+0x11/0x50
 [<ffffffffa90d1e45>] ? mark_held_locks+0x75/0xa0
 [<ffffffffa91f400e>] alloc_pages_vma+0xee/0x1b0
 [<ffffffffa91b643e>] ? shmem_alloc_page+0x6e/0xc0
 [<ffffffffa91b643e>] shmem_alloc_page+0x6e/0xc0
 [<ffffffffa90ac411>] ? get_parent_ip+0x11/0x50
 [<ffffffffa90ac58b>] ? preempt_count_sub+0x7b/0x100
 [<ffffffffa93dcc46>] ? __percpu_counter_add+0x86/0xb0
 [<ffffffffa91d50d6>] ? __vm_enough_memory+0x66/0x1c0
 [<ffffffffa919ad65>] ? find_get_entry+0x5/0x230
 [<ffffffffa933b10c>] ? cap_vm_enough_memory+0x4c/0x60
 [<ffffffffa91b8ff0>] shmem_getpage_gfp+0x630/0xa40
 [<ffffffffa90cee01>] ? match_held_lock+0x111/0x160
 [<ffffffffa91b9442>] shmem_write_begin+0x42/0x70
 [<ffffffffa919a684>] generic_perform_write+0xd4/0x1f0
 [<ffffffffa919d5d2>] __generic_file_write_iter+0x162/0x350
 [<ffffffffa92154a0>] ? new_sync_read+0xd0/0xd0
 [<ffffffffa919d7ff>] generic_file_write_iter+0x3f/0xb0
 [<ffffffffa919d7c0>] ? __generic_file_write_iter+0x350/0x350
 [<ffffffffa92155e8>] do_iter_readv_writev+0x78/0xc0
 [<ffffffffa9216e18>] do_readv_writev+0xd8/0x2a0
 [<ffffffffa919d7c0>] ? __generic_file_write_iter+0x350/0x350
 [<ffffffffa90cf426>] ? lock_release_holdtime.part.28+0xe6/0x160
 [<ffffffffa919d7c0>] ? __generic_file_write_iter+0x350/0x350
 [<ffffffffa90ac411>] ? get_parent_ip+0x11/0x50
 [<ffffffffa90ac58b>] ? preempt_count_sub+0x7b/0x100
 [<ffffffffa90e782e>] ? rcu_read_lock_held+0x6e/0x80
 [<ffffffffa921706c>] vfs_writev+0x3c/0x50
 [<ffffffffa92171dc>] SyS_writev+0x5c/0x100
 [<ffffffffa9852c49>] tracesys_phase2+0xd4/0xd9
Code: 09 48 83 f2 01 83 e2 01 eb a3 90 48 c7 c7 a0 8c e4 a9 e8 44 e1 f2 ff 85 c0 75 d2 eb c1 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 <0f> 1f 44 00 00 48 b8 00 00 00 00 00 16 00 00 55 4c 8b 47 68 48 
sending NMI to other CPUs:
NMI backtrace for cpu 2
CPU: 2 PID: 15913 Comm: trinity-c141 Not tainted 3.18.0-rc5+ #90 [loadavg: 199.00 178.81 173.92 35/402 20526]
task: ffff880223229780 ti: ffff8801afca0000 task.ti: ffff8801afca0000
RIP: 0010:[<ffffffffa9116dbe>]  [<ffffffffa9116dbe>] generic_exec_single+0xee/0x1a0
RSP: 0018:ffff8801afca3928  EFLAGS: 00000202
RAX: ffff8802443d9d00 RBX: ffff8801afca3930 RCX: ffff8802443d9dc0
RDX: ffff8802443d4d80 RSI: ffff8801afca3930 RDI: ffff8801afca3930
RBP: ffff8801afca3988 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000001
R13: 0000000000000001 R14: ffff8801afca3a48 R15: ffffffffa9045bb0
FS:  00007fd3b8960740(0000) GS:ffff880244400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000008 CR3: 000000022f8bd000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
Stack:
 ffff8801afca3a08 ffff8802443d9dc0 ffffffffa9045bb0 ffff8801afca3a48
 0000000000000003 000000007b19adc3 0000000000000001 00000000ffffffff
 0000000000000001 ffffffffa9045bb0 ffff8801afca3a48 0000000000000001
Call Trace:
 [<ffffffffa9045bb0>] ? do_flush_tlb_all+0x60/0x60
 [<ffffffffa9045bb0>] ? do_flush_tlb_all+0x60/0x60
 [<ffffffffa9116f3a>] smp_call_function_single+0x6a/0xe0
 [<ffffffffa93b2e1f>] ? cpumask_next_and+0x4f/0xb0
 [<ffffffffa9045bb0>] ? do_flush_tlb_all+0x60/0x60
 [<ffffffffa9117679>] smp_call_function_many+0x2b9/0x320
 [<ffffffffa9046370>] flush_tlb_mm_range+0xe0/0x370
 [<ffffffffa91cc762>] tlb_flush_mmu_tlbonly+0x42/0x50
 [<ffffffffa91cdd28>] unmap_single_vma+0x6b8/0x900
 [<ffffffffa91ce06c>] zap_page_range_single+0xfc/0x160
 [<ffffffffa91ce254>] unmap_mapping_range+0x134/0x190
 [<ffffffffa91bb9dd>] shmem_fallocate+0x4fd/0x520
 [<ffffffffa90c7c77>] ? prepare_to_wait+0x27/0x90
 [<ffffffffa9213bc2>] do_fallocate+0x132/0x1d0
 [<ffffffffa91e3228>] SyS_madvise+0x398/0x870
 [<ffffffffa983f6c0>] ? rcu_read_lock_sched_held+0x4e/0x6a
 [<ffffffffa9013877>] ? syscall_trace_enter_phase2+0xa7/0x2b0
 [<ffffffffa9852c49>] tracesys_phase2+0xd4/0xd9
Code: 48 89 de 48 03 14 c5 60 74 f1 a9 48 89 df e8 0a fa 2a 00 84 c0 75 46 45 85 ed 74 11 f6 43 18 01 74 0b 0f 1f 00 f3 90 f6 43 18 01 <75> f8 31 c0 48 8b 4d c8 65 48 33 0c 25 28 00 00 00 0f 85 8e 00 
NMI backtrace for cpu 0
INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 37.091 msecs
CPU: 0 PID: 15851 Comm: trinity-c80 Not tainted 3.18.0-rc5+ #90 [loadavg: 199.00 178.81 173.92 36/402 20526]
task: ffff8801874e8000 ti: ffff88022baec000 task.ti: ffff88022baec000
RIP: 0010:[<ffffffffa90ac450>]  [<ffffffffa90ac450>] preempt_count_add+0x0/0xc0
RSP: 0000:ffff880244003c30  EFLAGS: 00000092
RAX: 0000000000000001 RBX: ffffffffa9edb560 RCX: 0000000000000001
RDX: 0000000000000001 RSI: 0000000000000001 RDI: 0000000000000001
RBP: ffff880244003c48 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000000 R11: ffff8801874e88c8 [23543.271956] NMI backtrace for cpu 3
INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 100.612 msecs
CPU: 3 PID: 0 Comm: swapper/3 Not tainted 3.18.0-rc5+ #90 [loadavg: 199.00 178.81 173.92 37/402 20526]
task: ffff880242b5c680 ti: ffff880242b78000 task.ti: ffff880242b78000
RIP: 0010:[<ffffffffa94251b5>]  [<ffffffffa94251b5>] intel_idle+0xd5/0x180
RSP: 0018:ffff880242b7bdf8  EFLAGS: 00000046
RAX: 0000000000000032 RBX: 0000000000000010 RCX: 0000000000000001
RDX: 0000000000000000 RSI: ffff880242b7bfd8 RDI: 0000000000000003
RBP: ffff880242b7be28 R08: 000000008baf8f3d R09: 0000000000000000
R10: 0000000000000000 R11: ffff880242b5cea0 R12: 0000000000000005
R13: 0000000000000032 R14: 0000000000000004 R15: ffff880242b78000
FS:  0000000000000000(0000) GS:ffff880244600000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000b1c9ac CR3: 0000000029e11000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
Stack:
 0000000342b7be28 afc453cb003f4590 ffffe8ffff402200 0000000000000005
 ffffffffa9eaa0c0 0000000000000003 ffff880242b7be78 ffffffffa96bbb45
 0000156cc07cf6e3 ffffffffa9eaa290 0000000000000096 ffffffffa9f197b0
Call Trace:
 [<ffffffffa96bbb45>] cpuidle_enter_state+0x55/0x300
 [<ffffffffa96bbea7>] cpuidle_enter+0x17/0x20
 [<ffffffffa90c88f5>] cpu_startup_entry+0x4e5/0x630
 [<ffffffffa902d523>] start_secondary+0x1a3/0x220
Code: 31 d2 65 48 8b 34 25 08 ba 00 00 48 8d 86 38 c0 ff ff 48 89 d1 0f 01 c8 48 8b 86 38 c0 ff ff a8 08 75 08 b1 01 4c 89 e8 0f 01 c9 <65> 48 8b 0c 25 08 ba 00 00 f0 80 a1 3a c0 ff ff df 0f ae f0 48 
INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 125.739 msecs


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-18 14:52                     ` Dave Jones
@ 2014-11-18 17:20                       ` Linus Torvalds
  2014-11-18 19:28                         ` Thomas Gleixner
  2014-11-18 18:54                       ` Thomas Gleixner
  1 sibling, 1 reply; 486+ messages in thread
From: Linus Torvalds @ 2014-11-18 17:20 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel, the arch/x86 maintainers, Don Zickus

On Tue, Nov 18, 2014 at 6:52 AM, Dave Jones <davej@redhat.com> wrote:
>
> Here's the first hit. Curiously, one cpu is missing.

That might be the CPU3 that isn't responding to IPIs due to some bug..

> NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [trinity-c180:17837]
> RIP: 0010:[<ffffffffa91a0db0>]  [<ffffffffa91a0db0>] bad_range+0x0/0x90

Hmm. Something looping in the page allocator? Not waiting for a lock,
but livelocked? I'm not seeing anything here that should trigger the
NMI watchdog at all.

Can the NMI watchdog get confused somehow?

> Call Trace:
>  [<ffffffffa91a6030>] __alloc_pages_nodemask+0x230/0xd20
>  [<ffffffffa91f400e>] alloc_pages_vma+0xee/0x1b0
>  [<ffffffffa91b643e>] shmem_alloc_page+0x6e/0xc0
>  [<ffffffffa91b8ff0>] shmem_getpage_gfp+0x630/0xa40
>  [<ffffffffa91b9442>] shmem_write_begin+0x42/0x70
>  [<ffffffffa919a684>] generic_perform_write+0xd4/0x1f0
>  [<ffffffffa919d5d2>] __generic_file_write_iter+0x162/0x350
>  [<ffffffffa919d7ff>] generic_file_write_iter+0x3f/0xb0
>  [<ffffffffa92155e8>] do_iter_readv_writev+0x78/0xc0
>  [<ffffffffa9216e18>] do_readv_writev+0xd8/0x2a0
>  [<ffffffffa90cf426>] ? lock_release_holdtime.part.28+0xe6/0x160
>  [<ffffffffa921706c>] vfs_writev+0x3c/0x50

And CPU2 is in that TLB flusher again:

> NMI backtrace for cpu 2
> RIP: 0010:[<ffffffffa9116dbe>]  [<ffffffffa9116dbe>] generic_exec_single+0xee/0x1a0
> Call Trace:
>  [<ffffffffa9045bb0>] ? do_flush_tlb_all+0x60/0x60
>  [<ffffffffa9116f3a>] smp_call_function_single+0x6a/0xe0
>  [<ffffffffa9117679>] smp_call_function_many+0x2b9/0x320
>  [<ffffffffa9046370>] flush_tlb_mm_range+0xe0/0x370
>  [<ffffffffa91cc762>] tlb_flush_mmu_tlbonly+0x42/0x50
>  [<ffffffffa91cdd28>] unmap_single_vma+0x6b8/0x900
>  [<ffffffffa91ce06c>] zap_page_range_single+0xfc/0x160
>  [<ffffffffa91ce254>] unmap_mapping_range+0x134/0x190

.. and the code line implies that it's in that csd_lock_wait() loop,
again consistent with waiting for some other CPU. Presumably the
missing CPU3.

> NMI backtrace for cpu 0
> RIP: 0010:[<ffffffffa90ac450>]  [<ffffffffa90ac450>] preempt_count_add+0x0/0xc0
> Call Trace:
>  [<ffffffffa96bbb45>] cpuidle_enter_state+0x55/0x300
>  [<ffffffffa96bbea7>] cpuidle_enter+0x17/0x20
>  [<ffffffffa90c88f5>] cpu_startup_entry+0x4e5/0x630
>  [<ffffffffa902d523>] start_secondary+0x1a3/0x220

And CPU0 is just in the idle loop (that RIP is literally the
instruction after the "mwait" according to the code line).

> INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 125.739 msecs

.. and that's us giving up on CPU3.

So it does look like CPU3 is the problem, but sadly, CPU3 is
apparently not listening, and doesn't even react to the NMI, much less
a TLB flush IPI.

Not reacting to NMI could be:
 (a) some APIC state issue
 (b) we're already stuck in a loop in the previous NMI handler
 (c) what?

Anybody?

                     Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-18 14:52                     ` Dave Jones
  2014-11-18 17:20                       ` Linus Torvalds
@ 2014-11-18 18:54                       ` Thomas Gleixner
  2014-11-18 21:55                         ` Don Zickus
  1 sibling, 1 reply; 486+ messages in thread
From: Thomas Gleixner @ 2014-11-18 18:54 UTC (permalink / raw)
  To: Dave Jones; +Cc: Linus Torvalds, Linux Kernel, the arch/x86 maintainers

On Tue, 18 Nov 2014, Dave Jones wrote:
> Here's the first hit. Curiously, one cpu is missing.

I don't think so

> NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [trinity-c180:17837]

> irq event stamp: 2258092
> hardirqs last  enabled at (2258091): [<ffffffffa91a58b5>] get_page_from_freelist+0x555/0xaa0
> hardirqs last disabled at (2258092): [<ffffffffa985396a>] apic_timer_interrupt+0x6a/0x80

So that means we are in the timer interrupt and handling
watchdog_timer_fn.

> CPU: 1 PID: 17837 Comm: trinity-c180 Not tainted 3.18.0-rc5+ #90 [loadavg: 199.00 178.81 173.92 35/402 20526]
> task: ffff8801575e4680 ti: ffff880202434000 task.ti: ffff880202434000
> RIP: 0010:[<ffffffffa91a0db0>]  [<ffffffffa91a0db0>] bad_range+0x0/0x90

So the softlockup tells us, that the high priority watchdog thread was
not able to touch the watchdog timestamp. That means this task was
hogging the CPU for 20+ seconds. I have no idea how that happens in
that call chain.

Call Trace:
 [<ffffffffa91a58c4>] ? get_page_from_freelist+0x564/0xaa0
 [<ffffffffa90ac411>] ? get_parent_ip+0x11/0x50
 [<ffffffffa91a6030>] __alloc_pages_nodemask+0x230/0xd20
 [<ffffffffa90ac411>] ? get_parent_ip+0x11/0x50
 [<ffffffffa90d1e45>] ? mark_held_locks+0x75/0xa0
 [<ffffffffa91f400e>] alloc_pages_vma+0xee/0x1b0
 [<ffffffffa91b643e>] ? shmem_alloc_page+0x6e/0xc0
 [<ffffffffa91b643e>] shmem_alloc_page+0x6e/0xc0
 [<ffffffffa90ac411>] ? get_parent_ip+0x11/0x50
 [<ffffffffa90ac58b>] ? preempt_count_sub+0x7b/0x100
 [<ffffffffa93dcc46>] ? __percpu_counter_add+0x86/0xb0
 [<ffffffffa91d50d6>] ? __vm_enough_memory+0x66/0x1c0
 [<ffffffffa919ad65>] ? find_get_entry+0x5/0x230
 [<ffffffffa933b10c>] ? cap_vm_enough_memory+0x4c/0x60
 [<ffffffffa91b8ff0>] shmem_getpage_gfp+0x630/0xa40
 [<ffffffffa90cee01>] ? match_held_lock+0x111/0x160
 [<ffffffffa91b9442>] shmem_write_begin+0x42/0x70
 [<ffffffffa919a684>] generic_perform_write+0xd4/0x1f0
 [<ffffffffa919d5d2>] __generic_file_write_iter+0x162/0x350
 [<ffffffffa92154a0>] ? new_sync_read+0xd0/0xd0
 [<ffffffffa919d7ff>] generic_file_write_iter+0x3f/0xb0
 [<ffffffffa919d7c0>] ? __generic_file_write_iter+0x350/0x350
 [<ffffffffa92155e8>] do_iter_readv_writev+0x78/0xc0
 [<ffffffffa9216e18>] do_readv_writev+0xd8/0x2a0
 [<ffffffffa919d7c0>] ? __generic_file_write_iter+0x350/0x350
 [<ffffffffa90cf426>] ? lock_release_holdtime.part.28+0xe6/0x160
 [<ffffffffa919d7c0>] ? __generic_file_write_iter+0x350/0x350
 [<ffffffffa90ac411>] ? get_parent_ip+0x11/0x50
 [<ffffffffa90ac58b>] ? preempt_count_sub+0x7b/0x100
 [<ffffffffa90e782e>] ? rcu_read_lock_held+0x6e/0x80
 [<ffffffffa921706c>] vfs_writev+0x3c/0x50
 [<ffffffffa92171dc>] SyS_writev+0x5c/0x100
 [<ffffffffa9852c49>] tracesys_phase2+0xd4/0xd9

But this gets pages for a write into shmem and the other one below
does a madvise on a shmem map. Coincidence?

> sending NMI to other CPUs:

So here we kick the other cpus

> NMI backtrace for cpu 2
> CPU: 2 PID: 15913 Comm: trinity-c141 Not tainted 3.18.0-rc5+ #90 [loadavg: 199.00 178.81 173.92 35/402 20526]
> task: ffff880223229780 ti: ffff8801afca0000 task.ti: ffff8801afca0000
> RIP: 0010:[<ffffffffa9116dbe>]  [<ffffffffa9116dbe>] generic_exec_single+0xee/0x1a0
>  [<ffffffffa9045bb0>] ? do_flush_tlb_all+0x60/0x60
>  [<ffffffffa9045bb0>] ? do_flush_tlb_all+0x60/0x60
>  [<ffffffffa9116f3a>] smp_call_function_single+0x6a/0xe0
>  [<ffffffffa93b2e1f>] ? cpumask_next_and+0x4f/0xb0
>  [<ffffffffa9045bb0>] ? do_flush_tlb_all+0x60/0x60
>  [<ffffffffa9117679>] smp_call_function_many+0x2b9/0x320
>  [<ffffffffa9046370>] flush_tlb_mm_range+0xe0/0x370
>  [<ffffffffa91cc762>] tlb_flush_mmu_tlbonly+0x42/0x50
>  [<ffffffffa91cdd28>] unmap_single_vma+0x6b8/0x900
>  [<ffffffffa91ce06c>] zap_page_range_single+0xfc/0x160
>  [<ffffffffa91ce254>] unmap_mapping_range+0x134/0x190
>  [<ffffffffa91bb9dd>] shmem_fallocate+0x4fd/0x520
>  [<ffffffffa90c7c77>] ? prepare_to_wait+0x27/0x90
>  [<ffffffffa9213bc2>] do_fallocate+0x132/0x1d0
>  [<ffffffffa91e3228>] SyS_madvise+0x398/0x870
>  [<ffffffffa983f6c0>] ? rcu_read_lock_sched_held+0x4e/0x6a
>  [<ffffffffa9013877>] ? syscall_trace_enter_phase2+0xa7/0x2b0
>  [<ffffffffa9852c49>] tracesys_phase2+0xd4/0xd9

We've seen that before

> NMI backtrace for cpu 0
> INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 37.091 msecs

So it complains that the backtrace handler took 37 msec, which is
indeed long for just dumping a stack trace.

> CPU: 0 PID: 15851 Comm: trinity-c80 Not tainted 3.18.0-rc5+ #90 [loadavg: 199.00 178.81 173.92 36/402 20526]
> task: ffff8801874e8000 ti: ffff88022baec000 task.ti: ffff88022baec000
> RIP: 0010:[<ffffffffa90ac450>]  [<ffffffffa90ac450>] preempt_count_add+0x0/0xc0
> RSP: 0000:ffff880244003c30  EFLAGS: 00000092
> RAX: 0000000000000001 RBX: ffffffffa9edb560 RCX: 0000000000000001
> RDX: 0000000000000001 RSI: 0000000000000001 RDI: 0000000000000001
> RBP: ffff880244003c48 R08: 0000000000000000 R09: 0000000000000001
> R10: 0000000000000000 R11: ffff8801874e88c8 [23543.271956] NMI backtrace for cpu 3

So here we mangle CPU3 in and lose the backtrace for cpu0, which might
be the real interesting one ....

> INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 100.612 msecs

This one takes 100ms.

> CPU: 3 PID: 0 Comm: swapper/3 Not tainted 3.18.0-rc5+ #90 [loadavg: 199.00 178.81 173.92 37/402 20526]
> task: ffff880242b5c680 ti: ffff880242b78000 task.ti: ffff880242b78000
> RIP: 0010:[<ffffffffa94251b5>]  [<ffffffffa94251b5>] intel_idle+0xd5/0x180

So that one is simply idle.

> INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 125.739 msecs
> 

And we get another backtrace handler taking too long. Of course we
cannot tell which of the 3 complaints comes from which cpu, because
the printk lacks a cpuid.

Thanks,

	tglx




^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-18 17:20                       ` Linus Torvalds
@ 2014-11-18 19:28                         ` Thomas Gleixner
  2014-11-18 21:25                           ` Don Zickus
  0 siblings, 1 reply; 486+ messages in thread
From: Thomas Gleixner @ 2014-11-18 19:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Jones, Linux Kernel, the arch/x86 maintainers, Don Zickus

On Tue, 18 Nov 2014, Linus Torvalds wrote:
> On Tue, Nov 18, 2014 at 6:52 AM, Dave Jones <davej@redhat.com> wrote:
> >
> > Here's the first hit. Curiously, one cpu is missing.
> 
> That might be the CPU3 that isn't responding to IPIs due to some bug..
> 
> > NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [trinity-c180:17837]
> > RIP: 0010:[<ffffffffa91a0db0>]  [<ffffffffa91a0db0>] bad_range+0x0/0x90
> 
> Hmm. Something looping in the page allocator? Not waiting for a lock,
> but livelocked? I'm not seeing anything here that should trigger the
> NMI watchdog at all.
> 
> Can the NMI watchdog get confused somehow?

That's the soft lockup detector which runs from the timer interrupt
not from NMI.
 
> So it does look like CPU3 is the problem, but sadly, CPU3 is
> apparently not listening, and doesn't even react to the NMI, much less

As I said in the other mail. It gets the NMI and reacts on it. It's
just mangled into the CPU0 backtrace. 

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-18 19:28                         ` Thomas Gleixner
@ 2014-11-18 21:25                           ` Don Zickus
  2014-11-18 21:31                             ` Dave Jones
  0 siblings, 1 reply; 486+ messages in thread
From: Don Zickus @ 2014-11-18 21:25 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linus Torvalds, Dave Jones, Linux Kernel, the arch/x86 maintainers

On Tue, Nov 18, 2014 at 08:28:01PM +0100, Thomas Gleixner wrote:
> On Tue, 18 Nov 2014, Linus Torvalds wrote:
> > On Tue, Nov 18, 2014 at 6:52 AM, Dave Jones <davej@redhat.com> wrote:
> > >
> > > Here's the first hit. Curiously, one cpu is missing.
> > 
> > That might be the CPU3 that isn't responding to IPIs due to some bug..
> > 
> > > NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [trinity-c180:17837]
> > > RIP: 0010:[<ffffffffa91a0db0>]  [<ffffffffa91a0db0>] bad_range+0x0/0x90
> > 
> > Hmm. Something looping in the page allocator? Not waiting for a lock,
> > but livelocked? I'm not seeing anything here that should trigger the
> > NMI watchdog at all.
> > 
> > Can the NMI watchdog get confused somehow?
> 
> That's the soft lockup detector which runs from the timer interrupt
> not from NMI.
>  
> > So it does look like CPU3 is the problem, but sadly, CPU3 is
> > apparently not listening, and doesn't even react to the NMI, much less
> 
> As I said in the other mail. It gets the NMI and reacts on it. It's
> just mangled into the CPU0 backtrace. 

I was going to reply about both points too. :-)  Though the mangling looks
odd because we have spin_locks serializing the output for each cpu.

Another thing I wanted to ask DaveJ, did you recently turn on
CONFIG_PREEMPT?  That would explain why you are seeing the softlockups
now.  If you disable CONFIG_PREEMPT does the softlockups disappear.

Cheers,
Don

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-18 21:25                           ` Don Zickus
@ 2014-11-18 21:31                             ` Dave Jones
  0 siblings, 0 replies; 486+ messages in thread
From: Dave Jones @ 2014-11-18 21:31 UTC (permalink / raw)
  To: Don Zickus
  Cc: Thomas Gleixner, Linus Torvalds, Linux Kernel, the arch/x86 maintainers

On Tue, Nov 18, 2014 at 04:25:53PM -0500, Don Zickus wrote:
 
 > I was going to reply about both points too. :-)  Though the mangling looks
 > odd because we have spin_locks serializing the output for each cpu.
 > 
 > Another thing I wanted to ask DaveJ, did you recently turn on
 > CONFIG_PREEMPT?  That would explain why you are seeing the softlockups
 > now.  If you disable CONFIG_PREEMPT does the softlockups disappear.

I've had it on on my test box forever.  I'll add trying turning to off
to the list of things to try.

	Dave

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-18 18:54                       ` Thomas Gleixner
@ 2014-11-18 21:55                         ` Don Zickus
  2014-11-18 22:02                           ` Dave Jones
  2014-11-19  2:19                           ` Dave Jones
  0 siblings, 2 replies; 486+ messages in thread
From: Don Zickus @ 2014-11-18 21:55 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Dave Jones, Linus Torvalds, Linux Kernel, the arch/x86 maintainers

On Tue, Nov 18, 2014 at 07:54:17PM +0100, Thomas Gleixner wrote:
> On Tue, 18 Nov 2014, Dave Jones wrote:
> > Here's the first hit. Curiously, one cpu is missing.
> 
> I don't think so
> 
> > NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [trinity-c180:17837]
> 
> > irq event stamp: 2258092
> > hardirqs last  enabled at (2258091): [<ffffffffa91a58b5>] get_page_from_freelist+0x555/0xaa0
> > hardirqs last disabled at (2258092): [<ffffffffa985396a>] apic_timer_interrupt+0x6a/0x80
> 
> So that means we are in the timer interrupt and handling
> watchdog_timer_fn.
> 
> > CPU: 1 PID: 17837 Comm: trinity-c180 Not tainted 3.18.0-rc5+ #90 [loadavg: 199.00 178.81 173.92 35/402 20526]
> > task: ffff8801575e4680 ti: ffff880202434000 task.ti: ffff880202434000
> > RIP: 0010:[<ffffffffa91a0db0>]  [<ffffffffa91a0db0>] bad_range+0x0/0x90
> 
> So the softlockup tells us, that the high priority watchdog thread was
> not able to touch the watchdog timestamp. That means this task was
> hogging the CPU for 20+ seconds. I have no idea how that happens in
> that call chain.
> 
> Call Trace:
>  [<ffffffffa91a58c4>] ? get_page_from_freelist+0x564/0xaa0
>  [<ffffffffa90ac411>] ? get_parent_ip+0x11/0x50
>  [<ffffffffa91a6030>] __alloc_pages_nodemask+0x230/0xd20
>  [<ffffffffa90ac411>] ? get_parent_ip+0x11/0x50
>  [<ffffffffa90d1e45>] ? mark_held_locks+0x75/0xa0
>  [<ffffffffa91f400e>] alloc_pages_vma+0xee/0x1b0
>  [<ffffffffa91b643e>] ? shmem_alloc_page+0x6e/0xc0
>  [<ffffffffa91b643e>] shmem_alloc_page+0x6e/0xc0
>  [<ffffffffa90ac411>] ? get_parent_ip+0x11/0x50
>  [<ffffffffa90ac58b>] ? preempt_count_sub+0x7b/0x100
>  [<ffffffffa93dcc46>] ? __percpu_counter_add+0x86/0xb0
>  [<ffffffffa91d50d6>] ? __vm_enough_memory+0x66/0x1c0
>  [<ffffffffa919ad65>] ? find_get_entry+0x5/0x230
>  [<ffffffffa933b10c>] ? cap_vm_enough_memory+0x4c/0x60
>  [<ffffffffa91b8ff0>] shmem_getpage_gfp+0x630/0xa40
>  [<ffffffffa90cee01>] ? match_held_lock+0x111/0x160
>  [<ffffffffa91b9442>] shmem_write_begin+0x42/0x70
>  [<ffffffffa919a684>] generic_perform_write+0xd4/0x1f0
>  [<ffffffffa919d5d2>] __generic_file_write_iter+0x162/0x350
>  [<ffffffffa92154a0>] ? new_sync_read+0xd0/0xd0
>  [<ffffffffa919d7ff>] generic_file_write_iter+0x3f/0xb0
>  [<ffffffffa919d7c0>] ? __generic_file_write_iter+0x350/0x350
>  [<ffffffffa92155e8>] do_iter_readv_writev+0x78/0xc0
>  [<ffffffffa9216e18>] do_readv_writev+0xd8/0x2a0
>  [<ffffffffa919d7c0>] ? __generic_file_write_iter+0x350/0x350
>  [<ffffffffa90cf426>] ? lock_release_holdtime.part.28+0xe6/0x160
>  [<ffffffffa919d7c0>] ? __generic_file_write_iter+0x350/0x350
>  [<ffffffffa90ac411>] ? get_parent_ip+0x11/0x50
>  [<ffffffffa90ac58b>] ? preempt_count_sub+0x7b/0x100
>  [<ffffffffa90e782e>] ? rcu_read_lock_held+0x6e/0x80
>  [<ffffffffa921706c>] vfs_writev+0x3c/0x50
>  [<ffffffffa92171dc>] SyS_writev+0x5c/0x100
>  [<ffffffffa9852c49>] tracesys_phase2+0xd4/0xd9
> 
> But this gets pages for a write into shmem and the other one below
> does a madvise on a shmem map. Coincidence?
> 
> > sending NMI to other CPUs:
> 
> So here we kick the other cpus
> 
> > NMI backtrace for cpu 2
> > CPU: 2 PID: 15913 Comm: trinity-c141 Not tainted 3.18.0-rc5+ #90 [loadavg: 199.00 178.81 173.92 35/402 20526]
> > task: ffff880223229780 ti: ffff8801afca0000 task.ti: ffff8801afca0000
> > RIP: 0010:[<ffffffffa9116dbe>]  [<ffffffffa9116dbe>] generic_exec_single+0xee/0x1a0
> >  [<ffffffffa9045bb0>] ? do_flush_tlb_all+0x60/0x60
> >  [<ffffffffa9045bb0>] ? do_flush_tlb_all+0x60/0x60
> >  [<ffffffffa9116f3a>] smp_call_function_single+0x6a/0xe0
> >  [<ffffffffa93b2e1f>] ? cpumask_next_and+0x4f/0xb0
> >  [<ffffffffa9045bb0>] ? do_flush_tlb_all+0x60/0x60
> >  [<ffffffffa9117679>] smp_call_function_many+0x2b9/0x320
> >  [<ffffffffa9046370>] flush_tlb_mm_range+0xe0/0x370
> >  [<ffffffffa91cc762>] tlb_flush_mmu_tlbonly+0x42/0x50
> >  [<ffffffffa91cdd28>] unmap_single_vma+0x6b8/0x900
> >  [<ffffffffa91ce06c>] zap_page_range_single+0xfc/0x160
> >  [<ffffffffa91ce254>] unmap_mapping_range+0x134/0x190
> >  [<ffffffffa91bb9dd>] shmem_fallocate+0x4fd/0x520
> >  [<ffffffffa90c7c77>] ? prepare_to_wait+0x27/0x90
> >  [<ffffffffa9213bc2>] do_fallocate+0x132/0x1d0
> >  [<ffffffffa91e3228>] SyS_madvise+0x398/0x870
> >  [<ffffffffa983f6c0>] ? rcu_read_lock_sched_held+0x4e/0x6a
> >  [<ffffffffa9013877>] ? syscall_trace_enter_phase2+0xa7/0x2b0
> >  [<ffffffffa9852c49>] tracesys_phase2+0xd4/0xd9
> 
> We've seen that before
> 
> > NMI backtrace for cpu 0
> > INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 37.091 msecs
> 
> So it complains that the backtrace handler took 37 msec, which is
> indeed long for just dumping a stack trace.
> 
> > CPU: 0 PID: 15851 Comm: trinity-c80 Not tainted 3.18.0-rc5+ #90 [loadavg: 199.00 178.81 173.92 36/402 20526]
> > task: ffff8801874e8000 ti: ffff88022baec000 task.ti: ffff88022baec000
> > RIP: 0010:[<ffffffffa90ac450>]  [<ffffffffa90ac450>] preempt_count_add+0x0/0xc0
> > RSP: 0000:ffff880244003c30  EFLAGS: 00000092
> > RAX: 0000000000000001 RBX: ffffffffa9edb560 RCX: 0000000000000001
> > RDX: 0000000000000001 RSI: 0000000000000001 RDI: 0000000000000001
> > RBP: ffff880244003c48 R08: 0000000000000000 R09: 0000000000000001
> > R10: 0000000000000000 R11: ffff8801874e88c8 [23543.271956] NMI backtrace for cpu 3
> 
> So here we mangle CPU3 in and lose the backtrace for cpu0, which might
> be the real interesting one ....


Dave,

Can you provide another dump?  The hope is we get something not mangled?

The other option we have done in RHEL is panic the system and let kdump
capture the memory.  Then we can analyze the vmcore for the stack trace
cpu0 stored in memory to get a rough idea where it might be if the cpu
isn't responding very well.

Cheers,
Don

> 
> > INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 100.612 msecs
> 
> This one takes 100ms.
> 
> > CPU: 3 PID: 0 Comm: swapper/3 Not tainted 3.18.0-rc5+ #90 [loadavg: 199.00 178.81 173.92 37/402 20526]
> > task: ffff880242b5c680 ti: ffff880242b78000 task.ti: ffff880242b78000
> > RIP: 0010:[<ffffffffa94251b5>]  [<ffffffffa94251b5>] intel_idle+0xd5/0x180
> 
> So that one is simply idle.
> 
> > INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 125.739 msecs
> > 
> 
> And we get another backtrace handler taking too long. Of course we
> cannot tell which of the 3 complaints comes from which cpu, because
> the printk lacks a cpuid.
> 
> Thanks,
> 
> 	tglx
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-18 21:55                         ` Don Zickus
@ 2014-11-18 22:02                           ` Dave Jones
  2014-11-19 14:41                             ` Don Zickus
  2014-11-19  2:19                           ` Dave Jones
  1 sibling, 1 reply; 486+ messages in thread
From: Dave Jones @ 2014-11-18 22:02 UTC (permalink / raw)
  To: Don Zickus
  Cc: Thomas Gleixner, Linus Torvalds, Linux Kernel, the arch/x86 maintainers

On Tue, Nov 18, 2014 at 04:55:40PM -0500, Don Zickus wrote:

 > > So here we mangle CPU3 in and lose the backtrace for cpu0, which might
 > > be the real interesting one ....
 > 
 > Can you provide another dump?  The hope is we get something not mangled?

Working on it..

 > The other option we have done in RHEL is panic the system and let kdump
 > capture the memory.  Then we can analyze the vmcore for the stack trace
 > cpu0 stored in memory to get a rough idea where it might be if the cpu
 > isn't responding very well.

I don't know if it's because of the debug options I typically run with,
or that I'm perpetually cursed, but I've never managed to get kdump to
do anything useful. (The last time I tried it was actively harmful in
that not only did it fail to dump anything, it wedged the machine so
it didn't reboot after panic).

Unless there's some magic step missing from the documentation at
http://fedoraproject.org/wiki/How_to_use_kdump_to_debug_kernel_crashes
then I'm not optimistic it'll be useful.

	Dave


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-18 21:55                         ` Don Zickus
  2014-11-18 22:02                           ` Dave Jones
@ 2014-11-19  2:19                           ` Dave Jones
  2014-11-19  4:40                             ` Linus Torvalds
  1 sibling, 1 reply; 486+ messages in thread
From: Dave Jones @ 2014-11-19  2:19 UTC (permalink / raw)
  To: Don Zickus
  Cc: Thomas Gleixner, Linus Torvalds, Linux Kernel, the arch/x86 maintainers

On Tue, Nov 18, 2014 at 04:55:40PM -0500, Don Zickus wrote:
 
 > Can you provide another dump?  The hope is we get something not mangled?
 
Ok, here's another instance.
This time around, we got all 4 cpu traces.

NMI watchdog: BUG: soft lockup - CPU#2 stuck for 21s! [trinity-c42:31480]
CPU: 2 PID: 31480 Comm: trinity-c42 Not tainted 3.18.0-rc5+ #91 [loadavg: 174.61 150.35 148.64 9/411 32140]
task: ffff88023fda4680 ti: ffff880101ee0000 task.ti: ffff880101ee0000
RIP: 0010:[<ffffffff8a1798b4>]  [<ffffffff8a1798b4>] context_tracking_user_enter+0xa4/0x190
RSP: 0018:ffff880101ee3f00  EFLAGS: 00000282
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff8802445d1258
RDX: 0000000000000001 RSI: ffffffff8aac2c64 RDI: ffffffff8aa94505
RBP: ffff880101ee3f10 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff880101ee3ec0
R13: ffffffff8a38c577 R14: ffff880101ee3e70 R15: ffffffff8aa9ee99
FS:  00007f706c089740(0000) GS:ffff880244400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000001bf67f000 CR4: 00000000001407e0
DR0: 00007f0b19510000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
Stack:
 ffff880101ee3f58 00007f706bcd0000 ffff880101ee3f40 ffffffff8a012fc5
 0000000000000000 0000000000007c1b 00007f706bcd0000 00007f706bcd0068
 0000000000000000 ffffffff8a7d8624 000000001008feff 0000000000000000
Call Trace:
 [<ffffffff8a012fc5>] syscall_trace_leave+0xa5/0x160
 [<ffffffff8a7d8624>] int_check_syscall_exit_work+0x34/0x3d
Code: 75 4d 48 c7 c7 64 2c ac 8a e8 e9 2c 21 00 65 c7 04 25 54 f7 1c 00 01 00 00 00 41 f7 c4 00 02 00 00 74 1c e8 8f 44 fd ff 41 54 9d <5b> 41 5c 5d c3 0f 1f 80 00 00 00 00 f3 c3 66 0f 1f 44 00 00 41 
sending NMI to other CPUs:
NMI backtrace for cpu 0
CPU: 0 PID: 27716 Comm: kworker/0:1 Not tainted 3.18.0-rc5+ #91 [loadavg: 174.61 150.35 148.64 9/411 32140]
Workqueue: events nohz_kick_work_fn
task: ffff88017c358000 ti: ffff8801d7124000 task.ti: ffff8801d7124000
RIP: 0010:[<ffffffff8a0ffb52>]  [<ffffffff8a0ffb52>] smp_call_function_many+0x1b2/0x320
RSP: 0018:ffff8801d7127cb8  EFLAGS: 00000202
RAX: 0000000000000002 RBX: ffff8802441d4dc0 RCX: 0000000000000038
RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffff8802445d84e8
RBP: ffff8801d7127d08 R08: ffff880243c3aa80 R09: 0000000000000000
R10: ffff880243c3aa80 R11: 0000000000000000 R12: ffffffff8a0fa2c0
R13: 0000000000000000 R14: 0000000000000002 R15: 0000000000000002
FS:  0000000000000000(0000) GS:ffff880244000000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f70688ff000 CR3: 000000000ac11000 CR4: 00000000001407f0
DR0: 00007f0b19510000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
Stack:
 ffff8801d7127d28 0000000000000246 00000000d7127ce8 00000000001d4d80
 ffff880206b99780 ffff88017f6ccc30 ffff8802441d3640 ffff8802441d9e00
 ffffffff8ac4e340 0000000000000000 ffff8801d7127d18 ffffffff8a0fa3f5
Call Trace:
 [<ffffffff8a0fa3f5>] tick_nohz_full_kick_all+0x35/0x70
 [<ffffffff8a0ec8fe>] nohz_kick_work_fn+0xe/0x10
 [<ffffffff8a08e61d>] process_one_work+0x1fd/0x590
 [<ffffffff8a08e597>] ? process_one_work+0x177/0x590
 [<ffffffff8a0c112e>] ? put_lock_stats.isra.23+0xe/0x30
 [<ffffffff8a08eacb>] worker_thread+0x11b/0x490
 [<ffffffff8a08e9b0>] ? process_one_work+0x590/0x590
 [<ffffffff8a0942e9>] kthread+0xf9/0x110
 [<ffffffff8a0c112e>] ? put_lock_stats.isra.23+0xe/0x30
 [<ffffffff8a0941f0>] ? kthread_create_on_node+0x250/0x250
 [<ffffffff8a7d82ac>] ret_from_fork+0x7c/0xb0
 [<ffffffff8a0941f0>] ? kthread_create_on_node+0x250/0x250
Code: a5 c1 00 49 89 c7 41 89 c6 7d 7e 49 63 c7 48 8b 3b 48 03 3c c5 e0 6b d1 8a 0f b7 57 18 f6 c2 01 74 12 0f 1f 80 00 00 00 00 f3 90 <0f> b7 57 18 f6 c2 01 75 f5 83 ca 01 66 89 57 18 0f ae f0 48 8b 
NMI backtrace for cpu 1
INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 34.478 msecs
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.18.0-rc5+ #91 [loadavg: 174.61 150.35 148.64 10/411 32140]
task: ffff880242b5de00 ti: ffff880242b64000 task.ti: ffff880242b64000
RIP: 0010:[<ffffffff8a3e14a5>]  [<ffffffff8a3e14a5>] intel_idle+0xd5/0x180
RSP: 0018:ffff880242b67df8  EFLAGS: 00000046
RAX: 0000000000000032 RBX: 0000000000000010 RCX: 0000000000000001
RDX: 0000000000000000 RSI: ffff880242b67fd8 RDI: 0000000000000001
RBP: ffff880242b67e28 R08: 000000008baf93be R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000005
R13: 0000000000000032 R14: 0000000000000004 R15: ffff880242b64000
FS:  0000000000000000(0000) GS:ffff880244200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fcd3a34d000 CR3: 000000000ac11000 CR4: 00000000001407e0
DR0: 00007f0b19510000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
Stack:
 0000000142b67e28 84d4d6ff01c7f4fd ffffe8ffff002200 0000000000000005
 ffffffff8acaa080 0000000000000001 ffff880242b67e78 ffffffff8a666075
 000011f1e9f6e882 ffffffff8acaa250 ffff880242b64000 ffffffff8ad18f30
Call Trace:
 [<ffffffff8a666075>] cpuidle_enter_state+0x55/0x1c0
 [<ffffffff8a666297>] cpuidle_enter+0x17/0x20
 [<ffffffff8a0bb323>] cpu_startup_entry+0x433/0x4e0
 [<ffffffff8a02b763>] start_secondary+0x1a3/0x220
Code: 31 d2 65 48 8b 34 25 08 ba 00 00 48 8d 86 38 c0 ff ff 48 89 d1 0f 01 c8 48 8b 86 38 c0 ff ff a8 08 75 08 b1 01 4c 89 e8 0f 01 c9 <65> 48 8b 0c 25 08 ba 00 00 f0 80 a1 3a c0 ff ff df 0f ae f0 48 
NMI backtrace for cpu 3
INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 62.461 msecs
CPU: 3 PID: 0 Comm: swapper/3 Not tainted 3.18.0-rc5+ #91 [loadavg: 174.61 150.35 148.64 10/411 32140]
task: ffff880242b5c680 ti: ffff880242b78000 task.ti: ffff880242b78000
RIP: 0010:[<ffffffff8a3e14a5>]  [<ffffffff8a3e14a5>] intel_idle+0xd5/0x180
RSP: 0018:ffff880242b7bdf8  EFLAGS: 00000046
RAX: 0000000000000032 RBX: 0000000000000010 RCX: 0000000000000001
RDX: 0000000000000000 RSI: ffff880242b7bfd8 RDI: ffffffff8ac11000
RBP: ffff880242b7be28 R08: 000000008baf93be R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000005
R13: 0000000000000032 R14: 0000000000000004 R15: ffff880242b78000
FS:  0000000000000000(0000) GS:ffff880244600000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000000ac11000 CR4: 00000000001407e0
DR0: 00007f0b19510000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
Stack:
 0000000342b7be28 43db433bbd06c740 ffffe8ffff402200 0000000000000005
 ffffffff8acaa080 0000000000000003 ffff880242b7be78 ffffffff8a666075
 000011f1d69aaf37 ffffffff8acaa250 ffff880242b78000 ffffffff8ad18f30
Call Trace:
 [<ffffffff8a666075>] cpuidle_enter_state+0x55/0x1c0
 [<ffffffff8a666297>] cpuidle_enter+0x17/0x20
 [<ffffffff8a0bb323>] cpu_startup_entry+0x433/0x4e0
 [<ffffffff8a02b763>] start_secondary+0x1a3/0x220
Code: 31 d2 65 48 8b 34 25 08 ba 00 00 48 8d 86 38 c0 ff ff 48 89 d1 0f 01 c8 48 8b 86 38 c0 ff ff a8 08 75 08 b1 01 4c 89 e8 0f 01 c9 <65> 48 8b 0c 25 08 ba 00 00 f0 80 a1 3a c0 ff ff df 0f ae f0 48 
INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 89.635 msecs


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-19  2:19                           ` Dave Jones
@ 2014-11-19  4:40                             ` Linus Torvalds
  2014-11-19  4:59                               ` Dave Jones
                                                 ` (2 more replies)
  0 siblings, 3 replies; 486+ messages in thread
From: Linus Torvalds @ 2014-11-19  4:40 UTC (permalink / raw)
  To: Dave Jones, Don Zickus, Thomas Gleixner, Linus Torvalds,
	Linux Kernel, the arch/x86 maintainers

On Tue, Nov 18, 2014 at 6:19 PM, Dave Jones <davej@redhat.com> wrote:
>
> NMI watchdog: BUG: soft lockup - CPU#2 stuck for 21s! [trinity-c42:31480]
> CPU: 2 PID: 31480 Comm: trinity-c42 Not tainted 3.18.0-rc5+ #91 [loadavg: 174.61 150.35 148.64 9/411 32140]
> RIP: 0010:[<ffffffff8a1798b4>]  [<ffffffff8a1798b4>] context_tracking_user_enter+0xa4/0x190
> Call Trace:
>  [<ffffffff8a012fc5>] syscall_trace_leave+0xa5/0x160
>  [<ffffffff8a7d8624>] int_check_syscall_exit_work+0x34/0x3d

Hmm, if we are getting soft-lockups here, maybe it suggest too much exit-work.

Some TIF_NOHZ loop, perhaps? You have nohz on, don't you?

That makes me wonder: does the problem go away if you disable NOHZ?

> CPU: 0 PID: 27716 Comm: kworker/0:1 Not tainted 3.18.0-rc5+ #91 [loadavg: 174.61 150.35 148.64 9/411 32140]
> Workqueue: events nohz_kick_work_fn
> RIP: 0010:[<ffffffff8a0ffb52>]  [<ffffffff8a0ffb52>] smp_call_function_many+0x1b2/0x320
> Call Trace:
>  [<ffffffff8a0fa3f5>] tick_nohz_full_kick_all+0x35/0x70
>  [<ffffffff8a0ec8fe>] nohz_kick_work_fn+0xe/0x10
>  [<ffffffff8a08e61d>] process_one_work+0x1fd/0x590
>  [<ffffffff8a08eacb>] worker_thread+0x11b/0x490
>  [<ffffffff8a0942e9>] kthread+0xf9/0x110
>  [<ffffffff8a7d82ac>] ret_from_fork+0x7c/0xb0

Yeah, there's certainly some NOHZ work going on on CPU0 too.


> CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.18.0-rc5+ #91 [loadavg: 174.61 150.35 148.64 10/411 32140]
> RIP: 0010:[<ffffffff8a3e14a5>]  [<ffffffff8a3e14a5>] intel_idle+0xd5/0x180
> Call Trace:
>  [<ffffffff8a666075>] cpuidle_enter_state+0x55/0x1c0
>  [<ffffffff8a666297>] cpuidle_enter+0x17/0x20
>  [<ffffffff8a0bb323>] cpu_startup_entry+0x433/0x4e0
>  [<ffffffff8a02b763>] start_secondary+0x1a3/0x220

Nothing.

> CPU: 3 PID: 0 Comm: swapper/3 Not tainted 3.18.0-rc5+ #91 [loadavg: 174.61 150.35 148.64 10/411 32140]
> RIP: 0010:[<ffffffff8a3e14a5>]  [<ffffffff8a3e14a5>] intel_idle+0xd5/0x180
>  [<ffffffff8a666075>] cpuidle_enter_state+0x55/0x1c0
>  [<ffffffff8a666297>] cpuidle_enter+0x17/0x20
>  [<ffffffff8a0bb323>] cpu_startup_entry+0x433/0x4e0
>  [<ffffffff8a02b763>] start_secondary+0x1a3/0x220

Nothing.

Hmm. NOHZ?

                     Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-19  4:40                             ` Linus Torvalds
@ 2014-11-19  4:59                               ` Dave Jones
  2014-11-19  5:15                               ` Dave Jones
  2014-11-19 14:59                               ` Dave Jones
  2 siblings, 0 replies; 486+ messages in thread
From: Dave Jones @ 2014-11-19  4:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Don Zickus, Thomas Gleixner, Linux Kernel, the arch/x86 maintainers

On Tue, Nov 18, 2014 at 08:40:55PM -0800, Linus Torvalds wrote:
 > On Tue, Nov 18, 2014 at 6:19 PM, Dave Jones <davej@redhat.com> wrote:
 > >
 > > NMI watchdog: BUG: soft lockup - CPU#2 stuck for 21s! [trinity-c42:31480]
 > > CPU: 2 PID: 31480 Comm: trinity-c42 Not tainted 3.18.0-rc5+ #91 [loadavg: 174.61 150.35 148.64 9/411 32140]
 > > RIP: 0010:[<ffffffff8a1798b4>]  [<ffffffff8a1798b4>] context_tracking_user_enter+0xa4/0x190
 > > Call Trace:
 > >  [<ffffffff8a012fc5>] syscall_trace_leave+0xa5/0x160
 > >  [<ffffffff8a7d8624>] int_check_syscall_exit_work+0x34/0x3d
 > 
 > Hmm, if we are getting soft-lockups here, maybe it suggest too much exit-work.
 > 
 > Some TIF_NOHZ loop, perhaps? You have nohz on, don't you?

I do.

 > That makes me wonder: does the problem go away if you disable NOHZ?

I'll give it a try, and see what falls out overnight.

	Dave

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-19  4:40                             ` Linus Torvalds
  2014-11-19  4:59                               ` Dave Jones
@ 2014-11-19  5:15                               ` Dave Jones
  2014-11-20 14:36                                 ` Frederic Weisbecker
  2014-11-19 14:59                               ` Dave Jones
  2 siblings, 1 reply; 486+ messages in thread
From: Dave Jones @ 2014-11-19  5:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Don Zickus, Thomas Gleixner, Linux Kernel, the arch/x86 maintainers

On Tue, Nov 18, 2014 at 08:40:55PM -0800, Linus Torvalds wrote:

 > Hmm, if we are getting soft-lockups here, maybe it suggest too much exit-work.
 > 
 > Some TIF_NOHZ loop, perhaps? You have nohz on, don't you?
 > 
 > That makes me wonder: does the problem go away if you disable NOHZ?

Does nohz=off do enough ? I couldn't convince myself after looking at
dmesg, and still seeing dynticks stuff in there.

I'll do a rebuild with all the CONFIG_NO_HZ stuff off, though it also changes
some other config stuff wrt timers.

	Dave


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-18 22:02                           ` Dave Jones
@ 2014-11-19 14:41                             ` Don Zickus
  2014-11-19 15:03                               ` Vivek Goyal
  2014-11-20  9:54                               ` Dave Young
  0 siblings, 2 replies; 486+ messages in thread
From: Don Zickus @ 2014-11-19 14:41 UTC (permalink / raw)
  To: Dave Jones, Thomas Gleixner, Linus Torvalds, Linux Kernel,
	the arch/x86 maintainers, vgoyal

On Tue, Nov 18, 2014 at 05:02:54PM -0500, Dave Jones wrote:
> On Tue, Nov 18, 2014 at 04:55:40PM -0500, Don Zickus wrote:
> 
>  > > So here we mangle CPU3 in and lose the backtrace for cpu0, which might
>  > > be the real interesting one ....
>  > 
>  > Can you provide another dump?  The hope is we get something not mangled?
> 
> Working on it..
> 
>  > The other option we have done in RHEL is panic the system and let kdump
>  > capture the memory.  Then we can analyze the vmcore for the stack trace
>  > cpu0 stored in memory to get a rough idea where it might be if the cpu
>  > isn't responding very well.
> 
> I don't know if it's because of the debug options I typically run with,
> or that I'm perpetually cursed, but I've never managed to get kdump to
> do anything useful. (The last time I tried it was actively harmful in
> that not only did it fail to dump anything, it wedged the machine so
> it didn't reboot after panic).
> 
> Unless there's some magic step missing from the documentation at
> http://fedoraproject.org/wiki/How_to_use_kdump_to_debug_kernel_crashes
> then I'm not optimistic it'll be useful.

Well, I don't know when the last time you ran it, but I know the RH kexec
folks have started pursuing a Fedora-first package patch rule a couple of
years ago to ensure Fedora had a working kexec/kdump solution.

As for the wedging part, it was a common problem to have the kernel hang
while trying to boot the second kernel (and before console output
happened).  So the problem makes sense and is unfortunate.  I would
encourage you to try again.  :-)

Though, it is transitioning to have the app built into the kernel to deal
with the whole secure boot thing, so that might be another can of worms.

I cc'd Vivek and he can let us know how well it works with F21.

Cheers,
Don

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-19  4:40                             ` Linus Torvalds
  2014-11-19  4:59                               ` Dave Jones
  2014-11-19  5:15                               ` Dave Jones
@ 2014-11-19 14:59                               ` Dave Jones
  2014-11-19 17:22                                 ` Linus Torvalds
                                                   ` (2 more replies)
  2 siblings, 3 replies; 486+ messages in thread
From: Dave Jones @ 2014-11-19 14:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Don Zickus, Thomas Gleixner, Linux Kernel,
	the arch/x86 maintainers, Peter Zijlstra

On Tue, Nov 18, 2014 at 08:40:55PM -0800, Linus Torvalds wrote:
 > On Tue, Nov 18, 2014 at 6:19 PM, Dave Jones <davej@redhat.com> wrote:
 > >
 > > NMI watchdog: BUG: soft lockup - CPU#2 stuck for 21s! [trinity-c42:31480]
 > > CPU: 2 PID: 31480 Comm: trinity-c42 Not tainted 3.18.0-rc5+ #91 [loadavg: 174.61 150.35 148.64 9/411 32140]
 > > RIP: 0010:[<ffffffff8a1798b4>]  [<ffffffff8a1798b4>] context_tracking_user_enter+0xa4/0x190
 > > Call Trace:
 > >  [<ffffffff8a012fc5>] syscall_trace_leave+0xa5/0x160
 > >  [<ffffffff8a7d8624>] int_check_syscall_exit_work+0x34/0x3d
 > 
 > Hmm, if we are getting soft-lockups here, maybe it suggest too much exit-work.
 > 
 > Some TIF_NOHZ loop, perhaps? You have nohz on, don't you?
 > 
 > That makes me wonder: does the problem go away if you disable NOHZ?

Aparently not.

NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [trinity-c75:25175]
CPU: 3 PID: 25175 Comm: trinity-c75 Not tainted 3.18.0-rc5+ #92 [loadavg: 168.72 151.72 150.38 9/410 27945]
task: ffff8800364e44d0 ti: ffff880192d2c000 task.ti: ffff880192d2c000
RIP: 0010:[<ffffffff94175be7>]  [<ffffffff94175be7>] context_tracking_user_exit+0x57/0x120
RSP: 0018:ffff880192d2fee8  EFLAGS: 00000246
RAX: 0000000000000000 RBX: 0000000100000046 RCX: 000000336ee35b47
RDX: 0000000000000001 RSI: ffffffff94ac1e84 RDI: ffffffff94a93725
RBP: ffff880192d2fef8 R08: 00007f9b74d0b740 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: ffffffff940d8503
R13: ffff880192d2fe98 R14: ffffffff943884e7 R15: ffff880192d2fe48
FS:  00007f9b74d0b740(0000) GS:ffff880244600000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000336f1b7740 CR3: 0000000229a95000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
Stack:
 ffff880192d30000 0000000000080000 ffff880192d2ff78 ffffffff94012c25
 00007f9b747a5000 00007f9b747a5068 0000000000000000 0000000000000000
 0000000000000000 ffffffff9437b3be 0000000000000000 0000000000000000
Call Trace:
 [<ffffffff94012c25>] syscall_trace_enter_phase1+0x125/0x1a0
 [<ffffffff9437b3be>] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [<ffffffff947d41bf>] tracesys+0x14/0x4a
Code: 42 fd ff 48 c7 c7 7a 1e ac 94 e8 25 29 21 00 65 8b 04 25 34 f7 1c 00 83 f8 01 74 28 f6 c7 02 74 13 0f 1f 00 e8 bb 43 fd ff 53 9d <5b> 41 5c 5d c3 0f 1f 40 00 53 9d e8 89 42 fd ff eb ee 0f 1f 80 
sending NMI to other CPUs:
NMI backtrace for cpu 1
CPU: 1 PID: 25164 Comm: trinity-c64 Not tainted 3.18.0-rc5+ #92 [loadavg: 168.72 151.72 150.38 9/410 27945]
task: ffff88011600dbc0 ti: ffff8801a99a4000 task.ti: ffff8801a99a4000
RIP: 0010:[<ffffffff940fb71e>]  [<ffffffff940fb71e>] generic_exec_single+0xee/0x1a0
RSP: 0018:ffff8801a99a7d18  EFLAGS: 00000202
RAX: 0000000000000000 RBX: ffff8801a99a7d20 RCX: 0000000000000038
RDX: 00000000000000ff RSI: 0000000000000008 RDI: 0000000000000000
RBP: ffff8801a99a7d78 R08: ffff880242b57ce0 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000003
R13: 0000000000000001 R14: ffff880083c28948 R15: ffffffff94166aa0
FS:  00007f9b74d0b740(0000) GS:ffff880244200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000001 CR3: 00000001d8611000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
Stack:
 ffff8801a99a7d28 0000000000000000 ffffffff94166aa0 ffff880083c28948
 0000000000000003 00000000e38f9aac ffff880083c28948 00000000ffffffff
 0000000000000003 ffffffff94166aa0 ffff880083c28948 0000000000000001
Call Trace:
 [<ffffffff94166aa0>] ? perf_swevent_add+0x120/0x120
 [<ffffffff94166aa0>] ? perf_swevent_add+0x120/0x120
 [<ffffffff940fb89a>] smp_call_function_single+0x6a/0xe0
 [<ffffffff940a172b>] ? preempt_count_sub+0x7b/0x100
 [<ffffffff941671aa>] perf_event_read+0xca/0xd0
 [<ffffffff94167240>] perf_event_read_value+0x90/0xe0
 [<ffffffff941689c6>] perf_read+0x226/0x370
 [<ffffffff942fbfb7>] ? security_file_permission+0x87/0xa0
 [<ffffffff941eafff>] vfs_read+0x9f/0x180
 [<ffffffff941ebbd8>] SyS_read+0x58/0xd0
 [<ffffffff947d42c9>] tracesys_phase2+0xd4/0xd9
Code: 48 89 de 48 03 14 c5 20 65 d1 94 48 89 df e8 8a 4b 28 00 84 c0 75 46 45 85 ed 74 11 f6 43 18 01 74 0b 0f 1f 00 f3 90 f6 43 18 01 <75> f8 31 c0 48 8b 4d c8 65 48 33 0c 25 28 00 00 00 0f 85 8e 00 
NMI backtrace for cpu 0
INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 35.055 msecs
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.18.0-rc5+ #92 [loadavg: 168.72 151.72 150.38 11/410 27945]
task: ffffffff94c164c0 ti: ffffffff94c00000 task.ti: ffffffff94c00000
RIP: 0010:[<ffffffff943dd415>]  [<ffffffff943dd415>] intel_idle+0xd5/0x180
RSP: 0018:ffffffff94c03e28  EFLAGS: 00000046
RAX: 0000000000000032 RBX: 0000000000000010 RCX: 0000000000000001
RDX: 0000000000000000 RSI: ffffffff94c03fd8 RDI: 0000000000000000
RBP: ffffffff94c03e58 R08: 000000008baf8b86 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000005
R13: 0000000000000032 R14: 0000000000000004 R15: ffffffff94c00000
FS:  0000000000000000(0000) GS:ffff880244000000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f162e060000 CR3: 0000000014c11000 CR4: 00000000001407f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
Stack:
 0000000094c03e58 5955c5b31ad5e8cf ffffe8fffee031a8 0000000000000005
 ffffffff94ca9dc0 0000000000000000 ffffffff94c03ea8 ffffffff94661f05
 00001cb7dcf6fd93 ffffffff94ca9f90 ffffffff94c00000 ffffffff94d18870
Call[31557.908912] NMI backtrace for cpu 2
INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 68.178 msecs
CPU: 2 PID: 0 Comm: swapper/2 Not tainted 3.18.0-rc5+ #92 [loadavg: 168.72 151.72 150.38 10/410 27945]
task: ffff880242b596f0 ti: ffff880242b6c000 task.ti: ffff880242b6c000
RIP: 0010:[<ffffffff943dd415>]  [<ffffffff943dd415>] intel_idle+0xd5/0x180
RSP: 0018:ffff880242b6fdf8  EFLAGS: 00000046
RAX: 0000000000000032 RBX: 0000000000000010 RCX: 0000000000000001
RDX: 0000000000000000 RSI: ffff880242b6ffd8 RDI: 0000000000000002
RBP: ffff880242b6fe28 R08: 000000008baf8b86 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000005
R13: 0000000000000032 R14: 0000000000000004 R15: ffff880242b6c000
FS:  0000000000000000(0000) GS:ffff880244400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000014c11000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
Stack:
 0000000242b6fe28 da97aa9b9f42090a ffffe8ffff2031a8 0000000000000005
 ffffffff94ca9dc0 0000000000000002 ffff880242b6fe78 ffffffff94661f05
 00001cb7dcdd1af6 ffffffff94ca9f90 ffff880242b6c000 ffffffff94d18870
Call Trace:
 [<ffffffff94661f05>] cpuidle_enter_state+0x55/0x1c0
 [<ffffffff94662127>] cpuidle_enter+0x17/0x20
 [<ffffffff940b94a3>] cpu_startup_entry+0x423/0x4d0
 [<ffffffff9402b763>] start_secondary+0x1a3/0x220
Code: 31 d2 65 48 8b 34 25 08 ba 00 00 48 8d 86 38 c0 ff ff 48 89 d1 0f 01 c8 48 8b 86 38 c0 ff ff a8 08 75 08 b1 01 4c 89 e8 0f 01 c9 <65> 48 8b 0c 25 08 ba 00 00 f0 80 a1 3a c0 ff ff df 0f ae f0 48 
INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 95.994 msecs


<tangent>

Also, today I learned we can reach the perf_event_read code from
read(). Given I had /proc/sys/kernel/perf_event_paranoid set to 1,
I'm not sure how this is even possible. The only user of perf_fops
is perf_event_open syscall _after_ it's checked that sysctl.

Oh, there's an ioctl path to perf too. Though trinity
doesn't know anything about it, so I find it surprising if it
managed to pull the right combination of entropy to make that
do the right thing.  Still, that ioctl path probably needs
to also be checking that sysctl shouldn't it ?

	Dave


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-19 14:41                             ` Don Zickus
@ 2014-11-19 15:03                               ` Vivek Goyal
  2014-11-19 15:38                                 ` Dave Jones
  2014-11-20  9:54                               ` Dave Young
  1 sibling, 1 reply; 486+ messages in thread
From: Vivek Goyal @ 2014-11-19 15:03 UTC (permalink / raw)
  To: Don Zickus
  Cc: Dave Jones, Thomas Gleixner, Linus Torvalds, Linux Kernel,
	the arch/x86 maintainers

On Wed, Nov 19, 2014 at 09:41:05AM -0500, Don Zickus wrote:
> On Tue, Nov 18, 2014 at 05:02:54PM -0500, Dave Jones wrote:
> > On Tue, Nov 18, 2014 at 04:55:40PM -0500, Don Zickus wrote:
> > 
> >  > > So here we mangle CPU3 in and lose the backtrace for cpu0, which might
> >  > > be the real interesting one ....
> >  > 
> >  > Can you provide another dump?  The hope is we get something not mangled?
> > 
> > Working on it..
> > 
> >  > The other option we have done in RHEL is panic the system and let kdump
> >  > capture the memory.  Then we can analyze the vmcore for the stack trace
> >  > cpu0 stored in memory to get a rough idea where it might be if the cpu
> >  > isn't responding very well.
> > 
> > I don't know if it's because of the debug options I typically run with,
> > or that I'm perpetually cursed, but I've never managed to get kdump to
> > do anything useful. (The last time I tried it was actively harmful in
> > that not only did it fail to dump anything, it wedged the machine so
> > it didn't reboot after panic).

Hi Dave Jones,

Not being able to capture the dump I can understand but having wedged
the machine so that it does not reboot after dump failure sounds bad.
So you could not get machine to boot even after a power cycle? Would
you remember what was failing. I am curious to know what did kdump do
to make machine unbootable.

> > 
> > Unless there's some magic step missing from the documentation at
> > http://fedoraproject.org/wiki/How_to_use_kdump_to_debug_kernel_crashes
> > then I'm not optimistic it'll be useful.

I had a quick look at it and it basically looks fine. In fedora ideally
it is just two steps process.

- Reserve memory using crashkernel. Say crashkernel=160M
- systemctl start kdump
- Crash the system or wait for it to crash.

So despite your bad experience in the past, I would encourage you to
give it a try.

> 
> Well, I don't know when the last time you ran it, but I know the RH kexec
> folks have started pursuing a Fedora-first package patch rule a couple of
> years ago to ensure Fedora had a working kexec/kdump solution.

Yep, now we are putting everything in fedora first so it should be much
better. Hard to say the same thing about driver authors. Sometimes they
might have a driver working in rhel and not necessarily upstream. I am
not sure if you ran into one of those issues.

Also recently I have seen issues with graphics drivers too.

> 
> As for the wedging part, it was a common problem to have the kernel hang
> while trying to boot the second kernel (and before console output
> happened).  So the problem makes sense and is unfortunate.  I would
> encourage you to try again.  :-)
> 
> Though, it is transitioning to have the app built into the kernel to deal
> with the whole secure boot thing, so that might be another can of worms.

I doubt that secureboot bits will contribute to the failure.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-19 15:03                               ` Vivek Goyal
@ 2014-11-19 15:38                                 ` Dave Jones
  2014-11-19 16:28                                   ` Vivek Goyal
  0 siblings, 1 reply; 486+ messages in thread
From: Dave Jones @ 2014-11-19 15:38 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Don Zickus, Thomas Gleixner, Linus Torvalds, Linux Kernel,
	the arch/x86 maintainers

On Wed, Nov 19, 2014 at 10:03:33AM -0500, Vivek Goyal wrote:

 > Not being able to capture the dump I can understand but having wedged
 > the machine so that it does not reboot after dump failure sounds bad.
 > So you could not get machine to boot even after a power cycle? Would
 > you remember what was failing. I am curious to know what did kdump do
 > to make machine unbootable.

Power cycling was fine, because then it booted into the non-kdump kernel.
The issue was when I caused that kernel to panic, it would just sit there
wedged, with no indication it even tried to switch to the kdump kernel.

 > > > Unless there's some magic step missing from the documentation at
 > > > http://fedoraproject.org/wiki/How_to_use_kdump_to_debug_kernel_crashes
 > > > then I'm not optimistic it'll be useful.
 > 
 > I had a quick look at it and it basically looks fine. In fedora ideally
 > it is just two steps process.
 > 
 > - Reserve memory using crashkernel. Say crashkernel=160M
 > - systemctl start kdump
 > - Crash the system or wait for it to crash.
 > 
 > So despite your bad experience in the past, I would encourage you to
 > give it a try.

'the past' here, is two weeks ago, on Fedora 21.

But, since then, I've reinstalled that box with Fedora 20 because I didn't
trust gcc 4.9, and on f20 things are actually even worse.

Right now it doesn't even create the image correctly:

dracut: *** Stripping files done ***
dracut: *** Store current command line parameters ***
dracut: *** Creating image file ***
dracut: *** Creating image file done ***
kdumpctl: cat: write error: Broken pipe
kdumpctl: kexec: failed to load kdump kernel
kdumpctl: Starting kdump: [FAILED]

It works if I run a Fedora kernel, but not with a self-built one.
And there's zero information as to what I'm doing wrong.

I saw something similar on F21, got past it somehow a few weeks ago,
but I can't remember what I had to do. Unfortunatly that was still
fruitless as it didn't actually dump anything, leading to my frustration
with the state of kdump.

I'll try again when I put F21 back on that machine, but I'm
not particularly optimistic tbh.

	Dave


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-19 15:38                                 ` Dave Jones
@ 2014-11-19 16:28                                   ` Vivek Goyal
  2014-11-20 16:10                                     ` Dave Jones
  0 siblings, 1 reply; 486+ messages in thread
From: Vivek Goyal @ 2014-11-19 16:28 UTC (permalink / raw)
  To: Dave Jones, Don Zickus, Thomas Gleixner, Linus Torvalds,
	Linux Kernel, the arch/x86 maintainers
  Cc: WANG Chao, Baoquan He, Dave Young

On Wed, Nov 19, 2014 at 10:38:52AM -0500, Dave Jones wrote:
> On Wed, Nov 19, 2014 at 10:03:33AM -0500, Vivek Goyal wrote:
> 
>  > Not being able to capture the dump I can understand but having wedged
>  > the machine so that it does not reboot after dump failure sounds bad.
>  > So you could not get machine to boot even after a power cycle? Would
>  > you remember what was failing. I am curious to know what did kdump do
>  > to make machine unbootable.
> 
> Power cycling was fine, because then it booted into the non-kdump kernel.
> The issue was when I caused that kernel to panic, it would just sit there
> wedged, with no indication it even tried to switch to the kdump kernel.

I have seen the cases where we fail to boot in second kernel and often
failure can happen very early without any information on graphic console.
I have to always hook up a serial console to get an idea what went wrong
that early. It is not an idea situation but at the same time don't know
how to improve it.

I am wondering may be in some cases we panic in second kernel and sit
there. Probably we should append a kernel command line automatically
say "panic=1" so that it reboots itself if second kernel panics.

By any chance, have you enabled "CONFIG_RANDOMIZE_BASE"? If yes, please
disable that as currently kexec/kdump stuff does not work with it. And
it hangs very early in the boot process and I had to hook serial console
to get following message on console.

arch/x86/boot/compressed/misc.c
error("32-bit relocation outside of kernel!\n");

I noticed that error() halts in a while loop after error message. May be
there can be some way for it to try to reboot instead of halting in 
while loop.

> 
>  > > > Unless there's some magic step missing from the documentation at
>  > > > http://fedoraproject.org/wiki/How_to_use_kdump_to_debug_kernel_crashes
>  > > > then I'm not optimistic it'll be useful.
>  > 
>  > I had a quick look at it and it basically looks fine. In fedora ideally
>  > it is just two steps process.
>  > 
>  > - Reserve memory using crashkernel. Say crashkernel=160M
>  > - systemctl start kdump
>  > - Crash the system or wait for it to crash.
>  > 
>  > So despite your bad experience in the past, I would encourage you to
>  > give it a try.
> 
> 'the past' here, is two weeks ago, on Fedora 21.
> 
> But, since then, I've reinstalled that box with Fedora 20 because I didn't
> trust gcc 4.9, and on f20 things are actually even worse.
> 
> Right now it doesn't even create the image correctly:
> 
> dracut: *** Stripping files done ***
> dracut: *** Store current command line parameters ***
> dracut: *** Creating image file ***
> dracut: *** Creating image file done ***
> kdumpctl: cat: write error: Broken pipe
> kdumpctl: kexec: failed to load kdump kernel
> kdumpctl: Starting kdump: [FAILED]

Hmmm..., can you please enable debugging in kdumpctl using "set -x" and
do "touch /etc/kdump.conf; kdumpctl restart" and give debug output to me.

> 
> It works if I run a Fedora kernel, but not with a self-built one.
> And there's zero information as to what I'm doing wrong.

I just tested F20 kdump on my box and it worked fine for me.

So for you second kernel hangs and there is no info on console? Is there
any possibility to hook up serial console, enable early printk and see
if soemthing shows up there.

Apart from this, if you run into kdump issues in fedora, please cc 
kexec fedora mailing list too so that we are aware of it.

https://lists.fedoraproject.org/mailman/listinfo/kexec

Thanks
Vivek

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-19 14:59                               ` Dave Jones
@ 2014-11-19 17:22                                 ` Linus Torvalds
  2014-11-19 17:40                                   ` Linus Torvalds
  2014-11-19 19:15                                   ` Andy Lutomirski
  2014-11-19 21:01                                 ` Andy Lutomirski
  2014-11-20 15:04                                 ` Frederic Weisbecker
  2 siblings, 2 replies; 486+ messages in thread
From: Linus Torvalds @ 2014-11-19 17:22 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Don Zickus, Thomas Gleixner,
	Linux Kernel, the arch/x86 maintainers, Peter Zijlstra,
	Frédéric Weisbecker, Andy Lutomirski,
	Arnaldo Carvalho de Melo

On Wed, Nov 19, 2014 at 6:59 AM, Dave Jones <davej@redhat.com> wrote:
> On Tue, Nov 18, 2014 at 08:40:55PM -0800, Linus Torvalds wrote:
>  >
>  > That makes me wonder: does the problem go away if you disable NOHZ?
>
> Aparently not.
>
> NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [trinity-c75:25175]
> CPU: 3 PID: 25175 Comm: trinity-c75 Not tainted 3.18.0-rc5+ #92 [loadavg: 168.72 151.72 150.38 9/410 27945]
> RIP: 0010:[<ffffffff94175be7>]  [<ffffffff94175be7>] context_tracking_user_exit+0x57/0x120
> Call Trace:
>  [<ffffffff94012c25>] syscall_trace_enter_phase1+0x125/0x1a0
>  [<ffffffff9437b3be>] ? trace_hardirqs_on_thunk+0x3a/0x3f
>  [<ffffffff947d41bf>] tracesys+0x14/0x4a

Ok, that's just crazy. This is the system call *entry* portion.

Last time it was the system call exit side, which at least made some
sense, because the "return to user space" thing actually has loops in
it to handle all events before we return to user space.

But the whole 'tracesys' part is only entered once, at the very
beginning of a system call. There's no loop over the work. That whole
call trace implies that the lockup happened just after we entered the
system call path from _user_ space.

And in fact, exactly like last time, the code line implies that the
timer interrupt happened on the return from the instruction, and
indeed in both cases the code looked like this (the registers
differed, but the "restore flags, start popping saved regs" was the
exact same):

  26: 53                   push   %rbx
  27: 9d                   popfq
  28:* 5b                   pop    %rbx <-- trapping instruction
  29: 41 5c                 pop    %r12

in both cases, the timer interrupt happened right after the "popfq",
but in both cases the value in the register that was used to restore
eflags was invalid. Here %rbx was 0x0000000100000046 (which is a valid
eflags value, but not the one we've actually restored!), and in your
previous oops (where it was %r12) it was completely invalid.

So it hasn't actually done the "push %rbx; popfq" part - there must be
a label at the return part, and context_tracking_user_exit() never
actually did the local_irq_save/restore at all. Which means that it
took one of the early exits instead:

        if (!context_tracking_is_enabled())
                return;

        if (in_interrupt())
                return;

So not only does this happen at early system call entry time, the
function that is claimed to lock up doesn't actually *do* anything.

Ho humm..

Oh, and to make matters worse, the only way this call chain can happen
is this in syscall_trace_enter_phase1():

        if (work & _TIF_NOHZ) {
                user_exit();
                work &= ~TIF_NOHZ;
        }

so there's still some NOHZ confusion there. It looks like TIF_NOHZ
gets set regardless of whether NOHZ is enabled or not..

I'm adding Frederic explicitly to the cc too, because this is just
fishy.  I am starting to blame context tracking, because it has now
shown up twice in different guises, and TIF_NOHZ seems to be
implicated.

> CPU: 1 PID: 25164 Comm: trinity-c64 Not tainted 3.18.0-rc5+ #92 [loadavg: 168.72 151.72 150.38 9/410 27945]
> RIP: 0010:[<ffffffff940fb71e>]  [<ffffffff940fb71e>] generic_exec_single+0xee/0x1a0
> Call Trace:
>  [<ffffffff940fb89a>] smp_call_function_single+0x6a/0xe0
>  [<ffffffff941671aa>] perf_event_read+0xca/0xd0
>  [<ffffffff94167240>] perf_event_read_value+0x90/0xe0
>  [<ffffffff941689c6>] perf_read+0x226/0x370
>  [<ffffffff941eafff>] vfs_read+0x9f/0x180

Hmm.. We've certainly seen a lot of smp_call, for various different
reasons in your traces..

I'm wondering if the smp-call ended up corrupting something on CPU3.
Because even _with_ TIF_NOHZ confusion, I don't see how system call
*entry* could cause a watchdog event. There are no loops, there are no
locks I see, there is just *nothing* there I can see.

Let's add Andy L to the cc too, in case he hasn't seen this.  He's
been changing the lowlevel asm code, including very much this whole
"syscall_trace_enter_phase1" thing. Maybe he sees something I don't.

Andy?

> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.18.0-rc5+ #92 [loadavg: 168.72 151.72 150.38 11/410 27945]
> RIP: 0010:[<ffffffff943dd415>]  [<ffffffff943dd415>] intel_idle+0xd5/0x180
> CPU: 2 PID: 0 Comm: swapper/2 Not tainted 3.18.0-rc5+ #92 [loadavg: 168.72 151.72 150.38 10/410 27945]
> RIP: 0010:[<ffffffff943dd415>]  [<ffffffff943dd415>] intel_idle+0xd5/0x180

Nothing there.

> Also, today I learned we can reach the perf_event_read code from
> read(). Given I had /proc/sys/kernel/perf_event_paranoid set to 1,
> I'm not sure how this is even possible. The only user of perf_fops
> is perf_event_open syscall _after_ it's checked that sysctl.
>
> Oh, there's an ioctl path to perf too. Though trinity
> doesn't know anything about it, so I find it surprising if it
> managed to pull the right combination of entropy to make that
> do the right thing.  Still, that ioctl path probably needs
> to also be checking that sysctl shouldn't it ?

Hmm. Perf people are already mostly on the list. Peter/Ingo/Arnaldo?

                      Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-19 17:22                                 ` Linus Torvalds
@ 2014-11-19 17:40                                   ` Linus Torvalds
  2014-11-19 19:02                                     ` Frederic Weisbecker
  2014-11-19 19:15                                   ` Andy Lutomirski
  1 sibling, 1 reply; 486+ messages in thread
From: Linus Torvalds @ 2014-11-19 17:40 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Don Zickus, Thomas Gleixner,
	Linux Kernel, the arch/x86 maintainers, Peter Zijlstra,
	Frédéric Weisbecker, Andy Lutomirski,
	Arnaldo Carvalho de Melo

On Wed, Nov 19, 2014 at 9:22 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So it hasn't actually done the "push %rbx; popfq" part - there must be
> a label at the return part, and context_tracking_user_exit() never
> actually did the local_irq_save/restore at all. Which means that it
> took one of the early exits instead:
>
>         if (!context_tracking_is_enabled())
>                 return;
>
>         if (in_interrupt())
>                 return;

Ho humm. Interesting. Neither of those should possibly have happened.

We "know" that "context_tracking_is_enabled()" must be true, because
the only way we get to context_tracking_user_exit() in the first place
is through "user_exit()", which does:

        if (context_tracking_is_enabled())
                context_tracking_user_exit();

and we know we shouldn't be in_interrupt(), because the backtrace is
the system call entry path, for chrissake!

So we definitely have some corruption going on. A few possibilities:

 - either the register contents are corrupted (%rbx in your dump said
"0x0000000100000046", but the eflags we restored was 0x246)

 - in_interrupt() is wrong, and we've had some irq_count() corruption.
I'd expect that to result in "scheduling while atomic" messages,
though, especially if it goes on long enough that you get a watchdog
event..

 - there is something rotten in the land of
context_tracking_is_enabled(), which uses a static key.

 - I have misread the whole trace, and am a moron. But your earlier
report really had some very similar things, just in
context_tracking_user_enter() instead of exit.

In your previous oops, the registers that was allegedly used to
restore %eflags was %r12:

  28: 41 54                 push   %r12
  2a: 9d                   popfq
  2b:* 5b                   pop    %rbx <-- trapping instruction
  2c: 41 5c                 pop    %r12
  2e: 5d                   pop    %rbp
  2f: c3                   retq

but:

  R12: ffff880101ee3ec0
  EFLAGS: 00000282

so again, it looks like we never actually did that "popfq"
instruction, and it would have exited through the (same) early exits.

But what an odd coincidence that it ended up in both of your reports
being *exactly* at that instruction after the "popf". If it had
actually *taken* the popf, I'd not be so surprised ("ok, popf enabled
interrupts, and there was an interrupt pending"), but since everything
seems to say that it came there through some control flow that did
*not* go through the popf, that's just a very odd coincidence.

And both context_tracking_user_enter() and exit() have that exact same
issue with the early returns. They shouldn't have happened in the
first place.

                      Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-19 17:40                                   ` Linus Torvalds
@ 2014-11-19 19:02                                     ` Frederic Weisbecker
  2014-11-19 19:03                                       ` Andy Lutomirski
  2014-11-19 21:56                                       ` Thomas Gleixner
  0 siblings, 2 replies; 486+ messages in thread
From: Frederic Weisbecker @ 2014-11-19 19:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Jones, Don Zickus, Thomas Gleixner, Linux Kernel,
	the arch/x86 maintainers, Peter Zijlstra, Andy Lutomirski,
	Arnaldo Carvalho de Melo

On Wed, Nov 19, 2014 at 09:40:26AM -0800, Linus Torvalds wrote:
> On Wed, Nov 19, 2014 at 9:22 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > So it hasn't actually done the "push %rbx; popfq" part - there must be
> > a label at the return part, and context_tracking_user_exit() never
> > actually did the local_irq_save/restore at all. Which means that it
> > took one of the early exits instead:
> >
> >         if (!context_tracking_is_enabled())
> >                 return;
> >
> >         if (in_interrupt())
> >                 return;
> 
> Ho humm. Interesting. Neither of those should possibly have happened.
> 
> We "know" that "context_tracking_is_enabled()" must be true, because
> the only way we get to context_tracking_user_exit() in the first place
> is through "user_exit()", which does:
> 
>         if (context_tracking_is_enabled())
>                 context_tracking_user_exit();
> 
> and we know we shouldn't be in_interrupt(), because the backtrace is
> the system call entry path, for chrissake!
> 
> So we definitely have some corruption going on. A few possibilities:
> 
>  - either the register contents are corrupted (%rbx in your dump said
> "0x0000000100000046", but the eflags we restored was 0x246)
> 
>  - in_interrupt() is wrong, and we've had some irq_count() corruption.
> I'd expect that to result in "scheduling while atomic" messages,
> though, especially if it goes on long enough that you get a watchdog
> event..
> 
>  - there is something rotten in the land of
> context_tracking_is_enabled(), which uses a static key.
> 
>  - I have misread the whole trace, and am a moron. But your earlier
> report really had some very similar things, just in
> context_tracking_user_enter() instead of exit.
> 
> In your previous oops, the registers that was allegedly used to
> restore %eflags was %r12:
> 
>   28: 41 54                 push   %r12
>   2a: 9d                   popfq
>   2b:* 5b                   pop    %rbx <-- trapping instruction
>   2c: 41 5c                 pop    %r12
>   2e: 5d                   pop    %rbp
>   2f: c3                   retq
> 
> but:
> 
>   R12: ffff880101ee3ec0
>   EFLAGS: 00000282
> 
> so again, it looks like we never actually did that "popfq"
> instruction, and it would have exited through the (same) early exits.
> 
> But what an odd coincidence that it ended up in both of your reports
> being *exactly* at that instruction after the "popf". If it had
> actually *taken* the popf, I'd not be so surprised ("ok, popf enabled
> interrupts, and there was an interrupt pending"), but since everything
> seems to say that it came there through some control flow that did
> *not* go through the popf, that's just a very odd coincidence.
> 
> And both context_tracking_user_enter() and exit() have that exact same
> issue with the early returns. They shouldn't have happened in the
> first place.

I got a report lately involving context tracking. Not sure if it's
the same here but the issue was that context tracking uses per cpu data
and per cpu allocation use vmalloc and vmalloc'ed area can fault due to
lazy paging.

With that in mind, about eveything can happen. Some parts of the context tracking
code really aren't fault-safe (or more generally exception safe). That's because
context tracking itself tracks exceptions.

So for example if we enter a syscall, we go to context_tracking_user_exit() then
vtime_user_enter() which _takes a lock_ with write_seqlock().

If an exception occurs before we unlock the seqlock (it's possible for
example account_user_time() -> task_group_account_field()-> cpuacct_account_field()
accesses dynamically allocated per cpu area which can fault) then
the fault calls exception_enter() then user_exit() which does all the same again
and deadlocks.

I can certainly fix that with a few recursion protection.

Now we just need to determine if the current case has the same cause.

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-19 19:02                                     ` Frederic Weisbecker
@ 2014-11-19 19:03                                       ` Andy Lutomirski
  2014-11-19 23:00                                         ` Frederic Weisbecker
  2014-11-19 21:56                                       ` Thomas Gleixner
  1 sibling, 1 reply; 486+ messages in thread
From: Andy Lutomirski @ 2014-11-19 19:03 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Linus Torvalds, Dave Jones, Don Zickus, Thomas Gleixner,
	Linux Kernel, the arch/x86 maintainers, Peter Zijlstra,
	Arnaldo Carvalho de Melo

On Wed, Nov 19, 2014 at 11:02 AM, Frederic Weisbecker
<fweisbec@gmail.com> wrote:
> On Wed, Nov 19, 2014 at 09:40:26AM -0800, Linus Torvalds wrote:
>> On Wed, Nov 19, 2014 at 9:22 AM, Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>> >
>> > So it hasn't actually done the "push %rbx; popfq" part - there must be
>> > a label at the return part, and context_tracking_user_exit() never
>> > actually did the local_irq_save/restore at all. Which means that it
>> > took one of the early exits instead:
>> >
>> >         if (!context_tracking_is_enabled())
>> >                 return;
>> >
>> >         if (in_interrupt())
>> >                 return;
>>
>> Ho humm. Interesting. Neither of those should possibly have happened.
>>
>> We "know" that "context_tracking_is_enabled()" must be true, because
>> the only way we get to context_tracking_user_exit() in the first place
>> is through "user_exit()", which does:
>>
>>         if (context_tracking_is_enabled())
>>                 context_tracking_user_exit();
>>
>> and we know we shouldn't be in_interrupt(), because the backtrace is
>> the system call entry path, for chrissake!
>>
>> So we definitely have some corruption going on. A few possibilities:
>>
>>  - either the register contents are corrupted (%rbx in your dump said
>> "0x0000000100000046", but the eflags we restored was 0x246)
>>
>>  - in_interrupt() is wrong, and we've had some irq_count() corruption.
>> I'd expect that to result in "scheduling while atomic" messages,
>> though, especially if it goes on long enough that you get a watchdog
>> event..
>>
>>  - there is something rotten in the land of
>> context_tracking_is_enabled(), which uses a static key.
>>
>>  - I have misread the whole trace, and am a moron. But your earlier
>> report really had some very similar things, just in
>> context_tracking_user_enter() instead of exit.
>>
>> In your previous oops, the registers that was allegedly used to
>> restore %eflags was %r12:
>>
>>   28: 41 54                 push   %r12
>>   2a: 9d                   popfq
>>   2b:* 5b                   pop    %rbx <-- trapping instruction
>>   2c: 41 5c                 pop    %r12
>>   2e: 5d                   pop    %rbp
>>   2f: c3                   retq
>>
>> but:
>>
>>   R12: ffff880101ee3ec0
>>   EFLAGS: 00000282
>>
>> so again, it looks like we never actually did that "popfq"
>> instruction, and it would have exited through the (same) early exits.
>>
>> But what an odd coincidence that it ended up in both of your reports
>> being *exactly* at that instruction after the "popf". If it had
>> actually *taken* the popf, I'd not be so surprised ("ok, popf enabled
>> interrupts, and there was an interrupt pending"), but since everything
>> seems to say that it came there through some control flow that did
>> *not* go through the popf, that's just a very odd coincidence.
>>
>> And both context_tracking_user_enter() and exit() have that exact same
>> issue with the early returns. They shouldn't have happened in the
>> first place.
>
> I got a report lately involving context tracking. Not sure if it's
> the same here but the issue was that context tracking uses per cpu data
> and per cpu allocation use vmalloc and vmalloc'ed area can fault due to
> lazy paging.

Wait, what?  If something like kernel_stack ends with an unmapped pmd,
we are well and truly screwed.

--Andy

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-19 17:22                                 ` Linus Torvalds
  2014-11-19 17:40                                   ` Linus Torvalds
@ 2014-11-19 19:15                                   ` Andy Lutomirski
  2014-11-19 19:38                                     ` Linus Torvalds
  1 sibling, 1 reply; 486+ messages in thread
From: Andy Lutomirski @ 2014-11-19 19:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Jones, Don Zickus, Thomas Gleixner, Linux Kernel,
	the arch/x86 maintainers, Peter Zijlstra,
	Frédéric Weisbecker, Arnaldo Carvalho de Melo

On Wed, Nov 19, 2014 at 9:22 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Nov 19, 2014 at 6:59 AM, Dave Jones <davej@redhat.com> wrote:
>> On Tue, Nov 18, 2014 at 08:40:55PM -0800, Linus Torvalds wrote:
>>  >
>>  > That makes me wonder: does the problem go away if you disable NOHZ?
>>
>> Aparently not.
>>
>> NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [trinity-c75:25175]
>> CPU: 3 PID: 25175 Comm: trinity-c75 Not tainted 3.18.0-rc5+ #92 [loadavg: 168.72 151.72 150.38 9/410 27945]
>> RIP: 0010:[<ffffffff94175be7>]  [<ffffffff94175be7>] context_tracking_user_exit+0x57/0x120
>> Call Trace:
>>  [<ffffffff94012c25>] syscall_trace_enter_phase1+0x125/0x1a0
>>  [<ffffffff9437b3be>] ? trace_hardirqs_on_thunk+0x3a/0x3f
>>  [<ffffffff947d41bf>] tracesys+0x14/0x4a
>
> Ok, that's just crazy. This is the system call *entry* portion.
>
> Last time it was the system call exit side, which at least made some
> sense, because the "return to user space" thing actually has loops in
> it to handle all events before we return to user space.
>
> But the whole 'tracesys' part is only entered once, at the very
> beginning of a system call. There's no loop over the work. That whole
> call trace implies that the lockup happened just after we entered the
> system call path from _user_ space.

I suspect that the regression was triggered by the seccomp pull, since
that reworked a lot of this code.

>
> And in fact, exactly like last time, the code line implies that the
> timer interrupt happened on the return from the instruction, and
> indeed in both cases the code looked like this (the registers
> differed, but the "restore flags, start popping saved regs" was the
> exact same):
>
>   26: 53                   push   %rbx
>   27: 9d                   popfq
>   28:* 5b                   pop    %rbx <-- trapping instruction
>   29: 41 5c                 pop    %r12

Just to make sure I understand: it says "NMI watchdog", but this trace
is from a timer interrupt, not NMI, right?

>
> in both cases, the timer interrupt happened right after the "popfq",
> but in both cases the value in the register that was used to restore
> eflags was invalid. Here %rbx was 0x0000000100000046 (which is a valid
> eflags value, but not the one we've actually restored!), and in your
> previous oops (where it was %r12) it was completely invalid.
>
> So it hasn't actually done the "push %rbx; popfq" part - there must be
> a label at the return part, and context_tracking_user_exit() never
> actually did the local_irq_save/restore at all. Which means that it
> took one of the early exits instead:
>
>         if (!context_tracking_is_enabled())
>                 return;
>
>         if (in_interrupt())
>                 return;
>
> So not only does this happen at early system call entry time, the
> function that is claimed to lock up doesn't actually *do* anything.
>
> Ho humm..
>
> Oh, and to make matters worse, the only way this call chain can happen
> is this in syscall_trace_enter_phase1():
>
>         if (work & _TIF_NOHZ) {
>                 user_exit();
>                 work &= ~TIF_NOHZ;
>         }
>
> so there's still some NOHZ confusion there. It looks like TIF_NOHZ
> gets set regardless of whether NOHZ is enabled or not..
>
> I'm adding Frederic explicitly to the cc too, because this is just
> fishy.  I am starting to blame context tracking, because it has now
> shown up twice in different guises, and TIF_NOHZ seems to be
> implicated.

Is it possible that we've managed to return to userspace with
interrupts off somehow?  A loop in userspace that somehow has
interrupts off can cause all kinds of fun lockups.

I don't understand the logic of what enables TIF_NOHZ.  That being
said in the new 3.18 code, if TIF_NOHZ is set, we use part of the fast
path instead of the full syscall slow path, which means that we
meander differently through the asm than we used to (we do
syscall_trace_enter_phase1, then a fast path syscall, then we get to
sysret_careful, which does this:

    /*
     * We have a signal, or exit tracing or single-step.
     * These all wind up with the iret return path anyway,
     * so just join that path right now.
     */
    FIXUP_TOP_OF_STACK %r11, -ARGOFFSET
    jmp int_check_syscall_exit_work


In 3.17, I don't think that code would run with context tracking on,
although I don't immediately see any bugs here.

>
>> CPU: 1 PID: 25164 Comm: trinity-c64 Not tainted 3.18.0-rc5+ #92 [loadavg: 168.72 151.72 150.38 9/410 27945]
>> RIP: 0010:[<ffffffff940fb71e>]  [<ffffffff940fb71e>] generic_exec_single+0xee/0x1a0
>> Call Trace:
>>  [<ffffffff940fb89a>] smp_call_function_single+0x6a/0xe0
>>  [<ffffffff941671aa>] perf_event_read+0xca/0xd0
>>  [<ffffffff94167240>] perf_event_read_value+0x90/0xe0
>>  [<ffffffff941689c6>] perf_read+0x226/0x370
>>  [<ffffffff941eafff>] vfs_read+0x9f/0x180
>
> Hmm.. We've certainly seen a lot of smp_call, for various different
> reasons in your traces..
>
> I'm wondering if the smp-call ended up corrupting something on CPU3.
> Because even _with_ TIF_NOHZ confusion, I don't see how system call
> *entry* could cause a watchdog event. There are no loops, there are no
> locks I see, there is just *nothing* there I can see.
>

If we ever landed in userspace with interrupts off, this could happen
quite easily.  It should be straightforward to add an assertion for
that, in trinity or in the kernel.

--Andy

> Let's add Andy L to the cc too, in case he hasn't seen this.  He's
> been changing the lowlevel asm code, including very much this whole
> "syscall_trace_enter_phase1" thing. Maybe he sees something I don't.
>
> Andy?
>
>> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.18.0-rc5+ #92 [loadavg: 168.72 151.72 150.38 11/410 27945]
>> RIP: 0010:[<ffffffff943dd415>]  [<ffffffff943dd415>] intel_idle+0xd5/0x180
>> CPU: 2 PID: 0 Comm: swapper/2 Not tainted 3.18.0-rc5+ #92 [loadavg: 168.72 151.72 150.38 10/410 27945]
>> RIP: 0010:[<ffffffff943dd415>]  [<ffffffff943dd415>] intel_idle+0xd5/0x180
>
> Nothing there.
>
>> Also, today I learned we can reach the perf_event_read code from
>> read(). Given I had /proc/sys/kernel/perf_event_paranoid set to 1,
>> I'm not sure how this is even possible. The only user of perf_fops
>> is perf_event_open syscall _after_ it's checked that sysctl.
>>
>> Oh, there's an ioctl path to perf too. Though trinity
>> doesn't know anything about it, so I find it surprising if it
>> managed to pull the right combination of entropy to make that
>> do the right thing.  Still, that ioctl path probably needs
>> to also be checking that sysctl shouldn't it ?
>
> Hmm. Perf people are already mostly on the list. Peter/Ingo/Arnaldo?
>
>                       Linus



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-19 19:15                                   ` Andy Lutomirski
@ 2014-11-19 19:38                                     ` Linus Torvalds
  2014-11-19 22:18                                       ` Dave Jones
  0 siblings, 1 reply; 486+ messages in thread
From: Linus Torvalds @ 2014-11-19 19:38 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Jones, Don Zickus, Thomas Gleixner, Linux Kernel,
	the arch/x86 maintainers, Peter Zijlstra,
	Frédéric Weisbecker, Arnaldo Carvalho de Melo

On Wed, Nov 19, 2014 at 11:15 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> I suspect that the regression was triggered by the seccomp pull, since
> that reworked a lot of this code.

Note that it turns out that Dave can apparently see the same problems
with 3.17, so it's not actually a regression. So it may have been
going on for a while.


> Just to make sure I understand: it says "NMI watchdog", but this trace
> is from a timer interrupt, not NMI, right?

Yeah. The kernel/watchdog.c code always says "NMI watchdog", but it's
actually just a regular tiemr function: watchdog_timer_fn() started
with hrtimer_start().

> Is it possible that we've managed to return to userspace with
> interrupts off somehow?  A loop in userspace that somehow has
> interrupts off can cause all kinds of fun lockups.

That sounds unlikely, but if there is some stack corruption going on.

However, it wouldn't even explain things, because even if interrupts
had been disabled in user space, and even if that popf got executed,
this wouldn't be where they got enabled. That would be the :"sti" in
the system call entry path (hidden behind the ENABLE_INTERRUPTS
macro).

Of course, maybe Dave has paravirtualization enabled (what a crock
_that_ is), and there is something wrong with that whole code.

> I don't understand the logic of what enables TIF_NOHZ.

Yeah, that makes two of us.  But..

> In 3.17, I don't think that code would run with context tracking on,
> although I don't immediately see any bugs here.

See above: the problem apparently isn't new. Although it is possible
that we have two different issues going on..

                      Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-19 14:59                               ` Dave Jones
  2014-11-19 17:22                                 ` Linus Torvalds
@ 2014-11-19 21:01                                 ` Andy Lutomirski
  2014-11-19 21:47                                   ` Dave Jones
                                                     ` (2 more replies)
  2014-11-20 15:04                                 ` Frederic Weisbecker
  2 siblings, 3 replies; 486+ messages in thread
From: Andy Lutomirski @ 2014-11-19 21:01 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Don Zickus, Thomas Gleixner,
	Linux Kernel, the arch/x86 maintainers, Peter Zijlstra

On 11/19/2014 06:59 AM, Dave Jones wrote:
> On Tue, Nov 18, 2014 at 08:40:55PM -0800, Linus Torvalds wrote:
>  > On Tue, Nov 18, 2014 at 6:19 PM, Dave Jones <davej@redhat.com> wrote:
>  > >
>  > > NMI watchdog: BUG: soft lockup - CPU#2 stuck for 21s! [trinity-c42:31480]
>  > > CPU: 2 PID: 31480 Comm: trinity-c42 Not tainted 3.18.0-rc5+ #91 [loadavg: 174.61 150.35 148.64 9/411 32140]
>  > > RIP: 0010:[<ffffffff8a1798b4>]  [<ffffffff8a1798b4>] context_tracking_user_enter+0xa4/0x190
>  > > Call Trace:
>  > >  [<ffffffff8a012fc5>] syscall_trace_leave+0xa5/0x160
>  > >  [<ffffffff8a7d8624>] int_check_syscall_exit_work+0x34/0x3d
>  > 
>  > Hmm, if we are getting soft-lockups here, maybe it suggest too much exit-work.
>  > 
>  > Some TIF_NOHZ loop, perhaps? You have nohz on, don't you?
>  > 
>  > That makes me wonder: does the problem go away if you disable NOHZ?
> 
> Aparently not.

TIF_NOHZ is not the same thing as NOHZ.  Can you try a kernel with
CONFIG_CONTEXT_TRACKING=n?  Doing that may involve fiddling with RCU
settings a bit.  The normal no HZ idle stuff has nothing to do with
TIF_NOHZ, and you either have TIF_NOHZ set or you have some kind of
thread_info corruption going on here.

Hmm.  This isn't a stack overflow, is it?  That could cause all of these
problems quite easily, although I'd expect other symptoms, too.

> 
> NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [trinity-c75:25175]
> CPU: 3 PID: 25175 Comm: trinity-c75 Not tainted 3.18.0-rc5+ #92 [loadavg: 168.72 151.72 150.38 9/410 27945]
> task: ffff8800364e44d0 ti: ffff880192d2c000 task.ti: ffff880192d2c000
> RIP: 0010:[<ffffffff94175be7>]  [<ffffffff94175be7>] context_tracking_user_exit+0x57/0x120

This RIP should be impossible if context tracking is off.

> RSP: 0018:ffff880192d2fee8  EFLAGS: 00000246
> RAX: 0000000000000000 RBX: 0000000100000046 RCX: 000000336ee35b47

                                    ^^^^^^^^^

That is a strange coincidence.  Where did 0x46 | (1<<32) come from?
That's a sensible interrupts-disabled flags value with the high part set
to 0x1.  Those high bits are undefined, but they ought to all be zero.

> RDX: 0000000000000001 RSI: ffffffff94ac1e84 RDI: ffffffff94a93725
> RBP: ffff880192d2fef8 R08: 00007f9b74d0b740 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000246 R12: ffffffff940d8503
> R13: ffff880192d2fe98 R14: ffffffff943884e7 R15: ffff880192d2fe48
> FS:  00007f9b74d0b740(0000) GS:ffff880244600000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 000000336f1b7740 CR3: 0000000229a95000 CR4: 00000000001407e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
> Stack:
>  ffff880192d30000 0000000000080000 ffff880192d2ff78 ffffffff94012c25
>  00007f9b747a5000 00007f9b747a5068 0000000000000000 0000000000000000
>  0000000000000000 ffffffff9437b3be 0000000000000000 0000000000000000
> Call Trace:
>  [<ffffffff94012c25>] syscall_trace_enter_phase1+0x125/0x1a0
>  [<ffffffff9437b3be>] ? trace_hardirqs_on_thunk+0x3a/0x3f
>  [<ffffffff947d41bf>] tracesys+0x14/0x4a
> Code: 42 fd ff 48 c7 c7 7a 1e ac 94 e8 25 29 21 00 65 8b 04 25 34 f7 1c 00 83 f8 01 74 28 f6 c7 02 74 13 0f 1f 00 e8 bb 43 fd ff 53 9d <5b> 41 5c 5d c3 0f 1f 40 00 53 9d e8 89 42 fd ff eb ee 0f 1f 80 
> sending NMI to other CPUs:
> NMI backtrace for cpu 1
> CPU: 1 PID: 25164 Comm: trinity-c64 Not tainted 3.18.0-rc5+ #92 [loadavg: 168.72 151.72 150.38 9/410 27945]
> task: ffff88011600dbc0 ti: ffff8801a99a4000 task.ti: ffff8801a99a4000
> RIP: 0010:[<ffffffff940fb71e>]  [<ffffffff940fb71e>] generic_exec_single+0xee/0x1a0
> RSP: 0018:ffff8801a99a7d18  EFLAGS: 00000202
> RAX: 0000000000000000 RBX: ffff8801a99a7d20 RCX: 0000000000000038
> RDX: 00000000000000ff RSI: 0000000000000008 RDI: 0000000000000000
> RBP: ffff8801a99a7d78 R08: ffff880242b57ce0 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000003
> R13: 0000000000000001 R14: ffff880083c28948 R15: ffffffff94166aa0
> FS:  00007f9b74d0b740(0000) GS:ffff880244200000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000001 CR3: 00000001d8611000 CR4: 00000000001407e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
> Stack:
>  ffff8801a99a7d28 0000000000000000 ffffffff94166aa0 ffff880083c28948
>  0000000000000003 00000000e38f9aac ffff880083c28948 00000000ffffffff
>  0000000000000003 ffffffff94166aa0 ffff880083c28948 0000000000000001
> Call Trace:
>  [<ffffffff94166aa0>] ? perf_swevent_add+0x120/0x120
>  [<ffffffff94166aa0>] ? perf_swevent_add+0x120/0x120
>  [<ffffffff940fb89a>] smp_call_function_single+0x6a/0xe0
>  [<ffffffff940a172b>] ? preempt_count_sub+0x7b/0x100
>  [<ffffffff941671aa>] perf_event_read+0xca/0xd0
>  [<ffffffff94167240>] perf_event_read_value+0x90/0xe0
>  [<ffffffff941689c6>] perf_read+0x226/0x370
>  [<ffffffff942fbfb7>] ? security_file_permission+0x87/0xa0
>  [<ffffffff941eafff>] vfs_read+0x9f/0x180
>  [<ffffffff941ebbd8>] SyS_read+0x58/0xd0
>  [<ffffffff947d42c9>] tracesys_phase2+0xd4/0xd9

Riddle me this: what are we doing in tracesys_phase2?  This is a full
slow-path syscall.  TIF_NOHZ doesn't cause that, I think.  I'd love to
see the value of ti->flags here.  Is trinity using ptrace?

Um.  There's a bug.  Patch coming after lunch.  No clue whether it will
help here.

--Andy

> Code: 48 89 de 48 03 14 c5 20 65 d1 94 48 89 df e8 8a 4b 28 00 84 c0 75 46 45 85 ed 74 11 f6 43 18 01 74 0b 0f 1f 00 f3 90 f6 43 18 01 <75> f8 31 c0 48 8b 4d c8 65 48 33 0c 25 28 00 00 00 0f 85 8e 00 
> NMI backtrace for cpu 0
> INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 35.055 msecs
> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.18.0-rc5+ #92 [loadavg: 168.72 151.72 150.38 11/410 27945]
> task: ffffffff94c164c0 ti: ffffffff94c00000 task.ti: ffffffff94c00000
> RIP: 0010:[<ffffffff943dd415>]  [<ffffffff943dd415>] intel_idle+0xd5/0x180
> RSP: 0018:ffffffff94c03e28  EFLAGS: 00000046
> RAX: 0000000000000032 RBX: 0000000000000010 RCX: 0000000000000001
> RDX: 0000000000000000 RSI: ffffffff94c03fd8 RDI: 0000000000000000
> RBP: ffffffff94c03e58 R08: 000000008baf8b86 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000005
> R13: 0000000000000032 R14: 0000000000000004 R15: ffffffff94c00000
> FS:  0000000000000000(0000) GS:ffff880244000000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f162e060000 CR3: 0000000014c11000 CR4: 00000000001407f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
> Stack:
>  0000000094c03e58 5955c5b31ad5e8cf ffffe8fffee031a8 0000000000000005
>  ffffffff94ca9dc0 0000000000000000 ffffffff94c03ea8 ffffffff94661f05
>  00001cb7dcf6fd93 ffffffff94ca9f90 ffffffff94c00000 ffffffff94d18870
> Call[31557.908912] NMI backtrace for cpu 2
> INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 68.178 msecs
> CPU: 2 PID: 0 Comm: swapper/2 Not tainted 3.18.0-rc5+ #92 [loadavg: 168.72 151.72 150.38 10/410 27945]
> task: ffff880242b596f0 ti: ffff880242b6c000 task.ti: ffff880242b6c000
> RIP: 0010:[<ffffffff943dd415>]  [<ffffffff943dd415>] intel_idle+0xd5/0x180
> RSP: 0018:ffff880242b6fdf8  EFLAGS: 00000046
> RAX: 0000000000000032 RBX: 0000000000000010 RCX: 0000000000000001
> RDX: 0000000000000000 RSI: ffff880242b6ffd8 RDI: 0000000000000002
> RBP: ffff880242b6fe28 R08: 000000008baf8b86 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000005
> R13: 0000000000000032 R14: 0000000000000004 R15: ffff880242b6c000
> FS:  0000000000000000(0000) GS:ffff880244400000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000000 CR3: 0000000014c11000 CR4: 00000000001407e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
> Stack:
>  0000000242b6fe28 da97aa9b9f42090a ffffe8ffff2031a8 0000000000000005
>  ffffffff94ca9dc0 0000000000000002 ffff880242b6fe78 ffffffff94661f05
>  00001cb7dcdd1af6 ffffffff94ca9f90 ffff880242b6c000 ffffffff94d18870
> Call Trace:
>  [<ffffffff94661f05>] cpuidle_enter_state+0x55/0x1c0
>  [<ffffffff94662127>] cpuidle_enter+0x17/0x20
>  [<ffffffff940b94a3>] cpu_startup_entry+0x423/0x4d0
>  [<ffffffff9402b763>] start_secondary+0x1a3/0x220
> Code: 31 d2 65 48 8b 34 25 08 ba 00 00 48 8d 86 38 c0 ff ff 48 89 d1 0f 01 c8 48 8b 86 38 c0 ff ff a8 08 75 08 b1 01 4c 89 e8 0f 01 c9 <65> 48 8b 0c 25 08 ba 00 00 f0 80 a1 3a c0 ff ff df 0f ae f0 48 
> INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 95.994 msecs
> 
> 
> <tangent>
> 
> Also, today I learned we can reach the perf_event_read code from
> read(). Given I had /proc/sys/kernel/perf_event_paranoid set to 1,
> I'm not sure how this is even possible. The only user of perf_fops
> is perf_event_open syscall _after_ it's checked that sysctl.
> 
> Oh, there's an ioctl path to perf too. Though trinity
> doesn't know anything about it, so I find it surprising if it
> managed to pull the right combination of entropy to make that
> do the right thing.  Still, that ioctl path probably needs
> to also be checking that sysctl shouldn't it ?
> 
> 	Dave
> 


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-19 21:01                                 ` Andy Lutomirski
@ 2014-11-19 21:47                                   ` Dave Jones
  2014-11-19 21:58                                     ` Borislav Petkov
  2014-11-19 21:56                                   ` [PATCH] x86, syscall: Fix _TIF_NOHZ handling in syscall_trace_enter_phase1 Andy Lutomirski
  2014-11-20 15:25                                   ` frequent lockups in 3.18rc4 Dave Jones
  2 siblings, 1 reply; 486+ messages in thread
From: Dave Jones @ 2014-11-19 21:47 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Don Zickus, Thomas Gleixner, Linux Kernel,
	the arch/x86 maintainers, Peter Zijlstra

On Wed, Nov 19, 2014 at 01:01:36PM -0800, Andy Lutomirski wrote:
 
 > TIF_NOHZ is not the same thing as NOHZ.  Can you try a kernel with
 > CONFIG_CONTEXT_TRACKING=n?  Doing that may involve fiddling with RCU
 > settings a bit.  The normal no HZ idle stuff has nothing to do with
 > TIF_NOHZ, and you either have TIF_NOHZ set or you have some kind of
 > thread_info corruption going on here.

I'll try that next.

 > > RSP: 0018:ffff880192d2fee8  EFLAGS: 00000246
 > > RAX: 0000000000000000 RBX: 0000000100000046 RCX: 000000336ee35b47
 > 
 >                                     ^^^^^^^^^
 > 
 > That is a strange coincidence.  Where did 0x46 | (1<<32) come from?
 > That's a sensible interrupts-disabled flags value with the high part set
 > to 0x1.  Those high bits are undefined, but they ought to all be zero.

This box is usually pretty solid, but it's been in service as a 24/7
fuzzing box for over a year now, so it's not outside the realm of
possibility that this could all be a hardware fault if some memory
has gone bad or the like.  Unless we find something obvious in the
next few days, I'll try running memtest over the weekend (though
I've seen situations where that doesn't stress hardware enough to
manifest a problem, so it might not be entirely conclusive unless
it actually finds a fault).

I wish I had a second identical box to see if it would be reproducible.

 > >  [<ffffffff941689c6>] perf_read+0x226/0x370
 > >  [<ffffffff942fbfb7>] ? security_file_permission+0x87/0xa0
 > >  [<ffffffff941eafff>] vfs_read+0x9f/0x180
 > >  [<ffffffff941ebbd8>] SyS_read+0x58/0xd0
 > >  [<ffffffff947d42c9>] tracesys_phase2+0xd4/0xd9
 > 
 > Riddle me this: what are we doing in tracesys_phase2?  This is a full
 > slow-path syscall.  TIF_NOHZ doesn't cause that, I think.  I'd love to
 > see the value of ti->flags here.  Is trinity using ptrace?
 
That's one of the few syscalls we actually blacklist (mostly because it
requires some more thinking: just passing it crap can get the fuzzer
into a confused state where it thinks child processes are dead, when
they aren't etc).  So it shouldn't be calling ptrace ever.

	Dave


^ permalink raw reply	[flat|nested] 486+ messages in thread

* [PATCH] x86, syscall: Fix _TIF_NOHZ handling in syscall_trace_enter_phase1
  2014-11-19 21:01                                 ` Andy Lutomirski
  2014-11-19 21:47                                   ` Dave Jones
@ 2014-11-19 21:56                                   ` Andy Lutomirski
  2014-11-19 22:13                                     ` Thomas Gleixner
  2014-11-20 22:04                                     ` [tip:x86/urgent] " tip-bot for Andy Lutomirski
  2014-11-20 15:25                                   ` frequent lockups in 3.18rc4 Dave Jones
  2 siblings, 2 replies; 486+ messages in thread
From: Andy Lutomirski @ 2014-11-19 21:56 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds
  Cc: Don Zickus, Thomas Gleixner, Linux Kernel, x86, Peter Zijlstra,
	Andy Lutomirski

TIF_NOHZ is 19 (i.e. _TIF_SYSCALL_TRACE | _TIF_NOTIFY_RESUME |
_TIF_SINGLESTEP), not (1<<19).

This code is involved in Dave's trinity lockup, but I don't see why
it would cause any of the problems he's seeing, except inadvertently
by causing a different path through entry_64.S's syscall handling.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/kernel/ptrace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 749b0e423419..e510618b2e91 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1484,7 +1484,7 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
 	 */
 	if (work & _TIF_NOHZ) {
 		user_exit();
-		work &= ~TIF_NOHZ;
+		work &= ~_TIF_NOHZ;
 	}
 
 #ifdef CONFIG_SECCOMP
-- 
1.9.3


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-19 19:02                                     ` Frederic Weisbecker
  2014-11-19 19:03                                       ` Andy Lutomirski
@ 2014-11-19 21:56                                       ` Thomas Gleixner
  2014-11-19 22:56                                         ` Frederic Weisbecker
  1 sibling, 1 reply; 486+ messages in thread
From: Thomas Gleixner @ 2014-11-19 21:56 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Linus Torvalds, Dave Jones, Don Zickus, Linux Kernel,
	the arch/x86 maintainers, Peter Zijlstra, Andy Lutomirski,
	Arnaldo Carvalho de Melo

On Wed, 19 Nov 2014, Frederic Weisbecker wrote:
> I got a report lately involving context tracking. Not sure if it's
> the same here but the issue was that context tracking uses per cpu data
> and per cpu allocation use vmalloc and vmalloc'ed area can fault due to
> lazy paging.

This is complete nonsense. pcpu allocations are populated right
away. Otherwise no single line of kernel code which uses dynamically
allocated per cpu storage would be safe.
 
Thanks,

	tglx

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-19 21:47                                   ` Dave Jones
@ 2014-11-19 21:58                                     ` Borislav Petkov
  2014-11-19 22:18                                       ` Dave Jones
  0 siblings, 1 reply; 486+ messages in thread
From: Borislav Petkov @ 2014-11-19 21:58 UTC (permalink / raw)
  To: Dave Jones
  Cc: Andy Lutomirski, Linus Torvalds, Don Zickus, Thomas Gleixner,
	Linux Kernel, the arch/x86 maintainers, Peter Zijlstra

On Wed, Nov 19, 2014 at 04:47:43PM -0500, Dave Jones wrote:
> This box is usually pretty solid, but it's been in service as a 24/7
> fuzzing box for over a year now, so it's not outside the realm of
> possibility that this could all be a hardware fault if some memory
> has gone bad or the like.

You could grep old logs for "Hardware Error" and the usual suspects
coming from MCE/EDAC. Also /var/log/mcelog or something like that,
depending on what's running on that box.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: [PATCH] x86, syscall: Fix _TIF_NOHZ handling in syscall_trace_enter_phase1
  2014-11-19 21:56                                   ` [PATCH] x86, syscall: Fix _TIF_NOHZ handling in syscall_trace_enter_phase1 Andy Lutomirski
@ 2014-11-19 22:13                                     ` Thomas Gleixner
  2014-11-20 20:33                                       ` Linus Torvalds
  2014-11-20 22:04                                     ` [tip:x86/urgent] " tip-bot for Andy Lutomirski
  1 sibling, 1 reply; 486+ messages in thread
From: Thomas Gleixner @ 2014-11-19 22:13 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Jones, Linus Torvalds, Don Zickus, Linux Kernel, x86,
	Peter Zijlstra

On Wed, 19 Nov 2014, Andy Lutomirski wrote:

> TIF_NOHZ is 19 (i.e. _TIF_SYSCALL_TRACE | _TIF_NOTIFY_RESUME |
> _TIF_SINGLESTEP), not (1<<19).
> 
> This code is involved in Dave's trinity lockup, but I don't see why
> it would cause any of the problems he's seeing, except inadvertently
> by causing a different path through entry_64.S's syscall handling.

Right, while it is wrong it does not explain the wreckage on 3.17,
which does not have that code.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-19 19:38                                     ` Linus Torvalds
@ 2014-11-19 22:18                                       ` Dave Jones
  0 siblings, 0 replies; 486+ messages in thread
From: Dave Jones @ 2014-11-19 22:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Don Zickus, Thomas Gleixner, Linux Kernel,
	the arch/x86 maintainers, Peter Zijlstra,
	Frédéric Weisbecker, Arnaldo Carvalho de Melo

On Wed, Nov 19, 2014 at 11:38:09AM -0800, Linus Torvalds wrote:

 > > Is it possible that we've managed to return to userspace with
 > > interrupts off somehow?  A loop in userspace that somehow has
 > > interrupts off can cause all kinds of fun lockups.
 > 
 > That sounds unlikely, but if there is some stack corruption going on.
 > 
 > However, it wouldn't even explain things, because even if interrupts
 > had been disabled in user space, and even if that popf got executed,
 > this wouldn't be where they got enabled. That would be the :"sti" in
 > the system call entry path (hidden behind the ENABLE_INTERRUPTS
 > macro).
 > 
 > Of course, maybe Dave has paravirtualization enabled (what a crock
 > _that_ is), and there is something wrong with that whole code.

I've had HYPERVISOR_GUEST disabled for a while, which also disables
the paravirt code afaics.

	Dave


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-19 21:58                                     ` Borislav Petkov
@ 2014-11-19 22:18                                       ` Dave Jones
  2014-11-20 10:33                                         ` Borislav Petkov
  0 siblings, 1 reply; 486+ messages in thread
From: Dave Jones @ 2014-11-19 22:18 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andy Lutomirski, Linus Torvalds, Don Zickus, Thomas Gleixner,
	Linux Kernel, the arch/x86 maintainers, Peter Zijlstra

On Wed, Nov 19, 2014 at 10:58:14PM +0100, Borislav Petkov wrote:
 > On Wed, Nov 19, 2014 at 04:47:43PM -0500, Dave Jones wrote:
 > > This box is usually pretty solid, but it's been in service as a 24/7
 > > fuzzing box for over a year now, so it's not outside the realm of
 > > possibility that this could all be a hardware fault if some memory
 > > has gone bad or the like.
 > 
 > You could grep old logs for "Hardware Error" and the usual suspects
 > coming from MCE/EDAC. Also /var/log/mcelog or something like that,
 > depending on what's running on that box.

Nothing, but it wouldn't be the first time I'd seen a hardware fault
that didn't raise an MCE.

	Dave


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-19 21:56                                       ` Thomas Gleixner
@ 2014-11-19 22:56                                         ` Frederic Weisbecker
  2014-11-19 22:59                                           ` Andy Lutomirski
  2014-11-19 23:09                                           ` Thomas Gleixner
  0 siblings, 2 replies; 486+ messages in thread
From: Frederic Weisbecker @ 2014-11-19 22:56 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linus Torvalds, Dave Jones, Don Zickus, Linux Kernel,
	the arch/x86 maintainers, Peter Zijlstra, Andy Lutomirski,
	Arnaldo Carvalho de Melo

On Wed, Nov 19, 2014 at 10:56:26PM +0100, Thomas Gleixner wrote:
> On Wed, 19 Nov 2014, Frederic Weisbecker wrote:
> > I got a report lately involving context tracking. Not sure if it's
> > the same here but the issue was that context tracking uses per cpu data
> > and per cpu allocation use vmalloc and vmalloc'ed area can fault due to
> > lazy paging.
> 
> This is complete nonsense. pcpu allocations are populated right
> away. Otherwise no single line of kernel code which uses dynamically
> allocated per cpu storage would be safe.

Note this isn't faulting because part of the allocation is swapped. No
it's all reserved in the physical memory, but it's a lazy allocation.
Part of it isn't yet addressed in the P[UGM?]D. That's what vmalloc_fault() is for.

So it's a non-blocking/sleeping fault which is why it's probably fine
most of the time except on code that isn't fault-safe. And I suspect that
most people assume that kernel data won't fault so probably some other
places have similar issues. 

That's a long standing issue. We even had to convert the perf callchain
allocation to ad-hoc kmalloc() based per cpu allocation to get over vmalloc
faults. At that time, NMIs couldn't handle faults and many callchains were
populated in NMIs. We had serious crashes because of per cpu memory faults.

I think that lazy adressing is there for allocation performance reasons. But
still having faultable per cpu memory is insame IMHO.


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-19 22:56                                         ` Frederic Weisbecker
@ 2014-11-19 22:59                                           ` Andy Lutomirski
  2014-11-19 23:07                                             ` Frederic Weisbecker
  2014-11-19 23:09                                           ` Thomas Gleixner
  1 sibling, 1 reply; 486+ messages in thread
From: Andy Lutomirski @ 2014-11-19 22:59 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Don Zickus,
	Linux Kernel, the arch/x86 maintainers, Peter Zijlstra,
	Arnaldo Carvalho de Melo

On Wed, Nov 19, 2014 at 2:56 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> On Wed, Nov 19, 2014 at 10:56:26PM +0100, Thomas Gleixner wrote:
>> On Wed, 19 Nov 2014, Frederic Weisbecker wrote:
>> > I got a report lately involving context tracking. Not sure if it's
>> > the same here but the issue was that context tracking uses per cpu data
>> > and per cpu allocation use vmalloc and vmalloc'ed area can fault due to
>> > lazy paging.
>>
>> This is complete nonsense. pcpu allocations are populated right
>> away. Otherwise no single line of kernel code which uses dynamically
>> allocated per cpu storage would be safe.
>
> Note this isn't faulting because part of the allocation is swapped. No
> it's all reserved in the physical memory, but it's a lazy allocation.
> Part of it isn't yet addressed in the P[UGM?]D. That's what vmalloc_fault() is for.
>
> So it's a non-blocking/sleeping fault which is why it's probably fine
> most of the time except on code that isn't fault-safe. And I suspect that
> most people assume that kernel data won't fault so probably some other
> places have similar issues.
>
> That's a long standing issue. We even had to convert the perf callchain
> allocation to ad-hoc kmalloc() based per cpu allocation to get over vmalloc
> faults. At that time, NMIs couldn't handle faults and many callchains were
> populated in NMIs. We had serious crashes because of per cpu memory faults.

Is there seriously more than 512GB of per-cpu virtual space or
whatever's needed to exceed a single pgd on x86_64?

And there are definitely placed that access per-cpu data in contexts
in which a non-IST fault is not allowed.  Maybe not dynamic per-cpu
data, though.

--Andy

>
> I think that lazy adressing is there for allocation performance reasons. But
> still having faultable per cpu memory is insame IMHO.
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-19 19:03                                       ` Andy Lutomirski
@ 2014-11-19 23:00                                         ` Frederic Weisbecker
  2014-11-19 23:07                                           ` Andy Lutomirski
  0 siblings, 1 reply; 486+ messages in thread
From: Frederic Weisbecker @ 2014-11-19 23:00 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Dave Jones, Don Zickus, Thomas Gleixner,
	Linux Kernel, the arch/x86 maintainers, Peter Zijlstra,
	Arnaldo Carvalho de Melo

On Wed, Nov 19, 2014 at 11:03:48AM -0800, Andy Lutomirski wrote:
> On Wed, Nov 19, 2014 at 11:02 AM, Frederic Weisbecker
> <fweisbec@gmail.com> wrote:
> > On Wed, Nov 19, 2014 at 09:40:26AM -0800, Linus Torvalds wrote:
> >> On Wed, Nov 19, 2014 at 9:22 AM, Linus Torvalds
> >> <torvalds@linux-foundation.org> wrote:
> >> >
> >> > So it hasn't actually done the "push %rbx; popfq" part - there must be
> >> > a label at the return part, and context_tracking_user_exit() never
> >> > actually did the local_irq_save/restore at all. Which means that it
> >> > took one of the early exits instead:
> >> >
> >> >         if (!context_tracking_is_enabled())
> >> >                 return;
> >> >
> >> >         if (in_interrupt())
> >> >                 return;
> >>
> >> Ho humm. Interesting. Neither of those should possibly have happened.
> >>
> >> We "know" that "context_tracking_is_enabled()" must be true, because
> >> the only way we get to context_tracking_user_exit() in the first place
> >> is through "user_exit()", which does:
> >>
> >>         if (context_tracking_is_enabled())
> >>                 context_tracking_user_exit();
> >>
> >> and we know we shouldn't be in_interrupt(), because the backtrace is
> >> the system call entry path, for chrissake!
> >>
> >> So we definitely have some corruption going on. A few possibilities:
> >>
> >>  - either the register contents are corrupted (%rbx in your dump said
> >> "0x0000000100000046", but the eflags we restored was 0x246)
> >>
> >>  - in_interrupt() is wrong, and we've had some irq_count() corruption.
> >> I'd expect that to result in "scheduling while atomic" messages,
> >> though, especially if it goes on long enough that you get a watchdog
> >> event..
> >>
> >>  - there is something rotten in the land of
> >> context_tracking_is_enabled(), which uses a static key.
> >>
> >>  - I have misread the whole trace, and am a moron. But your earlier
> >> report really had some very similar things, just in
> >> context_tracking_user_enter() instead of exit.
> >>
> >> In your previous oops, the registers that was allegedly used to
> >> restore %eflags was %r12:
> >>
> >>   28: 41 54                 push   %r12
> >>   2a: 9d                   popfq
> >>   2b:* 5b                   pop    %rbx <-- trapping instruction
> >>   2c: 41 5c                 pop    %r12
> >>   2e: 5d                   pop    %rbp
> >>   2f: c3                   retq
> >>
> >> but:
> >>
> >>   R12: ffff880101ee3ec0
> >>   EFLAGS: 00000282
> >>
> >> so again, it looks like we never actually did that "popfq"
> >> instruction, and it would have exited through the (same) early exits.
> >>
> >> But what an odd coincidence that it ended up in both of your reports
> >> being *exactly* at that instruction after the "popf". If it had
> >> actually *taken* the popf, I'd not be so surprised ("ok, popf enabled
> >> interrupts, and there was an interrupt pending"), but since everything
> >> seems to say that it came there through some control flow that did
> >> *not* go through the popf, that's just a very odd coincidence.
> >>
> >> And both context_tracking_user_enter() and exit() have that exact same
> >> issue with the early returns. They shouldn't have happened in the
> >> first place.
> >
> > I got a report lately involving context tracking. Not sure if it's
> > the same here but the issue was that context tracking uses per cpu data
> > and per cpu allocation use vmalloc and vmalloc'ed area can fault due to
> > lazy paging.
> 
> Wait, what?  If something like kernel_stack ends with an unmapped pmd,
> we are well and truly screwed.

Note that's non-sleeping faults. So probably most places are fine except
a few of them that really don't want exception to mess up some state. I
can imagine some entry code that really don't want that.

Is kernel stack allocated by vmalloc or alloc_percpu()?

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-19 23:00                                         ` Frederic Weisbecker
@ 2014-11-19 23:07                                           ` Andy Lutomirski
  2014-11-19 23:13                                             ` Frederic Weisbecker
  0 siblings, 1 reply; 486+ messages in thread
From: Andy Lutomirski @ 2014-11-19 23:07 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Linus Torvalds, Dave Jones, Don Zickus, Thomas Gleixner,
	Linux Kernel, the arch/x86 maintainers, Peter Zijlstra,
	Arnaldo Carvalho de Melo

On Wed, Nov 19, 2014 at 3:00 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> On Wed, Nov 19, 2014 at 11:03:48AM -0800, Andy Lutomirski wrote:
>> On Wed, Nov 19, 2014 at 11:02 AM, Frederic Weisbecker
>> <fweisbec@gmail.com> wrote:
>> > On Wed, Nov 19, 2014 at 09:40:26AM -0800, Linus Torvalds wrote:
>> >> On Wed, Nov 19, 2014 at 9:22 AM, Linus Torvalds
>> >> <torvalds@linux-foundation.org> wrote:
>> >> >
>> >> > So it hasn't actually done the "push %rbx; popfq" part - there must be
>> >> > a label at the return part, and context_tracking_user_exit() never
>> >> > actually did the local_irq_save/restore at all. Which means that it
>> >> > took one of the early exits instead:
>> >> >
>> >> >         if (!context_tracking_is_enabled())
>> >> >                 return;
>> >> >
>> >> >         if (in_interrupt())
>> >> >                 return;
>> >>
>> >> Ho humm. Interesting. Neither of those should possibly have happened.
>> >>
>> >> We "know" that "context_tracking_is_enabled()" must be true, because
>> >> the only way we get to context_tracking_user_exit() in the first place
>> >> is through "user_exit()", which does:
>> >>
>> >>         if (context_tracking_is_enabled())
>> >>                 context_tracking_user_exit();
>> >>
>> >> and we know we shouldn't be in_interrupt(), because the backtrace is
>> >> the system call entry path, for chrissake!
>> >>
>> >> So we definitely have some corruption going on. A few possibilities:
>> >>
>> >>  - either the register contents are corrupted (%rbx in your dump said
>> >> "0x0000000100000046", but the eflags we restored was 0x246)
>> >>
>> >>  - in_interrupt() is wrong, and we've had some irq_count() corruption.
>> >> I'd expect that to result in "scheduling while atomic" messages,
>> >> though, especially if it goes on long enough that you get a watchdog
>> >> event..
>> >>
>> >>  - there is something rotten in the land of
>> >> context_tracking_is_enabled(), which uses a static key.
>> >>
>> >>  - I have misread the whole trace, and am a moron. But your earlier
>> >> report really had some very similar things, just in
>> >> context_tracking_user_enter() instead of exit.
>> >>
>> >> In your previous oops, the registers that was allegedly used to
>> >> restore %eflags was %r12:
>> >>
>> >>   28: 41 54                 push   %r12
>> >>   2a: 9d                   popfq
>> >>   2b:* 5b                   pop    %rbx <-- trapping instruction
>> >>   2c: 41 5c                 pop    %r12
>> >>   2e: 5d                   pop    %rbp
>> >>   2f: c3                   retq
>> >>
>> >> but:
>> >>
>> >>   R12: ffff880101ee3ec0
>> >>   EFLAGS: 00000282
>> >>
>> >> so again, it looks like we never actually did that "popfq"
>> >> instruction, and it would have exited through the (same) early exits.
>> >>
>> >> But what an odd coincidence that it ended up in both of your reports
>> >> being *exactly* at that instruction after the "popf". If it had
>> >> actually *taken* the popf, I'd not be so surprised ("ok, popf enabled
>> >> interrupts, and there was an interrupt pending"), but since everything
>> >> seems to say that it came there through some control flow that did
>> >> *not* go through the popf, that's just a very odd coincidence.
>> >>
>> >> And both context_tracking_user_enter() and exit() have that exact same
>> >> issue with the early returns. They shouldn't have happened in the
>> >> first place.
>> >
>> > I got a report lately involving context tracking. Not sure if it's
>> > the same here but the issue was that context tracking uses per cpu data
>> > and per cpu allocation use vmalloc and vmalloc'ed area can fault due to
>> > lazy paging.
>>
>> Wait, what?  If something like kernel_stack ends with an unmapped pmd,
>> we are well and truly screwed.
>
> Note that's non-sleeping faults. So probably most places are fine except
> a few of them that really don't want exception to mess up some state. I
> can imagine some entry code that really don't want that.

Any non-IST fault at all on the kernel_stack reference in system_call
is instant root on non-SMAP systems and instant double-fault or more
challenging root on SMAP systems.  The issue is that rsp is
user-controlled, so the CPU cannot deliver a non-IST fault safely.

>
> Is kernel stack allocated by vmalloc or alloc_percpu()?

DEFINE_PER_CPU(unsigned long, kernel_stack)

Note that I'm talking about kernel_stack, not the kernel stack itself.
The actual stack is regular linearly-mapped memory, although I plan on
trying to change that, complete with all kinds of care to avoid double
faults.

--Andy



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-19 22:59                                           ` Andy Lutomirski
@ 2014-11-19 23:07                                             ` Frederic Weisbecker
  0 siblings, 0 replies; 486+ messages in thread
From: Frederic Weisbecker @ 2014-11-19 23:07 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Don Zickus,
	Linux Kernel, the arch/x86 maintainers, Peter Zijlstra,
	Arnaldo Carvalho de Melo

On Wed, Nov 19, 2014 at 02:59:01PM -0800, Andy Lutomirski wrote:
> On Wed, Nov 19, 2014 at 2:56 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> > On Wed, Nov 19, 2014 at 10:56:26PM +0100, Thomas Gleixner wrote:
> >> On Wed, 19 Nov 2014, Frederic Weisbecker wrote:
> >> > I got a report lately involving context tracking. Not sure if it's
> >> > the same here but the issue was that context tracking uses per cpu data
> >> > and per cpu allocation use vmalloc and vmalloc'ed area can fault due to
> >> > lazy paging.
> >>
> >> This is complete nonsense. pcpu allocations are populated right
> >> away. Otherwise no single line of kernel code which uses dynamically
> >> allocated per cpu storage would be safe.
> >
> > Note this isn't faulting because part of the allocation is swapped. No
> > it's all reserved in the physical memory, but it's a lazy allocation.
> > Part of it isn't yet addressed in the P[UGM?]D. That's what vmalloc_fault() is for.
> >
> > So it's a non-blocking/sleeping fault which is why it's probably fine
> > most of the time except on code that isn't fault-safe. And I suspect that
> > most people assume that kernel data won't fault so probably some other
> > places have similar issues.
> >
> > That's a long standing issue. We even had to convert the perf callchain
> > allocation to ad-hoc kmalloc() based per cpu allocation to get over vmalloc
> > faults. At that time, NMIs couldn't handle faults and many callchains were
> > populated in NMIs. We had serious crashes because of per cpu memory faults.
> 
> Is there seriously more than 512GB of per-cpu virtual space or
> whatever's needed to exceed a single pgd on x86_64?

No idea, I'm clueless about -mm details.

> 
> And there are definitely placed that access per-cpu data in contexts
> in which a non-IST fault is not allowed.  Maybe not dynamic per-cpu
> data, though.

It probably happens to be fine because the code that accesses first the
related data is fault-safe. Or maybe not and some state is silently messed
up somewhere.

This doesn't leave a comfortable feeling.

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-19 22:56                                         ` Frederic Weisbecker
  2014-11-19 22:59                                           ` Andy Lutomirski
@ 2014-11-19 23:09                                           ` Thomas Gleixner
  2014-11-19 23:50                                             ` Frederic Weisbecker
  2014-11-19 23:54                                             ` Andy Lutomirski
  1 sibling, 2 replies; 486+ messages in thread
From: Thomas Gleixner @ 2014-11-19 23:09 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Linus Torvalds, Dave Jones, Don Zickus, Linux Kernel,
	the arch/x86 maintainers, Peter Zijlstra, Andy Lutomirski,
	Arnaldo Carvalho de Melo

On Wed, 19 Nov 2014, Frederic Weisbecker wrote:

> On Wed, Nov 19, 2014 at 10:56:26PM +0100, Thomas Gleixner wrote:
> > On Wed, 19 Nov 2014, Frederic Weisbecker wrote:
> > > I got a report lately involving context tracking. Not sure if it's
> > > the same here but the issue was that context tracking uses per cpu data
> > > and per cpu allocation use vmalloc and vmalloc'ed area can fault due to
> > > lazy paging.
> > 
> > This is complete nonsense. pcpu allocations are populated right
> > away. Otherwise no single line of kernel code which uses dynamically
> > allocated per cpu storage would be safe.
> 
> Note this isn't faulting because part of the allocation is
> swapped. No it's all reserved in the physical memory, but it's a
> lazy allocation.  Part of it isn't yet addressed in the
> P[UGM?]D. That's what vmalloc_fault() is for.

Sorry, I can't follow your argumentation here.

pcpu_alloc()
   ....
area_found:
   ....

        /* clear the areas and return address relative to base address */
        for_each_possible_cpu(cpu)
                memset((void *)pcpu_chunk_addr(chunk, cpu, 0) + off, 0, size);

How would that memset fail to establish the mapping, which is
btw. already established via:

     pcpu_populate_chunk()
  
already before that memset?   	    
 
Are we talking about different per cpu allocators here or am I missing
something completely non obvious?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-19 23:07                                           ` Andy Lutomirski
@ 2014-11-19 23:13                                             ` Frederic Weisbecker
  0 siblings, 0 replies; 486+ messages in thread
From: Frederic Weisbecker @ 2014-11-19 23:13 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Dave Jones, Don Zickus, Thomas Gleixner,
	Linux Kernel, the arch/x86 maintainers, Peter Zijlstra,
	Arnaldo Carvalho de Melo

On Wed, Nov 19, 2014 at 03:07:17PM -0800, Andy Lutomirski wrote:
> On Wed, Nov 19, 2014 at 3:00 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> > Note that's non-sleeping faults. So probably most places are fine except
> > a few of them that really don't want exception to mess up some state. I
> > can imagine some entry code that really don't want that.
> 
> Any non-IST fault at all on the kernel_stack reference in system_call
> is instant root on non-SMAP systems and instant double-fault or more
> challenging root on SMAP systems.  The issue is that rsp is
> user-controlled, so the CPU cannot deliver a non-IST fault safely.

Heh.

> >
> > Is kernel stack allocated by vmalloc or alloc_percpu()?
> 
> DEFINE_PER_CPU(unsigned long, kernel_stack)
> 
> Note that I'm talking about kernel_stack, not the kernel stack itself.

Ah. Note, static allocation like DEFINE_PER_CPU() is probably fine. The
issue is on dynamic allocations: alloc_percpu().

> The actual stack is regular linearly-mapped memory, although I plan on
> trying to change that, complete with all kinds of care to avoid double
> faults.

If you do so, you must really ensure that the resulting memory will never
fault.

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-19 23:09                                           ` Thomas Gleixner
@ 2014-11-19 23:50                                             ` Frederic Weisbecker
  2014-11-20 12:23                                               ` Tejun Heo
  2014-11-19 23:54                                             ` Andy Lutomirski
  1 sibling, 1 reply; 486+ messages in thread
From: Frederic Weisbecker @ 2014-11-19 23:50 UTC (permalink / raw)
  To: Thomas Gleixner, Tejun Heo
  Cc: Linus Torvalds, Dave Jones, Don Zickus, Linux Kernel,
	the arch/x86 maintainers, Peter Zijlstra, Andy Lutomirski,
	Arnaldo Carvalho de Melo

On Thu, Nov 20, 2014 at 12:09:22AM +0100, Thomas Gleixner wrote:
> On Wed, 19 Nov 2014, Frederic Weisbecker wrote:
> 
> > On Wed, Nov 19, 2014 at 10:56:26PM +0100, Thomas Gleixner wrote:
> > > On Wed, 19 Nov 2014, Frederic Weisbecker wrote:
> > > > I got a report lately involving context tracking. Not sure if it's
> > > > the same here but the issue was that context tracking uses per cpu data
> > > > and per cpu allocation use vmalloc and vmalloc'ed area can fault due to
> > > > lazy paging.
> > > 
> > > This is complete nonsense. pcpu allocations are populated right
> > > away. Otherwise no single line of kernel code which uses dynamically
> > > allocated per cpu storage would be safe.
> > 
> > Note this isn't faulting because part of the allocation is
> > swapped. No it's all reserved in the physical memory, but it's a
> > lazy allocation.  Part of it isn't yet addressed in the
> > P[UGM?]D. That's what vmalloc_fault() is for.
> 
> Sorry, I can't follow your argumentation here.
> 
> pcpu_alloc()
>    ....
> area_found:
>    ....
> 
>         /* clear the areas and return address relative to base address */
>         for_each_possible_cpu(cpu)
>                 memset((void *)pcpu_chunk_addr(chunk, cpu, 0) + off, 0, size);
> 
> How would that memset fail to establish the mapping, which is
> btw. already established via:
> 
>      pcpu_populate_chunk()
>   
> already before that memset?   	    
>  
> Are we talking about different per cpu allocators here or am I missing
> something completely non obvious?

That's the same allocator yeah. So if the whole memory is dereferenced,
faults shouldn't happen indeed.

Maybe that was a bug a few years ago but not anymore.

I'm surprised because I got a report from Dave that very much suggested
a vmalloc fault. See the discussion "Deadlock in vtime_account_user() vs itself across a page fault":

http://marc.info/?l=linux-kernel&m=141047612120263&w=2

Is it possible that, somehow, some part isn't zeroed by pcpu_alloc()?
After all it's allocated with vzalloc() so that part could be skipped. The memset(0)
is passed the whole size though so it looks like the whole is dereferenced.

(cc'ing Tejun just in case).

Now if faults on percpu memory don't happen anymore, perhaps we are accessing some
other vmalloc'ed area. In the above report from Dave, the fault happened somewhere
in account_user_time().

> 
> Thanks,
> 
> 	tglx

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-19 23:09                                           ` Thomas Gleixner
  2014-11-19 23:50                                             ` Frederic Weisbecker
@ 2014-11-19 23:54                                             ` Andy Lutomirski
  2014-11-20  0:00                                               ` Thomas Gleixner
  1 sibling, 1 reply; 486+ messages in thread
From: Andy Lutomirski @ 2014-11-19 23:54 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Frederic Weisbecker, Linus Torvalds, Dave Jones, Don Zickus,
	Linux Kernel, the arch/x86 maintainers, Peter Zijlstra,
	Arnaldo Carvalho de Melo

On Wed, Nov 19, 2014 at 3:09 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Wed, 19 Nov 2014, Frederic Weisbecker wrote:
>
>> On Wed, Nov 19, 2014 at 10:56:26PM +0100, Thomas Gleixner wrote:
>> > On Wed, 19 Nov 2014, Frederic Weisbecker wrote:
>> > > I got a report lately involving context tracking. Not sure if it's
>> > > the same here but the issue was that context tracking uses per cpu data
>> > > and per cpu allocation use vmalloc and vmalloc'ed area can fault due to
>> > > lazy paging.
>> >
>> > This is complete nonsense. pcpu allocations are populated right
>> > away. Otherwise no single line of kernel code which uses dynamically
>> > allocated per cpu storage would be safe.
>>
>> Note this isn't faulting because part of the allocation is
>> swapped. No it's all reserved in the physical memory, but it's a
>> lazy allocation.  Part of it isn't yet addressed in the
>> P[UGM?]D. That's what vmalloc_fault() is for.
>
> Sorry, I can't follow your argumentation here.
>
> pcpu_alloc()
>    ....
> area_found:
>    ....
>
>         /* clear the areas and return address relative to base address */
>         for_each_possible_cpu(cpu)
>                 memset((void *)pcpu_chunk_addr(chunk, cpu, 0) + off, 0, size);
>
> How would that memset fail to establish the mapping, which is
> btw. already established via:
>
>      pcpu_populate_chunk()
>
> already before that memset?

I think that this will map them into init_mm->pgd and
current->active_mm->pgd, but it won't necessarily map them into the
rest of the pgds.

At the risk of suggesting something awful, if we preallocated all 256
or whatever kernel pmd pages at boot, this whole problem would go away
forever.  It would only waste slightly under 1 MB of RAM (less on
extremely large systems).

--Andy

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-19 23:54                                             ` Andy Lutomirski
@ 2014-11-20  0:00                                               ` Thomas Gleixner
  2014-11-20  0:30                                                 ` Andy Lutomirski
  0 siblings, 1 reply; 486+ messages in thread
From: Thomas Gleixner @ 2014-11-20  0:00 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Frederic Weisbecker, Linus Torvalds, Dave Jones, Don Zickus,
	Linux Kernel, the arch/x86 maintainers, Peter Zijlstra,
	Arnaldo Carvalho de Melo

On Wed, 19 Nov 2014, Andy Lutomirski wrote:
> On Wed, Nov 19, 2014 at 3:09 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > Sorry, I can't follow your argumentation here.
> >
> > pcpu_alloc()
> >    ....
> > area_found:
> >    ....
> >
> >         /* clear the areas and return address relative to base address */
> >         for_each_possible_cpu(cpu)
> >                 memset((void *)pcpu_chunk_addr(chunk, cpu, 0) + off, 0, size);
> >
> > How would that memset fail to establish the mapping, which is
> > btw. already established via:
> >
> >      pcpu_populate_chunk()
> >
> > already before that memset?
> 
> I think that this will map them into init_mm->pgd and
> current->active_mm->pgd, but it won't necessarily map them into the
> rest of the pgds.

And why would mapping them into the kernel mapping, i.e. init_mm not
be sufficient?

We are talking about kernel memory and not some random user space
mapping.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-20  0:00                                               ` Thomas Gleixner
@ 2014-11-20  0:30                                                 ` Andy Lutomirski
  2014-11-20  0:40                                                   ` Linus Torvalds
  0 siblings, 1 reply; 486+ messages in thread
From: Andy Lutomirski @ 2014-11-20  0:30 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Arnaldo Carvalho de Melo, Peter Zijlstra,
	Linus Torvalds, Frederic Weisbecker, Don Zickus, Dave Jones,
	the arch/x86 maintainers

On Nov 19, 2014 4:00 PM, "Thomas Gleixner" <tglx@linutronix.de> wrote:
>
> On Wed, 19 Nov 2014, Andy Lutomirski wrote:
> > On Wed, Nov 19, 2014 at 3:09 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > > Sorry, I can't follow your argumentation here.
> > >
> > > pcpu_alloc()
> > >    ....
> > > area_found:
> > >    ....
> > >
> > >         /* clear the areas and return address relative to base address */
> > >         for_each_possible_cpu(cpu)
> > >                 memset((void *)pcpu_chunk_addr(chunk, cpu, 0) + off, 0, size);
> > >
> > > How would that memset fail to establish the mapping, which is
> > > btw. already established via:
> > >
> > >      pcpu_populate_chunk()
> > >
> > > already before that memset?
> >
> > I think that this will map them into init_mm->pgd and
> > current->active_mm->pgd, but it won't necessarily map them into the
> > rest of the pgds.
>
> And why would mapping them into the kernel mapping, i.e. init_mm not
> be sufficient?

Because the kernel can run with any pgd loaded into cr3, and we rely
on vmalloc_fault to lazily populate pgds in all the non-init pgds as
needed.  But this only happens if the first TLB-missing reference to
the pgd in question with any given cr3 value happens from a safe
context.

This is why I think that the grsec kernels will crash on very large
memory systems.  They don't seem to get this right for the kernel
stack, and a page fault trying to access the stack is a big no-no.

--Andy

>
> We are talking about kernel memory and not some random user space
> mapping.
>
> Thanks,
>
>         tglx

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-20  0:30                                                 ` Andy Lutomirski
@ 2014-11-20  0:40                                                   ` Linus Torvalds
  2014-11-20  0:49                                                     ` Andy Lutomirski
  0 siblings, 1 reply; 486+ messages in thread
From: Linus Torvalds @ 2014-11-20  0:40 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, linux-kernel, Arnaldo Carvalho de Melo,
	Peter Zijlstra, Frederic Weisbecker, Don Zickus, Dave Jones,
	the arch/x86 maintainers

On Wed, Nov 19, 2014 at 4:30 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> This is why I think that the grsec kernels will crash on very large
> memory systems.  They don't seem to get this right for the kernel
> stack, and a page fault trying to access the stack is a big no-no.

For something like a stack, that's trivial, you could just probe it
before the actual task switch.

So I wouldn't worry about the kernel stack itself (although I think
vmallocing it isn't likely worth it), I'd worry more about some other
random dynamic percpu allocation. Although they arguably shouldn't
happen for low-level code that cannot handle the dynamic
pgd-population. And they generally don't.

It's really tracing that tends to be a special case not because of any
particular low-level code issue, but because instrumenting itself
recursively tends to be a bad idea.

                    Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-20  0:40                                                   ` Linus Torvalds
@ 2014-11-20  0:49                                                     ` Andy Lutomirski
  2014-11-20  1:07                                                       ` Linus Torvalds
  0 siblings, 1 reply; 486+ messages in thread
From: Andy Lutomirski @ 2014-11-20  0:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, linux-kernel, Arnaldo Carvalho de Melo,
	Peter Zijlstra, Frederic Weisbecker, Don Zickus, Dave Jones,
	the arch/x86 maintainers

On Wed, Nov 19, 2014 at 4:40 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Nov 19, 2014 at 4:30 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> This is why I think that the grsec kernels will crash on very large
>> memory systems.  They don't seem to get this right for the kernel
>> stack, and a page fault trying to access the stack is a big no-no.
>
> For something like a stack, that's trivial, you could just probe it
> before the actual task switch.

I thought so for a while, too, but now I disagree.  On PGE hardware,
it seems entirely possible that the new stack would be in the TLB even
if it's not visible via cr3.  Then, as soon as the TLB entry expires,
we double-fault.

>
> So I wouldn't worry about the kernel stack itself (although I think
> vmallocing it isn't likely worth it),

I don't want vmalloc to avoid low-order allocations -- I want it to
have guard pages.  The fact that a user-triggerable stack overflow is
basically root right now and doesn't reliably OOPS scares me.

> I'd worry more about some other
> random dynamic percpu allocation. Although they arguably shouldn't
> happen for low-level code that cannot handle the dynamic
> pgd-population. And they generally don't.

This issue ought to be limited to nokprobes code, and I doubt that any
of that code touches dynamic per-cpu things.

>
> It's really tracing that tends to be a special case not because of any
> particular low-level code issue, but because instrumenting itself
> recursively tends to be a bad idea.
>
>                     Linus



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-20  0:49                                                     ` Andy Lutomirski
@ 2014-11-20  1:07                                                       ` Linus Torvalds
  2014-11-20  1:16                                                         ` Andy Lutomirski
  0 siblings, 1 reply; 486+ messages in thread
From: Linus Torvalds @ 2014-11-20  1:07 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, linux-kernel, Arnaldo Carvalho de Melo,
	Peter Zijlstra, Frederic Weisbecker, Don Zickus, Dave Jones,
	the arch/x86 maintainers

On Wed, Nov 19, 2014 at 4:49 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> I thought so for a while, too, but now I disagree.  On PGE hardware,
> it seems entirely possible that the new stack would be in the TLB even
> if it's not visible via cr3.  Then, as soon as the TLB entry expires,
> we double-fault.

Ahh. Good point.

> I don't want vmalloc to avoid low-order allocations -- I want it to
> have guard pages.  The fact that a user-triggerable stack overflow is
> basically root right now and doesn't reliably OOPS scares me.

Well, if you do that, you would have to make the double-fault handler
aware of the stack issue anyway, and then you could just do teh same
PGD repopulation that a page fault does and return (for the case where
you didn't overflow the stack, just had the page tables unpopulated -
obviously an actual stack overflow should do something more drastic).

                   Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-20  1:07                                                       ` Linus Torvalds
@ 2014-11-20  1:16                                                         ` Andy Lutomirski
  2014-11-20  2:42                                                           ` Linus Torvalds
  0 siblings, 1 reply; 486+ messages in thread
From: Andy Lutomirski @ 2014-11-20  1:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, linux-kernel, Arnaldo Carvalho de Melo,
	Peter Zijlstra, Frederic Weisbecker, Don Zickus, Dave Jones,
	the arch/x86 maintainers

On Wed, Nov 19, 2014 at 5:07 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Nov 19, 2014 at 4:49 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> I thought so for a while, too, but now I disagree.  On PGE hardware,
>> it seems entirely possible that the new stack would be in the TLB even
>> if it's not visible via cr3.  Then, as soon as the TLB entry expires,
>> we double-fault.
>
> Ahh. Good point.
>
>> I don't want vmalloc to avoid low-order allocations -- I want it to
>> have guard pages.  The fact that a user-triggerable stack overflow is
>> basically root right now and doesn't reliably OOPS scares me.
>
> Well, if you do that, you would have to make the double-fault handler
> aware of the stack issue anyway, and then you could just do teh same
> PGD repopulation that a page fault does and return (for the case where
> you didn't overflow the stack, just had the page tables unpopulated -
> obviously an actual stack overflow should do something more drastic).

And you were calling me crazy? :)

We could be restarting just about anything if that happens.  Except
that if we double-faulted on a trap gate entry instead of an interrupt
gate entry, then we can't restart, and, unless we can somehow decode
the error code usefully (it's woefully undocumented), int 0x80 and
int3 might be impossible to handle correctly if it double-faults.  And
please don't suggest moving int 0x80 to an IST stack :)

The SDM specifically says that you must not try to recover after a
double-fault.  We do, however, recover from a double-fault in the
specific case of an iret failure during espfix64 processing (and I
even have a nice test case for it), but I think that hpa had a long
conversation with one of the microcode architects before he was okay
with that.

--Andy

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-20  1:16                                                         ` Andy Lutomirski
@ 2014-11-20  2:42                                                           ` Linus Torvalds
  2014-11-20  6:16                                                             ` Andy Lutomirski
  0 siblings, 1 reply; 486+ messages in thread
From: Linus Torvalds @ 2014-11-20  2:42 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, linux-kernel, Arnaldo Carvalho de Melo,
	Peter Zijlstra, Frederic Weisbecker, Don Zickus, Dave Jones,
	the arch/x86 maintainers

On Wed, Nov 19, 2014 at 5:16 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> And you were calling me crazy? :)

Hey, I'm crazy like a fox.

> We could be restarting just about anything if that happens. Except
> that if we double-faulted on a trap gate entry instead of an interrupt
> gate entry, then we can't restart, and, unless we can somehow decode
> the error code usefully (it's woefully undocumented), int 0x80 and
> int3 might be impossible to handle correctly if it double-faults.  And
> please don't suggest moving int 0x80 to an IST stack :)

No, no.  So tell me if this won't work:

 - when forking a new process, make sure we allocate the vmalloc stack
*before* we copy the vm

 - this should guarantee that all new processes will at least have its
*own* stack always in its page tables, since vmalloc always fills in
the page table of the current page tables of the thread doing the
vmalloc.

HOWEVER, that leaves the task switch *to* that process, and making
sure that the stack pointer is ok in between the "switch %rsp" and
"switch %cr3".

So then we make the rule be: switch %cr3 *before* switching %rsp, and
only in between those places can we get in trouble. Yes/no?

And that small section is all with interrupts disabled, and nothing
should take an exception. The C code might take a double fault on a
regular access to the old stack (the *new* stack is guaranteed to be
mapped, but the old stack is not), but that should be very similar to
what we already do with "iret". So we can just fill in the page tables
and return.

For safety, add a percpu counter that is cleared before the %cr3
setting, to make sure that we only do a *single* double-fault, but it
really sounds pretty safe. No?

The only deadly thing would be NMI, but that's an IST anyway, so not
an issue. No other traps should be able to happen except the double
page table miss.

But hey, maybe I'm not crazy like a fox. Maybe I'm just plain crazy,
and I missed something else.

And no, I don't think the above is necessarily a *good* idea. But it
doesn't seem really overly complicated either.

                      Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-20  2:42                                                           ` Linus Torvalds
@ 2014-11-20  6:16                                                             ` Andy Lutomirski
  0 siblings, 0 replies; 486+ messages in thread
From: Andy Lutomirski @ 2014-11-20  6:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, linux-kernel, Arnaldo Carvalho de Melo,
	Peter Zijlstra, Frederic Weisbecker, Don Zickus, Dave Jones,
	the arch/x86 maintainers

On Wed, Nov 19, 2014 at 6:42 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Nov 19, 2014 at 5:16 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> And you were calling me crazy? :)
>
> Hey, I'm crazy like a fox.
>
>> We could be restarting just about anything if that happens. Except
>> that if we double-faulted on a trap gate entry instead of an interrupt
>> gate entry, then we can't restart, and, unless we can somehow decode
>> the error code usefully (it's woefully undocumented), int 0x80 and
>> int3 might be impossible to handle correctly if it double-faults.  And
>> please don't suggest moving int 0x80 to an IST stack :)
>
> No, no.  So tell me if this won't work:
>
>  - when forking a new process, make sure we allocate the vmalloc stack
> *before* we copy the vm
>
>  - this should guarantee that all new processes will at least have its
> *own* stack always in its page tables, since vmalloc always fills in
> the page table of the current page tables of the thread doing the
> vmalloc.

This gets interesting for kernel threads that don't really have an mm
in the first place, though.

>
> HOWEVER, that leaves the task switch *to* that process, and making
> sure that the stack pointer is ok in between the "switch %rsp" and
> "switch %cr3".
>
> So then we make the rule be: switch %cr3 *before* switching %rsp, and
> only in between those places can we get in trouble. Yes/no?
>

Kernel threads aside, sure.  And we do it in this order anyway, I think.

> And that small section is all with interrupts disabled, and nothing
> should take an exception. The C code might take a double fault on a
> regular access to the old stack (the *new* stack is guaranteed to be
> mapped, but the old stack is not), but that should be very similar to
> what we already do with "iret". So we can just fill in the page tables
> and return.

Unless we try to dump the stack from an NMI or something, but that
should be fine regardless.

>
> For safety, add a percpu counter that is cleared before the %cr3
> setting, to make sure that we only do a *single* double-fault, but it
> really sounds pretty safe. No?

I wouldn't be surprised if that's just as expensive as just fixing up
the pgd in the first place.  The fixup is just:

if (unlikely(pte_none(mm->pgd[pgd_address(rsp)]))) fix it;

or something like that.

>
> The only deadly thing would be NMI, but that's an IST anyway, so not
> an issue. No other traps should be able to happen except the double
> page table miss.
>
> But hey, maybe I'm not crazy like a fox. Maybe I'm just plain crazy,
> and I missed something else.

I actually kind of like it, other than the kernel thread issue.

We should arguably ditch lazy mm for kernel threads in favor of PCID,
but that's a different story.  Or we could beg Intel to give us
separate kernel and user page table hierarchies.

--Andy

>
> And no, I don't think the above is necessarily a *good* idea. But it
> doesn't seem really overly complicated either.
>
>                       Linus



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-19 14:41                             ` Don Zickus
  2014-11-19 15:03                               ` Vivek Goyal
@ 2014-11-20  9:54                               ` Dave Young
  1 sibling, 0 replies; 486+ messages in thread
From: Dave Young @ 2014-11-20  9:54 UTC (permalink / raw)
  To: Don Zickus
  Cc: Dave Jones, Thomas Gleixner, Linus Torvalds, Linux Kernel,
	the arch/x86 maintainers, vgoyal

On 11/19/14 at 09:41am, Don Zickus wrote:
> On Tue, Nov 18, 2014 at 05:02:54PM -0500, Dave Jones wrote:
> > On Tue, Nov 18, 2014 at 04:55:40PM -0500, Don Zickus wrote:
> > 
> >  > > So here we mangle CPU3 in and lose the backtrace for cpu0, which might
> >  > > be the real interesting one ....
> >  > 
> >  > Can you provide another dump?  The hope is we get something not mangled?
> > 
> > Working on it..
> > 
> >  > The other option we have done in RHEL is panic the system and let kdump
> >  > capture the memory.  Then we can analyze the vmcore for the stack trace
> >  > cpu0 stored in memory to get a rough idea where it might be if the cpu
> >  > isn't responding very well.
> > 
> > I don't know if it's because of the debug options I typically run with,
> > or that I'm perpetually cursed, but I've never managed to get kdump to
> > do anything useful. (The last time I tried it was actively harmful in
> > that not only did it fail to dump anything, it wedged the machine so
> > it didn't reboot after panic).
> > 
> > Unless there's some magic step missing from the documentation at
> > http://fedoraproject.org/wiki/How_to_use_kdump_to_debug_kernel_crashes
> > then I'm not optimistic it'll be useful.
> 
> Well, I don't know when the last time you ran it, but I know the RH kexec
> folks have started pursuing a Fedora-first package patch rule a couple of
> years ago to ensure Fedora had a working kexec/kdump solution.

It started from Fedora 17, I think for Fedora pre F17 kdump support is very
limited, it is becoming better.

> 
> As for the wedging part, it was a common problem to have the kernel hang
> while trying to boot the second kernel (and before console output
> happened).  So the problem makes sense and is unfortunate.  I would
> encourage you to try again.  :-)

In fedora we will have more such issues than RHEL because the kernel is updated
frequestly. There's ocasinaly new problems in upstream kernel, such as kaslr
feature in X86.

Problem for Fedora is it is not by default enabled, so user need explictly
specify kerenl cmdline for crashkernel reservation and enable kdump serivce.

There's very few bugs reported from Fedora user. So I guess it is not well tested
in Fedora community. Since Dave bring up this issue I think it's at least a good
news to us that someone is using it. We can address the problem case by case then.

Probably a good way to get more testing is to add kdump anaconda addon by default
at installation phase so user can choose to enable kdump or not.

Thanks
Dave

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-19 22:18                                       ` Dave Jones
@ 2014-11-20 10:33                                         ` Borislav Petkov
  0 siblings, 0 replies; 486+ messages in thread
From: Borislav Petkov @ 2014-11-20 10:33 UTC (permalink / raw)
  To: Dave Jones
  Cc: Andy Lutomirski, Linus Torvalds, Don Zickus, Thomas Gleixner,
	Linux Kernel, the arch/x86 maintainers, Peter Zijlstra

On Wed, Nov 19, 2014 at 05:18:42PM -0500, Dave Jones wrote:
> Nothing, but it wouldn't be the first time I'd seen a hardware fault
> that didn't raise an MCE.

And maybe it tried but it didn't manage to come out due to hard wedging. :-)

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-19 23:50                                             ` Frederic Weisbecker
@ 2014-11-20 12:23                                               ` Tejun Heo
  2014-11-20 21:58                                                 ` Thomas Gleixner
  0 siblings, 1 reply; 486+ messages in thread
From: Tejun Heo @ 2014-11-20 12:23 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Don Zickus,
	Linux Kernel, the arch/x86 maintainers, Peter Zijlstra,
	Andy Lutomirski, Arnaldo Carvalho de Melo

Hello,

On Thu, Nov 20, 2014 at 12:50:36AM +0100, Frederic Weisbecker wrote:
> > Are we talking about different per cpu allocators here or am I missing
> > something completely non obvious?
> 
> That's the same allocator yeah. So if the whole memory is dereferenced,
> faults shouldn't happen indeed.
> 
> Maybe that was a bug a few years ago but not anymore.

It has been always like that tho.  Percpu memory given out is always
populated and cleared.

> Is it possible that, somehow, some part isn't zeroed by pcpu_alloc()?
> After all it's allocated with vzalloc() so that part could be skipped. The memset(0)

The vzalloc call is for the internal allocation bitmap not the actual
percpu memory area.  The actual address areas for percpu memory are
obtained using pcpu_get_vm_areas() call and later get populated using
map_kernel_range_noflush() (flush is performed after mapping is
complete).

Trying to remember what happens with vmalloc_fault().  Ah okay, so
when a new PUD gets created for vmalloc area, we don't go through all
PGDs and update them.  The PGD entries get faulted in lazily.  Percpu
memory allocator clearing or not clearing the allocated area doesn't
have anything to do with it.  The memory area is always fully
populated in the kernel page table.  It's just that the population
happened while a different PGD was active and this PGD hasn't been
populated with the new PUD yet.

So, yeap, vmalloc_fault() can always happen when accessing vmalloc
areas and the only way to avoid that would be removing lazy PGD
population - going through all PGDs and populating new PUDs
immediately.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-19  5:15                               ` Dave Jones
@ 2014-11-20 14:36                                 ` Frederic Weisbecker
  0 siblings, 0 replies; 486+ messages in thread
From: Frederic Weisbecker @ 2014-11-20 14:36 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Don Zickus, Thomas Gleixner,
	Linux Kernel, the arch/x86 maintainers

On Wed, Nov 19, 2014 at 12:15:24AM -0500, Dave Jones wrote:
> On Tue, Nov 18, 2014 at 08:40:55PM -0800, Linus Torvalds wrote:
> 
>  > Hmm, if we are getting soft-lockups here, maybe it suggest too much exit-work.
>  > 
>  > Some TIF_NOHZ loop, perhaps? You have nohz on, don't you?
>  > 
>  > That makes me wonder: does the problem go away if you disable NOHZ?
> 
> Does nohz=off do enough ? I couldn't convince myself after looking at
> dmesg, and still seeing dynticks stuff in there.
> 
> I'll do a rebuild with all the CONFIG_NO_HZ stuff off, though it also changes
> some other config stuff wrt timers.

You also need to disable context tracking. So you need to deactive also
CONFIG_RCU_USER_QS and CONFIG_CONTEXT_TRACKING_FORCE and eventually make sure
nothing else is turning on CONFIG_CONTEXT_TRACKING.

You can keep CONFIG_NO_HZ_IDLE though, just not CONFIG_NO_HZ_FULL.

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-19 14:59                               ` Dave Jones
  2014-11-19 17:22                                 ` Linus Torvalds
  2014-11-19 21:01                                 ` Andy Lutomirski
@ 2014-11-20 15:04                                 ` Frederic Weisbecker
  2 siblings, 0 replies; 486+ messages in thread
From: Frederic Weisbecker @ 2014-11-20 15:04 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Don Zickus, Thomas Gleixner,
	Linux Kernel, the arch/x86 maintainers, Peter Zijlstra

On Wed, Nov 19, 2014 at 09:59:02AM -0500, Dave Jones wrote:
> On Tue, Nov 18, 2014 at 08:40:55PM -0800, Linus Torvalds wrote:
>  > On Tue, Nov 18, 2014 at 6:19 PM, Dave Jones <davej@redhat.com> wrote:
>  > >
>  > > NMI watchdog: BUG: soft lockup - CPU#2 stuck for 21s! [trinity-c42:31480]
>  > > CPU: 2 PID: 31480 Comm: trinity-c42 Not tainted 3.18.0-rc5+ #91 [loadavg: 174.61 150.35 148.64 9/411 32140]
>  > > RIP: 0010:[<ffffffff8a1798b4>]  [<ffffffff8a1798b4>] context_tracking_user_enter+0xa4/0x190
>  > > Call Trace:
>  > >  [<ffffffff8a012fc5>] syscall_trace_leave+0xa5/0x160
>  > >  [<ffffffff8a7d8624>] int_check_syscall_exit_work+0x34/0x3d
>  > 
>  > Hmm, if we are getting soft-lockups here, maybe it suggest too much exit-work.
>  > 
>  > Some TIF_NOHZ loop, perhaps? You have nohz on, don't you?
>  > 
>  > That makes me wonder: does the problem go away if you disable NOHZ?
> 
> Aparently not.
> 
> NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [trinity-c75:25175]
> CPU: 3 PID: 25175 Comm: trinity-c75 Not tainted 3.18.0-rc5+ #92 [loadavg: 168.72 151.72 150.38 9/410 27945]
> task: ffff8800364e44d0 ti: ffff880192d2c000 task.ti: ffff880192d2c000
> RIP: 0010:[<ffffffff94175be7>]  [<ffffffff94175be7>] context_tracking_user_exit+0x57/0x120
> RSP: 0018:ffff880192d2fee8  EFLAGS: 00000246
> RAX: 0000000000000000 RBX: 0000000100000046 RCX: 000000336ee35b47
> RDX: 0000000000000001 RSI: ffffffff94ac1e84 RDI: ffffffff94a93725
> RBP: ffff880192d2fef8 R08: 00007f9b74d0b740 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000246 R12: ffffffff940d8503
> R13: ffff880192d2fe98 R14: ffffffff943884e7 R15: ffff880192d2fe48
> FS:  00007f9b74d0b740(0000) GS:ffff880244600000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 000000336f1b7740 CR3: 0000000229a95000 CR4: 00000000001407e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
> Stack:
>  ffff880192d30000 0000000000080000 ffff880192d2ff78 ffffffff94012c25
>  00007f9b747a5000 00007f9b747a5068 0000000000000000 0000000000000000
>  0000000000000000 ffffffff9437b3be 0000000000000000 0000000000000000
> Call Trace:
>  [<ffffffff94012c25>] syscall_trace_enter_phase1+0x125/0x1a0
>  [<ffffffff9437b3be>] ? trace_hardirqs_on_thunk+0x3a/0x3f
>  [<ffffffff947d41bf>] tracesys+0x14/0x4a
> Code: 42 fd ff 48 c7 c7 7a 1e ac 94 e8 25 29 21 00 65 8b 04 25 34 f7 1c 00 83 f8 01 74 28 f6 c7 02 74 13 0f 1f 00 e8 bb 43 fd ff 53 9d <5b> 41 5c 5d c3 0f 1f 40 00 53 9d e8 89 42 fd ff eb ee 0f 1f 80 
> sending NMI to other CPUs:
> NMI backtrace for cpu 1
> CPU: 1 PID: 25164 Comm: trinity-c64 Not tainted 3.18.0-rc5+ #92 [loadavg: 168.72 151.72 150.38 9/410 27945]
> task: ffff88011600dbc0 ti: ffff8801a99a4000 task.ti: ffff8801a99a4000
> RIP: 0010:[<ffffffff940fb71e>]  [<ffffffff940fb71e>] generic_exec_single+0xee/0x1a0
> RSP: 0018:ffff8801a99a7d18  EFLAGS: 00000202
> RAX: 0000000000000000 RBX: ffff8801a99a7d20 RCX: 0000000000000038
> RDX: 00000000000000ff RSI: 0000000000000008 RDI: 0000000000000000
> RBP: ffff8801a99a7d78 R08: ffff880242b57ce0 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000003
> R13: 0000000000000001 R14: ffff880083c28948 R15: ffffffff94166aa0
> FS:  00007f9b74d0b740(0000) GS:ffff880244200000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000001 CR3: 00000001d8611000 CR4: 00000000001407e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
> Stack:
>  ffff8801a99a7d28 0000000000000000 ffffffff94166aa0 ffff880083c28948
>  0000000000000003 00000000e38f9aac ffff880083c28948 00000000ffffffff
>  0000000000000003 ffffffff94166aa0 ffff880083c28948 0000000000000001
> Call Trace:
>  [<ffffffff94166aa0>] ? perf_swevent_add+0x120/0x120
>  [<ffffffff94166aa0>] ? perf_swevent_add+0x120/0x120
>  [<ffffffff940fb89a>] smp_call_function_single+0x6a/0xe0

One thing that happens a lot in your crashes is a CPU sending IPIs. Maybe
stuck polling on csd->lock or something. But's it's not the CPU that soft
lockups. At least not the first that gets reported.

>  [<ffffffff940a172b>] ? preempt_count_sub+0x7b/0x100
>  [<ffffffff941671aa>] perf_event_read+0xca/0xd0
>  [<ffffffff94167240>] perf_event_read_value+0x90/0xe0
>  [<ffffffff941689c6>] perf_read+0x226/0x370
>  [<ffffffff942fbfb7>] ? security_file_permission+0x87/0xa0
>  [<ffffffff941eafff>] vfs_read+0x9f/0x180
>  [<ffffffff941ebbd8>] SyS_read+0x58/0xd0
>  [<ffffffff947d42c9>] tracesys_phase2+0xd4/0xd9

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-17 17:03         ` Dave Jones
  2014-11-17 19:59           ` Linus Torvalds
@ 2014-11-20 15:08           ` Frederic Weisbecker
  2014-11-20 16:19             ` Dave Jones
  1 sibling, 1 reply; 486+ messages in thread
From: Frederic Weisbecker @ 2014-11-20 15:08 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Linux Kernel, the arch/x86 maintainers

On Mon, Nov 17, 2014 at 12:03:59PM -0500, Dave Jones wrote:
> On Sat, Nov 15, 2014 at 10:33:19PM -0800, Linus Torvalds wrote:
>  
>  > >  > I'll try that next, and check in on it tomorrow.
>  > >
>  > > No luck. Died even faster this time.
>  > 
>  > Yeah, and your other lockups haven't even been TLB related. Not that
>  > they look like anything else *either*.
>  > 
>  > I have no ideas left. I'd go for a bisection - rather than try random
>  > things, at least bisection will get us a smaller set of suspects if
>  > you can go through a few cycles of it. Even if you decide that you
>  > want to run for most of a day before you are convinced it's all good,
>  > a couple of days should get you a handful of bisection points (that's
>  > assuming you hit a couple of bad ones too that turn bad in a shorter
>  > while). And 4 or five bisections should get us from 11k commits down
>  > to the ~600 commit range. That would be a huge improvement.
> 
> Great start to the week: I decided to confirm my recollection that .17
> was ok, only to hit this within 10 minutes.
> 
> Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 3
> CPU: 3 PID: 17176 Comm: trinity-c95 Not tainted 3.17.0+ #87
>  0000000000000000 00000000f3a61725 ffff880244606bf0 ffffffff9583e9fa
>  ffffffff95c67918 ffff880244606c78 ffffffff9583bcc0 0000000000000010
>  ffff880244606c88 ffff880244606c20 00000000f3a61725 0000000000000000
> Call Trace:
>  <NMI>  [<ffffffff9583e9fa>] dump_stack+0x4e/0x7a
>  [<ffffffff9583bcc0>] panic+0xd4/0x207
>  [<ffffffff95150908>] watchdog_overflow_callback+0x118/0x120
>  [<ffffffff95193dbe>] __perf_event_overflow+0xae/0x340
>  [<ffffffff95192230>] ? perf_event_task_disable+0xa0/0xa0
>  [<ffffffff9501a7bf>] ? x86_perf_event_set_period+0xbf/0x150
>  [<ffffffff95194be4>] perf_event_overflow+0x14/0x20
>  [<ffffffff95020676>] intel_pmu_handle_irq+0x206/0x410
>  [<ffffffff9501966b>] perf_event_nmi_handler+0x2b/0x50
>  [<ffffffff95007bb2>] nmi_handle+0xd2/0x390
>  [<ffffffff95007ae5>] ? nmi_handle+0x5/0x390
>  [<ffffffff958489b0>] ? _raw_spin_lock_irqsave+0x80/0x90
>  [<ffffffff950080a2>] default_do_nmi+0x72/0x1c0
>  [<ffffffff950082a8>] do_nmi+0xb8/0x100
>  [<ffffffff9584b9aa>] end_repeat_nmi+0x1e/0x2e
>  [<ffffffff958489b0>] ? _raw_spin_lock_irqsave+0x80/0x90
>  [<ffffffff958489b0>] ? _raw_spin_lock_irqsave+0x80/0x90
>  [<ffffffff958489b0>] ? _raw_spin_lock_irqsave+0x80/0x90
>  <<EOE>>  <IRQ>  [<ffffffff95101685>] lock_hrtimer_base.isra.18+0x25/0x50
>  [<ffffffff951019d3>] hrtimer_try_to_cancel+0x33/0x1f0

Ah that one got fixed in the merge window and in -stable, right?

>  [<ffffffff95101baa>] hrtimer_cancel+0x1a/0x30
>  [<ffffffff95113557>] tick_nohz_restart+0x17/0x90
>  [<ffffffff95114533>] __tick_nohz_full_check+0xc3/0x100
>  [<ffffffff9511457e>] nohz_full_kick_work_func+0xe/0x10
>  [<ffffffff95188894>] irq_work_run_list+0x44/0x70
>  [<ffffffff951888ea>] irq_work_run+0x2a/0x50
>  [<ffffffff9510109b>] update_process_times+0x5b/0x70
>  [<ffffffff95113325>] tick_sched_handle.isra.20+0x25/0x60
>  [<ffffffff95113801>] tick_sched_timer+0x41/0x60
>  [<ffffffff95102281>] __run_hrtimer+0x81/0x480
>  [<ffffffff951137c0>] ? tick_sched_do_timer+0xb0/0xb0
>  [<ffffffff95102977>] hrtimer_interrupt+0x117/0x270
>  [<ffffffff950346d7>] local_apic_timer_interrupt+0x37/0x60
>  [<ffffffff9584c44f>] smp_apic_timer_interrupt+0x3f/0x50
>  [<ffffffff9584a86f>] apic_timer_interrupt+0x6f/0x80
>  <EOI>  [<ffffffff950d3f3a>] ? lock_release_holdtime.part.28+0x9a/0x160
>  [<ffffffff950ef3b7>] ? rcu_is_watching+0x27/0x60
>  [<ffffffff9508cb75>] kill_pid_info+0xf5/0x130
>  [<ffffffff9508ca85>] ? kill_pid_info+0x5/0x130
>  [<ffffffff9508ccd3>] SYSC_kill+0x103/0x330
>  [<ffffffff9508cc7c>] ? SYSC_kill+0xac/0x330
>  [<ffffffff9519b592>] ? context_tracking_user_exit+0x52/0x1a0
>  [<ffffffff950d6f1d>] ? trace_hardirqs_on_caller+0x16d/0x210
>  [<ffffffff950d6fcd>] ? trace_hardirqs_on+0xd/0x10
>  [<ffffffff950137ad>] ? syscall_trace_enter+0x14d/0x330
>  [<ffffffff9508f44e>] SyS_kill+0xe/0x10
>  [<ffffffff95849b24>] tracesys+0xdd/0xe2
> Kernel Offset: 0x14000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> 
> It could a completely different cause for lockup, but seeing this now
> has me wondering if perhaps it's something unrelated to the kernel.
> I have recollection of running late .17rc's for days without incident,
> and I'm pretty sure .17 was ok too.  But a few weeks ago I did upgrade
> that test box to the Fedora 21 beta.  Which means I have a new gcc.
> I'm not sure I really trust 4.9.1 yet, so maybe I'll see if I can
> get 4.8 back on there and see if that's any better.
> 
> 	Dave
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-19 21:01                                 ` Andy Lutomirski
  2014-11-19 21:47                                   ` Dave Jones
  2014-11-19 21:56                                   ` [PATCH] x86, syscall: Fix _TIF_NOHZ handling in syscall_trace_enter_phase1 Andy Lutomirski
@ 2014-11-20 15:25                                   ` Dave Jones
  2014-11-20 19:43                                     ` Linus Torvalds
  2014-11-25 12:22                                     ` Will Deacon
  2 siblings, 2 replies; 486+ messages in thread
From: Dave Jones @ 2014-11-20 15:25 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Don Zickus, Thomas Gleixner, Linux Kernel,
	the arch/x86 maintainers, Peter Zijlstra

On Wed, Nov 19, 2014 at 01:01:36PM -0800, Andy Lutomirski wrote:
 
 > TIF_NOHZ is not the same thing as NOHZ.  Can you try a kernel with
 > CONFIG_CONTEXT_TRACKING=n?  Doing that may involve fiddling with RCU
 > settings a bit.  The normal no HZ idle stuff has nothing to do with
 > TIF_NOHZ, and you either have TIF_NOHZ set or you have some kind of
 > thread_info corruption going on here.

Disabling CONTEXT_TRACKING didn't change the problem.
Unfortunatly the full trace didn't make it over usb-serial this time. Grr.

Here's what came over serial..

NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [trinity-c35:11634]
CPU: 2 PID: 11634 Comm: trinity-c35 Not tainted 3.18.0-rc5+ #94 [loadavg: 164.79 157.30 155.90 37/409 11893]
task: ffff88014e0d96f0 ti: ffff880220eb4000 task.ti: ffff880220eb4000
RIP: 0010:[<ffffffff88379605>]  [<ffffffff88379605>] copy_user_enhanced_fast_string+0x5/0x10
RSP: 0018:ffff880220eb7ef0  EFLAGS: 00010283
RAX: ffff880220eb4000 RBX: ffffffff887dac64 RCX: 0000000000006a18
RDX: 000000000000e02f RSI: 00007f766f466620 RDI: ffff88016f6a7617
RBP: ffff880220eb7f78 R08: 8000000000000063 R09: 0000000000000004
R10: 0000000000000010 R11: 0000000000000000 R12: ffffffff880bf50d
R13: 0000000000000001 R14: ffff880220eb4000 R15: 0000000000000001
FS:  00007f766f459740(0000) GS:ffff880244400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f766f461000 CR3: 000000018b00e000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
Stack:
 ffffffff882f4225 ffff880183db5a00 0000000001743440 00007f766f0fb000
 fffffffffffffeff 0000000000000000 0000000000008d79 00007f766f45f000
 ffffffff8837adae 00ff880220eb7f38 000000003203f1ac 0000000000000001
Call Trace:
 [<ffffffff882f4225>] ? SyS_add_key+0xd5/0x240
 [<ffffffff8837adae>] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [<ffffffff887da092>] system_call_fastpath+0x12/0x17
Code: 48 ff c6 48 ff c7 ff c9 75 f2 89 d1 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 31 c0 0f 1f 00 c3 0f 1f 80 00 00 00 00 0f 1f 00 89 d1 <f3> a4 31 c0 0f 1f 00 c3 90 90 90 0f 1f 00 83 fa 08 0f 82 95 00 
sending NMI to other CPUs:


Here's a crappy phonecam pic of the screen. 
http://codemonkey.org.uk/junk/IMG_4311.jpg
There's a bit of trace missing between the above and what was on
the screen, so we missed some CPUs.

	Dave


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-16  1:40     ` Dave Jones
  2014-11-16  6:33       ` Linus Torvalds
@ 2014-11-20 15:28       ` Frederic Weisbecker
  1 sibling, 0 replies; 486+ messages in thread
From: Frederic Weisbecker @ 2014-11-20 15:28 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Linux Kernel,
	the arch/x86 maintainers, Andi Lutomirski

On Sat, Nov 15, 2014 at 08:40:06PM -0500, Dave Jones wrote:
> On Sat, Nov 15, 2014 at 04:34:05PM -0500, Dave Jones wrote:
>  > On Fri, Nov 14, 2014 at 02:01:27PM -0800, Linus Torvalds wrote:
>  > 
>  >  > But since you say "several times a day", just for fun, can you test
>  >  > the follow-up patch to that one-liner fix that Will Deacon posted
>  >  > today (Subject: "[PATCH] mmu_gather: move minimal range calculations
>  >  > into generic code"). That does some further cleanup in this area.
>  > 
>  > A few hours ago it hit the NMI watchdog again with that patch applied.
>  > Incomplete trace, but it looks different based on what did make it over.
>  > Different RIP at least.
>  > 
>  > [65155.054155] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [trinity-c127:12559]
>  > [65155.054573] irq event stamp: 296752
>  > [65155.054589] hardirqs last  enabled at (296751): [<ffffffff9d87403d>] _raw_spin_unlock_irqrestore+0x5d/0x80
>  > [65155.054625] hardirqs last disabled at (296752): [<ffffffff9d875cea>] apic_timer_interrupt+0x6a/0x80
>  > [65155.054657] softirqs last  enabled at (296188): [<ffffffff9d259943>] bdi_queue_work+0x83/0x270
>  > [65155.054688] softirqs last disabled at (296184): [<ffffffff9d259920>] bdi_queue_work+0x60/0x270
>  > [65155.054721] CPU: 1 PID: 12559 Comm: trinity-c127 Not tainted 3.18.0-rc4+ #84 [loadavg: 209.68 187.90 185.33 34/431 17515]
>  > [65155.054795] task: ffff88023f664680 ti: ffff8801649f0000 task.ti: ffff8801649f0000
>  > [65155.054820] RIP: 0010:[<ffffffff9d87403f>]  [<ffffffff9d87403f>] _raw_spin_unlock_irqrestore+0x5f/0x80
>  > [65155.054852] RSP: 0018:ffff8801649f3be8  EFLAGS: 00000292
>  > [65155.054872] RAX: ffff88023f664680 RBX: 0000000000000007 RCX: 0000000000000007
>  > [65155.054895] RDX: 00000000000029e0 RSI: ffff88023f664ea0 RDI: ffff88023f664680
>  > [65155.054919] RBP: ffff8801649f3bf8 R08: 0000000000000000 R09: 0000000000000000
>  > [65155.055956] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
>  > [65155.056985] R13: ffff8801649f3b58 R14: ffffffff9d3e7d0e R15: 00000000000003e0
>  > [65155.058037] FS:  00007f0dc957c700(0000) GS:ffff880244200000(0000) knlGS:0000000000000000
>  > [65155.059083] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  > [65155.060121] CR2: 00007f0dc958e000 CR3: 000000022f31e000 CR4: 00000000001407e0
>  > [65155.061152] DR0: 00007f54162bc000 DR1: 00007feb92c3d000 DR2: 0000000000000000
>  > [65155.062180] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
>  > [65155.063202] Stack:
>  > 
>  > And that's all she wrote.
>  > 
>  >  > If Will's patch doesn't make a difference, what about reverting that
>  >  > ce9ec37bddb6? Although it really *is* a "obvious bugfix", and I really
>  >  > don't see why any of this would be noticeable on x86 (it triggered
>  >  > issues on ARM64, but that was because ARM64 cared much more about the
>  >  > exact range).
>  > 
>  > I'll try that next, and check in on it tomorrow.
> 
> No luck. Died even faster this time.
> 
> [  772.459481] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [modprobe:31400]
> [  772.459858] irq event stamp: 3362
> [  772.459872] hardirqs last  enabled at (3361): [<ffffffff941a437c>] context_tracking_user_enter+0x9c/0x2c0
> [  772.459907] hardirqs last disabled at (3362): [<ffffffff94875bea>] apic_timer_interrupt+0x6a/0x80
> [  772.459937] softirqs last  enabled at (0): [<ffffffff940764d5>] copy_process.part.26+0x635/0x1d80
> [  772.459968] softirqs last disabled at (0): [<          (null)>]           (null)
> [  772.459996] CPU: 3 PID: 31400 Comm: modprobe Not tainted 3.18.0-rc4+ #85 [loadavg: 207.70 163.33 92.64 11/433 31547]
> [  772.460086] task: ffff88022f0b2f00 ti: ffff88019a944000 task.ti: ffff88019a944000
> [  772.460110] RIP: 0010:[<ffffffff941a437e>]  [<ffffffff941a437e>] context_tracking_user_enter+0x9e/0x2c0
> [  772.460142] RSP: 0018:ffff88019a947f00  EFLAGS: 00000282
> [  772.460161] RAX: ffff88022f0b2f00 RBX: 0000000000000000 RCX: 0000000000000000
> [  772.460184] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff88022f0b2f00
> [  772.460207] RBP: ffff88019a947f10 R08: 0000000000000000 R09: 0000000000000000
> [  772.460229] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88019a947e90
> [  772.460252] R13: ffffffff940f6d04 R14: ffff88019a947ec0 R15: ffff8802447cd640
> [  772.460294] FS:  00007f3b71ee4700(0000) GS:ffff880244600000(0000) knlGS:0000000000000000
> [  772.460362] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  772.460391] CR2: 00007fffdad5af58 CR3: 000000011608e000 CR4: 00000000001407e0
> [  772.460424] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  772.460447] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  772.460470] Stack:
> [  772.460480]  ffff88019a947f58 00000000006233a8 ffff88019a947f40 ffffffff9401429d
> [  772.460512]  00000000006233a8 000000000041d68a 00000000006233a8 0000000000000000
> [  772.460543]  00000000006233a0 ffffffff94874fa4 000000001008feff 000507d93d73a434
> [  772.460574] Call Trace:
> [  772.461576]  [<ffffffff9401429d>] syscall_trace_leave+0xad/0x2e0
> [  772.462572]  [<ffffffff94874fa4>] int_check_syscall_exit_work+0x34/0x3d
> [  772.463575] Code: f8 1c 00 84 c0 75 46 48 c7 c7 51 53 cd 94 e8 aa 23 24 00 65 c7 04 25 f4 f8 1c 00 01 00 00 00 f6 c7 02 74 19 e8 84 43 f3 ff 53 9d <5b> 41 5c 5d c3 0f 1f 44 00 00 c3 0f 1f 80 00 00 00 00 53 9d e8 
> [  772.465797] Kernel panic - not syncing: softlockup: hung tasks
> [  772.466821] CPU: 3 PID: 31400 Comm: modprobe Tainted: G             L 3.18.0-rc4+ #85 [loadavg: 207.70 163.33 92.64 11/433 31547]
> [  772.468915]  ffff88022f0b2f00 00000000de65d5f5 ffff880244603dc8 ffffffff94869e01
> [  772.470031]  0000000000000000 ffffffff94c7599b ffff880244603e48 ffffffff94866b21
> [  772.471085]  ffff880200000008 ffff880244603e58 ffff880244603df8 00000000de65d5f5
> [  772.472141] Call Trace:
> [  772.473183]  <IRQ>  [<ffffffff94869e01>] dump_stack+0x4f/0x7c
> [  772.474253]  [<ffffffff94866b21>] panic+0xcf/0x202
> [  772.475346]  [<ffffffff94154d1e>] watchdog_timer_fn+0x27e/0x290
> [  772.476414]  [<ffffffff94106297>] __run_hrtimer+0xe7/0x740
> [  772.477475]  [<ffffffff94106b64>] ? hrtimer_interrupt+0x94/0x270
> [  772.478555]  [<ffffffff94154aa0>] ? watchdog+0x40/0x40
> [  772.479627]  [<ffffffff94106be7>] hrtimer_interrupt+0x117/0x270
> [  772.480703]  [<ffffffff940303db>] local_apic_timer_interrupt+0x3b/0x70
> [  772.481777]  [<ffffffff948777f3>] smp_apic_timer_interrupt+0x43/0x60
> [  772.482856]  [<ffffffff94875bef>] apic_timer_interrupt+0x6f/0x80
> [  772.483915]  <EOI>  [<ffffffff941a437e>] ? context_tracking_user_enter+0x9e/0x2c0
> [  772.484972]  [<ffffffff9401429d>] syscall_trace_leave+0xad/0x2e0

It looks like we are looping somewhere around syscall_trace_leave(). Maybe the
TIF WORK_SYSCALL_EXIT flags aren't cleared properly after some of them got processed. Or
something keeps setting a TIF_WORK_SYSCALL_EXIT flag after they get cleared and we loop
endlessly to jump to int_check_syscall_exit_work().

Andi did some work there lately. Cc'ing him.

> [  772.486042]  [<ffffffff94874fa4>] int_check_syscall_exit_work+0x34/0x3d
> [  772.487187] Kernel Offset: 0x13000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> 
> 
> 	Dave
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-19 16:28                                   ` Vivek Goyal
@ 2014-11-20 16:10                                     ` Dave Jones
  2014-11-20 16:48                                       ` Vivek Goyal
  2014-11-20 16:54                                       ` Vivek Goyal
  0 siblings, 2 replies; 486+ messages in thread
From: Dave Jones @ 2014-11-20 16:10 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Don Zickus, Thomas Gleixner, Linus Torvalds, Linux Kernel,
	the arch/x86 maintainers, WANG Chao, Baoquan He, Dave Young

On Wed, Nov 19, 2014 at 11:28:06AM -0500, Vivek Goyal wrote:
 
 > I am wondering may be in some cases we panic in second kernel and sit
 > there. Probably we should append a kernel command line automatically
 > say "panic=1" so that it reboots itself if second kernel panics.
 > 
 > By any chance, have you enabled "CONFIG_RANDOMIZE_BASE"? If yes, please
 > disable that as currently kexec/kdump stuff does not work with it. And
 > it hangs very early in the boot process and I had to hook serial console
 > to get following message on console.

I did have that enabled. (Perhaps the kconfig should conflict?)

After rebuilding without it, this..

 > > dracut: *** Stripping files done ***
 > > dracut: *** Store current command line parameters ***
 > > dracut: *** Creating image file ***
 > > dracut: *** Creating image file done ***
 > > kdumpctl: cat: write error: Broken pipe
 > > kdumpctl: kexec: failed to load kdump kernel
 > > kdumpctl: Starting kdump: [FAILED]
 
went away. It generated the image, and things looked good.
I did echo c > /proc/sysrq-trigger and got this..

SysRq : Trigger a crash
BUG: sleeping function called from invalid context at arch/x86/mm/fault.c:1192
in_atomic(): 0, irqs_disabled(): 0, pid: 8860, name: bash
3 locks held by bash/8860:
 #0:  (sb_writers#5){......}, at: [<ffffffff811eac13>] vfs_write+0x1b3/0x1f0
 #1:  (rcu_read_lock){......}, at: [<ffffffff8144a435>] __handle_sysrq+0x5/0x1b0
 #2:  (&mm->mmap_sem){......}, at: [<ffffffff8103cb20>] __do_page_fault+0x140/0x600
Preemption disabled at:[<ffffffff817ca332>] printk+0x5c/0x72

CPU: 1 PID: 8860 Comm: bash Not tainted 3.18.0-rc5+ #95 [loadavg: 0.54 0.24 0.09 2/143 8909]
 00000000000004a8 00000000e1f75c1b ffff880236473c28 ffffffff817ce5c7
 0000000000000000 0000000000000000 ffff880236473c58 ffffffff8109af8a
 ffff880236473c58 0000000000000029 0000000000000000 ffff880236473d88
Call Trace:
 [<ffffffff817ce5c7>] dump_stack+0x4f/0x7c
 [<ffffffff8109af8a>] __might_sleep+0x12a/0x190
 [<ffffffff8103cb3b>] __do_page_fault+0x15b/0x600
 [<ffffffff811613b2>] ? irq_work_queue+0x62/0xd0
 [<ffffffff8137ad7d>] ? trace_hardirqs_off_thunk+0x3a/0x3f
 [<ffffffff8103cfec>] do_page_fault+0xc/0x10
 [<ffffffff817dbcf2>] page_fault+0x22/0x30
 [<ffffffff817ca332>] ? printk+0x5c/0x72
 [<ffffffff81449ce6>] ? sysrq_handle_crash+0x16/0x20
 [<ffffffff8144a567>] __handle_sysrq+0x137/0x1b0
 [<ffffffff8144a435>] ? __handle_sysrq+0x5/0x1b0
 [<ffffffff8144aa4a>] write_sysrq_trigger+0x4a/0x50
 [<ffffffff81259f2d>] proc_reg_write+0x3d/0x80
 [<ffffffff811eab1a>] vfs_write+0xba/0x1f0
 [<ffffffff811eb628>] SyS_write+0x58/0xd0
 [<ffffffff817da052>] system_call_fastpath+0x12/0x17
Oops: 0002 [#1] PREEMPT SMP DEBUG_PAGEALLOC
CPU: 1 PID: 8860 Comm: bash Not tainted 3.18.0-rc5+ #95 [loadavg: 0.54 0.24 0.09 1/143 8909]
task: ffff8800a1a60000 ti: ffff880236470000 task.ti: ffff880236470000
RIP: 0010:[<ffffffff81449ce6>]  [<ffffffff81449ce6>] sysrq_handle_crash+0x16/0x20
RSP: 0018:ffff880236473e38  EFLAGS: 00010246
RAX: 000000000000000f RBX: ffffffff81cb4a00 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff817ca332 RDI: 0000000000000063
RBP: ffff880236473e38 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000358 R11: 0000000000000357 R12: 0000000000000063
R13: 0000000000000000 R14: 0000000000000007 R15: 0000000000000000
FS:  00007fc652f4e740(0000) GS:ffff880244200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000023a3b2000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Stack:
 ffff880236473e78 ffffffff8144a567 ffffffff8144a435 0000000000000002
 0000000000000002 00007fc652f51000 0000000000000002 ffff880236473f48
 ffff880236473ea8 ffffffff8144aa4a 0000000000000002 00007fc652f51000
Call Trace:
 [<ffffffff8144a567>] __handle_sysrq+0x137/0x1b0
 [<ffffffff8144a435>] ? __handle_sysrq+0x5/0x1b0
 [<ffffffff8144aa4a>] write_sysrq_trigger+0x4a/0x50
 [<ffffffff81259f2d>] proc_reg_write+0x3d/0x80
 [<ffffffff811eab1a>] vfs_write+0xba/0x1f0
 [<ffffffff811eb628>] SyS_write+0x58/0xd0
 [<ffffffff817da052>] system_call_fastpath+0x12/0x17
Code: 01 f4 45 39 a5 b4 00 00 00 75 e2 4c 89 ef e8 d2 f7 ff ff eb d8 0f 1f 44 00 00 55 c7 05 08 b7 7e 00 01 00 00 00 48 89 e5 0f ae f8 <c6> 04 25 00 00 00 00 01 5d c3 0f 1f 44 00 00 55 31 c0 48 89 e5 
RIP  [<ffffffff81449ce6>] sysrq_handle_crash+0x16/0x20
 RSP <ffff880236473e38>
CR2: 0000000000000000

Which, asides from the sleeping while atomic thing which isn't important,
does what I expected.  Shortly later, it rebooted.

And then /var/crash was empty.

	Dave


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-20 15:08           ` Frederic Weisbecker
@ 2014-11-20 16:19             ` Dave Jones
  2014-11-20 16:42               ` Frederic Weisbecker
  0 siblings, 1 reply; 486+ messages in thread
From: Dave Jones @ 2014-11-20 16:19 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Linus Torvalds, Linux Kernel, the arch/x86 maintainers

On Thu, Nov 20, 2014 at 04:08:00PM +0100, Frederic Weisbecker wrote:
 
 > > Great start to the week: I decided to confirm my recollection that .17
 > > was ok, only to hit this within 10 minutes.
 > > 
 > > Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 3
 > > CPU: 3 PID: 17176 Comm: trinity-c95 Not tainted 3.17.0+ #87
 > >  0000000000000000 00000000f3a61725 ffff880244606bf0 ffffffff9583e9fa
 > >  ffffffff95c67918 ffff880244606c78 ffffffff9583bcc0 0000000000000010
 > >  ffff880244606c88 ffff880244606c20 00000000f3a61725 0000000000000000
 > > Call Trace:
 > >  <NMI>  [<ffffffff9583e9fa>] dump_stack+0x4e/0x7a
 > >  [<ffffffff9583bcc0>] panic+0xd4/0x207
 > >  [<ffffffff95150908>] watchdog_overflow_callback+0x118/0x120
 > >  [<ffffffff95193dbe>] __perf_event_overflow+0xae/0x340
 > >  [<ffffffff95192230>] ? perf_event_task_disable+0xa0/0xa0
 > >  [<ffffffff9501a7bf>] ? x86_perf_event_set_period+0xbf/0x150
 > >  [<ffffffff95194be4>] perf_event_overflow+0x14/0x20
 > >  [<ffffffff95020676>] intel_pmu_handle_irq+0x206/0x410
 > >  [<ffffffff9501966b>] perf_event_nmi_handler+0x2b/0x50
 > >  [<ffffffff95007bb2>] nmi_handle+0xd2/0x390
 > >  [<ffffffff95007ae5>] ? nmi_handle+0x5/0x390
 > >  [<ffffffff958489b0>] ? _raw_spin_lock_irqsave+0x80/0x90
 > >  [<ffffffff950080a2>] default_do_nmi+0x72/0x1c0
 > >  [<ffffffff950082a8>] do_nmi+0xb8/0x100
 > >  [<ffffffff9584b9aa>] end_repeat_nmi+0x1e/0x2e
 > >  [<ffffffff958489b0>] ? _raw_spin_lock_irqsave+0x80/0x90
 > >  [<ffffffff958489b0>] ? _raw_spin_lock_irqsave+0x80/0x90
 > >  [<ffffffff958489b0>] ? _raw_spin_lock_irqsave+0x80/0x90
 > >  <<EOE>>  <IRQ>  [<ffffffff95101685>] lock_hrtimer_base.isra.18+0x25/0x50
 > >  [<ffffffff951019d3>] hrtimer_try_to_cancel+0x33/0x1f0
 > 
 > Ah that one got fixed in the merge window and in -stable, right?
 
If that's true, that changes everything, and this might be more
bisectable.  I did the test above on 3.17, but perhaps I should
try a run on 3.17.3

	Dave


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-20 16:19             ` Dave Jones
@ 2014-11-20 16:42               ` Frederic Weisbecker
  0 siblings, 0 replies; 486+ messages in thread
From: Frederic Weisbecker @ 2014-11-20 16:42 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Linux Kernel, the arch/x86 maintainers

On Thu, Nov 20, 2014 at 11:19:25AM -0500, Dave Jones wrote:
> On Thu, Nov 20, 2014 at 04:08:00PM +0100, Frederic Weisbecker wrote:
>  
>  > > Great start to the week: I decided to confirm my recollection that .17
>  > > was ok, only to hit this within 10 minutes.
>  > > 
>  > > Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 3
>  > > CPU: 3 PID: 17176 Comm: trinity-c95 Not tainted 3.17.0+ #87
>  > >  0000000000000000 00000000f3a61725 ffff880244606bf0 ffffffff9583e9fa
>  > >  ffffffff95c67918 ffff880244606c78 ffffffff9583bcc0 0000000000000010
>  > >  ffff880244606c88 ffff880244606c20 00000000f3a61725 0000000000000000
>  > > Call Trace:
>  > >  <NMI>  [<ffffffff9583e9fa>] dump_stack+0x4e/0x7a
>  > >  [<ffffffff9583bcc0>] panic+0xd4/0x207
>  > >  [<ffffffff95150908>] watchdog_overflow_callback+0x118/0x120
>  > >  [<ffffffff95193dbe>] __perf_event_overflow+0xae/0x340
>  > >  [<ffffffff95192230>] ? perf_event_task_disable+0xa0/0xa0
>  > >  [<ffffffff9501a7bf>] ? x86_perf_event_set_period+0xbf/0x150
>  > >  [<ffffffff95194be4>] perf_event_overflow+0x14/0x20
>  > >  [<ffffffff95020676>] intel_pmu_handle_irq+0x206/0x410
>  > >  [<ffffffff9501966b>] perf_event_nmi_handler+0x2b/0x50
>  > >  [<ffffffff95007bb2>] nmi_handle+0xd2/0x390
>  > >  [<ffffffff95007ae5>] ? nmi_handle+0x5/0x390
>  > >  [<ffffffff958489b0>] ? _raw_spin_lock_irqsave+0x80/0x90
>  > >  [<ffffffff950080a2>] default_do_nmi+0x72/0x1c0
>  > >  [<ffffffff950082a8>] do_nmi+0xb8/0x100
>  > >  [<ffffffff9584b9aa>] end_repeat_nmi+0x1e/0x2e
>  > >  [<ffffffff958489b0>] ? _raw_spin_lock_irqsave+0x80/0x90
>  > >  [<ffffffff958489b0>] ? _raw_spin_lock_irqsave+0x80/0x90
>  > >  [<ffffffff958489b0>] ? _raw_spin_lock_irqsave+0x80/0x90
>  > >  <<EOE>>  <IRQ>  [<ffffffff95101685>] lock_hrtimer_base.isra.18+0x25/0x50
>  > >  [<ffffffff951019d3>] hrtimer_try_to_cancel+0x33/0x1f0
>  > 
>  > Ah that one got fixed in the merge window and in -stable, right?
>  
> If that's true, that changes everything, and this might be more
> bisectable.  I did the test above on 3.17, but perhaps I should
> try a run on 3.17.3

It might not be easier to bisect because stable is a seperate branch than the next -rc1.
And that above got fixed in -rc1, perhaps in the same merge window where the new different
issues were introduced. So you'll probably need to shutdown the above issue in order to
bisect the others.

What you can do is to bisect and then before every build apply the patches that
fix the above issue in -stable, those that I just enumerated to gregkh in our
discussion with him. There are only 4. Just try to apply all of them before each
build, unless they are already.

I could give you a much simpler hack but I fear it may chaoticly apply depending if
the real fixes are applied, halfway or not at all, all that with unpredictable results.
So lets rather stick to what we know to work.

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-20 16:10                                     ` Dave Jones
@ 2014-11-20 16:48                                       ` Vivek Goyal
  2014-11-20 17:38                                         ` Dave Jones
  2014-11-20 16:54                                       ` Vivek Goyal
  1 sibling, 1 reply; 486+ messages in thread
From: Vivek Goyal @ 2014-11-20 16:48 UTC (permalink / raw)
  To: Dave Jones, Don Zickus, Thomas Gleixner, Linus Torvalds,
	Linux Kernel, the arch/x86 maintainers, WANG Chao, Baoquan He,
	Dave Young

On Thu, Nov 20, 2014 at 11:10:55AM -0500, Dave Jones wrote:
> On Wed, Nov 19, 2014 at 11:28:06AM -0500, Vivek Goyal wrote:
>  
>  > I am wondering may be in some cases we panic in second kernel and sit
>  > there. Probably we should append a kernel command line automatically
>  > say "panic=1" so that it reboots itself if second kernel panics.
>  > 
>  > By any chance, have you enabled "CONFIG_RANDOMIZE_BASE"? If yes, please
>  > disable that as currently kexec/kdump stuff does not work with it. And
>  > it hangs very early in the boot process and I had to hook serial console
>  > to get following message on console.
> 
> I did have that enabled. (Perhaps the kconfig should conflict?)

Hi Dave,

Actually kexec/kdump allows booting into a different kernel than running
kernel. So one could have KEXEC and CONFIG_RANDOMIZE_BASE enabled in
the kernel at the same time but still booting into a kernel with
CONFIG_RANDOMIZE_BASE=n and that should work. CONFIG_RANDOMIZE_BASE is
only a problem if it is enabled in second kernel.So kconfig conflict
might not be a good fit here.

> 
> After rebuilding without it, this..
> 
>  > > dracut: *** Stripping files done ***
>  > > dracut: *** Store current command line parameters ***
>  > > dracut: *** Creating image file ***
>  > > dracut: *** Creating image file done ***
>  > > kdumpctl: cat: write error: Broken pipe
>  > > kdumpctl: kexec: failed to load kdump kernel
>  > > kdumpctl: Starting kdump: [FAILED]
>  
> went away. It generated the image, and things looked good.
> I did echo c > /proc/sysrq-trigger and got this..
> 
> SysRq : Trigger a crash
> BUG: sleeping function called from invalid context at arch/x86/mm/fault.c:1192
> in_atomic(): 0, irqs_disabled(): 0, pid: 8860, name: bash
> 3 locks held by bash/8860:
>  #0:  (sb_writers#5){......}, at: [<ffffffff811eac13>] vfs_write+0x1b3/0x1f0
>  #1:  (rcu_read_lock){......}, at: [<ffffffff8144a435>] __handle_sysrq+0x5/0x1b0
>  #2:  (&mm->mmap_sem){......}, at: [<ffffffff8103cb20>] __do_page_fault+0x140/0x600
> Preemption disabled at:[<ffffffff817ca332>] printk+0x5c/0x72
> 
> CPU: 1 PID: 8860 Comm: bash Not tainted 3.18.0-rc5+ #95 [loadavg: 0.54 0.24 0.09 2/143 8909]
>  00000000000004a8 00000000e1f75c1b ffff880236473c28 ffffffff817ce5c7
>  0000000000000000 0000000000000000 ffff880236473c58 ffffffff8109af8a
>  ffff880236473c58 0000000000000029 0000000000000000 ffff880236473d88
> Call Trace:
>  [<ffffffff817ce5c7>] dump_stack+0x4f/0x7c
>  [<ffffffff8109af8a>] __might_sleep+0x12a/0x190
>  [<ffffffff8103cb3b>] __do_page_fault+0x15b/0x600
>  [<ffffffff811613b2>] ? irq_work_queue+0x62/0xd0
>  [<ffffffff8137ad7d>] ? trace_hardirqs_off_thunk+0x3a/0x3f
>  [<ffffffff8103cfec>] do_page_fault+0xc/0x10
>  [<ffffffff817dbcf2>] page_fault+0x22/0x30
>  [<ffffffff817ca332>] ? printk+0x5c/0x72
>  [<ffffffff81449ce6>] ? sysrq_handle_crash+0x16/0x20
>  [<ffffffff8144a567>] __handle_sysrq+0x137/0x1b0
>  [<ffffffff8144a435>] ? __handle_sysrq+0x5/0x1b0
>  [<ffffffff8144aa4a>] write_sysrq_trigger+0x4a/0x50
>  [<ffffffff81259f2d>] proc_reg_write+0x3d/0x80
>  [<ffffffff811eab1a>] vfs_write+0xba/0x1f0
>  [<ffffffff811eb628>] SyS_write+0x58/0xd0
>  [<ffffffff817da052>] system_call_fastpath+0x12/0x17
> Oops: 0002 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> CPU: 1 PID: 8860 Comm: bash Not tainted 3.18.0-rc5+ #95 [loadavg: 0.54 0.24 0.09 1/143 8909]
> task: ffff8800a1a60000 ti: ffff880236470000 task.ti: ffff880236470000
> RIP: 0010:[<ffffffff81449ce6>]  [<ffffffff81449ce6>] sysrq_handle_crash+0x16/0x20
> RSP: 0018:ffff880236473e38  EFLAGS: 00010246
> RAX: 000000000000000f RBX: ffffffff81cb4a00 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: ffffffff817ca332 RDI: 0000000000000063
> RBP: ffff880236473e38 R08: 0000000000000001 R09: 0000000000000001
> R10: 0000000000000358 R11: 0000000000000357 R12: 0000000000000063
> R13: 0000000000000000 R14: 0000000000000007 R15: 0000000000000000
> FS:  00007fc652f4e740(0000) GS:ffff880244200000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000000 CR3: 000000023a3b2000 CR4: 00000000001407e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Stack:
>  ffff880236473e78 ffffffff8144a567 ffffffff8144a435 0000000000000002
>  0000000000000002 00007fc652f51000 0000000000000002 ffff880236473f48
>  ffff880236473ea8 ffffffff8144aa4a 0000000000000002 00007fc652f51000
> Call Trace:
>  [<ffffffff8144a567>] __handle_sysrq+0x137/0x1b0
>  [<ffffffff8144a435>] ? __handle_sysrq+0x5/0x1b0
>  [<ffffffff8144aa4a>] write_sysrq_trigger+0x4a/0x50
>  [<ffffffff81259f2d>] proc_reg_write+0x3d/0x80
>  [<ffffffff811eab1a>] vfs_write+0xba/0x1f0
>  [<ffffffff811eb628>] SyS_write+0x58/0xd0
>  [<ffffffff817da052>] system_call_fastpath+0x12/0x17
> Code: 01 f4 45 39 a5 b4 00 00 00 75 e2 4c 89 ef e8 d2 f7 ff ff eb d8 0f 1f 44 00 00 55 c7 05 08 b7 7e 00 01 00 00 00 48 89 e5 0f ae f8 <c6> 04 25 00 00 00 00 01 5d c3 0f 1f 44 00 00 55 31 c0 48 89 e5 
> RIP  [<ffffffff81449ce6>] sysrq_handle_crash+0x16/0x20
>  RSP <ffff880236473e38>
> CR2: 0000000000000000
> 
> Which, asides from the sleeping while atomic thing which isn't important,
> does what I expected.  Shortly later, it rebooted.
> 
> And then /var/crash was empty.

These messages came from first kernel. I think we have failed very early
in second kernel boot.

Can we try following and retry and see if some additional messages show
up on console and help us narrow down the problem.

- Enable verbose boot messages. CONFIG_X86_VERBOSE_BOOTUP=y

- Enable early printk in second kernel. (earlyprintk=ttyS0,115200).

  You can either enable early printk in first kernel and reboot. That way
  second kernel will automatically have it enabled. Or you can edit
  "/etc/sysconfig/kdump" and append earlyprintk=<> to KDUMP_COMMANDLINE_APPEND. 
  You will need to restart kdump service after this.

- Enable some debug output during runtime from kexec purgatory. For that one
  needs to pass additional arguments to /sbin/kexec. You can edit
  /etc/sysconfig/kdump file and modify "KEXEC_ARGS" to pass additional
  arguments to /sbin/kexec during kernel load. I use following for my
  serial console.

  KEXEC_ARGS="--console-serial --serial=0x3f8 --serial-baud=115200"

  You will need to restart kdump service.

I hope above give us some information to work with and figure out where
did we fail while booting into second kernel.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-20 16:10                                     ` Dave Jones
  2014-11-20 16:48                                       ` Vivek Goyal
@ 2014-11-20 16:54                                       ` Vivek Goyal
  1 sibling, 0 replies; 486+ messages in thread
From: Vivek Goyal @ 2014-11-20 16:54 UTC (permalink / raw)
  To: Dave Jones, Don Zickus, Thomas Gleixner, Linus Torvalds,
	Linux Kernel, the arch/x86 maintainers, WANG Chao, Baoquan He,
	Dave Young

On Thu, Nov 20, 2014 at 11:10:55AM -0500, Dave Jones wrote:
> On Wed, Nov 19, 2014 at 11:28:06AM -0500, Vivek Goyal wrote:
>  
>  > I am wondering may be in some cases we panic in second kernel and sit
>  > there. Probably we should append a kernel command line automatically
>  > say "panic=1" so that it reboots itself if second kernel panics.
>  > 
>  > By any chance, have you enabled "CONFIG_RANDOMIZE_BASE"? If yes, please
>  > disable that as currently kexec/kdump stuff does not work with it. And
>  > it hangs very early in the boot process and I had to hook serial console
>  > to get following message on console.
> 
> I did have that enabled. (Perhaps the kconfig should conflict?)

Hi Dave,

Can you please also send me your kernel config file. I will try that on
my machine and see if I can reproduce the problem on my machine.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-20 16:48                                       ` Vivek Goyal
@ 2014-11-20 17:38                                         ` Dave Jones
  2014-11-21  9:46                                           ` Dave Young
  0 siblings, 1 reply; 486+ messages in thread
From: Dave Jones @ 2014-11-20 17:38 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Don Zickus, Thomas Gleixner, Linus Torvalds, Linux Kernel,
	the arch/x86 maintainers, WANG Chao, Baoquan He, Dave Young

On Thu, Nov 20, 2014 at 11:48:09AM -0500, Vivek Goyal wrote:
 
 > Can we try following and retry and see if some additional messages show
 > up on console and help us narrow down the problem.
 > 
 > - Enable verbose boot messages. CONFIG_X86_VERBOSE_BOOTUP=y
 > 
 > - Enable early printk in second kernel. (earlyprintk=ttyS0,115200).
 > 
 >   You can either enable early printk in first kernel and reboot. That way
 >   second kernel will automatically have it enabled. Or you can edit
 >   "/etc/sysconfig/kdump" and append earlyprintk=<> to KDUMP_COMMANDLINE_APPEND. 
 >   You will need to restart kdump service after this.
 > 
 > - Enable some debug output during runtime from kexec purgatory. For that one
 >   needs to pass additional arguments to /sbin/kexec. You can edit
 >   /etc/sysconfig/kdump file and modify "KEXEC_ARGS" to pass additional
 >   arguments to /sbin/kexec during kernel load. I use following for my
 >   serial console.
 > 
 >   KEXEC_ARGS="--console-serial --serial=0x3f8 --serial-baud=115200"
 > 
 >   You will need to restart kdump service.

The only serial port on this machine is usb serial, which doesn't have io ports.

>From my reading of the kexec man page, it doesn't look like I can tell
it to use ttyUSB0.

And because it relies on usb being initialized, this probably isn't
going to help too much with early boot.

earlyprintk=tty0 didn't show anything extra after the sysrq-c oops.
likewise, =ttyUSB0

I'm going to try bisecting the problem I'm debugging again, so I'm not
going to dig into this much more today.

	Dave


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-20 15:25                                   ` frequent lockups in 3.18rc4 Dave Jones
@ 2014-11-20 19:43                                     ` Linus Torvalds
  2014-11-20 20:06                                       ` Dave Jones
                                                         ` (2 more replies)
  2014-11-25 12:22                                     ` Will Deacon
  1 sibling, 3 replies; 486+ messages in thread
From: Linus Torvalds @ 2014-11-20 19:43 UTC (permalink / raw)
  To: Dave Jones, Andy Lutomirski, Linus Torvalds, Don Zickus,
	Thomas Gleixner, Linux Kernel, the arch/x86 maintainers,
	Peter Zijlstra

On Thu, Nov 20, 2014 at 7:25 AM, Dave Jones <davej@redhat.com> wrote:
>
> Disabling CONTEXT_TRACKING didn't change the problem.
> Unfortunatly the full trace didn't make it over usb-serial this time. Grr.
>
> Here's what came over serial..
>
> NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [trinity-c35:11634]
> RIP: 0010:[<ffffffff88379605>]  [<ffffffff88379605>] copy_user_enhanced_fast_string+0x5/0x10
> RAX: ffff880220eb4000 RBX: ffffffff887dac64 RCX: 0000000000006a18
> RDX: 000000000000e02f RSI: 00007f766f466620 RDI: ffff88016f6a7617
> RBP: ffff880220eb7f78 R08: 8000000000000063 R09: 0000000000000004
> Call Trace:
>  [<ffffffff882f4225>] ? SyS_add_key+0xd5/0x240
>  [<ffffffff8837adae>] ? trace_hardirqs_on_thunk+0x3a/0x3f
>  [<ffffffff887da092>] system_call_fastpath+0x12/0x17

Ok, that's just about half-way in a ~57kB memory copy (you can see it
in the register state: %rdx contains the original size of the key
payload, rcx contains the current remaining size: 57kB total, 27kB
left).

And it's holding absolutely zero locks, and not even doing anything
odd. It wasn't doing anything particularly odd before either, although
the kmalloc() of a 64kB area might just have caused a fair amount of
VM work, of course.

You know what? I'm seriously starting to think that these bugs aren't
actually real. Or rather, I don't think it's really a true softlockup,
because most of them seem to happen in totally harmless code.

So I'm wondering whether the real issue might not be just this:

   [loadavg: 164.79 157.30 155.90 37/409 11893]

together with possibly a scheduler issue and/or a bug in the smpboot
thread logic (that the watchdog uses) or similar.

That's *especially* true if it turns out that the 3.17 problem you saw
was actually a perf bug that has already been fixed and is in stable.
We've been looking at kernel/smp.c changes, and looking for x86 IPI or
APIC changes, and found some harmlessly (at least on x86) suspicious
code and this exercise might be worth it for that reason, but what if
it's really just a scheduler regression.

There's been a *lot* more scheduler changes since 3.17 than the small
things we've looked at for x86 entry or IPI handling. And the
scheduler changes have been about things like overloaded scheduling
groups etc, and I could easily imaging that some bug *there* ends up
causing the watchdog process not to schedule.

Hmm? Scheduler people?

                       Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-20 19:43                                     ` Linus Torvalds
@ 2014-11-20 20:06                                       ` Dave Jones
  2014-11-20 20:37                                       ` Don Zickus
  2014-11-21  6:37                                       ` Ingo Molnar
  2 siblings, 0 replies; 486+ messages in thread
From: Dave Jones @ 2014-11-20 20:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Don Zickus, Thomas Gleixner, Linux Kernel,
	the arch/x86 maintainers, Peter Zijlstra

On Thu, Nov 20, 2014 at 11:43:07AM -0800, Linus Torvalds wrote:
 
 > You know what? I'm seriously starting to think that these bugs aren't
 > actually real. Or rather, I don't think it's really a true softlockup,
 > because most of them seem to happen in totally harmless code.
 > 
 > So I'm wondering whether the real issue might not be just this:
 > 
 >    [loadavg: 164.79 157.30 155.90 37/409 11893]
 > 
 > together with possibly a scheduler issue and/or a bug in the smpboot
 > thread logic (that the watchdog uses) or similar.
 > 
 > That's *especially* true if it turns out that the 3.17 problem you saw
 > was actually a perf bug that has already been fixed and is in stable.
 > We've been looking at kernel/smp.c changes, and looking for x86 IPI or
 > APIC changes, and found some harmlessly (at least on x86) suspicious
 > code and this exercise might be worth it for that reason, but what if
 > it's really just a scheduler regression.

I started a run against 3.17 with the perf fixes. If that survives
today, I'll start a bisection tomorrow.

 > There's been a *lot* more scheduler changes since 3.17 than the small
 > things we've looked at for x86 entry or IPI handling. And the
 > scheduler changes have been about things like overloaded scheduling
 > groups etc, and I could easily imaging that some bug *there* ends up
 > causing the watchdog process not to schedule.

One other data point: I put another box into service for testing,
but it's considerably slower (a ~6 year old Xeon vs the Haswell).
Maybe it's just because it's so much slower that it'll take longer,
(or slow enough that the bug is masked) but that machine hasn't had
a problem yet in almost a day of runtime.

	Dave


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: [PATCH] x86, syscall: Fix _TIF_NOHZ handling in syscall_trace_enter_phase1
  2014-11-19 22:13                                     ` Thomas Gleixner
@ 2014-11-20 20:33                                       ` Linus Torvalds
  2014-11-20 22:07                                         ` Thomas Gleixner
  0 siblings, 1 reply; 486+ messages in thread
From: Linus Torvalds @ 2014-11-20 20:33 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, Dave Jones, Don Zickus, Linux Kernel,
	the arch/x86 maintainers, Peter Zijlstra

On Wed, Nov 19, 2014 at 2:13 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Right, while it is wrong it does not explain the wreckage on 3.17,
> which does not have that code.

Thomas, I'm currently going off the assumption that I'll see this from
the x86 trees, and I can ignore the patch. It doesn't seem like this
is a particularly pressing bug.

If it's *not* going to show up as a pull request, holler, and I'll
just apply it.

                    Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-20 19:43                                     ` Linus Torvalds
  2014-11-20 20:06                                       ` Dave Jones
@ 2014-11-20 20:37                                       ` Don Zickus
  2014-11-20 20:51                                         ` Linus Torvalds
  2014-11-21  6:37                                       ` Ingo Molnar
  2 siblings, 1 reply; 486+ messages in thread
From: Don Zickus @ 2014-11-20 20:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Jones, Andy Lutomirski, Thomas Gleixner, Linux Kernel,
	the arch/x86 maintainers, Peter Zijlstra

On Thu, Nov 20, 2014 at 11:43:07AM -0800, Linus Torvalds wrote:
> On Thu, Nov 20, 2014 at 7:25 AM, Dave Jones <davej@redhat.com> wrote:
> >
> > Disabling CONTEXT_TRACKING didn't change the problem.
> > Unfortunatly the full trace didn't make it over usb-serial this time. Grr.
> >
> > Here's what came over serial..
> >
> > NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [trinity-c35:11634]
> > RIP: 0010:[<ffffffff88379605>]  [<ffffffff88379605>] copy_user_enhanced_fast_string+0x5/0x10
> > RAX: ffff880220eb4000 RBX: ffffffff887dac64 RCX: 0000000000006a18
> > RDX: 000000000000e02f RSI: 00007f766f466620 RDI: ffff88016f6a7617
> > RBP: ffff880220eb7f78 R08: 8000000000000063 R09: 0000000000000004
> > Call Trace:
> >  [<ffffffff882f4225>] ? SyS_add_key+0xd5/0x240
> >  [<ffffffff8837adae>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> >  [<ffffffff887da092>] system_call_fastpath+0x12/0x17
> 
> Ok, that's just about half-way in a ~57kB memory copy (you can see it
> in the register state: %rdx contains the original size of the key
> payload, rcx contains the current remaining size: 57kB total, 27kB
> left).
> 
> And it's holding absolutely zero locks, and not even doing anything
> odd. It wasn't doing anything particularly odd before either, although
> the kmalloc() of a 64kB area might just have caused a fair amount of
> VM work, of course.

Just for clarification, softlockups are processes hogging the cpu (thus
blocking the high priority per-cpu watchdog thread).

Hardlockups on the other hand are cpus with interrupts disabled for too
long (thus blocking the timer interrupt).

The might coincide with your scheduler theory below.  Don't know.

Cheers,
Don

> 
> You know what? I'm seriously starting to think that these bugs aren't
> actually real. Or rather, I don't think it's really a true softlockup,
> because most of them seem to happen in totally harmless code.
> 
> So I'm wondering whether the real issue might not be just this:
> 
>    [loadavg: 164.79 157.30 155.90 37/409 11893]
> 
> together with possibly a scheduler issue and/or a bug in the smpboot
> thread logic (that the watchdog uses) or similar.
> 
> That's *especially* true if it turns out that the 3.17 problem you saw
> was actually a perf bug that has already been fixed and is in stable.
> We've been looking at kernel/smp.c changes, and looking for x86 IPI or
> APIC changes, and found some harmlessly (at least on x86) suspicious
> code and this exercise might be worth it for that reason, but what if
> it's really just a scheduler regression.
> 
> There's been a *lot* more scheduler changes since 3.17 than the small
> things we've looked at for x86 entry or IPI handling. And the
> scheduler changes have been about things like overloaded scheduling
> groups etc, and I could easily imaging that some bug *there* ends up
> causing the watchdog process not to schedule.
> 
> Hmm? Scheduler people?
> 
>                        Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-20 20:37                                       ` Don Zickus
@ 2014-11-20 20:51                                         ` Linus Torvalds
  0 siblings, 0 replies; 486+ messages in thread
From: Linus Torvalds @ 2014-11-20 20:51 UTC (permalink / raw)
  To: Don Zickus
  Cc: Dave Jones, Andy Lutomirski, Thomas Gleixner, Linux Kernel,
	the arch/x86 maintainers, Peter Zijlstra

On Thu, Nov 20, 2014 at 12:37 PM, Don Zickus <dzickus@redhat.com> wrote:
>
> Just for clarification, softlockups are processes hogging the cpu (thus
> blocking the high priority per-cpu watchdog thread).

Right. And there is no actual sign of any CPU hogging going on.
There's a single system call with a small payload (I think it's safe
to call 64kB small these days), no hugely contended CPU-spinning
locking, nada.

                    Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-20 12:23                                               ` Tejun Heo
@ 2014-11-20 21:58                                                 ` Thomas Gleixner
  2014-11-20 22:06                                                   ` Andy Lutomirski
  2014-11-20 22:11                                                   ` Tejun Heo
  0 siblings, 2 replies; 486+ messages in thread
From: Thomas Gleixner @ 2014-11-20 21:58 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Frederic Weisbecker, Linus Torvalds, Dave Jones, Don Zickus,
	Linux Kernel, the arch/x86 maintainers, Peter Zijlstra,
	Andy Lutomirski, Arnaldo Carvalho de Melo

On Thu, 20 Nov 2014, Tejun Heo wrote:
> On Thu, Nov 20, 2014 at 12:50:36AM +0100, Frederic Weisbecker wrote:
> > > Are we talking about different per cpu allocators here or am I missing
> > > something completely non obvious?
> > 
> > That's the same allocator yeah. So if the whole memory is dereferenced,
> > faults shouldn't happen indeed.
> > 
> > Maybe that was a bug a few years ago but not anymore.
> 
> It has been always like that tho.  Percpu memory given out is always
> populated and cleared.
> 
> > Is it possible that, somehow, some part isn't zeroed by pcpu_alloc()?
> > After all it's allocated with vzalloc() so that part could be skipped. The memset(0)
> 
> The vzalloc call is for the internal allocation bitmap not the actual
> percpu memory area.  The actual address areas for percpu memory are
> obtained using pcpu_get_vm_areas() call and later get populated using
> map_kernel_range_noflush() (flush is performed after mapping is
> complete).
> 
> Trying to remember what happens with vmalloc_fault().  Ah okay, so
> when a new PUD gets created for vmalloc area, we don't go through all
> PGDs and update them.  The PGD entries get faulted in lazily.  Percpu
> memory allocator clearing or not clearing the allocated area doesn't
> have anything to do with it.  The memory area is always fully
> populated in the kernel page table.  It's just that the population
> happened while a different PGD was active and this PGD hasn't been
> populated with the new PUD yet.

It's completely undocumented behaviour, whether it has been that way
for ever or not. And I agree with Fredric, that it is insane. Actuallu
it's beyond insane, really.

> So, yeap, vmalloc_fault() can always happen when accessing vmalloc
> areas and the only way to avoid that would be removing lazy PGD
> population - going through all PGDs and populating new PUDs
> immediately.

There is no requirement to go through ALL PGDs and populate that stuff
immediately.

Lets look at the two types of allocations

   1) Kernel percpu allocations

   2) Per process/task percpu allocations

Of course we do not have a way to distinguish those, but we really
should have one.

#1 Kernel percpu allocations usually happen in the context of driver
   bringup, subsystem initialization, interrupt setup etc.

   So this is functionality which is not a hotpath and usually
   requires some form of synchronization versus the rest of the system
   anyway.

   The per cpu population stuff is serialized with a mutex anyway, so
   what's wrong to have a globaly visible percpu sequence counter,
   which is incremented whenever a new allocation is populated or torn
   down?

   We can make that sequence counter a per cpu variable as well to
   avoid the issues of a global variable (preferrably that's a
   compile/boot time allocated percpu variable to avoid the obvious
   circulus vitiosus)

   Now after that increment the allocation side needs to wait for a
   scheduling cycle on all cpus (we have mechanisms for that)
   
   So in the scheduler if the same task gets reselected you check that
   sequence count and update the PGD if different. If a task switch
   happens then you also need to check the sequence count and act
   accordingly.

   If we make the sequence counter a percpu variable as outlined above
   the overhead of checking this is just noise versus the other
   nonsense we do in schedule().


#2 That's process related statistics and instrumentation stuff.

   Now that just needs a immediate population on the process->mm->pgd
   aside of the init_mm.pgd, but that's really not a big deal.

Of course that does not solve the issues we have with the current
infrastructure retroactively, but it allows us to avoid fuckups like
the one Frederic was talking about that perf invented its own kmalloc
based 'percpu' replacement just to workaround the shortcoming in a
particular place.

What really frightens me is the potential and well hidden fuckup
potential which lurks around the corner and the hard to debug once in
a while fallout which might be caused by this.

Thanks,

	tglx



^ permalink raw reply	[flat|nested] 486+ messages in thread

* [tip:x86/urgent] x86, syscall: Fix _TIF_NOHZ handling in syscall_trace_enter_phase1
  2014-11-19 21:56                                   ` [PATCH] x86, syscall: Fix _TIF_NOHZ handling in syscall_trace_enter_phase1 Andy Lutomirski
  2014-11-19 22:13                                     ` Thomas Gleixner
@ 2014-11-20 22:04                                     ` tip-bot for Andy Lutomirski
  1 sibling, 0 replies; 486+ messages in thread
From: tip-bot for Andy Lutomirski @ 2014-11-20 22:04 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: peterz, dzickus, mingo, luto, davej, torvalds, hpa, linux-kernel, tglx

Commit-ID:  b5e212a3051b65e426a513901d9c7001681c7215
Gitweb:     http://git.kernel.org/tip/b5e212a3051b65e426a513901d9c7001681c7215
Author:     Andy Lutomirski <luto@amacapital.net>
AuthorDate: Wed, 19 Nov 2014 13:56:19 -0800
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Thu, 20 Nov 2014 23:01:53 +0100

x86, syscall: Fix _TIF_NOHZ handling in syscall_trace_enter_phase1

TIF_NOHZ is 19 (i.e. _TIF_SYSCALL_TRACE | _TIF_NOTIFY_RESUME |
_TIF_SINGLESTEP), not (1<<19).

This code is involved in Dave's trinity lockup, but I don't see why
it would cause any of the problems he's seeing, except inadvertently
by causing a different path through entry_64.S's syscall handling.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/a6cd3b60a3f53afb6e1c8081b0ec30ff19003dd7.1416434075.git.luto@amacapital.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/ptrace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 749b0e4..e510618 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1484,7 +1484,7 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
 	 */
 	if (work & _TIF_NOHZ) {
 		user_exit();
-		work &= ~TIF_NOHZ;
+		work &= ~_TIF_NOHZ;
 	}
 
 #ifdef CONFIG_SECCOMP

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-20 21:58                                                 ` Thomas Gleixner
@ 2014-11-20 22:06                                                   ` Andy Lutomirski
  2014-11-20 22:11                                                   ` Tejun Heo
  1 sibling, 0 replies; 486+ messages in thread
From: Andy Lutomirski @ 2014-11-20 22:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Tejun Heo, Frederic Weisbecker, Linus Torvalds, Dave Jones,
	Don Zickus, Linux Kernel, the arch/x86 maintainers,
	Peter Zijlstra, Arnaldo Carvalho de Melo

On Thu, Nov 20, 2014 at 1:58 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Thu, 20 Nov 2014, Tejun Heo wrote:
>> On Thu, Nov 20, 2014 at 12:50:36AM +0100, Frederic Weisbecker wrote:
>> > > Are we talking about different per cpu allocators here or am I missing
>> > > something completely non obvious?
>> >
>> > That's the same allocator yeah. So if the whole memory is dereferenced,
>> > faults shouldn't happen indeed.
>> >
>> > Maybe that was a bug a few years ago but not anymore.
>>
>> It has been always like that tho.  Percpu memory given out is always
>> populated and cleared.
>>
>> > Is it possible that, somehow, some part isn't zeroed by pcpu_alloc()?
>> > After all it's allocated with vzalloc() so that part could be skipped. The memset(0)
>>
>> The vzalloc call is for the internal allocation bitmap not the actual
>> percpu memory area.  The actual address areas for percpu memory are
>> obtained using pcpu_get_vm_areas() call and later get populated using
>> map_kernel_range_noflush() (flush is performed after mapping is
>> complete).
>>
>> Trying to remember what happens with vmalloc_fault().  Ah okay, so
>> when a new PUD gets created for vmalloc area, we don't go through all
>> PGDs and update them.  The PGD entries get faulted in lazily.  Percpu
>> memory allocator clearing or not clearing the allocated area doesn't
>> have anything to do with it.  The memory area is always fully
>> populated in the kernel page table.  It's just that the population
>> happened while a different PGD was active and this PGD hasn't been
>> populated with the new PUD yet.
>
> It's completely undocumented behaviour, whether it has been that way
> for ever or not. And I agree with Fredric, that it is insane. Actuallu
> it's beyond insane, really.
>
>> So, yeap, vmalloc_fault() can always happen when accessing vmalloc
>> areas and the only way to avoid that would be removing lazy PGD
>> population - going through all PGDs and populating new PUDs
>> immediately.
>
> There is no requirement to go through ALL PGDs and populate that stuff
> immediately.
>
> Lets look at the two types of allocations
>
>    1) Kernel percpu allocations
>
>    2) Per process/task percpu allocations
>
> Of course we do not have a way to distinguish those, but we really
> should have one.
>
> #1 Kernel percpu allocations usually happen in the context of driver
>    bringup, subsystem initialization, interrupt setup etc.
>
>    So this is functionality which is not a hotpath and usually
>    requires some form of synchronization versus the rest of the system
>    anyway.
>
>    The per cpu population stuff is serialized with a mutex anyway, so
>    what's wrong to have a globaly visible percpu sequence counter,
>    which is incremented whenever a new allocation is populated or torn
>    down?
>
>    We can make that sequence counter a per cpu variable as well to
>    avoid the issues of a global variable (preferrably that's a
>    compile/boot time allocated percpu variable to avoid the obvious
>    circulus vitiosus)
>
>    Now after that increment the allocation side needs to wait for a
>    scheduling cycle on all cpus (we have mechanisms for that)
>
>    So in the scheduler if the same task gets reselected you check that
>    sequence count and update the PGD if different. If a task switch
>    happens then you also need to check the sequence count and act
>    accordingly.
>
>    If we make the sequence counter a percpu variable as outlined above
>    the overhead of checking this is just noise versus the other
>    nonsense we do in schedule().

This seems like a reasonable idea, but I'd suggest a minor change:
rather than using a sequence number, track the number of kernel pgds.
That number should rarely change, and it's only one byte long.  That
means that we can easily stick it in mm_context_t without making it
any bigger.

The count for init_mm could be copied into cpu_tlbstate, which is
always hot on context switch.

>
>
> #2 That's process related statistics and instrumentation stuff.
>
>    Now that just needs a immediate population on the process->mm->pgd
>    aside of the init_mm.pgd, but that's really not a big deal.
>
> Of course that does not solve the issues we have with the current
> infrastructure retroactively, but it allows us to avoid fuckups like
> the one Frederic was talking about that perf invented its own kmalloc
> based 'percpu' replacement just to workaround the shortcoming in a
> particular place.
>
> What really frightens me is the potential and well hidden fuckup
> potential which lurks around the corner and the hard to debug once in
> a while fallout which might be caused by this.

The annoying part of this is that pgd allocation is *so* rare that
bugs here can probably go unnoticed for a long time.

--Andy

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: [PATCH] x86, syscall: Fix _TIF_NOHZ handling in syscall_trace_enter_phase1
  2014-11-20 20:33                                       ` Linus Torvalds
@ 2014-11-20 22:07                                         ` Thomas Gleixner
  0 siblings, 0 replies; 486+ messages in thread
From: Thomas Gleixner @ 2014-11-20 22:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Dave Jones, Don Zickus, Linux Kernel,
	the arch/x86 maintainers, Peter Zijlstra

On Thu, 20 Nov 2014, Linus Torvalds wrote:
> On Wed, Nov 19, 2014 at 2:13 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > Right, while it is wrong it does not explain the wreckage on 3.17,
> > which does not have that code.
> 
> Thomas, I'm currently going off the assumption that I'll see this from
> the x86 trees, and I can ignore the patch. It doesn't seem like this
> is a particularly pressing bug.
> 
> If it's *not* going to show up as a pull request, holler, and I'll
> just apply it.

I'll send out a updated pul request for the one Ingo sent earlier
today in a second.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-20 21:58                                                 ` Thomas Gleixner
  2014-11-20 22:06                                                   ` Andy Lutomirski
@ 2014-11-20 22:11                                                   ` Tejun Heo
  2014-11-20 22:42                                                     ` Thomas Gleixner
  1 sibling, 1 reply; 486+ messages in thread
From: Tejun Heo @ 2014-11-20 22:11 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Frederic Weisbecker, Linus Torvalds, Dave Jones, Don Zickus,
	Linux Kernel, the arch/x86 maintainers, Peter Zijlstra,
	Andy Lutomirski, Arnaldo Carvalho de Melo

On Thu, Nov 20, 2014 at 10:58:26PM +0100, Thomas Gleixner wrote:
> It's completely undocumented behaviour, whether it has been that way
> for ever or not. And I agree with Fredric, that it is insane. Actuallu
> it's beyond insane, really.

This is exactly the same for any address in the vmalloc space.

..
>    So in the scheduler if the same task gets reselected you check that
>    sequence count and update the PGD if different. If a task switch
>    happens then you also need to check the sequence count and act
>    accordingly.

That isn't enough tho.  What if the percpu allocated pointer gets
passed to another CPU without task switching?  You'd at least need to
send IPIs to all CPUs so that all the active PGDs get updated
synchronously.

> What really frightens me is the potential and well hidden fuckup
> potential which lurks around the corner and the hard to debug once in
> a while fallout which might be caused by this.

Lazy vmalloc population through fault is something we accepted as
reasonable as it works fine for most of the kernel.  If the lazy
loading can be improved so that it doesn't depend on faulting, great.
For the time being, we can make percpu accessors complain when called
from nmi handlers so that the problematic ones can be easily
identified.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-20 22:11                                                   ` Tejun Heo
@ 2014-11-20 22:42                                                     ` Thomas Gleixner
  2014-11-20 23:05                                                       ` Tejun Heo
  0 siblings, 1 reply; 486+ messages in thread
From: Thomas Gleixner @ 2014-11-20 22:42 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Frederic Weisbecker, Linus Torvalds, Dave Jones, Don Zickus,
	Linux Kernel, the arch/x86 maintainers, Peter Zijlstra,
	Andy Lutomirski, Arnaldo Carvalho de Melo

On Thu, 20 Nov 2014, Tejun Heo wrote:
> On Thu, Nov 20, 2014 at 10:58:26PM +0100, Thomas Gleixner wrote:
> > It's completely undocumented behaviour, whether it has been that way
> > for ever or not. And I agree with Fredric, that it is insane. Actuallu
> > it's beyond insane, really.
> 
> This is exactly the same for any address in the vmalloc space.

I know, but I really was not aware of the fact that dynamically
allocated percpu stuff is vmalloc based and therefor exposed to the
same issues.

The normal vmalloc space simply does not have the problems which are
generated by percpu allocations which have no documented access
restrictions.

You created a special case and that special case is clever but not
very well thought out considering the use cases of percpu variables
and the completely undocumented limitations you introduced silently.

Just admit it and dont try to educate me about trivial vmalloc
properties.

> ..
> >    So in the scheduler if the same task gets reselected you check that
> >    sequence count and update the PGD if different. If a task switch
> >    happens then you also need to check the sequence count and act
> >    accordingly.
> 
> That isn't enough tho.  What if the percpu allocated pointer gets
> passed to another CPU without task switching?  You'd at least need to
> send IPIs to all CPUs so that all the active PGDs get updated
> synchronously.

You obviously did not even take the time to carefully read what I
wrote:

   "Now after that increment the allocation side needs to wait for a
    scheduling cycle on all cpus (we have mechanisms for that)"

That's exactly stating what you claim to be 'not enough'. 

> > What really frightens me is the potential and well hidden fuckup
> > potential which lurks around the corner and the hard to debug once in
> > a while fallout which might be caused by this.
> 
> Lazy vmalloc population through fault is something we accepted as
> reasonable as it works fine for most of the kernel. 

Emphasis on most.

I'm well aware about the lazy vmalloc population, but I was definitely
not aware about the implications chosen by the dynamic percpu
allocator. I do not care about random discussion threads on LKML or
random slides you produced for a conference. All I care about is that
I cannot find a single word of documentation about that in the source
tree. Neither in the percpu implementation nor in Documentation/

> For the time being, we can make percpu accessors complain when
> called from nmi handlers so that the problematic ones can be easily
> identified.

You should have done that in the very first place instead of letting
other people run into issues which you should have thought of from the
very beginning.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-20 22:42                                                     ` Thomas Gleixner
@ 2014-11-20 23:05                                                       ` Tejun Heo
  2014-11-20 23:08                                                         ` Andy Lutomirski
  2014-11-21  0:54                                                         ` Thomas Gleixner
  0 siblings, 2 replies; 486+ messages in thread
From: Tejun Heo @ 2014-11-20 23:05 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Frederic Weisbecker, Linus Torvalds, Dave Jones, Don Zickus,
	Linux Kernel, the arch/x86 maintainers, Peter Zijlstra,
	Andy Lutomirski, Arnaldo Carvalho de Melo

Hello,

On Thu, Nov 20, 2014 at 11:42:42PM +0100, Thomas Gleixner wrote:
> On Thu, 20 Nov 2014, Tejun Heo wrote:
> > On Thu, Nov 20, 2014 at 10:58:26PM +0100, Thomas Gleixner wrote:
> > > It's completely undocumented behaviour, whether it has been that way
> > > for ever or not. And I agree with Fredric, that it is insane. Actuallu
> > > it's beyond insane, really.
> > 
> > This is exactly the same for any address in the vmalloc space.
> 
> I know, but I really was not aware of the fact that dynamically
> allocated percpu stuff is vmalloc based and therefor exposed to the
> same issues.
> 
> The normal vmalloc space simply does not have the problems which are
> generated by percpu allocations which have no documented access
> restrictions.
>
> You created a special case and that special case is clever but not
> very well thought out considering the use cases of percpu variables
> and the completely undocumented limitations you introduced silently.
> 
> Just admit it and dont try to educate me about trivial vmalloc
> properties.

Why are you always so overly dramatic?  How is this productive?  Sure,
this could have been better but I missed it at the beginning and this
is the first time I hear about this issue.  Shits happen and we fix
them.

> > That isn't enough tho.  What if the percpu allocated pointer gets
> > passed to another CPU without task switching?  You'd at least need to
> > send IPIs to all CPUs so that all the active PGDs get updated
> > synchronously.
> 
> You obviously did not even take the time to carefully read what I
> wrote:
> 
>    "Now after that increment the allocation side needs to wait for a
>     scheduling cycle on all cpus (we have mechanisms for that)"
> 
> That's exactly stating what you claim to be 'not enough'. 

Missed that.  Sorry.

> > For the time being, we can make percpu accessors complain when
> > called from nmi handlers so that the problematic ones can be easily
> > identified.
> 
> You should have done that in the very first place instead of letting
> other people run into issues which you should have thought of from the
> very beginning.

Sure, it would have been better if I noticed that from the get-go, but
I couldn't think of the NMI case that time and neither did anybody who
reviewed the code.  It'd be awesome if we could have avoided it but it
didn't go that way, so let's fix it.  Can we please stay technical?

So, for now, all we need is adding nmi check in percpu accessors,
right?

-- 
tejun

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-20 23:05                                                       ` Tejun Heo
@ 2014-11-20 23:08                                                         ` Andy Lutomirski
  2014-11-20 23:34                                                           ` Linus Torvalds
  2014-11-20 23:39                                                           ` Tejun Heo
  2014-11-21  0:54                                                         ` Thomas Gleixner
  1 sibling, 2 replies; 486+ messages in thread
From: Andy Lutomirski @ 2014-11-20 23:08 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Thomas Gleixner, Frederic Weisbecker, Linus Torvalds, Dave Jones,
	Don Zickus, Linux Kernel, the arch/x86 maintainers,
	Peter Zijlstra, Arnaldo Carvalho de Melo

On Thu, Nov 20, 2014 at 3:05 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> On Thu, Nov 20, 2014 at 11:42:42PM +0100, Thomas Gleixner wrote:
>> On Thu, 20 Nov 2014, Tejun Heo wrote:
>> > On Thu, Nov 20, 2014 at 10:58:26PM +0100, Thomas Gleixner wrote:
>> > > It's completely undocumented behaviour, whether it has been that way
>> > > for ever or not. And I agree with Fredric, that it is insane. Actuallu
>> > > it's beyond insane, really.
>> >
>> > This is exactly the same for any address in the vmalloc space.
>>
>> I know, but I really was not aware of the fact that dynamically
>> allocated percpu stuff is vmalloc based and therefor exposed to the
>> same issues.
>>
>> The normal vmalloc space simply does not have the problems which are
>> generated by percpu allocations which have no documented access
>> restrictions.
>>
>> You created a special case and that special case is clever but not
>> very well thought out considering the use cases of percpu variables
>> and the completely undocumented limitations you introduced silently.
>>
>> Just admit it and dont try to educate me about trivial vmalloc
>> properties.
>
> Why are you always so overly dramatic?  How is this productive?  Sure,
> this could have been better but I missed it at the beginning and this
> is the first time I hear about this issue.  Shits happen and we fix
> them.
>
>> > That isn't enough tho.  What if the percpu allocated pointer gets
>> > passed to another CPU without task switching?  You'd at least need to
>> > send IPIs to all CPUs so that all the active PGDs get updated
>> > synchronously.
>>
>> You obviously did not even take the time to carefully read what I
>> wrote:
>>
>>    "Now after that increment the allocation side needs to wait for a
>>     scheduling cycle on all cpus (we have mechanisms for that)"
>>
>> That's exactly stating what you claim to be 'not enough'.
>
> Missed that.  Sorry.
>
>> > For the time being, we can make percpu accessors complain when
>> > called from nmi handlers so that the problematic ones can be easily
>> > identified.
>>
>> You should have done that in the very first place instead of letting
>> other people run into issues which you should have thought of from the
>> very beginning.
>
> Sure, it would have been better if I noticed that from the get-go, but
> I couldn't think of the NMI case that time and neither did anybody who
> reviewed the code.  It'd be awesome if we could have avoided it but it
> didn't go that way, so let's fix it.  Can we please stay technical?
>
> So, for now, all we need is adding nmi check in percpu accessors,
> right?
>

What's the issue with nmi?  Page faults are supposed to nest correctly
inside nmi, right?

--Andy

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-20 23:08                                                         ` Andy Lutomirski
@ 2014-11-20 23:34                                                           ` Linus Torvalds
  2014-11-20 23:39                                                           ` Tejun Heo
  1 sibling, 0 replies; 486+ messages in thread
From: Linus Torvalds @ 2014-11-20 23:34 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Tejun Heo, Thomas Gleixner, Frederic Weisbecker, Dave Jones,
	Don Zickus, Linux Kernel, the arch/x86 maintainers,
	Peter Zijlstra, Arnaldo Carvalho de Melo

On Thu, Nov 20, 2014 at 3:08 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> What's the issue with nmi?  Page faults are supposed to nest correctly
> inside nmi, right?

They should, now, yes. There used to be issues with the whole "that
re-enables NMI".

Which reminds me. We never took your patches that use ljmp to handle
the return-to-kernel mode. You did them for performance reasons, but I
think the bigger deal was that it would have cleaned up that whole
special case.

Or did they have other problems? The ones to return to user space were
admittedly more fun, but just a tad too crazy (and not _quite_ in the
"crazy like a fox" camp ;)

            Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-20 23:08                                                         ` Andy Lutomirski
  2014-11-20 23:34                                                           ` Linus Torvalds
@ 2014-11-20 23:39                                                           ` Tejun Heo
  2014-11-20 23:55                                                             ` Andy Lutomirski
  2014-11-21  2:33                                                             ` Steven Rostedt
  1 sibling, 2 replies; 486+ messages in thread
From: Tejun Heo @ 2014-11-20 23:39 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Frederic Weisbecker, Linus Torvalds, Dave Jones,
	Don Zickus, Linux Kernel, the arch/x86 maintainers,
	Peter Zijlstra, Arnaldo Carvalho de Melo

On Thu, Nov 20, 2014 at 03:08:03PM -0800, Andy Lutomirski wrote:
> > So, for now, all we need is adding nmi check in percpu accessors,
> > right?
> >
> 
> What's the issue with nmi?  Page faults are supposed to nest correctly
> inside nmi, right?

Thought they couldn't.  Looking at the trace that Frederic linked, it
looks like straight-out tracing function recursion due to an
unexpected fault while holding a lock.  I don't think this can be
annotated from percpu accessor side.  There's nothing special about
the context.  :(

Does this matter for anybody other than tracers?  Ultimately, the
solution would be removing the vmalloc area faulting as Thomas
suggested.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-20 23:39                                                           ` Tejun Heo
@ 2014-11-20 23:55                                                             ` Andy Lutomirski
  2014-11-21 16:27                                                               ` Tejun Heo
  2014-11-21  2:33                                                             ` Steven Rostedt
  1 sibling, 1 reply; 486+ messages in thread
From: Andy Lutomirski @ 2014-11-20 23:55 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Thomas Gleixner, Frederic Weisbecker, Linus Torvalds, Dave Jones,
	Don Zickus, Linux Kernel, the arch/x86 maintainers,
	Peter Zijlstra, Arnaldo Carvalho de Melo

On Thu, Nov 20, 2014 at 3:39 PM, Tejun Heo <tj@kernel.org> wrote:
> On Thu, Nov 20, 2014 at 03:08:03PM -0800, Andy Lutomirski wrote:
>> > So, for now, all we need is adding nmi check in percpu accessors,
>> > right?
>> >
>>
>> What's the issue with nmi?  Page faults are supposed to nest correctly
>> inside nmi, right?
>
> Thought they couldn't.  Looking at the trace that Frederic linked, it
> looks like straight-out tracing function recursion due to an
> unexpected fault while holding a lock.  I don't think this can be
> annotated from percpu accessor side.  There's nothing special about
> the context.  :(

That doesn't appear to have anything to with nmi though, right?

Wouldn't this issue be fixed by moving the vmalloc_fault check into
do_page_fault before exception_enter?

--Andy

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-20 23:05                                                       ` Tejun Heo
  2014-11-20 23:08                                                         ` Andy Lutomirski
@ 2014-11-21  0:54                                                         ` Thomas Gleixner
  2014-11-21 14:13                                                           ` Frederic Weisbecker
  1 sibling, 1 reply; 486+ messages in thread
From: Thomas Gleixner @ 2014-11-21  0:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Frederic Weisbecker, Linus Torvalds, Dave Jones, Don Zickus,
	Linux Kernel, the arch/x86 maintainers, Peter Zijlstra,
	Andy Lutomirski, Arnaldo Carvalho de Melo

Tejun,

On Thu, 20 Nov 2014, Tejun Heo wrote:
> On Thu, Nov 20, 2014 at 11:42:42PM +0100, Thomas Gleixner wrote:
> > On Thu, 20 Nov 2014, Tejun Heo wrote:
> > > On Thu, Nov 20, 2014 at 10:58:26PM +0100, Thomas Gleixner wrote:
> > > > It's completely undocumented behaviour, whether it has been that way
> > > > for ever or not. And I agree with Fredric, that it is insane. Actuallu
> > > > it's beyond insane, really.
> > > 
> > > This is exactly the same for any address in the vmalloc space.
> > 
> > I know, but I really was not aware of the fact that dynamically
> > allocated percpu stuff is vmalloc based and therefor exposed to the
> > same issues.
> > 
> > The normal vmalloc space simply does not have the problems which are
> > generated by percpu allocations which have no documented access
> > restrictions.
> >
> > You created a special case and that special case is clever but not
> > very well thought out considering the use cases of percpu variables
> > and the completely undocumented limitations you introduced silently.
> > 
> > Just admit it and dont try to educate me about trivial vmalloc
> > properties.
> 
> Why are you always so overly dramatic?

This has nothing to do with dramatic. It's a matter of fact that I do
not need an education on the basic properties of the vmalloc space.

I just refuse to accept that you try to tell me that I should be aware
of this:

> > > This is exactly the same for any address in the vmalloc space.

What I was not aware of and even was not aware of after staring into
that code fore quite some time is the fact that the whole percpu
business is vmalloc based and therefor exposed to the same limitations
as the vmalloc space in general.

I'm not a mm expert and without the slightest piece of documentation
except for the chunk allocator, which is completely irrelevant in this
context, there is not a single word of explanation about the design and
the resulting limitations of that in the kernel tree.

So, I'm overly dramatic, because I tell you that I'm well aware of the
general vmalloc approach, which is btw. well documented?

> How is this productive?

It's obviously very productive, because I'm AFAICT the first person
who did not take your design decisions as granted and sacrosanct.

> Sure, this could have been better but I missed it at the beginning
> and this is the first time I hear about this issue.

So the issues Frederic talked about in that very thread about
recursive faults and the need that perf had to emulate percpu stuff in
order to work around them have never been communicated to you?

I that's the case then that's not your problem, but a serious problem
in our overall process.

> Shits happen and we fix them.

I have no problem with that and I'm not trying to put blame on you.

As you might have noticed I spent quite some time to think about a
possibile solution and also clearly stated that it's perhaps not
solving the issue at hand (while it's not complex to implement) it
might be too complex backport. The response I get from you is:

> > > That isn't enough tho.  What if the percpu allocated pointer gets
> > > passed to another CPU without task switching?  You'd at least need to
> > > send IPIs to all CPUs so that all the active PGDs get updated
> > > synchronously.
> > 
> > You obviously did not even take the time to carefully read what I
> > wrote:
> > 
> >    "Now after that increment the allocation side needs to wait for a
> >     scheduling cycle on all cpus (we have mechanisms for that)"
> > 
> > That's exactly stating what you claim to be 'not enough'. 
> 
> Missed that.  Sorry.

Apology accepted.
 
> So, for now, all we need is adding nmi check in percpu accessors,
> right?

s/all we need/all we can do/

I think is the proper technical expression for that.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-20 23:39                                                           ` Tejun Heo
  2014-11-20 23:55                                                             ` Andy Lutomirski
@ 2014-11-21  2:33                                                             ` Steven Rostedt
  1 sibling, 0 replies; 486+ messages in thread
From: Steven Rostedt @ 2014-11-21  2:33 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andy Lutomirski, Thomas Gleixner, Frederic Weisbecker,
	Linus Torvalds, Dave Jones, Don Zickus, Linux Kernel,
	the arch/x86 maintainers, Peter Zijlstra,
	Arnaldo Carvalho de Melo

On Thu, Nov 20, 2014 at 06:39:20PM -0500, Tejun Heo wrote:
> On Thu, Nov 20, 2014 at 03:08:03PM -0800, Andy Lutomirski wrote:
> > > So, for now, all we need is adding nmi check in percpu accessors,
> > > right?
> > >
> > 
> > What's the issue with nmi?  Page faults are supposed to nest correctly
> > inside nmi, right?
> 
> Thought they couldn't.  Looking at the trace that Frederic linked, it
> looks like straight-out tracing function recursion due to an
> unexpected fault while holding a lock.  I don't think this can be
> annotated from percpu accessor side.  There's nothing special about
> the context.  :(

There use to be issues with page faults in NMI. One was that the iretq
from the page fault handler would re-enable NMIs, and if another NMI triggered
then it would stomp all over the stack of the initial NMI. But my tripple
copy of the NMI stack frame solved that. You can read all about it here:

  http://lwn.net/Articles/484932/

The second bug was that if an NMI triggered right after a page fault, and
it had a page fault, the content of the cr2 register (faulting address)
would be lost for the page fault that was preempted by the NMI.
This too was solved by using (queue irony) using per_cpu variables.

Now I'm hoping that kernel boot time per_cpu variables never take any
faults, otherwise we are all f*cked!

> 
> Does this matter for anybody other than tracers?  Ultimately, the
> solution would be removing the vmalloc area faulting as Thomas
> suggested.

I don't know, but per_cpu variables are rather special and used all
over the place. Most other vmalloc code isn't as used as per_cpu is.

-- Steve


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-20 19:43                                     ` Linus Torvalds
  2014-11-20 20:06                                       ` Dave Jones
  2014-11-20 20:37                                       ` Don Zickus
@ 2014-11-21  6:37                                       ` Ingo Molnar
  2014-11-21 14:50                                         ` Dave Jones
  2 siblings, 1 reply; 486+ messages in thread
From: Ingo Molnar @ 2014-11-21  6:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Jones, Andy Lutomirski, Don Zickus, Thomas Gleixner,
	Linux Kernel, the arch/x86 maintainers, Peter Zijlstra


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> [...]
> 
> That's *especially* true if it turns out that the 3.17 problem 
> you saw was actually a perf bug that has already been fixed and 
> is in stable. We've been looking at kernel/smp.c changes, and 
> looking for x86 IPI or APIC changes, and found some harmlessly 
> (at least on x86) suspicious code and this exercise might be 
> worth it for that reason, but what if it's really just a 
> scheduler regression.
> 
> There's been a *lot* more scheduler changes since 3.17 than the 
> small things we've looked at for x86 entry or IPI handling. And 
> the scheduler changes have been about things like overloaded 
> scheduling groups etc, and I could easily imaging that some bug 
> *there* ends up causing the watchdog process not to schedule.
> 
> Hmm? Scheduler people?

Hm, that's a possibility, yes.

The watchdog threads are pretty simple beasts though, using 
SCHED_FIFO:

 kernel/watchdog.c:      watchdog_set_prio(SCHED_FIFO, MAX_RT_PRIO - 1);

which is typically only affected by less than 10% of scheduler 
changes - but it's entirely possible still.

It might make sense to disable the softlockup detector altogether 
and just see whether trinity finishes/wedges, whether a login 
over the console is still possible - etc.

The softlockup messages in themselves are only analytical, unless 
CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE=1 is used.

Interesting bug.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-20 17:38                                         ` Dave Jones
@ 2014-11-21  9:46                                           ` Dave Young
  0 siblings, 0 replies; 486+ messages in thread
From: Dave Young @ 2014-11-21  9:46 UTC (permalink / raw)
  To: Dave Jones, Vivek Goyal, Don Zickus, Thomas Gleixner,
	Linus Torvalds, Linux Kernel, the arch/x86 maintainers,
	WANG Chao, Baoquan He

On 11/20/14 at 12:38pm, Dave Jones wrote:
> On Thu, Nov 20, 2014 at 11:48:09AM -0500, Vivek Goyal wrote:
>  
>  > Can we try following and retry and see if some additional messages show
>  > up on console and help us narrow down the problem.
>  > 
>  > - Enable verbose boot messages. CONFIG_X86_VERBOSE_BOOTUP=y
>  > 
>  > - Enable early printk in second kernel. (earlyprintk=ttyS0,115200).
>  > 
>  >   You can either enable early printk in first kernel and reboot. That way
>  >   second kernel will automatically have it enabled. Or you can edit
>  >   "/etc/sysconfig/kdump" and append earlyprintk=<> to KDUMP_COMMANDLINE_APPEND. 
>  >   You will need to restart kdump service after this.
>  > 
>  > - Enable some debug output during runtime from kexec purgatory. For that one
>  >   needs to pass additional arguments to /sbin/kexec. You can edit
>  >   /etc/sysconfig/kdump file and modify "KEXEC_ARGS" to pass additional
>  >   arguments to /sbin/kexec during kernel load. I use following for my
>  >   serial console.
>  > 
>  >   KEXEC_ARGS="--console-serial --serial=0x3f8 --serial-baud=115200"
>  > 
>  >   You will need to restart kdump service.
> 
> The only serial port on this machine is usb serial, which doesn't have io ports.
> 
> From my reading of the kexec man page, it doesn't look like I can tell
> it to use ttyUSB0.

Enabling ttyUSB0 still need hacks in dracut/kdump module to pack the usb serial
ko to initramfs and load it early. We can work on it in Fedora because it may benefit
to some late problems.

> 
> And because it relies on usb being initialized, this probably isn't
> going to help too much with early boot.
> 
> earlyprintk=tty0 didn't show anything extra after the sysrq-c oops.
> likewise, =ttyUSB0

earlyprintk=vga instead of tty0?
earlyprintk=efi in case efi boot.

earlyprintk=dbgp sometimes also helps but it's a little hard to setup because we
need a usb debugger. My nokia n900 works well as a debugger. But to find a usable
usb debug port in native host might fail, so this is my last try for earlyprintk :(

> 
> I'm going to try bisecting the problem I'm debugging again, so I'm not
> going to dig into this much more today.
> 

Another case what I know about kdump kernel issue is nouveau sometimes does not work
So if this is the case you can try add "rd.driver.blacklist=nouveau" to field
KDUMP_COMMANDLINE_APPEND in /etc/sysconfig/kdump. Or just add "nomodeset" in 1st
kernel grub cmdline so that 2nd kernel will reuse it to avoid load drm modules and
also earlyprintk=vga probably could show something.

Thanks
Dave

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-21  0:54                                                         ` Thomas Gleixner
@ 2014-11-21 14:13                                                           ` Frederic Weisbecker
  2014-11-21 16:25                                                             ` Tejun Heo
  0 siblings, 1 reply; 486+ messages in thread
From: Frederic Weisbecker @ 2014-11-21 14:13 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Tejun Heo, Linus Torvalds, Dave Jones, Don Zickus, Linux Kernel,
	the arch/x86 maintainers, Peter Zijlstra, Andy Lutomirski,
	Arnaldo Carvalho de Melo

On Fri, Nov 21, 2014 at 01:54:00AM +0100, Thomas Gleixner wrote:
> On Thu, 20 Nov 2014, Tejun Heo wrote:
> > Sure, this could have been better but I missed it at the beginning
> > and this is the first time I hear about this issue.
> 
> So the issues Frederic talked about in that very thread about
> recursive faults and the need that perf had to emulate percpu stuff in
> order to work around them have never been communicated to you?
> 
> I that's the case then that's not your problem, but a serious problem
> in our overall process.

So when the issue arised 4 years ago, it was a problem only for NMIs.
Like Linus says: "what happens in NMI stays in NMI". Ok no that's not quite
what he says :-)  But NMIs happen to be a corner case for about everything
and it's sometimes better to fix things from NMI itself, or have an NMI
special case rather than grow the whole infrastructure in complexity to
support this very corner case.

Not saying that's the only valid approach to take wrt. NMIs but those vmalloc faults
seemed to be well established and generally known (except perhaps for percpu)
and NMI was the only corner case, and we are used to that, so fixing the issue
for NMIs only felt like the right direction when we fixed the callchain thing
with other perf developers.

I certainly should have talked to Tejun about that but it took a bit of time
for me to realize that randomly faultable memory is a dangerous behaviour.

Add to that a bit of the "take the infrastrusture as granted" problem when
you're not well experienced enough...

Anyway, I really hope we fix that, that's a bomb waiting to explode.

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-21  6:37                                       ` Ingo Molnar
@ 2014-11-21 14:50                                         ` Dave Jones
  0 siblings, 0 replies; 486+ messages in thread
From: Dave Jones @ 2014-11-21 14:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Andy Lutomirski, Don Zickus, Thomas Gleixner,
	Linux Kernel, the arch/x86 maintainers, Peter Zijlstra

On Fri, Nov 21, 2014 at 07:37:42AM +0100, Ingo Molnar wrote:

 > It might make sense to disable the softlockup detector altogether 
 > and just see whether trinity finishes/wedges, whether a login 
 > over the console is still possible - etc.

I can give that a try later.

 > The softlockup messages in themselves are only analytical, unless 
 > CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE=1 is used.

Hm, I don't recall why I had that set. That should make things easier
to debug if the machine stays alive a little longer rather than
panicing. At least it might make sure that I get the full traces
over usb-serial.

Additionally, it might make ftrace an option.

The last thing I tested was 3.17 plus the perf fixes Frederic pointed
out yesterday. It's survived 20 hours of runtime, so I'm back to
believing that this is a recent (ie, post 3.17 bug).

Running into the weekend though, so I'm not going to get to bisecting
until Monday probably. So maybe I'll try your idea at the top of this
mail in my over-the-weekend run.

	Dave


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-21 14:13                                                           ` Frederic Weisbecker
@ 2014-11-21 16:25                                                             ` Tejun Heo
  2014-11-21 17:01                                                               ` Steven Rostedt
  2014-11-21 21:44                                                               ` Frederic Weisbecker
  0 siblings, 2 replies; 486+ messages in thread
From: Tejun Heo @ 2014-11-21 16:25 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Don Zickus,
	Linux Kernel, the arch/x86 maintainers, Peter Zijlstra,
	Andy Lutomirski, Arnaldo Carvalho de Melo

Hello, Frederic.

On Fri, Nov 21, 2014 at 03:13:35PM +0100, Frederic Weisbecker wrote:
...
> So when the issue arised 4 years ago, it was a problem only for NMIs.
> Like Linus says: "what happens in NMI stays in NMI". Ok no that's not quite
> what he says :-)  But NMIs happen to be a corner case for about everything
> and it's sometimes better to fix things from NMI itself, or have an NMI
> special case rather than grow the whole infrastructure in complexity to
> support this very corner case.

I'm not familiar with the innards of fault handling, so can you please
help me understand what may actually break?  Here are what I currently
understand.

* Static percpu areas wouldn't trigger fault lazily.  Note that this
  is not necessarily because the first percpu chunk which contains the
  static area is embedded inside the kernel linear mapping.  Depending
  on the memory layout and boot param, percpu allocator may choose to
  map the first chunk in vmalloc space too; however, this still works
  out fine because at that point there are no other page tables and
  the PUD entries covering the first chunk is faulted in before other
  pages tables are copied from the kernel one.

* NMI used to be a problem because vmalloc fault handler couldn't
  safely nest inside NMI handler but this has been fixed since and it
  should work fine from NMI handlers now.

* Function tracers are problematic because they may end up nesting
  inside themselves through triggering a vmalloc fault while accessing
  dynamic percpu memory area.  This may lead to recursive locking and
  other surprises.

Are there other cases where the lazy vmalloc faults can break things?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-20 23:55                                                             ` Andy Lutomirski
@ 2014-11-21 16:27                                                               ` Tejun Heo
  2014-11-21 16:38                                                                 ` Andy Lutomirski
  0 siblings, 1 reply; 486+ messages in thread
From: Tejun Heo @ 2014-11-21 16:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Frederic Weisbecker, Linus Torvalds, Dave Jones,
	Don Zickus, Linux Kernel, the arch/x86 maintainers,
	Peter Zijlstra, Arnaldo Carvalho de Melo

Hello, Andy.

On Thu, Nov 20, 2014 at 03:55:09PM -0800, Andy Lutomirski wrote:
> That doesn't appear to have anything to with nmi though, right?

I thought that was the main offender but, apparently, not any more.

> Wouldn't this issue be fixed by moving the vmalloc_fault check into
> do_page_fault before exception_enter?

Can you please elaborate why that'd fix the issue?  I'm not
intimiately familiar with the fault handling so it'd be great if you
can give me some pointers in terms of where to look at.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-21 16:27                                                               ` Tejun Heo
@ 2014-11-21 16:38                                                                 ` Andy Lutomirski
  2014-11-21 16:48                                                                   ` Linus Torvalds
  2014-11-21 22:10                                                                   ` Frederic Weisbecker
  0 siblings, 2 replies; 486+ messages in thread
From: Andy Lutomirski @ 2014-11-21 16:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, Thomas Gleixner, Arnaldo Carvalho de Melo,
	Peter Zijlstra, Linus Torvalds, Frederic Weisbecker, Don Zickus,
	Dave Jones, the arch/x86 maintainers

On Nov 21, 2014 8:27 AM, "Tejun Heo" <tj@kernel.org> wrote:
>
> Hello, Andy.
>
> On Thu, Nov 20, 2014 at 03:55:09PM -0800, Andy Lutomirski wrote:
> > That doesn't appear to have anything to with nmi though, right?
>
> I thought that was the main offender but, apparently, not any more.
>
> > Wouldn't this issue be fixed by moving the vmalloc_fault check into
> > do_page_fault before exception_enter?
>
> Can you please elaborate why that'd fix the issue?  I'm not
> intimiately familiar with the fault handling so it'd be great if you
> can give me some pointers in terms of where to look at.

do_page_fault is called directly from asm.  It does:

    prev_state = exception_enter();
    __do_page_fault(regs, error_code, address);
    exception_exit(prev_state);

The vmalloc fixup is in __do_page_fault.

exception_enter does various accounting and tracing things, and I
think that the recursion in stack trace I saw was in exception_enter.

If you move the vmalloc fixup before exception_enter() and return if
the fault was from vmalloc, then you can't recurse.  You need to be
careful not to touch anything that uses RCU before exception_enter,
though.

--Andy

>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-21 16:38                                                                 ` Andy Lutomirski
@ 2014-11-21 16:48                                                                   ` Linus Torvalds
  2014-11-21 17:08                                                                     ` Steven Rostedt
  2014-11-21 22:10                                                                   ` Frederic Weisbecker
  1 sibling, 1 reply; 486+ messages in thread
From: Linus Torvalds @ 2014-11-21 16:48 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Tejun Heo, linux-kernel, Thomas Gleixner,
	Arnaldo Carvalho de Melo, Peter Zijlstra, Frederic Weisbecker,
	Don Zickus, Dave Jones, the arch/x86 maintainers

On Fri, Nov 21, 2014 at 8:38 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> If you move the vmalloc fixup before exception_enter() and return if
> the fault was from vmalloc, then you can't recurse.  You need to be
> careful not to touch anything that uses RCU before exception_enter,
> though.

This is probably the right thing to do anyway.

The vmalloc fixup is purely about filling in hardware structures, so
there really shouldn't be any need for RCU or anything else. It should
probably be done first, before *anything* else (like the whole
kmemcheck/kmmio fault etc handling)

That said, the whole vmalloc_fault fixup routine does some odd things,
over and beyond just filling in the page tables. So I'm not 100% sure
that is safe as-is. The 32-bit version looks fine, but the x86-64
version is very very dubious.

The x86-64 version does crazy things like:

 - uses "current->active_mm", which is very dubious
 - flush lazy mmu mode
 - walk down further in the page tables

and those are just bugs, imnsho. Get rid of that crap. The 32-bit code
does it right.

(The 64-bit mode also has a "WARN_ON_ONCE(in_nmi())", which I guess is
good - but it's good because the 64-bit version is written the way it
is).

                    Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-21 16:25                                                             ` Tejun Heo
@ 2014-11-21 17:01                                                               ` Steven Rostedt
  2014-11-21 17:11                                                                 ` Steven Rostedt
  2014-11-21 21:32                                                                 ` Frederic Weisbecker
  2014-11-21 21:44                                                               ` Frederic Weisbecker
  1 sibling, 2 replies; 486+ messages in thread
From: Steven Rostedt @ 2014-11-21 17:01 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Frederic Weisbecker, Thomas Gleixner, Linus Torvalds, Dave Jones,
	Don Zickus, Linux Kernel, the arch/x86 maintainers,
	Peter Zijlstra, Andy Lutomirski, Arnaldo Carvalho de Melo

On Fri, Nov 21, 2014 at 11:25:06AM -0500, Tejun Heo wrote:
> 
> * Static percpu areas wouldn't trigger fault lazily.  Note that this
>   is not necessarily because the first percpu chunk which contains the
>   static area is embedded inside the kernel linear mapping.  Depending
>   on the memory layout and boot param, percpu allocator may choose to
>   map the first chunk in vmalloc space too; however, this still works
>   out fine because at that point there are no other page tables and
>   the PUD entries covering the first chunk is faulted in before other
>   pages tables are copied from the kernel one.

That sounds correct.

> 
> * NMI used to be a problem because vmalloc fault handler couldn't
>   safely nest inside NMI handler but this has been fixed since and it
>   should work fine from NMI handlers now.

Right. Of course "should work fine" does not excatly mean "will work fine".


> 
> * Function tracers are problematic because they may end up nesting
>   inside themselves through triggering a vmalloc fault while accessing
>   dynamic percpu memory area.  This may lead to recursive locking and
>   other surprises.

The function tracer infrastructure now has a recursive check that happens
rather early in the call. Unless the registered OPS specifically states
it handles recursions (FTRACE_OPS_FL_RECUSION_SAFE), ftrace will add the
necessary recursion checks. If a registered OPS lies about being recusion
safe, well we can't stop suicide.

Looking at kernel/trace/trace_functions.c: function_trace_call() which is
registered with RECURSION_SAFE, I see that the recursion check is done
before the per_cpu_ptr() call to the dynamically allocated per_cpu data.

It looks OK, but...

Oh! but if we trace the page fault handler, and we fault here too
we just nuked the cr2 register. Not good.

-- Steve


> 
> Are there other cases where the lazy vmalloc faults can break things?

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-21 16:48                                                                   ` Linus Torvalds
@ 2014-11-21 17:08                                                                     ` Steven Rostedt
  2014-11-21 17:19                                                                       ` Linus Torvalds
  0 siblings, 1 reply; 486+ messages in thread
From: Steven Rostedt @ 2014-11-21 17:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Tejun Heo, linux-kernel, Thomas Gleixner,
	Arnaldo Carvalho de Melo, Peter Zijlstra, Frederic Weisbecker,
	Don Zickus, Dave Jones, the arch/x86 maintainers

On Fri, Nov 21, 2014 at 08:48:58AM -0800, Linus Torvalds wrote:
> 
> (The 64-bit mode also has a "WARN_ON_ONCE(in_nmi())", which I guess is
> good - but it's good because the 64-bit version is written the way it
> is).

Actually, in_nmi() is now safe for vmalloc faults. In fact, it handles the
clobbering of the cr2 register just fine. I wrote tests to test this, and
submitted patches to get rid of that warn on. But that never went through.

https://lkml.org/lkml/2013/10/15/894

-- Steve


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-21 17:01                                                               ` Steven Rostedt
@ 2014-11-21 17:11                                                                 ` Steven Rostedt
  2014-11-21 21:32                                                                 ` Frederic Weisbecker
  1 sibling, 0 replies; 486+ messages in thread
From: Steven Rostedt @ 2014-11-21 17:11 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Frederic Weisbecker, Thomas Gleixner, Linus Torvalds, Dave Jones,
	Don Zickus, Linux Kernel, the arch/x86 maintainers,
	Peter Zijlstra, Andy Lutomirski, Arnaldo Carvalho de Melo

On Fri, 21 Nov 2014 12:01:51 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:
 
> Looking at kernel/trace/trace_functions.c: function_trace_call() which is
> registered with RECURSION_SAFE, I see that the recursion check is done
> before the per_cpu_ptr() call to the dynamically allocated per_cpu data.
> 
> It looks OK, but...
> 
> Oh! but if we trace the page fault handler, and we fault here too
> we just nuked the cr2 register. Not good.

Ah! Looking at the code, I see that do_page_fault (called from
assembly) is marked notrace. And the first thing it does is:

	unsigned long address = read_cr2();

And uses that. Thus if the function tracer were to fault on
exception_enter() or __do_page_fautt(), the address wont be
clobbered.

-- Steve

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-21 17:08                                                                     ` Steven Rostedt
@ 2014-11-21 17:19                                                                       ` Linus Torvalds
  2014-11-21 17:22                                                                         ` Andy Lutomirski
  2014-11-21 17:34                                                                         ` Steven Rostedt
  0 siblings, 2 replies; 486+ messages in thread
From: Linus Torvalds @ 2014-11-21 17:19 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andy Lutomirski, Tejun Heo, linux-kernel, Thomas Gleixner,
	Arnaldo Carvalho de Melo, Peter Zijlstra, Frederic Weisbecker,
	Don Zickus, Dave Jones, the arch/x86 maintainers

On Fri, Nov 21, 2014 at 9:08 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> Actually, in_nmi() is now safe for vmalloc faults. In fact, it handles the
> clobbering of the cr2 register just fine.

That's not what I object to and find incorrect wrt NMI.

Compare the simple and correct 32-bit code to the complex and
incorrect 64-bit code.

In particular, look at how the 32-bit code relies *entirely* on hardware state.

Then look at where the 64-bit code does not.

                      Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-21 17:19                                                                       ` Linus Torvalds
@ 2014-11-21 17:22                                                                         ` Andy Lutomirski
  2014-11-21 18:22                                                                           ` Linus Torvalds
  2014-11-21 17:34                                                                         ` Steven Rostedt
  1 sibling, 1 reply; 486+ messages in thread
From: Andy Lutomirski @ 2014-11-21 17:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Tejun Heo, linux-kernel, Thomas Gleixner,
	Arnaldo Carvalho de Melo, Peter Zijlstra, Frederic Weisbecker,
	Don Zickus, Dave Jones, the arch/x86 maintainers

On Fri, Nov 21, 2014 at 9:19 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, Nov 21, 2014 at 9:08 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
>>
>> Actually, in_nmi() is now safe for vmalloc faults. In fact, it handles the
>> clobbering of the cr2 register just fine.
>
> That's not what I object to and find incorrect wrt NMI.
>
> Compare the simple and correct 32-bit code to the complex and
> incorrect 64-bit code.
>
> In particular, look at how the 32-bit code relies *entirely* on hardware state.
>
> Then look at where the 64-bit code does not.

Both mystify me.  Why does the 32-bit version walk down the hierarchy
at all instead of just touching the top level?

And why does the 64-bit version assert that the leaves of the tables
match?  It's already asserted that it's walking down pgd pointers that
are *exactly the same pointers*, so of course the stuff they point to
is the same.

--Andy

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-21 17:19                                                                       ` Linus Torvalds
  2014-11-21 17:22                                                                         ` Andy Lutomirski
@ 2014-11-21 17:34                                                                         ` Steven Rostedt
  2014-11-21 18:24                                                                           ` Linus Torvalds
  1 sibling, 1 reply; 486+ messages in thread
From: Steven Rostedt @ 2014-11-21 17:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Tejun Heo, linux-kernel, Thomas Gleixner,
	Arnaldo Carvalho de Melo, Peter Zijlstra, Frederic Weisbecker,
	Don Zickus, Dave Jones, the arch/x86 maintainers

On Fri, 21 Nov 2014 09:19:02 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, Nov 21, 2014 at 9:08 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > Actually, in_nmi() is now safe for vmalloc faults. In fact, it handles the
> > clobbering of the cr2 register just fine.
> 
> That's not what I object to and find incorrect wrt NMI.

I was commenting about the WARN_ON() itself.

> 
> Compare the simple and correct 32-bit code to the complex and
> incorrect 64-bit code.
> 
> In particular, look at how the 32-bit code relies *entirely* on hardware state.
> 
> Then look at where the 64-bit code does not.

I see. You have issues with the use of current->active_mm instead of
just doing a read_cr3() (and I'm sure other things).

Doing a series of git blame, 64 bit has been like that since 2005 (start
of git).

Looks to me we have more work to do with the merging of 64 bit and 32
bit. Perhaps 64 bit can become more like 32 bit.

-- Steve

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-21 17:22                                                                         ` Andy Lutomirski
@ 2014-11-21 18:22                                                                           ` Linus Torvalds
  2014-11-21 18:28                                                                             ` Andy Lutomirski
  2014-11-21 19:06                                                                             ` Linus Torvalds
  0 siblings, 2 replies; 486+ messages in thread
From: Linus Torvalds @ 2014-11-21 18:22 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Steven Rostedt, Tejun Heo, linux-kernel, Thomas Gleixner,
	Arnaldo Carvalho de Melo, Peter Zijlstra, Frederic Weisbecker,
	Don Zickus, Dave Jones, the arch/x86 maintainers

On Fri, Nov 21, 2014 at 9:22 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> Both mystify me.  Why does the 32-bit version walk down the hierarchy
> at all instead of just touching the top level?

Quite frankly, I think it's just due to historical reasons, and should
be removed.

But the historical reasons are that with the aliasing of the PUD and
PMD entries in the PGD, it's all fairly confusing. So I think we only
used to do the top level, but then when we expanded from two levels to
three, that "top level" became the pmd, and then when we expanded from
three to four, the pmd was actually two levels down. So it's all
basically mindless work.

So I do think we could simplify and unify things.

In 32-bit mode, we actually have two different cases:

 - in PAE, there's the magic top-level 4-entry PGD that always *has*
to be present (the P bit isn't actually checked by hardware)

    As a result, in PAE mode, the top PGD entries always exist, and
are always prepopulated, and for the kernel area (including obviously
the vmalloc space) always points to the init_pgd[] entry.

    Ergo, in PAE mode, I don't think we should ever hit this case in
the first place.

 - in non-PAE mode, we should just copy the top-level entry, and return.

And in 64-bit more, we only have the "copy the top-level entry" case.

So I think we should

 (a) remove the 32-bit vs 64-bit difference, because that's not actually valid

 (b) make it a PAE vs non-PAE difference

 (c) the PAE case is a no-op

 (d) the non-PAE case would look something like this:

    static noinline int vmalloc_fault(unsigned long address)
    {
        unsigned index;
        pgd_t *pgd_dst, pgd_entry;

        /* Make sure we are in vmalloc area: */
        if (!(address >= VMALLOC_START && address < VMALLOC_END))
                return -1;

        index = pgd_index(address);
        pgd_entry = init_mm.pgd[index];
        if (!pgd_present(pgd_entry))
                return -1;

        pgd_dst = __va(PAGE_MASK & read_cr3());
        if (pgd_present(pgd_dst[index]))
                return -1;

        ACCESS_ONCE(pgd_dst[index]) = pgd_entry;
        return 0;
    }
    NOKPROBE_SYMBOL(vmalloc_fault);

and it's done.

Would anybody be willing to actually *test* something like the above?
The above may compile, but that's all the "testing" it got.

                    Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-21 17:34                                                                         ` Steven Rostedt
@ 2014-11-21 18:24                                                                           ` Linus Torvalds
  0 siblings, 0 replies; 486+ messages in thread
From: Linus Torvalds @ 2014-11-21 18:24 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andy Lutomirski, Tejun Heo, linux-kernel, Thomas Gleixner,
	Arnaldo Carvalho de Melo, Peter Zijlstra, Frederic Weisbecker,
	Don Zickus, Dave Jones, the arch/x86 maintainers

On Fri, Nov 21, 2014 at 9:34 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> I see. You have issues with the use of current->active_mm instead of
> just doing a read_cr3() (and I'm sure other things).

Yes. And I have this memory of it actually mattering, where we'd get
get the page fault, but see that the (wrong) page table is already
populated, and say "ti wasn't a vmalloc fault", and then go down the
oops path.

Of course, the context switch itself has changed completely over the
years, but I think it would still be true with NMI. "active_mm" may
point to a different page table than the one the CPU is actually
using, and then the whole thing is bogus.

               Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-21 18:22                                                                           ` Linus Torvalds
@ 2014-11-21 18:28                                                                             ` Andy Lutomirski
  2014-11-21 19:06                                                                             ` Linus Torvalds
  1 sibling, 0 replies; 486+ messages in thread
From: Andy Lutomirski @ 2014-11-21 18:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Tejun Heo, linux-kernel, Thomas Gleixner,
	Arnaldo Carvalho de Melo, Peter Zijlstra, Frederic Weisbecker,
	Don Zickus, Dave Jones, the arch/x86 maintainers

On Fri, Nov 21, 2014 at 10:22 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, Nov 21, 2014 at 9:22 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> Both mystify me.  Why does the 32-bit version walk down the hierarchy
>> at all instead of just touching the top level?
>
> Quite frankly, I think it's just due to historical reasons, and should
> be removed.
>
> But the historical reasons are that with the aliasing of the PUD and
> PMD entries in the PGD, it's all fairly confusing. So I think we only
> used to do the top level, but then when we expanded from two levels to
> three, that "top level" became the pmd, and then when we expanded from
> three to four, the pmd was actually two levels down. So it's all
> basically mindless work.
>
> So I do think we could simplify and unify things.
>
> In 32-bit mode, we actually have two different cases:
>
>  - in PAE, there's the magic top-level 4-entry PGD that always *has*
> to be present (the P bit isn't actually checked by hardware)
>
>     As a result, in PAE mode, the top PGD entries always exist, and
> are always prepopulated, and for the kernel area (including obviously
> the vmalloc space) always points to the init_pgd[] entry.
>
>     Ergo, in PAE mode, I don't think we should ever hit this case in
> the first place.
>
>  - in non-PAE mode, we should just copy the top-level entry, and return.
>
> And in 64-bit more, we only have the "copy the top-level entry" case.
>
> So I think we should
>
>  (a) remove the 32-bit vs 64-bit difference, because that's not actually valid
>
>  (b) make it a PAE vs non-PAE difference
>
>  (c) the PAE case is a no-op
>
>  (d) the non-PAE case would look something like this:
>
>     static noinline int vmalloc_fault(unsigned long address)
>     {
>         unsigned index;
>         pgd_t *pgd_dst, pgd_entry;
>
>         /* Make sure we are in vmalloc area: */
>         if (!(address >= VMALLOC_START && address < VMALLOC_END))
>                 return -1;
>
>         index = pgd_index(address);
>         pgd_entry = init_mm.pgd[index];
>         if (!pgd_present(pgd_entry))
>                 return -1;
>
>         pgd_dst = __va(PAGE_MASK & read_cr3());
>         if (pgd_present(pgd_dst[index]))
>                 return -1;
>
>         ACCESS_ONCE(pgd_dst[index]) = pgd_entry;
>         return 0;
>     }
>     NOKPROBE_SYMBOL(vmalloc_fault);
>
> and it's done.
>
> Would anybody be willing to actually *test* something like the above?
> The above may compile, but that's all the "testing" it got.
>

I'd be happy to test it (i.e. boot it and try to use my computer), but
I have nowhere near enough RAM to do it right.

Is there any easy way to get the vmalloc code to randomize enough bits
to exercise this?

--Andy

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-21 18:22                                                                           ` Linus Torvalds
  2014-11-21 18:28                                                                             ` Andy Lutomirski
@ 2014-11-21 19:06                                                                             ` Linus Torvalds
  2014-11-21 19:23                                                                               ` Steven Rostedt
  2014-11-21 19:51                                                                               ` Thomas Gleixner
  1 sibling, 2 replies; 486+ messages in thread
From: Linus Torvalds @ 2014-11-21 19:06 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Steven Rostedt, Tejun Heo, linux-kernel, Thomas Gleixner,
	Arnaldo Carvalho de Melo, Peter Zijlstra, Frederic Weisbecker,
	Don Zickus, Dave Jones, the arch/x86 maintainers

On Fri, Nov 21, 2014 at 10:22 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
>  (d) the non-PAE case would look something like this:
>
>     static noinline int vmalloc_fault(unsigned long address)
>     {
>         unsigned index;
>         pgd_t *pgd_dst, pgd_entry;
>
>         /* Make sure we are in vmalloc area: */
>         if (!(address >= VMALLOC_START && address < VMALLOC_END))
>                 return -1;

Side note: I think this is just unnecessary confusion, and generates
big constants for no good reason.

The thing is, the kernel PGD's should always be in sync. In fact, at
PGD allocation time, we just do

     clone_pgd_range(.. KERNEL_PGD_BOUNDARY, KERNEL_PGD_PTRS);

and it might actually be better to structure this to be that exact same thing.

So instead of checking the address, we could just do

        index = pgd_index(address);
        if (index < KERNEL_PGD_BOUNDARY)
                return -1;

which actually matches our initialization sequence much better anyway.
And avoids those random big constants.

Also, it turns out that this:

        if (pgd_present(pgd_dst[index]))

generates a crazy big constant because of bad compiler issues (the
"pgd_present()" thing only checks the low bit, but it does so on
pgd_flags(), which does "native_pgd_val(pgd) & PTE_FLAGS_MASK", so you
have an insane extra "and" with the constant 0xffffc00000000fff, just
to then "and" it again with "1". It doesn't do that with the first
pgd_present() check, oddly enough.

WTF, gcc?

Anyway, even more importantly, because of the whole issue with nesting
page tables, it's probably best to actually avoid all the
"pgd_present()" etc helpers, because those might be hardcoded to 1
etc. So avoid the whole issue by just accessign the raw data.

Simplify, simplify, simplify. The actual code generation for this all
should be maybe 20 instructions.

Here's the simplified end result. Again, this is TOTALLY UNTESTED. I
compiled it and verified that the code generation looks like what I'd
have expected, but that's literally it.

  static noinline int vmalloc_fault(unsigned long address)
  {
        pgd_t *pgd_dst;
        pgdval_t pgd_entry;
        unsigned index = pgd_index(address);

        if (index < KERNEL_PGD_BOUNDARY)
                return -1;

        pgd_entry = init_mm.pgd[index].pgd;
        if (!pgd_entry)
                return -1;

        pgd_dst = __va(PAGE_MASK & read_cr3());
        pgd_dst += index;

        if (pgd_dst->pgd)
                return -1;

        ACCESS_ONCE(pgd_dst->pgd) = pgd_entry;
        return 0;
  }
  NOKPROBE_SYMBOL(vmalloc_fault);

Hmm? Does anybody see anything fundamentally wrong with this?

                     Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-21 19:06                                                                             ` Linus Torvalds
@ 2014-11-21 19:23                                                                               ` Steven Rostedt
  2014-11-21 19:34                                                                                 ` Linus Torvalds
  2014-11-21 19:51                                                                               ` Thomas Gleixner
  1 sibling, 1 reply; 486+ messages in thread
From: Steven Rostedt @ 2014-11-21 19:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Tejun Heo, linux-kernel, Thomas Gleixner,
	Arnaldo Carvalho de Melo, Peter Zijlstra, Frederic Weisbecker,
	Don Zickus, Dave Jones, the arch/x86 maintainers

On Fri, 21 Nov 2014 11:06:41 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:
 
>   static noinline int vmalloc_fault(unsigned long address)
>   {
>         pgd_t *pgd_dst;
>         pgdval_t pgd_entry;
>         unsigned index = pgd_index(address);
> 
>         if (index < KERNEL_PGD_BOUNDARY)
>                 return -1;
> 
>         pgd_entry = init_mm.pgd[index].pgd;
>         if (!pgd_entry)
>                 return -1;

Should we at least check to see if it is present?

	if (!(pgd_entry & 1))
		return -1;

?

-- Steve

> 
>         pgd_dst = __va(PAGE_MASK & read_cr3());
>         pgd_dst += index;
> 
>         if (pgd_dst->pgd)
>                 return -1;
> 
>         ACCESS_ONCE(pgd_dst->pgd) = pgd_entry;
>         return 0;
>   }
>   NOKPROBE_SYMBOL(vmalloc_fault);
> 
> Hmm? Does anybody see anything fundamentally wrong with this?
> 
>                      Linus


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-21 19:23                                                                               ` Steven Rostedt
@ 2014-11-21 19:34                                                                                 ` Linus Torvalds
  2014-11-21 19:46                                                                                   ` Linus Torvalds
  0 siblings, 1 reply; 486+ messages in thread
From: Linus Torvalds @ 2014-11-21 19:34 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andy Lutomirski, Tejun Heo, linux-kernel, Thomas Gleixner,
	Arnaldo Carvalho de Melo, Peter Zijlstra, Frederic Weisbecker,
	Don Zickus, Dave Jones, the arch/x86 maintainers

On Fri, Nov 21, 2014 at 11:23 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> Should we at least check to see if it is present?
>
>         if (!(pgd_entry & 1))
>                 return -1;

Maybe. But what other entry could there be?

But yes, returning -1 is "safe", since it basically says "I'm not
doing a vmalloc thing, oops if this is a bad access". So that kind of
argues for being as aggressive as possible in returning 1.

So for the first one (!pgd_entry), instead of returning -1 only for a
completely empty entry, returning it for any non-present case is
probably right.

And for the second one (where we check whether there is anything at
all in the destination), returning -1 for "anything but zero" is
probably the right thing to do.

But in the end, if you have a corrupted top-level kernel page table,
it sounds to me like you're just royally screwed anyway. So I don't
think it matters *that* much.

So I kind of agree, but it wouldn't be my primary worry. My primary
worry is actually paravirt doing something insane.

                    Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-21 19:34                                                                                 ` Linus Torvalds
@ 2014-11-21 19:46                                                                                   ` Linus Torvalds
  2014-11-21 19:52                                                                                     ` Andy Lutomirski
  2014-11-21 20:00                                                                                     ` Dave Jones
  0 siblings, 2 replies; 486+ messages in thread
From: Linus Torvalds @ 2014-11-21 19:46 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andy Lutomirski, Tejun Heo, linux-kernel, Thomas Gleixner,
	Arnaldo Carvalho de Melo, Peter Zijlstra, Frederic Weisbecker,
	Don Zickus, Dave Jones, the arch/x86 maintainers

On Fri, Nov 21, 2014 at 11:34 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So I kind of agree, but it wouldn't be my primary worry. My primary
> worry is actually paravirt doing something insane.

Btw, on that tangent, does anybody actually care about paravirt any more?

I'd love to start moving away from it. It makes a lot of the low-level
code completely impossible to follow due to the random indirection
through "native" vs "paravirt op table". Not just the page table
handling, it's all over.

Anybody who seriously does virtualization uses hw virtualization that
is much better than it used to be. And the non-serious users aren't
that performance-sensitive by definition.

I note that the Fedora kernel config seems to include paravirt by
default, so you get a lot of the crazy overheads..

                   Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-21 19:06                                                                             ` Linus Torvalds
  2014-11-21 19:23                                                                               ` Steven Rostedt
@ 2014-11-21 19:51                                                                               ` Thomas Gleixner
  2014-11-21 20:00                                                                                 ` Linus Torvalds
  2014-11-21 22:33                                                                                 ` Konrad Rzeszutek Wilk
  1 sibling, 2 replies; 486+ messages in thread
From: Thomas Gleixner @ 2014-11-21 19:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Steven Rostedt, Tejun Heo, linux-kernel,
	Arnaldo Carvalho de Melo, Peter Zijlstra, Frederic Weisbecker,
	Don Zickus, Dave Jones, the arch/x86 maintainers

On Fri, 21 Nov 2014, Linus Torvalds wrote:
> Here's the simplified end result. Again, this is TOTALLY UNTESTED. I
> compiled it and verified that the code generation looks like what I'd
> have expected, but that's literally it.
> 
>   static noinline int vmalloc_fault(unsigned long address)
>   {
>         pgd_t *pgd_dst;
>         pgdval_t pgd_entry;
>         unsigned index = pgd_index(address);
> 
>         if (index < KERNEL_PGD_BOUNDARY)
>                 return -1;
> 
>         pgd_entry = init_mm.pgd[index].pgd;
>         if (!pgd_entry)
>                 return -1;
> 
>         pgd_dst = __va(PAGE_MASK & read_cr3());
>         pgd_dst += index;
> 
>         if (pgd_dst->pgd)
>                 return -1;
> 
>         ACCESS_ONCE(pgd_dst->pgd) = pgd_entry;

This will break paravirt. set_pgd/set_pmd are paravirt functions.

But I'm fine with breaking it, then you just need to change
CONFIG_PARAVIRT to 'def_bool n'

Thanks,

	tglx




^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-21 19:46                                                                                   ` Linus Torvalds
@ 2014-11-21 19:52                                                                                     ` Andy Lutomirski
  2014-11-21 20:14                                                                                       ` Josh Boyer
  2014-11-21 20:00                                                                                     ` Dave Jones
  1 sibling, 1 reply; 486+ messages in thread
From: Andy Lutomirski @ 2014-11-21 19:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Tejun Heo, linux-kernel, Thomas Gleixner,
	Peter Zijlstra, Frederic Weisbecker, Don Zickus, Dave Jones,
	the arch/x86 maintainers

On Fri, Nov 21, 2014 at 11:46 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, Nov 21, 2014 at 11:34 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> So I kind of agree, but it wouldn't be my primary worry. My primary
>> worry is actually paravirt doing something insane.
>
> Btw, on that tangent, does anybody actually care about paravirt any more?
>

Amazon, for better or for worse.

> I'd love to start moving away from it. It makes a lot of the low-level
> code completely impossible to follow due to the random indirection
> through "native" vs "paravirt op table". Not just the page table
> handling, it's all over.
>
> Anybody who seriously does virtualization uses hw virtualization that
> is much better than it used to be. And the non-serious users aren't
> that performance-sensitive by definition.
>
> I note that the Fedora kernel config seems to include paravirt by
> default, so you get a lot of the crazy overheads..

I think that there is a move toward deprecating Xen PV in favor of
PVH, but we're not there yet.

--Andy

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-21 19:51                                                                               ` Thomas Gleixner
@ 2014-11-21 20:00                                                                                 ` Linus Torvalds
  2014-11-21 20:16                                                                                   ` Thomas Gleixner
  2014-11-21 22:33                                                                                 ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 486+ messages in thread
From: Linus Torvalds @ 2014-11-21 20:00 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, Steven Rostedt, Tejun Heo, linux-kernel,
	Arnaldo Carvalho de Melo, Peter Zijlstra, Frederic Weisbecker,
	Don Zickus, Dave Jones, the arch/x86 maintainers

On Fri, Nov 21, 2014 at 11:51 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> This will break paravirt. set_pgd/set_pmd are paravirt functions.

I suspect we could use "set_pgd()" here instead of the direct access.
I didn't want to walk through all the levels to see exactly which
random op I needed to use.

                Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-21 19:46                                                                                   ` Linus Torvalds
  2014-11-21 19:52                                                                                     ` Andy Lutomirski
@ 2014-11-21 20:00                                                                                     ` Dave Jones
  2014-11-21 20:02                                                                                       ` Andy Lutomirski
  1 sibling, 1 reply; 486+ messages in thread
From: Dave Jones @ 2014-11-21 20:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Andy Lutomirski, Tejun Heo, linux-kernel,
	Thomas Gleixner, Arnaldo Carvalho de Melo, Peter Zijlstra,
	Frederic Weisbecker, Don Zickus, the arch/x86 maintainers,
	Josh Boyer, Justin Forbes

On Fri, Nov 21, 2014 at 11:46:57AM -0800, Linus Torvalds wrote:
 
 > Anybody who seriously does virtualization uses hw virtualization that
 > is much better than it used to be. And the non-serious users aren't
 > that performance-sensitive by definition.
 > 
 > I note that the Fedora kernel config seems to include paravirt by
 > default, so you get a lot of the crazy overheads..

I'm not sure how many people actually use paravirt these days,
but the reason Fedora has it enabled still at least is probably
because..

config KVM_GUEST
         bool "KVM Guest support (including kvmclock)"
         depends on PARAVIRT

But tbh I've not looked at this stuff since it first got merged.
Will a full-virt system kvm boot a guest without KVM_GUEST enabled ?
(ie, is this just an optimisation for the paravirt case?)

I'm not a heavy virt user, so I don't even remember how a lot of
this stuff is supposed to work.

	Dave


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-21 20:00                                                                                     ` Dave Jones
@ 2014-11-21 20:02                                                                                       ` Andy Lutomirski
  0 siblings, 0 replies; 486+ messages in thread
From: Andy Lutomirski @ 2014-11-21 20:02 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Steven Rostedt, Andy Lutomirski,
	Tejun Heo, linux-kernel, Thomas Gleixner,
	Arnaldo Carvalho de Melo, Peter Zijlstra, Frederic Weisbecker,
	Don Zickus, the arch/x86 maintainers, Josh Boyer, Justin Forbes

On Fri, Nov 21, 2014 at 12:00 PM, Dave Jones <davej@redhat.com> wrote:
> On Fri, Nov 21, 2014 at 11:46:57AM -0800, Linus Torvalds wrote:
>
>  > Anybody who seriously does virtualization uses hw virtualization that
>  > is much better than it used to be. And the non-serious users aren't
>  > that performance-sensitive by definition.
>  >
>  > I note that the Fedora kernel config seems to include paravirt by
>  > default, so you get a lot of the crazy overheads..
>
> I'm not sure how many people actually use paravirt these days,
> but the reason Fedora has it enabled still at least is probably
> because..
>
> config KVM_GUEST
>          bool "KVM Guest support (including kvmclock)"
>          depends on PARAVIRT
>
> But tbh I've not looked at this stuff since it first got merged.
> Will a full-virt system kvm boot a guest without KVM_GUEST enabled ?
> (ie, is this just an optimisation for the paravirt case?)
>

It will boot just fine, although there may be some timing glitches.

I think we should have PARAVIRT_LITE that's just enough for KVM.  That
probably involves some apic changes and nothing else.

--Andy

> I'm not a heavy virt user, so I don't even remember how a lot of
> this stuff is supposed to work.
>
>         Dave
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-21 19:52                                                                                     ` Andy Lutomirski
@ 2014-11-21 20:14                                                                                       ` Josh Boyer
  2014-11-21 20:16                                                                                         ` Andy Lutomirski
  0 siblings, 1 reply; 486+ messages in thread
From: Josh Boyer @ 2014-11-21 20:14 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Steven Rostedt, Tejun Heo, linux-kernel,
	Thomas Gleixner, Peter Zijlstra, Frederic Weisbecker, Don Zickus,
	Dave Jones, the arch/x86 maintainers

On Fri, Nov 21, 2014 at 2:52 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Fri, Nov 21, 2014 at 11:46 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> On Fri, Nov 21, 2014 at 11:34 AM, Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>>>
>>> So I kind of agree, but it wouldn't be my primary worry. My primary
>>> worry is actually paravirt doing something insane.
>>
>> Btw, on that tangent, does anybody actually care about paravirt any more?
>>
>
> Amazon, for better or for worse.
>
>> I'd love to start moving away from it. It makes a lot of the low-level
>> code completely impossible to follow due to the random indirection
>> through "native" vs "paravirt op table". Not just the page table
>> handling, it's all over.
>>
>> Anybody who seriously does virtualization uses hw virtualization that
>> is much better than it used to be. And the non-serious users aren't
>> that performance-sensitive by definition.
>>
>> I note that the Fedora kernel config seems to include paravirt by
>> default, so you get a lot of the crazy overheads..
>
> I think that there is a move toward deprecating Xen PV in favor of
> PVH, but we're not there yet.

A move where?  The Xen stuff in Fedora is ... not paid attention to
very much.  If there's something we should be looking at turning off
(or on), we're happy to take suggestions.

josh

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-21 20:14                                                                                       ` Josh Boyer
@ 2014-11-21 20:16                                                                                         ` Andy Lutomirski
  2014-11-21 20:23                                                                                           ` Josh Boyer
  0 siblings, 1 reply; 486+ messages in thread
From: Andy Lutomirski @ 2014-11-21 20:16 UTC (permalink / raw)
  To: Josh Boyer
  Cc: Linus Torvalds, Steven Rostedt, Tejun Heo, linux-kernel,
	Thomas Gleixner, Peter Zijlstra, Frederic Weisbecker, Don Zickus,
	Dave Jones, the arch/x86 maintainers

On Fri, Nov 21, 2014 at 12:14 PM, Josh Boyer <jwboyer@fedoraproject.org> wrote:
> On Fri, Nov 21, 2014 at 2:52 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Fri, Nov 21, 2014 at 11:46 AM, Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>>> On Fri, Nov 21, 2014 at 11:34 AM, Linus Torvalds
>>> <torvalds@linux-foundation.org> wrote:
>>>>
>>>> So I kind of agree, but it wouldn't be my primary worry. My primary
>>>> worry is actually paravirt doing something insane.
>>>
>>> Btw, on that tangent, does anybody actually care about paravirt any more?
>>>
>>
>> Amazon, for better or for worse.
>>
>>> I'd love to start moving away from it. It makes a lot of the low-level
>>> code completely impossible to follow due to the random indirection
>>> through "native" vs "paravirt op table". Not just the page table
>>> handling, it's all over.
>>>
>>> Anybody who seriously does virtualization uses hw virtualization that
>>> is much better than it used to be. And the non-serious users aren't
>>> that performance-sensitive by definition.
>>>
>>> I note that the Fedora kernel config seems to include paravirt by
>>> default, so you get a lot of the crazy overheads..
>>
>> I think that there is a move toward deprecating Xen PV in favor of
>> PVH, but we're not there yet.
>
> A move where?  The Xen stuff in Fedora is ... not paid attention to
> very much.  If there's something we should be looking at turning off
> (or on), we're happy to take suggestions.

A move in the Xen project.  As I understand it, Xen wants to deprecate
PV in favor of PVH, but PVH is still experimental.

I think that dropping PARAVIRT in Fedora might be a bad idea for
several more releases, since that's likely to break the EC2 images.

--Andy

>
> josh



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-21 20:00                                                                                 ` Linus Torvalds
@ 2014-11-21 20:16                                                                                   ` Thomas Gleixner
  2014-11-21 20:41                                                                                     ` Linus Torvalds
  0 siblings, 1 reply; 486+ messages in thread
From: Thomas Gleixner @ 2014-11-21 20:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Steven Rostedt, Tejun Heo, linux-kernel,
	Arnaldo Carvalho de Melo, Peter Zijlstra, Frederic Weisbecker,
	Don Zickus, Dave Jones, the arch/x86 maintainers

On Fri, 21 Nov 2014, Linus Torvalds wrote:

> On Fri, Nov 21, 2014 at 11:51 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > This will break paravirt. set_pgd/set_pmd are paravirt functions.
> 
> I suspect we could use "set_pgd()" here instead of the direct access.
> I didn't want to walk through all the levels to see exactly which
> random op I needed to use.

I don't think that works on 32bit. See the magic in
vmalloc_sync_one().

Thanks,

	tglx




^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-21 20:16                                                                                         ` Andy Lutomirski
@ 2014-11-21 20:23                                                                                           ` Josh Boyer
  2014-11-24 18:48                                                                                             ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 486+ messages in thread
From: Josh Boyer @ 2014-11-21 20:23 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Steven Rostedt, Tejun Heo, linux-kernel,
	Thomas Gleixner, Peter Zijlstra, Frederic Weisbecker, Don Zickus,
	Dave Jones, the arch/x86 maintainers

On Fri, Nov 21, 2014 at 3:16 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Fri, Nov 21, 2014 at 12:14 PM, Josh Boyer <jwboyer@fedoraproject.org> wrote:
>> On Fri, Nov 21, 2014 at 2:52 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Fri, Nov 21, 2014 at 11:46 AM, Linus Torvalds
>>> <torvalds@linux-foundation.org> wrote:
>>>> On Fri, Nov 21, 2014 at 11:34 AM, Linus Torvalds
>>>> <torvalds@linux-foundation.org> wrote:
>>>>>
>>>>> So I kind of agree, but it wouldn't be my primary worry. My primary
>>>>> worry is actually paravirt doing something insane.
>>>>
>>>> Btw, on that tangent, does anybody actually care about paravirt any more?
>>>>
>>>
>>> Amazon, for better or for worse.
>>>
>>>> I'd love to start moving away from it. It makes a lot of the low-level
>>>> code completely impossible to follow due to the random indirection
>>>> through "native" vs "paravirt op table". Not just the page table
>>>> handling, it's all over.
>>>>
>>>> Anybody who seriously does virtualization uses hw virtualization that
>>>> is much better than it used to be. And the non-serious users aren't
>>>> that performance-sensitive by definition.
>>>>
>>>> I note that the Fedora kernel config seems to include paravirt by
>>>> default, so you get a lot of the crazy overheads..
>>>
>>> I think that there is a move toward deprecating Xen PV in favor of
>>> PVH, but we're not there yet.
>>
>> A move where?  The Xen stuff in Fedora is ... not paid attention to
>> very much.  If there's something we should be looking at turning off
>> (or on), we're happy to take suggestions.
>
> A move in the Xen project.  As I understand it, Xen wants to deprecate
> PV in favor of PVH, but PVH is still experimental.

OK.

> I think that dropping PARAVIRT in Fedora might be a bad idea for
> several more releases, since that's likely to break the EC2 images.

Yes, that's essentially the only reason we haven't looked at disabling
Xen completely for a while now, so <sad trombone>.

josh

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-21 20:16                                                                                   ` Thomas Gleixner
@ 2014-11-21 20:41                                                                                     ` Linus Torvalds
  2014-11-21 21:11                                                                                       ` Thomas Gleixner
  0 siblings, 1 reply; 486+ messages in thread
From: Linus Torvalds @ 2014-11-21 20:41 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, Steven Rostedt, Tejun Heo, linux-kernel,
	Arnaldo Carvalho de Melo, Peter Zijlstra, Frederic Weisbecker,
	Don Zickus, Dave Jones, the arch/x86 maintainers

On Fri, Nov 21, 2014 at 12:16 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> I don't think that works on 32bit. See the magic in
> vmalloc_sync_one().

Heh. I guess we could just add a wrapper around this crap, and make it
very clear that the paravirt case is a horrible horrible hack.

Something like

   #define set_one_pgd_entry(entry,pgdp) (pgdp)->pgd = (entry)

for the regular case, and then for paravirt we do something very
explicitly horrid, like

   #ifdef CONFIG_PARAVIRT
   #ifdef CONFIG_X86_32
   // The pmd is the top-level page directory on non-PAE x86, nested
inside pgd/pud
   #define set_one_pgd_entry(entry,pgdp) set_pmd((pmd_t *)(pgdp),
(pmd_t) { entry } )
   #else
   #define set_one_pgd_entry(entry, pgdp) do { set_pgd(pgdp, (pgd_t) {
entry });  arch_flush_lazy_mmu_mode(); } while (0)
   #endif

because on x86-64, there seems to be that whole lazy_mode pv_ops
craziness (which I'm not at all convinced is needed here, but that's
what the current code does).

                Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-21 20:41                                                                                     ` Linus Torvalds
@ 2014-11-21 21:11                                                                                       ` Thomas Gleixner
  2014-11-21 22:55                                                                                         ` Linus Torvalds
  0 siblings, 1 reply; 486+ messages in thread
From: Thomas Gleixner @ 2014-11-21 21:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Steven Rostedt, Tejun Heo, linux-kernel,
	Arnaldo Carvalho de Melo, Peter Zijlstra, Frederic Weisbecker,
	Don Zickus, Dave Jones, the arch/x86 maintainers

On Fri, 21 Nov 2014, Linus Torvalds wrote:
> On Fri, Nov 21, 2014 at 12:16 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > I don't think that works on 32bit. See the magic in
> > vmalloc_sync_one().
> 
> Heh. I guess we could just add a wrapper around this crap, and make it
> very clear that the paravirt case is a horrible horrible hack.
> 
> Something like
> 
>    #define set_one_pgd_entry(entry,pgdp) (pgdp)->pgd = (entry)
> 
> for the regular case, and then for paravirt we do something very
> explicitly horrid, like
> 
>    #ifdef CONFIG_PARAVIRT
>    #ifdef CONFIG_X86_32
>    // The pmd is the top-level page directory on non-PAE x86, nested
> inside pgd/pud
>    #define set_one_pgd_entry(entry,pgdp) set_pmd((pmd_t *)(pgdp),
> (pmd_t) { entry } )
>    #else
>    #define set_one_pgd_entry(entry, pgdp) do { set_pgd(pgdp, (pgd_t) {
> entry });  arch_flush_lazy_mmu_mode(); } while (0)
>    #endif
> 
> because on x86-64, there seems to be that whole lazy_mode pv_ops
> craziness (which I'm not at all convinced is needed here, but that's
> what the current code does).

I'm fine with that. I just think it's not horrid enough, but that can
be fixed easily :)

Thanks,

	tglx



^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-21 17:01                                                               ` Steven Rostedt
  2014-11-21 17:11                                                                 ` Steven Rostedt
@ 2014-11-21 21:32                                                                 ` Frederic Weisbecker
  2014-11-21 21:34                                                                   ` Andy Lutomirski
  1 sibling, 1 reply; 486+ messages in thread
From: Frederic Weisbecker @ 2014-11-21 21:32 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Tejun Heo, Thomas Gleixner, Linus Torvalds, Dave Jones,
	Don Zickus, Linux Kernel, the arch/x86 maintainers,
	Peter Zijlstra, Andy Lutomirski, Arnaldo Carvalho de Melo

On Fri, Nov 21, 2014 at 12:01:51PM -0500, Steven Rostedt wrote:
> On Fri, Nov 21, 2014 at 11:25:06AM -0500, Tejun Heo wrote:
> > 
> > * Static percpu areas wouldn't trigger fault lazily.  Note that this
> >   is not necessarily because the first percpu chunk which contains the
> >   static area is embedded inside the kernel linear mapping.  Depending
> >   on the memory layout and boot param, percpu allocator may choose to
> >   map the first chunk in vmalloc space too; however, this still works
> >   out fine because at that point there are no other page tables and
> >   the PUD entries covering the first chunk is faulted in before other
> >   pages tables are copied from the kernel one.
> 
> That sounds correct.
> 
> > 
> > * NMI used to be a problem because vmalloc fault handler couldn't
> >   safely nest inside NMI handler but this has been fixed since and it
> >   should work fine from NMI handlers now.
> 
> Right. Of course "should work fine" does not excatly mean "will work fine".
> 
> 
> > 
> > * Function tracers are problematic because they may end up nesting
> >   inside themselves through triggering a vmalloc fault while accessing
> >   dynamic percpu memory area.  This may lead to recursive locking and
> >   other surprises.
> 
> The function tracer infrastructure now has a recursive check that happens
> rather early in the call. Unless the registered OPS specifically states
> it handles recursions (FTRACE_OPS_FL_RECUSION_SAFE), ftrace will add the
> necessary recursion checks. If a registered OPS lies about being recusion
> safe, well we can't stop suicide.

Same if the recursion state is based on per cpu memory.

> 
> Looking at kernel/trace/trace_functions.c: function_trace_call() which is
> registered with RECURSION_SAFE, I see that the recursion check is done
> before the per_cpu_ptr() call to the dynamically allocated per_cpu data.
> 
> It looks OK, but...
> 
> Oh! but if we trace the page fault handler, and we fault here too
> we just nuked the cr2 register. Not good.

If we fault in the page fault handler, we double fault and apparently
recovering from that isn't quite expected anyway.

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-21 21:32                                                                 ` Frederic Weisbecker
@ 2014-11-21 21:34                                                                   ` Andy Lutomirski
  2014-11-21 21:50                                                                     ` Frederic Weisbecker
  0 siblings, 1 reply; 486+ messages in thread
From: Andy Lutomirski @ 2014-11-21 21:34 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Steven Rostedt, Tejun Heo, Thomas Gleixner, Linus Torvalds,
	Dave Jones, Don Zickus, Linux Kernel, the arch/x86 maintainers,
	Peter Zijlstra, Arnaldo Carvalho de Melo

On Fri, Nov 21, 2014 at 1:32 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> On Fri, Nov 21, 2014 at 12:01:51PM -0500, Steven Rostedt wrote:
>> On Fri, Nov 21, 2014 at 11:25:06AM -0500, Tejun Heo wrote:
>> >
>> > * Static percpu areas wouldn't trigger fault lazily.  Note that this
>> >   is not necessarily because the first percpu chunk which contains the
>> >   static area is embedded inside the kernel linear mapping.  Depending
>> >   on the memory layout and boot param, percpu allocator may choose to
>> >   map the first chunk in vmalloc space too; however, this still works
>> >   out fine because at that point there are no other page tables and
>> >   the PUD entries covering the first chunk is faulted in before other
>> >   pages tables are copied from the kernel one.
>>
>> That sounds correct.
>>
>> >
>> > * NMI used to be a problem because vmalloc fault handler couldn't
>> >   safely nest inside NMI handler but this has been fixed since and it
>> >   should work fine from NMI handlers now.
>>
>> Right. Of course "should work fine" does not excatly mean "will work fine".
>>
>>
>> >
>> > * Function tracers are problematic because they may end up nesting
>> >   inside themselves through triggering a vmalloc fault while accessing
>> >   dynamic percpu memory area.  This may lead to recursive locking and
>> >   other surprises.
>>
>> The function tracer infrastructure now has a recursive check that happens
>> rather early in the call. Unless the registered OPS specifically states
>> it handles recursions (FTRACE_OPS_FL_RECUSION_SAFE), ftrace will add the
>> necessary recursion checks. If a registered OPS lies about being recusion
>> safe, well we can't stop suicide.
>
> Same if the recursion state is based on per cpu memory.
>
>>
>> Looking at kernel/trace/trace_functions.c: function_trace_call() which is
>> registered with RECURSION_SAFE, I see that the recursion check is done
>> before the per_cpu_ptr() call to the dynamically allocated per_cpu data.
>>
>> It looks OK, but...
>>
>> Oh! but if we trace the page fault handler, and we fault here too
>> we just nuked the cr2 register. Not good.
>
> If we fault in the page fault handler, we double fault and apparently
> recovering from that isn't quite expected anyway.

Not quite.  We only double fault if we fault while pushing the
hardware part of the state onto the stack.  That happens even before
the entry asm gets run.

Otherwise if we have a page fault inside do_page_fault, it's just a
nested page fault.

--Andy


-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-21 16:25                                                             ` Tejun Heo
  2014-11-21 17:01                                                               ` Steven Rostedt
@ 2014-11-21 21:44                                                               ` Frederic Weisbecker
  2014-11-22  0:11                                                                 ` Tejun Heo
  1 sibling, 1 reply; 486+ messages in thread
From: Frederic Weisbecker @ 2014-11-21 21:44 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Don Zickus,
	Linux Kernel, the arch/x86 maintainers, Peter Zijlstra,
	Andy Lutomirski, Arnaldo Carvalho de Melo

On Fri, Nov 21, 2014 at 11:25:06AM -0500, Tejun Heo wrote:
> Hello, Frederic.
> 
> On Fri, Nov 21, 2014 at 03:13:35PM +0100, Frederic Weisbecker wrote:
> ...
> > So when the issue arised 4 years ago, it was a problem only for NMIs.
> > Like Linus says: "what happens in NMI stays in NMI". Ok no that's not quite
> > what he says :-)  But NMIs happen to be a corner case for about everything
> > and it's sometimes better to fix things from NMI itself, or have an NMI
> > special case rather than grow the whole infrastructure in complexity to
> > support this very corner case.
> 
> I'm not familiar with the innards of fault handling, so can you please
> help me understand what may actually break?  Here are what I currently
> understand.
> 
> * Static percpu areas wouldn't trigger fault lazily.  Note that this
>   is not necessarily because the first percpu chunk which contains the
>   static area is embedded inside the kernel linear mapping.  Depending
>   on the memory layout and boot param, percpu allocator may choose to
>   map the first chunk in vmalloc space too; however, this still works
>   out fine because at that point there are no other page tables and
>   the PUD entries covering the first chunk is faulted in before other
>   pages tables are copied from the kernel one.
> 
> * NMI used to be a problem because vmalloc fault handler couldn't
>   safely nest inside NMI handler but this has been fixed since and it
>   should work fine from NMI handlers now.
> 
> * Function tracers are problematic because they may end up nesting
>   inside themselves through triggering a vmalloc fault while accessing
>   dynamic percpu memory area.  This may lead to recursive locking and
>   other surprises.
> 
> Are there other cases where the lazy vmalloc faults can break things?

I fear that enumerating and fix the existing issues won't be enough.
We can't find all the code sites out there which rely on not being
faulted.

The best would be to fix that from the percpu allocator itself, or vmalloc.

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-21 21:34                                                                   ` Andy Lutomirski
@ 2014-11-21 21:50                                                                     ` Frederic Weisbecker
  2014-11-21 22:45                                                                       ` Steven Rostedt
  0 siblings, 1 reply; 486+ messages in thread
From: Frederic Weisbecker @ 2014-11-21 21:50 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Steven Rostedt, Tejun Heo, Thomas Gleixner, Linus Torvalds,
	Dave Jones, Don Zickus, Linux Kernel, the arch/x86 maintainers,
	Peter Zijlstra, Arnaldo Carvalho de Melo

On Fri, Nov 21, 2014 at 01:34:08PM -0800, Andy Lutomirski wrote:
> On Fri, Nov 21, 2014 at 1:32 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> > On Fri, Nov 21, 2014 at 12:01:51PM -0500, Steven Rostedt wrote:
> >> On Fri, Nov 21, 2014 at 11:25:06AM -0500, Tejun Heo wrote:
> >> >
> >> > * Static percpu areas wouldn't trigger fault lazily.  Note that this
> >> >   is not necessarily because the first percpu chunk which contains the
> >> >   static area is embedded inside the kernel linear mapping.  Depending
> >> >   on the memory layout and boot param, percpu allocator may choose to
> >> >   map the first chunk in vmalloc space too; however, this still works
> >> >   out fine because at that point there are no other page tables and
> >> >   the PUD entries covering the first chunk is faulted in before other
> >> >   pages tables are copied from the kernel one.
> >>
> >> That sounds correct.
> >>
> >> >
> >> > * NMI used to be a problem because vmalloc fault handler couldn't
> >> >   safely nest inside NMI handler but this has been fixed since and it
> >> >   should work fine from NMI handlers now.
> >>
> >> Right. Of course "should work fine" does not excatly mean "will work fine".
> >>
> >>
> >> >
> >> > * Function tracers are problematic because they may end up nesting
> >> >   inside themselves through triggering a vmalloc fault while accessing
> >> >   dynamic percpu memory area.  This may lead to recursive locking and
> >> >   other surprises.
> >>
> >> The function tracer infrastructure now has a recursive check that happens
> >> rather early in the call. Unless the registered OPS specifically states
> >> it handles recursions (FTRACE_OPS_FL_RECUSION_SAFE), ftrace will add the
> >> necessary recursion checks. If a registered OPS lies about being recusion
> >> safe, well we can't stop suicide.
> >
> > Same if the recursion state is based on per cpu memory.
> >
> >>
> >> Looking at kernel/trace/trace_functions.c: function_trace_call() which is
> >> registered with RECURSION_SAFE, I see that the recursion check is done
> >> before the per_cpu_ptr() call to the dynamically allocated per_cpu data.
> >>
> >> It looks OK, but...
> >>
> >> Oh! but if we trace the page fault handler, and we fault here too
> >> we just nuked the cr2 register. Not good.
> >
> > If we fault in the page fault handler, we double fault and apparently
> > recovering from that isn't quite expected anyway.
> 
> Not quite.  We only double fault if we fault while pushing the
> hardware part of the state onto the stack.  That happens even before
> the entry asm gets run.
> 
> Otherwise if we have a page fault inside do_page_fault, it's just a
> nested page fault.

Oh ok!

But we still have the cr2 issue that Steve talked about.

> 
> --Andy
> 
> 
> -- 
> Andy Lutomirski
> AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-21 16:38                                                                 ` Andy Lutomirski
  2014-11-21 16:48                                                                   ` Linus Torvalds
@ 2014-11-21 22:10                                                                   ` Frederic Weisbecker
  1 sibling, 0 replies; 486+ messages in thread
From: Frederic Weisbecker @ 2014-11-21 22:10 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Tejun Heo, linux-kernel, Thomas Gleixner,
	Arnaldo Carvalho de Melo, Peter Zijlstra, Linus Torvalds,
	Don Zickus, Dave Jones, the arch/x86 maintainers

On Fri, Nov 21, 2014 at 08:38:07AM -0800, Andy Lutomirski wrote:
> On Nov 21, 2014 8:27 AM, "Tejun Heo" <tj@kernel.org> wrote:
> >
> > Hello, Andy.
> >
> > On Thu, Nov 20, 2014 at 03:55:09PM -0800, Andy Lutomirski wrote:
> > > That doesn't appear to have anything to with nmi though, right?
> >
> > I thought that was the main offender but, apparently, not any more.
> >
> > > Wouldn't this issue be fixed by moving the vmalloc_fault check into
> > > do_page_fault before exception_enter?
> >
> > Can you please elaborate why that'd fix the issue?  I'm not
> > intimiately familiar with the fault handling so it'd be great if you
> > can give me some pointers in terms of where to look at.
> 
> do_page_fault is called directly from asm.  It does:
> 
>     prev_state = exception_enter();
>     __do_page_fault(regs, error_code, address);
>     exception_exit(prev_state);
> 
> The vmalloc fixup is in __do_page_fault.
> 
> exception_enter does various accounting and tracing things, and I
> think that the recursion in stack trace I saw was in exception_enter.
> 
> If you move the vmalloc fixup before exception_enter() and return if
> the fault was from vmalloc, then you can't recurse.  You need to be
> careful not to touch anything that uses RCU before exception_enter,
> though.

That fixes the exception_enter() recursion but surely more issues with
per cpu memory faults are lurking somewhere now or in the future.

I'm going to add recursion protection to user_exit()/user_enter() anyway.

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-21 19:51                                                                               ` Thomas Gleixner
  2014-11-21 20:00                                                                                 ` Linus Torvalds
@ 2014-11-21 22:33                                                                                 ` Konrad Rzeszutek Wilk
  2014-11-22  1:17                                                                                   ` Thomas Gleixner
  1 sibling, 1 reply; 486+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-11-21 22:33 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linus Torvalds, Andy Lutomirski, Steven Rostedt, Tejun Heo,
	linux-kernel, Arnaldo Carvalho de Melo, Peter Zijlstra,
	Frederic Weisbecker, Don Zickus, Dave Jones,
	the arch/x86 maintainers, xen-devel

On Fri, Nov 21, 2014 at 08:51:43PM +0100, Thomas Gleixner wrote:
> On Fri, 21 Nov 2014, Linus Torvalds wrote:
> > Here's the simplified end result. Again, this is TOTALLY UNTESTED. I
> > compiled it and verified that the code generation looks like what I'd
> > have expected, but that's literally it.
> > 
> >   static noinline int vmalloc_fault(unsigned long address)
> >   {
> >         pgd_t *pgd_dst;
> >         pgdval_t pgd_entry;
> >         unsigned index = pgd_index(address);
> > 
> >         if (index < KERNEL_PGD_BOUNDARY)
> >                 return -1;
> > 
> >         pgd_entry = init_mm.pgd[index].pgd;
> >         if (!pgd_entry)
> >                 return -1;
> > 
> >         pgd_dst = __va(PAGE_MASK & read_cr3());
> >         pgd_dst += index;
> > 
> >         if (pgd_dst->pgd)
> >                 return -1;
> > 
> >         ACCESS_ONCE(pgd_dst->pgd) = pgd_entry;
> 
> This will break paravirt. set_pgd/set_pmd are paravirt functions.
> 
> But I'm fine with breaking it, then you just need to change
> CONFIG_PARAVIRT to 'def_bool n'

That is not very nice.

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-21 21:50                                                                     ` Frederic Weisbecker
@ 2014-11-21 22:45                                                                       ` Steven Rostedt
  0 siblings, 0 replies; 486+ messages in thread
From: Steven Rostedt @ 2014-11-21 22:45 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andy Lutomirski, Tejun Heo, Thomas Gleixner, Linus Torvalds,
	Dave Jones, Don Zickus, Linux Kernel, the arch/x86 maintainers,
	Peter Zijlstra, Arnaldo Carvalho de Melo

On Fri, 21 Nov 2014 22:50:41 +0100
Frederic Weisbecker <fweisbec@gmail.com> wrote:
\
> > Otherwise if we have a page fault inside do_page_fault, it's just a
> > nested page fault.
> 
> Oh ok!
> 
> But we still have the cr2 issue that Steve talked about.
>

Nope, as I looked at the code, I noticed that do_page_fault isn't traced
which is the wrapper for __do_page_fault which is. And do_page_fault()
saves off the cr2 before calling anything else.

So we are ok in this respect as well.

-- Steve

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-21 21:11                                                                                       ` Thomas Gleixner
@ 2014-11-21 22:55                                                                                         ` Linus Torvalds
  2014-11-21 23:03                                                                                           ` Andy Lutomirski
  2014-12-16 19:28                                                                                           ` Peter Zijlstra
  0 siblings, 2 replies; 486+ messages in thread
From: Linus Torvalds @ 2014-11-21 22:55 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, Steven Rostedt, Tejun Heo, linux-kernel,
	Arnaldo Carvalho de Melo, Peter Zijlstra, Frederic Weisbecker,
	Don Zickus, Dave Jones, the arch/x86 maintainers

[-- Attachment #1: Type: text/plain, Size: 1244 bytes --]

On Fri, Nov 21, 2014 at 1:11 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> I'm fine with that. I just think it's not horrid enough, but that can
> be fixed easily :)

Oh, I think it's plenty horrid.

Anyway, here's an actual patch. As usual, it has seen absolutely no
actual testing, but I did try to make sure it compiles and seems to do
the right thing on:
 - x86-32 no-PAE
 - x86-32 no-PAE with PARAVIRT
 - x86-32 PAE
 - x86-64

also, I just removed the noise that is "vmalloc_sync_all()", since
it's just all garbage and nothing actually uses it. Yeah, it's used by
"register_die_notifier()", which makes no sense what-so-ever.
Whatever. It's gone.

Can somebody actually *test* this? In particular, in any kind of real
paravirt environment? Or, any comments even without testing?

I *really* am not proud of the mess wrt the whole

  #ifdef CONFIG_PARAVIRT
  #ifdef CONFIG_X86_32
    ...

but I think that from a long-term perspective, we're actually better
off with this kind of really ugly - but very explcit - hack that very
clearly shows what is going on.

The old code that actually "walked" the page tables was more
"portable", but was somewhat misleading about what was actually going
on.

Comments?

                   Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 6316 bytes --]

 arch/x86/mm/fault.c | 243 +++++++++++++---------------------------------------
 1 file changed, 58 insertions(+), 185 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index d973e61e450d..4b0a1b9404b1 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -42,6 +42,64 @@ enum x86_pf_error_code {
 };
 
 /*
+ * Handle a possible vmalloc fault. We just copy the
+ * top-level page table entry if necessary.
+ *
+ * With PAE, the top-most pgd entry is always shared,
+ * and that's where the vmalloc area is.  So PAE had
+ * better never have any vmalloc faults.
+ *
+ * NOTE! This on purpose does *NOT* use pgd_present()
+ * and such generic accessor functions, because
+ * the pgd may contain a folded pud/pmd, and is thus
+ * always "present". We access the actual hardware
+ * state directly, except for the final "set_pgd()"
+ * that may go through a paravirtualization layer.
+ *
+ * Also note the disgusting hackery for the whole
+ * paravirtualization case. Since PAE isn't an issue,
+ * we know that the pmd is the top level, and we just
+ * short-circuit it all.
+ *
+ * We *seriously* need to get rid of the crazy
+ * paravirtualization crud.
+ */
+static nokprobe_inline int vmalloc_fault(unsigned long address)
+{
+#ifdef CONFIG_X86_PAE
+	return -1;
+#else
+	pgd_t *pgd_dst, pgd_entry;
+	unsigned index = pgd_index(address);
+
+	if (index < KERNEL_PGD_BOUNDARY)
+		 return -1;
+
+	pgd_entry = init_mm.pgd[index];
+	if (!(pgd_entry.pgd & 1))
+		return -1;
+
+	pgd_dst = __va(PAGE_MASK & read_cr3());
+	pgd_dst += index;
+
+	if (pgd_dst->pgd)
+		return -1;
+
+#ifdef CONFIG_PARAVIRT
+#ifdef CONFIG_X86_32
+	set_pmd((pmd_t *)pgd_dst, (pmd_t){(pud_t){pgd_entry}});
+#else
+	set_pgd(pgd_dst, pgd_entry);
+	arch_flush_lazy_mmu_mode(); // WTF?
+#endif
+#else
+	*pgd_dst = pgd_entry;
+#endif
+	return 0;
+#endif
+}
+
+/*
  * Returns 0 if mmiotrace is disabled, or if the fault is not
  * handled by mmiotrace:
  */
@@ -189,110 +247,6 @@ DEFINE_SPINLOCK(pgd_lock);
 LIST_HEAD(pgd_list);
 
 #ifdef CONFIG_X86_32
-static inline pmd_t *vmalloc_sync_one(pgd_t *pgd, unsigned long address)
-{
-	unsigned index = pgd_index(address);
-	pgd_t *pgd_k;
-	pud_t *pud, *pud_k;
-	pmd_t *pmd, *pmd_k;
-
-	pgd += index;
-	pgd_k = init_mm.pgd + index;
-
-	if (!pgd_present(*pgd_k))
-		return NULL;
-
-	/*
-	 * set_pgd(pgd, *pgd_k); here would be useless on PAE
-	 * and redundant with the set_pmd() on non-PAE. As would
-	 * set_pud.
-	 */
-	pud = pud_offset(pgd, address);
-	pud_k = pud_offset(pgd_k, address);
-	if (!pud_present(*pud_k))
-		return NULL;
-
-	pmd = pmd_offset(pud, address);
-	pmd_k = pmd_offset(pud_k, address);
-	if (!pmd_present(*pmd_k))
-		return NULL;
-
-	if (!pmd_present(*pmd))
-		set_pmd(pmd, *pmd_k);
-	else
-		BUG_ON(pmd_page(*pmd) != pmd_page(*pmd_k));
-
-	return pmd_k;
-}
-
-void vmalloc_sync_all(void)
-{
-	unsigned long address;
-
-	if (SHARED_KERNEL_PMD)
-		return;
-
-	for (address = VMALLOC_START & PMD_MASK;
-	     address >= TASK_SIZE && address < FIXADDR_TOP;
-	     address += PMD_SIZE) {
-		struct page *page;
-
-		spin_lock(&pgd_lock);
-		list_for_each_entry(page, &pgd_list, lru) {
-			spinlock_t *pgt_lock;
-			pmd_t *ret;
-
-			/* the pgt_lock only for Xen */
-			pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
-
-			spin_lock(pgt_lock);
-			ret = vmalloc_sync_one(page_address(page), address);
-			spin_unlock(pgt_lock);
-
-			if (!ret)
-				break;
-		}
-		spin_unlock(&pgd_lock);
-	}
-}
-
-/*
- * 32-bit:
- *
- *   Handle a fault on the vmalloc or module mapping area
- */
-static noinline int vmalloc_fault(unsigned long address)
-{
-	unsigned long pgd_paddr;
-	pmd_t *pmd_k;
-	pte_t *pte_k;
-
-	/* Make sure we are in vmalloc area: */
-	if (!(address >= VMALLOC_START && address < VMALLOC_END))
-		return -1;
-
-	WARN_ON_ONCE(in_nmi());
-
-	/*
-	 * Synchronize this task's top level page-table
-	 * with the 'reference' page table.
-	 *
-	 * Do _not_ use "current" here. We might be inside
-	 * an interrupt in the middle of a task switch..
-	 */
-	pgd_paddr = read_cr3();
-	pmd_k = vmalloc_sync_one(__va(pgd_paddr), address);
-	if (!pmd_k)
-		return -1;
-
-	pte_k = pte_offset_kernel(pmd_k, address);
-	if (!pte_present(*pte_k))
-		return -1;
-
-	return 0;
-}
-NOKPROBE_SYMBOL(vmalloc_fault);
-
 /*
  * Did it hit the DOS screen memory VA from vm86 mode?
  */
@@ -347,87 +301,6 @@ out:
 
 #else /* CONFIG_X86_64: */
 
-void vmalloc_sync_all(void)
-{
-	sync_global_pgds(VMALLOC_START & PGDIR_MASK, VMALLOC_END, 0);
-}
-
-/*
- * 64-bit:
- *
- *   Handle a fault on the vmalloc area
- *
- * This assumes no large pages in there.
- */
-static noinline int vmalloc_fault(unsigned long address)
-{
-	pgd_t *pgd, *pgd_ref;
-	pud_t *pud, *pud_ref;
-	pmd_t *pmd, *pmd_ref;
-	pte_t *pte, *pte_ref;
-
-	/* Make sure we are in vmalloc area: */
-	if (!(address >= VMALLOC_START && address < VMALLOC_END))
-		return -1;
-
-	WARN_ON_ONCE(in_nmi());
-
-	/*
-	 * Copy kernel mappings over when needed. This can also
-	 * happen within a race in page table update. In the later
-	 * case just flush:
-	 */
-	pgd = pgd_offset(current->active_mm, address);
-	pgd_ref = pgd_offset_k(address);
-	if (pgd_none(*pgd_ref))
-		return -1;
-
-	if (pgd_none(*pgd)) {
-		set_pgd(pgd, *pgd_ref);
-		arch_flush_lazy_mmu_mode();
-	} else {
-		BUG_ON(pgd_page_vaddr(*pgd) != pgd_page_vaddr(*pgd_ref));
-	}
-
-	/*
-	 * Below here mismatches are bugs because these lower tables
-	 * are shared:
-	 */
-
-	pud = pud_offset(pgd, address);
-	pud_ref = pud_offset(pgd_ref, address);
-	if (pud_none(*pud_ref))
-		return -1;
-
-	if (pud_none(*pud) || pud_page_vaddr(*pud) != pud_page_vaddr(*pud_ref))
-		BUG();
-
-	pmd = pmd_offset(pud, address);
-	pmd_ref = pmd_offset(pud_ref, address);
-	if (pmd_none(*pmd_ref))
-		return -1;
-
-	if (pmd_none(*pmd) || pmd_page(*pmd) != pmd_page(*pmd_ref))
-		BUG();
-
-	pte_ref = pte_offset_kernel(pmd_ref, address);
-	if (!pte_present(*pte_ref))
-		return -1;
-
-	pte = pte_offset_kernel(pmd, address);
-
-	/*
-	 * Don't use pte_page here, because the mappings can point
-	 * outside mem_map, and the NUMA hash lookup cannot handle
-	 * that:
-	 */
-	if (!pte_present(*pte) || pte_pfn(*pte) != pte_pfn(*pte_ref))
-		BUG();
-
-	return 0;
-}
-NOKPROBE_SYMBOL(vmalloc_fault);
-
 #ifdef CONFIG_CPU_SUP_AMD
 static const char errata93_warning[] =
 KERN_ERR 

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-21 22:55                                                                                         ` Linus Torvalds
@ 2014-11-21 23:03                                                                                           ` Andy Lutomirski
  2014-11-21 23:33                                                                                             ` Linus Torvalds
  2014-12-16 19:28                                                                                           ` Peter Zijlstra
  1 sibling, 1 reply; 486+ messages in thread
From: Andy Lutomirski @ 2014-11-21 23:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, Steven Rostedt, Tejun Heo, linux-kernel,
	Arnaldo Carvalho de Melo, Peter Zijlstra, Frederic Weisbecker,
	Don Zickus, Dave Jones, the arch/x86 maintainers

On Fri, Nov 21, 2014 at 2:55 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, Nov 21, 2014 at 1:11 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> I'm fine with that. I just think it's not horrid enough, but that can
>> be fixed easily :)
>
> Oh, I think it's plenty horrid.
>
> Anyway, here's an actual patch. As usual, it has seen absolutely no
> actual testing, but I did try to make sure it compiles and seems to do
> the right thing on:
>  - x86-32 no-PAE
>  - x86-32 no-PAE with PARAVIRT
>  - x86-32 PAE
>  - x86-64
>
> also, I just removed the noise that is "vmalloc_sync_all()", since
> it's just all garbage and nothing actually uses it. Yeah, it's used by
> "register_die_notifier()", which makes no sense what-so-ever.
> Whatever. It's gone.
>
> Can somebody actually *test* this? In particular, in any kind of real
> paravirt environment? Or, any comments even without testing?
>
> I *really* am not proud of the mess wrt the whole
>
>   #ifdef CONFIG_PARAVIRT
>   #ifdef CONFIG_X86_32
>     ...
>
> but I think that from a long-term perspective, we're actually better
> off with this kind of really ugly - but very explcit - hack that very
> clearly shows what is going on.
>
> The old code that actually "walked" the page tables was more
> "portable", but was somewhat misleading about what was actually going
> on.

At the risk of going deeper down the rabbit hole, I grepped for
pgd_list.  I found:

__set_pmd_pte in pageattr.c.  It appears to be completely incorrect.
Unless I've misunderstood, other than the very first line, it will
either do nothing at all or crash when it falls off the end of the
page tables that it's pointlessly trying to update.

sync_global_pgds: OK, I guess -- this is for hot-add of memory, right?
 But if we teach the context switch code to check that the kernel
stack is okay, that can be removed, I think.  (We absolutely MUST keep
the static per-cpu stuff populated everywhere before running user
code, but that's never in hot-added memory.)

xen_mm_pin_all and xen_mm_unpin_all: I have no clue.  I wonder how
that works with SHARED_KERNEL_PMD.

Anyone want to attack these?  It would be kind of nice to remove
pgd_list entirely.  (I realize that doing so precludes the use of
bloody enormous 512GB kernel pages, but any attempt to use *those* is
so completely screwed without a major reworking of all of this (or
perhaps stop_machine) that keeping pgd_list around just for that is
probably a mistake.)

--Andy

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-21 23:03                                                                                           ` Andy Lutomirski
@ 2014-11-21 23:33                                                                                             ` Linus Torvalds
  0 siblings, 0 replies; 486+ messages in thread
From: Linus Torvalds @ 2014-11-21 23:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Steven Rostedt, Tejun Heo, linux-kernel,
	Arnaldo Carvalho de Melo, Peter Zijlstra, Frederic Weisbecker,
	Don Zickus, Dave Jones, the arch/x86 maintainers

On Fri, Nov 21, 2014 at 3:03 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Fri, Nov 21, 2014 at 2:55 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> Anyway, here's an actual patch. As usual, it has seen absolutely no
>> actual testing,

.. ok, it boots and works fine as far as I can tell on x86-64 with no
paravirt anywhere.

> At the risk of going deeper down the rabbit hole, I grepped for
> pgd_list.  I found:

Ugh.

> __set_pmd_pte in pageattr.c.  It appears to be completely incorrect.
> Unless I've misunderstood, other than the very first line, it will
> either do nothing at all or crash when it falls off the end of the
> page tables that it's pointlessly trying to update.

I think you found a rats nest.

I can't make heads nor tails of the logic. The !SHARED_KERNEL_PMD test
doesn't seem very sensible, since that's also the conditional for
adding anything to the list in the first place.

So I agree that the code doesn't make much sense. Although maybe it's
there just because that way the loop goes away at compile-time under
most circumstances. So maybe even that part does make sense.

And the "walk down to the pmd level" part actually looks ok. Remember:
this is on x86-32 only, and you have two cases: non-PAE where the
pmd/pud offset thing does nothing at all, and it just ends up
converting a "pgd_t *" to a "pmd_t *".  And for PAE, the top pud level
always exists, and the pmd is folded, so despite what looks like
walking two levels, it really just walks the one level - the
force-allocated PGD entries.

So it won't "fall off the end of the page tables" like you imply. It
will just walk to the pmd level. And there it will populate all the
page tables with the same pmd.

So I think it works.

                   Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-21 21:44                                                               ` Frederic Weisbecker
@ 2014-11-22  0:11                                                                 ` Tejun Heo
  2014-11-22  0:18                                                                   ` Linus Torvalds
  0 siblings, 1 reply; 486+ messages in thread
From: Tejun Heo @ 2014-11-22  0:11 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Don Zickus,
	Linux Kernel, the arch/x86 maintainers, Peter Zijlstra,
	Andy Lutomirski, Arnaldo Carvalho de Melo

Hello, Frederic.

On Fri, Nov 21, 2014 at 10:44:46PM +0100, Frederic Weisbecker wrote:
> I fear that enumerating and fix the existing issues won't be enough.
> We can't find all the code sites out there which rely on not being
> faulted.

Oh, sure but that can take some time so adding documentation in the
mean time probably isn't a bad idea.

> The best would be to fix that from the percpu allocator itself, or
> vmalloc.

I don't think there's much percpu allocator itself can do.  The
ability to grow dynamically comes from being able to allocate
relatively consistent layout among areas for different CPUs and pretty
much requires vmalloc area and it'd generally be a good idea to take
out the vmalloc fault anyway.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-22  0:11                                                                 ` Tejun Heo
@ 2014-11-22  0:18                                                                   ` Linus Torvalds
  2014-11-22  0:41                                                                     ` Andy Lutomirski
  0 siblings, 1 reply; 486+ messages in thread
From: Linus Torvalds @ 2014-11-22  0:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Frederic Weisbecker, Thomas Gleixner, Dave Jones, Don Zickus,
	Linux Kernel, the arch/x86 maintainers, Peter Zijlstra,
	Andy Lutomirski, Arnaldo Carvalho de Melo

On Fri, Nov 21, 2014 at 4:11 PM, Tejun Heo <tj@kernel.org> wrote:
>
> I don't think there's much percpu allocator itself can do.  The
> ability to grow dynamically comes from being able to allocate
> relatively consistent layout among areas for different CPUs and pretty
> much requires vmalloc area and it'd generally be a good idea to take
> out the vmalloc fault anyway.

Why do you guys worry so much about the vmalloc fault?

This started because of a very different issue: putting the actual
stack in vmalloc space. Then it can cause nasty triple faults etc.

But the normal vmalloc fault? Who cares, really? If that causes
problems, they are bugs. Fix them.

                    Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-22  0:18                                                                   ` Linus Torvalds
@ 2014-11-22  0:41                                                                     ` Andy Lutomirski
  0 siblings, 0 replies; 486+ messages in thread
From: Andy Lutomirski @ 2014-11-22  0:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Dave Jones,
	Don Zickus, Linux Kernel, the arch/x86 maintainers,
	Peter Zijlstra, Arnaldo Carvalho de Melo

On Fri, Nov 21, 2014 at 4:18 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, Nov 21, 2014 at 4:11 PM, Tejun Heo <tj@kernel.org> wrote:
>>
>> I don't think there's much percpu allocator itself can do.  The
>> ability to grow dynamically comes from being able to allocate
>> relatively consistent layout among areas for different CPUs and pretty
>> much requires vmalloc area and it'd generally be a good idea to take
>> out the vmalloc fault anyway.
>
> Why do you guys worry so much about the vmalloc fault?
>
> This started because of a very different issue: putting the actual
> stack in vmalloc space. Then it can cause nasty triple faults etc.
>
> But the normal vmalloc fault? Who cares, really? If that causes
> problems, they are bugs. Fix them.

Because of this in system_call_after_swapgs:

    movq    %rsp,PER_CPU_VAR(old_rsp)
    movq    PER_CPU_VAR(kernel_stack),%rsp

It occurs to me that, if we really want to change that, we could have
an array of syscall trampolines, one per CPU, that have the CPU number
hardcoded.  But I really don't think that's worth it.

Other than that, with your fix, vmalloc faults are no big deal :)

--Andy

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-21 22:33                                                                                 ` Konrad Rzeszutek Wilk
@ 2014-11-22  1:17                                                                                   ` Thomas Gleixner
  0 siblings, 0 replies; 486+ messages in thread
From: Thomas Gleixner @ 2014-11-22  1:17 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Linus Torvalds, Andy Lutomirski, Steven Rostedt, Tejun Heo,
	linux-kernel, Arnaldo Carvalho de Melo, Peter Zijlstra,
	Frederic Weisbecker, Don Zickus, Dave Jones,
	the arch/x86 maintainers, xen-devel

On Fri, 21 Nov 2014, Konrad Rzeszutek Wilk wrote:
> On Fri, Nov 21, 2014 at 08:51:43PM +0100, Thomas Gleixner wrote:
 > > On Fri, 21 Nov 2014, Linus Torvalds wrote:
> > > Here's the simplified end result. Again, this is TOTALLY UNTESTED. I
> > > compiled it and verified that the code generation looks like what I'd
> > > have expected, but that's literally it.
> > > 
> > >   static noinline int vmalloc_fault(unsigned long address)
> > >   {
> > >         pgd_t *pgd_dst;
> > >         pgdval_t pgd_entry;
> > >         unsigned index = pgd_index(address);
> > > 
> > >         if (index < KERNEL_PGD_BOUNDARY)
> > >                 return -1;
> > > 
> > >         pgd_entry = init_mm.pgd[index].pgd;
> > >         if (!pgd_entry)
> > >                 return -1;
> > > 
> > >         pgd_dst = __va(PAGE_MASK & read_cr3());
> > >         pgd_dst += index;
> > > 
> > >         if (pgd_dst->pgd)
> > >                 return -1;
> > > 
> > >         ACCESS_ONCE(pgd_dst->pgd) = pgd_entry;
> > 
> > This will break paravirt. set_pgd/set_pmd are paravirt functions.
> > 
> > But I'm fine with breaking it, then you just need to change
> > CONFIG_PARAVIRT to 'def_bool n'
> 
> That is not very nice.

Maybe not nice, but sensible.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-21 20:23                                                                                           ` Josh Boyer
@ 2014-11-24 18:48                                                                                             ` Konrad Rzeszutek Wilk
  2014-11-24 19:07                                                                                               ` Josh Boyer
  2014-11-25  5:36                                                                                               ` Jürgen Groß
  0 siblings, 2 replies; 486+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-11-24 18:48 UTC (permalink / raw)
  To: Josh Boyer
  Cc: Andy Lutomirski, Linus Torvalds, Steven Rostedt, Tejun Heo,
	linux-kernel, Thomas Gleixner, Peter Zijlstra,
	Frederic Weisbecker, Don Zickus, Dave Jones,
	the arch/x86 maintainers

On Fri, Nov 21, 2014 at 03:23:13PM -0500, Josh Boyer wrote:
> On Fri, Nov 21, 2014 at 3:16 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> > On Fri, Nov 21, 2014 at 12:14 PM, Josh Boyer <jwboyer@fedoraproject.org> wrote:
> >> On Fri, Nov 21, 2014 at 2:52 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> >>> On Fri, Nov 21, 2014 at 11:46 AM, Linus Torvalds
> >>> <torvalds@linux-foundation.org> wrote:
> >>>> On Fri, Nov 21, 2014 at 11:34 AM, Linus Torvalds
> >>>> <torvalds@linux-foundation.org> wrote:
> >>>>>
> >>>>> So I kind of agree, but it wouldn't be my primary worry. My primary
> >>>>> worry is actually paravirt doing something insane.
> >>>>
> >>>> Btw, on that tangent, does anybody actually care about paravirt any more?
> >>>>
> >>>
> >>> Amazon, for better or for worse.

And distros: Oracle and Novell.

> >>>
> >>>> I'd love to start moving away from it. It makes a lot of the low-level
> >>>> code completely impossible to follow due to the random indirection
> >>>> through "native" vs "paravirt op table". Not just the page table
> >>>> handling, it's all over.
> >>>>
> >>>> Anybody who seriously does virtualization uses hw virtualization that
> >>>> is much better than it used to be. And the non-serious users aren't
> >>>> that performance-sensitive by definition.

I would point out that the PV paravirt spinlock gives an huge boost
for virtualization guests (this is for both KVM and Xen).
> >>>>
> >>>> I note that the Fedora kernel config seems to include paravirt by
> >>>> default, so you get a lot of the crazy overheads..

Not that much. We ran benchmarks and it was in i-cache overhead - and
the numbers came out to be sub-1% percent.
> >>>
> >>> I think that there is a move toward deprecating Xen PV in favor of
> >>> PVH, but we're not there yet.
> >>
> >> A move where?  The Xen stuff in Fedora is ... not paid attention to
> >> very much.  If there's something we should be looking at turning off
> >> (or on), we're happy to take suggestions.
> >
> > A move in the Xen project.  As I understand it, Xen wants to deprecate
> > PV in favor of PVH, but PVH is still experimental.
> 
> OK.
> 
> > I think that dropping PARAVIRT in Fedora might be a bad idea for
> > several more releases, since that's likely to break the EC2 images.
> 
> Yes, that's essentially the only reason we haven't looked at disabling
> Xen completely for a while now, so <sad trombone>.

Heh. Didn't know you could play on a trombone!

As I had mentioned in the past - if there are Xen related bugs on
Fedora please CC me on them. Or perhaps CC xen-devel@lists.xenproject.org
if that is possible.

And as Andy has mentioned - we are moving towards using PVH as a way
to not use the PV MMU ops. But that is still off (<sad trombone played
from YouTube>).

> 
> josh
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-24 18:48                                                                                             ` Konrad Rzeszutek Wilk
@ 2014-11-24 19:07                                                                                               ` Josh Boyer
  2014-11-25  5:36                                                                                               ` Jürgen Groß
  1 sibling, 0 replies; 486+ messages in thread
From: Josh Boyer @ 2014-11-24 19:07 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Andy Lutomirski, Linus Torvalds, Steven Rostedt, Tejun Heo,
	linux-kernel, Thomas Gleixner, Peter Zijlstra,
	Frederic Weisbecker, Don Zickus, Dave Jones,
	the arch/x86 maintainers

On Mon, Nov 24, 2014 at 1:48 PM, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:
> On Fri, Nov 21, 2014 at 03:23:13PM -0500, Josh Boyer wrote:
>> On Fri, Nov 21, 2014 at 3:16 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> > On Fri, Nov 21, 2014 at 12:14 PM, Josh Boyer <jwboyer@fedoraproject.org> wrote:
>> >> On Fri, Nov 21, 2014 at 2:52 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> >>> On Fri, Nov 21, 2014 at 11:46 AM, Linus Torvalds
>> >>> <torvalds@linux-foundation.org> wrote:
>> >>>> On Fri, Nov 21, 2014 at 11:34 AM, Linus Torvalds
>> >>>> <torvalds@linux-foundation.org> wrote:
>> >>>>>
>> >>>>> So I kind of agree, but it wouldn't be my primary worry. My primary
>> >>>>> worry is actually paravirt doing something insane.
>> >>>>
>> >>>> Btw, on that tangent, does anybody actually care about paravirt any more?
>> >>>>
>> >>>
>> >>> Amazon, for better or for worse.
>
> And distros: Oracle and Novell.
>
>> >>>
>> >>>> I'd love to start moving away from it. It makes a lot of the low-level
>> >>>> code completely impossible to follow due to the random indirection
>> >>>> through "native" vs "paravirt op table". Not just the page table
>> >>>> handling, it's all over.
>> >>>>
>> >>>> Anybody who seriously does virtualization uses hw virtualization that
>> >>>> is much better than it used to be. And the non-serious users aren't
>> >>>> that performance-sensitive by definition.
>
> I would point out that the PV paravirt spinlock gives an huge boost
> for virtualization guests (this is for both KVM and Xen).
>> >>>>
>> >>>> I note that the Fedora kernel config seems to include paravirt by
>> >>>> default, so you get a lot of the crazy overheads..
>
> Not that much. We ran benchmarks and it was in i-cache overhead - and
> the numbers came out to be sub-1% percent.
>> >>>
>> >>> I think that there is a move toward deprecating Xen PV in favor of
>> >>> PVH, but we're not there yet.
>> >>
>> >> A move where?  The Xen stuff in Fedora is ... not paid attention to
>> >> very much.  If there's something we should be looking at turning off
>> >> (or on), we're happy to take suggestions.
>> >
>> > A move in the Xen project.  As I understand it, Xen wants to deprecate
>> > PV in favor of PVH, but PVH is still experimental.
>>
>> OK.
>>
>> > I think that dropping PARAVIRT in Fedora might be a bad idea for
>> > several more releases, since that's likely to break the EC2 images.
>>
>> Yes, that's essentially the only reason we haven't looked at disabling
>> Xen completely for a while now, so <sad trombone>.
>
> Heh. Didn't know you could play on a trombone!

It's sad because I can't really play the trombone and it sounds horrible.

> As I had mentioned in the past - if there are Xen related bugs on
> Fedora please CC me on them. Or perhaps CC xen-devel@lists.xenproject.org
> if that is possible.

Indeed, you have been massively helpful.  My comment on it being not
well paid attention to was a reflection on the distro maintainers, not
you.  You've been great once we notice the Xen issue, but that takes a
while on our part and it isn't the best of user experiences :\.

> And as Andy has mentioned - we are moving towards using PVH as a way
> to not use the PV MMU ops. But that is still off (<sad trombone played
> from YouTube>).

OK.  I'll try and do better at keeping up with things.

josh

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-24 18:48                                                                                             ` Konrad Rzeszutek Wilk
  2014-11-24 19:07                                                                                               ` Josh Boyer
@ 2014-11-25  5:36                                                                                               ` Jürgen Groß
  2014-11-25 17:22                                                                                                 ` Linus Torvalds
  1 sibling, 1 reply; 486+ messages in thread
From: Jürgen Groß @ 2014-11-25  5:36 UTC (permalink / raw)
  To: torvalds
  Cc: Konrad Rzeszutek Wilk, Josh Boyer, Andy Lutomirski,
	Linus Torvalds, Steven Rostedt, Tejun Heo, linux-kernel,
	Thomas Gleixner, Peter Zijlstra, Frederic Weisbecker, Don Zickus,
	Dave Jones, the arch/x86 maintainers

On 11/24/2014 07:48 PM, Konrad Rzeszutek Wilk wrote:
> On Fri, Nov 21, 2014 at 03:23:13PM -0500, Josh Boyer wrote:
>> On Fri, Nov 21, 2014 at 3:16 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Fri, Nov 21, 2014 at 12:14 PM, Josh Boyer <jwboyer@fedoraproject.org> wrote:
>>>> On Fri, Nov 21, 2014 at 2:52 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>> On Fri, Nov 21, 2014 at 11:46 AM, Linus Torvalds
>>>>> <torvalds@linux-foundation.org> wrote:
>>>>>> On Fri, Nov 21, 2014 at 11:34 AM, Linus Torvalds
>>>>>> <torvalds@linux-foundation.org> wrote:
>>>>>>>
>>>>>>> So I kind of agree, but it wouldn't be my primary worry. My primary
>>>>>>> worry is actually paravirt doing something insane.
>>>>>>
>>>>>> Btw, on that tangent, does anybody actually care about paravirt any more?
>>>>>>

Funny, during testing some patches related to Xen I hit the lockup
issue. It looked a little bit different, but a variation of your patch
solved my problem. The difference to the original report might be due
to the rather low system load during my test, so the system was still
responsive when the first lockup messages appeared. I could see the
hanging cpus were spinning in pmd_lock() called during
__handle_mm_fault().

I could reproduce the issue within a few minutes reliably without the
patch below. With it the machine survived 12 hours and is still running.

WHY my test would trigger the problem so fast I have no idea. I saw it
on a rather huge machine only (128GB memory, 120 cpus), that's quite
understandable. My test remapped some pages via the hypervisor and
removed those mappings again. Perhaps the TLB flushing involved in these
operations is triggering the problem.


diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index d973e61..b847ff7 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -377,7 +377,7 @@ static noinline int vmalloc_fault(unsigned long address)
          * happen within a race in page table update. In the later
          * case just flush:
          */
-       pgd = pgd_offset(current->active_mm, address);
+       pgd = (pgd_t *)__va(read_cr3()) + pgd_index(address);
         pgd_ref = pgd_offset_k(address);
         if (pgd_none(*pgd_ref))
                 return -1;



Juergen

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-20 15:25                                   ` frequent lockups in 3.18rc4 Dave Jones
  2014-11-20 19:43                                     ` Linus Torvalds
@ 2014-11-25 12:22                                     ` Will Deacon
  2014-12-01 11:48                                       ` Will Deacon
  1 sibling, 1 reply; 486+ messages in thread
From: Will Deacon @ 2014-11-25 12:22 UTC (permalink / raw)
  To: Dave Jones, Andy Lutomirski, Linus Torvalds, Don Zickus,
	Thomas Gleixner, Linux Kernel, the arch/x86 maintainers,
	Peter Zijlstra

Hi Dave,

On Thu, Nov 20, 2014 at 10:25:09AM -0500, Dave Jones wrote:
> On Wed, Nov 19, 2014 at 01:01:36PM -0800, Andy Lutomirski wrote:
>  
>  > TIF_NOHZ is not the same thing as NOHZ.  Can you try a kernel with
>  > CONFIG_CONTEXT_TRACKING=n?  Doing that may involve fiddling with RCU
>  > settings a bit.  The normal no HZ idle stuff has nothing to do with
>  > TIF_NOHZ, and you either have TIF_NOHZ set or you have some kind of
>  > thread_info corruption going on here.
> 
> Disabling CONTEXT_TRACKING didn't change the problem.
> Unfortunatly the full trace didn't make it over usb-serial this time. Grr.
> 
> Here's what came over serial..
> 
> NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [trinity-c35:11634]
> CPU: 2 PID: 11634 Comm: trinity-c35 Not tainted 3.18.0-rc5+ #94 [loadavg: 164.79 157.30 155.90 37/409 11893]
> task: ffff88014e0d96f0 ti: ffff880220eb4000 task.ti: ffff880220eb4000
> RIP: 0010:[<ffffffff88379605>]  [<ffffffff88379605>] copy_user_enhanced_fast_string+0x5/0x10
> RSP: 0018:ffff880220eb7ef0  EFLAGS: 00010283
> RAX: ffff880220eb4000 RBX: ffffffff887dac64 RCX: 0000000000006a18
> RDX: 000000000000e02f RSI: 00007f766f466620 RDI: ffff88016f6a7617
> RBP: ffff880220eb7f78 R08: 8000000000000063 R09: 0000000000000004
> R10: 0000000000000010 R11: 0000000000000000 R12: ffffffff880bf50d
> R13: 0000000000000001 R14: ffff880220eb4000 R15: 0000000000000001
> FS:  00007f766f459740(0000) GS:ffff880244400000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f766f461000 CR3: 000000018b00e000 CR4: 00000000001407e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
> Stack:
>  ffffffff882f4225 ffff880183db5a00 0000000001743440 00007f766f0fb000
>  fffffffffffffeff 0000000000000000 0000000000008d79 00007f766f45f000
>  ffffffff8837adae 00ff880220eb7f38 000000003203f1ac 0000000000000001
> Call Trace:
>  [<ffffffff882f4225>] ? SyS_add_key+0xd5/0x240
>  [<ffffffff8837adae>] ? trace_hardirqs_on_thunk+0x3a/0x3f
>  [<ffffffff887da092>] system_call_fastpath+0x12/0x17
> Code: 48 ff c6 48 ff c7 ff c9 75 f2 89 d1 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 31 c0 0f 1f 00 c3 0f 1f 80 00 00 00 00 0f 1f 00 89 d1 <f3> a4 31 c0 0f 1f 00 c3 90 90 90 0f 1f 00 83 fa 08 0f 82 95 00 
> sending NMI to other CPUs:
> 
> 
> Here's a crappy phonecam pic of the screen. 
> http://codemonkey.org.uk/junk/IMG_4311.jpg
> There's a bit of trace missing between the above and what was on
> the screen, so we missed some CPUs.

I'm not sure if this is useful, but I've been seeing trinity lockups
on arm64 as well. Sometimes they happen a few times a day, sometimes it
takes a few days (I just saw my first one on -rc6, for example).

However, I have a little bit more trace than you do and *every single time*
the lockup has involved an execve to a virtual file system.

E.g.:

[child1:10700] [212] execve(name="/sys/fs/ext4/features/batched_discard", argv=0x91796a0, envp=0x911a9c0)

(I've seen cases with /proc too)

The child doing the execve then doesn't return an error from the syscall,
and instead seems to disappear from the face of the planet, sometimes with
the tasklist_lock held for write, which causes a lockup shortly afterwards.

I'm running under KVM with two virtual CPUs. When the machine is wedged,
one CPU is sitting in idle and the other seems to be kicking around do_wait
and pid_vnr, but it's difficult to really see what's going on.

I tried increasing the likelihood of execve syscalls in trinity, but it
didn't seem to help with reproducing this issue.

Will

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-25  5:36                                                                                               ` Jürgen Groß
@ 2014-11-25 17:22                                                                                                 ` Linus Torvalds
  0 siblings, 0 replies; 486+ messages in thread
From: Linus Torvalds @ 2014-11-25 17:22 UTC (permalink / raw)
  To: Jürgen Groß
  Cc: Konrad Rzeszutek Wilk, Josh Boyer, Andy Lutomirski,
	Steven Rostedt, Tejun Heo, linux-kernel, Thomas Gleixner,
	Peter Zijlstra, Frederic Weisbecker, Don Zickus, Dave Jones,
	the arch/x86 maintainers

On Mon, Nov 24, 2014 at 9:36 PM, Jürgen Groß <jgross@suse.com> wrote:
>
> Funny, during testing some patches related to Xen I hit the lockup
> issue. It looked a little bit different, but a variation of your patch
> solved my problem.
>
> I could reproduce the issue within a few minutes reliably without the
> patch below. With it the machine survived 12 hours and is still running.

Do you have a backtrace for the failure case? I have no problem
applying this part of the patch (I really don't understand why x86-64
hadn't gotten the proper code from 32-bit), but I'd like to see (and
document) where the fault happens for this.

Since you can apparently reproduce this fairly easily with a broken
kernel, getting a backtrace shouldn't be too hard?

                     Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-16  6:33       ` Linus Torvalds
  2014-11-16 10:06         ` Markus Trippelsdorf
  2014-11-17 17:03         ` Dave Jones
@ 2014-11-26  0:25         ` Dave Jones
  2014-11-26  1:48           ` Linus Torvalds
  2 siblings, 1 reply; 486+ messages in thread
From: Dave Jones @ 2014-11-26  0:25 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux Kernel, the arch/x86 maintainers

On Sat, Nov 15, 2014 at 10:33:19PM -0800, Linus Torvalds wrote:

 > I have no ideas left. I'd go for a bisection - rather than try random
 > things, at least bisection will get us a smaller set of suspects if
 > you can go through a few cycles of it. Even if you decide that you
 > want to run for most of a day before you are convinced it's all good,
 > a couple of days should get you a handful of bisection points (that's
 > assuming you hit a couple of bad ones too that turn bad in a shorter
 > while). And 4 or five bisections should get us from 11k commits down
 > to the ~600 commit range. That would be a huge improvement.

There's 8 bisections remaining. The log so far:

git bisect start
# good: [bfe01a5ba2490f299e1d2d5508cbbbadd897bbe9] Linux 3.17
git bisect good bfe01a5ba2490f299e1d2d5508cbbbadd897bbe9
# bad: [f114040e3ea6e07372334ade75d1ee0775c355e1] Linux 3.18-rc1
git bisect bad f114040e3ea6e07372334ade75d1ee0775c355e1
# bad: [f114040e3ea6e07372334ade75d1ee0775c355e1] Linux 3.18-rc1
git bisect bad f114040e3ea6e07372334ade75d1ee0775c355e1
# bad: [35a9ad8af0bb0fa3525e6d0d20e32551d226f38e] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
git bisect bad 35a9ad8af0bb0fa3525e6d0d20e32551d226f38e
# bad: [35a9ad8af0bb0fa3525e6d0d20e32551d226f38e] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
git bisect bad 35a9ad8af0bb0fa3525e6d0d20e32551d226f38e
# bad: [683a52a10148e929fb4844f9237f059a47c0b01b] Merge tag 'tty-3.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
git bisect bad 683a52a10148e929fb4844f9237f059a47c0b01b
# bad: [683a52a10148e929fb4844f9237f059a47c0b01b] Merge tag 'tty-3.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
git bisect bad 683a52a10148e929fb4844f9237f059a47c0b01b
# bad: [76272ab3f348d303eb31a5a061601ca8e0f9c5ce] staging: rtl8821ae: remove driver
git bisect bad 76272ab3f348d303eb31a5a061601ca8e0f9c5ce
# bad: [e988e1f3f975a9d6013c6356c5b9369540c091f9] staging: comedi: ni_at_a2150: range check board index
git bisect bad e988e1f3f975a9d6013c6356c5b9369540c091f9
# bad: [bd8107b2b2dc9fb1113bfe1a9cf2533ee19c57ee] Staging: bcm: Bcmchar.c: Renamed variable: "RxCntrlMsgBitMask" -> "rx_cntrl_msg_bit_mask"
git bisect bad bd8107b2b2dc9fb1113bfe1a9cf2533ee19c57ee
# bad: [91ed283ab563727932d6cf92b74dd15226635870] staging: rtl8188eu: Remove unused function rtw_IOL_append_WD_cmd()
git bisect bad 91ed283ab563727932d6cf92b74dd15226635870


The reason I'm checking in at this point, is that I'm starting to see different
bugs at this point, so I don't know if I can call this good or bad, unless
someone has a fix for what I'm seeing now.

Reminiscent of a bug a couple releases ago. Processes about to exit, but stuck
in the kernel continuously faulting..
http://codemonkey.org.uk/junk/weird-hang.txt
The one I'm thinking of got fixed way before 3.17 though.

Does that trace ring a bell of something else I could try on top of
each bisection point ?

I rebooted and restarted my test at the current bisection point,
hopefully it'll show up as 'bad' before the bug above happens again.

	Dave


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-26  0:25         ` Dave Jones
@ 2014-11-26  1:48           ` Linus Torvalds
  2014-11-26  2:40             ` Dave Jones
  2014-11-26  4:39             ` Jürgen Groß
  0 siblings, 2 replies; 486+ messages in thread
From: Linus Torvalds @ 2014-11-26  1:48 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Linux Kernel, the arch/x86 maintainers

On Tue, Nov 25, 2014 at 4:25 PM, Dave Jones <davej@redhat.com> wrote:
>
> The reason I'm checking in at this point, is that I'm starting to see different
> bugs at this point, so I don't know if I can call this good or bad, unless
> someone has a fix for what I'm seeing now.

Hmm. The three last "bad" biisects are all just 3.17-rc1 plus staging fixes.

> Reminiscent of a bug a couple releases ago. Processes about to exit, but stuck
> in the kernel continuously faulting..
> http://codemonkey.org.uk/junk/weird-hang.txt
> The one I'm thinking of got fixed way before 3.17 though.

Well, the staging tree was based on that 3.17-rc1 tree, so it may well
have the bug without the fix.

You have also marked 3.18-rc1 bad *twice*, along with the network
merge, and the tty merge. That's just odd. But it doesn't make the
bisect wrong, it just means that you fat-fingered thing and marked the
same thing bad a couple of times.

Nothing to worry about, unless it's a sign of early Parkinsons...

> Does that trace ring a bell of something else I could try on top of
> each bisection point ?

Hmm.

Smells somewhat like the "pipe/page fault oddness" bug you reported.

That one caused endless page faults on fault_in_pages_writeable()
because of a page table entry that the VM thought was present, but the
CPU thought was missing.

That caused the whole "pte_protnone()" thing, and trying to get rid of
the PTE_NUMA bit, but those patches have *not* been merged. And you
were ever able to reproduce it., so we left it as pending.

But if you actually really think that the bisect log you posted is
real and true and actually is the bug you're chasing, I have bad news
for you: do a "gitk --bisect", and you'll see that all the remaining
commits are just to staging drivers.

So that would either imply you have some staging driver (unlikely), or
more likely that 3.17 really already has the problem, it's just that
it needs some particular code alignment or phase of the moon or
something to trigger.

                 Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-26  1:48           ` Linus Torvalds
@ 2014-11-26  2:40             ` Dave Jones
  2014-11-26 22:57               ` Dave Jones
  2014-11-26  4:39             ` Jürgen Groß
  1 sibling, 1 reply; 486+ messages in thread
From: Dave Jones @ 2014-11-26  2:40 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux Kernel, the arch/x86 maintainers

On Tue, Nov 25, 2014 at 05:48:15PM -0800, Linus Torvalds wrote:

 > You have also marked 3.18-rc1 bad *twice*, along with the network
 > merge, and the tty merge. That's just odd. But it doesn't make the
 > bisect wrong, it just means that you fat-fingered thing and marked the
 > same thing bad a couple of times.
 > 
 > Nothing to worry about, unless it's a sign of early Parkinsons...

Intentional on my part, without realizing the first one was recorded.
First time, it printed the usual bisect text, but then complained my
tree was dirty (which it was). I unapplied the stuff I had, and did
the bisect command a 2nd time..

 > > Does that trace ring a bell of something else I could try on top of
 > > each bisection point ?
 > 
 > Hmm.
 > 
 > Smells somewhat like the "pipe/page fault oddness" bug you reported.
 > 
 > That one caused endless page faults on fault_in_pages_writeable()
 > because of a page table entry that the VM thought was present, but the
 > CPU thought was missing.
 > 
 > That caused the whole "pte_protnone()" thing, and trying to get rid of
 > the PTE_NUMA bit, but those patches have *not* been merged. And you
 > were ever able to reproduce it., so we left it as pending.

ah, yeah, now it comes back to me.

 > But if you actually really think that the bisect log you posted is
 > real and true and actually is the bug you're chasing, I have bad news
 > for you: do a "gitk --bisect", and you'll see that all the remaining
 > commits are just to staging drivers.
 > 
 > So that would either imply you have some staging driver (unlikely), or
 > more likely that 3.17 really already has the problem, it's just that
 > it needs some particular code alignment or phase of the moon or
 > something to trigger.

Maybe I'll try 3.17 + perf fix for an even longer runtime.
Like over thanksgiving or something.

If some of the bisection points so far had been 'good', I would
go back and re-check, but every step of the way I've been able
to reproduce it.

	Dave


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-26  1:48           ` Linus Torvalds
  2014-11-26  2:40             ` Dave Jones
@ 2014-11-26  4:39             ` Jürgen Groß
       [not found]               ` <CA+55aFx1SiFBzmA=k9jHxi3cZE3Ei_+2NHepujgf86KEvkz8eQ@mail.gmail.com>
  1 sibling, 1 reply; 486+ messages in thread
From: Jürgen Groß @ 2014-11-26  4:39 UTC (permalink / raw)
  To: Linus Torvalds, Dave Jones, Linux Kernel, the arch/x86 maintainers

On 11/26/2014 02:48 AM, Linus Torvalds wrote:
> On Tue, Nov 25, 2014 at 4:25 PM, Dave Jones <davej@redhat.com> wrote:
>>
>> The reason I'm checking in at this point, is that I'm starting to see different
>> bugs at this point, so I don't know if I can call this good or bad, unless
>> someone has a fix for what I'm seeing now.
>
> Hmm. The three last "bad" biisects are all just 3.17-rc1 plus staging fixes.
>
>> Reminiscent of a bug a couple releases ago. Processes about to exit, but stuck
>> in the kernel continuously faulting..
>> http://codemonkey.org.uk/junk/weird-hang.txt
>> The one I'm thinking of got fixed way before 3.17 though.
>
> Well, the staging tree was based on that 3.17-rc1 tree, so it may well
> have the bug without the fix.
>
> You have also marked 3.18-rc1 bad *twice*, along with the network
> merge, and the tty merge. That's just odd. But it doesn't make the
> bisect wrong, it just means that you fat-fingered thing and marked the
> same thing bad a couple of times.
>
> Nothing to worry about, unless it's a sign of early Parkinsons...
>
>> Does that trace ring a bell of something else I could try on top of
>> each bisection point ?
>
> Hmm.
>
> Smells somewhat like the "pipe/page fault oddness" bug you reported.
>
> That one caused endless page faults on fault_in_pages_writeable()
> because of a page table entry that the VM thought was present, but the
> CPU thought was missing.
>
> That caused the whole "pte_protnone()" thing, and trying to get rid of
> the PTE_NUMA bit, but those patches have *not* been merged. And you
> were ever able to reproduce it., so we left it as pending.
>
> But if you actually really think that the bisect log you posted is
> real and true and actually is the bug you're chasing, I have bad news
> for you: do a "gitk --bisect", and you'll see that all the remaining
> commits are just to staging drivers.
>
> So that would either imply you have some staging driver (unlikely), or
> more likely that 3.17 really already has the problem, it's just that
> it needs some particular code alignment or phase of the moon or
> something to trigger.

I COULD trigger it with 3.17. Took much longer, but I've seen it once.
And from Xen hypervisor data it was clear it was the same bug (cpu
spinning in pmd_lock()).


Juergen


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
       [not found]               ` <CA+55aFx1SiFBzmA=k9jHxi3cZE3Ei_+2NHepujgf86KEvkz8eQ@mail.gmail.com>
@ 2014-11-26  5:11                 ` Dave Jones
  2014-11-26  5:24                 ` Juergen Gross
  1 sibling, 0 replies; 486+ messages in thread
From: Dave Jones @ 2014-11-26  5:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jürgen Groß, the arch/x86 maintainers, Kernel Mailing List

On Tue, Nov 25, 2014 at 09:09:45PM -0800, Linus Torvalds wrote:
 > On Nov 25, 2014 8:39 PM, "Jürgen Groß" <jgross@suse.com> wrote:
 > >
 > > I COULD trigger it with 3.17. Took much longer, but I've seen it once.
 > > And from Xen hypervisor data it was clear it was the same bug (cpu
 > > spinning in pmd_lock()).
 > 
 > I'm still hoping you can give a back trace. I'd like to know what access it
 > is that can trigger this, and preferably what the call chain to it was...
 > 
 > I do believe it happened in 3.17, I just want to understand the but more -
 > not just apply the fix..
 > 
 > Most of Dave's lockup back traces did not have the whole page fault in
 > them, so while Dave has seen this too, there might be different symptoms...

Before giving 3.17 a multi-day workout, I'll try rc6 with Jürgen's patch
to see if that makes any difference at all for me.

	Dave


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
       [not found]               ` <CA+55aFx1SiFBzmA=k9jHxi3cZE3Ei_+2NHepujgf86KEvkz8eQ@mail.gmail.com>
  2014-11-26  5:11                 ` Dave Jones
@ 2014-11-26  5:24                 ` Juergen Gross
  2014-11-26  5:52                   ` Linus Torvalds
  1 sibling, 1 reply; 486+ messages in thread
From: Juergen Gross @ 2014-11-26  5:24 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: the arch/x86 maintainers, Kernel Mailing List, Dave Jones

On 11/26/2014 06:09 AM, Linus Torvalds wrote:
>
> On Nov 25, 2014 8:39 PM, "Jürgen Groß" <jgross@suse.com
> <mailto:jgross@suse.com>> wrote:
>  >
>  > I COULD trigger it with 3.17. Took much longer, but I've seen it once.
>  > And from Xen hypervisor data it was clear it was the same bug (cpu
>  > spinning in pmd_lock()).
>
> I'm still hoping you can give a back trace. I'd like to know what access
> it is that can trigger this, and preferably what the call chain to it was...

Working on it. Triggering it via sysrq(l) isn't working: machine hung
up. I'll try a dump, but this might take some time due to the machine
size...

If this isn't working I can always modify the hypervisor to show me
more of the kernel stack in that situation. This will be a pure dump,
but it should be possible to extract the back trace from that.

>
> I do believe it happened in 3.17, I just want to understand the but more
> - not just apply the fix..

Sure.

>
> Most of Dave's lockup back traces did not have the whole page fault in
> them, so while Dave has seen this too, there might be different symptoms...

Stay tuned... :-)


Juergen


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-26  5:24                 ` Juergen Gross
@ 2014-11-26  5:52                   ` Linus Torvalds
  2014-11-26  6:21                     ` Linus Torvalds
  0 siblings, 1 reply; 486+ messages in thread
From: Linus Torvalds @ 2014-11-26  5:52 UTC (permalink / raw)
  To: Juergen Gross; +Cc: the arch/x86 maintainers, Kernel Mailing List, Dave Jones

On Tue, Nov 25, 2014 at 9:24 PM, Juergen Gross <jgross@suse.com> wrote:
>
> Working on it. Triggering it via sysrq(l) isn't working: machine hung
> up. I'll try a dump, but this might take some time due to the machine
> size...

Actually, in that patch that did this:

-       pgd = pgd_offset(current->active_mm, address);
+       pgd = (pgd_t *)__va(read_cr3()) + pgd_index(address);

make the code do:

        pgd = (pgd_t *)__va(read_cr3()) + pgd_index(address);
        WARN_ON(pdg != pgd_offset(current->active_mm, address));

and now you should get a nice backtrace for exactly when it happens,
but it's on a working kernel, so nothing will lock up.

Hmm?

And leave it running for a while, and see if the trace is always the
same, or if there are variations on it...

Thanks,

                      Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-26  5:52                   ` Linus Torvalds
@ 2014-11-26  6:21                     ` Linus Torvalds
  2014-11-26  6:52                       ` Juergen Gross
                                         ` (2 more replies)
  0 siblings, 3 replies; 486+ messages in thread
From: Linus Torvalds @ 2014-11-26  6:21 UTC (permalink / raw)
  To: Juergen Gross; +Cc: the arch/x86 maintainers, Kernel Mailing List, Dave Jones

On Tue, Nov 25, 2014 at 9:52 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> And leave it running for a while, and see if the trace is always the
> same, or if there are variations on it...

Amusing.

Lookie here:

   http://lists.xenproject.org/archives/html/xen-changelog/2005-08/msg00310.html

That's from 2005.

Anyway, I don't see why the cr3 issue matters, *unless* there is some
situation where the scheduler can run with interrupts enabled. And why
this is Xen-related, I have no idea.

The Xen patches seem to have lost that

 /* On Xen the line below does not always work. Needs investigating! */

line when backporting the 2.6.29 patches to Xen. And clearly nobody
investigated.

So please do get me back-traces, and we'll investigate. Better late
than never. But it does sound Xen-specific - although it's possible
that Xen just triggers some timing (and has apparently been able to
trigger it since 2005) that DaveJ now triggers on his one machine.

So DaveJ, even though this does appear Xen-centric (Xentric?) and
you're running on bare hardware, maybe you could do the same thing in
that x86-64 vmalloc_fault(). The timing with Jürgen is kind of
intriguing - if 3.18-rc made it happen much more often for him, maybe
it really is very timing-sensitive, and you actually are seeing a
non-Xen version of the same thing...

                           Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-26  6:21                     ` Linus Torvalds
@ 2014-11-26  6:52                       ` Juergen Gross
  2014-11-26  9:44                       ` Juergen Gross
  2014-11-26 14:34                       ` Dave Jones
  2 siblings, 0 replies; 486+ messages in thread
From: Juergen Gross @ 2014-11-26  6:52 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: the arch/x86 maintainers, Kernel Mailing List, Dave Jones

On 11/26/2014 07:21 AM, Linus Torvalds wrote:
> On Tue, Nov 25, 2014 at 9:52 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> And leave it running for a while, and see if the trace is always the
>> same, or if there are variations on it...
>
> Amusing.
>
> Lookie here:
>
>     http://lists.xenproject.org/archives/html/xen-changelog/2005-08/msg00310.html
>
> That's from 2005.

:-)

>
> Anyway, I don't see why the cr3 issue matters, *unless* there is some
> situation where the scheduler can run with interrupts enabled. And why
> this is Xen-related, I have no idea.
>
> The Xen patches seem to have lost that
>
>   /* On Xen the line below does not always work. Needs investigating! */
>
> line when backporting the 2.6.29 patches to Xen. And clearly nobody
> investigated.
>
> So please do get me back-traces, and we'll investigate. Better late
> than never. But it does sound Xen-specific - although it's possible
> that Xen just triggers some timing (and has apparently been able to
> trigger it since 2005) that DaveJ now triggers on his one machine.

Yeah, this sounds plausible.

I'm working on the back traces right now, hope to have them soon.


Juergen

>
> So DaveJ, even though this does appear Xen-centric (Xentric?) and
> you're running on bare hardware, maybe you could do the same thing in
> that x86-64 vmalloc_fault(). The timing with Jürgen is kind of
> intriguing - if 3.18-rc made it happen much more often for him, maybe
> it really is very timing-sensitive, and you actually are seeing a
> non-Xen version of the same thing...
>
>                             Linus
>


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-26  6:21                     ` Linus Torvalds
  2014-11-26  6:52                       ` Juergen Gross
@ 2014-11-26  9:44                       ` Juergen Gross
  2014-11-26 14:34                       ` Dave Jones
  2 siblings, 0 replies; 486+ messages in thread
From: Juergen Gross @ 2014-11-26  9:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: the arch/x86 maintainers, Kernel Mailing List, Dave Jones,
	Konrad Rzeszutek Wilk, David Vrabel, xen-devel

On 11/26/2014 07:21 AM, Linus Torvalds wrote:
> On Tue, Nov 25, 2014 at 9:52 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> And leave it running for a while, and see if the trace is always the
>> same, or if there are variations on it...
>
> Amusing.
>
> Lookie here:
>
>     http://lists.xenproject.org/archives/html/xen-changelog/2005-08/msg00310.html
>
> That's from 2005.
>
> Anyway, I don't see why the cr3 issue matters, *unless* there is some
> situation where the scheduler can run with interrupts enabled. And why
> this is Xen-related, I have no idea.
>
> The Xen patches seem to have lost that
>
>   /* On Xen the line below does not always work. Needs investigating! */
>
> line when backporting the 2.6.29 patches to Xen. And clearly nobody
> investigated.
>
> So please do get me back-traces, and we'll investigate. Better late
> than never. But it does sound Xen-specific - although it's possible
> that Xen just triggers some timing (and has apparently been able to
> trigger it since 2005) that DaveJ now triggers on his one machine.
>
> So DaveJ, even though this does appear Xen-centric (Xentric?) and
> you're running on bare hardware, maybe you could do the same thing in
> that x86-64 vmalloc_fault(). The timing with Jürgen is kind of
> intriguing - if 3.18-rc made it happen much more often for him, maybe
> it really is very timing-sensitive, and you actually are seeing a
> non-Xen version of the same thing...

Very interesting: I've updated my test-machine yesterday to the newest
Xen version after I've got rid of the lockups to avoid another problem
I was seeing. With this version I don't get the lockups any more even
with the unmodified 3.18-rc kernel.

Digging deeper I found something making me believe I've seen another
issue than Dave which just looked similar on the surface. :-(

My Xen problem was related to an error in freeing grant pages (pages
mapped in from another domain). One detail in the handling of such
mappings is interesting: the "private" member of the page structure
is used to hold the machine frame number of the mapped memory page.
Another usage of this "private" member is in the pgd handling of Xen
(see xen_pgd_alloc() and xen_get_user_pgd()) to hold the pgd of the
user address space (kernel and user are in separate address spaces on
Xen). So with an error in the grant page handling I could imagine a
pgd's private member could be clobbered leading to effects like the one
I've observed. And this could have been the problem in 2005, too.

And why is my patch working? I think it's just because cr3 is always
written with a page aligned value while the clobbered "private" member
of the Xen pgd is not page aligned resulting in a different pointer.
I'm still using the wrong page for the user's pgd, but this seems not
to lead to fatal errors when nearly nothing is running on the machine.
I've seen Xen messages occasionally indicating there was something
wrong with the page table handling of the kernel (pages used as page
tables not known to Xen as such).

I hope this all makes sense.

And just for the records: with the actual Xen version (tweaked to
show the grant page error again) I see different lockups with the
following backtrace:

[ 1122.256305] NMI watchdog: BUG: soft lockup - CPU#94 stuck for 23s! 
[systemd-udevd:1179]
[ 1122.303427] Modules linked in: xen_blkfront msr bridge stp llc 
iscsi_ibft ipmi_devintf nls_utf8 x86_pkg_temp_thermal intel_powerclamp 
nls_cp437 coretemp crct10dif_pclmul vfat crc32_pclmul fat crc32c_intel 
ghash_clmulni_intel snd_pcm aesni_intel aes_x86_64 snd_timer lrw 
be2iscsi be2net gf128mul libiscsi snd glue_helper joydev vxlan soundcore 
scsi_transport_iscsi ablk_helper iTCO_wdt ixgbe igb mdio ip6_udp_tunnel 
iTCO_vendor_support efivars evdev iscsi_boot_sysfs udp_tunnel cryptd dca 
pcspkr sb_edac e1000e edac_core lpc_ich i2c_i801 ptp mfd_core pps_core 
shpchp tpm_infineon ipmi_si tpm_tis ipmi_msghandler tpm button xenfs 
xen_privcmd xen_acpi_processor processor thermal_sys xen_pciback 
xen_netback xen_blkback xen_gntalloc xen_gntdev xen_evtchn dm_mod 
efivarfs crc32c_generic btrfs xor raid6_pq hid_generic
[ 1122.303450]  usbhid hid sd_mod mgag200 ehci_pci i2c_algo_bit ehci_hcd 
drm_kms_helper ttm usbcore drm megaraid_sas usb_common sg scsi_mod autofs4
[ 1122.303456] CPU: 94 PID: 1179 Comm: systemd-udevd Tainted: G 
     L 3.18.0-rc5+ #304
[ 1122.303458] Hardware name: FUJITSU PRIMEQUEST 2800E/SB, BIOS 
PRIMEQUEST 2000 Series BIOS Version 01.59 07/24/2014
[ 1122.303459] task: ffff881f17b56ce0 ti: ffff881f0fff0000 task.ti: 
ffff881f0fff0000
[ 1122.303460] RIP: e030:[<ffffffff814fcf5e>]  [<ffffffff814fcf5e>] 
_raw_spin_lock+0x1e/0x30
[ 1122.303462] RSP: e02b:ffff881f0fff3ce8  EFLAGS: 00000282
[ 1122.303463] RAX: 000000000000ba43 RBX: 00003ffffffff000 RCX: 
0000000000000190
[ 1122.303464] RDX: 0000000000000190 RSI: 000000190ba43067 RDI: 
ffffea000157c350
[ 1122.303465] RBP: ffff880000000c70 R08: 0000000000000000 R09: 
0000000000000000
[ 1122.303466] R10: 000000000001b688 R11: ffff881fdf24ad80 R12: 
ffffea0000000000
[ 1122.303466] R13: ffff88006237cc70 R14: 0000000000000000 R15: 
00007f70f438e000
[ 1122.303470] FS:  00007f70f5c49880(0000) GS:ffff881f4c5c0000(0000) 
knlGS:ffff881f4c5c0000
[ 1122.303471] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1122.303472] CR2: 00007f70f5c68000 CR3: 0000001f111b7000 CR4: 
0000000000042660
[ 1122.303473] Stack:
[ 1122.303474]  ffffffff81155850 ffff881fdf24ad80 00007f70f438f000 
ffff881f138ae5d8
[ 1122.303476]  ffff881f08ead400 ffff881f0fff3fd8 0000000000000000 
ffff881eff0cbd08
[ 1122.303477]  ffff881f18b57d08 ffffea000157c320 ffffea006ccc5ec8 
ffff881f0fc00800
[ 1122.303479] Call Trace:
[ 1122.303481]  [<ffffffff81155850>] ? copy_page_range+0x460/0xa10
[ 1122.303484]  [<ffffffff8105d727>] ? copy_process.part.27+0x13e7/0x1b10
[ 1122.303486]  [<ffffffff81435f41>] ? netlink_insert+0x91/0xb0
[ 1122.303488]  [<ffffffff813f85c9>] ? release_sock+0x19/0x160
[ 1122.303490]  [<ffffffff8105dff8>] ? do_fork+0xc8/0x320
[ 1122.303492]  [<ffffffff814fd779>] ? stub_clone+0x69/0x90
[ 1122.303493]  [<ffffffff814fd42d>] ? system_call_fastpath+0x16/0x1b
[ 1122.303494] Code: 90 0f b7 17 66 39 d0 75 f6 eb e8 66 90 b8 00 00 01 
00 f0 0f c1 07 89 c2 c1 ea 10 66 39 c2 89 d1 75 01 c3 0f b7 07 66 39 d0 
74 f7 <f3> 90 0f b7 07 66 39 c8 75 f6 c3 0f 1f 80 00 00 00 00 65 81 04

But if my assumptions above are correct this is meaningless, as using
an arbitrary memory page as pgd might result in anything...


Juergen

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-26  6:21                     ` Linus Torvalds
  2014-11-26  6:52                       ` Juergen Gross
  2014-11-26  9:44                       ` Juergen Gross
@ 2014-11-26 14:34                       ` Dave Jones
  2014-11-26 17:37                         ` Linus Torvalds
  2 siblings, 1 reply; 486+ messages in thread
From: Dave Jones @ 2014-11-26 14:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Juergen Gross, the arch/x86 maintainers, Kernel Mailing List

On Tue, Nov 25, 2014 at 10:21:46PM -0800, Linus Torvalds wrote:
 
 > So DaveJ, even though this does appear Xen-centric (Xentric?) and
 > you're running on bare hardware, maybe you could do the same thing in
 > that x86-64 vmalloc_fault(). The timing with Jürgen is kind of
 > intriguing - if 3.18-rc made it happen much more often for him, maybe
 > it really is very timing-sensitive, and you actually are seeing a
 > non-Xen version of the same thing...

I did try your WARN variant (after fixing the typo)

Woke up to the below trace. Looks like a different issue.

Nnngh.

	Dave

NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [trinity-c149:24766]
CPU: 2 PID: 24766 Comm: trinity-c149 Not tainted 3.18.0-rc6+ #98 [loadavg: 156.09 150.24 148.56 21/402 26750]
task: ffff8802285b96f0 ti: ffff8802260e0000 task.ti: ffff8802260e0000
RIP: 0010:[<ffffffff8104658c>]  [<ffffffff8104658c>] kernel_map_pages+0xbc/0x120
RSP: 0018:ffff8802260e3768  EFLAGS: 00000202
RAX: 00000000001407e0 RBX: ffffffff817e0c24 RCX: 0000000000140760
RDX: 0000000000000202 RSI: ffff8800000006b0 RDI: 0000000000000001
RBP: ffff8802260e37c8 R08: 8000000000000063 R09: ffff880000000000
R10: 0000000000000001 R11: 0000000000000000 R12: ffff880200000001
R13: 0000000000010000 R14: 0000000001b60000 R15: 0000000000000000
FS:  00007fb8ef71d740(0000) GS:ffff880244400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000018b87f0 CR3: 00000002277fe000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 00007f60b71b4000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
Stack:
 ffff880123cd7000 ffff8802260e3768 0000000000000000 0000000000000003
 0000000000000000 0000000100000001 0000000000123cd6 0000000000000000
 0000000000000000 00000000ade00558 ffff8802445d7638 0000000000000001
Call Trace:
 [<ffffffff81185ebf>] get_page_from_freelist+0x49f/0xaa0
 [<ffffffff810a7431>] ? get_parent_ip+0x11/0x50
 [<ffffffff811866ee>] __alloc_pages_nodemask+0x22e/0xb60
 [<ffffffff810ad5c5>] ? local_clock+0x25/0x30
 [<ffffffff810c6e7c>] ? __lock_acquire.isra.31+0x22c/0x9f0
 [<ffffffff813775e0>] ? __radix_tree_preload+0x60/0xf0
 [<ffffffff810a7431>] ? get_parent_ip+0x11/0x50
 [<ffffffff810c546d>] ? lock_release_holdtime.part.24+0x9d/0x160
 [<ffffffff811d093e>] alloc_pages_vma+0xee/0x1b0
 [<ffffffff81194f0e>] ? shmem_alloc_page+0x6e/0xc0
 [<ffffffff810c6e7c>] ? __lock_acquire.isra.31+0x22c/0x9f0
 [<ffffffff81194f0e>] shmem_alloc_page+0x6e/0xc0
 [<ffffffff810a7431>] ? get_parent_ip+0x11/0x50
 [<ffffffff810a75ab>] ? preempt_count_sub+0x7b/0x100
 [<ffffffff8139ac66>] ? __percpu_counter_add+0x86/0xb0
 [<ffffffff811b2396>] ? __vm_enough_memory+0x66/0x1c0
 [<ffffffff8117cac5>] ? find_get_entry+0x5/0x120
 [<ffffffff81300937>] ? cap_vm_enough_memory+0x47/0x50
 [<ffffffff81197880>] shmem_getpage_gfp+0x4d0/0x7e0
 [<ffffffff81197bd2>] shmem_write_begin+0x42/0x70
 [<ffffffff8117c2d4>] generic_perform_write+0xd4/0x1f0
 [<ffffffff8117eac2>] __generic_file_write_iter+0x162/0x350
 [<ffffffff811f0070>] ? new_sync_read+0xd0/0xd0
 [<ffffffff8117ecef>] generic_file_write_iter+0x3f/0xb0
 [<ffffffff8117ecb0>] ? __generic_file_write_iter+0x350/0x350
 [<ffffffff811f01b8>] do_iter_readv_writev+0x78/0xc0
 [<ffffffff811f19e8>] do_readv_writev+0xd8/0x2a0
 [<ffffffff8117ecb0>] ? __generic_file_write_iter+0x350/0x350
 [<ffffffff8117ecb0>] ? __generic_file_write_iter+0x350/0x350
 [<ffffffff810c54b6>] ? lock_release_holdtime.part.24+0xe6/0x160
 [<ffffffff810a7431>] ? get_parent_ip+0x11/0x50
 [<ffffffff810a75ab>] ? preempt_count_sub+0x7b/0x100
 [<ffffffff817df36b>] ? _raw_spin_unlock_irq+0x3b/0x60
 [<ffffffff811f1c3c>] vfs_writev+0x3c/0x50
 [<ffffffff811f1dac>] SyS_writev+0x5c/0x100
 [<ffffffff817e0249>] tracesys_phase2+0xd4/0xd9
Code: 65 48 33 04 25 28 00 00 00 75 75 48 83 c4 50 5b 41 5c 5d c3 0f 1f 00 9c 5a fa 0f 20 e0 48 89 c1 80 e1 7f 0f 22 e1 0f 22 e0 52 9d <eb> cf 66 90 49 bc 00 00 00 00 00 88 ff ff 48 63 f6 49 01 fc 48 
sending NMI to other CPUs:


<nothing further on console, accidentally had panic=1 set>

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-26 14:34                       ` Dave Jones
@ 2014-11-26 17:37                         ` Linus Torvalds
  0 siblings, 0 replies; 486+ messages in thread
From: Linus Torvalds @ 2014-11-26 17:37 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Juergen Gross,
	the arch/x86 maintainers, Kernel Mailing List

On Wed, Nov 26, 2014 at 6:34 AM, Dave Jones <davej@redhat.com> wrote:
>
> Woke up to the below trace. Looks like a different issue.

Yeah, apparently the Xen issue was really just a Xen bug.

> NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [trinity-c149:24766]
> RIP: 0010:[<ffffffff8104658c>]  [<ffffffff8104658c>] kernel_map_pages+0xbc/0x120

Well, this one at least makes some amount of sense. The "Code:" line says it's

  1b: 9c                   pushfq
  1c: 5a                   pop    %rdx
  1d: fa                   cli
  1e: 0f 20 e0             mov    %cr4,%rax
  21: 48 89 c1             mov    %rax,%rcx
  24: 80 e1 7f             and    $0x7f,%cl
  27: 0f 22 e1             mov    %rcx,%cr4
  2a: 0f 22 e0             mov    %rax,%cr4
  2d: 52                   push   %rdx
  2e: 9d                   popfq
  2f:* eb cf                 jmp    back <-- trapping instruction

and %rdx is 0x0202 which is actually a valid flags value.

That looks like the code for __native_flush_tlb_global().

Not that interrupt should have been disabled very long.

> Call Trace:
>  [<ffffffff81185ebf>] get_page_from_freelist+0x49f/0xaa0
>  [<ffffffff811866ee>] __alloc_pages_nodemask+0x22e/0xb60
>  [<ffffffff811d093e>] alloc_pages_vma+0xee/0x1b0
>  [<ffffffff81194f0e>] shmem_alloc_page+0x6e/0xc0
>  [<ffffffff81197880>] shmem_getpage_gfp+0x4d0/0x7e0
>  [<ffffffff81197bd2>] shmem_write_begin+0x42/0x70
>  [<ffffffff8117c2d4>] generic_perform_write+0xd4/0x1f0
>  [<ffffffff8117eac2>] __generic_file_write_iter+0x162/0x350
>  [<ffffffff8117ecef>] generic_file_write_iter+0x3f/0xb0
>  [<ffffffff811f01b8>] do_iter_readv_writev+0x78/0xc0
>  [<ffffffff811f19e8>] do_readv_writev+0xd8/0x2a0
>  [<ffffffff811f1c3c>] vfs_writev+0x3c/0x50
>  [<ffffffff811f1dac>] SyS_writev+0x5c/0x100
>  [<ffffffff817e0249>] tracesys_phase2+0xd4/0xd9

Hmm. Maybe some oom issue, that we spent a long time before this
trying to free pages?

                  Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-26  2:40             ` Dave Jones
@ 2014-11-26 22:57               ` Dave Jones
  2014-11-27  0:46                 ` Linus Torvalds
  2014-11-27 19:17                 ` Linus Torvalds
  0 siblings, 2 replies; 486+ messages in thread
From: Dave Jones @ 2014-11-26 22:57 UTC (permalink / raw)
  To: Linus Torvalds, Linux Kernel, the arch/x86 maintainers

On Tue, Nov 25, 2014 at 09:40:32PM -0500, Dave Jones wrote:
 > On Tue, Nov 25, 2014 at 05:48:15PM -0800, Linus Torvalds wrote:
 > 
 >  > So that would either imply you have some staging driver (unlikely), or
 >  > more likely that 3.17 really already has the problem, it's just that
 >  > it needs some particular code alignment or phase of the moon or
 >  > something to trigger.
 > 
 > Maybe I'll try 3.17 + perf fix for an even longer runtime.
 > Like over thanksgiving or something.

Dammit, dammit, dammit.

I didn't even have to wait that long.

[19861.135201] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [trinity-c132:26979]
[19861.135652] Modules linked in: snd_seq_dummy 8021q garp stp fuse tun hidp bnep rfcomm af_key llc2 scsi_transport_iscsi nfnetlink can_bcm nfc caif_socket caif af_802154 ieee802154 phonet af_rxrpc bluetooth can_raw can pppoe pppox ppp_ge
neric slhc irda crc_ccitt rds rose sctp libcrc32c x25 atm netrom appletalk ipx p8023 psnap p8022 llc ax25 cfg80211 rfkill coretemp hwmon x86_pkg_temp_thermal kvm_intel kvm snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi crc
t10dif_pclmul crc32c_intel snd_hda_intel snd_hda_controller snd_hda_codec ghash_clmulni_intel snd_hwdep pcspkr snd_seq snd_seq_device serio_raw usb_debug snd_pcm e1000e snd_timer microcode ptp snd pps_core shpchp soundcore nfsd auth_rpcgs
s oid_registry nfs_acl lockd sunrpc
[19861.138604] CPU: 1 PID: 26979 Comm: trinity-c132 Not tainted 3.17.0+ #2
[19861.139229] Hardware name: Intel Corporation Shark Bay Client platform/Flathead Creek Crb, BIOS HSWLPTU1.86C.0109.R03.1301282055 01/28/2013
[19861.139897] task: ffff8801ec6716f0 ti: ffff8801b5bf8000 task.ti: ffff8801b5bf8000
[19861.140564] RIP: 0010:[<ffffffff81369585>]  [<ffffffff81369585>] copy_user_enhanced_fast_string+0x5/0x10
[19861.141263] RSP: 0018:ffff8801b5bfbcf0  EFLAGS: 00010206
[19861.141974] RAX: ffff8801b5bfbe48 RBX: 0000000000000003 RCX: 0000000000000a1d
[19861.142688] RDX: 0000000000001000 RSI: 00007f6f89ef85e3 RDI: ffff8801750445e3
[19861.143416] RBP: ffff8801b5bfbd30 R08: 0000000000000000 R09: 0000000000000001
[19861.144164] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801b5bfbc78
[19861.144909] R13: ffff8801d702ed70 R14: ffffffff810a3d2b R15: ffff8801b5bfbc60
[19861.145668] FS:  00007f6f89eeb740(0000) GS:ffff880244200000(0000) knlGS:0000000000000000
[19861.146440] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[19861.147218] CR2: 00007f6f89ef3000 CR3: 00000001cddb5000 CR4: 00000000001407e0
[19861.148014] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[19861.148828] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[19861.149640] Stack:
[19861.150443]  ffffffff8119a4f8 160000007c331000 0000000000001000 160000007c331000
[19861.151283]  0000000000001000 ffff8801b5bfbe58 0000000000000000 ffff8801d702f0a0
[19861.152133]  ffff8801b5bfbdc0 ffffffff81170474 ffff8801b5bfbd88 0000000000001000
[19861.152995] Call Trace:
[19861.153851]  [<ffffffff8119a4f8>] ? iov_iter_copy_from_user_atomic+0x78/0x1c0
[19861.154738]  [<ffffffff81170474>] generic_perform_write+0xf4/0x1e0
[19861.155636]  [<ffffffff811ff1da>] ? file_update_time+0xaa/0xf0
[19861.156536]  [<ffffffff81172ba2>] __generic_file_write_iter+0x162/0x350
[19861.157447]  [<ffffffff81172dcf>] generic_file_write_iter+0x3f/0xb0
[19861.158365]  [<ffffffff811e17ae>] new_sync_write+0x8e/0xd0
[19861.159287]  [<ffffffff811e202a>] vfs_write+0xba/0x1f0
[19861.160214]  [<ffffffff811e2e42>] SyS_pwrite64+0x92/0xc0
[19861.161152]  [<ffffffff817b62a4>] tracesys+0xdd/0xe2
[19861.162091] Code: 48 ff c6 48 ff c7 ff c9 75 f2 89 d1 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 31 c0 0f 1f 00 c3 0f 1f 80 00 00 00 00 0f 1f 00 89 d1 <f3> a4 31 c0 0f 1f 00 c3 90 90 90 0f 1f 00 83 fa 08 0f 82 95 00 
[19861.164217] sending NMI to other CPUs:
[19861.165221] NMI backtrace for cpu 2
[19861.166099] CPU: 2 PID: 28083 Comm: trinity-c151 Not tainted 3.17.0+ #2
[19861.167084] Hardware name: Intel Corporation Shark Bay Client platform/Flathead Creek Crb, BIOS HSWLPTU1.86C.0109.R03.1301282055 01/28/2013
[19861.168113] task: ffff8800746116f0 ti: ffff8801c6894000 task.ti: ffff8801c6894000
[19861.169152] RIP: 0010:[<ffffffff810fb326>]  [<ffffffff810fb326>] smp_call_function_many+0x276/0x320
[19861.170223] RSP: 0000:ffff8801c6897b00  EFLAGS: 00000202
[19861.171295] RAX: 0000000000000001 RBX: ffff8802445d4c40 RCX: ffff8802443da408
[19861.172384] RDX: 0000000000000001 RSI: 0000000000000008 RDI: 0000000000000000
[19861.173483] RBP: ffff8801c6897b40 R08: ffff880242469ce0 R09: 0000000100180011
[19861.174590] R10: ffff880243c04240 R11: 0000000000000000 R12: 0000000000000001
[19861.175703] R13: 0000000000000000 R14: 0000000000000008 R15: 0000000000000008
[19861.176822] FS:  00007f6f89eeb740(0000) GS:ffff880244400000(0000) knlGS:0000000000000000
[19861.177956] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[19861.179103] CR2: 0000000002400000 CR3: 0000000231685000 CR4: 00000000001407e0
[19861.180264] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[19861.181428] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[19861.182595] Stack:
[19861.183764]  ffff88024d64dd00 000000014d64dd00 00000000001d4c00 ffffffff82be41a0
[19861.184969]  0000000000000002 ffffffff8117a700 0000000000000000 0000000000000001
[19861.186167]  ffff8801c6897b78 ffffffff810fb542 0000000000000003 0000000000000008
[19861.187355] Call Trace:
[19861.188538]  [<ffffffff8117a700>] ? drain_pages+0xc0/0xc0
[19861.189709]  [<ffffffff810fb542>] on_each_cpu_mask+0x42/0xc0
[19861.190853]  [<ffffffff811768b1>] drain_all_pages+0x101/0x120
[19861.191989]  [<ffffffff8117af40>] __alloc_pages_nodemask+0x7d0/0xb20
[19861.193130]  [<ffffffff811c2b11>] alloc_pages_vma+0xf1/0x1b0
[19861.194258]  [<ffffffff811d705c>] ? do_huge_pmd_anonymous_page+0x10c/0x3e0
[19861.195367]  [<ffffffff811d705c>] do_huge_pmd_anonymous_page+0x10c/0x3e0
[19861.196450]  [<ffffffff811a10dc>] handle_mm_fault+0x14c/0xe90
[19861.197509]  [<ffffffff81041940>] ? __do_page_fault+0x140/0x600
[19861.198540]  [<ffffffff810419a4>] __do_page_fault+0x1a4/0x600
[19861.199550]  [<ffffffff810a3bcd>] ? get_parent_ip+0xd/0x50
[19861.200539]  [<ffffffff810a3d2b>] ? preempt_count_sub+0x6b/0xf0
[19861.201514]  [<ffffffff810c0b6e>] ? put_lock_stats.isra.23+0xe/0x30
[19861.202467]  [<ffffffff8136ad3d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[19861.203407]  [<ffffffff81041e0c>] do_page_fault+0xc/0x10
[19861.204331]  [<ffffffff817b7d72>] page_fault+0x22/0x30
[19861.205249] Code: 00 41 89 c4 39 f0 0f 8d 25 fe ff ff 48 63 d0 48 8b 0b 48 03 0c d5 a0 b9 d1 81 f6 41 18 01 74 14 0f 1f 44 00 00 f3 90 f6 41 18 01 <75> f8 48 63 35 45 3b c2 00 83 f8 ff 48 8b 7b 08 74 b0 39 c6 77 
[19861.207272] NMI backtrace for cpu 0
[19861.207376] INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 42.050 msecs
[19861.209220] CPU: 0 PID: 28128 Comm: trinity-c242 Not tainted 3.17.0+ #2
[19861.210200] Hardware name: Intel Corporation Shark Bay Client platform/Flathead Creek Crb, BIOS HSWLPTU1.86C.0109.R03.1301282055 01/28/2013
[19861.211210] task: ffff8802168716f0 ti: ffff88007467c000 task.ti: ffff88007467c000
[19861.212215] RIP: 0010:[<ffffffff810c26ea>]  [<ffffffff810c26ea>] __lock_acquire.isra.31+0xfa/0x9f0
[19861.213248] RSP: 0000:ffff880244003cb0  EFLAGS: 00000046
[19861.214281] RAX: 0000000000000046 RBX: ffff8802168716f0 RCX: 0000000000000000
[19861.215318] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88023fcbbc40
[19861.216346] RBP: ffff880244003d18 R08: 0000000000000001 R09: 0000000000000000
[19861.217378] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
[19861.218399] R13: 0000000000000000 R14: ffff88023fcbbc40 R15: 0000000000000000
[19861.219410] FS:  00007f6f89eeb740(0000) GS:ffff880244000000(0000) knlGS:0000000000000000
[19861.220425] CS:  0010 DS: 0000 ES: 0[19861.273167] NMI backtrace for cpu 3
[19861.273315] INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 107.946 msecs
[19861.274821] CPU: 3 PID: 27913 Comm: trinity-c37 Not tainted 3.17.0+ #2
[19861.275672] Hardware name: Intel Corporation Shark Bay Client platform/Flathead Creek Crb, BIOS HSWLPTU1.86C.0109.R03.1301282055 01/28/2013
[19861.276543] task: ffff88009a735bc0 ti: ffff8801eda8c000 task.ti: ffff8801eda8c000
[19861.277422] RIP: 0010:[<ffffffff810fb322>]  [<ffffffff810fb322>] smp_call_function_many+0x272/0x320
[19861.278339] RSP: 0000:ffff8801eda8fb00  EFLAGS: 00000202
[19861.279220] RAX: 0000000000000001 RBX: ffff8802447d4c40 RCX: ffff8802443da428
[19861.280115] RDX: 0000000000000001 RSI: 0000000000000008 RDI: 0000000000000000
[19861.281021] RBP: ffff8801eda8fb40 R08: ffff880242469a40 R09: 0000000100180011
[19861.281917] R10: ffff880243c04240 R11: 0000000000000000 R12: 0000000000000001
[19861.282813] R13: 0000000000000000 R14: 0000000000000008 R15: 0000000000000008
[19861.283690] FS:  00007f6f89eeb740(0000) GS:ffff880244600000(0000) knlGS:0000000000000000
[19861.284570] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[19861.285466] CR2: 0000000002400000 CR3: 00000001cdd92000 CR4: 00000000001407e0
[19861.286363] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[19861.287255] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[19861.288142] Stack:
[19861.289024]  ffff88024d64dd00 000000014d64dd00 00000000001d4c00 ffffffff82be41a0
[19861.289934]  0000000000000003 ffffffff8117a700 0000000000000000 0000000000000001
[19861.290845]  ffff8801eda8fb78 ffffffff810fb542 0000000000000003 0000000000000008
[19861.291763] Call Trace:
[19861.292664]  [<ffffffff8117a700>] ? drain_pages+0xc0/0xc0
[19861.293582]  [<ffffffff810fb542>] on_each_cpu_mask+0x42/0xc0
[19861.294501]  [<ffffffff811768b1>] drain_all_pages+0x101/0x120
[19861.295439]  [<ffffffff8117af40>] __alloc_pages_nodemask+0x7d0/0xb20
[19861.296369]  [<ffffffff811c2b11>] alloc_pages_vma+0xf1/0x1b0
[19861.297292]  [<ffffffff811d705c>] ? do_huge_pmd_anonymous_page+0x10c/0x3e0
[19861.298218]  [<ffffffff811d705c>] do_huge_pmd_anonymous_page+0x10c/0x3e0
[19861.299146]  [<ffffffff811a10dc>] handle_mm_fault+0x14c/0xe90
[19861.300078]  [<ffffffff81041940>] ? __do_page_fault+0x140/0x600
[19861.301011]  [<ffffffff810419a4>] __do_page_fault+0x1a4/0x600
[19861.301946]  [<ffffffff810a3bcd>] ? get_parent_ip+0xd/0x50
[19861.302874]  [<ffffffff810a3d2b>] ? preempt_count_sub+0x6b/0xf0
[19861.303805]  [<ffffffff810c0b6e>] ? put_lock_stats.isra.23+0xe/0x30
[19861.304736]  [<ffffffff8136ad3d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[19861.305665]  [<ffffffff81041e0c>] do_page_fault+0xc/0x10
[19861.306590]  [<ffffffff817b7d72>] page_fault+0x22/0x30
[19861.307527] Code: 35 78 3b c2 00 41 89 c4 39 f0 0f 8d 25 fe ff ff 48 63 d0 48 8b 0b 48 03 0c d5 a0 b9 d1 81 f6 41 18 01 74 14 0f 1f 44 00 00 f3 90 <f6> 41 18 01 75 f8 48 63 35 45 3b c2 00 83 f8 ff 48 8b 7b 08 74 
[19861.309600] INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 144.376 msecs


So 3.17 also has this problem.
Good news I guess in that it's not a regression, but damn I really didn't
want to have to go digging through the mists of time to find the last 'good' point.
At least it shouldn't hold up 3.18

I'll do a couple builds to run over the holidays, but next week
I think I'm going to need to approach this differently to add
more debugging somewhere/somehow.

	Dave


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-26 22:57               ` Dave Jones
@ 2014-11-27  0:46                 ` Linus Torvalds
  2014-11-27 19:17                 ` Linus Torvalds
  1 sibling, 0 replies; 486+ messages in thread
From: Linus Torvalds @ 2014-11-27  0:46 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Linux Kernel, the arch/x86 maintainers

On Wed, Nov 26, 2014 at 2:57 PM, Dave Jones <davej@redhat.com> wrote:
>
> So 3.17 also has this problem.
> Good news I guess in that it's not a regression, but damn I really didn't
> want to have to go digging through the mists of time to find the last 'good' point.
> At least it shouldn't hold up 3.18

Ugh. That still doesn't make me very happy.

I'll try to think about this more.

                  Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-26 22:57               ` Dave Jones
  2014-11-27  0:46                 ` Linus Torvalds
@ 2014-11-27 19:17                 ` Linus Torvalds
  2014-11-27 22:56                   ` Dave Jones
  1 sibling, 1 reply; 486+ messages in thread
From: Linus Torvalds @ 2014-11-27 19:17 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Linux Kernel,
	the arch/x86 maintainers, Don Zickus

On Wed, Nov 26, 2014 at 2:57 PM, Dave Jones <davej@redhat.com> wrote:
>
> So 3.17 also has this problem.
> Good news I guess in that it's not a regression, but damn I really didn't
> want to have to go digging through the mists of time to find the last 'good' point.

So I'm looking at the watchdog code, and it seems racy wrt parking and startup.

In particular, it sets the high priority *after* starting the hrtimer,
and it goes back to SCHED_NORMAL *before* canceling the timer.

Which seems completely ass-backwards. And the smp_hotplug_thread stuff
explicitly enables preemption around the setup/cleanup/part/unpark
operations.

However, that would be an issue only if trinity might be doing things
that enable and disable the watchdog. And doing so under insane loads.
Even then it seems unlikely.

The insane loads you have. But even then, could a load average of 169
possibly delay running a non-RT process for 22 seconds? Doubtful.

But just in case: do you do cpu hotplug events (that will disable and
re-enable the watchdog process?).  Anything else that will part/unpark
the hotplug thread?

Quite frankly, I'm just grasping for straws here, but a lot of the
watchdog traces really have seemed spurious...

                   Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-27 19:17                 ` Linus Torvalds
@ 2014-11-27 22:56                   ` Dave Jones
  2014-11-29 20:38                     ` Dâniel Fraga
  2014-12-01 16:56                     ` Don Zickus
  0 siblings, 2 replies; 486+ messages in thread
From: Dave Jones @ 2014-11-27 22:56 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux Kernel, the arch/x86 maintainers, Don Zickus

On Thu, Nov 27, 2014 at 11:17:16AM -0800, Linus Torvalds wrote:
 > On Wed, Nov 26, 2014 at 2:57 PM, Dave Jones <davej@redhat.com> wrote:
 > >
 > > So 3.17 also has this problem.
 > > Good news I guess in that it's not a regression, but damn I really didn't
 > > want to have to go digging through the mists of time to find the last 'good' point.
 > 
 > So I'm looking at the watchdog code, and it seems racy wrt parking and startup.
 > 
 > In particular, it sets the high priority *after* starting the hrtimer,
 > and it goes back to SCHED_NORMAL *before* canceling the timer.
 > 
 > Which seems completely ass-backwards. And the smp_hotplug_thread stuff
 > explicitly enables preemption around the setup/cleanup/part/unpark
 > operations.
 > 
 > However, that would be an issue only if trinity might be doing things
 > that enable and disable the watchdog. And doing so under insane loads.
 > Even then it seems unlikely.
 > 
 > The insane loads you have. But even then, could a load average of 169
 > possibly delay running a non-RT process for 22 seconds? Doubtful.
 > 
 > But just in case: do you do cpu hotplug events (that will disable and
 > re-enable the watchdog process?).  Anything else that will part/unpark
 > the hotplug thread?

That's root-only iirc, and I'm not running trinity as root, so that
shouldn't be happening. There's also no sign of such behaviour in dmesg
when the problem occurs.

 > Quite frankly, I'm just grasping for straws here, but a lot of the
 > watchdog traces really have seemed spurious...

Agreed.

Currently leaving 3.16 running. 21hrs so far.

	Dave


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-27 22:56                   ` Dave Jones
@ 2014-11-29 20:38                     ` Dâniel Fraga
  2014-11-30 20:45                       ` Linus Torvalds
  2014-12-01 16:56                     ` Don Zickus
  1 sibling, 1 reply; 486+ messages in thread
From: Dâniel Fraga @ 2014-11-29 20:38 UTC (permalink / raw)
  To: linux-kernel

On Thu, 27 Nov 2014 17:56:37 -0500
Dave Jones <davej@redhat.com> wrote:

> Agreed.
> 
> Currently leaving 3.16 running. 21hrs so far.

	Dave, I think I reported this bug in this bug report:

https://bugzilla.kernel.org/show_bug.cgi?id=85941

	Just posting in case the call trace helps...

	In my case it happens when I watch a video on Youtube or play a
audio file...

-- 
Linux 3.16.0-00115-g19583ca: Shuffling Zombie Juror
http://www.youtube.com/DanielFragaBR
http://exchangewar.info
Bitcoin: 12H6661yoLDUZaYPdah6urZS5WiXwTAUgL



^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-29 20:38                     ` Dâniel Fraga
@ 2014-11-30 20:45                       ` Linus Torvalds
  2014-11-30 21:21                         ` Dâniel Fraga
  0 siblings, 1 reply; 486+ messages in thread
From: Linus Torvalds @ 2014-11-30 20:45 UTC (permalink / raw)
  To: Dâniel Fraga; +Cc: Linux Kernel Mailing List

On Sat, Nov 29, 2014 at 12:38 PM, Dâniel Fraga <fragabr@gmail.com> wrote:
>
>         Dave, I think I reported this bug in this bug report:

Yours looks very different. Dave (and Sasha Levin) have reported
rcy_preempt stalls too, but it's not clear it's the same issue.

In case yours is repeatable (you seem to say it is), can you try it
without TREE_PREEMPT_RCU?

                     Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-30 20:45                       ` Linus Torvalds
@ 2014-11-30 21:21                         ` Dâniel Fraga
  2014-12-01  0:21                           ` Linus Torvalds
  0 siblings, 1 reply; 486+ messages in thread
From: Dâniel Fraga @ 2014-11-30 21:21 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux Kernel Mailing List

On Sun, 30 Nov 2014 12:45:31 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Yours looks very different. Dave (and Sasha Levin) have reported
> rcy_preempt stalls too, but it's not clear it's the same issue.
> 
> In case yours is repeatable (you seem to say it is), can you try it
> without TREE_PREEMPT_RCU?

	Yes, but "menuconfig" doesn't allow me to disable it (it's
always checked). Newbie question: does TREE_PREEMPT_RCU depends on any
other option? Thanks.

-- 
Linux 3.16.0-00115-g19583ca: Shuffling Zombie Juror
http://www.youtube.com/DanielFragaBR
http://exchangewar.info
Bitcoin: 12H6661yoLDUZaYPdah6urZS5WiXwTAUgL

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-30 21:21                         ` Dâniel Fraga
@ 2014-12-01  0:21                           ` Linus Torvalds
  2014-12-01  1:02                             ` Dâniel Fraga
  0 siblings, 1 reply; 486+ messages in thread
From: Linus Torvalds @ 2014-12-01  0:21 UTC (permalink / raw)
  To: Dâniel Fraga; +Cc: Linux Kernel Mailing List

On Sun, Nov 30, 2014 at 1:21 PM, Dâniel Fraga <fragabr@gmail.com> wrote:
>
>         Yes, but "menuconfig" doesn't allow me to disable it (it's
> always checked). Newbie question: does TREE_PREEMPT_RCU depends on any
> other option? Thanks.

Maybe you'll have to turn off RCU_CPU_STALL_VERBOSE first.

Although I think you should be able to just edit the .config file,
delete the like that says

    CONFIG_TREE_PREEMPT_RCU=y

and then just do a "make oldconfig", and then verify that
TREE_PREEMPT_RCU hasn't been re-enabled by some dependency. But it
shouldn't have, and that "make oldconfig" should get rid of anything
that depends on TREE_PREEMPT_RCU.

                  Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-01  0:21                           ` Linus Torvalds
@ 2014-12-01  1:02                             ` Dâniel Fraga
  2014-12-01 19:14                               ` Paul E. McKenney
  0 siblings, 1 reply; 486+ messages in thread
From: Dâniel Fraga @ 2014-12-01  1:02 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux Kernel Mailing List

On Sun, 30 Nov 2014 16:21:19 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Maybe you'll have to turn off RCU_CPU_STALL_VERBOSE first.
> 
> Although I think you should be able to just edit the .config file,
> delete the like that says
> 
>     CONFIG_TREE_PREEMPT_RCU=y
> 
> and then just do a "make oldconfig", and then verify that
> TREE_PREEMPT_RCU hasn't been re-enabled by some dependency. But it
> shouldn't have, and that "make oldconfig" should get rid of anything
> that depends on TREE_PREEMPT_RCU.
	
	Ok, I did exactly that, but CONFIG_TREE_PREEMPT_RCU is
re-enabled. I talked with Pranith Kumar and he suggested I could just
disable preemption (No Forced Preemption (Server)) and that's the only
way to disable CONFIG_TREE_PREEMPT_RCU.

	Now I'll try to make the system freeze, then I'll send
you the Call trace.

	Thanks.

-- 
Linux 3.17.0: Shuffling Zombie Juror
http://www.youtube.com/DanielFragaBR
http://exchangewar.info
Bitcoin: 12H6661yoLDUZaYPdah6urZS5WiXwTAUgL

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-25 12:22                                     ` Will Deacon
@ 2014-12-01 11:48                                       ` Will Deacon
  2014-12-01 17:05                                         ` Linus Torvalds
  0 siblings, 1 reply; 486+ messages in thread
From: Will Deacon @ 2014-12-01 11:48 UTC (permalink / raw)
  To: Dave Jones, Andy Lutomirski, Linus Torvalds, Don Zickus,
	Thomas Gleixner, Linux Kernel, the arch/x86 maintainers,
	Peter Zijlstra

On Tue, Nov 25, 2014 at 12:22:17PM +0000, Will Deacon wrote:
> I'm not sure if this is useful, but I've been seeing trinity lockups
> on arm64 as well. Sometimes they happen a few times a day, sometimes it
> takes a few days (I just saw my first one on -rc6, for example).
> 
> However, I have a little bit more trace than you do and *every single time*
> the lockup has involved an execve to a virtual file system.

Ok, just hit another one of these and I have a little bit more info this
time. The trinity log is:

[child1:27912] [438] execve(name="/proc/602/task/602/oom_score", argv=0x3a8426c0, envp=0x3a7a3bb0) = 0    # wtf
[child0:27837] [1081] setfsuid(uid=-128) = 0x3fffb000
[child0:27837] [1082] shmdt(shmaddr=0x7f92269000) = -1 (Invalid argument)
[child0:27837] [1083] fchmod(fd=676, mode=5237) = -1 (Operation not permitted)
[child0:27837] [1084] setuid(uid=0xffffe000) = -1 (Operation not permitted)
[child0:27837] [1085] newfstatat(dfd=676, filename="/proc/612/fdinfo/390", statbuf=0x7f935d8000, flag=0x0) = -1 (Permission denied)
[child0:27837] [1086] process_vm_readv(pid=0, lvec=0x3a7a3d70, liovcnt=47, rvec=0x3a7a4070, riovcnt=74, flags=0x0) = -1 (No such process)
[child0:27837] [1087] clock_gettime(which_clock=0x890000000000003f, tp=0x0) = -1 (Invalid argument)
[child0:27837] [1088] accept(fd=676, upeer_sockaddr=0x3a842360, upeer_addrlen=16) = -1 (Socket operation on non-socket)
[child0:27837] [1089] getpid() = 0x6cbd
[child0:27837] [1090] getpeername(fd=496, usockaddr=0x3a842310, usockaddr_len=16) = -1 (Socket operation on non-socket)
[child0:27837] [1091] timer_getoverrun(timer_id=0x4ff8e1) = -1 (Invalid argument)
[child0:27837] [1092] sigaltstack(uss=0x7f93069000, uoss=0x0, regs=0x0) = -1 (Invalid argument)
[child0:27837] [1093] io_cancel(ctx_id=-3, iocb=0x0, result=0xffffffc000080000) = -1 (Bad address)
[child0:27837] [1094] mknodat(dfd=496, filename="/proc/irq/84/affinity_hint", mode=0xa2c03013110804a0, dev=0xfbac6adf1379fada) = -1 (File exists)
[child0:27837] [1095] clock_nanosleep(which_clock=0x2, flags=0x1, rqtp=0x0, rmtp=0xffffffc000080000) = -1 (Bad address)
[child0:27837] [1096] reboot(magic1=-52, magic2=0xffffff1edbdf7fff, cmd=0xffb5179bfafbfbff, arg=0x0) = -1 (Operation not permitted)
[child0:27837] [1097] sched_yield() = 0
[child0:27837] [1098] getpid() = 0x6cbd
[child0:27837] [1099] newuname(name=0x400008200405) = -1 (Bad address)
[child0:27837] [1100] vmsplice(fd=384, iov=0x3a88fc20, nr_segs=687, flags=0x2) = -1 (Resource temporarily unavailable)
[child0:27837] [1101] timerfd_gettime(ufd=496, otmr=0x1) = -1 (Invalid argument)
[child0:27837] [1102] getcwd(buf=0x0, size=111) = -1 (Bad address)
[child0:27837] [1103] setdomainname(name=0x0, len=0) = -1 (Operation not permitted)
[child0:27837] [1104] sched_getparam(pid=0, param=0xbaedc7bf7ffaf2fe) = -1 (Bad address)
[child0:27837] [1105] readlinkat(dfd=496, pathname="/proc/4/task/4/net/netstat", buf=0x7f935d4000, bufsiz=0) = -1 (Invalid argument)
[child0:27837] [1106] shmctl(shmid=0xa1000000000000ff, cmd=0x7dad54836e49ff1d, buf=0x900000000000002c) = -1 (Invalid argument)
[child0:27837] [1107] getpgid(pid=0) = 0x6cbd
[child0:27837] [1108] flistxattr(fd=496, list=0xffffffffffffffdf, size=0xe7ff) = 0
[child0:27837] [1109] remap_file_pages(start=0x7f9324b000, size=0xfffffffffffaaead, prot=0, pgoff=0, flags=0x0) = -1 (Invalid argument)
[child0:27837] [1110] io_submit(ctx_id=0xffbf, nr=0xffbef, iocbpp=0x8) = -1 (Invalid argument)
[child0:27837] [1111] flistxattr(fd=384, list=0x0, size=0) = 0
[child0:27837] [1112] semtimedop(semid=0xffffffffefffffff, tsops=0x0, nsops=0xfffffffff71a7113, timeout=0xffffffa9) = -1 (Invalid argument)
[child0:27837] [1113] ioctl(fd=384, cmd=0x5100000080000000, arg=362) = -1 (Inappropriate ioctl for device)
[child0:27837] [1114] futex(uaddr=0x0, op=0xb, val=0x80000000000000de, utime=0x8, uaddr2=0x0, val3=0xffffffff00000fff) = -1 (Bad address)
[child0:27837] [1115] listxattr(pathname="/proc/219/net/softnet_stat", list=0x0, size=152) = 0
[child0:27837] [1116] getrusage(who=0xffffffffff080808, ru=0xffffffc000080000) = -1 (Invalid argument)
[child0:27837] [1117] clock_settime(which_clock=0xffffffff7fffffff, tp=0x0) = -1 (Invalid argument)
[child0:27837] [1118] mremap(addr=0x6680000000, old_len=0, new_len=8192, flags=0x2, new_addr=0x5080400000) = -1 (Invalid argument)
[child0:27837] [1119] waitid(which=0x80000702c966254, upid=0, infop=0x7f90069000, options=-166, ru=0x7f90069004) = -1 (Invalid argument)
[child0:27837] [1120] sigaltstack(uss=0x40000000bd5fff6f, uoss=0x8000000000000000, regs=0x0) = -1 (Bad address)
[child0:27837] [1121] timer_delete(timer_id=0x4300d68e28803329) = -1 (Invalid argument)
[child0:27837] [1122] preadv(fd=384, vec=0x3a88fc20, vlen=173, pos_l=0x82000000ff804000, pos_h=96) = -1 (Invalid argument)
[child0:27837] [1123] getdents64(fd=384, dirent=0x7f90a69000, count=0x2ab672e3) = -1 (Not a directory)
[child0:27837] [1124] mlock(addr=0x7f92e69000, len=0x1e0000) 

so for some bizarre reason, child1 (27912) managed to execve oom_score
from /proc. mlock then hangs waiting for a completion in flush_work,
although I'm not sure how the execve is responsible for that.

Looking at the task trace:


SysRq : Show State
  task                        PC stack   pid father

[...]

deferwq         S ffffffc0000855b0     0   599      2 0x00000000
Call trace:
[<ffffffc0000855b0>] __switch_to+0x74/0x8c
[<ffffffc000534214>] __schedule+0x214/0x680
[<ffffffc0005346a4>] schedule+0x24/0x74
[<ffffffc0000c5780>] rescuer_thread+0x200/0x29c
[<ffffffc0000ca404>] kthread+0xd8/0xf0
sh              S ffffffc0000855b0     0   602      1 0x00000000
Call trace:
[<ffffffc0000855b0>] __switch_to+0x74/0x8c
[<ffffffc000534214>] __schedule+0x214/0x680
[<ffffffc0005346a4>] schedule+0x24/0x74
[<ffffffc0000b1f94>] do_wait+0x1c4/0x1fc
[<ffffffc0000b306c>] SyS_wait4+0x74/0xf0
trinity         S ffffffc0000855b0     0   610    602 0x00000000
Call trace:
[<ffffffc0000855b0>] __switch_to+0x74/0x8c
[<ffffffc000534214>] __schedule+0x214/0x680
[<ffffffc0005346a4>] schedule+0x24/0x74
[<ffffffc0000b1f94>] do_wait+0x1c4/0x1fc
[<ffffffc0000b306c>] SyS_wait4+0x74/0xf0
trinity-watchdo R  running task        0   611    610 0x00000000
Call trace:
[<ffffffc0000855b0>] __switch_to+0x74/0x8c
[<ffffffc000534214>] __schedule+0x214/0x680
[<ffffffc0005346a4>] schedule+0x24/0x74
[<ffffffc0005373a0>] do_nanosleep+0xcc/0x134
[<ffffffc0000f9da4>] hrtimer_nanosleep+0x88/0x108
[<ffffffc0000f9eb0>] SyS_nanosleep+0x8c/0xa4
trinity-main    S ffffffc0000855b0     0   612    610 0x00000000
Call trace:
[<ffffffc0000855b0>] __switch_to+0x74/0x8c
[<ffffffc000534214>] __schedule+0x214/0x680
[<ffffffc0005346a4>] schedule+0x24/0x74
[<ffffffc0000b1f94>] do_wait+0x1c4/0x1fc
[<ffffffc0000b306c>] SyS_wait4+0x74/0xf0
trinity-c0      D ffffffc0000855b0     0 27837    612 0x00000000
Call trace:
[<ffffffc0000855b0>] __switch_to+0x74/0x8c
[<ffffffc000534214>] __schedule+0x214/0x680
[<ffffffc0005346a4>] schedule+0x24/0x74
[<ffffffc000537204>] schedule_timeout+0x134/0x18c
[<ffffffc000535364>] wait_for_common+0x9c/0x144
[<ffffffc00053541c>] wait_for_completion+0x10/0x1c
[<ffffffc0000c4cdc>] flush_work+0xbc/0x168
[<ffffffc00013f608>] lru_add_drain_all+0x12c/0x180
[<ffffffc00015cb78>] SyS_mlock+0x20/0x118
trinity-c1      R  running task        0 27912    612 0x00000000
Call trace:
[<ffffffc0000855b0>] __switch_to+0x74/0x8c
trinity-c1      R  running task        0 27921  27912 0x00000000
Call trace:


We can see the child that did the execve has somehow gained its own
child process (27921) that we're unable to backtrace. I can't see any
clone/fork syscalls in the log for 27912.

At this point, both of the CPUs are sitting in idle, so there's nothing
interesting in their register dumps.

Still confused.

Will

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-11-27 22:56                   ` Dave Jones
  2014-11-29 20:38                     ` Dâniel Fraga
@ 2014-12-01 16:56                     ` Don Zickus
  1 sibling, 0 replies; 486+ messages in thread
From: Don Zickus @ 2014-12-01 16:56 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Linux Kernel, the arch/x86 maintainers

On Thu, Nov 27, 2014 at 05:56:37PM -0500, Dave Jones wrote:
> On Thu, Nov 27, 2014 at 11:17:16AM -0800, Linus Torvalds wrote:
>  > On Wed, Nov 26, 2014 at 2:57 PM, Dave Jones <davej@redhat.com> wrote:
>  > >
>  > > So 3.17 also has this problem.
>  > > Good news I guess in that it's not a regression, but damn I really didn't
>  > > want to have to go digging through the mists of time to find the last 'good' point.
>  > 
>  > So I'm looking at the watchdog code, and it seems racy wrt parking and startup.
>  > 
>  > In particular, it sets the high priority *after* starting the hrtimer,
>  > and it goes back to SCHED_NORMAL *before* canceling the timer.
>  > 
>  > Which seems completely ass-backwards. And the smp_hotplug_thread stuff
>  > explicitly enables preemption around the setup/cleanup/part/unpark
>  > operations.
>  > 
>  > However, that would be an issue only if trinity might be doing things
>  > that enable and disable the watchdog. And doing so under insane loads.
>  > Even then it seems unlikely.
>  > 
>  > The insane loads you have. But even then, could a load average of 169
>  > possibly delay running a non-RT process for 22 seconds? Doubtful.
>  > 
>  > But just in case: do you do cpu hotplug events (that will disable and
>  > re-enable the watchdog process?).  Anything else that will part/unpark
>  > the hotplug thread?
> 
> That's root-only iirc, and I'm not running trinity as root, so that
> shouldn't be happening. There's also no sign of such behaviour in dmesg
> when the problem occurs.

Yeah, the watchdog code is very chatty during thread 'unparking'.  If
Dave's dmesg log isn't seeing any:

"enabled on all CPUs, permanently consumes one hw-PMU counter"

except on boot, then I believe the park/unpark race you see shouldn't
be occuring in this scenario.


> 
>  > Quite frankly, I'm just grasping for straws here, but a lot of the
>  > watchdog traces really have seemed spurious...
> 
> Agreed.

Well we can explore this route..

I added a patch below that just logs the watchdog timer function and
kernel thread for each cpu.  It's a little chatty but every 4 seconds you
will see something like this in the logs:

[ 2507.580184] 1: watchdog process kicked (reset)
[ 2507.581154] 0: watchdog process kicked (reset)
[ 2507.581172] 0: watchdog run
[ 2507.593469] 1: watchdog run
[ 2507.595106] 2: watchdog process kicked (reset)
[ 2507.595120] 2: watchdog run
[ 2507.608136] 3: watchdog process kicked (reset)
[ 2507.613204] 3: watchdog run

With the printk timestamps it would be interesting to see what the
watchdog was doing in its final moments and if the timestamps verify the
exceeded duration or if the watchdog screws up the calculation and falsely
reports a lockup.

Cheers,
Don


diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 70bf118..b1ea06c 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -324,6 +324,7 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
 	hrtimer_forward_now(hrtimer, ns_to_ktime(sample_period));
 
 	if (touch_ts == 0) {
+		printk("%d: watchdog process kicked (reset)\n", smp_processor_id());
 		if (unlikely(__this_cpu_read(softlockup_touch_sync))) {
 			/*
 			 * If the time stamp was touched atomically
@@ -346,6 +347,7 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
 	 * this is a good indication some task is hogging the cpu
 	 */
 	duration = is_softlockup(touch_ts);
+	printk("%d: watchdog process kicked (%d seconds since last)\n", smp_processor_id(), duration);
 	if (unlikely(duration)) {
 		/*
 		 * If a virtual machine is stopped by the host it can look to
@@ -477,6 +479,7 @@ static void watchdog(unsigned int cpu)
 	__this_cpu_write(soft_lockup_hrtimer_cnt,
 			 __this_cpu_read(hrtimer_interrupts));
 	__touch_watchdog();
+	printk("%d: watchdog run\n", cpu);
 }
 
 #ifdef CONFIG_HARDLOCKUP_DETECTOR

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-01 11:48                                       ` Will Deacon
@ 2014-12-01 17:05                                         ` Linus Torvalds
  2014-12-01 17:10                                           ` Will Deacon
  0 siblings, 1 reply; 486+ messages in thread
From: Linus Torvalds @ 2014-12-01 17:05 UTC (permalink / raw)
  To: Will Deacon
  Cc: Dave Jones, Andy Lutomirski, Don Zickus, Thomas Gleixner,
	Linux Kernel, the arch/x86 maintainers, Peter Zijlstra

On Mon, Dec 1, 2014 at 3:48 AM, Will Deacon <will.deacon@arm.com> wrote:
>
> so for some bizarre reason, child1 (27912) managed to execve oom_score
> from /proc.

That sounds like you have a binfmt that accepts crap. Possibly
ARM-specific, although more likely it's just a misc script.

> We can see the child that did the execve has somehow gained its own
> child process (27921) that we're unable to backtrace. I can't see any
> clone/fork syscalls in the log for 27912.

Well, it wouldn't be trinity any more, it would likely be some execve
script (think "/bin/sh", except likely through binfmt_misc).

Do you have anything in /proc/sys/fs/binfmt_misc? I don't see anything
else that would trigger it.

This doesn't really look anything like DaveJ's issue, but who knows..

                       Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-01 17:05                                         ` Linus Torvalds
@ 2014-12-01 17:10                                           ` Will Deacon
  2014-12-01 17:53                                             ` Linus Torvalds
  0 siblings, 1 reply; 486+ messages in thread
From: Will Deacon @ 2014-12-01 17:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Jones, Andy Lutomirski, Don Zickus, Thomas Gleixner,
	Linux Kernel, the arch/x86 maintainers, Peter Zijlstra

On Mon, Dec 01, 2014 at 05:05:06PM +0000, Linus Torvalds wrote:
> On Mon, Dec 1, 2014 at 3:48 AM, Will Deacon <will.deacon@arm.com> wrote:
> > so for some bizarre reason, child1 (27912) managed to execve oom_score
> > from /proc.
> 
> That sounds like you have a binfmt that accepts crap. Possibly
> ARM-specific, although more likely it's just a misc script.
> 
> > We can see the child that did the execve has somehow gained its own
> > child process (27921) that we're unable to backtrace. I can't see any
> > clone/fork syscalls in the log for 27912.
> 
> Well, it wouldn't be trinity any more, it would likely be some execve
> script (think "/bin/sh", except likely through binfmt_misc).
> 
> Do you have anything in /proc/sys/fs/binfmt_misc? I don't see anything
> else that would trigger it.

So I don't even have binfmt-misc compiled in. The two handlers I have are
BINFMT_ELF and BINFMT_SCRIPT, but they both check for headers that we won't
get back from oom_score afaict.

> This doesn't really look anything like DaveJ's issue, but who knows..

It's the only lockup I'm seeing on arm64 with trinity, but I agree that it's
not very helpful.

Will

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-01 17:10                                           ` Will Deacon
@ 2014-12-01 17:53                                             ` Linus Torvalds
  2014-12-01 18:25                                               ` Kirill A. Shutemov
  0 siblings, 1 reply; 486+ messages in thread
From: Linus Torvalds @ 2014-12-01 17:53 UTC (permalink / raw)
  To: Will Deacon, Tejun Heo
  Cc: Dave Jones, Andy Lutomirski, Don Zickus, Thomas Gleixner,
	Linux Kernel, the arch/x86 maintainers, Peter Zijlstra

On Mon, Dec 1, 2014 at 9:10 AM, Will Deacon <will.deacon@arm.com> wrote:
>
> So I don't even have binfmt-misc compiled in. The two handlers I have are
> BINFMT_ELF and BINFMT_SCRIPT, but they both check for headers that we won't
> get back from oom_score afaict.

Hmm. So I can't even get that "oom_score" file to be executable in the
first place, which should mean that execve() should terminate very
quickly with an EACCES error.

The fact that you have a "flush_work()" that is waiting for completion
is interesting. Maybe the odd new thread is a worker thread for some
modprobe or similar, and we .  There's that whole

   request_module("binfmt-%04x", *(ushort *)(bprm->buf + 2))

which ends up creating a new work. Maybe the flush_work() is waiting
for that whole mess. Adding Tejun to the cc, since there *were*
changes to workqueues etc since 3.16..

Tejun, full thread on lkml, I'm assuming you can find it in your mail archives..

                      Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-01 17:53                                             ` Linus Torvalds
@ 2014-12-01 18:25                                               ` Kirill A. Shutemov
  2014-12-01 18:36                                                 ` Linus Torvalds
  0 siblings, 1 reply; 486+ messages in thread
From: Kirill A. Shutemov @ 2014-12-01 18:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Will Deacon, Tejun Heo, Dave Jones, Andy Lutomirski, Don Zickus,
	Thomas Gleixner, Linux Kernel, the arch/x86 maintainers,
	Peter Zijlstra

On Mon, Dec 01, 2014 at 09:53:16AM -0800, Linus Torvalds wrote:
> On Mon, Dec 1, 2014 at 9:10 AM, Will Deacon <will.deacon@arm.com> wrote:
> >
> > So I don't even have binfmt-misc compiled in. The two handlers I have are
> > BINFMT_ELF and BINFMT_SCRIPT, but they both check for headers that we won't
> > get back from oom_score afaict.
> 
> Hmm. So I can't even get that "oom_score" file to be executable in the
> first place, which should mean that execve() should terminate very
> quickly with an EACCES error.

No idea about oom_score, but kernel happily accepts chmod on any file
under /proc/PID/net/. It caused issues before[1].

Why do we allow this?

I've asked before, but no answer so far.


[1] https://lkml.org/lkml/2014/8/2/103

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-01 18:25                                               ` Kirill A. Shutemov
@ 2014-12-01 18:36                                                 ` Linus Torvalds
  2014-12-04 10:51                                                   ` Will Deacon
  0 siblings, 1 reply; 486+ messages in thread
From: Linus Torvalds @ 2014-12-01 18:36 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Will Deacon, Tejun Heo, Dave Jones, Andy Lutomirski, Don Zickus,
	Thomas Gleixner, Linux Kernel, the arch/x86 maintainers,
	Peter Zijlstra

On Mon, Dec 1, 2014 at 10:25 AM, Kirill A. Shutemov
<kirill@shutemov.name> wrote:
>
> No idea about oom_score, but kernel happily accepts chmod on any file
> under /proc/PID/net/.

/proc used to accept that fairly widely, but no, we tightened things
down, and core /proc files end up not accepting chmod. See
'proc_setattr()':

        if (attr->ia_valid & ATTR_MODE)
                return -EPERM;

although particular /proc files could choose to not use 'proc_setattr'
if they want to.

The '/proc/pid/net' subtree is obviously not doing that. No idea why,
and probably for no good reason.

                     Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-01  1:02                             ` Dâniel Fraga
@ 2014-12-01 19:14                               ` Paul E. McKenney
  2014-12-01 20:28                                 ` Dâniel Fraga
  2014-12-02  8:40                                 ` Lai Jiangshan
  0 siblings, 2 replies; 486+ messages in thread
From: Paul E. McKenney @ 2014-12-01 19:14 UTC (permalink / raw)
  To: Dâniel Fraga; +Cc: Linus Torvalds, Linux Kernel Mailing List

On Sun, Nov 30, 2014 at 11:02:43PM -0200, Dâniel Fraga wrote:
> On Sun, 30 Nov 2014 16:21:19 -0800
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > Maybe you'll have to turn off RCU_CPU_STALL_VERBOSE first.
> > 
> > Although I think you should be able to just edit the .config file,
> > delete the like that says
> > 
> >     CONFIG_TREE_PREEMPT_RCU=y
> > 
> > and then just do a "make oldconfig", and then verify that
> > TREE_PREEMPT_RCU hasn't been re-enabled by some dependency. But it
> > shouldn't have, and that "make oldconfig" should get rid of anything
> > that depends on TREE_PREEMPT_RCU.
> 	
> 	Ok, I did exactly that, but CONFIG_TREE_PREEMPT_RCU is
> re-enabled. I talked with Pranith Kumar and he suggested I could just
> disable preemption (No Forced Preemption (Server)) and that's the only
> way to disable CONFIG_TREE_PREEMPT_RCU.

If it would help to have !CONFIG_TREE_PREEMPT_RCU with CONFIG_PREEMPT=y,
please let me know and I will create a patch that forces this.
(Not mainline material, but if it helps with debug...)

							Thanx, Paul

> 	Now I'll try to make the system freeze, then I'll send
> you the Call trace.
> 
> 	Thanks.
> 
> -- 
> Linux 3.17.0: Shuffling Zombie Juror
> http://www.youtube.com/DanielFragaBR
> http://exchangewar.info
> Bitcoin: 12H6661yoLDUZaYPdah6urZS5WiXwTAUgL
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-01 19:14                               ` Paul E. McKenney
@ 2014-12-01 20:28                                 ` Dâniel Fraga
  2014-12-01 20:36                                   ` Linus Torvalds
  2014-12-01 23:08                                   ` Paul E. McKenney
  2014-12-02  8:40                                 ` Lai Jiangshan
  1 sibling, 2 replies; 486+ messages in thread
From: Dâniel Fraga @ 2014-12-01 20:28 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Linus Torvalds, Linux Kernel Mailing List

On Mon, 1 Dec 2014 11:14:31 -0800
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:

> If it would help to have !CONFIG_TREE_PREEMPT_RCU with CONFIG_PREEMPT=y,
> please let me know and I will create a patch that forces this.
> (Not mainline material, but if it helps with debug...)

	Hi Paul. Please, I'd like the patch, because without
preemption, I'm unable to trigger this bug.

	Thanks.

-- 
Linux 3.17.0: Shuffling Zombie Juror
http://www.youtube.com/DanielFragaBR
http://exchangewar.info
Bitcoin: 12H6661yoLDUZaYPdah6urZS5WiXwTAUgL

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-01 20:28                                 ` Dâniel Fraga
@ 2014-12-01 20:36                                   ` Linus Torvalds
  2014-12-01 23:08                                     ` Chris Mason
                                                       ` (2 more replies)
  2014-12-01 23:08                                   ` Paul E. McKenney
  1 sibling, 3 replies; 486+ messages in thread
From: Linus Torvalds @ 2014-12-01 20:36 UTC (permalink / raw)
  To: Dâniel Fraga, Dave Jones, Sasha Levin
  Cc: Paul E. McKenney, Linux Kernel Mailing List

On Mon, Dec 1, 2014 at 12:28 PM, Dâniel Fraga <fragabr@gmail.com> wrote:
>
>         Hi Paul. Please, I'd like the patch, because without
> preemption, I'm unable to trigger this bug.

Ok, that's already interesting information. And yes, it would probably
be interesting to see if CONFIG_PREEMPT=y but !CONFIG_TREE_PREEMPT_RCU
then solves it too, to narrow it down to one but not the other..

DaveJ - what about your situation? The standard Fedora kernels use
CONFIG_PREEMPT_VOLUNTARY, do you have CONFIG_PREEMPT and
CONFIG_TREE_PREEMPT_RCU enabled? I think you and Sasha both saw some
RCU oddities too, no?

                    Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-01 20:28                                 ` Dâniel Fraga
  2014-12-01 20:36                                   ` Linus Torvalds
@ 2014-12-01 23:08                                   ` Paul E. McKenney
  2014-12-02 16:43                                     ` Dâniel Fraga
  1 sibling, 1 reply; 486+ messages in thread
From: Paul E. McKenney @ 2014-12-01 23:08 UTC (permalink / raw)
  To: Dâniel Fraga; +Cc: Linus Torvalds, Linux Kernel Mailing List

On Mon, Dec 01, 2014 at 06:28:31PM -0200, Dâniel Fraga wrote:
> On Mon, 1 Dec 2014 11:14:31 -0800
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> 
> > If it would help to have !CONFIG_TREE_PREEMPT_RCU with CONFIG_PREEMPT=y,
> > please let me know and I will create a patch that forces this.
> > (Not mainline material, but if it helps with debug...)
> 
> 	Hi Paul. Please, I'd like the patch, because without
> preemption, I'm unable to trigger this bug.

Well, this turned out to be way simpler than I expected.  Passes
light rcutorture testing.  Sometimes you get lucky...

							Thanx, Paul


diff --git a/init/Kconfig b/init/Kconfig
index 903505e66d1d..2cf71fcd514f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -469,7 +469,7 @@ choice
 
 config TREE_RCU
 	bool "Tree-based hierarchical RCU"
-	depends on !PREEMPT && SMP
+	depends on SMP
 	select IRQ_WORK
 	help
 	  This option selects the RCU implementation that is


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-01 20:36                                   ` Linus Torvalds
@ 2014-12-01 23:08                                     ` Chris Mason
  2014-12-01 23:25                                       ` Linus Torvalds
                                                         ` (2 more replies)
  2014-12-02 19:31                                     ` Dave Jones
  2014-12-02 20:30                                     ` Dave Jones
  2 siblings, 3 replies; 486+ messages in thread
From: Chris Mason @ 2014-12-01 23:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dâniel Fraga, Dave Jones, Sasha Levin, Paul E. McKenney,
	Linux Kernel Mailing List

I'm not sure if this is related, but running trinity here, I noticed it
was stuck at 100% system time on every CPU.  perf report tells me we are
spending all of our time in spin_lock under the sync system call.

I think it's coming from contention in the bdi_queue_work() call from
inside sync_inodes_sb, which is spin_lock_bh(). 

I wonder if we're just spinning so hard on this one bh lock that we're
starving the watchdog?

Dave, do you have spinlock debugging on?  

-chris

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-01 23:08                                     ` Chris Mason
@ 2014-12-01 23:25                                       ` Linus Torvalds
  2014-12-01 23:44                                         ` Chris Mason
  2014-12-02 14:13                                       ` Mike Galbraith
  2014-12-02 19:32                                       ` Dave Jones
  2 siblings, 1 reply; 486+ messages in thread
From: Linus Torvalds @ 2014-12-01 23:25 UTC (permalink / raw)
  To: Chris Mason, Linus Torvalds, Dâniel Fraga, Dave Jones,
	Sasha Levin, Paul E. McKenney, Linux Kernel Mailing List

On Mon, Dec 1, 2014 at 3:08 PM, Chris Mason <clm@fb.com> wrote:
> I'm not sure if this is related, but running trinity here, I noticed it
> was stuck at 100% system time on every CPU.  perf report tells me we are
> spending all of our time in spin_lock under the sync system call.
>
> I think it's coming from contention in the bdi_queue_work() call from
> inside sync_inodes_sb, which is spin_lock_bh().

Please do a perf run with -g to get the call chain to make sure..

> I wonder if we're just spinning so hard on this one bh lock that we're
> starving the watchdog?

If it was that simple, we should see it in the actual soft-lockup stack trace.

That said, looking at the bdi_queue_work() function, I don't think you
should see any real contention there, although:

 - spin-lock debugging can make any bad situation about 10x worse by
making the spinlocks just that much more horrible from a performance
standpoint

 - the whole "complete(work->done)" thing seems to be pointlessly done
inside the spinlock, and that just seems horrible. Do you have a ton
of BDI's that might fail that BDI_registered thing?

 - even the "mod_delayed_work()" is dubious wrt the wb_lock. From what
I can tell, the spinlock is supposed to just protect the list.

So I think that bdi_queue_work() quite possibly is horribly broken
crap and *if* it really is contention on wb_lock, we could rewrite it
to not be so bad locking-wise.

That said, contention that happens with spinlock debugging enabled
really tends to fall under the heading of "that's your own fault".

                     Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-01 23:25                                       ` Linus Torvalds
@ 2014-12-01 23:44                                         ` Chris Mason
  2014-12-02  0:39                                           ` Linus Torvalds
  0 siblings, 1 reply; 486+ messages in thread
From: Chris Mason @ 2014-12-01 23:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linus Torvalds, Dâniel Fraga, Dave Jones, Sasha Levin,
	Paul E. McKenney, Linux Kernel Mailing List



On Mon, Dec 1, 2014 at 6:25 PM, Linus Torvalds 
<torvalds@linux-foundation.org> wrote:
> On Mon, Dec 1, 2014 at 3:08 PM, Chris Mason <clm@fb.com> wrote:
>>  I'm not sure if this is related, but running trinity here, I 
>> noticed it
>>  was stuck at 100% system time on every CPU.  perf report tells me 
>> we are
>>  spending all of our time in spin_lock under the sync system call.
>> 
>>  I think it's coming from contention in the bdi_queue_work() call 
>> from
>>  inside sync_inodes_sb, which is spin_lock_bh().
> 
> Please do a perf run with -g to get the call chain to make sure..

The call chain goes something like this:

               --- _raw_spin_lock
                   |
                   |--99.72%-- sync_inodes_sb
                   |          sync_inodes_one_sb
                   |          iterate_supers
                   |          sys_sync
                   |          |
                   |          |--79.66%-- system_call_fastpath
                   |          |          syscall
                   |          |
                   |           --20.34%-- ia32_sysret
                   |                     __do_syscall
                    --0.28%-- [...]

(the 64bit call variation is similar)  Adding -v doesn't really help, 
because it isn't giving me the address inside sync_inodes_sb()

I first read this and guessed it must be leaving out the call to 
bdi_queue_work, hoping the spin_lock_bh and lock debugging were teaming 
up to stall the box.

But looking harder it's probably inside wait_sb_inodes:

        spin_lock(&inode_sb_list_lock);

Which is a little harder to blame.  Maaaaaybe with lock debugging, but 
its enough of a stretch that I wouldn't have emailed at all if I hadn't 
fixated on the bdi code.

-chris




^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-01 23:44                                         ` Chris Mason
@ 2014-12-02  0:39                                           ` Linus Torvalds
  0 siblings, 0 replies; 486+ messages in thread
From: Linus Torvalds @ 2014-12-02  0:39 UTC (permalink / raw)
  To: Chris Mason
  Cc: Dâniel Fraga, Dave Jones, Sasha Levin, Paul E. McKenney,
	Linux Kernel Mailing List

On Mon, Dec 1, 2014 at 3:44 PM, Chris Mason <clm@fb.com> wrote:
>
> But looking harder it's probably inside wait_sb_inodes:
>
>        spin_lock(&inode_sb_list_lock);

Yeah, that's a known pain-point for sync(), although nobody has really
cared enough, since performance of parallel sync() calls is usually
not very high on anybody's list of things to care about except when it
occasionally shows up on some old Unix benchmark (maybe AIM, I
forget).

Anyway, lock debugging will make what is usually not noticeable into a
"whee, that's horrible", because the lock debugging overhead is often
many *many* times higher than the cost of the code inside the lock..

                  Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-01 19:14                               ` Paul E. McKenney
  2014-12-01 20:28                                 ` Dâniel Fraga
@ 2014-12-02  8:40                                 ` Lai Jiangshan
  2014-12-02 16:58                                   ` Paul E. McKenney
  2014-12-02 16:58                                   ` Dâniel Fraga
  1 sibling, 2 replies; 486+ messages in thread
From: Lai Jiangshan @ 2014-12-02  8:40 UTC (permalink / raw)
  To: paulmck; +Cc: Dâniel Fraga, Linus Torvalds, Linux Kernel Mailing List

On 12/02/2014 03:14 AM, Paul E. McKenney wrote:
> On Sun, Nov 30, 2014 at 11:02:43PM -0200, Dâniel Fraga wrote:
>> On Sun, 30 Nov 2014 16:21:19 -0800
>> Linus Torvalds <torvalds@linux-foundation.org> wrote:
>>
>>> Maybe you'll have to turn off RCU_CPU_STALL_VERBOSE first.
>>>
>>> Although I think you should be able to just edit the .config file,
>>> delete the like that says
>>>
>>>     CONFIG_TREE_PREEMPT_RCU=y
>>>
>>> and then just do a "make oldconfig", and then verify that
>>> TREE_PREEMPT_RCU hasn't been re-enabled by some dependency. But it
>>> shouldn't have, and that "make oldconfig" should get rid of anything
>>> that depends on TREE_PREEMPT_RCU.
>> 	
>> 	Ok, I did exactly that, but CONFIG_TREE_PREEMPT_RCU is
>> re-enabled. I talked with Pranith Kumar and he suggested I could just
>> disable preemption (No Forced Preemption (Server)) and that's the only
>> way to disable CONFIG_TREE_PREEMPT_RCU.
> 
> If it would help to have !CONFIG_TREE_PREEMPT_RCU with CONFIG_PREEMPT=y,

It is needed at lest for testing.

CONFIG_TREE_PREEMPT_RCU=y with CONFIG_PREEMPT=n is needed for testing too.

Please enable them (or enable them under CONFIG_RCU_TRACE=y)

> please let me know and I will create a patch that forces this.
> (Not mainline material, but if it helps with debug...)
> 
> 							Thanx, Paul
> 
>> 	Now I'll try to make the system freeze, then I'll send
>> you the Call trace.
>>
>> 	Thanks.
>>
>> -- 
>> Linux 3.17.0: Shuffling Zombie Juror
>> http://www.youtube.com/DanielFragaBR
>> http://exchangewar.info
>> Bitcoin: 12H6661yoLDUZaYPdah6urZS5WiXwTAUgL
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-01 23:08                                     ` Chris Mason
  2014-12-01 23:25                                       ` Linus Torvalds
@ 2014-12-02 14:13                                       ` Mike Galbraith
  2014-12-02 16:33                                         ` Linus Torvalds
  2014-12-02 19:32                                       ` Dave Jones
  2 siblings, 1 reply; 486+ messages in thread
From: Mike Galbraith @ 2014-12-02 14:13 UTC (permalink / raw)
  To: Chris Mason
  Cc: Linus Torvalds, Dâniel Fraga, Dave Jones, Sasha Levin,
	Paul E. McKenney, Linux Kernel Mailing List

On Mon, 2014-12-01 at 18:08 -0500, Chris Mason wrote:
> I'm not sure if this is related, but running trinity here, I noticed it
> was stuck at 100% system time on every CPU.  perf report tells me we are
> spending all of our time in spin_lock under the sync system call.
> 
> I think it's coming from contention in the bdi_queue_work() call from
> inside sync_inodes_sb, which is spin_lock_bh(). 
> 
> I wonder if we're just spinning so hard on this one bh lock that we're
> starving the watchdog?

The bean counting problem below can contribute.

https://lkml.org/lkml/2014/3/30/7

	-Mike


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-02 14:13                                       ` Mike Galbraith
@ 2014-12-02 16:33                                         ` Linus Torvalds
  2014-12-02 17:14                                           ` Chris Mason
                                                             ` (2 more replies)
  0 siblings, 3 replies; 486+ messages in thread
From: Linus Torvalds @ 2014-12-02 16:33 UTC (permalink / raw)
  To: Mike Galbraith, Ingo Molnar, Peter Zijlstra
  Cc: Chris Mason, Dâniel Fraga, Dave Jones, Sasha Levin,
	Paul E. McKenney, Linux Kernel Mailing List

On Tue, Dec 2, 2014 at 6:13 AM, Mike Galbraith <umgwanakikbuti@gmail.com> wrote:
>
> The bean counting problem below can contribute.
>
> https://lkml.org/lkml/2014/3/30/7

Hmm. That never got applied. I didn't apply it originally because of
timing and wanting clarifications, but apparently it never made it
into the -tip tree either.

Ingo, PeterZ - comments?

Looking again at that patch (the commit message still doesn't strike
me as wonderfully explanatory :^) makes me worry, though.

Is that

        if (rq->skip_clock_update-- > 0)
                return;

really right? If skip_clock_update was zero (normal), it now gets set
to -1, which has its own specific meaning (see "force clock update"
comment in kernel/sched/rt.c). Is that intentional? That seems insane.

Or should it be

        if (rq->skip_clock_update > 0) {
                rq->skip_clock_update = 0;
                return;
        }

or what? Maybe there was a reason the patch never got applied even to -tip.

At the same time, the whole "incapacitated by the rt throttle long
enough for the hard lockup detector to trigger" commentary about that
skip_clock_update issue does make me go "Hmmm..". It would certainly
explain Dave's incomprehensible watchdog messages..

               Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-01 23:08                                   ` Paul E. McKenney
@ 2014-12-02 16:43                                     ` Dâniel Fraga
  2014-12-02 17:04                                       ` Paul E. McKenney
  2014-12-02 17:08                                       ` Linus Torvalds
  0 siblings, 2 replies; 486+ messages in thread
From: Dâniel Fraga @ 2014-12-02 16:43 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Linus Torvalds, Linux Kernel Mailing List

On Mon, 1 Dec 2014 15:08:13 -0800
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:

> Well, this turned out to be way simpler than I expected.  Passes
> light rcutorture testing.  Sometimes you get lucky...

	Linus, Paul and others, I finally got a call trace with
only CONFIG_TREE_PREEMPT_RCU *disabled* using Paul's patch (to trigger 
it I compiled PHP with make -j8).

Dec  2 14:24:39 tux kernel: [ 8475.941616] conftest[9730]: segfault at 0 ip 0000000000400640 sp 00007fffa67ab300 error 4 in conftest[400000+1000]
Dec  2 14:24:40 tux kernel: [ 8476.104725] conftest[9753]: segfault at 0 ip 00007f6863024906 sp 00007fff0e31cc48 error 4 in libc-2.19.so[7f6862efe000+1a1000]
Dec  2 14:25:54 tux kernel: [ 8550.791697] INFO: rcu_sched detected stalls on CPUs/tasks: { 4} (detected by 0, t=60002 jiffies, g=112854, c=112853, q=0)
Dec  2 14:25:54 tux kernel: [ 8550.791702] Task dump for CPU 4:
Dec  2 14:25:54 tux kernel: [ 8550.791703] cc1             R  running task        0 14344  14340 0x00080008
Dec  2 14:25:54 tux kernel: [ 8550.791706]  000000001bcebcd8 ffff880100000003 ffffffff810cb7f1 ffff88021f5f5c00
Dec  2 14:25:54 tux kernel: [ 8550.791708]  ffff88011bcebfd8 ffff88011bcebce8 ffffffff811fb970 ffff8802149a2a00
Dec  2 14:25:54 tux kernel: [ 8550.791710]  ffff8802149a2cc8 ffff88011bcebd28 ffffffff8103e979 ffff88020ed01398
Dec  2 14:25:54 tux kernel: [ 8550.791712] Call Trace:
Dec  2 14:25:54 tux kernel: [ 8550.791718]  [<ffffffff810cb7f1>] ? release_pages+0xa1/0x1e0
Dec  2 14:25:54 tux kernel: [ 8550.791722]  [<ffffffff811fb970>] ? cpumask_any_but+0x30/0x40
Dec  2 14:25:54 tux kernel: [ 8550.791725]  [<ffffffff8103e979>] ? flush_tlb_page+0x49/0xf0
Dec  2 14:25:54 tux kernel: [ 8550.791727]  [<ffffffff810cbe72>] ? lru_cache_add_active_or_unevictable+0x22/0x90
Dec  2 14:25:54 tux kernel: [ 8550.791731]  [<ffffffff810fc4c2>] ? alloc_pages_vma+0x72/0x130
Dec  2 14:25:54 tux kernel: [ 8550.791733]  [<ffffffff810cbe72>] ? lru_cache_add_active_or_unevictable+0x22/0x90
Dec  2 14:25:54 tux kernel: [ 8550.791735]  [<ffffffff810e5220>] ? handle_mm_fault+0x3a0/0xaf0
Dec  2 14:25:54 tux kernel: [ 8550.791737]  [<ffffffff81039074>] ? __do_page_fault+0x224/0x4c0
Dec  2 14:25:54 tux kernel: [ 8550.791740]  [<ffffffff8110d54c>] ? new_sync_write+0x7c/0xb0
Dec  2 14:25:55 tux kernel: [ 8550.791743]  [<ffffffff8114765c>] ? fsnotify+0x27c/0x350
Dec  2 14:25:55 tux kernel: [ 8550.791746]  [<ffffffff81087233>] ? rcu_eqs_enter+0x93/0xa0
Dec  2 14:25:55 tux kernel: [ 8550.791748]  [<ffffffff81087a5e>] ? rcu_user_enter+0xe/0x10
Dec  2 14:25:55 tux kernel: [ 8550.791749]  [<ffffffff8103938a>] ? do_page_fault+0x5a/0x70
Dec  2 14:25:55 tux kernel: [ 8550.791752]  [<ffffffff8139d9d2>] ? page_fault+0x22/0x30

	If you need more info/testing, just ask.

-- 
Linux 3.17.0-dirty: Shuffling Zombie Juror
http://www.youtube.com/DanielFragaBR
http://exchangewar.info
Bitcoin: 12H6661yoLDUZaYPdah6urZS5WiXwTAUgL

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-02  8:40                                 ` Lai Jiangshan
@ 2014-12-02 16:58                                   ` Paul E. McKenney
  2014-12-02 16:58                                   ` Dâniel Fraga
  1 sibling, 0 replies; 486+ messages in thread
From: Paul E. McKenney @ 2014-12-02 16:58 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Dâniel Fraga, Linus Torvalds, Linux Kernel Mailing List

On Tue, Dec 02, 2014 at 04:40:37PM +0800, Lai Jiangshan wrote:
> On 12/02/2014 03:14 AM, Paul E. McKenney wrote:
> > On Sun, Nov 30, 2014 at 11:02:43PM -0200, Dâniel Fraga wrote:
> >> On Sun, 30 Nov 2014 16:21:19 -0800
> >> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> >>
> >>> Maybe you'll have to turn off RCU_CPU_STALL_VERBOSE first.
> >>>
> >>> Although I think you should be able to just edit the .config file,
> >>> delete the like that says
> >>>
> >>>     CONFIG_TREE_PREEMPT_RCU=y
> >>>
> >>> and then just do a "make oldconfig", and then verify that
> >>> TREE_PREEMPT_RCU hasn't been re-enabled by some dependency. But it
> >>> shouldn't have, and that "make oldconfig" should get rid of anything
> >>> that depends on TREE_PREEMPT_RCU.
> >> 	
> >> 	Ok, I did exactly that, but CONFIG_TREE_PREEMPT_RCU is
> >> re-enabled. I talked with Pranith Kumar and he suggested I could just
> >> disable preemption (No Forced Preemption (Server)) and that's the only
> >> way to disable CONFIG_TREE_PREEMPT_RCU.
> > 
> > If it would help to have !CONFIG_TREE_PREEMPT_RCU with CONFIG_PREEMPT=y,
> 
> It is needed at lest for testing.
> 
> CONFIG_TREE_PREEMPT_RCU=y with CONFIG_PREEMPT=n is needed for testing too.
> 
> Please enable them (or enable them under CONFIG_RCU_TRACE=y)

It is a really easy edit to Kconfig, but I don't want people using it
in production because I really don't need the extra test scenarios.
So I am happy to provide the patch below as needed, but not willing to
submit it to mainline without a lot more justification.  Because if it
appears in mainline, people will start using it in production, whether
I am doing proper testing of it or not.  :-/

							Thanx, Paul

------------------------------------------------------------------------

diff --git a/init/Kconfig b/init/Kconfig
index 903505e66d1d..2cf71fcd514f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -469,7 +469,7 @@ choice
 
 config TREE_RCU
 	bool "Tree-based hierarchical RCU"
-	depends on !PREEMPT && SMP
+	depends on SMP
 	select IRQ_WORK
 	help
 	  This option selects the RCU implementation that is


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-02  8:40                                 ` Lai Jiangshan
  2014-12-02 16:58                                   ` Paul E. McKenney
@ 2014-12-02 16:58                                   ` Dâniel Fraga
  2014-12-02 17:17                                     ` Paul E. McKenney
  2014-12-03  2:03                                     ` Lai Jiangshan
  1 sibling, 2 replies; 486+ messages in thread
From: Dâniel Fraga @ 2014-12-02 16:58 UTC (permalink / raw)
  To: Lai Jiangshan; +Cc: paulmck, Linus Torvalds, Linux Kernel Mailing List

On Tue, 2 Dec 2014 16:40:37 +0800
Lai Jiangshan <laijs@cn.fujitsu.com> wrote:

> It is needed at lest for testing.
> 
> CONFIG_TREE_PREEMPT_RCU=y with CONFIG_PREEMPT=n is needed for testing too.
> 
> Please enable them (or enable them under CONFIG_RCU_TRACE=y)

	Lai, sorry but I didn't understand. Do you mean both of them
enabled? Because how can CONFIG_TREE_PREEMPT_RCU be enabled without
CONFIG_PREEMPT ?

	If you mean both enabled, I already reported a call trace with
both enabled:

https://bugzilla.kernel.org/show_bug.cgi?id=85941

	Please see my previous answer to Linus and Paul too.

	Regarding CONFIG_RCU_TRACE, do you mean
"CONFIG_TREE_RCU_TRACE"? I couldn't find CONFIG_RCU_TRACE.

	Thanks.

-- 
Linux 3.17.0-dirty: Shuffling Zombie Juror
http://www.youtube.com/DanielFragaBR
http://exchangewar.info
Bitcoin: 12H6661yoLDUZaYPdah6urZS5WiXwTAUgL

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-02 16:43                                     ` Dâniel Fraga
@ 2014-12-02 17:04                                       ` Paul E. McKenney
  2014-12-02 17:14                                         ` Dâniel Fraga
  2014-12-02 18:09                                         ` Paul E. McKenney
  2014-12-02 17:08                                       ` Linus Torvalds
  1 sibling, 2 replies; 486+ messages in thread
From: Paul E. McKenney @ 2014-12-02 17:04 UTC (permalink / raw)
  To: Dâniel Fraga; +Cc: Linus Torvalds, Linux Kernel Mailing List

On Tue, Dec 02, 2014 at 02:43:17PM -0200, Dâniel Fraga wrote:
> On Mon, 1 Dec 2014 15:08:13 -0800
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> 
> > Well, this turned out to be way simpler than I expected.  Passes
> > light rcutorture testing.  Sometimes you get lucky...
> 
> 	Linus, Paul and others, I finally got a call trace with
> only CONFIG_TREE_PREEMPT_RCU *disabled* using Paul's patch (to trigger 
> it I compiled PHP with make -j8).

Is it harder to reproduce with CONFIG_PREEMPT=y and CONFIG_TREE_PREEMPT_RCU=n?

If it is a -lot- harder to reproduce, it might be worth bisecting among
the RCU read-side critical sections.  If making a few of them be
non-preemptible greatly reduces the probability of the bug occuring,
that might provide a clue about root cause.

On the other hand, if it is just a little harder to reproduce, this
RCU read-side bisection would likely be an exercise in futility.

							Thanx, Paul

> Dec  2 14:24:39 tux kernel: [ 8475.941616] conftest[9730]: segfault at 0 ip 0000000000400640 sp 00007fffa67ab300 error 4 in conftest[400000+1000]
> Dec  2 14:24:40 tux kernel: [ 8476.104725] conftest[9753]: segfault at 0 ip 00007f6863024906 sp 00007fff0e31cc48 error 4 in libc-2.19.so[7f6862efe000+1a1000]
> Dec  2 14:25:54 tux kernel: [ 8550.791697] INFO: rcu_sched detected stalls on CPUs/tasks: { 4} (detected by 0, t=60002 jiffies, g=112854, c=112853, q=0)
> Dec  2 14:25:54 tux kernel: [ 8550.791702] Task dump for CPU 4:
> Dec  2 14:25:54 tux kernel: [ 8550.791703] cc1             R  running task        0 14344  14340 0x00080008
> Dec  2 14:25:54 tux kernel: [ 8550.791706]  000000001bcebcd8 ffff880100000003 ffffffff810cb7f1 ffff88021f5f5c00
> Dec  2 14:25:54 tux kernel: [ 8550.791708]  ffff88011bcebfd8 ffff88011bcebce8 ffffffff811fb970 ffff8802149a2a00
> Dec  2 14:25:54 tux kernel: [ 8550.791710]  ffff8802149a2cc8 ffff88011bcebd28 ffffffff8103e979 ffff88020ed01398
> Dec  2 14:25:54 tux kernel: [ 8550.791712] Call Trace:
> Dec  2 14:25:54 tux kernel: [ 8550.791718]  [<ffffffff810cb7f1>] ? release_pages+0xa1/0x1e0
> Dec  2 14:25:54 tux kernel: [ 8550.791722]  [<ffffffff811fb970>] ? cpumask_any_but+0x30/0x40
> Dec  2 14:25:54 tux kernel: [ 8550.791725]  [<ffffffff8103e979>] ? flush_tlb_page+0x49/0xf0
> Dec  2 14:25:54 tux kernel: [ 8550.791727]  [<ffffffff810cbe72>] ? lru_cache_add_active_or_unevictable+0x22/0x90
> Dec  2 14:25:54 tux kernel: [ 8550.791731]  [<ffffffff810fc4c2>] ? alloc_pages_vma+0x72/0x130
> Dec  2 14:25:54 tux kernel: [ 8550.791733]  [<ffffffff810cbe72>] ? lru_cache_add_active_or_unevictable+0x22/0x90
> Dec  2 14:25:54 tux kernel: [ 8550.791735]  [<ffffffff810e5220>] ? handle_mm_fault+0x3a0/0xaf0
> Dec  2 14:25:54 tux kernel: [ 8550.791737]  [<ffffffff81039074>] ? __do_page_fault+0x224/0x4c0
> Dec  2 14:25:54 tux kernel: [ 8550.791740]  [<ffffffff8110d54c>] ? new_sync_write+0x7c/0xb0
> Dec  2 14:25:55 tux kernel: [ 8550.791743]  [<ffffffff8114765c>] ? fsnotify+0x27c/0x350
> Dec  2 14:25:55 tux kernel: [ 8550.791746]  [<ffffffff81087233>] ? rcu_eqs_enter+0x93/0xa0
> Dec  2 14:25:55 tux kernel: [ 8550.791748]  [<ffffffff81087a5e>] ? rcu_user_enter+0xe/0x10
> Dec  2 14:25:55 tux kernel: [ 8550.791749]  [<ffffffff8103938a>] ? do_page_fault+0x5a/0x70
> Dec  2 14:25:55 tux kernel: [ 8550.791752]  [<ffffffff8139d9d2>] ? page_fault+0x22/0x30
> 
> 	If you need more info/testing, just ask.
> 
> -- 
> Linux 3.17.0-dirty: Shuffling Zombie Juror
> http://www.youtube.com/DanielFragaBR
> http://exchangewar.info
> Bitcoin: 12H6661yoLDUZaYPdah6urZS5WiXwTAUgL
> 


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-02 16:43                                     ` Dâniel Fraga
  2014-12-02 17:04                                       ` Paul E. McKenney
@ 2014-12-02 17:08                                       ` Linus Torvalds
  2014-12-02 17:16                                         ` Dâniel Fraga
  1 sibling, 1 reply; 486+ messages in thread
From: Linus Torvalds @ 2014-12-02 17:08 UTC (permalink / raw)
  To: Dâniel Fraga; +Cc: Paul E. McKenney, Linux Kernel Mailing List

On Tue, Dec 2, 2014 at 8:43 AM, Dâniel Fraga <fragabr@gmail.com> wrote:
>
>         Linus, Paul and others, I finally got a call trace with
> only CONFIG_TREE_PREEMPT_RCU *disabled* using Paul's patch (to trigger
> it I compiled PHP with make -j8).

So just to verify:

Without CONFIG_PREEMPT, things work well for you?

But with CONFIG_PREEMPT, you are able to create the rcu_sched stalls
both with and without CONFIG_TREE_PREEMPT_RCU?

Correct?

> Dec  2 14:25:54 tux kernel: [ 8550.791697] INFO: rcu_sched detected stalls on CPUs/tasks: { 4} (detected by 0, t=60002 jiffies, g=112854, c=112853, q=0)

Paul?

                    Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-02 17:04                                       ` Paul E. McKenney
@ 2014-12-02 17:14                                         ` Dâniel Fraga
  2014-12-02 18:42                                           ` Paul E. McKenney
  2014-12-02 18:09                                         ` Paul E. McKenney
  1 sibling, 1 reply; 486+ messages in thread
From: Dâniel Fraga @ 2014-12-02 17:14 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Linus Torvalds, Linux Kernel Mailing List

On Tue, 2 Dec 2014 09:04:07 -0800
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:

> Is it harder to reproduce with CONFIG_PREEMPT=y and CONFIG_TREE_PREEMPT_RCU=n?

	Yes, it's much harder! :)

> If it is a -lot- harder to reproduce, it might be worth bisecting among
> the RCU read-side critical sections.  If making a few of them be
> non-preemptible greatly reduces the probability of the bug occuring,
> that might provide a clue about root cause.
> 
> On the other hand, if it is just a little harder to reproduce, this
> RCU read-side bisection would likely be an exercise in futility.

	Ok, I want to bisect it. Since it could be painful to bisect,
could you suggest 2 commits between 3.16.0 and 3.17.0 so we can narrow
the bisect? I could just bisect between 3.16.0 and 3.17.0 but it would
take many days :).

	Ps: if you prefer I bisect between 3.16.0 and 3.17.0, no
problem, but you'll have to be patient ;).

-- 
Linux 3.17.0-dirty: Shuffling Zombie Juror
http://www.youtube.com/DanielFragaBR
http://exchangewar.info
Bitcoin: 12H6661yoLDUZaYPdah6urZS5WiXwTAUgL

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-02 16:33                                         ` Linus Torvalds
@ 2014-12-02 17:14                                           ` Chris Mason
  2014-12-03 18:41                                             ` Dave Jones
  2014-12-02 17:47                                           ` Mike Galbraith
  2014-12-17 11:13                                           ` Peter Zijlstra
  2 siblings, 1 reply; 486+ messages in thread
From: Chris Mason @ 2014-12-02 17:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mike Galbraith, Ingo Molnar, Peter Zijlstra, Dâniel Fraga,
	Dave Jones, Sasha Levin, Paul E. McKenney,
	Linux Kernel Mailing List

On Tue, Dec 2, 2014 at 11:33 AM, Linus Torvalds 
<torvalds@linux-foundation.org> wrote:
> On Tue, Dec 2, 2014 at 6:13 AM, Mike Galbraith 
> <umgwanakikbuti@gmail.com> wrote:
> 
> At the same time, the whole "incapacitated by the rt throttle long
> enough for the hard lockup detector to trigger" commentary about that
> skip_clock_update issue does make me go "Hmmm..". It would certainly
> explain Dave's incomprehensible watchdog messages..

Dave's first email mentioned that he had panic on softlockup enabled, 
but even with that off the box wasn't recovering.

In my trinity runs here, I've gotten softlockup warnings where the box 
eventually recovered.  I'm wondering if some of the "bad" commits in 
the bisection are really false positives where the box would have been 
able to recover if we'd killed off all the trinity procs and given it 
time to breath.

-chris




^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-02 17:08                                       ` Linus Torvalds
@ 2014-12-02 17:16                                         ` Dâniel Fraga
  0 siblings, 0 replies; 486+ messages in thread
From: Dâniel Fraga @ 2014-12-02 17:16 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Paul E. McKenney, Linux Kernel Mailing List

On Tue, 2 Dec 2014 09:08:53 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> So just to verify:
> 
> Without CONFIG_PREEMPT, things work well for you?

	Yes.

> But with CONFIG_PREEMPT, you are able to create the rcu_sched stalls
> both with and without CONFIG_TREE_PREEMPT_RCU?
> 
> Correct?

	Yes, correct. And without CONFIG_TREE_PREEMPT_RCU it's
much harder to trigger the bug.

-- 
Linux 3.17.0-dirty: Shuffling Zombie Juror
http://www.youtube.com/DanielFragaBR
http://exchangewar.info
Bitcoin: 12H6661yoLDUZaYPdah6urZS5WiXwTAUgL

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-02 16:58                                   ` Dâniel Fraga
@ 2014-12-02 17:17                                     ` Paul E. McKenney
  2014-12-03  2:03                                     ` Lai Jiangshan
  1 sibling, 0 replies; 486+ messages in thread
From: Paul E. McKenney @ 2014-12-02 17:17 UTC (permalink / raw)
  To: Dâniel Fraga
  Cc: Lai Jiangshan, Linus Torvalds, Linux Kernel Mailing List

On Tue, Dec 02, 2014 at 02:58:38PM -0200, Dâniel Fraga wrote:
> On Tue, 2 Dec 2014 16:40:37 +0800
> Lai Jiangshan <laijs@cn.fujitsu.com> wrote:
> 
> > It is needed at lest for testing.
> > 
> > CONFIG_TREE_PREEMPT_RCU=y with CONFIG_PREEMPT=n is needed for testing too.
> > 
> > Please enable them (or enable them under CONFIG_RCU_TRACE=y)
> 
> 	Lai, sorry but I didn't understand. Do you mean both of them
> enabled? Because how can CONFIG_TREE_PREEMPT_RCU be enabled without
> CONFIG_PREEMPT ?

Hmmm...  I did misread that in my reply.  A similar Kconfig edit will
enable that, but I am even less happy about the thought of pushing that
to mainline!  ;-)

							Thanx, Paul

> 	If you mean both enabled, I already reported a call trace with
> both enabled:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=85941
> 
> 	Please see my previous answer to Linus and Paul too.
> 
> 	Regarding CONFIG_RCU_TRACE, do you mean
> "CONFIG_TREE_RCU_TRACE"? I couldn't find CONFIG_RCU_TRACE.
> 
> 	Thanks.
> 
> -- 
> Linux 3.17.0-dirty: Shuffling Zombie Juror
> http://www.youtube.com/DanielFragaBR
> http://exchangewar.info
> Bitcoin: 12H6661yoLDUZaYPdah6urZS5WiXwTAUgL
> 


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-02 16:33                                         ` Linus Torvalds
  2014-12-02 17:14                                           ` Chris Mason
@ 2014-12-02 17:47                                           ` Mike Galbraith
  2014-12-13  8:11                                             ` Ingo Molnar
  2014-12-17 11:13                                           ` Peter Zijlstra
  2 siblings, 1 reply; 486+ messages in thread
From: Mike Galbraith @ 2014-12-02 17:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Peter Zijlstra, Chris Mason, Dâniel Fraga,
	Dave Jones, Sasha Levin, Paul E. McKenney,
	Linux Kernel Mailing List

On Tue, 2014-12-02 at 08:33 -0800, Linus Torvalds wrote:

> Looking again at that patch (the commit message still doesn't strike
> me as wonderfully explanatory :^) makes me worry, though.
> 
> Is that
> 
>         if (rq->skip_clock_update-- > 0)
>                 return;
> 
> really right? If skip_clock_update was zero (normal), it now gets set
> to -1, which has its own specific meaning (see "force clock update"
> comment in kernel/sched/rt.c). Is that intentional? That seems insane.

Yeah, it was intentional.  Least lines.

> Or should it be
> 
>         if (rq->skip_clock_update > 0) {
>                 rq->skip_clock_update = 0;
>                 return;
>         }
> 
> or what? Maybe there was a reason the patch never got applied even to -tip.

Peterz was looking at corner case proofing the thing.  Saving those
cycles has been entirely too annoying.

https://lkml.org/lkml/2014/4/8/295

	-Mike


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-02 17:04                                       ` Paul E. McKenney
  2014-12-02 17:14                                         ` Dâniel Fraga
@ 2014-12-02 18:09                                         ` Paul E. McKenney
  2014-12-02 18:41                                           ` Dâniel Fraga
  1 sibling, 1 reply; 486+ messages in thread
From: Paul E. McKenney @ 2014-12-02 18:09 UTC (permalink / raw)
  To: Dâniel Fraga; +Cc: Linus Torvalds, Linux Kernel Mailing List

On Tue, Dec 02, 2014 at 09:04:07AM -0800, Paul E. McKenney wrote:
> On Tue, Dec 02, 2014 at 02:43:17PM -0200, Dâniel Fraga wrote:
> > On Mon, 1 Dec 2014 15:08:13 -0800
> > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> > 
> > > Well, this turned out to be way simpler than I expected.  Passes
> > > light rcutorture testing.  Sometimes you get lucky...
> > 
> > 	Linus, Paul and others, I finally got a call trace with
> > only CONFIG_TREE_PREEMPT_RCU *disabled* using Paul's patch (to trigger 
> > it I compiled PHP with make -j8).
> 
> Is it harder to reproduce with CONFIG_PREEMPT=y and CONFIG_TREE_PREEMPT_RCU=n?
> 
> If it is a -lot- harder to reproduce, it might be worth bisecting among
> the RCU read-side critical sections.  If making a few of them be
> non-preemptible greatly reduces the probability of the bug occuring,
> that might provide a clue about root cause.
> 
> On the other hand, if it is just a little harder to reproduce, this
> RCU read-side bisection would likely be an exercise in futility.

To Linus's point, I guess I could look at the RCU CPU stall warning.  ;-)

Summary:  Not seeing something that would loop for 21 seconds.
Dâniel, if you let this run, does it hit a second RCU CPU stall
warning, or does it just lock up?

Details:

First, what would we be looking for?  We know that with CONFIG_PREEMPT=n,
things work, or at least that the failure rate is quite low.
With CONFIG_PREEMPT=y, with or without CONFIG_TREE_PREEMPT_RCU=y,
things can break.  This is backwards of the usual behavior: Normally
CONFIG_PREEMPT=y kernels are a bit less prone to RCU CPU stall warnings,
at least assuming that the kernel spends a relatively small fraction of
its time in RCU read-side critical sections.

So, how could this be?

1.	Someone forgot an rcu_read_unlock() on one of the exit paths from
	some RCU read-side critical section somewhere.  This seems unlikely,
	but either CONFIG_PROVE_RCU=y or CONFIG_PREEMPT_DEBUG=y should
	catch it.

2.	Someone forgot a preempt_enable() on one of the exit paths from
	some preempt-disable region somewhere.  This also seems a bit
	unlikely, but CONFIG_PREEMPT_DEBUG=y should catch it.

3.	Preemption exposes a race condition that is extremely unlikely
	for CONFIG_PREEMPT=n.

Of course, it wouldn't hurt for someone who knows mm better than I to
check my work.

> > Dec  2 14:24:39 tux kernel: [ 8475.941616] conftest[9730]: segfault at 0 ip 0000000000400640 sp 00007fffa67ab300 error 4 in conftest[400000+1000]
> > Dec  2 14:24:40 tux kernel: [ 8476.104725] conftest[9753]: segfault at 0 ip 00007f6863024906 sp 00007fff0e31cc48 error 4 in libc-2.19.so[7f6862efe000+1a1000]
> > Dec  2 14:25:54 tux kernel: [ 8550.791697] INFO: rcu_sched detected stalls on CPUs/tasks: { 4} (detected by 0, t=60002 jiffies, g=112854, c=112853, q=0)

Note that the patch I gave to Dâniel provides only rcu_sched, as opposed
to the usual CONFIG_PREEMPT=y rcu_preempt and rcu_sched.  This is expected
behavior for CONFIG_TREE_RCU=y/CONFIG_TREE_PREEMPT_RCU=n.

> > Dec  2 14:25:54 tux kernel: [ 8550.791702] Task dump for CPU 4:
> > Dec  2 14:25:54 tux kernel: [ 8550.791703] cc1             R  running task        0 14344  14340 0x00080008
> > Dec  2 14:25:54 tux kernel: [ 8550.791706]  000000001bcebcd8 ffff880100000003 ffffffff810cb7f1 ffff88021f5f5c00
> > Dec  2 14:25:54 tux kernel: [ 8550.791708]  ffff88011bcebfd8 ffff88011bcebce8 ffffffff811fb970 ffff8802149a2a00
> > Dec  2 14:25:54 tux kernel: [ 8550.791710]  ffff8802149a2cc8 ffff88011bcebd28 ffffffff8103e979 ffff88020ed01398
> > Dec  2 14:25:54 tux kernel: [ 8550.791712] Call Trace:
> > Dec  2 14:25:54 tux kernel: [ 8550.791718]  [<ffffffff810cb7f1>] ? release_pages+0xa1/0x1e0

This does have a loop whose length is controlled by the "nr" argument.

> > Dec  2 14:25:54 tux kernel: [ 8550.791722]  [<ffffffff811fb970>] ? cpumask_any_but+0x30/0x40

This one is inconsistent with the release_pages() called function.
Besides, its runtime is limited by the number of CPUs, so it shouldn't
go on forever.

> > Dec  2 14:25:54 tux kernel: [ 8550.791725]  [<ffffffff8103e979>] ? flush_tlb_page+0x49/0xf0

This one should also have a sharply limited runtime.

> > Dec  2 14:25:54 tux kernel: [ 8550.791727]  [<ffffffff810cbe72>] ? lru_cache_add_active_or_unevictable+0x22/0x90

This one does acquire locks, so could in theory run for a long time.
Would require high contention on ->lru_lock, though.  A pvec can only
contain 14 pages, so the move loop should have limited runtime.

> > Dec  2 14:25:54 tux kernel: [ 8550.791731]  [<ffffffff810fc4c2>] ? alloc_pages_vma+0x72/0x130

This one contains a retry loop, at least if CONFIG_NUMA=y.  But I don't
see anything here that would block an RCU grace period.

> > Dec  2 14:25:54 tux kernel: [ 8550.791733]  [<ffffffff810cbe72>] ? lru_cache_add_active_or_unevictable+0x22/0x90

Duplicate address above, presumably one or both are due to stack-trace
confusion.

> > Dec  2 14:25:54 tux kernel: [ 8550.791735]  [<ffffffff810e5220>] ? handle_mm_fault+0x3a0/0xaf0

If this one had a problem, I would expect to see it in some of its called
functions.

> > Dec  2 14:25:54 tux kernel: [ 8550.791737]  [<ffffffff81039074>] ? __do_page_fault+0x224/0x4c0

Ditto.

> > Dec  2 14:25:54 tux kernel: [ 8550.791740]  [<ffffffff8110d54c>] ? new_sync_write+0x7c/0xb0

Ditto.

> > Dec  2 14:25:55 tux kernel: [ 8550.791743]  [<ffffffff8114765c>] ? fsnotify+0x27c/0x350

This one uses SRCU, not RCU.

> > Dec  2 14:25:55 tux kernel: [ 8550.791746]  [<ffffffff81087233>] ? rcu_eqs_enter+0x93/0xa0
> > Dec  2 14:25:55 tux kernel: [ 8550.791748]  [<ffffffff81087a5e>] ? rcu_user_enter+0xe/0x10

These two don't call fsnotify(), so I am assuming that the stack trace is
confused here.  Any chance of enabling frame pointers or some such to get
an accurate stack trace?  (And yes, this is one CPU tracing another live
CPU's stack, so some confusion is inherent, but probably not this far up
the stack.)

> > Dec  2 14:25:55 tux kernel: [ 8550.791749]  [<ffffffff8103938a>] ? do_page_fault+0x5a/0x70

Wrapper for __do_page_fault().  Yay!  Functions that actually call each
other in this stack trace!  ;-)

> > Dec  2 14:25:55 tux kernel: [ 8550.791752]  [<ffffffff8139d9d2>] ? page_fault+0x22/0x30

Not seeing much in the way of loops here.

							Thanx, Paul

> > 
> > 	If you need more info/testing, just ask.
> > 
> > -- 
> > Linux 3.17.0-dirty: Shuffling Zombie Juror
> > http://www.youtube.com/DanielFragaBR
> > http://exchangewar.info
> > Bitcoin: 12H6661yoLDUZaYPdah6urZS5WiXwTAUgL
> > 


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-02 18:09                                         ` Paul E. McKenney
@ 2014-12-02 18:41                                           ` Dâniel Fraga
  0 siblings, 0 replies; 486+ messages in thread
From: Dâniel Fraga @ 2014-12-02 18:41 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Linus Torvalds, Linux Kernel Mailing List

On Tue, 2 Dec 2014 10:09:47 -0800
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:

> To Linus's point, I guess I could look at the RCU CPU stall warning.  ;-)
> 
> Summary:  Not seeing something that would loop for 21 seconds.
> Dâniel, if you let this run, does it hit a second RCU CPU stall
> warning, or does it just lock up?

	It just lock up. I can't even use the keyboard or mouse, so I
have to hard reset the system.

	I'm trying the bisect you asked... even if it takes longer,
maybe I can find something for you.

-- 
Linux 3.17.0-rc6-00235-gb94d525: Shuffling Zombie Juror
http://www.youtube.com/DanielFragaBR
http://exchangewar.info
Bitcoin: 12H6661yoLDUZaYPdah6urZS5WiXwTAUgL

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-02 17:14                                         ` Dâniel Fraga
@ 2014-12-02 18:42                                           ` Paul E. McKenney
  2014-12-02 18:47                                             ` Dâniel Fraga
  0 siblings, 1 reply; 486+ messages in thread
From: Paul E. McKenney @ 2014-12-02 18:42 UTC (permalink / raw)
  To: Dâniel Fraga; +Cc: Linus Torvalds, Linux Kernel Mailing List

On Tue, Dec 02, 2014 at 03:14:08PM -0200, Dâniel Fraga wrote:
> On Tue, 2 Dec 2014 09:04:07 -0800
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> 
> > Is it harder to reproduce with CONFIG_PREEMPT=y and CONFIG_TREE_PREEMPT_RCU=n?
> 
> 	Yes, it's much harder! :)
> 
> > If it is a -lot- harder to reproduce, it might be worth bisecting among
> > the RCU read-side critical sections.  If making a few of them be
> > non-preemptible greatly reduces the probability of the bug occuring,
> > that might provide a clue about root cause.
> > 
> > On the other hand, if it is just a little harder to reproduce, this
> > RCU read-side bisection would likely be an exercise in futility.
> 
> 	Ok, I want to bisect it. Since it could be painful to bisect,
> could you suggest 2 commits between 3.16.0 and 3.17.0 so we can narrow
> the bisect? I could just bisect between 3.16.0 and 3.17.0 but it would
> take many days :).
> 
> 	Ps: if you prefer I bisect between 3.16.0 and 3.17.0, no
> problem, but you'll have to be patient ;).

I was actually suggesting something a bit different.  Instead of bisecting
by release, bisect by code.  The procedure is as follows:

1.	I figure out some reliable way of making RCU allow preemption to
	be disabled for some RCU read-side critical sections, but not for
	others.  I send you the patch, which has rcu_read_lock_test()
	as well as rcu_read_lock().

2.	You build a kernel without my Kconfig hack, with my patch from
	#1 above, and build a kernel with CONFIG_PREEMPT=y (which of
	course implies CONFIG_TREE_PREEMPT_RCU=y, given that you are
	building without my Kconfig hack).

3.	You make a list of all the rcu_read_lock() uses in the kernel
	(or ask me to provide it).  You change the rcu_read_lock()
	calls in the first half of this list to rcu_read_lock_test().

	If the kernel locks up as easily with this change as it did
	in a stock CONFIG_PREEMPT=y CONFIG_TREE_PREEMPT_RCU=y kernel,
	change half of the remaining rcu_read_lock() calls to
	rcu_read_lock_test().  If the kernel is much more resistant
	to lockup, change half of the rcu_read_lock_test() calls
	back to rcu_read_lock().

4.	It is quite possible that several of the RCU read-side critical
	sections contribute to the unreliability, in which case the
	bisection will get a bit more complicated.

Other thoughts on how to attack this?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-02 18:42                                           ` Paul E. McKenney
@ 2014-12-02 18:47                                             ` Dâniel Fraga
  2014-12-02 19:11                                               ` Paul E. McKenney
  0 siblings, 1 reply; 486+ messages in thread
From: Dâniel Fraga @ 2014-12-02 18:47 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Linus Torvalds, Linux Kernel Mailing List

On Tue, 2 Dec 2014 10:42:02 -0800
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:

> I was actually suggesting something a bit different.  Instead of bisecting
> by release, bisect by code.  The procedure is as follows:
> 
> 1.	I figure out some reliable way of making RCU allow preemption to
> 	be disabled for some RCU read-side critical sections, but not for
> 	others.  I send you the patch, which has rcu_read_lock_test()
> 	as well as rcu_read_lock().
> 
> 2.	You build a kernel without my Kconfig hack, with my patch from
> 	#1 above, and build a kernel with CONFIG_PREEMPT=y (which of
> 	course implies CONFIG_TREE_PREEMPT_RCU=y, given that you are
> 	building without my Kconfig hack).
> 
> 3.	You make a list of all the rcu_read_lock() uses in the kernel
> 	(or ask me to provide it).  You change the rcu_read_lock()
> 	calls in the first half of this list to rcu_read_lock_test().
> 
> 	If the kernel locks up as easily with this change as it did
> 	in a stock CONFIG_PREEMPT=y CONFIG_TREE_PREEMPT_RCU=y kernel,
> 	change half of the remaining rcu_read_lock() calls to
> 	rcu_read_lock_test().  If the kernel is much more resistant
> 	to lockup, change half of the rcu_read_lock_test() calls
> 	back to rcu_read_lock().

	Ok Paul, I want to do everything I can to help you debug this.

	So can you provide me the list you mentioned at point 3 (or
tell me how can I get it)? If you guide me through this, I can do
whatever you need. Thanks!

-- 
Linux 3.17.0-rc6-00235-gb94d525: Shuffling Zombie Juror
http://www.youtube.com/DanielFragaBR
http://exchangewar.info
Bitcoin: 12H6661yoLDUZaYPdah6urZS5WiXwTAUgL

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-02 18:47                                             ` Dâniel Fraga
@ 2014-12-02 19:11                                               ` Paul E. McKenney
  2014-12-02 19:24                                                 ` Dâniel Fraga
  0 siblings, 1 reply; 486+ messages in thread
From: Paul E. McKenney @ 2014-12-02 19:11 UTC (permalink / raw)
  To: Dâniel Fraga; +Cc: Linus Torvalds, Linux Kernel Mailing List

On Tue, Dec 02, 2014 at 04:47:31PM -0200, Dâniel Fraga wrote:
> On Tue, 2 Dec 2014 10:42:02 -0800
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> 
> > I was actually suggesting something a bit different.  Instead of bisecting
> > by release, bisect by code.  The procedure is as follows:
> > 
> > 1.	I figure out some reliable way of making RCU allow preemption to
> > 	be disabled for some RCU read-side critical sections, but not for
> > 	others.  I send you the patch, which has rcu_read_lock_test()
> > 	as well as rcu_read_lock().
> > 
> > 2.	You build a kernel without my Kconfig hack, with my patch from
> > 	#1 above, and build a kernel with CONFIG_PREEMPT=y (which of
> > 	course implies CONFIG_TREE_PREEMPT_RCU=y, given that you are
> > 	building without my Kconfig hack).
> > 
> > 3.	You make a list of all the rcu_read_lock() uses in the kernel
> > 	(or ask me to provide it).  You change the rcu_read_lock()
> > 	calls in the first half of this list to rcu_read_lock_test().
> > 
> > 	If the kernel locks up as easily with this change as it did
> > 	in a stock CONFIG_PREEMPT=y CONFIG_TREE_PREEMPT_RCU=y kernel,
> > 	change half of the remaining rcu_read_lock() calls to
> > 	rcu_read_lock_test().  If the kernel is much more resistant
> > 	to lockup, change half of the rcu_read_lock_test() calls
> > 	back to rcu_read_lock().
> 
> 	Ok Paul, I want to do everything I can to help you debug this.
> 
> 	So can you provide me the list you mentioned at point 3 (or
> tell me how can I get it)? If you guide me through this, I can do
> whatever you need. Thanks!

OK.  I need to know exactly what version of the Linux kernel you are
using.  3.18-rc7?  (I am not too worried about exactly which version
you are using as long as I know which version it is.)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-02 19:11                                               ` Paul E. McKenney
@ 2014-12-02 19:24                                                 ` Dâniel Fraga
  2014-12-02 20:56                                                   ` Paul E. McKenney
  0 siblings, 1 reply; 486+ messages in thread
From: Dâniel Fraga @ 2014-12-02 19:24 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Linus Torvalds, Linux Kernel Mailing List

On Tue, 2 Dec 2014 11:11:43 -0800
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:

> OK.  I need to know exactly what version of the Linux kernel you are
> using.  3.18-rc7?  (I am not too worried about exactly which version
> you are using as long as I know which version it is.)

	Ok, I stopped bisecting and went back to 3.17.0 stock kernel.
I'm testing with 3.17.0 kernel because this one is the first to show
problems. If you want me to go to 3.18-rc7, just ask I can checkout
through git.

	Ps: my signature will reflect the kernel I'm using now ;)

-- 
Linux 3.17.0: Shuffling Zombie Juror
http://www.youtube.com/DanielFragaBR
http://exchangewar.info
Bitcoin: 12H6661yoLDUZaYPdah6urZS5WiXwTAUgL

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-01 20:36                                   ` Linus Torvalds
  2014-12-01 23:08                                     ` Chris Mason
@ 2014-12-02 19:31                                     ` Dave Jones
  2014-12-02 21:17                                       ` Linus Torvalds
  2014-12-02 20:30                                     ` Dave Jones
  2 siblings, 1 reply; 486+ messages in thread
From: Dave Jones @ 2014-12-02 19:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dâniel Fraga, Sasha Levin, Paul E. McKenney,
	Linux Kernel Mailing List

On Mon, Dec 01, 2014 at 12:36:34PM -0800, Linus Torvalds wrote:
 > On Mon, Dec 1, 2014 at 12:28 PM, Dâniel Fraga <fragabr@gmail.com> wrote:
 > >
 > >         Hi Paul. Please, I'd like the patch, because without
 > > preemption, I'm unable to trigger this bug.
 > 
 > Ok, that's already interesting information. And yes, it would probably
 > be interesting to see if CONFIG_PREEMPT=y but !CONFIG_TREE_PREEMPT_RCU
 > then solves it too, to narrow it down to one but not the other..
 > 
 > DaveJ - what about your situation? The standard Fedora kernels use
 > CONFIG_PREEMPT_VOLUNTARY, do you have CONFIG_PREEMPT and
 > CONFIG_TREE_PREEMPT_RCU enabled?

I periodically switch PREEMPT options, just to see if anything new falls
out, though I've been on CONFIG_PREEMPT for quite a while.
So right now I'm testing the TREE_PREEMPT_RCU case.

I'm in the process of bisecting 3.16 -> 3.17 (currently on -rc1)
Thanksgiving kind of screwed up my flow, but 3.16 got a real
pounding for over 3 days with no problems.
I can give a !PREEMPT_RCU build a try, in case that turns out to
point to something quicker than a bisect is going to.

 > I think you and Sasha both saw some RCU oddities too, no?

I don't recall a kernel where I didn't see RCU oddities of some
description, but in recent times, not so much, though a few releases
back I did change some of the RCU related CONFIG options while
Paul & co were chasing down some bugs.

These days I've been running with..

# RCU Subsystem
CONFIG_TREE_PREEMPT_RCU=y
CONFIG_PREEMPT_RCU=y
# CONFIG_TASKS_RCU is not set
CONFIG_RCU_STALL_COMMON=y
# CONFIG_RCU_USER_QS is not set
CONFIG_RCU_FANOUT=64
CONFIG_RCU_FANOUT_LEAF=16
# CONFIG_RCU_FANOUT_EXACT is not set
CONFIG_TREE_RCU_TRACE=y
CONFIG_RCU_BOOST=y
CONFIG_RCU_BOOST_PRIO=1
CONFIG_RCU_BOOST_DELAY=500
CONFIG_RCU_NOCB_CPU=y
# CONFIG_RCU_NOCB_CPU_NONE is not set
# CONFIG_RCU_NOCB_CPU_ZERO is not set
CONFIG_RCU_NOCB_CPU_ALL=y
CONFIG_DEBUG_OBJECTS_RCU_HEAD=y
# RCU Debugging
CONFIG_SPARSE_RCU_POINTER=y
# CONFIG_RCU_TORTURE_TEST is not set
CONFIG_RCU_CPU_STALL_TIMEOUT=60
CONFIG_RCU_CPU_STALL_VERBOSE=y
CONFIG_RCU_CPU_STALL_INFO=y
CONFIG_RCU_TRACE=y

	Dave


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-01 23:08                                     ` Chris Mason
  2014-12-01 23:25                                       ` Linus Torvalds
  2014-12-02 14:13                                       ` Mike Galbraith
@ 2014-12-02 19:32                                       ` Dave Jones
  2014-12-02 23:32                                         ` Sasha Levin
  2 siblings, 1 reply; 486+ messages in thread
From: Dave Jones @ 2014-12-02 19:32 UTC (permalink / raw)
  To: Chris Mason, Linus Torvalds, Dâniel Fraga, Sasha Levin,
	Paul E. McKenney, Linux Kernel Mailing List

On Mon, Dec 01, 2014 at 06:08:38PM -0500, Chris Mason wrote:
 > I'm not sure if this is related, but running trinity here, I noticed it
 > was stuck at 100% system time on every CPU.  perf report tells me we are
 > spending all of our time in spin_lock under the sync system call.
 > 
 > I think it's coming from contention in the bdi_queue_work() call from
 > inside sync_inodes_sb, which is spin_lock_bh(). 
 > 
 > I wonder if we're just spinning so hard on this one bh lock that we're
 > starving the watchdog?
 > 
 > Dave, do you have spinlock debugging on?  

That has been a constant, yes. I can try with that disabled some time.

	Dave


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-01 20:36                                   ` Linus Torvalds
  2014-12-01 23:08                                     ` Chris Mason
  2014-12-02 19:31                                     ` Dave Jones
@ 2014-12-02 20:30                                     ` Dave Jones
  2014-12-02 20:48                                       ` Paul E. McKenney
  2 siblings, 1 reply; 486+ messages in thread
From: Dave Jones @ 2014-12-02 20:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dâniel Fraga, Sasha Levin, Paul E. McKenney,
	Linux Kernel Mailing List

On Mon, Dec 01, 2014 at 12:36:34PM -0800, Linus Torvalds wrote:
 > On Mon, Dec 1, 2014 at 12:28 PM, Dâniel Fraga <fragabr@gmail.com> wrote:
 > >
 > >         Hi Paul. Please, I'd like the patch, because without
 > > preemption, I'm unable to trigger this bug.
 > 
 > Ok, that's already interesting information. And yes, it would probably
 > be interesting to see if CONFIG_PREEMPT=y but !CONFIG_TREE_PREEMPT_RCU
 > then solves it too, to narrow it down to one but not the other..

That combination doesn't seem possible. TREE_PREEMPT_RCU is the only
possible choice if PREEMPT=y

	Dave

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-02 20:30                                     ` Dave Jones
@ 2014-12-02 20:48                                       ` Paul E. McKenney
  0 siblings, 0 replies; 486+ messages in thread
From: Paul E. McKenney @ 2014-12-02 20:48 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Dâniel Fraga, Sasha Levin,
	Linux Kernel Mailing List

On Tue, Dec 02, 2014 at 03:30:44PM -0500, Dave Jones wrote:
> On Mon, Dec 01, 2014 at 12:36:34PM -0800, Linus Torvalds wrote:
>  > On Mon, Dec 1, 2014 at 12:28 PM, Dâniel Fraga <fragabr@gmail.com> wrote:
>  > >
>  > >         Hi Paul. Please, I'd like the patch, because without
>  > > preemption, I'm unable to trigger this bug.
>  > 
>  > Ok, that's already interesting information. And yes, it would probably
>  > be interesting to see if CONFIG_PREEMPT=y but !CONFIG_TREE_PREEMPT_RCU
>  > then solves it too, to narrow it down to one but not the other..
> 
> That combination doesn't seem possible. TREE_PREEMPT_RCU is the only
> possible choice if PREEMPT=y

Indeed, getting that combination requires a Kconfig patch, which I
supplied below.  Not for mainline, debugging only.

							Thanx, Paul

------------------------------------------------------------------------

diff --git a/init/Kconfig b/init/Kconfig
index 903505e66d1d..2cf71fcd514f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -469,7 +469,7 @@ choice
 
 config TREE_RCU
 	bool "Tree-based hierarchical RCU"
-	depends on !PREEMPT && SMP
+	depends on SMP
 	select IRQ_WORK
 	help
 	  This option selects the RCU implementation that is


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-02 19:24                                                 ` Dâniel Fraga
@ 2014-12-02 20:56                                                   ` Paul E. McKenney
  2014-12-02 22:01                                                     ` Dâniel Fraga
  0 siblings, 1 reply; 486+ messages in thread
From: Paul E. McKenney @ 2014-12-02 20:56 UTC (permalink / raw)
  To: Dâniel Fraga; +Cc: Linus Torvalds, Linux Kernel Mailing List

On Tue, Dec 02, 2014 at 05:24:39PM -0200, Dâniel Fraga wrote:
> On Tue, 2 Dec 2014 11:11:43 -0800
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> 
> > OK.  I need to know exactly what version of the Linux kernel you are
> > using.  3.18-rc7?  (I am not too worried about exactly which version
> > you are using as long as I know which version it is.)
> 
> 	Ok, I stopped bisecting and went back to 3.17.0 stock kernel.
> I'm testing with 3.17.0 kernel because this one is the first to show
> problems. If you want me to go to 3.18-rc7, just ask I can checkout
> through git.
> 
> 	Ps: my signature will reflect the kernel I'm using now ;)

And I left out a step.  Let's make sure that my preempt_disabled() hack
to CONFIG_TREE_PREEMPT_RCU=y has the same effect as the Kconfig hack
that allowed CONFIG_PREEMPT=y and CONFIG_TREE_PREEMPT_RCU=n.  Could you
please try out the following patch configured with CONFIG_PREEMPT=y
and CONFIG_TREE_PREEMPT_RCU=y?

							Thanx, Paul

------------------------------------------------------------------------

diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index e0d31a345ee6..fff605a9e87f 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -71,7 +71,11 @@ module_param(rcu_expedited, int, 0);
  */
 void __rcu_read_lock(void)
 {
-	current->rcu_read_lock_nesting++;
+	struct task_struct *t = current;
+
+	if (!t->rcu_read_lock_nesting)
+		preempt_disable();
+	t->rcu_read_lock_nesting++;
 	barrier();  /* critical section after entry code. */
 }
 EXPORT_SYMBOL_GPL(__rcu_read_lock);
@@ -92,6 +96,7 @@ void __rcu_read_unlock(void)
 	} else {
 		barrier();  /* critical section before exit code. */
 		t->rcu_read_lock_nesting = INT_MIN;
+		preempt_enable();
 		barrier();  /* assign before ->rcu_read_unlock_special load */
 		if (unlikely(ACCESS_ONCE(t->rcu_read_unlock_special.s)))
 			rcu_read_unlock_special(t);


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-02 19:31                                     ` Dave Jones
@ 2014-12-02 21:17                                       ` Linus Torvalds
  0 siblings, 0 replies; 486+ messages in thread
From: Linus Torvalds @ 2014-12-02 21:17 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Dâniel Fraga, Sasha Levin,
	Paul E. McKenney, Linux Kernel Mailing List

On Tue, Dec 2, 2014 at 11:31 AM, Dave Jones <davej@redhat.com> wrote:
>
> I'm in the process of bisecting 3.16 -> 3.17 (currently on -rc1)
> Thanksgiving kind of screwed up my flow, but 3.16 got a real
> pounding for over 3 days with no problems.
> I can give a !PREEMPT_RCU build a try, in case that turns out to
> point to something quicker than a bisect is going to.

No,. go on with the bisect. I think at this point we'll all be happier
narrowing down the range of commits than anything else. The behavior
Dâniel sees may be entirely unrelated anyway.

                   Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-02 20:56                                                   ` Paul E. McKenney
@ 2014-12-02 22:01                                                     ` Dâniel Fraga
  2014-12-02 22:10                                                       ` Paul E. McKenney
  2014-12-02 22:10                                                       ` Linus Torvalds
  0 siblings, 2 replies; 486+ messages in thread
From: Dâniel Fraga @ 2014-12-02 22:01 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Linus Torvalds, Linux Kernel Mailing List

On Tue, 2 Dec 2014 12:56:36 -0800
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:

> And I left out a step.  Let's make sure that my preempt_disabled() hack
> to CONFIG_TREE_PREEMPT_RCU=y has the same effect as the Kconfig hack
> that allowed CONFIG_PREEMPT=y and CONFIG_TREE_PREEMPT_RCU=n.  Could you
> please try out the following patch configured with CONFIG_PREEMPT=y
> and CONFIG_TREE_PREEMPT_RCU=y?

	Of course! I applied your patch to 3.17 stock kernel and after
stressing it (compiling with -j8 and watching videos on Youtube) to
trigger the bug I got the following:

Dec  2 19:47:25 tux kernel: [  927.973547] INFO: rcu_preempt detected stalls on CPUs/tasks: { 5} (detected by 1, t=60002 jiffies, g=71142, c=71141, q=0)
Dec  2 19:47:26 tux kernel: [  927.973553] Task dump for CPU 5:
Dec  2 19:47:26 tux kernel: [  927.973555] cc1             R  running task        0 30691  30680 0x00080008
Dec  2 19:47:26 tux kernel: [  927.973558]  ffff88021f351bc0 ffff8801d5743f00 ffffffff8107062a ffff88021f351c38
Dec  2 19:47:26 tux kernel: [  927.973560]  ffff8801d5743ea8 ffff8800cd3b0000 ffff8800cd3b041c 0000000000000000
Dec  2 19:47:26 tux kernel: [  927.973562]  00007f89b4d7d8e8 00007f89b5939a60 ffff88021f34d2c0 00007f89b4d7f000
Dec  2 19:47:26 tux kernel: [  927.973564] Call Trace:
Dec  2 19:47:26 tux kernel: [  927.973573]  [<ffffffff8107062a>] ? pick_next_task_fair+0x6aa/0x890
Dec  2 19:47:26 tux kernel: [  927.973577]  [<ffffffff81087483>] ? rcu_eqs_enter+0x93/0xa0
Dec  2 19:47:26 tux kernel: [  927.973579]  [<ffffffff81087f2e>] ? rcu_user_enter+0xe/0x10
Dec  2 19:47:26 tux kernel: [  927.973582]  [<ffffffff8103935a>] ? do_page_fault+0x5a/0x70
Dec  2 19:47:26 tux kernel: [  927.973585]  [<ffffffff8139bed2>] ? page_fault+0x22/0x30
Dec  2 19:47:30 tux kernel: [  932.471964] CPU1: Core temperature above threshold, cpu clock throttled (total events = 820)
Dec  2 19:47:30 tux kernel: [  932.471966] CPU6: Package temperature above threshold, cpu clock throttled (total events = 2624)
Dec  2 19:47:30 tux kernel: [  932.471967] CPU3: Package temperature above threshold, cpu clock throttled (total events = 2624)
Dec  2 19:47:30 tux kernel: [  932.471968] CPU7: Package temperature above threshold, cpu clock throttled (total events = 2624)
Dec  2 19:47:30 tux kernel: [  932.471969] CPU0: Package temperature above threshold, cpu clock throttled (total events = 2624)
Dec  2 19:47:30 tux kernel: [  932.471970] CPU2: Package temperature above threshold, cpu clock throttled (total events = 2624)
Dec  2 19:47:30 tux kernel: [  932.471978] CPU1: Package temperature above threshold, cpu clock throttled (total events = 2624)
Dec  2 19:47:30 tux kernel: [  932.472922] CPU1: Core temperature/speed normal
Dec  2 19:47:30 tux kernel: [  932.472923] CPU6: Package temperature/speed normal
Dec  2 19:47:31 tux kernel: [  932.472923] CPU2: Package temperature/speed normal
Dec  2 19:47:31 tux kernel: [  932.472924] CPU3: Package temperature/speed normal
Dec  2 19:47:31 tux kernel: [  932.472925] CPU0: Package temperature/speed normal

	Waiting for your next instructions.

-- 
Linux 3.17.0-dirty: Shuffling Zombie Juror
http://www.youtube.com/DanielFragaBR
http://exchangewar.info
Bitcoin: 12H6661yoLDUZaYPdah6urZS5WiXwTAUgL

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-02 22:01                                                     ` Dâniel Fraga
@ 2014-12-02 22:10                                                       ` Paul E. McKenney
  2014-12-02 22:18                                                         ` Dâniel Fraga
  2014-12-02 22:10                                                       ` Linus Torvalds
  1 sibling, 1 reply; 486+ messages in thread
From: Paul E. McKenney @ 2014-12-02 22:10 UTC (permalink / raw)
  To: Dâniel Fraga; +Cc: Linus Torvalds, Linux Kernel Mailing List

On Tue, Dec 02, 2014 at 08:01:49PM -0200, Dâniel Fraga wrote:
> On Tue, 2 Dec 2014 12:56:36 -0800
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> 
> > And I left out a step.  Let's make sure that my preempt_disabled() hack
> > to CONFIG_TREE_PREEMPT_RCU=y has the same effect as the Kconfig hack
> > that allowed CONFIG_PREEMPT=y and CONFIG_TREE_PREEMPT_RCU=n.  Could you
> > please try out the following patch configured with CONFIG_PREEMPT=y
> > and CONFIG_TREE_PREEMPT_RCU=y?
> 
> 	Of course! I applied your patch to 3.17 stock kernel and after
> stressing it (compiling with -j8 and watching videos on Youtube) to
> trigger the bug I got the following:

Thank you!!!

Was this as difficult to trigger as the version with the Kconfig hack
that used CONFIG_PREEMPT=y and CONFIG_TREE_PREEMPT_RCU=n?

							Thanx, Paul

> Dec  2 19:47:25 tux kernel: [  927.973547] INFO: rcu_preempt detected stalls on CPUs/tasks: { 5} (detected by 1, t=60002 jiffies, g=71142, c=71141, q=0)
> Dec  2 19:47:26 tux kernel: [  927.973553] Task dump for CPU 5:
> Dec  2 19:47:26 tux kernel: [  927.973555] cc1             R  running task        0 30691  30680 0x00080008
> Dec  2 19:47:26 tux kernel: [  927.973558]  ffff88021f351bc0 ffff8801d5743f00 ffffffff8107062a ffff88021f351c38
> Dec  2 19:47:26 tux kernel: [  927.973560]  ffff8801d5743ea8 ffff8800cd3b0000 ffff8800cd3b041c 0000000000000000
> Dec  2 19:47:26 tux kernel: [  927.973562]  00007f89b4d7d8e8 00007f89b5939a60 ffff88021f34d2c0 00007f89b4d7f000
> Dec  2 19:47:26 tux kernel: [  927.973564] Call Trace:
> Dec  2 19:47:26 tux kernel: [  927.973573]  [<ffffffff8107062a>] ? pick_next_task_fair+0x6aa/0x890
> Dec  2 19:47:26 tux kernel: [  927.973577]  [<ffffffff81087483>] ? rcu_eqs_enter+0x93/0xa0
> Dec  2 19:47:26 tux kernel: [  927.973579]  [<ffffffff81087f2e>] ? rcu_user_enter+0xe/0x10
> Dec  2 19:47:26 tux kernel: [  927.973582]  [<ffffffff8103935a>] ? do_page_fault+0x5a/0x70
> Dec  2 19:47:26 tux kernel: [  927.973585]  [<ffffffff8139bed2>] ? page_fault+0x22/0x30
> Dec  2 19:47:30 tux kernel: [  932.471964] CPU1: Core temperature above threshold, cpu clock throttled (total events = 820)
> Dec  2 19:47:30 tux kernel: [  932.471966] CPU6: Package temperature above threshold, cpu clock throttled (total events = 2624)
> Dec  2 19:47:30 tux kernel: [  932.471967] CPU3: Package temperature above threshold, cpu clock throttled (total events = 2624)
> Dec  2 19:47:30 tux kernel: [  932.471968] CPU7: Package temperature above threshold, cpu clock throttled (total events = 2624)
> Dec  2 19:47:30 tux kernel: [  932.471969] CPU0: Package temperature above threshold, cpu clock throttled (total events = 2624)
> Dec  2 19:47:30 tux kernel: [  932.471970] CPU2: Package temperature above threshold, cpu clock throttled (total events = 2624)
> Dec  2 19:47:30 tux kernel: [  932.471978] CPU1: Package temperature above threshold, cpu clock throttled (total events = 2624)
> Dec  2 19:47:30 tux kernel: [  932.472922] CPU1: Core temperature/speed normal
> Dec  2 19:47:30 tux kernel: [  932.472923] CPU6: Package temperature/speed normal
> Dec  2 19:47:31 tux kernel: [  932.472923] CPU2: Package temperature/speed normal
> Dec  2 19:47:31 tux kernel: [  932.472924] CPU3: Package temperature/speed normal
> Dec  2 19:47:31 tux kernel: [  932.472925] CPU0: Package temperature/speed normal
> 
> 	Waiting for your next instructions.
> 
> -- 
> Linux 3.17.0-dirty: Shuffling Zombie Juror
> http://www.youtube.com/DanielFragaBR
> http://exchangewar.info
> Bitcoin: 12H6661yoLDUZaYPdah6urZS5WiXwTAUgL
> 


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-02 22:01                                                     ` Dâniel Fraga
  2014-12-02 22:10                                                       ` Paul E. McKenney
@ 2014-12-02 22:10                                                       ` Linus Torvalds
  2014-12-02 22:16                                                         ` Dâniel Fraga
  2014-12-03  3:21                                                         ` Dâniel Fraga
  1 sibling, 2 replies; 486+ messages in thread
From: Linus Torvalds @ 2014-12-02 22:10 UTC (permalink / raw)
  To: Dâniel Fraga; +Cc: Paul E. McKenney, Linux Kernel Mailing List

On Tue, Dec 2, 2014 at 2:01 PM, Dâniel Fraga <fragabr@gmail.com> wrote:
>
>         Of course! I applied your patch to 3.17 stock kernel and after
> stressing it (compiling with -j8 and watching videos on Youtube) to
> trigger the bug I got the following:

So it appears that you can recreate this much more quickly than DaveJ
can recreate his issue.

The two issues may be entirely unrelated, but the it is certainly
quite possible that they have some relation to each other, and the
timing is intriguing, in that 3.17 seems to be the first kernel
release this happened in.

So at this point I think I'd ask you to just go back to your bisection
that you apparently already started earlier. I take it 3.16 worked
fine, and that's what you used as the good base for your bisect?

Even if it's something else than what DaveJ sees (or perhaps
*particularly* if it's something else), bisecting when it started
would be very worthwhile.

There's 13k+ commits in between 3.16 and 3.17, so a full bisect should
be around 15 test-points. But judging by the timing of your emails,
you can generally reproduce this relatively quickly..

                   Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-02 22:10                                                       ` Linus Torvalds
@ 2014-12-02 22:16                                                         ` Dâniel Fraga
  2014-12-03  3:21                                                         ` Dâniel Fraga
  1 sibling, 0 replies; 486+ messages in thread
From: Dâniel Fraga @ 2014-12-02 22:16 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Paul E. McKenney, Linux Kernel Mailing List

On Tue, 2 Dec 2014 14:10:33 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> So it appears that you can recreate this much more quickly than DaveJ
> can recreate his issue.
> 
> The two issues may be entirely unrelated, but the it is certainly
> quite possible that they have some relation to each other, and the
> timing is intriguing, in that 3.17 seems to be the first kernel
> release this happened in.
> 
> So at this point I think I'd ask you to just go back to your bisection
> that you apparently already started earlier. I take it 3.16 worked
> fine, and that's what you used as the good base for your bisect?
> 
> Even if it's something else than what DaveJ sees (or perhaps
> *particularly* if it's something else), bisecting when it started
> would be very worthwhile.
> 
> There's 13k+ commits in between 3.16 and 3.17, so a full bisect should
> be around 15 test-points. But judging by the timing of your emails,
> you can generally reproduce this relatively quickly..

	No problem Linus. I'll try the full bisect and I post here
later with the final result.

-- 
Linux 3.17.0-dirty: Shuffling Zombie Juror
http://www.youtube.com/DanielFragaBR
http://exchangewar.info
Bitcoin: 12H6661yoLDUZaYPdah6urZS5WiXwTAUgL

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-02 22:10                                                       ` Paul E. McKenney
@ 2014-12-02 22:18                                                         ` Dâniel Fraga
  2014-12-02 22:35                                                           ` Paul E. McKenney
  0 siblings, 1 reply; 486+ messages in thread
From: Dâniel Fraga @ 2014-12-02 22:18 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Linus Torvalds, Linux Kernel Mailing List

On Tue, 2 Dec 2014 14:10:31 -0800
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:

> Thank you!!!

	;)

> Was this as difficult to trigger as the version with the Kconfig hack
> that used CONFIG_PREEMPT=y and CONFIG_TREE_PREEMPT_RCU=n?

	Yes. I had to try many times until I got the call trace.

	I'll try the bisect as Linus suggested, but if you have any
other suggestions, just ask ;). Thanks Paul.

-- 
Linux 3.17.0-dirty: Shuffling Zombie Juror
http://www.youtube.com/DanielFragaBR
http://exchangewar.info
Bitcoin: 12H6661yoLDUZaYPdah6urZS5WiXwTAUgL

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-02 22:18                                                         ` Dâniel Fraga
@ 2014-12-02 22:35                                                           ` Paul E. McKenney
  0 siblings, 0 replies; 486+ messages in thread
From: Paul E. McKenney @ 2014-12-02 22:35 UTC (permalink / raw)
  To: Dâniel Fraga; +Cc: Linus Torvalds, Linux Kernel Mailing List

On Tue, Dec 02, 2014 at 08:18:46PM -0200, Dâniel Fraga wrote:
> On Tue, 2 Dec 2014 14:10:31 -0800
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> 
> > Thank you!!!
> 
> 	;)
> 
> > Was this as difficult to trigger as the version with the Kconfig hack
> > that used CONFIG_PREEMPT=y and CONFIG_TREE_PREEMPT_RCU=n?
> 
> 	Yes. I had to try many times until I got the call trace.
> 
> 	I'll try the bisect as Linus suggested, but if you have any
> other suggestions, just ask ;). Thanks Paul.

Sounds good to me -- getting a single commit somewhere between
3.16 and 3.17 is going to be a lot better than reasoning indirectly
from some set of RCU read-side critical sections.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-02 19:32                                       ` Dave Jones
@ 2014-12-02 23:32                                         ` Sasha Levin
  2014-12-03  0:09                                           ` Linus Torvalds
  2014-12-05  5:00                                           ` Sasha Levin
  0 siblings, 2 replies; 486+ messages in thread
From: Sasha Levin @ 2014-12-02 23:32 UTC (permalink / raw)
  To: Dave Jones, Chris Mason, Linus Torvalds, Dâniel Fraga,
	Paul E. McKenney, Linux Kernel Mailing List

On 12/02/2014 02:32 PM, Dave Jones wrote:
> On Mon, Dec 01, 2014 at 06:08:38PM -0500, Chris Mason wrote:
>  > I'm not sure if this is related, but running trinity here, I noticed it
>  > was stuck at 100% system time on every CPU.  perf report tells me we are
>  > spending all of our time in spin_lock under the sync system call.
>  > 
>  > I think it's coming from contention in the bdi_queue_work() call from
>  > inside sync_inodes_sb, which is spin_lock_bh(). 
>  > 
>  > I wonder if we're just spinning so hard on this one bh lock that we're
>  > starving the watchdog?
>  > 
>  > Dave, do you have spinlock debugging on?  
> 
> That has been a constant, yes. I can try with that disabled some time.

Here's my side of the story: I was observing RCU lockups which went away when
I disabled verbose printing for fault injections. It seems that printing one
line ~10 times a second can cause that...

I've disabled lock debugging to see if anything new will show up, and hit
something that may be related:

[  787.894288] ================================================================================
[  787.897074] UBSan: Undefined behaviour in kernel/sched/fair.c:4541:17
[  787.898981] signed integer overflow:
[  787.900066] 361516561629678 * 101500 cannot be represented in type 'long long int'
[  787.900066] CPU: 18 PID: 12958 Comm: trinity-c103 Not tainted 3.18.0-rc6-next-20141201-sasha-00070-g028060a-dirty #1528
[  787.900066]  0000000000000000 0000000000000000 ffffffff93b0f890 ffff8806e3eff918
[  787.900066]  ffffffff91f1cf26 1ffffffff3c2de73 ffffffff93b0f8a8 ffff8806e3eff938
[  787.900066]  ffffffff91f1fb90 1ffffffff3c2de73 ffffffff93b0f8a8 ffff8806e3eff9f8
[  787.900066] Call Trace:
[  787.900066] dump_stack (lib/dump_stack.c:52)
[  787.900066] ubsan_epilogue (lib/ubsan.c:159)
[  787.900066] handle_overflow (lib/ubsan.c:191)
[  787.900066] ? __do_page_fault (arch/x86/mm/fault.c:1220)
[  787.900066] ? local_clock (kernel/sched/clock.c:392)
[  787.900066] __ubsan_handle_mul_overflow (lib/ubsan.c:218)
[  787.900066] select_task_rq_fair (kernel/sched/fair.c:4541 kernel/sched/fair.c:4755)
[  787.900066] try_to_wake_up (kernel/sched/core.c:1415 kernel/sched/core.c:1724)
[  787.900066] ? trace_hardirqs_on_thunk (arch/x86/lib/thunk_64.S:33)
[  787.900066] default_wake_function (kernel/sched/core.c:2979)
[  787.900066] ? get_parent_ip (kernel/sched/core.c:2559)
[  787.900066] autoremove_wake_function (kernel/sched/wait.c:295)
[  787.900066] ? get_parent_ip (kernel/sched/core.c:2559)
[  787.900066] __wake_up_common (kernel/sched/wait.c:73)
[  787.900066] __wake_up_sync_key (include/linux/spinlock.h:364 kernel/sched/wait.c:146)
[  787.900066] pipe_write (fs/pipe.c:466)
[  787.900066] ? kasan_poison_shadow (mm/kasan/kasan.c:48)
[  787.900066] ? new_sync_read (fs/read_write.c:480)
[  787.900066] do_iter_readv_writev (fs/read_write.c:681)
[  787.900066] compat_do_readv_writev (fs/read_write.c:1029)
[  787.900066] ? wait_for_partner (fs/pipe.c:340)
[  787.900066] ? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:95 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
[  787.900066] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[  787.900066] ? syscall_trace_enter_phase1 (include/linux/context_tracking.h:27 arch/x86/kernel/ptrace.c:1486)
[  787.900066] compat_writev (fs/read_write.c:1145)
[  787.900066] compat_SyS_writev (fs/read_write.c:1163 fs/read_write.c:1151)
[  787.900066] ia32_do_call (arch/x86/ia32/ia32entry.S:446)
[  787.900066] ================================================================================

(For Linus asking himself "what the hell is this UBSan thing, I didn't merge that!" - it's an
undefined behaviour sanitizer that works with gcc5.)


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-02 23:32                                         ` Sasha Levin
@ 2014-12-03  0:09                                           ` Linus Torvalds
  2014-12-03  0:25                                             ` Sasha Levin
  2014-12-05  5:00                                           ` Sasha Levin
  1 sibling, 1 reply; 486+ messages in thread
From: Linus Torvalds @ 2014-12-03  0:09 UTC (permalink / raw)
  To: Sasha Levin, Peter Zijlstra, Ingo Molnar
  Cc: Dave Jones, Chris Mason, Dâniel Fraga, Paul E. McKenney,
	Linux Kernel Mailing List

On Tue, Dec 2, 2014 at 3:32 PM, Sasha Levin <sasha.levin@oracle.com> wrote:
>
> I've disabled lock debugging to see if anything new will show up, and hit
> something that may be related:

Very interesting. But your source code doesn't match mine - can you
say what that

    kernel/sched/fair.c:4541:17

line is?

There are at least five multiplications there (all inlined):

 - "imbalance*min_load" from find_idlest_group()

 - "factor * p->wakee_flips" in wake_wide()

 - at least three in wake_affine:

    "prev_eff_load *= capacity_of(this_cpu)"
    "this_eff_load *= this_load + effective_load(tg, this_cpu, weight, weight)"
    "prev_eff_load *= load + effective_load(tg, prev_cpu, 0, weight)"

(There are other multiplications too, but they are by constants afaik
and don't match yours).

None of those seem to have anything to do with the 3.16..3.17 changes,
but I might be missing something, and obviously this also might have
nothing to do with the problems anyway.

Adding Ingo/PeterZ to the participants again.

                 Linus


---
> [  787.894288] ================================================================================
> [  787.897074] UBSan: Undefined behaviour in kernel/sched/fair.c:4541:17
> [  787.898981] signed integer overflow:
> [  787.900066] 361516561629678 * 101500 cannot be represented in type 'long long int'
> [  787.900066] ubsan_epilogue (lib/ubsan.c:159)
> [  787.900066] handle_overflow (lib/ubsan.c:191)
> [  787.900066] ? __do_page_fault (arch/x86/mm/fault.c:1220)
> [  787.900066] ? local_clock (kernel/sched/clock.c:392)
> [  787.900066] __ubsan_handle_mul_overflow (lib/ubsan.c:218)
> [  787.900066] select_task_rq_fair (kernel/sched/fair.c:4541 kernel/sched/fair.c:4755)

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-03  0:09                                           ` Linus Torvalds
@ 2014-12-03  0:25                                             ` Sasha Levin
  0 siblings, 0 replies; 486+ messages in thread
From: Sasha Levin @ 2014-12-03  0:25 UTC (permalink / raw)
  To: Linus Torvalds, Peter Zijlstra, Ingo Molnar
  Cc: Dave Jones, Chris Mason, Dâniel Fraga, Paul E. McKenney,
	Linux Kernel Mailing List

On 12/02/2014 07:09 PM, Linus Torvalds wrote:
> On Tue, Dec 2, 2014 at 3:32 PM, Sasha Levin <sasha.levin@oracle.com> wrote:
>> >
>> > I've disabled lock debugging to see if anything new will show up, and hit
>> > something that may be related:
> Very interesting. But your source code doesn't match mine - can you
> say what that
> 
>     kernel/sched/fair.c:4541:17
> 
> line is?


Sorry about that, I'm testing on the -next kernel. The relevant code snippet is
in wake_affine:

        prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2;
        prev_eff_load *= capacity_of(this_cpu);

        if (this_load > 0) {
                this_eff_load *= this_load +
                        effective_load(tg, this_cpu, weight, weight);  <==== This one

                prev_eff_load *= load + effective_load(tg, prev_cpu, 0, weight);
        }


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-02 16:58                                   ` Dâniel Fraga
  2014-12-02 17:17                                     ` Paul E. McKenney
@ 2014-12-03  2:03                                     ` Lai Jiangshan
  2014-12-03  5:22                                       ` Paul E. McKenney
  1 sibling, 1 reply; 486+ messages in thread
From: Lai Jiangshan @ 2014-12-03  2:03 UTC (permalink / raw)
  To: paulmck; +Cc: Dâniel Fraga, Linus Torvalds, Linux Kernel Mailing List

On 12/03/2014 12:58 AM, Dâniel Fraga wrote:
> On Tue, 2 Dec 2014 16:40:37 +0800
> Lai Jiangshan <laijs@cn.fujitsu.com> wrote:
> 
>> It is needed at lest for testing.
>>
>> CONFIG_TREE_PREEMPT_RCU=y with CONFIG_PREEMPT=n is needed for testing too.
>>
>> Please enable them (or enable them under CONFIG_RCU_TRACE=y)
> 
> 	Lai, sorry but I didn't understand. Do you mean both of them
> enabled? Because how can CONFIG_TREE_PREEMPT_RCU be enabled without
> CONFIG_PREEMPT ?


Sorry, I replied to Paul, and my reply was off-topic, it has nothing
related your reports. Sorry again.

I think we need two combinations for testing (not mainline, but I think
they (combinations) should be enabled for test farms).

So I hope Paul enable them (combinations).

combination1: CONFIG_TREE_PREEMPT_RCU=n & CONFIG_PREEMPT=y
combination2: CONFIG_TREE_PREEMPT_RCU=y & CONFIG_PREEMPT=n

The core code should work correctly in these combinations.
I agree with Paul that these combinations should not be enabled in production,
So my request is: enable these combinations under CONFIG_RCU_TRACE
or CONFIG_TREE_RCU_TRACE.

For myself, I always edit the Kconfig directly, thus it is not a problem
for me.  But there is no way for test farms to test these combinations.

Thanks,
Lai

> 
> 	If you mean both enabled, I already reported a call trace with
> both enabled:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=85941
> 
> 	Please see my previous answer to Linus and Paul too.
> 
> 	Regarding CONFIG_RCU_TRACE, do you mean
> "CONFIG_TREE_RCU_TRACE"? I couldn't find CONFIG_RCU_TRACE.
> 
> 	Thanks.
> 


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-02 22:10                                                       ` Linus Torvalds
  2014-12-02 22:16                                                         ` Dâniel Fraga
@ 2014-12-03  3:21                                                         ` Dâniel Fraga
  2014-12-03  4:14                                                           ` Linus Torvalds
  1 sibling, 1 reply; 486+ messages in thread
From: Dâniel Fraga @ 2014-12-03  3:21 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Paul E. McKenney, Linux Kernel Mailing List

On Tue, 2 Dec 2014 14:10:33 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> There's 13k+ commits in between 3.16 and 3.17, so a full bisect should
> be around 15 test-points. But judging by the timing of your emails,
> you can generally reproduce this relatively quickly..

	Ok Linus and Paul, it took me almost 5 hours to bisect it and
the result is:

c9b88e9581828bb8bba06c5e7ee8ed1761172b6e is the first bad commit

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=c9b88e9581828bb8bba06c5e7ee8ed1761172b6e

	I hope I didn't get any false positive/negative during 
bisect.

	And here's the complete bisect log (just in case):

git bisect start
# good: [19583ca584d6f574384e17fe7613dfaeadcdc4a6] Linux 3.16
git bisect good 19583ca584d6f574384e17fe7613dfaeadcdc4a6
# bad: [bfe01a5ba2490f299e1d2d5508cbbbadd897bbe9] Linux 3.17
git bisect bad bfe01a5ba2490f299e1d2d5508cbbbadd897bbe9
# bad: [f2d7e4d4398092d14fb039cb4d38e502d3f019ee] checkpatch: add fix_insert_line and fix_delete_line helpers
git bisect bad f2d7e4d4398092d14fb039cb4d38e502d3f019ee
# bad: [79eb238c76782a59d51adf8a3dd7f6444245b475] Merge tag 'tty-3.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
git bisect bad 79eb238c76782a59d51adf8a3dd7f6444245b475
# good: [3d582487beb83d650fbd25cb65688b0fbedc97f1] staging: vt6656: struct vnt_private pInterruptURB rename to interrupt_urb
git bisect good 3d582487beb83d650fbd25cb65688b0fbedc97f1
# bad: [e9c9eecabaa898ff3fedd98813ee4ac1a00d006a] Merge branch 'x86-cpufeature-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect bad e9c9eecabaa898ff3fedd98813ee4ac1a00d006a
# bad: [c9b88e9581828bb8bba06c5e7ee8ed1761172b6e] Merge tag 'trace-3.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
git bisect bad c9b88e9581828bb8bba06c5e7ee8ed1761172b6e
# good: [47dfe4037e37b2843055ea3feccf1c335ea23a9c] Merge branch 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
git bisect good 47dfe4037e37b2843055ea3feccf1c335ea23a9c
# good: [b11a6face1b6d5518319f797a74e22bb4309daa9] clk: Add missing of_clk_set_defaults export
git bisect good b11a6face1b6d5518319f797a74e22bb4309daa9
# good: [3a636388bae8390d23f31e061c0c6fdc14525786] tracing: Remove function_trace_stop and HAVE_FUNCTION_TRACE_MCOUNT_TEST
git bisect good 3a636388bae8390d23f31e061c0c6fdc14525786
# good: [e17acfdc83b877794c119fac4627e80510ea3c09] Merge branch 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata
git bisect good e17acfdc83b877794c119fac4627e80510ea3c09
# good: [c7ed326fa7cafb83ced5a8b02517a61672fe9e90] Merge tag 'ktest-v3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-ktest
git bisect good c7ed326fa7cafb83ced5a8b02517a61672fe9e90
# good: [dc6f03f26f570104a2bb03f9d1deb588026d7c75] ftrace: Add warning if tramp hash does not match nr_trampolines
git bisect good dc6f03f26f570104a2bb03f9d1deb588026d7c75
# good: [ede392a75090aab49b01ecd6f7694bb9130ad461] tracing/uprobes: Kill the dead TRACE_EVENT_FL_USE_CALL_FILTER logic
git bisect good ede392a75090aab49b01ecd6f7694bb9130ad461
# good: [bb9ef1cb7d8668d6b0038b6f9f783c849135e40d] tracing: Change apply_subsystem_event_filter() paths to check file->system == dir
git bisect good bb9ef1cb7d8668d6b0038b6f9f783c849135e40d
# good: [6355d54438bfc3b636cb6453cd091f782fb9b4d7] tracing: Kill "filter_string" arg of replace_preds()
git bisect good 6355d54438bfc3b636cb6453cd091f782fb9b4d7
# good: [b8c0aa46b3e86083721b57ed2eec6bd2c29ebfba] Merge tag 'trace-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
git bisect good b8c0aa46b3e86083721b57ed2eec6bd2c29ebfba
# first bad commit: [c9b88e9581828bb8bba06c5e7ee8ed1761172b6e] Merge tag 'trace-3.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

-- 
Linux 3.16.0-00409-gb8c0aa4: Shuffling Zombie Juror
http://www.youtube.com/DanielFragaBR
http://exchangewar.info
Bitcoin: 12H6661yoLDUZaYPdah6urZS5WiXwTAUgL

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-03  3:21                                                         ` Dâniel Fraga
@ 2014-12-03  4:14                                                           ` Linus Torvalds
  2014-12-03  4:51                                                             ` Dâniel Fraga
                                                                               ` (2 more replies)
  0 siblings, 3 replies; 486+ messages in thread
From: Linus Torvalds @ 2014-12-03  4:14 UTC (permalink / raw)
  To: Dâniel Fraga, Tejun Heo; +Cc: Paul E. McKenney, Linux Kernel Mailing List

On Tue, Dec 2, 2014 at 7:21 PM, Dâniel Fraga <fragabr@gmail.com> wrote:
>
>         Ok Linus and Paul, it took me almost 5 hours to bisect it and
> the result is:

Much faster than I expected. However:

> c9b88e9581828bb8bba06c5e7ee8ed1761172b6e is the first bad commit

Hgghnn.. A merge commit can certainly be the thing that introduces
bugs, but it *usually* isn't. Especially not one that is fairly small
and has no actual conflcts in it. Sure, there could be semantics
conflicts etc, but that's where "fairly small" comes in - that is just
not a complicated or subtle merge. And there are other reasons to
believe your bisection weered off into the weeds earlier. Read on.

So:

>         I hope I didn't get any false positive/negative during
> bisect.

Well, the "bad" ones should be pretty safe, since there is no question
at all about any case where things locked up. So unless you actually
mis-typed or did something other silly, I'll trust the ones you marked
bad.

It's the ones marked "good" that are more questionable, and might be
wrong, because you didn't run for long enough, and didn't happen to
hit the right condition.

Your bisection log also kind of points to a mistake: it ends with a
long run of "all good". That usually means that you're not actually
getting closer to the bug: if you were, you'd - pretty much by
definition - also get closer to the "edge" of the bug, and you should
generally see a mix of good/bad as you narrow in on it. Of course,
it's all statistical, so I'm not saying that a run of "good"
bisections is a sure-fire sign of anything, but it's just another
sign: you may have marked something "good" that wasn't, and that
actually took you *away* from the bug, so now everything that followed
that false positive was good.

>         And here's the complete bisect log (just in case):

So this part I'll believe in:

> git bisect start
> # good: [19583ca584d6f574384e17fe7613dfaeadcdc4a6] Linux 3.16
> git bisect good 19583ca584d6f574384e17fe7613dfaeadcdc4a6
> # bad: [bfe01a5ba2490f299e1d2d5508cbbbadd897bbe9] Linux 3.17
> git bisect bad bfe01a5ba2490f299e1d2d5508cbbbadd897bbe9
> # bad: [f2d7e4d4398092d14fb039cb4d38e502d3f019ee] checkpatch: add fix_insert_line and fix_delete_line helpers
> git bisect bad f2d7e4d4398092d14fb039cb4d38e502d3f019ee
> # bad: [79eb238c76782a59d51adf8a3dd7f6444245b475] Merge tag 'tty-3.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
> git bisect bad 79eb238c76782a59d51adf8a3dd7f6444245b475
> # good: [3d582487beb83d650fbd25cb65688b0fbedc97f1] staging: vt6656: struct vnt_private pInterruptURB rename to interrupt_urb
> git bisect good 3d582487beb83d650fbd25cb65688b0fbedc97f1
> # bad: [e9c9eecabaa898ff3fedd98813ee4ac1a00d006a] Merge branch 'x86-cpufeature-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
> git bisect bad e9c9eecabaa898ff3fedd98813ee4ac1a00d006a
> # bad: [c9b88e9581828bb8bba06c5e7ee8ed1761172b6e] Merge tag 'trace-3.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
> git bisect bad c9b88e9581828bb8bba06c5e7ee8ed1761172b6e

because anything marked "bad" clearly must be bad, and anything you
marked "good" before that was probably correct too - because you saw
"bad" cases after it, the good marking clearly hadn't made us ignore
the bug.

Put another way: "bad" is generally more trustworthy (because you
actively saw the bug), while a "good" _before_ a subsequent bad is
also trustworthy (because if the "good" kernel contained the bug and
you should have marked it bad, we'd then go on to test all the commits
that were *not* the bug, so we'd never see a "bad" kernel again).

Of course, the above rule-of-thumb is a simplification of reality. In
reality, there might be multiple bugs that come together and make the
whole good-vs-bad a much less black-and-white thing, but *generally* I
trust "git bisect bad" more than "git bisect good", and "git bisect
good" that is followed by "bad".

What is *really* suspicious is a series of "git bisect good" with no
"bad"s anywhere. Which is exactly what we see at the end of the
bisect.

So might I ask you to try starting from this point again (this is why
the bisect log is so useful - no need to retest the above part, you
can just mindlessly do that sequence by hand without testing), and
starting with this commit:

> # good: [47dfe4037e37b2843055ea3feccf1c335ea23a9c] Merge branch 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
> git bisect good 47dfe4037e37b2843055ea3feccf1c335ea23a9c

Double-check whether that commit is really good. Run that "good"
kernel for a longer time, and under heavier load. Just to verify.

Because looking at the part of the bisect that seems trust-worthy, and
looking at what remains (hint: do "gitk --bisect" while bisecting to
see what is going on), these are the merges in that set (in my
"mergelog" format):

    Bjorn Helgaas (1):
      PCI updates

    Borislav Petkov (1):
      EDAC changes

    Herbert Xu (1):
      crypto update

    Jeff Layton (1):
      file locking related changes

    Mike Turquette (1):
      clock framework updates

    Steven Rostedt (3):
      config-bisect changes
      tracing updates
      tracing filter cleanups

    Tejun Heo (4):
      workqueue updates
      percpu updates
      cgroup changes
      libata changes

and quite frankly, for some core bug like this, I'd suspsect the
workqueue or percpu updates from Tejun (possibly cgroup), *not* the
tracing pull.

Of course, bugs can come in from anywhere, so it *could* be the
tracing one, and it *could* be the merge commit, but my gut just
screams that you probably missed one bad kernel, and marked it good.
And it's really that very first one (ie commit
47dfe4037e37b2843055ea3feccf1c335ea23a9c) that contains most of the
actually suspect code, so I'd really like you to re-test that one a
lot before you call it "good" again.

Humor me.

I added Tejun to the Cc, just because I wanted to give him a heads-up
that I am tentatively starting to blame him in my dark little mind..

                 Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-03  4:14                                                           ` Linus Torvalds
@ 2014-12-03  4:51                                                             ` Dâniel Fraga
  2014-12-03  6:02                                                             ` Chris Rorvick
  2014-12-03 14:54                                                             ` Tejun Heo
  2 siblings, 0 replies; 486+ messages in thread
From: Dâniel Fraga @ 2014-12-03  4:51 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Tejun Heo, Paul E. McKenney, Linux Kernel Mailing List

On Tue, 2 Dec 2014 20:14:52 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> What is *really* suspicious is a series of "git bisect good" with no
> "bad"s anywhere. Which is exactly what we see at the end of the
> bisect.
> 
> So might I ask you to try starting from this point again (this is why
> the bisect log is so useful - no need to retest the above part, you
> can just mindlessly do that sequence by hand without testing), and
> starting with this commit:
> 
> > # good: [47dfe4037e37b2843055ea3feccf1c335ea23a9c] Merge branch
> > 'for-3.17' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup git bisect
> > good 47dfe4037e37b2843055ea3feccf1c335ea23a9c

> Of course, bugs can come in from anywhere, so it *could* be the
> tracing one, and it *could* be the merge commit, but my gut just
> screams that you probably missed one bad kernel, and marked it good.
> And it's really that very first one (ie commit
> 47dfe4037e37b2843055ea3feccf1c335ea23a9c) that contains most of the
> actually suspect code, so I'd really like you to re-test that one a
> lot before you call it "good" again.
> 
> Humor me.
> 
> I added Tejun to the Cc, just because I wanted to give him a heads-up
> that I am tentatively starting to blame him in my dark little mind..

	:)

	I understand Linus. I'll try the 47dfe4037 commit you
suggested harder and I'll return tomorrow.  

-- 
Linux 3.16.0-00409-gb8c0aa4: Shuffling Zombie Juror
http://www.youtube.com/DanielFragaBR
http://exchangewar.info
Bitcoin: 12H6661yoLDUZaYPdah6urZS5WiXwTAUgL

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-03  2:03                                     ` Lai Jiangshan
@ 2014-12-03  5:22                                       ` Paul E. McKenney
  0 siblings, 0 replies; 486+ messages in thread
From: Paul E. McKenney @ 2014-12-03  5:22 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Dâniel Fraga, Linus Torvalds, Linux Kernel Mailing List

On Wed, Dec 03, 2014 at 10:03:56AM +0800, Lai Jiangshan wrote:
> On 12/03/2014 12:58 AM, Dâniel Fraga wrote:
> > On Tue, 2 Dec 2014 16:40:37 +0800
> > Lai Jiangshan <laijs@cn.fujitsu.com> wrote:
> > 
> >> It is needed at lest for testing.
> >>
> >> CONFIG_TREE_PREEMPT_RCU=y with CONFIG_PREEMPT=n is needed for testing too.
> >>
> >> Please enable them (or enable them under CONFIG_RCU_TRACE=y)
> > 
> > 	Lai, sorry but I didn't understand. Do you mean both of them
> > enabled? Because how can CONFIG_TREE_PREEMPT_RCU be enabled without
> > CONFIG_PREEMPT ?
> 
> 
> Sorry, I replied to Paul, and my reply was off-topic, it has nothing
> related your reports. Sorry again.
> 
> I think we need two combinations for testing (not mainline, but I think
> they (combinations) should be enabled for test farms).
> 
> So I hope Paul enable them (combinations).
> 
> combination1: CONFIG_TREE_PREEMPT_RCU=n & CONFIG_PREEMPT=y
> combination2: CONFIG_TREE_PREEMPT_RCU=y & CONFIG_PREEMPT=n
> 
> The core code should work correctly in these combinations.
> I agree with Paul that these combinations should not be enabled in production,
> So my request is: enable these combinations under CONFIG_RCU_TRACE
> or CONFIG_TREE_RCU_TRACE.
> 
> For myself, I always edit the Kconfig directly, thus it is not a problem
> for me.  But there is no way for test farms to test these combinations.

OK, I'll bite...

How have these two combinations helped you in your testing?

The reason I ask is that I am actually trying to -decrease- the RCU
configurations, not increase them.  Added configurations need to have
strong justification, for example, the kernel-tracing/patching need
for tasks_rcu.

							Thanx, Paul

> Thanks,
> Lai
> 
> > 
> > 	If you mean both enabled, I already reported a call trace with
> > both enabled:
> > 
> > https://bugzilla.kernel.org/show_bug.cgi?id=85941
> > 
> > 	Please see my previous answer to Linus and Paul too.
> > 
> > 	Regarding CONFIG_RCU_TRACE, do you mean
> > "CONFIG_TREE_RCU_TRACE"? I couldn't find CONFIG_RCU_TRACE.
> > 
> > 	Thanks.
> > 
> 


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-03  4:14                                                           ` Linus Torvalds
  2014-12-03  4:51                                                             ` Dâniel Fraga
@ 2014-12-03  6:02                                                             ` Chris Rorvick
  2014-12-03 15:22                                                               ` Linus Torvalds
  2014-12-03 14:54                                                             ` Tejun Heo
  2 siblings, 1 reply; 486+ messages in thread
From: Chris Rorvick @ 2014-12-03  6:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dâniel Fraga, Tejun Heo, Paul E. McKenney,
	Linux Kernel Mailing List

On Tue, Dec 2, 2014 at 10:14 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> Put another way: "bad" is generally more trustworthy (because you
> actively saw the bug),

Makes sense, but ...

> while a "good" _before_ a subsequent bad is
> also trustworthy (because if the "good" kernel contained the bug and
> you should have marked it bad, we'd then go on to test all the commits
> that were *not* the bug, so we'd never see a "bad" kernel again).

wouldn't marking a bad commit "good" cause you to not see a *good*
kernel again?  Marking it "good" would seem push the search away from
the bug toward the current "bad" commit.

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-03  4:14                                                           ` Linus Torvalds
  2014-12-03  4:51                                                             ` Dâniel Fraga
  2014-12-03  6:02                                                             ` Chris Rorvick
@ 2014-12-03 14:54                                                             ` Tejun Heo
  2 siblings, 0 replies; 486+ messages in thread
From: Tejun Heo @ 2014-12-03 14:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dâniel Fraga, Paul E. McKenney, Linux Kernel Mailing List

Hello,

On Tue, Dec 02, 2014 at 08:14:52PM -0800, Linus Torvalds wrote:
> I added Tejun to the Cc, just because I wanted to give him a heads-up
> that I am tentatively starting to blame him in my dark little mind..

Yeap, keeping watch on the thread and working on a patch to dump more
workqueue info on sysrq (I don't know which sysrq alphabet to hang it
on yet, mostly likely it'd get appeneded to tasks dump).  For all
three subsystems, for-3.17 pulls contained quite a bit of changes.
I've skimmed through the commits but nothing rings the bell - I've
never seen this pattern of failures in any of the three subsystems.
Maybe a subtle percpu bug or cpuset messing up scheduling somehow?
Anyways, let's see how 47dfe4037 does.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-03  6:02                                                             ` Chris Rorvick
@ 2014-12-03 15:22                                                               ` Linus Torvalds
  2014-12-04  8:43                                                                 ` Dâniel Fraga
  0 siblings, 1 reply; 486+ messages in thread
From: Linus Torvalds @ 2014-12-03 15:22 UTC (permalink / raw)
  To: Chris Rorvick
  Cc: Dâniel Fraga, Tejun Heo, Paul E. McKenney,
	Linux Kernel Mailing List

On Tue, Dec 2, 2014 at 10:02 PM, Chris Rorvick <chris@rorvick.com> wrote:
>
>> while a "good" _before_ a subsequent bad is
>> also trustworthy (because if the "good" kernel contained the bug and
>> you should have marked it bad, we'd then go on to test all the commits
>> that were *not* the bug, so we'd never see a "bad" kernel again).
>
> wouldn't marking a bad commit "good" cause you to not see a *good*
> kernel again?  Marking it "good" would seem push the search away from
> the bug toward the current "bad" commit.

Yes, you're right.

The "long series of 'good'" at the end actually implies that the last
'bad' is questionable - just marking a bad kernel as being good should
push us further into 'bad' land, not the other way around. While
marking a 'good' kernel as 'bad' will push us into 'bug hasn't
happaned yet' land.

Which is somewhat odd, because the bad kernels should be easy to spot.
But it could happen if screwing up the test (by not booting the right
kernel, for example.

Or - and this is the scary part, and one of the huge downsides of 'git
bisect' - it just ends up meaning that the bug comes and goes and is
not quite repeatable enough.

Anyway, Dâniel, if you restart the bisection today, start it one
kernel earlier: re-test the last 'bad' kernel too. So start with
reconfirming that the c9b88e958182 kernel was bad (that *might* be as
easy as just checking your old kernel boot logs, and verifying that
"yes, I really booted it, and yes, it clearly hung and I had to
hard-reboot into it")

                   Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-02 17:14                                           ` Chris Mason
@ 2014-12-03 18:41                                             ` Dave Jones
  2014-12-03 18:45                                               ` Linus Torvalds
  0 siblings, 1 reply; 486+ messages in thread
From: Dave Jones @ 2014-12-03 18:41 UTC (permalink / raw)
  To: Chris Mason
  Cc: Linus Torvalds, Mike Galbraith, Ingo Molnar, Peter Zijlstra,
	Dâniel Fraga, Sasha Levin, Paul E. McKenney,
	Linux Kernel Mailing List

On Tue, Dec 02, 2014 at 12:14:53PM -0500, Chris Mason wrote:
 > On Tue, Dec 2, 2014 at 11:33 AM, Linus Torvalds 
 > <torvalds@linux-foundation.org> wrote:
 > > On Tue, Dec 2, 2014 at 6:13 AM, Mike Galbraith 
 > > <umgwanakikbuti@gmail.com> wrote:
 > > 
 > > At the same time, the whole "incapacitated by the rt throttle long
 > > enough for the hard lockup detector to trigger" commentary about that
 > > skip_clock_update issue does make me go "Hmmm..". It would certainly
 > > explain Dave's incomprehensible watchdog messages..
 > 
 > Dave's first email mentioned that he had panic on softlockup enabled, 
 > but even with that off the box wasn't recovering.

Not sure if I mentioned in an earlier post, but when I'm local to the machine,
I've disabled reboot-on-lockup, but yes the problem case is the
situation where it actually does lock up afterwards.

 > In my trinity runs here, I've gotten softlockup warnings where the box 
 > eventually recovered.  I'm wondering if some of the "bad" commits in 
 > the bisection are really false positives where the box would have been 
 > able to recover if we'd killed off all the trinity procs and given it 
 > time to breath.

So I've done multiple runs against 3.17-rc1 during bisecting, and hit
the case you describe, where I get a dump like below, and then it
eventually recovers. (Trinity then exits because the taint flag
changes).

I've been stuck on this kernel for a few days now trying to prove it
good/bad one way or the other, and I'm leaning towards good, given
that it recovers, even though the traces look similar.

	Dave


[ 9862.915562] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [trinity-c29:13237]
[ 9862.915684] Modules linked in: 8021q garp stp tun fuse bnep hidp llc2 af_key nfnetlink can_bcm scsi_transport_iscsi can_raw nfc caif_socket caif af_802154 ieee802154 phonet af_rxrpc rfcomm bluetooth can pppoe pppox ppp_generic slhc irda crc_ccitt rds rose x25 atm netrom appletalk ipx p8023 psnap p8022 llc ax25 cfg80211 rfkill snd_hda_codec_realtek snd_hda_codec_hdmi snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep e1000e snd_seq coretemp hwmon x86_pkg_temp_thermal nfsd kvm_intel kvm snd_seq_device snd_pcm snd_timer ptp auth_rpcgss snd shpchp oid_registry crct10dif_pclmul crc32c_intel ghash_clmulni_intel usb_debug soundcore pps_core nfs_acl microcode serio_raw pcspkr lockd sunrpc
[ 9862.915987] CPU: 0 PID: 13237 Comm: trinity-c29 Not tainted 3.17.0-rc1+ #112
[ 9862.916046] task: ffff88022657dbc0 ti: ffff8800962b0000 task.ti: ffff8800962b0000
[ 9862.916071] RIP: 0010:[<ffffffff81042569>]  [<ffffffff81042569>] lookup_address_in_pgd+0x89/0xe0
[ 9862.916103] RSP: 0018:ffff8800962b36a8  EFLAGS: 00000202
[ 9862.917024] RAX: ffff88024da748d0 RBX: ffffffff81164c63 RCX: 0000000000000001
[ 9862.917956] RDX: ffff8800962b3740 RSI: ffff8801a3417000 RDI: 000000024da74000
[ 9862.918891] RBP: ffff8800962b36a8 R08: 00003ffffffff000 R09: ffff880000000000
[ 9862.919828] R10: 0000000000000001 R11: 0000000000000000 R12: ffffffff81375a47
[ 9862.920758] R13: ffff8800962b3618 R14: ffff8802441d81f0 R15: fffff70c86134602
[ 9862.921681] FS:  00007f06c569f740(0000) GS:ffff880244000000(0000) knlGS:0000000000000000
[ 9862.922603] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9862.923526] CR2: 0000000002378590 CR3: 00000000a1ba3000 CR4: 00000000001407f0
[ 9862.924459] DR0: 00007f40a579e000 DR1: 0000000000000000 DR2: 0000000000000000
[ 9862.925386] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[ 9862.926342] Stack:
[ 9862.927255]  ffff8800962b36b8 ffffffff810425e8 ffff8800962b36c8 ffffffff810426ab
[ 9862.928188]  ffff8800962b37c0 ffffffff810427a0 ffffffff810bfc5e ffff8800962b36f0
[ 9862.929127]  ffffffff810a89f5 ffff8800962b3768 ffffffff810c19b4 0000000000000002
[ 9862.930072] Call Trace:
[ 9862.931008]  [<ffffffff810425e8>] lookup_address+0x28/0x30
[ 9862.931958]  [<ffffffff810426ab>] _lookup_address_cpa.isra.9+0x3b/0x40
[ 9862.932913]  [<ffffffff810427a0>] __change_page_attr_set_clr+0xf0/0xab0
[ 9862.933869]  [<ffffffff810bfc5e>] ? put_lock_stats.isra.23+0xe/0x30
[ 9862.934831]  [<ffffffff810a89f5>] ? local_clock+0x25/0x30
[ 9862.935827]  [<ffffffff810c19b4>] ? __lock_acquire.isra.31+0x264/0xa60
[ 9862.936798]  [<ffffffff8109bfed>] ? finish_task_switch+0x7d/0x120
[ 9862.937765]  [<ffffffff810bfc5e>] ? put_lock_stats.isra.23+0xe/0x30
[ 9862.938730]  [<ffffffff810bffc6>] ? lock_release_holdtime.part.24+0xe6/0x160
[ 9862.939698]  [<ffffffff8104418b>] kernel_map_pages+0x7b/0x120
[ 9862.940653]  [<ffffffff81178517>] get_page_from_freelist+0x497/0xaa0
[ 9862.941597]  [<ffffffff81179498>] __alloc_pages_nodemask+0x228/0xb20
[ 9862.942539]  [<ffffffff810a89f5>] ? local_clock+0x25/0x30
[ 9862.943469]  [<ffffffff810c19b4>] ? __lock_acquire.isra.31+0x264/0xa60
[ 9862.944411]  [<ffffffff8135fe50>] ? __radix_tree_preload+0x60/0xf0
[ 9862.945357]  [<ffffffff8135fe50>] ? __radix_tree_preload+0x60/0xf0
[ 9862.946326]  [<ffffffff811c14c1>] alloc_pages_vma+0xf1/0x1b0
[ 9862.947263]  [<ffffffff8118766e>] ? shmem_alloc_page+0x6e/0xc0
[ 9862.948205]  [<ffffffff8118766e>] shmem_alloc_page+0x6e/0xc0
[ 9862.949148]  [<ffffffff810a2dbb>] ? preempt_count_sub+0x6b/0xf0
[ 9862.950090]  [<ffffffff810a2c5d>] ? get_parent_ip+0xd/0x50
[ 9862.951012]  [<ffffffff810a2dbb>] ? preempt_count_sub+0x6b/0xf0
[ 9862.951913]  [<ffffffff81382666>] ? __percpu_counter_add+0x86/0xb0
[ 9862.952811]  [<ffffffff811a4362>] ? __vm_enough_memory+0x62/0x1c0
[ 9862.953700]  [<ffffffff812eb0c7>] ? cap_vm_enough_memory+0x47/0x50
[ 9862.954591]  [<ffffffff81189f00>[ 9893.337880] [sched_delayed] sched: RT throttling activated
[ 9918.893057] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 24s! [swapper/1:0]
[ 9918.894352] Modules linked in: 8021q garp stp tun fuse bnep hidp llc2 af_key nfnetlink can_bcm scsi_transport_iscsi can_raw nfc caif_socket caif af_802154 ieee802154 phonet af_rxrpc rfcomm bluetooth can pppoe pppox ppp_generic slhc irda crc_ccitt rds rose x25 atm netrom appletalk ipx p8023 psnap p8022 llc ax25 cfg80211 rfkill snd_hda_codec_realtek snd_hda_codec_hdmi snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep e1000e snd_seq coretemp hwmon x86_pkg_temp_thermal nfsd kvm_intel kvm snd_seq_device snd_pcm snd_timer ptp auth_rpcgss snd shpchp oid_registry crct10dif_pclmul crc32c_intel ghash_clmulni_intel usb_debug soundcore pps_core nfs_acl microcode serio_raw pcspkr lockd sunrpc
[ 9918.901158] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G             L 3.17.0-rc1+ #112
[ 9918.903863] task: ffff880242b716f0 ti: ffff88024240c000 task.ti: ffff88024240c000
[ 9918.905218] RIP: 0010:[<ffffffff81645849>]  [<ffffffff81645849>] cpuidle_enter_state+0x79/0x1c0
[ 9918.906591] RSP: 0000:ffff88024240fe60  EFLAGS: 00000246
[ 9918.907933] RAX: 0000000000000000 RBX: ffff880242b716f0 RCX: 0000000000000019
[ 9918.909264] RDX: 20c49ba5e353f7cf RSI: 000000000003cea4 RDI: 00536da522b45eb6
[ 9918.910594] RBP: ffff88024240fe98 R08: 000000008baf9f86 R09: 0000000000000000
[ 9918.911916] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88024240fdf0
[ 9918.913243] R13: ffffffff810bfc5e R14: ffff88024240fdd0 R15: 00000000000001e1
[ 9918.914523] FS:  0000000000000000(0000) GS:ffff880244200000(0000) knlGS:0000000000000000
[ 9918.915815] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9918.917108] CR2: 000000000044bfa0 CR3: 0000000001c11000 CR4: 00000000001407e0
[ 9918.918414] DR0: 00007f40a579e000 DR1: 0000000000000000 DR2: 0000000000000000
[ 9918.919710] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[ 9918.920998] Stack:
[ 9918.922273]  00000906c89bd5bb ffffffff81cae7f0 ffffffff81d1d290 ffffe8ffff004da8
[ 9918.923576]  ffff88024240c000 ffffffff81cae620 ffff88024240c000 ffff88024240fea8
[ 9918.924845]  ffffffff81645a47 ffff88024240ff10 ffffffff810b9fb4 ffff88024240ffd8
[ 9918.926108] Call Trace:
[ 9918.927351]  [<ffffffff81645a47>] cpuidle_enter+0x17/0x20
[ 9918.928602]  [<ffffffff810b9fb4>] cpu_startup_entry+0x384/0x410
[ 9918.929850]  [<ffffffff8102ff37>] start_secondary+0x237/0x340
[ 9918.931086] Code: d0 48 89 df ff 50 48 41 89 c5 e8 b3 5c aa ff 44 8b 63 04 49 89 c7 0f 1f 44 00 00 e8 a2 19 b0 ff fb 48 ba cf f7 53 e3 a5 9b c4 20 <4c> 2b 7d c8 4c 89 f8 49 c1 ff 3f 48 f7 ea b8 ff ff ff 7f 48 c1 
[ 9918.933755] sending NMI to other CPUs:
[ 9918.935008] NMI backtrace for cpu 2
[ 9918.936208] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G             L 3.17.0-rc1+ #112
[ 9918.938592] task: ffff880242b744d0 ti: ffff880242414000 task.ti: ffff880242414000
[ 9918.939793] RIP: 0010:[<ffffffff813c9e65>]  [<ffffffff813c9e65>] intel_idle+0xd5/0x180
[ 9918.941003] RSP: 0018:ffff880242417e20  EFLAGS: 00000046
[ 9918.942191] RAX: 0000000000000032 RBX: 0000000000000010 RCX: 0000000000000001
[ 9918.943369] RDX: 0000000000000000 RSI: ffff880242417fd8 RDI: 0000000000000002
[ 9918.944535] RBP: ffff880242417e50 R08: 000000008baf9f86 R09: 0000000000000000
[ 9918.945695] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000005
[ 9918.946856] R13: 0000000000000032 R14: 0000000000000004 R15: ffff880242414000
[ 9918.948013] FS:  0000000000000000(0000) GS:ffff880244400000(0000) knlGS:0000000000000000
[ 9918.949175] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9918.950339] CR2: 000000000068f760 CR3: 0000000001c11000 CR4: 00000000001407e0
[ 9918.951515] DR0: 00007f40a579e000 DR1: 0000000000000000 DR2: 0000000000000000
[ 9918.952678] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[ 9918.953840] Stack:
[ 9918.954988]  0000000242414000 d7251be6f43cb9e7 ffffe8ffff204da8 0000000000000005
[ 9918.956164]  ffffffff81cae620 0000000000000002 ffff880242417e98 ffffffff81645825
[ 9918.957345]  00000906cb95d0a1 ffffffff81cae7f0 ffffffff81d1d290 ffffe8ffff204da8
[ 9918.958527] Call Trace:
[ 9918.959692]  [<ffffffff81645825>] cpuidle_enter_state+0x55/0x1c0
[ 9918.960866]  [<ffffffff81645a47>] cpuidle_enter+0x17/0x20
[ 9918.962029]  [<ffffffff810b9fb4>] cpu_startup_entry+0x384/0x410
[ 9918.963185]  [<ffffffff8102ff37>] start_secondary+0x237/0x340
[ 9918.964339] Code: 31 d2 65 48 8b 34 25 08 ba 00 00 48 8d 86 38 c0 ff ff 48 89 d1 0f 01 c8 48 8b 86 38 c0 ff ff a8 08 75 08 b1 01 4c 89 e8 0f 01 c9 <65> 48 8b 0c 25 08 ba 00 00 f0 80 a1 3a c0 ff ff df 0f ae f0 48 
[ 9918.966852] NMI backtrace for cpu 3
[ 9918.968059] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G             L 3.17.0-rc1+ #112
[ 9918.970450] task: ffff880242b72de0 ti: ffff880242418000 task.ti: ffff880242418000
[ 9918.971652] RIP: 0010:[<ffffffff813c9e65>]  [<ffffffff813c9e65>] intel_idle+0xd5/0x180
[ 9918.972862] RSP: 0018:ffff88024241be20  EFLAGS: 00000046
[ 9918.974045] RAX: 0000000000000032 RBX: 0000000000000010 RCX: 0000000000000001
[ 9918.975212] RDX: 0000000000000000 RSI: ffff88024241bfd8 RDI: 0000000000000003
[ 9918.976349] RBP: ffff88024241be50 R08: 000000008baf9f86 R09: 0000000000000000
[ 9918.977463] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000005
[ 9918.978550] R13: 0000000000000032 R14: 0000000000000004 R15: ffff880242418000
[ 9918.979628] FS:  0000000000000000(0000) GS:ffff880244600000(0000) knlGS:0000000000000000
[ 9918.980695] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9918.981739] CR2: 00007f37d766f050 CR3: 0000000001c11000 CR4: 00000000001407e0
[ 9918.982786] DR0: 00007f40a579e000 DR1: 0000000000000000 DR2: 0000000000000000
[ 9918.983821] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[ 9918.984849] Stack:
[ 9918.985866]  0000000342418000 63d0570b22f343d2 ffffe8ffff404da8 0000000000000005
[ 9918.986911]  ffffffff81cae620 0000000000000003 ffff88024241be98 ffffffff81645825
[ 9918.987961]  00000906cb9636ab ffffffff81cae7f0 ffffffff81d1d290 ffffe8ffff404da8
[ 9918.989007] Call Trace:
[ 9918.990036]  [<ffffffff81645825>] cpuidle_enter_state+0x55/0x1c0
[ 9918.991072]  [<ffffffff81645a47>] cpuidle_enter+0x17/0x20
[ 9918.992097]  [<ffffffff810b9fb4>] cpu_startup_entry+0x384/0x410
[ 9918.993127]  [<ffffffff8102ff37>] start_secondary+0x237/0x340
[ 9918.994152] Code: 31 d2 65 48 8b 34 25 08 ba 00 00 48 8d 86 38 c0 ff ff 48 89 d1 0f 


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-03 18:41                                             ` Dave Jones
@ 2014-12-03 18:45                                               ` Linus Torvalds
  2014-12-03 19:00                                                 ` Dave Jones
                                                                   ` (2 more replies)
  0 siblings, 3 replies; 486+ messages in thread
From: Linus Torvalds @ 2014-12-03 18:45 UTC (permalink / raw)
  To: Dave Jones, Chris Mason, Linus Torvalds, Mike Galbraith,
	Ingo Molnar, Peter Zijlstra, Dâniel Fraga, Sasha Levin,
	Paul E. McKenney, Linux Kernel Mailing List

On Wed, Dec 3, 2014 at 10:41 AM, Dave Jones <davej@redhat.com> wrote:
>
> I've been stuck on this kernel for a few days now trying to prove it
> good/bad one way or the other, and I'm leaning towards good, given
> that it recovers, even though the traces look similar.

Ugh. But this does *not* happen with 3.16, right? Even the non-fatal case?

If so, I'd be inclined to call it "bad". But there might well be two
bugs: one that makes that NMI watchdog trigger, and another one that
then makes it be a hard lockup. I'd think it would be good to figure
out the "NMI watchdog starts triggering" one first, though.

                 Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-03 18:45                                               ` Linus Torvalds
@ 2014-12-03 19:00                                                 ` Dave Jones
  2014-12-03 19:25                                                   ` Linus Torvalds
  2014-12-03 19:59                                                   ` Chris Mason
  2014-12-04  0:27                                                 ` Dave Jones
  2014-12-05 17:15                                                 ` Dave Jones
  2 siblings, 2 replies; 486+ messages in thread
From: Dave Jones @ 2014-12-03 19:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Chris Mason, Mike Galbraith, Ingo Molnar, Peter Zijlstra,
	Dâniel Fraga, Sasha Levin, Paul E. McKenney,
	Linux Kernel Mailing List

On Wed, Dec 03, 2014 at 10:45:57AM -0800, Linus Torvalds wrote:
 > On Wed, Dec 3, 2014 at 10:41 AM, Dave Jones <davej@redhat.com> wrote:
 > >
 > > I've been stuck on this kernel for a few days now trying to prove it
 > > good/bad one way or the other, and I'm leaning towards good, given
 > > that it recovers, even though the traces look similar.
 > 
 > Ugh. But this does *not* happen with 3.16, right? Even the non-fatal case?

correct. at least not in any of the runs that I did to date.

 > If so, I'd be inclined to call it "bad". But there might well be two
 > bugs: one that makes that NMI watchdog trigger, and another one that
 > then makes it be a hard lockup. I'd think it would be good to figure
 > out the "NMI watchdog starts triggering" one first, though.

I think you're right.

So right after sending my last mail, I rebooted, and restarted the run
on the same kernel again.

As I was writing this mail, this happened.

[  524.420897] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [trinity-c178:20182]

and that's all that made it over the console. I couldn't log in via ssh,
and thought "ah-ha, so it IS bad".  I walked over to reboot it, and
found I could actually log in on the console. check out this dmesg..

[  503.683055] Clocksource tsc unstable (delta = -95946009388 ns)
[  503.692038] Switched to clocksource hpet
[  524.420897] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [trinity-c178:20182]
[  524.420972] Modules linked in: fuse tun rfcomm llc2 af_key nfnetlink scsi_transport_iscsi can_bcm bnep can_raw nfc caif_socket caif af_802154 ieee802154 phonet af_rxrpc bluetooth can pppoe pppox ppp_generic slhc irda crc_ccitt rds rose x25 atm netrom appletalk ipx p8023 psnap p8022 llc ax25 cfg80211 rfkill coretemp hwmon x86_pkg_temp_thermal kvm_intel kvm nfsd auth_rpcgss oid_registry nfs_acl lockd sunrpc snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec crct10dif_pclmul crc32c_intel ghash_clmulni_intel e1000e snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer microcode snd serio_raw pcspkr usb_debug ptp pps_core shpchp soundcore
[  524.421288] CPU: 0 PID: 20182 Comm: trinity-c178 Not tainted 3.17.0-rc1+ #112
[  524.421351] task: ffff8801cd63c4d0 ti: ffff8801d2138000 task.ti: ffff8801d2138000
[  524.421377] RIP: 0010:[<ffffffff8136968d>]  [<ffffffff8136968d>] copy_user_handle_tail+0x6d/0x90
[  524.421411] RSP: 0018:ffff8801d213bf00  EFLAGS: 00000202
[  524.421430] RAX: 000000000007a8d9 RBX: ffffffff817b2c64 RCX: 0000000000000000
[  524.421455] RDX: 0000000000056ddc RSI: ffff88023412baf5 RDI: ffff88023412baf4
[  524.421480] RBP: ffff8801d213bf00 R08: 0000000000000000 R09: 0000000000000000
[  524.421504] R10: 0000000000000100 R11: 0000000000000000 R12: ffffffff817b92d0
[  524.421528] R13: 00007f6a24fe0000 R14: ffff8801d2138000 R15: 0000000000000001
[  524.421552] FS:  00007f6a24fd0740(0000) GS:ffff880244000000(0000) knlGS:0000000000000000
[  524.421579] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  524.421600] CR2: 00007f6a24fe0000 CR3: 00000002053c8000 CR4: 00000000001407f0
[  524.421624] DR0: 00007f6670c66000 DR1: 0000000000000000 DR2: 0000000000000000
[  524.421648] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[  524.421672] Stack:
[  524.421683]  ffff8801d213bf78 ffffffff812e4a85 00007f6a2452f068 0000000000000000
[  524.421716]  ffffffff3fffffff 00000000000003e8 00007f6a2452f000 00000000000000f8
[  524.421748]  0000000000000000 00000000a417dc9d 00000000000000f8 00007f6a2452f000
[  524.421781] Call Trace:
[  524.422754]  [<ffffffff812e4a85>] SyS_add_key+0xd5/0x240
[  524.423736]  [<ffffffff817b2264>] tracesys+0xdd/0xe2
[  524.424713] Code: c0 74 d3 85 d2 89 d0 74 39 85 c9 74 35 45 31 c0 eb 0c 0f 1f 40 00 83 ea 01 74 17 48 89 f7 48 8d 77 01 44 89 c1 0f 1f 00 c6 07 00 <0f> 1f 00 85 c9 74 e4 0f 1f 00 5d c3 0f 1f 80 00 00 00 00 31 c0 
[  524.426861] sending NMI to other CPUs:
[  524.427867] NMI backtrace for cpu 3
[  524.428868] CPU: 3 PID: 20165 Comm: trinity-c161 Not tainted 3.17.0-rc1+ #112
[  524.430914] task: ffff8801fe67dbc0 ti: ffff8801fe70c000 task.ti: ffff8801fe70c000
[  524.431951] RIP: 0010:[<ffffffff810f99da>]  [<ffffffff810f99da>] generic_exec_single+0xea/0x1a0
[  524.433004] RSP: 0018:ffff8801fe70fc40  EFLAGS: 00000202
[  524.434051] RAX: 0000000000000000 RBX: ffff8801fe70fc40 RCX: 0000000000000038
[  524.435108] RDX: 00000000000000ff RSI: 0000000000000008 RDI: 0000000000000000
[  524.436165] RBP: ffff8801fe70fc90 R08: ffff880242bfa3f0 R09: 0000000000000000
[  524.437221] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  524.438278] R13: 0000000000000001 R14: ffff880238ef1290 R15: ffffffff8115fd30
[  524.439339] FS:  00007f6a24fd0740(0000) GS:ffff880244600000(0000) knlGS:0000000000000000
[  524.440415] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  524.441494] CR2: 00007f6a23446001 CR3: 00000001cd627000 CR4: 00000000001407e0
[  524.442579] DR0: 00007f6670c66000 DR1: 0000000000000000 DR2: 0000000000000000
[  524.443665] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[  524.444749] Stack:
[  524.445825]  0000000000000000 ffffffff8115fd30 ffff880238ef1290 0000000000000003
[  524.446926]  00000000af3b5f31 00000000ffffffff 0000000000000000 ffffffff8115fd30
[  524.448028]  ffff880238ef1290 0000000000000001 ffff8801fe70fcd0 ffffffff810f9b5a
[  524.449115] Call Trace:
[  524.450179]  [<ffffffff8115fd30>] ? perf_duration_warn+0x70/0x70
[  524.451259]  [<ffffffff8115fd30>] ? perf_duration_warn+0x70/0x70
[  524.452334]  [<ffffffff810f9b5a>] smp_call_function_single+0x6a/0xe0
[  524.453411]  [<ffffffff810a2dbb>] ? preempt_count_sub+0x6b/0xf0
[  524.454483]  [<ffffffff8116036a>] perf_event_read+0xca/0xd0
[  524.455549]  [<ffffffff81160400>] perf_event_read_value+0x90/0xe0
[  524.456618]  [<ffffffff81161b1e>] perf_read+0x20e/0x360
[  524.457689]  [<ffffffff810bffc6>] ? lock_release_holdtime.part.24+0xe6/0x160
[  524.458761]  [<ffffffff81161910>] ? cpu_clock_event_init+0x40/0x40
[  524.459827]  [<ffffffff811dfdd3>] do_loop_readv_writev+0x63/0x90
[  524.460887]  [<ffffffff81161910>] ? cpu_clock_event_init+0x40/0x40
[  524.461944]  [<ffffffff811e1c97>] do_readv_writev+0x267/0x280
[  524.462996]  [<ffffffff81375a47>] ? debug_smp_processor_id+0x17/0x20
[  524.464029]  [<ffffffff810bffc6>] ? lock_release_holdtime.part.24+0xe6/0x160
[  524.465040]  [<ffffffff810a2c5d>] ? get_parent_ip+0xd/0x50
[  524.466027]  [<ffffffff810a2dbb>] ? preempt_count_sub+0x6b/0xf0
[  524.466995]  [<ffffffff817b1377>] ? _raw_spin_unlock_irq+0x37/0x60
[  524.467940]  [<ffffffff810e672a>] ? do_setitimer+0x1ca/0x250
[  524.468874]  [<ffffffff811e1ce9>] vfs_readv+0x39/0x50
[  524.469791]  [<ffffffff811e1dac>] SyS_readv+0x5c/0x100
[  524.470689]  [<ffffffff817b2264>] tracesys+0xdd/0xe2
[  524.471568] Code: 00 4c 1d 00 48 89 de 48 03 14 c5 e0 af d1 81 48 89 df e8 5a 47 27 00 84 c0 75 46 45 85 ed 74 11 f6 43 18 01 74 0b 0f 1f 00 f3 90 <f6> 43 18 01 75 f8 31 c0 48 8b 4d d0 65 48 33 0c 25 28 00 00 00 
[  524.473500] NMI backtrace for cpu 1
[  524.473584] INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 45.633 msecs
[  524.475347] CPU: 1 PID: 20241 Comm: trinity-c237 Not tainted 3.17.0-rc1+ #112
[  524.477246] task: ffff8801e32c16f0 ti: ffff88008696c000 task.ti: ffff88008696c000
[  524.478212] RIP: 0010:[<ffffffff810f99d8>]  [<ffffffff810f99d8>] generic_exec_single+0xe8/0x1a0
[  524.479189] RSP: 0018:ffff88008696fc40  EFLAGS: 00000202
[  524.480152] RAX: ffff880062d6bc00 RBX: ffff88008696fc40 RCX: ffff880062d6bc40
[  524.481130] RDX: ffff8802441d4c00 RSI: ffff88008696fc40 RDI: ffff88008696fc40
[  524.482108] RBP: ffff88008696fc90 R08: 0000000000000001 R09: 0000000000000001
[  524.483084] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  524.484051] R13: 0000000000000001 R14: ffff880238ef1bd8 R15: ffffffff8115fd30
[  524.485022] FS:  00007f6a24fd0740(0000) GS:ffff880244200000(0000) knlGS:0000000000000000
[  524.486001] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  524.486975] CR2: 0000000000000000 CR3: 00000000869c1000 CR4: 00000000001407e0
[  524.487952] DR0: 00007f6670c66000 DR1: 0000000000000000 DR2: 0000000000000000
[  524.488935] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[  524.489891] Stack:
[  524.490817]  ffff880062d6bc40 ffffffff8115fd30 ffff880238ef1bd8 0000000000000003
[  524.491770]  00000000d88cff08 00000000ffffffff 0000000000000000 ffffffff8115fd30
[  524.492729]  ffff880238ef1bd8 0000000000000001 ffff88008696fcd0 ffffffff810f9b5a
[  524.493684] Call Trace:
[  524.494611]  [<ffffffff8115fd30>] ? perf_duration_warn+0x70/0x70
[  524.495530]  [<ffffffff8115fd30>] ? perf_duration_warn+0x70/0x70
[  524.496433]  [<ffffffff810f9b5a>] smp_call_function_single+0x6a/0xe0
[  524.497334]  [<ffffffff810a2dbb>] ? preempt_count_sub+0x6b/0xf0
[  524.498227]  [<ffffffff8116036a>] perf_event_read+0xca/0xd0
[  524.499117]  [<ffffffff81160400>] perf_event_read_value+0x90/0xe0
[  524.500007]  [<ffffffff81161b1e>] perf_read+0x20e/0x360
[  524.500895]  [<ffffffff810bffc6>] ? lock_release_holdtime.part.24+0xe6/0x160
[  524.501786]  [<ffffffff81161910>] ? cpu_clock_event_init+0x40/0x40
[  524.502668]  [<ffffffff811dfdd3>] do_loop_readv_writev+0x63/0x90
[  524.503548]  [<ffffffff81161910>] ? cpu_clock_event_init+0x40/0x40
[  524.504419]  [<ffffffff811e1c97>] do_readv_writev+0x267/0x280
[  524.505295]  [<ffffffff81375a47>] ? debug_smp_processor_id+0x17/0x20
[  524.506172]  [<ffffffff810bffc6>] ? lock_release_holdtime.part.24+0xe6/0x160
[  524.507045]  [<ffffffff810a2c5d>] ? get_parent_ip+0xd/0x50
[  524.507911]  [<ffffffff810a2dbb>] ? preempt_count_sub+0x6b/0xf0
[  524.508774]  [<ffffffff817b1377>] ? _raw_spin_unlock_irq+0x37/0x60
[  524.509634]  [<ffffffff810e672a>] ? do_setitimer+0x1ca/0x250
[  524.510488]  [<ffffffff811e1ce9>] vfs_readv+0x39/0x50
[  524.511330]  [<ffffffff811e1dac>] SyS_readv+0x5c/0x100
[  524.512180]  [<ffffffff817b2264>] tracesys+0xdd/0xe2
[  524.513026] Code: c7 c2 00 4c 1d 00 48 89 de 48 03 14 c5 e0 af d1 81 48 89 df e8 5a 47 27 00 84 c0 75 46 45 85 ed 74 11 f6 43 18 01 74 0b 0f 1f 00 <f3> 90 f6 43 18 01 75 f8 31 c0 48 8b 4d d0 65 48 33 0c 25 28 00 
[  524.514900] NMI backtrace for cpu 2
[  524.514903] INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 87.033 msecs
[  524.516709] CPU: 2 PID: 20160 Comm: trinity-c156 Not tainted 3.17.0-rc1+ #112
[  524.518568] task: ffff88006b945bc0 ti: ffff880062d68000 task.ti: ffff880062d68000
[  524.519521] RIP: 0010:[<ffffffff810f99de>]  [<ffffffff810f99de>] generic_exec_single+0xee/0x1a0
[  524.520488] RSP: 0018:ffff880062d6bc40  EFLAGS: 00000202
[  524.521456] RAX: ffff8801fe70fc00 RBX: ffff880062d6bc40 RCX: ffff8801fe70fc40
[  524.522424] RDX: ffff8802441d4c00 RSI: ffff880062d6bc40 RDI: ffff880062d6bc40
[  524.523394] RBP: ffff880062d6bc90 R08: 0000000000000001 R09: 0000000000000001
[  524.524360] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  524.525325] R13: 0000000000000001 R14: ffff880238ef5cd0 R15: ffffffff8115fd30
[  524.526293] FS:  00007f6a24fd0740(0000) GS:ffff880244400000(0000) knlGS:0000000000000000
[  524.527269] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  524.528248] CR2: 00000000019ca288 CR3: 000000007a7bb000 CR4: 00000000001407e0
[  524.529238] DR0: 00007f6670c66000 DR1: 0000000000000000 DR2: 0000000000000000
[  524.530230] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[  524.531200] Stack:
[  524.532141]  ffff8801fe70fc40 ffffffff8115fd30 ffff880238ef5cd0 0000000000000003
[  524.533104]  000000008b5d668d 00000000ffffffff 0000000000000000 ffffffff8115fd30
[  524.534065]  ffff880238ef5cd0 0000000000000001 ffff880062d6bcd0 ffffffff810f9b5a
[  524.535014] Call Trace:
[  524.535938]  [<ffffffff8115fd30>] ? perf_duration_warn+0x70/0x70
[  524.536854]  [<ffffffff8115fd30>] ? perf_duration_warn+0x70/0x70
[  524.537756]  [<ffffffff810f9b5a>] smp_call_function_single+0x6a/0xe0
[  524.538648]  [<ffffffff810a2dbb>] ? preempt_count_sub+0x6b/0xf0
[  524.539536]  [<ffffffff8116036a>] perf_event_read+0xca/0xd0
[  524.540425]  [<ffffffff81160400>] perf_event_read_value+0x90/0xe0
[  524.541317]  [<ffffffff81161b1e>] perf_read+0x20e/0x360
[  524.542202]  [<ffffffff810bffc6>] ? lock_release_holdtime.part.24+0xe6/0x160
[  524.543089]  [<ffffffff81161910>] ? cpu_clock_event_init+0x40/0x40
[  524.543967]  [<ffffffff811dfdd3>] do_loop_readv_writev+0x63/0x90
[  524.544840]  [<ffffffff81161910>] ? cpu_clock_event_init+0x40/0x40
[  524.545709]  [<ffffffff811e1c97>] do_readv_writev+0x267/0x280
[  524.546581]  [<ffffffff81375a47>] ? debug_smp_processor_id+0x17/0x20
[  524.547455]  [<ffffffff810bffc6>] ? lock_release_holdtime.part.24+0xe6/0x160
[  524.548327]  [<ffffffff810a2c5d>] ? get_parent_ip+0xd/0x50
[  524.549188]  [<ffffffff810a2dbb>] ? preempt_count_sub+0x6b/0xf0
[  524.550046]  [<ffffffff817b1377>] ? _raw_spin_unlock_irq+0x37/0x60
[  524.550905]  [<ffffffff810e672a>] ? do_setitimer+0x1ca/0x250
[  524.551755]  [<ffffffff811e1ce9>] vfs_readv+0x39/0x50
[  524.552600]  [<ffffffff811e1dac>] SyS_readv+0x5c/0x100
[  524.553444]  [<ffffffff817b2264>] tracesys+0xdd/0xe2
[  524.554281] Code: 48 89 de 48 03 14 c5 e0 af d1 81 48 89 df e8 5a 47 27 00 84 c0 75 46 45 85 ed 74 11 f6 43 18 01 74 0b 0f 1f 00 f3 90 f6 43 18 01 <75> f8 31 c0 48 8b 4d d0 65 48 33 0c 25 28 00 00 00 0f 85 8e 00 
[  524.556148] INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 128.279 msecs
[  548.406844] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [trinity-c178:20182]
[  548.407766] Modules linked in: fuse tun rfcomm llc2 af_key nfnetlink scsi_transport_iscsi can_bcm bnep can_raw nfc caif_socket caif af_802154 ieee802154 phonet af_rxrpc bluetooth can pppoe pppox ppp_generic slhc irda crc_ccitt rds rose x25 atm netrom appletalk ipx p8023 psnap p8022 llc ax25 cfg80211 rfkill coretemp hwmon x86_pkg_temp_thermal kvm_intel kvm nfsd auth_rpcgss oid_registry nfs_acl lockd sunrpc snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec crct10dif_pclmul crc32c_intel ghash_clmulni_intel e1000e snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer microcode snd serio_raw pcspkr usb_debug ptp pps_core shpchp soundcore
[  548.412860] CPU: 0 PID: 20182 Comm: trinity-c178 Tainted: G             L 3.17.0-rc1+ #112
[  548.414955] task: ffff8801cd63c4d0 ti: ffff8801d2138000 task.ti: ffff8801d2138000
[  548.416020] RIP: 0010:[<ffffffff8136968d>]  [<ffffffff8136968d>] copy_user_handle_tail+0x6d/0x90
[  548.417110] RSP: 0018:ffff8801d213bf00  EFLAGS: 00000202
[  548.418188] RAX: 000000000007a8d9 RBX: ffffffff817b2c64 RCX: 0000000000000000
[  548.419278] RDX: 0000000000056ddc RSI: ffff88023412baf5 RDI: ffff88023412baf4
[  548.420372] RBP: ffff8801d213bf00 R08: 0000000000000000 R09: 0000000000000000
[  548.421473] R10: 0000000000000100 R11: 0000000000000000 R12: ffffffff817b92d0
[  548.422576] R13: 00007f6a24fe0000 R14: ffff8801d2138000 R15: 0000000000000001
[  548.423662] FS:  00007f6a24fd0740(0000) GS:ffff880244000000(0000) knlGS:0000000000000000
[  548.424731] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  548.425801] CR2: 00007f6a24fe0000 CR3: 00000002053c8000 CR4: 00000000001407f0
[  548.426870] DR0: 00007f6670c66000 DR1: 0000000000000000 DR2: 0000000000000000
[  548.427923] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[  548.428956] Stack:
[  548.429968]  ffff8801d213bf78 ffffffff812e4a85 00007f6a2452f068 0000000000000000
[  548.430992]  ffffffff3fffffff 00000000000003e8 00007f6a2452f000 00000000000000f8
[  548.432009]  0000000000000000 00000000a417dc9d 00000000000000f8 00007f6a2452f000
[  548.433023] Call Trace:
[  548.434031]  [<ffffffff812e4a85>] SyS_add_key+0xd5/0x240
[  548.435046]  [<ffffffff817b2264>] tracesys+0xdd/0xe2
[  548.436057] Code: c0 74 d3 85 d2 89 d0 74 39 85 c9 74 35 45 31 c0 eb 0c 0f 1f 40 00 83 ea 01 74 17 48 89 f7 48 8d 77 01 44 89 c1 0f 1f 00 c6 07 00 <0f> 1f 00 85 c9 74 e4 0f 1f 00 5d c3 0f 1f 80 00 00 00 00 31 c0 
[  548.438271] sending NMI to other CPUs:
[  548.439311] NMI backtrace for cpu 3
[  548.440341] CPU: 3 PID: 20165 Comm: trinity-c161 Tainted: G             L 3.17.0-rc1+ #112
[  548.442472] task: ffff8801fe67dbc0 ti: ffff8801fe70c000 task.ti: ffff8801fe70c000
[  548.443553] RIP: 0010:[<ffffffff810f99de>]  [<ffffffff810f99de>] generic_exec_single+0xee/0x1a0
[  548.444639] RSP: 0018:ffff8801fe70fc40  EFLAGS: 00000202
[  548.445718] RAX: 0000000000000000 RBX: ffff8801fe70fc40 RCX: 0000000000000038
[  548.446803] RDX: 00000000000000ff RSI: 0000000000000008 RDI: 0000000000000000
[  548.447882] RBP: ffff8801fe70fc90 R08: ffff880242bfa3f0 R09: 0000000000000000
[  548.448957] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  548.450039] R13: 0000000000000001 R14: ffff880238ef1290 R15: ffffffff8115fd30
[  548.451120] FS:  00007f6a24fd0740(0000) GS:ffff880244600000(0000) knlGS:0000000000000000
[  548.452211] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  548.453305] CR2: 00007f6a23446001 CR3: 00000001cd627000 CR4: 00000000001407e0
[  548.454411] DR0: 00007f6670c66000 DR1: 0000000000000000 DR2: 0000000000000000
[  548.455515] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[  548.456617] Stack:
[  548.457711]  0000000000000000 ffffffff8115fd30 ffff880238ef1290 0000000000000003
[  548.458831]  00000000af3b5f31 00000000ffffffff 0000000000000000 ffffffff8115fd30
[  548.459954]  ffff880238ef1290 0000000000000001 ffff8801fe70fcd0 ffffffff810f9b5a
[  548.461078] Call Trace:
[  548.462187]  [<ffffffff8115fd30>] ? perf_duration_warn+0x70/0x70
[  548.463310]  [<ffffffff8115fd30>] ? perf_duration_warn+0x70/0x70
[  548.464425]  [<ffffffff810f9b5a>] smp_call_function_single+0x6a/0xe0
[  548.465541]  [<ffffffff810a2dbb>] ? preempt_count_sub+0x6b/0xf0
[  548.466656]  [<ffffffff8116036a>] perf_event_read+0xca/0xd0
[  548.467772]  [<ffffffff81160400>] perf_event_read_value+0x90/0xe0
[  548.468887]  [<ffffffff81161b1e>] perf_read+0x20e/0x360
[  548.470003]  [<ffffffff810bffc6>] ? lock_release_holdtime.part.24+0xe6/0x160
[  548.471127]  [<ffffffff81161910>] ? cpu_clock_event_init+0x40/0x40
[  548.472230]  [<ffffffff811dfdd3>] do_loop_readv_writev+0x63/0x90
[  548.473310]  [<ffffffff81161910>] ? cpu_clock_event_init+0x40/0x40
[  548.474375]  [<ffffffff811e1c97>] do_readv_writev+0x267/0x280
[  548.475432]  [<ffffffff81375a47>] ? debug_smp_processor_id+0x17/0x20
[  548.476470]  [<ffffffff810bffc6>] ? lock_release_holdtime.part.24+0xe6/0x160
[  548.477486]  [<ffffffff810a2c5d>] ? get_parent_ip+0xd/0x50
[  548.478479]  [<ffffffff810a2dbb>] ? preempt_count_sub+0x6b/0xf0
[  548.479452]  [<ffffffff817b1377>] ? _raw_spin_unlock_irq+0x37/0x60
[  548.480402]  [<ffffffff810e672a>] ? do_setitimer+0x1ca/0x250
[  548.481346]  [<ffffffff811e1ce9>] vfs_readv+0x39/0x50
[  548.482266]  [<ffffffff811e1dac>] SyS_readv+0x5c/0x100
[  548.483164]  [<ffffffff817b2264>] tracesys+0xdd/0xe2
[  548.484050] Code: 48 89 de 48 03 14 c5 e0 af d1 81 48 89 df e8 5a 47 27 00 84 c0 75 46 45 85 ed 74 11 f6 43 18 01 74 0b 0f 1f 00 f3 90 f6 43 18 01 <75> f8 31 c0 48 8b 4d d0 65 48 33 0c 25 28 00 00 00 0f 85 8e 00 
[  548.485991] NMI backtrace for cpu 2
[  548.486903] CPU: 2 PID: 20160 Comm: trinity-c156 Tainted: G             L 3.17.0-rc1+ #112
[  548.488787] task: ffff88006b945bc0 ti: ffff880062d68000 task.ti: ffff880062d68000
[  548.489743] RIP: 0010:[<ffffffff810f99da>]  [<ffffffff810f99da>] generic_exec_single+0xea/0x1a0
[  548.490709] RSP: 0018:ffff880062d6bc40  EFLAGS: 00000202
[  548.491667] RAX: ffff8801fe70fc00 RBX: ffff880062d6bc40 RCX: ffff8801fe70fc40
[  548.492629] RDX: ffff8802441d4c00 RSI: ffff880062d6bc40 RDI: ffff880062d6bc40
[  548.493590] RBP: ffff880062d6bc90 R08: 0000000000000001 R09: 0000000000000001
[  548.494553] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  548.495511] R13: 0000000000000001 R14: ffff880238ef5cd0 R15: ffffffff8115fd30
[  548.496462] FS:  00007f6a24fd0740(0000) GS:ffff880244400000(0000) knlGS:0000000000000000
[  548.497423] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  548.498382] CR2: 00000000019ca288 CR3: 000000007a7bb000 CR4: 00000000001407e0
[  548.499347] DR0: 00007f6670c66000 DR1: 0000000000000000 DR2: 0000000000000000
[  548.500305] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[  548.501264] Stack:
[  548.502193]  ffff8801fe70fc40 ffffffff8115fd30 ffff880238ef5cd0 0000000000000003
[  548.503131]  000000008b5d668d 00000000ffffffff 0000000000000000 ffffffff8115fd30
[  548.504073]  ffff880238ef5cd0 0000000000000001 ffff880062d6bcd0 ffffffff810f9b5a
[  548.505015] Call Trace:
[  548.505940]  [<ffffffff8115fd30>] ? perf_duration_warn+0x70/0x70
[  548.506857]  [<ffffffff8115fd30>] ? perf_duration_warn+0x70/0x70
[  548.507753]  [<ffffffff810f9b5a>] smp_call_function_single+0x6a/0xe0
[  548.508639]  [<ffffffff810a2dbb>] ? preempt_count_sub+0x6b/0xf0
[  548.509518]  [<ffffffff8116036a>] perf_event_read+0xca/0xd0
[  548.510391]  [<ffffffff81160400>] perf_event_read_value+0x90/0xe0
[  548.511261]  [<ffffffff81161b1e>] perf_read+0x20e/0x360
[  548.512130]  [<ffffffff810bffc6>] ? lock_release_holdtime.part.24+0xe6/0x160
[  548.513005]  [<ffffffff81161910>] ? cpu_clock_event_init+0x40/0x40
[  548.513877]  [<ffffffff811dfdd3>] do_loop_readv_writev+0x63/0x90
[  548.514740]  [<ffffffff81161910>] ? cpu_clock_event_init+0x40/0x40
[  548.515600]  [<ffffffff811e1c97>] do_readv_writev+0x267/0x280
[  548.516451]  [<ffffffff81375a47>] ? debug_smp_processor_id+0x17/0x20
[  548.517313]  [<ffffffff810bffc6>] ? lock_release_holdtime.part.24+0xe6/0x160
[  548.518177]  [<ffffffff810a2c5d>] ? get_parent_ip+0xd/0x50
[  548.519031]  [<ffffffff810a2dbb>] ? preempt_count_sub+0x6b/0xf0
[  548.519878]  [<ffffffff817b1377>] ? _raw_spin_unlock_irq+0x37/0x60
[  548.520719]  [<ffffffff810e672a>] ? do_setitimer+0x1ca/0x250
[  548.521563]  [<ffffffff811e1ce9>] vfs_readv+0x39/0x50
[  548.522397]  [<ffffffff811e1dac>] SyS_readv+0x5c/0x100
[  548.523224]  [<ffffffff817b2264>] tracesys+0xdd/0xe2
[  548.524052] Code: 00 4c 1d 00 48 89 de 48 03 14 c5 e0 af d1 81 48 89 df e8 5a 47 27 00 84 c0 75 46 45 85 ed 74 11 f6 43 18 01 74 0b 0f 1f 00 f3 90 <f6> 43 18 01 75 f8 31 c0 48 8b 4d d0 65 48 33 0c 25 28 00 00 00 
[  548.525897] NMI backtrace for cpu 1
[  548.526765] CPU: 1 PID: 20241 Comm: trinity-c237 Tainted: G             L 3.17.0-rc1+ #112
[  548.528575] task: ffff8801e32c16f0 ti: ffff88008696c000 task.ti: ffff88008696c000
[  548.529499] RIP: 0010:[<ffffffff810f99de>]  [<ffffffff810f99de>] generic_exec_single+0xee/0x1a0
[  548.530433] RSP: 0018:ffff88008696fc40  EFLAGS: 00000202
[  548.531363] RAX: ffff880062d6bc00 RBX: ffff88008696fc40 RCX: ffff880062d6bc40
[  548.532305] RDX: ffff8802441d4c00 RSI: ffff88008696fc40 RDI: ffff88008696fc40
[  548.533244] RBP: ffff88008696fc90 R08: 0000000000000001 R09: 0000000000000001
[  548.534181] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  548.535114] R13: 0000000000000001 R14: ffff880238ef1bd8 R15: ffffffff8115fd30
[  548.536048] FS:  00007f6a24fd0740(0000) GS:ffff880244200000(0000) knlGS:0000000000000000
[  548.536995] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  548.537936] CR2: 0000000000000000 CR3: 00000000869c1000 CR4: 00000000001407e0
[  548.538888] DR0: 00007f6670c66000 DR1: 0000000000000000 DR2: 0000000000000000
[  548.539842] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[  548.540795] Stack:
[  548.541745]  ffff880062d6bc40 ffffffff8115fd30 ffff880238ef1bd8 0000000000000003
[  548.542699]  00000000d88cff08 00000000ffffffff 0000000000000000 ffffffff8115fd30
[  548.543640]  ffff880238ef1bd8 0000000000000001 ffff88008696fcd0 ffffffff810f9b5a
[  548.544574] Call Trace:
[  548.545491]  [<ffffffff8115fd30>] ? perf_duration_warn+0x70/0x70
[  548.546404]  [<ffffffff8115fd30>] ? perf_duration_warn+0x70/0x70
[  548.547297]  [<ffffffff810f9b5a>] smp_call_function_single+0x6a/0xe0
[  548.548179]  [<ffffffff810a2dbb>] ? preempt_count_sub+0x6b/0xf0
[  548.549053]  [<ffffffff8116036a>] perf_event_read+0xca/0xd0
[  548.549920]  [<ffffffff81160400>] perf_event_read_value+0x90/0xe0
[  548.550789]  [<ffffffff81161b1e>] perf_read+0x20e/0x360
[  548.551651]  [<ffffffff810bffc6>] ? lock_release_holdtime.part.24+0xe6/0x160
[  548.552520]  [<ffffffff81161910>] ? cpu_clock_event_init+0x40/0x40
[  548.553388]  [<ffffffff811dfdd3>] do_loop_readv_writev+0x63/0x90
[  548.554247]  [<ffffffff81161910>] ? cpu_clock_event_init+0x40/0x40
[  548.555103]  [<ffffffff811e1c97>] do_readv_writev+0x267/0x280
[  548.555952]  [<ffffffff81375a47>] ? debug_smp_processor_id+0x17/0x20
[  548.556809]  [<ffffffff810bffc6>] ? lock_release_holdtime.part.24+0xe6/0x160
[  548.557665]  [<ffffffff810a2c5d>] ? get_parent_ip+0xd/0x50
[  548.558511]  [<ffffffff810a2dbb>] ? preempt_count_sub+0x6b/0xf0
[  548.559352]  [<ffffffff817b1377>] ? _raw_spin_unlock_irq+0x37/0x60
[  548.560189]  [<ffffffff810e672a>] ? do_setitimer+0x1ca/0x250
[  548.561021]  [<ffffffff811e1ce9>] vfs_readv+0x39/0x50
[  548.561847]  [<ffffffff811e1dac>] SyS_readv+0x5c/0x100
[  548.562664]  [<ffffffff817b2264>] tracesys+0xdd/0xe2
[  548.563483] Code: 48 89 de 48 03 14 c5 e0 af d1 81 48 89 df e8 5a 47 27 00 84 c0 75 46 45 85 ed 74 11 f6 43 18 01 74 0b 0f 1f 00 f3 90 f6 43 18 01 <75> f8 31 c0 48 8b 4d d0 65 48 33 0c 25 28 00 00 00 0f 85 8e 00 
[  564.237567] INFO: rcu_preempt self-detected stall on CPU
[  564.238434] 	0: (23495594 ticks this GP) idle=ee5/140000000000001/0 softirq=31067/31067 
[  564.239318] 	 (t=6000 jiffies g=12425 c=12424 q=0)
[  564.240203] Task dump for CPU 0:
[  564.241078] trinity-c178    R  running task    13424 20182  19467 0x10000008
[  564.241975]  ffff8801cd63c4d0 00000000a417dc9d ffff880244003dc8 ffffffff810a4406
[  564.242880]  ffffffff810a4372 0000000000000000 ffffffff81c50240 0000000000000086
[  564.243789]  ffff880244003de0 ffffffff810a83e9 0000000000000001 ffff880244003e10
[  564.244697] Call Trace:
[  564.245597]  <IRQ>  [<ffffffff810a4406>] sched_show_task+0x116/0x180
[  564.246511]  [<ffffffff810a4372>] ? sched_show_task+0x82/0x180
[  564.247420]  [<ffffffff810a83e9>] dump_cpu_task+0x39/0x40
[  564.248337]  [<ffffffff810d6360>] rcu_dump_cpu_stacks+0xa0/0xe0
[  564.249247]  [<ffffffff810ddba3>] rcu_check_callbacks+0x503/0x810
[  564.250155]  [<ffffffff81375a63>] ? __this_cpu_preempt_check+0x13/0x20
[  564.251069]  [<ffffffff810e5c93>] ? hrtimer_run_queues+0x43/0x130
[  564.251985]  [<ffffffff810e43e7>] update_process_times+0x47/0x70
[  564.252902]  [<ffffffff810f4c8a>] tick_sched_timer+0x4a/0x1a0
[  564.253796]  [<ffffffff810e4a71>] ? __run_hrtimer+0x81/0x250
[  564.254671]  [<ffffffff810e4a71>] __run_hrtimer+0x81/0x250
[  564.255541]  [<ffffffff810f4c40>] ? tick_init_highres+0x20/0x20
[  564.256404]  [<ffffffff810e5697>] hrtimer_interrupt+0x107/0x260
[  564.257251]  [<ffffffff81031cc4>] local_apic_timer_interrupt+0x34/0x60
[  564.258086]  [<ffffffff817b4b8f>] smp_apic_timer_interrupt+0x3f/0x60
[  564.258907]  [<ffffffff817b2faf>] apic_timer_interrupt+0x6f/0x80
[  564.259721]  <EOI>  [<ffffffff817b2c64>] ? retint_restore_args+0xe/0xe
[  564.260541]  [<ffffffff8136968d>] ? copy_user_handle_tail+0x6d/0x90
[  564.261360]  [<ffffffff812e4a85>] SyS_add_key+0xd5/0x240
[  564.262176]  [<ffffffff817b2264>] tracesys+0xdd/0xe2
[  572.432766] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [trinity-c156:20160]
[  572.433593] Modules linked in: fuse tun rfcomm llc2 af_key nfnetlink scsi_transport_iscsi can_bcm bnep can_raw nfc caif_socket caif af_802154 ieee802154 phonet af_rxrpc bluetooth can pppoe pppox ppp_generic slhc irda crc_ccitt rds rose x25 atm netrom appletalk ipx p8023 psnap p8022 llc ax25 cfg80211 rfkill coretemp hwmon x86_pkg_temp_thermal kvm_intel kvm nfsd auth_rpcgss oid_registry nfs_acl lockd sunrpc snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec crct10dif_pclmul crc32c_intel ghash_clmulni_intel e1000e snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer microcode snd serio_raw pcspkr usb_debug ptp pps_core shpchp soundcore
[  572.438194] CPU: 2 PID: 20160 Comm: trinity-c156 Tainted: G             L 3.17.0-rc1+ #112
[  572.440070] task: ffff88006b945bc0 ti: ffff880062d68000 task.ti: ffff880062d68000
[  572.441020] RIP: 0010:[<ffffffff810f99de>]  [<ffffffff810f99de>] generic_exec_single+0xee/0x1a0
[  572.441982] RSP: 0018:ffff880062d6bc40  EFLAGS: 00000202
[  572.442935] RAX: ffff8801fe70fc00 RBX: ffffffff817b2c64 RCX: ffff8801fe70fc40
[  572.443898] RDX: ffff8802441d4c00 RSI: ffff880062d6bc40 RDI: ffff880062d6bc40
[  572.444868] RBP: ffff880062d6bc90 R08: 0000000000000001 R09: 0000000000000001
[  572.445836] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880062d6bbb8
[  572.446808] R13: 0000000000406040 R14: ffff880062d68000 R15: ffff88006b945bc0
[  572.447777] FS:  00007f6a24fd0740(0000) GS:ffff880244400000(0000) knlGS:0000000000000000
[  572.448759] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  572.449742] CR2: 00000000019ca288 CR3: 000000007a7bb000 CR4: 00000000001407e0
[  572.450733] DR0: 00007f6670c66000 DR1: 0000000000000000 DR2: 0000000000000000
[  572.451722] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[  572.452710] Stack:
[  572.453692]  ffff8801fe70fc40 ffffffff8115fd30 ffff880238ef5cd0 0000000000000003
[  572.454698]  000000008b5d668d 00000000ffffffff 0000000000000000 ffffffff8115fd30
[  572.455712]  ffff880238ef5cd0 0000000000000001 ffff880062d6bcd0 ffffffff810f9b5a
[  572.456721] Call Trace:
[  572.457721]  [<ffffffff8115fd30>] ? perf_duration_warn+0x70/0x70
[  572.458732]  [<ffffffff8115fd30>] ? perf_duration_warn+0x70/0x70
[  572.459736]  [<ffffffff810f9b5a>] smp_call_function_single+0x6a/0xe0
[  572.460745]  [<ffffffff810a2dbb>] ? preempt_count_sub+0x6b/0xf0
[  572.461748]  [<ffffffff8116036a>] perf_event_read+0xca/0xd0
[  572.462763]  [<ffffffff81160400>] perf_event_read_value+0x90/0xe0
[  572.463776]  [<ffffffff81161b1e>] perf_read+0x20e/0x360
[  572.464765]  [<ffffffff810bffc6>] ? lock_release_holdtime.part.24+0xe6/0x160
[  572.465738]  [<ffffffff81161910>] ? cpu_clock_event_init+0x40/0x40
[  572.466709]  [<ffffffff811dfdd3>] do_loop_readv_writev+0x63/0x90
[  572.467669]  [<ffffffff81161910>] ? cpu_clock_event_init+0x40/0x40
[  572.468627]  [<ffffffff811e1c97>] do_readv_writev+0x267/0x280
[  572.469586]  [<ffffffff81375a47>] ? debug_smp_processor_id+0x17/0x20
[  572.470543]  [<ffffffff810bffc6>] ? lock_release_holdtime.part.24+0xe6/0x160
[  572.471493]  [<ffffffff810a2c5d>] ? get_parent_ip+0xd/0x50
[  572.472442]  [<ffffffff810a2dbb>] ? preempt_count_sub+0x6b/0xf0
[  572.473386]  [<ffffffff817b1377>] ? _raw_spin_unlock_irq+0x37/0x60
[  572.474322]  [<ffffffff810e672a>] ? do_setitimer+0x1ca/0x250
[  572.475260]  [<ffffffff811e1ce9>] vfs_readv+0x39/0x50
[  572.476186]  [<ffffffff811e1dac>] SyS_readv+0x5c/0x100
[  572.477112]  [<ffffffff817b2264>] tracesys+0xdd/0xe2
[  572.478038] Code: 48 89 de 48 03 14 c5 e0 af d1 81 48 89 df e8 5a 47 27 00 84 c0 75 46 45 85 ed 74 11 f6 43 18 01 74 0b 0f 1f 00 f3 90 f6 43 18 01 <75> f8 31 c0 48 8b 4d d0 65 48 33 0c 25 28 00 00 00 0f 85 8e 00 
[  572.480077] sending NMI to other CPUs:
[  572.481045] NMI backtrace for cpu 1
[  572.482011] CPU: 1 PID: 20241 Comm: trinity-c237 Tainted: G             L 3.17.0-rc1+ #112
[  572.484001] task: ffff8801e32c16f0 ti: ffff88008696c000 task.ti: ffff88008696c000
[  572.485020] RIP: 0010:[<ffffffff810f99de>]  [<ffffffff810f99de>] generic_exec_single+0xee/0x1a0
[  572.486053] RSP: 0018:ffff88008696fc40  EFLAGS: 00000202
[  572.487084] RAX: ffff880062d6bc00 RBX: ffff88008696fc40 RCX: ffff880062d6bc40
[  572.488124] RDX: ffff8802441d4c00 RSI: ffff88008696fc40 RDI: ffff88008696fc40
[  572.489163] RBP: ffff88008696fc90 R08: 0000000000000001 R09: 0000000000000001
[  572.490199] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  572.491231] R13: 0000000000000001 R14: ffff880238ef1bd8 R15: ffffffff8115fd30
[  572.492262] FS:  00007f6a24fd0740(0000) GS:ffff880244200000(0000) knlGS:0000000000000000
[  572.493299] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  572.494343] CR2: 0000000000000000 CR3: 00000000869c1000 CR4: 00000000001407e0
[  572.495380] DR0: 00007f6670c66000 DR1: 0000000000000000 DR2: 0000000000000000
[  572.496396] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[  572.497393] Stack:
[  572.498357]  ffff880062d6bc40 ffffffff8115fd30 ffff880238ef1bd8 0000000000000003
[  572.499325]  00000000d88cff08 00000000ffffffff 0000000000000000 ffffffff8115fd30
[  572.500276]  ffff880238ef1bd8 0000000000000001 ffff88008696fcd0 ffffffff810f9b5a
[  572.501223] Call Trace:
[  572.502140]  [<ffffffff8115fd30>] ? perf_duration_warn+0x70/0x70
[  572.503053]  [<ffffffff8115fd30>] ? perf_duration_warn+0x70/0x70
[  572.503947]  [<ffffffff810f9b5a>] smp_call_function_single+0x6a/0xe0
[  572.504837]  [<ffffffff810a2dbb>] ? preempt_count_sub+0x6b/0xf0
[  572.505720]  [<ffffffff8116036a>] perf_event_read+0xca/0xd0
[  572.506600]  [<ffffffff81160400>] perf_event_read_value+0x90/0xe0
[  572.507477]  [<ffffffff81161b1e>] perf_read+0x20e/0x360
[  572.508350]  [<ffffffff810bffc6>] ? lock_release_holdtime.part.24+0xe6/0x160
[  572.509229]  [<ffffffff81161910>] ? cpu_clock_event_init+0x40/0x40
[  572.510102]  [<ffffffff811dfdd3>] do_loop_readv_writev+0x63/0x90
[  572.510968]  [<ffffffff81161910>] ? cpu_clock_event_init+0x40/0x40
[  572.511828]  [<ffffffff811e1c97>] do_readv_writev+0x267/0x280
[  572.512689]  [<ffffffff81375a47>] ? debug_smp_processor_id+0x17/0x20
[  572.513556]  [<ffffffff810bffc6>] ? lock_release_holdtime.part.24+0xe6/0x160
[  572.514413]  [<ffffffff810a2c5d>] ? get_parent_ip+0xd/0x50
[  572.515267]  [<ffffffff810a2dbb>] ? preempt_count_sub+0x6b/0xf0
[  572.516114]  [<ffffffff817b1377>] ? _raw_spin_unlock_irq+0x37/0x60
[  572.516958]  [<ffffffff810e672a>] ? do_setitimer+0x1ca/0x250
[  572.517794]  [<ffffffff811e1ce9>] vfs_readv+0x39/0x50
[  572.518627]  [<ffffffff811e1dac>] SyS_readv+0x5c/0x100
[  572.519462]  [<ffffffff817b2264>] tracesys+0xdd/0xe2
[  572.520291] Code: 48 89 de 48 03 14 c5 e0 af d1 81 48 89 df e8 5a 47 27 00 84 c0 75 46 45 85 ed 74 11 f6 43 18 01 74 0b 0f 1f 00 f3 90 f6 43 18 01 <75> f8 31 c0 48 8b 4d d0 65 48 33 0c 25 28 00 00 00 0f 85 8e 00 
[  572.522141] NMI backtrace for cpu 3
[  572.523018] CPU: 3 PID: 20165 Comm: trinity-c161 Tainted: G             L 3.17.0-rc1+ #112
[  572.524827] task: ffff8801fe67dbc0 ti: ffff8801fe70c000 task.ti: ffff8801fe70c000
[  572.525748] RIP: 0010:[<ffffffff810f99de>]  [<ffffffff810f99de>] generic_exec_single+0xee/0x1a0
[  572.526677] RSP: 0018:ffff8801fe70fc40  EFLAGS: 00000202
[  572.527605] RAX: 0000000000000000 RBX: ffff8801fe70fc40 RCX: 0000000000000038
[  572.528545] RDX: 00000000000000ff RSI: 0000000000000008 RDI: 0000000000000000
[  572.529480] RBP: ffff8801fe70fc90 R08: ffff880242bfa3f0 R09: 0000000000000000
[  572.530416] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  572.531350] R13: 0000000000000001 R14: ffff880238ef1290 R15: ffffffff8115fd30
[  572.532287] FS:  00007f6a24fd0740(0000) GS:ffff880244600000(0000) knlGS:0000000000000000
[  572.533230] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  572.534174] CR2: 00007f6a23446001 CR3: 00000001cd627000 CR4: 00000000001407e0
[  572.535133] DR0: 00007f6670c66000 DR1: 0000000000000000 DR2: 0000000000000000
[  572.536091] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[  572.537048] Stack:
[  572.537979]  0000000000000000 ffffffff8115fd30 ffff880238ef1290 0000000000000003
[  572.538915]  00000000af3b5f31 00000000ffffffff 0000000000000000 ffffffff8115fd30
[  572.539850]  ffff880238ef1290 0000000000000001 ffff8801fe70fcd0 ffffffff810f9b5a
[  572.540790] Call Trace:
[  572.541711]  [<ffffffff8115fd30>] ? perf_duration_warn+0x70/0x70
[  572.542629]  [<ffffffff8115fd30>] ? perf_duration_warn+0x70/0x70
[  572.543522]  [<ffffffff810f9b5a>] smp_call_function_single+0x6a/0xe0
[  572.544408]  [<ffffffff810a2dbb>] ? preempt_count_sub+0x6b/0xf0
[  572.545285]  [<ffffffff8116036a>] perf_event_read+0xca/0xd0
[  572.546154]  [<ffffffff81160400>] perf_event_read_value+0x90/0xe0
[  572.547027]  [<ffffffff81161b1e>] perf_read+0x20e/0x360
[  572.547890]  [<ffffffff810bffc6>] ? lock_release_holdtime.part.24+0xe6/0x160
[  572.548765]  [<ffffffff81161910>] ? cpu_clock_event_init+0x40/0x40
[  572.549637]  [<ffffffff811dfdd3>] do_loop_readv_writev+0x63/0x90
[  572.550497]  [<ffffffff81161910>] ? cpu_clock_event_init+0x40/0x40
[  572.551353]  [<ffffffff811e1c97>] do_readv_writev+0x267/0x280
[  572.552210]  [<ffffffff81375a47>] ? debug_smp_processor_id+0x17/0x20
[  572.553066]  [<ffffffff810bffc6>] ? lock_release_holdtime.part.24+0xe6/0x160
[  572.553924]  [<ffffffff810a2c5d>] ? get_parent_ip+0xd/0x50
[  572.554775]  [<ffffffff810a2dbb>] ? preempt_count_sub+0x6b/0xf0
[  572.555618]  [<ffffffff817b1377>] ? _raw_spin_unlock_irq+0x37/0x60
[  572.556453]  [<ffffffff810e672a>] ? do_setitimer+0x1ca/0x250
[  572.557284]  [<ffffffff811e1ce9>] vfs_readv+0x39/0x50
[  572.558111]  [<ffffffff811e1dac>] SyS_readv+0x5c/0x100
[  572.558934]  [<ffffffff817b2264>] tracesys+0xdd/0xe2
[  572.559756] Code: 48 89 de 48 03 14 c5 e0 af d1 81 48 89 df e8 5a 47 27 00 84 c0 75 46 45 85 ed 74 11 f6 43 18 01 74 0b 0f 1f 00 f3 90 f6 43 18 01 <75> f8 31 c0 48 8b 4d d0 65 48 33 0c 25 28 00 00 00 0f 85 8e 00 
[  572.561589] NMI backtrace for cpu 0
[  572.562451] CPU: 0 PID: 20182 Comm: trinity-c178 Tainted: G             L 3.17.0-rc1+ #112
[  572.564252] task: ffff8801cd63c4d0 ti: ffff8801d2138000 task.ti: ffff8801d2138000
[  572.565166] RIP: 0010:[<ffffffff8103cc16>]  [<ffffffff8103cc16>] read_hpet+0x16/0x20
[  572.566090] RSP: 0018:ffff880244003e70  EFLAGS: 00000046
[  572.567011] RAX: 00000000e8dd201c RBX: 000000000001cc86 RCX: ffff8802441d1118
[  572.567939] RDX: 0000000000010001 RSI: ffffffff81a86870 RDI: ffffffff81c28680
[  572.568869] RBP: ffff880244003e70 R08: 0000000000000000 R09: 0000000000000000
[  572.569802] R10: 0000000000000000 R11: 0000000000000000 R12: 000000854428057e
[  572.570732] R13: ffff8802441ce060 R14: ffff8802441cda80 R15: 0000008baf2377a5
[  572.571665] FS:  00007f6a24fd0740(0000) GS:ffff880244000000(0000) knlGS:0000000000000000
[  572.572608] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  572.573555] CR2: 00007f6a24fe0000 CR3: 00000002053c8000 CR4: 00000000001407f0
[  572.574508] DR0: 00007f6670c66000 DR1: 0000000000000000 DR2: 0000000000000000
[  572.575459] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[  572.576414] Stack:
[  572.577362]  ffff880244003e98 ffffffff810eb574 ffffffff810f4c63 ffff8801d213be58
[  572.578323]  ffff880244003f40 ffff880244003ec8 ffffffff810f4c63 ffff8802441cda80
[  572.579255]  ffff8802441ce060 ffff880244003f40 ffff8802441cdb08 ffff880244003f08
[  572.580181] Call Trace:
[  572.581092]  <IRQ> 

[  572.581988]  [<ffffffff810eb574>] ktime_get+0x94/0x120
[  572.582865]  [<ffffffff810f4c63>] ? tick_sched_timer+0x23/0x1a0
[  572.583735]  [<ffffffff810f4c63>] tick_sched_timer+0x23/0x1a0
[  572.584592]  [<ffffffff810e4a71>] __run_hrtimer+0x81/0x250
[  572.585447]  [<ffffffff810f4c40>] ? tick_init_highres+0x20/0x20
[  572.586297]  [<ffffffff810e5697>] hrtimer_interrupt+0x107/0x260
[  572.587148]  [<ffffffff81031cc4>] local_apic_timer_interrupt+0x34/0x60
[  572.588004]  [<ffffffff817b4b8f>] smp_apic_timer_interrupt+0x3f/0x60
[  572.588859]  [<ffffffff817b2faf>] apic_timer_interrupt+0x6f/0x80
[  572.589705]  <EOI> 

[  572.590542]  [<ffffffff817b2c64>] ? retint_restore_args+0xe/0xe
[  572.591379]  [<ffffffff8136968d>] ? copy_user_handle_tail+0x6d/0x90
[  572.592225]  [<ffffffff812e4a85>] SyS_add_key+0xd5/0x240
[  572.593071]  [<ffffffff817b2264>] tracesys+0xdd/0xe2
[  572.593901] Code: 00 29 c7 ba 00 00 00 00 b8 c2 ff ff ff 83 ff 7f 5d 0f 4f c2 c3 0f 1f 44 00 00 55 48 8b 05 a3 0c 0b 01 48 89 e5 8b 80 f0 00 00 00 <89> c0 5d c3 66 0f 1f 44 00 00 0f 1f 44 00 00 8b 0d d9 0b 0b 01 
[  599.566886] [sched_delayed] sched: RT throttling activated
[  599.573324] end_request: I/O error, dev sda, sector 0
[  624.402393] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [swapper/2:0]
[  624.403521] Modules linked in: fuse tun rfcomm llc2 af_key nfnetlink scsi_transport_iscsi can_bcm bnep can_raw nfc caif_socket caif af_802154 ieee802154 phonet af_rxrpc bluetooth can pppoe pppox ppp_generic slhc irda crc_ccitt rds rose x25 atm netrom appletalk ipx p8023 psnap p8022 llc ax25 cfg80211 rfkill coretemp hwmon x86_pkg_temp_thermal kvm_intel kvm nfsd auth_rpcgss oid_registry nfs_acl lockd sunrpc snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec crct10dif_pclmul crc32c_intel ghash_clmulni_intel e1000e snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer microcode snd serio_raw pcspkr usb_debug ptp pps_core shpchp soundcore
[  624.409477] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G             L 3.17.0-rc1+ #112
[  624.411931] task: ffff880242b744d0 ti: ffff880242414000 task.ti: ffff880242414000
[  624.413170] RIP: 0010:[<ffffffff81645849>]  [<ffffffff81645849>] cpuidle_enter_state+0x79/0x1c0
[  624.414396] RSP: 0018:ffff880242417e60  EFLAGS: 00000246
[  624.415614] RAX: 0000000000000000 RBX: ffff880242b744d0 RCX: 0000000000000019
[  624.416843] RDX: 20c49ba5e353f7cf RSI: 000000000003cd60 RDI: 002e512580cfca6e
[  624.418074] RBP: ffff880242417e98 R08: 000000008bafc4e0 R09: 0000000000000000
[  624.419397] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880242417df0
[  624.420626] R13: ffffffff810bfc5e R14: ffff880242417dd0 R15: 0000000000000210
[  624.421864] FS:  0000000000000000(0000) GS:ffff880244400000(0000) knlGS:0000000000000000
[  624.423132] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  624.424380] CR2: 00007f15cfbaf000 CR3: 0000000001c11000 CR4: 00000000001407e0
[  624.425640] DR0: 00007f6670c66000 DR1: 0000000000000000 DR2: 0000000000000000
[  624.426904] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[  624.428164] Stack:
[  624.429393]  0000009177337316 ffffffff81cae7f0 ffffffff81d1d290 ffffe8ffff204da8
[  624.430630]  ffff880242414000 ffffffff81cae620 ffff880242414000 ffff880242417ea8
[  624.431865]  ffffffff81645a47 ffff880242417f10 ffffffff810b9fb4 ffff880242417fd8
[  624.433085] Call Trace:
[  624.434242]  [<ffffffff81645a47>] cpuidle_enter+0x17/0x20
[  624.435413]  [<ffffffff810b9fb4>] cpu_startup_entry+0x384/0x410
[  624.436588]  [<ffffffff8102ff37>] start_secondary+0x237/0x340
[  624.437761] Code: d0 48 89 df ff 50 48 41 89 c5 e8 b3 5c aa ff 44 8b 63 04 49 89 c7 0f 1f 44 00 00 e8 a2 19 b0 ff fb 48 ba cf f7 53 e3 a5 9b c4 20 <4c> 2b 7d c8 4c 89 f8 49 c1 ff 3f 48 f7 ea b8 ff ff ff 7f 48 c1 
[  624.440251] sending NMI to other CPUs:
[  624.441500] NMI backtrace for cpu 1
[  624.442624] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G             L 3.17.0-rc1+ #112
[  624.444912] task: ffff880242b716f0 ti: ffff88024240c000 task.ti: ffff88024240c000
[  624.446070] RIP: 0010:[<ffffffff813c9e65>]  [<ffffffff813c9e65>] intel_idle+0xd5/0x180
[  624.447236] RSP: 0018:ffff88024240fe20  EFLAGS: 00000046
[  624.448397] RAX: 0000000000000032 RBX: 0000000000000010 RCX: 0000000000000001
[  624.449570] RDX: 0000000000000000 RSI: ffff88024240ffd8 RDI: 0000000000000001
[  624.450735] RBP: ffff88024240fe50 R08: 000000008bafc4e0 R09: 0000000000000000
[  624.451894] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000005
[  624.453053] R13: 0000000000000032 R14: 0000000000000004 R15: ffff88024240c000
[  624.454213] FS:  0000000000000000(0000) GS:ffff880244200000(0000) knlGS:0000000000000000
[  624.455369] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  624.456531] CR2: 0000000000497120 CR3: 0000000001c11000 CR4: 00000000001407e0
[  624.457707] DR0: 00007f6670c66000 DR1: 0000000000000000 DR2: 0000000000000000
[  624.458881] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[  624.460056] Stack:
[  624.461220]  000000014240c000 0d2996f8aba02290 ffffe8ffff004da8 0000000000000005
[  624.462418]  ffffffff81cae620 0000000000000001 ffff88024240fe98 ffffffff81645825
[  624.463622]  000000917994f078 ffffffff81cae7f0 ffffffff81d1d290 ffffe8ffff004da8
[  624.464826] Call Trace:
[  624.466015]  [<ffffffff81645825>] cpuidle_enter_state+0x55/0x1c0
[  624.467220]  [<ffffffff81645a47>] cpuidle_enter+0x17/0x20
[  624.468415]  [<ffffffff810b9fb4>] cpu_startup_entry+0x384/0x410
[  624.469607]  [<ffffffff8102ff37>] start_secondary+0x237/0x340
[  624.470800] Code: 31 d2 65 48 8b 34 25 08 ba 00 00 48 8d 86 38 c0 ff ff 48 89 d1 0f 01 c8 48 8b 86 38 c0 ff ff a8 08 75 08 b1 01 4c 89 e8 0f 01 c9 <65> 48 8b 0c 25 08 ba 00 00 f0 80 a1 3a c0 ff ff df 0f ae f0 48 
[  624.473394] NMI backtrace for cpu 0
[  624.474616] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G             L 3.17.0-rc1+ #112
[  624.477061] task: ffffffff81c164c0 ti: ffffffff81c00000 task.ti: ffffffff81c00000
[  624.478299] RIP: 0010:[<ffffffff813c9e65>]  [<ffffffff813c9e65>] intel_idle+0xd5/0x180
[  624.479544] RSP: 0018:ffffffff81c03e68  EFLAGS: 00000046
[  624.480766] RAX: 0000000000000032 RBX: 0000000000000010 RCX: 0000000000000001
[  624.481964] RDX: 0000000000000000 RSI: ffffffff81c03fd8 RDI: 0000000000000000
[  624.483137] RBP: ffffffff81c03e98 R08: 000000008bafc4e0 R09: 0000000000000000
[  624.484284] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000005
[  624.485403] R13: 0000000000000032 R14: 0000000000000004 R15: ffffffff81c00000
[  624.486516] FS:  0000000000000000(0000) GS:ffff880244000000(0000) knlGS:0000000000000000
[  624.487613] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  624.488690] CR2: 00007fd405db1000 CR3: 0000000001c11000 CR4: 00000000001407f0
[  624.489768] DR0: 00007f6670c66000 DR1: 0000000000000000 DR2: 0000000000000000
[  624.490837] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[  624.491894] Stack:
[  624.492941]  0000000081c00000 b6e98a804d03933a ffffe8fffee04da8 0000000000000005
[  624.494017]  ffffffff81cae620 0000000000000000 ffffffff81c03ee0 ffffffff81645825
[  624.495096]  000000917a0eca81 ffffffff81cae7f0 ffffffff81d1d290 ffffe8fffee04da8
[  624.496174] Call Trace:
[  624.497229]  [<ffffffff81645825>] cpuidle_enter_state+0x55/0x1c0
[  624.498294]  [<ffffffff81645a47>] cpuidle_enter+0x17/0x20
[  624.499348]  [<ffffffff810b9fb4>] cpu_startup_entry+0x384/0x410
[  624.500403]  [<ffffffff8179d7a0>] rest_init+0xc0/0xd0
[  624.501458]  [<ffffffff8179d6e5>] ? rest_init+0x5/0xd0
[  624.502507]  [<ffffffff81eff009>] start_kernel+0x475/0x496
[  624.503549]  [<ffffffff81efe98d>] ? set_init_arg+0x53/0x53
[  624.504585]  [<ffffffff81efe57b>] x86_64_start_reservations+0x2a/0x2c
[  624.505627]  [<ffffffff81efe66e>] x86_64_start_kernel+0xf1/0xf4
[  624.506655] Code: 31 d2 65 48 8b 34 25 08 ba 00 00 48 8d 86 38 c0 ff ff 48 89 d1 0f 01 c8 48 8b 86 38 c0 ff ff a8 08 75 08 b1 01 4c 89 e8 0f 01 c9 <65> 48 8b 0c 25 08 ba 00 00 f0 80 a1 3a c0 ff ff df 0f ae f0 48 
[  624.508927] NMI backtrace for cpu 3
[  624.510061] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G             L 3.17.0-rc1+ #112
[  624.512303] task: ffff880242b72de0 ti: ffff880242418000 task.ti: ffff880242418000
[  624.513284] RIP: 0010:[<ffffffff813c9e65>]  [<ffffffff813c9e65>] intel_idle+0xd5/0x180
[  624.514281] RSP: 0000:ffff88024241be20  EFLAGS: 00000046
[  624.515289] RAX: 0000000000000032 RBX: 0000000000000010 RCX: 0000000000000001
[  624.516286] RDX: 0000000000000000 RSI: ffff88024241bfd8 RDI: 0000000000000003
[  624.517264] RBP: ffff88024241be50 R08: 000000008bafc4e0 R09: 0000000000000000
[  624.518235] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000005
[  624.519205] R13: 0000000000000032 R14: 0000000000000004 R15: ffff880242418000
[  624.520169] FS:  0000000000000000(0000) GS:ffff880244600000(0000) knlGS:0000000000000000
[  624.521142] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  624.522100] CR2: 00000000013a4738 CR3: 0000000001c11000 CR4: 00000000001407e0
[  624.523105] DR0: 00007f6670c66000 DR1: 0000000000000000 DR2: 0000000000000000
[  624.524063] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[  624.525039] Stack:
[  624.526003]  0000000342418000 1e8c3d4850bfa337 ffffe8ffff404da8 0000000000000005
[  624.526991]  ffffffff81cae620 0000000000000003 ffff88024241be98 ffffffff81645825
[  624.527953]  000000917994f2a7 ffffffff81cae7f0 ffffffff81d1d290 ffffe8ffff404da8
[  624.528926] Call Trace:
[  624.529911]  [<ffffffff81645825>] cpuidle_enter_state+0x55/0x1c0
[  624.530911]  [<ffffffff81645a47>] cpuidle_enter+0x17/0x20
[  624.531883]  [<ffffffff810b9fb4>] cpu_startup_entry+0x384/0x410
[  624.532905]  [<ffffffff8102ff37>] start_secondary+0x237/0x340
[  624.533877] Code: 31 d2 65 48 8b 34 25 08 ba 00 00 48 8d 86 38 c0 ff ff 48 89 d1 0f 01 c8 48 8b 86 38 c0 ff ff a8 08 75 08 b1 01 4c 89 e8 0f 01 c9 <65> 48 8b 0c 25 08 ba 00 00 f0 80 a1 3a c0 ff ff df 0f ae f0 48 
[  652.386003] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [swapper/2:0]
[  652.388580] Modules linked in: fuse tun rfcomm llc2 af_key nfnetlink scsi_transport_iscsi can_bcm bnep can_raw nfc caif_socket caif af_802154 ieee802154 phonet af_rxrpc bluetooth can pppoe pppox ppp_generic slhc irda crc_ccitt rds rose x25 atm netrom appletalk ipx p8023 psnap p8022 llc ax25 cfg80211 rfkill coretemp hwmon x86_pkg_temp_thermal kvm_intel kvm nfsd auth_rpcgss oid_registry nfs_acl lockd sunrpc snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec crct10dif_pclmul crc32c_intel ghash_clmulni_intel e1000e snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer microcode snd serio_raw pcspkr usb_debug ptp pps_core shpchp soundcore
[  652.394750] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G             L 3.17.0-rc1+ #112
[  652.397225] task: ffff880242b744d0 ti: ffff880242414000 task.ti: ffff880242414000
[  652.398457] RIP: 0010:[<ffffffff81645849>]  [<ffffffff81645849>] cpuidle_enter_state+0x79/0x1c0
[  652.399711] RSP: 0018:ffff880242417e60  EFLAGS: 00000246
[  652.400953] RAX: 0000000000000000 RBX: ffff880242b744d0 RCX: 0000000000000019
[  652.402203] RDX: 20c49ba5e353f7cf RSI: 0000000000039e2e RDI: 002e6b0e8bd4c66e
[  652.403446] RBP: ffff880242417e98 R08: 000000008bafc4e0 R09: 0000000000000000
[  652.404695] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880242417df0
[  652.405953] R13: ffffffff810bfc5e R14: ffff880242417dd0 R15: 00000000000001da
[  652.407176] FS:  0000000000000000(0000) GS:ffff880244400000(0000) knlGS:0000000000000000
[  652.408441] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  652.409679] CR2: 00007f15cfbaf000 CR3: 0000000001c11000 CR4: 00000000001407e0
[  652.410923] DR0: 00007f6670c66000 DR1: 0000000000000000 DR2: 0000000000000000
[  652.412173] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[  652.413415] Stack:
[  652.414652]  00000097fc215a08 ffffffff81cae7f0 ffffffff81d1d290 ffffe8ffff204da8
[  652.415906]  ffff880242414000 ffffffff81cae620 ffff880242414000 ffff880242417ea8
[  652.417138]  ffffffff81645a47 ffff880242417f10 ffffffff810b9fb4 ffff880242417fd8
[  652.418379] Call Trace:
[  652.419637]  [<ffffffff81645a47>] cpuidle_enter+0x17/0x20
[  652.420917]  [<ffffffff810b9fb4>] cpu_startup_entry+0x384/0x410
[  652.422192]  [<ffffffff8102ff37>] start_secondary+0x237/0x340
[  652.423462] Code: d0 48 89 df ff 50 48 41 89 c5 e8 b3 5c aa ff 44 8b 63 04 49 89 c7 0f 1f 44 00 00 e8 a2 19 b0 ff fb 48 ba cf f7 53 e3 a5 9b c4 20 <4c> 2b 7d c8 4c 89 f8 49 c1 ff 3f 48 f7 ea b8 ff ff ff 7f 48 c1 
[  652.426104] sending NMI to other CPUs:
[  652.427302] NMI backtrace for cpu 0
[  652.428449] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G             L 3.17.0-rc1+ #112
[  652.430736] task: ffffffff81c164c0 ti: ffffffff81c00000 task.ti: ffffffff81c00000
[  652.431882] RIP: 0010:[<ffffffff813c9e65>]  [<ffffffff813c9e65>] intel_idle+0xd5/0x180
[  652.433036] RSP: 0018:ffffffff81c03e68  EFLAGS: 00000046
[  652.434181] RAX: 0000000000000032 RBX: 0000000000000010 RCX: 0000000000000001
[  652.435341] RDX: 0000000000000000 RSI: ffffffff81c03fd8 RDI: 0000000000000000
[  652.436491] RBP: ffffffff81c03e98 R08: 000000008bafc4e0 R09: 0000000000000000
[  652.437642] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000005
[  652.438797] R13: 0000000000000032 R14: 0000000000000004 R15: ffffffff81c00000
[  652.439945] FS:  0000000000000000(0000) GS:ffff880244000000(0000) knlGS:0000000000000000
[  652.441093] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  652.442234] CR2: 00007f00bcc2c000 CR3: 0000000001c11000 CR4: 00000000001407f0
[  652.443388] DR0: 00007f6670c66000 DR1: 0000000000000000 DR2: 0000000000000000
[  652.444540] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[  652.445680] Stack:
[  652.446808]  0000000081c00000 b6e98a804d03933a ffffe8fffee04da8 0000000000000005
[  652.447959]  ffffffff81cae620 0000000000000000 ffffffff81c03ee0 ffffffff81645825
[  652.449111]  00000097ff1f0b6f ffffffff81cae7f0 ffffffff81d1d290 ffffe8fffee04da8
[  652.450261] Call Trace:
[  652.451392]  [<ffffffff81645825>] cpuidle_enter_state+0x55/0x1c0
[  652.452539]  [<ffffffff81645a47>] cpuidle_enter+0x17/0x20
[  652.453682]  [<ffffffff810b9fb4>] cpu_startup_entry+0x384/0x410
[  652.454825]  [<ffffffff8179d7a0>] rest_init+0xc0/0xd0
[  652.455942]  [<ffffffff8179d6e5>] ? rest_init+0x5/0xd0
[  652.457036]  [<ffffffff81eff009>] start_kernel+0x475/0x496
[  652.458123]  [<ffffffff81efe98d>] ? set_init_arg+0x53/0x53
[  652.459218]  [<ffffffff81efe57b>] x86_64_start_reservations+0x2a/0x2c
[  652.460320]  [<ffffffff81efe66e>] x86_64_start_kernel+0xf1/0xf4
[  652.461419] Code: 31 d2 65 48 8b 34 25 08 ba 00 00 48 8d 86 38 c0 ff ff 48 89 d1 0f 01 c8 48 8b 86 38 c0 ff ff a8 08 75 08 b1 01 4c 89 e8 0f 01 c9 <65> 48 8b 0c 25 08 ba 00 00 f0 80 a1 3a c0 ff ff df 0f ae f0 48 
[  652.463812] NMI backtrace for cpu 3
[  652.464973] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G             L 3.17.0-rc1+ #112
[  652.467190] task: ffff880242b72de0 ti: ffff880242418000 task.ti: ffff880242418000
[  652.468248] RIP: 0010:[<ffffffff813c9e65>]  [<ffffffff813c9e65>] intel_idle+0xd5/0x180
[  652.469309] RSP: 0018:ffff88024241be20  EFLAGS: 00000046
[  652.470336] RAX: 0000000000000032 RBX: 0000000000000010 RCX: 0000000000000001
[  652.471368] RDX: 0000000000000000 RSI: ffff88024241bfd8 RDI: 0000000000000003
[  652.472378] RBP: ffff88024241be50 R08: 000000008bafc4e0 R09: 0000000000000000
[  652.473383] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000005
[  652.474378] R13: 0000000000000032 R14: 0000000000000004 R15: ffff880242418000
[  652.475366] FS:  0000000000000000(0000) GS:ffff880244600000(0000) knlGS:0000000000000000
[  652.476385] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  652.477389] CR2: 00000000013a4738 CR3: 0000000001c11000 CR4: 00000000001407e0
[  652.478379] DR0: 00007f6670c66000 DR1: 0000000000000000 DR2: 0000000000000000
[  652.479371] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[  652.480366] Stack:
[  652.481327]  0000000342418000 1e8c3d4850bfa337 ffffe8ffff404da8 0000000000000005
[  652.482326]  ffffffff81cae620 0000000000000003 ffff88024241be98 ffffffff81645825
[  652.483326]  00000097ff1b9892 ffffffff81cae7f0 ffffffff81d1d290 ffffe8ffff404da8
[  652.484332] Call Trace:
[  652.485323]  [<ffffffff81645825>] cpuidle_enter_state+0x55/0x1c0
[  652.486345]  [<ffffffff81645a47>] cpuidle_enter+0x17/0x20
[  652.487338]  [<ffffffff810b9fb4>] cpu_startup_entry+0x384/0x410
[  652.488335]  [<ffffffff8102ff37>] start_secondary+0x237/0x340
[  652.489322] Code: 31 d2 65 48 8b 34 25 08 ba 00 00 48 8d 86 38 c0 ff ff 48 89 d1 0f 01 c8 48 8b 86 38 c0 ff ff a8 08 75 08 b1 01 4c 89 e8 0f 01 c9 <65> 48 8b 0c 25 08 ba 00 00 f0 80 a1 3a c0 ff ff df 0f ae f0 48 
[  652.491493] NMI backtrace for cpu 1
[  652.492553] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G             L 3.17.0-rc1+ #112
[  652.494553] task: ffff880242b716f0 ti: ffff88024240c000 task.ti: ffff88024240c000
[  652.495559] RIP: 0010:[<ffffffff813c9e65>]  [<ffffffff813c9e65>] intel_idle+0xd5/0x180
[  652.496585] RSP: 0018:ffff88024240fe20  EFLAGS: 00000046
[  652.497569] RAX: 0000000000000032 RBX: 0000000000000010 RCX: 0000000000000001
[  652.498545] RDX: 0000000000000000 RSI: ffff88024240ffd8 RDI: 0000000000000001
[  652.499502] RBP: ffff88024240fe50 R08: 000000008bafc4e0 R09: 0000000000000000
[  652.500452] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000005
[  652.501398] R13: 0000000000000032 R14: 0000000000000004 R15: ffff88024240c000
[  652.502342] FS:  0000000000000000(0000) GS:ffff880244200000(0000) knlGS:0000000000000000
[  652.503290] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  652.504236] CR2: 00007fecce638ab8 CR3: 0000000001c11000 CR4: 00000000001407e0
[  652.505195] DR0: 00007f6670c66000 DR1: 0000000000000000 DR2: 0000000000000000
[  652.506202] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[  652.507177] Stack:
[  652.508136]  000000014240c000 0d2996f8aba02290 ffffe8ffff004da8 0000000000000005
[  652.509122]  ffffffff81cae620 0000000000000001 ffff88024240fe98 ffffffff81645825
[  652.510113]  00000097ff1b96a9 ffffffff81cae7f0 ffffffff81d1d290 ffffe8ffff004da8
[  652.511130] Call Trace:
[  652.512121]  [<ffffffff81645825>] cpuidle_enter_state+0x55/0x1c0
[  652.513102]  [<ffffffff81645a47>] cpuidle_enter+0x17/0x20
[  652.514086]  [<ffffffff810b9fb4>] cpu_startup_entry+0x384/0x410
[  652.515078]  [<ffffffff8102ff37>] start_secondary+0x237/0x340
[  652.516085] Code: 31 d2 65 48 8b 34 25 08 ba 00 00 48 8d 86 38 c0 ff ff 48 89 d1 0f 01 c8 48 8b 86 38 c0 ff ff a8 08 75 08 b1 01 4c 89 e8 0f 01 c9 <65> 48 8b 0c 25 08 ba 00 00 f0 80 a1 3a c0 ff ff df 0f ae f0 48 



It kept spewing lockups over and over.
Something weird that jumped out to me was this:

[  599.573324] end_request: I/O error, dev sda, sector 0

User trinity was running under did not have permission
to read block device directly, so that's just.. creepy.
Hopefully not a sign of impending disk death.

	Dave


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-03 19:00                                                 ` Dave Jones
@ 2014-12-03 19:25                                                   ` Linus Torvalds
  2014-12-03 19:30                                                     ` Dave Jones
                                                                       ` (2 more replies)
  2014-12-03 19:59                                                   ` Chris Mason
  1 sibling, 3 replies; 486+ messages in thread
From: Linus Torvalds @ 2014-12-03 19:25 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Chris Mason, Mike Galbraith,
	Ingo Molnar, Peter Zijlstra, Dâniel Fraga, Sasha Levin,
	Paul E. McKenney, Linux Kernel Mailing List, Thomas Gleixner,
	John Stultz

On Wed, Dec 3, 2014 at 11:00 AM, Dave Jones <davej@redhat.com> wrote:
>
> So right after sending my last mail, I rebooted, and restarted the run
> on the same kernel again.
>
> As I was writing this mail, this happened.
>
> [  524.420897] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [trinity-c178:20182]
>
> and that's all that made it over the console. I couldn't log in via ssh,
> and thought "ah-ha, so it IS bad".  I walked over to reboot it, and
> found I could actually log in on the console. check out this dmesg..
>
> [  503.683055] Clocksource tsc unstable (delta = -95946009388 ns)
> [  503.692038] Switched to clocksource hpet
> [  524.420897] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [trinity-c178:20182]

Interesting. That whole NMI watchdog thing happens pretty much 22s
after the "TSC unstable" message.

Have you ever seen that TSC issue before? The watchdog relies on
comparing get_timestamp() differences, so if the timestamp was
incorrect...

Maybe that whole "clocksource_watchdog()" is bogus. That delta is
about 96 seconds, sounds very odd. I'm not seeing how the TSC could
actually scew up that badly, so I'd almost be more likely to blame the
"watchdog" clock.

I don't know. This piece of code:

        delta = clocksource_delta(wdnow, cs->wd_last, watchdog->mask);

makes no sense to me. Shouldn't it be

        delta = clocksource_delta(wdnow, watchdog->wd_last, watchdog->mask);

Thomas? John?

                  Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-03 19:25                                                   ` Linus Torvalds
@ 2014-12-03 19:30                                                     ` Dave Jones
  2014-12-03 19:48                                                     ` Linus Torvalds
  2014-12-03 19:56                                                     ` John Stultz
  2 siblings, 0 replies; 486+ messages in thread
From: Dave Jones @ 2014-12-03 19:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Chris Mason, Mike Galbraith, Ingo Molnar, Peter Zijlstra,
	Dâniel Fraga, Sasha Levin, Paul E. McKenney,
	Linux Kernel Mailing List, Thomas Gleixner, John Stultz

On Wed, Dec 03, 2014 at 11:25:29AM -0800, Linus Torvalds wrote:
 > On Wed, Dec 3, 2014 at 11:00 AM, Dave Jones <davej@redhat.com> wrote:
 > >
 > > So right after sending my last mail, I rebooted, and restarted the run
 > > on the same kernel again.
 > >
 > > As I was writing this mail, this happened.
 > >
 > > [  524.420897] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [trinity-c178:20182]
 > >
 > > and that's all that made it over the console. I couldn't log in via ssh,
 > > and thought "ah-ha, so it IS bad".  I walked over to reboot it, and
 > > found I could actually log in on the console. check out this dmesg..
 > >
 > > [  503.683055] Clocksource tsc unstable (delta = -95946009388 ns)
 > > [  503.692038] Switched to clocksource hpet
 > > [  524.420897] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [trinity-c178:20182]
 > 
 > Interesting. That whole NMI watchdog thing happens pretty much 22s
 > after the "TSC unstable" message.
 > 
 > Have you ever seen that TSC issue before? The watchdog relies on
 > comparing get_timestamp() differences, so if the timestamp was
 > incorrect...
 
yeah, quite a lot.

# grep tsc\ unstable /var/log/messages* | wc -l
71

Usually happens pretty soon after boot, once I start the fuzzing run.
It sometimes occurs quite some time before the NMI issue though.

eg:

Dec  3 11:50:24 binary kernel: [ 4253.432642] Clocksource tsc unstable (delta = -243666538341 ns)
...
Dec  3 13:24:28 binary kernel: [ 9862.915562] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [trinity-c29:13237]


	Dave

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-03 19:25                                                   ` Linus Torvalds
  2014-12-03 19:30                                                     ` Dave Jones
@ 2014-12-03 19:48                                                     ` Linus Torvalds
  2014-12-03 20:09                                                       ` Dave Jones
  2014-12-03 19:56                                                     ` John Stultz
  2 siblings, 1 reply; 486+ messages in thread
From: Linus Torvalds @ 2014-12-03 19:48 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Chris Mason, Mike Galbraith,
	Ingo Molnar, Peter Zijlstra, Dâniel Fraga, Sasha Levin,
	Paul E. McKenney, Linux Kernel Mailing List, Thomas Gleixner,
	John Stultz

On Wed, Dec 3, 2014 at 11:25 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I don't know. This piece of code:
>
>         delta = clocksource_delta(wdnow, cs->wd_last, watchdog->mask);
>
> makes no sense to me.

Yeah, no, I see what's up. I missed that whole wd_last vs cs_last
pairing. I guess that part is all good. There are other crazy issues
in there, though, like the double test of 'watchdog_reset_pending'. So
I still wonder, though, since that odd 96-second delta is just insane
and makes no sense from a TSC standpoint (it's closer to a 32-bit
overflow of a hpet counter, but that sounds off too).

                 Linus

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-03 19:25                                                   ` Linus Torvalds
  2014-12-03 19:30                                                     ` Dave Jones
  2014-12-03 19:48                                                     ` Linus Torvalds
@ 2014-12-03 19:56                                                     ` John Stultz
  2014-12-03 20:37                                                       ` Thomas Gleixner
  2014-12-03 20:39                                                       ` Thomas Gleixner
  2 siblings, 2 replies; 486+ messages in thread
From: John Stultz @ 2014-12-03 19:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Jones, Chris Mason, Mike Galbraith, Ingo Molnar,
	Peter Zijlstra, Dâniel Fraga, Sasha Levin, Paul E. McKenney,
	Linux Kernel Mailing List, Thomas Gleixner

On Wed, Dec 3, 2014 at 11:25 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Dec 3, 2014 at 11:00 AM, Dave Jones <davej@redhat.com> wrote:
>>
>> So right after sending my last mail, I rebooted, and restarted the run
>> on the same kernel again.
>>
>> As I was writing this mail, this happened.
>>
>> [  524.420897] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [trinity-c178:20182]
>>
>> and that's all that made it over the console. I couldn't log in via ssh,
>> and thought "ah-ha, so it IS bad".  I walked over to reboot it, and
>> found I could actually log in on the console. check out this dmesg..
>>
>> [  503.683055] Clocksource tsc unstable (delta = -95946009388 ns)
>> [  503.692038] Switched to clocksource hpet
>> [  524.420897] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [trinity-c178:20182]
>
> Interesting. That whole NMI watchdog thing happens pretty much 22s
> after the "TSC unstable" message.
>
> Have you ever seen that TSC issue before? The watchdog relies on
> comparing get_timestamp() differences, so if the timestamp was
> incorrect...
>
> Maybe that whole "clocksource_watchdog()" is bogus. That delta is
> about 96 seconds, sounds very odd. I'm not seeing how the TSC could
> actually scew up that badly, so I'd almost be more likely to blame the
> "watchdog" clock.
>
> I don't know. This piece of code:
>
>         delta = clocksource_delta(wdnow, cs->wd_last, watchdog->mask);
>
> makes no sense to me. Shouldn't it be
>
>         delta = clocksource_delta(wdnow, watchdog->wd_last, watchdog->mask);

So we store wdnow value in the cs->wd_last a few lines below, so I
don't think that's problematic.

I do recall seeing problematic watchdog behavior back in the day w/
PREEMPT_RT when a high priority task really starved the watchdog for a
long time. When we came back the hpet had wrapped, making the wd_delta
look quite small relative to the TSC delta, causing improper
disqualification of the TSC.

But in that case the watchdog would disqualify the TSC after the
stall, and here the stall is happening right afterwards. So I'm not
sure.

I'll look around for some other suspects though. The nohz ntp
improvments might be high on my list there, since it was a 3.17 item.
Will dig.

thanks
-john

^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-03 19:00                                                 ` Dave Jones
  2014-12-03 19:25                                                   ` Linus Torvalds
@ 2014-12-03 19:59                                                   ` Chris Mason
  2014-12-03 20:11                                                     ` Dave Jones
  1 sibling, 1 reply; 486+ messages in thread
From: Chris Mason @ 2014-12-03 19:59 UTC (permalink / raw)
  To: Dave Jones
  Cc: Linus Torvalds, Mike Galbraith, Ingo Molnar, Peter Zijlstra,
	Dâniel Fraga, Sasha Levin, Paul E. McKenney,
	Linux Kernel Mailing List



On Wed, Dec 3, 2014 at 2:00 PM, Dave Jones <davej@redhat.com> wrote:
> On Wed, Dec 03, 2014 at 10:45:57AM -0800, Linus Torvalds wrote:
>  > On Wed, Dec 3, 2014 at 10:41 AM, Dave Jones <davej@redhat.com> 
> wrote:
>  > >
>  > > I've been stuck on this kernel for a few days now trying to 
> prove it
>  > > good/bad one way or the other, and I'm leaning towards good, 
> given
>  > > that it recovers, even though the traces look similar.
>  >
>  > Ugh. But this does *not* happen with 3.16, right? Even the 
> non-fatal case?
> 
> correct. at least not in any of the runs that I did to date.
> 
>  > If so, I'd be inclined to call it "bad". But there might well be 
> two
>  > bugs: one that makes that NMI watchdog trigger, and another one 
> that
>  > then makes it be a hard lockup. I'd think it would be good to 
> figure
>  > out the "NMI watchdog starts triggering" one first, though.
> 
> I think you're right.
> 
> So right after sending my last mail, I rebooted, and restarted the run
> on the same kernel again.
> 
> As I was writing this mail, this happened.
> 
> [  524.420897] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! 
> [trinity-c178:20182]
> 
> and that's all that made it over the console. I couldn't log in via 
> ssh,
> and thought "ah-ha, so it IS bad".  I walked over to reboot it, and
> found I could actually log in on the console. check out this dmesg..
> 
> [  503.683055] Clocksource tsc unstable (delta = -95946009388 ns)
> [  503.692038] Switched to clocksource hpet
> [  524.420897] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! 
> [trinity-c178:20182]

Neat.  We often see switching to hpet on boxes as they are diving into 
softlockup pain, but it's not usually before the softlockups.

Are you configured for CONFIG_NOHZ_FULL?

I'd love to blame the only commit to kernel/smp.c between 3.16 and 3.17

commit 478850160636c4f0b2558451df0e42f8c5a10939
Author: Frederic Weisbecker <fweisbec@gmail.com>
Date:   Thu May 8 01:37:48 2014 +0200

    irq_work: Implement remote queueing

You've also mentioned a few times where messages stopped hitting the 
console?


commit 5874af2003b1aaaa053128d655710140e3187226
Author: Jan Kara <jack@suse.cz>
Date:   Wed Aug 6 16:09:10 2014 -0700

    printk: enable interrupts before calling 
console_trylock_for_printk()

-chris


^ permalink raw reply	[flat|nested] 486+ messages in thread

* Re: frequent lockups in 3.18rc4
  2014-12-03 19:48                                                     ` Linus Torvalds
@ 2014-12-03 20:09                                                       ` Dave Jones
  2014-12-03 20:37                                                         ` Linus Torvalds
  0 siblings, 1 reply; 486+ messages in thread
From: Dave Jones @ 2014-12-03 20:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Chris Mason, Mike Galbraith, Ingo Molnar, Peter Zijlstra,
	Dâniel Fraga, Sasha Levin, Paul E. McKenney,
	Linux Kernel Mailing List, Thomas Gleixner, John Stultz

On Wed, Dec 03, 2014 at 11:48:55AM -0800, Linus Torvalds wrote:
 > On Wed, Dec 3, 2014 at 11:25 AM, Linus Torvalds
 > <torvalds@linux-foundation.org> wrote:
 > >
 > > I don't know. This piece of code:
 > >
 > >         delta = clocksource_delta(wdnow, cs->wd_last, watchdog->mask);
 > >
 > > makes no sense to me.
 > 
 > Yeah, no, I see what's up. I missed that whole wd_last vs cs_last
 > pairing. I guess that part is all good. There are other crazy issues
 > in there, though, like the double test of 'watchdog_reset_pending'. So
 > I still wonder, though, since that odd 96-second delta is just insane
 > and makes no sense from a TSC standpoint (it's closer to a 32-bit
 > overflow of a hpet counter, but that sounds off too).

fwiw, there's quite a bit of variance in the delta that seems to show up.

Clocksource tsc unstable (delta = -1010986453 ns) 
Clocksource tsc unstable (delta = -112130224777 ns) 
Clocksource tsc unstable (delta = -154880389323 ns) 
Clocksource tsc unstable (delta = -165033940543 ns) 
Clocksource tsc unstable (delta = -16610147135 ns) 
Clocksource tsc unstable (delta = -169783264218 ns) 
Clocksource tsc unstable (delta = -183044061613 ns) 
Clocksource tsc unstable (delta = -188697049603 ns) 
Clo