LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* 2.6.20-git15 BUG: soft lockup detected on CPU#0! - timers?
@ 2007-02-20 18:54 Michal Piotrowski
  2007-02-20 21:56 ` Thomas Gleixner
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Piotrowski @ 2007-02-20 18:54 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: LKML, Ingo Molnar

Hi,

I have a problem with timers.

NOHZ: local_softirq_pending 20
NOHZ: local_softirq_pending 20
BUG: soft lockup detected on CPU#0!
 [<c01050f9>] show_trace_log_lvl+0x1a/0x2f
 [<c01057e0>] show_trace+0x12/0x14
 [<c0105892>] dump_stack+0x16/0x18
 [<c0151214>] softlockup_tick+0xa7/0xb9
 [<c012a18f>] run_local_timers+0x12/0x14
 [<c012a54b>] update_process_times+0x3e/0x63
 [<c0139e34>] tick_sched_timer+0x66/0xaa
 [<c01366d2>] hrtimer_interrupt+0x113/0x19d
 [<c0114c45>] smp_apic_timer_interrupt+0x77/0x89
 [<c0104beb>] apic_timer_interrupt+0x33/0x38
 [<c026b945>] scsi_dispatch_cmd+0x1a8/0x1e5
 [<c026ff86>] scsi_request_fn+0x224/0x2ec
 [<c01e1779>] __generic_unplug_device+0x20/0x23
 [<c01e1b49>] generic_unplug_device+0x18/0x26
 [<c01e2c48>] blk_backing_dev_unplug+0x76/0x7e
 [<c018e289>] sync_buffer+0x2e/0x37
 [<c0311041>] __wait_on_bit+0x36/0x5d
 [<c03110c3>] out_of_line_wait_on_bit+0x5b/0x63
 [<c018e1f9>] __wait_on_buffer+0x27/0x2b
 [<c018ec74>] sync_dirty_buffer+0xa0/0xd6
 [<c01bc5c1>] journal_commit_transaction+0xca4/0x11d3
  [<c01bff86>] kjournald+0xab/0x1e8
  [<c0133639>] kthread+0xb5/0xe0
 [<c0104d8f>] kernel_thread_helper+0x7/0x10
 =======================
Clocksource tsc unstable (delta = 18743203260 ns)
Time: acpi_pm clocksource has been installed.

http://www.stardust.webpages.pl/files/tbf/bitis-gabonica/2.6.20-git15/git-dmesg
http://www.stardust.webpages.pl/files/tbf/bitis-gabonica/2.6.20-git15/git-config

Might it be 6ba9b346e1e0eca65ec589d32de3a9fe32dc5de6 commit?

Regards,
Michal

-- 
Michal K. K. Piotrowski
LTG - Linux Testers Group (PL)
(http://www.stardust.webpages.pl/ltg/)
LTG - Linux Testers Group (EN)
(http://www.stardust.webpages.pl/linux_testers_group_en/)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 2.6.20-git15 BUG: soft lockup detected on CPU#0! - timers?
  2007-02-20 18:54 2.6.20-git15 BUG: soft lockup detected on CPU#0! - timers? Michal Piotrowski
@ 2007-02-20 21:56 ` Thomas Gleixner
  2007-02-20 22:37   ` Michal Piotrowski
  0 siblings, 1 reply; 15+ messages in thread
From: Thomas Gleixner @ 2007-02-20 21:56 UTC (permalink / raw)
  To: Michal Piotrowski; +Cc: LKML, Ingo Molnar

On Tue, 2007-02-20 at 19:54 +0100, Michal Piotrowski wrote:
>
> Might it be 6ba9b346e1e0eca65ec589d32de3a9fe32dc5de6 commit?

I doubt that it is, but can you revert it ?

	tglx



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 2.6.20-git15 BUG: soft lockup detected on CPU#0! - timers?
  2007-02-20 21:56 ` Thomas Gleixner
@ 2007-02-20 22:37   ` Michal Piotrowski
  2007-02-21 15:33     ` Thomas Gleixner
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Piotrowski @ 2007-02-20 22:37 UTC (permalink / raw)
  To: tglx; +Cc: LKML, Ingo Molnar

On 20/02/07, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Tue, 2007-02-20 at 19:54 +0100, Michal Piotrowski wrote:
> >
> > Might it be 6ba9b346e1e0eca65ec589d32de3a9fe32dc5de6 commit?
>
> I doubt that it is, but can you revert it ?

I'm using the latest kernel without this patch since 3 hours.

So far so good.

Regards,
Michal

-- 
Michal K. K. Piotrowski
LTG - Linux Testers Group (PL)
(http://www.stardust.webpages.pl/ltg/)
LTG - Linux Testers Group (EN)
(http://www.stardust.webpages.pl/linux_testers_group_en/)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 2.6.20-git15 BUG: soft lockup detected on CPU#0! - timers?
  2007-02-20 22:37   ` Michal Piotrowski
@ 2007-02-21 15:33     ` Thomas Gleixner
  2007-02-21 15:38       ` Michal Piotrowski
  0 siblings, 1 reply; 15+ messages in thread
From: Thomas Gleixner @ 2007-02-21 15:33 UTC (permalink / raw)
  To: Michal Piotrowski; +Cc: LKML, Ingo Molnar

On Tue, 2007-02-20 at 23:37 +0100, Michal Piotrowski wrote:
> On 20/02/07, Thomas Gleixner <tglx@linutronix.de> wrote:
> > On Tue, 2007-02-20 at 19:54 +0100, Michal Piotrowski wrote:
> > >
> > > Might it be 6ba9b346e1e0eca65ec589d32de3a9fe32dc5de6 commit?
> >
> > I doubt that it is, but can you revert it ?
> 
> I'm using the latest kernel without this patch since 3 hours.
> 
> So far so good.

But you still have those softirq pending messages, right ? I think those
are pointing to the root cause of this. Still no idea how to get hold of
them. All my systems refuse to produce that. Hrmpf.

	tglx



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 2.6.20-git15 BUG: soft lockup detected on CPU#0! - timers?
  2007-02-21 15:33     ` Thomas Gleixner
@ 2007-02-21 15:38       ` Michal Piotrowski
  2007-02-21 20:00         ` Thomas Gleixner
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Piotrowski @ 2007-02-21 15:38 UTC (permalink / raw)
  To: tglx; +Cc: LKML, Ingo Molnar

On 21/02/07, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Tue, 2007-02-20 at 23:37 +0100, Michal Piotrowski wrote:
> > On 20/02/07, Thomas Gleixner <tglx@linutronix.de> wrote:
> > > On Tue, 2007-02-20 at 19:54 +0100, Michal Piotrowski wrote:
> > > >
> > > > Might it be 6ba9b346e1e0eca65ec589d32de3a9fe32dc5de6 commit?
> > >
> > > I doubt that it is, but can you revert it ?
> >
> > I'm using the latest kernel without this patch since 3 hours.
> >
> > So far so good.
>
> But you still have those softirq pending messages, right ?

Yes

(+ new NOHZ: local_softirq_pending 02)
http://www.ussg.iu.edu/hypermail/linux/kernel/0702.2/1944.html

> I think those
> are pointing to the root cause of this. Still no idea how to get hold of
> them. All my systems refuse to produce that. Hrmpf.
>
>         tglx
>

Regards,
Michal

-- 
Michal K. K. Piotrowski
LTG - Linux Testers Group (PL)
(http://www.stardust.webpages.pl/ltg/)
LTG - Linux Testers Group (EN)
(http://www.stardust.webpages.pl/linux_testers_group_en/)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 2.6.20-git15 BUG: soft lockup detected on CPU#0! - timers?
  2007-02-21 15:38       ` Michal Piotrowski
@ 2007-02-21 20:00         ` Thomas Gleixner
  2007-02-22  1:47           ` Michal Piotrowski
  0 siblings, 1 reply; 15+ messages in thread
From: Thomas Gleixner @ 2007-02-21 20:00 UTC (permalink / raw)
  To: Michal Piotrowski; +Cc: LKML, Ingo Molnar

Michal,

On Wed, 2007-02-21 at 16:38 +0100, Michal Piotrowski wrote:
> > But you still have those softirq pending messages, right ?
> 
> Yes
> 
> (+ new NOHZ: local_softirq_pending 02)

Yike, that's the timer softirq.

Can you add the patch below, maybe it gives us some useful info. Please
enable lockdep (your last config had it already)

Thanks,

	tglx

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 512a4a9..cc705c7 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -165,9 +165,11 @@ void tick_nohz_stop_sched_tick(void)
 		goto end;
 
 	cpu = smp_processor_id();
-	if (unlikely(local_softirq_pending()))
-		printk(KERN_ERR "NOHZ: local_softirq_pending %02x\n",
-		       local_softirq_pending());
+	if (unlikely(local_softirq_pending())) {
+		print_irqtrace_events(current);
+		printk(KERN_ERR "NOHZ: local_softirq_pending %02x, %08x\n",
+		       local_softirq_pending(), preempt_count());
+	}
 
 	now = ktime_get();
 	/*



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 2.6.20-git15 BUG: soft lockup detected on CPU#0! - timers?
  2007-02-21 20:00         ` Thomas Gleixner
@ 2007-02-22  1:47           ` Michal Piotrowski
  2007-02-22 10:50             ` Michal Piotrowski
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Piotrowski @ 2007-02-22  1:47 UTC (permalink / raw)
  To: tglx; +Cc: Michal Piotrowski, LKML, Ingo Molnar

Thomas Gleixner napisał(a):
> Michal,
> 
> On Wed, 2007-02-21 at 16:38 +0100, Michal Piotrowski wrote:
>>> But you still have those softirq pending messages, right ?
>> Yes
>>
>> (+ new NOHZ: local_softirq_pending 02)
> 
> Yike, that's the timer softirq.
> 
> Can you add the patch below, maybe it gives us some useful info. Please
> enable lockdep (your last config had it already)
> 

I hope this helps.

irq event stamp: 103630856
hardirqs last  enabled at (103630855): [<c031356a>] _spin_unlock_irq+0x22/0x43
hardirqs last disabled at (103630856): [<c013a00e>] tick_nohz_stop_sched_tick+0x13/0x1fd
softirqs last  enabled at (103630824): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (103630819): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 20, 00000001

(gdb) l *0xc031356a
0xc031356a is in _spin_unlock_irq (include2/asm/irqflags.h:48).
43              __asm__ __volatile__("cli" : : : "memory");
44      }
45
46      static inline void raw_local_irq_enable(void)
47      {
48              __asm__ __volatile__("sti" : : : "memory");
49      }
50
51      /*
52       * Used in the idle loop; sti takes one instruction cycle
(gdb) l *0xc013a00e
0xc013a00e is in tick_nohz_stop_sched_tick (/mnt/md0/devel/linux-git/kernel/time/tick-sched.c:158).
153             ktime_t last_update, expires, now, delta;
154             int cpu;
155
156             local_irq_save(flags);
157
158             cpu = smp_processor_id();
159             ts = &per_cpu(tick_cpu_sched, cpu);
160
161             if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE))
162                     goto end;
(gdb) l *0xc01265df
0xc01265df is in __do_softirq (/mnt/md0/devel/linux-git/kernel/softirq.c:251).
246
247             trace_softirq_exit();
248
249             account_system_vtime(current);
250             _local_bh_enable();
251     }
252
253     #ifndef __ARCH_HAS_DO_SOFTIRQ
254
255     asmlinkage void do_softirq(void)
(gdb) l *0xc0106a75
0xc0106a75 is in do_softirq (/mnt/md0/devel/linux-git/arch/i386/kernel/irq.c:222).
217                     irqctx->tinfo.previous_esp = current_stack_pointer;
218
219                     /* build the stack frame on the softirq stack */
220                     isp = (u32*) ((char*)irqctx + sizeof(*irqctx));
221
222                     asm volatile(
223                             "       xchgl   %%ebx,%%esp     \n"
224                             "       call    __do_softirq    \n"
225                             "       movl    %%ebx,%%esp     \n"
226                             : "=b"(isp)

http://www.stardust.webpages.pl/files/tbf/bitis-gabonica/2.6.21-rc1/git-config
http://www.stardust.webpages.pl/files/tbf/bitis-gabonica/2.6.21-rc1/git-dmesg

Regards,
Michal

-- 
Michal K. K. Piotrowski
LTG - Linux Testers Group (PL)
(http://www.stardust.webpages.pl/ltg/)
LTG - Linux Testers Group (EN)
(http://www.stardust.webpages.pl/linux_testers_group_en/)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 2.6.20-git15 BUG: soft lockup detected on CPU#0! - timers?
  2007-02-22  1:47           ` Michal Piotrowski
@ 2007-02-22 10:50             ` Michal Piotrowski
  2007-02-23  6:08               ` Ingo Molnar
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Piotrowski @ 2007-02-22 10:50 UTC (permalink / raw)
  To: Michal Piotrowski; +Cc: tglx, Michal Piotrowski, LKML, Ingo Molnar

Michal Piotrowski napisał(a):
> Thomas Gleixner napisał(a):
>> Michal,
>>
>> On Wed, 2007-02-21 at 16:38 +0100, Michal Piotrowski wrote:
>>>> But you still have those softirq pending messages, right ?
>>> Yes
>>>
>>> (+ new NOHZ: local_softirq_pending 02)
>> Yike, that's the timer softirq.
>>
>> Can you add the patch below, maybe it gives us some useful info. Please
>> enable lockdep (your last config had it already)
>>
> 
> I hope this helps.
> 
> irq event stamp: 103630856
> hardirqs last  enabled at (103630855): [<c031356a>] _spin_unlock_irq+0x22/0x43
> hardirqs last disabled at (103630856): [<c013a00e>] tick_nohz_stop_sched_tick+0x13/0x1fd
> softirqs last  enabled at (103630824): [<c01265df>] __do_softirq+0xe4/0xea
> softirqs last disabled at (103630819): [<c0106a75>] do_softirq+0x64/0xd1
> NOHZ: local_softirq_pending 20, 00000001
> 
> (gdb) l *0xc031356a
> 0xc031356a is in _spin_unlock_irq (include2/asm/irqflags.h:48).
> 43              __asm__ __volatile__("cli" : : : "memory");
> 44      }
> 45
> 46      static inline void raw_local_irq_enable(void)
> 47      {
> 48              __asm__ __volatile__("sti" : : : "memory");
> 49      }
> 50
> 51      /*
> 52       * Used in the idle loop; sti takes one instruction cycle
> (gdb) l *0xc013a00e
> 0xc013a00e is in tick_nohz_stop_sched_tick (/mnt/md0/devel/linux-git/kernel/time/tick-sched.c:158).
> 153             ktime_t last_update, expires, now, delta;
> 154             int cpu;
> 155
> 156             local_irq_save(flags);
> 157
> 158             cpu = smp_processor_id();
> 159             ts = &per_cpu(tick_cpu_sched, cpu);
> 160
> 161             if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE))
> 162                     goto end;
> (gdb) l *0xc01265df
> 0xc01265df is in __do_softirq (/mnt/md0/devel/linux-git/kernel/softirq.c:251).
> 246
> 247             trace_softirq_exit();
> 248
> 249             account_system_vtime(current);
> 250             _local_bh_enable();
> 251     }
> 252
> 253     #ifndef __ARCH_HAS_DO_SOFTIRQ
> 254
> 255     asmlinkage void do_softirq(void)
> (gdb) l *0xc0106a75
> 0xc0106a75 is in do_softirq (/mnt/md0/devel/linux-git/arch/i386/kernel/irq.c:222).
> 217                     irqctx->tinfo.previous_esp = current_stack_pointer;
> 218
> 219                     /* build the stack frame on the softirq stack */
> 220                     isp = (u32*) ((char*)irqctx + sizeof(*irqctx));
> 221
> 222                     asm volatile(
> 223                             "       xchgl   %%ebx,%%esp     \n"
> 224                             "       call    __do_softirq    \n"
> 225                             "       movl    %%ebx,%%esp     \n"
> 226                             : "=b"(isp)
> 

Here is more

BUG: at /mnt/md0/devel/linux-git/kernel/lockdep.c:2427 check_flags()
 [<c01050f9>] show_trace_log_lvl+0x1a/0x2f
 [<c01057e0>] show_trace+0x12/0x14
 [<c0105892>] dump_stack+0x16/0x18
 [<c013abf7>] check_flags+0x95/0x143
 [<c013d978>] lock_acquire+0x29/0x82
 [<c0136974>] down_write+0x3a/0x54
 [<c01630f8>] sys_munmap+0x23/0x3f
 [<c0104120>] syscall_call+0x7/0xb
 =======================
irq event stamp: 30788
hardirqs last  enabled at (30787): [<c0104249>] syscall_exit_work+0x11/0x26
hardirqs last disabled at (30788): [<c0103fc9>] ret_from_exception+0x9/0xc
softirqs last  enabled at (30202): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (30193): [<c0106a75>] do_softirq+0x64/0xd1
irq event stamp: 214661966
hardirqs last  enabled at (214661965): [<c0104168>] restore_nocheck+0x12/0x15
hardirqs last disabled at (214661966): [<c0313174>] _spin_lock_irq+0xf/0x48
softirqs last  enabled at (214661782): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (214661755): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 24, 00000001
irq event stamp: 137191105
hardirqs last  enabled at (137191105): [<c0102b80>] default_idle+0x3e/0x59
hardirqs last disabled at (137191104): [<c0102451>] cpu_idle+0xba/0xfb
softirqs last  enabled at (137190592): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (137190587): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 20, 00000001
irq event stamp: 214661966
hardirqs last  enabled at (214661965): [<c0104168>] restore_nocheck+0x12/0x15
hardirqs last disabled at (214661966): [<c0313174>] _spin_lock_irq+0xf/0x48
softirqs last  enabled at (214661782): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (214661755): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 24, 00000001
irq event stamp: 137191105
hardirqs last  enabled at (137191105): [<c0102b80>] default_idle+0x3e/0x59
hardirqs last disabled at (137191104): [<c0102451>] cpu_idle+0xba/0xfb
softirqs last  enabled at (137190592): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (137190587): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 02, 00000001
irq event stamp: 214661966
hardirqs last  enabled at (214661965): [<c0104168>] restore_nocheck+0x12/0x15
hardirqs last disabled at (214661966): [<c0313174>] _spin_lock_irq+0xf/0x48
softirqs last  enabled at (214661782): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (214661755): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 06, 00000001
irq event stamp: 137191105
hardirqs last  enabled at (137191105): [<c0102b80>] default_idle+0x3e/0x59
hardirqs last disabled at (137191104): [<c0102451>] cpu_idle+0xba/0xfb
softirqs last  enabled at (137190592): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (137190587): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 20, 00000001
irq event stamp: 137191105
hardirqs last  enabled at (137191105): [<c0102b80>] default_idle+0x3e/0x59
hardirqs last disabled at (137191104): [<c0102451>] cpu_idle+0xba/0xfb
softirqs last  enabled at (137190592): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (137190587): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 20, 00000001
irq event stamp: 214661966
hardirqs last  enabled at (214661965): [<c0104168>] restore_nocheck+0x12/0x15
hardirqs last disabled at (214661966): [<c0313174>] _spin_lock_irq+0xf/0x48
softirqs last  enabled at (214661782): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (214661755): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 24, 00000001
irq event stamp: 137191105
hardirqs last  enabled at (137191105): [<c0102b80>] default_idle+0x3e/0x59
hardirqs last disabled at (137191104): [<c0102451>] cpu_idle+0xba/0xfb
softirqs last  enabled at (137190592): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (137190587): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 02, 00000001
irq event stamp: 137191105
hardirqs last  enabled at (137191105): [<c0102b80>] default_idle+0x3e/0x59
hardirqs last disabled at (137191104): [<c0102451>] cpu_idle+0xba/0xfb
softirqs last  enabled at (137190592): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (137190587): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 20, 00000001
irq event stamp: 214661966
hardirqs last  enabled at (214661965): [<c0104168>] restore_nocheck+0x12/0x15
hardirqs last disabled at (214661966): [<c0313174>] _spin_lock_irq+0xf/0x48
softirqs last  enabled at (214661782): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (214661755): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 06, 00000001
irq event stamp: 137191105
hardirqs last  enabled at (137191105): [<c0102b80>] default_idle+0x3e/0x59
hardirqs last disabled at (137191104): [<c0102451>] cpu_idle+0xba/0xfb
softirqs last  enabled at (137190592): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (137190587): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 02, 00000001
irq event stamp: 137191105
hardirqs last  enabled at (137191105): [<c0102b80>] default_idle+0x3e/0x59
hardirqs last disabled at (137191104): [<c0102451>] cpu_idle+0xba/0xfb
softirqs last  enabled at (137190592): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (137190587): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 20, 00000001
irq event stamp: 137191105
hardirqs last  enabled at (137191105): [<c0102b80>] default_idle+0x3e/0x59
hardirqs last disabled at (137191104): [<c0102451>] cpu_idle+0xba/0xfb
softirqs last  enabled at (137190592): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (137190587): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 20, 00000001
irq event stamp: 214661966
hardirqs last  enabled at (214661965): [<c0104168>] restore_nocheck+0x12/0x15
hardirqs last disabled at (214661966): [<c0313174>] _spin_lock_irq+0xf/0x48
softirqs last  enabled at (214661782): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (214661755): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 06, 00000001
irq event stamp: 137191105
hardirqs last  enabled at (137191105): [<c0102b80>] default_idle+0x3e/0x59
hardirqs last disabled at (137191104): [<c0102451>] cpu_idle+0xba/0xfb
softirqs last  enabled at (137190592): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (137190587): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 20, 00000001
irq event stamp: 214661966
hardirqs last  enabled at (214661965): [<c0104168>] restore_nocheck+0x12/0x15
hardirqs last disabled at (214661966): [<c0313174>] _spin_lock_irq+0xf/0x48
softirqs last  enabled at (214661782): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (214661755): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 06, 00000001
irq event stamp: 214661966
hardirqs last  enabled at (214661965): [<c0104168>] restore_nocheck+0x12/0x15
hardirqs last disabled at (214661966): [<c0313174>] _spin_lock_irq+0xf/0x48
softirqs last  enabled at (214661782): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (214661755): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 06, 00000001
irq event stamp: 137191105
hardirqs last  enabled at (137191105): [<c0102b80>] default_idle+0x3e/0x59
hardirqs last disabled at (137191104): [<c0102451>] cpu_idle+0xba/0xfb
softirqs last  enabled at (137190592): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (137190587): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 20, 00000001
irq event stamp: 137191105
hardirqs last  enabled at (137191105): [<c0102b80>] default_idle+0x3e/0x59
hardirqs last disabled at (137191104): [<c0102451>] cpu_idle+0xba/0xfb
softirqs last  enabled at (137190592): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (137190587): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 20, 00000001
irq event stamp: 214661966
hardirqs last  enabled at (214661965): [<c0104168>] restore_nocheck+0x12/0x15
hardirqs last disabled at (214661966): [<c0313174>] _spin_lock_irq+0xf/0x48
softirqs last  enabled at (214661782): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (214661755): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 24, 00000001
irq event stamp: 214661966
hardirqs last  enabled at (214661965): [<c0104168>] restore_nocheck+0x12/0x15
hardirqs last disabled at (214661966): [<c0313174>] _spin_lock_irq+0xf/0x48
softirqs last  enabled at (214661782): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (214661755): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 24, 00000001
irq event stamp: 137191105
hardirqs last  enabled at (137191105): [<c0102b80>] default_idle+0x3e/0x59
hardirqs last disabled at (137191104): [<c0102451>] cpu_idle+0xba/0xfb
softirqs last  enabled at (137190592): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (137190587): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 20, 00000001
irq event stamp: 214661966
hardirqs last  enabled at (214661965): [<c0104168>] restore_nocheck+0x12/0x15
hardirqs last disabled at (214661966): [<c0313174>] _spin_lock_irq+0xf/0x48
softirqs last  enabled at (214661782): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (214661755): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 24, 00000001
irq event stamp: 214661966
hardirqs last  enabled at (214661965): [<c0104168>] restore_nocheck+0x12/0x15
hardirqs last disabled at (214661966): [<c0313174>] _spin_lock_irq+0xf/0x48
softirqs last  enabled at (214661782): [<c01265df>] __do_softirq+0xe4/0xea
softirqs last disabled at (214661755): [<c0106a75>] do_softirq+0x64/0xd1
NOHZ: local_softirq_pending 06, 00000001


http://www.stardust.webpages.pl/files/tbf/bitis-gabonica/2.6.21-rc1/git-config
http://www.stardust.webpages.pl/files/tbf/bitis-gabonica/2.6.21-rc1/git-dmesg2

[<c01265df>] __do_softirq+0xe4/0xea
[<c0106a75>] do_softirq+0x64/0xd1

this repeats every time

Regards,
Michal

-- 
Michal K. K. Piotrowski
LTG - Linux Testers Group (PL)
(http://www.stardust.webpages.pl/ltg/)
LTG - Linux Testers Group (EN)
(http://www.stardust.webpages.pl/linux_testers_group_en/)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 2.6.20-git15 BUG: soft lockup detected on CPU#0! - timers?
  2007-02-22 10:50             ` Michal Piotrowski
@ 2007-02-23  6:08               ` Ingo Molnar
  2007-02-23  7:32                 ` Mike Galbraith
                                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Ingo Molnar @ 2007-02-23  6:08 UTC (permalink / raw)
  To: Michal Piotrowski; +Cc: Michal Piotrowski, tglx, LKML


Michal,

* Michal Piotrowski <michal.k.k.piotrowski@gmail.com> wrote:

> Here is more
> 
> hardirqs last  enabled at (30787): [<c0104249>] syscall_exit_work+0x11/0x26
> hardirqs last disabled at (30788): [<c0103fc9>] ret_from_exception+0x9/0xc
> softirqs last  enabled at (30202): [<c01265df>] __do_softirq+0xe4/0xea
> softirqs last disabled at (30193): [<c0106a75>] do_softirq+0x64/0xd1

could you please try the patch below? This is pretty much the only 
condition under which we can silently 'leak' pending softirqs, and 
trigger the new warning: if something does cond_resched_softirq() in 
non-runnable state. (which is a no-no, but nothing enforced this, so it 
could in theory happen.) So the question is, with this patch applied, do 
you get these new warnings from sched.c?

	Ingo

-------------------------->
Subject: [patch] add warning to cond_resched_softirq()
From: Ingo Molnar <mingo@elte.hu>

make sure that cond_resched_softirq() is always called with a runnable 
task - so that we do not leave softirq work pending indefinitely.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched.c |    1 +
 1 file changed, 1 insertion(+)

Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -4689,6 +4689,7 @@ int __sched cond_resched_softirq(void)
 	BUG_ON(!in_softirq());
 
 	if (need_resched() && system_state == SYSTEM_RUNNING) {
+		WARN_ON(current->state != TASK_RUNNING);
 		raw_local_irq_disable();
 		_local_bh_enable();
 		raw_local_irq_enable();

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 2.6.20-git15 BUG: soft lockup detected on CPU#0! - timers?
  2007-02-23  6:08               ` Ingo Molnar
@ 2007-02-23  7:32                 ` Mike Galbraith
  2007-02-23 10:10                 ` Ingo Molnar
  2007-02-24 22:45                 ` Michal Piotrowski
  2 siblings, 0 replies; 15+ messages in thread
From: Mike Galbraith @ 2007-02-23  7:32 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Michal Piotrowski, Michal Piotrowski, tglx, LKML

On Fri, 2007-02-23 at 07:08 +0100, Ingo Molnar wrote:

> Index: linux/kernel/sched.c
> ===================================================================
> --- linux.orig/kernel/sched.c
> +++ linux/kernel/sched.c
> @@ -4689,6 +4689,7 @@ int __sched cond_resched_softirq(void)
>  	BUG_ON(!in_softirq());
>  
>  	if (need_resched() && system_state == SYSTEM_RUNNING) {
> +		WARN_ON(current->state != TASK_RUNNING);
>  		raw_local_irq_disable();
>  		_local_bh_enable();
>  		raw_local_irq_enable();

I'm getting NOHZ: local_softirq_pending 02 or 10 every few seconds if
the box is doing anything, and the WARN_ON() has yet to trigger.

	-Mike


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 2.6.20-git15 BUG: soft lockup detected on CPU#0! - timers?
  2007-02-23  6:08               ` Ingo Molnar
  2007-02-23  7:32                 ` Mike Galbraith
@ 2007-02-23 10:10                 ` Ingo Molnar
  2007-02-24 22:45                 ` Michal Piotrowski
  2 siblings, 0 replies; 15+ messages in thread
From: Ingo Molnar @ 2007-02-23 10:10 UTC (permalink / raw)
  To: Michal Piotrowski; +Cc: Michal Piotrowski, tglx, LKML


* Ingo Molnar <mingo@elte.hu> wrote:

> could you please try the patch below? This is pretty much the only 
> condition under which we can silently 'leak' pending softirqs, and 
> trigger the new warning: if something does cond_resched_softirq() in 
> non-runnable state. (which is a no-no, but nothing enforced this, so 
> it could in theory happen.) So the question is, with this patch 
> applied, do you get these new warnings from sched.c?

it just triggered on one of my boxes:

BUG: at kernel/sched.c:4692 cond_resched_softirq()
 [<c03ce128>] cond_resched_softirq+0x5f/0x7b
 [<c0369078>] release_sock+0x42/0x81
 [<c03693bc>] sk_wait_data+0x57/0x9d
 [<c0129a00>] autoremove_wake_function+0x0/0x33
 [<c03942ff>] tcp_recvmsg+0x39c/0x973
 [<c0368e39>] sock_common_recvmsg+0x3e/0x54
 [<c0367903>] sock_aio_read+0x106/0x112
 [<c0159b0c>] do_sync_read+0xc8/0x105
 [<c0129a00>] autoremove_wake_function+0x0/0x33
 [<c0159e82>] vfs_read+0xc1/0x15a
 [<c015a7d2>] sys_read+0x41/0x67
 [<c0103c10>] syscall_call+0x7/0xb
 =======================

so tcp_recvmsg() definitely gets into this condition.

	Ingo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 2.6.20-git15 BUG: soft lockup detected on CPU#0! - timers?
  2007-02-23  6:08               ` Ingo Molnar
  2007-02-23  7:32                 ` Mike Galbraith
  2007-02-23 10:10                 ` Ingo Molnar
@ 2007-02-24 22:45                 ` Michal Piotrowski
  2007-02-25  9:53                   ` Thomas Gleixner
  2 siblings, 1 reply; 15+ messages in thread
From: Michal Piotrowski @ 2007-02-24 22:45 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Michal Piotrowski, tglx, LKML

Hi Ingo,

On 23/02/07, Ingo Molnar <mingo@elte.hu> wrote:
>
> Michal,
>
> * Michal Piotrowski <michal.k.k.piotrowski@gmail.com> wrote:
>
> > Here is more
> >
> > hardirqs last  enabled at (30787): [<c0104249>] syscall_exit_work+0x11/0x26
> > hardirqs last disabled at (30788): [<c0103fc9>] ret_from_exception+0x9/0xc
> > softirqs last  enabled at (30202): [<c01265df>] __do_softirq+0xe4/0xea
> > softirqs last disabled at (30193): [<c0106a75>] do_softirq+0x64/0xd1
>
> could you please try the patch below? This is pretty much the only
> condition under which we can silently 'leak' pending softirqs, and
> trigger the new warning: if something does cond_resched_softirq() in
> non-runnable state. (which is a no-no, but nothing enforced this, so it
> could in theory happen.) So the question is, with this patch applied, do
> you get these new warnings from sched.c?

Here is a dmesg after over 25 hours of uptime
http://www.stardust.webpages.pl/files/tbf/bitis-gabonica/2.6.21-rc1/git-dmesg3
nothing new.

Regards,
Michal

-- 
Michal K. K. Piotrowski
LTG - Linux Testers Group (PL)
(http://www.stardust.webpages.pl/ltg/)
LTG - Linux Testers Group (EN)
(http://www.stardust.webpages.pl/linux_testers_group_en/)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 2.6.20-git15 BUG: soft lockup detected on CPU#0! - timers?
  2007-02-24 22:45                 ` Michal Piotrowski
@ 2007-02-25  9:53                   ` Thomas Gleixner
  2007-02-26 13:01                     ` Michal Piotrowski
  0 siblings, 1 reply; 15+ messages in thread
From: Thomas Gleixner @ 2007-02-25  9:53 UTC (permalink / raw)
  To: Michal Piotrowski; +Cc: Ingo Molnar, Michal Piotrowski, LKML

Michal,

On Sat, 2007-02-24 at 23:45 +0100, Michal Piotrowski wrote:
> Here is a dmesg after over 25 hours of uptime
> http://www.stardust.webpages.pl/files/tbf/bitis-gabonica/2.6.21-rc1/git-dmesg3
> nothing new.

can you please turn off CONFIG_SCHED_SMT ? It seems all reports have one
thing in common:

CPU == P4 and CONFIG_SCHED_SMT turned on.

	tglx



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 2.6.20-git15 BUG: soft lockup detected on CPU#0! - timers?
  2007-02-25  9:53                   ` Thomas Gleixner
@ 2007-02-26 13:01                     ` Michal Piotrowski
  2007-02-26 13:25                       ` [patch] sched: fix SMT scheduler bug Ingo Molnar
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Piotrowski @ 2007-02-26 13:01 UTC (permalink / raw)
  To: tglx; +Cc: Ingo Molnar, Michal Piotrowski, LKML

On 25/02/07, Thomas Gleixner <tglx@linutronix.de> wrote:
> Michal,
>
> On Sat, 2007-02-24 at 23:45 +0100, Michal Piotrowski wrote:
> > Here is a dmesg after over 25 hours of uptime
> > http://www.stardust.webpages.pl/files/tbf/bitis-gabonica/2.6.21-rc1/git-dmesg3
> > nothing new.
>
> can you please turn off CONFIG_SCHED_SMT ? It seems all reports have one
> thing in common:
>
> CPU == P4 and CONFIG_SCHED_SMT turned on.
>
>         tglx
>
>
>

Ok, I think that this is a SMT scheduler problem.

System works fine for fifteen hours.

Regards,
Michal

-- 
Michal K. K. Piotrowski
LTG - Linux Testers Group (PL)
(http://www.stardust.webpages.pl/ltg/)
LTG - Linux Testers Group (EN)
(http://www.stardust.webpages.pl/linux_testers_group_en/)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [patch] sched: fix SMT scheduler bug
  2007-02-26 13:01                     ` Michal Piotrowski
@ 2007-02-26 13:25                       ` Ingo Molnar
  0 siblings, 0 replies; 15+ messages in thread
From: Ingo Molnar @ 2007-02-26 13:25 UTC (permalink / raw)
  To: Michal Piotrowski; +Cc: tglx, Michal Piotrowski, LKML, Andrew Morton


* Michal Piotrowski <michal.k.k.piotrowski@gmail.com> wrote:

> Ok, I think that this is a SMT scheduler problem.
> 
> System works fine for fifteen hours.

hmmm. I think it's this piece of smt-nice code in sched.c:

        if (dependent_sleeper(cpu, rq, next))
                next = rq->idle;

this will skip to run this HT CPU if the sibling CPU is running a task 
that is more important. And dependent_sleeper() doesnt skip kernel 
threads such as ksoftirqd:

        /* kernel/rt threads do not participate in dependent sleeping */
        if (!p->mm || rt_task(p))
                return 0;

but ... what if the highest-prio thread is a user-space task /and/ there 
is ksoftirqd also queued, which is now delayed? We schedule the idle 
task instead of processing softirqs. Ouch!

To test this theory, could you turn the SMT scheduler back on but also 
apply the patch below. Does it still work fine?

	Ingo

---------------------->
Subject: [patch] sched: fix SMT scheduler bug
From: Ingo Molnar <mingo@elte.hu>

the SMT scheduler incorrectly skips kernel threads even if they
are runnable (but they are preempted by a higher-prio user-space
task which got SMT-delayed by an even higher-priority task
running on a sibling CPU).

fix this for now by only doing the SMT-nice optimization if
the to-be-delayed task is the only runnable task. (This should
cover most of the real-life cases anyway.)

this bug has been in the SMT scheduler since 2.6.17 or so, but has
only been noticed now by the active check in the dynticks code.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -3537,7 +3537,7 @@ need_resched_nonpreemptible:
 		}
 	}
 	next->sleep_type = SLEEP_NORMAL;
-	if (dependent_sleeper(cpu, rq, next))
+	if (rq->nr_running == 1 && dependent_sleeper(cpu, rq, next))
 		next = rq->idle;
 switch_tasks:
 	if (next == rq->idle)

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2007-02-26 13:31 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-02-20 18:54 2.6.20-git15 BUG: soft lockup detected on CPU#0! - timers? Michal Piotrowski
2007-02-20 21:56 ` Thomas Gleixner
2007-02-20 22:37   ` Michal Piotrowski
2007-02-21 15:33     ` Thomas Gleixner
2007-02-21 15:38       ` Michal Piotrowski
2007-02-21 20:00         ` Thomas Gleixner
2007-02-22  1:47           ` Michal Piotrowski
2007-02-22 10:50             ` Michal Piotrowski
2007-02-23  6:08               ` Ingo Molnar
2007-02-23  7:32                 ` Mike Galbraith
2007-02-23 10:10                 ` Ingo Molnar
2007-02-24 22:45                 ` Michal Piotrowski
2007-02-25  9:53                   ` Thomas Gleixner
2007-02-26 13:01                     ` Michal Piotrowski
2007-02-26 13:25                       ` [patch] sched: fix SMT scheduler bug Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).