LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* Re: softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks
       [not found] <200801252259.m0PMxHmD012059@hera.kernel.org>
@ 2008-02-06  0:46 ` Andrew Morton
  2008-02-06 14:50   ` Peter Zijlstra
  0 siblings, 1 reply; 8+ messages in thread
From: Andrew Morton @ 2008-02-06  0:46 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Linux Kernel Mailing List

On Fri, 25 Jan 2008 22:59:17 GMT
Linux Kernel Mailing List <linux-kernel@vger.kernel.org> wrote:

> Gitweb:     http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=82a1fcb90287052aabfa235e7ffc693ea003fe69
> Commit:     82a1fcb90287052aabfa235e7ffc693ea003fe69
> Parent:     d0d23b5432fe61229dd3641c5e94d4130bc4e61b
> Author:     Ingo Molnar <mingo@elte.hu>
> AuthorDate: Fri Jan 25 21:08:02 2008 +0100
> Committer:  Ingo Molnar <mingo@elte.hu>
> CommitDate: Fri Jan 25 21:08:02 2008 +0100
> 
>     softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks

One of my test boxes (an 8-way x86_64 software-development thing from Intel
- I'm not sure what's inside it) no longer powers itself off when I run `halt
-pfn'.

During bisection I found two different problems.  Sometimes the machine
wouldn't power off at all.  Other times it would power off after a pause of
around twenty seconds.

Bisection indicates that this commit is what caused the 20-second pause. 
It could be that some later commit caused the infinity-second pause.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks
  2008-02-06  0:46 ` softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks Andrew Morton
@ 2008-02-06 14:50   ` Peter Zijlstra
  2008-02-06 18:05     ` Andrew Morton
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2008-02-06 14:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Ingo Molnar, Linux Kernel Mailing List


On Tue, 2008-02-05 at 16:46 -0800, Andrew Morton wrote:
> On Fri, 25 Jan 2008 22:59:17 GMT
> Linux Kernel Mailing List <linux-kernel@vger.kernel.org> wrote:
> 
> > Gitweb:     http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=82a1fcb90287052aabfa235e7ffc693ea003fe69
> > Commit:     82a1fcb90287052aabfa235e7ffc693ea003fe69
> > Parent:     d0d23b5432fe61229dd3641c5e94d4130bc4e61b
> > Author:     Ingo Molnar <mingo@elte.hu>
> > AuthorDate: Fri Jan 25 21:08:02 2008 +0100
> > Committer:  Ingo Molnar <mingo@elte.hu>
> > CommitDate: Fri Jan 25 21:08:02 2008 +0100
> > 
> >     softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks
> 
> One of my test boxes (an 8-way x86_64 software-development thing from Intel
> - I'm not sure what's inside it) no longer powers itself off when I run `halt
> -pfn'.
> 
> During bisection I found two different problems.  Sometimes the machine
> wouldn't power off at all.  Other times it would power off after a pause of
> around twenty seconds.
> 
> Bisection indicates that this commit is what caused the 20-second pause. 
> It could be that some later commit caused the infinity-second pause.


Does that kernel have:

commit ed50d6cbc394cd0966469d3e249353c9dd1d38b9
Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date:   Sat Feb 2 00:23:08 2008 +0100

    debug: softlockup looping fix




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks
  2008-02-06 14:50   ` Peter Zijlstra
@ 2008-02-06 18:05     ` Andrew Morton
  2008-02-07  0:04       ` Ingo Molnar
  0 siblings, 1 reply; 8+ messages in thread
From: Andrew Morton @ 2008-02-06 18:05 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, Linux Kernel Mailing List

On Wed, 06 Feb 2008 15:50:02 +0100 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> 
> On Tue, 2008-02-05 at 16:46 -0800, Andrew Morton wrote:
> > On Fri, 25 Jan 2008 22:59:17 GMT
> > Linux Kernel Mailing List <linux-kernel@vger.kernel.org> wrote:
> > 
> > > Gitweb:     http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=82a1fcb90287052aabfa235e7ffc693ea003fe69
> > > Commit:     82a1fcb90287052aabfa235e7ffc693ea003fe69
> > > Parent:     d0d23b5432fe61229dd3641c5e94d4130bc4e61b
> > > Author:     Ingo Molnar <mingo@elte.hu>
> > > AuthorDate: Fri Jan 25 21:08:02 2008 +0100
> > > Committer:  Ingo Molnar <mingo@elte.hu>
> > > CommitDate: Fri Jan 25 21:08:02 2008 +0100
> > > 
> > >     softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks
> > 
> > One of my test boxes (an 8-way x86_64 software-development thing from Intel
> > - I'm not sure what's inside it) no longer powers itself off when I run `halt
> > -pfn'.
> > 
> > During bisection I found two different problems.  Sometimes the machine
> > wouldn't power off at all.  Other times it would power off after a pause of
> > around twenty seconds.
> > 
> > Bisection indicates that this commit is what caused the 20-second pause. 
> > It could be that some later commit caused the infinity-second pause.
> 
> 
> Does that kernel have:
> 
> commit ed50d6cbc394cd0966469d3e249353c9dd1d38b9
> Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Date:   Sat Feb 2 00:23:08 2008 +0100
> 
>     debug: softlockup looping fix
> 
> 

yup.  It was fetched less than 24 hours ago.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks
  2008-02-06 18:05     ` Andrew Morton
@ 2008-02-07  0:04       ` Ingo Molnar
  2008-02-07  0:31         ` Andrew Morton
  0 siblings, 1 reply; 8+ messages in thread
From: Ingo Molnar @ 2008-02-07  0:04 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Peter Zijlstra, Linux Kernel Mailing List


* Andrew Morton <akpm@linux-foundation.org> wrote:

> > Does that kernel have:
> > 
> > commit ed50d6cbc394cd0966469d3e249353c9dd1d38b9
> > Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > Date:   Sat Feb 2 00:23:08 2008 +0100
> > 
> >     debug: softlockup looping fix
> 
> yup.  It was fetched less than 24 hours ago.

does the patch below improve the situation?

	Ingo

---
 arch/x86/kernel/reboot.c |   16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

Index: linux-x86.q/arch/x86/kernel/reboot.c
===================================================================
--- linux-x86.q.orig/arch/x86/kernel/reboot.c
+++ linux-x86.q/arch/x86/kernel/reboot.c
@@ -396,8 +396,20 @@ void machine_shutdown(void)
 	if (!cpu_isset(reboot_cpu_id, cpu_online_map))
 		reboot_cpu_id = smp_processor_id();
 
-	/* Make certain I only run on the appropriate processor */
-	set_cpus_allowed(current, cpumask_of_cpu(reboot_cpu_id));
+	/*
+	 * Make certain we only run on the appropriate processor,
+	 * and with sufficient priority:
+	 */
+	{
+		struct sched_param schedparm;
+		schedparm.sched_priority = 99;
+		int ret;
+
+		ret = sched_setscheduler(current, SCHED_RR, &schedparm);
+		WARN_ON_ONCE(1);
+
+		set_cpus_allowed(current, cpumask_of_cpu(reboot_cpu_id));
+	}
 
 	/* O.K Now that I'm on the appropriate processor,
 	 * stop all of the others.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks
  2008-02-07  0:04       ` Ingo Molnar
@ 2008-02-07  0:31         ` Andrew Morton
  2008-02-07  0:47           ` Andrew Morton
  2008-02-07  0:51           ` Ingo Molnar
  0 siblings, 2 replies; 8+ messages in thread
From: Andrew Morton @ 2008-02-07  0:31 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: a.p.zijlstra, linux-kernel

On Thu, 7 Feb 2008 01:04:25 +0100
Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > > Does that kernel have:
> > > 
> > > commit ed50d6cbc394cd0966469d3e249353c9dd1d38b9
> > > Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > > Date:   Sat Feb 2 00:23:08 2008 +0100
> > > 
> > >     debug: softlockup looping fix
> > 
> > yup.  It was fetched less than 24 hours ago.
> 
> does the patch below improve the situation?
> 

Nope.

But I tested it on mainline, and mainline exhibits the never-powers-off
symptom, whereas ed50d6cbc394cd0966469d3e249353c9dd1d38b9 demonstrates the
powers-off-after-20-seconds symptom.  

So we _may_ be dealing with two bugs here, and your patch might have fixed
the first, but that success is obscured by the second.  I guess I need to
prepare a tree which has ed50d6cbc394cd0966469d3e249353c9dd1d38b9 at its
tip.  (Wonders how to do that).  

btw, mainline (plus this patch, not that it changed anything) prints

<stopping disk stuff>
Disabling non-boot CPUs
CPU 1 is now offline

and that's it.   This machine has eight cpus.  Might be a hint?


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks
  2008-02-07  0:31         ` Andrew Morton
@ 2008-02-07  0:47           ` Andrew Morton
  2008-02-07  0:51           ` Ingo Molnar
  1 sibling, 0 replies; 8+ messages in thread
From: Andrew Morton @ 2008-02-07  0:47 UTC (permalink / raw)
  To: mingo, a.p.zijlstra, linux-kernel

On Wed, 6 Feb 2008 16:31:11 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Thu, 7 Feb 2008 01:04:25 +0100
> Ingo Molnar <mingo@elte.hu> wrote:
> 
> > 
> > * Andrew Morton <akpm@linux-foundation.org> wrote:
> > 
> > > > Does that kernel have:
> > > > 
> > > > commit ed50d6cbc394cd0966469d3e249353c9dd1d38b9
> > > > Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > > > Date:   Sat Feb 2 00:23:08 2008 +0100
> > > > 
> > > >     debug: softlockup looping fix
> > > 
> > > yup.  It was fetched less than 24 hours ago.
> > 
> > does the patch below improve the situation?
> > 
> 
> Nope.
> 
> But I tested it on mainline, and mainline exhibits the never-powers-off
> symptom, whereas ed50d6cbc394cd0966469d3e249353c9dd1d38b9 demonstrates the
> powers-off-after-20-seconds symptom.  
> 
> So we _may_ be dealing with two bugs here, and your patch might have fixed
> the first, but that success is obscured by the second.  I guess I need to
> prepare a tree which has ed50d6cbc394cd0966469d3e249353c9dd1d38b9 at its
> tip.  (Wonders how to do that).  

OK, I did this (tested on a ed50d6cbc394cd0966469d3e249353c9dd1d38b9-tipped
tree) and again, the patch made no difference: the machine still pauses
20-odd seconds before (correctly) powering off.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks
  2008-02-07  0:31         ` Andrew Morton
  2008-02-07  0:47           ` Andrew Morton
@ 2008-02-07  0:51           ` Ingo Molnar
  2008-02-07  1:12             ` Andrew Morton
  1 sibling, 1 reply; 8+ messages in thread
From: Ingo Molnar @ 2008-02-07  0:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: a.p.zijlstra, linux-kernel, Gautham R Shenoy


* Andrew Morton <akpm@linux-foundation.org> wrote:

> Nope.
> 
> But I tested it on mainline, and mainline exhibits the 
> never-powers-off symptom, whereas 
> ed50d6cbc394cd0966469d3e249353c9dd1d38b9 demonstrates the 
> powers-off-after-20-seconds symptom.
> 
> So we _may_ be dealing with two bugs here, and your patch might have 
> fixed the first, but that success is obscured by the second.  I guess 
> I need to prepare a tree which has 
> ed50d6cbc394cd0966469d3e249353c9dd1d38b9 at its tip.  (Wonders how to 
> do that).

the way i do it in bisection is to do:

  mkdir patches
  git-log -1 -p ed50d6cbc394cd0966469d3 > patches/fix.patch
  echo fix.patch > patches/series

and then before testing a bisection point, i do a 'quilt push'. Before 
telling git-bisect about the quality of that bisection point (good/bad) 
i pop it off via 'quilt pop'.

this way the 'required fix' can be kept during the bisection, to find 
the secondary bug.

> btw, mainline (plus this patch, not that it changed anything) prints
> 
> <stopping disk stuff>
> Disabling non-boot CPUs
> CPU 1 is now offline
> 
> and that's it.   This machine has eight cpus.  Might be a hint?

what should be the proper message?

my suspects, besides there being something wrong in the hung-tasks code 
of the softlockup watchdog, would be the cpu-hotplug commits, or some 
arch/x86 commit. (although we didnt really have anything specifically 
touching the the reboot path)

does a stupid patch like the one below tell you more about what the 
other CPUs are doing during this hang? [32-bit only patch]

	Ingo

---
 arch/i386/kernel/nmi.c |    8 ++++++++
 1 file changed, 8 insertions(+)

Index: linux/arch/i386/kernel/nmi.c
===================================================================
--- linux.orig/arch/x86/kernel/nmi_64.c
+++ linux/arch/x86/kernel/nmi_64.c
@@ -331,6 +331,14 @@ __kprobes int nmi_watchdog_tick(struct p
 	int touched = 0;
 	int cpu = smp_processor_id();
 	int rc=0;
+	static int count[NR_CPUS];
+
+	if (!count[cpu]) {
+		count[cpu] = nmi_hz;
+		printk("CPU#%d, tick\n", cpu);
+		show_regs(regs);
+	}
+	count[cpu]--;
 
 	/* check for other users first */
 	if (notify_die(DIE_NMI, "nmi", regs, reason, 2, SIGINT)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks
  2008-02-07  0:51           ` Ingo Molnar
@ 2008-02-07  1:12             ` Andrew Morton
  0 siblings, 0 replies; 8+ messages in thread
From: Andrew Morton @ 2008-02-07  1:12 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: a.p.zijlstra, linux-kernel, ego

On Thu, 7 Feb 2008 01:51:10 +0100
Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > Nope.
> > 
> > But I tested it on mainline, and mainline exhibits the 
> > never-powers-off symptom, whereas 
> > ed50d6cbc394cd0966469d3e249353c9dd1d38b9 demonstrates the 
> > powers-off-after-20-seconds symptom.
> > 
> > So we _may_ be dealing with two bugs here, and your patch might have 
> > fixed the first, but that success is obscured by the second.  I guess 
> > I need to prepare a tree which has 
> > ed50d6cbc394cd0966469d3e249353c9dd1d38b9 at its tip.  (Wonders how to 
> > do that).
> 
> the way i do it in bisection is to do:
> 
>   mkdir patches
>   git-log -1 -p ed50d6cbc394cd0966469d3 > patches/fix.patch
>   echo fix.patch > patches/series
> 
> and then before testing a bisection point, i do a 'quilt push'. Before 
> telling git-bisect about the quality of that bisection point (good/bad) 
> i pop it off via 'quilt pop'.
> 
> this way the 'required fix' can be kept during the bisection, to find 
> the secondary bug.
> 
> > btw, mainline (plus this patch, not that it changed anything) prints
> > 
> > <stopping disk stuff>
> > Disabling non-boot CPUs
> > CPU 1 is now offline
> > 
> > and that's it.   This machine has eight cpus.  Might be a hint?
> 
> what should be the proper message?

Seems that it should be a stream of eight

CPU n is now offline
CPU n down

> my suspects, besides there being something wrong in the hung-tasks code 
> of the softlockup watchdog, would be the cpu-hotplug commits, or some 
> arch/x86 commit. (although we didnt really have anything specifically 
> touching the the reboot path)
> 
> does a stupid patch like the one below tell you more about what the 
> other CPUs are doing during this hang? [32-bit only patch]
> 
> 	Ingo
> 
> ---
>  arch/i386/kernel/nmi.c |    8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> Index: linux/arch/i386/kernel/nmi.c
> ===================================================================
> --- linux.orig/arch/x86/kernel/nmi_64.c
> +++ linux/arch/x86/kernel/nmi_64.c
> @@ -331,6 +331,14 @@ __kprobes int nmi_watchdog_tick(struct p
>  	int touched = 0;
>  	int cpu = smp_processor_id();
>  	int rc=0;
> +	static int count[NR_CPUS];
> +
> +	if (!count[cpu]) {
> +		count[cpu] = nmi_hz;
> +		printk("CPU#%d, tick\n", cpu);
> +		show_regs(regs);
> +	}
> +	count[cpu]--;
>  
>  	/* check for other users first */
>  	if (notify_die(DIE_NMI, "nmi", regs, reason, 2, SIGINT)

I reworked that on top of ed50d6cbc394cd0966469d3e249353c9dd1d38b9: no
change.

However I watched the vga console this time (nothing is coming over
netconsole at this stage) I saw this:


CPU 1 is now offline
<10 second pause>
CPU 1 is down
CPU 2 is now offline
CPU 2 is down
CPU 3 is now offline
CPU 3 is down
CPU 4 is now offline
<10 second pause>

followed by a quick spew of the remaining CPUs going down and offline then
poweroff.


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2008-02-07  1:13 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <200801252259.m0PMxHmD012059@hera.kernel.org>
2008-02-06  0:46 ` softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks Andrew Morton
2008-02-06 14:50   ` Peter Zijlstra
2008-02-06 18:05     ` Andrew Morton
2008-02-07  0:04       ` Ingo Molnar
2008-02-07  0:31         ` Andrew Morton
2008-02-07  0:47           ` Andrew Morton
2008-02-07  0:51           ` Ingo Molnar
2008-02-07  1:12             ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).