LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* Re: softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks
[not found] <200801252259.m0PMxHmD012059@hera.kernel.org>
@ 2008-02-06 0:46 ` Andrew Morton
2008-02-06 14:50 ` Peter Zijlstra
0 siblings, 1 reply; 8+ messages in thread
From: Andrew Morton @ 2008-02-06 0:46 UTC (permalink / raw)
To: Ingo Molnar; +Cc: Linux Kernel Mailing List
On Fri, 25 Jan 2008 22:59:17 GMT
Linux Kernel Mailing List <linux-kernel@vger.kernel.org> wrote:
> Gitweb: http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=82a1fcb90287052aabfa235e7ffc693ea003fe69
> Commit: 82a1fcb90287052aabfa235e7ffc693ea003fe69
> Parent: d0d23b5432fe61229dd3641c5e94d4130bc4e61b
> Author: Ingo Molnar <mingo@elte.hu>
> AuthorDate: Fri Jan 25 21:08:02 2008 +0100
> Committer: Ingo Molnar <mingo@elte.hu>
> CommitDate: Fri Jan 25 21:08:02 2008 +0100
>
> softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks
One of my test boxes (an 8-way x86_64 software-development thing from Intel
- I'm not sure what's inside it) no longer powers itself off when I run `halt
-pfn'.
During bisection I found two different problems. Sometimes the machine
wouldn't power off at all. Other times it would power off after a pause of
around twenty seconds.
Bisection indicates that this commit is what caused the 20-second pause.
It could be that some later commit caused the infinity-second pause.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks
2008-02-06 0:46 ` softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks Andrew Morton
@ 2008-02-06 14:50 ` Peter Zijlstra
2008-02-06 18:05 ` Andrew Morton
0 siblings, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2008-02-06 14:50 UTC (permalink / raw)
To: Andrew Morton; +Cc: Ingo Molnar, Linux Kernel Mailing List
On Tue, 2008-02-05 at 16:46 -0800, Andrew Morton wrote:
> On Fri, 25 Jan 2008 22:59:17 GMT
> Linux Kernel Mailing List <linux-kernel@vger.kernel.org> wrote:
>
> > Gitweb: http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=82a1fcb90287052aabfa235e7ffc693ea003fe69
> > Commit: 82a1fcb90287052aabfa235e7ffc693ea003fe69
> > Parent: d0d23b5432fe61229dd3641c5e94d4130bc4e61b
> > Author: Ingo Molnar <mingo@elte.hu>
> > AuthorDate: Fri Jan 25 21:08:02 2008 +0100
> > Committer: Ingo Molnar <mingo@elte.hu>
> > CommitDate: Fri Jan 25 21:08:02 2008 +0100
> >
> > softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks
>
> One of my test boxes (an 8-way x86_64 software-development thing from Intel
> - I'm not sure what's inside it) no longer powers itself off when I run `halt
> -pfn'.
>
> During bisection I found two different problems. Sometimes the machine
> wouldn't power off at all. Other times it would power off after a pause of
> around twenty seconds.
>
> Bisection indicates that this commit is what caused the 20-second pause.
> It could be that some later commit caused the infinity-second pause.
Does that kernel have:
commit ed50d6cbc394cd0966469d3e249353c9dd1d38b9
Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Sat Feb 2 00:23:08 2008 +0100
debug: softlockup looping fix
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks
2008-02-06 14:50 ` Peter Zijlstra
@ 2008-02-06 18:05 ` Andrew Morton
2008-02-07 0:04 ` Ingo Molnar
0 siblings, 1 reply; 8+ messages in thread
From: Andrew Morton @ 2008-02-06 18:05 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: Ingo Molnar, Linux Kernel Mailing List
On Wed, 06 Feb 2008 15:50:02 +0100 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>
> On Tue, 2008-02-05 at 16:46 -0800, Andrew Morton wrote:
> > On Fri, 25 Jan 2008 22:59:17 GMT
> > Linux Kernel Mailing List <linux-kernel@vger.kernel.org> wrote:
> >
> > > Gitweb: http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=82a1fcb90287052aabfa235e7ffc693ea003fe69
> > > Commit: 82a1fcb90287052aabfa235e7ffc693ea003fe69
> > > Parent: d0d23b5432fe61229dd3641c5e94d4130bc4e61b
> > > Author: Ingo Molnar <mingo@elte.hu>
> > > AuthorDate: Fri Jan 25 21:08:02 2008 +0100
> > > Committer: Ingo Molnar <mingo@elte.hu>
> > > CommitDate: Fri Jan 25 21:08:02 2008 +0100
> > >
> > > softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks
> >
> > One of my test boxes (an 8-way x86_64 software-development thing from Intel
> > - I'm not sure what's inside it) no longer powers itself off when I run `halt
> > -pfn'.
> >
> > During bisection I found two different problems. Sometimes the machine
> > wouldn't power off at all. Other times it would power off after a pause of
> > around twenty seconds.
> >
> > Bisection indicates that this commit is what caused the 20-second pause.
> > It could be that some later commit caused the infinity-second pause.
>
>
> Does that kernel have:
>
> commit ed50d6cbc394cd0966469d3e249353c9dd1d38b9
> Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Date: Sat Feb 2 00:23:08 2008 +0100
>
> debug: softlockup looping fix
>
>
yup. It was fetched less than 24 hours ago.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks
2008-02-06 18:05 ` Andrew Morton
@ 2008-02-07 0:04 ` Ingo Molnar
2008-02-07 0:31 ` Andrew Morton
0 siblings, 1 reply; 8+ messages in thread
From: Ingo Molnar @ 2008-02-07 0:04 UTC (permalink / raw)
To: Andrew Morton; +Cc: Peter Zijlstra, Linux Kernel Mailing List
* Andrew Morton <akpm@linux-foundation.org> wrote:
> > Does that kernel have:
> >
> > commit ed50d6cbc394cd0966469d3e249353c9dd1d38b9
> > Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > Date: Sat Feb 2 00:23:08 2008 +0100
> >
> > debug: softlockup looping fix
>
> yup. It was fetched less than 24 hours ago.
does the patch below improve the situation?
Ingo
---
arch/x86/kernel/reboot.c | 16 ++++++++++++++--
1 file changed, 14 insertions(+), 2 deletions(-)
Index: linux-x86.q/arch/x86/kernel/reboot.c
===================================================================
--- linux-x86.q.orig/arch/x86/kernel/reboot.c
+++ linux-x86.q/arch/x86/kernel/reboot.c
@@ -396,8 +396,20 @@ void machine_shutdown(void)
if (!cpu_isset(reboot_cpu_id, cpu_online_map))
reboot_cpu_id = smp_processor_id();
- /* Make certain I only run on the appropriate processor */
- set_cpus_allowed(current, cpumask_of_cpu(reboot_cpu_id));
+ /*
+ * Make certain we only run on the appropriate processor,
+ * and with sufficient priority:
+ */
+ {
+ struct sched_param schedparm;
+ schedparm.sched_priority = 99;
+ int ret;
+
+ ret = sched_setscheduler(current, SCHED_RR, &schedparm);
+ WARN_ON_ONCE(1);
+
+ set_cpus_allowed(current, cpumask_of_cpu(reboot_cpu_id));
+ }
/* O.K Now that I'm on the appropriate processor,
* stop all of the others.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks
2008-02-07 0:04 ` Ingo Molnar
@ 2008-02-07 0:31 ` Andrew Morton
2008-02-07 0:47 ` Andrew Morton
2008-02-07 0:51 ` Ingo Molnar
0 siblings, 2 replies; 8+ messages in thread
From: Andrew Morton @ 2008-02-07 0:31 UTC (permalink / raw)
To: Ingo Molnar; +Cc: a.p.zijlstra, linux-kernel
On Thu, 7 Feb 2008 01:04:25 +0100
Ingo Molnar <mingo@elte.hu> wrote:
>
> * Andrew Morton <akpm@linux-foundation.org> wrote:
>
> > > Does that kernel have:
> > >
> > > commit ed50d6cbc394cd0966469d3e249353c9dd1d38b9
> > > Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > > Date: Sat Feb 2 00:23:08 2008 +0100
> > >
> > > debug: softlockup looping fix
> >
> > yup. It was fetched less than 24 hours ago.
>
> does the patch below improve the situation?
>
Nope.
But I tested it on mainline, and mainline exhibits the never-powers-off
symptom, whereas ed50d6cbc394cd0966469d3e249353c9dd1d38b9 demonstrates the
powers-off-after-20-seconds symptom.
So we _may_ be dealing with two bugs here, and your patch might have fixed
the first, but that success is obscured by the second. I guess I need to
prepare a tree which has ed50d6cbc394cd0966469d3e249353c9dd1d38b9 at its
tip. (Wonders how to do that).
btw, mainline (plus this patch, not that it changed anything) prints
<stopping disk stuff>
Disabling non-boot CPUs
CPU 1 is now offline
and that's it. This machine has eight cpus. Might be a hint?
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks
2008-02-07 0:31 ` Andrew Morton
@ 2008-02-07 0:47 ` Andrew Morton
2008-02-07 0:51 ` Ingo Molnar
1 sibling, 0 replies; 8+ messages in thread
From: Andrew Morton @ 2008-02-07 0:47 UTC (permalink / raw)
To: mingo, a.p.zijlstra, linux-kernel
On Wed, 6 Feb 2008 16:31:11 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:
> On Thu, 7 Feb 2008 01:04:25 +0100
> Ingo Molnar <mingo@elte.hu> wrote:
>
> >
> > * Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > > > Does that kernel have:
> > > >
> > > > commit ed50d6cbc394cd0966469d3e249353c9dd1d38b9
> > > > Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > > > Date: Sat Feb 2 00:23:08 2008 +0100
> > > >
> > > > debug: softlockup looping fix
> > >
> > > yup. It was fetched less than 24 hours ago.
> >
> > does the patch below improve the situation?
> >
>
> Nope.
>
> But I tested it on mainline, and mainline exhibits the never-powers-off
> symptom, whereas ed50d6cbc394cd0966469d3e249353c9dd1d38b9 demonstrates the
> powers-off-after-20-seconds symptom.
>
> So we _may_ be dealing with two bugs here, and your patch might have fixed
> the first, but that success is obscured by the second. I guess I need to
> prepare a tree which has ed50d6cbc394cd0966469d3e249353c9dd1d38b9 at its
> tip. (Wonders how to do that).
OK, I did this (tested on a ed50d6cbc394cd0966469d3e249353c9dd1d38b9-tipped
tree) and again, the patch made no difference: the machine still pauses
20-odd seconds before (correctly) powering off.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks
2008-02-07 0:31 ` Andrew Morton
2008-02-07 0:47 ` Andrew Morton
@ 2008-02-07 0:51 ` Ingo Molnar
2008-02-07 1:12 ` Andrew Morton
1 sibling, 1 reply; 8+ messages in thread
From: Ingo Molnar @ 2008-02-07 0:51 UTC (permalink / raw)
To: Andrew Morton; +Cc: a.p.zijlstra, linux-kernel, Gautham R Shenoy
* Andrew Morton <akpm@linux-foundation.org> wrote:
> Nope.
>
> But I tested it on mainline, and mainline exhibits the
> never-powers-off symptom, whereas
> ed50d6cbc394cd0966469d3e249353c9dd1d38b9 demonstrates the
> powers-off-after-20-seconds symptom.
>
> So we _may_ be dealing with two bugs here, and your patch might have
> fixed the first, but that success is obscured by the second. I guess
> I need to prepare a tree which has
> ed50d6cbc394cd0966469d3e249353c9dd1d38b9 at its tip. (Wonders how to
> do that).
the way i do it in bisection is to do:
mkdir patches
git-log -1 -p ed50d6cbc394cd0966469d3 > patches/fix.patch
echo fix.patch > patches/series
and then before testing a bisection point, i do a 'quilt push'. Before
telling git-bisect about the quality of that bisection point (good/bad)
i pop it off via 'quilt pop'.
this way the 'required fix' can be kept during the bisection, to find
the secondary bug.
> btw, mainline (plus this patch, not that it changed anything) prints
>
> <stopping disk stuff>
> Disabling non-boot CPUs
> CPU 1 is now offline
>
> and that's it. This machine has eight cpus. Might be a hint?
what should be the proper message?
my suspects, besides there being something wrong in the hung-tasks code
of the softlockup watchdog, would be the cpu-hotplug commits, or some
arch/x86 commit. (although we didnt really have anything specifically
touching the the reboot path)
does a stupid patch like the one below tell you more about what the
other CPUs are doing during this hang? [32-bit only patch]
Ingo
---
arch/i386/kernel/nmi.c | 8 ++++++++
1 file changed, 8 insertions(+)
Index: linux/arch/i386/kernel/nmi.c
===================================================================
--- linux.orig/arch/x86/kernel/nmi_64.c
+++ linux/arch/x86/kernel/nmi_64.c
@@ -331,6 +331,14 @@ __kprobes int nmi_watchdog_tick(struct p
int touched = 0;
int cpu = smp_processor_id();
int rc=0;
+ static int count[NR_CPUS];
+
+ if (!count[cpu]) {
+ count[cpu] = nmi_hz;
+ printk("CPU#%d, tick\n", cpu);
+ show_regs(regs);
+ }
+ count[cpu]--;
/* check for other users first */
if (notify_die(DIE_NMI, "nmi", regs, reason, 2, SIGINT)
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks
2008-02-07 0:51 ` Ingo Molnar
@ 2008-02-07 1:12 ` Andrew Morton
0 siblings, 0 replies; 8+ messages in thread
From: Andrew Morton @ 2008-02-07 1:12 UTC (permalink / raw)
To: Ingo Molnar; +Cc: a.p.zijlstra, linux-kernel, ego
On Thu, 7 Feb 2008 01:51:10 +0100
Ingo Molnar <mingo@elte.hu> wrote:
>
> * Andrew Morton <akpm@linux-foundation.org> wrote:
>
> > Nope.
> >
> > But I tested it on mainline, and mainline exhibits the
> > never-powers-off symptom, whereas
> > ed50d6cbc394cd0966469d3e249353c9dd1d38b9 demonstrates the
> > powers-off-after-20-seconds symptom.
> >
> > So we _may_ be dealing with two bugs here, and your patch might have
> > fixed the first, but that success is obscured by the second. I guess
> > I need to prepare a tree which has
> > ed50d6cbc394cd0966469d3e249353c9dd1d38b9 at its tip. (Wonders how to
> > do that).
>
> the way i do it in bisection is to do:
>
> mkdir patches
> git-log -1 -p ed50d6cbc394cd0966469d3 > patches/fix.patch
> echo fix.patch > patches/series
>
> and then before testing a bisection point, i do a 'quilt push'. Before
> telling git-bisect about the quality of that bisection point (good/bad)
> i pop it off via 'quilt pop'.
>
> this way the 'required fix' can be kept during the bisection, to find
> the secondary bug.
>
> > btw, mainline (plus this patch, not that it changed anything) prints
> >
> > <stopping disk stuff>
> > Disabling non-boot CPUs
> > CPU 1 is now offline
> >
> > and that's it. This machine has eight cpus. Might be a hint?
>
> what should be the proper message?
Seems that it should be a stream of eight
CPU n is now offline
CPU n down
> my suspects, besides there being something wrong in the hung-tasks code
> of the softlockup watchdog, would be the cpu-hotplug commits, or some
> arch/x86 commit. (although we didnt really have anything specifically
> touching the the reboot path)
>
> does a stupid patch like the one below tell you more about what the
> other CPUs are doing during this hang? [32-bit only patch]
>
> Ingo
>
> ---
> arch/i386/kernel/nmi.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> Index: linux/arch/i386/kernel/nmi.c
> ===================================================================
> --- linux.orig/arch/x86/kernel/nmi_64.c
> +++ linux/arch/x86/kernel/nmi_64.c
> @@ -331,6 +331,14 @@ __kprobes int nmi_watchdog_tick(struct p
> int touched = 0;
> int cpu = smp_processor_id();
> int rc=0;
> + static int count[NR_CPUS];
> +
> + if (!count[cpu]) {
> + count[cpu] = nmi_hz;
> + printk("CPU#%d, tick\n", cpu);
> + show_regs(regs);
> + }
> + count[cpu]--;
>
> /* check for other users first */
> if (notify_die(DIE_NMI, "nmi", regs, reason, 2, SIGINT)
I reworked that on top of ed50d6cbc394cd0966469d3e249353c9dd1d38b9: no
change.
However I watched the vga console this time (nothing is coming over
netconsole at this stage) I saw this:
CPU 1 is now offline
<10 second pause>
CPU 1 is down
CPU 2 is now offline
CPU 2 is down
CPU 3 is now offline
CPU 3 is down
CPU 4 is now offline
<10 second pause>
followed by a quick spew of the remaining CPUs going down and offline then
poweroff.
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2008-02-07 1:13 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <200801252259.m0PMxHmD012059@hera.kernel.org>
2008-02-06 0:46 ` softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks Andrew Morton
2008-02-06 14:50 ` Peter Zijlstra
2008-02-06 18:05 ` Andrew Morton
2008-02-07 0:04 ` Ingo Molnar
2008-02-07 0:31 ` Andrew Morton
2008-02-07 0:47 ` Andrew Morton
2008-02-07 0:51 ` Ingo Molnar
2008-02-07 1:12 ` Andrew Morton
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).