LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* Re: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.
@ 2011-02-14 13:36 Preeti Khurana
  2011-02-17  0:17 ` Ryan Underwood
  0 siblings, 1 reply; 26+ messages in thread
From: Preeti Khurana @ 2011-02-14 13:36 UTC (permalink / raw)
  To: linux-kernel

I am getting the similar issue as reported in https://lkml.org/lkml/2011/2/10/187

Can someone tell me if the same issue  because I am getting the problem on Intel Xeon..

Thanks
Preeti

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.
  2011-02-14 13:36 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0 Preeti Khurana
@ 2011-02-17  0:17 ` Ryan Underwood
  2011-02-17  7:59   ` Cyrill Gorcunov
  0 siblings, 1 reply; 26+ messages in thread
From: Ryan Underwood @ 2011-02-17  0:17 UTC (permalink / raw)
  To: linux-kernel

Preeti Khurana <Preeti.Khurana <at> guavus.com> writes:

> 
> I am getting the similar issue as reported
> in https://lkml.org/lkml/2011/2/10/187
> 
> Can someone tell me if the same issue  because I am getting the
> problem on Intel Xeon..
> 

I am seeing exactly the same problem (on 2.6.35 as Preeti reported originally)
on some Xeon servers but only with recently shipped BIOS revisions. The OS is
CentOS 5.5.

In my cases, the system sometimes hangs with no comment, sometimes with a NMI
message immediately before hanging and sometimes with a long trail of
backtrace originating at cpu_idle().  The NMI reason code is different but
in my observation it is usually 21 or 31.

The problem seems to be triggered by accessing a PCI card (via MMIO) because 
until accessing the PCI card, the system will run forever with no problems.

Other servers of exactly the same model (Intel SR2500) but older BIOS revision
are working (working is 3/14/2008, non working is 3/9/2010).  All software is
identical in these cases.

Also, in one instance, kernel v2.6.18 is used on these servers with the
3/14/2008 BIOS revision without a problem.  The rest of the software is again
the same (except for kernel and drivers).

It seems to be a problem with newer kernels combined with the newer Intel BIOS.
I have not tried an older kernel on the newer BIOS yet.

I have not tried the following patches yet which seem to both be for spurious
NMI messages, not accompanied by system lockups:

https://lkml.org/lkml/2011/2/16/106
https://lkml.org/lkml/2011/2/1/286

Both nmi_watchdog=0 and pcie_aspm=off options do not solve the problem.

I am not subscribed so please Cc me.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.
  2011-02-17  0:17 ` Ryan Underwood
@ 2011-02-17  7:59   ` Cyrill Gorcunov
  2011-02-18  2:40     ` Paul E. McKenney
  0 siblings, 1 reply; 26+ messages in thread
From: Cyrill Gorcunov @ 2011-02-17  7:59 UTC (permalink / raw)
  To: Ryan Underwood; +Cc: linux-kernel

On Thu, Feb 17, 2011 at 3:17 AM, Ryan Underwood
<ryan.underwood@flightsafety.com> wrote:
> Preeti Khurana <Preeti.Khurana <at> guavus.com> writes:
>
>>
>> I am getting the similar issue as reported
>> in https://lkml.org/lkml/2011/2/10/187
>>
>> Can someone tell me if the same issue  because I am getting the
>> problem on Intel Xeon..
>>
>
> I am seeing exactly the same problem (on 2.6.35 as Preeti reported originally)
> on some Xeon servers but only with recently shipped BIOS revisions. The OS is
> CentOS 5.5.
>
...
> I have not tried the following patches yet which seem to both be for spurious
> NMI messages, not accompanied by system lockups:
>
> https://lkml.org/lkml/2011/2/16/106
> https://lkml.org/lkml/2011/2/1/286
>
> Both nmi_watchdog=0 and pcie_aspm=off options do not solve the problem.
>
> I am not subscribed so please Cc me.
>

Since nmi_watchdog=0 didn't help -- I believe it's a different issue unrelated
to 'perf' patches you mentioned, probably rcu-people help is needed here.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.
  2011-02-17  7:59   ` Cyrill Gorcunov
@ 2011-02-18  2:40     ` Paul E. McKenney
  2011-02-18 20:38       ` Underwood, Ryan
  0 siblings, 1 reply; 26+ messages in thread
From: Paul E. McKenney @ 2011-02-18  2:40 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: Ryan Underwood, linux-kernel

On Thu, Feb 17, 2011 at 10:59:43AM +0300, Cyrill Gorcunov wrote:
> On Thu, Feb 17, 2011 at 3:17 AM, Ryan Underwood
> <ryan.underwood@flightsafety.com> wrote:
> > Preeti Khurana <Preeti.Khurana <at> guavus.com> writes:
> >
> >>
> >> I am getting the similar issue as reported
> >> in https://lkml.org/lkml/2011/2/10/187
> >>
> >> Can someone tell me if the same issue  because I am getting the
> >> problem on Intel Xeon..
> >>
> >
> > I am seeing exactly the same problem (on 2.6.35 as Preeti reported originally)
> > on some Xeon servers but only with recently shipped BIOS revisions. The OS is
> > CentOS 5.5.
> >
> ...
> > I have not tried the following patches yet which seem to both be for spurious
> > NMI messages, not accompanied by system lockups:
> >
> > https://lkml.org/lkml/2011/2/16/106
> > https://lkml.org/lkml/2011/2/1/286
> >
> > Both nmi_watchdog=0 and pcie_aspm=off options do not solve the problem.
> >
> > I am not subscribed so please Cc me.

Given 2.6.35, has anyone tried applying the following patch?

	https://patchwork.kernel.org/patch/23985/

It turned out to resolve an otherwise mysterious RCU CPU stall warning
for someone running 2.6.36, IIRC.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.
  2011-02-18  2:40     ` Paul E. McKenney
@ 2011-02-18 20:38       ` Underwood, Ryan
  2011-02-21  6:56         ` Preeti Khurana
  0 siblings, 1 reply; 26+ messages in thread
From: Underwood, Ryan @ 2011-02-18 20:38 UTC (permalink / raw)
  To: paulmck, Cyrill Gorcunov, Preeti Khurana; +Cc: linux-kernel

> 
> Given 2.6.35, has anyone tried applying the following patch?
> 
> 	https://patchwork.kernel.org/patch/23985/
> 
> It turned out to resolve an otherwise mysterious RCU CPU stall warning
> for someone running 2.6.36, IIRC.
> 

Now I've tried 2.6.38-rc5 which already includes that patch, and the
same problems remain.  It also includes the following patch that Preeti
seems to have had some success with on 2.6.35, so my problem must really
be elsewhere:
https://lkml.org/lkml/2011/1/6/131

Since a previous BIOS version is known to work I may end up having to
do some BIOS-bisecting today...


^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.
  2011-02-18 20:38       ` Underwood, Ryan
@ 2011-02-21  6:56         ` Preeti Khurana
  2011-02-21 16:45           ` Underwood, Ryan
  0 siblings, 1 reply; 26+ messages in thread
From: Preeti Khurana @ 2011-02-21  6:56 UTC (permalink / raw)
  To: Underwood, Ryan, paulmck, Cyrill Gorcunov; +Cc: linux-kernel

> >
> > Given 2.6.35, has anyone tried applying the following patch?
> >
> > 	https://patchwork.kernel.org/patch/23985/
> >
> > It turned out to resolve an otherwise mysterious RCU CPU stall warning
> > for someone running 2.6.36, IIRC.
> >
> 
	This fix is already in 2.6.35, so this doesn't seem to be an issue.

> Now I've tried 2.6.38-rc5 which already includes that patch, and the same
> problems remain.  It also includes the following patch that Preeti seems to
> have had some success with on 2.6.35, so my problem must really be
> elsewhere:
> https://lkml.org/lkml/2011/1/6/131
> 
> Since a previous BIOS version is known to work I may end up having to do
> some BIOS-bisecting today...

Ryan,
   Cant say that this patch (https://lkml.org/lkml/2011/1/6/131) worked for me since I am not able to reproduce the problem quite reliably and now not getting the problem even under the original kernel without this patch.  Just wondering what triggers this problem.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.
  2011-02-21  6:56         ` Preeti Khurana
@ 2011-02-21 16:45           ` Underwood, Ryan
  0 siblings, 0 replies; 26+ messages in thread
From: Underwood, Ryan @ 2011-02-21 16:45 UTC (permalink / raw)
  To: Preeti Khurana, paulmck, Cyrill Gorcunov; +Cc: linux-kernel

> > Since a previous BIOS version is known to work I may end up having to do
> > some BIOS-bisecting today...
> 
> Ryan,
>    Cant say that this patch (https://lkml.org/lkml/2011/1/6/131) worked
> for me since I am not able to reproduce the problem quite reliably and now
> not getting the problem even under the original kernel without this patch.
> Just wondering what triggers this problem.

I found that even with downgrading to the same BIOS as the working systems,
the problem on the newer SR2500 systems remains!  There must have been a
recent change in the hardware causing this, or some arcane BIOS setting
that I am overlooking.  I still need to rule out our PCI hardware as the
source of the problem, since PCI parity errors seem to be a usual source
of NMIs, but I thought there would be a standard error code in that case...


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.
  2011-02-17  2:56           ` Dave Airlie
@ 2011-02-17  7:48             ` Cyrill Gorcunov
  0 siblings, 0 replies; 26+ messages in thread
From: Cyrill Gorcunov @ 2011-02-17  7:48 UTC (permalink / raw)
  To: Dave Airlie
  Cc: George Spelvin, a.p.zijlstra, dzickus, eranian, linux-kernel,
	ming.m.lin, mingo

On Thu, Feb 17, 2011 at 5:56 AM, Dave Airlie <airlied@gmail.com> wrote:
>>
>> It's appended below for your convenience.  Are you using this
>> unsuccessfully?
>
> This patch quoted below fixes it for me.
>
> No more spurious NMIs on my P4.
>
> Tested-by: Dave Airlie <airlied@redhat.com>
>

Thanks Dave! Ingo has merged it into -urgent branch, so it should
reach mainline soon.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.
  2011-02-16 11:57         ` George Spelvin
@ 2011-02-17  2:56           ` Dave Airlie
  2011-02-17  7:48             ` Cyrill Gorcunov
  0 siblings, 1 reply; 26+ messages in thread
From: Dave Airlie @ 2011-02-17  2:56 UTC (permalink / raw)
  To: George Spelvin
  Cc: gorcunov, a.p.zijlstra, dzickus, eranian, linux-kernel,
	ming.m.lin, mingo

>
> It's appended below for your convenience.  Are you using this
> unsuccessfully?

This patch quoted below fixes it for me.

No more spurious NMIs on my P4.

Tested-by: Dave Airlie <airlied@redhat.com>

>
>
> From: Cyrill Gorcunov <gorcunov@openvz.org>
> Subject: [PATCH] perf, x86: P4 PMU -- Fix unflagged overflows test
>
> A couple of people have reported an unknown NMI issue on p4 pmu.
> This patch should fix it.
>
> Reported-by: George Spelvin <linux@horizon.com>
> Reported-by: Meelis Roos <mroos@linux.ee>
> Reported-by: Don Zickus <dzickus@redhat.com>
> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
> CC: Ingo Molnar <mingo@elte.hu>
> CC: Lin Ming <ming.m.lin@intel.com>
> CC: Don Zickus <dzickus@redhat.com>
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  arch/x86/include/asm/perf_event_p4.h |    1 +
>  arch/x86/kernel/cpu/perf_event_p4.c  |   11 ++++++++---
>  2 files changed, 9 insertions(+), 3 deletions(-)
>
> Index: linux-2.6.tip/arch/x86/include/asm/perf_event_p4.h
> ===================================================================
> --- linux-2.6.tip.orig/arch/x86/include/asm/perf_event_p4.h
> +++ linux-2.6.tip/arch/x86/include/asm/perf_event_p4.h
> @@ -22,6 +22,7 @@
>
>  #define ARCH_P4_CNTRVAL_BITS   (40)
>  #define ARCH_P4_CNTRVAL_MASK   ((1ULL << ARCH_P4_CNTRVAL_BITS) - 1)
> +#define ARCH_P4_UNFLAGGED_BIT  ((1ULL) << (ARCH_P4_CNTRVAL_BITS - 1))
>
>  #define P4_ESCR_EVENT_MASK     0x7e000000U
>  #define P4_ESCR_EVENT_SHIFT    25
> Index: linux-2.6.tip/arch/x86/kernel/cpu/perf_event_p4.c
> ===================================================================
> --- linux-2.6.tip.orig/arch/x86/kernel/cpu/perf_event_p4.c
> +++ linux-2.6.tip/arch/x86/kernel/cpu/perf_event_p4.c
> @@ -770,9 +770,14 @@ static inline int p4_pmu_clear_cccr_ovf(
>                return 1;
>        }
>
> -       /* it might be unflagged overflow */
> -       rdmsrl(hwc->event_base + hwc->idx, v);
> -       if (!(v & ARCH_P4_CNTRVAL_MASK))
> +       /*
> +        * at some circumstances the overflow might issue NMI but did
> +        * not set P4_CCCR_OVF bit so since a counter holds a negative value
> +        * we simply check for high bit being set, if it's cleared it means
> +        * the counter has reached zero value and continued counting before
> +        * real NMI signal was received
> +        */
> +       if (!(v & ARCH_P4_UNFLAGGED_BIT))
>                return 1;
>
>        return 0;
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.
  2011-02-16  1:57       ` Dave Airlie
  2011-02-16  4:19         ` Cyrill Gorcunov
@ 2011-02-16 11:57         ` George Spelvin
  2011-02-17  2:56           ` Dave Airlie
  1 sibling, 1 reply; 26+ messages in thread
From: George Spelvin @ 2011-02-16 11:57 UTC (permalink / raw)
  To: airlied, gorcunov
  Cc: a.p.zijlstra, dzickus, eranian, linux-kernel, linux, ming.m.lin, mingo

> Ping on this problem, still seeing
> 
> Uhhuh. NMI received for unknown reason 3c on CPU 0.
> Do you have a strange power saving mode enabled?
> Dazed and confused, but trying to continue
> 
> on my Pentium-D system here with latest Linus head.
> 
> its sometimes 3c, sometimes 3d, I'm going to bisect and push for
> reverts if nobody still has any clue about how to fix this.

The second patch (not the one you quote) fixed it for me.  Almost 8 days
of uptime and no log spam.

It's appended below for your convenience.  Are you using this
unsuccessfully?


From: Cyrill Gorcunov <gorcunov@openvz.org>
Subject: [PATCH] perf, x86: P4 PMU -- Fix unflagged overflows test

A couple of people have reported an unknown NMI issue on p4 pmu.
This patch should fix it.

Reported-by: George Spelvin <linux@horizon.com>
Reported-by: Meelis Roos <mroos@linux.ee>
Reported-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
CC: Ingo Molnar <mingo@elte.hu>
CC: Lin Ming <ming.m.lin@intel.com>
CC: Don Zickus <dzickus@redhat.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/x86/include/asm/perf_event_p4.h |    1 +
 arch/x86/kernel/cpu/perf_event_p4.c  |   11 ++++++++---
 2 files changed, 9 insertions(+), 3 deletions(-)

Index: linux-2.6.tip/arch/x86/include/asm/perf_event_p4.h
===================================================================
--- linux-2.6.tip.orig/arch/x86/include/asm/perf_event_p4.h
+++ linux-2.6.tip/arch/x86/include/asm/perf_event_p4.h
@@ -22,6 +22,7 @@
 
 #define ARCH_P4_CNTRVAL_BITS	(40)
 #define ARCH_P4_CNTRVAL_MASK	((1ULL << ARCH_P4_CNTRVAL_BITS) - 1)
+#define ARCH_P4_UNFLAGGED_BIT	((1ULL) << (ARCH_P4_CNTRVAL_BITS - 1))
 
 #define P4_ESCR_EVENT_MASK	0x7e000000U
 #define P4_ESCR_EVENT_SHIFT	25
Index: linux-2.6.tip/arch/x86/kernel/cpu/perf_event_p4.c
===================================================================
--- linux-2.6.tip.orig/arch/x86/kernel/cpu/perf_event_p4.c
+++ linux-2.6.tip/arch/x86/kernel/cpu/perf_event_p4.c
@@ -770,9 +770,14 @@ static inline int p4_pmu_clear_cccr_ovf(
 		return 1;
 	}
 
-	/* it might be unflagged overflow */
-	rdmsrl(hwc->event_base + hwc->idx, v);
-	if (!(v & ARCH_P4_CNTRVAL_MASK))
+	/*
+	 * at some circumstances the overflow might issue NMI but did
+	 * not set P4_CCCR_OVF bit so since a counter holds a negative value
+	 * we simply check for high bit being set, if it's cleared it means
+	 * the counter has reached zero value and continued counting before
+	 * real NMI signal was received
+	 */
+	if (!(v & ARCH_P4_UNFLAGGED_BIT))
 		return 1;
 
 	return 0;

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.
  2011-02-16 10:09                   ` Ingo Molnar
@ 2011-02-16 11:08                     ` Cyrill Gorcunov
  0 siblings, 0 replies; 26+ messages in thread
From: Cyrill Gorcunov @ 2011-02-16 11:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Dave Airlie, George Spelvin, a.p.zijlstra, dzickus, eranian,
	linux-kernel, ming.m.lin

[-- Attachment #1: Type: text/plain, Size: 626 bytes --]

On Wed, Feb 16, 2011 at 1:09 PM, Ingo Molnar <mingo@elte.hu> wrote:
...
>>
>> For nonkgdb case 'unflagged nmi fix' patch should be enough. i've
>> tested it on non-ht machine by self. without it there is no lockup
>> but only a message about unknown nmi.
>
> Ok, please submit it ASAP then - that ought to address the regression. Please Cc:
> Dave to the patch.
>
> Thanks,
>
>        Ingo
>

Ingo both patches are already in thread (attached to previous mail
since I've web
access at moment). So just to be sure the 'unflagged' nmi patch is
attached again,
this fix which should be aplied to -tip/master.

[-- Attachment #2: perf-x86-p4-unflagged-nmi --]
[-- Type: application/octet-stream, Size: 2068 bytes --]

From: Cyrill Gorcunov <gorcunov@openvz.org>
Subject: [PATCH] perf, x86: P4 PMU -- Fix unflagged overflows test

A couple of people have reported an unknown NMI issue on p4 pmu.
This patch should fix it.

Reported-by: George Spelvin <linux@horizon.com>
Reported-by: Meelis Roos <mroos@linux.ee>
Reported-by: Don Zickus <dzickus@redhat.com>
Reported-by: Dave Airlie <airlied@gmail.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
CC: Ingo Molnar <mingo@elte.hu>
CC: Lin Ming <ming.m.lin@intel.com>
CC: Don Zickus <dzickus@redhat.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/x86/include/asm/perf_event_p4.h |    1 +
 arch/x86/kernel/cpu/perf_event_p4.c  |   11 ++++++++---
 2 files changed, 9 insertions(+), 3 deletions(-)

Index: linux-2.6.tip/arch/x86/include/asm/perf_event_p4.h
===================================================================
--- linux-2.6.tip.orig/arch/x86/include/asm/perf_event_p4.h
+++ linux-2.6.tip/arch/x86/include/asm/perf_event_p4.h
@@ -22,6 +22,7 @@
 
 #define ARCH_P4_CNTRVAL_BITS	(40)
 #define ARCH_P4_CNTRVAL_MASK	((1ULL << ARCH_P4_CNTRVAL_BITS) - 1)
+#define ARCH_P4_UNFLAGGED_BIT	((1ULL) << (ARCH_P4_CNTRVAL_BITS - 1))
 
 #define P4_ESCR_EVENT_MASK	0x7e000000U
 #define P4_ESCR_EVENT_SHIFT	25
Index: linux-2.6.tip/arch/x86/kernel/cpu/perf_event_p4.c
===================================================================
--- linux-2.6.tip.orig/arch/x86/kernel/cpu/perf_event_p4.c
+++ linux-2.6.tip/arch/x86/kernel/cpu/perf_event_p4.c
@@ -770,9 +770,14 @@ static inline int p4_pmu_clear_cccr_ovf(
 		return 1;
 	}
 
-	/* it might be unflagged overflow */
-	rdmsrl(hwc->event_base + hwc->idx, v);
-	if (!(v & ARCH_P4_CNTRVAL_MASK))
+	/*
+	 * at some circumstances the overflow might issue NMI but did
+	 * not set P4_CCCR_OVF bit so since a counter holds a negative value
+	 * we simply check for high bit being set, if it's cleared it means
+	 * the counter has reached zero value and continued counting before
+	 * real NMI signal was received
+	 */
+	if (!(v & ARCH_P4_UNFLAGGED_BIT))
 		return 1;
 
 	return 0;

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.
  2011-02-16  9:33                 ` Cyrill Gorcunov
@ 2011-02-16 10:09                   ` Ingo Molnar
  2011-02-16 11:08                     ` Cyrill Gorcunov
  0 siblings, 1 reply; 26+ messages in thread
From: Ingo Molnar @ 2011-02-16 10:09 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Dave Airlie, George Spelvin, a.p.zijlstra, dzickus, eranian,
	linux-kernel, ming.m.lin


* Cyrill Gorcunov <gorcunov@gmail.com> wrote:

> On 2/16/11, Ingo Molnar <mingo@elte.hu> wrote:
> >
> > * Cyrill Gorcunov <gorcunov@gmail.com> wrote:
> >
> >> On Wed, Feb 16, 2011 at 11:37 AM, Ingo Molnar <mingo@elte.hu> wrote:
> >> ...
> >> >> >>
> >> >> >
> >> >> > Ping on this problem, still seeing
> >> >> >
> >> >> > Uhhuh. NMI received for unknown reason 3c on CPU 0.
> >> >> > Do you have a strange power saving mode enabled?
> >> >> > Dazed and confused, but trying to continue
> >> >> >
> >> >> > on my Pentium-D system here with latest Linus head.
> >> >> >
> >> >> > its sometimes 3c, sometimes 3d, I'm going to bisect and push for
> >> >> > reverts if nobody still has any clue about how to fix this.
> >> >> >
> >> >> > Dave.
> >> >> >
> >> >>
> >> >> We still trying to resolve it but without success yet. There is no
> >> >> easy way to revert it. One of the option might be to disable perf on
> >> >> p4 for a while. If this is acceptable -- i'll cook such patch and send
> >> >> it to Ingo. Hm?
> >> >
> >> > That's not really acceptable - need to fix it or revert it to the last
> >> > working
> >> > state. Which commit broke it?
> >> >
> >> > Thanks,
> >> >
> >> >        Ingo
> >> >
> >>
> >> I can't say you the commit id after which unknown-nmi start happening
> >> (i'm out of git tree
> >> at moment) but even then this commit should not be reverted since the
> >> problem is in
> >> p4 code not in the rest of perf system.
> >>
> >> I have two patches here (attached) and would really appreciate of
> >> their testing on HT machine
> >> together with kgdb bootup tests enabled. Dave could you please?
> >
> > Could these patches fix Dave's non-kgdb problem? Dave isnt using kgdb but is
> > probably using perf which triggers NMIs? Dave, can you confirm that?
> >
> > And it's a spurious NMI message, not actual lockup or other misbehavior,
> > right?
> >
> > Thanks,
> >
> > 	Ingo
> >
>
> For nonkgdb case 'unflagged nmi fix' patch should be enough. i've
> tested it on non-ht machine by self. without it there is no lockup
> but only a message about unknown nmi.

Ok, please submit it ASAP then - that ought to address the regression. Please Cc: 
Dave to the patch.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.
  2011-02-16  8:56               ` Ingo Molnar
@ 2011-02-16  9:33                 ` Cyrill Gorcunov
  2011-02-16 10:09                   ` Ingo Molnar
  0 siblings, 1 reply; 26+ messages in thread
From: Cyrill Gorcunov @ 2011-02-16  9:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Dave Airlie, George Spelvin, a.p.zijlstra, dzickus, eranian,
	linux-kernel, ming.m.lin

On 2/16/11, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Cyrill Gorcunov <gorcunov@gmail.com> wrote:
>
>> On Wed, Feb 16, 2011 at 11:37 AM, Ingo Molnar <mingo@elte.hu> wrote:
>> ...
>> >> >>
>> >> >
>> >> > Ping on this problem, still seeing
>> >> >
>> >> > Uhhuh. NMI received for unknown reason 3c on CPU 0.
>> >> > Do you have a strange power saving mode enabled?
>> >> > Dazed and confused, but trying to continue
>> >> >
>> >> > on my Pentium-D system here with latest Linus head.
>> >> >
>> >> > its sometimes 3c, sometimes 3d, I'm going to bisect and push for
>> >> > reverts if nobody still has any clue about how to fix this.
>> >> >
>> >> > Dave.
>> >> >
>> >>
>> >> We still trying to resolve it but without success yet. There is no
>> >> easy way to revert it. One of the option might be to disable perf on
>> >> p4 for a while. If this is acceptable -- i'll cook such patch and send
>> >> it to Ingo. Hm?
>> >
>> > That's not really acceptable - need to fix it or revert it to the last
>> > working
>> > state. Which commit broke it?
>> >
>> > Thanks,
>> >
>> >        Ingo
>> >
>>
>> I can't say you the commit id after which unknown-nmi start happening
>> (i'm out of git tree
>> at moment) but even then this commit should not be reverted since the
>> problem is in
>> p4 code not in the rest of perf system.
>>
>> I have two patches here (attached) and would really appreciate of
>> their testing on HT machine
>> together with kgdb bootup tests enabled. Dave could you please?
>
> Could these patches fix Dave's non-kgdb problem? Dave isnt using kgdb but is
> probably using perf which triggers NMIs? Dave, can you confirm that?
>
> And it's a spurious NMI message, not actual lockup or other misbehavior,
> right?
>
> Thanks,
>
> 	Ingo
>
For nonkgdb case 'unflagged nmi fix' patch should be enough. i've
tested it on non-ht machine by self. without it there is no lockup
but only a message about unknown nmi.

for hr-machine with kgdb the things go harder, Don reported lockup on
boot. The second patch might help but i cant test it (here i need help
in testing)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.
  2011-02-16  8:49             ` Cyrill Gorcunov
@ 2011-02-16  8:56               ` Ingo Molnar
  2011-02-16  9:33                 ` Cyrill Gorcunov
  0 siblings, 1 reply; 26+ messages in thread
From: Ingo Molnar @ 2011-02-16  8:56 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Dave Airlie, George Spelvin, a.p.zijlstra, dzickus, eranian,
	linux-kernel, ming.m.lin


* Cyrill Gorcunov <gorcunov@gmail.com> wrote:

> On Wed, Feb 16, 2011 at 11:37 AM, Ingo Molnar <mingo@elte.hu> wrote:
> ...
> >> >>
> >> >
> >> > Ping on this problem, still seeing
> >> >
> >> > Uhhuh. NMI received for unknown reason 3c on CPU 0.
> >> > Do you have a strange power saving mode enabled?
> >> > Dazed and confused, but trying to continue
> >> >
> >> > on my Pentium-D system here with latest Linus head.
> >> >
> >> > its sometimes 3c, sometimes 3d, I'm going to bisect and push for
> >> > reverts if nobody still has any clue about how to fix this.
> >> >
> >> > Dave.
> >> >
> >>
> >> We still trying to resolve it but without success yet. There is no
> >> easy way to revert it. One of the option might be to disable perf on
> >> p4 for a while. If this is acceptable -- i'll cook such patch and send
> >> it to Ingo. Hm?
> >
> > That's not really acceptable - need to fix it or revert it to the last working
> > state. Which commit broke it?
> >
> > Thanks,
> >
> >        Ingo
> >
> 
> I can't say you the commit id after which unknown-nmi start happening
> (i'm out of git tree
> at moment) but even then this commit should not be reverted since the
> problem is in
> p4 code not in the rest of perf system.
> 
> I have two patches here (attached) and would really appreciate of
> their testing on HT machine
> together with kgdb bootup tests enabled. Dave could you please?

Could these patches fix Dave's non-kgdb problem? Dave isnt using kgdb but is 
probably using perf which triggers NMIs? Dave, can you confirm that?

And it's a spurious NMI message, not actual lockup or other misbehavior, right?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.
  2011-02-16  8:37           ` Ingo Molnar
@ 2011-02-16  8:49             ` Cyrill Gorcunov
  2011-02-16  8:56               ` Ingo Molnar
  0 siblings, 1 reply; 26+ messages in thread
From: Cyrill Gorcunov @ 2011-02-16  8:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Dave Airlie, George Spelvin, a.p.zijlstra, dzickus, eranian,
	linux-kernel, ming.m.lin

[-- Attachment #1: Type: text/plain, Size: 1309 bytes --]

On Wed, Feb 16, 2011 at 11:37 AM, Ingo Molnar <mingo@elte.hu> wrote:
...
>> >>
>> >
>> > Ping on this problem, still seeing
>> >
>> > Uhhuh. NMI received for unknown reason 3c on CPU 0.
>> > Do you have a strange power saving mode enabled?
>> > Dazed and confused, but trying to continue
>> >
>> > on my Pentium-D system here with latest Linus head.
>> >
>> > its sometimes 3c, sometimes 3d, I'm going to bisect and push for
>> > reverts if nobody still has any clue about how to fix this.
>> >
>> > Dave.
>> >
>>
>> We still trying to resolve it but without success yet. There is no
>> easy way to revert it. One of the option might be to disable perf on
>> p4 for a while. If this is acceptable -- i'll cook such patch and send
>> it to Ingo. Hm?
>
> That's not really acceptable - need to fix it or revert it to the last working
> state. Which commit broke it?
>
> Thanks,
>
>        Ingo
>

I can't say you the commit id after which unknown-nmi start happening
(i'm out of git tree
at moment) but even then this commit should not be reverted since the
problem is in
p4 code not in the rest of perf system.

I have two patches here (attached) and would really appreciate of
their testing on HT machine
together with kgdb bootup tests enabled. Dave could you please?

[-- Attachment #2: perf-x86-p4-extra-nmi --]
[-- Type: application/octet-stream, Size: 1042 bytes --]

perf, x86: P4 PMU -- capture extra nmi on overflow

It was observed that in case of performance nmi an extra
nmi may be issued, capture it otherwise 'unknown' nmi
message will appear.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
---
 arch/x86/kernel/cpu/perf_event_p4.c |    9 +++++++++
 1 file changed, 9 insertions(+)

Index: linux-2.6.tip/arch/x86/kernel/cpu/perf_event_p4.c
===================================================================
--- linux-2.6.tip.orig/arch/x86/kernel/cpu/perf_event_p4.c
+++ linux-2.6.tip/arch/x86/kernel/cpu/perf_event_p4.c
@@ -950,6 +950,15 @@ static int p4_pmu_handle_irq(struct pt_r
 	}
 
 	if (handled) {
+		/*
+		 * p4 quirk: extra NMI might happen so we
+		 * add one more here, the back-to-back
+		 * nmi logic will capture it; if more than
+		 * one counter triggered an extra nmi will
+		 * be captured anyway
+		 */
+		if (handled < 2)
+			handled++;
 		/* p4 quirk: unmask it again */
 		apic_write(APIC_LVTPC, apic_read(APIC_LVTPC) & ~APIC_LVT_MASKED);
 		inc_irq_stat(apic_perf_irqs);

[-- Attachment #3: perf-x86-p4-unflagged-nmi --]
[-- Type: application/octet-stream, Size: 2023 bytes --]

From: Cyrill Gorcunov <gorcunov@openvz.org>
Subject: [PATCH] perf, x86: P4 PMU -- Fix unflagged overflows test

A couple of people have reported an unknown NMI issue on p4 pmu.
This patch should fix it.

Reported-by: George Spelvin <linux@horizon.com>
Reported-by: Meelis Roos <mroos@linux.ee>
Reported-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
CC: Ingo Molnar <mingo@elte.hu>
CC: Lin Ming <ming.m.lin@intel.com>
CC: Don Zickus <dzickus@redhat.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/x86/include/asm/perf_event_p4.h |    1 +
 arch/x86/kernel/cpu/perf_event_p4.c  |   11 ++++++++---
 2 files changed, 9 insertions(+), 3 deletions(-)

Index: linux-2.6.tip/arch/x86/include/asm/perf_event_p4.h
===================================================================
--- linux-2.6.tip.orig/arch/x86/include/asm/perf_event_p4.h
+++ linux-2.6.tip/arch/x86/include/asm/perf_event_p4.h
@@ -22,6 +22,7 @@
 
 #define ARCH_P4_CNTRVAL_BITS	(40)
 #define ARCH_P4_CNTRVAL_MASK	((1ULL << ARCH_P4_CNTRVAL_BITS) - 1)
+#define ARCH_P4_UNFLAGGED_BIT	((1ULL) << (ARCH_P4_CNTRVAL_BITS - 1))
 
 #define P4_ESCR_EVENT_MASK	0x7e000000U
 #define P4_ESCR_EVENT_SHIFT	25
Index: linux-2.6.tip/arch/x86/kernel/cpu/perf_event_p4.c
===================================================================
--- linux-2.6.tip.orig/arch/x86/kernel/cpu/perf_event_p4.c
+++ linux-2.6.tip/arch/x86/kernel/cpu/perf_event_p4.c
@@ -770,9 +770,14 @@ static inline int p4_pmu_clear_cccr_ovf(
 		return 1;
 	}
 
-	/* it might be unflagged overflow */
-	rdmsrl(hwc->event_base + hwc->idx, v);
-	if (!(v & ARCH_P4_CNTRVAL_MASK))
+	/*
+	 * at some circumstances the overflow might issue NMI but did
+	 * not set P4_CCCR_OVF bit so since a counter holds a negative value
+	 * we simply check for high bit being set, if it's cleared it means
+	 * the counter has reached zero value and continued counting before
+	 * real NMI signal was received
+	 */
+	if (!(v & ARCH_P4_UNFLAGGED_BIT))
 		return 1;
 
 	return 0;

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.
  2011-02-16  4:19         ` Cyrill Gorcunov
@ 2011-02-16  8:37           ` Ingo Molnar
  2011-02-16  8:49             ` Cyrill Gorcunov
  0 siblings, 1 reply; 26+ messages in thread
From: Ingo Molnar @ 2011-02-16  8:37 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Dave Airlie, George Spelvin, a.p.zijlstra, dzickus, eranian,
	linux-kernel, ming.m.lin


* Cyrill Gorcunov <gorcunov@gmail.com> wrote:

> On 2/16/11, Dave Airlie <airlied@gmail.com> wrote:
> > On Wed, Feb 2, 2011 at 2:18 PM, Cyrill Gorcunov <gorcunov@gmail.com> wrote:
> >> On 2/2/11, George Spelvin <linux@horizon.com> wrote:
> >>>>  But if you have a will or would like to help debug the problem -- mind
> >>>> to
> >>>> try the patch below? Note the patch is ugly at moment and must *not* be
> >>>> running on non-P4 system (and I only compile-tested it so no guarantees
> >>>> at all, and I've CC'ed a couple of people as well)
> >>>
> >>> Promising...  After 32 minute of uptime, no NMI complaints so far.
> >>>
> >>> I'll let it run overnight and see what happens.
> >>>
> >>
> >> Great, thanks. Though the patch didn't help for Don, ie there is still
> >> an issue which needs to be resolved as well.
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> Please read the FAQ at  http://www.tux.org/lkml/
> >>
> >
> > Ping on this problem, still seeing
> >
> > Uhhuh. NMI received for unknown reason 3c on CPU 0.
> > Do you have a strange power saving mode enabled?
> > Dazed and confused, but trying to continue
> >
> > on my Pentium-D system here with latest Linus head.
> >
> > its sometimes 3c, sometimes 3d, I'm going to bisect and push for
> > reverts if nobody still has any clue about how to fix this.
> >
> > Dave.
> >
> 
> We still trying to resolve it but without success yet. There is no
> easy way to revert it. One of the option might be to disable perf on
> p4 for a while. If this is acceptable -- i'll cook such patch and send
> it to Ingo. Hm?

That's not really acceptable - need to fix it or revert it to the last working 
state. Which commit broke it?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.
  2011-02-16  1:57       ` Dave Airlie
@ 2011-02-16  4:19         ` Cyrill Gorcunov
  2011-02-16  8:37           ` Ingo Molnar
  2011-02-16 11:57         ` George Spelvin
  1 sibling, 1 reply; 26+ messages in thread
From: Cyrill Gorcunov @ 2011-02-16  4:19 UTC (permalink / raw)
  To: Dave Airlie
  Cc: George Spelvin, a.p.zijlstra, dzickus, eranian, linux-kernel,
	ming.m.lin, mingo

On 2/16/11, Dave Airlie <airlied@gmail.com> wrote:
> On Wed, Feb 2, 2011 at 2:18 PM, Cyrill Gorcunov <gorcunov@gmail.com> wrote:
>> On 2/2/11, George Spelvin <linux@horizon.com> wrote:
>>>>  But if you have a will or would like to help debug the problem -- mind
>>>> to
>>>> try the patch below? Note the patch is ugly at moment and must *not* be
>>>> running on non-P4 system (and I only compile-tested it so no guarantees
>>>> at all, and I've CC'ed a couple of people as well)
>>>
>>> Promising...  After 32 minute of uptime, no NMI complaints so far.
>>>
>>> I'll let it run overnight and see what happens.
>>>
>>
>> Great, thanks. Though the patch didn't help for Don, ie there is still
>> an issue which needs to be resolved as well.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>>
>
> Ping on this problem, still seeing
>
> Uhhuh. NMI received for unknown reason 3c on CPU 0.
> Do you have a strange power saving mode enabled?
> Dazed and confused, but trying to continue
>
> on my Pentium-D system here with latest Linus head.
>
> its sometimes 3c, sometimes 3d, I'm going to bisect and push for
> reverts if nobody still has any clue about how to fix this.
>
> Dave.
>

We still trying to resolve it but without success yet. There is no
easy way to revert it. One of the option might be to disable perf on
p4 for a while. If this is acceptable -- i'll cook such patch and send
it to Ingo. Hm?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.
  2011-02-02  4:18     ` Cyrill Gorcunov
@ 2011-02-16  1:57       ` Dave Airlie
  2011-02-16  4:19         ` Cyrill Gorcunov
  2011-02-16 11:57         ` George Spelvin
  0 siblings, 2 replies; 26+ messages in thread
From: Dave Airlie @ 2011-02-16  1:57 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: George Spelvin, a.p.zijlstra, dzickus, eranian, linux-kernel,
	ming.m.lin, mingo

On Wed, Feb 2, 2011 at 2:18 PM, Cyrill Gorcunov <gorcunov@gmail.com> wrote:
> On 2/2/11, George Spelvin <linux@horizon.com> wrote:
>>>  But if you have a will or would like to help debug the problem -- mind to
>>> try the patch below? Note the patch is ugly at moment and must *not* be
>>> running on non-P4 system (and I only compile-tested it so no guarantees
>>> at all, and I've CC'ed a couple of people as well)
>>
>> Promising...  After 32 minute of uptime, no NMI complaints so far.
>>
>> I'll let it run overnight and see what happens.
>>
>
> Great, thanks. Though the patch didn't help for Don, ie there is still
> an issue which needs to be resolved as well.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

Ping on this problem, still seeing

Uhhuh. NMI received for unknown reason 3c on CPU 0.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue

on my Pentium-D system here with latest Linus head.

its sometimes 3c, sometimes 3d, I'm going to bisect and push for
reverts if nobody still has any clue about how to fix this.

Dave.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.
  2011-02-02  2:36   ` George Spelvin
@ 2011-02-02  4:18     ` Cyrill Gorcunov
  2011-02-16  1:57       ` Dave Airlie
  0 siblings, 1 reply; 26+ messages in thread
From: Cyrill Gorcunov @ 2011-02-02  4:18 UTC (permalink / raw)
  To: George Spelvin
  Cc: a.p.zijlstra, dzickus, eranian, linux-kernel, ming.m.lin, mingo

On 2/2/11, George Spelvin <linux@horizon.com> wrote:
>>  But if you have a will or would like to help debug the problem -- mind to
>> try the patch below? Note the patch is ugly at moment and must *not* be
>> running on non-P4 system (and I only compile-tested it so no guarantees
>> at all, and I've CC'ed a couple of people as well)
>
> Promising...  After 32 minute of uptime, no NMI complaints so far.
>
> I'll let it run overnight and see what happens.
>

Great, thanks. Though the patch didn't help for Don, ie there is still
an issue which needs to be resolved as well.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.
  2011-02-01 17:52 ` Cyrill Gorcunov
  2011-02-01 18:41   ` Don Zickus
@ 2011-02-02  2:36   ` George Spelvin
  2011-02-02  4:18     ` Cyrill Gorcunov
  1 sibling, 1 reply; 26+ messages in thread
From: George Spelvin @ 2011-02-02  2:36 UTC (permalink / raw)
  To: gorcunov, linux
  Cc: a.p.zijlstra, dzickus, eranian, linux-kernel, ming.m.lin, mingo

>  But if you have a will or would like to help debug the problem -- mind to
> try the patch below? Note the patch is ugly at moment and must *not* be
> running on non-P4 system (and I only compile-tested it so no guarantees
> at all, and I've CC'ed a couple of people as well)

Promising...  After 32 minute of uptime, no NMI complaints so far.

I'll let it run overnight and see what happens.

Thank you very much!

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.
  2011-02-01 18:51       ` Don Zickus
@ 2011-02-01 20:00         ` Cyrill Gorcunov
  0 siblings, 0 replies; 26+ messages in thread
From: Cyrill Gorcunov @ 2011-02-01 20:00 UTC (permalink / raw)
  To: Don Zickus
  Cc: George Spelvin, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Lin Ming, Stephane Eranian

On 02/01/2011 09:51 PM, Don Zickus wrote:
...
>>
>> You mean it didn't help?
> 
> Not that I noticed no.
> 
> Cheers,
> Don

 Thanks a huge for testing, Don! I'll check what else I can do.

-- 
    Cyrill

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.
  2011-02-01 18:44     ` Cyrill Gorcunov
@ 2011-02-01 18:51       ` Don Zickus
  2011-02-01 20:00         ` Cyrill Gorcunov
  0 siblings, 1 reply; 26+ messages in thread
From: Don Zickus @ 2011-02-01 18:51 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: George Spelvin, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Lin Ming, Stephane Eranian

On Tue, Feb 01, 2011 at 09:44:15PM +0300, Cyrill Gorcunov wrote:
> On 02/01/2011 09:41 PM, Don Zickus wrote:
> ...
> > 
> > Unfortunately, I have not had success with patch below on my system. :-(
> > 
> > Cheers,
> > Don
> 
> You mean it didn't help?

Not that I noticed no.

Cheers,
Don

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.
  2011-02-01 18:41   ` Don Zickus
@ 2011-02-01 18:44     ` Cyrill Gorcunov
  2011-02-01 18:51       ` Don Zickus
  0 siblings, 1 reply; 26+ messages in thread
From: Cyrill Gorcunov @ 2011-02-01 18:44 UTC (permalink / raw)
  To: Don Zickus
  Cc: George Spelvin, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Lin Ming, Stephane Eranian

On 02/01/2011 09:41 PM, Don Zickus wrote:
...
> 
> Unfortunately, I have not had success with patch below on my system. :-(
> 
> Cheers,
> Don

You mean it didn't help?

-- 
    Cyrill

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.
  2011-02-01 17:52 ` Cyrill Gorcunov
@ 2011-02-01 18:41   ` Don Zickus
  2011-02-01 18:44     ` Cyrill Gorcunov
  2011-02-02  2:36   ` George Spelvin
  1 sibling, 1 reply; 26+ messages in thread
From: Don Zickus @ 2011-02-01 18:41 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: George Spelvin, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Lin Ming, Stephane Eranian

On Tue, Feb 01, 2011 at 08:52:19PM +0300, Cyrill Gorcunov wrote:
> On 02/01/2011 07:27 PM, George Spelvin wrote:
> > Since upgrading to -rc2 (-rc3 is compiling right now), I've been getting
> > complaints at irregular intervals.  This didn't used to happen with 2.6.37.
> > 
> ...
> > Should I bisect this, or does someone know what might be happening?
> > 
> > Thank you!
> > 
> 
>  I fear it's known issue at moment, we're trying to resolve it. There is
> an option -- to disable nmi_watchdog (nmi_watchdog=0 boot option).
> 
>  But if you have a will or would like to help debug the problem -- mind to
> try the patch below? Note the patch is ugly at moment and must *not* be
> running on non-P4 system (and I only compile-tested it so no guarantees
> at all, and I've CC'ed a couple of people as well)

Unfortunately, I have not had success with patch below on my system. :-(

Cheers,
Don

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.
  2011-02-01 16:27 George Spelvin
@ 2011-02-01 17:52 ` Cyrill Gorcunov
  2011-02-01 18:41   ` Don Zickus
  2011-02-02  2:36   ` George Spelvin
  0 siblings, 2 replies; 26+ messages in thread
From: Cyrill Gorcunov @ 2011-02-01 17:52 UTC (permalink / raw)
  To: George Spelvin
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Don Zickus, Lin Ming,
	Stephane Eranian

On 02/01/2011 07:27 PM, George Spelvin wrote:
> Since upgrading to -rc2 (-rc3 is compiling right now), I've been getting
> complaints at irregular intervals.  This didn't used to happen with 2.6.37.
> 
...
> Should I bisect this, or does someone know what might be happening?
> 
> Thank you!
> 

 I fear it's known issue at moment, we're trying to resolve it. There is
an option -- to disable nmi_watchdog (nmi_watchdog=0 boot option).

 But if you have a will or would like to help debug the problem -- mind to
try the patch below? Note the patch is ugly at moment and must *not* be
running on non-P4 system (and I only compile-tested it so no guarantees
at all, and I've CC'ed a couple of people as well)

    Cyrill

---
 arch/x86/kernel/cpu/perf_event.c    |   12 +++++++++++-
 arch/x86/kernel/cpu/perf_event_p4.c |    8 +++++++-
 2 files changed, 18 insertions(+), 2 deletions(-)

Index: linux-2.6.git/arch/x86/kernel/cpu/perf_event.c
=====================================================================
--- linux-2.6.git.orig/arch/x86/kernel/cpu/perf_event.c
+++ linux-2.6.git/arch/x86/kernel/cpu/perf_event.c
@@ -1075,7 +1075,17 @@ static void x86_pmu_start(struct perf_ev

 	cpuc->events[idx] = event;
 	__set_bit(idx, cpuc->active_mask);
-	__set_bit(idx, cpuc->running);
+	if (1) {
+		/* running mask is shared across a core */
+		int leader_cpu;
+		struct cpu_hw_events *leader_cpuc;
+
+		leader_cpu	= cpumask_first(__get_cpu_var(cpu_sibling_map));
+		leader_cpuc	= &per_cpu(cpu_hw_events, leader_cpu);
+
+		__set_bit(idx, leader_cpuc->running);
+	} else
+		__set_bit(idx, cpuc->running);
 	x86_pmu.enable(event);
 	perf_event_update_userpage(event);
 }
Index: linux-2.6.git/arch/x86/kernel/cpu/perf_event_p4.c
=====================================================================
--- linux-2.6.git.orig/arch/x86/kernel/cpu/perf_event_p4.c
+++ linux-2.6.git/arch/x86/kernel/cpu/perf_event_p4.c
@@ -907,8 +907,14 @@ static int p4_pmu_handle_irq(struct pt_r
 		int overflow;

 		if (!test_bit(idx, cpuc->active_mask)) {
+			int leader_cpu;
+			struct cpu_hw_events *leader_cpuc;
+
+			leader_cpu	= cpumask_first(__get_cpu_var(cpu_sibling_map));
+			leader_cpuc	= &per_cpu(cpu_hw_events, leader_cpu);
+
 			/* catch in-flight IRQs */
-			if (__test_and_clear_bit(idx, cpuc->running))
+			if (__test_and_clear_bit(idx, leader_cpuc->running))
 				handled++;
 			continue;
 		}


^ permalink raw reply	[flat|nested] 26+ messages in thread

* 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0.
@ 2011-02-01 16:27 George Spelvin
  2011-02-01 17:52 ` Cyrill Gorcunov
  0 siblings, 1 reply; 26+ messages in thread
From: George Spelvin @ 2011-02-01 16:27 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux

Since upgrading to -rc2 (-rc3 is compiling right now), I've been getting
complaints at irregular intervals.  This didn't used to happen with 2.6.37.

It's an old crappy 1.6 GHz P4 (HP Pavilion) with an ASUS P4B266LA
motherboard and a 2001 Award BIOS.

00:00.0 Host bridge [0600]: Intel Corporation 82845 845 [Brookdale] Chipset Host Bridge [8086:1a30] (rev 04)
00:01.0 PCI bridge [0604]: Intel Corporation 82845 845 [Brookdale] Chipset AGP Bridge [8086:1a31] (rev 04)
00:1e.0 PCI bridge [0604]: Intel Corporation 82801 PCI Bridge [8086:244e] (rev 05)
00:1f.0 ISA bridge [0601]: Intel Corporation 82801BA ISA Bridge (LPC) [8086:2440] (rev 05)
00:1f.1 IDE interface [0101]: Intel Corporation 82801BA IDE U100 Controller [8086:244b] (rev 05)
00:1f.2 USB Controller [0c03]: Intel Corporation 82801BA/BAM USB Controller #1 [8086:2442] (rev 05)
00:1f.3 SMBus [0c05]: Intel Corporation 82801BA/BAM SMBus Controller [8086:2443] (rev 05)
00:1f.4 USB Controller [0c03]: Intel Corporation 82801BA/BAM USB Controller #1 [8086:2444] (rev 05)
00:1f.5 Multimedia audio controller [0401]: Intel Corporation 82801BA/BAM AC'97 Audio Controller [8086:2445] (rev 05)
01:00.0 VGA compatible controller [0300]: ATI Technologies Inc Radeon RV250 If [Radeon 9000] [1002:4966] (rev 01)
01:00.1 Display controller [0380]: ATI Technologies Inc Radeon RV250 [Radeon 9000] (Secondary) [1002:496e] (rev 01)
02:08.0 Ethernet controller [0200]: Intel Corporation 82801BA/BAM/CA/CAM Ethernet Controller [8086:2449] (rev 03)
02:09.0 FireWire (IEEE 1394) [0c00]: Texas Instruments TSB12LV26 IEEE-1394 Controller (Link) [104c:8020]


Should I bisect this, or does someone know what might be happening?

Thank you!


Jan 30 13:13:25 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Jan 30 13:13:25 kernel: Do you have a strange power saving mode enabled?
Jan 30 13:13:25 kernel: Dazed and confused, but trying to continue
Jan 30 17:51:10 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Jan 30 17:51:10 kernel: Do you have a strange power saving mode enabled?
Jan 30 17:51:10 kernel: Dazed and confused, but trying to continue
Jan 30 18:05:11 kernel: Uhhuh. NMI received for unknown reason 3d on CPU 0.
Jan 30 18:05:11 kernel: Do you have a strange power saving mode enabled?
Jan 30 18:05:11 kernel: Dazed and confused, but trying to continue
Jan 30 18:19:16 kernel: Uhhuh. NMI received for unknown reason 3d on CPU 0.
Jan 30 18:19:16 kernel: Do you have a strange power saving mode enabled?
Jan 30 18:19:16 kernel: Dazed and confused, but trying to continue
Jan 30 18:33:33 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Jan 30 18:33:33 kernel: Do you have a strange power saving mode enabled?
Jan 30 18:33:33 kernel: Dazed and confused, but trying to continue
Jan 30 18:48:23 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Jan 30 18:48:23 kernel: Do you have a strange power saving mode enabled?
Jan 30 18:48:23 kernel: Dazed and confused, but trying to continue
Jan 30 21:39:58 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Jan 30 21:39:58 kernel: Do you have a strange power saving mode enabled?
Jan 30 21:39:58 kernel: Dazed and confused, but trying to continue
Jan 30 22:01:46 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Jan 30 22:01:46 kernel: Do you have a strange power saving mode enabled?
Jan 30 22:01:46 kernel: Dazed and confused, but trying to continue
Jan 30 22:03:13 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Jan 30 22:03:13 kernel: Do you have a strange power saving mode enabled?
Jan 30 22:03:13 kernel: Dazed and confused, but trying to continue
Jan 30 22:04:38 kernel: Uhhuh. NMI received for unknown reason 3d on CPU 0.
Jan 30 22:04:38 kernel: Do you have a strange power saving mode enabled?
Jan 30 22:04:38 kernel: Dazed and confused, but trying to continue
Jan 30 22:06:03 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Jan 30 22:06:03 kernel: Do you have a strange power saving mode enabled?
Jan 30 22:06:03 kernel: Dazed and confused, but trying to continue
Jan 30 22:07:23 kernel: Uhhuh. NMI received for unknown reason 3d on CPU 0.
Jan 30 22:07:23 kernel: Do you have a strange power saving mode enabled?
Jan 30 22:07:23 kernel: Dazed and confused, but trying to continue
Jan 31 01:00:28 kernel: Uhhuh. NMI received for unknown reason 3d on CPU 0.
Jan 31 01:00:28 kernel: Do you have a strange power saving mode enabled?
Jan 31 01:00:28 kernel: Dazed and confused, but trying to continue
Jan 31 03:00:02 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Jan 31 03:00:02 kernel: Do you have a strange power saving mode enabled?
Jan 31 03:00:02 kernel: Dazed and confused, but trying to continue
Jan 31 06:27:52 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Jan 31 06:27:52 kernel: Do you have a strange power saving mode enabled?
Jan 31 06:27:52 kernel: Dazed and confused, but trying to continue
Jan 31 07:36:54 kernel: Uhhuh. NMI received for unknown reason 3d on CPU 0.
Jan 31 07:36:54 kernel: Do you have a strange power saving mode enabled?
Jan 31 07:36:54 kernel: Dazed and confused, but trying to continue
Jan 31 10:08:08 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Jan 31 10:08:08 kernel: Do you have a strange power saving mode enabled?
Jan 31 10:08:08 kernel: Dazed and confused, but trying to continue
Jan 31 16:42:02 kernel: Uhhuh. NMI received for unknown reason 3d on CPU 0.
Jan 31 16:42:02 kernel: Do you have a strange power saving mode enabled?
Jan 31 16:42:02 kernel: Dazed and confused, but trying to continue
Jan 31 20:05:21 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Jan 31 20:05:21 kernel: Do you have a strange power saving mode enabled?
Jan 31 20:05:21 kernel: Dazed and confused, but trying to continue
Feb  1 01:00:19 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Feb  1 01:00:19 kernel: Do you have a strange power saving mode enabled?
Feb  1 01:00:19 kernel: Dazed and confused, but trying to continue
Feb  1 01:36:42 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Feb  1 01:36:42 kernel: Do you have a strange power saving mode enabled?
Feb  1 01:36:42 kernel: Dazed and confused, but trying to continue
Feb  1 02:01:04 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Feb  1 02:01:04 kernel: Do you have a strange power saving mode enabled?
Feb  1 02:01:04 kernel: Dazed and confused, but trying to continue
Feb  1 05:58:05 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Feb  1 05:58:05 kernel: Do you have a strange power saving mode enabled?
Feb  1 05:58:05 kernel: Dazed and confused, but trying to continue
Feb  1 06:28:18 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Feb  1 06:28:18 kernel: Do you have a strange power saving mode enabled?
Feb  1 06:28:18 kernel: Dazed and confused, but trying to continue
Feb  1 08:59:18 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Feb  1 08:59:18 kernel: Do you have a strange power saving mode enabled?
Feb  1 08:59:18 kernel: Dazed and confused, but trying to continue
Feb  1 11:04:43 kernel: Uhhuh. NMI received for unknown reason 3d on CPU 0.
Feb  1 11:04:43 kernel: Do you have a strange power saving mode enabled?
Feb  1 11:04:43 kernel: Dazed and confused, but trying to continue
Feb  1 11:05:47 kernel: Uhhuh. NMI received for unknown reason 3d on CPU 0.
Feb  1 11:05:47 kernel: Do you have a strange power saving mode enabled?
Feb  1 11:05:47 kernel: Dazed and confused, but trying to continue
Feb  1 11:06:48 kernel: Uhhuh. NMI received for unknown reason 3d on CPU 0.
Feb  1 11:06:48 kernel: Do you have a strange power saving mode enabled?
Feb  1 11:06:48 kernel: Dazed and confused, but trying to continue
Feb  1 11:07:50 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Feb  1 11:07:50 kernel: Do you have a strange power saving mode enabled?
Feb  1 11:07:50 kernel: Dazed and confused, but trying to continue
Feb  1 11:08:52 kernel: Uhhuh. NMI received for unknown reason 3d on CPU 0.
Feb  1 11:08:52 kernel: Do you have a strange power saving mode enabled?
Feb  1 11:08:52 kernel: Dazed and confused, but trying to continue
Feb  1 11:09:54 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Feb  1 11:09:54 kernel: Do you have a strange power saving mode enabled?
Feb  1 11:09:54 kernel: Dazed and confused, but trying to continue
Feb  1 11:10:56 kernel: Uhhuh. NMI received for unknown reason 3d on CPU 0.
Feb  1 11:10:56 kernel: Do you have a strange power saving mode enabled?
Feb  1 11:10:56 kernel: Dazed and confused, but trying to continue
Feb  1 11:11:58 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Feb  1 11:11:58 kernel: Do you have a strange power saving mode enabled?
Feb  1 11:11:58 kernel: Dazed and confused, but trying to continue
Feb  1 11:13:00 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Feb  1 11:13:00 kernel: Do you have a strange power saving mode enabled?
Feb  1 11:13:00 kernel: Dazed and confused, but trying to continue
Feb  1 11:14:01 kernel: Uhhuh. NMI received for unknown reason 3d on CPU 0.
Feb  1 11:14:01 kernel: Do you have a strange power saving mode enabled?
Feb  1 11:14:01 kernel: Dazed and confused, but trying to continue
Feb  1 11:15:04 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Feb  1 11:15:04 kernel: Do you have a strange power saving mode enabled?
Feb  1 11:15:04 kernel: Dazed and confused, but trying to continue
Feb  1 11:16:05 kernel: Uhhuh. NMI received for unknown reason 3d on CPU 0.
Feb  1 11:16:05 kernel: Do you have a strange power saving mode enabled?
Feb  1 11:16:05 kernel: Dazed and confused, but trying to continue
Feb  1 11:17:07 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Feb  1 11:17:07 kernel: Do you have a strange power saving mode enabled?
Feb  1 11:17:07 kernel: Dazed and confused, but trying to continue
Feb  1 11:18:33 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0.
Feb  1 11:18:33 kernel: Do you have a strange power saving mode enabled?
Feb  1 11:18:33 kernel: Dazed and confused, but trying to continue

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2011-02-21 16:45 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-02-14 13:36 2.6.38-rc2: Uhhuh. NMI received for unknown reason 2d on CPU 0 Preeti Khurana
2011-02-17  0:17 ` Ryan Underwood
2011-02-17  7:59   ` Cyrill Gorcunov
2011-02-18  2:40     ` Paul E. McKenney
2011-02-18 20:38       ` Underwood, Ryan
2011-02-21  6:56         ` Preeti Khurana
2011-02-21 16:45           ` Underwood, Ryan
  -- strict thread matches above, loose matches on Subject: below --
2011-02-01 16:27 George Spelvin
2011-02-01 17:52 ` Cyrill Gorcunov
2011-02-01 18:41   ` Don Zickus
2011-02-01 18:44     ` Cyrill Gorcunov
2011-02-01 18:51       ` Don Zickus
2011-02-01 20:00         ` Cyrill Gorcunov
2011-02-02  2:36   ` George Spelvin
2011-02-02  4:18     ` Cyrill Gorcunov
2011-02-16  1:57       ` Dave Airlie
2011-02-16  4:19         ` Cyrill Gorcunov
2011-02-16  8:37           ` Ingo Molnar
2011-02-16  8:49             ` Cyrill Gorcunov
2011-02-16  8:56               ` Ingo Molnar
2011-02-16  9:33                 ` Cyrill Gorcunov
2011-02-16 10:09                   ` Ingo Molnar
2011-02-16 11:08                     ` Cyrill Gorcunov
2011-02-16 11:57         ` George Spelvin
2011-02-17  2:56           ` Dave Airlie
2011-02-17  7:48             ` Cyrill Gorcunov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).