LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* RE: 2.6.7-rc1-bk: SMT scheduler bug / crashes on kernel boot
@ 2004-05-26 15:54 Nakajima, Jun
  2004-05-26 20:52 ` Zwane Mwaikambo
  0 siblings, 1 reply; 7+ messages in thread
From: Nakajima, Jun @ 2004-05-26 15:54 UTC (permalink / raw)
  To: Nick Piggin, Anton Altaparmakov; +Cc: mingo, lkml, Zwane Mwaikambo

>From: Nick Piggin [mailto:nickpiggin@yahoo.com.au]
>Sent: Wednesday, May 26, 2004 4:45 AM
>To: Anton Altaparmakov
>Cc: mingo@elte.hu; lkml; Zwane Mwaikambo; Nakajima, Jun
>Subject: Re: 2.6.7-rc1-bk: SMT scheduler bug / crashes on kernel boot
>
>Anton Altaparmakov wrote:
>> On Wed, 2004-05-26 at 12:28, Nick Piggin wrote:
>>
>
>The problem is smp_num_siblings is set to 2, but phys_proc_id
>doesn't seem to be set up right (or it could be cpu_callout_map).
>That would cause the sched-domains to get set up wrong, sure. Maybe
>we should just go bug in this case? Or try to fix up?
>
>Anyone have any idea why this is happening?
>
Does this happen when you run the kernel on top of VMWare, as far as I
understand? If so, it is likely that VMWare is not simulating the
behavior of cpuid regarding HT support (especially, mapping between
logical and processor package). Then, you should get the same result
even when you run 2.6.6 kernel. Do you have dmesg output from some older
kernel or 2.6.6?

Jun


^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: 2.6.7-rc1-bk: SMT scheduler bug / crashes on kernel boot
  2004-05-26 15:54 2.6.7-rc1-bk: SMT scheduler bug / crashes on kernel boot Nakajima, Jun
@ 2004-05-26 20:52 ` Zwane Mwaikambo
  0 siblings, 0 replies; 7+ messages in thread
From: Zwane Mwaikambo @ 2004-05-26 20:52 UTC (permalink / raw)
  To: Nakajima, Jun; +Cc: Nick Piggin, Anton Altaparmakov, Ingo Molnar, lkml, zwane

On Wed, 26 May 2004, Nakajima, Jun wrote:

> >Anyone have any idea why this is happening?
> >
> Does this happen when you run the kernel on top of VMWare, as far as I
> understand? If so, it is likely that VMWare is not simulating the
> behavior of cpuid regarding HT support (especially, mapping between
> logical and processor package). Then, you should get the same result
> even when you run 2.6.6 kernel. Do you have dmesg output from some older
> kernel or 2.6.6?

It's a vmware thing as far as i know, it just runs the cpuid from the host
and will therefore report what it finds there, but then the kernel gets
confused when it finds that it really doesn't have a sibling processor.
The crash on boot is just a side effect of the new sched domains code
which depends more on this information.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 2.6.7-rc1-bk: SMT scheduler bug / crashes on kernel boot
       [not found]   ` <1085572902.2666.105.camel@imp.csi.cam.ac.uk>
@ 2004-05-26 12:10     ` Nick Piggin
  0 siblings, 0 replies; 7+ messages in thread
From: Nick Piggin @ 2004-05-26 12:10 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: mingo, lkml

Anton Altaparmakov wrote:
> On Wed, 2004-05-26 at 12:13, Nick Piggin wrote:
> 
>>Anton Altaparmakov wrote:
>>
>>>Hi,
>>>
>>>Kernel 2.6.7-rc1-bk crashes on boot with a NULL pointer dereference. 
>>>The kernel is running under VMware if that matters but I don't think it
>>>should.  It was working fine with 2.6.6-rc3-bk kernels.
>>>
>>>I am afraid the only way I could capture the crash was to capture the
>>>vmware screen into a PNG image which is attached.  Maybe I need to setup
>>>some OCR software for in the future...  (-;
>>>
>>>The system running VMware is a P4 2.6Hz with Hyper threading enabled and
>>>/proc/cpuinfo shows two cpus:
>>
>>OK, thanks for that. It would be quite helpful if you edit
>>kernel/sched.c and turn the line #undef SCHED_DOMAIN_DEBUG into
>>#define SCHED_DOMAIN_DEBUG, then compile a kernel with debugging
>>info enabled.
> 
> 
> Looking at kernel/sched.c it already says #define, not #undef!
> 

Oops, yes.

[snip]

> So the dereferencing of one of the two fails.  Considering the offset is
> 0x18 in the NULL dereference it must be the (p)->prio that causes the
> oops and hence p must be NULL.  I will leave you to figure out what that
> means...
> 

Nice detective work.

It tried to dereference a NULL idle thread I'd say.
ie. the CPU hasn't been set up. Please try Ingo's patch.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 2.6.7-rc1-bk: SMT scheduler bug / crashes on kernel boot
       [not found]     ` <1085571285.2666.75.camel@imp.csi.cam.ac.uk>
  2004-05-26 11:44       ` Ingo Molnar
@ 2004-05-26 11:45       ` Nick Piggin
  1 sibling, 0 replies; 7+ messages in thread
From: Nick Piggin @ 2004-05-26 11:45 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: mingo, lkml, Zwane Mwaikambo, Nakajima, Jun

Anton Altaparmakov wrote:
> On Wed, 2004-05-26 at 12:28, Nick Piggin wrote:
> 
>>Anton Altaparmakov wrote:
>>
>>
>>>The kernel shows a warning about the number of siblings being 2 but only
>>>1 being detected.  Perhaps this is the cause of the problem.  Even if
>>
>>Probably this is the cause of the problem.
>>
>>What is the exact message?
> 
> 
> You can see it in the PNG I attached in the original post.  But here it
> is copied out:
> 
> WARNING: 1 siblings found for CPU0, should be 2
> 

OK thanks... I think I even wrote that :P

The problem is smp_num_siblings is set to 2, but phys_proc_id
doesn't seem to be set up right (or it could be cpu_callout_map).
That would cause the sched-domains to get set up wrong, sure. Maybe
we should just go bug in this case? Or try to fix up?

Anyone have any idea why this is happening?


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 2.6.7-rc1-bk: SMT scheduler bug / crashes on kernel boot
       [not found]     ` <1085571285.2666.75.camel@imp.csi.cam.ac.uk>
@ 2004-05-26 11:44       ` Ingo Molnar
  2004-05-26 11:45       ` Nick Piggin
  1 sibling, 0 replies; 7+ messages in thread
From: Ingo Molnar @ 2004-05-26 11:44 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: Nick Piggin, lkml


* Anton Altaparmakov <aia21@cam.ac.uk> wrote:

> WARNING: 1 siblings found for CPU0, should be 2

does the patch below fix it?

	Ingo

--- linux/arch/i386/kernel/smpboot.c.orig	
+++ linux/arch/i386/kernel/smpboot.c	
@@ -1110,8 +1110,10 @@ static void __init smp_boot_cpus(unsigne
 			cpu_set(cpu, cpu_sibling_map[cpu]);
 		}
 
-		if (siblings != smp_num_siblings)
+		if (siblings != smp_num_siblings) {
 			printk(KERN_WARNING "WARNING: %d siblings found for CPU%d, should be %d\n", siblings, cpu, smp_num_siblings);
+			smp_num_siblings = siblings;
+		}
 	}
 
 	if (nmi_watchdog == NMI_LOCAL_APIC)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 2.6.7-rc1-bk: SMT scheduler bug / crashes on kernel boot
       [not found] ` <1085569838.2666.60.camel@imp.csi.cam.ac.uk>
@ 2004-05-26 11:28   ` Nick Piggin
       [not found]     ` <1085571285.2666.75.camel@imp.csi.cam.ac.uk>
  0 siblings, 1 reply; 7+ messages in thread
From: Nick Piggin @ 2004-05-26 11:28 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: mingo, lkml

Anton Altaparmakov wrote:

>The kernel shows a warning about the number of siblings being 2 but only
>1 being detected.  Perhaps this is the cause of the problem.  Even if

Probably this is the cause of the problem.

What is the exact message?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 2.6.7-rc1-bk: SMT scheduler bug / crashes on kernel boot
       [not found] <1085568719.2666.53.camel@imp.csi.cam.ac.uk>
@ 2004-05-26 11:13 ` Nick Piggin
       [not found]   ` <1085572902.2666.105.camel@imp.csi.cam.ac.uk>
       [not found] ` <1085569838.2666.60.camel@imp.csi.cam.ac.uk>
  1 sibling, 1 reply; 7+ messages in thread
From: Nick Piggin @ 2004-05-26 11:13 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: mingo, lkml

Anton Altaparmakov wrote:
> Hi,
> 
> Kernel 2.6.7-rc1-bk crashes on boot with a NULL pointer dereference. 
> The kernel is running under VMware if that matters but I don't think it
> should.  It was working fine with 2.6.6-rc3-bk kernels.
> 
> I am afraid the only way I could capture the crash was to capture the
> vmware screen into a PNG image which is attached.  Maybe I need to setup
> some OCR software for in the future...  (-;
> 
> The system running VMware is a P4 2.6Hz with Hyper threading enabled and
> /proc/cpuinfo shows two cpus:

OK, thanks for that. It would be quite helpful if you edit
kernel/sched.c and turn the line #undef SCHED_DOMAIN_DEBUG into
#define SCHED_DOMAIN_DEBUG, then compile a kernel with debugging
info enabled.

Boot again, and capture another screenshot of the oops. This will
hopefully include the SCHED_DOMAIN_DEBUG output.

Also run
	
	addr2line -e /path/to/vmlinux EIP

vmlinux is the file generated in the root directory of the source
tree after compilation. EIP is the EIP value printed by the Oops.

Thanks.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2004-05-26 20:51 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-05-26 15:54 2.6.7-rc1-bk: SMT scheduler bug / crashes on kernel boot Nakajima, Jun
2004-05-26 20:52 ` Zwane Mwaikambo
     [not found] <1085568719.2666.53.camel@imp.csi.cam.ac.uk>
2004-05-26 11:13 ` Nick Piggin
     [not found]   ` <1085572902.2666.105.camel@imp.csi.cam.ac.uk>
2004-05-26 12:10     ` Nick Piggin
     [not found] ` <1085569838.2666.60.camel@imp.csi.cam.ac.uk>
2004-05-26 11:28   ` Nick Piggin
     [not found]     ` <1085571285.2666.75.camel@imp.csi.cam.ac.uk>
2004-05-26 11:44       ` Ingo Molnar
2004-05-26 11:45       ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).