LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* State of kgdb on x86-64
@ 2008-01-14 19:09 Jan Kiszka
  2008-01-14 19:26 ` Jason Wessel
  0 siblings, 1 reply; 6+ messages in thread
From: Jan Kiszka @ 2008-01-14 19:09 UTC (permalink / raw)
  To: Jason Wessel; +Cc: Linux Kernel Mailing List

Hi Jason,

what is the state of the kgbd git repos? What should work, what might be
broken?

I'm asking as today I tried to get kgdb up and running on a 4-way x86-64
Xeon box with both 2.6.24-rc6 and -rc7. Once kgdb is enabled in .config,
the boot stops early with this panic:

Kernel panic - not syncing: Attempted to kill init!

May I have more success with 2.6.23? Was x86-64 tested and found working
once? I can dig deeper into the above issue, but before starting
blindly, I would like to asses if there could be more issues ahead on
this arch.

Thanks,
Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: State of kgdb on x86-64
  2008-01-14 19:09 State of kgdb on x86-64 Jan Kiszka
@ 2008-01-14 19:26 ` Jason Wessel
  2008-01-14 19:59   ` Jan Kiszka
       [not found]   ` <478C786A.3090709@siemens.com>
  0 siblings, 2 replies; 6+ messages in thread
From: Jason Wessel @ 2008-01-14 19:26 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Linux Kernel Mailing List

It was working at the point that I tested it with the 2.6.24-rc5 on
x86_64.  However I suspect my kernel config may differ drastically from
what you are using.

Without any other context provided than the generic message, it is hard
to know what might have happened. 

Jason.

Jan Kiszka wrote:
> Hi Jason,
>
> what is the state of the kgbd git repos? What should work, what might be
> broken?
>
> I'm asking as today I tried to get kgdb up and running on a 4-way x86-64
> Xeon box with both 2.6.24-rc6 and -rc7. Once kgdb is enabled in .config,
> the boot stops early with this panic:
>
> Kernel panic - not syncing: Attempted to kill init!
>
> May I have more success with 2.6.23? Was x86-64 tested and found working
> once? I can dig deeper into the above issue, but before starting
> blindly, I would like to asses if there could be more issues ahead on
> this arch.
>
> Thanks,
> Jan
>
>   


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: State of kgdb on x86-64
  2008-01-14 19:26 ` Jason Wessel
@ 2008-01-14 19:59   ` Jan Kiszka
       [not found]   ` <478C786A.3090709@siemens.com>
  1 sibling, 0 replies; 6+ messages in thread
From: Jan Kiszka @ 2008-01-14 19:59 UTC (permalink / raw)
  To: Jason Wessel; +Cc: Jan Kiszka, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 615 bytes --]

Jason Wessel wrote:
> It was working at the point that I tested it with the 2.6.24-rc5 on
> x86_64.  However I suspect my kernel config may differ drastically from
> what you are using.

Yeah, that might be the case. The only thing I tried to vary so far was 
applying maxcpus=1, but without success.

> 
> Without any other context provided than the generic message, it is hard
> to know what might have happened. 

OK, will throw my .config over tomorrow or on Wednesday, it's out of 
reach ATM. And if you have a reference .config, I would appreciate if 
you could send it over.

Thanks,
Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: State of kgdb on x86-64
       [not found]     ` <478CB724.3000900@windriver.com>
@ 2008-01-15 18:44       ` Jan Kiszka
  2008-01-16  4:10         ` Jason Wessel
  0 siblings, 1 reply; 6+ messages in thread
From: Jan Kiszka @ 2008-01-15 18:44 UTC (permalink / raw)
  To: Jason Wessel; +Cc: Jan Kiszka, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 2092 bytes --]

Jason Wessel wrote:
> Jan Kiszka wrote:
>> Jason Wessel wrote:
>>   
>>> It was working at the point that I tested it with the 2.6.24-rc5 on
>>> x86_64.  However I suspect my kernel config may differ drastically from
>>> what you are using.
>>>
>>> Without any other context provided than the generic message, it is hard
>>> to know what might have happened. 
>>>     
>> Here is the promised .config. I could also dig out the backtrace of the
>> panic as kgdb sees it if that helps, just let me know.
>>
>> Jan
>>
>>   
> The backtrace might be very telling as to what happened.  More
> information is always better than less :-)
> 

My primary test box is again out of reach, but meanwhile I was able to
reproduce some kind of problem under QEMU - that one at least is
triggered by SMP. With only one CPU -> all apparently fine. Once booting
QEMU with "-smp 2" -> this happens:

(gdb) tar remote /dev/pts/6
Remote debugging using /dev/pts/6
Not all CPUs have been synced for KGDB
breakpoint () at kernel/kgdb.c:1895
1895            wmb(); /* Sync point after breakpoint */
(gdb) c
Continuing.
Not all CPUs have been synced for KGDB
[New Thread 32769]

Program received signal SIGFPE, Arithmetic exception.
[Switching to Thread 32769]
0xffffffff8020adb7 in default_idle () at include/asm/irqflags_64.h:140
140             __asm__ __volatile__("sti; hlt" : : : "memory");
(gdb) bt
#0  0xffffffff8020adb7 in default_idle () at include/asm/irqflags_64.h:140
#1  0xffffffff8020ae65 in cpu_idle () at arch/x86/kernel/process_64.c:225
#2  0xffffffff8021ccb9 in start_secondary () at arch/x86/kernel/smpboot_64.c:375
#3  0x0000000000000000 in ?? ()
(gdb)                                                                                     

The problem seems to be related to continuing SMP boxes. I'm able to
boot my box up if I leave kgdb unattached. But when I then later attach
and continue execution, I get the same crash. Any ideas what goes wrong,
any suggestion where to start digging? Maybe at "Not all CPUs have been
synched"?

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 254 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: State of kgdb on x86-64
  2008-01-15 18:44       ` Jan Kiszka
@ 2008-01-16  4:10         ` Jason Wessel
  2008-01-16 15:44           ` Jan Kiszka
  0 siblings, 1 reply; 6+ messages in thread
From: Jason Wessel @ 2008-01-16  4:10 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Jan Kiszka, Linux Kernel Mailing List

Jan Kiszka wrote:
> Jason Wessel wrote:
>   
>> Jan Kiszka wrote:
>>     
>>> Jason Wessel wrote:
>>>   
>>>       
>>>> It was working at the point that I tested it with the 2.6.24-rc5 on
>>>> x86_64.  However I suspect my kernel config may differ drastically from
>>>> what you are using.
>>>>
>>>> Without any other context provided than the generic message, it is hard
>>>> to know what might have happened. 
>>>>     
>>>>         
>>> Here is the promised .config. I could also dig out the backtrace of the
>>> panic as kgdb sees it if that helps, just let me know.
>>>
>>> Jan
>>>
>>>   
>>>       
>> The backtrace might be very telling as to what happened.  More
>> information is always better than less :-)
>>
>>     
>
> My primary test box is again out of reach, but meanwhile I was able to
> reproduce some kind of problem under QEMU - that one at least is
> triggered by SMP. With only one CPU -> all apparently fine. Once booting
> QEMU with "-smp 2" -> this happens:
>
> (gdb) tar remote /dev/pts/6
> Remote debugging using /dev/pts/6
> Not all CPUs have been synced for KGDB
> breakpoint () at kernel/kgdb.c:1895
> 1895            wmb(); /* Sync point after breakpoint */
> (gdb) c
> Continuing.
> Not all CPUs have been synced for KGDB
> [New Thread 32769]
>
> Program received signal SIGFPE, Arithmetic exception.
> [Switching to Thread 32769]
> 0xffffffff8020adb7 in default_idle () at include/asm/irqflags_64.h:140
> 140             __asm__ __volatile__("sti; hlt" : : : "memory");
> (gdb) bt
> #0  0xffffffff8020adb7 in default_idle () at include/asm/irqflags_64.h:140
> #1  0xffffffff8020ae65 in cpu_idle () at arch/x86/kernel/process_64.c:225
> #2  0xffffffff8021ccb9 in start_secondary () at arch/x86/kernel/smpboot_64.c:375
> #3  0x0000000000000000 in ?? ()
> (gdb)                                                                                     
>
> The problem seems to be related to continuing SMP boxes. I'm able to
> boot my box up if I leave kgdb unattached. But when I then later attach
> and continue execution, I get the same crash. Any ideas what goes wrong,
> any suggestion where to start digging? Maybe at "Not all CPUs have been
> synched"?
>   

Generally speaking when you get an error that the CPUs have not been
synced, it means that the IPI which was sent to all the non-master
processors failed.  I took a quick look and it appears that the DIE_TRAP
is occuring after kgdb sends the IPI to the non master cores with the call:

    send_IPI_allbutself(APIC_DM_NMI);

In prior kernels that ultimately resulted in an NMI trap.  I am not sure
of the cause of the DIE_TRAP as a result of the IPI.  For now, if you
add the statement "case DIE_TRAP:" right before "    case
DIE_NMIWATCHDOG:" in arch/x86/kernel/kgdb_64.c it will sync te
processors, however the kernel should not be trapping for this error
code from the IPI event.  I suspect there has been some kind of change
to the way the IPI/NMI handling is being done in the latest kernels.

Jason.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: State of kgdb on x86-64
  2008-01-16  4:10         ` Jason Wessel
@ 2008-01-16 15:44           ` Jan Kiszka
  0 siblings, 0 replies; 6+ messages in thread
From: Jan Kiszka @ 2008-01-16 15:44 UTC (permalink / raw)
  To: Jason Wessel; +Cc: Jan Kiszka, Linux Kernel Mailing List

Jason Wessel wrote:
> Jan Kiszka wrote:
>> Jason Wessel wrote:
>>   
>>> Jan Kiszka wrote:
>>>     
>>>> Jason Wessel wrote:
>>>>   
>>>>       
>>>>> It was working at the point that I tested it with the 2.6.24-rc5 on
>>>>> x86_64.  However I suspect my kernel config may differ drastically from
>>>>> what you are using.
>>>>>
>>>>> Without any other context provided than the generic message, it is hard
>>>>> to know what might have happened. 
>>>>>     
>>>>>         
>>>> Here is the promised .config. I could also dig out the backtrace of the
>>>> panic as kgdb sees it if that helps, just let me know.
>>>>
>>>> Jan
>>>>
>>>>   
>>>>       
>>> The backtrace might be very telling as to what happened.  More
>>> information is always better than less :-)
>>>
>>>     
>> My primary test box is again out of reach, but meanwhile I was able to
>> reproduce some kind of problem under QEMU - that one at least is
>> triggered by SMP. With only one CPU -> all apparently fine. Once booting
>> QEMU with "-smp 2" -> this happens:
>>
>> (gdb) tar remote /dev/pts/6
>> Remote debugging using /dev/pts/6
>> Not all CPUs have been synced for KGDB
>> breakpoint () at kernel/kgdb.c:1895
>> 1895            wmb(); /* Sync point after breakpoint */
>> (gdb) c
>> Continuing.
>> Not all CPUs have been synced for KGDB
>> [New Thread 32769]
>>
>> Program received signal SIGFPE, Arithmetic exception.
>> [Switching to Thread 32769]
>> 0xffffffff8020adb7 in default_idle () at include/asm/irqflags_64.h:140
>> 140             __asm__ __volatile__("sti; hlt" : : : "memory");
>> (gdb) bt
>> #0  0xffffffff8020adb7 in default_idle () at include/asm/irqflags_64.h:140
>> #1  0xffffffff8020ae65 in cpu_idle () at arch/x86/kernel/process_64.c:225
>> #2  0xffffffff8021ccb9 in start_secondary () at arch/x86/kernel/smpboot_64.c:375
>> #3  0x0000000000000000 in ?? ()
>> (gdb)                                                                                     
>>
>> The problem seems to be related to continuing SMP boxes. I'm able to
>> boot my box up if I leave kgdb unattached. But when I then later attach
>> and continue execution, I get the same crash. Any ideas what goes wrong,
>> any suggestion where to start digging? Maybe at "Not all CPUs have been
>> synched"?
>>   
> 
> Generally speaking when you get an error that the CPUs have not been
> synced, it means that the IPI which was sent to all the non-master
> processors failed.  I took a quick look and it appears that the DIE_TRAP
> is occuring after kgdb sends the IPI to the non master cores with the call:
> 
>     send_IPI_allbutself(APIC_DM_NMI);
> 
> In prior kernels that ultimately resulted in an NMI trap.  I am not sure
> of the cause of the DIE_TRAP as a result of the IPI.  For now, if you
> add the statement "case DIE_TRAP:" right before "    case
> DIE_NMIWATCHDOG:" in arch/x86/kernel/kgdb_64.c it will sync te
> processors, however the kernel should not be trapping for this error
> code from the IPI event.  I suspect there has been some kind of change
> to the way the IPI/NMI handling is being done in the latest kernels.

Things I found out so far:

 - delivery of this IPI under QEMU somehow doesn't work
   (I would dare to say: at emulation level. But I'm not sure yet.)

 - the breakdown of my Xeon box with kdbg is a separate issue

 - stuffing my Xeon kernel into QEMU triggers the original bug there
   too, even under UP => debugging with QEMU should be feasible

But I'm first trying to identify the related config switch that makes it
pop up, because this mostly means compiling and quick testing (I can't
spent my full time on it yet). Will let you know about news, and I would
appreciate to hear from you if you have any updates.

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2008-01-16 15:44 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-01-14 19:09 State of kgdb on x86-64 Jan Kiszka
2008-01-14 19:26 ` Jason Wessel
2008-01-14 19:59   ` Jan Kiszka
     [not found]   ` <478C786A.3090709@siemens.com>
     [not found]     ` <478CB724.3000900@windriver.com>
2008-01-15 18:44       ` Jan Kiszka
2008-01-16  4:10         ` Jason Wessel
2008-01-16 15:44           ` Jan Kiszka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).