LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* panic() logic
@ 2008-10-17 21:48 Ani Sinha
  2008-10-17 21:50 ` Ani Sinha
  2008-10-18  7:56 ` Andi Kleen
  0 siblings, 2 replies; 5+ messages in thread
From: Ani Sinha @ 2008-10-17 21:48 UTC (permalink / raw)
  To: linux-kernel, torvalds; +Cc: kernel

Hi All:

I noticed an issue with the panic() firing on a back core in SMP
lately. We are mostly working on mips architectures but it might
effect other archs as well. Therefore, I am putting forward my
thoughts and comments to the whole linux community. In the following,
by front core I mean core#0 and by back core I mean other cores.

The panic() call does a smp_send_stop() pretty early in the call
process. There is an issue with this:

smp_send_stop basically marks all the other cores as 'down' and
updates the cpu bitmap. One implication of this is that you can not do
an IPI later on to other cores (smp_send_function() does a
'for_earch_online_cpu'). This makes sense since you should not be
allowed to do anything on a down cpu. But what if a particular
architecture had logic to do specific things for the front core and
other things on the back cores as a part of 'graceful reboot' process?
For example, the arch dependent emergency_restart() might try to do
send IPI to the front core to do a graceful reboot and simply place
the others on an infinite loop? As a concrete example, mips sibyte
processor has logic in arch dependent restart code to call cfe_exit()
on front core and an infinite loop on the back cores. It does this
through an IPI which obviously does not succeed because of the early
smp_send_stop().

So, my proposal is, can we just run panic on the front core and put
the back cores in loop right away. I mean, can we do this in pseudo
code?

panic()
{

/* do stuff that must be done on the current core */
    on_each_cpu(continue_panic, ...);

}

continue_panic()
{
 if (!smp_processor_id()) {
        smp_send_stop();
        /* rest of existing panic() logic */
 }else

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: panic() logic
  2008-10-17 21:48 panic() logic Ani Sinha
@ 2008-10-17 21:50 ` Ani Sinha
  2008-10-18  7:56 ` Andi Kleen
  1 sibling, 0 replies; 5+ messages in thread
From: Ani Sinha @ 2008-10-17 21:50 UTC (permalink / raw)
  To: linux-kernel, torvalds; +Cc: kernel

Hi All:

I noticed an issue with the panic() firing on a back core in SMP
lately. We are mostly working on mips architectures but it might
effect other archs as well. Therefore, I am putting forward my
thoughts and comments to the whole linux community. In the following,
by front core I mean core#0 and by back core I mean other cores.

The panic() call does a smp_send_stop() pretty early in the call
process. There is an issue with this:

smp_send_stop basically marks all the other cores as 'down' and
updates the cpu bitmap. One implication of this is that you can not do
an IPI later on to other cores (smp_send_function() does a
'for_earch_online_cpu'). This makes sense since you should not be
allowed to do anything on a down cpu. But what if a particular
architecture had logic to do specific things for the front core and
other things on the back cores as a part of 'graceful reboot' process?
For example, the arch dependent emergency_restart() might try to do
send IPI to the front core to do a graceful reboot and simply place
the others on an infinite loop? As a concrete example, mips sibyte
processor has logic in arch dependent restart code to call cfe_exit()
on front core and an infinite loop on the back cores. It does this
through an IPI which obviously does not succeed because of the early
smp_send_stop().

So, my proposal is, can we just run panic on the front core and put
the back cores in loop right away. I mean, can we do this in pseudo
code?

panic()
{

/* do stuff that must be done on the current core */
   on_each_cpu(continue_panic, ...);
   return;
}

continue_panic()
{
 if (!smp_processor_id()) {
       smp_send_stop();
       /* rest of existing panic() logic */
 }else
   for(;;)
}

Is this going to be okay for all archs?

Cheers!

Ani

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: panic() logic
  2008-10-17 21:48 panic() logic Ani Sinha
  2008-10-17 21:50 ` Ani Sinha
@ 2008-10-18  7:56 ` Andi Kleen
  2008-10-19  2:44   ` Anirban Sinha
  1 sibling, 1 reply; 5+ messages in thread
From: Andi Kleen @ 2008-10-18  7:56 UTC (permalink / raw)
  To: Ani Sinha; +Cc: linux-kernel, torvalds

"Ani Sinha" <kernel@anirban.org> writes:

> I noticed an issue with the panic() firing on a back core in SMP
> lately. We are mostly working on mips architectures but it might
> effect other archs as well. Therefore, I am putting forward my
> thoughts and comments to the whole linux community. In the following,
> by front core I mean core#0 and by back core I mean other cores.

Why exactly is the "front core" special? 

> smp_send_stop basically marks all the other cores as 'down' and
> updates the cpu bitmap. One implication of this is that you can not do
> an IPI later on to other cores (smp_send_function() does a
> 'for_earch_online_cpu'). This makes sense since you should not be
> allowed to do anything on a down cpu. But what if a particular
> architecture had logic to do specific things for the front core and
> other things on the back cores as a part of 'graceful reboot' process?

Is that logic in Linux or in the platform?

Normally it's best to not rely on any specific CPU for panic.
What do you do when that CPU is so broken that it cannot
process IPIs anymore?

-Andi

-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: panic() logic
  2008-10-18  7:56 ` Andi Kleen
@ 2008-10-19  2:44   ` Anirban Sinha
  2008-10-19  5:07     ` Anirban Sinha
  0 siblings, 1 reply; 5+ messages in thread
From: Anirban Sinha @ 2008-10-19  2:44 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, Linus Torvalds, Anirban Sinha

Hi Andi:

Thanks for replying.
On 18-Oct-08, at 12:56 AM, Andi Kleen wrote:

> "Ani Sinha" <kernel@anirban.org> writes:
>
>> I noticed an issue with the panic() firing on a back core in SMP
>> lately. We are mostly working on mips architectures but it might
>> effect other archs as well. Therefore, I am putting forward my
>> thoughts and comments to the whole linux community. In the following,
>> by front core I mean core#0 and by back core I mean other cores.
>
> Why exactly is the "front core" special?

I am not exactly a firmware (CFE) guy but if I understand it  
correctly, all the interrupts are tied to the front core and  
cfe_exit() can only be called from the front core. I have written to  
the other guy who specializes in the CFE area and I will get back to  
you when I get an answer from him.


>
>
>> smp_send_stop basically marks all the other cores as 'down' and
>> updates the cpu bitmap. One implication of this is that you can not  
>> do
>> an IPI later on to other cores (smp_send_function() does a
>> 'for_earch_online_cpu'). This makes sense since you should not be
>> allowed to do anything on a down cpu.


This part of the logic is in Linux and is arch independent.


>> But what if a particular
>> architecture had logic to do specific things for the front core and
>> other things on the back cores as a part of 'graceful reboot'  
>> process?
>
> Is that logic in Linux or in the platform?


This logic is in arch specific code.

>
>
> Normally it's best to not rely on any specific CPU for panic.
> What do you do when that CPU is so broken that it cannot
> process IPIs anymore?


Agreed. That is why in my pseudo code I have a block (a comment  
really) telling you do do absolutely bare minimal things that you must  
do in a panic situation on the current core (without relying on IPIs  
to succeed on other cores).. What this bare minimum will be is a  
matter of debate. Getting a message out to the console saying that  
something bad has occurred (with details of the crash) can perhaps be  
a part of that minumum hunk of code.

Currently, the arch independent logic defeats the main purpose of the  
arch dependent emergency_restart() function which is to restart the  
system.  If a panic occurs on a back core, the kernel halts with the  
message "rebooting in 5 sec) and someone has to physically press the  
reset button. In a vast majority of the cases, we do have a perfectly  
sane and functional front core and we are just not able to gracefully  
reboot the system because we are limited by the way panic() handles  
things. If there are other archs that does a similar specific  
operation for the front core as a part of 'emergency restart', they  
are all defeated.

All I am trying to say is that perhaps there is a window of  
possibility where we can better handle a kernel panic. I am not saying  
that we should rearrange code in order to just accommodate mips archs,  
but if it can be done without much pain and objection, then lets just  
do it.

Thanks.

Ani



>
>
> -Andi
>
> -- 
> ak@linux.intel.com


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: panic() logic
  2008-10-19  2:44   ` Anirban Sinha
@ 2008-10-19  5:07     ` Anirban Sinha
  0 siblings, 0 replies; 5+ messages in thread
From: Anirban Sinha @ 2008-10-19  5:07 UTC (permalink / raw)
  To: Andi Kleen, linux-kernel, Linus Torvalds, Anirban Sinha; +Cc: Anirban Sinha

Hi All:

There is another alternative way to solve this issue. Why does not the  
archs take the responsibility of shutting down the cores? The actual  
power down mechanism is arch dependent anyway, so I guess it can be  
made to be a part of emergency_shutdown(). The arch independent kernel  
code will then simply do the necessary arch independent things to  
handle panic and simply call emergency_reboot() to do the rest of the  
arch specific stuff, including powering down the cores.

How does this sound?

Ani

On 18-Oct-08, at 7:44 PM, Anirban Sinha wrote:

> Hi Andi:
>
> Thanks for replying.
> On 18-Oct-08, at 12:56 AM, Andi Kleen wrote:
>
>> "Ani Sinha" <kernel@anirban.org> writes:
>>
>>> I noticed an issue with the panic() firing on a back core in SMP
>>> lately. We are mostly working on mips architectures but it might
>>> effect other archs as well. Therefore, I am putting forward my
>>> thoughts and comments to the whole linux community. In the  
>>> following,
>>> by front core I mean core#0 and by back core I mean other cores.
>>
>> Why exactly is the "front core" special?
>
> I am not exactly a firmware (CFE) guy but if I understand it  
> correctly, all the interrupts are tied to the front core and  
> cfe_exit() can only be called from the front core. I have written to  
> the other guy who specializes in the CFE area and I will get back to  
> you when I get an answer from him.
>
>
>>
>>
>>> smp_send_stop basically marks all the other cores as 'down' and
>>> updates the cpu bitmap. One implication of this is that you can  
>>> not do
>>> an IPI later on to other cores (smp_send_function() does a
>>> 'for_earch_online_cpu'). This makes sense since you should not be
>>> allowed to do anything on a down cpu.
>
>
> This part of the logic is in Linux and is arch independent.
>
>
>>> But what if a particular
>>> architecture had logic to do specific things for the front core and
>>> other things on the back cores as a part of 'graceful reboot'  
>>> process?
>>
>> Is that logic in Linux or in the platform?
>
>
> This logic is in arch specific code.
>
>>
>>
>> Normally it's best to not rely on any specific CPU for panic.
>> What do you do when that CPU is so broken that it cannot
>> process IPIs anymore?
>
>
> Agreed. That is why in my pseudo code I have a block (a comment  
> really) telling you do do absolutely bare minimal things that you  
> must do in a panic situation on the current core (without relying on  
> IPIs to succeed on other cores).. What this bare minimum will be is  
> a matter of debate. Getting a message out to the console saying that  
> something bad has occurred (with details of the crash) can perhaps  
> be a part of that minumum hunk of code.
>
> Currently, the arch independent logic defeats the main purpose of  
> the arch dependent emergency_restart() function which is to restart  
> the system.  If a panic occurs on a back core, the kernel halts with  
> the message "rebooting in 5 sec) and someone has to physically press  
> the reset button. In a vast majority of the cases, we do have a  
> perfectly sane and functional front core and we are just not able to  
> gracefully reboot the system because we are limited by the way  
> panic() handles things. If there are other archs that does a similar  
> specific operation for the front core as a part of 'emergency  
> restart', they are all defeated.
>
> All I am trying to say is that perhaps there is a window of  
> possibility where we can better handle a kernel panic. I am not  
> saying that we should rearrange code in order to just accommodate  
> mips archs, but if it can be done without much pain and objection,  
> then lets just do it.
>
> Thanks.
>
> Ani
>
>
>
>>
>>
>> -Andi
>>
>> -- 
>> ak@linux.intel.com
>


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2008-10-19  5:07 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-10-17 21:48 panic() logic Ani Sinha
2008-10-17 21:50 ` Ani Sinha
2008-10-18  7:56 ` Andi Kleen
2008-10-19  2:44   ` Anirban Sinha
2008-10-19  5:07     ` Anirban Sinha

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).