LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* Likely race between sys_rt_sigtimedwait() and complete_signal()
@ 2011-02-10  7:32 Nikita V. Youshchenko
  0 siblings, 0 replies; 3+ messages in thread
From: Nikita V. Youshchenko @ 2011-02-10  7:32 UTC (permalink / raw)
  To: linux-kernel; +Cc: Alexander Kaliadin, oishi.y

Hello linux-kernel.

Within project we are working on, we are facing a "rare" situation when 
setitimer() / sigwait() - based periodic task execution hangs. "Rare" 
means once per several hours for 1000 Hz timer.

For hanged thread, cat /proc/pid/status shows

...
State:	S (sleeping)
...
SigPnd:	0000000000000000
ShdPnd:	0000000000002000
SigBlk:	0000000000000000
...

and SysRq - T shows

[<c015b1b0>] (__schedule+0x2fc/0x37c) from [<c015b7b8>] 
(schedule+0x1c/0x30)
[<c015b7b8>] (schedule+0x1c/0x30) from [<c015b8c4>] 
(schedule_timeout+0x18/0x1dc)
[<c015b8c4>] (schedule_timeout+0x18/0x1dc) from [<c004a084>] 
(sys_rt_sigtimedwait+0x1b4/0x288)
[<c004a084>] (sys_rt_sigtimedwait+0x1b4/0x288) from [<c001cf00>] 
(ret_fast_syscall+0x0/0x28)

All other threads have SIGALRM blocked as they should, looking 
through /proc/X/status proves this.

So for some reason, SIGALRM was successfully delivered by timer, bit was 
set in ShdPnd [I guess at the bottom of __send_signal()], but that still 
resulted somehow in thread going to schedule() and not waking.

I guess this is some sort of race between sys_rt_sigtimedwait() and 
complete_signal().

This is on embedded system running vendor 2.6.31-based kernel, moving 
forward is unfortunately impossible because of hardware support issues. 
However I've looked through

git log -p HEAD..linus/master -- kernel/signal.c

and did not notice anything that could be related.

Unfortunately we don't currently have resources for futher analysis - 
especially with simple workarounds existing, such as switch to 
timerfd-based periodic execution (which looks working without hangs).

However I guess the race we faced still exists in the current upstream 
kernel, so maybe somebody on this mailing list could be interested into 
looking at this?

Nikita

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Likely race between sys_rt_sigtimedwait() and complete_signal()
  2011-04-09 13:45 ` Oleg Nesterov
@ 2011-04-09 19:44   ` Nikita V. Youshchenko
  0 siblings, 0 replies; 3+ messages in thread
From: Nikita V. Youshchenko @ 2011-04-09 19:44 UTC (permalink / raw)
  To: Oleg Nesterov, Anders Ernevi
  Cc: Andrew Morton, Alexander Kaliadin, oishi.y, linux-kernel

Hello Oleg and all.
Thanks to looking into this.

Unfotunately I was never able to reproduce the hang in question myself. It 
was only reproducible on customers side, and required hours of running to 
happen.

I can only ask Anders [CCed] to test your fix, if possible. But I'm not 
sure it is possible now, when problem is long closed by switching to 
timerfd-based periodic execution.

Se others comments below ...

> Can't find the original email, replying to Andrew's fwd.
>
> On 04/07, Andrew Morton wrote:
> > Within project we are working on, we are facing a "rare" situation
> > when setitimer() / sigwait() - based periodic task execution hangs.
> > "Rare" means once per several hours for 1000 Hz timer.
> >
> > For hanged thread, cat /proc/pid/status shows
> >
> > ...
> > State:	S (sleeping)
> > ...
> > SigPnd:	0000000000000000
> > ShdPnd:	0000000000002000
> > SigBlk:	0000000000000000
> > ...
> >
> > and SysRq - T shows
> >
> > [<c015b1b0>] (__schedule+0x2fc/0x37c) from [<c015b7b8>]
> > (schedule+0x1c/0x30)
> > [<c015b7b8>] (schedule+0x1c/0x30) from [<c015b8c4>]
> > (schedule_timeout+0x18/0x1dc)
> > [<c015b8c4>] (schedule_timeout+0x18/0x1dc) from [<c004a084>]
> > (sys_rt_sigtimedwait+0x1b4/0x288)
> > [<c004a084>] (sys_rt_sigtimedwait+0x1b4/0x288) from [<c001cf00>]
> > (ret_fast_syscall+0x0/0x28)
>
> Is this thread the group leader?

I don't know.
Anders, could you please answer?
"Group leader" is the main thread, that entered application's main() 
function on styartup.

> > All other threads have SIGALRM blocked as they should, looking
> > through /proc/X/status proves this.
>
> Do they ever had SIGALRM unlblocked ?

As far as I understand, all threads are created, and their signal masks 
set, at application startup - and the hang happens when application is 
already long running (and executed thousands, if not millions, of 
iterations successfully).

So, unless some libc or libGL routine plays hidden games with signal masks, 
SIGALRM should not become unblocked.

> > So for some reason, SIGALRM was successfully delivered by timer, bit
> > was set in ShdPnd [I guess at the bottom of __send_signal()], but that
> > still resulted somehow in thread going to schedule() and not waking.
>
> Thanks for the detailed report.
>
> There is an old, ancient problem which I constantly forget to fix.
> It _can_ perfectly explain the hang, at least in theory. I'll try
> to make the patch on Monday.
>
> In short: if a thread T runs with SIGALRM unblocked while another
> thread sleeps in sigtimedwait(), and then T blocks SIGALRM, the
> signal can be "lost" as above.
>
> Does your application do something like this? If not, then there
> is another problem.

As I've written above, I don't think this is the case - although I can't be 
100% sure.

> > This is on embedded system running vendor 2.6.31-based kernel, moving
> > forward is unfortunately impossible because of hardware support
> > issues.
>
> If I make the patch for 2.6.31, any chance you can test it?

See above. I can't. Maybe Anders can.

> > However I guess the race we faced still exists in the current upstream
> > kernel,
>
> Yes, this is possible. OTOH, the bug can be anywhere, not necessarily in
> signal.c, and it might be already fixed.

Well, I wrote original report just because I thought results of my analysis 
could be used by kernel developers to fix probably-still-existing issue 
that is hard to reproduce.

Thanks for looking into this anyway.

Nikita

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Likely race between sys_rt_sigtimedwait() and complete_signal()
       [not found] <20110407141215.46d0b930.akpm@linux-foundation.org>
@ 2011-04-09 13:45 ` Oleg Nesterov
  2011-04-09 19:44   ` Nikita V. Youshchenko
  0 siblings, 1 reply; 3+ messages in thread
From: Oleg Nesterov @ 2011-04-09 13:45 UTC (permalink / raw)
  To: Andrew Morton, Nikita V. Youshchenko, Alexander Kaliadin, oishi.y
  Cc: linux-kernel

Can't find the original email, replying to Andrew's fwd.

On 04/07, Andrew Morton wrote:
>
> Within project we are working on, we are facing a "rare" situation when
> setitimer() / sigwait() - based periodic task execution hangs. "Rare"
> means once per several hours for 1000 Hz timer.
>
> For hanged thread, cat /proc/pid/status shows
>
> ...
> State:	S (sleeping)
> ...
> SigPnd:	0000000000000000
> ShdPnd:	0000000000002000
> SigBlk:	0000000000000000
> ...
>
> and SysRq - T shows
>
> [<c015b1b0>] (__schedule+0x2fc/0x37c) from [<c015b7b8>]
> (schedule+0x1c/0x30)
> [<c015b7b8>] (schedule+0x1c/0x30) from [<c015b8c4>]
> (schedule_timeout+0x18/0x1dc)
> [<c015b8c4>] (schedule_timeout+0x18/0x1dc) from [<c004a084>]
> (sys_rt_sigtimedwait+0x1b4/0x288)
> [<c004a084>] (sys_rt_sigtimedwait+0x1b4/0x288) from [<c001cf00>]
> (ret_fast_syscall+0x0/0x28)

Is this thread the group leader?

> All other threads have SIGALRM blocked as they should, looking
> through /proc/X/status proves this.

Do they ever had SIGALRM unlblocked ?

> So for some reason, SIGALRM was successfully delivered by timer, bit was
> set in ShdPnd [I guess at the bottom of __send_signal()], but that still
> resulted somehow in thread going to schedule() and not waking.

Thanks for the detailed report.

There is an old, ancient problem which I constantly forget to fix.
It _can_ perfectly explain the hang, at least in theory. I'll try
to make the patch on Monday.



In short: if a thread T runs with SIGALRM unblocked while another
thread sleeps in sigtimedwait(), and then T blocks SIGALRM, the
signal can be "lost" as above.

Does your application do something like this? If not, then there
is another problem.



> This is on embedded system running vendor 2.6.31-based kernel, moving
> forward is unfortunately impossible because of hardware support issues.

If I make the patch for 2.6.31, any chance you can test it?

> However I guess the race we faced still exists in the current upstream
> kernel,

Yes, this is possible. OTOH, the bug can be anywhere, not necessarily in
signal.c, and it might be already fixed.

Oleg.


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2011-04-09 19:50 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-02-10  7:32 Likely race between sys_rt_sigtimedwait() and complete_signal() Nikita V. Youshchenko
     [not found] <20110407141215.46d0b930.akpm@linux-foundation.org>
2011-04-09 13:45 ` Oleg Nesterov
2011-04-09 19:44   ` Nikita V. Youshchenko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).