LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* 2.6.21.5 june 30th to july 1st date hang?
@ 2007-07-03 12:44 Fortier,Vincent [Montreal]
  2007-07-03 12:55 ` Fortier,Vincent [Montreal]
                   ` (4 more replies)
  0 siblings, 5 replies; 35+ messages in thread
From: Fortier,Vincent [Montreal] @ 2007-07-03 12:44 UTC (permalink / raw)
  To: Linux Kernel Mailing List

Hi all,

All my servers and workstations running a 2.6.21.5 kernel hanged exactly
when the date shift from june 30th to july 1st.

On my monitoring system every single station running a 2.6.21.5 kernel
stoped responding exactly after midnight on the date shift from June
30th to July 1st.  Although, stations still running 2.6.18 to 2.6.20.11
worked flawlessly.

I first tought there had been an electricity outage but two of my
servers (dell PE 2950 dual-quad core) on UPS in our server room also
hanged:
Jun 30 23:55:01 urpdev1 /USR/SBIN/CRON[31298]: (root) CMD ([ -x
/usr/lib/sysstat/sa1 ] && { [ -r "$DEFAULT" ] && . "$DEFAULT" ; [
"$ENABLED" = "true" ] && exec /usr/lib/sysstat/sa1; })
Jul  3 11:54:03 urpdev1 syslogd 1.4.1#17: restart.

I could not get anything on any of the 20+ consoles...  All the systems
hanged at around the exact same time... When the date shifted from June
30th to July 1st in UTC ...?

Any clue any one?

- vin

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: 2.6.21.5 june 30th to july 1st date hang?
  2007-07-03 12:44 2.6.21.5 june 30th to july 1st date hang? Fortier,Vincent [Montreal]
@ 2007-07-03 12:55 ` Fortier,Vincent [Montreal]
  2007-07-03 13:05 ` Clemens Koller
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 35+ messages in thread
From: Fortier,Vincent [Montreal] @ 2007-07-03 12:55 UTC (permalink / raw)
  To: Linux Kernel Mailing List

> -----Message d'origine-----
> De : linux-kernel-owner@vger.kernel.org 
> [mailto:linux-kernel-owner@vger.kernel.org] De la part de 
> Fortier,Vincent [Montreal]
> Envoyé : 3 juillet 2007 08:44
> 
> Hi all,
> 
> All my servers and workstations running a 2.6.21.5 kernel 
> hanged exactly when the date shift from june 30th to july 1st.
> 
> On my monitoring system every single station running a 
> 2.6.21.5 kernel stoped responding exactly after midnight on 
> the date shift from June 30th to July 1st.  Although, 
> stations still running 2.6.18 to 2.6.20.11 worked flawlessly.
> 
> I first tought there had been an electricity outage but two 
> of my servers (dell PE 2950 dual-quad core) on UPS in our 
> server room also
> hanged:
> Jun 30 23:55:01 urpdev1 /USR/SBIN/CRON[31298]: (root) CMD ([ -x
> /usr/lib/sysstat/sa1 ] && { [ -r "$DEFAULT" ] && . "$DEFAULT" 
> ; [ "$ENABLED" = "true" ] && exec /usr/lib/sysstat/sa1; }) 
> Jul  3 11:54:03 urpdev1 syslogd 1.4.1#17: restart.
> 
> I could not get anything on any of the 20+ consoles...  All 
> the systems hanged at around the exact same time... When the 
> date shifted from June 30th to July 1st in UTC ...?
> 
> Any clue any one?

Forgot to mention:

- All stations that failed where running a 2.6.21 kernel + CFS v18 (I don't have any stations running a plain 2.6.21 kernel so can't tell)
- Config file can be found at: http://linux-dev.qc.ec.gc.ca/kernel/debian/CONFIG-i686-2.6.21-005
- kernels can be found at: http://linux-dev.qc.ec.gc.ca/kernel/debian/sarge/i686/2.6.21/

> 
> - vin
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: 2.6.21.5 june 30th to july 1st date hang?
  2007-07-03 12:44 2.6.21.5 june 30th to july 1st date hang? Fortier,Vincent [Montreal]
  2007-07-03 12:55 ` Fortier,Vincent [Montreal]
@ 2007-07-03 13:05 ` Clemens Koller
  2007-07-03 14:51   ` Fortier,Vincent [Montreal]
  2007-07-03 13:56 ` Uli Luckas
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 35+ messages in thread
From: Clemens Koller @ 2007-07-03 13:05 UTC (permalink / raw)
  To: Fortier,Vincent [Montreal]; +Cc: Linux Kernel Mailing List

Hi, Vincent!

Fortier,Vincent [Montreal] schrieb:
> Hi all,
> 
> All my servers and workstations running a 2.6.21.5 kernel hanged exactly
> when the date shift from june 30th to july 1st.
> 
> On my monitoring system every single station running a 2.6.21.5 kernel
> stoped responding exactly after midnight on the date shift from June
> 30th to July 1st.  Although, stations still running 2.6.18 to 2.6.20.11
> worked flawlessly.
> 
> I first tought there had been an electricity outage but two of my
> servers (dell PE 2950 dual-quad core) on UPS in our server room also
> hanged:
> Jun 30 23:55:01 urpdev1 /USR/SBIN/CRON[31298]: (root) CMD ([ -x
> /usr/lib/sysstat/sa1 ] && { [ -r "$DEFAULT" ] && . "$DEFAULT" ; [
> "$ENABLED" = "true" ] && exec /usr/lib/sysstat/sa1; })
> Jul  3 11:54:03 urpdev1 syslogd 1.4.1#17: restart.
> 
> I could not get anything on any of the 20+ consoles...  All the systems
> hanged at around the exact same time... When the date shifted from June
> 30th to July 1st in UTC ...?
> 
> Any clue any one?

No problems over here with plain 2.6.21.5 on x686

You could just reset the date back on one of these machines and
check the transition again... and see if it was really the kernel
who crashed.... and check your cron configuration.

Regards,
-- 
Clemens Koller
__________________________________
R&D Imaging Devices
Anagramm GmbH
Rupert-Mayer-Straße 45/1
Linhof Werksgelände
D-81379 München
Tel.089-741518-50
Fax 089-741518-19
http://www.anagramm-technology.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: 2.6.21.5 june 30th to july 1st date hang?
  2007-07-03 12:44 2.6.21.5 june 30th to july 1st date hang? Fortier,Vincent [Montreal]
  2007-07-03 12:55 ` Fortier,Vincent [Montreal]
  2007-07-03 13:05 ` Clemens Koller
@ 2007-07-03 13:56 ` Uli Luckas
  2007-07-03 13:59 ` Florian Attenberger
  2007-07-03 15:59 ` Chris Friesen
  4 siblings, 0 replies; 35+ messages in thread
From: Uli Luckas @ 2007-07-03 13:56 UTC (permalink / raw)
  To: LKML; +Cc: Fortier,Vincent [Montreal]

On Tuesday, 3. July 2007, Fortier,Vincent [Montreal] wrote:
> Hi all,
>
> All my servers and workstations running a 2.6.21.5 kernel hanged exactly
> when the date shift from june 30th to july 1st.
>
Same thing here on two machines with plain vanilla 2.6.21.(3/4), on debian 
testing & debian unstable.

regards,
Uli

-- 

------- ROAD ...the handyPC Company - - -  ) ) )

Uli Luckas
Software Development

ROAD GmbH
Bennigsenstr. 14 | 12159 Berlin | Germany
fon: +49 (30) 230069 - 64 | fax: +49 (30) 230069 - 69
url: www.road.de

Amtsgericht Charlottenburg: HRB 96688 B
Managing directors: Hans-Peter Constien, Hubertus von Streit

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: 2.6.21.5 june 30th to july 1st date hang?
  2007-07-03 12:44 2.6.21.5 june 30th to july 1st date hang? Fortier,Vincent [Montreal]
                   ` (2 preceding siblings ...)
  2007-07-03 13:56 ` Uli Luckas
@ 2007-07-03 13:59 ` Florian Attenberger
  2007-07-03 14:20   ` Arne Georg Gleditsch
  2007-07-03 15:59 ` Chris Friesen
  4 siblings, 1 reply; 35+ messages in thread
From: Florian Attenberger @ 2007-07-03 13:59 UTC (permalink / raw)
  To: linux-kernel

On Tue, Jul 03, 2007 at 08:44:00AM -0400, Fortier,Vincent [Montreal] wrote:
> Hi all,
> 
> All my servers and workstations running a 2.6.21.5 kernel hanged exactly
> when the date shift from june 30th to july 1st.
> 
> On my monitoring system every single station running a 2.6.21.5 kernel
> stoped responding exactly after midnight on the date shift from June
> 30th to July 1st.  Although, stations still running 2.6.18 to 2.6.20.11
> worked flawlessly.
> 
> I first tought there had been an electricity outage but two of my
> servers (dell PE 2950 dual-quad core) on UPS in our server room also
> hanged:
> Jun 30 23:55:01 urpdev1 /USR/SBIN/CRON[31298]: (root) CMD ([ -x
> /usr/lib/sysstat/sa1 ] && { [ -r "$DEFAULT" ] && . "$DEFAULT" ; [
> "$ENABLED" = "true" ] && exec /usr/lib/sysstat/sa1; })
> Jul  3 11:54:03 urpdev1 syslogd 1.4.1#17: restart.
> 
> I could not get anything on any of the 20+ consoles...  All the systems
> hanged at around the exact same time... When the date shifted from June
> 30th to July 1st in UTC ...?
> 
> Any clue any one?
> 
> - vin
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>
there was one 'special' event at that date:
syslog.2.gz:Jul  1 01:59:59 master kernel: Clock: inserting leap second
23:59:60 UTC


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: 2.6.21.5 june 30th to july 1st date hang?
  2007-07-03 13:59 ` Florian Attenberger
@ 2007-07-03 14:20   ` Arne Georg Gleditsch
  2007-07-03 15:02     ` Florian Attenberger
  0 siblings, 1 reply; 35+ messages in thread
From: Arne Georg Gleditsch @ 2007-07-03 14:20 UTC (permalink / raw)
  To: Florian Attenberger; +Cc: linux-kernel

Florian Attenberger <valdyn@gmail.com> writes:
> there was one 'special' event at that date:
> syslog.2.gz:Jul  1 01:59:59 master kernel: Clock: inserting leap second
> 23:59:60 UTC

As far as I can tell, no leap second was due to be inserted at 1. of
July this year.  Is the year set correctly for this box?

-- 
								Arne.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: 2.6.21.5 june 30th to july 1st date hang?
  2007-07-03 13:05 ` Clemens Koller
@ 2007-07-03 14:51   ` Fortier,Vincent [Montreal]
  0 siblings, 0 replies; 35+ messages in thread
From: Fortier,Vincent [Montreal] @ 2007-07-03 14:51 UTC (permalink / raw)
  To: Clemens Koller
  Cc: Linux Kernel Mailing List, Uli Luckas, Florian Attenberger,
	Arne Georg Gleditsch

> -----Message d'origine-----
> De : Clemens Koller [mailto:clemens.koller@anagramm.de] 
> Envoyé : 3 juillet 2007 09:05
> 
> Hi, Vincent!
> 
> Fortier,Vincent [Montreal] schrieb:
> > Hi all,
> > 
> > All my servers and workstations running a 2.6.21.5 kernel hanged 
> > exactly when the date shift from june 30th to july 1st.
> > 
> > I could not get anything on any of the 20+ consoles...  All the 
> > systems hanged at around the exact same time... When the date shifted 
> > from June 30th to July 1st in UTC ...?
> > 
> > Any clue any one?
> 
> No problems over here with plain 2.6.21.5 on x686
> 
> You could just reset the date back on one of these machines 
> and check the transition again... and see if it was really 
> the kernel who crashed.... and check your cron configuration.

I tried reverting to 23:50 June 30th and it did not hanged when switching to July 1st.  I've tried it a few times without any problems.  So I deactivated all NTP related jobs, switched the date back to June 30th 23:10 and reboot the system.  Wait and see.

> De : linux-kernel-owner@vger.kernel.org 
> [mailto:linux-kernel-owner@vger.kernel.org] De la part de Uli Luckas
> 
> Same thing here on two machines with plain vanilla 
> 2.6.21.(3/4), on debian testing & debian unstable.
> 

I am also using Debian... But Sarge 3.1.  There might be a relation there unless somebody comes up having the same problem using another dist.  At least it's not CFS related.

> -----Message d'origine-----
> De : linux-kernel-owner@vger.kernel.org 
> [mailto:linux-kernel-owner@vger.kernel.org] De la part de 
> Arne Georg Gleditsch
> 
> Florian Attenberger <valdyn@gmail.com> writes:
> > there was one 'special' event at that date:
> > syslog.2.gz:Jul  1 01:59:59 master kernel: Clock: inserting leap 
> > second 23:59:60 UTC
> 
> As far as I can tell, no leap second was due to be inserted 
> at 1. of July this year.  Is the year set correctly for this box?
> 

All my server/workstations are in sync using ntp... And yes, the year is set properly on all of them.

Regards,

- vin

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: 2.6.21.5 june 30th to july 1st date hang?
  2007-07-03 14:20   ` Arne Georg Gleditsch
@ 2007-07-03 15:02     ` Florian Attenberger
  2007-07-03 15:26       ` Arne Georg Gleditsch
  0 siblings, 1 reply; 35+ messages in thread
From: Florian Attenberger @ 2007-07-03 15:02 UTC (permalink / raw)
  To: Arne Georg Gleditsch; +Cc: linux-kernel

On Tue, Jul 03, 2007 at 04:20:17PM +0200, Arne Georg Gleditsch wrote:
> Florian Attenberger <valdyn@gmail.com> writes:
> > there was one 'special' event at that date:
> > syslog.2.gz:Jul  1 01:59:59 master kernel: Clock: inserting leap second
> > 23:59:60 UTC
> 
> As far as I can tell, no leap second was due to be inserted at 1. of
> July this year.  Is the year set correctly for this box?
>
yep, controlled by ntpd.
You're right according to
ftp://hpiers.obspm.fr/iers/bul/bulc/bulletinc.33
that event shouldn't have been there.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: 2.6.21.5 june 30th to july 1st date hang?
  2007-07-03 15:02     ` Florian Attenberger
@ 2007-07-03 15:26       ` Arne Georg Gleditsch
  2007-07-03 15:36         ` Fortier,Vincent [Montreal]
  2007-07-03 19:28         ` Chris Friesen
  0 siblings, 2 replies; 35+ messages in thread
From: Arne Georg Gleditsch @ 2007-07-03 15:26 UTC (permalink / raw)
  To: Florian Attenberger; +Cc: linux-kernel

Florian Attenberger <valdyn@gmail.com> writes:
> yep, controlled by ntpd.
> You're right according to
> ftp://hpiers.obspm.fr/iers/bul/bulc/bulletinc.33
> that event shouldn't have been there.

I'm not all that versed in ntp-ish, but it appears that the leap
second insertion should be propagated through the ntp protocol.
Whether the leap second in question came from a ntp server giving out
wrong data or from a misinterpretation or bug in ntpd is of course
hard to say, but either way turning the clock back is unlikely to
reconstruct the circumstances.  An interesting exercise might be to
code up a small program to call adjtimex with timex.status |= STA_INS,
to see if this can trigger the problem.  (The bogus leap second might
be a red herring entirely, of course...)

-- 
								Arne.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: 2.6.21.5 june 30th to july 1st date hang?
  2007-07-03 15:26       ` Arne Georg Gleditsch
@ 2007-07-03 15:36         ` Fortier,Vincent [Montreal]
  2007-07-03 17:19           ` Dave Jones
  2007-07-03 19:28         ` Chris Friesen
  1 sibling, 1 reply; 35+ messages in thread
From: Fortier,Vincent [Montreal] @ 2007-07-03 15:36 UTC (permalink / raw)
  To: Arne Georg Gleditsch, Florian Attenberger; +Cc: linux-kernel

> -----Message d'origine-----
> De : linux-kernel-owner@vger.kernel.org 
> [mailto:linux-kernel-owner@vger.kernel.org] De la part de 
> Arne Georg Gleditsch
> 
> Florian Attenberger <valdyn@gmail.com> writes:
> > yep, controlled by ntpd.
> > You're right according to
> > ftp://hpiers.obspm.fr/iers/bul/bulc/bulletinc.33
> > that event shouldn't have been there.
> 
> I'm not all that versed in ntp-ish, but it appears that the 
> leap second insertion should be propagated through the ntp protocol.
> Whether the leap second in question came from a ntp server 
> giving out wrong data or from a misinterpretation or bug in 
> ntpd is of course hard to say, but either way turning the 
> clock back is unlikely to reconstruct the circumstances.  An 
> interesting exercise might be to code up a small program to 
> call adjtimex with timex.status |= STA_INS, to see if this 
> can trigger the problem.  (The bogus leap second might be a 
> red herring entirely, of course...)

You are probably right, I did tried to reproduce the problem without
success...

Although it is wierd that it happend only on 2.6.21 kernels... It did
not happend on any of my workstations/servers running either 2.6.18 or
2.6.20.  

Could dynticks be involved?

- vin

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: 2.6.21.5 june 30th to july 1st date hang?
  2007-07-03 12:44 2.6.21.5 june 30th to july 1st date hang? Fortier,Vincent [Montreal]
                   ` (3 preceding siblings ...)
  2007-07-03 13:59 ` Florian Attenberger
@ 2007-07-03 15:59 ` Chris Friesen
  2007-07-03 16:00   ` Fortier,Vincent [Montreal]
  4 siblings, 1 reply; 35+ messages in thread
From: Chris Friesen @ 2007-07-03 15:59 UTC (permalink / raw)
  To: Fortier,Vincent [Montreal]; +Cc: Linux Kernel Mailing List

Fortier,Vincent [Montreal] wrote:
> Hi all,
> 
> All my servers and workstations running a 2.6.21.5 kernel hanged exactly
> when the date shift from june 30th to july 1st.

Interesting, I just sent out an email for a similar issue, but with a pair of 
2.6.10 machines.

I'm wondering if its related to a spurious leap second event...

Chris

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: 2.6.21.5 june 30th to july 1st date hang?
  2007-07-03 15:59 ` Chris Friesen
@ 2007-07-03 16:00   ` Fortier,Vincent [Montreal]
  2007-07-03 16:03     ` Chris Friesen
  0 siblings, 1 reply; 35+ messages in thread
From: Fortier,Vincent [Montreal] @ 2007-07-03 16:00 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Linux Kernel Mailing List

> -----Message d'origine-----
> De : linux-kernel-owner@vger.kernel.org 
> [mailto:linux-kernel-owner@vger.kernel.org] De la part de 
> Chris Friesen
> 
> Fortier,Vincent [Montreal] wrote:
> > Hi all,
> > 
> > All my servers and workstations running a 2.6.21.5 kernel hanged 
> > exactly when the date shift from june 30th to july 1st.
> 
> Interesting, I just sent out an email for a similar issue, 
> but with a pair of 2.6.10 machines.
> 
> I'm wondering if its related to a spurious leap second event...
> 

Just wondering, what is your distribution?

- vin

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: 2.6.21.5 june 30th to july 1st date hang?
  2007-07-03 16:00   ` Fortier,Vincent [Montreal]
@ 2007-07-03 16:03     ` Chris Friesen
  2007-07-03 17:28       ` Chris Friesen
  0 siblings, 1 reply; 35+ messages in thread
From: Chris Friesen @ 2007-07-03 16:03 UTC (permalink / raw)
  To: Fortier,Vincent [Montreal]; +Cc: Linux Kernel Mailing List

Fortier,Vincent [Montreal] wrote:
>>-----Message d'origine-----
>>De : linux-kernel-owner@vger.kernel.org 
>>[mailto:linux-kernel-owner@vger.kernel.org] De la part de 
>>Chris Friesen

>>Interesting, I just sent out an email for a similar issue, 
>>but with a pair of 2.6.10 machines.
>>
>>I'm wondering if its related to a spurious leap second event...

> Just wondering, what is your distribution?

We're based off a WindRiver PNE-LE distribution.

Chris

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: 2.6.21.5 june 30th to july 1st date hang?
  2007-07-03 15:36         ` Fortier,Vincent [Montreal]
@ 2007-07-03 17:19           ` Dave Jones
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Jones @ 2007-07-03 17:19 UTC (permalink / raw)
  To: Fortier,Vincent [Montreal]
  Cc: Arne Georg Gleditsch, Florian Attenberger, linux-kernel

On Tue, Jul 03, 2007 at 11:36:42AM -0400, Fortier,Vincent [Montreal] wrote:
 > > -----Message d'origine-----
 > > De : linux-kernel-owner@vger.kernel.org 
 > > [mailto:linux-kernel-owner@vger.kernel.org] De la part de 
 > > Arne Georg Gleditsch
 > > 
 > > Florian Attenberger <valdyn@gmail.com> writes:
 > > > yep, controlled by ntpd.
 > > > You're right according to
 > > > ftp://hpiers.obspm.fr/iers/bul/bulc/bulletinc.33
 > > > that event shouldn't have been there.
 > > 
 > > I'm not all that versed in ntp-ish, but it appears that the 
 > > leap second insertion should be propagated through the ntp protocol.
 > > Whether the leap second in question came from a ntp server 
 > > giving out wrong data or from a misinterpretation or bug in 
 > > ntpd is of course hard to say, but either way turning the 
 > > clock back is unlikely to reconstruct the circumstances.  An 
 > > interesting exercise might be to code up a small program to 
 > > call adjtimex with timex.status |= STA_INS, to see if this 
 > > can trigger the problem.  (The bogus leap second might be a 
 > > red herring entirely, of course...)
 > 
 > You are probably right, I did tried to reproduce the problem without
 > success...
 > 
 > Although it is wierd that it happend only on 2.6.21 kernels... It did
 > not happend on any of my workstations/servers running either 2.6.18 or
 > 2.6.20.  
 > 
 > Could dynticks be involved?

I saw it on a box that happened to have lockdep enabled.
(I run it everywhere thankfully).  This is what it looked like..
http://www.codemonkey.org.uk/junk/img_0421.jpg

	Dave

-- 
http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: 2.6.21.5 june 30th to july 1st date hang?
  2007-07-03 16:03     ` Chris Friesen
@ 2007-07-03 17:28       ` Chris Friesen
  0 siblings, 0 replies; 35+ messages in thread
From: Chris Friesen @ 2007-07-03 17:28 UTC (permalink / raw)
  To: Fortier,Vincent [Montreal]; +Cc: Linux Kernel Mailing List

Some more information....

I'm trying to get a console on the affected system to query the leap second info 
from the ntp servers.

However, just for kicks I queried the local servers for my desktop following the 
instructions that I found on a thread about spurious leap second notifications. 
  Interestingly, two of the associations show non-zero leap values...

Chris



[cfriesen@wcary14e ~]$ /usr/sbin/ntpq -cas zcars0vr

ind assID status  conf reach auth condition  last_event cnt
===========================================================
   1 62124  b614   yes   yes  none  sys.peer   reachable  1
   2 62125  b4f4   yes   yes  none  candidat   reachable 15
   3 62126  b314   yes   yes  none   outlyer   reachable  1
   4 62127  b314   yes   yes  none   outlyer   reachable  1
   5 62128  8000   yes   yes  none    reject
   6 62129  b434   yes   yes  none  candidat   reachable  3
   7 62130  b424   yes   yes  none  candidat   reachable  2
   8 62131  a0f3   yes   yes  none    reject  lost reach 15


[cfriesen@wcary14e ~]$ /usr/sbin/ntpq -c"rv 62124 leap" zcars0vr
assID=62124 status=b614 reach, conf, sel_sys.peer, 1 event, event_reach,
leap=00
[cfriesen@wcary14e ~]$ /usr/sbin/ntpq -c"rv 62125 leap" zcars0vr
assID=62125 status=b0f4 reach, conf, 15 events, event_reach,
leap=00
[cfriesen@wcary14e ~]$ /usr/sbin/ntpq -c"rv 62126 leap" zcars0vr
assID=62126 status=b314 reach, conf, sel_outlyer, 1 event, event_reach,
leap=00
[cfriesen@wcary14e ~]$ /usr/sbin/ntpq -c"rv 62127 leap" zcars0vr
assID=62127 status=b414 reach, conf, sel_candidat, 1 event, event_reach,
leap=00
[cfriesen@wcary14e ~]$ /usr/sbin/ntpq -c"rv 62128 leap" zcars0vr
assID=62128 status=8000 unreach, conf, no events,
leap=11
[cfriesen@wcary14e ~]$ /usr/sbin/ntpq -c"rv 62129 leap" zcars0vr
assID=62129 status=b434 reach, conf, sel_candidat, 3 events, event_reach,
leap=00
[cfriesen@wcary14e ~]$ /usr/sbin/ntpq -c"rv 62130 leap" zcars0vr
assID=62130 status=b424 reach, conf, sel_candidat, 2 events, event_reach,
leap=00
[cfriesen@wcary14e ~]$ /usr/sbin/ntpq -c"rv 62131 leap" zcars0vr
assID=62131 status=a0f3 unreach, conf, 15 events, event_unreach,
leap=11

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: 2.6.21.5 june 30th to july 1st date hang?
  2007-07-03 15:26       ` Arne Georg Gleditsch
  2007-07-03 15:36         ` Fortier,Vincent [Montreal]
@ 2007-07-03 19:28         ` Chris Friesen
  2007-07-03 21:02           ` Chris Friesen
  2007-07-03 21:02           ` Chuck Ebbert
  1 sibling, 2 replies; 35+ messages in thread
From: Chris Friesen @ 2007-07-03 19:28 UTC (permalink / raw)
  To: Arne Georg Gleditsch; +Cc: Florian Attenberger, linux-kernel

Arne Georg Gleditsch wrote:

> An interesting exercise might be to
> code up a small program to call adjtimex with timex.status |= STA_INS,
> to see if this can trigger the problem.

Setting the date to just before midnight June 30 UTC and then running the 
following as root triggered the crash on a modified 2.6.10.  Anyone see anything 
wrong with the code below, or is this a valid indication of a bug in the leap 
second code?

Chris


#include <sys/timex.h>
#include <stdio.h>
#include <errno.h>

struct timex buf;
int main(void)
{
	int rc = adjtimex(&buf);
	printf("initial status: 0x%x\n", buf.status);
	buf.status |= STA_INS;
	buf.modes = ADJ_STATUS;
	rc = adjtimex(&buf);
	if (rc == -1) {
		printf("unable to set status: %m\n");
		return -1;
	} else
		printf("rc: %d\n", rc);
	printf("final status: 0x%x\n", buf.status);
	return 0;
}

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: 2.6.21.5 june 30th to july 1st date hang?
  2007-07-03 19:28         ` Chris Friesen
@ 2007-07-03 21:02           ` Chris Friesen
  2007-07-03 21:02           ` Chuck Ebbert
  1 sibling, 0 replies; 35+ messages in thread
From: Chris Friesen @ 2007-07-03 21:02 UTC (permalink / raw)
  To: Arne Georg Gleditsch; +Cc: Florian Attenberger, linux-kernel

Chris Friesen wrote:

> Setting the date to just before midnight June 30 UTC and then running 
> the following as root triggered the crash on a modified 2.6.10.  Anyone 
> see anything wrong with the code below, or is this a valid indication of 
> a bug in the leap second code?


As a further data point, the test app triggers problems on x86 uniprocessor and 
SMP as well as arm uniprocessor.  On ppc64 we see the leap second being added, 
but it doesn't hang, while on ppc we don't even see the leap second being 
added--leading me to wonder if the leap second code even works for ppc32.

The above is all for modified 2.6.10.

Chris

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: 2.6.21.5 june 30th to july 1st date hang?
  2007-07-03 19:28         ` Chris Friesen
  2007-07-03 21:02           ` Chris Friesen
@ 2007-07-03 21:02           ` Chuck Ebbert
  2007-07-04  1:06             ` Fortier,Vincent [Montreal]
  2007-07-04  8:56             ` Uli Luckas
  1 sibling, 2 replies; 35+ messages in thread
From: Chuck Ebbert @ 2007-07-03 21:02 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Arne Georg Gleditsch, Florian Attenberger, linux-kernel

On 07/03/2007 03:28 PM, Chris Friesen wrote:
> Arne Georg Gleditsch wrote:
> 
>> An interesting exercise might be to
>> code up a small program to call adjtimex with timex.status |= STA_INS,
>> to see if this can trigger the problem.
> 
> Setting the date to just before midnight June 30 UTC and then running
> the following as root triggered the crash on a modified 2.6.10.  Anyone
> see anything wrong with the code below, or is this a valid indication of
> a bug in the leap second code?
> 

Fixed:
http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=746976a301ac9c9aa10d7d42454f8d6cdad8ff2b


^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: 2.6.21.5 june 30th to july 1st date hang?
  2007-07-03 21:02           ` Chuck Ebbert
@ 2007-07-04  1:06             ` Fortier,Vincent [Montreal]
  2007-07-04  8:56             ` Uli Luckas
  1 sibling, 0 replies; 35+ messages in thread
From: Fortier,Vincent [Montreal] @ 2007-07-04  1:06 UTC (permalink / raw)
  To: Chuck Ebbert, Chris Friesen
  Cc: Arne Georg Gleditsch, Florian Attenberger, linux-kernel

> -----Message d'origine-----
> De : linux-kernel-owner@vger.kernel.org 
> [mailto:linux-kernel-owner@vger.kernel.org] De la part de Chuck Ebbert
> Envoyé : 3 juillet 2007 17:03
> 
> On 07/03/2007 03:28 PM, Chris Friesen wrote:
> > Arne Georg Gleditsch wrote:
> > 
> >> An interesting exercise might be to
> >> code up a small program to call adjtimex with timex.status |= 
> >> STA_INS, to see if this can trigger the problem.
> > 
> > Setting the date to just before midnight June 30 UTC and then running 
> > the following as root triggered the crash on a modified 2.6.10.  
> > Anyone see anything wrong with the code below, or is this a valid 
> > indication of a bug in the leap second code?
> > 
> 
> Fixed:
> http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2
> .6.git;a=commitdiff;h=746976a301ac9c9aa10d7d42454f8d6cdad8ff2b

Thanx a lot!  This was fast! (beat that closed source!)

- vin


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: 2.6.21.5 june 30th to july 1st date hang?
  2007-07-03 21:02           ` Chuck Ebbert
  2007-07-04  1:06             ` Fortier,Vincent [Montreal]
@ 2007-07-04  8:56             ` Uli Luckas
  2007-07-04 16:53               ` Chris Wright
  1 sibling, 1 reply; 35+ messages in thread
From: Uli Luckas @ 2007-07-04  8:56 UTC (permalink / raw)
  To: Chris Wright; +Cc: LKML

On Tuesday, 3. July 2007, Chuck Ebbert wrote:
> On 07/03/2007 03:28 PM, Chris Friesen wrote:
> > Arne Georg Gleditsch wrote:
> >> An interesting exercise might be to
> >> code up a small program to call adjtimex with timex.status |= STA_INS,
> >> to see if this can trigger the problem.
> >
> > Setting the date to just before midnight June 30 UTC and then running
> > the following as root triggered the crash on a modified 2.6.10.  Anyone
> > see anything wrong with the code below, or is this a valid indication of
> > a bug in the leap second code?
>
> Fixed:
> http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=comm
>itdiff;h=746976a301ac9c9aa10d7d42454f8d6cdad8ff2b
>
Hi Chris,
does that qualify for inclusion into 2.6.21.6?

regards,
Uli

-- 

------- ROAD ...the handyPC Company - - -  ) ) )

Uli Luckas
Software Development

ROAD GmbH
Bennigsenstr. 14 | 12159 Berlin | Germany
fon: +49 (30) 230069 - 64 | fax: +49 (30) 230069 - 69
url: www.road.de

Amtsgericht Charlottenburg: HRB 96688 B
Managing directors: Hans-Peter Constien, Hubertus von Streit

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: 2.6.21.5 june 30th to july 1st date hang?
  2007-07-04  8:56             ` Uli Luckas
@ 2007-07-04 16:53               ` Chris Wright
  2007-07-05 14:13                 ` Clemens Koller
  0 siblings, 1 reply; 35+ messages in thread
From: Chris Wright @ 2007-07-04 16:53 UTC (permalink / raw)
  To: Uli Luckas; +Cc: Chris Wright, LKML

* Uli Luckas (u.luckas@road.de) wrote:
> On Tuesday, 3. July 2007, Chuck Ebbert wrote:
> > On 07/03/2007 03:28 PM, Chris Friesen wrote:
> > > Arne Georg Gleditsch wrote:
> > >> An interesting exercise might be to
> > >> code up a small program to call adjtimex with timex.status |= STA_INS,
> > >> to see if this can trigger the problem.
> > >
> > > Setting the date to just before midnight June 30 UTC and then running
> > > the following as root triggered the crash on a modified 2.6.10.  Anyone
> > > see anything wrong with the code below, or is this a valid indication of
> > > a bug in the leap second code?
> >
> > Fixed:
> > http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=comm
> >itdiff;h=746976a301ac9c9aa10d7d42454f8d6cdad8ff2b
> >
> Hi Chris,
> does that qualify for inclusion into 2.6.21.6?

Yes, it has already been sent to -stable.

thanks,
-chris

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: 2.6.21.5 june 30th to july 1st date hang?
  2007-07-04 16:53               ` Chris Wright
@ 2007-07-05 14:13                 ` Clemens Koller
  2007-07-05 17:48                   ` Chris Friesen
  0 siblings, 1 reply; 35+ messages in thread
From: Clemens Koller @ 2007-07-05 14:13 UTC (permalink / raw)
  To: Chris Wright; +Cc: Uli Luckas, LKML

Hello, Chris!

Chris Wright schrieb:
> * Uli Luckas (u.luckas@road.de) wrote:
>> On Tuesday, 3. July 2007, Chuck Ebbert wrote:
>>> On 07/03/2007 03:28 PM, Chris Friesen wrote:
>>>> Arne Georg Gleditsch wrote:
>>>>> An interesting exercise might be to
>>>>> code up a small program to call adjtimex with timex.status |= STA_INS,
>>>>> to see if this can trigger the problem.
>>>> Setting the date to just before midnight June 30 UTC and then running
>>>> the following as root triggered the crash on a modified 2.6.10.  Anyone
>>>> see anything wrong with the code below, or is this a valid indication of
>>>> a bug in the leap second code?
>>> Fixed:
>>> http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=comm
>>> itdiff;h=746976a301ac9c9aa10d7d42454f8d6cdad8ff2b
>>>
>> Hi Chris,
>> does that qualify for inclusion into 2.6.21.6?
> 
> Yes, it has already been sent to -stable.

Okay, we all survived Y2K and this little glitch. Puh! ;-)
Can you please explain in which configuration this problem got triggered.

Does it make sense to have some testing environments which have the date
set to about one month in the future to catch any crashes like that,
preventing machines in production from failing?!

Best regards,
-- 
Clemens Koller
__________________________________
R&D Imaging Devices
Anagramm GmbH
Rupert-Mayer-Straße 45/1
Linhof Werksgelände
D-81379 München
Tel.089-741518-50
Fax 089-741518-19
http://www.anagramm-technology.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: 2.6.21.5 june 30th to july 1st date hang?
  2007-07-05 14:13                 ` Clemens Koller
@ 2007-07-05 17:48                   ` Chris Friesen
  2007-07-05 18:34                     ` Clemens Koller
                                       ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Chris Friesen @ 2007-07-05 17:48 UTC (permalink / raw)
  To: Clemens Koller; +Cc: Chris Wright, Uli Luckas, LKML

Clemens Koller wrote:

> Okay, we all survived Y2K and this little glitch. Puh! ;-)
> Can you please explain in which configuration this problem got triggered.

As far as I can tell many kernel versions contained the source code bug. 
  (I'd like some more information on exactly what the problem was if 
anyone cares to share..the proposed patch didn't give much in the way of 
specifics.)

However, in order to trigger the problem you also need to have NTP 
servers that were erroneously broadcasting the addition of a leap second.

So most people didn't see the issue because there wasn't supposed to be 
a leap second added this year...but they would have seen it the next 
time a leap second was added.

Chris

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: 2.6.21.5 june 30th to july 1st date hang?
  2007-07-05 17:48                   ` Chris Friesen
@ 2007-07-05 18:34                     ` Clemens Koller
  2007-07-05 20:10                     ` Thomas Gleixner
  2007-07-05 22:28                     ` Ernie Petrides
  2 siblings, 0 replies; 35+ messages in thread
From: Clemens Koller @ 2007-07-05 18:34 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Chris Wright, Uli Luckas, LKML

Chris Friesen schrieb:
> However, in order to trigger the problem you also need to have NTP 
> servers that were erroneously broadcasting the addition of a leap second.

No matter what NTP servers send, it shouldn't result in a DoS.

> So most people didn't see the issue because there wasn't supposed to be 
> a leap second added this year...but they would have seen it the next 
> time a leap second was added.

True. It seems like we will have another one next year.

Regards,
-- 
Clemens Koller
__________________________________
R&D Imaging Devices
Anagramm GmbH
Rupert-Mayer-Straße 45/1
Linhof Werksgelände
D-81379 München
Tel.089-741518-50
Fax 089-741518-19
http://www.anagramm-technology.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: 2.6.21.5 june 30th to july 1st date hang?
  2007-07-05 17:48                   ` Chris Friesen
  2007-07-05 18:34                     ` Clemens Koller
@ 2007-07-05 20:10                     ` Thomas Gleixner
  2007-07-05 21:02                       ` Chris Friesen
  2007-07-05 22:28                     ` Ernie Petrides
  2 siblings, 1 reply; 35+ messages in thread
From: Thomas Gleixner @ 2007-07-05 20:10 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Clemens Koller, Chris Wright, Uli Luckas, LKML

On Thu, 2007-07-05 at 11:48 -0600, Chris Friesen wrote:
> Clemens Koller wrote:
> 
> > Okay, we all survived Y2K and this little glitch. Puh! ;-)
> > Can you please explain in which configuration this problem got triggered.
> 
> As far as I can tell many kernel versions contained the source code bug. 
>   (I'd like some more information on exactly what the problem was if 
> anyone cares to share..the proposed patch didn't give much in the way of 
> specifics.)

It only happens with CONFIG_HIGHRES_TIMERS=y otherwise clock_was_set()
is a NOP. So only the 2.6.21 kernel and i386 and ARM are affected.

	tglx



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: 2.6.21.5 june 30th to july 1st date hang?
  2007-07-05 20:10                     ` Thomas Gleixner
@ 2007-07-05 21:02                       ` Chris Friesen
  2007-07-05 21:17                         ` Thomas Gleixner
  0 siblings, 1 reply; 35+ messages in thread
From: Chris Friesen @ 2007-07-05 21:02 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Clemens Koller, Chris Wright, Uli Luckas, LKML

Thomas Gleixner wrote:

> It only happens with CONFIG_HIGHRES_TIMERS=y otherwise clock_was_set()
> is a NOP. So only the 2.6.21 kernel and i386 and ARM are affected.

Are you certain?

Vanilla 2.6.10 shows a clock_was_set() function.  Does it just not call 
the dangerous code or something?

Also, our modified 2.6.10 has the high res timers patch applied, but the 
config option is turned off and we were still affected.

Chris

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: 2.6.21.5 june 30th to july 1st date hang?
  2007-07-05 21:02                       ` Chris Friesen
@ 2007-07-05 21:17                         ` Thomas Gleixner
  0 siblings, 0 replies; 35+ messages in thread
From: Thomas Gleixner @ 2007-07-05 21:17 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Clemens Koller, Chris Wright, Uli Luckas, LKML

On Thu, 2007-07-05 at 15:02 -0600, Chris Friesen wrote:
> Thomas Gleixner wrote:
> 
> > It only happens with CONFIG_HIGHRES_TIMERS=y otherwise clock_was_set()
> > is a NOP. So only the 2.6.21 kernel and i386 and ARM are affected.
> 
> Are you certain?

At least for anything >= 2.6.16

> Vanilla 2.6.10 shows a clock_was_set() function.  Does it just not call 
> the dangerous code or something?

Ouch, the old posix timer code might be affected as well, but I did not
look.

> Also, our modified 2.6.10 has the high res timers patch applied, but the 
> config option is turned off and we were still affected.

You mean Anzingers high res patches. No idea about those.

	tglx



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: 2.6.21.5 june 30th to july 1st date hang?
  2007-07-05 17:48                   ` Chris Friesen
  2007-07-05 18:34                     ` Clemens Koller
  2007-07-05 20:10                     ` Thomas Gleixner
@ 2007-07-05 22:28                     ` Ernie Petrides
  2007-07-05 22:49                       ` Chris Friesen
  2 siblings, 1 reply; 35+ messages in thread
From: Ernie Petrides @ 2007-07-05 22:28 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Clemens Koller, Chris Wright, Uli Luckas, LKML

On Thursday, 5-Jul-2007 at 11:48 MDT, "Chris Friesen" wrote:

> Clemens Koller wrote:
> 
> > Okay, we all survived Y2K and this little glitch. Puh! ;-)
> > Can you please explain in which configuration this problem got triggered.
> 
> As far as I can tell many kernel versions contained the source code bug. 
>   (I'd like some more information on exactly what the problem was if 
> anyone cares to share..the proposed patch didn't give much in the way of 
> specifics.)
> 
> However, in order to trigger the problem you also need to have NTP 
> servers that were erroneously broadcasting the addition of a leap second.
> 
> So most people didn't see the issue because there wasn't supposed to be 
> a leap second added this year...but they would have seen it the next 
> time a leap second was added.

Only kernels built with the CONFIG_HIGH_RES_TIMERS option enabled were
vulnerable.

Cheers.  -ernie

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: 2.6.21.5 june 30th to july 1st date hang?
  2007-07-05 22:28                     ` Ernie Petrides
@ 2007-07-05 22:49                       ` Chris Friesen
  2007-07-05 23:12                         ` Ernie Petrides
  0 siblings, 1 reply; 35+ messages in thread
From: Chris Friesen @ 2007-07-05 22:49 UTC (permalink / raw)
  To: Ernie Petrides; +Cc: Clemens Koller, Chris Wright, Uli Luckas, LKML

Ernie Petrides wrote:

> Only kernels built with the CONFIG_HIGH_RES_TIMERS option enabled were
> vulnerable.

As I mentioned in my post to Thomas, we have high res timers disabled 
and were still affected.  Granted, our kernel has been modified so it is 
possible that vanilla would not be affected....I haven't tested it.

Chris

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: 2.6.21.5 june 30th to july 1st date hang?
  2007-07-05 22:49                       ` Chris Friesen
@ 2007-07-05 23:12                         ` Ernie Petrides
  2007-07-05 23:45                           ` Chris Friesen
  2007-07-06  5:17                           ` Thomas Gleixner
  0 siblings, 2 replies; 35+ messages in thread
From: Ernie Petrides @ 2007-07-05 23:12 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Thomas Gleixner, Clemens Koller, Chris Wright, Uli Luckas, LKML

On Thursday, 5-Jul-2007 at 16:49 MDT, Chris Friesen wrote:

> Ernie Petrides wrote:
> 
> > Only kernels built with the CONFIG_HIGH_RES_TIMERS option enabled were
> > vulnerable.
> 
> As I mentioned in my post to Thomas, we have high res timers disabled 
> and were still affected.  Granted, our kernel has been modified so it is 
> possible that vanilla would not be affected....I haven't tested it.
> 
> Chris

That's odd, because Thomas's patch removed two calls to clock_was_set(),
which is a no-op when CONFIG_HIGH_RES_TIMERS is not enabled (at least in
the 2.6.21 source tree).

Also, I personally tested with the reproducer you posted here, initially
on a box running 2.6.22-rc4, and there were no problems (but I'm not sure
what config options were enabled on that kernel).  I did reproduce the
problem on a stock 2.6.21 kernel with CONFIG_HIGH_RES_TIMERS enabled.

Cheers.  -ernie

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: 2.6.21.5 june 30th to july 1st date hang?
  2007-07-05 23:12                         ` Ernie Petrides
@ 2007-07-05 23:45                           ` Chris Friesen
  2007-07-06  5:16                             ` Thomas Gleixner
  2007-07-06  5:17                           ` Thomas Gleixner
  1 sibling, 1 reply; 35+ messages in thread
From: Chris Friesen @ 2007-07-05 23:45 UTC (permalink / raw)
  To: Ernie Petrides
  Cc: Thomas Gleixner, Clemens Koller, Chris Wright, Uli Luckas, LKML

Ernie Petrides wrote:

> That's odd, because Thomas's patch removed two calls to clock_was_set(),
> which is a no-op when CONFIG_HIGH_RES_TIMERS is not enabled (at least in
> the 2.6.21 source tree).

I'm using a modified 2.6.10 tree...I expect the timer code is different.

Chris

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: 2.6.21.5 june 30th to july 1st date hang?
  2007-07-05 23:45                           ` Chris Friesen
@ 2007-07-06  5:16                             ` Thomas Gleixner
  0 siblings, 0 replies; 35+ messages in thread
From: Thomas Gleixner @ 2007-07-06  5:16 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Ernie Petrides, Clemens Koller, Chris Wright, Uli Luckas, LKML

On Thu, 2007-07-05 at 17:45 -0600, Chris Friesen wrote:
> Ernie Petrides wrote:
> 
> > That's odd, because Thomas's patch removed two calls to clock_was_set(),
> > which is a no-op when CONFIG_HIGH_RES_TIMERS is not enabled (at least in
> > the 2.6.21 source tree).
> 
> I'm using a modified 2.6.10 tree...I expect the timer code is different.

Way different and you have extra patches on top.

	tglx



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: 2.6.21.5 june 30th to july 1st date hang?
  2007-07-05 23:12                         ` Ernie Petrides
  2007-07-05 23:45                           ` Chris Friesen
@ 2007-07-06  5:17                           ` Thomas Gleixner
  2007-07-06 15:47                             ` Chris Friesen
  2007-07-06 20:03                             ` Ernie Petrides
  1 sibling, 2 replies; 35+ messages in thread
From: Thomas Gleixner @ 2007-07-06  5:17 UTC (permalink / raw)
  To: Ernie Petrides
  Cc: Chris Friesen, Clemens Koller, Chris Wright, Uli Luckas, LKML

On Thu, 2007-07-05 at 19:12 -0400, Ernie Petrides wrote:
> On Thursday, 5-Jul-2007 at 16:49 MDT, Chris Friesen wrote:
> 
> > Ernie Petrides wrote:
> > 
> > > Only kernels built with the CONFIG_HIGH_RES_TIMERS option enabled were
> > > vulnerable.
> > 
> > As I mentioned in my post to Thomas, we have high res timers disabled 
> > and were still affected.  Granted, our kernel has been modified so it is 
> > possible that vanilla would not be affected....I haven't tested it.
> > 
> > Chris
> 
> That's odd, because Thomas's patch removed two calls to clock_was_set(),
> which is a no-op when CONFIG_HIGH_RES_TIMERS is not enabled (at least in
> the 2.6.21 source tree).
> 
> Also, I personally tested with the reproducer you posted here, initially
> on a box running 2.6.22-rc4, and there were no problems (but I'm not sure
> what config options were enabled on that kernel).  I did reproduce the
> problem on a stock 2.6.21 kernel with CONFIG_HIGH_RES_TIMERS enabled.

It needs a running smp_call_function() to be interrupted by the timer
interrupt, which calls clock_was_set(). So it's not that easy to
reproduce.

	tglx



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: 2.6.21.5 june 30th to july 1st date hang?
  2007-07-06  5:17                           ` Thomas Gleixner
@ 2007-07-06 15:47                             ` Chris Friesen
  2007-07-06 20:03                             ` Ernie Petrides
  1 sibling, 0 replies; 35+ messages in thread
From: Chris Friesen @ 2007-07-06 15:47 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ernie Petrides, Clemens Koller, Chris Wright, Uli Luckas, LKML

Thomas Gleixner wrote:

> It needs a running smp_call_function() to be interrupted by the timer
> interrupt, which calls clock_was_set(). So it's not that easy to
> reproduce.

On our 2.6.10-based kernel its basically trivial to reproduce, and the 
posted fix doesn't solve the issue.

One of our guys is trying to track it down. As yet we don't know if it's 
the vanilla code or the patches on top that contain the bug.

Chris

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: 2.6.21.5 june 30th to july 1st date hang?
  2007-07-06  5:17                           ` Thomas Gleixner
  2007-07-06 15:47                             ` Chris Friesen
@ 2007-07-06 20:03                             ` Ernie Petrides
  1 sibling, 0 replies; 35+ messages in thread
From: Ernie Petrides @ 2007-07-06 20:03 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Chris Friesen, Clemens Koller, Chris Wright, Uli Luckas, LKML

On Friday, 6-Jul-2007 at 7:17 +0200, Thomas Gleixner wrote:

> On Thu, 2007-07-05 at 19:12 -0400, Ernie Petrides wrote:
> > On Thursday, 5-Jul-2007 at 16:49 MDT, Chris Friesen wrote:
> >
> > > Ernie Petrides wrote:
> > >
> > > > Only kernels built with the CONFIG_HIGH_RES_TIMERS option enabled were
> > > > vulnerable.
> > >
> > > As I mentioned in my post to Thomas, we have high res timers disabled
> > > and were still affected.  Granted, our kernel has been modified so it is
> > > possible that vanilla would not be affected....I haven't tested it.
> > >
> > > Chris
> >
> > That's odd, because Thomas's patch removed two calls to clock_was_set(),
> > which is a no-op when CONFIG_HIGH_RES_TIMERS is not enabled (at least in
> > the 2.6.21 source tree).
> >
> > Also, I personally tested with the reproducer you posted here, initially
> > on a box running 2.6.22-rc4, and there were no problems (but I'm not sure
> > what config options were enabled on that kernel).  I did reproduce the
> > problem on a stock 2.6.21 kernel with CONFIG_HIGH_RES_TIMERS enabled.
>
> It needs a running smp_call_function() to be interrupted by the timer
> interrupt, which calls clock_was_set(). So it's not that easy to
> reproduce.

I think it's reproducible at will when CONFIG_BUG is enabled, because the
WARN_ON() on line 546 of arch/i386/kernel/smp.c fires in smp_call_function(),
causing lots of console output.  By the time on_each_cpu() later reenables
interrupts, another clock interrupt is pending, and (I think) causes a
self-deadlock on the xtime_lock in vmi_account_real_cycles().

That's my (unproven) theory, anyway.  :)

At any rate, I reproduced it twice in two tries on stock 2.6.21.

Cheers.  -ernie

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2007-07-06 20:05 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-07-03 12:44 2.6.21.5 june 30th to july 1st date hang? Fortier,Vincent [Montreal]
2007-07-03 12:55 ` Fortier,Vincent [Montreal]
2007-07-03 13:05 ` Clemens Koller
2007-07-03 14:51   ` Fortier,Vincent [Montreal]
2007-07-03 13:56 ` Uli Luckas
2007-07-03 13:59 ` Florian Attenberger
2007-07-03 14:20   ` Arne Georg Gleditsch
2007-07-03 15:02     ` Florian Attenberger
2007-07-03 15:26       ` Arne Georg Gleditsch
2007-07-03 15:36         ` Fortier,Vincent [Montreal]
2007-07-03 17:19           ` Dave Jones
2007-07-03 19:28         ` Chris Friesen
2007-07-03 21:02           ` Chris Friesen
2007-07-03 21:02           ` Chuck Ebbert
2007-07-04  1:06             ` Fortier,Vincent [Montreal]
2007-07-04  8:56             ` Uli Luckas
2007-07-04 16:53               ` Chris Wright
2007-07-05 14:13                 ` Clemens Koller
2007-07-05 17:48                   ` Chris Friesen
2007-07-05 18:34                     ` Clemens Koller
2007-07-05 20:10                     ` Thomas Gleixner
2007-07-05 21:02                       ` Chris Friesen
2007-07-05 21:17                         ` Thomas Gleixner
2007-07-05 22:28                     ` Ernie Petrides
2007-07-05 22:49                       ` Chris Friesen
2007-07-05 23:12                         ` Ernie Petrides
2007-07-05 23:45                           ` Chris Friesen
2007-07-06  5:16                             ` Thomas Gleixner
2007-07-06  5:17                           ` Thomas Gleixner
2007-07-06 15:47                             ` Chris Friesen
2007-07-06 20:03                             ` Ernie Petrides
2007-07-03 15:59 ` Chris Friesen
2007-07-03 16:00   ` Fortier,Vincent [Montreal]
2007-07-03 16:03     ` Chris Friesen
2007-07-03 17:28       ` Chris Friesen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).