LKML Archive on
help / color / mirror / Atom feed
From: Matt Mackall <>
To: Chris Rankin <>
Cc: Mark Rustad <>, Alan <>,
Subject: Re: 2.6.18-stable release plans?
Date: Thu, 25 Jan 2007 15:04:29 -0600	[thread overview]
Message-ID: <> (raw)
In-Reply-To: <20070124231153.6bc04063@localhost.localdomain>

On Wed, Jan 24, 2007 at 11:11:53PM +0000, Alan wrote:
> > I am going to assume that you are being facaetious, because it would be the rarified pinnacle of
> > supreme arrogance to suggest that a cosmic ray event is a more likely explanation than a bug in
> > the kernel.
> A one off non repeatable error experienced by two people out of the
> millions using it does fit the cosmic ray description quite well. That's
> not to say there isn't a bug, but you don't have enough data to even
> begin debugging it unless its rather more reproducable.

The soft error rate (cosmic rays, alpha decay, etc.) for modern memory
at sea level is estimated to be somewhere around 1000 - 5000
FIT/Mbit[1]. FIT is Failures in Time - errors per billion hours of
use. If you've got 1GB of memory, you've got 8000Mbits. So you'd
expect 8M - 40M errors per billion hours on your machine. Or 8 to 40
errors per 1000 hours. That's about one single-bit error per week to
one per day.

Yes, that's a lot. Can it really be that high? Big supercomputer
installations actually measure it in errors per day or hour.

Most of these errors will go completely unnoticed because they happen
in data structures that aren't revisited (stale cache, unused code,
empty memory). The remainder will often look like random disk read or
write errors or random application bugs/crashes. Sound familiar? That's why
people buy ECC memory.

Now if we say that 10% of of that 1GB of RAM (~100MB) is kernel code/data
(not including page cache) and that, say, 1-10% of errors trigger
BUG/WARN code, we'll see these bug messages once every 100 days to
once every 1000 weeks (per GB per user).

As for the relative error rate vs kernel bugs - there are no shortage
of Linux boxes with trouble-free uptimes much longer than the 100 days

So yes, if a user reports a bug that's attributable to a single bit
memory error that's otherwise unreproduced and unexplained, it's
totally reasonable to chalk it up to cosmic rays until some sort of
pattern of reports emerges.

As for your particular bug:

 Eeek! page_mapcount(page) went negative! (-1)
  page->flags = 14
  page->count = 0
  page->mapping = 00000000

This check occurs whenever the last mapping is removed from a page.
It's a very heavily used piece of code. The check is there as
sanity-checking from when this logic was introduced. If there were a
new bug here that could be triggered by gcc or telnet, odds are very
good that it would trigger for TONS of people.

So more likely theories are: a) pointer scribble from something
completely unrelated or b) cosmic rays. As the nearby data (flags,
count, mapping) doesn't appear to be scribbled on, (a) looks less


Mathematics is the supreme nostalgia of our time.

  parent reply	other threads:[~2007-01-25 21:16 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-01-24 15:53 Chris Rankin
2007-01-24 16:12 ` Hugh Dickins
2007-01-24 17:33   ` Chris Rankin
2007-01-24 16:28 ` Mark Rustad
2007-01-24 22:37   ` Chris Rankin
2007-01-24 23:11     ` Alan
2007-01-24 23:05       ` Chris Rankin
2007-01-24 23:32       ` Mark Rustad
2007-01-24 23:45         ` Chris Rankin
2007-01-25  1:00           ` Ken Moffat
2007-01-25  9:16             ` Chris Rankin
2007-01-25 19:36               ` Ken Moffat
2007-01-26 13:02                 ` Chris Rankin
2007-01-25 23:26               ` Alistair John Strachan
2007-01-25  3:05           ` Mark Rustad
2007-01-25 21:04       ` Matt Mackall [this message]
2007-02-02  4:02     ` Valdis.Kletnieks
2007-02-02  6:47       ` Jon Masters
2007-02-02  8:17         ` Valdis.Kletnieks
     [not found] <>
2007-01-25  8:51 ` Chris Rankin
  -- strict thread matches above, loose matches on Subject: below --
2007-01-24 15:06 Chris Rankin
2007-01-24 15:40 ` Hugh Dickins
2007-01-24 13:30 Chris Rankin
2007-01-24 14:37 ` Hugh Dickins
2007-01-22 22:13 Chuck Ebbert
2007-01-23  0:23 ` Jesper Juhl
2007-01-23 20:33   ` Chuck Ebbert
2007-01-23 20:56     ` Adrian Bunk
2007-01-24  4:50   ` Daniel Barkalow

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \
    --subject='Re: 2.6.18-stable release plans?' \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).