LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* Re: [PATCH] mm: fix page_mkclean_one
@ 2007-02-01 22:40 Mark Groves
  2007-02-02  8:42 ` Nick Piggin
  0 siblings, 1 reply; 45+ messages in thread
From: Mark Groves @ 2007-02-01 22:40 UTC (permalink / raw)
  To: linux-kernel

Hi,


I have been been seeing a problem when using sendfile repeatedly on an
SMP server, which I believe is related to the problem that was
discovered recently with marking dirty pages. The bug, as well as a test
script, is listed at http://bugzilla.kernel.org/show_bug.cgi?id=7650.
Currently, we're experiencing errors where part of a previous packet is
being sent out rather than the current packet.

I have applied the patch Linus posted to a 2.6.19 kernel but am still
getting the problem. So I am wondering if there are any other places in
the kernel which mark pages as dirty which might require a similar
patch?


Regards,

Mark Groves
Researcher
University of Waterloo
Waterloo, Ontario, Canada


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one
  2007-02-01 22:40 [PATCH] mm: fix page_mkclean_one Mark Groves
@ 2007-02-02  8:42 ` Nick Piggin
  2007-02-02 13:08   ` Evgeniy Polyakov
  0 siblings, 1 reply; 45+ messages in thread
From: Nick Piggin @ 2007-02-02  8:42 UTC (permalink / raw)
  To: Mark Groves; +Cc: linux-kernel, netdev

Mark Groves wrote:
> Hi,
> 
> 
> I have been been seeing a problem when using sendfile repeatedly on an
> SMP server, which I believe is related to the problem that was
> discovered recently with marking dirty pages. The bug, as well as a test
> script, is listed at http://bugzilla.kernel.org/show_bug.cgi?id=7650.
> Currently, we're experiencing errors where part of a previous packet is
> being sent out rather than the current packet.
> 
> I have applied the patch Linus posted to a 2.6.19 kernel but am still
> getting the problem. So I am wondering if there are any other places in
> the kernel which mark pages as dirty which might require a similar
> patch?

Your issue is not related, firstly because the page_mkclean bug did not
exist before 2.6.19 kernels.

Anyway, I had a look at your bugzilla test-case and managed to slim it
down to something that easily shows what the problem is (available on
request) -- the problem is that recipient of the sendfile is seeing
modifications that occur to the source file _after_ the sender has
completed the sendfile, because the file pages are not copied but
queued.

I think the usual approach to what you are trying to do is to set TCP_CORK,
then write(2) the header into the socket, then sendfile directly from the
file you want.

Another approach I guess is to implement an ack in your userland protocol
so you do not modify the sendfile source file until the client acks that
it has all the data.

I'm not sure if there are any other usual ways to do this (ie. a barrier
for sendfile, to ensure it will not pick up "future" modifications to the
file). netdev cc'ed, someone there might have additional comments.

Please close this bug if/when you are satisfied it is not a kernel problem.

Thanks,
Nick

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one
  2007-02-02  8:42 ` Nick Piggin
@ 2007-02-02 13:08   ` Evgeniy Polyakov
  0 siblings, 0 replies; 45+ messages in thread
From: Evgeniy Polyakov @ 2007-02-02 13:08 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Mark Groves, linux-kernel, netdev

On Fri, Feb 02, 2007 at 07:42:52PM +1100, Nick Piggin (nickpiggin@yahoo.com.au) wrote:
> Anyway, I had a look at your bugzilla test-case and managed to slim it
> down to something that easily shows what the problem is (available on
> request) -- the problem is that recipient of the sendfile is seeing
> modifications that occur to the source file _after_ the sender has
> completed the sendfile, because the file pages are not copied but
> queued.
> 
> I think the usual approach to what you are trying to do is to set TCP_CORK,
> then write(2) the header into the socket, then sendfile directly from the
> file you want.
> 
> Another approach I guess is to implement an ack in your userland protocol
> so you do not modify the sendfile source file until the client acks that
> it has all the data.

Mark, don't you use e1000 or other scatter-gather capable nic with
checksum offload? Likely yes.

Actual data sucking in that case happens when packet is supposed to be
transmitted by the NIC, not when sendfile() is returned. The same
applies to the case, when you have fancy egress filtering.

It is not allowed to modify pages until they are really transmitted, if
you want data integrity.

There are _no_ bugs in network or VFS cache in this test case.

> I'm not sure if there are any other usual ways to do this (ie. a barrier
> for sendfile, to ensure it will not pick up "future" modifications to the
> file). netdev cc'ed, someone there might have additional comments.
> 
> Please close this bug if/when you are satisfied it is not a kernel problem.
> 
> Thanks,
> Nick
> 
> -- 
> SUSE Labs, Novell Inc.
> Send instant messages to your online friends http://au.messenger.yahoo.com 

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one
  2006-12-29  2:50                           ` Segher Boessenkool
@ 2006-12-29  6:48                             ` Linus Torvalds
  0 siblings, 0 replies; 45+ messages in thread
From: Linus Torvalds @ 2006-12-29  6:48 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: David Miller, nickpiggin, kenneth.w.chen, guichaz, hugh,
	linux-kernel, ranma, gordonfarquharson, akpm, a.p.zijlstra, tbm,
	arjan, andrei.popa



On Fri, 29 Dec 2006, Segher Boessenkool wrote:
>
> > I think what might be happening is that pdflush writes them out fine,
> > however we don't trap writes by the application _during_ that writeout.
> 
> Yeah.  I believe that more exactly it happens if the very last
> write to the page causes a writeback (due to dirty balancing)
> while another writeback for the page is already happening.
> 
> As usual in these cases, I have zero proof.

I actually have proof to the contrary, ie I have traces that say "the 
write was started" after the last write.

And the VM layer in this area is actually fairly sane and civilized. It 
has a bit that says "writeback in progress", and if that bit is set, it 
simply _will_not_ start a new write. It even has various BUG_ON()'s to 
that effect.

So everything I have ever seen says that the VM layer is actually doing 
everything right.

> It's the do_wp_page -> balance_dirty_pages -> generic_writepages
> path for sure.  Maybe it's enough to change
> 
>                         if (wbc->sync_mode != WB_SYNC_NONE)
>                                 wait_on_page_writeback(page);
> 
>                         if (PageWriteback(page) ||
>                             !clear_page_dirty_for_io(page)) {
>                                 unlock_page(page);
>                                 continue;
>                         }

Notive how this one basically says:

 - if it's under writeback, don't even clear the page dirty flag.

Your suggested change:

>                         if (wbc->sync_mode != WB_SYNC_NONE)
>                                 wait_on_page_writeback(page);
> 
>                         if (PageWriteback(page)) {
>                         	    redirty_page_for_writepage(wbc, page);

makes no sense, because we simply never _did_ the "clear_page_dirty()" if 
the thing was under writeback in the first place. That's how C 
conditionals work.  So there's no reason to "redirty" it, because it 
wasn't cleaned in the first place.

I've double- and triple-checked the dirty bits, including having traces 
that actually say that the IO was started (from a VM perspective) _after_ 
the last write was done. The IO just didn't hit the disk.

I'm personally fairly convinced that it's not a VM issue, but a "IO 
issue". Either in a low-level filesystem or in some of the fs/buffer.c 
helper routines.

But I'd love to be proven wrong. 

I do have a few interesting details from the trace I haven't really 
analyzed yet. Here's the trace for events on one of the pages that was 
corrupted. Note how the events are numbered (there were 171640 events 
total), so the thing you see is just a small set of events from the whole 
big trace, but it's the ones that talk about _that_ particular page.

I've grouped them so hat "consecutive" events group together. That just 
means that no events on any other pages happened in between those events, 
and it is usually a sign that it's really one single call-chain that 
causes all the events.

For example, for the first group of three events (44366-44368), it's the 
page fault that brings in the page, and since it's a write-fault, it will 
not only map the page, it will mark the page itself dirty and then also 
set the TAG_DIRTY on the mapping. So the "group" is just really a result 
of one single event happening, which causes several things to happen to 
that page. That's exactly what you'd expect.

Anyway, here is the list of events that page went through:

   44366  PG 00000f6d: mm/memory.c:2254 mapping at b789fc54 (write)
   44367  PG 00000f6d: mm/page-writeback.c:817 setting dirty
   44368  PG 00000f6d: fs/buffer.c:738 setting TAG_DIRTY

   64231  PG 00000f6d: mm/page-writeback.c:872 clean_for_io
   64232  PG 00000f6d: mm/rmap.c:451 cleaning PTE b789f000
   64233  PG 00000f6d: mm/page-writeback.c:914 set writeback
   64234  PG 00000f6d: mm/page-writeback.c:916 setting TAG_WRITEBACK
   64235  PG 00000f6d: mm/page-writeback.c:922 clearing TAG_DIRTY

   67570  PG 00000f6d: mm/page-writeback.c:891 end writeback
   67571  PG 00000f6d: mm/page-writeback.c:893 clearing TAG_WRITEBACK

   76705  PG 00000f6d: mm/page-writeback.c:817 setting dirty
   76706  PG 00000f6d: fs/buffer.c:725 dirtied buffers
   76707  PG 00000f6d: fs/buffer.c:738 setting TAG_DIRTY

  105267  PG 00000f6d: mm/page-writeback.c:872 clean_for_io
  105268  PG 00000f6d: mm/rmap.c:451 cleaning PTE b789f000
  105269  PG 00000f6d: mm/page-writeback.c:914 set writeback
  105270  PG 00000f6d: mm/page-writeback.c:916 setting TAG_WRITEBACK
  105271  PG 00000f6d: mm/page-writeback.c:922 clearing TAG_DIRTY
  105272  PG 00000f6d: mm/page-writeback.c:891 end writeback
  105273  PG 00000f6d: mm/page-writeback.c:893 clearing TAG_WRITEBACK

  128032  PG 00000f6d: mm/memory.c:670 unmapped at b789f000

  132662  PG 00000f6d: mm/filemap.c:119 Removing page cache

  148278  PG 00000f6d: mm/memory.c:2254 mapping at b789f000 (read)

  166326  PG 00000f6d: mm/memory.c:670 unmapped at b789f000

  171958  PG 00000f6d: mm/filemap.c:119 Removing page cache

And notice that big grouping of seven events (105267-105273). The five 
first events really _do_ make sense together: it's our page cleaning that 
happens. But notice how the "end writeback" happens _immediately_.

Here's another page cleaning event for the page that preceded that page, 
and did _not_ get corrupted:

  105262  PG 00000f6c: mm/page-writeback.c:872 clean_for_io
  105263  PG 00000f6c: mm/rmap.c:451 cleaning PTE b789e000
  105264  PG 00000f6c: mm/page-writeback.c:914 set writeback
  105265  PG 00000f6c: mm/page-writeback.c:916 setting TAG_WRITEBACK
  105266  PG 00000f6c: mm/page-writeback.c:922 clearing TAG_DIRTY

  108437  PG 00000f6c: mm/page-writeback.c:891 end writeback
  108438  PG 00000f6c: mm/page-writeback.c:893 clearing TAG_WRITEBACK

and this looks a lot more like what you'd expect: other thngs happened in 
between the "clear dirty, set writeback" stage and the "end writeback" 
stage. That's what you'd expect to see if there was actually overlapping 
IO and/or work. 

(And notice that that _was_ what you saw even for the corrupted page for 
the _first_ writeback: you saw the group-of-five that indicated a page 
cleaning event had started, and then a group-of-two to indicate that the 
writeback finished).

So I find this kind of pattern really suspicious. We have a missing 
writeout, and my traces show (I didn't analyze this _particular_ one 
closely, but I did the previous trace for another page that I posted) that 
the writeback was actually started after the write that went missing was 
done. AND I have this trace that seems to show that the writeback 
basically completed immediately, with no other work in between.

That to me says: "somebody didn't actually write out out". The VM layer 
asked the filesystem to do the write, but the filesystem just didn't do 
it. I personally think it's because some buffer-head BH_dirty bit got 
scrogged, but it could be some event that makes the filesystem simply not 
do the IO because it thinks the "disk queues are too full", so it just 
says "IO completed", without actually doing anything at all.

Now, the fact that it apparently happens for all of ext2, ext3 
and reiserfs (but NOT apparently with "data=writeback"), makes me suspect 
that there is some common interaction, and that it's somehow BH-related 
(they all share much of the buffer head infrastructure). So it doesn't 
look like it's just a bug in one random filesystem, I think it's a bug in 
some buffer-head infrastructure/helper function.

So I don't think it's "core VM". I don't think it's the "page cache". I 
think we handle the dirty state correctly at that level.

It looks more like "buffer cache" or "filesystem" to me by now.

(Btw, don't get me wrong - the above sequence numbers are in no way 
*proof* of anything. You could get big groups for one page just because 
something ended up being synchronous. I'll add some timestamps to my 
traces to make it easier to see where there was real IO going on and where 
there wasn't).

		Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one
  2006-12-28 22:38                         ` David Miller
@ 2006-12-29  2:50                           ` Segher Boessenkool
  2006-12-29  6:48                             ` Linus Torvalds
  0 siblings, 1 reply; 45+ messages in thread
From: Segher Boessenkool @ 2006-12-29  2:50 UTC (permalink / raw)
  To: David Miller
  Cc: nickpiggin, kenneth.w.chen, guichaz, hugh, linux-kernel, ranma,
	torvalds, gordonfarquharson, akpm, a.p.zijlstra, tbm, arjan,
	andrei.popa

> I think what might be happening is that pdflush writes them out fine,
> however we don't trap writes by the application _during_ that writeout.

Yeah.  I believe that more exactly it happens if the very last
write to the page causes a writeback (due to dirty balancing)
while another writeback for the page is already happening.

As usual in these cases, I have zero proof.

> It's something that will only occur with writeback and MAP_SHARED
> writable access to the file pages.

It's the do_wp_page -> balance_dirty_pages -> generic_writepages
path for sure.  Maybe it's enough to change

                         if (wbc->sync_mode != WB_SYNC_NONE)
                                 wait_on_page_writeback(page);

                         if (PageWriteback(page) ||
                             !clear_page_dirty_for_io(page)) {
                                 unlock_page(page);
                                 continue;
                         }

to

                         if (wbc->sync_mode != WB_SYNC_NONE)
                                 wait_on_page_writeback(page);

                         if (PageWriteback(page)) {
                         	    redirty_page_for_writepage(wbc, page);
                                 unlock_page(page);
                                 continue;
                         }

                         if (!clear_page_dirty_for_io(page)) {
                                 unlock_page(page);
                                 continue;
                         }

or something along those lines.  Completely untested of course :-)


Segher


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one
  2006-12-28 20:14                       ` Linus Torvalds
@ 2006-12-28 22:38                         ` David Miller
  2006-12-29  2:50                           ` Segher Boessenkool
  0 siblings, 1 reply; 45+ messages in thread
From: David Miller @ 2006-12-28 22:38 UTC (permalink / raw)
  To: torvalds
  Cc: akpm, guichaz, ranma, gordonfarquharson, tbm, a.p.zijlstra,
	andrei.popa, hugh, nickpiggin, arjan, linux-kernel,
	kenneth.w.chen

From: Linus Torvalds <torvalds@osdl.org>
Date: Thu, 28 Dec 2006 12:14:31 -0800 (PST)

> I get corruption - but the whole point is that it's very much pdflush that 
> should be writing these pages out.

I think what might be happening is that pdflush writes them out fine,
however we don't trap writes by the application _during_ that writeout.

These corruptions look exactly as if:

1) pdflush begins writeback of page X
2) page goes to disk
3) application writes a chunk to the page
4) pdflush et al. think the page is clean, so it gets tossed, losing
   the writes done in #3

So there's a missing PTE change in there, so that we never get proper
re-dirtying of the page if the application tries to write to the page
during the writeback.

It's something that will only occur with writeback and MAP_SHARED
writable access to the file pages.  That's why we never see this
with normal filesystem writes, since those explicitly manage the
page dirty state.

I think the dirty balancing logic etc. isn't where the problems are,
to me it's a PTE state update issue for sure.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one
  2006-12-28 19:45                     ` Andrew Morton
  2006-12-28 20:14                       ` Linus Torvalds
@ 2006-12-28 22:35                       ` Mike Galbraith
  1 sibling, 0 replies; 45+ messages in thread
From: Mike Galbraith @ 2006-12-28 22:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Guillaume Chazarain, David Miller, ranma,
	gordonfarquharson, tbm, Peter Zijlstra, andrei.popa, hugh,
	nickpiggin, arjan, Linux Kernel Mailing List, Chen Kenneth W

On Thu, 2006-12-28 at 11:45 -0800, Andrew Morton wrote:
> On Thu, 28 Dec 2006 11:28:52 -0800 (PST)
> Linus Torvalds <torvalds@osdl.org> wrote:
> 
> > 
> > 
> > On Thu, 28 Dec 2006, Guillaume Chazarain wrote:
> > > 
> > > The attached patch fixes the corruption for me.
> > 
> > Well, that's a good hint, but it's really just a symptom. You effectively 
> > just made the test-program not even try to flush the data to disk, so the 
> > page cache would stay in memory, and you'd not see the corruption as well.
> > 
> > So you basically disabled the code that tried to trigger the bug more 
> > easily.
> > 
> > But the reason I say it's interesting is that "WB_SYNC_NONE" is very much 
> > implicated in mm/page-writeback.c, and if there is a bug triggered by 
> > WB_SYNC_NONE writebacks, then that would explain why page-writeback.c also 
> > fails..
> > 
> 
> It would be interesting to convert your app to do fsync() before
> FADV_DONTNEED.  That would take WB_SYNC_NONE out of the picture as well
> (apart from pdflush activity).

I did fdatasync(), tried remapping before unmapping... nogo here.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one
  2006-12-28 19:45                     ` Andrew Morton
@ 2006-12-28 20:14                       ` Linus Torvalds
  2006-12-28 22:38                         ` David Miller
  2006-12-28 22:35                       ` Mike Galbraith
  1 sibling, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2006-12-28 20:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Guillaume Chazarain, David Miller, ranma, gordonfarquharson, tbm,
	Peter Zijlstra, andrei.popa, hugh, nickpiggin, arjan,
	Linux Kernel Mailing List, Chen Kenneth W

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2468 bytes --]



On Thu, 28 Dec 2006, Andrew Morton wrote:
> 
> It would be interesting to convert your app to do fsync() before
> FADV_DONTNEED.  That would take WB_SYNC_NONE out of the picture as well
> (apart from pdflush activity).

I get corruption - but the whole point is that it's very much pdflush that 
should be writing these pages out.

Andrew - give my test-program a try. It can run in about 1 minute if you 
have a 256MB machine (I didn't, but "mem=256M" is my friend), and it seems 
to very consistently cause corruption.

What I do is:

	# Make sure we write back aggressively
	echo 5 > /proc/sys/vm/dirty_ratio

as root, and then just run the thing. Tons of corruption. But the 
corruption goes away if I just leave the default dirty ratio alone (but 
then I can increse the file size to trigger it, of course - but that 
also makes the test run a lot slower).

Now, with a pre-2.6.19 kernel, I bet you won't get the corruption as 
easily (at least with the "fsync()"), but that's less to do with anything 
new, and probably just because then you simply won't have any pdflushing 
going on - since the kernel won't even notice that you have tons of dirty 
pages ;)

It might also depend on the speed of your disk drive - the machine I test 
this on has a slow 4200 rpm laptop drive in it, and that probably makes 
things go south more easily. That's _especially_ true if this is related 
to any "bdi_write_congested()" logic.

Now, it could also be related to various code snippets like

	...
	if (wbc->sync_mode != WB_SYNC_NONE)
		wait_on_page_writeback(page);

	if (PageWriteback(page) ||
			!clear_page_dirty_for_io(page)) {
		unlock_page(page);
		continue;
	}
	...

where the WB_SYNC_NONE case will hit the "PageWriteback()" and just not do 
the writeback at all (but it also won't clear the dirty bit, so it's 
certainly not an *OBVIOUS* bug).

We also have code like this ("pageout()"):

	if (clear_page_dirty_for_io(page)) {
		int res;
		struct writeback_control wbc = {
			.sync_mode = WB_SYNC_NONE,
			..
		}
		...
		res = mapping->a_ops->writepage(page, &wbc);

and in this case, if the "WB_SYNC_NONE" means that the "writepage()" call 
won't do anything at all because of congestion, then that would be a _bad_ 
thing, and would certainly explain how something didn't get written out.

But that particular path should only trigger for the "shrink_page_list()" 
case, and it's not the case I seem to be testing with my "low dirty_ratio" 
testing.

		Linus

[-- Attachment #2: Type: TEXT/PLAIN, Size: 2872 bytes --]

#include <sys/mman.h>
#include <sys/fcntl.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <time.h>

#define TARGETSIZE (22 << 20)
#define CHUNKSIZE (1460)
#define NRCHUNKS (TARGETSIZE / CHUNKSIZE)
#define SIZE (NRCHUNKS * CHUNKSIZE)

static void fillmem(void *start, int nr)
{
	memset(start, nr, CHUNKSIZE);
}

#define page_offset(buf, off) (unsigned)((unsigned long)(buf)+(off)-(unsigned long)(mapping))

static int chunkorder[NRCHUNKS];
static char *mapping;

static int order(int nr)
{
	int i;
	if (nr < 0 || nr >= NRCHUNKS)
		return -1;
	for (i = 0; i < NRCHUNKS; i++)
		if (chunkorder[i] == nr)
			return i;
	return -2;
}

static void checkmem(void *buf, int nr)
{
	unsigned int start = ~0u, end = 0;
	unsigned char c = nr, *p = buf, differs = 0;
	int i;
	for (i = 0; i < CHUNKSIZE; i++) {
		unsigned char got = *p++;
		if (got != c) {
			if (i < start)
				start = i;
			if (i > end)
				end = i;
			differs = got;
		}
	}
	if (start < end) {
		printf("Chunk %d corrupted (%u-%u)  (%x-%x)            \n", nr, start, end,
			page_offset(buf, start), page_offset(buf, end));
		printf("Expected %u, got %u\n", c, differs);
		printf("Written as (%d)%d(%d)\n", order(nr-1), order(nr), order(nr+1));
	}
}

static char *remap(int fd, char *mapping)
{
	if (mapping) {
		munmap(mapping, SIZE);
//		fsync(fd);
		posix_fadvise(fd, 0, SIZE, POSIX_FADV_DONTNEED);
	}
	return mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
}

int main(int argc, char **argv)
{
	int fd, i;

	/*
	 * Make some random ordering of writing the chunks to the
	 * memory map..
	 *
	 * Start with fully ordered..
	 */
	for (i = 0; i < NRCHUNKS; i++)
		chunkorder[i] = i;

	/* ..and then mix it up randomly */
	srandom(time(NULL));
	for (i = 0; i < NRCHUNKS; i++) {
		int index = (unsigned int) random() % NRCHUNKS;
		int nr = chunkorder[index];
		chunkorder[index] = chunkorder[i];
		chunkorder[i] = nr;
	}

	fd = open("mapfile", O_RDWR | O_TRUNC | O_CREAT, 0666);
	if (fd < 0)
		return -1;
	if (ftruncate(fd, SIZE) < 0)
		return -1;
	mapping = remap(fd, NULL);
	if (-1 == (int)(long)mapping)
		return -1;

	for (i = 0; i < NRCHUNKS; i++) {
		int chunk = chunkorder[i];
		printf("Writing chunk %d/%d (%d%%)     \r", i, NRCHUNKS, 100*i/NRCHUNKS);
		fillmem(mapping + chunk * CHUNKSIZE, chunk);
	}
	printf("\n");

	/* Unmap, drop, and remap.. */
	mapping = remap(fd, mapping);

	/* .. and check */
	for (i = 0; i < NRCHUNKS; i++) {
		int chunk = i;
		printf("Checking chunk %d/%d (%d%%)     \r", i, NRCHUNKS, 100*i/NRCHUNKS);
		checkmem(mapping + chunk * CHUNKSIZE, chunk);
	}
	printf("\n");

	/* Clean up for next time */
	sleep(5);
	sync();
	sleep(5);
	munmap(mapping, SIZE);
	close(fd);
	unlink("mapfile");
	
	return 0;
}

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one
  2006-12-28 19:28                   ` Linus Torvalds
@ 2006-12-28 19:45                     ` Andrew Morton
  2006-12-28 20:14                       ` Linus Torvalds
  2006-12-28 22:35                       ` Mike Galbraith
  0 siblings, 2 replies; 45+ messages in thread
From: Andrew Morton @ 2006-12-28 19:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Guillaume Chazarain, David Miller, ranma, gordonfarquharson, tbm,
	Peter Zijlstra, andrei.popa, hugh, nickpiggin, arjan,
	Linux Kernel Mailing List, Chen Kenneth W

On Thu, 28 Dec 2006 11:28:52 -0800 (PST)
Linus Torvalds <torvalds@osdl.org> wrote:

> 
> 
> On Thu, 28 Dec 2006, Guillaume Chazarain wrote:
> > 
> > The attached patch fixes the corruption for me.
> 
> Well, that's a good hint, but it's really just a symptom. You effectively 
> just made the test-program not even try to flush the data to disk, so the 
> page cache would stay in memory, and you'd not see the corruption as well.
> 
> So you basically disabled the code that tried to trigger the bug more 
> easily.
> 
> But the reason I say it's interesting is that "WB_SYNC_NONE" is very much 
> implicated in mm/page-writeback.c, and if there is a bug triggered by 
> WB_SYNC_NONE writebacks, then that would explain why page-writeback.c also 
> fails..
> 

It would be interesting to convert your app to do fsync() before
FADV_DONTNEED.  That would take WB_SYNC_NONE out of the picture as well
(apart from pdflush activity).


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one
  2006-12-28 19:19                 ` Guillaume Chazarain
@ 2006-12-28 19:28                   ` Linus Torvalds
  2006-12-28 19:45                     ` Andrew Morton
  0 siblings, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2006-12-28 19:28 UTC (permalink / raw)
  To: Guillaume Chazarain
  Cc: David Miller, ranma, gordonfarquharson, tbm, Peter Zijlstra,
	andrei.popa, Andrew Morton, hugh, nickpiggin, arjan,
	Linux Kernel Mailing List, Chen Kenneth W



On Thu, 28 Dec 2006, Guillaume Chazarain wrote:
> 
> The attached patch fixes the corruption for me.

Well, that's a good hint, but it's really just a symptom. You effectively 
just made the test-program not even try to flush the data to disk, so the 
page cache would stay in memory, and you'd not see the corruption as well.

So you basically disabled the code that tried to trigger the bug more 
easily.

But the reason I say it's interesting is that "WB_SYNC_NONE" is very much 
implicated in mm/page-writeback.c, and if there is a bug triggered by 
WB_SYNC_NONE writebacks, then that would explain why page-writeback.c also 
fails..

		Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one
  2006-12-28 15:09               ` Guillaume Chazarain
@ 2006-12-28 19:19                 ` Guillaume Chazarain
  2006-12-28 19:28                   ` Linus Torvalds
  0 siblings, 1 reply; 45+ messages in thread
From: Guillaume Chazarain @ 2006-12-28 19:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Miller, ranma, gordonfarquharson, tbm, Peter Zijlstra,
	andrei.popa, Andrew Morton, hugh, nickpiggin, arjan,
	Linux Kernel Mailing List, Chen Kenneth W

[-- Attachment #1: Type: text/plain, Size: 625 bytes --]

Guillaume Chazarain a écrit :
> I get this kind of corruption: 
> http://guichaz.free.fr/linux-bug/corruption.png

Actually in qemu, I get three different behaviours:
- no corruption at all : with linux-2.4
- corruption only on the first chunks: before  [PATCH] mm: balance dirty 
pages as identified by Kenneth
- corruption of all chunks: after the balance dirty pages patch

Bisecting in linux-2.5 land I found 
http://kernel.org/pub/linux/kernel/people/akpm/patches/2.5/2.5.66/2.5.66-mm3/broken-out/fadvise-flush-data.patch 
to cause the corruption for me.

The attached patch fixes the corruption for me.

-- 
Guillaume


[-- Attachment #2: fadvise-dontneed.patch --]
[-- Type: text/x-patch, Size: 492 bytes --]

diff -r 3859b1144d3a mm/fadvise.c
--- a/mm/fadvise.c	Sun Dec 24 05:00:03 2006 +0000
+++ b/mm/fadvise.c	Thu Dec 28 19:53:40 2006 +0100
@@ -96,9 +96,6 @@ asmlinkage long sys_fadvise64_64(int fd,
 	case POSIX_FADV_NOREUSE:
 		break;
 	case POSIX_FADV_DONTNEED:
-		if (!bdi_write_congested(mapping->backing_dev_info))
-			filemap_flush(mapping);
-
 		/* First and last FULL page! */
 		start_index = (offset+(PAGE_CACHE_SIZE-1)) >> PAGE_CACHE_SHIFT;
 		end_index = (endbyte >> PAGE_CACHE_SHIFT);

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one
  2006-12-28 18:44                         ` Russell King
@ 2006-12-28 19:01                           ` Linus Torvalds
  0 siblings, 0 replies; 45+ messages in thread
From: Linus Torvalds @ 2006-12-28 19:01 UTC (permalink / raw)
  To: Russell King
  Cc: Gordon Farquharson, David Miller, ranma, tbm, Peter Zijlstra,
	andrei.popa, Andrew Morton, hugh, nickpiggin, arjan,
	Linux Kernel Mailing List



On Thu, 28 Dec 2006, Russell King wrote:
> 
> Yup, but I have nothing to do with glibc because I refuse to do that
> silly copyright assignment FSF thing.  Hopefully someone else can
> resolve it, but...

Yeah, me too.

> _is_ a fix whether _you_ like it or not to work around the issue so
> people can at least run your test program.  I'm not saying it's a
> proper fix though.

My point was that it wasn't a "fix", it's a "workaround". The _fix_ would 
be in glibc.

Nothing more..

		Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one
  2006-12-28 17:27                       ` Linus Torvalds
@ 2006-12-28 18:44                         ` Russell King
  2006-12-28 19:01                           ` Linus Torvalds
  0 siblings, 1 reply; 45+ messages in thread
From: Russell King @ 2006-12-28 18:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Gordon Farquharson, David Miller, ranma, tbm, Peter Zijlstra,
	andrei.popa, Andrew Morton, hugh, nickpiggin, arjan,
	Linux Kernel Mailing List

On Thu, Dec 28, 2006 at 09:27:12AM -0800, Linus Torvalds wrote:
> On Thu, 28 Dec 2006, Russell King wrote:
> > and if you look at glibc's memset() function, you'll notice that's exactly
> > what you expect if you pass a non-8bit value to it.  Ergo, what you're
> > seeing is utterly expected given glibc's memset() implementation on ARM.
> 
> Guys, you _really_ should fix memset(). What you describe is a _bug_. 

Yup, but I have nothing to do with glibc because I refuse to do that
silly copyright assignment FSF thing.  Hopefully someone else can
resolve it, but...

> > Fixing Linus' test program to pass nr & 255 to memset
> 
> No. I'm almost certain that that is not a "fix", it's a workaround for a 
> serious bug in your glibc crap.

_is_ a fix whether _you_ like it or not to work around the issue so
people can at least run your test program.  I'm not saying it's a
proper fix though.

Of course, if you prefer to be mislead by incorrect bug reports...

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one
  2006-12-28 10:13                     ` Russell King
  2006-12-28 14:15                       ` Gordon Farquharson
@ 2006-12-28 17:27                       ` Linus Torvalds
  2006-12-28 18:44                         ` Russell King
  1 sibling, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2006-12-28 17:27 UTC (permalink / raw)
  To: Russell King
  Cc: Gordon Farquharson, David Miller, ranma, tbm, Peter Zijlstra,
	andrei.popa, Andrew Morton, hugh, nickpiggin, arjan,
	Linux Kernel Mailing List



On Thu, 28 Dec 2006, Russell King wrote:
> 
> and if you look at glibc's memset() function, you'll notice that's exactly
> what you expect if you pass a non-8bit value to it.  Ergo, what you're
> seeing is utterly expected given glibc's memset() implementation on ARM.

Guys, you _really_ should fix memset(). What you describe is a _bug_. 

"memset()" takes an "int" as its argument (always has), and has to convert 
it to a byte _itself_. It may not be common, but it's perfectly normal, to 
pass it values outside 0-255 (negative values that still fit in a "signed 
char" in particular are very normal, but my usage of "let the thing 
truncate it itself" is also quite fine).

> Fixing Linus' test program to pass nr & 255 to memset

No. I'm almost certain that that is not a "fix", it's a workaround for a 
serious bug in your glibc crap.

But it does explain all the unexpected strange behaviour (and the really 
small writeback size - now it doesn't need any /proc/sys/vm/dirty_ratio 
assumptions to be explicable.

		Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one
  2006-12-28  9:15               ` Zhang, Yanmin
@ 2006-12-28 17:15                 ` Linus Torvalds
  0 siblings, 0 replies; 45+ messages in thread
From: Linus Torvalds @ 2006-12-28 17:15 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: David Miller, ranma, gordonfarquharson, tbm, Peter Zijlstra,
	andrei.popa, Andrew Morton, hugh, nickpiggin, arjan,
	Linux Kernel Mailing List



On Thu, 28 Dec 2006, Zhang, Yanmin wrote:
> 
> The test program is a process to write/read data. pdflush might write data
> to disk asynchronously. After pdflush writes a page to disk, it will call (either by
> softirq) clear_page_dirty to clear the dirty bit after getting the interrupt
> notification.

That would indeed be a horrible bug. However, we don't do 
"clear_page_dirty()" _after_ the IO has completed, we do it _before_ the 
IO starts.

If you can actually find a place that does clear_page_dirty _after_ IO, 
then yes, you've just found the bug. But I haven't found it.

So the rule is _always_:

 - call "clear_page_dirty_for_io()" with the page lock held, and _before_ 
   the IO starts.
 - do "set_page_writeback()" before unlocking the page again
 - do a "end_page_writeback()" when the IO actually finishes.

and any code sequence that doesn't honor those rules would be buggy. A 
beer for anybody that finds it..

		Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [PATCH] mm: fix page_mkclean_one
  2006-12-28  6:10                 ` Chen, Kenneth W
  2006-12-28  6:27                   ` David Miller
@ 2006-12-28 17:10                   ` Linus Torvalds
  1 sibling, 0 replies; 45+ messages in thread
From: Linus Torvalds @ 2006-12-28 17:10 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: David Miller, ranma, gordonfarquharson, tbm, Peter Zijlstra,
	andrei.popa, Andrew Morton, hugh, nickpiggin, arjan,
	Linux Kernel Mailing List



On Wed, 27 Dec 2006, Chen, Kenneth W wrote:
> > 
> > Running the test code, git bisect points its finger at this commit. Reverting
> > this commit on top of 2.6.20-rc2 doesn't trigger the bug from the test code.
> > 
> >     [PATCH] mm: balance dirty pages
> > 
> >     Now that we can detect writers of shared mappings, throttle them.  Avoids OOM
> >     by surprise.
> 
> Oh, never mind :-(  I just didn't create enough write out pressure when
> test this. I just saw bug got triggered on a kernel I previously thought
> was OK.

Btw, this is an important point - people have long felt that the new page 
balancing in 2.6.19 was to blame, but you've just confirmed the long-held 
suspicion (at least by me) that it's not actually a new bug at all, it's 
just that the dirty page balancing causes writeback to happen _earlier_, 
and thus is better able to _show_ a bug that we've likely had for a long 
long time.

			Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one
  2006-12-28  5:58                       ` Gordon Farquharson
@ 2006-12-28 17:08                         ` Linus Torvalds
  0 siblings, 0 replies; 45+ messages in thread
From: Linus Torvalds @ 2006-12-28 17:08 UTC (permalink / raw)
  To: Gordon Farquharson
  Cc: David Miller, ranma, tbm, Peter Zijlstra, andrei.popa,
	Andrew Morton, hugh, nickpiggin, arjan,
	Linux Kernel Mailing List



On Wed, 27 Dec 2006, Gordon Farquharson wrote:
> 
> 100kB and 200kB files always succeed on the ARM system. 400kB and
> larger always seem to fail.

Oh, wow. Yeah, I've just repressed how tiny 32MB is. And especially if you 
lowered the /proc/sys/vm/dirty_ratio to a smaller percentage, I guess 
400kB should be enough to cause writeback.

Ugh. I tested a 128MB machine a few weeks ago, and found it painful.

		Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one
  2006-12-28 14:15                       ` Gordon Farquharson
@ 2006-12-28 15:53                         ` Martin Michlmayr
  0 siblings, 0 replies; 45+ messages in thread
From: Martin Michlmayr @ 2006-12-28 15:53 UTC (permalink / raw)
  To: Gordon Farquharson
  Cc: Linus Torvalds, David Miller, ranma, Peter Zijlstra, andrei.popa,
	Andrew Morton, hugh, nickpiggin, arjan,
	Linux Kernel Mailing List

* Gordon Farquharson <gordonfarquharson@gmail.com> [2006-12-28 07:15]:
> Thanks for the fix, Russell.
> 
> I can now trigger the (real) problem by using a 25 MB file (100 << 18)
> and the Linksys NSLU2 (ARM, IXP420 processor, 32 MB RAM).

Me too (using 100 << 18).  Interestingly, I don't seem to get any
corruption on a different ARM board, an IOP32x based machine with 128
MB RAM.
-- 
Martin Michlmayr
http://www.cyrius.com/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one
@ 2006-12-28 15:50 martin
  0 siblings, 0 replies; 45+ messages in thread
From: martin @ 2006-12-28 15:50 UTC (permalink / raw)
  To: Linus Torvalds, Guillaume Chazarain
  Cc: David Miller, ranma, gordonfarquharson, tbm, Peter Zijlstra,
	andrei.popa, Andrew Morton, hugh, nickpiggin, arjan,
	Linux Kernel Mailing List


On Thu Dec 28 15:09 , Guillaume Chazarain sent:

>I set a qemu environment to test kernels: http://guichaz.free.fr/linux-bug/
>I have corruption with every Fedora release kernel except the first, that is
>2.4.22 works, but 2.6.5, 2.6.9, 2.6.11, 2.6.15 and 2.6.18-1.2798 exhibit 
>some corruption.

Confirm that I see corruption with Fedora kernel 2.6.18-1.2239.fc5:

...
Chunk 142969 corrupted (0-1459)  (2580-4039)
Expected 121, got 0
Written as (89632)127652(124721)
Chunk 142976 corrupted (0-1459)  (512-1971)
Expected 128, got 0
Written as (121734)128324(108589)
Checking chunk 143639/143640 (99%)
$ uname -a
Linux 2.6.18-1.2239.fc5 #1 Fri Nov 10 13:04:06 EST 2006 i686 athlon i386 GNU/Linux

/Martin

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one
  2006-12-28 10:49                           ` Russell King
@ 2006-12-28 14:56                             ` Martin Michlmayr
  0 siblings, 0 replies; 45+ messages in thread
From: Martin Michlmayr @ 2006-12-28 14:56 UTC (permalink / raw)
  To: Gordon Farquharson, Linus Torvalds, David Miller, ranma,
	Peter Zijlstra, andrei.popa, Andrew Morton, hugh, nickpiggin,
	arjan, Linux Kernel Mailing List

* Russell King <rmk+lkml@arm.linux.org.uk> [2006-12-28 10:49]:
> > By the way, I just tried it with TARGETSIZE (100 << 12) on a different
> > ARM machine (a Thecus N2100 based on an IOP32x chip with 128 MB of
> > memory) and I see similar results to that from Gordon:
> 
> Work around the glibc memset() problem by passing nr & 255, and re-run
> the test.  You're getting false positives.

Yeah, I saw your message about this problem after I sent mine.
-- 
Martin Michlmayr
http://www.cyrius.com/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one
  2006-12-28 10:13                     ` Russell King
@ 2006-12-28 14:15                       ` Gordon Farquharson
  2006-12-28 15:53                         ` Martin Michlmayr
  2006-12-28 17:27                       ` Linus Torvalds
  1 sibling, 1 reply; 45+ messages in thread
From: Gordon Farquharson @ 2006-12-28 14:15 UTC (permalink / raw)
  To: Gordon Farquharson, Linus Torvalds, David Miller, ranma, tbm,
	Peter Zijlstra, andrei.popa, Andrew Morton, hugh, nickpiggin,
	arjan, Linux Kernel Mailing List

On 12/28/06, Russell King <rmk+lkml@arm.linux.org.uk> wrote:

> Fixing Linus' test program to pass nr & 255 to memset results in clean
> passes on 2.6.9 on TheCus N2100 (IOP8032x) and 2.6.16.9 StrongARM
> machines (as would be expected.)

Thanks for the fix, Russell.

I can now trigger the (real) problem by using a 25 MB file (100 << 18)
and the Linksys NSLU2 (ARM, IXP420 processor, 32 MB RAM).

$ ./linus-test
Writing chunk 17954/17955 (99%)
Chunk 514 corrupted (0-1459)  (872-2331)
Expected 2, got 0
Written as (8479)11160(10312)
Chunk 516 corrupted (0-303)  (3792-4095)
Expected 4, got 0
Written as (10312)10569(4426)
Chunk 959 corrupted (0-691)  (3404-4095)
Expected 191, got 0
Written as (687)4881(1522)
Chunk 1895 corrupted (0-1459)  (1900-3359)
Expected 103, got 0
Written as (7746)8389(6231)
Chunk 2702 corrupted (0-1459)  (472-1931)
Expected 142, got 0
Written as (4866)7103(2409)
Chunk 3314 corrupted (0-1459)  (1064-2523)
Expected 242, got 0
Written as (4287)7064(1730)
Chunk 4043 corrupted (0-1459)  (444-1903)
Expected 203, got 0
Written as (6495)8509(4464)
Chunk 5180 corrupted (0-1459)  (1584-3043)
Expected 60, got 0
Written as (11056)12826(10797)
Chunk 5672 corrupted (0-991)  (3104-4095)
Expected 40, got 0
Written as (9944)4872(41)
Chunk 5793 corrupted (460-1459)  (0-999)
Expected 161, got 0
Written as (7059)5038(4377)
Chunk 6089 corrupted (0-1459)  (1620-3079)
Expected 201, got 0
Written as (4672)5230(4403)
Chunk 6545 corrupted (268-1459)  (0-1191)
Expected 145, got 0
Written as (3701)5969(4668)
Chunk 7578 corrupted (0-1459)  (584-2043)
Expected 154, got 0
Written as (10015)5082(1648)
Chunk 7880 corrupted (864-1459)  (0-595)
Expected 200, got 0
Written as (17869)5064(4745)
Chunk 8086 corrupted (0-1459)  (888-2347)
Expected 150, got 0
Written as (10206)11050(10374)
Chunk 8749 corrupted (0-1459)  (2212-3671)
Expected 45, got 0
Written as (15263)7132(4825)
Chunk 9068 corrupted (0-1459)  (1008-2467)
Expected 108, got 0
Written as (5557)7571(6771)
Chunk 9193 corrupted (812-1459)  (0-647)
Expected 233, got 0
Written as (9238)7277(4757)
Chunk 10032 corrupted (576-1459)  (0-883)
Expected 48, got 0
Written as (15741)10012(1753)
Chunk 10056 corrupted (0-1459)  (1696-3155)
Expected 72, got 0
Written as (5379)7431(262)
Chunk 10395 corrupted (0-1459)  (1020-2479)
Expected 155, got 0
Written as (21)7442(5902)
Chunk 10791 corrupted (0-1459)  (1644-3103)
Expected 39, got 0
Written as (4753)5925(5926)
Chunk 10792 corrupted (0-991)  (3104-4095)
Expected 40, got 0
Written as (5925)5926(8555)
Chunk 11036 corrupted (0-1103)  (2992-4095)
Expected 28, got 0
Written as (13755)14449(7458)
Chunk 11387 corrupted (644-1459)  (0-815)
Expected 123, got 0
Written as (10853)11459(9445)
Chunk 11586 corrupted (920-1459)  (0-539)
Expected 66, got 0
Written as (3769)11691(11123)
Chunk 11882 corrupted (0-1459)  (1160-2619)
Expected 106, got 0
Written as (10736)11696(2788)
Chunk 12397 corrupted (0-603)  (3492-4095)
Expected 109, got 0
Written as (2352)7515(2437)
Chunk 12669 corrupted (0-795)  (3300-4095)
Expected 125, got 0
Written as (1191)7661(5266)
Chunk 13162 corrupted (0-1459)  (2184-3643)
Expected 106, got 0
Written as (9383)13662(11544)
Chunk 14653 corrupted (0-27)  (4068-4095)
Expected 61, got 0
Written as (8100)9456(1275)
Chunk 17332 corrupted (0-367)  (3728-4095)
Expected 180, got 0
Written as (760)12247(1244)
Chunk 17445 corrupted (0-1459)  (772-2231)
Expected 37, got 0
Written as (8007)16481(14439)
Chunk 17556 corrupted (0-1007)  (3088-4095)
Expected 148, got 0
Written as (10113)10657(10477)
Chunk 17859 corrupted (0-995)  (3100-4095)
Expected 195, got 0
Written as (14472)14767(11426)
Checking chunk 17954/17955 (99%)

Gordon

-- 
Gordon Farquharson

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one
  2006-12-28  3:04             ` Linus Torvalds
                                 ` (2 preceding siblings ...)
  2006-12-28  9:15               ` Zhang, Yanmin
@ 2006-12-28 11:50               ` Petri Kaukasoina
  2006-12-28 15:09               ` Guillaume Chazarain
  4 siblings, 0 replies; 45+ messages in thread
From: Petri Kaukasoina @ 2006-12-28 11:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Miller, ranma, gordonfarquharson, tbm, Peter Zijlstra,
	andrei.popa, Andrew Morton, hugh, nickpiggin, arjan,
	Linux Kernel Mailing List

On Wed, Dec 27, 2006 at 07:04:34PM -0800, Linus Torvalds wrote:
> [ Modified test-program that tells you where the corruption happens (and 
>   when the missing parts were supposed to be written out) appended, in 
>   case people care. ]

Hi

2.6.18 (and 2.6.18.6) is ok, 2.6.19-rc1 is broken. I tried some snapshots
between them but they hung before shell (2.6.18-git11, 2.6.18-git16,
2.6.18-git20, 2.6.18-git21). 2.6.18-git22 booted and was broken.

(UP, no preempt)

-Petri

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one
  2006-12-28 10:16                         ` Martin Michlmayr
@ 2006-12-28 10:49                           ` Russell King
  2006-12-28 14:56                             ` Martin Michlmayr
  0 siblings, 1 reply; 45+ messages in thread
From: Russell King @ 2006-12-28 10:49 UTC (permalink / raw)
  To: Martin Michlmayr
  Cc: Gordon Farquharson, Linus Torvalds, David Miller, ranma,
	Peter Zijlstra, andrei.popa, Andrew Morton, hugh, nickpiggin,
	arjan, Linux Kernel Mailing List

On Thu, Dec 28, 2006 at 11:16:59AM +0100, Martin Michlmayr wrote:
> * Gordon Farquharson <gordonfarquharson@gmail.com> [2006-12-27 22:38]:
> > >> #define TARGETSIZE (100 << 12)
> > >
> > >That's just 400kB!
> > >
> > >There's no way you should see corruption with that kind of value. It
> > >should all stay solidly in the cache.
> > >
> > >Is this perhaps with ARM nommu or something else strange? It may be that
> > >the program just doesn't work at all if mmap() is faked out with a malloc
> > >or similar.
> > 
> > Definitely a question for the ARM gurus. I'm out of my depth.
> 
> By the way, I just tried it with TARGETSIZE (100 << 12) on a different
> ARM machine (a Thecus N2100 based on an IOP32x chip with 128 MB of
> memory) and I see similar results to that from Gordon:

Work around the glibc memset() problem by passing nr & 255, and re-run
the test.  You're getting false positives.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one
  2006-12-28  5:38                       ` Gordon Farquharson
  2006-12-28  9:30                         ` Martin Michlmayr
@ 2006-12-28 10:16                         ` Martin Michlmayr
  2006-12-28 10:49                           ` Russell King
  1 sibling, 1 reply; 45+ messages in thread
From: Martin Michlmayr @ 2006-12-28 10:16 UTC (permalink / raw)
  To: Gordon Farquharson
  Cc: Linus Torvalds, David Miller, ranma, Peter Zijlstra, andrei.popa,
	Andrew Morton, hugh, nickpiggin, arjan,
	Linux Kernel Mailing List

* Gordon Farquharson <gordonfarquharson@gmail.com> [2006-12-27 22:38]:
> >> #define TARGETSIZE (100 << 12)
> >
> >That's just 400kB!
> >
> >There's no way you should see corruption with that kind of value. It
> >should all stay solidly in the cache.
> >
> >Is this perhaps with ARM nommu or something else strange? It may be that
> >the program just doesn't work at all if mmap() is faked out with a malloc
> >or similar.
> 
> Definitely a question for the ARM gurus. I'm out of my depth.

By the way, I just tried it with TARGETSIZE (100 << 12) on a different
ARM machine (a Thecus N2100 based on an IOP32x chip with 128 MB of
memory) and I see similar results to that from Gordon:

Writing chunk 279/280 (99%)
Chunk 256 corrupted (1-1455)  (1025-2479)
Expected 0, got 1
Written as (199)43(184)
Chunk 258 corrupted (1-1455)  (3945-1303)
Expected 2, got 3
Written as (184)74(145)
Chunk 260 corrupted (1-1455)  (2769-127)
Expected 4, got 5
Written as (145)89(237)
Chunk 262 corrupted (1-1455)  (1593-3047)
Expected 6, got 7
Written as (237)168(174)
Chunk 264 corrupted (1-1455)  (417-1871)
Expected 8, got 9
Written as (174)135(161)
Chunk 266 corrupted (1-1455)  (3337-695)
Expected 10, got 11
Written as (161)123(180)
Chunk 268 corrupted (1-1455)  (2161-3615)
Expected 12, got 13
Written as (180)13(19)
Chunk 270 corrupted (1-1455)  (985-2439)
Expected 14, got 15
Written as (19)172(106)
Chunk 272 corrupted (1-1455)  (3905-1263)
Expected 16, got 17
Written as (106)212(140)
Chunk 274 corrupted (1-1455)  (2729-87)
Expected 18, got 19
Written as (140)124(73)
Chunk 276 corrupted (1-1455)  (1553-3007)
Expected 20, got 21
Written as (73)151(52)
Chunk 278 corrupted (1-1455)  (377-1831)
Expected 22, got 23
Written as (52)215(99)
Checking chunk 279/280 (99%)

-- 
Martin Michlmayr
http://www.cyrius.com/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one
  2006-12-28  5:20                   ` Gordon Farquharson
  2006-12-28  5:41                     ` David Miller
@ 2006-12-28 10:13                     ` Russell King
  2006-12-28 14:15                       ` Gordon Farquharson
  2006-12-28 17:27                       ` Linus Torvalds
  1 sibling, 2 replies; 45+ messages in thread
From: Russell King @ 2006-12-28 10:13 UTC (permalink / raw)
  To: Gordon Farquharson
  Cc: Linus Torvalds, David Miller, ranma, tbm, Peter Zijlstra,
	andrei.popa, Andrew Morton, hugh, nickpiggin, arjan,
	Linux Kernel Mailing List

On Wed, Dec 27, 2006 at 10:20:20PM -0700, Gordon Farquharson wrote:
> I have run the program a few times, and the output is pretty
> consistent. However, when I increase the target size, the difference
> between the expected and actual values is larger.
> 
> Written as (749)935(738)
> Chunk 1113 corrupted (1-1455)  (2965-323)
> Expected 89, got 93

This is not the corruption Linus is after.  Note that the corruption starts
at offset '1'.  Also note that:

89 = 1113 & 255
93 = 1113 & 255 | (1113 >> 8)

and if you look at glibc's memset() function, you'll notice that's exactly
what you expect if you pass a non-8bit value to it.  Ergo, what you're
seeing is utterly expected given glibc's memset() implementation on ARM.

Fixing Linus' test program to pass nr & 255 to memset results in clean
passes on 2.6.9 on TheCus N2100 (IOP8032x) and 2.6.16.9 StrongARM
machines (as would be expected.)

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one
  2006-12-28  5:38                       ` Gordon Farquharson
@ 2006-12-28  9:30                         ` Martin Michlmayr
  2006-12-28 10:16                         ` Martin Michlmayr
  1 sibling, 0 replies; 45+ messages in thread
From: Martin Michlmayr @ 2006-12-28  9:30 UTC (permalink / raw)
  To: Gordon Farquharson
  Cc: Linus Torvalds, David Miller, ranma, Peter Zijlstra, andrei.popa,
	Andrew Morton, hugh, nickpiggin, arjan,
	Linux Kernel Mailing List

* Gordon Farquharson <gordonfarquharson@gmail.com> [2006-12-27 22:38]:
> >That's just 400kB!
> >
> >There's no way you should see corruption with that kind of value. It
> >should all stay solidly in the cache.
> >
> >Is this perhaps with ARM nommu or something else strange? It may be that
> >the program just doesn't work at all if mmap() is faked out with a malloc
> >or similar.
> 
> Definitely a question for the ARM gurus. I'm out of my depth.

The CPU has a MMU.  For reference, it's a IXP4xx based device with 32 MB
of memory.
-- 
Martin Michlmayr
http://www.cyrius.com/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one
  2006-12-28  3:04             ` Linus Torvalds
  2006-12-28  4:32               ` Gordon Farquharson
  2006-12-28  5:55               ` Chen, Kenneth W
@ 2006-12-28  9:15               ` Zhang, Yanmin
  2006-12-28 17:15                 ` Linus Torvalds
  2006-12-28 11:50               ` Petri Kaukasoina
  2006-12-28 15:09               ` Guillaume Chazarain
  4 siblings, 1 reply; 45+ messages in thread
From: Zhang, Yanmin @ 2006-12-28  9:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Miller, ranma, gordonfarquharson, tbm, Peter Zijlstra,
	andrei.popa, Andrew Morton, hugh, nickpiggin, arjan,
	Linux Kernel Mailing List

On Wed, 2006-12-27 at 19:04 -0800, Linus Torvalds wrote:
> 
> On Wed, 27 Dec 2006, David Miller wrote:
> > > 
> > > I still don't see _why_, though. But maybe smarter people than me can see 
> > > it..
> > 
> > FWIW this program definitely triggers the bug for me.
> 
> Ok, now that I have something simple to do repeatable stuff with, I can 
> say what the pattern is.. It's not all that surprising, but it's still 
> worth just stating for the record.
> 
> What happens is that when I do the "packetized writes" in random order, 
> the _last_ write to a page occasionally just goes missing. It's not always 
> at the end of a page, as shown by for example:
> 
>  - A whole chunk got dropped:
> 
> 	Chunk 2094 corrupted (0-1459)  (1624-3083)
> 	Expected 46, got 0
> 	Written as (30912)55414(10000)
> 
>    That "Written as (x)y(z)" line means that the corrupted chunk was 
>    written as chunk #y, and the preceding and following chunks (that were 
>    _not_ corrupt) on the page was written as #x and #z respectively.
> 
>    In other words, the missing chunk (which is still zero) was written 
>    much later than the ones that were ok, and never hit the disk. It's a 
>    contiguous chunk in the middle of the page (chunks are 1460 bytes in 
>    size)
> 
>    The first line means that all bytes of the chunk (0-1459) were 
>    corrupted, and the values in parenthesis are the offsets within a page. 
>    In other words, this was a chunk in the _middle_ of a page.
> 
>  - The missing data can also be at the beginning or ends of pages:
> 
>    Beginning of the chunk missing, it was at the end of a page (page 
>    offsets 3288-4095) and the _next_ page got written out fine:
> 
> 	Chunk 2126 corrupted (0-807)  (3288-4095)
> 	Expected 78, got 0
> 	Written as (32713)55573(14301)
> 
>    End of a chunk missing, it was the beginning of a page (and the 
>    _previous_ page that contained the beginning of the chunk was written 
>    out fine)
> 
> 	Chunk 2179 corrupted (1252-1459)  (0-207)
> 	Expected 131, got 0
> 	Written as (45189)55489(15515)
> 
> Now, the reason I say this isn't surprising is that this is entirely 
> consistent with the dirty bit being dropped on the floor somewhere, and 
> likely through some interaction with the previous changes being in the 
> process of being written out.
> 
> Something (incorrectly) ends up deciding that it doesn't need to write the 
> page, since it's already written, or alternatively clears the dirty bit 
> too late (clears it because an _earlier_ write finished, never mind that 
> the new dirty data didn't make it).
There might be a narrow race between set_page_dirty and clear_page_dirty.

The test program is a process to write/read data. pdflush might write data
to disk asynchronously. After pdflush writes a page to disk, it will call (either by
softirq) clear_page_dirty to clear the dirty bit after getting the interrupt
notification. But just after the page is written to disk and before the interrupt
reports the result, the test process might change the data and unmap the area. When
the area is unmapped, the page is marked as dirty again, but just after that, the
interrupt arrives and the dirty bit is cleared, so the late data will not be written
to disk.

Function zap_pte_range checks pte to set page dirty if needed, but it doesn't
hold page lock. If the page lock is held before set page dirty, the race might
be avoided.

Yanmin

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one
  2006-12-28  6:10                 ` Chen, Kenneth W
@ 2006-12-28  6:27                   ` David Miller
  2006-12-28 17:10                   ` Linus Torvalds
  1 sibling, 0 replies; 45+ messages in thread
From: David Miller @ 2006-12-28  6:27 UTC (permalink / raw)
  To: kenneth.w.chen
  Cc: torvalds, ranma, gordonfarquharson, tbm, a.p.zijlstra,
	andrei.popa, akpm, hugh, nickpiggin, arjan, linux-kernel

From: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
Date: Wed, 27 Dec 2006 22:10:52 -0800

> Chen, Kenneth wrote on Wednesday, December 27, 2006 9:55 PM
> > Linus Torvalds wrote on Wednesday, December 27, 2006 7:05 PM
> > > On Wed, 27 Dec 2006, David Miller wrote:
> > > > > 
> > > > > I still don't see _why_, though. But maybe smarter people than me can see 
> > > > > it..
> > > > 
> > > > FWIW this program definitely triggers the bug for me.
> > > 
> > > Ok, now that I have something simple to do repeatable stuff with, I can 
> > > say what the pattern is.. It's not all that surprising, but it's still 
> > > worth just stating for the record.
> > 
> > 
> > Running the test code, git bisect points its finger at this commit. Reverting
> > this commit on top of 2.6.20-rc2 doesn't trigger the bug from the test code.
> > 
> > edc79b2a46ed854595e40edcf3f8b37f9f14aa3f is first bad commit
> > commit edc79b2a46ed854595e40edcf3f8b37f9f14aa3f
> > Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > Date:   Mon Sep 25 23:30:58 2006 -0700
> > 
> >     [PATCH] mm: balance dirty pages
> > 
> >     Now that we can detect writers of shared mappings, throttle them.  Avoids OOM
> >     by surprise.
> 
> 
> Oh, never mind :-(  I just didn't create enough write out pressure when
> test this. I just saw bug got triggered on a kernel I previously thought
> was OK.

Besides, I'm pretty sure that from the Debian bug entry it's been
established that the dirty-page tracking changes from a few releases
ago introduced this problem.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [PATCH] mm: fix page_mkclean_one
  2006-12-28  5:55               ` Chen, Kenneth W
@ 2006-12-28  6:10                 ` Chen, Kenneth W
  2006-12-28  6:27                   ` David Miller
  2006-12-28 17:10                   ` Linus Torvalds
  0 siblings, 2 replies; 45+ messages in thread
From: Chen, Kenneth W @ 2006-12-28  6:10 UTC (permalink / raw)
  To: 'Linus Torvalds', David Miller
  Cc: ranma, gordonfarquharson, tbm, Peter Zijlstra, andrei.popa,
	Andrew Morton, hugh, nickpiggin, arjan,
	Linux Kernel Mailing List

Chen, Kenneth wrote on Wednesday, December 27, 2006 9:55 PM
> Linus Torvalds wrote on Wednesday, December 27, 2006 7:05 PM
> > On Wed, 27 Dec 2006, David Miller wrote:
> > > > 
> > > > I still don't see _why_, though. But maybe smarter people than me can see 
> > > > it..
> > > 
> > > FWIW this program definitely triggers the bug for me.
> > 
> > Ok, now that I have something simple to do repeatable stuff with, I can 
> > say what the pattern is.. It's not all that surprising, but it's still 
> > worth just stating for the record.
> 
> 
> Running the test code, git bisect points its finger at this commit. Reverting
> this commit on top of 2.6.20-rc2 doesn't trigger the bug from the test code.
> 
> edc79b2a46ed854595e40edcf3f8b37f9f14aa3f is first bad commit
> commit edc79b2a46ed854595e40edcf3f8b37f9f14aa3f
> Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Date:   Mon Sep 25 23:30:58 2006 -0700
> 
>     [PATCH] mm: balance dirty pages
> 
>     Now that we can detect writers of shared mappings, throttle them.  Avoids OOM
>     by surprise.


Oh, never mind :-(  I just didn't create enough write out pressure when
test this. I just saw bug got triggered on a kernel I previously thought
was OK.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one
       [not found]                     ` <Pine.LNX.4.64.0612272120180.4473@woody.osdl.org>
  2006-12-28  5:38                       ` Gordon Farquharson
@ 2006-12-28  5:58                       ` Gordon Farquharson
  2006-12-28 17:08                         ` Linus Torvalds
  1 sibling, 1 reply; 45+ messages in thread
From: Gordon Farquharson @ 2006-12-28  5:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Miller, ranma, tbm, Peter Zijlstra, andrei.popa,
	Andrew Morton, hugh, nickpiggin, arjan,
	Linux Kernel Mailing List

On 12/27/06, Linus Torvalds <torvalds@osdl.org> wrote:

> That's just 400kB!
>
> There's no way you should see corruption with that kind of value. It
> should all stay solidly in the cache.

100kB and 200kB files always succeed on the ARM system. 400kB and
larger always seem to fail.

Does the following help interpret the results on ARM at all ?

$ free
             total       used       free     shared    buffers     cached
Mem:         30000      23620       6380          0        808      15676
-/+ buffers/cache:       7136      22864
Swap:        88316       3664      84652

Gordon

-- 
Gordon Farquharson

^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [PATCH] mm: fix page_mkclean_one
  2006-12-28  3:04             ` Linus Torvalds
  2006-12-28  4:32               ` Gordon Farquharson
@ 2006-12-28  5:55               ` Chen, Kenneth W
  2006-12-28  6:10                 ` Chen, Kenneth W
  2006-12-28  9:15               ` Zhang, Yanmin
                                 ` (2 subsequent siblings)
  4 siblings, 1 reply; 45+ messages in thread
From: Chen, Kenneth W @ 2006-12-28  5:55 UTC (permalink / raw)
  To: 'Linus Torvalds', David Miller
  Cc: ranma, gordonfarquharson, tbm, Peter Zijlstra, andrei.popa,
	Andrew Morton, hugh, nickpiggin, arjan,
	Linux Kernel Mailing List

Linus Torvalds wrote on Wednesday, December 27, 2006 7:05 PM
> On Wed, 27 Dec 2006, David Miller wrote:
> > > 
> > > I still don't see _why_, though. But maybe smarter people than me can see 
> > > it..
> > 
> > FWIW this program definitely triggers the bug for me.
> 
> Ok, now that I have something simple to do repeatable stuff with, I can 
> say what the pattern is.. It's not all that surprising, but it's still 
> worth just stating for the record.


Running the test code, git bisect points its finger at this commit. Reverting
this commit on top of 2.6.20-rc2 doesn't trigger the bug from the test code.


edc79b2a46ed854595e40edcf3f8b37f9f14aa3f is first bad commit
commit edc79b2a46ed854595e40edcf3f8b37f9f14aa3f
Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date:   Mon Sep 25 23:30:58 2006 -0700

    [PATCH] mm: balance dirty pages

    Now that we can detect writers of shared mappings, throttle them.  Avoids OOM
    by surprise.

    Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Cc: Hugh Dickins <hugh@veritas.com>
    Signed-off-by: Andrew Morton <akpm@osdl.org>
    Signed-off-by: Linus Torvalds <torvalds@osdl.org>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one
  2006-12-28  5:41                     ` David Miller
@ 2006-12-28  5:47                       ` Gordon Farquharson
  0 siblings, 0 replies; 45+ messages in thread
From: Gordon Farquharson @ 2006-12-28  5:47 UTC (permalink / raw)
  To: David Miller
  Cc: torvalds, ranma, tbm, a.p.zijlstra, andrei.popa, akpm, hugh,
	nickpiggin, arjan, linux-kernel

Hi David

On 12/27/06, David Miller <davem@davemloft.net> wrote:

> Me too, I added "-D_POSIX_C_SOURCE=200112" to "fix" this.

That works for me. Thanks for the tip.

Gordon

-- 
Gordon Farquharson

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one
  2006-12-28  5:20                   ` Gordon Farquharson
@ 2006-12-28  5:41                     ` David Miller
  2006-12-28  5:47                       ` Gordon Farquharson
  2006-12-28 10:13                     ` Russell King
  1 sibling, 1 reply; 45+ messages in thread
From: David Miller @ 2006-12-28  5:41 UTC (permalink / raw)
  To: gordonfarquharson
  Cc: torvalds, ranma, tbm, a.p.zijlstra, andrei.popa, akpm, hugh,
	nickpiggin, arjan, linux-kernel

From: "Gordon Farquharson" <gordonfarquharson@gmail.com>
Date: Wed, 27 Dec 2006 22:20:20 -0700

> and for some reason I get
> 
> linus-test.c: In function 'remap':
> linus-test.c:61: error: 'POSIX_FADV_DONTNEED' undeclared (first use in
> this function)
> 
> when I compile the program, so I replaced POSIX_FADV_DONTNEED with 4
> as defined in /usr/include/bits/fcntl.h.

Me too, I added "-D_POSIX_C_SOURCE=200112" to "fix" this.

Perhaps Linus's GCC sets that by default and our's doesn't.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one
       [not found]                     ` <Pine.LNX.4.64.0612272120180.4473@woody.osdl.org>
@ 2006-12-28  5:38                       ` Gordon Farquharson
  2006-12-28  9:30                         ` Martin Michlmayr
  2006-12-28 10:16                         ` Martin Michlmayr
  2006-12-28  5:58                       ` Gordon Farquharson
  1 sibling, 2 replies; 45+ messages in thread
From: Gordon Farquharson @ 2006-12-28  5:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Miller, ranma, tbm, Peter Zijlstra, andrei.popa,
	Andrew Morton, hugh, nickpiggin, arjan,
	Linux Kernel Mailing List

On 12/27/06, Linus Torvalds <torvalds@osdl.org> wrote:

> On Wed, 27 Dec 2006, Gordon Farquharson wrote:
> >
> > I don't think so. I did reduce the target size
> >
> > #define TARGETSIZE (100 << 12)
>
> That's just 400kB!
>
> There's no way you should see corruption with that kind of value. It
> should all stay solidly in the cache.
>
> Is this perhaps with ARM nommu or something else strange? It may be that
> the program just doesn't work at all if mmap() is faked out with a malloc
> or similar.

Definitely a question for the ARM gurus. I'm out of my depth.

Gordon

-- 
Gordon Farquharson

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one
  2006-12-28  4:53                 ` Linus Torvalds
@ 2006-12-28  5:20                   ` Gordon Farquharson
  2006-12-28  5:41                     ` David Miller
  2006-12-28 10:13                     ` Russell King
       [not found]                   ` <97a0a9ac0612272115g4cce1f08n3c3c8498a6076bd5@mail.gmail.com>
  1 sibling, 2 replies; 45+ messages in thread
From: Gordon Farquharson @ 2006-12-28  5:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Miller, ranma, tbm, Peter Zijlstra, andrei.popa,
	Andrew Morton, hugh, nickpiggin, arjan,
	Linux Kernel Mailing List

[Oops - forgot to hit "Reply to All" first time round.]

Hi Linus

On 12/27/06, Linus Torvalds <torvalds@osdl.org> wrote:

> For all I know, my test-program is buggy wrt the ordering printouts,
> though. Did you perhaps change the logic in any way?

I don't think so. I did reduce the target size

#define TARGETSIZE (100 << 12)

to make the program finish a little quicker, and for some reason I get

linus-test.c: In function 'remap':
linus-test.c:61: error: 'POSIX_FADV_DONTNEED' undeclared (first use in
this function)

when I compile the program, so I replaced POSIX_FADV_DONTNEED with 4
as defined in /usr/include/bits/fcntl.h.

Other than these two changes, the program is identical to the version
you posted.

I have run the program a few times, and the output is pretty
consistent. However, when I increase the target size, the difference
between the expected and actual values is larger.

Written as (749)935(738)
Chunk 1113 corrupted (1-1455)  (2965-323)
Expected 89, got 93
Written as (935)738(538)
Chunk 1114 corrupted (1-1455)  (329-1783)
Expected 90, got 94
Written as (738)538(678)
Chunk 1115 corrupted (1-1455)  (1789-3243)
Expected 91, got 95
Written as (538)678(989)
Chunk 1120 corrupted (1-1455)  (897-2351)
Expected 96, got 100
Written as (537)265(1005)
Chunk 1121 corrupted (1-1455)  (2357-3811)
Expected 97, got 101
Written as (265)1005(-1)

--- linus-test.c.orig   2006-12-28 06:17:24.000000000 +0100
+++ linus-test.c        2006-12-28 06:18:24.000000000 +0100
@@ -6,7 +6,7 @@
 #include <stdio.h>
 #include <time.h>

-#define TARGETSIZE (100 << 20)
+#define TARGETSIZE (100 << 14)
 #define CHUNKSIZE (1460)
 #define NRCHUNKS (TARGETSIZE / CHUNKSIZE)
 #define SIZE (NRCHUNKS * CHUNKSIZE)
@@ -61,7 +61,7 @@
 {
        if (mapping) {
                munmap(mapping, SIZE);
-                posix_fadvise(fd, 0, SIZE, POSIX_FADV_DONTNEED);
+                posix_fadvise(fd, 0, SIZE, 4);
        }
        return mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);

Gordon

-- 
Gordon Farquharson

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one
  2006-12-28  4:32               ` Gordon Farquharson
@ 2006-12-28  4:53                 ` Linus Torvalds
  2006-12-28  5:20                   ` Gordon Farquharson
       [not found]                   ` <97a0a9ac0612272115g4cce1f08n3c3c8498a6076bd5@mail.gmail.com>
  0 siblings, 2 replies; 45+ messages in thread
From: Linus Torvalds @ 2006-12-28  4:53 UTC (permalink / raw)
  To: Gordon Farquharson
  Cc: David Miller, ranma, tbm, Peter Zijlstra, andrei.popa,
	Andrew Morton, hugh, nickpiggin, arjan,
	Linux Kernel Mailing List



On Wed, 27 Dec 2006, Gordon Farquharson wrote:
> 
> It is at all suprising that the second offset within a page can be
> less than the first offset within a page ? e.g.
> 
> Chunk 260 corrupted (1-1455)  (2769-127)

No, that just means that it went over to the next page (so you actually 
had two consecutive pages that weren't written out).

That said, your output is very different from mine in another way. You 
don't have zeroes in your pages, rather the thing seems to have data from 
the next block (ie the chunk that should have 20 is reported as having 21 
etc). You also have your offsets shifted up by one (ie offset 0 looks ok 
for you, and then you have a strange pattern of corruption at bytes 
1...1455 instead of 0..1459.

You also seem to have an example of the _earlier_ writes being corrupted, 
rather than the later ones. For example (but it's also a page-crosser, so 
maybe that's part of it):

	Chunk 274 corrupted (1-1455)  (2729-87)
	Expected 18, got 19
	Written as (154)11(85)

says that block chunk 274 is the corrupt one, but it was written fairly 
early as #11, and the blocks around it (chunks 273 and 275) were actually 
written later.

For all I know, my test-program is buggy wrt the ordering printouts, 
though. Did you perhaps change the logic in any way?

			Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one
  2006-12-28  3:04             ` Linus Torvalds
@ 2006-12-28  4:32               ` Gordon Farquharson
  2006-12-28  4:53                 ` Linus Torvalds
  2006-12-28  5:55               ` Chen, Kenneth W
                                 ` (3 subsequent siblings)
  4 siblings, 1 reply; 45+ messages in thread
From: Gordon Farquharson @ 2006-12-28  4:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Miller, ranma, tbm, Peter Zijlstra, andrei.popa,
	Andrew Morton, hugh, nickpiggin, arjan,
	Linux Kernel Mailing List

On 12/27/06, Linus Torvalds <torvalds@osdl.org> wrote:

> [ Modified test-program that tells you where the corruption happens (and
>   when the missing parts were supposed to be written out) appended, in
>   case people care. ]

For the record, this is the output from a run on our ARM machine (32
MB RAM) with 2.6.18 + the following patches:

   mm: tracking shared dirty pages
   mm: balance dirty pages
   mm: optimize the new mprotect() code a bit
   mm: small cleanup of install_page()
   mm: fixup do_wp_page()
   mm: msync() cleanup

It is at all suprising that the second offset within a page can be
less than the first offset within a page ? e.g.

Chunk 260 corrupted (1-1455)  (2769-127)

$ ./linus-test
Writing chunk 279/280 (99%)
Chunk 256 corrupted (1-1455)  (1025-2479)
Expected 0, got 1
Written as (82)175(56)
Chunk 258 corrupted (1-1455)  (3945-1303)
Expected 2, got 3
Written as (56)51(20)
Chunk 260 corrupted (1-1455)  (2769-127)
Expected 4, got 5
Written as (20)30(18)
Chunk 262 corrupted (1-1455)  (1593-3047)
Expected 6, got 7
Written as (18)196(158)
Chunk 264 corrupted (1-1455)  (417-1871)
Expected 8, got 9
Written as (158)133(146)
Chunk 266 corrupted (1-1455)  (3337-695)
Expected 10, got 11
Written as (146)43(77)
Chunk 268 corrupted (1-1455)  (2161-3615)
Expected 12, got 13
Written as (77)251(211)
Chunk 270 corrupted (1-1455)  (985-2439)
Expected 14, got 15
Written as (211)257(231)
Chunk 272 corrupted (1-1455)  (3905-1263)
Expected 16, got 17
Written as (231)254(154)
Chunk 274 corrupted (1-1455)  (2729-87)
Expected 18, got 19
Written as (154)11(85)
Chunk 276 corrupted (1-1455)  (1553-3007)
Expected 20, got 21
Written as (85)230(134)
Chunk 278 corrupted (1-1455)  (377-1831)
Expected 22, got 23
Written as (134)233(103)
Checking chunk 279/280 (99%)

Gordon

-- 
Gordon Farquharson

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one
  2006-12-28  0:52           ` David Miller
@ 2006-12-28  3:04             ` Linus Torvalds
  2006-12-28  4:32               ` Gordon Farquharson
                                 ` (4 more replies)
  0 siblings, 5 replies; 45+ messages in thread
From: Linus Torvalds @ 2006-12-28  3:04 UTC (permalink / raw)
  To: David Miller
  Cc: ranma, gordonfarquharson, tbm, Peter Zijlstra, andrei.popa,
	Andrew Morton, hugh, nickpiggin, arjan,
	Linux Kernel Mailing List



On Wed, 27 Dec 2006, David Miller wrote:
> > 
> > I still don't see _why_, though. But maybe smarter people than me can see 
> > it..
> 
> FWIW this program definitely triggers the bug for me.

Ok, now that I have something simple to do repeatable stuff with, I can 
say what the pattern is.. It's not all that surprising, but it's still 
worth just stating for the record.

What happens is that when I do the "packetized writes" in random order, 
the _last_ write to a page occasionally just goes missing. It's not always 
at the end of a page, as shown by for example:

 - A whole chunk got dropped:

	Chunk 2094 corrupted (0-1459)  (1624-3083)
	Expected 46, got 0
	Written as (30912)55414(10000)

   That "Written as (x)y(z)" line means that the corrupted chunk was 
   written as chunk #y, and the preceding and following chunks (that were 
   _not_ corrupt) on the page was written as #x and #z respectively.

   In other words, the missing chunk (which is still zero) was written 
   much later than the ones that were ok, and never hit the disk. It's a 
   contiguous chunk in the middle of the page (chunks are 1460 bytes in 
   size)

   The first line means that all bytes of the chunk (0-1459) were 
   corrupted, and the values in parenthesis are the offsets within a page. 
   In other words, this was a chunk in the _middle_ of a page.

 - The missing data can also be at the beginning or ends of pages:

   Beginning of the chunk missing, it was at the end of a page (page 
   offsets 3288-4095) and the _next_ page got written out fine:

	Chunk 2126 corrupted (0-807)  (3288-4095)
	Expected 78, got 0
	Written as (32713)55573(14301)

   End of a chunk missing, it was the beginning of a page (and the 
   _previous_ page that contained the beginning of the chunk was written 
   out fine)

	Chunk 2179 corrupted (1252-1459)  (0-207)
	Expected 131, got 0
	Written as (45189)55489(15515)

Now, the reason I say this isn't surprising is that this is entirely 
consistent with the dirty bit being dropped on the floor somewhere, and 
likely through some interaction with the previous changes being in the 
process of being written out.

Something (incorrectly) ends up deciding that it doesn't need to write the 
page, since it's already written, or alternatively clears the dirty bit 
too late (clears it because an _earlier_ write finished, never mind that 
the new dirty data didn't make it).

I also figured out that it's not the low-memory situation that does it, it 
really must be the "page_mkclean()" triggering. Becuase I can do

	echo 5 > /proc/sys/vm/dirty_ratio
	echo 3 > /proc/sys/vm/dirty_background_ratio

to make it clean the pages much more aggressively than the default, and I 
can see corruption on my 256MB machine with just a 40MB shared file, and 
70MB of memory consistently free.

So this thing is definitely giving some answers. It's NOT about low 
memory, and it very much seems to be about the whole "balance_dirty_ratio" 
thing. I don't think I triggered the actual low-memory stuff once in that 
situation..

So I have some more data on the behaviour, but I _still_ don't see the 
reason behind it. It's probably something really obvious once it's pointed 
out..

[ Modified test-program that tells you where the corruption happens (and 
  when the missing parts were supposed to be written out) appended, in 
  case people care. ]

			Linus

---
#include <sys/mman.h>
#include <sys/fcntl.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <time.h>

#define TARGETSIZE (100 << 20)
#define CHUNKSIZE (1460)
#define NRCHUNKS (TARGETSIZE / CHUNKSIZE)
#define SIZE (NRCHUNKS * CHUNKSIZE)

static void fillmem(void *start, int nr)
{
	memset(start, nr, CHUNKSIZE);
}

#define page_offset(buf, off) (0xfff & ((unsigned)(unsigned long)(buf)+(off)))

static int chunkorder[NRCHUNKS];

static int order(int nr)
{
	int i;
	if (nr < 0 || nr >= NRCHUNKS)
		return -1;
	for (i = 0; i < NRCHUNKS; i++)
		if (chunkorder[i] == nr)
			return i;
	return -2;
}

static void checkmem(void *buf, int nr)
{
	unsigned int start = ~0u, end = 0;
	unsigned char c = nr, *p = buf, differs = 0;
	int i;
	for (i = 0; i < CHUNKSIZE; i++) {
		unsigned char got = *p++;
		if (got != c) {
			if (i < start)
				start = i;
			if (i > end)
				end = i;
			differs = got;
		}
	}
	if (start < end) {
		printf("Chunk %d corrupted (%u-%u)  (%u-%u)            \n", nr, start, end,
			page_offset(buf, start), page_offset(buf, end));
		printf("Expected %u, got %u\n", c, differs);
		printf("Written as (%d)%d(%d)\n", order(nr-1), order(nr), order(nr+1));
	}
}

static char *remap(int fd, char *mapping)
{
	if (mapping) {
		munmap(mapping, SIZE);
		posix_fadvise(fd, 0, SIZE, POSIX_FADV_DONTNEED);
	}
	return mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
}

int main(int argc, char **argv)
{
	char *mapping;
	int fd, i;

	/*
	 * Make some random ordering of writing the chunks to the
	 * memory map..
	 *
	 * Start with fully ordered..
	 */
	for (i = 0; i < NRCHUNKS; i++)
		chunkorder[i] = i;

	/* ..and then mix it up randomly */
	srandom(time(NULL));
	for (i = 0; i < NRCHUNKS; i++) {
		int index = (unsigned int) random() % NRCHUNKS;
		int nr = chunkorder[index];
		chunkorder[index] = chunkorder[i];
		chunkorder[i] = nr;
	}

	fd = open("mapfile", O_RDWR | O_TRUNC | O_CREAT, 0666);
	if (fd < 0)
		return -1;
	if (ftruncate(fd, SIZE) < 0)
		return -1;
	mapping = remap(fd, NULL);
	if (-1 == (int)(long)mapping)
		return -1;

	for (i = 0; i < NRCHUNKS; i++) {
		int chunk = chunkorder[i];
		printf("Writing chunk %d/%d (%d%%)     \r", i, NRCHUNKS, 100*i/NRCHUNKS);
		fillmem(mapping + chunk * CHUNKSIZE, chunk);
	}
	printf("\n");

	/* Unmap, drop, and remap.. */
	mapping = remap(fd, mapping);

	/* .. and check */
	for (i = 0; i < NRCHUNKS; i++) {
		int chunk = i;
		printf("Checking chunk %d/%d (%d%%)     \r", i, NRCHUNKS, 100*i/NRCHUNKS);
		checkmem(mapping + chunk * CHUNKSIZE, chunk);
	}
	printf("\n");
	
	return 0;
}

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one
  2006-12-28  0:39         ` Linus Torvalds
@ 2006-12-28  0:52           ` David Miller
  2006-12-28  3:04             ` Linus Torvalds
  0 siblings, 1 reply; 45+ messages in thread
From: David Miller @ 2006-12-28  0:52 UTC (permalink / raw)
  To: torvalds
  Cc: ranma, gordonfarquharson, tbm, a.p.zijlstra, andrei.popa, akpm,
	hugh, nickpiggin, arjan, linux-kernel

From: Linus Torvalds <torvalds@osdl.org>
Date: Wed, 27 Dec 2006 16:39:43 -0800 (PST)

> 
> 
> On Wed, 27 Dec 2006, Linus Torvalds wrote:
> > 
> > I think the test-case could probably be improved by having a munmap() and 
> > page-cache flush in between the writing and the checking, to see whether 
> > that shows the corruption easier (and possibly without having to start 
> > paging in order to throw the pages out, which would simplify testing a 
> > lot).
> 
> I think the page-writeout is implicated, because I do seem to need it, but 
> the page-cache flush does seem to make corruption _easier_ to see. I now 
> seem about to trigger it with a 100MB file on a 256MB machine in a minute 
> or so, with this slight modification.
> 
> I still don't see _why_, though. But maybe smarter people than me can see 
> it..

FWIW this program definitely triggers the bug for me.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one
  2006-12-28  0:42   ` Linus Torvalds
@ 2006-12-28  0:52     ` David Miller
  0 siblings, 0 replies; 45+ messages in thread
From: David Miller @ 2006-12-28  0:52 UTC (permalink / raw)
  To: torvalds
  Cc: schwidefsky, a.p.zijlstra, tbm, hugh, nickpiggin, arjan,
	andrei.popa, akpm, linux-kernel, fw, mh+linux-kernel,
	heiko.carstens, arnd.bergmann, gordonfarquharson

From: Linus Torvalds <torvalds@osdl.org>
Date: Wed, 27 Dec 2006 16:42:40 -0800 (PST)

> That's fine. In that situation, you shouldn't need any atomic ops at all, 
> I think all our sw page-table operations are already done under the pte 
> lock. 

This is true, but there is one subtlety to this I want to
point out in passing.

That lock can possibly only protect a page of PTEs.

When NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS, the locking is done per page
of PTEs, not for all of the page tables of an address space at once.

What this means is that it's really difficult to forcefully block out
all page table operations for a given mm, and I actually needed to do
something like this on sparc64 (when growing the TLB lookup hash
table, you can't let any PTEs change state while the table is
changing).  For my case, I added a spinlock to mm->context since
actually what I need is to block modifications to the hash table
itself during PTE changes.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one
  2006-12-28  0:16       ` Linus Torvalds
@ 2006-12-28  0:39         ` Linus Torvalds
  2006-12-28  0:52           ` David Miller
  0 siblings, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2006-12-28  0:39 UTC (permalink / raw)
  To: David Miller
  Cc: ranma, gordonfarquharson, tbm, a.p.zijlstra, andrei.popa,
	Andrew Morton, hugh, nickpiggin, arjan,
	Linux Kernel Mailing List



On Wed, 27 Dec 2006, Linus Torvalds wrote:
> 
> I think the test-case could probably be improved by having a munmap() and 
> page-cache flush in between the writing and the checking, to see whether 
> that shows the corruption easier (and possibly without having to start 
> paging in order to throw the pages out, which would simplify testing a 
> lot).

I think the page-writeout is implicated, because I do seem to need it, but 
the page-cache flush does seem to make corruption _easier_ to see. I now 
seem about to trigger it with a 100MB file on a 256MB machine in a minute 
or so, with this slight modification.

I still don't see _why_, though. But maybe smarter people than me can see 
it..

		Linus

---
#include <sys/mman.h>
#include <sys/fcntl.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <time.h>

#define TARGETSIZE (100 << 20)
#define CHUNKSIZE (1460)
#define NRCHUNKS (TARGETSIZE / CHUNKSIZE)
#define SIZE (NRCHUNKS * CHUNKSIZE)

static void fillmem(void *start, int nr)
{
	memset(start, nr, CHUNKSIZE);
}

static void checkmem(void *start, int nr)
{
	unsigned char c = nr, *p = start;
	int i;
	for (i = 0; i < CHUNKSIZE; i++) {
		if (*p++ != c) {
			printf("Chunk %d corrupted           \n", nr);
			return;
		}
	}
}

static char *remap(int fd, char *mapping)
{
	if (mapping) {
		munmap(mapping, SIZE);
		posix_fadvise(fd, 0, SIZE, POSIX_FADV_DONTNEED);
	}
	return mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
}

int main(int argc, char **argv)
{
	char *mapping;
	int fd, i;
	static int chunkorder[NRCHUNKS];

	/*
	 * Make some random ordering of writing the chunks to the
	 * memory map..
	 *
	 * Start with fully ordered..
	 */
	for (i = 0; i < NRCHUNKS; i++)
		chunkorder[i] = i;

	/* ..and then mix it up randomly */
	srandom(time(NULL));
	for (i = 0; i < NRCHUNKS; i++) {
		int index = (unsigned int) random() % NRCHUNKS;
		int nr = chunkorder[index];
		chunkorder[index] = chunkorder[i];
		chunkorder[i] = nr;
	}

	fd = open("mapfile", O_RDWR | O_TRUNC | O_CREAT, 0666);
	if (fd < 0)
		return -1;
	if (ftruncate(fd, SIZE) < 0)
		return -1;
	mapping = remap(fd, NULL);
	if (-1 == (int)(long)mapping)
		return -1;

	for (i = 0; i < NRCHUNKS; i++) {
		int chunk = chunkorder[i];
		printf("Writing chunk %d/%d (%d%%)     \r", i, NRCHUNKS, 100*i/NRCHUNKS);
		fillmem(mapping + chunk * CHUNKSIZE, chunk);
	}
	printf("\n");

	/* Unmap, drop, and remap.. */
	mapping = remap(fd, mapping);

	/* .. and check */
	for (i = 0; i < NRCHUNKS; i++) {
		int chunk = i;
		printf("Checking chunk %d/%d (%d%%)     \r", i, NRCHUNKS, 100*i/NRCHUNKS);
		checkmem(mapping + chunk * CHUNKSIZE, chunk);
	}
	printf("\n");
	
	return 0;
}

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one
  2006-12-27  4:55     ` [PATCH] mm: fix page_mkclean_one David Miller
  2006-12-27  7:00       ` Linus Torvalds
@ 2006-12-28  0:16       ` Linus Torvalds
  2006-12-28  0:39         ` Linus Torvalds
  1 sibling, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2006-12-28  0:16 UTC (permalink / raw)
  To: David Miller
  Cc: ranma, gordonfarquharson, tbm, a.p.zijlstra, andrei.popa,
	Andrew Morton, hugh, nickpiggin, arjan,
	Linux Kernel Mailing List


On Tue, 26 Dec 2006, David Miller wrote:
> 
> I've seen it on sparc64, UP kernel, no preempt.

Ok, I still don't have a clue, but I think I at least have a new 
test-case.

It can probably be improved upon, but this would _seem_ to trigger the 
problem. Can people check?

You'd want to make sure you get page-put activity, by making TARGETSIZE be 
big enough to cause memory pressure (and rather than making it bigger, you 
might want to make your memory smaller instead, to make it run more 
quickly. Either using "mem=128M" or big compiles or something...).

If it finds corruption, you'll see something like

	Writing chunk 183858/183859 (99%)
	Chunk ..
	Chunk 120887 corrupted
	Chunk 122372 corrupted
	Chunk ...
	Checking chunk 183858/183859 (99%)

otherwise it will just say

	Writing chunk 183858/183859 (99%)
	Checking chunk 183858/183859 (99%)

and exit.

I didn't spend a lot of time verifying this, but I _was_ able to cause 
those "Chunk xxx corrupted" messages with this. There's probably a more 
efficient better way to do it, but this is better than trying to use 
rtorrent, and also makes any worries about what rtorrent does go away.

Of course, maybe it's this test-program that is buggy now, although it 
looks trivial enough that I don't think it is.

I think my earlier stress-tester may not have triggered this, because it 
just did all its writing in a linear order, so any LRU logic will happen 
to write back old pages that we are no longer touching. The randomization 
(and using a chunksize that isn't a multiple of a page-size) makes sure 
that we're actually going to have lots of rewriting going on.

I think the test-case could probably be improved by having a munmap() and 
page-cache flush in between the writing and the checking, to see whether 
that shows the corruption easier (and possibly without having to start 
paging in order to throw the pages out, which would simplify testing a 
lot). But I haven't tested. I decided to post this asap, now that I've 
recreated the corruption with something else, and something that is 
possibly easier to analyze..

		Linus

----
#include <sys/mman.h>
#include <sys/fcntl.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <time.h>

#define TARGETSIZE (256 << 20)
#define CHUNKSIZE (1460)
#define NRCHUNKS (TARGETSIZE / CHUNKSIZE)
#define SIZE (NRCHUNKS * CHUNKSIZE)

static void fillmem(void *start, int nr)
{
	memset(start, nr, CHUNKSIZE);
}

static void checkmem(void *start, int nr)
{
	unsigned char c = nr, *p = start;
	int i;
	for (i = 0; i < CHUNKSIZE; i++) {
		if (*p++ != c) {
			printf("Chunk %d corrupted               \n", nr);
			return;
		}
	}
}

int main(int argc, char **argv)
{
	char *mapping;
	int fd, i;
	static int chunkorder[NRCHUNKS];

	/*
	 * Make some random ordering of writing the chunks to the
	 * memory map..
	 *
	 * Start with fully ordered..
	 */
	for (i = 0; i < NRCHUNKS; i++)
		chunkorder[i] = i;

	/* ..and then mix it up randomly */
	srandom(time(NULL));
	for (i = 0; i < NRCHUNKS; i++) {
		int index = (unsigned int) random() % NRCHUNKS;
		int nr = chunkorder[index];
		chunkorder[index] = chunkorder[i];
		chunkorder[i] = nr;
	}

	fd = open("mapfile", O_RDWR | O_TRUNC | O_CREAT, 0666);
	if (fd < 0)
		return -1;
	if (ftruncate(fd, SIZE) < 0)
		return -1;
	mapping = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
	if (-1 == (int)(long)mapping)
		return -1;

	for (i = 0; i < NRCHUNKS; i++) {
		int chunk = chunkorder[i];
		printf("Writing chunk %d/%d (%d%%)     \r", i, NRCHUNKS, 100*i/NRCHUNKS);
		fillmem(mapping + chunk * CHUNKSIZE, chunk);
	}
	printf("\n");
	for (i = 0; i < NRCHUNKS; i++) {
		int chunk = i;
		printf("Checking chunk %d/%d (%d%%)     \r", i, NRCHUNKS, 100*i/NRCHUNKS);
		checkmem(mapping + chunk * CHUNKSIZE, chunk);
	}
	printf("\n");
	
	return 0;
}

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one
  2006-12-27  7:00       ` Linus Torvalds
@ 2006-12-27  8:39         ` Andrei Popa
  0 siblings, 0 replies; 45+ messages in thread
From: Andrei Popa @ 2006-12-27  8:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Miller, ranma, gordonfarquharson, tbm, a.p.zijlstra, akpm,
	hugh, nickpiggin, arjan, linux-kernel

I have corrupted files...

> ---
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 263f88e..4652ef1 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -1653,19 +1653,7 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
>  	do {
>  		if (!buffer_mapped(bh))
>  			continue;
> -		/*
> -		 * If it's a fully non-blocking write attempt and we cannot
> -		 * lock the buffer then redirty the page.  Note that this can
> -		 * potentially cause a busy-wait loop from pdflush and kswapd
> -		 * activity, but those code paths have their own higher-level
> -		 * throttling.
> -		 */
> -		if (wbc->sync_mode != WB_SYNC_NONE || !wbc->nonblocking) {
> -			lock_buffer(bh);
> -		} else if (test_set_buffer_locked(bh)) {
> -			redirty_page_for_writepage(wbc, page);
> -			continue;
> -		}
> +		lock_buffer(bh);
>  		if (test_clear_buffer_dirty(bh)) {
>  			mark_buffer_async_write(bh);
>  		} else {


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one
  2006-12-27  4:55     ` [PATCH] mm: fix page_mkclean_one David Miller
@ 2006-12-27  7:00       ` Linus Torvalds
  2006-12-27  8:39         ` Andrei Popa
  2006-12-28  0:16       ` Linus Torvalds
  1 sibling, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2006-12-27  7:00 UTC (permalink / raw)
  To: David Miller
  Cc: ranma, gordonfarquharson, tbm, a.p.zijlstra, andrei.popa, akpm,
	hugh, nickpiggin, arjan, linux-kernel



On Tue, 26 Dec 2006, David Miller wrote:
> 
> I've seen it on sparc64, UP kernel, no preempt.

Btw, having tried to debug the writeback code, there's one very special 
case that just makes me go "hmm".

If we have a buffer that is "busy" when we try to write back a page, we 
have this magic "wbc->sync_mode == WB_SYNC_NONE && wbc->nonblocking" mode, 
where we won't wait for it, but instead we'll redirty the page and redo 
the whole thing.

Looking at the code, that should all work, but at the same time, it 
triggers some of my debug messages about having a dirty page during 
writeback, and one way to trigger that debug message is to try to run 
rtorrent on the machine.. 

I dunno. Witht he writeback being suspicious, and the normal 
"block_write_full_page()" path being implicated in at least ext2, I just 
wonder. This is one of those "let's see if behaviour changes" patches, 
that I'm just throwing out there..

		Linus

---
diff --git a/fs/buffer.c b/fs/buffer.c
index 263f88e..4652ef1 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1653,19 +1653,7 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
 	do {
 		if (!buffer_mapped(bh))
 			continue;
-		/*
-		 * If it's a fully non-blocking write attempt and we cannot
-		 * lock the buffer then redirty the page.  Note that this can
-		 * potentially cause a busy-wait loop from pdflush and kswapd
-		 * activity, but those code paths have their own higher-level
-		 * throttling.
-		 */
-		if (wbc->sync_mode != WB_SYNC_NONE || !wbc->nonblocking) {
-			lock_buffer(bh);
-		} else if (test_set_buffer_locked(bh)) {
-			redirty_page_for_writepage(wbc, page);
-			continue;
-		}
+		lock_buffer(bh);
 		if (test_clear_buffer_dirty(bh)) {
 			mark_buffer_async_write(bh);
 		} else {

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] mm: fix page_mkclean_one
  2006-12-26 16:17   ` Tobias Diedrich
@ 2006-12-27  4:55     ` David Miller
  2006-12-27  7:00       ` Linus Torvalds
  2006-12-28  0:16       ` Linus Torvalds
  0 siblings, 2 replies; 45+ messages in thread
From: David Miller @ 2006-12-27  4:55 UTC (permalink / raw)
  To: ranma
  Cc: torvalds, gordonfarquharson, tbm, a.p.zijlstra, andrei.popa,
	akpm, hugh, nickpiggin, arjan, linux-kernel

From: Tobias Diedrich <ranma@tdiedrich.de>
Date: Tue, 26 Dec 2006 17:17:00 +0100

> Linus Torvalds wrote:
> > I don't think it's a page table issue any more, it just doesn't look 
> > likely with the ARM UP corruption. It's also not apparently even on a 
> > cacheline boundary, so it probably is really a dirty bit that got cleared 
> > wrogn due to some race with IO.
> 
> So, until now it's only been reported for SMP on i386?
> I'm seeing the issue on my Pentium-M Notebook (Thinkpad R52) over
> here, UP kernel, no preempt.

I've seen it on sparc64, UP kernel, no preempt.

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2007-02-02 13:09 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-02-01 22:40 [PATCH] mm: fix page_mkclean_one Mark Groves
2007-02-02  8:42 ` Nick Piggin
2007-02-02 13:08   ` Evgeniy Polyakov
  -- strict thread matches above, loose matches on Subject: below --
2006-12-28 15:50 martin
2006-12-24  8:10 [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) Gordon Farquharson
2006-12-24  8:43 ` Linus Torvalds
2006-12-26 16:17   ` Tobias Diedrich
2006-12-27  4:55     ` [PATCH] mm: fix page_mkclean_one David Miller
2006-12-27  7:00       ` Linus Torvalds
2006-12-27  8:39         ` Andrei Popa
2006-12-28  0:16       ` Linus Torvalds
2006-12-28  0:39         ` Linus Torvalds
2006-12-28  0:52           ` David Miller
2006-12-28  3:04             ` Linus Torvalds
2006-12-28  4:32               ` Gordon Farquharson
2006-12-28  4:53                 ` Linus Torvalds
2006-12-28  5:20                   ` Gordon Farquharson
2006-12-28  5:41                     ` David Miller
2006-12-28  5:47                       ` Gordon Farquharson
2006-12-28 10:13                     ` Russell King
2006-12-28 14:15                       ` Gordon Farquharson
2006-12-28 15:53                         ` Martin Michlmayr
2006-12-28 17:27                       ` Linus Torvalds
2006-12-28 18:44                         ` Russell King
2006-12-28 19:01                           ` Linus Torvalds
     [not found]                   ` <97a0a9ac0612272115g4cce1f08n3c3c8498a6076bd5@mail.gmail.com>
     [not found]                     ` <Pine.LNX.4.64.0612272120180.4473@woody.osdl.org>
2006-12-28  5:38                       ` Gordon Farquharson
2006-12-28  9:30                         ` Martin Michlmayr
2006-12-28 10:16                         ` Martin Michlmayr
2006-12-28 10:49                           ` Russell King
2006-12-28 14:56                             ` Martin Michlmayr
2006-12-28  5:58                       ` Gordon Farquharson
2006-12-28 17:08                         ` Linus Torvalds
2006-12-28  5:55               ` Chen, Kenneth W
2006-12-28  6:10                 ` Chen, Kenneth W
2006-12-28  6:27                   ` David Miller
2006-12-28 17:10                   ` Linus Torvalds
2006-12-28  9:15               ` Zhang, Yanmin
2006-12-28 17:15                 ` Linus Torvalds
2006-12-28 11:50               ` Petri Kaukasoina
2006-12-28 15:09               ` Guillaume Chazarain
2006-12-28 19:19                 ` Guillaume Chazarain
2006-12-28 19:28                   ` Linus Torvalds
2006-12-28 19:45                     ` Andrew Morton
2006-12-28 20:14                       ` Linus Torvalds
2006-12-28 22:38                         ` David Miller
2006-12-29  2:50                           ` Segher Boessenkool
2006-12-29  6:48                             ` Linus Torvalds
2006-12-28 22:35                       ` Mike Galbraith
2006-12-21 20:01 [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) Linus Torvalds
2006-12-28  0:00 ` Martin Schwidefsky
2006-12-28  0:42   ` Linus Torvalds
2006-12-28  0:52     ` [PATCH] mm: fix page_mkclean_one David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).