LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* Re: [PATCH] mm: fix page_mkclean_one
@ 2007-02-01 22:40 Mark Groves
2007-02-02 8:42 ` Nick Piggin
0 siblings, 1 reply; 45+ messages in thread
From: Mark Groves @ 2007-02-01 22:40 UTC (permalink / raw)
To: linux-kernel
Hi,
I have been been seeing a problem when using sendfile repeatedly on an
SMP server, which I believe is related to the problem that was
discovered recently with marking dirty pages. The bug, as well as a test
script, is listed at http://bugzilla.kernel.org/show_bug.cgi?id=7650.
Currently, we're experiencing errors where part of a previous packet is
being sent out rather than the current packet.
I have applied the patch Linus posted to a 2.6.19 kernel but am still
getting the problem. So I am wondering if there are any other places in
the kernel which mark pages as dirty which might require a similar
patch?
Regards,
Mark Groves
Researcher
University of Waterloo
Waterloo, Ontario, Canada
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one
2007-02-01 22:40 [PATCH] mm: fix page_mkclean_one Mark Groves
@ 2007-02-02 8:42 ` Nick Piggin
2007-02-02 13:08 ` Evgeniy Polyakov
0 siblings, 1 reply; 45+ messages in thread
From: Nick Piggin @ 2007-02-02 8:42 UTC (permalink / raw)
To: Mark Groves; +Cc: linux-kernel, netdev
Mark Groves wrote:
> Hi,
>
>
> I have been been seeing a problem when using sendfile repeatedly on an
> SMP server, which I believe is related to the problem that was
> discovered recently with marking dirty pages. The bug, as well as a test
> script, is listed at http://bugzilla.kernel.org/show_bug.cgi?id=7650.
> Currently, we're experiencing errors where part of a previous packet is
> being sent out rather than the current packet.
>
> I have applied the patch Linus posted to a 2.6.19 kernel but am still
> getting the problem. So I am wondering if there are any other places in
> the kernel which mark pages as dirty which might require a similar
> patch?
Your issue is not related, firstly because the page_mkclean bug did not
exist before 2.6.19 kernels.
Anyway, I had a look at your bugzilla test-case and managed to slim it
down to something that easily shows what the problem is (available on
request) -- the problem is that recipient of the sendfile is seeing
modifications that occur to the source file _after_ the sender has
completed the sendfile, because the file pages are not copied but
queued.
I think the usual approach to what you are trying to do is to set TCP_CORK,
then write(2) the header into the socket, then sendfile directly from the
file you want.
Another approach I guess is to implement an ack in your userland protocol
so you do not modify the sendfile source file until the client acks that
it has all the data.
I'm not sure if there are any other usual ways to do this (ie. a barrier
for sendfile, to ensure it will not pick up "future" modifications to the
file). netdev cc'ed, someone there might have additional comments.
Please close this bug if/when you are satisfied it is not a kernel problem.
Thanks,
Nick
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one
2007-02-02 8:42 ` Nick Piggin
@ 2007-02-02 13:08 ` Evgeniy Polyakov
0 siblings, 0 replies; 45+ messages in thread
From: Evgeniy Polyakov @ 2007-02-02 13:08 UTC (permalink / raw)
To: Nick Piggin; +Cc: Mark Groves, linux-kernel, netdev
On Fri, Feb 02, 2007 at 07:42:52PM +1100, Nick Piggin (nickpiggin@yahoo.com.au) wrote:
> Anyway, I had a look at your bugzilla test-case and managed to slim it
> down to something that easily shows what the problem is (available on
> request) -- the problem is that recipient of the sendfile is seeing
> modifications that occur to the source file _after_ the sender has
> completed the sendfile, because the file pages are not copied but
> queued.
>
> I think the usual approach to what you are trying to do is to set TCP_CORK,
> then write(2) the header into the socket, then sendfile directly from the
> file you want.
>
> Another approach I guess is to implement an ack in your userland protocol
> so you do not modify the sendfile source file until the client acks that
> it has all the data.
Mark, don't you use e1000 or other scatter-gather capable nic with
checksum offload? Likely yes.
Actual data sucking in that case happens when packet is supposed to be
transmitted by the NIC, not when sendfile() is returned. The same
applies to the case, when you have fancy egress filtering.
It is not allowed to modify pages until they are really transmitted, if
you want data integrity.
There are _no_ bugs in network or VFS cache in this test case.
> I'm not sure if there are any other usual ways to do this (ie. a barrier
> for sendfile, to ensure it will not pick up "future" modifications to the
> file). netdev cc'ed, someone there might have additional comments.
>
> Please close this bug if/when you are satisfied it is not a kernel problem.
>
> Thanks,
> Nick
>
> --
> SUSE Labs, Novell Inc.
> Send instant messages to your online friends http://au.messenger.yahoo.com
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one
2006-12-29 2:50 ` Segher Boessenkool
@ 2006-12-29 6:48 ` Linus Torvalds
0 siblings, 0 replies; 45+ messages in thread
From: Linus Torvalds @ 2006-12-29 6:48 UTC (permalink / raw)
To: Segher Boessenkool
Cc: David Miller, nickpiggin, kenneth.w.chen, guichaz, hugh,
linux-kernel, ranma, gordonfarquharson, akpm, a.p.zijlstra, tbm,
arjan, andrei.popa
On Fri, 29 Dec 2006, Segher Boessenkool wrote:
>
> > I think what might be happening is that pdflush writes them out fine,
> > however we don't trap writes by the application _during_ that writeout.
>
> Yeah. I believe that more exactly it happens if the very last
> write to the page causes a writeback (due to dirty balancing)
> while another writeback for the page is already happening.
>
> As usual in these cases, I have zero proof.
I actually have proof to the contrary, ie I have traces that say "the
write was started" after the last write.
And the VM layer in this area is actually fairly sane and civilized. It
has a bit that says "writeback in progress", and if that bit is set, it
simply _will_not_ start a new write. It even has various BUG_ON()'s to
that effect.
So everything I have ever seen says that the VM layer is actually doing
everything right.
> It's the do_wp_page -> balance_dirty_pages -> generic_writepages
> path for sure. Maybe it's enough to change
>
> if (wbc->sync_mode != WB_SYNC_NONE)
> wait_on_page_writeback(page);
>
> if (PageWriteback(page) ||
> !clear_page_dirty_for_io(page)) {
> unlock_page(page);
> continue;
> }
Notive how this one basically says:
- if it's under writeback, don't even clear the page dirty flag.
Your suggested change:
> if (wbc->sync_mode != WB_SYNC_NONE)
> wait_on_page_writeback(page);
>
> if (PageWriteback(page)) {
> redirty_page_for_writepage(wbc, page);
makes no sense, because we simply never _did_ the "clear_page_dirty()" if
the thing was under writeback in the first place. That's how C
conditionals work. So there's no reason to "redirty" it, because it
wasn't cleaned in the first place.
I've double- and triple-checked the dirty bits, including having traces
that actually say that the IO was started (from a VM perspective) _after_
the last write was done. The IO just didn't hit the disk.
I'm personally fairly convinced that it's not a VM issue, but a "IO
issue". Either in a low-level filesystem or in some of the fs/buffer.c
helper routines.
But I'd love to be proven wrong.
I do have a few interesting details from the trace I haven't really
analyzed yet. Here's the trace for events on one of the pages that was
corrupted. Note how the events are numbered (there were 171640 events
total), so the thing you see is just a small set of events from the whole
big trace, but it's the ones that talk about _that_ particular page.
I've grouped them so hat "consecutive" events group together. That just
means that no events on any other pages happened in between those events,
and it is usually a sign that it's really one single call-chain that
causes all the events.
For example, for the first group of three events (44366-44368), it's the
page fault that brings in the page, and since it's a write-fault, it will
not only map the page, it will mark the page itself dirty and then also
set the TAG_DIRTY on the mapping. So the "group" is just really a result
of one single event happening, which causes several things to happen to
that page. That's exactly what you'd expect.
Anyway, here is the list of events that page went through:
44366 PG 00000f6d: mm/memory.c:2254 mapping at b789fc54 (write)
44367 PG 00000f6d: mm/page-writeback.c:817 setting dirty
44368 PG 00000f6d: fs/buffer.c:738 setting TAG_DIRTY
64231 PG 00000f6d: mm/page-writeback.c:872 clean_for_io
64232 PG 00000f6d: mm/rmap.c:451 cleaning PTE b789f000
64233 PG 00000f6d: mm/page-writeback.c:914 set writeback
64234 PG 00000f6d: mm/page-writeback.c:916 setting TAG_WRITEBACK
64235 PG 00000f6d: mm/page-writeback.c:922 clearing TAG_DIRTY
67570 PG 00000f6d: mm/page-writeback.c:891 end writeback
67571 PG 00000f6d: mm/page-writeback.c:893 clearing TAG_WRITEBACK
76705 PG 00000f6d: mm/page-writeback.c:817 setting dirty
76706 PG 00000f6d: fs/buffer.c:725 dirtied buffers
76707 PG 00000f6d: fs/buffer.c:738 setting TAG_DIRTY
105267 PG 00000f6d: mm/page-writeback.c:872 clean_for_io
105268 PG 00000f6d: mm/rmap.c:451 cleaning PTE b789f000
105269 PG 00000f6d: mm/page-writeback.c:914 set writeback
105270 PG 00000f6d: mm/page-writeback.c:916 setting TAG_WRITEBACK
105271 PG 00000f6d: mm/page-writeback.c:922 clearing TAG_DIRTY
105272 PG 00000f6d: mm/page-writeback.c:891 end writeback
105273 PG 00000f6d: mm/page-writeback.c:893 clearing TAG_WRITEBACK
128032 PG 00000f6d: mm/memory.c:670 unmapped at b789f000
132662 PG 00000f6d: mm/filemap.c:119 Removing page cache
148278 PG 00000f6d: mm/memory.c:2254 mapping at b789f000 (read)
166326 PG 00000f6d: mm/memory.c:670 unmapped at b789f000
171958 PG 00000f6d: mm/filemap.c:119 Removing page cache
And notice that big grouping of seven events (105267-105273). The five
first events really _do_ make sense together: it's our page cleaning that
happens. But notice how the "end writeback" happens _immediately_.
Here's another page cleaning event for the page that preceded that page,
and did _not_ get corrupted:
105262 PG 00000f6c: mm/page-writeback.c:872 clean_for_io
105263 PG 00000f6c: mm/rmap.c:451 cleaning PTE b789e000
105264 PG 00000f6c: mm/page-writeback.c:914 set writeback
105265 PG 00000f6c: mm/page-writeback.c:916 setting TAG_WRITEBACK
105266 PG 00000f6c: mm/page-writeback.c:922 clearing TAG_DIRTY
108437 PG 00000f6c: mm/page-writeback.c:891 end writeback
108438 PG 00000f6c: mm/page-writeback.c:893 clearing TAG_WRITEBACK
and this looks a lot more like what you'd expect: other thngs happened in
between the "clear dirty, set writeback" stage and the "end writeback"
stage. That's what you'd expect to see if there was actually overlapping
IO and/or work.
(And notice that that _was_ what you saw even for the corrupted page for
the _first_ writeback: you saw the group-of-five that indicated a page
cleaning event had started, and then a group-of-two to indicate that the
writeback finished).
So I find this kind of pattern really suspicious. We have a missing
writeout, and my traces show (I didn't analyze this _particular_ one
closely, but I did the previous trace for another page that I posted) that
the writeback was actually started after the write that went missing was
done. AND I have this trace that seems to show that the writeback
basically completed immediately, with no other work in between.
That to me says: "somebody didn't actually write out out". The VM layer
asked the filesystem to do the write, but the filesystem just didn't do
it. I personally think it's because some buffer-head BH_dirty bit got
scrogged, but it could be some event that makes the filesystem simply not
do the IO because it thinks the "disk queues are too full", so it just
says "IO completed", without actually doing anything at all.
Now, the fact that it apparently happens for all of ext2, ext3
and reiserfs (but NOT apparently with "data=writeback"), makes me suspect
that there is some common interaction, and that it's somehow BH-related
(they all share much of the buffer head infrastructure). So it doesn't
look like it's just a bug in one random filesystem, I think it's a bug in
some buffer-head infrastructure/helper function.
So I don't think it's "core VM". I don't think it's the "page cache". I
think we handle the dirty state correctly at that level.
It looks more like "buffer cache" or "filesystem" to me by now.
(Btw, don't get me wrong - the above sequence numbers are in no way
*proof* of anything. You could get big groups for one page just because
something ended up being synchronous. I'll add some timestamps to my
traces to make it easier to see where there was real IO going on and where
there wasn't).
Linus
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one
2006-12-28 22:38 ` David Miller
@ 2006-12-29 2:50 ` Segher Boessenkool
2006-12-29 6:48 ` Linus Torvalds
0 siblings, 1 reply; 45+ messages in thread
From: Segher Boessenkool @ 2006-12-29 2:50 UTC (permalink / raw)
To: David Miller
Cc: nickpiggin, kenneth.w.chen, guichaz, hugh, linux-kernel, ranma,
torvalds, gordonfarquharson, akpm, a.p.zijlstra, tbm, arjan,
andrei.popa
> I think what might be happening is that pdflush writes them out fine,
> however we don't trap writes by the application _during_ that writeout.
Yeah. I believe that more exactly it happens if the very last
write to the page causes a writeback (due to dirty balancing)
while another writeback for the page is already happening.
As usual in these cases, I have zero proof.
> It's something that will only occur with writeback and MAP_SHARED
> writable access to the file pages.
It's the do_wp_page -> balance_dirty_pages -> generic_writepages
path for sure. Maybe it's enough to change
if (wbc->sync_mode != WB_SYNC_NONE)
wait_on_page_writeback(page);
if (PageWriteback(page) ||
!clear_page_dirty_for_io(page)) {
unlock_page(page);
continue;
}
to
if (wbc->sync_mode != WB_SYNC_NONE)
wait_on_page_writeback(page);
if (PageWriteback(page)) {
redirty_page_for_writepage(wbc, page);
unlock_page(page);
continue;
}
if (!clear_page_dirty_for_io(page)) {
unlock_page(page);
continue;
}
or something along those lines. Completely untested of course :-)
Segher
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one
2006-12-28 20:14 ` Linus Torvalds
@ 2006-12-28 22:38 ` David Miller
2006-12-29 2:50 ` Segher Boessenkool
0 siblings, 1 reply; 45+ messages in thread
From: David Miller @ 2006-12-28 22:38 UTC (permalink / raw)
To: torvalds
Cc: akpm, guichaz, ranma, gordonfarquharson, tbm, a.p.zijlstra,
andrei.popa, hugh, nickpiggin, arjan, linux-kernel,
kenneth.w.chen
From: Linus Torvalds <torvalds@osdl.org>
Date: Thu, 28 Dec 2006 12:14:31 -0800 (PST)
> I get corruption - but the whole point is that it's very much pdflush that
> should be writing these pages out.
I think what might be happening is that pdflush writes them out fine,
however we don't trap writes by the application _during_ that writeout.
These corruptions look exactly as if:
1) pdflush begins writeback of page X
2) page goes to disk
3) application writes a chunk to the page
4) pdflush et al. think the page is clean, so it gets tossed, losing
the writes done in #3
So there's a missing PTE change in there, so that we never get proper
re-dirtying of the page if the application tries to write to the page
during the writeback.
It's something that will only occur with writeback and MAP_SHARED
writable access to the file pages. That's why we never see this
with normal filesystem writes, since those explicitly manage the
page dirty state.
I think the dirty balancing logic etc. isn't where the problems are,
to me it's a PTE state update issue for sure.
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one
2006-12-28 19:45 ` Andrew Morton
2006-12-28 20:14 ` Linus Torvalds
@ 2006-12-28 22:35 ` Mike Galbraith
1 sibling, 0 replies; 45+ messages in thread
From: Mike Galbraith @ 2006-12-28 22:35 UTC (permalink / raw)
To: Andrew Morton
Cc: Linus Torvalds, Guillaume Chazarain, David Miller, ranma,
gordonfarquharson, tbm, Peter Zijlstra, andrei.popa, hugh,
nickpiggin, arjan, Linux Kernel Mailing List, Chen Kenneth W
On Thu, 2006-12-28 at 11:45 -0800, Andrew Morton wrote:
> On Thu, 28 Dec 2006 11:28:52 -0800 (PST)
> Linus Torvalds <torvalds@osdl.org> wrote:
>
> >
> >
> > On Thu, 28 Dec 2006, Guillaume Chazarain wrote:
> > >
> > > The attached patch fixes the corruption for me.
> >
> > Well, that's a good hint, but it's really just a symptom. You effectively
> > just made the test-program not even try to flush the data to disk, so the
> > page cache would stay in memory, and you'd not see the corruption as well.
> >
> > So you basically disabled the code that tried to trigger the bug more
> > easily.
> >
> > But the reason I say it's interesting is that "WB_SYNC_NONE" is very much
> > implicated in mm/page-writeback.c, and if there is a bug triggered by
> > WB_SYNC_NONE writebacks, then that would explain why page-writeback.c also
> > fails..
> >
>
> It would be interesting to convert your app to do fsync() before
> FADV_DONTNEED. That would take WB_SYNC_NONE out of the picture as well
> (apart from pdflush activity).
I did fdatasync(), tried remapping before unmapping... nogo here.
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one
2006-12-28 19:45 ` Andrew Morton
@ 2006-12-28 20:14 ` Linus Torvalds
2006-12-28 22:38 ` David Miller
2006-12-28 22:35 ` Mike Galbraith
1 sibling, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2006-12-28 20:14 UTC (permalink / raw)
To: Andrew Morton
Cc: Guillaume Chazarain, David Miller, ranma, gordonfarquharson, tbm,
Peter Zijlstra, andrei.popa, hugh, nickpiggin, arjan,
Linux Kernel Mailing List, Chen Kenneth W
[-- Attachment #1: Type: TEXT/PLAIN, Size: 2468 bytes --]
On Thu, 28 Dec 2006, Andrew Morton wrote:
>
> It would be interesting to convert your app to do fsync() before
> FADV_DONTNEED. That would take WB_SYNC_NONE out of the picture as well
> (apart from pdflush activity).
I get corruption - but the whole point is that it's very much pdflush that
should be writing these pages out.
Andrew - give my test-program a try. It can run in about 1 minute if you
have a 256MB machine (I didn't, but "mem=256M" is my friend), and it seems
to very consistently cause corruption.
What I do is:
# Make sure we write back aggressively
echo 5 > /proc/sys/vm/dirty_ratio
as root, and then just run the thing. Tons of corruption. But the
corruption goes away if I just leave the default dirty ratio alone (but
then I can increse the file size to trigger it, of course - but that
also makes the test run a lot slower).
Now, with a pre-2.6.19 kernel, I bet you won't get the corruption as
easily (at least with the "fsync()"), but that's less to do with anything
new, and probably just because then you simply won't have any pdflushing
going on - since the kernel won't even notice that you have tons of dirty
pages ;)
It might also depend on the speed of your disk drive - the machine I test
this on has a slow 4200 rpm laptop drive in it, and that probably makes
things go south more easily. That's _especially_ true if this is related
to any "bdi_write_congested()" logic.
Now, it could also be related to various code snippets like
...
if (wbc->sync_mode != WB_SYNC_NONE)
wait_on_page_writeback(page);
if (PageWriteback(page) ||
!clear_page_dirty_for_io(page)) {
unlock_page(page);
continue;
}
...
where the WB_SYNC_NONE case will hit the "PageWriteback()" and just not do
the writeback at all (but it also won't clear the dirty bit, so it's
certainly not an *OBVIOUS* bug).
We also have code like this ("pageout()"):
if (clear_page_dirty_for_io(page)) {
int res;
struct writeback_control wbc = {
.sync_mode = WB_SYNC_NONE,
..
}
...
res = mapping->a_ops->writepage(page, &wbc);
and in this case, if the "WB_SYNC_NONE" means that the "writepage()" call
won't do anything at all because of congestion, then that would be a _bad_
thing, and would certainly explain how something didn't get written out.
But that particular path should only trigger for the "shrink_page_list()"
case, and it's not the case I seem to be testing with my "low dirty_ratio"
testing.
Linus
[-- Attachment #2: Type: TEXT/PLAIN, Size: 2872 bytes --]
#include <sys/mman.h>
#include <sys/fcntl.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <time.h>
#define TARGETSIZE (22 << 20)
#define CHUNKSIZE (1460)
#define NRCHUNKS (TARGETSIZE / CHUNKSIZE)
#define SIZE (NRCHUNKS * CHUNKSIZE)
static void fillmem(void *start, int nr)
{
memset(start, nr, CHUNKSIZE);
}
#define page_offset(buf, off) (unsigned)((unsigned long)(buf)+(off)-(unsigned long)(mapping))
static int chunkorder[NRCHUNKS];
static char *mapping;
static int order(int nr)
{
int i;
if (nr < 0 || nr >= NRCHUNKS)
return -1;
for (i = 0; i < NRCHUNKS; i++)
if (chunkorder[i] == nr)
return i;
return -2;
}
static void checkmem(void *buf, int nr)
{
unsigned int start = ~0u, end = 0;
unsigned char c = nr, *p = buf, differs = 0;
int i;
for (i = 0; i < CHUNKSIZE; i++) {
unsigned char got = *p++;
if (got != c) {
if (i < start)
start = i;
if (i > end)
end = i;
differs = got;
}
}
if (start < end) {
printf("Chunk %d corrupted (%u-%u) (%x-%x) \n", nr, start, end,
page_offset(buf, start), page_offset(buf, end));
printf("Expected %u, got %u\n", c, differs);
printf("Written as (%d)%d(%d)\n", order(nr-1), order(nr), order(nr+1));
}
}
static char *remap(int fd, char *mapping)
{
if (mapping) {
munmap(mapping, SIZE);
// fsync(fd);
posix_fadvise(fd, 0, SIZE, POSIX_FADV_DONTNEED);
}
return mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
}
int main(int argc, char **argv)
{
int fd, i;
/*
* Make some random ordering of writing the chunks to the
* memory map..
*
* Start with fully ordered..
*/
for (i = 0; i < NRCHUNKS; i++)
chunkorder[i] = i;
/* ..and then mix it up randomly */
srandom(time(NULL));
for (i = 0; i < NRCHUNKS; i++) {
int index = (unsigned int) random() % NRCHUNKS;
int nr = chunkorder[index];
chunkorder[index] = chunkorder[i];
chunkorder[i] = nr;
}
fd = open("mapfile", O_RDWR | O_TRUNC | O_CREAT, 0666);
if (fd < 0)
return -1;
if (ftruncate(fd, SIZE) < 0)
return -1;
mapping = remap(fd, NULL);
if (-1 == (int)(long)mapping)
return -1;
for (i = 0; i < NRCHUNKS; i++) {
int chunk = chunkorder[i];
printf("Writing chunk %d/%d (%d%%) \r", i, NRCHUNKS, 100*i/NRCHUNKS);
fillmem(mapping + chunk * CHUNKSIZE, chunk);
}
printf("\n");
/* Unmap, drop, and remap.. */
mapping = remap(fd, mapping);
/* .. and check */
for (i = 0; i < NRCHUNKS; i++) {
int chunk = i;
printf("Checking chunk %d/%d (%d%%) \r", i, NRCHUNKS, 100*i/NRCHUNKS);
checkmem(mapping + chunk * CHUNKSIZE, chunk);
}
printf("\n");
/* Clean up for next time */
sleep(5);
sync();
sleep(5);
munmap(mapping, SIZE);
close(fd);
unlink("mapfile");
return 0;
}
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one
2006-12-28 19:28 ` Linus Torvalds
@ 2006-12-28 19:45 ` Andrew Morton
2006-12-28 20:14 ` Linus Torvalds
2006-12-28 22:35 ` Mike Galbraith
0 siblings, 2 replies; 45+ messages in thread
From: Andrew Morton @ 2006-12-28 19:45 UTC (permalink / raw)
To: Linus Torvalds
Cc: Guillaume Chazarain, David Miller, ranma, gordonfarquharson, tbm,
Peter Zijlstra, andrei.popa, hugh, nickpiggin, arjan,
Linux Kernel Mailing List, Chen Kenneth W
On Thu, 28 Dec 2006 11:28:52 -0800 (PST)
Linus Torvalds <torvalds@osdl.org> wrote:
>
>
> On Thu, 28 Dec 2006, Guillaume Chazarain wrote:
> >
> > The attached patch fixes the corruption for me.
>
> Well, that's a good hint, but it's really just a symptom. You effectively
> just made the test-program not even try to flush the data to disk, so the
> page cache would stay in memory, and you'd not see the corruption as well.
>
> So you basically disabled the code that tried to trigger the bug more
> easily.
>
> But the reason I say it's interesting is that "WB_SYNC_NONE" is very much
> implicated in mm/page-writeback.c, and if there is a bug triggered by
> WB_SYNC_NONE writebacks, then that would explain why page-writeback.c also
> fails..
>
It would be interesting to convert your app to do fsync() before
FADV_DONTNEED. That would take WB_SYNC_NONE out of the picture as well
(apart from pdflush activity).
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one
2006-12-28 19:19 ` Guillaume Chazarain
@ 2006-12-28 19:28 ` Linus Torvalds
2006-12-28 19:45 ` Andrew Morton
0 siblings, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2006-12-28 19:28 UTC (permalink / raw)
To: Guillaume Chazarain
Cc: David Miller, ranma, gordonfarquharson, tbm, Peter Zijlstra,
andrei.popa, Andrew Morton, hugh, nickpiggin, arjan,
Linux Kernel Mailing List, Chen Kenneth W
On Thu, 28 Dec 2006, Guillaume Chazarain wrote:
>
> The attached patch fixes the corruption for me.
Well, that's a good hint, but it's really just a symptom. You effectively
just made the test-program not even try to flush the data to disk, so the
page cache would stay in memory, and you'd not see the corruption as well.
So you basically disabled the code that tried to trigger the bug more
easily.
But the reason I say it's interesting is that "WB_SYNC_NONE" is very much
implicated in mm/page-writeback.c, and if there is a bug triggered by
WB_SYNC_NONE writebacks, then that would explain why page-writeback.c also
fails..
Linus
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one
2006-12-28 15:09 ` Guillaume Chazarain
@ 2006-12-28 19:19 ` Guillaume Chazarain
2006-12-28 19:28 ` Linus Torvalds
0 siblings, 1 reply; 45+ messages in thread
From: Guillaume Chazarain @ 2006-12-28 19:19 UTC (permalink / raw)
To: Linus Torvalds
Cc: David Miller, ranma, gordonfarquharson, tbm, Peter Zijlstra,
andrei.popa, Andrew Morton, hugh, nickpiggin, arjan,
Linux Kernel Mailing List, Chen Kenneth W
[-- Attachment #1: Type: text/plain, Size: 625 bytes --]
Guillaume Chazarain a écrit :
> I get this kind of corruption:
> http://guichaz.free.fr/linux-bug/corruption.png
Actually in qemu, I get three different behaviours:
- no corruption at all : with linux-2.4
- corruption only on the first chunks: before [PATCH] mm: balance dirty
pages as identified by Kenneth
- corruption of all chunks: after the balance dirty pages patch
Bisecting in linux-2.5 land I found
http://kernel.org/pub/linux/kernel/people/akpm/patches/2.5/2.5.66/2.5.66-mm3/broken-out/fadvise-flush-data.patch
to cause the corruption for me.
The attached patch fixes the corruption for me.
--
Guillaume
[-- Attachment #2: fadvise-dontneed.patch --]
[-- Type: text/x-patch, Size: 492 bytes --]
diff -r 3859b1144d3a mm/fadvise.c
--- a/mm/fadvise.c Sun Dec 24 05:00:03 2006 +0000
+++ b/mm/fadvise.c Thu Dec 28 19:53:40 2006 +0100
@@ -96,9 +96,6 @@ asmlinkage long sys_fadvise64_64(int fd,
case POSIX_FADV_NOREUSE:
break;
case POSIX_FADV_DONTNEED:
- if (!bdi_write_congested(mapping->backing_dev_info))
- filemap_flush(mapping);
-
/* First and last FULL page! */
start_index = (offset+(PAGE_CACHE_SIZE-1)) >> PAGE_CACHE_SHIFT;
end_index = (endbyte >> PAGE_CACHE_SHIFT);
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one
2006-12-28 18:44 ` Russell King
@ 2006-12-28 19:01 ` Linus Torvalds
0 siblings, 0 replies; 45+ messages in thread
From: Linus Torvalds @ 2006-12-28 19:01 UTC (permalink / raw)
To: Russell King
Cc: Gordon Farquharson, David Miller, ranma, tbm, Peter Zijlstra,
andrei.popa, Andrew Morton, hugh, nickpiggin, arjan,
Linux Kernel Mailing List
On Thu, 28 Dec 2006, Russell King wrote:
>
> Yup, but I have nothing to do with glibc because I refuse to do that
> silly copyright assignment FSF thing. Hopefully someone else can
> resolve it, but...
Yeah, me too.
> _is_ a fix whether _you_ like it or not to work around the issue so
> people can at least run your test program. I'm not saying it's a
> proper fix though.
My point was that it wasn't a "fix", it's a "workaround". The _fix_ would
be in glibc.
Nothing more..
Linus
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one
2006-12-28 17:27 ` Linus Torvalds
@ 2006-12-28 18:44 ` Russell King
2006-12-28 19:01 ` Linus Torvalds
0 siblings, 1 reply; 45+ messages in thread
From: Russell King @ 2006-12-28 18:44 UTC (permalink / raw)
To: Linus Torvalds
Cc: Gordon Farquharson, David Miller, ranma, tbm, Peter Zijlstra,
andrei.popa, Andrew Morton, hugh, nickpiggin, arjan,
Linux Kernel Mailing List
On Thu, Dec 28, 2006 at 09:27:12AM -0800, Linus Torvalds wrote:
> On Thu, 28 Dec 2006, Russell King wrote:
> > and if you look at glibc's memset() function, you'll notice that's exactly
> > what you expect if you pass a non-8bit value to it. Ergo, what you're
> > seeing is utterly expected given glibc's memset() implementation on ARM.
>
> Guys, you _really_ should fix memset(). What you describe is a _bug_.
Yup, but I have nothing to do with glibc because I refuse to do that
silly copyright assignment FSF thing. Hopefully someone else can
resolve it, but...
> > Fixing Linus' test program to pass nr & 255 to memset
>
> No. I'm almost certain that that is not a "fix", it's a workaround for a
> serious bug in your glibc crap.
_is_ a fix whether _you_ like it or not to work around the issue so
people can at least run your test program. I'm not saying it's a
proper fix though.
Of course, if you prefer to be mislead by incorrect bug reports...
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one
2006-12-28 10:13 ` Russell King
2006-12-28 14:15 ` Gordon Farquharson
@ 2006-12-28 17:27 ` Linus Torvalds
2006-12-28 18:44 ` Russell King
1 sibling, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2006-12-28 17:27 UTC (permalink / raw)
To: Russell King
Cc: Gordon Farquharson, David Miller, ranma, tbm, Peter Zijlstra,
andrei.popa, Andrew Morton, hugh, nickpiggin, arjan,
Linux Kernel Mailing List
On Thu, 28 Dec 2006, Russell King wrote:
>
> and if you look at glibc's memset() function, you'll notice that's exactly
> what you expect if you pass a non-8bit value to it. Ergo, what you're
> seeing is utterly expected given glibc's memset() implementation on ARM.
Guys, you _really_ should fix memset(). What you describe is a _bug_.
"memset()" takes an "int" as its argument (always has), and has to convert
it to a byte _itself_. It may not be common, but it's perfectly normal, to
pass it values outside 0-255 (negative values that still fit in a "signed
char" in particular are very normal, but my usage of "let the thing
truncate it itself" is also quite fine).
> Fixing Linus' test program to pass nr & 255 to memset
No. I'm almost certain that that is not a "fix", it's a workaround for a
serious bug in your glibc crap.
But it does explain all the unexpected strange behaviour (and the really
small writeback size - now it doesn't need any /proc/sys/vm/dirty_ratio
assumptions to be explicable.
Linus
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one
2006-12-28 9:15 ` Zhang, Yanmin
@ 2006-12-28 17:15 ` Linus Torvalds
0 siblings, 0 replies; 45+ messages in thread
From: Linus Torvalds @ 2006-12-28 17:15 UTC (permalink / raw)
To: Zhang, Yanmin
Cc: David Miller, ranma, gordonfarquharson, tbm, Peter Zijlstra,
andrei.popa, Andrew Morton, hugh, nickpiggin, arjan,
Linux Kernel Mailing List
On Thu, 28 Dec 2006, Zhang, Yanmin wrote:
>
> The test program is a process to write/read data. pdflush might write data
> to disk asynchronously. After pdflush writes a page to disk, it will call (either by
> softirq) clear_page_dirty to clear the dirty bit after getting the interrupt
> notification.
That would indeed be a horrible bug. However, we don't do
"clear_page_dirty()" _after_ the IO has completed, we do it _before_ the
IO starts.
If you can actually find a place that does clear_page_dirty _after_ IO,
then yes, you've just found the bug. But I haven't found it.
So the rule is _always_:
- call "clear_page_dirty_for_io()" with the page lock held, and _before_
the IO starts.
- do "set_page_writeback()" before unlocking the page again
- do a "end_page_writeback()" when the IO actually finishes.
and any code sequence that doesn't honor those rules would be buggy. A
beer for anybody that finds it..
Linus
^ permalink raw reply [flat|nested] 45+ messages in thread
* RE: [PATCH] mm: fix page_mkclean_one
2006-12-28 6:10 ` Chen, Kenneth W
2006-12-28 6:27 ` David Miller
@ 2006-12-28 17:10 ` Linus Torvalds
1 sibling, 0 replies; 45+ messages in thread
From: Linus Torvalds @ 2006-12-28 17:10 UTC (permalink / raw)
To: Chen, Kenneth W
Cc: David Miller, ranma, gordonfarquharson, tbm, Peter Zijlstra,
andrei.popa, Andrew Morton, hugh, nickpiggin, arjan,
Linux Kernel Mailing List
On Wed, 27 Dec 2006, Chen, Kenneth W wrote:
> >
> > Running the test code, git bisect points its finger at this commit. Reverting
> > this commit on top of 2.6.20-rc2 doesn't trigger the bug from the test code.
> >
> > [PATCH] mm: balance dirty pages
> >
> > Now that we can detect writers of shared mappings, throttle them. Avoids OOM
> > by surprise.
>
> Oh, never mind :-( I just didn't create enough write out pressure when
> test this. I just saw bug got triggered on a kernel I previously thought
> was OK.
Btw, this is an important point - people have long felt that the new page
balancing in 2.6.19 was to blame, but you've just confirmed the long-held
suspicion (at least by me) that it's not actually a new bug at all, it's
just that the dirty page balancing causes writeback to happen _earlier_,
and thus is better able to _show_ a bug that we've likely had for a long
long time.
Linus
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one
2006-12-28 5:58 ` Gordon Farquharson
@ 2006-12-28 17:08 ` Linus Torvalds
0 siblings, 0 replies; 45+ messages in thread
From: Linus Torvalds @ 2006-12-28 17:08 UTC (permalink / raw)
To: Gordon Farquharson
Cc: David Miller, ranma, tbm, Peter Zijlstra, andrei.popa,
Andrew Morton, hugh, nickpiggin, arjan,
Linux Kernel Mailing List
On Wed, 27 Dec 2006, Gordon Farquharson wrote:
>
> 100kB and 200kB files always succeed on the ARM system. 400kB and
> larger always seem to fail.
Oh, wow. Yeah, I've just repressed how tiny 32MB is. And especially if you
lowered the /proc/sys/vm/dirty_ratio to a smaller percentage, I guess
400kB should be enough to cause writeback.
Ugh. I tested a 128MB machine a few weeks ago, and found it painful.
Linus
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one
2006-12-28 14:15 ` Gordon Farquharson
@ 2006-12-28 15:53 ` Martin Michlmayr
0 siblings, 0 replies; 45+ messages in thread
From: Martin Michlmayr @ 2006-12-28 15:53 UTC (permalink / raw)
To: Gordon Farquharson
Cc: Linus Torvalds, David Miller, ranma, Peter Zijlstra, andrei.popa,
Andrew Morton, hugh, nickpiggin, arjan,
Linux Kernel Mailing List
* Gordon Farquharson <gordonfarquharson@gmail.com> [2006-12-28 07:15]:
> Thanks for the fix, Russell.
>
> I can now trigger the (real) problem by using a 25 MB file (100 << 18)
> and the Linksys NSLU2 (ARM, IXP420 processor, 32 MB RAM).
Me too (using 100 << 18). Interestingly, I don't seem to get any
corruption on a different ARM board, an IOP32x based machine with 128
MB RAM.
--
Martin Michlmayr
http://www.cyrius.com/
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one
@ 2006-12-28 15:50 martin
0 siblings, 0 replies; 45+ messages in thread
From: martin @ 2006-12-28 15:50 UTC (permalink / raw)
To: Linus Torvalds, Guillaume Chazarain
Cc: David Miller, ranma, gordonfarquharson, tbm, Peter Zijlstra,
andrei.popa, Andrew Morton, hugh, nickpiggin, arjan,
Linux Kernel Mailing List
On Thu Dec 28 15:09 , Guillaume Chazarain sent:
>I set a qemu environment to test kernels: http://guichaz.free.fr/linux-bug/
>I have corruption with every Fedora release kernel except the first, that is
>2.4.22 works, but 2.6.5, 2.6.9, 2.6.11, 2.6.15 and 2.6.18-1.2798 exhibit
>some corruption.
Confirm that I see corruption with Fedora kernel 2.6.18-1.2239.fc5:
...
Chunk 142969 corrupted (0-1459) (2580-4039)
Expected 121, got 0
Written as (89632)127652(124721)
Chunk 142976 corrupted (0-1459) (512-1971)
Expected 128, got 0
Written as (121734)128324(108589)
Checking chunk 143639/143640 (99%)
$ uname -a
Linux 2.6.18-1.2239.fc5 #1 Fri Nov 10 13:04:06 EST 2006 i686 athlon i386 GNU/Linux
/Martin
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one
2006-12-28 10:49 ` Russell King
@ 2006-12-28 14:56 ` Martin Michlmayr
0 siblings, 0 replies; 45+ messages in thread
From: Martin Michlmayr @ 2006-12-28 14:56 UTC (permalink / raw)
To: Gordon Farquharson, Linus Torvalds, David Miller, ranma,
Peter Zijlstra, andrei.popa, Andrew Morton, hugh, nickpiggin,
arjan, Linux Kernel Mailing List
* Russell King <rmk+lkml@arm.linux.org.uk> [2006-12-28 10:49]:
> > By the way, I just tried it with TARGETSIZE (100 << 12) on a different
> > ARM machine (a Thecus N2100 based on an IOP32x chip with 128 MB of
> > memory) and I see similar results to that from Gordon:
>
> Work around the glibc memset() problem by passing nr & 255, and re-run
> the test. You're getting false positives.
Yeah, I saw your message about this problem after I sent mine.
--
Martin Michlmayr
http://www.cyrius.com/
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one
2006-12-28 10:13 ` Russell King
@ 2006-12-28 14:15 ` Gordon Farquharson
2006-12-28 15:53 ` Martin Michlmayr
2006-12-28 17:27 ` Linus Torvalds
1 sibling, 1 reply; 45+ messages in thread
From: Gordon Farquharson @ 2006-12-28 14:15 UTC (permalink / raw)
To: Gordon Farquharson, Linus Torvalds, David Miller, ranma, tbm,
Peter Zijlstra, andrei.popa, Andrew Morton, hugh, nickpiggin,
arjan, Linux Kernel Mailing List
On 12/28/06, Russell King <rmk+lkml@arm.linux.org.uk> wrote:
> Fixing Linus' test program to pass nr & 255 to memset results in clean
> passes on 2.6.9 on TheCus N2100 (IOP8032x) and 2.6.16.9 StrongARM
> machines (as would be expected.)
Thanks for the fix, Russell.
I can now trigger the (real) problem by using a 25 MB file (100 << 18)
and the Linksys NSLU2 (ARM, IXP420 processor, 32 MB RAM).
$ ./linus-test
Writing chunk 17954/17955 (99%)
Chunk 514 corrupted (0-1459) (872-2331)
Expected 2, got 0
Written as (8479)11160(10312)
Chunk 516 corrupted (0-303) (3792-4095)
Expected 4, got 0
Written as (10312)10569(4426)
Chunk 959 corrupted (0-691) (3404-4095)
Expected 191, got 0
Written as (687)4881(1522)
Chunk 1895 corrupted (0-1459) (1900-3359)
Expected 103, got 0
Written as (7746)8389(6231)
Chunk 2702 corrupted (0-1459) (472-1931)
Expected 142, got 0
Written as (4866)7103(2409)
Chunk 3314 corrupted (0-1459) (1064-2523)
Expected 242, got 0
Written as (4287)7064(1730)
Chunk 4043 corrupted (0-1459) (444-1903)
Expected 203, got 0
Written as (6495)8509(4464)
Chunk 5180 corrupted (0-1459) (1584-3043)
Expected 60, got 0
Written as (11056)12826(10797)
Chunk 5672 corrupted (0-991) (3104-4095)
Expected 40, got 0
Written as (9944)4872(41)
Chunk 5793 corrupted (460-1459) (0-999)
Expected 161, got 0
Written as (7059)5038(4377)
Chunk 6089 corrupted (0-1459) (1620-3079)
Expected 201, got 0
Written as (4672)5230(4403)
Chunk 6545 corrupted (268-1459) (0-1191)
Expected 145, got 0
Written as (3701)5969(4668)
Chunk 7578 corrupted (0-1459) (584-2043)
Expected 154, got 0
Written as (10015)5082(1648)
Chunk 7880 corrupted (864-1459) (0-595)
Expected 200, got 0
Written as (17869)5064(4745)
Chunk 8086 corrupted (0-1459) (888-2347)
Expected 150, got 0
Written as (10206)11050(10374)
Chunk 8749 corrupted (0-1459) (2212-3671)
Expected 45, got 0
Written as (15263)7132(4825)
Chunk 9068 corrupted (0-1459) (1008-2467)
Expected 108, got 0
Written as (5557)7571(6771)
Chunk 9193 corrupted (812-1459) (0-647)
Expected 233, got 0
Written as (9238)7277(4757)
Chunk 10032 corrupted (576-1459) (0-883)
Expected 48, got 0
Written as (15741)10012(1753)
Chunk 10056 corrupted (0-1459) (1696-3155)
Expected 72, got 0
Written as (5379)7431(262)
Chunk 10395 corrupted (0-1459) (1020-2479)
Expected 155, got 0
Written as (21)7442(5902)
Chunk 10791 corrupted (0-1459) (1644-3103)
Expected 39, got 0
Written as (4753)5925(5926)
Chunk 10792 corrupted (0-991) (3104-4095)
Expected 40, got 0
Written as (5925)5926(8555)
Chunk 11036 corrupted (0-1103) (2992-4095)
Expected 28, got 0
Written as (13755)14449(7458)
Chunk 11387 corrupted (644-1459) (0-815)
Expected 123, got 0
Written as (10853)11459(9445)
Chunk 11586 corrupted (920-1459) (0-539)
Expected 66, got 0
Written as (3769)11691(11123)
Chunk 11882 corrupted (0-1459) (1160-2619)
Expected 106, got 0
Written as (10736)11696(2788)
Chunk 12397 corrupted (0-603) (3492-4095)
Expected 109, got 0
Written as (2352)7515(2437)
Chunk 12669 corrupted (0-795) (3300-4095)
Expected 125, got 0
Written as (1191)7661(5266)
Chunk 13162 corrupted (0-1459) (2184-3643)
Expected 106, got 0
Written as (9383)13662(11544)
Chunk 14653 corrupted (0-27) (4068-4095)
Expected 61, got 0
Written as (8100)9456(1275)
Chunk 17332 corrupted (0-367) (3728-4095)
Expected 180, got 0
Written as (760)12247(1244)
Chunk 17445 corrupted (0-1459) (772-2231)
Expected 37, got 0
Written as (8007)16481(14439)
Chunk 17556 corrupted (0-1007) (3088-4095)
Expected 148, got 0
Written as (10113)10657(10477)
Chunk 17859 corrupted (0-995) (3100-4095)
Expected 195, got 0
Written as (14472)14767(11426)
Checking chunk 17954/17955 (99%)
Gordon
--
Gordon Farquharson
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one
2006-12-28 3:04 ` Linus Torvalds
` (2 preceding siblings ...)
2006-12-28 9:15 ` Zhang, Yanmin
@ 2006-12-28 11:50 ` Petri Kaukasoina
2006-12-28 15:09 ` Guillaume Chazarain
4 siblings, 0 replies; 45+ messages in thread
From: Petri Kaukasoina @ 2006-12-28 11:50 UTC (permalink / raw)
To: Linus Torvalds
Cc: David Miller, ranma, gordonfarquharson, tbm, Peter Zijlstra,
andrei.popa, Andrew Morton, hugh, nickpiggin, arjan,
Linux Kernel Mailing List
On Wed, Dec 27, 2006 at 07:04:34PM -0800, Linus Torvalds wrote:
> [ Modified test-program that tells you where the corruption happens (and
> when the missing parts were supposed to be written out) appended, in
> case people care. ]
Hi
2.6.18 (and 2.6.18.6) is ok, 2.6.19-rc1 is broken. I tried some snapshots
between them but they hung before shell (2.6.18-git11, 2.6.18-git16,
2.6.18-git20, 2.6.18-git21). 2.6.18-git22 booted and was broken.
(UP, no preempt)
-Petri
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one
2006-12-28 10:16 ` Martin Michlmayr
@ 2006-12-28 10:49 ` Russell King
2006-12-28 14:56 ` Martin Michlmayr
0 siblings, 1 reply; 45+ messages in thread
From: Russell King @ 2006-12-28 10:49 UTC (permalink / raw)
To: Martin Michlmayr
Cc: Gordon Farquharson, Linus Torvalds, David Miller, ranma,
Peter Zijlstra, andrei.popa, Andrew Morton, hugh, nickpiggin,
arjan, Linux Kernel Mailing List
On Thu, Dec 28, 2006 at 11:16:59AM +0100, Martin Michlmayr wrote:
> * Gordon Farquharson <gordonfarquharson@gmail.com> [2006-12-27 22:38]:
> > >> #define TARGETSIZE (100 << 12)
> > >
> > >That's just 400kB!
> > >
> > >There's no way you should see corruption with that kind of value. It
> > >should all stay solidly in the cache.
> > >
> > >Is this perhaps with ARM nommu or something else strange? It may be that
> > >the program just doesn't work at all if mmap() is faked out with a malloc
> > >or similar.
> >
> > Definitely a question for the ARM gurus. I'm out of my depth.
>
> By the way, I just tried it with TARGETSIZE (100 << 12) on a different
> ARM machine (a Thecus N2100 based on an IOP32x chip with 128 MB of
> memory) and I see similar results to that from Gordon:
Work around the glibc memset() problem by passing nr & 255, and re-run
the test. You're getting false positives.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one
2006-12-28 5:38 ` Gordon Farquharson
2006-12-28 9:30 ` Martin Michlmayr
@ 2006-12-28 10:16 ` Martin Michlmayr
2006-12-28 10:49 ` Russell King
1 sibling, 1 reply; 45+ messages in thread
From: Martin Michlmayr @ 2006-12-28 10:16 UTC (permalink / raw)
To: Gordon Farquharson
Cc: Linus Torvalds, David Miller, ranma, Peter Zijlstra, andrei.popa,
Andrew Morton, hugh, nickpiggin, arjan,
Linux Kernel Mailing List
* Gordon Farquharson <gordonfarquharson@gmail.com> [2006-12-27 22:38]:
> >> #define TARGETSIZE (100 << 12)
> >
> >That's just 400kB!
> >
> >There's no way you should see corruption with that kind of value. It
> >should all stay solidly in the cache.
> >
> >Is this perhaps with ARM nommu or something else strange? It may be that
> >the program just doesn't work at all if mmap() is faked out with a malloc
> >or similar.
>
> Definitely a question for the ARM gurus. I'm out of my depth.
By the way, I just tried it with TARGETSIZE (100 << 12) on a different
ARM machine (a Thecus N2100 based on an IOP32x chip with 128 MB of
memory) and I see similar results to that from Gordon:
Writing chunk 279/280 (99%)
Chunk 256 corrupted (1-1455) (1025-2479)
Expected 0, got 1
Written as (199)43(184)
Chunk 258 corrupted (1-1455) (3945-1303)
Expected 2, got 3
Written as (184)74(145)
Chunk 260 corrupted (1-1455) (2769-127)
Expected 4, got 5
Written as (145)89(237)
Chunk 262 corrupted (1-1455) (1593-3047)
Expected 6, got 7
Written as (237)168(174)
Chunk 264 corrupted (1-1455) (417-1871)
Expected 8, got 9
Written as (174)135(161)
Chunk 266 corrupted (1-1455) (3337-695)
Expected 10, got 11
Written as (161)123(180)
Chunk 268 corrupted (1-1455) (2161-3615)
Expected 12, got 13
Written as (180)13(19)
Chunk 270 corrupted (1-1455) (985-2439)
Expected 14, got 15
Written as (19)172(106)
Chunk 272 corrupted (1-1455) (3905-1263)
Expected 16, got 17
Written as (106)212(140)
Chunk 274 corrupted (1-1455) (2729-87)
Expected 18, got 19
Written as (140)124(73)
Chunk 276 corrupted (1-1455) (1553-3007)
Expected 20, got 21
Written as (73)151(52)
Chunk 278 corrupted (1-1455) (377-1831)
Expected 22, got 23
Written as (52)215(99)
Checking chunk 279/280 (99%)
--
Martin Michlmayr
http://www.cyrius.com/
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one
2006-12-28 5:20 ` Gordon Farquharson
2006-12-28 5:41 ` David Miller
@ 2006-12-28 10:13 ` Russell King
2006-12-28 14:15 ` Gordon Farquharson
2006-12-28 17:27 ` Linus Torvalds
1 sibling, 2 replies; 45+ messages in thread
From: Russell King @ 2006-12-28 10:13 UTC (permalink / raw)
To: Gordon Farquharson
Cc: Linus Torvalds, David Miller, ranma, tbm, Peter Zijlstra,
andrei.popa, Andrew Morton, hugh, nickpiggin, arjan,
Linux Kernel Mailing List
On Wed, Dec 27, 2006 at 10:20:20PM -0700, Gordon Farquharson wrote:
> I have run the program a few times, and the output is pretty
> consistent. However, when I increase the target size, the difference
> between the expected and actual values is larger.
>
> Written as (749)935(738)
> Chunk 1113 corrupted (1-1455) (2965-323)
> Expected 89, got 93
This is not the corruption Linus is after. Note that the corruption starts
at offset '1'. Also note that:
89 = 1113 & 255
93 = 1113 & 255 | (1113 >> 8)
and if you look at glibc's memset() function, you'll notice that's exactly
what you expect if you pass a non-8bit value to it. Ergo, what you're
seeing is utterly expected given glibc's memset() implementation on ARM.
Fixing Linus' test program to pass nr & 255 to memset results in clean
passes on 2.6.9 on TheCus N2100 (IOP8032x) and 2.6.16.9 StrongARM
machines (as would be expected.)
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one
2006-12-28 5:38 ` Gordon Farquharson
@ 2006-12-28 9:30 ` Martin Michlmayr
2006-12-28 10:16 ` Martin Michlmayr
1 sibling, 0 replies; 45+ messages in thread
From: Martin Michlmayr @ 2006-12-28 9:30 UTC (permalink / raw)
To: Gordon Farquharson
Cc: Linus Torvalds, David Miller, ranma, Peter Zijlstra, andrei.popa,
Andrew Morton, hugh, nickpiggin, arjan,
Linux Kernel Mailing List
* Gordon Farquharson <gordonfarquharson@gmail.com> [2006-12-27 22:38]:
> >That's just 400kB!
> >
> >There's no way you should see corruption with that kind of value. It
> >should all stay solidly in the cache.
> >
> >Is this perhaps with ARM nommu or something else strange? It may be that
> >the program just doesn't work at all if mmap() is faked out with a malloc
> >or similar.
>
> Definitely a question for the ARM gurus. I'm out of my depth.
The CPU has a MMU. For reference, it's a IXP4xx based device with 32 MB
of memory.
--
Martin Michlmayr
http://www.cyrius.com/
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one
2006-12-28 3:04 ` Linus Torvalds
2006-12-28 4:32 ` Gordon Farquharson
2006-12-28 5:55 ` Chen, Kenneth W
@ 2006-12-28 9:15 ` Zhang, Yanmin
2006-12-28 17:15 ` Linus Torvalds
2006-12-28 11:50 ` Petri Kaukasoina
2006-12-28 15:09 ` Guillaume Chazarain
4 siblings, 1 reply; 45+ messages in thread
From: Zhang, Yanmin @ 2006-12-28 9:15 UTC (permalink / raw)
To: Linus Torvalds
Cc: David Miller, ranma, gordonfarquharson, tbm, Peter Zijlstra,
andrei.popa, Andrew Morton, hugh, nickpiggin, arjan,
Linux Kernel Mailing List
On Wed, 2006-12-27 at 19:04 -0800, Linus Torvalds wrote:
>
> On Wed, 27 Dec 2006, David Miller wrote:
> > >
> > > I still don't see _why_, though. But maybe smarter people than me can see
> > > it..
> >
> > FWIW this program definitely triggers the bug for me.
>
> Ok, now that I have something simple to do repeatable stuff with, I can
> say what the pattern is.. It's not all that surprising, but it's still
> worth just stating for the record.
>
> What happens is that when I do the "packetized writes" in random order,
> the _last_ write to a page occasionally just goes missing. It's not always
> at the end of a page, as shown by for example:
>
> - A whole chunk got dropped:
>
> Chunk 2094 corrupted (0-1459) (1624-3083)
> Expected 46, got 0
> Written as (30912)55414(10000)
>
> That "Written as (x)y(z)" line means that the corrupted chunk was
> written as chunk #y, and the preceding and following chunks (that were
> _not_ corrupt) on the page was written as #x and #z respectively.
>
> In other words, the missing chunk (which is still zero) was written
> much later than the ones that were ok, and never hit the disk. It's a
> contiguous chunk in the middle of the page (chunks are 1460 bytes in
> size)
>
> The first line means that all bytes of the chunk (0-1459) were
> corrupted, and the values in parenthesis are the offsets within a page.
> In other words, this was a chunk in the _middle_ of a page.
>
> - The missing data can also be at the beginning or ends of pages:
>
> Beginning of the chunk missing, it was at the end of a page (page
> offsets 3288-4095) and the _next_ page got written out fine:
>
> Chunk 2126 corrupted (0-807) (3288-4095)
> Expected 78, got 0
> Written as (32713)55573(14301)
>
> End of a chunk missing, it was the beginning of a page (and the
> _previous_ page that contained the beginning of the chunk was written
> out fine)
>
> Chunk 2179 corrupted (1252-1459) (0-207)
> Expected 131, got 0
> Written as (45189)55489(15515)
>
> Now, the reason I say this isn't surprising is that this is entirely
> consistent with the dirty bit being dropped on the floor somewhere, and
> likely through some interaction with the previous changes being in the
> process of being written out.
>
> Something (incorrectly) ends up deciding that it doesn't need to write the
> page, since it's already written, or alternatively clears the dirty bit
> too late (clears it because an _earlier_ write finished, never mind that
> the new dirty data didn't make it).
There might be a narrow race between set_page_dirty and clear_page_dirty.
The test program is a process to write/read data. pdflush might write data
to disk asynchronously. After pdflush writes a page to disk, it will call (either by
softirq) clear_page_dirty to clear the dirty bit after getting the interrupt
notification. But just after the page is written to disk and before the interrupt
reports the result, the test process might change the data and unmap the area. When
the area is unmapped, the page is marked as dirty again, but just after that, the
interrupt arrives and the dirty bit is cleared, so the late data will not be written
to disk.
Function zap_pte_range checks pte to set page dirty if needed, but it doesn't
hold page lock. If the page lock is held before set page dirty, the race might
be avoided.
Yanmin
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one
2006-12-28 6:10 ` Chen, Kenneth W
@ 2006-12-28 6:27 ` David Miller
2006-12-28 17:10 ` Linus Torvalds
1 sibling, 0 replies; 45+ messages in thread
From: David Miller @ 2006-12-28 6:27 UTC (permalink / raw)
To: kenneth.w.chen
Cc: torvalds, ranma, gordonfarquharson, tbm, a.p.zijlstra,
andrei.popa, akpm, hugh, nickpiggin, arjan, linux-kernel
From: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
Date: Wed, 27 Dec 2006 22:10:52 -0800
> Chen, Kenneth wrote on Wednesday, December 27, 2006 9:55 PM
> > Linus Torvalds wrote on Wednesday, December 27, 2006 7:05 PM
> > > On Wed, 27 Dec 2006, David Miller wrote:
> > > > >
> > > > > I still don't see _why_, though. But maybe smarter people than me can see
> > > > > it..
> > > >
> > > > FWIW this program definitely triggers the bug for me.
> > >
> > > Ok, now that I have something simple to do repeatable stuff with, I can
> > > say what the pattern is.. It's not all that surprising, but it's still
> > > worth just stating for the record.
> >
> >
> > Running the test code, git bisect points its finger at this commit. Reverting
> > this commit on top of 2.6.20-rc2 doesn't trigger the bug from the test code.
> >
> > edc79b2a46ed854595e40edcf3f8b37f9f14aa3f is first bad commit
> > commit edc79b2a46ed854595e40edcf3f8b37f9f14aa3f
> > Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > Date: Mon Sep 25 23:30:58 2006 -0700
> >
> > [PATCH] mm: balance dirty pages
> >
> > Now that we can detect writers of shared mappings, throttle them. Avoids OOM
> > by surprise.
>
>
> Oh, never mind :-( I just didn't create enough write out pressure when
> test this. I just saw bug got triggered on a kernel I previously thought
> was OK.
Besides, I'm pretty sure that from the Debian bug entry it's been
established that the dirty-page tracking changes from a few releases
ago introduced this problem.
^ permalink raw reply [flat|nested] 45+ messages in thread
* RE: [PATCH] mm: fix page_mkclean_one
2006-12-28 5:55 ` Chen, Kenneth W
@ 2006-12-28 6:10 ` Chen, Kenneth W
2006-12-28 6:27 ` David Miller
2006-12-28 17:10 ` Linus Torvalds
0 siblings, 2 replies; 45+ messages in thread
From: Chen, Kenneth W @ 2006-12-28 6:10 UTC (permalink / raw)
To: 'Linus Torvalds', David Miller
Cc: ranma, gordonfarquharson, tbm, Peter Zijlstra, andrei.popa,
Andrew Morton, hugh, nickpiggin, arjan,
Linux Kernel Mailing List
Chen, Kenneth wrote on Wednesday, December 27, 2006 9:55 PM
> Linus Torvalds wrote on Wednesday, December 27, 2006 7:05 PM
> > On Wed, 27 Dec 2006, David Miller wrote:
> > > >
> > > > I still don't see _why_, though. But maybe smarter people than me can see
> > > > it..
> > >
> > > FWIW this program definitely triggers the bug for me.
> >
> > Ok, now that I have something simple to do repeatable stuff with, I can
> > say what the pattern is.. It's not all that surprising, but it's still
> > worth just stating for the record.
>
>
> Running the test code, git bisect points its finger at this commit. Reverting
> this commit on top of 2.6.20-rc2 doesn't trigger the bug from the test code.
>
> edc79b2a46ed854595e40edcf3f8b37f9f14aa3f is first bad commit
> commit edc79b2a46ed854595e40edcf3f8b37f9f14aa3f
> Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Date: Mon Sep 25 23:30:58 2006 -0700
>
> [PATCH] mm: balance dirty pages
>
> Now that we can detect writers of shared mappings, throttle them. Avoids OOM
> by surprise.
Oh, never mind :-( I just didn't create enough write out pressure when
test this. I just saw bug got triggered on a kernel I previously thought
was OK.
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one
[not found] ` <Pine.LNX.4.64.0612272120180.4473@woody.osdl.org>
2006-12-28 5:38 ` Gordon Farquharson
@ 2006-12-28 5:58 ` Gordon Farquharson
2006-12-28 17:08 ` Linus Torvalds
1 sibling, 1 reply; 45+ messages in thread
From: Gordon Farquharson @ 2006-12-28 5:58 UTC (permalink / raw)
To: Linus Torvalds
Cc: David Miller, ranma, tbm, Peter Zijlstra, andrei.popa,
Andrew Morton, hugh, nickpiggin, arjan,
Linux Kernel Mailing List
On 12/27/06, Linus Torvalds <torvalds@osdl.org> wrote:
> That's just 400kB!
>
> There's no way you should see corruption with that kind of value. It
> should all stay solidly in the cache.
100kB and 200kB files always succeed on the ARM system. 400kB and
larger always seem to fail.
Does the following help interpret the results on ARM at all ?
$ free
total used free shared buffers cached
Mem: 30000 23620 6380 0 808 15676
-/+ buffers/cache: 7136 22864
Swap: 88316 3664 84652
Gordon
--
Gordon Farquharson
^ permalink raw reply [flat|nested] 45+ messages in thread
* RE: [PATCH] mm: fix page_mkclean_one
2006-12-28 3:04 ` Linus Torvalds
2006-12-28 4:32 ` Gordon Farquharson
@ 2006-12-28 5:55 ` Chen, Kenneth W
2006-12-28 6:10 ` Chen, Kenneth W
2006-12-28 9:15 ` Zhang, Yanmin
` (2 subsequent siblings)
4 siblings, 1 reply; 45+ messages in thread
From: Chen, Kenneth W @ 2006-12-28 5:55 UTC (permalink / raw)
To: 'Linus Torvalds', David Miller
Cc: ranma, gordonfarquharson, tbm, Peter Zijlstra, andrei.popa,
Andrew Morton, hugh, nickpiggin, arjan,
Linux Kernel Mailing List
Linus Torvalds wrote on Wednesday, December 27, 2006 7:05 PM
> On Wed, 27 Dec 2006, David Miller wrote:
> > >
> > > I still don't see _why_, though. But maybe smarter people than me can see
> > > it..
> >
> > FWIW this program definitely triggers the bug for me.
>
> Ok, now that I have something simple to do repeatable stuff with, I can
> say what the pattern is.. It's not all that surprising, but it's still
> worth just stating for the record.
Running the test code, git bisect points its finger at this commit. Reverting
this commit on top of 2.6.20-rc2 doesn't trigger the bug from the test code.
edc79b2a46ed854595e40edcf3f8b37f9f14aa3f is first bad commit
commit edc79b2a46ed854595e40edcf3f8b37f9f14aa3f
Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Mon Sep 25 23:30:58 2006 -0700
[PATCH] mm: balance dirty pages
Now that we can detect writers of shared mappings, throttle them. Avoids OOM
by surprise.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one
2006-12-28 5:41 ` David Miller
@ 2006-12-28 5:47 ` Gordon Farquharson
0 siblings, 0 replies; 45+ messages in thread
From: Gordon Farquharson @ 2006-12-28 5:47 UTC (permalink / raw)
To: David Miller
Cc: torvalds, ranma, tbm, a.p.zijlstra, andrei.popa, akpm, hugh,
nickpiggin, arjan, linux-kernel
Hi David
On 12/27/06, David Miller <davem@davemloft.net> wrote:
> Me too, I added "-D_POSIX_C_SOURCE=200112" to "fix" this.
That works for me. Thanks for the tip.
Gordon
--
Gordon Farquharson
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one
2006-12-28 5:20 ` Gordon Farquharson
@ 2006-12-28 5:41 ` David Miller
2006-12-28 5:47 ` Gordon Farquharson
2006-12-28 10:13 ` Russell King
1 sibling, 1 reply; 45+ messages in thread
From: David Miller @ 2006-12-28 5:41 UTC (permalink / raw)
To: gordonfarquharson
Cc: torvalds, ranma, tbm, a.p.zijlstra, andrei.popa, akpm, hugh,
nickpiggin, arjan, linux-kernel
From: "Gordon Farquharson" <gordonfarquharson@gmail.com>
Date: Wed, 27 Dec 2006 22:20:20 -0700
> and for some reason I get
>
> linus-test.c: In function 'remap':
> linus-test.c:61: error: 'POSIX_FADV_DONTNEED' undeclared (first use in
> this function)
>
> when I compile the program, so I replaced POSIX_FADV_DONTNEED with 4
> as defined in /usr/include/bits/fcntl.h.
Me too, I added "-D_POSIX_C_SOURCE=200112" to "fix" this.
Perhaps Linus's GCC sets that by default and our's doesn't.
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one
[not found] ` <Pine.LNX.4.64.0612272120180.4473@woody.osdl.org>
@ 2006-12-28 5:38 ` Gordon Farquharson
2006-12-28 9:30 ` Martin Michlmayr
2006-12-28 10:16 ` Martin Michlmayr
2006-12-28 5:58 ` Gordon Farquharson
1 sibling, 2 replies; 45+ messages in thread
From: Gordon Farquharson @ 2006-12-28 5:38 UTC (permalink / raw)
To: Linus Torvalds
Cc: David Miller, ranma, tbm, Peter Zijlstra, andrei.popa,
Andrew Morton, hugh, nickpiggin, arjan,
Linux Kernel Mailing List
On 12/27/06, Linus Torvalds <torvalds@osdl.org> wrote:
> On Wed, 27 Dec 2006, Gordon Farquharson wrote:
> >
> > I don't think so. I did reduce the target size
> >
> > #define TARGETSIZE (100 << 12)
>
> That's just 400kB!
>
> There's no way you should see corruption with that kind of value. It
> should all stay solidly in the cache.
>
> Is this perhaps with ARM nommu or something else strange? It may be that
> the program just doesn't work at all if mmap() is faked out with a malloc
> or similar.
Definitely a question for the ARM gurus. I'm out of my depth.
Gordon
--
Gordon Farquharson
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one
2006-12-28 4:53 ` Linus Torvalds
@ 2006-12-28 5:20 ` Gordon Farquharson
2006-12-28 5:41 ` David Miller
2006-12-28 10:13 ` Russell King
[not found] ` <97a0a9ac0612272115g4cce1f08n3c3c8498a6076bd5@mail.gmail.com>
1 sibling, 2 replies; 45+ messages in thread
From: Gordon Farquharson @ 2006-12-28 5:20 UTC (permalink / raw)
To: Linus Torvalds
Cc: David Miller, ranma, tbm, Peter Zijlstra, andrei.popa,
Andrew Morton, hugh, nickpiggin, arjan,
Linux Kernel Mailing List
[Oops - forgot to hit "Reply to All" first time round.]
Hi Linus
On 12/27/06, Linus Torvalds <torvalds@osdl.org> wrote:
> For all I know, my test-program is buggy wrt the ordering printouts,
> though. Did you perhaps change the logic in any way?
I don't think so. I did reduce the target size
#define TARGETSIZE (100 << 12)
to make the program finish a little quicker, and for some reason I get
linus-test.c: In function 'remap':
linus-test.c:61: error: 'POSIX_FADV_DONTNEED' undeclared (first use in
this function)
when I compile the program, so I replaced POSIX_FADV_DONTNEED with 4
as defined in /usr/include/bits/fcntl.h.
Other than these two changes, the program is identical to the version
you posted.
I have run the program a few times, and the output is pretty
consistent. However, when I increase the target size, the difference
between the expected and actual values is larger.
Written as (749)935(738)
Chunk 1113 corrupted (1-1455) (2965-323)
Expected 89, got 93
Written as (935)738(538)
Chunk 1114 corrupted (1-1455) (329-1783)
Expected 90, got 94
Written as (738)538(678)
Chunk 1115 corrupted (1-1455) (1789-3243)
Expected 91, got 95
Written as (538)678(989)
Chunk 1120 corrupted (1-1455) (897-2351)
Expected 96, got 100
Written as (537)265(1005)
Chunk 1121 corrupted (1-1455) (2357-3811)
Expected 97, got 101
Written as (265)1005(-1)
--- linus-test.c.orig 2006-12-28 06:17:24.000000000 +0100
+++ linus-test.c 2006-12-28 06:18:24.000000000 +0100
@@ -6,7 +6,7 @@
#include <stdio.h>
#include <time.h>
-#define TARGETSIZE (100 << 20)
+#define TARGETSIZE (100 << 14)
#define CHUNKSIZE (1460)
#define NRCHUNKS (TARGETSIZE / CHUNKSIZE)
#define SIZE (NRCHUNKS * CHUNKSIZE)
@@ -61,7 +61,7 @@
{
if (mapping) {
munmap(mapping, SIZE);
- posix_fadvise(fd, 0, SIZE, POSIX_FADV_DONTNEED);
+ posix_fadvise(fd, 0, SIZE, 4);
}
return mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
Gordon
--
Gordon Farquharson
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one
2006-12-28 4:32 ` Gordon Farquharson
@ 2006-12-28 4:53 ` Linus Torvalds
2006-12-28 5:20 ` Gordon Farquharson
[not found] ` <97a0a9ac0612272115g4cce1f08n3c3c8498a6076bd5@mail.gmail.com>
0 siblings, 2 replies; 45+ messages in thread
From: Linus Torvalds @ 2006-12-28 4:53 UTC (permalink / raw)
To: Gordon Farquharson
Cc: David Miller, ranma, tbm, Peter Zijlstra, andrei.popa,
Andrew Morton, hugh, nickpiggin, arjan,
Linux Kernel Mailing List
On Wed, 27 Dec 2006, Gordon Farquharson wrote:
>
> It is at all suprising that the second offset within a page can be
> less than the first offset within a page ? e.g.
>
> Chunk 260 corrupted (1-1455) (2769-127)
No, that just means that it went over to the next page (so you actually
had two consecutive pages that weren't written out).
That said, your output is very different from mine in another way. You
don't have zeroes in your pages, rather the thing seems to have data from
the next block (ie the chunk that should have 20 is reported as having 21
etc). You also have your offsets shifted up by one (ie offset 0 looks ok
for you, and then you have a strange pattern of corruption at bytes
1...1455 instead of 0..1459.
You also seem to have an example of the _earlier_ writes being corrupted,
rather than the later ones. For example (but it's also a page-crosser, so
maybe that's part of it):
Chunk 274 corrupted (1-1455) (2729-87)
Expected 18, got 19
Written as (154)11(85)
says that block chunk 274 is the corrupt one, but it was written fairly
early as #11, and the blocks around it (chunks 273 and 275) were actually
written later.
For all I know, my test-program is buggy wrt the ordering printouts,
though. Did you perhaps change the logic in any way?
Linus
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one
2006-12-28 3:04 ` Linus Torvalds
@ 2006-12-28 4:32 ` Gordon Farquharson
2006-12-28 4:53 ` Linus Torvalds
2006-12-28 5:55 ` Chen, Kenneth W
` (3 subsequent siblings)
4 siblings, 1 reply; 45+ messages in thread
From: Gordon Farquharson @ 2006-12-28 4:32 UTC (permalink / raw)
To: Linus Torvalds
Cc: David Miller, ranma, tbm, Peter Zijlstra, andrei.popa,
Andrew Morton, hugh, nickpiggin, arjan,
Linux Kernel Mailing List
On 12/27/06, Linus Torvalds <torvalds@osdl.org> wrote:
> [ Modified test-program that tells you where the corruption happens (and
> when the missing parts were supposed to be written out) appended, in
> case people care. ]
For the record, this is the output from a run on our ARM machine (32
MB RAM) with 2.6.18 + the following patches:
mm: tracking shared dirty pages
mm: balance dirty pages
mm: optimize the new mprotect() code a bit
mm: small cleanup of install_page()
mm: fixup do_wp_page()
mm: msync() cleanup
It is at all suprising that the second offset within a page can be
less than the first offset within a page ? e.g.
Chunk 260 corrupted (1-1455) (2769-127)
$ ./linus-test
Writing chunk 279/280 (99%)
Chunk 256 corrupted (1-1455) (1025-2479)
Expected 0, got 1
Written as (82)175(56)
Chunk 258 corrupted (1-1455) (3945-1303)
Expected 2, got 3
Written as (56)51(20)
Chunk 260 corrupted (1-1455) (2769-127)
Expected 4, got 5
Written as (20)30(18)
Chunk 262 corrupted (1-1455) (1593-3047)
Expected 6, got 7
Written as (18)196(158)
Chunk 264 corrupted (1-1455) (417-1871)
Expected 8, got 9
Written as (158)133(146)
Chunk 266 corrupted (1-1455) (3337-695)
Expected 10, got 11
Written as (146)43(77)
Chunk 268 corrupted (1-1455) (2161-3615)
Expected 12, got 13
Written as (77)251(211)
Chunk 270 corrupted (1-1455) (985-2439)
Expected 14, got 15
Written as (211)257(231)
Chunk 272 corrupted (1-1455) (3905-1263)
Expected 16, got 17
Written as (231)254(154)
Chunk 274 corrupted (1-1455) (2729-87)
Expected 18, got 19
Written as (154)11(85)
Chunk 276 corrupted (1-1455) (1553-3007)
Expected 20, got 21
Written as (85)230(134)
Chunk 278 corrupted (1-1455) (377-1831)
Expected 22, got 23
Written as (134)233(103)
Checking chunk 279/280 (99%)
Gordon
--
Gordon Farquharson
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one
2006-12-28 0:52 ` David Miller
@ 2006-12-28 3:04 ` Linus Torvalds
2006-12-28 4:32 ` Gordon Farquharson
` (4 more replies)
0 siblings, 5 replies; 45+ messages in thread
From: Linus Torvalds @ 2006-12-28 3:04 UTC (permalink / raw)
To: David Miller
Cc: ranma, gordonfarquharson, tbm, Peter Zijlstra, andrei.popa,
Andrew Morton, hugh, nickpiggin, arjan,
Linux Kernel Mailing List
On Wed, 27 Dec 2006, David Miller wrote:
> >
> > I still don't see _why_, though. But maybe smarter people than me can see
> > it..
>
> FWIW this program definitely triggers the bug for me.
Ok, now that I have something simple to do repeatable stuff with, I can
say what the pattern is.. It's not all that surprising, but it's still
worth just stating for the record.
What happens is that when I do the "packetized writes" in random order,
the _last_ write to a page occasionally just goes missing. It's not always
at the end of a page, as shown by for example:
- A whole chunk got dropped:
Chunk 2094 corrupted (0-1459) (1624-3083)
Expected 46, got 0
Written as (30912)55414(10000)
That "Written as (x)y(z)" line means that the corrupted chunk was
written as chunk #y, and the preceding and following chunks (that were
_not_ corrupt) on the page was written as #x and #z respectively.
In other words, the missing chunk (which is still zero) was written
much later than the ones that were ok, and never hit the disk. It's a
contiguous chunk in the middle of the page (chunks are 1460 bytes in
size)
The first line means that all bytes of the chunk (0-1459) were
corrupted, and the values in parenthesis are the offsets within a page.
In other words, this was a chunk in the _middle_ of a page.
- The missing data can also be at the beginning or ends of pages:
Beginning of the chunk missing, it was at the end of a page (page
offsets 3288-4095) and the _next_ page got written out fine:
Chunk 2126 corrupted (0-807) (3288-4095)
Expected 78, got 0
Written as (32713)55573(14301)
End of a chunk missing, it was the beginning of a page (and the
_previous_ page that contained the beginning of the chunk was written
out fine)
Chunk 2179 corrupted (1252-1459) (0-207)
Expected 131, got 0
Written as (45189)55489(15515)
Now, the reason I say this isn't surprising is that this is entirely
consistent with the dirty bit being dropped on the floor somewhere, and
likely through some interaction with the previous changes being in the
process of being written out.
Something (incorrectly) ends up deciding that it doesn't need to write the
page, since it's already written, or alternatively clears the dirty bit
too late (clears it because an _earlier_ write finished, never mind that
the new dirty data didn't make it).
I also figured out that it's not the low-memory situation that does it, it
really must be the "page_mkclean()" triggering. Becuase I can do
echo 5 > /proc/sys/vm/dirty_ratio
echo 3 > /proc/sys/vm/dirty_background_ratio
to make it clean the pages much more aggressively than the default, and I
can see corruption on my 256MB machine with just a 40MB shared file, and
70MB of memory consistently free.
So this thing is definitely giving some answers. It's NOT about low
memory, and it very much seems to be about the whole "balance_dirty_ratio"
thing. I don't think I triggered the actual low-memory stuff once in that
situation..
So I have some more data on the behaviour, but I _still_ don't see the
reason behind it. It's probably something really obvious once it's pointed
out..
[ Modified test-program that tells you where the corruption happens (and
when the missing parts were supposed to be written out) appended, in
case people care. ]
Linus
---
#include <sys/mman.h>
#include <sys/fcntl.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <time.h>
#define TARGETSIZE (100 << 20)
#define CHUNKSIZE (1460)
#define NRCHUNKS (TARGETSIZE / CHUNKSIZE)
#define SIZE (NRCHUNKS * CHUNKSIZE)
static void fillmem(void *start, int nr)
{
memset(start, nr, CHUNKSIZE);
}
#define page_offset(buf, off) (0xfff & ((unsigned)(unsigned long)(buf)+(off)))
static int chunkorder[NRCHUNKS];
static int order(int nr)
{
int i;
if (nr < 0 || nr >= NRCHUNKS)
return -1;
for (i = 0; i < NRCHUNKS; i++)
if (chunkorder[i] == nr)
return i;
return -2;
}
static void checkmem(void *buf, int nr)
{
unsigned int start = ~0u, end = 0;
unsigned char c = nr, *p = buf, differs = 0;
int i;
for (i = 0; i < CHUNKSIZE; i++) {
unsigned char got = *p++;
if (got != c) {
if (i < start)
start = i;
if (i > end)
end = i;
differs = got;
}
}
if (start < end) {
printf("Chunk %d corrupted (%u-%u) (%u-%u) \n", nr, start, end,
page_offset(buf, start), page_offset(buf, end));
printf("Expected %u, got %u\n", c, differs);
printf("Written as (%d)%d(%d)\n", order(nr-1), order(nr), order(nr+1));
}
}
static char *remap(int fd, char *mapping)
{
if (mapping) {
munmap(mapping, SIZE);
posix_fadvise(fd, 0, SIZE, POSIX_FADV_DONTNEED);
}
return mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
}
int main(int argc, char **argv)
{
char *mapping;
int fd, i;
/*
* Make some random ordering of writing the chunks to the
* memory map..
*
* Start with fully ordered..
*/
for (i = 0; i < NRCHUNKS; i++)
chunkorder[i] = i;
/* ..and then mix it up randomly */
srandom(time(NULL));
for (i = 0; i < NRCHUNKS; i++) {
int index = (unsigned int) random() % NRCHUNKS;
int nr = chunkorder[index];
chunkorder[index] = chunkorder[i];
chunkorder[i] = nr;
}
fd = open("mapfile", O_RDWR | O_TRUNC | O_CREAT, 0666);
if (fd < 0)
return -1;
if (ftruncate(fd, SIZE) < 0)
return -1;
mapping = remap(fd, NULL);
if (-1 == (int)(long)mapping)
return -1;
for (i = 0; i < NRCHUNKS; i++) {
int chunk = chunkorder[i];
printf("Writing chunk %d/%d (%d%%) \r", i, NRCHUNKS, 100*i/NRCHUNKS);
fillmem(mapping + chunk * CHUNKSIZE, chunk);
}
printf("\n");
/* Unmap, drop, and remap.. */
mapping = remap(fd, mapping);
/* .. and check */
for (i = 0; i < NRCHUNKS; i++) {
int chunk = i;
printf("Checking chunk %d/%d (%d%%) \r", i, NRCHUNKS, 100*i/NRCHUNKS);
checkmem(mapping + chunk * CHUNKSIZE, chunk);
}
printf("\n");
return 0;
}
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one
2006-12-28 0:39 ` Linus Torvalds
@ 2006-12-28 0:52 ` David Miller
2006-12-28 3:04 ` Linus Torvalds
0 siblings, 1 reply; 45+ messages in thread
From: David Miller @ 2006-12-28 0:52 UTC (permalink / raw)
To: torvalds
Cc: ranma, gordonfarquharson, tbm, a.p.zijlstra, andrei.popa, akpm,
hugh, nickpiggin, arjan, linux-kernel
From: Linus Torvalds <torvalds@osdl.org>
Date: Wed, 27 Dec 2006 16:39:43 -0800 (PST)
>
>
> On Wed, 27 Dec 2006, Linus Torvalds wrote:
> >
> > I think the test-case could probably be improved by having a munmap() and
> > page-cache flush in between the writing and the checking, to see whether
> > that shows the corruption easier (and possibly without having to start
> > paging in order to throw the pages out, which would simplify testing a
> > lot).
>
> I think the page-writeout is implicated, because I do seem to need it, but
> the page-cache flush does seem to make corruption _easier_ to see. I now
> seem about to trigger it with a 100MB file on a 256MB machine in a minute
> or so, with this slight modification.
>
> I still don't see _why_, though. But maybe smarter people than me can see
> it..
FWIW this program definitely triggers the bug for me.
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one
2006-12-28 0:42 ` Linus Torvalds
@ 2006-12-28 0:52 ` David Miller
0 siblings, 0 replies; 45+ messages in thread
From: David Miller @ 2006-12-28 0:52 UTC (permalink / raw)
To: torvalds
Cc: schwidefsky, a.p.zijlstra, tbm, hugh, nickpiggin, arjan,
andrei.popa, akpm, linux-kernel, fw, mh+linux-kernel,
heiko.carstens, arnd.bergmann, gordonfarquharson
From: Linus Torvalds <torvalds@osdl.org>
Date: Wed, 27 Dec 2006 16:42:40 -0800 (PST)
> That's fine. In that situation, you shouldn't need any atomic ops at all,
> I think all our sw page-table operations are already done under the pte
> lock.
This is true, but there is one subtlety to this I want to
point out in passing.
That lock can possibly only protect a page of PTEs.
When NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS, the locking is done per page
of PTEs, not for all of the page tables of an address space at once.
What this means is that it's really difficult to forcefully block out
all page table operations for a given mm, and I actually needed to do
something like this on sparc64 (when growing the TLB lookup hash
table, you can't let any PTEs change state while the table is
changing). For my case, I added a spinlock to mm->context since
actually what I need is to block modifications to the hash table
itself during PTE changes.
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one
2006-12-28 0:16 ` Linus Torvalds
@ 2006-12-28 0:39 ` Linus Torvalds
2006-12-28 0:52 ` David Miller
0 siblings, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2006-12-28 0:39 UTC (permalink / raw)
To: David Miller
Cc: ranma, gordonfarquharson, tbm, a.p.zijlstra, andrei.popa,
Andrew Morton, hugh, nickpiggin, arjan,
Linux Kernel Mailing List
On Wed, 27 Dec 2006, Linus Torvalds wrote:
>
> I think the test-case could probably be improved by having a munmap() and
> page-cache flush in between the writing and the checking, to see whether
> that shows the corruption easier (and possibly without having to start
> paging in order to throw the pages out, which would simplify testing a
> lot).
I think the page-writeout is implicated, because I do seem to need it, but
the page-cache flush does seem to make corruption _easier_ to see. I now
seem about to trigger it with a 100MB file on a 256MB machine in a minute
or so, with this slight modification.
I still don't see _why_, though. But maybe smarter people than me can see
it..
Linus
---
#include <sys/mman.h>
#include <sys/fcntl.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <time.h>
#define TARGETSIZE (100 << 20)
#define CHUNKSIZE (1460)
#define NRCHUNKS (TARGETSIZE / CHUNKSIZE)
#define SIZE (NRCHUNKS * CHUNKSIZE)
static void fillmem(void *start, int nr)
{
memset(start, nr, CHUNKSIZE);
}
static void checkmem(void *start, int nr)
{
unsigned char c = nr, *p = start;
int i;
for (i = 0; i < CHUNKSIZE; i++) {
if (*p++ != c) {
printf("Chunk %d corrupted \n", nr);
return;
}
}
}
static char *remap(int fd, char *mapping)
{
if (mapping) {
munmap(mapping, SIZE);
posix_fadvise(fd, 0, SIZE, POSIX_FADV_DONTNEED);
}
return mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
}
int main(int argc, char **argv)
{
char *mapping;
int fd, i;
static int chunkorder[NRCHUNKS];
/*
* Make some random ordering of writing the chunks to the
* memory map..
*
* Start with fully ordered..
*/
for (i = 0; i < NRCHUNKS; i++)
chunkorder[i] = i;
/* ..and then mix it up randomly */
srandom(time(NULL));
for (i = 0; i < NRCHUNKS; i++) {
int index = (unsigned int) random() % NRCHUNKS;
int nr = chunkorder[index];
chunkorder[index] = chunkorder[i];
chunkorder[i] = nr;
}
fd = open("mapfile", O_RDWR | O_TRUNC | O_CREAT, 0666);
if (fd < 0)
return -1;
if (ftruncate(fd, SIZE) < 0)
return -1;
mapping = remap(fd, NULL);
if (-1 == (int)(long)mapping)
return -1;
for (i = 0; i < NRCHUNKS; i++) {
int chunk = chunkorder[i];
printf("Writing chunk %d/%d (%d%%) \r", i, NRCHUNKS, 100*i/NRCHUNKS);
fillmem(mapping + chunk * CHUNKSIZE, chunk);
}
printf("\n");
/* Unmap, drop, and remap.. */
mapping = remap(fd, mapping);
/* .. and check */
for (i = 0; i < NRCHUNKS; i++) {
int chunk = i;
printf("Checking chunk %d/%d (%d%%) \r", i, NRCHUNKS, 100*i/NRCHUNKS);
checkmem(mapping + chunk * CHUNKSIZE, chunk);
}
printf("\n");
return 0;
}
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one
2006-12-27 4:55 ` [PATCH] mm: fix page_mkclean_one David Miller
2006-12-27 7:00 ` Linus Torvalds
@ 2006-12-28 0:16 ` Linus Torvalds
2006-12-28 0:39 ` Linus Torvalds
1 sibling, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2006-12-28 0:16 UTC (permalink / raw)
To: David Miller
Cc: ranma, gordonfarquharson, tbm, a.p.zijlstra, andrei.popa,
Andrew Morton, hugh, nickpiggin, arjan,
Linux Kernel Mailing List
On Tue, 26 Dec 2006, David Miller wrote:
>
> I've seen it on sparc64, UP kernel, no preempt.
Ok, I still don't have a clue, but I think I at least have a new
test-case.
It can probably be improved upon, but this would _seem_ to trigger the
problem. Can people check?
You'd want to make sure you get page-put activity, by making TARGETSIZE be
big enough to cause memory pressure (and rather than making it bigger, you
might want to make your memory smaller instead, to make it run more
quickly. Either using "mem=128M" or big compiles or something...).
If it finds corruption, you'll see something like
Writing chunk 183858/183859 (99%)
Chunk ..
Chunk 120887 corrupted
Chunk 122372 corrupted
Chunk ...
Checking chunk 183858/183859 (99%)
otherwise it will just say
Writing chunk 183858/183859 (99%)
Checking chunk 183858/183859 (99%)
and exit.
I didn't spend a lot of time verifying this, but I _was_ able to cause
those "Chunk xxx corrupted" messages with this. There's probably a more
efficient better way to do it, but this is better than trying to use
rtorrent, and also makes any worries about what rtorrent does go away.
Of course, maybe it's this test-program that is buggy now, although it
looks trivial enough that I don't think it is.
I think my earlier stress-tester may not have triggered this, because it
just did all its writing in a linear order, so any LRU logic will happen
to write back old pages that we are no longer touching. The randomization
(and using a chunksize that isn't a multiple of a page-size) makes sure
that we're actually going to have lots of rewriting going on.
I think the test-case could probably be improved by having a munmap() and
page-cache flush in between the writing and the checking, to see whether
that shows the corruption easier (and possibly without having to start
paging in order to throw the pages out, which would simplify testing a
lot). But I haven't tested. I decided to post this asap, now that I've
recreated the corruption with something else, and something that is
possibly easier to analyze..
Linus
----
#include <sys/mman.h>
#include <sys/fcntl.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <time.h>
#define TARGETSIZE (256 << 20)
#define CHUNKSIZE (1460)
#define NRCHUNKS (TARGETSIZE / CHUNKSIZE)
#define SIZE (NRCHUNKS * CHUNKSIZE)
static void fillmem(void *start, int nr)
{
memset(start, nr, CHUNKSIZE);
}
static void checkmem(void *start, int nr)
{
unsigned char c = nr, *p = start;
int i;
for (i = 0; i < CHUNKSIZE; i++) {
if (*p++ != c) {
printf("Chunk %d corrupted \n", nr);
return;
}
}
}
int main(int argc, char **argv)
{
char *mapping;
int fd, i;
static int chunkorder[NRCHUNKS];
/*
* Make some random ordering of writing the chunks to the
* memory map..
*
* Start with fully ordered..
*/
for (i = 0; i < NRCHUNKS; i++)
chunkorder[i] = i;
/* ..and then mix it up randomly */
srandom(time(NULL));
for (i = 0; i < NRCHUNKS; i++) {
int index = (unsigned int) random() % NRCHUNKS;
int nr = chunkorder[index];
chunkorder[index] = chunkorder[i];
chunkorder[i] = nr;
}
fd = open("mapfile", O_RDWR | O_TRUNC | O_CREAT, 0666);
if (fd < 0)
return -1;
if (ftruncate(fd, SIZE) < 0)
return -1;
mapping = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
if (-1 == (int)(long)mapping)
return -1;
for (i = 0; i < NRCHUNKS; i++) {
int chunk = chunkorder[i];
printf("Writing chunk %d/%d (%d%%) \r", i, NRCHUNKS, 100*i/NRCHUNKS);
fillmem(mapping + chunk * CHUNKSIZE, chunk);
}
printf("\n");
for (i = 0; i < NRCHUNKS; i++) {
int chunk = i;
printf("Checking chunk %d/%d (%d%%) \r", i, NRCHUNKS, 100*i/NRCHUNKS);
checkmem(mapping + chunk * CHUNKSIZE, chunk);
}
printf("\n");
return 0;
}
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one
2006-12-27 7:00 ` Linus Torvalds
@ 2006-12-27 8:39 ` Andrei Popa
0 siblings, 0 replies; 45+ messages in thread
From: Andrei Popa @ 2006-12-27 8:39 UTC (permalink / raw)
To: Linus Torvalds
Cc: David Miller, ranma, gordonfarquharson, tbm, a.p.zijlstra, akpm,
hugh, nickpiggin, arjan, linux-kernel
I have corrupted files...
> ---
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 263f88e..4652ef1 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -1653,19 +1653,7 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
> do {
> if (!buffer_mapped(bh))
> continue;
> - /*
> - * If it's a fully non-blocking write attempt and we cannot
> - * lock the buffer then redirty the page. Note that this can
> - * potentially cause a busy-wait loop from pdflush and kswapd
> - * activity, but those code paths have their own higher-level
> - * throttling.
> - */
> - if (wbc->sync_mode != WB_SYNC_NONE || !wbc->nonblocking) {
> - lock_buffer(bh);
> - } else if (test_set_buffer_locked(bh)) {
> - redirty_page_for_writepage(wbc, page);
> - continue;
> - }
> + lock_buffer(bh);
> if (test_clear_buffer_dirty(bh)) {
> mark_buffer_async_write(bh);
> } else {
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one
2006-12-27 4:55 ` [PATCH] mm: fix page_mkclean_one David Miller
@ 2006-12-27 7:00 ` Linus Torvalds
2006-12-27 8:39 ` Andrei Popa
2006-12-28 0:16 ` Linus Torvalds
1 sibling, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2006-12-27 7:00 UTC (permalink / raw)
To: David Miller
Cc: ranma, gordonfarquharson, tbm, a.p.zijlstra, andrei.popa, akpm,
hugh, nickpiggin, arjan, linux-kernel
On Tue, 26 Dec 2006, David Miller wrote:
>
> I've seen it on sparc64, UP kernel, no preempt.
Btw, having tried to debug the writeback code, there's one very special
case that just makes me go "hmm".
If we have a buffer that is "busy" when we try to write back a page, we
have this magic "wbc->sync_mode == WB_SYNC_NONE && wbc->nonblocking" mode,
where we won't wait for it, but instead we'll redirty the page and redo
the whole thing.
Looking at the code, that should all work, but at the same time, it
triggers some of my debug messages about having a dirty page during
writeback, and one way to trigger that debug message is to try to run
rtorrent on the machine..
I dunno. Witht he writeback being suspicious, and the normal
"block_write_full_page()" path being implicated in at least ext2, I just
wonder. This is one of those "let's see if behaviour changes" patches,
that I'm just throwing out there..
Linus
---
diff --git a/fs/buffer.c b/fs/buffer.c
index 263f88e..4652ef1 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1653,19 +1653,7 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
do {
if (!buffer_mapped(bh))
continue;
- /*
- * If it's a fully non-blocking write attempt and we cannot
- * lock the buffer then redirty the page. Note that this can
- * potentially cause a busy-wait loop from pdflush and kswapd
- * activity, but those code paths have their own higher-level
- * throttling.
- */
- if (wbc->sync_mode != WB_SYNC_NONE || !wbc->nonblocking) {
- lock_buffer(bh);
- } else if (test_set_buffer_locked(bh)) {
- redirty_page_for_writepage(wbc, page);
- continue;
- }
+ lock_buffer(bh);
if (test_clear_buffer_dirty(bh)) {
mark_buffer_async_write(bh);
} else {
^ permalink raw reply related [flat|nested] 45+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one
2006-12-26 16:17 ` Tobias Diedrich
@ 2006-12-27 4:55 ` David Miller
2006-12-27 7:00 ` Linus Torvalds
2006-12-28 0:16 ` Linus Torvalds
0 siblings, 2 replies; 45+ messages in thread
From: David Miller @ 2006-12-27 4:55 UTC (permalink / raw)
To: ranma
Cc: torvalds, gordonfarquharson, tbm, a.p.zijlstra, andrei.popa,
akpm, hugh, nickpiggin, arjan, linux-kernel
From: Tobias Diedrich <ranma@tdiedrich.de>
Date: Tue, 26 Dec 2006 17:17:00 +0100
> Linus Torvalds wrote:
> > I don't think it's a page table issue any more, it just doesn't look
> > likely with the ARM UP corruption. It's also not apparently even on a
> > cacheline boundary, so it probably is really a dirty bit that got cleared
> > wrogn due to some race with IO.
>
> So, until now it's only been reported for SMP on i386?
> I'm seeing the issue on my Pentium-M Notebook (Thinkpad R52) over
> here, UP kernel, no preempt.
I've seen it on sparc64, UP kernel, no preempt.
^ permalink raw reply [flat|nested] 45+ messages in thread
end of thread, other threads:[~2007-02-02 13:09 UTC | newest]
Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-02-01 22:40 [PATCH] mm: fix page_mkclean_one Mark Groves
2007-02-02 8:42 ` Nick Piggin
2007-02-02 13:08 ` Evgeniy Polyakov
-- strict thread matches above, loose matches on Subject: below --
2006-12-28 15:50 martin
2006-12-24 8:10 [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) Gordon Farquharson
2006-12-24 8:43 ` Linus Torvalds
2006-12-26 16:17 ` Tobias Diedrich
2006-12-27 4:55 ` [PATCH] mm: fix page_mkclean_one David Miller
2006-12-27 7:00 ` Linus Torvalds
2006-12-27 8:39 ` Andrei Popa
2006-12-28 0:16 ` Linus Torvalds
2006-12-28 0:39 ` Linus Torvalds
2006-12-28 0:52 ` David Miller
2006-12-28 3:04 ` Linus Torvalds
2006-12-28 4:32 ` Gordon Farquharson
2006-12-28 4:53 ` Linus Torvalds
2006-12-28 5:20 ` Gordon Farquharson
2006-12-28 5:41 ` David Miller
2006-12-28 5:47 ` Gordon Farquharson
2006-12-28 10:13 ` Russell King
2006-12-28 14:15 ` Gordon Farquharson
2006-12-28 15:53 ` Martin Michlmayr
2006-12-28 17:27 ` Linus Torvalds
2006-12-28 18:44 ` Russell King
2006-12-28 19:01 ` Linus Torvalds
[not found] ` <97a0a9ac0612272115g4cce1f08n3c3c8498a6076bd5@mail.gmail.com>
[not found] ` <Pine.LNX.4.64.0612272120180.4473@woody.osdl.org>
2006-12-28 5:38 ` Gordon Farquharson
2006-12-28 9:30 ` Martin Michlmayr
2006-12-28 10:16 ` Martin Michlmayr
2006-12-28 10:49 ` Russell King
2006-12-28 14:56 ` Martin Michlmayr
2006-12-28 5:58 ` Gordon Farquharson
2006-12-28 17:08 ` Linus Torvalds
2006-12-28 5:55 ` Chen, Kenneth W
2006-12-28 6:10 ` Chen, Kenneth W
2006-12-28 6:27 ` David Miller
2006-12-28 17:10 ` Linus Torvalds
2006-12-28 9:15 ` Zhang, Yanmin
2006-12-28 17:15 ` Linus Torvalds
2006-12-28 11:50 ` Petri Kaukasoina
2006-12-28 15:09 ` Guillaume Chazarain
2006-12-28 19:19 ` Guillaume Chazarain
2006-12-28 19:28 ` Linus Torvalds
2006-12-28 19:45 ` Andrew Morton
2006-12-28 20:14 ` Linus Torvalds
2006-12-28 22:38 ` David Miller
2006-12-29 2:50 ` Segher Boessenkool
2006-12-29 6:48 ` Linus Torvalds
2006-12-28 22:35 ` Mike Galbraith
2006-12-21 20:01 [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) Linus Torvalds
2006-12-28 0:00 ` Martin Schwidefsky
2006-12-28 0:42 ` Linus Torvalds
2006-12-28 0:52 ` [PATCH] mm: fix page_mkclean_one David Miller
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).