LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
@ 2007-04-27 7:59 Mike Galbraith
2007-04-27 8:33 ` Andrew Morton
2007-04-27 15:18 ` Linus Torvalds
0 siblings, 2 replies; 66+ messages in thread
From: Mike Galbraith @ 2007-04-27 7:59 UTC (permalink / raw)
To: LKML; +Cc: Andrew Morton, Linus Torvalds, Jens Axboe
[-- Attachment #1: Type: text/plain, Size: 6835 bytes --]
Greetings,
As subject states, my GUI is going away for extended periods of time
when my very full and likely highly fragmented (how to find out)
filesystem is under heavy write load. While write is under way, if
amarok (mp3 player) is running, no song change will occur until write is
finished, and the GUI can go _entirely_ comatose for very long periods.
Usually, it will come back to life after write is finished, but
occasionally, a complete GUI restart is necessary.
The longest comatose period to date was ~20 minutes with 2.6.20.7 a few
days ago. I was letting SuSE's software update programs update my SuSE
10.2 system, and started a bonnie while it was running (because I had
been seeing this on recent kernels, and wanted to see if it was in
stable as well), WHAM, instant dead GUI. When this happens, kbd and
mouse events work fine, so I hot-keyed to a VT via CTRL+ALT+F1, and
killed the bonnie. No joy, GUI stayed utterly comatose until the
updater finished roughly 20 minutes later, at which time the shells I'd
tried to start popped up, and all worked as if nothing bad had ever
happened. During the time in between, no window could be brought into
focus, nada.
While a bonnie is writing, if I poke KDE's menu button, that will
instantly trigger nastiness, and a trace (this one was with a cfs
kernel, but I just did same with virgin 2.6.21) shows that "kicker",
KDE's launcher proggy does an fdatasync for some reason, and that's the
end of it's world for ages. When clicking on amarok's icon, it does an
fsync, and that's the last thing that will happen in it's world until
write is done as well. I've repeated this with CFQ and AS IO
schedulers.
I have a couple of old kernels lying around that I can test with, but I
think it's going to be the same. Seems to be ext3's journal that is
causing my woes. Below this trace of kicker is one of amarok during
it's dead to the world time.
Box is 3GHz P4 intel ICH5, SMP/UP doesn't matter. .config is latest
kernel tested attached. Mount options are
noatime,nodiratime,acl,user_xattr.
[ 308.046646] kicker D 00000044 0 5897 1 (NOTLB)
[ 308.052611] f32abe4c 00200082 83398b5a 00000044 c01c251e f32ab000 f32ab000 c01169b6
[ 308.060926] f772fbcc cdc7e694 00000039 8339857a 00000044 83398b5a 00000044 00000000
[ 308.069422] c1b5ab00 c1b5aac0 00353928 f32abe80 c01c7ab8 00000000 f32ab000 c1b5ab10
[ 308.077927] Call Trace:
[ 308.080568] [<c01c7ab8>] log_wait_commit+0x9d/0x11f
[ 308.085549] [<c01c250f>] journal_stop+0x1a1/0x22a
[ 308.090364] [<c01c2fce>] journal_force_commit+0x1d/0x20
[ 308.095699] [<c01bac1e>] ext3_force_commit+0x24/0x26
[ 308.100774] [<c01b50ea>] ext3_write_inode+0x2d/0x3b
[ 308.105771] [<c0186cf0>] __writeback_single_inode+0x2df/0x3a9
[ 308.111633] [<c0187641>] sync_inode+0x15/0x38
[ 308.116093] [<c01b1695>] ext3_sync_file+0xbd/0xc8
[ 308.120900] [<c0189a09>] do_fsync+0x58/0x8b
[ 308.125188] [<c0189a5c>] __do_fsync+0x20/0x2f
[ 308.129656] [<c0189a7b>] sys_fdatasync+0x10/0x12
[ 308.134384] [<c0103eec>] sysenter_past_esp+0x5d/0x81
[ 308.139441] =======================
[ 311.755953] bonnie D 00000046 0 6146 5929 (NOTLB)
[ 311.761929] e7622a60 00200082 04d7e5fe 00000046 03332bd5 00000000 e7622000 c02c0c54
[ 311.770244] d8eaabcc e7622a64 f7d0c3ec 04d7e521 00000046 04d7e5fe 00000046 00000000
[ 311.778758] e7622aa4 f4f94400 e7622aac e7622a68 c04a2f06 e7622a70 c018b105 e7622a8c
[ 311.787261] Call Trace:
[ 311.789904] [<c04a2f06>] io_schedule+0xe/0x16
[ 311.794373] [<c018b105>] sync_buffer+0x2e/0x32
[ 311.798927] [<c04a3756>] __wait_on_bit_lock+0x3f/0x62
[ 311.804089] [<c04a37d8>] out_of_line_wait_on_bit_lock+0x5f/0x67
[ 311.810115] [<c018b248>] __lock_buffer+0x2b/0x31
[ 311.814846] [<c018bb56>] sync_dirty_buffer+0x88/0xc3
[ 311.819921] [<c01c3ce4>] journal_dirty_data+0x1dd/0x205
[ 311.825256] [<c01b3300>] ext3_journal_dirty_data+0x12/0x37
[ 311.830858] [<c01b333a>] journal_dirty_data_fn+0x15/0x1c
[ 311.836280] [<c01b277d>] walk_page_buffers+0x36/0x68
[ 311.841347] [<c01b552f>] ext3_ordered_writepage+0x11a/0x191
[ 311.847027] [<c0152133>] generic_writepages+0x1f3/0x305
[ 311.852344] [<c015227c>] do_writepages+0x37/0x39
[ 311.857064] [<c0186aa7>] __writeback_single_inode+0x96/0x3a9
[ 311.862842] [<c0187037>] sync_sb_inodes+0x1bc/0x27f
[ 311.867830] [<c01875e3>] writeback_inodes+0x98/0xe1
[ 311.872819] [<c015240a>] balance_dirty_pages_ratelimited_nr+0xc4/0x1bf
[ 311.879461] [<c014df55>] generic_file_buffered_write+0x32e/0x677
[ 311.885576] [<c014e580>] __generic_file_aio_write_nolock+0x2e2/0x57f
[ 311.892044] [<c014e87d>] generic_file_aio_write+0x60/0xd4
[ 311.897553] [<c01b14f7>] ext3_file_write+0x27/0xa5
[ 311.902455] [<c016ab7b>] do_sync_write+0xcd/0x103
[ 311.907270] [<c016b37a>] vfs_write+0xa8/0x128
[ 311.911738] [<c016b873>] sys_write+0x3d/0x64
[ 311.916111] [<c0103eec>] sysenter_past_esp+0x5d/0x81
[ 311.921185] =======================
[ 311.924763] pdflush D 00000046 0 6147 5 (L-TLB)
[ 311.930739] ec7e2ef0 00000046 03f2b0ea 00000046 ec7e2f0c c0186b45 ec7e2000 c01169b6
[ 311.939052] ea14069c ec7e2f00 ec7e2f00 03f2afc9 00000046 03f2b0ea 00000046 00000282
[ 311.947557] ec7e2f00 ffffab4c ec7e2f30 ec7e2f20 c04a3689 00000400 00000840 c0681648
[ 311.956062] Call Trace:
[ 311.958703] [<c04a3689>] schedule_timeout+0x44/0xa4
[ 311.963683] [<c04a2eda>] io_schedule_timeout+0xe/0x16
[ 311.968827] [<c01568c7>] congestion_wait+0x4c/0x61
[ 311.973721] [<c0152603>] background_writeout+0x2f/0x8f
[ 311.978969] [<c0152b68>] pdflush+0xe7/0x1d6
[ 311.983255] [<c012e8fb>] kthread+0xc5/0xc9
[ 311.987465] [<c0104aab>] kernel_thread_helper+0x7/0x1c
[ 1421.790647] amarokapp D 00000148 0 6428 1 (NOTLB)
[ 1421.796620] e303ce4c 00000082 823c9fd2 00000148 00000148 ee1a3030 e303c000 00000001
[ 1421.804944] ee1a316c dfef91d0 00000000 823c9f15 00000148 823c9fd2 00000148 00000246
[ 1421.813447] dfef91c0 dfef9180 0035465c e303ce80 c01c7ab8 00000000 e303c000 dfef91d0
[ 1421.821962] Call Trace:
[ 1421.824603] [<c01c7ab8>] log_wait_commit+0x9d/0x11f
[ 1421.829582] [<c01c250f>] journal_stop+0x1a1/0x22a
[ 1421.834397] [<c01c2fce>] journal_force_commit+0x1d/0x20
[ 1421.839732] [<c01bac1e>] ext3_force_commit+0x24/0x26
[ 1421.844790] [<c01b50ea>] ext3_write_inode+0x2d/0x3b
[ 1421.849771] [<c0186cf0>] __writeback_single_inode+0x2df/0x3a9
[ 1421.855633] [<c0187641>] sync_inode+0x15/0x38
[ 1421.860103] [<c01b1695>] ext3_sync_file+0xbd/0xc8
[ 1421.864918] [<c0189a09>] do_fsync+0x58/0x8b
[ 1421.869213] [<c0189a5c>] __do_fsync+0x20/0x2f
[ 1421.873673] [<c0189a8a>] sys_fsync+0xd/0xf
[ 1421.877874] [<c0103eec>] sysenter_past_esp+0x5d/0x81
[-- Attachment #2: config.gz --]
[-- Type: application/x-gzip, Size: 13535 bytes --]
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-27 7:59 [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation) Mike Galbraith
@ 2007-04-27 8:33 ` Andrew Morton
2007-04-27 9:23 ` Mike Galbraith
` (3 more replies)
2007-04-27 15:18 ` Linus Torvalds
1 sibling, 4 replies; 66+ messages in thread
From: Andrew Morton @ 2007-04-27 8:33 UTC (permalink / raw)
To: Mike Galbraith; +Cc: LKML, Linus Torvalds, Jens Axboe, linux-ext4
On Fri, 27 Apr 2007 09:59:27 +0200 Mike Galbraith <efault@gmx.de> wrote:
> Greetings,
>
> As subject states, my GUI is going away for extended periods of time
> when my very full and likely highly fragmented (how to find out)
> filesystem is under heavy write load. While write is under way, if
> amarok (mp3 player) is running, no song change will occur until write is
> finished, and the GUI can go _entirely_ comatose for very long periods.
> Usually, it will come back to life after write is finished, but
> occasionally, a complete GUI restart is necessary.
I'd be suspecting a GUI bug if a restart is necessary. Perhaps it went to
lunch for so long in the kernel that some time-based thing went bad.
> The longest comatose period to date was ~20 minutes with 2.6.20.7 a few
> days ago. I was letting SuSE's software update programs update my SuSE
> 10.2 system, and started a bonnie while it was running (because I had
> been seeing this on recent kernels, and wanted to see if it was in
> stable as well), WHAM, instant dead GUI. When this happens, kbd and
> mouse events work fine, so I hot-keyed to a VT via CTRL+ALT+F1, and
> killed the bonnie. No joy, GUI stayed utterly comatose until the
> updater finished roughly 20 minutes later, at which time the shells I'd
> tried to start popped up, and all worked as if nothing bad had ever
> happened. During the time in between, no window could be brought into
> focus, nada.
>
> While a bonnie is writing, if I poke KDE's menu button, that will
> instantly trigger nastiness, and a trace (this one was with a cfs
> kernel, but I just did same with virgin 2.6.21) shows that "kicker",
> KDE's launcher proggy does an fdatasync for some reason, and that's the
> end of it's world for ages. When clicking on amarok's icon, it does an
> fsync, and that's the last thing that will happen in it's world until
> write is done as well. I've repeated this with CFQ and AS IO
> schedulers.
Well that all sucks.
> I have a couple of old kernels lying around that I can test with, but I
> think it's going to be the same. Seems to be ext3's journal that is
> causing my woes. Below this trace of kicker is one of amarok during
> it's dead to the world time.
>
> Box is 3GHz P4 intel ICH5, SMP/UP doesn't matter. .config is latest
> kernel tested attached. Mount options are
> noatime,nodiratime,acl,user_xattr.
>
>
> [ 308.046646] kicker D 00000044 0 5897 1 (NOTLB)
> [ 308.052611] f32abe4c 00200082 83398b5a 00000044 c01c251e f32ab000 f32ab000 c01169b6
> [ 308.060926] f772fbcc cdc7e694 00000039 8339857a 00000044 83398b5a 00000044 00000000
> [ 308.069422] c1b5ab00 c1b5aac0 00353928 f32abe80 c01c7ab8 00000000 f32ab000 c1b5ab10
> [ 308.077927] Call Trace:
> [ 308.080568] [<c01c7ab8>] log_wait_commit+0x9d/0x11f
> [ 308.085549] [<c01c250f>] journal_stop+0x1a1/0x22a
> [ 308.090364] [<c01c2fce>] journal_force_commit+0x1d/0x20
> [ 308.095699] [<c01bac1e>] ext3_force_commit+0x24/0x26
> [ 308.100774] [<c01b50ea>] ext3_write_inode+0x2d/0x3b
> [ 308.105771] [<c0186cf0>] __writeback_single_inode+0x2df/0x3a9
> [ 308.111633] [<c0187641>] sync_inode+0x15/0x38
> [ 308.116093] [<c01b1695>] ext3_sync_file+0xbd/0xc8
> [ 308.120900] [<c0189a09>] do_fsync+0x58/0x8b
> [ 308.125188] [<c0189a5c>] __do_fsync+0x20/0x2f
> [ 308.129656] [<c0189a7b>] sys_fdatasync+0x10/0x12
> [ 308.134384] [<c0103eec>] sysenter_past_esp+0x5d/0x81
> [ 308.139441] =======================
Right. One possibility here is that bonnie is stuffing new dirty blocks
onto the committing transaction's ordered-data list and JBD commit is
livelocking. Only we're not supposed to be putting those blocks on that
list.
Another livelock possibility is that bonnie is redirtying pages faster than
commit can write them out, so commit got livelocked:
When I was doing the original port-from-2.2 I found that an application
which does
for ( ; ; )
pwrite(fd, "", 1, 0);
would permanently livelock the fs. I fixed that, but it was six years ago,
and perhaps we later unfixed it.
It would be most interesting to try data=writeback.
> [ 311.755953] bonnie D 00000046 0 6146 5929 (NOTLB)
> [ 311.761929] e7622a60 00200082 04d7e5fe 00000046 03332bd5 00000000 e7622000 c02c0c54
> [ 311.770244] d8eaabcc e7622a64 f7d0c3ec 04d7e521 00000046 04d7e5fe 00000046 00000000
> [ 311.778758] e7622aa4 f4f94400 e7622aac e7622a68 c04a2f06 e7622a70 c018b105 e7622a8c
> [ 311.787261] Call Trace:
> [ 311.789904] [<c04a2f06>] io_schedule+0xe/0x16
> [ 311.794373] [<c018b105>] sync_buffer+0x2e/0x32
> [ 311.798927] [<c04a3756>] __wait_on_bit_lock+0x3f/0x62
> [ 311.804089] [<c04a37d8>] out_of_line_wait_on_bit_lock+0x5f/0x67
> [ 311.810115] [<c018b248>] __lock_buffer+0x2b/0x31
> [ 311.814846] [<c018bb56>] sync_dirty_buffer+0x88/0xc3
> [ 311.819921] [<c01c3ce4>] journal_dirty_data+0x1dd/0x205
> [ 311.825256] [<c01b3300>] ext3_journal_dirty_data+0x12/0x37
> [ 311.830858] [<c01b333a>] journal_dirty_data_fn+0x15/0x1c
> [ 311.836280] [<c01b277d>] walk_page_buffers+0x36/0x68
> [ 311.841347] [<c01b552f>] ext3_ordered_writepage+0x11a/0x191
> [ 311.847027] [<c0152133>] generic_writepages+0x1f3/0x305
> [ 311.852344] [<c015227c>] do_writepages+0x37/0x39
> [ 311.857064] [<c0186aa7>] __writeback_single_inode+0x96/0x3a9
> [ 311.862842] [<c0187037>] sync_sb_inodes+0x1bc/0x27f
> [ 311.867830] [<c01875e3>] writeback_inodes+0x98/0xe1
> [ 311.872819] [<c015240a>] balance_dirty_pages_ratelimited_nr+0xc4/0x1bf
> [ 311.879461] [<c014df55>] generic_file_buffered_write+0x32e/0x677
> [ 311.885576] [<c014e580>] __generic_file_aio_write_nolock+0x2e2/0x57f
> [ 311.892044] [<c014e87d>] generic_file_aio_write+0x60/0xd4
> [ 311.897553] [<c01b14f7>] ext3_file_write+0x27/0xa5
> [ 311.902455] [<c016ab7b>] do_sync_write+0xcd/0x103
> [ 311.907270] [<c016b37a>] vfs_write+0xa8/0x128
> [ 311.911738] [<c016b873>] sys_write+0x3d/0x64
> [ 311.916111] [<c0103eec>] sysenter_past_esp+0x5d/0x81
That's normal. But bonnie _is_ blocking here, so it obviously cannot be
dirtying buffers while it's doing that.
> [ 311.921185] =======================
> [ 311.924763] pdflush D 00000046 0 6147 5 (L-TLB)
> [ 311.930739] ec7e2ef0 00000046 03f2b0ea 00000046 ec7e2f0c c0186b45 ec7e2000 c01169b6
> [ 311.939052] ea14069c ec7e2f00 ec7e2f00 03f2afc9 00000046 03f2b0ea 00000046 00000282
> [ 311.947557] ec7e2f00 ffffab4c ec7e2f30 ec7e2f20 c04a3689 00000400 00000840 c0681648
> [ 311.956062] Call Trace:
> [ 311.958703] [<c04a3689>] schedule_timeout+0x44/0xa4
> [ 311.963683] [<c04a2eda>] io_schedule_timeout+0xe/0x16
> [ 311.968827] [<c01568c7>] congestion_wait+0x4c/0x61
> [ 311.973721] [<c0152603>] background_writeout+0x2f/0x8f
> [ 311.978969] [<c0152b68>] pdflush+0xe7/0x1d6
> [ 311.983255] [<c012e8fb>] kthread+0xc5/0xc9
> [ 311.987465] [<c0104aab>] kernel_thread_helper+0x7/0x1c
OK, that's normal.
> [ 1421.790647] amarokapp D 00000148 0 6428 1 (NOTLB)
> [ 1421.796620] e303ce4c 00000082 823c9fd2 00000148 00000148 ee1a3030 e303c000 00000001
> [ 1421.804944] ee1a316c dfef91d0 00000000 823c9f15 00000148 823c9fd2 00000148 00000246
> [ 1421.813447] dfef91c0 dfef9180 0035465c e303ce80 c01c7ab8 00000000 e303c000 dfef91d0
> [ 1421.821962] Call Trace:
> [ 1421.824603] [<c01c7ab8>] log_wait_commit+0x9d/0x11f
> [ 1421.829582] [<c01c250f>] journal_stop+0x1a1/0x22a
> [ 1421.834397] [<c01c2fce>] journal_force_commit+0x1d/0x20
> [ 1421.839732] [<c01bac1e>] ext3_force_commit+0x24/0x26
> [ 1421.844790] [<c01b50ea>] ext3_write_inode+0x2d/0x3b
> [ 1421.849771] [<c0186cf0>] __writeback_single_inode+0x2df/0x3a9
> [ 1421.855633] [<c0187641>] sync_inode+0x15/0x38
> [ 1421.860103] [<c01b1695>] ext3_sync_file+0xbd/0xc8
> [ 1421.864918] [<c0189a09>] do_fsync+0x58/0x8b
> [ 1421.869213] [<c0189a5c>] __do_fsync+0x20/0x2f
> [ 1421.873673] [<c0189a8a>] sys_fsync+0xd/0xf
> [ 1421.877874] [<c0103eec>] sysenter_past_esp+0x5d/0x81
>
hm, fsync.
Aside: why the heck do applications think that their data is so important
that they need to fsync it all the time. I used to run a kernel on my
laptop which had "return 0;" at the top of fsync() and fdatasync(). Most
pleasurable.
But wedging for 20 minutes is probably excessive punishment.
Bottom line: no idea. Please see if data=writeback changes things, maybe
try the pwrite() loop.
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-27 8:33 ` Andrew Morton
@ 2007-04-27 9:23 ` Mike Galbraith
2007-04-27 10:17 ` Mike Galbraith
` (2 subsequent siblings)
3 siblings, 0 replies; 66+ messages in thread
From: Mike Galbraith @ 2007-04-27 9:23 UTC (permalink / raw)
To: Andrew Morton; +Cc: LKML, Linus Torvalds, Jens Axboe, linux-ext4
On Fri, 2007-04-27 at 01:33 -0700, Andrew Morton wrote:
> On Fri, 27 Apr 2007 09:59:27 +0200 Mike Galbraith <efault@gmx.de> wrote:
>
> > Greetings,
> >
> > As subject states, my GUI is going away for extended periods of time
> > when my very full and likely highly fragmented (how to find out)
> > filesystem is under heavy write load. While write is under way, if
> > amarok (mp3 player) is running, no song change will occur until write is
> > finished, and the GUI can go _entirely_ comatose for very long periods.
> > Usually, it will come back to life after write is finished, but
> > occasionally, a complete GUI restart is necessary.
>
> I'd be suspecting a GUI bug if a restart is necessary. Perhaps it went to
> lunch for so long in the kernel that some time-based thing went bad.
Yeah, there have been some KDE updates, maybe something went south. I
know for sure that nothing this horrible used to happen during IO. But
then when I used to regularly test IO, my disk heads didn't have to
traverse nearly as much either.
> Right. One possibility here is that bonnie is stuffing new dirty blocks
> onto the committing transaction's ordered-data list and JBD commit is
> livelocking. Only we're not supposed to be putting those blocks on that
> list.
>
> Another livelock possibility is that bonnie is redirtying pages faster than
> commit can write them out, so commit got livelocked:
>
> When I was doing the original port-from-2.2 I found that an application
> which does
>
> for ( ; ; )
> pwrite(fd, "", 1, 0);
>
> would permanently livelock the fs. I fixed that, but it was six years ago,
> and perhaps we later unfixed it.
I'll try that.
> It would be most interesting to try data=writeback.
Seems somewhat better, but nothing close to tolerable. I still had to
hot-key to a VT and kill the bonnie.
> hm, fsync.
>
> Aside: why the heck do applications think that their data is so important
> that they need to fsync it all the time. I used to run a kernel on my
> laptop which had "return 0;" at the top of fsync() and fdatasync(). Most
> pleasurable.
I thought unkind thoughts when I saw those traces :)
Thanks,
-Mike
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-27 8:33 ` Andrew Morton
2007-04-27 9:23 ` Mike Galbraith
@ 2007-04-27 10:17 ` Mike Galbraith
2007-04-27 11:59 ` Marat Buharov
2007-04-28 20:46 ` Mikulas Patocka
3 siblings, 0 replies; 66+ messages in thread
From: Mike Galbraith @ 2007-04-27 10:17 UTC (permalink / raw)
To: Andrew Morton; +Cc: LKML, Linus Torvalds, Jens Axboe, linux-ext4
On Fri, 2007-04-27 at 01:33 -0700, Andrew Morton wrote:
> Another livelock possibility is that bonnie is redirtying pages faster than
> commit can write them out, so commit got livelocked:
>
> When I was doing the original port-from-2.2 I found that an application
> which does
>
> for ( ; ; )
> pwrite(fd, "", 1, 0);
>
> would permanently livelock the fs. I fixed that, but it was six years ago,
> and perhaps we later unfixed it.
Well, box doesn't seem the least bit upset after quite a while now, so I
guess it didn't get unfixed.
-Mike
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-27 8:33 ` Andrew Morton
2007-04-27 9:23 ` Mike Galbraith
2007-04-27 10:17 ` Mike Galbraith
@ 2007-04-27 11:59 ` Marat Buharov
2007-04-27 12:30 ` Peter Zijlstra
` (2 more replies)
2007-04-28 20:46 ` Mikulas Patocka
3 siblings, 3 replies; 66+ messages in thread
From: Marat Buharov @ 2007-04-27 11:59 UTC (permalink / raw)
To: Andrew Morton
Cc: Mike Galbraith, LKML, Linus Torvalds, Jens Axboe, linux-ext4
On 4/27/07, Andrew Morton <akpm@linux-foundation.org> wrote:
> Aside: why the heck do applications think that their data is so important
> that they need to fsync it all the time. I used to run a kernel on my
> laptop which had "return 0;" at the top of fsync() and fdatasync(). Most
> pleasurable.
So, if having fake fsync() and fdatasync() is pleasurable for laptop
and desktop, may be it's time to add option into Kconfig which
disables normal fsync behaviour in favor of robust desktop?
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-27 11:59 ` Marat Buharov
@ 2007-04-27 12:30 ` Peter Zijlstra
2007-04-27 13:50 ` Mark Lord
2007-04-27 12:39 ` Manoj Joseph
2007-04-27 15:30 ` Linus Torvalds
2 siblings, 1 reply; 66+ messages in thread
From: Peter Zijlstra @ 2007-04-27 12:30 UTC (permalink / raw)
To: Marat Buharov
Cc: Andrew Morton, Mike Galbraith, LKML, Linus Torvalds, Jens Axboe,
linux-ext4
On Fri, 2007-04-27 at 15:59 +0400, Marat Buharov wrote:
> On 4/27/07, Andrew Morton <akpm@linux-foundation.org> wrote:
> > Aside: why the heck do applications think that their data is so important
> > that they need to fsync it all the time. I used to run a kernel on my
> > laptop which had "return 0;" at the top of fsync() and fdatasync(). Most
> > pleasurable.
>
> So, if having fake fsync() and fdatasync() is pleasurable for laptop
> and desktop, may be it's time to add option into Kconfig which
> disables normal fsync behaviour in favor of robust desktop?
Nah, just teaching user-space to behave themselves should be sufficient;
there is just no way kicker can justify doing a fdatasync(), I mean,
come on its just showing a friggin menu. I have always wondered why that
thing was so damn slow, like it needs to fetch stuff like that from all
four corners of disk, feh!
Just sliding over a sub-menu can take more than a second; I mean, it
_really_ is just faster to just start things from your favourite shell.
No way is globally disabling fsync() a good thing. I guess Andrew just
is a sucker for punishment :-)
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-27 11:59 ` Marat Buharov
2007-04-27 12:30 ` Peter Zijlstra
@ 2007-04-27 12:39 ` Manoj Joseph
2007-04-27 15:30 ` Linus Torvalds
2 siblings, 0 replies; 66+ messages in thread
From: Manoj Joseph @ 2007-04-27 12:39 UTC (permalink / raw)
To: Marat Buharov
Cc: Andrew Morton, Mike Galbraith, LKML, Linus Torvalds, Jens Axboe,
linux-ext4
Marat Buharov wrote:
> On 4/27/07, Andrew Morton <akpm@linux-foundation.org> wrote:
>> Aside: why the heck do applications think that their data is so important
>> that they need to fsync it all the time. I used to run a kernel on my
>> laptop which had "return 0;" at the top of fsync() and fdatasync(). Most
>> pleasurable.
>
> So, if having fake fsync() and fdatasync() is pleasurable for laptop
> and desktop, may be it's time to add option into Kconfig which
> disables normal fsync behaviour in favor of robust desktop?
Sure, a noop fsync/fdatasync would speed up some things. And I am sure
Andrew Morton knew what he was doing and the consequences.
But unless you care nothing about your data, you should not do it. It is
as simple as that. No, it does not give you a robust desktop!!
-Manoj
--
Manoj Joseph
http://kerneljunkie.blogspot.com/
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-27 12:30 ` Peter Zijlstra
@ 2007-04-27 13:50 ` Mark Lord
0 siblings, 0 replies; 66+ messages in thread
From: Mark Lord @ 2007-04-27 13:50 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Marat Buharov, Andrew Morton, Mike Galbraith, LKML,
Linus Torvalds, Jens Axboe, linux-ext4
Peter Zijlstra wrote:
>
> No way is globally disabling fsync() a good thing. I guess Andrew just
> is a sucker for punishment :-)
Mmm... perhaps another nice thing to include in laptop-mode operation?
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-27 7:59 [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation) Mike Galbraith
2007-04-27 8:33 ` Andrew Morton
@ 2007-04-27 15:18 ` Linus Torvalds
2007-04-27 15:41 ` John Anthony Kazos Jr.
` (6 more replies)
1 sibling, 7 replies; 66+ messages in thread
From: Linus Torvalds @ 2007-04-27 15:18 UTC (permalink / raw)
To: Mike Galbraith; +Cc: LKML, Andrew Morton, Jens Axboe
On Fri, 27 Apr 2007, Mike Galbraith wrote:
>
> As subject states, my GUI is going away for extended periods of time
> when my very full and likely highly fragmented (how to find out)
> filesystem is under heavy write load. While write is under way, if
> amarok (mp3 player) is running, no song change will occur until write is
> finished, and the GUI can go _entirely_ comatose for very long periods.
> Usually, it will come back to life after write is finished, but
> occasionally, a complete GUI restart is necessary.
One thing to try out (and dammit, I should make it the default now in
2.6.21) is to just make the dirty limits much lower. We've been talking
about this for ages, I think this might be the right time to do it.
Especially with lots of memory, allowing 40% of that memory to be dirty is
just insane (even if we limit it to "just" 40% of the normal memory zone.
That can be gigabytes. And no amount of IO scheduling will make it
pleasant to try to handle the situation where that much memory is dirty.
So I do believe that we could probably do something about the IO
scheduling _too_:
- break up large write requests (yeah, it will make for worse IO
throughput, but if make it configurable, and especially with
controllers that don't have insane overheads per command, the
difference between 128kB requests and 16MB requests is probably not
really even noticeable - SCSI things with large per-command overheads
are just stupid)
Generating huge requests will automatically mean that they are
"unbreakable" from an IO scheduler perspective, so it's bad for latency
for other reqeusts once they've started.
- maybe be more aggressive about prioritizing reads over writes.
but in the meantime, what happens if you apply this patch?
Actually, you don't need to apply the patch - just do
echo 5 > /proc/sys/vm/dirty_background_ratio
echo 10 > /proc/sys/vm/dirty_ratio
and say if it seems to improve things. I think those are much saner
defaults especially for a desktop system (and probably for most servers
too, for that matter).
Even 10% of memory dirty can be a whole lot of RAM, but it should
hopefully be _better_ than the insane default we have now.
Historical note: allowing about half of memory to contain dirty pages made
more sense back in the days when people had 16-64MB of memory, and a
single untar of even fairly small projects would otherwise hit the disk.
But memory sizes have grown *much* more quickly than disk speeds (and
latency requirements have gone down, not up), so a default that may
actually have been perfectly fine at some point seems crazy these days..
Linus
---
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index f469e3c..a794945 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -67,12 +67,12 @@ static inline long sync_writeback_pages(void)
/*
* Start background writeback (via pdflush) at this percentage
*/
-int dirty_background_ratio = 10;
+int dirty_background_ratio = 5;
/*
* The generator of dirty data starts writeback at this percentage
*/
-int vm_dirty_ratio = 40;
+int vm_dirty_ratio = 10;
/*
* The interval between `kupdate'-style writebacks, in jiffies
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-27 11:59 ` Marat Buharov
2007-04-27 12:30 ` Peter Zijlstra
2007-04-27 12:39 ` Manoj Joseph
@ 2007-04-27 15:30 ` Linus Torvalds
2007-04-27 19:31 ` Andreas Dilger
2007-04-28 8:44 ` Matthias Andree
2 siblings, 2 replies; 66+ messages in thread
From: Linus Torvalds @ 2007-04-27 15:30 UTC (permalink / raw)
To: Marat Buharov; +Cc: Andrew Morton, Mike Galbraith, LKML, Jens Axboe, linux-ext4
On Fri, 27 Apr 2007, Marat Buharov wrote:
>
> On 4/27/07, Andrew Morton <akpm@linux-foundation.org> wrote:
> > Aside: why the heck do applications think that their data is so important
> > that they need to fsync it all the time. I used to run a kernel on my
> > laptop which had "return 0;" at the top of fsync() and fdatasync(). Most
> > pleasurable.
>
> So, if having fake fsync() and fdatasync() is pleasurable for laptop
> and desktop, may be it's time to add option into Kconfig which
> disables normal fsync behaviour in favor of robust desktop?
This really is an ext3 issue, not "fsync()".
On a good filesystem, when you do "fsync()" on a file, nothing at all
happens to any other files. On ext3, it seems to sync the global journal,
which means that just about *everything* that writes even a single byte
(well, at least anything journalled, which would be all the normal
directory ops etc) to disk will just *stop* dead cold!
It's horrid. And it really is ext3, not "fsync()".
I used to run reiserfs, and it had its problems, but this was the
"feature" of ext3 that I've disliked most. If you run a MUA with local
mail, it will do fsync's for most things, and things really hickup if you
are doing some other writes at the same time. In contrast, with reiser, if
you did a big untar or some other big write, if somebody fsync'ed a small
file, it wasn't even a blip on the radar - the fsync would sync just that
small thing.
Maybe I'm wrong on the exact details (I'm not really up on the ext3
journal handling ;^), but you don't even have to know about any internals
at all: you can just test it. Gaak.
Linus
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-27 15:18 ` Linus Torvalds
@ 2007-04-27 15:41 ` John Anthony Kazos Jr.
2007-04-27 15:54 ` Linus Torvalds
2007-04-27 18:31 ` Andrew Morton
` (5 subsequent siblings)
6 siblings, 1 reply; 66+ messages in thread
From: John Anthony Kazos Jr. @ 2007-04-27 15:41 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Mike Galbraith, LKML, Andrew Morton, Jens Axboe
> One thing to try out (and dammit, I should make it the default now in
> 2.6.21) is to just make the dirty limits much lower. We've been talking
> about this for ages, I think this might be the right time to do it.
Could[/should] this stuff be changed from ratios to amounts? Or a quick
boot-time test to use a ratio if the memory is small and an amount (like
tax brackets, I would expect) if it's great?
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-27 15:41 ` John Anthony Kazos Jr.
@ 2007-04-27 15:54 ` Linus Torvalds
2007-04-27 16:24 ` Chuck Ebbert
2007-04-27 19:43 ` Marko Macek
0 siblings, 2 replies; 66+ messages in thread
From: Linus Torvalds @ 2007-04-27 15:54 UTC (permalink / raw)
To: John Anthony Kazos Jr.; +Cc: Mike Galbraith, LKML, Andrew Morton, Jens Axboe
On Fri, 27 Apr 2007, John Anthony Kazos Jr. wrote:
>
> Could[/should] this stuff be changed from ratios to amounts? Or a quick
> boot-time test to use a ratio if the memory is small and an amount (like
> tax brackets, I would expect) if it's great?
Yes, the "percentage" thing was likely wrong. That said, there *is* some
correlation between "lots of memory" and "high-end machine", and that in
turn tends to correlate with "fast disk", so I don't think the percentage
approach is really *horribly* wrong.
The main issue with the percentage is that we do export them as such
through the /proc/ interface, and they are easy to change and understand.
So changing them to amounts is non-trivial if you also want to support the
old interfaces - and the advantage isn't obvious enough that it's a
clear-cut case.
Linus
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-27 15:54 ` Linus Torvalds
@ 2007-04-27 16:24 ` Chuck Ebbert
2007-04-27 19:43 ` Marko Macek
1 sibling, 0 replies; 66+ messages in thread
From: Chuck Ebbert @ 2007-04-27 16:24 UTC (permalink / raw)
To: Linus Torvalds
Cc: John Anthony Kazos Jr., Mike Galbraith, LKML, Andrew Morton, Jens Axboe
Linus Torvalds wrote:
>
> On Fri, 27 Apr 2007, John Anthony Kazos Jr. wrote:
>> Could[/should] this stuff be changed from ratios to amounts? Or a quick
>> boot-time test to use a ratio if the memory is small and an amount (like
>> tax brackets, I would expect) if it's great?
>
> Yes, the "percentage" thing was likely wrong. That said, there *is* some
> correlation between "lots of memory" and "high-end machine", and that in
> turn tends to correlate with "fast disk", so I don't think the percentage
> approach is really *horribly* wrong.
>
> The main issue with the percentage is that we do export them as such
> through the /proc/ interface, and they are easy to change and understand.
> So changing them to amounts is non-trivial if you also want to support the
> old interfaces - and the advantage isn't obvious enough that it's a
> clear-cut case.
>
We could add a new "limit" field, though. If it defaulted to 0 (unlimited)
the default behavior wouldn't change.
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-27 15:18 ` Linus Torvalds
2007-04-27 15:41 ` John Anthony Kazos Jr.
@ 2007-04-27 18:31 ` Andrew Morton
2007-04-27 19:09 ` Zan Lynx
` (2 more replies)
2007-04-27 19:28 ` Mike Galbraith
` (4 subsequent siblings)
6 siblings, 3 replies; 66+ messages in thread
From: Andrew Morton @ 2007-04-27 18:31 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Mike Galbraith, LKML, Jens Axboe
On Fri, 27 Apr 2007 08:18:34 -0700 (PDT) Linus Torvalds <torvalds@linux-foundation.org> wrote:
> echo 5 > /proc/sys/vm/dirty_background_ratio
> echo 10 > /proc/sys/vm/dirty_ratio
That'll help a lot.
ext3's problem here is that a single fsync() requires that ext3 sync the
whole filesystem. Because
- a journal commit can contain metadata from multiple files, and if we
want to journal one file's metadata via fsync(), we unavoidably journal
all the other file's metadata at the same time.
- ordered mode requires that we write a file's data blocks prior to
journalling the metadata which refers to those blocks.
net result: syncing anything syncs the whole world.
There are a few areas in which this could conceivably be tuned up: if a
particular file doesn't currently have any metadata in the commit, we don't
actually need to sync its data blocks: we could just transfer them into
next commit. Hard, unlikely to be of benefit.
Arguably, we could get away without syncing overwritten data blocks. Users
would occasionally see older data than they otherwise would have after a
crash. Could help a bit in some circumstances.
But none of this explains a 20-minute hang, unless a *lot* of fsyncs are
being performed, perhaps.
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-27 18:31 ` Andrew Morton
@ 2007-04-27 19:09 ` Zan Lynx
2007-04-27 22:07 ` Andrew Morton
2007-04-27 19:27 ` Mike Galbraith
2007-04-28 8:51 ` Matthias Andree
2 siblings, 1 reply; 66+ messages in thread
From: Zan Lynx @ 2007-04-27 19:09 UTC (permalink / raw)
To: Andrew Morton; +Cc: Linus Torvalds, Mike Galbraith, LKML, Jens Axboe
[-- Attachment #1: Type: text/plain, Size: 1233 bytes --]
On Fri, 2007-04-27 at 11:31 -0700, Andrew Morton wrote:
[snip]
> ext3's problem here is that a single fsync() requires that ext3 sync the
> whole filesystem. Because
>
> - a journal commit can contain metadata from multiple files, and if we
> want to journal one file's metadata via fsync(), we unavoidably journal
> all the other file's metadata at the same time.
>
> - ordered mode requires that we write a file's data blocks prior to
> journalling the metadata which refers to those blocks.
>
> net result: syncing anything syncs the whole world.
>
> There are a few areas in which this could conceivably be tuned up: if a
> particular file doesn't currently have any metadata in the commit, we don't
> actually need to sync its data blocks: we could just transfer them into
> next commit. Hard, unlikely to be of benefit.
[snip]
How about mixing the ordered and data journal modes? If the data blocks
would fit, have fsync write them into the journal as is done in
data=journal mode. Then that file data is committed to disk as fsync
requires, but it shouldn't require flushing all the previous metadata to
get an ordered guarantee.
Or so it seems to me.
--
Zan Lynx <zlynx@acm.org>
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-27 18:31 ` Andrew Morton
2007-04-27 19:09 ` Zan Lynx
@ 2007-04-27 19:27 ` Mike Galbraith
2007-04-28 8:51 ` Matthias Andree
2 siblings, 0 replies; 66+ messages in thread
From: Mike Galbraith @ 2007-04-27 19:27 UTC (permalink / raw)
To: Andrew Morton; +Cc: Linus Torvalds, LKML, Jens Axboe
On Fri, 2007-04-27 at 11:31 -0700, Andrew Morton wrote:
> But none of this explains a 20-minute hang, unless a *lot* of fsyncs are
> being performed, perhaps.
Yes. I need to do a lot more testing. All I see is one, and it's game
over. Bizarre.
-Mike
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-27 15:18 ` Linus Torvalds
2007-04-27 15:41 ` John Anthony Kazos Jr.
2007-04-27 18:31 ` Andrew Morton
@ 2007-04-27 19:28 ` Mike Galbraith
2007-04-27 20:06 ` Jan Engelhardt
` (3 subsequent siblings)
6 siblings, 0 replies; 66+ messages in thread
From: Mike Galbraith @ 2007-04-27 19:28 UTC (permalink / raw)
To: Linus Torvalds; +Cc: LKML, Andrew Morton, Jens Axboe
On Fri, 2007-04-27 at 08:18 -0700, Linus Torvalds wrote:
> Actually, you don't need to apply the patch - just do
>
> echo 5 > /proc/sys/vm/dirty_background_ratio
> echo 10 > /proc/sys/vm/dirty_ratio
>
I'll try this, and do some testing with other kernels as well.
-Mike
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-27 15:30 ` Linus Torvalds
@ 2007-04-27 19:31 ` Andreas Dilger
2007-04-27 19:44 ` Mike Galbraith
` (2 more replies)
2007-04-28 8:44 ` Matthias Andree
1 sibling, 3 replies; 66+ messages in thread
From: Andreas Dilger @ 2007-04-27 19:31 UTC (permalink / raw)
To: Linus Torvalds
Cc: Marat Buharov, Andrew Morton, Mike Galbraith, LKML, Jens Axboe,
linux-ext4, Alex Tomas
On Apr 27, 2007 08:30 -0700, Linus Torvalds wrote:
> On a good filesystem, when you do "fsync()" on a file, nothing at all
> happens to any other files. On ext3, it seems to sync the global journal,
> which means that just about *everything* that writes even a single byte
> (well, at least anything journalled, which would be all the normal
> directory ops etc) to disk will just *stop* dead cold!
>
> It's horrid. And it really is ext3, not "fsync()".
>
> I used to run reiserfs, and it had its problems, but this was the
> "feature" of ext3 that I've disliked most. If you run a MUA with local
> mail, it will do fsync's for most things, and things really hickup if you
> are doing some other writes at the same time. In contrast, with reiser, if
> you did a big untar or some other big write, if somebody fsync'ed a small
> file, it wasn't even a blip on the radar - the fsync would sync just that
> small thing.
It's true that this is a "feature" of ext3 with data=ordered (the default),
but I suspect the same thing is now true in reiserfs too. The reason is
that if a journal commit doesn't flush the data as well then a crash will
result in garbage (from old deleted files) being visible in the newly
allocated file. People used to complain about this with reiserfs all the
time having corrupt data in new files after a crash, which is why I believe
it was fixed.
There definitely are some problems with the ext3 journal commit though.
If the journal is full it will cause the whole journal to checkpoint out
to the filesystem synchronously even if just space for a small transaction
is needed. That is doubly bad if you have a very large journal. I believe
Alex has a patch to have it checkpoint much smaller chunks to the fs.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-27 15:54 ` Linus Torvalds
2007-04-27 16:24 ` Chuck Ebbert
@ 2007-04-27 19:43 ` Marko Macek
1 sibling, 0 replies; 66+ messages in thread
From: Marko Macek @ 2007-04-27 19:43 UTC (permalink / raw)
To: Linus Torvalds
Cc: John Anthony Kazos Jr., Mike Galbraith, LKML, Andrew Morton, Jens Axboe
Linus Torvalds wrote:
>
> On Fri, 27 Apr 2007, John Anthony Kazos Jr. wrote:
>> Could[/should] this stuff be changed from ratios to amounts? Or a quick
>> boot-time test to use a ratio if the memory is small and an amount (like
>> tax brackets, I would expect) if it's great?
>
> Yes, the "percentage" thing was likely wrong. That said, there *is* some
> correlation between "lots of memory" and "high-end machine", and that in
> turn tends to correlate with "fast disk", so I don't think the percentage
> approach is really *horribly* wrong.
>
> The main issue with the percentage is that we do export them as such
> through the /proc/ interface, and they are easy to change and understand.
> So changing them to amounts is non-trivial if you also want to support the
> old interfaces - and the advantage isn't obvious enough that it's a
> clear-cut case.
I wonder if it would be useful if the limit was 'data we can write out
in 1 (configurable) second. This would typically mean either one 50mb
(depending on disk) contigous block or 100-200 scattered blocks (since
the typical disk latency is about 5-10ms).
Has anyone tried something like this?
Mark
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-27 19:31 ` Andreas Dilger
@ 2007-04-27 19:44 ` Mike Galbraith
2007-04-27 19:50 ` Linus Torvalds
2007-04-27 22:18 ` Andrew Morton
2 siblings, 0 replies; 66+ messages in thread
From: Mike Galbraith @ 2007-04-27 19:44 UTC (permalink / raw)
To: Andreas Dilger
Cc: Linus Torvalds, Marat Buharov, Andrew Morton, LKML, Jens Axboe,
linux-ext4, Alex Tomas
On Fri, 2007-04-27 at 13:31 -0600, Andreas Dilger wrote:
> I believe
> Alex has a patch to have it checkpoint much smaller chunks to the fs.
I wouldn't be averse to test driving such a patch (understatement). You
have a pointer?
-Mike
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-27 19:31 ` Andreas Dilger
2007-04-27 19:44 ` Mike Galbraith
@ 2007-04-27 19:50 ` Linus Torvalds
2007-04-27 20:05 ` Hua Zhong
` (6 more replies)
2007-04-27 22:18 ` Andrew Morton
2 siblings, 7 replies; 66+ messages in thread
From: Linus Torvalds @ 2007-04-27 19:50 UTC (permalink / raw)
To: Andreas Dilger
Cc: Marat Buharov, Andrew Morton, Mike Galbraith, LKML, Jens Axboe,
linux-ext4, Alex Tomas
On Fri, 27 Apr 2007, Andreas Dilger wrote:
>
> It's true that this is a "feature" of ext3 with data=ordered (the default),
> but I suspect the same thing is now true in reiserfs too.
Oh, well.. Journalling sucks.
I was actually _really_ hoping that somebody would come along and tell
everybody that this whole journal-logging is stupid, and that it's just
better to not ever re-write blocks on disk, but instead write to new
blocks with version numbers (and not re-use old blocks until new versions
are stable on disk).
There was even somebody who did something like that for a PhD thesis, I
forget the details (and it apparently died when the thesis was presumably
accepted ;).
Linus
^ permalink raw reply [flat|nested] 66+ messages in thread
* RE: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-27 19:50 ` Linus Torvalds
@ 2007-04-27 20:05 ` Hua Zhong
2007-04-27 20:12 ` Miquel van Smoorenburg
` (5 subsequent siblings)
6 siblings, 0 replies; 66+ messages in thread
From: Hua Zhong @ 2007-04-27 20:05 UTC (permalink / raw)
To: 'Linus Torvalds', 'Andreas Dilger'
Cc: 'Marat Buharov', 'Andrew Morton',
'Mike Galbraith', 'LKML', 'Jens Axboe',
linux-ext4, 'Alex Tomas'
The idea has not died and some NAS/file server vendors have already been
doing this for some time. (I am not sure but is WAFS the same thing?)
> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-
> owner@vger.kernel.org] On Behalf Of Linus Torvalds
> Sent: Friday, April 27, 2007 12:51 PM
> To: Andreas Dilger
> Cc: Marat Buharov; Andrew Morton; Mike Galbraith; LKML; Jens Axboe;
> linux-ext4@vger.kernel.org; Alex Tomas
> Subject: Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose
> when FS is under heavy write load (massive starvation)
>
>
>
> On Fri, 27 Apr 2007, Andreas Dilger wrote:
> >
> > It's true that this is a "feature" of ext3 with data=ordered (the
> default),
> > but I suspect the same thing is now true in reiserfs too.
>
> Oh, well.. Journalling sucks.
>
> I was actually _really_ hoping that somebody would come along and tell
> everybody that this whole journal-logging is stupid, and that it's just
> better to not ever re-write blocks on disk, but instead write to new
> blocks with version numbers (and not re-use old blocks until new
> versions
> are stable on disk).
>
> There was even somebody who did something like that for a PhD thesis, I
> forget the details (and it apparently died when the thesis was
> presumably
> accepted ;).
>
> Linus
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-27 15:18 ` Linus Torvalds
` (2 preceding siblings ...)
2007-04-27 19:28 ` Mike Galbraith
@ 2007-04-27 20:06 ` Jan Engelhardt
2007-04-27 21:22 ` Linus Torvalds
2007-04-28 4:25 ` Mike Galbraith
` (2 subsequent siblings)
6 siblings, 1 reply; 66+ messages in thread
From: Jan Engelhardt @ 2007-04-27 20:06 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Mike Galbraith, LKML, Andrew Morton, Jens Axboe
On Apr 27 2007 08:18, Linus Torvalds wrote:
>
>Actually, you don't need to apply the patch - just do
>
> echo 5 > /proc/sys/vm/dirty_background_ratio
> echo 10 > /proc/sys/vm/dirty_ratio
>
>and say if it seems to improve things. I think those are much saner
>defaults especially for a desktop system (and probably for most servers
>too, for that matter).
Interesting. For my laptop, I have configured like 90 for
dirty_background_ratio and 95 for dirty_ratio. Makes for a nice
delayed write, but I do not do workloads bigger than extracing kernel
tarballs (~250 MB) and coding away on that machine (488 MB RAM) anyway.
Setting it to something like 95, I could probably rm -Rf the kernel
tree again and the disk never gets active because it is all cached.
But if dirty_ratio is lowered, the disk will get active soon.
>Historical note: allowing about half of memory to contain dirty pages made
>more sense back in the days when people had 16-64MB of memory, and a
>single untar of even fairly small projects would otherwise hit the disk.
>But memory sizes have grown *much* more quickly than disk speeds (and
>latency requirements have gone down, not up), so a default that may
>actually have been perfectly fine at some point seems crazy these days..
Jan
--
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-27 19:50 ` Linus Torvalds
2007-04-27 20:05 ` Hua Zhong
@ 2007-04-27 20:12 ` Miquel van Smoorenburg
2007-04-27 20:12 ` Bill Huey
` (4 subsequent siblings)
6 siblings, 0 replies; 66+ messages in thread
From: Miquel van Smoorenburg @ 2007-04-27 20:12 UTC (permalink / raw)
To: torvalds; +Cc: linux-kernel
In article <alpine.LFD.0.98.0704271246550.9964@woody.linux-foundation.org> you write:
>I was actually _really_ hoping that somebody would come along and tell
>everybody that this whole journal-logging is stupid, and that it's just
>better to not ever re-write blocks on disk, but instead write to new
>blocks with version numbers (and not re-use old blocks until new versions
>are stable on disk).
>
>There was even somebody who did something like that for a PhD thesis, I
>forget the details (and it apparently died when the thesis was presumably
>accepted ;).
If you mean tux2, it died because of patent issues:
http://www.uwsg.iu.edu/hypermail/linux/kernel/0208.3/0332.html
Mike.
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-27 19:50 ` Linus Torvalds
2007-04-27 20:05 ` Hua Zhong
2007-04-27 20:12 ` Miquel van Smoorenburg
@ 2007-04-27 20:12 ` Bill Huey
2007-04-28 5:37 ` Mikulas Patocka
2007-04-27 20:29 ` Gabriel C
` (3 subsequent siblings)
6 siblings, 1 reply; 66+ messages in thread
From: Bill Huey @ 2007-04-27 20:12 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andreas Dilger, Marat Buharov, Andrew Morton, Mike Galbraith,
LKML, Jens Axboe, linux-ext4, Alex Tomas, Bill Huey (hui)
On Fri, Apr 27, 2007 at 12:50:34PM -0700, Linus Torvalds wrote:
> Oh, well.. Journalling sucks.
>
> I was actually _really_ hoping that somebody would come along and tell
> everybody that this whole journal-logging is stupid, and that it's just
> better to not ever re-write blocks on disk, but instead write to new
> blocks with version numbers (and not re-use old blocks until new versions
> are stable on disk).
>
> There was even somebody who did something like that for a PhD thesis, I
> forget the details (and it apparently died when the thesis was presumably
> accepted ;).
That sounds a whole lot like NetApp's WAFL file system and is heavily patented.
bill
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-27 19:50 ` Linus Torvalds
` (2 preceding siblings ...)
2007-04-27 20:12 ` Bill Huey
@ 2007-04-27 20:29 ` Gabriel C
2007-04-27 20:45 ` Stephen Clark
` (2 subsequent siblings)
6 siblings, 0 replies; 66+ messages in thread
From: Gabriel C @ 2007-04-27 20:29 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andreas Dilger, Marat Buharov, Andrew Morton, Mike Galbraith,
LKML, Jens Axboe, linux-ext4, Alex Tomas, Mikulas Patocka
Linus Torvalds wrote:
> There was even somebody who did something like that for a PhD thesis, I
> forget the details (and it apparently died when the thesis was presumably
> accepted ;).
>
>
You mean SpadFS[1] right ?
> Linus
>
Gabriel
[1] http://artax.karlin.mff.cuni.cz/~mikulas/spadfs/
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-27 19:50 ` Linus Torvalds
` (3 preceding siblings ...)
2007-04-27 20:29 ` Gabriel C
@ 2007-04-27 20:45 ` Stephen Clark
2007-04-27 20:54 ` Manoj Joseph
2007-04-28 8:45 ` Matthias Andree
6 siblings, 0 replies; 66+ messages in thread
From: Stephen Clark @ 2007-04-27 20:45 UTC (permalink / raw)
To: linux-kernel
Linus Torvalds wrote:
>On Fri, 27 Apr 2007, Andreas Dilger wrote:
>
>
>>It's true that this is a "feature" of ext3 with data=ordered (the default),
>>but I suspect the same thing is now true in reiserfs too.
>>
>>
>
>Oh, well.. Journalling sucks.
>
>I was actually _really_ hoping that somebody would come along and tell
>everybody that this whole journal-logging is stupid, and that it's just
>better to not ever re-write blocks on disk, but instead write to new
>blocks with version numbers (and not re-use old blocks until new versions
>are stable on disk).
>
>
>
That sort of sounds like something NCR used to do in the mainframe days
files had
generation numbers, and multiple generations of the files were kept
around with the
OS automatically removing the older ones.
>There was even somebody who did something like that for a PhD thesis, I
>forget the details (and it apparently died when the thesis was presumably
>accepted ;).
>
> Linus
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>
>
--
"They that give up essential liberty to obtain temporary safety,
deserve neither liberty nor safety." (Ben Franklin)
"The course of history shows that as a government grows, liberty
decreases." (Thomas Jefferson)
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-27 19:50 ` Linus Torvalds
` (4 preceding siblings ...)
2007-04-27 20:45 ` Stephen Clark
@ 2007-04-27 20:54 ` Manoj Joseph
2007-04-28 8:45 ` Matthias Andree
6 siblings, 0 replies; 66+ messages in thread
From: Manoj Joseph @ 2007-04-27 20:54 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andreas Dilger, Marat Buharov, Andrew Morton, Mike Galbraith,
LKML, Jens Axboe, linux-ext4, Alex Tomas
Linus Torvalds wrote:
>
> On Fri, 27 Apr 2007, Andreas Dilger wrote:
>> It's true that this is a "feature" of ext3 with data=ordered (the default),
>> but I suspect the same thing is now true in reiserfs too.
>
> Oh, well.. Journalling sucks.
Go back to ext2? ;)
> I was actually _really_ hoping that somebody would come along and tell
> everybody that this whole journal-logging is stupid, and that it's just
> better to not ever re-write blocks on disk, but instead write to new
> blocks with version numbers (and not re-use old blocks until new versions
> are stable on disk).
Ah, "copy on write"! ZFS (Sun) and WAFL (NetApp) does this. Don't know
about WAFL, but ZFS does logging too.
-Manoj
--
Manoj Joseph
http://kerneljunkie.blogspot.com/
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-27 20:06 ` Jan Engelhardt
@ 2007-04-27 21:22 ` Linus Torvalds
0 siblings, 0 replies; 66+ messages in thread
From: Linus Torvalds @ 2007-04-27 21:22 UTC (permalink / raw)
To: Jan Engelhardt; +Cc: Mike Galbraith, LKML, Andrew Morton, Jens Axboe
On Fri, 27 Apr 2007, Jan Engelhardt wrote:
>
> Interesting. For my laptop, I have configured like 90 for
> dirty_background_ratio and 95 for dirty_ratio. Makes for a nice
> delayed write, but I do not do workloads bigger than extracing kernel
> tarballs (~250 MB) and coding away on that machine (488 MB RAM) anyway.
> Setting it to something like 95, I could probably rm -Rf the kernel
> tree again and the disk never gets active because it is all cached.
> But if dirty_ratio is lowered, the disk will get active soon.
Yes. For laptops, you may want to
- raise the dirty limits
- increase the dirty scan times
but you do realize that if you then need memory for something else,
latency just becomes *horrible*. So even on laptops, it's not obviously
the right thing to do (these days, throwing money at the problem instead,
and getting one of the nice new 1.8" flash disks, will solve all issues:
you'd have no reason to try to delay spinning up the disk anyway).
Linus
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-27 19:09 ` Zan Lynx
@ 2007-04-27 22:07 ` Andrew Morton
0 siblings, 0 replies; 66+ messages in thread
From: Andrew Morton @ 2007-04-27 22:07 UTC (permalink / raw)
To: Zan Lynx; +Cc: Linus Torvalds, Mike Galbraith, LKML, Jens Axboe
On Fri, 27 Apr 2007 13:09:06 -0600
Zan Lynx <zlynx@acm.org> wrote:
> On Fri, 2007-04-27 at 11:31 -0700, Andrew Morton wrote:
> [snip]
> > ext3's problem here is that a single fsync() requires that ext3 sync the
> > whole filesystem. Because
> >
> > - a journal commit can contain metadata from multiple files, and if we
> > want to journal one file's metadata via fsync(), we unavoidably journal
> > all the other file's metadata at the same time.
> >
> > - ordered mode requires that we write a file's data blocks prior to
> > journalling the metadata which refers to those blocks.
> >
> > net result: syncing anything syncs the whole world.
> >
> > There are a few areas in which this could conceivably be tuned up: if a
> > particular file doesn't currently have any metadata in the commit, we don't
> > actually need to sync its data blocks: we could just transfer them into
> > next commit. Hard, unlikely to be of benefit.
> [snip]
>
> How about mixing the ordered and data journal modes? If the data blocks
> would fit, have fsync write them into the journal as is done in
> data=journal mode. Then that file data is committed to disk as fsync
> requires, but it shouldn't require flushing all the previous metadata to
> get an ordered guarantee.
In some ways that would be quite neat: if a process does a small write then
fsyncs it, write it all into the journal. That avoids a seek out to the
file's data blocks.
However it'd be quite hard to do, I expect: we don't know until commit time
how much data has been written to this file (actually, we don't even know
at commit-time, but we could, with quite some work, find out).
But none of this will solve the problem, because even with your optimised
fsync(), we still need to write out bonnie's large file at commit time,
when we fsync() your small write to a different file.
(And when I say "this problem" I refer to the known-about problem which
we're discussing here. I suspect this in fact isn't Mike's problem - 20
minutes is crazy - it's not attributable to the fsync-syncs-everything
problem unless Mike's GUI is doing a huge numer of separate fsyncs)
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-27 19:31 ` Andreas Dilger
2007-04-27 19:44 ` Mike Galbraith
2007-04-27 19:50 ` Linus Torvalds
@ 2007-04-27 22:18 ` Andrew Morton
2007-05-03 17:38 ` Alex Tomas
2 siblings, 1 reply; 66+ messages in thread
From: Andrew Morton @ 2007-04-27 22:18 UTC (permalink / raw)
To: Andreas Dilger
Cc: Linus Torvalds, Marat Buharov, Mike Galbraith, LKML, Jens Axboe,
linux-ext4, Alex Tomas
On Fri, 27 Apr 2007 13:31:30 -0600
Andreas Dilger <adilger@clusterfs.com> wrote:
> On Apr 27, 2007 08:30 -0700, Linus Torvalds wrote:
> > On a good filesystem, when you do "fsync()" on a file, nothing at all
> > happens to any other files. On ext3, it seems to sync the global journal,
> > which means that just about *everything* that writes even a single byte
> > (well, at least anything journalled, which would be all the normal
> > directory ops etc) to disk will just *stop* dead cold!
> >
> > It's horrid. And it really is ext3, not "fsync()".
> >
> > I used to run reiserfs, and it had its problems, but this was the
> > "feature" of ext3 that I've disliked most. If you run a MUA with local
> > mail, it will do fsync's for most things, and things really hickup if you
> > are doing some other writes at the same time. In contrast, with reiser, if
> > you did a big untar or some other big write, if somebody fsync'ed a small
> > file, it wasn't even a blip on the radar - the fsync would sync just that
> > small thing.
>
> It's true that this is a "feature" of ext3 with data=ordered (the default),
> but I suspect the same thing is now true in reiserfs too. The reason is
> that if a journal commit doesn't flush the data as well then a crash will
> result in garbage (from old deleted files) being visible in the newly
> allocated file. People used to complain about this with reiserfs all the
> time having corrupt data in new files after a crash, which is why I believe
> it was fixed.
People still complain about hey-my-files-are-all-full-of-zeroes on XFS.
> There definitely are some problems with the ext3 journal commit though.
> If the journal is full it will cause the whole journal to checkpoint out
> to the filesystem synchronously even if just space for a small transaction
> is needed. That is doubly bad if you have a very large journal. I believe
> Alex has a patch to have it checkpoint much smaller chunks to the fs.
>
We can make great improvements here, and I've (twice) previously decribed
how: hoist the entire ordered-mode data handling out of ext3, and out of
the buffer_head layer and move it up into the VFS pagecache layer.
Basically, do ordered-data with a commit-time inode walk, calling
do_sync_mapping_range().
Do it in the VFS. Make reiserfs use it, remove reiserfs ordered-mode too.
Make XFS use it, fix the hey-my-files-are-all-full-of-zeroes problem there.
And guess what? We can then partly fix _this_ problem too. If we're
running a commit on behalf of fsync(inode1) and we come across an inode2
which doesn't have any block allocation metadata in this commit, we don't
need to sync inode2's pages.
Weep. It's times like this when I want to escape all this patch-wrangling
nonsense and go do some real stuff.
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-27 15:18 ` Linus Torvalds
` (3 preceding siblings ...)
2007-04-27 20:06 ` Jan Engelhardt
@ 2007-04-28 4:25 ` Mike Galbraith
2007-04-28 6:32 ` Mike Galbraith
2007-04-28 6:32 ` Mikulas Patocka
2007-05-02 6:53 ` Jens Axboe
6 siblings, 1 reply; 66+ messages in thread
From: Mike Galbraith @ 2007-04-28 4:25 UTC (permalink / raw)
To: Linus Torvalds; +Cc: LKML, Andrew Morton, Jens Axboe
On Fri, 2007-04-27 at 08:18 -0700, Linus Torvalds wrote:
> Actually, you don't need to apply the patch - just do
>
> echo 5 > /proc/sys/vm/dirty_background_ratio
> echo 10 > /proc/sys/vm/dirty_ratio
That seems to have done the trick. Amarok and GUI aren't exactly speed
demons while writeout is happening, but they are not hanging for
eternities.
-Mike
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-27 20:12 ` Bill Huey
@ 2007-04-28 5:37 ` Mikulas Patocka
2007-04-28 5:45 ` Mikulas Patocka
2007-04-28 21:57 ` Bill Huey
0 siblings, 2 replies; 66+ messages in thread
From: Mikulas Patocka @ 2007-04-28 5:37 UTC (permalink / raw)
To: Bill Huey
Cc: Linus Torvalds, Andreas Dilger, Marat Buharov, Andrew Morton,
Mike Galbraith, LKML, Jens Axboe, linux-ext4, Alex Tomas
On Fri, 27 Apr 2007, Bill Huey wrote:
> On Fri, Apr 27, 2007 at 12:50:34PM -0700, Linus Torvalds wrote:
>> Oh, well.. Journalling sucks.
>>
>> I was actually _really_ hoping that somebody would come along and tell
>> everybody that this whole journal-logging is stupid, and that it's just
>> better to not ever re-write blocks on disk, but instead write to new
>> blocks with version numbers (and not re-use old blocks until new versions
>> are stable on disk).
>>
>> There was even somebody who did something like that for a PhD thesis, I
>> forget the details (and it apparently died when the thesis was presumably
>> accepted ;).
>
> That sounds a whole lot like NetApp's WAFL file system and is heavily
> patented.
>
> bill
Hi
SpadFS doesn't write to unallocated parts like log filesystems (LFS) or
phase tree filesystems (TUX2); it writes inside normal used structures,
but it marks each structure with generation tags --- when it updates
global table of tags, it atomically makes several structures valid. I
don't know about this idea being used elsewhere.
It's fsync is slow too (needs to write all (meta)data too), but it at
least doesn't livelock --- fsync is basically:
* write all buffers and wait for completion
* take lock preventing metadata updates
* write all buffers again (those that were updated while previous write
was in progress) and wait for completion
* update global generation count table
* release the lock
Maybe Suse will be paying me from this autumn to make more features to it
--- so far it works, doesn't eat data, but isn't much known :)
Mikulas
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-28 5:37 ` Mikulas Patocka
@ 2007-04-28 5:45 ` Mikulas Patocka
2007-04-28 21:57 ` Bill Huey
1 sibling, 0 replies; 66+ messages in thread
From: Mikulas Patocka @ 2007-04-28 5:45 UTC (permalink / raw)
To: Bill Huey
Cc: Linus Torvalds, Andreas Dilger, Marat Buharov, Andrew Morton,
Mike Galbraith, LKML, Jens Axboe, linux-ext4, Alex Tomas
On Sat, 28 Apr 2007, Mikulas Patocka wrote:
> On Fri, 27 Apr 2007, Bill Huey wrote:
> Hi
>
> SpadFS doesn't write to unallocated parts like log filesystems (LFS) or
> phase tree filesystems (TUX2);
--- BTW, I don't think that writing to unallocated parts of disk is good
idea. These filesystems have cool write benchmarks, but one subtle (and
unbenchmarkable) problem:
They group files according to time when they were created and not
according to directory hierarchy.
When the user has directory with project files and he edited different
files at different times, normal filesystems will place the files near
each other (so that "grep blabla *" is fast) and log-structured
filesystems will scatter the files over the whole disk.
Mikulas
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-28 4:25 ` Mike Galbraith
@ 2007-04-28 6:32 ` Mike Galbraith
2007-04-28 7:01 ` Andrew Morton
0 siblings, 1 reply; 66+ messages in thread
From: Mike Galbraith @ 2007-04-28 6:32 UTC (permalink / raw)
To: Linus Torvalds; +Cc: LKML, Andrew Morton, Jens Axboe
On Sat, 2007-04-28 at 06:25 +0200, Mike Galbraith wrote:
> On Fri, 2007-04-27 at 08:18 -0700, Linus Torvalds wrote:
>
> > Actually, you don't need to apply the patch - just do
> >
> > echo 5 > /proc/sys/vm/dirty_background_ratio
> > echo 10 > /proc/sys/vm/dirty_ratio
>
> That seems to have done the trick. Amarok and GUI aren't exactly speed
> demons while writeout is happening, but they are not hanging for
> eternities.
As promised, I tested with a kernel that I know for fact that I have
tested heavy IO on previously, and behavior was identically horrid, so
it's not something new that snuck in ~recently, my disk just got a _lot_
fuller in the meantime (12k mp3s munch a lot).
I also verified that I don't need to use the dirty data restrictions
with ext2, all is just peachy using stock settings. Amarok switches
songs quickly, and GUI doesn't hang. Behavior is that expected of a
heavily loaded IO subsystem, and is 1000% better than ext3 with my very
full disk.
Journaling is very nice, but I think I'll be much better off without it
responsiveness wise.
-Mike
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-27 15:18 ` Linus Torvalds
` (4 preceding siblings ...)
2007-04-28 4:25 ` Mike Galbraith
@ 2007-04-28 6:32 ` Mikulas Patocka
2007-04-28 16:05 ` Linus Torvalds
2007-05-02 6:53 ` Jens Axboe
6 siblings, 1 reply; 66+ messages in thread
From: Mikulas Patocka @ 2007-04-28 6:32 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Mike Galbraith, LKML, Andrew Morton, Jens Axboe
On Fri, 27 Apr 2007, Linus Torvalds wrote:
>
>
> On Fri, 27 Apr 2007, Mike Galbraith wrote:
>>
>> As subject states, my GUI is going away for extended periods of time
>> when my very full and likely highly fragmented (how to find out)
>> filesystem is under heavy write load. While write is under way, if
>> amarok (mp3 player) is running, no song change will occur until write is
>> finished, and the GUI can go _entirely_ comatose for very long periods.
>> Usually, it will come back to life after write is finished, but
>> occasionally, a complete GUI restart is necessary.
>
> One thing to try out (and dammit, I should make it the default now in
> 2.6.21) is to just make the dirty limits much lower. We've been talking
> about this for ages, I think this might be the right time to do it.
>
> Especially with lots of memory, allowing 40% of that memory to be dirty is
> just insane (even if we limit it to "just" 40% of the normal memory zone.
> That can be gigabytes. And no amount of IO scheduling will make it
> pleasant to try to handle the situation where that much memory is dirty.
What about using different dirtypage limits for different processes?
--- i.e. every process has dirtypage activity counter, that is increased
when it dirties a page and decreased over time. Compute the limit for
process as some inverse of this counter --- so that processes that dirtied
a lot of pages will be blocked at lower limit and processes that dirtied
few pages will be blocked at higher limit.
The main problem is that if the user extracts tar archive, tar eventually
blocks on writeback I/O --- O.K. But if bash attempts to write one page to
.bash_history file at the same time, it blocks too --- bad, the user is
annoyed.
(I don't have time to write and test it, it is just an idea --- I found
these writeback lockups of the whole system annoying too)
Mikulas
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-28 6:32 ` Mike Galbraith
@ 2007-04-28 7:01 ` Andrew Morton
2007-04-28 7:12 ` Mike Galbraith
0 siblings, 1 reply; 66+ messages in thread
From: Andrew Morton @ 2007-04-28 7:01 UTC (permalink / raw)
To: Mike Galbraith; +Cc: Linus Torvalds, LKML, Jens Axboe
On Sat, 28 Apr 2007 08:32:32 +0200 Mike Galbraith <efault@gmx.de> wrote:
> On Sat, 2007-04-28 at 06:25 +0200, Mike Galbraith wrote:
> > On Fri, 2007-04-27 at 08:18 -0700, Linus Torvalds wrote:
> >
> > > Actually, you don't need to apply the patch - just do
> > >
> > > echo 5 > /proc/sys/vm/dirty_background_ratio
> > > echo 10 > /proc/sys/vm/dirty_ratio
> >
> > That seems to have done the trick. Amarok and GUI aren't exactly speed
> > demons while writeout is happening, but they are not hanging for
> > eternities.
>
> As promised, I tested with a kernel that I know for fact that I have
> tested heavy IO on previously, and behavior was identically horrid, so
> it's not something new that snuck in ~recently, my disk just got a _lot_
> fuller in the meantime (12k mp3s munch a lot).
Just to clarify here - you're saying that some older kernel is as sucky as
2.6.21, and that (presumably) dropping the dirty ratios makes things a bit
better on the old kernel as well?
> I also verified that I don't need to use the dirty data restrictions
> with ext2, all is just peachy using stock settings. Amarok switches
> songs quickly, and GUI doesn't hang. Behavior is that expected of a
> heavily loaded IO subsystem, and is 1000% better than ext3 with my very
> full disk.
Yes, the very full disk could explain why things are _so_ bad. Not only
does fsync() force vast amounts of writeout, it's also seeky writeout.
> Journaling is very nice, but I think I'll be much better off without it
> responsiveness wise.
Well, physical journalling with ordered data is bad here. Other forms of
journalling which don't introduce this great contention point shouldn't be
as bad.
Actually, I'm surprised that data=writeback didn't help much. If the
present theories are correct it should have helped quite a lot, because in
data=writeback mode fsync(small-file) will not cause
fdatasync(everything-else).
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-28 7:01 ` Andrew Morton
@ 2007-04-28 7:12 ` Mike Galbraith
0 siblings, 0 replies; 66+ messages in thread
From: Mike Galbraith @ 2007-04-28 7:12 UTC (permalink / raw)
To: Andrew Morton; +Cc: Linus Torvalds, LKML, Jens Axboe
On Sat, 2007-04-28 at 00:01 -0700, Andrew Morton wrote:
> On Sat, 28 Apr 2007 08:32:32 +0200 Mike Galbraith <efault@gmx.de> wrote:
>
> > On Sat, 2007-04-28 at 06:25 +0200, Mike Galbraith wrote:
> > As promised, I tested with a kernel that I know for fact that I have
> > tested heavy IO on previously, and behavior was identically horrid, so
> > it's not something new that snuck in ~recently, my disk just got a _lot_
> > fuller in the meantime (12k mp3s munch a lot).
>
> Just to clarify here - you're saying that some older kernel is as sucky as
> 2.6.21, and that (presumably) dropping the dirty ratios makes things a bit
> better on the old kernel as well?
I didn't drop dirty ratios, only verified that behavior was just as
horrible as 2.6.21.
> Actually, I'm surprised that data=writeback didn't help much. If the
> present theories are correct it should have helped quite a lot, because in
> data=writeback mode fsync(small-file) will not cause
> fdatasync(everything-else).
data=writeback did help quite noticeably, just not enough.
-Mike
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-27 15:30 ` Linus Torvalds
2007-04-27 19:31 ` Andreas Dilger
@ 2007-04-28 8:44 ` Matthias Andree
1 sibling, 0 replies; 66+ messages in thread
From: Matthias Andree @ 2007-04-28 8:44 UTC (permalink / raw)
To: Linus Torvalds
Cc: Marat Buharov, Andrew Morton, Mike Galbraith, LKML, Jens Axboe,
linux-ext4
On Fri, 27 Apr 2007, Linus Torvalds wrote:
>
>
> On Fri, 27 Apr 2007, Marat Buharov wrote:
> >
> > On 4/27/07, Andrew Morton <akpm@linux-foundation.org> wrote:
> > > Aside: why the heck do applications think that their data is so important
> > > that they need to fsync it all the time. I used to run a kernel on my
> > > laptop which had "return 0;" at the top of fsync() and fdatasync(). Most
> > > pleasurable.
> >
> > So, if having fake fsync() and fdatasync() is pleasurable for laptop
> > and desktop, may be it's time to add option into Kconfig which
> > disables normal fsync behaviour in favor of robust desktop?
>
> This really is an ext3 issue, not "fsync()".
>
> On a good filesystem, when you do "fsync()" on a file, nothing at all
> happens to any other files. On ext3, it seems to sync the global journal,
This behavior has been in Linux and sort of official since the early
2.4.X days - remember the discussion on fsync()ing directory changes for
MTAs that led to the mount option "dirsync" for ext?fs so that rename(),
link() and stuff like that became synchronous even without fsync()ing
the parent directory? I can look up archive references if need be.
Surely four years ago, if not five (this is from the top of my head, not
a quotable fact I verified from the LKML archives though).
> I used to run reiserfs, and it had its problems, but this was the
> "feature" of ext3 that I've disliked most. If you run a MUA with local
> mail, it will do fsync's for most things, and things really hickup if you
> are doing some other writes at the same time. In contrast, with reiser, if
> you did a big untar or some other big write, if somebody fsync'ed a small
> file, it wasn't even a blip on the radar - the fsync would sync just that
> small thing.
It's not as though I'd recommend reiserfs. I have seen one major
corruption recently in openSUSE 10.2 with ext3, but I've had constant
headaches with reiserfs since the day it went into S.u.S.E. kernels at
the time until I switched away from reiserfs some years ago.
--
Matthias Andree
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-27 19:50 ` Linus Torvalds
` (5 preceding siblings ...)
2007-04-27 20:54 ` Manoj Joseph
@ 2007-04-28 8:45 ` Matthias Andree
6 siblings, 0 replies; 66+ messages in thread
From: Matthias Andree @ 2007-04-28 8:45 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andreas Dilger, Marat Buharov, Andrew Morton, Mike Galbraith,
LKML, Jens Axboe, linux-ext4, Alex Tomas
On Fri, 27 Apr 2007, Linus Torvalds wrote:
> Oh, well.. Journalling sucks.
>
> I was actually _really_ hoping that somebody would come along and tell
> everybody that this whole journal-logging is stupid, and that it's just
> better to not ever re-write blocks on disk, but instead write to new
> blocks with version numbers (and not re-use old blocks until new versions
> are stable on disk).
Only that you need direct-overwrite support to be able to safely trash
data you no longer need...
--
Matthias Andree
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-27 18:31 ` Andrew Morton
2007-04-27 19:09 ` Zan Lynx
2007-04-27 19:27 ` Mike Galbraith
@ 2007-04-28 8:51 ` Matthias Andree
2007-04-28 8:59 ` Andrew Morton
2007-04-28 16:30 ` Linus Torvalds
2 siblings, 2 replies; 66+ messages in thread
From: Matthias Andree @ 2007-04-28 8:51 UTC (permalink / raw)
To: Andrew Morton; +Cc: Linus Torvalds, Mike Galbraith, LKML, Jens Axboe
On Fri, 27 Apr 2007, Andrew Morton wrote:
> But none of this explains a 20-minute hang, unless a *lot* of fsyncs are
> being performed, perhaps.
Another thing that is rather unpleasant (haven't yet tried fiddling with
the dirty limits) is UDF to DVD-RAM - try rsyncing /home to a DVD-RAM,
that's going to leave you with tons of dirty buffers that clear slowly
-- "watch -n 1 grep -i dirty /proc/meminfo" is boring, but elucidating...
--
Matthias Andree
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-28 8:51 ` Matthias Andree
@ 2007-04-28 8:59 ` Andrew Morton
2007-04-28 16:30 ` Linus Torvalds
1 sibling, 0 replies; 66+ messages in thread
From: Andrew Morton @ 2007-04-28 8:59 UTC (permalink / raw)
To: Matthias Andree; +Cc: Linus Torvalds, Mike Galbraith, LKML, Jens Axboe
On Sat, 28 Apr 2007 10:51:48 +0200 Matthias Andree <matthias.andree@gmx.de> wrote:
> On Fri, 27 Apr 2007, Andrew Morton wrote:
>
> > But none of this explains a 20-minute hang, unless a *lot* of fsyncs are
> > being performed, perhaps.
>
> Another thing that is rather unpleasant (haven't yet tried fiddling with
> the dirty limits) is UDF to DVD-RAM - try rsyncing /home to a DVD-RAM,
> that's going to leave you with tons of dirty buffers that clear slowly
> -- "watch -n 1 grep -i dirty /proc/meminfo" is boring, but elucidating...
>
yes, a few people are attacking that from various angles at present. It's
tricky - writeback has to juggle a lot of balls. We'll get there.
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-28 6:32 ` Mikulas Patocka
@ 2007-04-28 16:05 ` Linus Torvalds
2007-04-28 16:37 ` Ingo Molnar
` (2 more replies)
0 siblings, 3 replies; 66+ messages in thread
From: Linus Torvalds @ 2007-04-28 16:05 UTC (permalink / raw)
To: Mikulas Patocka; +Cc: Mike Galbraith, LKML, Andrew Morton, Jens Axboe
On Sat, 28 Apr 2007, Mikulas Patocka wrote:
> >
> > Especially with lots of memory, allowing 40% of that memory to be dirty is
> > just insane (even if we limit it to "just" 40% of the normal memory zone.
> > That can be gigabytes. And no amount of IO scheduling will make it
> > pleasant to try to handle the situation where that much memory is dirty.
>
> What about using different dirtypage limits for different processes?
Not good. We inadvertedly actually had a very strange case of that, in the
sense that we had different dirtypage limits depending on the type of the
allocation: if somebody used GFP_HIGHUSER, he'd be looking at the
percentage as a percentage of _all_ memory, but if somebody used
GFP_KERNEL he'd look at it as a percentage of just the normal low memory.
So effectively they had different limits (the percentage may have been the
same, but the _meaning_ of the percentage changed ;)
And it's really problematic, because it means that the process that has a
high tolerance for dirty memory will happily dirty a lot of RAM, and then
when the process that has a _low_ tolerance comes along, it might write
just a single byte, and go "oh, damn, I'm way over my dirty limits, I will
now have to start doing writeouts like mad".
Your form is much better:
> --- i.e. every process has dirtypage activity counter, that is increased when
> it dirties a page and decreased over time.
..but is really hard to do, and in particular, it's really hard to make
any kinds of guarantees that when you have a hundred processes, they won't
go over the total dirty limit together!
And one of the reasons for the dirty limit is that the VM really wants to
know that it always has enough clean memory that it can throw away that
even if it needs to do allocations while under IO, it's not totally
screwed. An example of this is using dirty mmap with a networked
filesystem: with 2.6.20 and later, this should actually _work_ fairly
reliably, exactly because we now also count the dirty mapped pages in the
dirty limits, so we never get into the situation that we used to be able
to get into, where some process had mapped all of RAM, and dirtied it
without the kernel even realizing, and then when the kernel needed more
memory (in order to write some of it back), it was totally screwed.
So we do need the "global limit", as just a VM safety issue. We could do
some per-process counters in addition to that, but generally, the global
limit actually ends up doing the right thing: heavy writers are more
likely to _hit_ the limit, so statistically the people who write most are
also the people who end up havign to clean up - so it's all fair.
> The main problem is that if the user extracts tar archive, tar eventually
> blocks on writeback I/O --- O.K. But if bash attempts to write one page to
> .bash_history file at the same time, it blocks too --- bad, the user is
> annoyed.
Right, but it's actually very unlikely. Think about it: the person who
extracts the tar-archive is perhaps dirtying a thousand pages, while the
.bash_history writeback is doing a single one. Which process do you think
is going to hit the "oops, we went over the limit" case 99.9% of the time?
The _really_ annoying problem is when you just have absolutely tons of
memory dirty, and you start doing the writeback: if you saturate the IO
queues totally, it simply doesn't matter _who_ starts the writeback,
because anybody who needs to do any IO at all (not necessarily writing) is
going to be blocked.
This is why having gigabytes of dirty data (or even "just" hundreds of
megs) can be so annoying.
Even with a good software IO scheduler, when you have disks that do tagged
queueing, if you fill up the disk queue with a few dozen (depends on the
disk what the queue limit is) huge write requests, it doesn't really
matter if the _software_ queuing then gives a big advantage to reads
coming in. They'll _still_ be waiting for a long time, especially since
you don't know what the disk firmware is going to do.
It's possible that we could do things like refusing to use all tag entries
on the disk for writing. That would probably help latency a _lot_. Right
now, if we do writeback, and fill up all the slots on the disk, we cannot
even feed the disk the read request immediately - we'll have to wait for
some of the writes to finish before we can even queue the read to the
disk.
(Of course, if disks don't support tagged queueing, you'll never have this
problem at all, but most disks do these days, and I strongly suspect it
really can aggravate latency numbers a lot).
Jens? Comments? Or do you do that already?
Linus
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-28 8:51 ` Matthias Andree
2007-04-28 8:59 ` Andrew Morton
@ 2007-04-28 16:30 ` Linus Torvalds
2007-04-28 16:56 ` Paolo Ornati
1 sibling, 1 reply; 66+ messages in thread
From: Linus Torvalds @ 2007-04-28 16:30 UTC (permalink / raw)
To: Matthias Andree; +Cc: Andrew Morton, Mike Galbraith, LKML, Jens Axboe
On Sat, 28 Apr 2007, Matthias Andree wrote:
>
> Another thing that is rather unpleasant (haven't yet tried fiddling with
> the dirty limits) is UDF to DVD-RAM - try rsyncing /home to a DVD-RAM,
> that's going to leave you with tons of dirty buffers that clear slowly
> -- "watch -n 1 grep -i dirty /proc/meminfo" is boring, but elucidating...
Now *this* is actually really really nasty.
There are worse examples. Try connecting some flash disk over USB-1, and
untar to it. Ugh.
I'd love to have some per-device dirty limit, but it's harder than it
should be.
Linus
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-28 16:05 ` Linus Torvalds
@ 2007-04-28 16:37 ` Ingo Molnar
2007-04-28 17:11 ` Mikulas Patocka
2007-04-28 17:55 ` Mikulas Patocka
2007-04-30 6:56 ` Jens Axboe
2 siblings, 1 reply; 66+ messages in thread
From: Ingo Molnar @ 2007-04-28 16:37 UTC (permalink / raw)
To: Linus Torvalds
Cc: Mikulas Patocka, Mike Galbraith, LKML, Andrew Morton, Jens Axboe
* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> Even with a good software IO scheduler, when you have disks that do
> tagged queueing, if you fill up the disk queue with a few dozen
> (depends on the disk what the queue limit is) huge write requests, it
> doesn't really matter if the _software_ queuing then gives a big
> advantage to reads coming in. They'll _still_ be waiting for a long
> time, especially since you don't know what the disk firmware is going
> to do.
by far the largest advantage of tagged queueing is when we go from 1
pending request to 2 pending requests. The rest helps too for certain
workloads (especially benchmarks), but if the IRQ handling is fast
enough, having just 2 is more than enough to get 80% of the advantage of
say of hardware-queue with a depth of 64.
So perhaps if there's any privileged reads going on then we should limit
writes to a depth of 2 at most, with some timeout mechanism that would
gradually allow the deepening of the hardware queue, as long as no
highprio reads come inbetween? With 2 pending requests and even assuming
worst-case seeks the user-visible latency would be on the order of 20-30
msecs, which is at the edge of human perception. The problem comes when
a hardware queue of 32-64 entries starves that one highprio read which
then results in a 2+ seconds latency.
Ingo
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-28 16:30 ` Linus Torvalds
@ 2007-04-28 16:56 ` Paolo Ornati
0 siblings, 0 replies; 66+ messages in thread
From: Paolo Ornati @ 2007-04-28 16:56 UTC (permalink / raw)
To: Linus Torvalds
Cc: Matthias Andree, Andrew Morton, Mike Galbraith, LKML, Jens Axboe
On Sat, 28 Apr 2007 09:30:06 -0700 (PDT)
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> There are worse examples. Try connecting some flash disk over USB-1, and
> untar to it. Ugh.
>
> I'd love to have some per-device dirty limit, but it's harder than it
> should be.
this one should help:
Patch: per device dirty throttling
http://lwn.net/Articles/226709/
--
Paolo Ornati
Linux 2.6.21-cfs-v7-g13fe02de on x86_64
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-28 16:37 ` Ingo Molnar
@ 2007-04-28 17:11 ` Mikulas Patocka
2007-04-30 6:57 ` Jens Axboe
0 siblings, 1 reply; 66+ messages in thread
From: Mikulas Patocka @ 2007-04-28 17:11 UTC (permalink / raw)
To: Ingo Molnar
Cc: Linus Torvalds, Mike Galbraith, LKML, Andrew Morton, Jens Axboe
> So perhaps if there's any privileged reads going on then we should limit
> writes to a depth of 2 at most, with some timeout mechanism that would
SCSI has a "high priority" bit in the command block, so you can just set
it --- but I am not sure how well do disks support it.
Mikulas
> gradually allow the deepening of the hardware queue, as long as no
> highprio reads come inbetween? With 2 pending requests and even assuming
> worst-case seeks the user-visible latency would be on the order of 20-30
> msecs, which is at the edge of human perception. The problem comes when
> a hardware queue of 32-64 entries starves that one highprio read which
> then results in a 2+ seconds latency.
>
> Ingo
>
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-28 16:05 ` Linus Torvalds
2007-04-28 16:37 ` Ingo Molnar
@ 2007-04-28 17:55 ` Mikulas Patocka
2007-04-30 6:56 ` Jens Axboe
2 siblings, 0 replies; 66+ messages in thread
From: Mikulas Patocka @ 2007-04-28 17:55 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Mike Galbraith, LKML, Andrew Morton, Jens Axboe
On Sat, 28 Apr 2007, Linus Torvalds wrote:
>> The main problem is that if the user extracts tar archive, tar eventually
>> blocks on writeback I/O --- O.K. But if bash attempts to write one page to
>> .bash_history file at the same time, it blocks too --- bad, the user is
>> annoyed.
>
> Right, but it's actually very unlikely. Think about it: the person who
> extracts the tar-archive is perhaps dirtying a thousand pages, while the
> .bash_history writeback is doing a single one. Which process do you think
> is going to hit the "oops, we went over the limit" case 99.9% of the time?
Both. See balance_dirty_pages --- you loop there if
global_page_state(NR_FILE_DIRTY) + global_page_state(NR_UNSTABLE_NFS) +
global_page_state(NR_WRITEBACK) is over limit.
So tar gets there first, start writeback, blocks. Innocent process calling
one small write() gets there too (while writeback has not yet finished),
sees that the expression is over limit and blocks too.
Really, you go to ballance_dirty_pages with 1/8 probability, so small
writers will block with that probability --- better than blocking always,
but still annoying.
> The _really_ annoying problem is when you just have absolutely tons of
> memory dirty, and you start doing the writeback: if you saturate the IO
> queues totally, it simply doesn't matter _who_ starts the writeback,
> because anybody who needs to do any IO at all (not necessarily writing) is
> going to be blocked.
I saw this writeback problem on machine that had a lot of memory (1G),
internal fast disk where the distribution was installed and very slow
external SCSI disk (6MB/s or so). When I did heavy write on the external
disk and writeback started, the computer almost completely locked up ---
any process trying to write anything to the fast disk blocked until
writeback on the slow disk finishes.
(that machine had some old RHEL kernel and it is not mine so I can't test
new kernels on it --- but the above fragment of code shows that the
problem still exists today)
Mikulas
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-27 8:33 ` Andrew Morton
` (2 preceding siblings ...)
2007-04-27 11:59 ` Marat Buharov
@ 2007-04-28 20:46 ` Mikulas Patocka
2007-04-28 21:12 ` Lee Revell
3 siblings, 1 reply; 66+ messages in thread
From: Mikulas Patocka @ 2007-04-28 20:46 UTC (permalink / raw)
To: Andrew Morton
Cc: Mike Galbraith, LKML, Linus Torvalds, Jens Axboe, linux-ext4
> hm, fsync.
>
> Aside: why the heck do applications think that their data is so important
> that they need to fsync it all the time. I used to run a kernel on my
> laptop which had "return 0;" at the top of fsync() and fdatasync(). Most
> pleasurable.
>
> But wedging for 20 minutes is probably excessive punishment.
I most wonder, why vim fsyncs its swapfile regularly (blocking typing
during that) and doesn't fsync the resulting file on :w :-/
Mikulas
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-28 20:46 ` Mikulas Patocka
@ 2007-04-28 21:12 ` Lee Revell
2007-04-29 20:49 ` Mark Lord
2007-04-29 21:17 ` Mikulas Patocka
0 siblings, 2 replies; 66+ messages in thread
From: Lee Revell @ 2007-04-28 21:12 UTC (permalink / raw)
To: Mikulas Patocka
Cc: Andrew Morton, Mike Galbraith, LKML, Linus Torvalds, Jens Axboe,
linux-ext4
On 4/28/07, Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> wrote:
> I most wonder, why vim fsyncs its swapfile regularly (blocking typing
> during that) and doesn't fsync the resulting file on :w :-/
Never seen this. Why would fsync block typing unless vim was doing
disk IO for every keystroke?
Lee
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-28 5:37 ` Mikulas Patocka
2007-04-28 5:45 ` Mikulas Patocka
@ 2007-04-28 21:57 ` Bill Huey
2007-04-28 22:38 ` Mikulas Patocka
1 sibling, 1 reply; 66+ messages in thread
From: Bill Huey @ 2007-04-28 21:57 UTC (permalink / raw)
To: Mikulas Patocka
Cc: Linus Torvalds, Andreas Dilger, Marat Buharov, Andrew Morton,
Mike Galbraith, LKML, Jens Axboe, linux-ext4, Alex Tomas,
Bill Huey (hui)
On Sat, Apr 28, 2007 at 07:37:17AM +0200, Mikulas Patocka wrote:
> SpadFS doesn't write to unallocated parts like log filesystems (LFS) or
> phase tree filesystems (TUX2); it writes inside normal used structures,
> but it marks each structure with generation tags --- when it updates
> global table of tags, it atomically makes several structures valid. I
> don't know about this idea being used elsewhere.
So how is this generation structure organized ? paper ?
bill
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-28 21:57 ` Bill Huey
@ 2007-04-28 22:38 ` Mikulas Patocka
0 siblings, 0 replies; 66+ messages in thread
From: Mikulas Patocka @ 2007-04-28 22:38 UTC (permalink / raw)
To: Bill Huey
Cc: Linus Torvalds, Andreas Dilger, Marat Buharov, Andrew Morton,
Mike Galbraith, LKML, Jens Axboe, linux-ext4, Alex Tomas
> On Sat, Apr 28, 2007 at 07:37:17AM +0200, Mikulas Patocka wrote:
>> SpadFS doesn't write to unallocated parts like log filesystems (LFS) or
>> phase tree filesystems (TUX2); it writes inside normal used structures,
>> but it marks each structure with generation tags --- when it updates
>> global table of tags, it atomically makes several structures valid. I
>> don't know about this idea being used elsewhere.
>
> So how is this generation structure organized ? paper ?
Paper is in CITSA 2006 proceedings (but you likely don't have them and I
signed some statement that I can't post it elsewhere :-( )
Basicly the idea is this:
* you have array containing 65536 32-bit numbers --- crash count table ---
that array is on disk and in memory (see struct __spadfs->cct in my sources)
* you have 16-bit value --- crash count, that value is on disk and in memory
too (see struct __spadfs->cc)
* On mount, you load crash count table and crash count from disk to
memory. You increment carsh count on disk (but leave old in memory). You
increment one entry in crash count table - cct[cc] in memory, but leave
old on disk.
* On sync you write all metadata buffers, do write barrier, write one
sector of crash count table from memory to disk and do write
barrier again.
* On unmount, you sync and decrement crash count on disk.
--- so crash count counts crashes --- it is increased each time you mount
and don't unmount.
Consistency of structures:
* Each directory entry has two tags --- 32-bit transaction count (txc)
and 16-bit crash count(cc).
* You create directory entry with entry->txc = fs->txc[fs->cc] and
entry->cc = fs->cc
* Directory entry is considered valid if fs->txc[entry->cc] >= entry->txc
(see macro CC_VALID)
* If the directory entry is not valid, it is skipped during directory
scan, as if it wasn't there
--- so you create a directory entry and its valid. If the system crashes,
it will load crash count table from disk and there's one-less value than
entry->txc, so the entry will be invalid. It will also run with increased
cc, so it will never touch txc at an old index, so the entry will be valid
forever.
--- if you sync, you write crash count table to disk and directory entry
will be atomically made valid forever (because values in crash count table
never decrease)
In my implementation, the top bit of entry->txc is used to mark whether
the entry is scheduled for adding or delete, so that you can atomically
add one directory entry and delete other.
Space allocation bitmaps or lists are managed in such a way that there are
two copies and cc/txc pair determining which one is valid.
Files are extended in such a way that each file has two "size" entries and
cc/txc pair denoting which one is valid, so that you can atomically
extend/truncate file and mark its space allocated/freed in bitmaps or
lists (BTW. this cc/txc pair is the same one that denotes if the directory
entry is valid and another bit determines one of these two functions ---
to save space).
Mikulas
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-28 21:12 ` Lee Revell
@ 2007-04-29 20:49 ` Mark Lord
2007-04-29 21:17 ` Mikulas Patocka
1 sibling, 0 replies; 66+ messages in thread
From: Mark Lord @ 2007-04-29 20:49 UTC (permalink / raw)
To: Lee Revell
Cc: Mikulas Patocka, Andrew Morton, Mike Galbraith, LKML,
Linus Torvalds, Jens Axboe, linux-ext4
Lee Revell wrote:
> On 4/28/07, Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> wrote:
>> I most wonder, why vim fsyncs its swapfile regularly (blocking typing
>> during that) and doesn't fsync the resulting file on :w :-/
>
> Never seen this. Why would fsync block typing unless vim was doing
> disk IO for every keystroke?
It does do that, for the crash-recovery files it maintains.
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-28 21:12 ` Lee Revell
2007-04-29 20:49 ` Mark Lord
@ 2007-04-29 21:17 ` Mikulas Patocka
1 sibling, 0 replies; 66+ messages in thread
From: Mikulas Patocka @ 2007-04-29 21:17 UTC (permalink / raw)
To: Lee Revell; +Cc: LKML
On Sat, 28 Apr 2007, Lee Revell wrote:
> On 4/28/07, Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> wrote:
>> I most wonder, why vim fsyncs its swapfile regularly (blocking typing
>> during that) and doesn't fsync the resulting file on :w :-/
>
> Never seen this. Why would fsync block typing unless vim was doing
> disk IO for every keystroke?
>
> Lee
Not for every keystroke, but after some time it calls fsync(). During
execution of that call, keyboard is blocked. It is not normally problem
(fsync executes very fastly), but it starts to be problem in case of
extremely overloaded system.
Mikulas
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-28 16:05 ` Linus Torvalds
2007-04-28 16:37 ` Ingo Molnar
2007-04-28 17:55 ` Mikulas Patocka
@ 2007-04-30 6:56 ` Jens Axboe
2 siblings, 0 replies; 66+ messages in thread
From: Jens Axboe @ 2007-04-30 6:56 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Mikulas Patocka, Mike Galbraith, LKML, Andrew Morton
On Sat, Apr 28 2007, Linus Torvalds wrote:
> > The main problem is that if the user extracts tar archive, tar eventually
> > blocks on writeback I/O --- O.K. But if bash attempts to write one page to
> > .bash_history file at the same time, it blocks too --- bad, the user is
> > annoyed.
>
> Right, but it's actually very unlikely. Think about it: the person who
> extracts the tar-archive is perhaps dirtying a thousand pages, while the
> .bash_history writeback is doing a single one. Which process do you think
> is going to hit the "oops, we went over the limit" case 99.9% of the time?
>
> The _really_ annoying problem is when you just have absolutely tons of
> memory dirty, and you start doing the writeback: if you saturate the IO
> queues totally, it simply doesn't matter _who_ starts the writeback,
> because anybody who needs to do any IO at all (not necessarily writing) is
> going to be blocked.
>
> This is why having gigabytes of dirty data (or even "just" hundreds of
> megs) can be so annoying.
>
> Even with a good software IO scheduler, when you have disks that do tagged
> queueing, if you fill up the disk queue with a few dozen (depends on the
> disk what the queue limit is) huge write requests, it doesn't really
> matter if the _software_ queuing then gives a big advantage to reads
> coming in. They'll _still_ be waiting for a long time, especially since
> you don't know what the disk firmware is going to do.
>
> It's possible that we could do things like refusing to use all tag entries
> on the disk for writing. That would probably help latency a _lot_. Right
> now, if we do writeback, and fill up all the slots on the disk, we cannot
> even feed the disk the read request immediately - we'll have to wait for
> some of the writes to finish before we can even queue the read to the
> disk.
>
> (Of course, if disks don't support tagged queueing, you'll never have this
> problem at all, but most disks do these days, and I strongly suspect it
> really can aggravate latency numbers a lot).
>
> Jens? Comments? Or do you do that already?
Yes, CFQ tries to handle that quite aggressively already. With the
emergene of NCQ on SATA, it has become a much bigger problem since it's
seen so easily on the desktop. The SCSI people usually don't care about
latency that much, so not many complaints there.
The recently posted patch series for CFQ that I will submit soon for
2.6.22 has more fixes/tweaks for this.
--
Jens Axboe
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-28 17:11 ` Mikulas Patocka
@ 2007-04-30 6:57 ` Jens Axboe
0 siblings, 0 replies; 66+ messages in thread
From: Jens Axboe @ 2007-04-30 6:57 UTC (permalink / raw)
To: Mikulas Patocka
Cc: Ingo Molnar, Linus Torvalds, Mike Galbraith, LKML, Andrew Morton
On Sat, Apr 28 2007, Mikulas Patocka wrote:
> >So perhaps if there's any privileged reads going on then we should limit
> >writes to a depth of 2 at most, with some timeout mechanism that would
>
> SCSI has a "high priority" bit in the command block, so you can just set
> it --- but I am not sure how well do disks support it.
I'd be surprised if it was useful.
--
Jens Axboe
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-27 15:18 ` Linus Torvalds
` (5 preceding siblings ...)
2007-04-28 6:32 ` Mikulas Patocka
@ 2007-05-02 6:53 ` Jens Axboe
2007-05-02 7:36 ` Mike Galbraith
6 siblings, 1 reply; 66+ messages in thread
From: Jens Axboe @ 2007-05-02 6:53 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Mike Galbraith, LKML, Andrew Morton
On Fri, Apr 27 2007, Linus Torvalds wrote:
> So I do believe that we could probably do something about the IO
> scheduling _too_:
>
> - break up large write requests (yeah, it will make for worse IO
> throughput, but if make it configurable, and especially with
> controllers that don't have insane overheads per command, the
> difference between 128kB requests and 16MB requests is probably not
> really even noticeable - SCSI things with large per-command overheads
> are just stupid)
>
> Generating huge requests will automatically mean that they are
> "unbreakable" from an IO scheduler perspective, so it's bad for latency
> for other reqeusts once they've started.
Overlooked this one initially... We actually don't generate huge
requests, exactly because of that. Even if the device can do large
requests (most SATA disks today can do 32meg), we default to 512kB as
the largest one that we will build due to file system requests. It's
trivial to reduce that limit, see /sys/block/<dev>/queue/max_sectors_kb.
That controls the maximum per-request size.
--
Jens Axboe
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-05-02 6:53 ` Jens Axboe
@ 2007-05-02 7:36 ` Mike Galbraith
0 siblings, 0 replies; 66+ messages in thread
From: Mike Galbraith @ 2007-05-02 7:36 UTC (permalink / raw)
To: Jens Axboe; +Cc: Linus Torvalds, LKML, Andrew Morton
On Wed, 2007-05-02 at 08:53 +0200, Jens Axboe wrote:
> On Fri, Apr 27 2007, Linus Torvalds wrote:
> > So I do believe that we could probably do something about the IO
> > scheduling _too_:
> >
> > - break up large write requests (yeah, it will make for worse IO
> > throughput, but if make it configurable, and especially with
> > controllers that don't have insane overheads per command, the
> > difference between 128kB requests and 16MB requests is probably not
> > really even noticeable - SCSI things with large per-command overheads
> > are just stupid)
> >
> > Generating huge requests will automatically mean that they are
> > "unbreakable" from an IO scheduler perspective, so it's bad for latency
> > for other reqeusts once they've started.
>
> Overlooked this one initially... We actually don't generate huge
> requests, exactly because of that. Even if the device can do large
> requests (most SATA disks today can do 32meg), we default to 512kB as
> the largest one that we will build due to file system requests. It's
> trivial to reduce that limit, see /sys/block/<dev>/queue/max_sectors_kb.
> That controls the maximum per-request size.
For the record, I haven't been able to stall KDE for ages with
data=writeback.
-Mike
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-04-27 22:18 ` Andrew Morton
@ 2007-05-03 17:38 ` Alex Tomas
2007-05-03 23:54 ` Andrew Morton
0 siblings, 1 reply; 66+ messages in thread
From: Alex Tomas @ 2007-05-03 17:38 UTC (permalink / raw)
To: Andrew Morton
Cc: Andreas Dilger, Linus Torvalds, Marat Buharov, Mike Galbraith,
LKML, Jens Axboe, linux-ext4
Andrew Morton wrote:
> We can make great improvements here, and I've (twice) previously decribed
> how: hoist the entire ordered-mode data handling out of ext3, and out of
> the buffer_head layer and move it up into the VFS pagecache layer.
> Basically, do ordered-data with a commit-time inode walk, calling
> do_sync_mapping_range().
>
> Do it in the VFS. Make reiserfs use it, remove reiserfs ordered-mode too.
> Make XFS use it, fix the hey-my-files-are-all-full-of-zeroes problem there.
I'm not sure it's that easy.
if we move to pages, then we have to mark pages to be flushed holding
transaction open. now take delayed allocation into account: we need
to allocate number of blocks at once and then mark all pages mapped,
again within context of the same transaction. so, an implementation
would look like the following?
generic_writepages() {
/* collect set of contig. dirty pages */
foo_get_blocks() {
foo_journal_start();
foo_new_blocks();
foo_attach_blocks_to_inode();
generic_mark_pages_mapped();
foo_journal_stop();
}
}
another question is will it scale well given number of dirty inodes
can be much larger than number of inodes with dirty mapped blocks
(in delayed allocation case, for example) ?
thanks, Alex
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-05-03 17:38 ` Alex Tomas
@ 2007-05-03 23:54 ` Andrew Morton
2007-05-04 6:18 ` Alex Tomas
0 siblings, 1 reply; 66+ messages in thread
From: Andrew Morton @ 2007-05-03 23:54 UTC (permalink / raw)
To: Alex Tomas
Cc: Andreas Dilger, Linus Torvalds, Marat Buharov, Mike Galbraith,
LKML, Jens Axboe, linux-ext4
On Thu, 03 May 2007 21:38:10 +0400
Alex Tomas <alex@clusterfs.com> wrote:
> Andrew Morton wrote:
> > We can make great improvements here, and I've (twice) previously decribed
> > how: hoist the entire ordered-mode data handling out of ext3, and out of
> > the buffer_head layer and move it up into the VFS pagecache layer.
> > Basically, do ordered-data with a commit-time inode walk, calling
> > do_sync_mapping_range().
> >
> > Do it in the VFS. Make reiserfs use it, remove reiserfs ordered-mode too.
> > Make XFS use it, fix the hey-my-files-are-all-full-of-zeroes problem there.
>
> I'm not sure it's that easy.
>
> if we move to pages, then we have to mark pages to be flushed holding
> transaction open. now take delayed allocation into account: we need
> to allocate number of blocks at once and then mark all pages mapped,
> again within context of the same transaction.
Yes, there can be issues with needing to allocate journal space within the
context of a commit. But
a) If the page has newly allocated space on disk then the metadata which
refers to that page is already in the journal: no new journal space
needed.
b) If the page doesn't have space allocated on disk then we don't need
to write it out at ordered-mode commit time, because the post-recovery
filesystem will not have any references to that page.
c) If the page is dirty due to overwrite then no metadata update was required.
IOW, under what circumstances would an ordered-mode commit need to allocate
space for a delayed-allocate page?
However b) might lead to the hey-my-file-is-full-of-zeroes problem.
> so, an implementation
> would look like the following?
>
> generic_writepages() {
> /* collect set of contig. dirty pages */
> foo_get_blocks() {
> foo_journal_start();
> foo_new_blocks();
> foo_attach_blocks_to_inode();
> generic_mark_pages_mapped();
> foo_journal_stop();
> }
> }
>
> another question is will it scale well given number of dirty inodes
> can be much larger than number of inodes with dirty mapped blocks
> (in delayed allocation case, for example) ?
Possibly - zillions of dirty-for-atime inodes might get in the way. A
short-term fix would be to create a separate dirty-inode list on the
superblock (ug). A long-term fix is to rip all the per-superblock
dirty-inode lists and use a radix-tree. Not for lookup purposes, but for
the tree's ability to do tagged and restartable searches.
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-05-03 23:54 ` Andrew Morton
@ 2007-05-04 6:18 ` Alex Tomas
2007-05-04 6:38 ` Andrew Morton
0 siblings, 1 reply; 66+ messages in thread
From: Alex Tomas @ 2007-05-04 6:18 UTC (permalink / raw)
To: Andrew Morton
Cc: Andreas Dilger, Linus Torvalds, Marat Buharov, Mike Galbraith,
LKML, Jens Axboe, linux-ext4
Andrew Morton wrote:
> Yes, there can be issues with needing to allocate journal space within the
> context of a commit. But
no-no, this isn't required. we only need to mark pages/blocks within
transaction, otherwise race is possible when we allocate blocks in transaction,
then transacton starts to commit, then we mark pages/blocks to be flushed
before commit.
> a) If the page has newly allocated space on disk then the metadata which
> refers to that page is already in the journal: no new journal space
> needed.
>
> b) If the page doesn't have space allocated on disk then we don't need
> to write it out at ordered-mode commit time, because the post-recovery
> filesystem will not have any references to that page.
>
> c) If the page is dirty due to overwrite then no metadata update was required.
>
> IOW, under what circumstances would an ordered-mode commit need to allocate
> space for a delayed-allocate page?
no need to allocate space within commit thread, I think. only to take care
of the race I described above. in hackish version of data=ordered for delayed
allocation I used counter of submitted bio's with newly-allocated blocks and
commit thread waits for the counter to reach 0.
>
> However b) might lead to the hey-my-file-is-full-of-zeroes problem.
>
thanks, Alex
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-05-04 6:18 ` Alex Tomas
@ 2007-05-04 6:38 ` Andrew Morton
2007-05-04 6:57 ` Alex Tomas
0 siblings, 1 reply; 66+ messages in thread
From: Andrew Morton @ 2007-05-04 6:38 UTC (permalink / raw)
To: Alex Tomas
Cc: Andreas Dilger, Linus Torvalds, Marat Buharov, Mike Galbraith,
LKML, Jens Axboe, linux-ext4
On Fri, 04 May 2007 10:18:12 +0400 Alex Tomas <alex@clusterfs.com> wrote:
> Andrew Morton wrote:
> > Yes, there can be issues with needing to allocate journal space within the
> > context of a commit. But
>
> no-no, this isn't required. we only need to mark pages/blocks within
> transaction, otherwise race is possible when we allocate blocks in transaction,
> then transacton starts to commit, then we mark pages/blocks to be flushed
> before commit.
I don't understand. Can you please describe the race in more detail?
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-05-04 6:38 ` Andrew Morton
@ 2007-05-04 6:57 ` Alex Tomas
2007-05-04 7:18 ` Andrew Morton
0 siblings, 1 reply; 66+ messages in thread
From: Alex Tomas @ 2007-05-04 6:57 UTC (permalink / raw)
To: Andrew Morton
Cc: Andreas Dilger, Linus Torvalds, Marat Buharov, Mike Galbraith,
LKML, Jens Axboe, linux-ext4
Andrew Morton wrote:
> On Fri, 04 May 2007 10:18:12 +0400 Alex Tomas <alex@clusterfs.com> wrote:
>
>> Andrew Morton wrote:
>>> Yes, there can be issues with needing to allocate journal space within the
>>> context of a commit. But
>> no-no, this isn't required. we only need to mark pages/blocks within
>> transaction, otherwise race is possible when we allocate blocks in transaction,
>> then transacton starts to commit, then we mark pages/blocks to be flushed
>> before commit.
>
> I don't understand. Can you please describe the race in more detail?
if I understood your idea right, then in data=ordered mode, commit thread writes
all dirty mapped blocks before real commit.
say, we have two threads: t1 is a thread doing flushing and t2 is a commit thread
t1 t2
find dirty inode I
find some dirty unallocated blocks
journal_start()
allocate blocks
attach them to I
journal_stop()
going to commit
find inode I dirty
do NOT find these blocks because they're
allocated only, but pages/bhs aren't mapped
to them
start commit
map pages/bhs to just allocate blocks
so, either we mark pages/bhs someway within journal_start()--journal_stop() or
commit thread should do lookup for all dirty pages. the latter doesn't sound nice, IMHO.
thanks, Alex
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-05-04 6:57 ` Alex Tomas
@ 2007-05-04 7:18 ` Andrew Morton
2007-05-04 7:39 ` Alex Tomas
0 siblings, 1 reply; 66+ messages in thread
From: Andrew Morton @ 2007-05-04 7:18 UTC (permalink / raw)
To: Alex Tomas
Cc: Andreas Dilger, Linus Torvalds, Marat Buharov, Mike Galbraith,
LKML, Jens Axboe, linux-ext4
On Fri, 04 May 2007 10:57:12 +0400 Alex Tomas <alex@clusterfs.com> wrote:
> Andrew Morton wrote:
> > On Fri, 04 May 2007 10:18:12 +0400 Alex Tomas <alex@clusterfs.com> wrote:
> >
> >> Andrew Morton wrote:
> >>> Yes, there can be issues with needing to allocate journal space within the
> >>> context of a commit. But
> >> no-no, this isn't required. we only need to mark pages/blocks within
> >> transaction, otherwise race is possible when we allocate blocks in transaction,
> >> then transacton starts to commit, then we mark pages/blocks to be flushed
> >> before commit.
> >
> > I don't understand. Can you please describe the race in more detail?
>
> if I understood your idea right, then in data=ordered mode, commit thread writes
> all dirty mapped blocks before real commit.
>
> say, we have two threads: t1 is a thread doing flushing and t2 is a commit thread
>
> t1 t2
> find dirty inode I
> find some dirty unallocated blocks
> journal_start()
> allocate blocks
> attach them to I
> journal_stop()
I'm still not understanding. The terms you're using are a bit ambiguous.
What does "find some dirty unallocated blocks" mean? Find a page which is
dirty and which does not have a disk mapping?
Normally the above operation would be implemented via
ext4_writeback_writepage(), and it runs under lock_page().
> going to commit
> find inode I dirty
> do NOT find these blocks because they're
> allocated only, but pages/bhs aren't mapped
> to them
> start commit
I think you're assuming here that commit would be using ->t_sync_datalist
to locate dirty buffer_heads.
But under this proposal, t_sync_datalist just gets removed: the new
ordered-data mode _only_ need to do the sb->inode->page walk. So if I'm
understanding you, the way in which we'd handle any such race is to make
kjournald's writeback of the dirty pages block in lock_page(). Once it
gets the page lock it can look to see if some other thread has mapped the
page to disk.
It may turn out that kjournald needs a private way of getting at the
I_DIRTY_PAGES inodes to do this properly, but I don't _think_ so. If we
had the radix-tree-of-dirty-inodes thing then that's easy enough to do
anyway, with a tagged search. But I expect that a single pass through the
superblock's dirty inodes would suffice for ordered-data. Files which
have chattr +j would screw things up, as usual.
I assume (hope) that your delayed allocation code implements
->writepages()? Doing the allocation one-page-at-a-time sounds painful...
>
> map pages/bhs to just allocate blocks
>
>
> so, either we mark pages/bhs someway within journal_start()--journal_stop() or
> commit thread should do lookup for all dirty pages. the latter doesn't sound nice, IMHO.
>
I don't think I'm understanding you fully yet.
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-05-04 7:18 ` Andrew Morton
@ 2007-05-04 7:39 ` Alex Tomas
2007-05-04 8:02 ` Andrew Morton
0 siblings, 1 reply; 66+ messages in thread
From: Alex Tomas @ 2007-05-04 7:39 UTC (permalink / raw)
To: Andrew Morton
Cc: Andreas Dilger, Linus Torvalds, Marat Buharov, Mike Galbraith,
LKML, Jens Axboe, linux-ext4
Andrew Morton wrote:
> I'm still not understanding. The terms you're using are a bit ambiguous.
>
> What does "find some dirty unallocated blocks" mean? Find a page which is
> dirty and which does not have a disk mapping?
>
> Normally the above operation would be implemented via
> ext4_writeback_writepage(), and it runs under lock_page().
I'm mostly worried about delayed allocation case. My impression was that
holding number of pages locked isn't a good idea, even if they're locked
in index order. so, I was going to turn number of pages writeback, then
allocate blocks for all of them at once, then put proper blocknr's into
bh's (or PG_mappedtodisk?).
>
>
>> going to commit
>> find inode I dirty
>> do NOT find these blocks because they're
>> allocated only, but pages/bhs aren't mapped
>> to them
>> start commit
>
> I think you're assuming here that commit would be using ->t_sync_datalist
> to locate dirty buffer_heads.
nope, I mean sb->inode->page walk.
> But under this proposal, t_sync_datalist just gets removed: the new
> ordered-data mode _only_ need to do the sb->inode->page walk. So if I'm
> understanding you, the way in which we'd handle any such race is to make
> kjournald's writeback of the dirty pages block in lock_page(). Once it
> gets the page lock it can look to see if some other thread has mapped the
> page to disk.
if I'm right holding number of pages locked, then they won't be locked, but
writeback. of course kjournald can block on writeback as well, but how does
it find pages with *newly allocated* blocks only?
> It may turn out that kjournald needs a private way of getting at the
> I_DIRTY_PAGES inodes to do this properly, but I don't _think_ so. If we
> had the radix-tree-of-dirty-inodes thing then that's easy enough to do
> anyway, with a tagged search. But I expect that a single pass through the
> superblock's dirty inodes would suffice for ordered-data. Files which
> have chattr +j would screw things up, as usual.
not dirty inodes only, but rather some fast way to find pages with newly
allocated pages.
> I assume (hope) that your delayed allocation code implements
> ->writepages()? Doing the allocation one-page-at-a-time sounds painful...
indeed. this is a root cause of all this complexity.
thanks, Alex
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
2007-05-04 7:39 ` Alex Tomas
@ 2007-05-04 8:02 ` Andrew Morton
0 siblings, 0 replies; 66+ messages in thread
From: Andrew Morton @ 2007-05-04 8:02 UTC (permalink / raw)
To: Alex Tomas
Cc: Andreas Dilger, Linus Torvalds, Marat Buharov, Mike Galbraith,
LKML, Jens Axboe, linux-ext4
On Fri, 04 May 2007 11:39:22 +0400 Alex Tomas <alex@clusterfs.com> wrote:
> Andrew Morton wrote:
> > I'm still not understanding. The terms you're using are a bit ambiguous.
> >
> > What does "find some dirty unallocated blocks" mean? Find a page which is
> > dirty and which does not have a disk mapping?
> >
> > Normally the above operation would be implemented via
> > ext4_writeback_writepage(), and it runs under lock_page().
>
> I'm mostly worried about delayed allocation case. My impression was that
> holding number of pages locked isn't a good idea, even if they're locked
> in index order. so, I was going to turn number of pages writeback, then
> allocate blocks for all of them at once, then put proper blocknr's into
> bh's (or PG_mappedtodisk?).
ooh, that sounds hacky and quite worrisome. If someone comes in and does
an fsync() we've lost our synchronisation point. Yes, all callers happen
to do
lock_page();
wait_on_page_writeback();
(I think) but we've never considered a bare PageWriteback() as something
which protects page internals. We're OK wrt page reclaim and we're OK wrt
truncate and invalidate. As long as the page is uptodate we _should_ be OK
wrt readpage(). But still, it'd be better to use the standard locking
rather than inventing new rules, if poss.
I'd be 100% OK with locking multiple pages in ascending pgoff_t order.
Locking the page is the standard way of doing this synchronisation and the
only problem I can think of is that having a tremendous number of pages
locked could cause the wake_up_page() waitqueue hashes to get overloaded
and go slow. But it's also possible to lock many, many pages with
readahead and nobody has reported problems in there.
> >
> >
> >> going to commit
> >> find inode I dirty
> >> do NOT find these blocks because they're
> >> allocated only, but pages/bhs aren't mapped
> >> to them
> >> start commit
> >
> > I think you're assuming here that commit would be using ->t_sync_datalist
> > to locate dirty buffer_heads.
>
> nope, I mean sb->inode->page walk.
>
> > But under this proposal, t_sync_datalist just gets removed: the new
> > ordered-data mode _only_ need to do the sb->inode->page walk. So if I'm
> > understanding you, the way in which we'd handle any such race is to make
> > kjournald's writeback of the dirty pages block in lock_page(). Once it
> > gets the page lock it can look to see if some other thread has mapped the
> > page to disk.
>
> if I'm right holding number of pages locked, then they won't be locked, but
> writeback. of course kjournald can block on writeback as well, but how does
> it find pages with *newly allocated* blocks only?
I don't think we'd want kjournald to do that. Even if a page was dirtied
by an overwrite, we'd want to write it back during commit, just from a
quality-of-implementation point of view. If we were to leave these pages
unwritten during commit then a post-recovery file could have a mix of
up-to-five-second-old data and up-to-30-seconds-old data.
> > It may turn out that kjournald needs a private way of getting at the
> > I_DIRTY_PAGES inodes to do this properly, but I don't _think_ so. If we
> > had the radix-tree-of-dirty-inodes thing then that's easy enough to do
> > anyway, with a tagged search. But I expect that a single pass through the
> > superblock's dirty inodes would suffice for ordered-data. Files which
> > have chattr +j would screw things up, as usual.
>
> not dirty inodes only, but rather some fast way to find pages with newly
> allocated pages.
Newly allocated blocks, you mean?
Just write out the overwritten blocks as well as the new ones, I reckon.
It's what we do now.
^ permalink raw reply [flat|nested] 66+ messages in thread
end of thread, other threads:[~2007-05-04 8:03 UTC | newest]
Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-04-27 7:59 [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation) Mike Galbraith
2007-04-27 8:33 ` Andrew Morton
2007-04-27 9:23 ` Mike Galbraith
2007-04-27 10:17 ` Mike Galbraith
2007-04-27 11:59 ` Marat Buharov
2007-04-27 12:30 ` Peter Zijlstra
2007-04-27 13:50 ` Mark Lord
2007-04-27 12:39 ` Manoj Joseph
2007-04-27 15:30 ` Linus Torvalds
2007-04-27 19:31 ` Andreas Dilger
2007-04-27 19:44 ` Mike Galbraith
2007-04-27 19:50 ` Linus Torvalds
2007-04-27 20:05 ` Hua Zhong
2007-04-27 20:12 ` Miquel van Smoorenburg
2007-04-27 20:12 ` Bill Huey
2007-04-28 5:37 ` Mikulas Patocka
2007-04-28 5:45 ` Mikulas Patocka
2007-04-28 21:57 ` Bill Huey
2007-04-28 22:38 ` Mikulas Patocka
2007-04-27 20:29 ` Gabriel C
2007-04-27 20:45 ` Stephen Clark
2007-04-27 20:54 ` Manoj Joseph
2007-04-28 8:45 ` Matthias Andree
2007-04-27 22:18 ` Andrew Morton
2007-05-03 17:38 ` Alex Tomas
2007-05-03 23:54 ` Andrew Morton
2007-05-04 6:18 ` Alex Tomas
2007-05-04 6:38 ` Andrew Morton
2007-05-04 6:57 ` Alex Tomas
2007-05-04 7:18 ` Andrew Morton
2007-05-04 7:39 ` Alex Tomas
2007-05-04 8:02 ` Andrew Morton
2007-04-28 8:44 ` Matthias Andree
2007-04-28 20:46 ` Mikulas Patocka
2007-04-28 21:12 ` Lee Revell
2007-04-29 20:49 ` Mark Lord
2007-04-29 21:17 ` Mikulas Patocka
2007-04-27 15:18 ` Linus Torvalds
2007-04-27 15:41 ` John Anthony Kazos Jr.
2007-04-27 15:54 ` Linus Torvalds
2007-04-27 16:24 ` Chuck Ebbert
2007-04-27 19:43 ` Marko Macek
2007-04-27 18:31 ` Andrew Morton
2007-04-27 19:09 ` Zan Lynx
2007-04-27 22:07 ` Andrew Morton
2007-04-27 19:27 ` Mike Galbraith
2007-04-28 8:51 ` Matthias Andree
2007-04-28 8:59 ` Andrew Morton
2007-04-28 16:30 ` Linus Torvalds
2007-04-28 16:56 ` Paolo Ornati
2007-04-27 19:28 ` Mike Galbraith
2007-04-27 20:06 ` Jan Engelhardt
2007-04-27 21:22 ` Linus Torvalds
2007-04-28 4:25 ` Mike Galbraith
2007-04-28 6:32 ` Mike Galbraith
2007-04-28 7:01 ` Andrew Morton
2007-04-28 7:12 ` Mike Galbraith
2007-04-28 6:32 ` Mikulas Patocka
2007-04-28 16:05 ` Linus Torvalds
2007-04-28 16:37 ` Ingo Molnar
2007-04-28 17:11 ` Mikulas Patocka
2007-04-30 6:57 ` Jens Axboe
2007-04-28 17:55 ` Mikulas Patocka
2007-04-30 6:56 ` Jens Axboe
2007-05-02 6:53 ` Jens Axboe
2007-05-02 7:36 ` Mike Galbraith
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).