LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH] Fix private_list handling
@ 2008-01-10 15:50 Jan Kara
2008-01-10 15:55 ` Jan Kara
0 siblings, 1 reply; 8+ messages in thread
From: Jan Kara @ 2008-01-10 15:50 UTC (permalink / raw)
To: linux-kernel; +Cc: Andrew Morton, Supriya Kannery
Hi,
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] Fix private_list handling
2008-01-10 15:50 [PATCH] Fix private_list handling Jan Kara
@ 2008-01-10 15:55 ` Jan Kara
2008-01-11 0:36 ` Andrew Morton
0 siblings, 1 reply; 8+ messages in thread
From: Jan Kara @ 2008-01-10 15:55 UTC (permalink / raw)
To: linux-kernel; +Cc: Andrew Morton, Supriya Kannery
Hi,
sorry for the previous empty email...
Supriya noted in his testing that sometimes buffers removed by
__remove_assoc_queue() don't have b_assoc_mapping set (and thus IO error
won't be properly propagated). Actually, looking more into the code I found
there are some more races. The patch below should fix them. It survived
beating with LTP and fsstress on ext2 filesystem on my testing machine so
it should be reasonably bugfree... Andrew, would you put the patch into
-mm? Thanks.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
---
There are two possible races in handling of private_list in buffer cache.
1) When fsync_buffers_list() processes a private_list, it clears
b_assoc_mapping and moves buffer to its private list. Now drop_buffers() comes,
sees a buffer is on list so it calls __remove_assoc_queue() which complains
about b_assoc_mapping being cleared (as it cannot propagate possible IO error).
This race has been actually observed in the wild.
2) When fsync_buffers_list() processes a private_list,
mark_buffer_dirty_inode() can be called on bh which is already on the private
list of fsync_buffers_list(). As buffer is on some list (note that the check is
performed without private_lock), it is not readded to the mapping's
private_list and after fsync_buffers_list() finishes, we have a dirty buffer
which should be on private_list but it isn't. This race has not been reported,
probably because most (but not all) callers of mark_buffer_dirty_inode() hold
i_mutex and thus are serialized with fsync().
Fix these issues by rewriting fsync_buffers_list() to not move buffers to
dedicated private list (which has an additional bonus of reducing code size)
and scan original private_list instead (some care has to be taken to avoid
racing with drop_buffers()). Also mark_buffer_dirty_inode() is modified
to avoid race 2).
Signed-off-by: Jan Kara <jack@suse.cz>
diff --git a/fs/buffer.c b/fs/buffer.c
index 7249e01..8691fa5 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -563,15 +563,6 @@ EXPORT_SYMBOL(mark_buffer_async_write);
* take an address_space, not an inode. And it should be called
* mark_buffer_dirty_fsync() to clearly define why those buffers are being
* queued up.
- *
- * FIXME: mark_buffer_dirty_inode() doesn't need to add the buffer to the
- * list if it is already on a list. Because if the buffer is on a list,
- * it *must* already be on the right one. If not, the filesystem is being
- * silly. This will save a ton of locking. But first we have to ensure
- * that buffers are taken *off* the old inode's list when they are freed
- * (presumably in truncate). That requires careful auditing of all
- * filesystems (do it inside bforget()). It could also be done by bringing
- * b_inode back.
*/
/*
@@ -591,41 +582,6 @@ int inode_has_buffers(struct inode *inode)
return !list_empty(&inode->i_data.private_list);
}
-/*
- * osync is designed to support O_SYNC io. It waits synchronously for
- * all already-submitted IO to complete, but does not queue any new
- * writes to the disk.
- *
- * To do O_SYNC writes, just queue the buffer writes with ll_rw_block as
- * you dirty the buffers, and then use osync_inode_buffers to wait for
- * completion. Any other dirty buffers which are not yet queued for
- * write will not be flushed to disk by the osync.
- */
-static int osync_buffers_list(spinlock_t *lock, struct list_head *list)
-{
- struct buffer_head *bh;
- struct list_head *p;
- int err = 0;
-
- spin_lock(lock);
-repeat:
- list_for_each_prev(p, list) {
- bh = BH_ENTRY(p);
- if (buffer_locked(bh)) {
- get_bh(bh);
- spin_unlock(lock);
- wait_on_buffer(bh);
- if (!buffer_uptodate(bh))
- err = -EIO;
- brelse(bh);
- spin_lock(lock);
- goto repeat;
- }
- }
- spin_unlock(lock);
- return err;
-}
-
/**
* sync_mapping_buffers - write out and wait upon a mapping's "associated"
* buffers
@@ -634,6 +590,11 @@ repeat:
* Starts I/O against the buffers at mapping->private_list, and waits upon
* that I/O.
*
+ * The caller must make sure (usually by holding i_mutex) that we cannot race
+ * with invalidate_inode_buffers() or remove_inode_buffers() calling
+ * __remove_assoc_queue(). Racing with drop_buffers() is fine and is taken care
+ * of.
+ *
* Basically, this is a convenience function for fsync().
* @mapping is a file or directory which needs those buffers to be written for
* a successful fsync().
@@ -667,22 +628,28 @@ void write_boundary_block(struct block_device *bdev,
}
}
+/* Mark buffer as dirty and add it to inode mapping's private_list. We play
+ * tricks to avoid locking private_lock when not needed and still keep
+ * fsync_buffers_list() from removing the buffer we are just adding. */
void mark_buffer_dirty_inode(struct buffer_head *bh, struct inode *inode)
{
struct address_space *mapping = inode->i_mapping;
struct address_space *buffer_mapping = bh->b_page->mapping;
- mark_buffer_dirty(bh);
- if (!mapping->assoc_mapping) {
+ if (!mapping->assoc_mapping)
mapping->assoc_mapping = buffer_mapping;
- } else {
+ else
BUG_ON(mapping->assoc_mapping != buffer_mapping);
- }
- if (list_empty(&bh->b_assoc_buffers)) {
+
+ if (!buffer_dirty(bh) || !bh->b_assoc_map) {
+ mark_buffer_dirty(bh);
spin_lock(&buffer_mapping->private_lock);
- list_move_tail(&bh->b_assoc_buffers,
+ /* Recheck under spinlock */
+ if (!bh->b_assoc_map) {
+ bh->b_assoc_map = mapping;
+ list_add_tail(&bh->b_assoc_buffers,
&mapping->private_list);
- bh->b_assoc_map = mapping;
+ }
spin_unlock(&buffer_mapping->private_lock);
}
}
@@ -772,72 +739,69 @@ int __set_page_dirty_buffers(struct page *page)
EXPORT_SYMBOL(__set_page_dirty_buffers);
/*
- * Write out and wait upon a list of buffers.
+ * Write out and wait upon a list of buffers.
*
- * We have conflicting pressures: we want to make sure that all
- * initially dirty buffers get waited on, but that any subsequently
- * dirtied buffers don't. After all, we don't want fsync to last
- * forever if somebody is actively writing to the file.
+ * We have conflicting pressures: we want to make sure that all initially dirty
+ * buffers get waited on, but that any subsequently dirtied buffers don't.
+ * After all, we don't want fsync to last forever if somebody is actively
+ * writing to the file.
*
- * Do this in two main stages: first we copy dirty buffers to a
- * temporary inode list, queueing the writes as we go. Then we clean
- * up, waiting for those writes to complete.
+ * Do this in two sweeps of the private_list. In the first one we queue writes
+ * of dirty buffers, in the second one we wait for locked buffers.
*
- * During this second stage, any subsequent updates to the file may end
- * up refiling the buffer on the original inode's dirty list again, so
- * there is a chance we will end up with a buffer queued for write but
- * not yet completed on that list. So, as a final cleanup we go through
- * the osync code to catch these locked, dirty buffers without requeuing
- * any newly dirty buffers for write.
+ * The locking here is a bit subtle. We hold i_mutex so we are safe from most
+ * other users of private_list (especially the ones removing buffers from the
+ * list). Only drop_buffers() can still remove a buffer from private_list and
+ * from that we are guarded by the reference to the buffer we acquire. So we
+ * continue scanning the list after we reacquire the private lock.
*/
static int fsync_buffers_list(spinlock_t *lock, struct list_head *list)
{
struct buffer_head *bh;
- struct list_head tmp;
- int err = 0, err2;
-
- INIT_LIST_HEAD(&tmp);
+ struct list_head *cur, *next;
+ int err = 0;
spin_lock(lock);
- while (!list_empty(list)) {
- bh = BH_ENTRY(list->next);
- __remove_assoc_queue(bh);
- if (buffer_dirty(bh) || buffer_locked(bh)) {
- list_add(&bh->b_assoc_buffers, &tmp);
- if (buffer_dirty(bh)) {
- get_bh(bh);
- spin_unlock(lock);
- /*
- * Ensure any pending I/O completes so that
- * ll_rw_block() actually writes the current
- * contents - it is a noop if I/O is still in
- * flight on potentially older contents.
- */
- ll_rw_block(SWRITE, 1, &bh);
- brelse(bh);
- spin_lock(lock);
- }
+ list_for_each_entry(bh, list, b_assoc_buffers) {
+ if (buffer_dirty(bh)) {
+ get_bh(bh);
+ spin_unlock(lock);
+ /*
+ * Ensure any pending I/O completes so that
+ * ll_rw_block() actually writes the current
+ * contents - it is a noop if I/O is still in
+ * flight on potentially older contents.
+ */
+ ll_rw_block(SWRITE, 1, &bh);
+ spin_lock(lock);
+ BUG_ON(list_empty(&bh->b_assoc_buffers));
+ brelse(bh);
}
}
- while (!list_empty(&tmp)) {
- bh = BH_ENTRY(tmp.prev);
- list_del_init(&bh->b_assoc_buffers);
+ for (cur = list->next; cur != list; cur = next) {
+ bh = list_entry(cur, struct buffer_head, b_assoc_buffers);
+ if (!buffer_locked(bh)) {
+ next = cur->next;
+ if (!buffer_dirty(bh))
+ __remove_assoc_queue(bh);
+ continue;
+ }
get_bh(bh);
spin_unlock(lock);
wait_on_buffer(bh);
if (!buffer_uptodate(bh))
err = -EIO;
- brelse(bh);
spin_lock(lock);
+ BUG_ON(list_empty(&bh->b_assoc_buffers));
+ next = cur->next;
+ if (!buffer_dirty(bh) && !buffer_locked(bh))
+ __remove_assoc_queue(bh);
+ brelse(bh);
}
-
spin_unlock(lock);
- err2 = osync_buffers_list(lock, list);
- if (err)
- return err;
- else
- return err2;
+
+ return err;
}
/*
@@ -1195,7 +1159,7 @@ void __brelse(struct buffer_head * buf)
void __bforget(struct buffer_head *bh)
{
clear_buffer_dirty(bh);
- if (!list_empty(&bh->b_assoc_buffers)) {
+ if (bh->b_assoc_map) {
struct address_space *buffer_mapping = bh->b_page->mapping;
spin_lock(&buffer_mapping->private_lock);
@@ -3037,7 +3001,7 @@ drop_buffers(struct page *page, struct buffer_head **buffers_to_free)
do {
struct buffer_head *next = bh->b_this_page;
- if (!list_empty(&bh->b_assoc_buffers))
+ if (bh->b_assoc_map)
__remove_assoc_queue(bh);
bh = next;
} while (bh != head);
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] Fix private_list handling
2008-01-10 15:55 ` Jan Kara
@ 2008-01-11 0:36 ` Andrew Morton
2008-01-11 14:21 ` Jan Kara
0 siblings, 1 reply; 8+ messages in thread
From: Andrew Morton @ 2008-01-11 0:36 UTC (permalink / raw)
To: Jan Kara; +Cc: linux-kernel, supriyak
On Thu, 10 Jan 2008 16:55:13 +0100
Jan Kara <jack@suse.cz> wrote:
> Hi,
>
> sorry for the previous empty email...
>
> Supriya noted in his testing that sometimes buffers removed by
> __remove_assoc_queue() don't have b_assoc_mapping set (and thus IO error
> won't be properly propagated). Actually, looking more into the code I found
> there are some more races. The patch below should fix them. It survived
> beating with LTP and fsstress on ext2 filesystem on my testing machine so
> it should be reasonably bugfree... Andrew, would you put the patch into
> -mm? Thanks.
>
> Honza
> --
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
> ---
>
> There are two possible races in handling of private_list in buffer cache.
> 1) When fsync_buffers_list() processes a private_list, it clears
> b_assoc_mapping and moves buffer to its private list. Now drop_buffers() comes,
> sees a buffer is on list so it calls __remove_assoc_queue() which complains
> about b_assoc_mapping being cleared (as it cannot propagate possible IO error).
> This race has been actually observed in the wild.
private_lock should prevent this race.
Which call to drop_buffers() is the culprit? The first one in
try_to_free_buffers(), I assume? The "can this still happen?" one?
If so, it can happen. How? Perhaps this is a bug.
> 2) When fsync_buffers_list() processes a private_list,
> mark_buffer_dirty_inode() can be called on bh which is already on the private
> list of fsync_buffers_list(). As buffer is on some list (note that the check is
> performed without private_lock), it is not readded to the mapping's
> private_list and after fsync_buffers_list() finishes, we have a dirty buffer
> which should be on private_list but it isn't. This race has not been reported,
> probably because most (but not all) callers of mark_buffer_dirty_inode() hold
> i_mutex and thus are serialized with fsync().
Maybe fsync_buffers_list should put the buffer back onto private_list if it
got dirtied again.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] Fix private_list handling
2008-01-11 0:36 ` Andrew Morton
@ 2008-01-11 14:21 ` Jan Kara
2008-01-11 23:33 ` Andrew Morton
0 siblings, 1 reply; 8+ messages in thread
From: Jan Kara @ 2008-01-11 14:21 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, supriyak
On Thu 10-01-08 16:36:35, Andrew Morton wrote:
> On Thu, 10 Jan 2008 16:55:13 +0100
> Jan Kara <jack@suse.cz> wrote:
>
> > Hi,
> >
> > sorry for the previous empty email...
> >
> > Supriya noted in his testing that sometimes buffers removed by
> > __remove_assoc_queue() don't have b_assoc_mapping set (and thus IO error
> > won't be properly propagated). Actually, looking more into the code I found
> > there are some more races. The patch below should fix them. It survived
> > beating with LTP and fsstress on ext2 filesystem on my testing machine so
> > it should be reasonably bugfree... Andrew, would you put the patch into
> > -mm? Thanks.
> >
> > Honza
> > --
> > Jan Kara <jack@suse.cz>
> > SUSE Labs, CR
> > ---
> >
> > There are two possible races in handling of private_list in buffer cache.
> > 1) When fsync_buffers_list() processes a private_list, it clears
> > b_assoc_mapping and moves buffer to its private list. Now drop_buffers() comes,
> > sees a buffer is on list so it calls __remove_assoc_queue() which complains
> > about b_assoc_mapping being cleared (as it cannot propagate possible IO error).
> > This race has been actually observed in the wild.
>
> private_lock should prevent this race.
>
> Which call to drop_buffers() is the culprit? The first one in
> try_to_free_buffers(), I assume? The "can this still happen?" one?
>
> If so, it can happen. How? Perhaps this is a bug.
Good question but I don't think so. The problem is that
fsync_buffers_list() drops the private_lock() e.g. when it does
wait_on_buffer(). And buffers on the list fsync_buffers_list() constructs
have b_assoc_mapping set to NULL but the test
!list_empty(bh->b_assoc_buffers) succeeds for them and thus
__remove_assoc_queue() is called and it complains.
We could also silence the warning by leaving b_assoc_mapping set when we
move the buffer to the constructed list. But given the problem below
I've decided to do a more complete cleanup of the code.
> > 2) When fsync_buffers_list() processes a private_list,
> > mark_buffer_dirty_inode() can be called on bh which is already on the private
> > list of fsync_buffers_list(). As buffer is on some list (note that the check is
> > performed without private_lock), it is not readded to the mapping's
> > private_list and after fsync_buffers_list() finishes, we have a dirty buffer
> > which should be on private_list but it isn't. This race has not been reported,
> > probably because most (but not all) callers of mark_buffer_dirty_inode() hold
> > i_mutex and thus are serialized with fsync().
>
> Maybe fsync_buffers_list should put the buffer back onto private_list if it
> got dirtied again.
Yes, that's what it does in my new version. Only the locking is a bit
subtle if we want to avoid taking private_lock in mark_buffer_dirty_inode()
when the buffer is already on some list...
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] Fix private_list handling
2008-01-11 14:21 ` Jan Kara
@ 2008-01-11 23:33 ` Andrew Morton
2008-01-14 11:59 ` Jan Kara
2008-01-14 18:14 ` Jan Kara
0 siblings, 2 replies; 8+ messages in thread
From: Andrew Morton @ 2008-01-11 23:33 UTC (permalink / raw)
To: Jan Kara; +Cc: linux-kernel, supriyak
On Fri, 11 Jan 2008 15:21:31 +0100
Jan Kara <jack@suse.cz> wrote:
> On Thu 10-01-08 16:36:35, Andrew Morton wrote:
> > On Thu, 10 Jan 2008 16:55:13 +0100
> > Jan Kara <jack@suse.cz> wrote:
> >
> > > Hi,
> > >
> > > sorry for the previous empty email...
> > >
> > > Supriya noted in his testing that sometimes buffers removed by
> > > __remove_assoc_queue() don't have b_assoc_mapping set (and thus IO error
> > > won't be properly propagated). Actually, looking more into the code I found
> > > there are some more races. The patch below should fix them. It survived
> > > beating with LTP and fsstress on ext2 filesystem on my testing machine so
> > > it should be reasonably bugfree... Andrew, would you put the patch into
> > > -mm? Thanks.
> > >
> > > Honza
> > > --
> > > Jan Kara <jack@suse.cz>
> > > SUSE Labs, CR
> > > ---
> > >
> > > There are two possible races in handling of private_list in buffer cache.
> > > 1) When fsync_buffers_list() processes a private_list, it clears
> > > b_assoc_mapping and moves buffer to its private list. Now drop_buffers() comes,
> > > sees a buffer is on list so it calls __remove_assoc_queue() which complains
> > > about b_assoc_mapping being cleared (as it cannot propagate possible IO error).
> > > This race has been actually observed in the wild.
> >
> > private_lock should prevent this race.
> >
> > Which call to drop_buffers() is the culprit? The first one in
> > try_to_free_buffers(), I assume? The "can this still happen?" one?
> >
> > If so, it can happen. How? Perhaps this is a bug.
> Good question but I don't think so. The problem is that
> fsync_buffers_list() drops the private_lock() e.g. when it does
> wait_on_buffer(). And buffers on the list fsync_buffers_list() constructs
> have b_assoc_mapping set to NULL but the test
> !list_empty(bh->b_assoc_buffers) succeeds for them and thus
> __remove_assoc_queue() is called and it complains.
> We could also silence the warning by leaving b_assoc_mapping set when we
> move the buffer to the constructed list.
Could fsync_buffers_list() leave the buffer not on a list when it does the
ll_rw_block() and only add it to tmp afterwards if it sees that the buffer
is still not on a list?
I guess not, as it still needs to find the buffer to wait upon it.
> But given the problem below
> I've decided to do a more complete cleanup of the code.
Well. Large-scale changes to long-established code like this are a last
resort. We should fully investigate little local fixups first.
> > > 2) When fsync_buffers_list() processes a private_list,
> > > mark_buffer_dirty_inode() can be called on bh which is already on the private
> > > list of fsync_buffers_list(). As buffer is on some list (note that the check is
> > > performed without private_lock), it is not readded to the mapping's
> > > private_list and after fsync_buffers_list() finishes, we have a dirty buffer
> > > which should be on private_list but it isn't. This race has not been reported,
> > > probably because most (but not all) callers of mark_buffer_dirty_inode() hold
> > > i_mutex and thus are serialized with fsync().
> >
> > Maybe fsync_buffers_list should put the buffer back onto private_list if it
> > got dirtied again.
> Yes, that's what it does in my new version. Only the locking is a bit
> subtle if we want to avoid taking private_lock in mark_buffer_dirty_inode()
> when the buffer is already on some list...
I fear rewrites in there. It took five years to find this bug. How long
will it take to find new ones which get added?
Sigh. lists do suck for this sort of thing.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] Fix private_list handling
2008-01-11 23:33 ` Andrew Morton
@ 2008-01-14 11:59 ` Jan Kara
2008-01-14 18:14 ` Jan Kara
1 sibling, 0 replies; 8+ messages in thread
From: Jan Kara @ 2008-01-14 11:59 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, supriyak
On Fri 11-01-08 15:33:54, Andrew Morton wrote:
> On Fri, 11 Jan 2008 15:21:31 +0100
> Jan Kara <jack@suse.cz> wrote:
>
> > On Thu 10-01-08 16:36:35, Andrew Morton wrote:
> > > On Thu, 10 Jan 2008 16:55:13 +0100
> > > Jan Kara <jack@suse.cz> wrote:
> > >
> > > > Hi,
> > > >
> > > > sorry for the previous empty email...
> > > >
> > > > Supriya noted in his testing that sometimes buffers removed by
> > > > __remove_assoc_queue() don't have b_assoc_mapping set (and thus IO error
> > > > won't be properly propagated). Actually, looking more into the code I found
> > > > there are some more races. The patch below should fix them. It survived
> > > > beating with LTP and fsstress on ext2 filesystem on my testing machine so
> > > > it should be reasonably bugfree... Andrew, would you put the patch into
> > > > -mm? Thanks.
> > > >
> > > > Honza
> > > > --
> > > > Jan Kara <jack@suse.cz>
> > > > SUSE Labs, CR
> > > > ---
> > > >
> > > > There are two possible races in handling of private_list in buffer cache.
> > > > 1) When fsync_buffers_list() processes a private_list, it clears
> > > > b_assoc_mapping and moves buffer to its private list. Now drop_buffers() comes,
> > > > sees a buffer is on list so it calls __remove_assoc_queue() which complains
> > > > about b_assoc_mapping being cleared (as it cannot propagate possible IO error).
> > > > This race has been actually observed in the wild.
> > >
> > > private_lock should prevent this race.
> > >
> > > Which call to drop_buffers() is the culprit? The first one in
> > > try_to_free_buffers(), I assume? The "can this still happen?" one?
> > >
> > > If so, it can happen. How? Perhaps this is a bug.
> > Good question but I don't think so. The problem is that
> > fsync_buffers_list() drops the private_lock() e.g. when it does
> > wait_on_buffer(). And buffers on the list fsync_buffers_list() constructs
> > have b_assoc_mapping set to NULL but the test
> > !list_empty(bh->b_assoc_buffers) succeeds for them and thus
> > __remove_assoc_queue() is called and it complains.
> > We could also silence the warning by leaving b_assoc_mapping set when we
> > move the buffer to the constructed list.
>
> Could fsync_buffers_list() leave the buffer not on a list when it does the
> ll_rw_block() and only add it to tmp afterwards if it sees that the buffer
> is still not on a list?
>
> I guess not, as it still needs to find the buffer to wait upon it.
Exactly... We have to have submitted buffers on some list.
> > But given the problem below
> > I've decided to do a more complete cleanup of the code.
>
> Well. Large-scale changes to long-established code like this are a last
> resort. We should fully investigate little local fixups first.
I see. Well, I could do the following minimal changes if it'd make you more
comfortable about it):
1) Don't clear b_assoc_map in fsync_buffers_list() (that would get rid of
race with drop_buffers())
2) Change fsync_buffers_list() to readd buffer into private_list when it
is dirty after we've waited on it. That should get rid of the second
race.
But I still think that unnecessary temporary list is worth a cleanup ;).
> > > > 2) When fsync_buffers_list() processes a private_list,
> > > > mark_buffer_dirty_inode() can be called on bh which is already on the private
> > > > list of fsync_buffers_list(). As buffer is on some list (note that the check is
> > > > performed without private_lock), it is not readded to the mapping's
> > > > private_list and after fsync_buffers_list() finishes, we have a dirty buffer
> > > > which should be on private_list but it isn't. This race has not been reported,
> > > > probably because most (but not all) callers of mark_buffer_dirty_inode() hold
> > > > i_mutex and thus are serialized with fsync().
> > >
> > > Maybe fsync_buffers_list should put the buffer back onto private_list if it
> > > got dirtied again.
> > Yes, that's what it does in my new version. Only the locking is a bit
> > subtle if we want to avoid taking private_lock in mark_buffer_dirty_inode()
> > when the buffer is already on some list...
>
> I fear rewrites in there. It took five years to find this bug. How long
> will it take to find new ones which get added?
Well, I've added that WARN_ON() which warned us about the problem about
14 month ago so it was not that bad. But I agree that bugs are hard to
detect here as they result either in lost IO error or in not writing all
buffers on fsync both of which go mostly unnoticed.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] Fix private_list handling
2008-01-11 23:33 ` Andrew Morton
2008-01-14 11:59 ` Jan Kara
@ 2008-01-14 18:14 ` Jan Kara
1 sibling, 0 replies; 8+ messages in thread
From: Jan Kara @ 2008-01-14 18:14 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, supriyak
On Fri 11-01-08 15:33:54, Andrew Morton wrote:
> On Fri, 11 Jan 2008 15:21:31 +0100
> Jan Kara <jack@suse.cz> wrote:
>
> > On Thu 10-01-08 16:36:35, Andrew Morton wrote:
> > > On Thu, 10 Jan 2008 16:55:13 +0100
> > > Jan Kara <jack@suse.cz> wrote:
> > >
> > > > Hi,
> > > >
> > > > sorry for the previous empty email...
> > > >
> > > > Supriya noted in his testing that sometimes buffers removed by
> > > > __remove_assoc_queue() don't have b_assoc_mapping set (and thus IO error
> > > > won't be properly propagated). Actually, looking more into the code I found
> > > > there are some more races. The patch below should fix them. It survived
> > > > beating with LTP and fsstress on ext2 filesystem on my testing machine so
> > > > it should be reasonably bugfree... Andrew, would you put the patch into
> > > > -mm? Thanks.
> > > >
> > > > Honza
> > > > --
> > > > Jan Kara <jack@suse.cz>
> > > > SUSE Labs, CR
> > > > ---
> > > >
> > > > There are two possible races in handling of private_list in buffer cache.
> > > > 1) When fsync_buffers_list() processes a private_list, it clears
> > > > b_assoc_mapping and moves buffer to its private list. Now drop_buffers() comes,
> > > > sees a buffer is on list so it calls __remove_assoc_queue() which complains
> > > > about b_assoc_mapping being cleared (as it cannot propagate possible IO error).
> > > > This race has been actually observed in the wild.
> > >
> > > private_lock should prevent this race.
> > >
> > > Which call to drop_buffers() is the culprit? The first one in
> > > try_to_free_buffers(), I assume? The "can this still happen?" one?
> > >
> > > If so, it can happen. How? Perhaps this is a bug.
> > Good question but I don't think so. The problem is that
> > fsync_buffers_list() drops the private_lock() e.g. when it does
> > wait_on_buffer(). And buffers on the list fsync_buffers_list() constructs
> > have b_assoc_mapping set to NULL but the test
> > !list_empty(bh->b_assoc_buffers) succeeds for them and thus
> > __remove_assoc_queue() is called and it complains.
> > We could also silence the warning by leaving b_assoc_mapping set when we
> > move the buffer to the constructed list.
>
> Could fsync_buffers_list() leave the buffer not on a list when it does the
> ll_rw_block() and only add it to tmp afterwards if it sees that the buffer
> is still not on a list?
>
> I guess not, as it still needs to find the buffer to wait upon it.
>
> > But given the problem below
> > I've decided to do a more complete cleanup of the code.
>
> Well. Large-scale changes to long-established code like this are a last
> resort. We should fully investigate little local fixups first.
>
> > > > 2) When fsync_buffers_list() processes a private_list,
> > > > mark_buffer_dirty_inode() can be called on bh which is already on the private
> > > > list of fsync_buffers_list(). As buffer is on some list (note that the check is
> > > > performed without private_lock), it is not readded to the mapping's
> > > > private_list and after fsync_buffers_list() finishes, we have a dirty buffer
> > > > which should be on private_list but it isn't. This race has not been reported,
> > > > probably because most (but not all) callers of mark_buffer_dirty_inode() hold
> > > > i_mutex and thus are serialized with fsync().
> > >
> > > Maybe fsync_buffers_list should put the buffer back onto private_list if it
> > > got dirtied again.
> > Yes, that's what it does in my new version. Only the locking is a bit
> > subtle if we want to avoid taking private_lock in mark_buffer_dirty_inode()
> > when the buffer is already on some list...
>
> I fear rewrites in there. It took five years to find this bug. How long
> will it take to find new ones which get added?
>
> Sigh. lists do suck for this sort of thing.
OK, below is a minimal fix for the problems...
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
---
There are two possible races in handling of private_list in buffer cache.
1) When fsync_buffers_list() processes a private_list, it clears
b_assoc_mapping and moves buffer to its private list. Now drop_buffers() comes,
sees a buffer is on list so it calls __remove_assoc_queue() which complains
about b_assoc_mapping being cleared (as it cannot propagate possible IO error).
This race has been actually observed in the wild.
2) When fsync_buffers_list() processes a private_list,
mark_buffer_dirty_inode() can be called on bh which is already on the private
list of fsync_buffers_list(). As buffer is on some list (note that the check is
performed without private_lock), it is not readded to the mapping's
private_list and after fsync_buffers_list() finishes, we have a dirty buffer
which should be on private_list but it isn't. This race has not been reported,
probably because most (but not all) callers of mark_buffer_dirty_inode() hold
i_mutex and thus are serialized with fsync().
Fix these issues by not clearing b_assoc_map when fsync_buffers_list() moves
buffer to a dedicated list and by reinserting buffer in private_list when
it is found dirty after the IO has completed.
Signed-off-by: Jan Kara <jack@suse.cz>
diff --git a/fs/buffer.c b/fs/buffer.c
index 7249e01..3ffb2b6 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -794,6 +794,7 @@ static int fsync_buffers_list(spinlock_t *lock, struct list_head *list)
{
struct buffer_head *bh;
struct list_head tmp;
+ struct address_space *mapping;
int err = 0, err2;
INIT_LIST_HEAD(&tmp);
@@ -801,9 +802,11 @@ static int fsync_buffers_list(spinlock_t *lock, struct list_head *list)
spin_lock(lock);
while (!list_empty(list)) {
bh = BH_ENTRY(list->next);
+ mapping = bh->b_assoc_map;
__remove_assoc_queue(bh);
if (buffer_dirty(bh) || buffer_locked(bh)) {
list_add(&bh->b_assoc_buffers, &tmp);
+ bh->b_assoc_map = mapping;
if (buffer_dirty(bh)) {
get_bh(bh);
spin_unlock(lock);
@@ -828,8 +831,13 @@ static int fsync_buffers_list(spinlock_t *lock, struct list_head *list)
wait_on_buffer(bh);
if (!buffer_uptodate(bh))
err = -EIO;
- brelse(bh);
spin_lock(lock);
+ if (buffer_dirty(bh) && list_empty(&bh->b_assoc_buffers)) {
+ BUG_ON(!bh->b_assoc_map);
+ list_add(&bh->b_assoc_buffers,
+ &bh->b_assoc_map->private_list);
+ }
+ brelse(bh);
}
spin_unlock(lock);
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH] Fix private_list handling
@ 2008-02-04 16:44 Jan Kara
0 siblings, 0 replies; 8+ messages in thread
From: Jan Kara @ 2008-02-04 16:44 UTC (permalink / raw)
To: Andrew Morton; +Cc: Nick Piggin, linux-kernel
There are two possible races in handling of private_list in buffer cache.
1) When fsync_buffers_list() processes a private_list, it clears
b_assoc_mapping and moves buffer to its private list. Now drop_buffers() comes,
sees a buffer is on list so it calls __remove_assoc_queue() which complains
about b_assoc_mapping being cleared (as it cannot propagate possible IO error).
This race has been actually observed in the wild.
2) When fsync_buffers_list() processes a private_list,
mark_buffer_dirty_inode() can be called on bh which is already on the private
list of fsync_buffers_list(). As buffer is on some list (note that the check is
performed without private_lock), it is not readded to the mapping's
private_list and after fsync_buffers_list() finishes, we have a dirty buffer
which should be on private_list but it isn't. This race has not been reported,
probably because most (but not all) callers of mark_buffer_dirty_inode() hold
i_mutex and thus are serialized with fsync().
Fix these issues by not clearing b_assoc_map when fsync_buffers_list() moves
buffer to a dedicated list and by reinserting buffer in private_list when
it is found dirty after we have submitted buffer for IO. We also change the
tests whether a buffer is on a private list from
!list_empty(&bh->b_assoc_buffers) to bh->b_assoc_map
so that they are single word reads and hence lockless checks are safe.
Signed-off-by: Jan Kara <jack@suse.cz>
CC: Nick Piggin <npiggin@suse.de>
---
Hi Andrew,
this is the latest version of the fix for private_list handling races. The
changes are now smaller so I hope you'll like the patch more... Please
apply if you now think the patch is fine. Thanks.
Honza
diff --git a/fs/buffer.c b/fs/buffer.c
index 7249e01..76e1ab4 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -678,7 +678,7 @@ void mark_buffer_dirty_inode(struct buffer_head *bh, struct inode *inode)
} else {
BUG_ON(mapping->assoc_mapping != buffer_mapping);
}
- if (list_empty(&bh->b_assoc_buffers)) {
+ if (!bh->b_assoc_map) {
spin_lock(&buffer_mapping->private_lock);
list_move_tail(&bh->b_assoc_buffers,
&mapping->private_list);
@@ -794,6 +794,7 @@ static int fsync_buffers_list(spinlock_t *lock, struct list_head *list)
{
struct buffer_head *bh;
struct list_head tmp;
+ struct address_space *mapping;
int err = 0, err2;
INIT_LIST_HEAD(&tmp);
@@ -801,9 +802,14 @@ static int fsync_buffers_list(spinlock_t *lock, struct list_head *list)
spin_lock(lock);
while (!list_empty(list)) {
bh = BH_ENTRY(list->next);
+ mapping = bh->b_assoc_map;
__remove_assoc_queue(bh);
+ /* Avoid race with mark_buffer_dirty_inode() which does
+ * a lockless check and we rely on seeing the dirty bit */
+ smp_mb();
if (buffer_dirty(bh) || buffer_locked(bh)) {
list_add(&bh->b_assoc_buffers, &tmp);
+ bh->b_assoc_map = mapping;
if (buffer_dirty(bh)) {
get_bh(bh);
spin_unlock(lock);
@@ -822,8 +828,17 @@ static int fsync_buffers_list(spinlock_t *lock, struct list_head *list)
while (!list_empty(&tmp)) {
bh = BH_ENTRY(tmp.prev);
- list_del_init(&bh->b_assoc_buffers);
get_bh(bh);
+ mapping = bh->b_assoc_map;
+ __remove_assoc_queue(bh);
+ /* Avoid race with mark_buffer_dirty_inode() which does
+ * a lockless check and we rely on seeing the dirty bit */
+ smp_mb();
+ if (buffer_dirty(bh)) {
+ list_add(&bh->b_assoc_buffers,
+ &bh->b_assoc_map->private_list);
+ bh->b_assoc_map = mapping;
+ }
spin_unlock(lock);
wait_on_buffer(bh);
if (!buffer_uptodate(bh))
@@ -1195,7 +1210,7 @@ void __brelse(struct buffer_head * buf)
void __bforget(struct buffer_head *bh)
{
clear_buffer_dirty(bh);
- if (!list_empty(&bh->b_assoc_buffers)) {
+ if (bh->b_assoc_map) {
struct address_space *buffer_mapping = bh->b_page->mapping;
spin_lock(&buffer_mapping->private_lock);
@@ -3037,7 +3052,7 @@ drop_buffers(struct page *page, struct buffer_head **buffers_to_free)
do {
struct buffer_head *next = bh->b_this_page;
- if (!list_empty(&bh->b_assoc_buffers))
+ if (bh->b_assoc_map)
__remove_assoc_queue(bh);
bh = next;
} while (bh != head);
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2008-02-04 16:44 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-01-10 15:50 [PATCH] Fix private_list handling Jan Kara
2008-01-10 15:55 ` Jan Kara
2008-01-11 0:36 ` Andrew Morton
2008-01-11 14:21 ` Jan Kara
2008-01-11 23:33 ` Andrew Morton
2008-01-14 11:59 ` Jan Kara
2008-01-14 18:14 ` Jan Kara
2008-02-04 16:44 Jan Kara
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).