LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* Queue upcall locking (was: [dm-devel] [RFC][PATCH] fix dm_any_congested() to properly sync up with suspend code path)
       [not found] <1225944008.14830.1101.camel@chandra-ubuntu>
@ 2008-11-10 13:11 ` Mikulas Patocka
  2008-11-10 13:54   ` Christoph Hellwig
  0 siblings, 1 reply; 14+ messages in thread
From: Mikulas Patocka @ 2008-11-10 13:11 UTC (permalink / raw)
  To: Chandra Seetharaman; +Cc: Alasdair G Kergon, dm-devel, linux-kernel, axboe

Hi

The bug is correctly analyzed.

Regarding the patch --- you need to wake up md->wait if dec_pending drops 
to zero, so that dm_suspend is not waiting forever.

Otherwise the patch is correct, but contains some unneeded parts: if 
md->pending is used to block removal, we no longer need to hold the 
reference on the table at all. If the code is already called under a 
spinlock (from upstream kernel), we could execute everything under 
md->map_lock spinlock and drop this reference counting at all. But the 
code can take indefinite amount of time, so it would be best to not call 
it under a spinlock in the first place.

For upstream Linux developers: you are holding a spinlock and calling 
bdi*_congested functions that can take indefinite amount of time (there 
are even users reporting having 50 disks in one logical volume or so). I 
think it would be good to move these calls out of spinlocks.

What are the general locking rules for these block queue upcalls 
(congested_fn, unplug_fn, merge_bvec_fn & others)? It looks like there may 
be more bugs like this in another upcalls. Can you please somehow specify 
what can the upcall do and what can't?

Mikulas

On Wed, 5 Nov 2008, Chandra Seetharaman wrote:

> dm_any_congested() just checks for the DMF_BLOCK_IO and has no
> code to make sure that suspend waits for dm_any_congested() to
> complete.
> 
> This leads to a case where the table_destroy() and free_multipath()
> and some other sleeping functions are called (thru dm_table_put())
> from dm_any_congested.
> 
> This leads to 2 problems:
> 1. Sleeping functions called from congested code, whose caller
>    holds a spin lock.
> 2. An ABBA deadlock between pdflush and multipathd. The two locks
>    in contention are inode lock and kernel lock.
>    Here is the crash analysis:
> PID: 16879  TASK: ffff81013a06a140  CPU: 3   COMMAND: "pdflush"
> Owns inode_lock and waiting for kernel_sem
> 
> PID: 8299   TASK: ffff81024f03e100  CPU: 2   COMMAND: "multipathd"
> Owns kernel_sem and waiting for inode_lock.
> 
> 
> PID: 8299   TASK: ffff81024f03e100  CPU: 2   COMMAND: "multipathd"^M
>  #0 [ffff81024ad338c8] schedule at ffffffff8128534c^M
>  #1 [ffff81024ad33980] rt_spin_lock_slowlock at ffffffff81286d15^M << Waiting
> for inode_lock
>  #2 [ffff81024ad33a40] __rt_spin_lock at ffffffff812873b0^M
>  #3 [ffff81024ad33a50] rt_spin_lock at ffffffff812873bb^M
>  #4 [ffff81024ad33a60] ifind_fast at ffffffff810c32a1^M
>  #5 [ffff81024ad33a90] iget_locked at ffffffff810c3be8^M
>  #6 [ffff81024ad33ad0] proc_get_inode at ffffffff810ebaff^M
>  #7 [ffff81024ad33b10] proc_lookup at ffffffff810f082a^M
>  #8 [ffff81024ad33b40] proc_root_lookup at ffffffff810ec312^M
>  #9 [ffff81024ad33b70] do_lookup at ffffffff810b78f7^M
> #10 [ffff81024ad33bc0] __link_path_walk at ffffffff810b9a93^M
> #11 [ffff81024ad33c60] link_path_walk at ffffffff810b9fc1^M
> #12 [ffff81024ad33d30] path_walk at ffffffff810ba073^M
> #13 [ffff81024ad33d40] do_path_lookup at ffffffff810ba37a^M
> #14 [ffff81024ad33d90] __path_lookup_intent_open at ffffffff810baeb0^M
> #15 [ffff81024ad33de0] path_lookup_open at ffffffff810baf60^M
> #16 [ffff81024ad33df0] open_namei at ffffffff810bb071^M
> #17 [ffff81024ad33e80] do_filp_open at ffffffff810ae610^M
> #18 [ffff81024ad33f30] do_sys_open at ffffffff810ae67f^M
> #19 [ffff81024ad33f70] sys_open at ffffffff810ae729^M
> 
> 
> PID: 16879  TASK: ffff81013a06a140  CPU: 3   COMMAND: "pdflush"^M
>  #0 [ffff810023063aa0] schedule at ffffffff8128534c^M
>  #1 [ffff810023063b58] rt_mutex_slowlock at ffffffff81286ac5^M  << Waiting for
> Kernel Lock
>  #2 [ffff810023063c28] rt_mutex_lock at ffffffff81285fb4^M
>  #3 [ffff810023063c38] rt_down at ffffffff8105fec7^M
>  #4 [ffff810023063c58] lock_kernel at ffffffff81287b8c^M
>  #5 [ffff810023063c78] __blkdev_put at ffffffff810d5d31^M
>  #6 [ffff810023063cb8] blkdev_put at ffffffff810d5e68^M
>  #7 [ffff810023063cc8] close_dev at ffffffff8819e547^M
>  #8 [ffff810023063ce8] dm_put_device at ffffffff8819e579^M
>  #9 [ffff810023063d08] free_priority_group at ffffffff881c0e86^M
> #10 [ffff810023063d58] free_multipath at ffffffff881c0f11^M
> #11 [ffff810023063d78] multipath_dtr at ffffffff881c0f73^M
> #12 [ffff810023063d98] dm_table_put at ffffffff8819e347^M
> #13 [ffff810023063dc8] dm_any_congested at ffffffff8819d074^M
> #14 [ffff810023063df8] sync_sb_inodes at ffffffff810cd451^M
> #15 [ffff810023063e38] writeback_inodes at ffffffff810cd7b5^M
> #16 [ffff810023063e68] background_writeout at ffffffff8108be38^M
> #17 [ffff810023063ed8] pdflush at ffffffff8108c79a^M
> #18 [ffff810023063f28] kthread at ffffffff81051477^M
> #19 [ffff810023063f48] kernel_thread at ffffffff8100d048^M
> 
> crash> kernel_sem
> kernel_sem = $5 = {
>   count = {
>     counter = 0
>   }, 
>   lock = {
>     wait_lock = {
>       raw_lock = {
>         slock = 34952
>       }, 
>       break_lock = 0
>     }, 
>     wait_list = {
>       prio_list = {
>         next = 0xffff810023063b88, 
>         prev = 0xffff81014941fdc0
>       }, 
>       node_list = {
>         next = 0xffff810023063b98, 
>         prev = 0xffff81014d04ddd0
>       }
>     }, 
>     owner = 0xffff81024f03e102 << multipathd owns it. (task 0xffff81024f03e100) 
>                                                    << last two bits are flags..
> so replace it with 0.
>   }
> }
> crash> inode_lock
> inode_lock = $6 = {
>   lock = {
>     wait_lock = {
>       raw_lock = {
>         slock = 3341
>       }, 
>       break_lock = 0
>     }, 
>     wait_list = {
>       prio_list = {
>         next = 0xffff81024ad339a0, 
>         prev = 0xffff8100784fdd78
>       }, 
>       node_list = {
>         next = 0xffff81024ad339b0, 
>         prev = 0xffff81024afb3a70
>       }
>     }, 
>     owner = 0xffff81013a06a142 << pdflush owns it. (task 0xffff81013a06a140)
>   }, 
>   break_lock = 0
> }
> crash> 
> ----------------
> 
> Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
> ----
> 
> Index: linux-2.6.28-rc3/drivers/md/dm.c
> ===================================================================
> --- linux-2.6.28-rc3.orig/drivers/md/dm.c
> +++ linux-2.6.28-rc3/drivers/md/dm.c
> @@ -937,16 +937,21 @@ static void dm_unplug_all(struct request
>  
>  static int dm_any_congested(void *congested_data, int bdi_bits)
>  {
> -	int r;
> +	int r = bdi_bits;
>  	struct mapped_device *md = (struct mapped_device *) congested_data;
> -	struct dm_table *map = dm_get_table(md);
> +	struct dm_table *map;
>  
> -	if (!map || test_bit(DMF_BLOCK_IO, &md->flags))
> -		r = bdi_bits;
> -	else
> -		r = dm_table_any_congested(map, bdi_bits);
> +	atomic_inc(&md->pending);
> +	if (test_bit(DMF_BLOCK_IO, &md->flags))
> +		goto done;
>  
> -	dm_table_put(map);
> +	map = dm_get_table(md);
> +	if (map) {
> +		r = dm_table_any_congested(map, bdi_bits);
> +		dm_table_put(map);
> +	}
> +done:
> +	atomic_dec(&md->pending);
>  	return r;
>  }
>  
> 
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Queue upcall locking (was: [dm-devel] [RFC][PATCH] fix dm_any_congested() to properly sync up with suspend code path)
  2008-11-10 13:11 ` Queue upcall locking (was: [dm-devel] [RFC][PATCH] fix dm_any_congested() to properly sync up with suspend code path) Mikulas Patocka
@ 2008-11-10 13:54   ` Christoph Hellwig
  2008-11-10 14:19     ` Mikulas Patocka
  2008-11-10 14:27     ` Alasdair G Kergon
  0 siblings, 2 replies; 14+ messages in thread
From: Christoph Hellwig @ 2008-11-10 13:54 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Chandra Seetharaman, Alasdair G Kergon, dm-devel, linux-kernel, axboe

On Mon, Nov 10, 2008 at 08:11:51AM -0500, Mikulas Patocka wrote:
> For upstream Linux developers: you are holding a spinlock and calling 
> bdi*_congested functions that can take indefinite amount of time (there 
> are even users reporting having 50 disks in one logical volume or so). I 
> think it would be good to move these calls out of spinlocks.

Umm, they shouldn't block that long, as that completely defeats their
purpose.  These functions are mostly used to avoid throwing more I/O at
a congested device if pdflush could do more useful things instead.  But
if it blocks in those functions anyway we wouldn't have to bother using
them.  Do you have more details about the uses cases when this happens
and where the routines spend so much time?


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Queue upcall locking (was: [dm-devel] [RFC][PATCH] fix dm_any_congested() to properly sync up with suspend code path)
  2008-11-10 13:54   ` Christoph Hellwig
@ 2008-11-10 14:19     ` Mikulas Patocka
  2008-11-10 14:26       ` Peter Zijlstra
  2008-11-11 18:18       ` Christoph Hellwig
  2008-11-10 14:27     ` Alasdair G Kergon
  1 sibling, 2 replies; 14+ messages in thread
From: Mikulas Patocka @ 2008-11-10 14:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chandra Seetharaman, Alasdair G Kergon, dm-devel, linux-kernel, axboe

On Mon, 10 Nov 2008, Christoph Hellwig wrote:

> On Mon, Nov 10, 2008 at 08:11:51AM -0500, Mikulas Patocka wrote:
> > For upstream Linux developers: you are holding a spinlock and calling 
> > bdi*_congested functions that can take indefinite amount of time (there 
> > are even users reporting having 50 disks in one logical volume or so). I 
> > think it would be good to move these calls out of spinlocks.
> 
> Umm, they shouldn't block that long, as that completely defeats their
> purpose.  These functions are mostly used to avoid throwing more I/O at
> a congested device if pdflush could do more useful things instead.  But
> if it blocks in those functions anyway we wouldn't have to bother using
> them.  Do you have more details about the uses cases when this happens
> and where the routines spend so much time?

For device mapper, congested_fn asks every device in the tree and make OR 
of their bits --- so if the user has 50 devices, it asks them all.

For md-linear, md-raid0, md-raid1, md-raid10 and md-multipath it does the 
same --- asking every device.

If you have a better idea how to implement congested_fn, say it.

Mikulas

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Queue upcall locking (was: [dm-devel] [RFC][PATCH] fix dm_any_congested() to properly sync up with suspend code path)
  2008-11-10 14:19     ` Mikulas Patocka
@ 2008-11-10 14:26       ` Peter Zijlstra
  2008-11-10 14:32         ` Mikulas Patocka
  2008-11-10 14:35         ` Alasdair G Kergon
  2008-11-11 18:18       ` Christoph Hellwig
  1 sibling, 2 replies; 14+ messages in thread
From: Peter Zijlstra @ 2008-11-10 14:26 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Christoph Hellwig, Chandra Seetharaman, Alasdair G Kergon,
	dm-devel, linux-kernel, axboe

On Mon, 2008-11-10 at 09:19 -0500, Mikulas Patocka wrote:
> On Mon, 10 Nov 2008, Christoph Hellwig wrote:
> 
> > On Mon, Nov 10, 2008 at 08:11:51AM -0500, Mikulas Patocka wrote:
> > > For upstream Linux developers: you are holding a spinlock and calling 
> > > bdi*_congested functions that can take indefinite amount of time (there 
> > > are even users reporting having 50 disks in one logical volume or so). I 
> > > think it would be good to move these calls out of spinlocks.
> > 
> > Umm, they shouldn't block that long, as that completely defeats their
> > purpose.  These functions are mostly used to avoid throwing more I/O at
> > a congested device if pdflush could do more useful things instead.  But
> > if it blocks in those functions anyway we wouldn't have to bother using
> > them.  Do you have more details about the uses cases when this happens
> > and where the routines spend so much time?
> 
> For device mapper, congested_fn asks every device in the tree and make OR 
> of their bits --- so if the user has 50 devices, it asks them all.
> 
> For md-linear, md-raid0, md-raid1, md-raid10 and md-multipath it does the 
> same --- asking every device.
> 
> If you have a better idea how to implement congested_fn, say it.

Fix the infrastructure by adding a function call so that you can have
the individual devices report their congestion state to the aggregate.

Then congestion_fn can return a valid state in O(1) because the state is
keps up-to-date by the individual state changes.

IOW, add a set_congested_fn() and clear_congested_fn().

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Queue upcall locking (was: [dm-devel] [RFC][PATCH] fix dm_any_congested() to properly sync up with suspend code path)
  2008-11-10 13:54   ` Christoph Hellwig
  2008-11-10 14:19     ` Mikulas Patocka
@ 2008-11-10 14:27     ` Alasdair G Kergon
  2008-11-10 14:40       ` Mikulas Patocka
  2008-11-11 18:21       ` Christoph Hellwig
  1 sibling, 2 replies; 14+ messages in thread
From: Alasdair G Kergon @ 2008-11-10 14:27 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Mikulas Patocka, axboe, dm-devel, linux-kernel

On Mon, Nov 10, 2008 at 08:54:01AM -0500, Christoph Hellwig wrote:
> On Mon, Nov 10, 2008 at 08:11:51AM -0500, Mikulas Patocka wrote:
> > For upstream Linux developers: you are holding a spinlock and calling 
> > bdi*_congested functions that can take indefinite amount of time (there 
> > are even users reporting having 50 disks in one logical volume or so). I 
> > think it would be good to move these calls out of spinlocks.
> Umm, they shouldn't block that long, as that completely defeats their
> purpose.  

Indeed - the blocking was a bug for which there's a patch, but that doesn't
deal with how the function should be operating in the first place.

- If one device is found to be congested, why bother checking the remaining
devices?

- If the device is suspended, the response should be that it is congested, I'd
have thought.

Alasdair
-- 
agk@redhat.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Queue upcall locking (was: [dm-devel] [RFC][PATCH] fix dm_any_congested() to properly sync up with suspend code path)
  2008-11-10 14:26       ` Peter Zijlstra
@ 2008-11-10 14:32         ` Mikulas Patocka
  2008-11-10 14:46           ` Peter Zijlstra
  2008-11-10 14:35         ` Alasdair G Kergon
  1 sibling, 1 reply; 14+ messages in thread
From: Mikulas Patocka @ 2008-11-10 14:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christoph Hellwig, Chandra Seetharaman, Alasdair G Kergon,
	dm-devel, linux-kernel, axboe



On Mon, 10 Nov 2008, Peter Zijlstra wrote:

> On Mon, 2008-11-10 at 09:19 -0500, Mikulas Patocka wrote:
> > On Mon, 10 Nov 2008, Christoph Hellwig wrote:
> > 
> > > On Mon, Nov 10, 2008 at 08:11:51AM -0500, Mikulas Patocka wrote:
> > > > For upstream Linux developers: you are holding a spinlock and calling 
> > > > bdi*_congested functions that can take indefinite amount of time (there 
> > > > are even users reporting having 50 disks in one logical volume or so). I 
> > > > think it would be good to move these calls out of spinlocks.
> > > 
> > > Umm, they shouldn't block that long, as that completely defeats their
> > > purpose.  These functions are mostly used to avoid throwing more I/O at
> > > a congested device if pdflush could do more useful things instead.  But
> > > if it blocks in those functions anyway we wouldn't have to bother using
> > > them.  Do you have more details about the uses cases when this happens
> > > and where the routines spend so much time?
> > 
> > For device mapper, congested_fn asks every device in the tree and make OR 
> > of their bits --- so if the user has 50 devices, it asks them all.
> > 
> > For md-linear, md-raid0, md-raid1, md-raid10 and md-multipath it does the 
> > same --- asking every device.
> > 
> > If you have a better idea how to implement congested_fn, say it.
> 
> Fix the infrastructure by adding a function call so that you can have
> the individual devices report their congestion state to the aggregate.
> 
> Then congestion_fn can return a valid state in O(1) because the state is
> keps up-to-date by the individual state changes.
> 
> IOW, add a set_congested_fn() and clear_congested_fn().

If you have a physical disk that has many LVM volumes on it, you end up in 
a situation when disk congestion state change is reported to all the 
volumes. So it will create O(n) problem at the other side.

Mikulas

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Queue upcall locking (was: [dm-devel] [RFC][PATCH] fix dm_any_congested() to properly sync up with suspend code path)
  2008-11-10 14:26       ` Peter Zijlstra
  2008-11-10 14:32         ` Mikulas Patocka
@ 2008-11-10 14:35         ` Alasdair G Kergon
  1 sibling, 0 replies; 14+ messages in thread
From: Alasdair G Kergon @ 2008-11-10 14:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mikulas Patocka, Christoph Hellwig, Chandra Seetharaman,
	dm-devel, linux-kernel, axboe

On Mon, Nov 10, 2008 at 03:26:50PM +0100, Peter Zijlstra wrote:
> Fix the infrastructure by adding a function call so that you can have
> the individual devices report their congestion state to the aggregate.
 
Which requires knowing/accessing some device hierarchy, which would be nice
for other things too (like propagation of notification of changes to device
properties such as size, hardsect size etc.).

Alasdair
-- 
agk@redhat.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Queue upcall locking (was: [dm-devel] [RFC][PATCH] fix dm_any_congested() to properly sync up with suspend code path)
  2008-11-10 14:27     ` Alasdair G Kergon
@ 2008-11-10 14:40       ` Mikulas Patocka
  2008-11-11 18:22         ` Christoph Hellwig
  2008-11-11 18:21       ` Christoph Hellwig
  1 sibling, 1 reply; 14+ messages in thread
From: Mikulas Patocka @ 2008-11-10 14:40 UTC (permalink / raw)
  To: Alasdair G Kergon; +Cc: Christoph Hellwig, axboe, dm-devel, linux-kernel



On Mon, 10 Nov 2008, Alasdair G Kergon wrote:

> On Mon, Nov 10, 2008 at 08:54:01AM -0500, Christoph Hellwig wrote:
> > On Mon, Nov 10, 2008 at 08:11:51AM -0500, Mikulas Patocka wrote:
> > > For upstream Linux developers: you are holding a spinlock and calling 
> > > bdi*_congested functions that can take indefinite amount of time (there 
> > > are even users reporting having 50 disks in one logical volume or so). I 
> > > think it would be good to move these calls out of spinlocks.
> > Umm, they shouldn't block that long, as that completely defeats their
> > purpose.  
> 
> Indeed - the blocking was a bug for which there's a patch, but that doesn't
> deal with how the function should be operating in the first place.

To me it looks like the bug is more severe --- holding a reference to 
table can happen in other upcalls too and in many other dm places.

I'm considering if we could call the table destructor just at one place 
(that would wait until all references are gone), instead of calling the 
destructor at each dm_table_put point (there are many of these points and 
it's hard to check all of them for correctness).

> - If one device is found to be congested, why bother checking the remaining
> devices?

Yes, that could be improved. But it doesn't solve the O(n) problem.

> - If the device is suspended, the response should be that it is 
> congested, I'd have thought.

Yes, but these congestion upcalls are used only for optimization and the 
device is suspended for so small time, that it doesn't matter if we 
optimize io acces in small moment or not.

Mikulas

> Alasdair
> -- 
> agk@redhat.com
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Queue upcall locking (was: [dm-devel] [RFC][PATCH] fix dm_any_congested() to properly sync up with suspend code path)
  2008-11-10 14:32         ` Mikulas Patocka
@ 2008-11-10 14:46           ` Peter Zijlstra
  2008-11-10 14:55             ` Mikulas Patocka
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2008-11-10 14:46 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Christoph Hellwig, Chandra Seetharaman, Alasdair G Kergon,
	dm-devel, linux-kernel, axboe

On Mon, 2008-11-10 at 09:32 -0500, Mikulas Patocka wrote:
> 
> On Mon, 10 Nov 2008, Peter Zijlstra wrote:
> 
> > On Mon, 2008-11-10 at 09:19 -0500, Mikulas Patocka wrote:
> > > On Mon, 10 Nov 2008, Christoph Hellwig wrote:
> > > 
> > > > On Mon, Nov 10, 2008 at 08:11:51AM -0500, Mikulas Patocka wrote:
> > > > > For upstream Linux developers: you are holding a spinlock and calling 
> > > > > bdi*_congested functions that can take indefinite amount of time (there 
> > > > > are even users reporting having 50 disks in one logical volume or so). I 
> > > > > think it would be good to move these calls out of spinlocks.
> > > > 
> > > > Umm, they shouldn't block that long, as that completely defeats their
> > > > purpose.  These functions are mostly used to avoid throwing more I/O at
> > > > a congested device if pdflush could do more useful things instead.  But
> > > > if it blocks in those functions anyway we wouldn't have to bother using
> > > > them.  Do you have more details about the uses cases when this happens
> > > > and where the routines spend so much time?
> > > 
> > > For device mapper, congested_fn asks every device in the tree and make OR 
> > > of their bits --- so if the user has 50 devices, it asks them all.
> > > 
> > > For md-linear, md-raid0, md-raid1, md-raid10 and md-multipath it does the 
> > > same --- asking every device.
> > > 
> > > If you have a better idea how to implement congested_fn, say it.
> > 
> > Fix the infrastructure by adding a function call so that you can have
> > the individual devices report their congestion state to the aggregate.
> > 
> > Then congestion_fn can return a valid state in O(1) because the state is
> > keps up-to-date by the individual state changes.
> > 
> > IOW, add a set_congested_fn() and clear_congested_fn().
> 
> If you have a physical disk that has many LVM volumes on it, you end up in 
> a situation when disk congestion state change is reported to all the 
> volumes. So it will create O(n) problem at the other side.

*sigh* I can almost understand why people want to use lvm to combine
multiple disks, but why make the partition thing even worse...

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Queue upcall locking (was: [dm-devel] [RFC][PATCH] fix dm_any_congested() to properly sync up with suspend code path)
  2008-11-10 14:46           ` Peter Zijlstra
@ 2008-11-10 14:55             ` Mikulas Patocka
  0 siblings, 0 replies; 14+ messages in thread
From: Mikulas Patocka @ 2008-11-10 14:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christoph Hellwig, Chandra Seetharaman, Alasdair G Kergon,
	dm-devel, linux-kernel, axboe

On Mon, 10 Nov 2008, Peter Zijlstra wrote:

> > If you have a physical disk that has many LVM volumes on it, you end up in 
> > a situation when disk congestion state change is reported to all the 
> > volumes. So it will create O(n) problem at the other side.
> 
> *sigh* I can almost understand why people want to use lvm to combine
> multiple disks, but why make the partition thing even worse...

To protect the services from each other --- so that mail storm won't blow 
user's directories, users blowing their space can't corrupt the system 
updates, logs are kept safely even if other partition overflows etc.

I just know some admins who prefer to have separate filesystems instead of 
quotas for this purpose.

Mikulas

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Queue upcall locking (was: [dm-devel] [RFC][PATCH] fix dm_any_congested() to properly sync up with suspend code path)
  2008-11-10 14:19     ` Mikulas Patocka
  2008-11-10 14:26       ` Peter Zijlstra
@ 2008-11-11 18:18       ` Christoph Hellwig
  2008-11-11 22:08         ` Mikulas Patocka
  1 sibling, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2008-11-11 18:18 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Christoph Hellwig, Chandra Seetharaman, Alasdair G Kergon,
	dm-devel, linux-kernel, axboe

On Mon, Nov 10, 2008 at 09:19:21AM -0500, Mikulas Patocka wrote:
> For device mapper, congested_fn asks every device in the tree and make OR 
> of their bits --- so if the user has 50 devices, it asks them all.
> 
> For md-linear, md-raid0, md-raid1, md-raid10 and md-multipath it does the 
> same --- asking every device.
> 
> If you have a better idea how to implement congested_fn, say it.

Well, even asking 50 devices shouldn't take that long normally.  Do you
take some sleeping lock that might block forever?  In that case I would
just return congested for that device as it would delay pdflush
otherwise.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Queue upcall locking (was: [dm-devel] [RFC][PATCH] fix dm_any_congested() to properly sync up with suspend code path)
  2008-11-10 14:27     ` Alasdair G Kergon
  2008-11-10 14:40       ` Mikulas Patocka
@ 2008-11-11 18:21       ` Christoph Hellwig
  1 sibling, 0 replies; 14+ messages in thread
From: Christoph Hellwig @ 2008-11-11 18:21 UTC (permalink / raw)
  To: Christoph Hellwig, Mikulas Patocka, axboe, dm-devel, linux-kernel

On Mon, Nov 10, 2008 at 02:27:15PM +0000, Alasdair G Kergon wrote:
> Indeed - the blocking was a bug for which there's a patch, but that doesn't
> deal with how the function should be operating in the first place.
> 
> - If one device is found to be congested, why bother checking the remaining
> devices?
> 
> - If the device is suspended, the response should be that it is congested, I'd
> have thought.

Yes.  device congested is a quick an dirty check - if you encounter
anything long just return congested if I/O would hit this long delay,
too or not congested if you can't really know.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Queue upcall locking (was: [dm-devel] [RFC][PATCH] fix dm_any_congested() to properly sync up with suspend code path)
  2008-11-10 14:40       ` Mikulas Patocka
@ 2008-11-11 18:22         ` Christoph Hellwig
  0 siblings, 0 replies; 14+ messages in thread
From: Christoph Hellwig @ 2008-11-11 18:22 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Alasdair G Kergon, Christoph Hellwig, axboe, dm-devel, linux-kernel

On Mon, Nov 10, 2008 at 09:40:18AM -0500, Mikulas Patocka wrote:
> > - If the device is suspended, the response should be that it is 
> > congested, I'd have thought.
> 
> Yes, but these congestion upcalls are used only for optimization and the 
> device is suspended for so small time, that it doesn't matter if we 
> optimize io acces in small moment or not.

We use device congested to not block pdflush on a device that can't
currently accept I/O.  For a suspended device congested is always
correct, pdflush will just come back a little later and in the mean time
write out stuff that doesn't delay it.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Queue upcall locking (was: [dm-devel] [RFC][PATCH] fix dm_any_congested() to properly sync up with suspend code path)
  2008-11-11 18:18       ` Christoph Hellwig
@ 2008-11-11 22:08         ` Mikulas Patocka
  0 siblings, 0 replies; 14+ messages in thread
From: Mikulas Patocka @ 2008-11-11 22:08 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chandra Seetharaman, Alasdair G Kergon, dm-devel, linux-kernel, axboe

On Tue, 11 Nov 2008, Christoph Hellwig wrote:

> On Mon, Nov 10, 2008 at 09:19:21AM -0500, Mikulas Patocka wrote:
> > For device mapper, congested_fn asks every device in the tree and make OR 
> > of their bits --- so if the user has 50 devices, it asks them all.
> > 
> > For md-linear, md-raid0, md-raid1, md-raid10 and md-multipath it does the 
> > same --- asking every device.
> > 
> > If you have a better idea how to implement congested_fn, say it.
> 
> Well, even asking 50 devices shouldn't take that long normally.

There are some atomic operations inside congested_fn.

I remember that there was some database benchmark on a setup with 48 
devices. Just dropping one spinlock from unplug function resulted in 
overall improvement (not microbenchmarks, the whole run) by 18%. The 
slowdown with unplug is similar as with congested_fn --- unplugs walks all 
the physical devices for a given logical device and unplugs them.

So simplifying congested_fn might help too. Do you have an idea, how often 
is it called? I thought about caching the return value for one jiffy.

> Do you take some sleeping lock that might block forever?  In that case I 
> would just return congested for that device as it would delay pdflush 
> otherwise.

Sometimes we do, and we shouldn't, because congested_fn is called with 
spinlock. That's what I'm working on now.

BTW. how "realtime" do you expect the Linux kernel to be? Is there some 
effort to make all spinlocks locked for finite time? (so that someone 
could one day calculate the times of all spinlocks, take the maximum time 
and give Linux a "realtime" label). If so, then bdi_congested call should 
be moved out of spinlocks. If there are already many cases when spinlock 
could be taken and list of indefinite length walked inside, it isn't worth 
optimizing for it.

Mikulas

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2008-11-11 22:09 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1225944008.14830.1101.camel@chandra-ubuntu>
2008-11-10 13:11 ` Queue upcall locking (was: [dm-devel] [RFC][PATCH] fix dm_any_congested() to properly sync up with suspend code path) Mikulas Patocka
2008-11-10 13:54   ` Christoph Hellwig
2008-11-10 14:19     ` Mikulas Patocka
2008-11-10 14:26       ` Peter Zijlstra
2008-11-10 14:32         ` Mikulas Patocka
2008-11-10 14:46           ` Peter Zijlstra
2008-11-10 14:55             ` Mikulas Patocka
2008-11-10 14:35         ` Alasdair G Kergon
2008-11-11 18:18       ` Christoph Hellwig
2008-11-11 22:08         ` Mikulas Patocka
2008-11-10 14:27     ` Alasdair G Kergon
2008-11-10 14:40       ` Mikulas Patocka
2008-11-11 18:22         ` Christoph Hellwig
2008-11-11 18:21       ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).