LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH 0/10] On-stack explicit block queue plugging
@ 2011-01-22  1:17 Jens Axboe
  2011-01-22  1:17 ` [PATCH 01/10] block: add API for delaying work/request_fn a little bit Jens Axboe
                   ` (9 more replies)
  0 siblings, 10 replies; 152+ messages in thread
From: Jens Axboe @ 2011-01-22  1:17 UTC (permalink / raw)
  To: jaxboe, linux-kernel; +Cc: hch

Hi,

This is something that I have been sitting on for a while and
I finally got a bit of time to bring it up to date and at least
ensure that basic functionality was there.

Currently we use what I call implicit IO plugging for block devices.
This means that the target block device queue may or may not be
"plugged" when someone submits IO. The IO submitter has to
ensure that the IO is sent off by calling a function to submit
it. This ugliness propagates through to the vm, which needs a
->sync_page() hook to ensure that things are submitted if someone
ends up waiting on a page.

Additionally, queue plugging ends up being a burden on the
queue lock (which is already heavily contended in some cases).
By moving to an explicit plugging scheme we make the API nicer,
get rid of the ->sync_page() vm hack, and allow IO to be queued
up in an on-stack structure and submitted in batches to the
block device queue.

Right now only submission of mergeable IO is lockless, the next
step is ensuring that rq allocation can be less queue lock
intensive and get some benefits there as well. There's an
unrelated batching change in this series as well that doesn't
really belong that's the start of that.

The patch boots and runs on my laptop, but apart from that I make
no guarantees as to the state of it. Particularly the md and dm
changes are quite invasive and need both careful review (and then,
I'm sure, bug fixing) and testing.

Patches are against 2.6.38-rc1 and can also be found in the block
git tree, in the for-2.6.39/stack-plug branch.

I'm traveling, so I'll tend to replies/comments/reviews/bugs on
this patch series when I get back early next week.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* [PATCH 01/10] block: add API for delaying work/request_fn a little bit
  2011-01-22  1:17 [PATCH 0/10] On-stack explicit block queue plugging Jens Axboe
@ 2011-01-22  1:17 ` Jens Axboe
  2011-01-22  1:17 ` [PATCH 02/10] ide-cd: convert to blk_delay_queue() for a short pause Jens Axboe
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 152+ messages in thread
From: Jens Axboe @ 2011-01-22  1:17 UTC (permalink / raw)
  To: jaxboe, linux-kernel; +Cc: hch

Currently we use plugging for that, but as plugging is going away,
we need an alternative mechanism.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
---
 block/blk-core.c       |   29 +++++++++++++++++++++++++++++
 include/linux/blkdev.h |    6 ++++++
 2 files changed, 35 insertions(+), 0 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 2f4002f..960f12c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -207,6 +207,32 @@ void blk_dump_rq_flags(struct request *rq, char *msg)
 }
 EXPORT_SYMBOL(blk_dump_rq_flags);
 
+static void blk_delay_work(struct work_struct *work)
+{
+	struct request_queue *q;
+
+	q = container_of(work, struct request_queue, delay_work.work);
+	spin_lock_irq(q->queue_lock);
+	q->request_fn(q);
+	spin_unlock_irq(q->queue_lock);
+}
+
+/**
+ * blk_delay_queue - restart queueing after defined interval
+ * @q:		The &struct request_queue in question
+ * @msecs:	Delay in msecs
+ *
+ * Description:
+ *   Sometimes queueing needs to be postponed for a little while, to allow
+ *   resources to come back. This function will make sure that queueing is
+ *   restarted around the specified time.
+ */
+void blk_delay_queue(struct request_queue *q, unsigned long msecs)
+{
+	schedule_delayed_work(&q->delay_work, msecs_to_jiffies(msecs));
+}
+EXPORT_SYMBOL(blk_delay_queue);
+
 /*
  * "plug" the device if there are no outstanding requests: this will
  * force the transfer to start only after we have put all the requests
@@ -373,6 +399,7 @@ EXPORT_SYMBOL(blk_start_queue);
 void blk_stop_queue(struct request_queue *q)
 {
 	blk_remove_plug(q);
+	cancel_delayed_work(&q->delay_work);
 	queue_flag_set(QUEUE_FLAG_STOPPED, q);
 }
 EXPORT_SYMBOL(blk_stop_queue);
@@ -397,6 +424,7 @@ void blk_sync_queue(struct request_queue *q)
 	del_timer_sync(&q->timeout);
 	cancel_work_sync(&q->unplug_work);
 	throtl_shutdown_timer_wq(q);
+	cancel_delayed_work_sync(&q->delay_work);
 }
 EXPORT_SYMBOL(blk_sync_queue);
 
@@ -542,6 +570,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 	INIT_LIST_HEAD(&q->timeout_list);
 	INIT_LIST_HEAD(&q->pending_flushes);
 	INIT_WORK(&q->unplug_work, blk_unplug_work);
+	INIT_DELAYED_WORK(&q->delay_work, blk_delay_work);
 
 	kobject_init(&q->kobj, &blk_queue_ktype);
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 4d18ff3..b4812f9 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -294,6 +294,11 @@ struct request_queue
 	unsigned long		unplug_delay;	/* After this many jiffies */
 	struct work_struct	unplug_work;
 
+	/*
+	 * Delayed queue handling
+	 */
+	struct delayed_work	delay_work;
+
 	struct backing_dev_info	backing_dev_info;
 
 	/*
@@ -670,6 +675,7 @@ extern int blk_insert_cloned_request(struct request_queue *q,
 extern void blk_plug_device(struct request_queue *);
 extern void blk_plug_device_unlocked(struct request_queue *);
 extern int blk_remove_plug(struct request_queue *);
+extern void blk_delay_queue(struct request_queue *, unsigned long);
 extern void blk_recount_segments(struct request_queue *, struct bio *);
 extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 			  unsigned int, void __user *);
-- 
1.7.3.2.146.gca209


^ permalink raw reply	[flat|nested] 152+ messages in thread

* [PATCH 02/10] ide-cd: convert to blk_delay_queue() for a short pause
  2011-01-22  1:17 [PATCH 0/10] On-stack explicit block queue plugging Jens Axboe
  2011-01-22  1:17 ` [PATCH 01/10] block: add API for delaying work/request_fn a little bit Jens Axboe
@ 2011-01-22  1:17 ` Jens Axboe
  2011-01-22  1:19   ` David Miller
  2011-01-22  1:17 ` [PATCH 03/10] scsi: convert to blk_delay_queue() Jens Axboe
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 152+ messages in thread
From: Jens Axboe @ 2011-01-22  1:17 UTC (permalink / raw)
  To: jaxboe, linux-kernel; +Cc: hch

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
---
 drivers/ide/ide-cd.c |   13 ++-----------
 1 files changed, 2 insertions(+), 11 deletions(-)

diff --git a/drivers/ide/ide-cd.c b/drivers/ide/ide-cd.c
index 0c73fe3..7ce9caf 100644
--- a/drivers/ide/ide-cd.c
+++ b/drivers/ide/ide-cd.c
@@ -258,17 +258,10 @@ static int ide_cd_breathe(ide_drive_t *drive, struct request *rq)
 	if (time_after(jiffies, info->write_timeout))
 		return 0;
 	else {
-		struct request_queue *q = drive->queue;
-		unsigned long flags;
-
 		/*
-		 * take a breather relying on the unplug timer to kick us again
+		 * take a breather
 		 */
-
-		spin_lock_irqsave(q->queue_lock, flags);
-		blk_plug_device(q);
-		spin_unlock_irqrestore(q->queue_lock, flags);
-
+		blk_delay_queue(drive->queue, 1);
 		return 1;
 	}
 }
@@ -1514,8 +1507,6 @@ static int ide_cdrom_setup(ide_drive_t *drive)
 	blk_queue_dma_alignment(q, 31);
 	blk_queue_update_dma_pad(q, 15);
 
-	q->unplug_delay = max((1 * HZ) / 1000, 1);
-
 	drive->dev_flags |= IDE_DFLAG_MEDIA_CHANGED;
 	drive->atapi_flags = IDE_AFLAG_NO_EJECT | ide_cd_flags(id);
 
-- 
1.7.3.2.146.gca209


^ permalink raw reply	[flat|nested] 152+ messages in thread

* [PATCH 03/10] scsi: convert to blk_delay_queue()
  2011-01-22  1:17 [PATCH 0/10] On-stack explicit block queue plugging Jens Axboe
  2011-01-22  1:17 ` [PATCH 01/10] block: add API for delaying work/request_fn a little bit Jens Axboe
  2011-01-22  1:17 ` [PATCH 02/10] ide-cd: convert to blk_delay_queue() for a short pause Jens Axboe
@ 2011-01-22  1:17 ` Jens Axboe
  2011-01-22  1:17 ` [PATCH 04/10] block: initial patch for on-stack per-task plugging Jens Axboe
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 152+ messages in thread
From: Jens Axboe @ 2011-01-22  1:17 UTC (permalink / raw)
  To: jaxboe, linux-kernel; +Cc: hch

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
---
 drivers/scsi/scsi_lib.c |   44 +++++++++++++++++++-------------------------
 1 files changed, 19 insertions(+), 25 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 9045c52..5a0ae7a 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -67,6 +67,13 @@ static struct scsi_host_sg_pool scsi_sg_pools[] = {
 
 struct kmem_cache *scsi_sdb_cache;
 
+/*
+ * When to reinvoke queueing after a resource shortage. It's 3 msecs to
+ * not change behaviour from the previous unplug mechanism, experimentation
+ * may prove this needs changing.
+ */
+#define SCSI_QUEUE_DELAY	3
+
 static void scsi_run_queue(struct request_queue *q);
 
 /*
@@ -149,14 +156,7 @@ static int __scsi_queue_insert(struct scsi_cmnd *cmd, int reason, int unbusy)
 	/*
 	 * Requeue this command.  It will go before all other commands
 	 * that are already in the queue.
-	 *
-	 * NOTE: there is magic here about the way the queue is plugged if
-	 * we have no outstanding commands.
-	 * 
-	 * Although we *don't* plug the queue, we call the request
-	 * function.  The SCSI request function detects the blocked condition
-	 * and plugs the queue appropriately.
-         */
+	 */
 	spin_lock_irqsave(q->queue_lock, flags);
 	blk_requeue_request(q, cmd->request);
 	spin_unlock_irqrestore(q->queue_lock, flags);
@@ -1194,11 +1194,11 @@ int scsi_prep_return(struct request_queue *q, struct request *req, int ret)
 	case BLKPREP_DEFER:
 		/*
 		 * If we defer, the blk_peek_request() returns NULL, but the
-		 * queue must be restarted, so we plug here if no returning
-		 * command will automatically do that.
+		 * queue must be restarted, so we schedule a callback to happen
+		 * shortly.
 		 */
 		if (sdev->device_busy == 0)
-			blk_plug_device(q);
+			blk_delay_queue(q, SCSI_QUEUE_DELAY);
 		break;
 	default:
 		req->cmd_flags |= REQ_DONTPREP;
@@ -1237,7 +1237,7 @@ static inline int scsi_dev_queue_ready(struct request_queue *q,
 				   sdev_printk(KERN_INFO, sdev,
 				   "unblocking device at zero depth\n"));
 		} else {
-			blk_plug_device(q);
+			blk_delay_queue(q, SCSI_QUEUE_DELAY);
 			return 0;
 		}
 	}
@@ -1467,7 +1467,7 @@ static void scsi_request_fn(struct request_queue *q)
 	 * the host is no longer able to accept any more requests.
 	 */
 	shost = sdev->host;
-	while (!blk_queue_plugged(q)) {
+	for (;;) {
 		int rtn;
 		/*
 		 * get next queueable request.  We do this early to make sure
@@ -1546,15 +1546,8 @@ static void scsi_request_fn(struct request_queue *q)
 		 */
 		rtn = scsi_dispatch_cmd(cmd);
 		spin_lock_irq(q->queue_lock);
-		if(rtn) {
-			/* we're refusing the command; because of
-			 * the way locks get dropped, we need to 
-			 * check here if plugging is required */
-			if(sdev->device_busy == 0)
-				blk_plug_device(q);
-
-			break;
-		}
+		if (rtn)
+			goto out_delay;
 	}
 
 	goto out;
@@ -1573,9 +1566,10 @@ static void scsi_request_fn(struct request_queue *q)
 	spin_lock_irq(q->queue_lock);
 	blk_requeue_request(q, req);
 	sdev->device_busy--;
-	if(sdev->device_busy == 0)
-		blk_plug_device(q);
- out:
+out_delay:
+	if (sdev->device_busy == 0)
+		blk_delay_queue(q, SCSI_QUEUE_DELAY);
+out:
 	/* must be careful here...if we trigger the ->remove() function
 	 * we cannot be holding the q lock */
 	spin_unlock_irq(q->queue_lock);
-- 
1.7.3.2.146.gca209


^ permalink raw reply	[flat|nested] 152+ messages in thread

* [PATCH 04/10] block: initial patch for on-stack per-task plugging
  2011-01-22  1:17 [PATCH 0/10] On-stack explicit block queue plugging Jens Axboe
                   ` (2 preceding siblings ...)
  2011-01-22  1:17 ` [PATCH 03/10] scsi: convert to blk_delay_queue() Jens Axboe
@ 2011-01-22  1:17 ` Jens Axboe
  2011-01-24 19:36   ` Jeff Moyer
                     ` (2 more replies)
  2011-01-22  1:17 ` [PATCH 05/10] block: remove per-queue plugging Jens Axboe
                   ` (5 subsequent siblings)
  9 siblings, 3 replies; 152+ messages in thread
From: Jens Axboe @ 2011-01-22  1:17 UTC (permalink / raw)
  To: jaxboe, linux-kernel; +Cc: hch

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
---
 block/blk-core.c          |  357 ++++++++++++++++++++++++++++++++------------
 block/elevator.c          |    6 +-
 include/linux/blk_types.h |    2 +
 include/linux/blkdev.h    |   30 ++++
 include/linux/elevator.h  |    1 +
 include/linux/sched.h     |    6 +
 kernel/exit.c             |    1 +
 kernel/fork.c             |    3 +
 kernel/sched.c            |   11 ++-
 9 files changed, 317 insertions(+), 100 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 960f12c..42dbfcc 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -27,6 +27,7 @@
 #include <linux/writeback.h>
 #include <linux/task_io_accounting_ops.h>
 #include <linux/fault-inject.h>
+#include <linux/list_sort.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/block.h>
@@ -213,7 +214,7 @@ static void blk_delay_work(struct work_struct *work)
 
 	q = container_of(work, struct request_queue, delay_work.work);
 	spin_lock_irq(q->queue_lock);
-	q->request_fn(q);
+	__blk_run_queue(q);
 	spin_unlock_irq(q->queue_lock);
 }
 
@@ -694,6 +695,8 @@ int blk_get_queue(struct request_queue *q)
 
 static inline void blk_free_request(struct request_queue *q, struct request *rq)
 {
+	BUG_ON(rq->cmd_flags & REQ_ON_PLUG);
+
 	if (rq->cmd_flags & REQ_ELVPRIV)
 		elv_put_request(q, rq);
 	mempool_free(rq, q->rq.rq_pool);
@@ -1038,6 +1041,13 @@ void blk_requeue_request(struct request_queue *q, struct request *rq)
 }
 EXPORT_SYMBOL(blk_requeue_request);
 
+static void add_acct_request(struct request_queue *q, struct request *rq,
+			     int where)
+{
+	drive_stat_acct(rq, 1);
+	__elv_add_request(q, rq, where, 0);
+}
+
 /**
  * blk_insert_request - insert a special request into a request queue
  * @q:		request queue where request should be inserted
@@ -1080,8 +1090,7 @@ void blk_insert_request(struct request_queue *q, struct request *rq,
 	if (blk_rq_tagged(rq))
 		blk_queue_end_tag(q, rq);
 
-	drive_stat_acct(rq, 1);
-	__elv_add_request(q, rq, where, 0);
+	add_acct_request(q, rq, where);
 	__blk_run_queue(q);
 	spin_unlock_irqrestore(q->queue_lock, flags);
 }
@@ -1202,6 +1211,113 @@ void blk_add_request_payload(struct request *rq, struct page *page,
 }
 EXPORT_SYMBOL_GPL(blk_add_request_payload);
 
+static bool bio_attempt_back_merge(struct request_queue *q, struct request *req,
+				   struct bio *bio)
+{
+	const int ff = bio->bi_rw & REQ_FAILFAST_MASK;
+
+	/*
+	 * Debug stuff, kill later
+	 */
+	if (!rq_mergeable(req)) {
+		blk_dump_rq_flags(req, "back");
+		return false;
+	}
+
+	if (!ll_back_merge_fn(q, req, bio))
+		return false;
+
+	trace_block_bio_backmerge(q, bio);
+
+	if ((req->cmd_flags & REQ_FAILFAST_MASK) != ff)
+		blk_rq_set_mixed_merge(req);
+
+	req->biotail->bi_next = bio;
+	req->biotail = bio;
+	req->__data_len += bio->bi_size;
+	req->ioprio = ioprio_best(req->ioprio, bio_prio(bio));
+
+	drive_stat_acct(req, 0);
+	return true;
+}
+
+static bool bio_attempt_front_merge(struct request_queue *q,
+				    struct request *req, struct bio *bio)
+{
+	const int ff = bio->bi_rw & REQ_FAILFAST_MASK;
+	sector_t sector;
+
+	/*
+	 * Debug stuff, kill later
+	 */
+	if (!rq_mergeable(req)) {
+		blk_dump_rq_flags(req, "front");
+		return false;
+	}
+
+	if (!ll_front_merge_fn(q, req, bio))
+		return false;
+
+	trace_block_bio_frontmerge(q, bio);
+
+	if ((req->cmd_flags & REQ_FAILFAST_MASK) != ff)
+		blk_rq_set_mixed_merge(req);
+
+	sector = bio->bi_sector;
+
+	bio->bi_next = req->bio;
+	req->bio = bio;
+
+	/*
+	 * may not be valid. if the low level driver said
+	 * it didn't need a bounce buffer then it better
+	 * not touch req->buffer either...
+	 */
+	req->buffer = bio_data(bio);
+	req->__sector = bio->bi_sector;
+	req->__data_len += bio->bi_size;
+	req->ioprio = ioprio_best(req->ioprio, bio_prio(bio));
+
+	drive_stat_acct(req, 0);
+	return true;
+}
+
+/*
+ * Attempts to merge with the plugged list in the current process. Returns
+ * true if merge was succesful, otherwise false.
+ */
+static bool check_plug_merge(struct task_struct *tsk, struct request_queue *q,
+			     struct bio *bio)
+{
+	struct blk_plug *plug;
+	struct request *rq;
+	bool ret = false;
+
+	plug = tsk->plug;
+	if (!plug)
+		goto out;
+
+	list_for_each_entry_reverse(rq, &plug->list, queuelist) {
+		int el_ret;
+
+		if (rq->q != q)
+			continue;
+
+		el_ret = elv_try_merge(rq, bio);
+		if (el_ret == ELEVATOR_BACK_MERGE) {
+			ret = bio_attempt_back_merge(q, rq, bio);
+			if (ret)
+				break;
+		} else if (el_ret == ELEVATOR_FRONT_MERGE) {
+			ret = bio_attempt_front_merge(q, rq, bio);
+			if (ret)
+				break;
+		}
+	}
+out:
+	return ret;
+}
+
 void init_request_from_bio(struct request *req, struct bio *bio)
 {
 	req->cpu = bio->bi_comp_cpu;
@@ -1217,26 +1333,12 @@ void init_request_from_bio(struct request *req, struct bio *bio)
 	blk_rq_bio_prep(req->q, req, bio);
 }
 
-/*
- * Only disabling plugging for non-rotational devices if it does tagging
- * as well, otherwise we do need the proper merging
- */
-static inline bool queue_should_plug(struct request_queue *q)
-{
-	return !(blk_queue_nonrot(q) && blk_queue_tagged(q));
-}
-
 static int __make_request(struct request_queue *q, struct bio *bio)
 {
-	struct request *req;
-	int el_ret;
-	unsigned int bytes = bio->bi_size;
-	const unsigned short prio = bio_prio(bio);
 	const bool sync = !!(bio->bi_rw & REQ_SYNC);
-	const bool unplug = !!(bio->bi_rw & REQ_UNPLUG);
-	const unsigned long ff = bio->bi_rw & REQ_FAILFAST_MASK;
-	int where = ELEVATOR_INSERT_SORT;
-	int rw_flags;
+	struct blk_plug *plug;
+	int el_ret, rw_flags, where = ELEVATOR_INSERT_SORT;
+	struct request *req;
 
 	/*
 	 * low level driver can indicate that it wants pages above a
@@ -1245,78 +1347,36 @@ static int __make_request(struct request_queue *q, struct bio *bio)
 	 */
 	blk_queue_bounce(q, &bio);
 
-	spin_lock_irq(q->queue_lock);
-
 	if (bio->bi_rw & (REQ_FLUSH | REQ_FUA)) {
+		spin_lock_irq(q->queue_lock);
 		where = ELEVATOR_INSERT_FRONT;
 		goto get_rq;
 	}
 
-	if (elv_queue_empty(q))
-		goto get_rq;
-
-	el_ret = elv_merge(q, &req, bio);
-	switch (el_ret) {
-	case ELEVATOR_BACK_MERGE:
-		BUG_ON(!rq_mergeable(req));
-
-		if (!ll_back_merge_fn(q, req, bio))
-			break;
-
-		trace_block_bio_backmerge(q, bio);
-
-		if ((req->cmd_flags & REQ_FAILFAST_MASK) != ff)
-			blk_rq_set_mixed_merge(req);
-
-		req->biotail->bi_next = bio;
-		req->biotail = bio;
-		req->__data_len += bytes;
-		req->ioprio = ioprio_best(req->ioprio, prio);
-		if (!blk_rq_cpu_valid(req))
-			req->cpu = bio->bi_comp_cpu;
-		drive_stat_acct(req, 0);
-		elv_bio_merged(q, req, bio);
-		if (!attempt_back_merge(q, req))
-			elv_merged_request(q, req, el_ret);
+	/*
+	 * Check if we can merge with the plugged list before grabbing
+	 * any locks.
+	 */
+	if (check_plug_merge(current, q, bio))
 		goto out;
 
-	case ELEVATOR_FRONT_MERGE:
-		BUG_ON(!rq_mergeable(req));
-
-		if (!ll_front_merge_fn(q, req, bio))
-			break;
-
-		trace_block_bio_frontmerge(q, bio);
+	spin_lock_irq(q->queue_lock);
 
-		if ((req->cmd_flags & REQ_FAILFAST_MASK) != ff) {
-			blk_rq_set_mixed_merge(req);
-			req->cmd_flags &= ~REQ_FAILFAST_MASK;
-			req->cmd_flags |= ff;
+	el_ret = elv_merge(q, &req, bio);
+	if (el_ret == ELEVATOR_BACK_MERGE) {
+		BUG_ON(req->cmd_flags & REQ_ON_PLUG);
+		if (bio_attempt_back_merge(q, req, bio)) {
+			if (!attempt_back_merge(q, req))
+				elv_merged_request(q, req, el_ret);
+			goto out_unlock;
+		}
+	} else if (el_ret == ELEVATOR_FRONT_MERGE) {
+		BUG_ON(req->cmd_flags & REQ_ON_PLUG);
+		if (bio_attempt_front_merge(q, req, bio)) {
+			if (!attempt_front_merge(q, req))
+				elv_merged_request(q, req, el_ret);
+			goto out_unlock;
 		}
-
-		bio->bi_next = req->bio;
-		req->bio = bio;
-
-		/*
-		 * may not be valid. if the low level driver said
-		 * it didn't need a bounce buffer then it better
-		 * not touch req->buffer either...
-		 */
-		req->buffer = bio_data(bio);
-		req->__sector = bio->bi_sector;
-		req->__data_len += bytes;
-		req->ioprio = ioprio_best(req->ioprio, prio);
-		if (!blk_rq_cpu_valid(req))
-			req->cpu = bio->bi_comp_cpu;
-		drive_stat_acct(req, 0);
-		elv_bio_merged(q, req, bio);
-		if (!attempt_front_merge(q, req))
-			elv_merged_request(q, req, el_ret);
-		goto out;
-
-	/* ELV_NO_MERGE: elevator says don't/can't merge. */
-	default:
-		;
 	}
 
 get_rq:
@@ -1343,20 +1403,35 @@ get_rq:
 	 */
 	init_request_from_bio(req, bio);
 
-	spin_lock_irq(q->queue_lock);
 	if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) ||
-	    bio_flagged(bio, BIO_CPU_AFFINE))
-		req->cpu = blk_cpu_to_group(smp_processor_id());
-	if (queue_should_plug(q) && elv_queue_empty(q))
-		blk_plug_device(q);
-
-	/* insert the request into the elevator */
-	drive_stat_acct(req, 1);
-	__elv_add_request(q, req, where, 0);
+	    bio_flagged(bio, BIO_CPU_AFFINE)) {
+		req->cpu = blk_cpu_to_group(get_cpu());
+		put_cpu();
+	}
+
+	plug = current->plug;
+	if (plug && !sync) {
+		if (!plug->should_sort && !list_empty(&plug->list)) {
+			struct request *__rq;
+
+			__rq = list_entry_rq(plug->list.prev);
+			if (__rq->q != q)
+				plug->should_sort = 1;
+		}
+		/*
+		 * Debug flag, kill later
+		 */
+		req->cmd_flags |= REQ_ON_PLUG;
+		list_add_tail(&req->queuelist, &plug->list);
+		drive_stat_acct(req, 1);
+	} else {
+		spin_lock_irq(q->queue_lock);
+		add_acct_request(q, req, where);
+		__blk_run_queue(q);
+out_unlock:
+		spin_unlock_irq(q->queue_lock);
+	}
 out:
-	if (unplug || !queue_should_plug(q))
-		__generic_unplug_device(q);
-	spin_unlock_irq(q->queue_lock);
 	return 0;
 }
 
@@ -1759,9 +1834,7 @@ int blk_insert_cloned_request(struct request_queue *q, struct request *rq)
 	 */
 	BUG_ON(blk_queued_rq(rq));
 
-	drive_stat_acct(rq, 1);
-	__elv_add_request(q, rq, ELEVATOR_INSERT_BACK, 0);
-
+	add_acct_request(q, rq, ELEVATOR_INSERT_BACK);
 	spin_unlock_irqrestore(q->queue_lock, flags);
 
 	return 0;
@@ -2646,6 +2719,94 @@ int kblockd_schedule_delayed_work(struct request_queue *q,
 }
 EXPORT_SYMBOL(kblockd_schedule_delayed_work);
 
+#define PLUG_MAGIC	0x91827364
+
+void blk_start_plug(struct blk_plug *plug)
+{
+	struct task_struct *tsk = current;
+
+	plug->magic = PLUG_MAGIC;
+	INIT_LIST_HEAD(&plug->list);
+	plug->should_sort = 0;
+
+	/*
+	 * Store ordering should not be needed here, since a potential
+	 * preempt will imply a full memory barrier
+	 */
+	tsk->plug = plug;
+}
+EXPORT_SYMBOL(blk_start_plug);
+
+static int plug_rq_cmp(void *priv, struct list_head *a, struct list_head *b)
+{
+	struct request *rqa = container_of(a, struct request, queuelist);
+	struct request *rqb = container_of(b, struct request, queuelist);
+
+	return !(rqa->q == rqb->q);
+}
+
+static void __blk_finish_plug(struct task_struct *tsk, struct blk_plug *plug)
+{
+	struct request_queue *q = NULL;
+	unsigned long flags;
+	struct request *rq;
+
+	local_irq_save(flags);
+
+	if (!list_empty(&plug->list) && plug != tsk->plug)
+		BUG();
+	if (plug == tsk->plug)
+		tsk->plug = NULL;
+
+	BUG_ON(plug->magic != PLUG_MAGIC);
+
+	if (plug->should_sort)
+		list_sort(NULL, &plug->list, plug_rq_cmp);
+
+	while (!list_empty(&plug->list)) {
+		rq = list_entry_rq(plug->list.next);
+		list_del_init(&rq->queuelist);
+		BUG_ON(!(rq->cmd_flags & REQ_ON_PLUG));
+		BUG_ON(!rq->q);
+		if (rq->q != q) {
+			if (q) {
+				__blk_run_queue(q);
+				spin_unlock(q->queue_lock);
+			}
+			q = rq->q;
+			spin_lock(q->queue_lock);
+		}
+		rq->cmd_flags &= ~REQ_ON_PLUG;
+
+		/*
+		 * rq is already accounted, so use raw insert
+		 */
+		__elv_add_request(q, rq, ELEVATOR_INSERT_SORT, 0);
+	}
+
+	if (q) {
+		__blk_run_queue(q);
+		spin_unlock(q->queue_lock);
+	}
+
+	BUG_ON(!list_empty(&plug->list));
+	local_irq_restore(flags);
+}
+
+void blk_finish_plug(struct blk_plug *plug)
+{
+	if (plug)
+		__blk_finish_plug(current, plug);
+}
+EXPORT_SYMBOL(blk_finish_plug);
+
+void __blk_flush_plug(struct task_struct *tsk, struct blk_plug *plug)
+{
+	__blk_finish_plug(tsk, plug);
+	tsk->plug = plug;
+}
+EXPORT_SYMBOL(__blk_flush_plug);
+
 int __init blk_dev_init(void)
 {
 	BUILD_BUG_ON(__REQ_NR_BITS > 8 *
diff --git a/block/elevator.c b/block/elevator.c
index 2569512..a9fe237 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -113,7 +113,7 @@ int elv_rq_merge_ok(struct request *rq, struct bio *bio)
 }
 EXPORT_SYMBOL(elv_rq_merge_ok);
 
-static inline int elv_try_merge(struct request *__rq, struct bio *bio)
+int elv_try_merge(struct request *__rq, struct bio *bio)
 {
 	int ret = ELEVATOR_NO_MERGE;
 
@@ -421,6 +421,8 @@ void elv_dispatch_sort(struct request_queue *q, struct request *rq)
 	struct list_head *entry;
 	int stop_flags;
 
+	BUG_ON(rq->cmd_flags & REQ_ON_PLUG);
+
 	if (q->last_merge == rq)
 		q->last_merge = NULL;
 
@@ -691,6 +693,8 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
 void __elv_add_request(struct request_queue *q, struct request *rq, int where,
 		       int plug)
 {
+	BUG_ON(rq->cmd_flags & REQ_ON_PLUG);
+
 	if (rq->cmd_flags & REQ_SOFTBARRIER) {
 		/* barriers are scheduling boundary, update end_sector */
 		if (rq->cmd_type == REQ_TYPE_FS ||
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 46ad519..a755762 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -151,6 +151,7 @@ enum rq_flag_bits {
 	__REQ_IO_STAT,		/* account I/O stat */
 	__REQ_MIXED_MERGE,	/* merge of different types, fail separately */
 	__REQ_SECURE,		/* secure discard (used with __REQ_DISCARD) */
+	__REQ_ON_PLUG,		/* on plug list */
 	__REQ_NR_BITS,		/* stops here */
 };
 
@@ -191,5 +192,6 @@ enum rq_flag_bits {
 #define REQ_IO_STAT		(1 << __REQ_IO_STAT)
 #define REQ_MIXED_MERGE		(1 << __REQ_MIXED_MERGE)
 #define REQ_SECURE		(1 << __REQ_SECURE)
+#define REQ_ON_PLUG		(1 << __REQ_ON_PLUG)
 
 #endif /* __LINUX_BLK_TYPES_H */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index b4812f9..3d246a9 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -864,6 +864,24 @@ struct request_queue *blk_alloc_queue(gfp_t);
 struct request_queue *blk_alloc_queue_node(gfp_t, int);
 extern void blk_put_queue(struct request_queue *);
 
+struct blk_plug {
+	unsigned long magic;
+	struct list_head list;
+	unsigned int should_sort;
+};
+
+extern void blk_start_plug(struct blk_plug *);
+extern void blk_finish_plug(struct blk_plug *);
+extern void __blk_flush_plug(struct task_struct *, struct blk_plug *);
+
+static inline void blk_flush_plug(struct task_struct *tsk)
+{
+	struct blk_plug *plug = tsk->plug;
+
+	if (unlikely(plug))
+		__blk_flush_plug(tsk, plug);
+}
+
 /*
  * tag stuff
  */
@@ -1287,6 +1305,18 @@ static inline long nr_blockdev_pages(void)
 	return 0;
 }
 
+static inline void blk_start_plug(struct list_head *list)
+{
+}
+
+static inline void blk_finish_plug(struct list_head *list)
+{
+}
+
+static inline void blk_flush_plug(struct task_struct *tsk)
+{
+}
+
 #endif /* CONFIG_BLOCK */
 
 #endif
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 4d85797..ac2b7a0 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -105,6 +105,7 @@ extern void elv_add_request(struct request_queue *, struct request *, int, int);
 extern void __elv_add_request(struct request_queue *, struct request *, int, int);
 extern void elv_insert(struct request_queue *, struct request *, int);
 extern int elv_merge(struct request_queue *, struct request **, struct bio *);
+extern int elv_try_merge(struct request *, struct bio *);
 extern void elv_merge_requests(struct request_queue *, struct request *,
 			       struct request *);
 extern void elv_merged_request(struct request_queue *, struct request *, int);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d747f94..ed74f1d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -99,6 +99,7 @@ struct robust_list_head;
 struct bio_list;
 struct fs_struct;
 struct perf_event_context;
+struct blk_plug;
 
 /*
  * List of flags we want to share for kernel threads,
@@ -1429,6 +1430,11 @@ struct task_struct {
 /* stacked block device info */
 	struct bio_list *bio_list;
 
+#ifdef CONFIG_BLOCK
+/* stack plugging */
+	struct blk_plug *plug;
+#endif
+
 /* VM state */
 	struct reclaim_state *reclaim_state;
 
diff --git a/kernel/exit.c b/kernel/exit.c
index f9a45eb..360f0f3 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -908,6 +908,7 @@ NORET_TYPE void do_exit(long code)
 	profile_task_exit(tsk);
 
 	WARN_ON(atomic_read(&tsk->fs_excl));
+	WARN_ON(tsk->plug && !list_empty(&tsk->plug->list));
 
 	if (unlikely(in_interrupt()))
 		panic("Aiee, killing interrupt handler!");
diff --git a/kernel/fork.c b/kernel/fork.c
index 25e4291..027c80e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1204,6 +1204,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	 * Clear TID on mm_release()?
 	 */
 	p->clear_child_tid = (clone_flags & CLONE_CHILD_CLEARTID) ? child_tidptr: NULL;
+#ifdef CONFIG_BLOCK
+	p->plug = NULL;
+#endif
 #ifdef CONFIG_FUTEX
 	p->robust_list = NULL;
 #ifdef CONFIG_COMPAT
diff --git a/kernel/sched.c b/kernel/sched.c
index ea3e5ef..0d15f78 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3947,7 +3947,6 @@ need_resched:
 
 	release_kernel_lock(prev);
 need_resched_nonpreemptible:
-
 	schedule_debug(prev);
 
 	if (sched_feat(HRTICK))
@@ -3973,6 +3972,14 @@ need_resched_nonpreemptible:
 				if (to_wakeup)
 					try_to_wake_up_local(to_wakeup);
 			}
+			/*
+			 * If this task has IO plugged, make sure it
+			 * gets flushed out to the devices before we go
+			 * to sleep
+			 */
+			blk_flush_plug(prev);
+			BUG_ON(prev->plug && !list_empty(&prev->plug->list));
+
 			deactivate_task(rq, prev, DEQUEUE_SLEEP);
 		}
 		switch_count = &prev->nvcsw;
@@ -5332,6 +5339,7 @@ void __sched io_schedule(void)
 
 	delayacct_blkio_start();
 	atomic_inc(&rq->nr_iowait);
+	blk_flush_plug(current);
 	current->in_iowait = 1;
 	schedule();
 	current->in_iowait = 0;
@@ -5347,6 +5355,7 @@ long __sched io_schedule_timeout(long timeout)
 
 	delayacct_blkio_start();
 	atomic_inc(&rq->nr_iowait);
+	blk_flush_plug(current);
 	current->in_iowait = 1;
 	ret = schedule_timeout(timeout);
 	current->in_iowait = 0;
-- 
1.7.3.2.146.gca209


^ permalink raw reply	[flat|nested] 152+ messages in thread

* [PATCH 05/10] block: remove per-queue plugging
  2011-01-22  1:17 [PATCH 0/10] On-stack explicit block queue plugging Jens Axboe
                   ` (3 preceding siblings ...)
  2011-01-22  1:17 ` [PATCH 04/10] block: initial patch for on-stack per-task plugging Jens Axboe
@ 2011-01-22  1:17 ` Jens Axboe
  2011-01-22  1:31   ` Nick Piggin
                     ` (2 more replies)
  2011-01-22  1:17 ` [PATCH 06/10] block: kill request allocation batching Jens Axboe
                   ` (4 subsequent siblings)
  9 siblings, 3 replies; 152+ messages in thread
From: Jens Axboe @ 2011-01-22  1:17 UTC (permalink / raw)
  To: jaxboe, linux-kernel; +Cc: hch

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
---
 Documentation/block/biodoc.txt     |    5 -
 block/blk-core.c                   |  173 ++++-------------------------------
 block/blk-exec.c                   |    4 +-
 block/blk-flush.c                  |    3 +-
 block/blk-settings.c               |    8 --
 block/blk-sysfs.c                  |    3 +
 block/blk.h                        |    2 -
 block/cfq-iosched.c                |    8 --
 block/deadline-iosched.c           |    9 --
 block/elevator.c                   |   38 +-------
 block/noop-iosched.c               |    8 --
 drivers/block/cciss.c              |    6 --
 drivers/block/cpqarray.c           |    3 -
 drivers/block/drbd/drbd_actlog.c   |    2 -
 drivers/block/drbd/drbd_bitmap.c   |    1 -
 drivers/block/drbd/drbd_int.h      |   14 ---
 drivers/block/drbd/drbd_main.c     |   33 +-------
 drivers/block/drbd/drbd_receiver.c |   20 +----
 drivers/block/drbd/drbd_req.c      |    4 -
 drivers/block/drbd/drbd_worker.c   |    1 -
 drivers/block/drbd/drbd_wrappers.h |   18 ----
 drivers/block/floppy.c             |    1 -
 drivers/block/loop.c               |   13 ---
 drivers/block/pktcdvd.c            |    2 -
 drivers/block/umem.c               |   16 +---
 drivers/ide/ide-atapi.c            |    3 +-
 drivers/ide/ide-io.c               |    4 -
 drivers/ide/ide-park.c             |    2 +-
 drivers/md/bitmap.c                |    3 +-
 drivers/md/dm-crypt.c              |    9 +--
 drivers/md/dm-kcopyd.c             |   43 +--------
 drivers/md/dm-raid.c               |    2 +-
 drivers/md/dm-raid1.c              |    2 -
 drivers/md/dm-table.c              |   24 -----
 drivers/md/dm.c                    |   33 +------
 drivers/md/linear.c                |   17 ----
 drivers/md/md.c                    |    7 --
 drivers/md/multipath.c             |   31 -------
 drivers/md/raid0.c                 |   16 ----
 drivers/md/raid1.c                 |   82 ++++--------------
 drivers/md/raid10.c                |   86 ++++--------------
 drivers/md/raid5.c                 |   63 ++-----------
 drivers/md/raid5.h                 |    2 +-
 drivers/message/i2o/i2o_block.c    |    6 +-
 drivers/mmc/card/queue.c           |    3 +-
 drivers/s390/block/dasd.c          |    2 +-
 drivers/s390/char/tape_block.c     |    1 -
 drivers/scsi/scsi_transport_fc.c   |    2 +-
 drivers/scsi/scsi_transport_sas.c  |    6 +-
 fs/adfs/inode.c                    |    1 -
 fs/affs/file.c                     |    2 -
 fs/aio.c                           |    4 +-
 fs/befs/linuxvfs.c                 |    1 -
 fs/bfs/file.c                      |    1 -
 fs/block_dev.c                     |    1 -
 fs/btrfs/disk-io.c                 |   79 ----------------
 fs/btrfs/inode.c                   |    1 -
 fs/btrfs/volumes.c                 |   91 +++-----------------
 fs/buffer.c                        |   31 +------
 fs/cifs/file.c                     |   30 ------
 fs/direct-io.c                     |    5 +-
 fs/efs/inode.c                     |    1 -
 fs/exofs/inode.c                   |    1 -
 fs/ext2/inode.c                    |    2 -
 fs/ext3/inode.c                    |    3 -
 fs/ext4/inode.c                    |    4 -
 fs/fat/inode.c                     |    1 -
 fs/freevxfs/vxfs_subr.c            |    1 -
 fs/fuse/inode.c                    |    1 -
 fs/gfs2/aops.c                     |    3 -
 fs/gfs2/meta_io.c                  |    1 -
 fs/hfs/inode.c                     |    2 -
 fs/hfsplus/inode.c                 |    2 -
 fs/hpfs/file.c                     |    1 -
 fs/isofs/inode.c                   |    1 -
 fs/jfs/inode.c                     |    1 -
 fs/jfs/jfs_metapage.c              |    1 -
 fs/logfs/dev_bdev.c                |    2 -
 fs/minix/inode.c                   |    1 -
 fs/nilfs2/btnode.c                 |    6 +-
 fs/nilfs2/gcinode.c                |    1 -
 fs/nilfs2/inode.c                  |    1 -
 fs/nilfs2/mdt.c                    |    9 +--
 fs/nilfs2/page.c                   |    8 +-
 fs/nilfs2/page.h                   |    3 +-
 fs/ntfs/aops.c                     |    4 -
 fs/ntfs/compress.c                 |    3 +-
 fs/ocfs2/aops.c                    |    1 -
 fs/ocfs2/cluster/heartbeat.c       |    4 -
 fs/omfs/file.c                     |    1 -
 fs/qnx4/inode.c                    |    1 -
 fs/reiserfs/inode.c                |    1 -
 fs/sysv/itree.c                    |    1 -
 fs/ubifs/super.c                   |    1 -
 fs/udf/file.c                      |    1 -
 fs/udf/inode.c                     |    1 -
 fs/ufs/inode.c                     |    1 -
 fs/ufs/truncate.c                  |    2 +-
 fs/xfs/linux-2.6/xfs_aops.c        |    1 -
 fs/xfs/linux-2.6/xfs_buf.c         |   13 +--
 include/linux/backing-dev.h        |   16 ----
 include/linux/blkdev.h             |   31 ++-----
 include/linux/buffer_head.h        |    1 -
 include/linux/device-mapper.h      |    5 -
 include/linux/elevator.h           |    7 +-
 include/linux/fs.h                 |    1 -
 include/linux/pagemap.h            |   12 ---
 include/linux/swap.h               |    2 -
 mm/backing-dev.c                   |    6 --
 mm/filemap.c                       |   67 ++-------------
 mm/memory-failure.c                |    6 +-
 mm/nommu.c                         |    4 -
 mm/page-writeback.c                |    2 +-
 mm/readahead.c                     |   12 ---
 mm/shmem.c                         |    1 -
 mm/swap_state.c                    |    5 +-
 mm/swapfile.c                      |   37 --------
 mm/vmscan.c                        |    2 +-
 118 files changed, 153 insertions(+), 1248 deletions(-)

diff --git a/Documentation/block/biodoc.txt b/Documentation/block/biodoc.txt
index b9a83dd..2a7b38c 100644
--- a/Documentation/block/biodoc.txt
+++ b/Documentation/block/biodoc.txt
@@ -963,11 +963,6 @@ elevator_dispatch_fn*		fills the dispatch queue with ready requests.
 
 elevator_add_req_fn*		called to add a new request into the scheduler
 
-elevator_queue_empty_fn		returns true if the merge queue is empty.
-				Drivers shouldn't use this, but rather check
-				if elv_next_request is NULL (without losing the
-				request if one exists!)
-
 elevator_former_req_fn
 elevator_latter_req_fn		These return the request before or after the
 				one specified in disk sort order. Used by the
diff --git a/block/blk-core.c b/block/blk-core.c
index 42dbfcc..7ab6620 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -208,6 +208,19 @@ void blk_dump_rq_flags(struct request *rq, char *msg)
 }
 EXPORT_SYMBOL(blk_dump_rq_flags);
 
+/*
+ * Make sure that plugs that were pending when this function was entered,
+ * are now complete and requests pushed to the queue.
+*/
+static inline void queue_sync_plugs(struct request_queue *q)
+{
+	/*
+	 * If the current process is plugged and has barriers submitted,
+	 * we will livelock if we don't unplug first.
+	 */
+	blk_flush_plug(current);
+}
+
 static void blk_delay_work(struct work_struct *work)
 {
 	struct request_queue *q;
@@ -234,137 +247,6 @@ void blk_delay_queue(struct request_queue *q, unsigned long msecs)
 }
 EXPORT_SYMBOL(blk_delay_queue);
 
-/*
- * "plug" the device if there are no outstanding requests: this will
- * force the transfer to start only after we have put all the requests
- * on the list.
- *
- * This is called with interrupts off and no requests on the queue and
- * with the queue lock held.
- */
-void blk_plug_device(struct request_queue *q)
-{
-	WARN_ON(!irqs_disabled());
-
-	/*
-	 * don't plug a stopped queue, it must be paired with blk_start_queue()
-	 * which will restart the queueing
-	 */
-	if (blk_queue_stopped(q))
-		return;
-
-	if (!queue_flag_test_and_set(QUEUE_FLAG_PLUGGED, q)) {
-		mod_timer(&q->unplug_timer, jiffies + q->unplug_delay);
-		trace_block_plug(q);
-	}
-}
-EXPORT_SYMBOL(blk_plug_device);
-
-/**
- * blk_plug_device_unlocked - plug a device without queue lock held
- * @q:    The &struct request_queue to plug
- *
- * Description:
- *   Like @blk_plug_device(), but grabs the queue lock and disables
- *   interrupts.
- **/
-void blk_plug_device_unlocked(struct request_queue *q)
-{
-	unsigned long flags;
-
-	spin_lock_irqsave(q->queue_lock, flags);
-	blk_plug_device(q);
-	spin_unlock_irqrestore(q->queue_lock, flags);
-}
-EXPORT_SYMBOL(blk_plug_device_unlocked);
-
-/*
- * remove the queue from the plugged list, if present. called with
- * queue lock held and interrupts disabled.
- */
-int blk_remove_plug(struct request_queue *q)
-{
-	WARN_ON(!irqs_disabled());
-
-	if (!queue_flag_test_and_clear(QUEUE_FLAG_PLUGGED, q))
-		return 0;
-
-	del_timer(&q->unplug_timer);
-	return 1;
-}
-EXPORT_SYMBOL(blk_remove_plug);
-
-/*
- * remove the plug and let it rip..
- */
-void __generic_unplug_device(struct request_queue *q)
-{
-	if (unlikely(blk_queue_stopped(q)))
-		return;
-	if (!blk_remove_plug(q) && !blk_queue_nonrot(q))
-		return;
-
-	q->request_fn(q);
-}
-
-/**
- * generic_unplug_device - fire a request queue
- * @q:    The &struct request_queue in question
- *
- * Description:
- *   Linux uses plugging to build bigger requests queues before letting
- *   the device have at them. If a queue is plugged, the I/O scheduler
- *   is still adding and merging requests on the queue. Once the queue
- *   gets unplugged, the request_fn defined for the queue is invoked and
- *   transfers started.
- **/
-void generic_unplug_device(struct request_queue *q)
-{
-	if (blk_queue_plugged(q)) {
-		spin_lock_irq(q->queue_lock);
-		__generic_unplug_device(q);
-		spin_unlock_irq(q->queue_lock);
-	}
-}
-EXPORT_SYMBOL(generic_unplug_device);
-
-static void blk_backing_dev_unplug(struct backing_dev_info *bdi,
-				   struct page *page)
-{
-	struct request_queue *q = bdi->unplug_io_data;
-
-	blk_unplug(q);
-}
-
-void blk_unplug_work(struct work_struct *work)
-{
-	struct request_queue *q =
-		container_of(work, struct request_queue, unplug_work);
-
-	trace_block_unplug_io(q);
-	q->unplug_fn(q);
-}
-
-void blk_unplug_timeout(unsigned long data)
-{
-	struct request_queue *q = (struct request_queue *)data;
-
-	trace_block_unplug_timer(q);
-	kblockd_schedule_work(q, &q->unplug_work);
-}
-
-void blk_unplug(struct request_queue *q)
-{
-	/*
-	 * devices don't necessarily have an ->unplug_fn defined
-	 */
-	if (q->unplug_fn) {
-		trace_block_unplug_io(q);
-		q->unplug_fn(q);
-	}
-}
-EXPORT_SYMBOL(blk_unplug);
-
 /**
  * blk_start_queue - restart a previously stopped queue
  * @q:    The &struct request_queue in question
@@ -399,7 +281,6 @@ EXPORT_SYMBOL(blk_start_queue);
  **/
 void blk_stop_queue(struct request_queue *q)
 {
-	blk_remove_plug(q);
 	cancel_delayed_work(&q->delay_work);
 	queue_flag_set(QUEUE_FLAG_STOPPED, q);
 }
@@ -421,11 +302,10 @@ EXPORT_SYMBOL(blk_stop_queue);
  */
 void blk_sync_queue(struct request_queue *q)
 {
-	del_timer_sync(&q->unplug_timer);
 	del_timer_sync(&q->timeout);
-	cancel_work_sync(&q->unplug_work);
 	throtl_shutdown_timer_wq(q);
 	cancel_delayed_work_sync(&q->delay_work);
+	queue_sync_plugs(q);
 }
 EXPORT_SYMBOL(blk_sync_queue);
 
@@ -440,14 +320,9 @@ EXPORT_SYMBOL(blk_sync_queue);
  */
 void __blk_run_queue(struct request_queue *q)
 {
-	blk_remove_plug(q);
-
 	if (unlikely(blk_queue_stopped(q)))
 		return;
 
-	if (elv_queue_empty(q))
-		return;
-
 	/*
 	 * Only recurse once to avoid overrunning the stack, let the unplug
 	 * handling reinvoke the handler shortly if we already got there.
@@ -455,10 +330,8 @@ void __blk_run_queue(struct request_queue *q)
 	if (!queue_flag_test_and_set(QUEUE_FLAG_REENTER, q)) {
 		q->request_fn(q);
 		queue_flag_clear(QUEUE_FLAG_REENTER, q);
-	} else {
-		queue_flag_set(QUEUE_FLAG_PLUGGED, q);
-		kblockd_schedule_work(q, &q->unplug_work);
-	}
+	} else
+		queue_delayed_work(kblockd_workqueue, &q->delay_work, 0);
 }
 EXPORT_SYMBOL(__blk_run_queue);
 
@@ -545,8 +418,6 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 	if (!q)
 		return NULL;
 
-	q->backing_dev_info.unplug_io_fn = blk_backing_dev_unplug;
-	q->backing_dev_info.unplug_io_data = q;
 	q->backing_dev_info.ra_pages =
 			(VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
 	q->backing_dev_info.state = 0;
@@ -566,11 +437,9 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 
 	setup_timer(&q->backing_dev_info.laptop_mode_wb_timer,
 		    laptop_mode_timer_fn, (unsigned long) q);
-	init_timer(&q->unplug_timer);
 	setup_timer(&q->timeout, blk_rq_timed_out_timer, (unsigned long) q);
 	INIT_LIST_HEAD(&q->timeout_list);
 	INIT_LIST_HEAD(&q->pending_flushes);
-	INIT_WORK(&q->unplug_work, blk_unplug_work);
 	INIT_DELAYED_WORK(&q->delay_work, blk_delay_work);
 
 	kobject_init(&q->kobj, &blk_queue_ktype);
@@ -660,7 +529,6 @@ blk_init_allocated_queue_node(struct request_queue *q, request_fn_proc *rfn,
 	q->request_fn		= rfn;
 	q->prep_rq_fn		= NULL;
 	q->unprep_rq_fn		= NULL;
-	q->unplug_fn		= generic_unplug_device;
 	q->queue_flags		= QUEUE_FLAG_DEFAULT;
 	q->queue_lock		= lock;
 
@@ -897,8 +765,8 @@ out:
 }
 
 /*
- * No available requests for this queue, unplug the device and wait for some
- * requests to become available.
+ * No available requests for this queue, wait for some requests to become
+ * available.
  *
  * Called with q->queue_lock held, and returns with it unlocked.
  */
@@ -919,7 +787,6 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
 
 		trace_block_sleeprq(q, bio, rw_flags & 1);
 
-		__generic_unplug_device(q);
 		spin_unlock_irq(q->queue_lock);
 		io_schedule();
 
@@ -1045,7 +912,7 @@ static void add_acct_request(struct request_queue *q, struct request *rq,
 			     int where)
 {
 	drive_stat_acct(rq, 1);
-	__elv_add_request(q, rq, where, 0);
+	__elv_add_request(q, rq, where);
 }
 
 /**
@@ -2781,7 +2648,7 @@ static void __blk_finish_plug(struct task_struct *tsk, struct blk_plug *plug)
 		/*
 		 * rq is already accounted, so use raw insert
 		 */
-		__elv_add_request(q, rq, ELEVATOR_INSERT_SORT, 0);
+		__elv_add_request(q, rq, ELEVATOR_INSERT_SORT);
 	}
 
 	if (q) {
diff --git a/block/blk-exec.c b/block/blk-exec.c
index cf1456a..81e3181 100644
--- a/block/blk-exec.c
+++ b/block/blk-exec.c
@@ -54,8 +54,8 @@ void blk_execute_rq_nowait(struct request_queue *q, struct gendisk *bd_disk,
 	rq->end_io = done;
 	WARN_ON(irqs_disabled());
 	spin_lock_irq(q->queue_lock);
-	__elv_add_request(q, rq, where, 1);
-	__generic_unplug_device(q);
+	__elv_add_request(q, rq, where);
+	__blk_run_queue(q);
 	/* the queue is stopped so it won't be plugged+unplugged */
 	if (rq->cmd_type == REQ_TYPE_PM_RESUME)
 		q->request_fn(q);
diff --git a/block/blk-flush.c b/block/blk-flush.c
index 54b123d..c0a07aa 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -59,7 +59,6 @@ static struct request *blk_flush_complete_seq(struct request_queue *q,
 static void blk_flush_complete_seq_end_io(struct request_queue *q,
 					  unsigned seq, int error)
 {
-	bool was_empty = elv_queue_empty(q);
 	struct request *next_rq;
 
 	next_rq = blk_flush_complete_seq(q, seq, error);
@@ -68,7 +67,7 @@ static void blk_flush_complete_seq_end_io(struct request_queue *q,
 	 * Moving a request silently to empty queue_head may stall the
 	 * queue.  Kick the queue in those cases.
 	 */
-	if (was_empty && next_rq)
+	if (next_rq)
 		__blk_run_queue(q);
 }
 
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 36c8c1f..c8d6892 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -164,14 +164,6 @@ void blk_queue_make_request(struct request_queue *q, make_request_fn *mfn)
 	blk_queue_congestion_threshold(q);
 	q->nr_batching = BLK_BATCH_REQ;
 
-	q->unplug_thresh = 4;		/* hmm */
-	q->unplug_delay = msecs_to_jiffies(3);	/* 3 milliseconds */
-	if (q->unplug_delay == 0)
-		q->unplug_delay = 1;
-
-	q->unplug_timer.function = blk_unplug_timeout;
-	q->unplug_timer.data = (unsigned long)q;
-
 	blk_set_default_limits(&q->limits);
 	blk_queue_max_hw_sectors(q, BLK_SAFE_MAX_SECTORS);
 
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 41fb691..8d3e40c 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -481,6 +481,9 @@ static void blk_release_queue(struct kobject *kobj)
 
 	blk_trace_shutdown(q);
 
+#if 0
+	cleanup_qrcu_struct(&q->qrcu);
+#endif
 	bdi_destroy(&q->backing_dev_info);
 	kmem_cache_free(blk_requestq_cachep, q);
 }
diff --git a/block/blk.h b/block/blk.h
index 2db8f32..2c3d2e7 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -18,8 +18,6 @@ int blk_rq_append_bio(struct request_queue *q, struct request *rq,
 void blk_dequeue_request(struct request *rq);
 void __blk_queue_free_tags(struct request_queue *q);
 
-void blk_unplug_work(struct work_struct *work);
-void blk_unplug_timeout(unsigned long data);
 void blk_rq_timed_out_timer(unsigned long data);
 void blk_delete_timer(struct request *);
 void blk_add_timer(struct request *);
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 501ffdf..0a5f731 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -501,13 +501,6 @@ static inline void cfq_schedule_dispatch(struct cfq_data *cfqd)
 	}
 }
 
-static int cfq_queue_empty(struct request_queue *q)
-{
-	struct cfq_data *cfqd = q->elevator->elevator_data;
-
-	return !cfqd->rq_queued;
-}
-
 /*
  * Scale schedule slice based on io priority. Use the sync time slice only
  * if a queue is marked sync and has sync io queued. A sync queue with async
@@ -4092,7 +4085,6 @@ static struct elevator_type iosched_cfq = {
 		.elevator_add_req_fn =		cfq_insert_request,
 		.elevator_activate_req_fn =	cfq_activate_request,
 		.elevator_deactivate_req_fn =	cfq_deactivate_request,
-		.elevator_queue_empty_fn =	cfq_queue_empty,
 		.elevator_completed_req_fn =	cfq_completed_request,
 		.elevator_former_req_fn =	elv_rb_former_request,
 		.elevator_latter_req_fn =	elv_rb_latter_request,
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index b547cbc..5139c0e 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -326,14 +326,6 @@ dispatch_request:
 	return 1;
 }
 
-static int deadline_queue_empty(struct request_queue *q)
-{
-	struct deadline_data *dd = q->elevator->elevator_data;
-
-	return list_empty(&dd->fifo_list[WRITE])
-		&& list_empty(&dd->fifo_list[READ]);
-}
-
 static void deadline_exit_queue(struct elevator_queue *e)
 {
 	struct deadline_data *dd = e->elevator_data;
@@ -445,7 +437,6 @@ static struct elevator_type iosched_deadline = {
 		.elevator_merge_req_fn =	deadline_merged_requests,
 		.elevator_dispatch_fn =		deadline_dispatch_requests,
 		.elevator_add_req_fn =		deadline_add_request,
-		.elevator_queue_empty_fn =	deadline_queue_empty,
 		.elevator_former_req_fn =	elv_rb_former_request,
 		.elevator_latter_req_fn =	elv_rb_latter_request,
 		.elevator_init_fn =		deadline_init_queue,
diff --git a/block/elevator.c b/block/elevator.c
index a9fe237..d5d17a4 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -619,8 +619,6 @@ void elv_quiesce_end(struct request_queue *q)
 
 void elv_insert(struct request_queue *q, struct request *rq, int where)
 {
-	int unplug_it = 1;
-
 	trace_block_rq_insert(q, rq);
 
 	rq->q = q;
@@ -632,8 +630,6 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
 		 * don't force unplug of the queue for that case.
 		 * Clear unplug_it and fall through.
 		 */
-		unplug_it = 0;
-
 	case ELEVATOR_INSERT_FRONT:
 		rq->cmd_flags |= REQ_SOFTBARRIER;
 		list_add(&rq->queuelist, &q->queue_head);
@@ -674,24 +670,14 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
 		 */
 		q->elevator->ops->elevator_add_req_fn(q, rq);
 		break;
-
 	default:
 		printk(KERN_ERR "%s: bad insertion point %d\n",
 		       __func__, where);
 		BUG();
 	}
-
-	if (unplug_it && blk_queue_plugged(q)) {
-		int nrq = q->rq.count[BLK_RW_SYNC] + q->rq.count[BLK_RW_ASYNC]
-				- queue_in_flight(q);
-
-		if (nrq >= q->unplug_thresh)
-			__generic_unplug_device(q);
-	}
 }
 
-void __elv_add_request(struct request_queue *q, struct request *rq, int where,
-		       int plug)
+void __elv_add_request(struct request_queue *q, struct request *rq, int where)
 {
 	BUG_ON(rq->cmd_flags & REQ_ON_PLUG);
 
@@ -706,38 +692,20 @@ void __elv_add_request(struct request_queue *q, struct request *rq, int where,
 		    where == ELEVATOR_INSERT_SORT)
 		where = ELEVATOR_INSERT_BACK;
 
-	if (plug)
-		blk_plug_device(q);
-
 	elv_insert(q, rq, where);
 }
 EXPORT_SYMBOL(__elv_add_request);
 
-void elv_add_request(struct request_queue *q, struct request *rq, int where,
-		     int plug)
+void elv_add_request(struct request_queue *q, struct request *rq, int where)
 {
 	unsigned long flags;
 
 	spin_lock_irqsave(q->queue_lock, flags);
-	__elv_add_request(q, rq, where, plug);
+	__elv_add_request(q, rq, where);
 	spin_unlock_irqrestore(q->queue_lock, flags);
 }
 EXPORT_SYMBOL(elv_add_request);
 
-int elv_queue_empty(struct request_queue *q)
-{
-	struct elevator_queue *e = q->elevator;
-
-	if (!list_empty(&q->queue_head))
-		return 0;
-
-	if (e->ops->elevator_queue_empty_fn)
-		return e->ops->elevator_queue_empty_fn(q);
-
-	return 1;
-}
-EXPORT_SYMBOL(elv_queue_empty);
-
 struct request *elv_latter_request(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
diff --git a/block/noop-iosched.c b/block/noop-iosched.c
index 232c4b3..06389e9 100644
--- a/block/noop-iosched.c
+++ b/block/noop-iosched.c
@@ -39,13 +39,6 @@ static void noop_add_request(struct request_queue *q, struct request *rq)
 	list_add_tail(&rq->queuelist, &nd->queue);
 }
 
-static int noop_queue_empty(struct request_queue *q)
-{
-	struct noop_data *nd = q->elevator->elevator_data;
-
-	return list_empty(&nd->queue);
-}
-
 static struct request *
 noop_former_request(struct request_queue *q, struct request *rq)
 {
@@ -90,7 +83,6 @@ static struct elevator_type elevator_noop = {
 		.elevator_merge_req_fn		= noop_merged_requests,
 		.elevator_dispatch_fn		= noop_dispatch,
 		.elevator_add_req_fn		= noop_add_request,
-		.elevator_queue_empty_fn	= noop_queue_empty,
 		.elevator_former_req_fn		= noop_former_request,
 		.elevator_latter_req_fn		= noop_latter_request,
 		.elevator_init_fn		= noop_init_queue,
diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
index 516d5bb..37d1545 100644
--- a/drivers/block/cciss.c
+++ b/drivers/block/cciss.c
@@ -3170,12 +3170,6 @@ static void do_cciss_request(struct request_queue *q)
 	int sg_index = 0;
 	int chained = 0;
 
-	/* We call start_io here in case there is a command waiting on the
-	 * queue that has not been sent.
-	 */
-	if (blk_queue_plugged(q))
-		goto startio;
-
       queue:
 	creq = blk_peek_request(q);
 	if (!creq)
diff --git a/drivers/block/cpqarray.c b/drivers/block/cpqarray.c
index 946dad4..b2fceb5 100644
--- a/drivers/block/cpqarray.c
+++ b/drivers/block/cpqarray.c
@@ -911,9 +911,6 @@ static void do_ida_request(struct request_queue *q)
 	struct scatterlist tmp_sg[SG_MAX];
 	int i, dir, seg;
 
-	if (blk_queue_plugged(q))
-		goto startio;
-
 queue_next:
 	creq = blk_peek_request(q);
 	if (!creq)
diff --git a/drivers/block/drbd/drbd_actlog.c b/drivers/block/drbd/drbd_actlog.c
index ba95cba..2096628 100644
--- a/drivers/block/drbd/drbd_actlog.c
+++ b/drivers/block/drbd/drbd_actlog.c
@@ -689,8 +689,6 @@ void drbd_al_to_on_disk_bm(struct drbd_conf *mdev)
 		}
 	}
 
-	drbd_blk_run_queue(bdev_get_queue(mdev->ldev->md_bdev));
-
 	/* always (try to) flush bitmap to stable storage */
 	drbd_md_flush(mdev);
 
diff --git a/drivers/block/drbd/drbd_bitmap.c b/drivers/block/drbd/drbd_bitmap.c
index fd42832..0645ca8 100644
--- a/drivers/block/drbd/drbd_bitmap.c
+++ b/drivers/block/drbd/drbd_bitmap.c
@@ -840,7 +840,6 @@ static int bm_rw(struct drbd_conf *mdev, int rw) __must_hold(local)
 	for (i = 0; i < num_pages; i++)
 		bm_page_io_async(mdev, b, i, rw);
 
-	drbd_blk_run_queue(bdev_get_queue(mdev->ldev->md_bdev));
 	wait_event(b->bm_io_wait, atomic_read(&b->bm_async_io) == 0);
 
 	if (test_bit(BM_MD_IO_ERROR, &b->bm_flags)) {
diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index 3803a03..0b5718e 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -2382,20 +2382,6 @@ static inline int drbd_queue_order_type(struct drbd_conf *mdev)
 	return QUEUE_ORDERED_NONE;
 }
 
-static inline void drbd_blk_run_queue(struct request_queue *q)
-{
-	if (q && q->unplug_fn)
-		q->unplug_fn(q);
-}
-
-static inline void drbd_kick_lo(struct drbd_conf *mdev)
-{
-	if (get_ldev(mdev)) {
-		drbd_blk_run_queue(bdev_get_queue(mdev->ldev->backing_bdev));
-		put_ldev(mdev);
-	}
-}
-
 static inline void drbd_md_flush(struct drbd_conf *mdev)
 {
 	int r;
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 29cd0dc..6049cb8 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -2719,35 +2719,6 @@ static int drbd_release(struct gendisk *gd, fmode_t mode)
 	return 0;
 }
 
-static void drbd_unplug_fn(struct request_queue *q)
-{
-	struct drbd_conf *mdev = q->queuedata;
-
-	/* unplug FIRST */
-	spin_lock_irq(q->queue_lock);
-	blk_remove_plug(q);
-	spin_unlock_irq(q->queue_lock);
-
-	/* only if connected */
-	spin_lock_irq(&mdev->req_lock);
-	if (mdev->state.pdsk >= D_INCONSISTENT && mdev->state.conn >= C_CONNECTED) {
-		D_ASSERT(mdev->state.role == R_PRIMARY);
-		if (test_and_clear_bit(UNPLUG_REMOTE, &mdev->flags)) {
-			/* add to the data.work queue,
-			 * unless already queued.
-			 * XXX this might be a good addition to drbd_queue_work
-			 * anyways, to detect "double queuing" ... */
-			if (list_empty(&mdev->unplug_work.list))
-				drbd_queue_work(&mdev->data.work,
-						&mdev->unplug_work);
-		}
-	}
-	spin_unlock_irq(&mdev->req_lock);
-
-	if (mdev->state.disk >= D_INCONSISTENT)
-		drbd_kick_lo(mdev);
-}
-
 static void drbd_set_defaults(struct drbd_conf *mdev)
 {
 	/* This way we get a compile error when sync_conf grows,
@@ -3222,9 +3193,7 @@ struct drbd_conf *drbd_new_device(unsigned int minor)
 	blk_queue_max_segment_size(q, DRBD_MAX_SEGMENT_SIZE);
 	blk_queue_bounce_limit(q, BLK_BOUNCE_ANY);
 	blk_queue_merge_bvec(q, drbd_merge_bvec);
-	q->queue_lock = &mdev->req_lock; /* needed since we use */
-		/* plugging on a queue, that actually has no requests! */
-	q->unplug_fn = drbd_unplug_fn;
+	q->queue_lock = &mdev->req_lock;
 
 	mdev->md_io_page = alloc_page(GFP_KERNEL);
 	if (!mdev->md_io_page)
diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c
index 24487d4..84132f8 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -187,15 +187,6 @@ static struct page *drbd_pp_first_pages_or_try_alloc(struct drbd_conf *mdev, int
 	return NULL;
 }
 
-/* kick lower level device, if we have more than (arbitrary number)
- * reference counts on it, which typically are locally submitted io
- * requests.  don't use unacked_cnt, so we speed up proto A and B, too. */
-static void maybe_kick_lo(struct drbd_conf *mdev)
-{
-	if (atomic_read(&mdev->local_cnt) >= mdev->net_conf->unplug_watermark)
-		drbd_kick_lo(mdev);
-}
-
 static void reclaim_net_ee(struct drbd_conf *mdev, struct list_head *to_be_freed)
 {
 	struct drbd_epoch_entry *e;
@@ -219,7 +210,6 @@ static void drbd_kick_lo_and_reclaim_net(struct drbd_conf *mdev)
 	LIST_HEAD(reclaimed);
 	struct drbd_epoch_entry *e, *t;
 
-	maybe_kick_lo(mdev);
 	spin_lock_irq(&mdev->req_lock);
 	reclaim_net_ee(mdev, &reclaimed);
 	spin_unlock_irq(&mdev->req_lock);
@@ -436,8 +426,7 @@ void _drbd_wait_ee_list_empty(struct drbd_conf *mdev, struct list_head *head)
 	while (!list_empty(head)) {
 		prepare_to_wait(&mdev->ee_wait, &wait, TASK_UNINTERRUPTIBLE);
 		spin_unlock_irq(&mdev->req_lock);
-		drbd_kick_lo(mdev);
-		schedule();
+		io_schedule();
 		finish_wait(&mdev->ee_wait, &wait);
 		spin_lock_irq(&mdev->req_lock);
 	}
@@ -1147,7 +1136,6 @@ next_bio:
 
 		drbd_generic_make_request(mdev, fault_type, bio);
 	} while (bios);
-	maybe_kick_lo(mdev);
 	return 0;
 
 fail:
@@ -1167,9 +1155,6 @@ static int receive_Barrier(struct drbd_conf *mdev, enum drbd_packets cmd, unsign
 
 	inc_unacked(mdev);
 
-	if (mdev->net_conf->wire_protocol != DRBD_PROT_C)
-		drbd_kick_lo(mdev);
-
 	mdev->current_epoch->barrier_nr = p->barrier;
 	rv = drbd_may_finish_epoch(mdev, mdev->current_epoch, EV_GOT_BARRIER_NR);
 
@@ -3556,9 +3541,6 @@ static int receive_skip(struct drbd_conf *mdev, enum drbd_packets cmd, unsigned
 
 static int receive_UnplugRemote(struct drbd_conf *mdev, enum drbd_packets cmd, unsigned int data_size)
 {
-	if (mdev->state.disk >= D_INCONSISTENT)
-		drbd_kick_lo(mdev);
-
 	/* Make sure we've acked all the TCP data associated
 	 * with the data requests being unplugged */
 	drbd_tcp_quickack(mdev->data.socket);
diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index 11a75d3..ad3fc62 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -960,10 +960,6 @@ allocate_barrier:
 			bio_endio(req->private_bio, -EIO);
 	}
 
-	/* we need to plug ALWAYS since we possibly need to kick lo_dev.
-	 * we plug after submit, so we won't miss an unplug event */
-	drbd_plug_device(mdev);
-
 	return 0;
 
 fail_conflicting:
diff --git a/drivers/block/drbd/drbd_worker.c b/drivers/block/drbd/drbd_worker.c
index 34f224b..e027446 100644
--- a/drivers/block/drbd/drbd_worker.c
+++ b/drivers/block/drbd/drbd_worker.c
@@ -792,7 +792,6 @@ int drbd_resync_finished(struct drbd_conf *mdev)
 		 * queue (or even the read operations for those packets
 		 * is not finished by now).   Retry in 100ms. */
 
-		drbd_kick_lo(mdev);
 		__set_current_state(TASK_INTERRUPTIBLE);
 		schedule_timeout(HZ / 10);
 		w = kmalloc(sizeof(struct drbd_work), GFP_ATOMIC);
diff --git a/drivers/block/drbd/drbd_wrappers.h b/drivers/block/drbd/drbd_wrappers.h
index defdb50..53586fa 100644
--- a/drivers/block/drbd/drbd_wrappers.h
+++ b/drivers/block/drbd/drbd_wrappers.h
@@ -45,24 +45,6 @@ static inline void drbd_generic_make_request(struct drbd_conf *mdev,
 		generic_make_request(bio);
 }
 
-static inline void drbd_plug_device(struct drbd_conf *mdev)
-{
-	struct request_queue *q;
-	q = bdev_get_queue(mdev->this_bdev);
-
-	spin_lock_irq(q->queue_lock);
-
-/* XXX the check on !blk_queue_plugged is redundant,
- * implicitly checked in blk_plug_device */
-
-	if (!blk_queue_plugged(q)) {
-		blk_plug_device(q);
-		del_timer(&q->unplug_timer);
-		/* unplugging should not happen automatically... */
-	}
-	spin_unlock_irq(q->queue_lock);
-}
-
 static inline int drbd_crypto_is_hash(struct crypto_tfm *tfm)
 {
         return (crypto_tfm_alg_type(tfm) & CRYPTO_ALG_TYPE_HASH_MASK)
diff --git a/drivers/block/floppy.c b/drivers/block/floppy.c
index b9ba04f..271142b 100644
--- a/drivers/block/floppy.c
+++ b/drivers/block/floppy.c
@@ -3837,7 +3837,6 @@ static int __floppy_read_block_0(struct block_device *bdev)
 	bio.bi_end_io = floppy_rb0_complete;
 
 	submit_bio(READ, &bio);
-	generic_unplug_device(bdev_get_queue(bdev));
 	process_fd_request();
 	wait_for_completion(&complete);
 
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 44e18c0..03cf2c9 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -541,17 +541,6 @@ out:
 	return 0;
 }
 
-/*
- * kick off io on the underlying address space
- */
-static void loop_unplug(struct request_queue *q)
-{
-	struct loop_device *lo = q->queuedata;
-
-	queue_flag_clear_unlocked(QUEUE_FLAG_PLUGGED, q);
-	blk_run_address_space(lo->lo_backing_file->f_mapping);
-}
-
 struct switch_request {
 	struct file *file;
 	struct completion wait;
@@ -918,7 +907,6 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
 	 */
 	blk_queue_make_request(lo->lo_queue, loop_make_request);
 	lo->lo_queue->queuedata = lo;
-	lo->lo_queue->unplug_fn = loop_unplug;
 
 	if (!(lo_flags & LO_FLAGS_READ_ONLY) && file->f_op->fsync)
 		blk_queue_flush(lo->lo_queue, REQ_FLUSH);
@@ -1020,7 +1008,6 @@ static int loop_clr_fd(struct loop_device *lo, struct block_device *bdev)
 
 	kthread_stop(lo->lo_thread);
 
-	lo->lo_queue->unplug_fn = NULL;
 	lo->lo_backing_file = NULL;
 
 	loop_release_xfer(lo);
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 77d70ee..d20e13f 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -1606,8 +1606,6 @@ static int kcdrwd(void *foobar)
 					min_sleep_time = pkt->sleep_time;
 			}
 
-			generic_unplug_device(bdev_get_queue(pd->bdev));
-
 			VPRINTK("kcdrwd: sleeping\n");
 			residue = schedule_timeout(min_sleep_time);
 			VPRINTK("kcdrwd: wake up\n");
diff --git a/drivers/block/umem.c b/drivers/block/umem.c
index 8be5715..653439f 100644
--- a/drivers/block/umem.c
+++ b/drivers/block/umem.c
@@ -241,8 +241,7 @@ static void dump_dmastat(struct cardinfo *card, unsigned int dmastat)
  *
  * Whenever IO on the active page completes, the Ready page is activated
  * and the ex-Active page is clean out and made Ready.
- * Otherwise the Ready page is only activated when it becomes full, or
- * when mm_unplug_device is called via the unplug_io_fn.
+ * Otherwise the Ready page is only activated when it becomes full.
  *
  * If a request arrives while both pages a full, it is queued, and b_rdev is
  * overloaded to record whether it was a read or a write.
@@ -333,17 +332,6 @@ static inline void reset_page(struct mm_page *page)
 	page->biotail = &page->bio;
 }
 
-static void mm_unplug_device(struct request_queue *q)
-{
-	struct cardinfo *card = q->queuedata;
-	unsigned long flags;
-
-	spin_lock_irqsave(&card->lock, flags);
-	if (blk_remove_plug(q))
-		activate(card);
-	spin_unlock_irqrestore(&card->lock, flags);
-}
-
 /*
  * If there is room on Ready page, take
  * one bh off list and add it.
@@ -535,7 +523,6 @@ static int mm_make_request(struct request_queue *q, struct bio *bio)
 	*card->biotail = bio;
 	bio->bi_next = NULL;
 	card->biotail = &bio->bi_next;
-	blk_plug_device(q);
 	spin_unlock_irq(&card->lock);
 
 	return 0;
@@ -907,7 +894,6 @@ static int __devinit mm_pci_probe(struct pci_dev *dev,
 	blk_queue_make_request(card->queue, mm_make_request);
 	card->queue->queue_lock = &card->lock;
 	card->queue->queuedata = card;
-	card->queue->unplug_fn = mm_unplug_device;
 
 	tasklet_init(&card->tasklet, process_page, (unsigned long)card);
 
diff --git a/drivers/ide/ide-atapi.c b/drivers/ide/ide-atapi.c
index e88a2cf..6f218e01 100644
--- a/drivers/ide/ide-atapi.c
+++ b/drivers/ide/ide-atapi.c
@@ -233,8 +233,7 @@ int ide_queue_sense_rq(ide_drive_t *drive, void *special)
 
 	drive->hwif->rq = NULL;
 
-	elv_add_request(drive->queue, &drive->sense_rq,
-			ELEVATOR_INSERT_FRONT, 0);
+	elv_add_request(drive->queue, &drive->sense_rq, ELEVATOR_INSERT_FRONT);
 	return 0;
 }
 EXPORT_SYMBOL_GPL(ide_queue_sense_rq);
diff --git a/drivers/ide/ide-io.c b/drivers/ide/ide-io.c
index 999dac0..f407784 100644
--- a/drivers/ide/ide-io.c
+++ b/drivers/ide/ide-io.c
@@ -549,8 +549,6 @@ plug_device_2:
 
 	if (rq)
 		blk_requeue_request(q, rq);
-	if (!elv_queue_empty(q))
-		blk_plug_device(q);
 }
 
 void ide_requeue_and_plug(ide_drive_t *drive, struct request *rq)
@@ -562,8 +560,6 @@ void ide_requeue_and_plug(ide_drive_t *drive, struct request *rq)
 
 	if (rq)
 		blk_requeue_request(q, rq);
-	if (!elv_queue_empty(q))
-		blk_plug_device(q);
 
 	spin_unlock_irqrestore(q->queue_lock, flags);
 }
diff --git a/drivers/ide/ide-park.c b/drivers/ide/ide-park.c
index 88a380c..6ab9ab2 100644
--- a/drivers/ide/ide-park.c
+++ b/drivers/ide/ide-park.c
@@ -52,7 +52,7 @@ static void issue_park_cmd(ide_drive_t *drive, unsigned long timeout)
 	rq->cmd[0] = REQ_UNPARK_HEADS;
 	rq->cmd_len = 1;
 	rq->cmd_type = REQ_TYPE_SPECIAL;
-	elv_add_request(q, rq, ELEVATOR_INSERT_FRONT, 1);
+	elv_add_request(q, rq, ELEVATOR_INSERT_FRONT);
 
 out:
 	return;
diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index 9a35320..54bfc27 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -1339,8 +1339,7 @@ int bitmap_startwrite(struct bitmap *bitmap, sector_t offset, unsigned long sect
 			prepare_to_wait(&bitmap->overflow_wait, &__wait,
 					TASK_UNINTERRUPTIBLE);
 			spin_unlock_irq(&bitmap->lock);
-			md_unplug(bitmap->mddev);
-			schedule();
+			io_schedule();
 			finish_wait(&bitmap->overflow_wait, &__wait);
 			continue;
 		}
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 4e054bd..2c62c11 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -991,11 +991,6 @@ static void clone_init(struct dm_crypt_io *io, struct bio *clone)
 	clone->bi_destructor = dm_crypt_bio_destructor;
 }
 
-static void kcryptd_unplug(struct crypt_config *cc)
-{
-	blk_unplug(bdev_get_queue(cc->dev->bdev));
-}
-
 static int kcryptd_io_read(struct dm_crypt_io *io, gfp_t gfp)
 {
 	struct crypt_config *cc = io->target->private;
@@ -1008,10 +1003,8 @@ static int kcryptd_io_read(struct dm_crypt_io *io, gfp_t gfp)
 	 * one in order to decrypt the whole bio data *afterwards*.
 	 */
 	clone = bio_alloc_bioset(gfp, bio_segments(base_bio), cc->bs);
-	if (!clone) {
-		kcryptd_unplug(cc);
+	if (!clone)
 		return 1;
-	}
 
 	crypt_inc_pending(io);
 
diff --git a/drivers/md/dm-kcopyd.c b/drivers/md/dm-kcopyd.c
index 924f5f0..e8429ce 100644
--- a/drivers/md/dm-kcopyd.c
+++ b/drivers/md/dm-kcopyd.c
@@ -315,31 +315,6 @@ static int run_complete_job(struct kcopyd_job *job)
 	return 0;
 }
 
-/*
- * Unplug the block device at the specified index.
- */
-static void unplug(struct dm_kcopyd_client *kc, int rw)
-{
-	if (kc->unplug[rw] != NULL) {
-		blk_unplug(bdev_get_queue(kc->unplug[rw]));
-		kc->unplug[rw] = NULL;
-	}
-}
-
-/*
- * Prepare block device unplug. If there's another device
- * to be unplugged at the same array index, we unplug that
- * device first.
- */
-static void prepare_unplug(struct dm_kcopyd_client *kc, int rw,
-			   struct block_device *bdev)
-{
-	if (likely(kc->unplug[rw] == bdev))
-		return;
-	unplug(kc, rw);
-	kc->unplug[rw] = bdev;
-}
-
 static void complete_io(unsigned long error, void *context)
 {
 	struct kcopyd_job *job = (struct kcopyd_job *) context;
@@ -386,15 +361,12 @@ static int run_io_job(struct kcopyd_job *job)
 		.client = job->kc->io_client,
 	};
 
-	if (job->rw == READ) {
+	if (job->rw == READ)
 		r = dm_io(&io_req, 1, &job->source, NULL);
-		prepare_unplug(job->kc, READ, job->source.bdev);
-	} else {
+	else {
 		if (job->num_dests > 1)
 			io_req.bi_rw |= REQ_UNPLUG;
 		r = dm_io(&io_req, job->num_dests, job->dests, NULL);
-		if (!(io_req.bi_rw & REQ_UNPLUG))
-			prepare_unplug(job->kc, WRITE, job->dests[0].bdev);
 	}
 
 	return r;
@@ -466,6 +438,7 @@ static void do_work(struct work_struct *work)
 {
 	struct dm_kcopyd_client *kc = container_of(work,
 					struct dm_kcopyd_client, kcopyd_work);
+	struct blk_plug plug;
 
 	/*
 	 * The order that these are called is *very* important.
@@ -473,18 +446,12 @@ static void do_work(struct work_struct *work)
 	 * Pages jobs when successful will jump onto the io jobs
 	 * list.  io jobs call wake when they complete and it all
 	 * starts again.
-	 *
-	 * Note that io_jobs add block devices to the unplug array,
-	 * this array is cleared with "unplug" calls. It is thus
-	 * forbidden to run complete_jobs after io_jobs and before
-	 * unplug because the block device could be destroyed in
-	 * job completion callback.
 	 */
+	blk_start_plug(&plug);
 	process_jobs(&kc->complete_jobs, kc, run_complete_job);
 	process_jobs(&kc->pages_jobs, kc, run_pages_job);
 	process_jobs(&kc->io_jobs, kc, run_io_job);
-	unplug(kc, READ);
-	unplug(kc, WRITE);
+	blk_finish_plug(&plug);
 }
 
 /*
diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
index b9e1e15..5ef136c 100644
--- a/drivers/md/dm-raid.c
+++ b/drivers/md/dm-raid.c
@@ -394,7 +394,7 @@ static void raid_unplug(struct dm_target_callbacks *cb)
 {
 	struct raid_set *rs = container_of(cb, struct raid_set, callbacks);
 
-	md_raid5_unplug_device(rs->md.private);
+	md_raid5_kick_device(rs->md.private);
 }
 
 /*
diff --git a/drivers/md/dm-raid1.c b/drivers/md/dm-raid1.c
index dee3267..976ad46 100644
--- a/drivers/md/dm-raid1.c
+++ b/drivers/md/dm-raid1.c
@@ -842,8 +842,6 @@ static void do_mirror(struct work_struct *work)
 	do_reads(ms, &reads);
 	do_writes(ms, &writes);
 	do_failures(ms, &failures);
-
-	dm_table_unplug_all(ms->ti->table);
 }
 
 /*-----------------------------------------------------------------
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 38e4eb1..f50a7b9 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1275,29 +1275,6 @@ int dm_table_any_busy_target(struct dm_table *t)
 	return 0;
 }
 
-void dm_table_unplug_all(struct dm_table *t)
-{
-	struct dm_dev_internal *dd;
-	struct list_head *devices = dm_table_get_devices(t);
-	struct dm_target_callbacks *cb;
-
-	list_for_each_entry(dd, devices, list) {
-		struct request_queue *q = bdev_get_queue(dd->dm_dev.bdev);
-		char b[BDEVNAME_SIZE];
-
-		if (likely(q))
-			blk_unplug(q);
-		else
-			DMWARN_LIMIT("%s: Cannot unplug nonexistent device %s",
-				     dm_device_name(t->md),
-				     bdevname(dd->dm_dev.bdev, b));
-	}
-
-	list_for_each_entry(cb, &t->target_callbacks, list)
-		if (cb->unplug_fn)
-			cb->unplug_fn(cb);
-}
-
 struct mapped_device *dm_table_get_md(struct dm_table *t)
 {
 	return t->md;
@@ -1345,4 +1322,3 @@ EXPORT_SYMBOL(dm_table_get_mode);
 EXPORT_SYMBOL(dm_table_get_md);
 EXPORT_SYMBOL(dm_table_put);
 EXPORT_SYMBOL(dm_table_get);
-EXPORT_SYMBOL(dm_table_unplug_all);
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index eaa3af0..d22b990 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -807,8 +807,6 @@ void dm_requeue_unmapped_request(struct request *clone)
 	dm_unprep_request(rq);
 
 	spin_lock_irqsave(q->queue_lock, flags);
-	if (elv_queue_empty(q))
-		blk_plug_device(q);
 	blk_requeue_request(q, rq);
 	spin_unlock_irqrestore(q->queue_lock, flags);
 
@@ -1613,10 +1611,10 @@ static void dm_request_fn(struct request_queue *q)
 	 * number of in-flight I/Os after the queue is stopped in
 	 * dm_suspend().
 	 */
-	while (!blk_queue_plugged(q) && !blk_queue_stopped(q)) {
+	while (!blk_queue_stopped(q)) {
 		rq = blk_peek_request(q);
 		if (!rq)
-			goto plug_and_out;
+			goto delay_and_out;
 
 		/* always use block 0 to find the target for flushes for now */
 		pos = 0;
@@ -1627,7 +1625,7 @@ static void dm_request_fn(struct request_queue *q)
 		BUG_ON(!dm_target_is_valid(ti));
 
 		if (ti->type->busy && ti->type->busy(ti))
-			goto plug_and_out;
+			goto delay_and_out;
 
 		blk_start_request(rq);
 		clone = rq->special;
@@ -1647,11 +1645,8 @@ requeued:
 	BUG_ON(!irqs_disabled());
 	spin_lock(q->queue_lock);
 
-plug_and_out:
-	if (!elv_queue_empty(q))
-		/* Some requests still remain, retry later */
-		blk_plug_device(q);
-
+delay_and_out:
+	blk_delay_queue(q, HZ / 10);
 out:
 	dm_table_put(map);
 
@@ -1680,20 +1675,6 @@ static int dm_lld_busy(struct request_queue *q)
 	return r;
 }
 
-static void dm_unplug_all(struct request_queue *q)
-{
-	struct mapped_device *md = q->queuedata;
-	struct dm_table *map = dm_get_live_table(md);
-
-	if (map) {
-		if (dm_request_based(md))
-			generic_unplug_device(q);
-
-		dm_table_unplug_all(map);
-		dm_table_put(map);
-	}
-}
-
 static int dm_any_congested(void *congested_data, int bdi_bits)
 {
 	int r = bdi_bits;
@@ -1817,7 +1798,6 @@ static void dm_init_md_queue(struct mapped_device *md)
 	md->queue->backing_dev_info.congested_data = md;
 	blk_queue_make_request(md->queue, dm_request);
 	blk_queue_bounce_limit(md->queue, BLK_BOUNCE_ANY);
-	md->queue->unplug_fn = dm_unplug_all;
 	blk_queue_merge_bvec(md->queue, dm_merge_bvec);
 	blk_queue_flush(md->queue, REQ_FLUSH | REQ_FUA);
 }
@@ -2263,8 +2243,6 @@ static int dm_wait_for_completion(struct mapped_device *md, int interruptible)
 	int r = 0;
 	DECLARE_WAITQUEUE(wait, current);
 
-	dm_unplug_all(md->queue);
-
 	add_wait_queue(&md->wait, &wait);
 
 	while (1) {
@@ -2539,7 +2517,6 @@ int dm_resume(struct mapped_device *md)
 
 	clear_bit(DMF_SUSPENDED, &md->flags);
 
-	dm_table_unplug_all(map);
 	r = 0;
 out:
 	dm_table_put(map);
diff --git a/drivers/md/linear.c b/drivers/md/linear.c
index 8a2f767..38861b5 100644
--- a/drivers/md/linear.c
+++ b/drivers/md/linear.c
@@ -87,22 +87,6 @@ static int linear_mergeable_bvec(struct request_queue *q,
 	return maxsectors << 9;
 }
 
-static void linear_unplug(struct request_queue *q)
-{
-	mddev_t *mddev = q->queuedata;
-	linear_conf_t *conf;
-	int i;
-
-	rcu_read_lock();
-	conf = rcu_dereference(mddev->private);
-
-	for (i=0; i < mddev->raid_disks; i++) {
-		struct request_queue *r_queue = bdev_get_queue(conf->disks[i].rdev->bdev);
-		blk_unplug(r_queue);
-	}
-	rcu_read_unlock();
-}
-
 static int linear_congested(void *data, int bits)
 {
 	mddev_t *mddev = data;
@@ -225,7 +209,6 @@ static int linear_run (mddev_t *mddev)
 	md_set_array_sectors(mddev, linear_size(mddev, 0, 0));
 
 	blk_queue_merge_bvec(mddev->queue, linear_mergeable_bvec);
-	mddev->queue->unplug_fn = linear_unplug;
 	mddev->queue->backing_dev_info.congested_fn = linear_congested;
 	mddev->queue->backing_dev_info.congested_data = mddev;
 	md_integrity_register(mddev);
diff --git a/drivers/md/md.c b/drivers/md/md.c
index b76cfc8..d1326ac 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -4807,7 +4807,6 @@ static int do_md_stop(mddev_t * mddev, int mode, int is_open)
 		__md_stop_writes(mddev);
 		md_stop(mddev);
 		mddev->queue->merge_bvec_fn = NULL;
-		mddev->queue->unplug_fn = NULL;
 		mddev->queue->backing_dev_info.congested_fn = NULL;
 
 		/* tell userspace to handle 'inactive' */
@@ -6662,8 +6661,6 @@ EXPORT_SYMBOL_GPL(md_allow_write);
 
 void md_unplug(mddev_t *mddev)
 {
-	if (mddev->queue)
-		blk_unplug(mddev->queue);
 	if (mddev->plug)
 		mddev->plug->unplug_fn(mddev->plug);
 }
@@ -6846,7 +6843,6 @@ void md_do_sync(mddev_t *mddev)
 		     >= mddev->resync_max - mddev->curr_resync_completed
 			    )) {
 			/* time to update curr_resync_completed */
-			md_unplug(mddev);
 			wait_event(mddev->recovery_wait,
 				   atomic_read(&mddev->recovery_active) == 0);
 			mddev->curr_resync_completed = j;
@@ -6922,7 +6918,6 @@ void md_do_sync(mddev_t *mddev)
 		 * about not overloading the IO subsystem. (things like an
 		 * e2fsck being done on the RAID array should execute fast)
 		 */
-		md_unplug(mddev);
 		cond_resched();
 
 		currspeed = ((unsigned long)(io_sectors-mddev->resync_mark_cnt))/2
@@ -6941,8 +6936,6 @@ void md_do_sync(mddev_t *mddev)
 	 * this also signals 'finished resyncing' to md_stop
 	 */
  out:
-	md_unplug(mddev);
-
 	wait_event(mddev->recovery_wait, !atomic_read(&mddev->recovery_active));
 
 	/* tell personality that we are finished */
diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
index 6d7ddf3..1cc8ed4 100644
--- a/drivers/md/multipath.c
+++ b/drivers/md/multipath.c
@@ -106,36 +106,6 @@ static void multipath_end_request(struct bio *bio, int error)
 	rdev_dec_pending(rdev, conf->mddev);
 }
 
-static void unplug_slaves(mddev_t *mddev)
-{
-	multipath_conf_t *conf = mddev->private;
-	int i;
-
-	rcu_read_lock();
-	for (i=0; i<mddev->raid_disks; i++) {
-		mdk_rdev_t *rdev = rcu_dereference(conf->multipaths[i].rdev);
-		if (rdev && !test_bit(Faulty, &rdev->flags)
-		    && atomic_read(&rdev->nr_pending)) {
-			struct request_queue *r_queue = bdev_get_queue(rdev->bdev);
-
-			atomic_inc(&rdev->nr_pending);
-			rcu_read_unlock();
-
-			blk_unplug(r_queue);
-
-			rdev_dec_pending(rdev, mddev);
-			rcu_read_lock();
-		}
-	}
-	rcu_read_unlock();
-}
-
-static void multipath_unplug(struct request_queue *q)
-{
-	unplug_slaves(q->queuedata);
-}
-
-
 static int multipath_make_request(mddev_t *mddev, struct bio * bio)
 {
 	multipath_conf_t *conf = mddev->private;
@@ -518,7 +488,6 @@ static int multipath_run (mddev_t *mddev)
 	 */
 	md_set_array_sectors(mddev, multipath_size(mddev, 0, 0));
 
-	mddev->queue->unplug_fn = multipath_unplug;
 	mddev->queue->backing_dev_info.congested_fn = multipath_congested;
 	mddev->queue->backing_dev_info.congested_data = mddev;
 	md_integrity_register(mddev);
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index a39f4c3..3cf4279 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -25,21 +25,6 @@
 #include "raid0.h"
 #include "raid5.h"
 
-static void raid0_unplug(struct request_queue *q)
-{
-	mddev_t *mddev = q->queuedata;
-	raid0_conf_t *conf = mddev->private;
-	mdk_rdev_t **devlist = conf->devlist;
-	int raid_disks = conf->strip_zone[0].nb_dev;
-	int i;
-
-	for (i=0; i < raid_disks; i++) {
-		struct request_queue *r_queue = bdev_get_queue(devlist[i]->bdev);
-
-		blk_unplug(r_queue);
-	}
-}
-
 static int raid0_congested(void *data, int bits)
 {
 	mddev_t *mddev = data;
@@ -264,7 +249,6 @@ static int create_strip_zones(mddev_t *mddev, raid0_conf_t **private_conf)
 		       mdname(mddev),
 		       (unsigned long long)smallest->sectors);
 	}
-	mddev->queue->unplug_fn = raid0_unplug;
 	mddev->queue->backing_dev_info.congested_fn = raid0_congested;
 	mddev->queue->backing_dev_info.congested_data = mddev;
 
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index a23ffa3..8a61fcc 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -52,23 +52,16 @@
 #define	NR_RAID1_BIOS 256
 
 
-static void unplug_slaves(mddev_t *mddev);
-
 static void allow_barrier(conf_t *conf);
 static void lower_barrier(conf_t *conf);
 
 static void * r1bio_pool_alloc(gfp_t gfp_flags, void *data)
 {
 	struct pool_info *pi = data;
-	r1bio_t *r1_bio;
 	int size = offsetof(r1bio_t, bios[pi->raid_disks]);
 
 	/* allocate a r1bio with room for raid_disks entries in the bios array */
-	r1_bio = kzalloc(size, gfp_flags);
-	if (!r1_bio && pi->mddev)
-		unplug_slaves(pi->mddev);
-
-	return r1_bio;
+	return kzalloc(size, gfp_flags);
 }
 
 static void r1bio_pool_free(void *r1_bio, void *data)
@@ -91,10 +84,8 @@ static void * r1buf_pool_alloc(gfp_t gfp_flags, void *data)
 	int i, j;
 
 	r1_bio = r1bio_pool_alloc(gfp_flags, pi);
-	if (!r1_bio) {
-		unplug_slaves(pi->mddev);
+	if (!r1_bio)
 		return NULL;
-	}
 
 	/*
 	 * Allocate bios : 1 for reading, n-1 for writing
@@ -520,37 +511,6 @@ static int read_balance(conf_t *conf, r1bio_t *r1_bio)
 	return new_disk;
 }
 
-static void unplug_slaves(mddev_t *mddev)
-{
-	conf_t *conf = mddev->private;
-	int i;
-
-	rcu_read_lock();
-	for (i=0; i<mddev->raid_disks; i++) {
-		mdk_rdev_t *rdev = rcu_dereference(conf->mirrors[i].rdev);
-		if (rdev && !test_bit(Faulty, &rdev->flags) && atomic_read(&rdev->nr_pending)) {
-			struct request_queue *r_queue = bdev_get_queue(rdev->bdev);
-
-			atomic_inc(&rdev->nr_pending);
-			rcu_read_unlock();
-
-			blk_unplug(r_queue);
-
-			rdev_dec_pending(rdev, mddev);
-			rcu_read_lock();
-		}
-	}
-	rcu_read_unlock();
-}
-
-static void raid1_unplug(struct request_queue *q)
-{
-	mddev_t *mddev = q->queuedata;
-
-	unplug_slaves(mddev);
-	md_wakeup_thread(mddev->thread);
-}
-
 static int raid1_congested(void *data, int bits)
 {
 	mddev_t *mddev = data;
@@ -580,20 +540,17 @@ static int raid1_congested(void *data, int bits)
 }
 
 
-static int flush_pending_writes(conf_t *conf)
+static void flush_pending_writes(conf_t *conf)
 {
 	/* Any writes that have been queued but are awaiting
 	 * bitmap updates get flushed here.
-	 * We return 1 if any requests were actually submitted.
 	 */
-	int rv = 0;
 
 	spin_lock_irq(&conf->device_lock);
 
 	if (conf->pending_bio_list.head) {
 		struct bio *bio;
 		bio = bio_list_get(&conf->pending_bio_list);
-		blk_remove_plug(conf->mddev->queue);
 		spin_unlock_irq(&conf->device_lock);
 		/* flush any pending bitmap writes to
 		 * disk before proceeding w/ I/O */
@@ -605,10 +562,14 @@ static int flush_pending_writes(conf_t *conf)
 			generic_make_request(bio);
 			bio = next;
 		}
-		rv = 1;
 	} else
 		spin_unlock_irq(&conf->device_lock);
-	return rv;
+}
+
+static void md_kick_device(mddev_t *mddev)
+{
+	blk_flush_plug(current);
+	md_wakeup_thread(mddev->thread);
 }
 
 /* Barriers....
@@ -640,8 +601,7 @@ static void raise_barrier(conf_t *conf)
 
 	/* Wait until no block IO is waiting */
 	wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting,
-			    conf->resync_lock,
-			    raid1_unplug(conf->mddev->queue));
+			    conf->resync_lock, md_kick_device(conf->mddev));
 
 	/* block any new IO from starting */
 	conf->barrier++;
@@ -649,8 +609,7 @@ static void raise_barrier(conf_t *conf)
 	/* Now wait for all pending IO to complete */
 	wait_event_lock_irq(conf->wait_barrier,
 			    !conf->nr_pending && conf->barrier < RESYNC_DEPTH,
-			    conf->resync_lock,
-			    raid1_unplug(conf->mddev->queue));
+			    conf->resync_lock, md_kick_device(conf->mddev));
 
 	spin_unlock_irq(&conf->resync_lock);
 }
@@ -672,7 +631,7 @@ static void wait_barrier(conf_t *conf)
 		conf->nr_waiting++;
 		wait_event_lock_irq(conf->wait_barrier, !conf->barrier,
 				    conf->resync_lock,
-				    raid1_unplug(conf->mddev->queue));
+				    md_kick_device(conf->mddev));
 		conf->nr_waiting--;
 	}
 	conf->nr_pending++;
@@ -709,7 +668,7 @@ static void freeze_array(conf_t *conf)
 			    conf->nr_pending == conf->nr_queued+1,
 			    conf->resync_lock,
 			    ({ flush_pending_writes(conf);
-			       raid1_unplug(conf->mddev->queue); }));
+			       md_kick_device(conf->mddev); }));
 	spin_unlock_irq(&conf->resync_lock);
 }
 static void unfreeze_array(conf_t *conf)
@@ -959,7 +918,6 @@ static int make_request(mddev_t *mddev, struct bio * bio)
 		atomic_inc(&r1_bio->remaining);
 		spin_lock_irqsave(&conf->device_lock, flags);
 		bio_list_add(&conf->pending_bio_list, mbio);
-		blk_plug_device(mddev->queue);
 		spin_unlock_irqrestore(&conf->device_lock, flags);
 	}
 	r1_bio_write_done(r1_bio, bio->bi_vcnt, behind_pages, behind_pages != NULL);
@@ -968,7 +926,7 @@ static int make_request(mddev_t *mddev, struct bio * bio)
 	/* In case raid1d snuck in to freeze_array */
 	wake_up(&conf->wait_barrier);
 
-	if (do_sync)
+	if (do_sync || !bitmap)
 		md_wakeup_thread(mddev->thread);
 
 	return 0;
@@ -1558,7 +1516,6 @@ static void raid1d(mddev_t *mddev)
 	unsigned long flags;
 	conf_t *conf = mddev->private;
 	struct list_head *head = &conf->retry_list;
-	int unplug=0;
 	mdk_rdev_t *rdev;
 
 	md_check_recovery(mddev);
@@ -1566,7 +1523,7 @@ static void raid1d(mddev_t *mddev)
 	for (;;) {
 		char b[BDEVNAME_SIZE];
 
-		unplug += flush_pending_writes(conf);
+		flush_pending_writes(conf);
 
 		spin_lock_irqsave(&conf->device_lock, flags);
 		if (list_empty(head)) {
@@ -1580,10 +1537,9 @@ static void raid1d(mddev_t *mddev)
 
 		mddev = r1_bio->mddev;
 		conf = mddev->private;
-		if (test_bit(R1BIO_IsSync, &r1_bio->state)) {
+		if (test_bit(R1BIO_IsSync, &r1_bio->state))
 			sync_request_write(mddev, r1_bio);
-			unplug = 1;
-		} else {
+		else {
 			int disk;
 
 			/* we got a read error. Maybe the drive is bad.  Maybe just
@@ -1633,14 +1589,11 @@ static void raid1d(mddev_t *mddev)
 				bio->bi_end_io = raid1_end_read_request;
 				bio->bi_rw = READ | do_sync;
 				bio->bi_private = r1_bio;
-				unplug = 1;
 				generic_make_request(bio);
 			}
 		}
 		cond_resched();
 	}
-	if (unplug)
-		unplug_slaves(mddev);
 }
 
 
@@ -2064,7 +2017,6 @@ static int run(mddev_t *mddev)
 
 	md_set_array_sectors(mddev, raid1_size(mddev, 0, 0));
 
-	mddev->queue->unplug_fn = raid1_unplug;
 	mddev->queue->backing_dev_info.congested_fn = raid1_congested;
 	mddev->queue->backing_dev_info.congested_data = mddev;
 	md_integrity_register(mddev);
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 69b6595..5811845 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -57,23 +57,16 @@
  */
 #define	NR_RAID10_BIOS 256
 
-static void unplug_slaves(mddev_t *mddev);
-
 static void allow_barrier(conf_t *conf);
 static void lower_barrier(conf_t *conf);
 
 static void * r10bio_pool_alloc(gfp_t gfp_flags, void *data)
 {
 	conf_t *conf = data;
-	r10bio_t *r10_bio;
 	int size = offsetof(struct r10bio_s, devs[conf->copies]);
 
 	/* allocate a r10bio with room for raid_disks entries in the bios array */
-	r10_bio = kzalloc(size, gfp_flags);
-	if (!r10_bio && conf->mddev)
-		unplug_slaves(conf->mddev);
-
-	return r10_bio;
+	return kzalloc(size, gfp_flags);
 }
 
 static void r10bio_pool_free(void *r10_bio, void *data)
@@ -106,10 +99,8 @@ static void * r10buf_pool_alloc(gfp_t gfp_flags, void *data)
 	int nalloc;
 
 	r10_bio = r10bio_pool_alloc(gfp_flags, conf);
-	if (!r10_bio) {
-		unplug_slaves(conf->mddev);
+	if (!r10_bio)
 		return NULL;
-	}
 
 	if (test_bit(MD_RECOVERY_SYNC, &conf->mddev->recovery))
 		nalloc = conf->copies; /* resync */
@@ -597,37 +588,6 @@ rb_out:
 	return disk;
 }
 
-static void unplug_slaves(mddev_t *mddev)
-{
-	conf_t *conf = mddev->private;
-	int i;
-
-	rcu_read_lock();
-	for (i=0; i < conf->raid_disks; i++) {
-		mdk_rdev_t *rdev = rcu_dereference(conf->mirrors[i].rdev);
-		if (rdev && !test_bit(Faulty, &rdev->flags) && atomic_read(&rdev->nr_pending)) {
-			struct request_queue *r_queue = bdev_get_queue(rdev->bdev);
-
-			atomic_inc(&rdev->nr_pending);
-			rcu_read_unlock();
-
-			blk_unplug(r_queue);
-
-			rdev_dec_pending(rdev, mddev);
-			rcu_read_lock();
-		}
-	}
-	rcu_read_unlock();
-}
-
-static void raid10_unplug(struct request_queue *q)
-{
-	mddev_t *mddev = q->queuedata;
-
-	unplug_slaves(q->queuedata);
-	md_wakeup_thread(mddev->thread);
-}
-
 static int raid10_congested(void *data, int bits)
 {
 	mddev_t *mddev = data;
@@ -649,20 +609,17 @@ static int raid10_congested(void *data, int bits)
 	return ret;
 }
 
-static int flush_pending_writes(conf_t *conf)
+static void flush_pending_writes(conf_t *conf)
 {
 	/* Any writes that have been queued but are awaiting
 	 * bitmap updates get flushed here.
 	 * We return 1 if any requests were actually submitted.
 	 */
-	int rv = 0;
-
 	spin_lock_irq(&conf->device_lock);
 
 	if (conf->pending_bio_list.head) {
 		struct bio *bio;
 		bio = bio_list_get(&conf->pending_bio_list);
-		blk_remove_plug(conf->mddev->queue);
 		spin_unlock_irq(&conf->device_lock);
 		/* flush any pending bitmap writes to disk
 		 * before proceeding w/ I/O */
@@ -674,11 +631,16 @@ static int flush_pending_writes(conf_t *conf)
 			generic_make_request(bio);
 			bio = next;
 		}
-		rv = 1;
 	} else
 		spin_unlock_irq(&conf->device_lock);
-	return rv;
 }
+
+static void md_kick_device(mddev_t *mddev)
+{
+	blk_flush_plug(current);
+	md_wakeup_thread(mddev->thread);
+}
+
 /* Barriers....
  * Sometimes we need to suspend IO while we do something else,
  * either some resync/recovery, or reconfigure the array.
@@ -708,8 +670,7 @@ static void raise_barrier(conf_t *conf, int force)
 
 	/* Wait until no block IO is waiting (unless 'force') */
 	wait_event_lock_irq(conf->wait_barrier, force || !conf->nr_waiting,
-			    conf->resync_lock,
-			    raid10_unplug(conf->mddev->queue));
+			    conf->resync_lock, md_kick_device(conf->mddev));
 
 	/* block any new IO from starting */
 	conf->barrier++;
@@ -717,8 +678,7 @@ static void raise_barrier(conf_t *conf, int force)
 	/* No wait for all pending IO to complete */
 	wait_event_lock_irq(conf->wait_barrier,
 			    !conf->nr_pending && conf->barrier < RESYNC_DEPTH,
-			    conf->resync_lock,
-			    raid10_unplug(conf->mddev->queue));
+			    conf->resync_lock, md_kick_device(conf->mddev));
 
 	spin_unlock_irq(&conf->resync_lock);
 }
@@ -739,7 +699,7 @@ static void wait_barrier(conf_t *conf)
 		conf->nr_waiting++;
 		wait_event_lock_irq(conf->wait_barrier, !conf->barrier,
 				    conf->resync_lock,
-				    raid10_unplug(conf->mddev->queue));
+				    md_kick_device(conf->mddev));
 		conf->nr_waiting--;
 	}
 	conf->nr_pending++;
@@ -776,7 +736,7 @@ static void freeze_array(conf_t *conf)
 			    conf->nr_pending == conf->nr_queued+1,
 			    conf->resync_lock,
 			    ({ flush_pending_writes(conf);
-			       raid10_unplug(conf->mddev->queue); }));
+			       md_kick_device(conf->mddev); }));
 	spin_unlock_irq(&conf->resync_lock);
 }
 
@@ -971,7 +931,6 @@ static int make_request(mddev_t *mddev, struct bio * bio)
 		atomic_inc(&r10_bio->remaining);
 		spin_lock_irqsave(&conf->device_lock, flags);
 		bio_list_add(&conf->pending_bio_list, mbio);
-		blk_plug_device(mddev->queue);
 		spin_unlock_irqrestore(&conf->device_lock, flags);
 	}
 
@@ -988,7 +947,7 @@ static int make_request(mddev_t *mddev, struct bio * bio)
 	/* In case raid10d snuck in to freeze_array */
 	wake_up(&conf->wait_barrier);
 
-	if (do_sync)
+	if (do_sync || !mddev->bitmap)
 		md_wakeup_thread(mddev->thread);
 
 	return 0;
@@ -1681,7 +1640,6 @@ static void raid10d(mddev_t *mddev)
 	unsigned long flags;
 	conf_t *conf = mddev->private;
 	struct list_head *head = &conf->retry_list;
-	int unplug=0;
 	mdk_rdev_t *rdev;
 
 	md_check_recovery(mddev);
@@ -1689,7 +1647,7 @@ static void raid10d(mddev_t *mddev)
 	for (;;) {
 		char b[BDEVNAME_SIZE];
 
-		unplug += flush_pending_writes(conf);
+		flush_pending_writes(conf);
 
 		spin_lock_irqsave(&conf->device_lock, flags);
 		if (list_empty(head)) {
@@ -1703,13 +1661,11 @@ static void raid10d(mddev_t *mddev)
 
 		mddev = r10_bio->mddev;
 		conf = mddev->private;
-		if (test_bit(R10BIO_IsSync, &r10_bio->state)) {
+		if (test_bit(R10BIO_IsSync, &r10_bio->state))
 			sync_request_write(mddev, r10_bio);
-			unplug = 1;
-		} else 	if (test_bit(R10BIO_IsRecover, &r10_bio->state)) {
+		else if (test_bit(R10BIO_IsRecover, &r10_bio->state))
 			recovery_request_write(mddev, r10_bio);
-			unplug = 1;
-		} else {
+		else {
 			int mirror;
 			/* we got a read error. Maybe the drive is bad.  Maybe just
 			 * the block and we can fix it.
@@ -1756,14 +1712,11 @@ static void raid10d(mddev_t *mddev)
 				bio->bi_rw = READ | do_sync;
 				bio->bi_private = r10_bio;
 				bio->bi_end_io = raid10_end_read_request;
-				unplug = 1;
 				generic_make_request(bio);
 			}
 		}
 		cond_resched();
 	}
-	if (unplug)
-		unplug_slaves(mddev);
 }
 
 
@@ -2376,7 +2329,6 @@ static int run(mddev_t *mddev)
 	md_set_array_sectors(mddev, size);
 	mddev->resync_max_sectors = size;
 
-	mddev->queue->unplug_fn = raid10_unplug;
 	mddev->queue->backing_dev_info.congested_fn = raid10_congested;
 	mddev->queue->backing_dev_info.congested_data = mddev;
 
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 5044bab..f1f2d3c 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -433,8 +433,6 @@ static int has_failed(raid5_conf_t *conf)
 	return 0;
 }
 
-static void unplug_slaves(mddev_t *mddev);
-
 static struct stripe_head *
 get_active_stripe(raid5_conf_t *conf, sector_t sector,
 		  int previous, int noblock, int noquiesce)
@@ -463,8 +461,7 @@ get_active_stripe(raid5_conf_t *conf, sector_t sector,
 						     < (conf->max_nr_stripes *3/4)
 						     || !conf->inactive_blocked),
 						    conf->device_lock,
-						    md_raid5_unplug_device(conf)
-					);
+						    md_raid5_kick_device(conf));
 				conf->inactive_blocked = 0;
 			} else
 				init_stripe(sh, sector, previous);
@@ -1473,8 +1470,7 @@ static int resize_stripes(raid5_conf_t *conf, int newsize)
 		wait_event_lock_irq(conf->wait_for_stripe,
 				    !list_empty(&conf->inactive_list),
 				    conf->device_lock,
-				    unplug_slaves(conf->mddev)
-			);
+				    blk_flush_plug(current));
 		osh = get_free_stripe(conf);
 		spin_unlock_irq(&conf->device_lock);
 		atomic_set(&nsh->count, 1);
@@ -3645,58 +3641,19 @@ static void activate_bit_delay(raid5_conf_t *conf)
 	}
 }
 
-static void unplug_slaves(mddev_t *mddev)
-{
-	raid5_conf_t *conf = mddev->private;
-	int i;
-	int devs = max(conf->raid_disks, conf->previous_raid_disks);
-
-	rcu_read_lock();
-	for (i = 0; i < devs; i++) {
-		mdk_rdev_t *rdev = rcu_dereference(conf->disks[i].rdev);
-		if (rdev && !test_bit(Faulty, &rdev->flags) && atomic_read(&rdev->nr_pending)) {
-			struct request_queue *r_queue = bdev_get_queue(rdev->bdev);
-
-			atomic_inc(&rdev->nr_pending);
-			rcu_read_unlock();
-
-			blk_unplug(r_queue);
-
-			rdev_dec_pending(rdev, mddev);
-			rcu_read_lock();
-		}
-	}
-	rcu_read_unlock();
-}
-
-void md_raid5_unplug_device(raid5_conf_t *conf)
+void md_raid5_kick_device(raid5_conf_t *conf)
 {
-	unsigned long flags;
-
-	spin_lock_irqsave(&conf->device_lock, flags);
-
-	if (plugger_remove_plug(&conf->plug)) {
-		conf->seq_flush++;
-		raid5_activate_delayed(conf);
-	}
+	blk_flush_plug(current);
+	raid5_activate_delayed(conf);
 	md_wakeup_thread(conf->mddev->thread);
-
-	spin_unlock_irqrestore(&conf->device_lock, flags);
-
-	unplug_slaves(conf->mddev);
 }
-EXPORT_SYMBOL_GPL(md_raid5_unplug_device);
+EXPORT_SYMBOL_GPL(md_raid5_kick_device);
 
 static void raid5_unplug(struct plug_handle *plug)
 {
 	raid5_conf_t *conf = container_of(plug, raid5_conf_t, plug);
-	md_raid5_unplug_device(conf);
-}
 
-static void raid5_unplug_queue(struct request_queue *q)
-{
-	mddev_t *mddev = q->queuedata;
-	md_raid5_unplug_device(mddev->private);
+	md_raid5_kick_device(conf);
 }
 
 int md_raid5_congested(mddev_t *mddev, int bits)
@@ -4100,7 +4057,7 @@ static int make_request(mddev_t *mddev, struct bio * bi)
 				 * add failed due to overlap.  Flush everything
 				 * and wait a while
 				 */
-				md_raid5_unplug_device(conf);
+				md_raid5_kick_device(conf);
 				release_stripe(sh);
 				schedule();
 				goto retry;
@@ -4365,7 +4322,6 @@ static inline sector_t sync_request(mddev_t *mddev, sector_t sector_nr, int *ski
 
 	if (sector_nr >= max_sector) {
 		/* just being told to finish up .. nothing much to do */
-		unplug_slaves(mddev);
 
 		if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery)) {
 			end_reshape(conf);
@@ -4569,7 +4525,6 @@ static void raid5d(mddev_t *mddev)
 	spin_unlock_irq(&conf->device_lock);
 
 	async_tx_issue_pending_all();
-	unplug_slaves(mddev);
 
 	pr_debug("--- raid5d inactive\n");
 }
@@ -5201,11 +5156,9 @@ static int run(mddev_t *mddev)
 			mddev->queue->backing_dev_info.ra_pages = 2 * stripe;
 
 		blk_queue_merge_bvec(mddev->queue, raid5_mergeable_bvec);
-
 		mddev->queue->backing_dev_info.congested_data = mddev;
 		mddev->queue->backing_dev_info.congested_fn = raid5_congested;
 		mddev->queue->queue_lock = &conf->device_lock;
-		mddev->queue->unplug_fn = raid5_unplug_queue;
 
 		chunk_size = mddev->chunk_sectors << 9;
 		blk_queue_io_min(mddev->queue, chunk_size);
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 2ace058..8d563a4 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -503,6 +503,6 @@ static inline int algorithm_is_DDF(int layout)
 }
 
 extern int md_raid5_congested(mddev_t *mddev, int bits);
-extern void md_raid5_unplug_device(raid5_conf_t *conf);
+extern void md_raid5_kick_device(raid5_conf_t *conf);
 extern int raid5_set_cache_size(mddev_t *mddev, int size);
 #endif
diff --git a/drivers/message/i2o/i2o_block.c b/drivers/message/i2o/i2o_block.c
index ae7cad1..b29eb4e 100644
--- a/drivers/message/i2o/i2o_block.c
+++ b/drivers/message/i2o/i2o_block.c
@@ -895,11 +895,7 @@ static void i2o_block_request_fn(struct request_queue *q)
 {
 	struct request *req;
 
-	while (!blk_queue_plugged(q)) {
-		req = blk_peek_request(q);
-		if (!req)
-			break;
-
+	while ((req = blk_peek_request(q)) != NULL) {
 		if (req->cmd_type == REQ_TYPE_FS) {
 			struct i2o_block_delayed_request *dreq;
 			struct i2o_block_request *ireq = req->special;
diff --git a/drivers/mmc/card/queue.c b/drivers/mmc/card/queue.c
index 4e42d03..2ae7275 100644
--- a/drivers/mmc/card/queue.c
+++ b/drivers/mmc/card/queue.c
@@ -55,8 +55,7 @@ static int mmc_queue_thread(void *d)
 
 		spin_lock_irq(q->queue_lock);
 		set_current_state(TASK_INTERRUPTIBLE);
-		if (!blk_queue_plugged(q))
-			req = blk_fetch_request(q);
+		req = blk_fetch_request(q);
 		mq->req = req;
 		spin_unlock_irq(q->queue_lock);
 
diff --git a/drivers/s390/block/dasd.c b/drivers/s390/block/dasd.c
index 794bfd9..4d2df2f 100644
--- a/drivers/s390/block/dasd.c
+++ b/drivers/s390/block/dasd.c
@@ -1917,7 +1917,7 @@ static void __dasd_process_request_queue(struct dasd_block *block)
 		return;
 	}
 	/* Now we try to fetch requests from the request queue */
-	while (!blk_queue_plugged(queue) && (req = blk_peek_request(queue))) {
+	while ((req = blk_peek_request(queue))) {
 		if (basedev->features & DASD_FEATURE_READONLY &&
 		    rq_data_dir(req) == WRITE) {
 			DBF_DEV_EVENT(DBF_ERR, basedev,
diff --git a/drivers/s390/char/tape_block.c b/drivers/s390/char/tape_block.c
index 55d2d0f..f061b25 100644
--- a/drivers/s390/char/tape_block.c
+++ b/drivers/s390/char/tape_block.c
@@ -161,7 +161,6 @@ tapeblock_requeue(struct work_struct *work) {
 
 	spin_lock_irq(&device->blk_data.request_queue_lock);
 	while (
-		!blk_queue_plugged(queue) &&
 		blk_peek_request(queue) &&
 		nr_queued < TAPEBLOCK_MIN_REQUEUE
 	) {
diff --git a/drivers/scsi/scsi_transport_fc.c b/drivers/scsi/scsi_transport_fc.c
index 998c01b..2cefabd 100644
--- a/drivers/scsi/scsi_transport_fc.c
+++ b/drivers/scsi/scsi_transport_fc.c
@@ -3913,7 +3913,7 @@ fc_bsg_request_handler(struct request_queue *q, struct Scsi_Host *shost,
 	if (!get_device(dev))
 		return;
 
-	while (!blk_queue_plugged(q)) {
+	while (1) {
 		if (rport && (rport->port_state == FC_PORTSTATE_BLOCKED) &&
 		    !(rport->flags & FC_RPORT_FAST_FAIL_TIMEDOUT))
 			break;
diff --git a/drivers/scsi/scsi_transport_sas.c b/drivers/scsi/scsi_transport_sas.c
index 927e99c..c6fcf76 100644
--- a/drivers/scsi/scsi_transport_sas.c
+++ b/drivers/scsi/scsi_transport_sas.c
@@ -173,11 +173,7 @@ static void sas_smp_request(struct request_queue *q, struct Scsi_Host *shost,
 	int ret;
 	int (*handler)(struct Scsi_Host *, struct sas_rphy *, struct request *);
 
-	while (!blk_queue_plugged(q)) {
-		req = blk_fetch_request(q);
-		if (!req)
-			break;
-
+	while ((req = blk_fetch_request(q)) != NULL) {
 		spin_unlock_irq(q->queue_lock);
 
 		handler = to_sas_internal(shost->transportt)->f->smp_handler;
diff --git a/fs/adfs/inode.c b/fs/adfs/inode.c
index 65794b8..1cc84b2 100644
--- a/fs/adfs/inode.c
+++ b/fs/adfs/inode.c
@@ -73,7 +73,6 @@ static sector_t _adfs_bmap(struct address_space *mapping, sector_t block)
 static const struct address_space_operations adfs_aops = {
 	.readpage	= adfs_readpage,
 	.writepage	= adfs_writepage,
-	.sync_page	= block_sync_page,
 	.write_begin	= adfs_write_begin,
 	.write_end	= generic_write_end,
 	.bmap		= _adfs_bmap
diff --git a/fs/affs/file.c b/fs/affs/file.c
index 0a90dcd..acf321b 100644
--- a/fs/affs/file.c
+++ b/fs/affs/file.c
@@ -429,7 +429,6 @@ static sector_t _affs_bmap(struct address_space *mapping, sector_t block)
 const struct address_space_operations affs_aops = {
 	.readpage = affs_readpage,
 	.writepage = affs_writepage,
-	.sync_page = block_sync_page,
 	.write_begin = affs_write_begin,
 	.write_end = generic_write_end,
 	.bmap = _affs_bmap
@@ -786,7 +785,6 @@ out:
 const struct address_space_operations affs_aops_ofs = {
 	.readpage = affs_readpage_ofs,
 	//.writepage = affs_writepage_ofs,
-	//.sync_page = affs_sync_page_ofs,
 	.write_begin = affs_write_begin_ofs,
 	.write_end = affs_write_end_ofs
 };
diff --git a/fs/aio.c b/fs/aio.c
index fc557a3..c5ea494 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1550,9 +1550,11 @@ static void aio_batch_free(struct hlist_head *batch_hash)
 	struct hlist_node *pos, *n;
 	int i;
 
+	/*
+	 * TODO: kill this
+	 */
 	for (i = 0; i < AIO_BATCH_HASH_SIZE; i++) {
 		hlist_for_each_entry_safe(abe, pos, n, &batch_hash[i], list) {
-			blk_run_address_space(abe->mapping);
 			iput(abe->mapping->host);
 			hlist_del(&abe->list);
 			mempool_free(abe, abe_pool);
diff --git a/fs/befs/linuxvfs.c b/fs/befs/linuxvfs.c
index b1d0c79..06457ed 100644
--- a/fs/befs/linuxvfs.c
+++ b/fs/befs/linuxvfs.c
@@ -75,7 +75,6 @@ static const struct inode_operations befs_dir_inode_operations = {
 
 static const struct address_space_operations befs_aops = {
 	.readpage	= befs_readpage,
-	.sync_page	= block_sync_page,
 	.bmap		= befs_bmap,
 };
 
diff --git a/fs/bfs/file.c b/fs/bfs/file.c
index eb67edd..f20e8a7 100644
--- a/fs/bfs/file.c
+++ b/fs/bfs/file.c
@@ -186,7 +186,6 @@ static sector_t bfs_bmap(struct address_space *mapping, sector_t block)
 const struct address_space_operations bfs_aops = {
 	.readpage	= bfs_readpage,
 	.writepage	= bfs_writepage,
-	.sync_page	= block_sync_page,
 	.write_begin	= bfs_write_begin,
 	.write_end	= generic_write_end,
 	.bmap		= bfs_bmap,
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 333a7bb..6dea657 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1521,7 +1521,6 @@ static int blkdev_releasepage(struct page *page, gfp_t wait)
 static const struct address_space_operations def_blk_aops = {
 	.readpage	= blkdev_readpage,
 	.writepage	= blkdev_writepage,
-	.sync_page	= block_sync_page,
 	.write_begin	= blkdev_write_begin,
 	.write_end	= blkdev_write_end,
 	.writepages	= generic_writepages,
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index b531c36..bb5c93a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -843,7 +843,6 @@ static const struct address_space_operations btree_aops = {
 	.writepages	= btree_writepages,
 	.releasepage	= btree_releasepage,
 	.invalidatepage = btree_invalidatepage,
-	.sync_page	= block_sync_page,
 #ifdef CONFIG_MIGRATION
 	.migratepage	= btree_migratepage,
 #endif
@@ -1327,82 +1326,6 @@ static int btrfs_congested_fn(void *congested_data, int bdi_bits)
 }
 
 /*
- * this unplugs every device on the box, and it is only used when page
- * is null
- */
-static void __unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
-{
-	struct btrfs_device *device;
-	struct btrfs_fs_info *info;
-
-	info = (struct btrfs_fs_info *)bdi->unplug_io_data;
-	list_for_each_entry(device, &info->fs_devices->devices, dev_list) {
-		if (!device->bdev)
-			continue;
-
-		bdi = blk_get_backing_dev_info(device->bdev);
-		if (bdi->unplug_io_fn)
-			bdi->unplug_io_fn(bdi, page);
-	}
-}
-
-static void btrfs_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
-{
-	struct inode *inode;
-	struct extent_map_tree *em_tree;
-	struct extent_map *em;
-	struct address_space *mapping;
-	u64 offset;
-
-	/* the generic O_DIRECT read code does this */
-	if (1 || !page) {
-		__unplug_io_fn(bdi, page);
-		return;
-	}
-
-	/*
-	 * page->mapping may change at any time.  Get a consistent copy
-	 * and use that for everything below
-	 */
-	smp_mb();
-	mapping = page->mapping;
-	if (!mapping)
-		return;
-
-	inode = mapping->host;
-
-	/*
-	 * don't do the expensive searching for a small number of
-	 * devices
-	 */
-	if (BTRFS_I(inode)->root->fs_info->fs_devices->open_devices <= 2) {
-		__unplug_io_fn(bdi, page);
-		return;
-	}
-
-	offset = page_offset(page);
-
-	em_tree = &BTRFS_I(inode)->extent_tree;
-	read_lock(&em_tree->lock);
-	em = lookup_extent_mapping(em_tree, offset, PAGE_CACHE_SIZE);
-	read_unlock(&em_tree->lock);
-	if (!em) {
-		__unplug_io_fn(bdi, page);
-		return;
-	}
-
-	if (em->block_start >= EXTENT_MAP_LAST_BYTE) {
-		free_extent_map(em);
-		__unplug_io_fn(bdi, page);
-		return;
-	}
-	offset = offset - em->start;
-	btrfs_unplug_page(&BTRFS_I(inode)->root->fs_info->mapping_tree,
-			  em->block_start + offset, page);
-	free_extent_map(em);
-}
-
-/*
  * If this fails, caller must call bdi_destroy() to get rid of the
  * bdi again.
  */
@@ -1416,8 +1339,6 @@ static int setup_bdi(struct btrfs_fs_info *info, struct backing_dev_info *bdi)
 		return err;
 
 	bdi->ra_pages	= default_backing_dev_info.ra_pages;
-	bdi->unplug_io_fn	= btrfs_unplug_io_fn;
-	bdi->unplug_io_data	= info;
 	bdi->congested_fn	= btrfs_congested_fn;
 	bdi->congested_data	= info;
 	return 0;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 160b55b..a4da8bc 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7204,7 +7204,6 @@ static const struct address_space_operations btrfs_aops = {
 	.writepage	= btrfs_writepage,
 	.writepages	= btrfs_writepages,
 	.readpages	= btrfs_readpages,
-	.sync_page	= block_sync_page,
 	.direct_IO	= btrfs_direct_IO,
 	.invalidatepage = btrfs_invalidatepage,
 	.releasepage	= btrfs_releasepage,
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index d158530..a5d4417 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -162,7 +162,6 @@ static noinline int run_scheduled_bios(struct btrfs_device *device)
 	struct bio *cur;
 	int again = 0;
 	unsigned long num_run;
-	unsigned long num_sync_run;
 	unsigned long batch_run = 0;
 	unsigned long limit;
 	unsigned long last_waited = 0;
@@ -173,11 +172,6 @@ static noinline int run_scheduled_bios(struct btrfs_device *device)
 	limit = btrfs_async_submit_limit(fs_info);
 	limit = limit * 2 / 3;
 
-	/* we want to make sure that every time we switch from the sync
-	 * list to the normal list, we unplug
-	 */
-	num_sync_run = 0;
-
 loop:
 	spin_lock(&device->io_lock);
 
@@ -223,15 +217,6 @@ loop_lock:
 
 	spin_unlock(&device->io_lock);
 
-	/*
-	 * if we're doing the regular priority list, make sure we unplug
-	 * for any high prio bios we've sent down
-	 */
-	if (pending_bios == &device->pending_bios && num_sync_run > 0) {
-		num_sync_run = 0;
-		blk_run_backing_dev(bdi, NULL);
-	}
-
 	while (pending) {
 
 		rmb();
@@ -259,19 +244,11 @@ loop_lock:
 
 		BUG_ON(atomic_read(&cur->bi_cnt) == 0);
 
-		if (cur->bi_rw & REQ_SYNC)
-			num_sync_run++;
-
 		submit_bio(cur->bi_rw, cur);
 		num_run++;
 		batch_run++;
-		if (need_resched()) {
-			if (num_sync_run) {
-				blk_run_backing_dev(bdi, NULL);
-				num_sync_run = 0;
-			}
+		if (need_resched())
 			cond_resched();
-		}
 
 		/*
 		 * we made progress, there is more work to do and the bdi
@@ -304,13 +281,8 @@ loop_lock:
 				 * against it before looping
 				 */
 				last_waited = ioc->last_waited;
-				if (need_resched()) {
-					if (num_sync_run) {
-						blk_run_backing_dev(bdi, NULL);
-						num_sync_run = 0;
-					}
+				if (need_resched())
 					cond_resched();
-				}
 				continue;
 			}
 			spin_lock(&device->io_lock);
@@ -323,22 +295,6 @@ loop_lock:
 		}
 	}
 
-	if (num_sync_run) {
-		num_sync_run = 0;
-		blk_run_backing_dev(bdi, NULL);
-	}
-	/*
-	 * IO has already been through a long path to get here.  Checksumming,
-	 * async helper threads, perhaps compression.  We've done a pretty
-	 * good job of collecting a batch of IO and should just unplug
-	 * the device right away.
-	 *
-	 * This will help anyone who is waiting on the IO, they might have
-	 * already unplugged, but managed to do so before the bio they
-	 * cared about found its way down here.
-	 */
-	blk_run_backing_dev(bdi, NULL);
-
 	cond_resched();
 	if (again)
 		goto loop;
@@ -2931,7 +2887,7 @@ static int find_live_mirror(struct map_lookup *map, int first, int num,
 static int __btrfs_map_block(struct btrfs_mapping_tree *map_tree, int rw,
 			     u64 logical, u64 *length,
 			     struct btrfs_multi_bio **multi_ret,
-			     int mirror_num, struct page *unplug_page)
+			     int mirror_num)
 {
 	struct extent_map *em;
 	struct map_lookup *map;
@@ -2963,11 +2919,6 @@ again:
 	em = lookup_extent_mapping(em_tree, logical, *length);
 	read_unlock(&em_tree->lock);
 
-	if (!em && unplug_page) {
-		kfree(multi);
-		return 0;
-	}
-
 	if (!em) {
 		printk(KERN_CRIT "unable to find logical %llu len %llu\n",
 		       (unsigned long long)logical,
@@ -3023,13 +2974,13 @@ again:
 		*length = em->len - offset;
 	}
 
-	if (!multi_ret && !unplug_page)
+	if (!multi_ret)
 		goto out;
 
 	num_stripes = 1;
 	stripe_index = 0;
 	if (map->type & BTRFS_BLOCK_GROUP_RAID1) {
-		if (unplug_page || (rw & REQ_WRITE))
+		if (rw & REQ_WRITE)
 			num_stripes = map->num_stripes;
 		else if (mirror_num)
 			stripe_index = mirror_num - 1;
@@ -3051,7 +3002,7 @@ again:
 		stripe_index = do_div(stripe_nr, factor);
 		stripe_index *= map->sub_stripes;
 
-		if (unplug_page || (rw & REQ_WRITE))
+		if (rw & REQ_WRITE)
 			num_stripes = map->sub_stripes;
 		else if (mirror_num)
 			stripe_index += mirror_num - 1;
@@ -3071,22 +3022,10 @@ again:
 	BUG_ON(stripe_index >= map->num_stripes);
 
 	for (i = 0; i < num_stripes; i++) {
-		if (unplug_page) {
-			struct btrfs_device *device;
-			struct backing_dev_info *bdi;
-
-			device = map->stripes[stripe_index].dev;
-			if (device->bdev) {
-				bdi = blk_get_backing_dev_info(device->bdev);
-				if (bdi->unplug_io_fn)
-					bdi->unplug_io_fn(bdi, unplug_page);
-			}
-		} else {
-			multi->stripes[i].physical =
-				map->stripes[stripe_index].physical +
-				stripe_offset + stripe_nr * map->stripe_len;
-			multi->stripes[i].dev = map->stripes[stripe_index].dev;
-		}
+		multi->stripes[i].physical =
+			map->stripes[stripe_index].physical +
+			stripe_offset + stripe_nr * map->stripe_len;
+		multi->stripes[i].dev = map->stripes[stripe_index].dev;
 		stripe_index++;
 	}
 	if (multi_ret) {
@@ -3104,7 +3043,7 @@ int btrfs_map_block(struct btrfs_mapping_tree *map_tree, int rw,
 		      struct btrfs_multi_bio **multi_ret, int mirror_num)
 {
 	return __btrfs_map_block(map_tree, rw, logical, length, multi_ret,
-				 mirror_num, NULL);
+				 mirror_num);
 }
 
 int btrfs_rmap_block(struct btrfs_mapping_tree *map_tree,
@@ -3172,14 +3111,6 @@ int btrfs_rmap_block(struct btrfs_mapping_tree *map_tree,
 	return 0;
 }
 
-int btrfs_unplug_page(struct btrfs_mapping_tree *map_tree,
-		      u64 logical, struct page *page)
-{
-	u64 length = PAGE_CACHE_SIZE;
-	return __btrfs_map_block(map_tree, READ, logical, &length,
-				 NULL, 0, page);
-}
-
 static void end_bio_multi_stripe(struct bio *bio, int err)
 {
 	struct btrfs_multi_bio *multi = bio->bi_private;
diff --git a/fs/buffer.c b/fs/buffer.c
index 2219a76..f903f2e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -54,23 +54,15 @@ init_buffer(struct buffer_head *bh, bh_end_io_t *handler, void *private)
 }
 EXPORT_SYMBOL(init_buffer);
 
-static int sync_buffer(void *word)
+static int sleep_on_buffer(void *word)
 {
-	struct block_device *bd;
-	struct buffer_head *bh
-		= container_of(word, struct buffer_head, b_state);
-
-	smp_mb();
-	bd = bh->b_bdev;
-	if (bd)
-		blk_run_address_space(bd->bd_inode->i_mapping);
 	io_schedule();
 	return 0;
 }
 
 void __lock_buffer(struct buffer_head *bh)
 {
-	wait_on_bit_lock(&bh->b_state, BH_Lock, sync_buffer,
+	wait_on_bit_lock(&bh->b_state, BH_Lock, sleep_on_buffer,
 							TASK_UNINTERRUPTIBLE);
 }
 EXPORT_SYMBOL(__lock_buffer);
@@ -90,7 +82,7 @@ EXPORT_SYMBOL(unlock_buffer);
  */
 void __wait_on_buffer(struct buffer_head * bh)
 {
-	wait_on_bit(&bh->b_state, BH_Lock, sync_buffer, TASK_UNINTERRUPTIBLE);
+	wait_on_bit(&bh->b_state, BH_Lock, sleep_on_buffer, TASK_UNINTERRUPTIBLE);
 }
 EXPORT_SYMBOL(__wait_on_buffer);
 
@@ -749,7 +741,7 @@ static int fsync_buffers_list(spinlock_t *lock, struct list_head *list)
 {
 	struct buffer_head *bh;
 	struct list_head tmp;
-	struct address_space *mapping, *prev_mapping = NULL;
+	struct address_space *mapping;
 	int err = 0, err2;
 
 	INIT_LIST_HEAD(&tmp);
@@ -783,10 +775,6 @@ static int fsync_buffers_list(spinlock_t *lock, struct list_head *list)
 				 * wait_on_buffer() will do that for us
 				 * through sync_buffer().
 				 */
-				if (prev_mapping && prev_mapping != mapping)
-					blk_run_address_space(prev_mapping);
-				prev_mapping = mapping;
-
 				brelse(bh);
 				spin_lock(lock);
 			}
@@ -3138,17 +3126,6 @@ out:
 }
 EXPORT_SYMBOL(try_to_free_buffers);
 
-void block_sync_page(struct page *page)
-{
-	struct address_space *mapping;
-
-	smp_mb();
-	mapping = page_mapping(page);
-	if (mapping)
-		blk_run_backing_dev(mapping->backing_dev_info, page);
-}
-EXPORT_SYMBOL(block_sync_page);
-
 /*
  * There are no bdflush tunables left.  But distributions are
  * still running obsolete flush daemons, so we terminate them here.
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index d843631..b6431b0 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -1551,34 +1551,6 @@ int cifs_fsync(struct file *file, int datasync)
 	return rc;
 }
 
-/* static void cifs_sync_page(struct page *page)
-{
-	struct address_space *mapping;
-	struct inode *inode;
-	unsigned long index = page->index;
-	unsigned int rpages = 0;
-	int rc = 0;
-
-	cFYI(1, "sync page %p", page);
-	mapping = page->mapping;
-	if (!mapping)
-		return 0;
-	inode = mapping->host;
-	if (!inode)
-		return; */
-
-/*	fill in rpages then
-	result = cifs_pagein_inode(inode, index, rpages); */ /* BB finish */
-
-/*	cFYI(1, "rpages is %d for sync page of Index %ld", rpages, index);
-
-#if 0
-	if (rc < 0)
-		return rc;
-	return 0;
-#endif
-} */
-
 /*
  * As file closes, flush all cached write data for this inode checking
  * for write behind errors.
@@ -2232,7 +2204,6 @@ const struct address_space_operations cifs_addr_ops = {
 	.set_page_dirty = __set_page_dirty_nobuffers,
 	.releasepage = cifs_release_page,
 	.invalidatepage = cifs_invalidate_page,
-	/* .sync_page = cifs_sync_page, */
 	/* .direct_IO = */
 };
 
@@ -2250,6 +2221,5 @@ const struct address_space_operations cifs_addr_ops_smallbuf = {
 	.set_page_dirty = __set_page_dirty_nobuffers,
 	.releasepage = cifs_release_page,
 	.invalidatepage = cifs_invalidate_page,
-	/* .sync_page = cifs_sync_page, */
 	/* .direct_IO = */
 };
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 85882f6..0a9b085 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -1106,11 +1106,8 @@ direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode,
 	    ((rw & READ) || (dio->result == dio->size)))
 		ret = -EIOCBQUEUED;
 
-	if (ret != -EIOCBQUEUED) {
-		/* All IO is now issued, send it on its way */
-		blk_run_address_space(inode->i_mapping);
+	if (ret != -EIOCBQUEUED)
 		dio_await_completion(dio);
-	}
 
 	/*
 	 * Sync will always be dropping the final ref and completing the
diff --git a/fs/efs/inode.c b/fs/efs/inode.c
index a8e7797..9c13412 100644
--- a/fs/efs/inode.c
+++ b/fs/efs/inode.c
@@ -23,7 +23,6 @@ static sector_t _efs_bmap(struct address_space *mapping, sector_t block)
 }
 static const struct address_space_operations efs_aops = {
 	.readpage = efs_readpage,
-	.sync_page = block_sync_page,
 	.bmap = _efs_bmap
 };
 
diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c
index 4268542..bd56fed 100644
--- a/fs/exofs/inode.c
+++ b/fs/exofs/inode.c
@@ -795,7 +795,6 @@ const struct address_space_operations exofs_aops = {
 	.direct_IO	= NULL, /* TODO: Should be trivial to do */
 
 	/* With these NULL has special meaning or default is not exported */
-	.sync_page	= NULL,
 	.get_xip_mem	= NULL,
 	.migratepage	= NULL,
 	.launder_page	= NULL,
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 40ad210..c47f706 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -860,7 +860,6 @@ const struct address_space_operations ext2_aops = {
 	.readpage		= ext2_readpage,
 	.readpages		= ext2_readpages,
 	.writepage		= ext2_writepage,
-	.sync_page		= block_sync_page,
 	.write_begin		= ext2_write_begin,
 	.write_end		= ext2_write_end,
 	.bmap			= ext2_bmap,
@@ -880,7 +879,6 @@ const struct address_space_operations ext2_nobh_aops = {
 	.readpage		= ext2_readpage,
 	.readpages		= ext2_readpages,
 	.writepage		= ext2_nobh_writepage,
-	.sync_page		= block_sync_page,
 	.write_begin		= ext2_nobh_write_begin,
 	.write_end		= nobh_write_end,
 	.bmap			= ext2_bmap,
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index ae94f6d..fe2541d 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1894,7 +1894,6 @@ static const struct address_space_operations ext3_ordered_aops = {
 	.readpage		= ext3_readpage,
 	.readpages		= ext3_readpages,
 	.writepage		= ext3_ordered_writepage,
-	.sync_page		= block_sync_page,
 	.write_begin		= ext3_write_begin,
 	.write_end		= ext3_ordered_write_end,
 	.bmap			= ext3_bmap,
@@ -1910,7 +1909,6 @@ static const struct address_space_operations ext3_writeback_aops = {
 	.readpage		= ext3_readpage,
 	.readpages		= ext3_readpages,
 	.writepage		= ext3_writeback_writepage,
-	.sync_page		= block_sync_page,
 	.write_begin		= ext3_write_begin,
 	.write_end		= ext3_writeback_write_end,
 	.bmap			= ext3_bmap,
@@ -1926,7 +1924,6 @@ static const struct address_space_operations ext3_journalled_aops = {
 	.readpage		= ext3_readpage,
 	.readpages		= ext3_readpages,
 	.writepage		= ext3_journalled_writepage,
-	.sync_page		= block_sync_page,
 	.write_begin		= ext3_write_begin,
 	.write_end		= ext3_journalled_write_end,
 	.set_page_dirty		= ext3_journalled_set_page_dirty,
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 9f7f9e4..9297ad4 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3903,7 +3903,6 @@ static const struct address_space_operations ext4_ordered_aops = {
 	.readpage		= ext4_readpage,
 	.readpages		= ext4_readpages,
 	.writepage		= ext4_writepage,
-	.sync_page		= block_sync_page,
 	.write_begin		= ext4_write_begin,
 	.write_end		= ext4_ordered_write_end,
 	.bmap			= ext4_bmap,
@@ -3919,7 +3918,6 @@ static const struct address_space_operations ext4_writeback_aops = {
 	.readpage		= ext4_readpage,
 	.readpages		= ext4_readpages,
 	.writepage		= ext4_writepage,
-	.sync_page		= block_sync_page,
 	.write_begin		= ext4_write_begin,
 	.write_end		= ext4_writeback_write_end,
 	.bmap			= ext4_bmap,
@@ -3935,7 +3933,6 @@ static const struct address_space_operations ext4_journalled_aops = {
 	.readpage		= ext4_readpage,
 	.readpages		= ext4_readpages,
 	.writepage		= ext4_writepage,
-	.sync_page		= block_sync_page,
 	.write_begin		= ext4_write_begin,
 	.write_end		= ext4_journalled_write_end,
 	.set_page_dirty		= ext4_journalled_set_page_dirty,
@@ -3951,7 +3948,6 @@ static const struct address_space_operations ext4_da_aops = {
 	.readpages		= ext4_readpages,
 	.writepage		= ext4_writepage,
 	.writepages		= ext4_da_writepages,
-	.sync_page		= block_sync_page,
 	.write_begin		= ext4_da_write_begin,
 	.write_end		= ext4_da_write_end,
 	.bmap			= ext4_bmap,
diff --git a/fs/fat/inode.c b/fs/fat/inode.c
index 86753fe..f4ff09f 100644
--- a/fs/fat/inode.c
+++ b/fs/fat/inode.c
@@ -236,7 +236,6 @@ static const struct address_space_operations fat_aops = {
 	.readpages	= fat_readpages,
 	.writepage	= fat_writepage,
 	.writepages	= fat_writepages,
-	.sync_page	= block_sync_page,
 	.write_begin	= fat_write_begin,
 	.write_end	= fat_write_end,
 	.direct_IO	= fat_direct_IO,
diff --git a/fs/freevxfs/vxfs_subr.c b/fs/freevxfs/vxfs_subr.c
index 1429f3ae..5d318c4 100644
--- a/fs/freevxfs/vxfs_subr.c
+++ b/fs/freevxfs/vxfs_subr.c
@@ -44,7 +44,6 @@ static sector_t		vxfs_bmap(struct address_space *, sector_t);
 const struct address_space_operations vxfs_aops = {
 	.readpage =		vxfs_readpage,
 	.bmap =			vxfs_bmap,
-	.sync_page =		block_sync_page,
 };
 
 inline void
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 9e3f68c..09e8d51 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -868,7 +868,6 @@ static int fuse_bdi_init(struct fuse_conn *fc, struct super_block *sb)
 
 	fc->bdi.name = "fuse";
 	fc->bdi.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
-	fc->bdi.unplug_io_fn = default_unplug_io_fn;
 	/* fuse does it's own writeback accounting */
 	fc->bdi.capabilities = BDI_CAP_NO_ACCT_WB;
 
diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 4f36f88..2f87ad2 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -1116,7 +1116,6 @@ static const struct address_space_operations gfs2_writeback_aops = {
 	.writepages = gfs2_writeback_writepages,
 	.readpage = gfs2_readpage,
 	.readpages = gfs2_readpages,
-	.sync_page = block_sync_page,
 	.write_begin = gfs2_write_begin,
 	.write_end = gfs2_write_end,
 	.bmap = gfs2_bmap,
@@ -1132,7 +1131,6 @@ static const struct address_space_operations gfs2_ordered_aops = {
 	.writepage = gfs2_ordered_writepage,
 	.readpage = gfs2_readpage,
 	.readpages = gfs2_readpages,
-	.sync_page = block_sync_page,
 	.write_begin = gfs2_write_begin,
 	.write_end = gfs2_write_end,
 	.set_page_dirty = gfs2_set_page_dirty,
@@ -1150,7 +1148,6 @@ static const struct address_space_operations gfs2_jdata_aops = {
 	.writepages = gfs2_jdata_writepages,
 	.readpage = gfs2_readpage,
 	.readpages = gfs2_readpages,
-	.sync_page = block_sync_page,
 	.write_begin = gfs2_write_begin,
 	.write_end = gfs2_write_end,
 	.set_page_dirty = gfs2_set_page_dirty,
diff --git a/fs/gfs2/meta_io.c b/fs/gfs2/meta_io.c
index 939739c..a566331 100644
--- a/fs/gfs2/meta_io.c
+++ b/fs/gfs2/meta_io.c
@@ -94,7 +94,6 @@ static int gfs2_aspace_writepage(struct page *page, struct writeback_control *wb
 const struct address_space_operations gfs2_meta_aops = {
 	.writepage = gfs2_aspace_writepage,
 	.releasepage = gfs2_releasepage,
-	.sync_page = block_sync_page,
 };
 
 /**
diff --git a/fs/hfs/inode.c b/fs/hfs/inode.c
index dffb4e9..fff16c9 100644
--- a/fs/hfs/inode.c
+++ b/fs/hfs/inode.c
@@ -150,7 +150,6 @@ static int hfs_writepages(struct address_space *mapping,
 const struct address_space_operations hfs_btree_aops = {
 	.readpage	= hfs_readpage,
 	.writepage	= hfs_writepage,
-	.sync_page	= block_sync_page,
 	.write_begin	= hfs_write_begin,
 	.write_end	= generic_write_end,
 	.bmap		= hfs_bmap,
@@ -160,7 +159,6 @@ const struct address_space_operations hfs_btree_aops = {
 const struct address_space_operations hfs_aops = {
 	.readpage	= hfs_readpage,
 	.writepage	= hfs_writepage,
-	.sync_page	= block_sync_page,
 	.write_begin	= hfs_write_begin,
 	.write_end	= generic_write_end,
 	.bmap		= hfs_bmap,
diff --git a/fs/hfsplus/inode.c b/fs/hfsplus/inode.c
index a8df651..b248a6c 100644
--- a/fs/hfsplus/inode.c
+++ b/fs/hfsplus/inode.c
@@ -146,7 +146,6 @@ static int hfsplus_writepages(struct address_space *mapping,
 const struct address_space_operations hfsplus_btree_aops = {
 	.readpage	= hfsplus_readpage,
 	.writepage	= hfsplus_writepage,
-	.sync_page	= block_sync_page,
 	.write_begin	= hfsplus_write_begin,
 	.write_end	= generic_write_end,
 	.bmap		= hfsplus_bmap,
@@ -156,7 +155,6 @@ const struct address_space_operations hfsplus_btree_aops = {
 const struct address_space_operations hfsplus_aops = {
 	.readpage	= hfsplus_readpage,
 	.writepage	= hfsplus_writepage,
-	.sync_page	= block_sync_page,
 	.write_begin	= hfsplus_write_begin,
 	.write_end	= generic_write_end,
 	.bmap		= hfsplus_bmap,
diff --git a/fs/hpfs/file.c b/fs/hpfs/file.c
index c034088..9e84257 100644
--- a/fs/hpfs/file.c
+++ b/fs/hpfs/file.c
@@ -120,7 +120,6 @@ static sector_t _hpfs_bmap(struct address_space *mapping, sector_t block)
 const struct address_space_operations hpfs_aops = {
 	.readpage = hpfs_readpage,
 	.writepage = hpfs_writepage,
-	.sync_page = block_sync_page,
 	.write_begin = hpfs_write_begin,
 	.write_end = generic_write_end,
 	.bmap = _hpfs_bmap
diff --git a/fs/isofs/inode.c b/fs/isofs/inode.c
index a0f3833..3db5ba4 100644
--- a/fs/isofs/inode.c
+++ b/fs/isofs/inode.c
@@ -1158,7 +1158,6 @@ static sector_t _isofs_bmap(struct address_space *mapping, sector_t block)
 
 static const struct address_space_operations isofs_aops = {
 	.readpage = isofs_readpage,
-	.sync_page = block_sync_page,
 	.bmap = _isofs_bmap
 };
 
diff --git a/fs/jfs/inode.c b/fs/jfs/inode.c
index 9978803..eddbb37 100644
--- a/fs/jfs/inode.c
+++ b/fs/jfs/inode.c
@@ -352,7 +352,6 @@ const struct address_space_operations jfs_aops = {
 	.readpages	= jfs_readpages,
 	.writepage	= jfs_writepage,
 	.writepages	= jfs_writepages,
-	.sync_page	= block_sync_page,
 	.write_begin	= jfs_write_begin,
 	.write_end	= nobh_write_end,
 	.bmap		= jfs_bmap,
diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c
index 48b44bd..6740d34 100644
--- a/fs/jfs/jfs_metapage.c
+++ b/fs/jfs/jfs_metapage.c
@@ -583,7 +583,6 @@ static void metapage_invalidatepage(struct page *page, unsigned long offset)
 const struct address_space_operations jfs_metapage_aops = {
 	.readpage	= metapage_readpage,
 	.writepage	= metapage_writepage,
-	.sync_page	= block_sync_page,
 	.releasepage	= metapage_releasepage,
 	.invalidatepage	= metapage_invalidatepage,
 	.set_page_dirty	= __set_page_dirty_nobuffers,
diff --git a/fs/logfs/dev_bdev.c b/fs/logfs/dev_bdev.c
index 723bc5b..1adc8d4 100644
--- a/fs/logfs/dev_bdev.c
+++ b/fs/logfs/dev_bdev.c
@@ -39,7 +39,6 @@ static int sync_request(struct page *page, struct block_device *bdev, int rw)
 	bio.bi_end_io = request_complete;
 
 	submit_bio(rw, &bio);
-	generic_unplug_device(bdev_get_queue(bdev));
 	wait_for_completion(&complete);
 	return test_bit(BIO_UPTODATE, &bio.bi_flags) ? 0 : -EIO;
 }
@@ -168,7 +167,6 @@ static void bdev_writeseg(struct super_block *sb, u64 ofs, size_t len)
 	}
 	len = PAGE_ALIGN(len);
 	__bdev_writeseg(sb, ofs, ofs >> PAGE_SHIFT, len >> PAGE_SHIFT);
-	generic_unplug_device(bdev_get_queue(logfs_super(sb)->s_bdev));
 }
 
 
diff --git a/fs/minix/inode.c b/fs/minix/inode.c
index ae0b83f..adcdc0a 100644
--- a/fs/minix/inode.c
+++ b/fs/minix/inode.c
@@ -399,7 +399,6 @@ static sector_t minix_bmap(struct address_space *mapping, sector_t block)
 static const struct address_space_operations minix_aops = {
 	.readpage = minix_readpage,
 	.writepage = minix_writepage,
-	.sync_page = block_sync_page,
 	.write_begin = minix_write_begin,
 	.write_end = generic_write_end,
 	.bmap = minix_bmap
diff --git a/fs/nilfs2/btnode.c b/fs/nilfs2/btnode.c
index 388e9e8..f4f1c08 100644
--- a/fs/nilfs2/btnode.c
+++ b/fs/nilfs2/btnode.c
@@ -40,14 +40,10 @@ void nilfs_btnode_cache_init_once(struct address_space *btnc)
 	nilfs_mapping_init_once(btnc);
 }
 
-static const struct address_space_operations def_btnode_aops = {
-	.sync_page		= block_sync_page,
-};
-
 void nilfs_btnode_cache_init(struct address_space *btnc,
 			     struct backing_dev_info *bdi)
 {
-	nilfs_mapping_init(btnc, bdi, &def_btnode_aops);
+	nilfs_mapping_init(btnc, bdi);
 }
 
 void nilfs_btnode_cache_clear(struct address_space *btnc)
diff --git a/fs/nilfs2/gcinode.c b/fs/nilfs2/gcinode.c
index caf9a6a..1c2a3e2 100644
--- a/fs/nilfs2/gcinode.c
+++ b/fs/nilfs2/gcinode.c
@@ -49,7 +49,6 @@
 #include "ifile.h"
 
 static const struct address_space_operations def_gcinode_aops = {
-	.sync_page		= block_sync_page,
 };
 
 /*
diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c
index 2fd440d..c89d5d1 100644
--- a/fs/nilfs2/inode.c
+++ b/fs/nilfs2/inode.c
@@ -262,7 +262,6 @@ nilfs_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
 const struct address_space_operations nilfs_aops = {
 	.writepage		= nilfs_writepage,
 	.readpage		= nilfs_readpage,
-	.sync_page		= block_sync_page,
 	.writepages		= nilfs_writepages,
 	.set_page_dirty		= nilfs_set_page_dirty,
 	.readpages		= nilfs_readpages,
diff --git a/fs/nilfs2/mdt.c b/fs/nilfs2/mdt.c
index 6a0e2a1..3fdb61d 100644
--- a/fs/nilfs2/mdt.c
+++ b/fs/nilfs2/mdt.c
@@ -399,7 +399,6 @@ nilfs_mdt_write_page(struct page *page, struct writeback_control *wbc)
 
 static const struct address_space_operations def_mdt_aops = {
 	.writepage		= nilfs_mdt_write_page,
-	.sync_page		= block_sync_page,
 };
 
 static const struct inode_operations def_mdt_iops;
@@ -438,10 +437,6 @@ void nilfs_mdt_set_entry_size(struct inode *inode, unsigned entry_size,
 	mi->mi_first_entry_offset = DIV_ROUND_UP(header_size, entry_size);
 }
 
-static const struct address_space_operations shadow_map_aops = {
-	.sync_page		= block_sync_page,
-};
-
 /**
  * nilfs_mdt_setup_shadow_map - setup shadow map and bind it to metadata file
  * @inode: inode of the metadata file
@@ -455,9 +450,9 @@ int nilfs_mdt_setup_shadow_map(struct inode *inode,
 
 	INIT_LIST_HEAD(&shadow->frozen_buffers);
 	nilfs_mapping_init_once(&shadow->frozen_data);
-	nilfs_mapping_init(&shadow->frozen_data, bdi, &shadow_map_aops);
+	nilfs_mapping_init(&shadow->frozen_data, bdi);
 	nilfs_mapping_init_once(&shadow->frozen_btnodes);
-	nilfs_mapping_init(&shadow->frozen_btnodes, bdi, &shadow_map_aops);
+	nilfs_mapping_init(&shadow->frozen_btnodes, bdi);
 	mi->mi_shadow = shadow;
 	return 0;
 }
diff --git a/fs/nilfs2/page.c b/fs/nilfs2/page.c
index 0c43241..c0d4381 100644
--- a/fs/nilfs2/page.c
+++ b/fs/nilfs2/page.c
@@ -505,16 +505,18 @@ void nilfs_mapping_init_once(struct address_space *mapping)
 	INIT_LIST_HEAD(&mapping->i_mmap_nonlinear);
 }
 
+static const struct address_space_operations def_btnode_aops = {
+};
+
 void nilfs_mapping_init(struct address_space *mapping,
-			struct backing_dev_info *bdi,
-			const struct address_space_operations *aops)
+			struct backing_dev_info *bdi)
 {
 	mapping->host = NULL;
 	mapping->flags = 0;
 	mapping_set_gfp_mask(mapping, GFP_NOFS);
 	mapping->assoc_mapping = NULL;
 	mapping->backing_dev_info = bdi;
-	mapping->a_ops = aops;
+	mapping->a_ops = &def_btnode_aops;
 }
 
 /*
diff --git a/fs/nilfs2/page.h b/fs/nilfs2/page.h
index 622df27..ba4d6fd 100644
--- a/fs/nilfs2/page.h
+++ b/fs/nilfs2/page.h
@@ -63,8 +63,7 @@ void nilfs_copy_back_pages(struct address_space *, struct address_space *);
 void nilfs_clear_dirty_pages(struct address_space *);
 void nilfs_mapping_init_once(struct address_space *mapping);
 void nilfs_mapping_init(struct address_space *mapping,
-			struct backing_dev_info *bdi,
-			const struct address_space_operations *aops);
+			struct backing_dev_info *bdi);
 unsigned nilfs_page_count_clean_buffers(struct page *, unsigned, unsigned);
 unsigned long nilfs_find_uncommitted_extent(struct inode *inode,
 					    sector_t start_blk,
diff --git a/fs/ntfs/aops.c b/fs/ntfs/aops.c
index c3c2c7a..0b1e885b 100644
--- a/fs/ntfs/aops.c
+++ b/fs/ntfs/aops.c
@@ -1543,8 +1543,6 @@ err_out:
  */
 const struct address_space_operations ntfs_aops = {
 	.readpage	= ntfs_readpage,	/* Fill page with data. */
-	.sync_page	= block_sync_page,	/* Currently, just unplugs the
-						   disk request queue. */
 #ifdef NTFS_RW
 	.writepage	= ntfs_writepage,	/* Write dirty page to disk. */
 #endif /* NTFS_RW */
@@ -1560,8 +1558,6 @@ const struct address_space_operations ntfs_aops = {
  */
 const struct address_space_operations ntfs_mst_aops = {
 	.readpage	= ntfs_readpage,	/* Fill page with data. */
-	.sync_page	= block_sync_page,	/* Currently, just unplugs the
-						   disk request queue. */
 #ifdef NTFS_RW
 	.writepage	= ntfs_writepage,	/* Write dirty page to disk. */
 	.set_page_dirty	= __set_page_dirty_nobuffers,	/* Set the page dirty
diff --git a/fs/ntfs/compress.c b/fs/ntfs/compress.c
index 6551c7c..ef9ed85 100644
--- a/fs/ntfs/compress.c
+++ b/fs/ntfs/compress.c
@@ -698,8 +698,7 @@ lock_retry_remap:
 					"uptodate! Unplugging the disk queue "
 					"and rescheduling.");
 			get_bh(tbh);
-			blk_run_address_space(mapping);
-			schedule();
+			io_schedule();
 			put_bh(tbh);
 			if (unlikely(!buffer_uptodate(tbh)))
 				goto read_err;
diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index 1fbb0e2..daea035 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -2043,7 +2043,6 @@ const struct address_space_operations ocfs2_aops = {
 	.write_begin		= ocfs2_write_begin,
 	.write_end		= ocfs2_write_end,
 	.bmap			= ocfs2_bmap,
-	.sync_page		= block_sync_page,
 	.direct_IO		= ocfs2_direct_IO,
 	.invalidatepage		= ocfs2_invalidatepage,
 	.releasepage		= ocfs2_releasepage,
diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c
index b108e86..1adab28 100644
--- a/fs/ocfs2/cluster/heartbeat.c
+++ b/fs/ocfs2/cluster/heartbeat.c
@@ -367,11 +367,7 @@ static inline void o2hb_bio_wait_dec(struct o2hb_bio_wait_ctxt *wc,
 static void o2hb_wait_on_io(struct o2hb_region *reg,
 			    struct o2hb_bio_wait_ctxt *wc)
 {
-	struct address_space *mapping = reg->hr_bdev->bd_inode->i_mapping;
-
-	blk_run_address_space(mapping);
 	o2hb_bio_wait_dec(wc, 1);
-
 	wait_for_completion(&wc->wc_io_complete);
 }
 
diff --git a/fs/omfs/file.c b/fs/omfs/file.c
index 8a6d34f..d738a7e 100644
--- a/fs/omfs/file.c
+++ b/fs/omfs/file.c
@@ -372,7 +372,6 @@ const struct address_space_operations omfs_aops = {
 	.readpages = omfs_readpages,
 	.writepage = omfs_writepage,
 	.writepages = omfs_writepages,
-	.sync_page = block_sync_page,
 	.write_begin = omfs_write_begin,
 	.write_end = generic_write_end,
 	.bmap = omfs_bmap,
diff --git a/fs/qnx4/inode.c b/fs/qnx4/inode.c
index e63b417..2b06466 100644
--- a/fs/qnx4/inode.c
+++ b/fs/qnx4/inode.c
@@ -335,7 +335,6 @@ static sector_t qnx4_bmap(struct address_space *mapping, sector_t block)
 static const struct address_space_operations qnx4_aops = {
 	.readpage	= qnx4_readpage,
 	.writepage	= qnx4_writepage,
-	.sync_page	= block_sync_page,
 	.write_begin	= qnx4_write_begin,
 	.write_end	= generic_write_end,
 	.bmap		= qnx4_bmap
diff --git a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c
index 0bae036..0367467 100644
--- a/fs/reiserfs/inode.c
+++ b/fs/reiserfs/inode.c
@@ -3212,7 +3212,6 @@ const struct address_space_operations reiserfs_address_space_operations = {
 	.readpages = reiserfs_readpages,
 	.releasepage = reiserfs_releasepage,
 	.invalidatepage = reiserfs_invalidatepage,
-	.sync_page = block_sync_page,
 	.write_begin = reiserfs_write_begin,
 	.write_end = reiserfs_write_end,
 	.bmap = reiserfs_aop_bmap,
diff --git a/fs/sysv/itree.c b/fs/sysv/itree.c
index 9ca6627..fa8d43c 100644
--- a/fs/sysv/itree.c
+++ b/fs/sysv/itree.c
@@ -488,7 +488,6 @@ static sector_t sysv_bmap(struct address_space *mapping, sector_t block)
 const struct address_space_operations sysv_aops = {
 	.readpage = sysv_readpage,
 	.writepage = sysv_writepage,
-	.sync_page = block_sync_page,
 	.write_begin = sysv_write_begin,
 	.write_end = generic_write_end,
 	.bmap = sysv_bmap
diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index 6e11c29..81368d4 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -1979,7 +1979,6 @@ static int ubifs_fill_super(struct super_block *sb, void *data, int silent)
 	 */
 	c->bdi.name = "ubifs",
 	c->bdi.capabilities = BDI_CAP_MAP_COPY;
-	c->bdi.unplug_io_fn = default_unplug_io_fn;
 	err  = bdi_init(&c->bdi);
 	if (err)
 		goto out_close;
diff --git a/fs/udf/file.c b/fs/udf/file.c
index 89c7848..94e4553 100644
--- a/fs/udf/file.c
+++ b/fs/udf/file.c
@@ -98,7 +98,6 @@ static int udf_adinicb_write_end(struct file *file,
 const struct address_space_operations udf_adinicb_aops = {
 	.readpage	= udf_adinicb_readpage,
 	.writepage	= udf_adinicb_writepage,
-	.sync_page	= block_sync_page,
 	.write_begin = simple_write_begin,
 	.write_end = udf_adinicb_write_end,
 };
diff --git a/fs/udf/inode.c b/fs/udf/inode.c
index c6a2e78..fa96fc0 100644
--- a/fs/udf/inode.c
+++ b/fs/udf/inode.c
@@ -133,7 +133,6 @@ static sector_t udf_bmap(struct address_space *mapping, sector_t block)
 const struct address_space_operations udf_aops = {
 	.readpage	= udf_readpage,
 	.writepage	= udf_writepage,
-	.sync_page	= block_sync_page,
 	.write_begin		= udf_write_begin,
 	.write_end		= generic_write_end,
 	.bmap		= udf_bmap,
diff --git a/fs/ufs/inode.c b/fs/ufs/inode.c
index 2b251f2..83b2844 100644
--- a/fs/ufs/inode.c
+++ b/fs/ufs/inode.c
@@ -588,7 +588,6 @@ static sector_t ufs_bmap(struct address_space *mapping, sector_t block)
 const struct address_space_operations ufs_aops = {
 	.readpage = ufs_readpage,
 	.writepage = ufs_writepage,
-	.sync_page = block_sync_page,
 	.write_begin = ufs_write_begin,
 	.write_end = generic_write_end,
 	.bmap = ufs_bmap
diff --git a/fs/ufs/truncate.c b/fs/ufs/truncate.c
index a58f915..ff0e792 100644
--- a/fs/ufs/truncate.c
+++ b/fs/ufs/truncate.c
@@ -481,7 +481,7 @@ int ufs_truncate(struct inode *inode, loff_t old_i_size)
 			break;
 		if (IS_SYNC(inode) && (inode->i_state & I_DIRTY))
 			ufs_sync_inode (inode);
-		blk_run_address_space(inode->i_mapping);
+		blk_flush_plug(current);
 		yield();
 	}
 
diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index ec7bbb5..83c1c20 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -1495,7 +1495,6 @@ const struct address_space_operations xfs_address_space_operations = {
 	.readpages		= xfs_vm_readpages,
 	.writepage		= xfs_vm_writepage,
 	.writepages		= xfs_vm_writepages,
-	.sync_page		= block_sync_page,
 	.releasepage		= xfs_vm_releasepage,
 	.invalidatepage		= xfs_vm_invalidatepage,
 	.write_begin		= xfs_vm_write_begin,
diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index ac1c7e8..4f8f53c 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -991,7 +991,7 @@ xfs_buf_lock(
 	if (atomic_read(&bp->b_pin_count) && (bp->b_flags & XBF_STALE))
 		xfs_log_force(bp->b_target->bt_mount, 0);
 	if (atomic_read(&bp->b_io_remaining))
-		blk_run_address_space(bp->b_target->bt_mapping);
+		blk_flush_plug(current);
 	down(&bp->b_sema);
 	XB_SET_OWNER(bp);
 
@@ -1035,9 +1035,7 @@ xfs_buf_wait_unpin(
 		set_current_state(TASK_UNINTERRUPTIBLE);
 		if (atomic_read(&bp->b_pin_count) == 0)
 			break;
-		if (atomic_read(&bp->b_io_remaining))
-			blk_run_address_space(bp->b_target->bt_mapping);
-		schedule();
+		io_schedule();
 	}
 	remove_wait_queue(&bp->b_waiters, &wait);
 	set_current_state(TASK_RUNNING);
@@ -1443,7 +1441,7 @@ xfs_buf_iowait(
 	trace_xfs_buf_iowait(bp, _RET_IP_);
 
 	if (atomic_read(&bp->b_io_remaining))
-		blk_run_address_space(bp->b_target->bt_mapping);
+		blk_flush_plug(current);
 	wait_for_completion(&bp->b_iowait);
 
 	trace_xfs_buf_iowait_done(bp, _RET_IP_);
@@ -1667,7 +1665,6 @@ xfs_mapping_buftarg(
 	struct inode		*inode;
 	struct address_space	*mapping;
 	static const struct address_space_operations mapping_aops = {
-		.sync_page = block_sync_page,
 		.migratepage = fail_migrate_page,
 	};
 
@@ -1948,7 +1945,7 @@ xfsbufd(
 			count++;
 		}
 		if (count)
-			blk_run_address_space(target->bt_mapping);
+			blk_flush_plug(current);
 
 	} while (!kthread_should_stop());
 
@@ -1996,7 +1993,7 @@ xfs_flush_buftarg(
 
 	if (wait) {
 		/* Expedite and wait for IO to complete. */
-		blk_run_address_space(target->bt_mapping);
+		blk_flush_plug(current);
 		while (!list_empty(&wait_list)) {
 			bp = list_first_entry(&wait_list, struct xfs_buf, b_list);
 
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 4ce34fa..96f4094 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -66,8 +66,6 @@ struct backing_dev_info {
 	unsigned int capabilities; /* Device capabilities */
 	congested_fn *congested_fn; /* Function pointer if device is md/dm */
 	void *congested_data;	/* Pointer to aux data for congested func */
-	void (*unplug_io_fn)(struct backing_dev_info *, struct page *);
-	void *unplug_io_data;
 
 	char *name;
 
@@ -251,7 +249,6 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio);
 
 extern struct backing_dev_info default_backing_dev_info;
 extern struct backing_dev_info noop_backing_dev_info;
-void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page);
 
 int writeback_in_progress(struct backing_dev_info *bdi);
 
@@ -336,17 +333,4 @@ static inline int bdi_sched_wait(void *word)
 	return 0;
 }
 
-static inline void blk_run_backing_dev(struct backing_dev_info *bdi,
-				       struct page *page)
-{
-	if (bdi && bdi->unplug_io_fn)
-		bdi->unplug_io_fn(bdi, page);
-}
-
-static inline void blk_run_address_space(struct address_space *mapping)
-{
-	if (mapping)
-		blk_run_backing_dev(mapping->backing_dev_info, NULL);
-}
-
 #endif		/* _LINUX_BACKING_DEV_H */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 3d246a9..dfb6ffd 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -190,7 +190,6 @@ typedef void (request_fn_proc) (struct request_queue *q);
 typedef int (make_request_fn) (struct request_queue *q, struct bio *bio);
 typedef int (prep_rq_fn) (struct request_queue *, struct request *);
 typedef void (unprep_rq_fn) (struct request_queue *, struct request *);
-typedef void (unplug_fn) (struct request_queue *);
 
 struct bio_vec;
 struct bvec_merge_data {
@@ -273,7 +272,6 @@ struct request_queue
 	make_request_fn		*make_request_fn;
 	prep_rq_fn		*prep_rq_fn;
 	unprep_rq_fn		*unprep_rq_fn;
-	unplug_fn		*unplug_fn;
 	merge_bvec_fn		*merge_bvec_fn;
 	softirq_done_fn		*softirq_done_fn;
 	rq_timed_out_fn		*rq_timed_out_fn;
@@ -287,14 +285,6 @@ struct request_queue
 	struct request		*boundary_rq;
 
 	/*
-	 * Auto-unplugging state
-	 */
-	struct timer_list	unplug_timer;
-	int			unplug_thresh;	/* After this many requests */
-	unsigned long		unplug_delay;	/* After this many jiffies */
-	struct work_struct	unplug_work;
-
-	/*
 	 * Delayed queue handling
 	 */
 	struct delayed_work	delay_work;
@@ -392,14 +382,13 @@ struct request_queue
 #define QUEUE_FLAG_ASYNCFULL	4	/* write queue has been filled */
 #define QUEUE_FLAG_DEAD		5	/* queue being torn down */
 #define QUEUE_FLAG_REENTER	6	/* Re-entrancy avoidance */
-#define QUEUE_FLAG_PLUGGED	7	/* queue is plugged */
-#define QUEUE_FLAG_ELVSWITCH	8	/* don't use elevator, just do FIFO */
-#define QUEUE_FLAG_BIDI		9	/* queue supports bidi requests */
-#define QUEUE_FLAG_NOMERGES    10	/* disable merge attempts */
-#define QUEUE_FLAG_SAME_COMP   11	/* force complete on same CPU */
-#define QUEUE_FLAG_FAIL_IO     12	/* fake timeout */
-#define QUEUE_FLAG_STACKABLE   13	/* supports request stacking */
-#define QUEUE_FLAG_NONROT      14	/* non-rotational device (SSD) */
+#define QUEUE_FLAG_ELVSWITCH	7	/* don't use elevator, just do FIFO */
+#define QUEUE_FLAG_BIDI		8	/* queue supports bidi requests */
+#define QUEUE_FLAG_NOMERGES     9	/* disable merge attempts */
+#define QUEUE_FLAG_SAME_COMP   10	/* force complete on same CPU */
+#define QUEUE_FLAG_FAIL_IO     11	/* fake timeout */
+#define QUEUE_FLAG_STACKABLE   12	/* supports request stacking */
+#define QUEUE_FLAG_NONROT      13	/* non-rotational device (SSD) */
 #define QUEUE_FLAG_VIRT        QUEUE_FLAG_NONROT /* paravirt device */
 #define QUEUE_FLAG_IO_STAT     15	/* do IO stats */
 #define QUEUE_FLAG_DISCARD     16	/* supports DISCARD */
@@ -477,7 +466,6 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q)
 	__clear_bit(flag, &q->queue_flags);
 }
 
-#define blk_queue_plugged(q)	test_bit(QUEUE_FLAG_PLUGGED, &(q)->queue_flags)
 #define blk_queue_tagged(q)	test_bit(QUEUE_FLAG_QUEUED, &(q)->queue_flags)
 #define blk_queue_stopped(q)	test_bit(QUEUE_FLAG_STOPPED, &(q)->queue_flags)
 #define blk_queue_nomerges(q)	test_bit(QUEUE_FLAG_NOMERGES, &(q)->queue_flags)
@@ -672,9 +660,6 @@ extern int blk_rq_prep_clone(struct request *rq, struct request *rq_src,
 extern void blk_rq_unprep_clone(struct request *rq);
 extern int blk_insert_cloned_request(struct request_queue *q,
 				     struct request *rq);
-extern void blk_plug_device(struct request_queue *);
-extern void blk_plug_device_unlocked(struct request_queue *);
-extern int blk_remove_plug(struct request_queue *);
 extern void blk_delay_queue(struct request_queue *, unsigned long);
 extern void blk_recount_segments(struct request_queue *, struct bio *);
 extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t,
@@ -719,7 +704,6 @@ extern int blk_execute_rq(struct request_queue *, struct gendisk *,
 			  struct request *, int);
 extern void blk_execute_rq_nowait(struct request_queue *, struct gendisk *,
 				  struct request *, int, rq_end_io_fn *);
-extern void blk_unplug(struct request_queue *q);
 
 static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
 {
@@ -856,7 +840,6 @@ extern struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bd
 
 extern int blk_rq_map_sg(struct request_queue *, struct request *, struct scatterlist *);
 extern void blk_dump_rq_flags(struct request *, char *);
-extern void generic_unplug_device(struct request_queue *);
 extern long nr_blockdev_pages(void);
 
 int blk_get_queue(struct request_queue *);
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 68d1fe7..f5df235 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -219,7 +219,6 @@ int generic_cont_expand_simple(struct inode *inode, loff_t size);
 int block_commit_write(struct page *page, unsigned from, unsigned to);
 int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 				get_block_t get_block);
-void block_sync_page(struct page *);
 sector_t generic_block_bmap(struct address_space *, sector_t, get_block_t *);
 int block_truncate_page(struct address_space *, loff_t, get_block_t *);
 int nobh_write_begin(struct address_space *, loff_t, unsigned, unsigned,
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 272496d..e276883 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -286,11 +286,6 @@ void dm_table_add_target_callbacks(struct dm_table *t, struct dm_target_callback
 int dm_table_complete(struct dm_table *t);
 
 /*
- * Unplug all devices in a table.
- */
-void dm_table_unplug_all(struct dm_table *t);
-
-/*
  * Table reference counting.
  */
 struct dm_table *dm_get_live_table(struct mapped_device *md);
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index ac2b7a0..82a563c 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -20,7 +20,6 @@ typedef void (elevator_bio_merged_fn) (struct request_queue *,
 typedef int (elevator_dispatch_fn) (struct request_queue *, int);
 
 typedef void (elevator_add_req_fn) (struct request_queue *, struct request *);
-typedef int (elevator_queue_empty_fn) (struct request_queue *);
 typedef struct request *(elevator_request_list_fn) (struct request_queue *, struct request *);
 typedef void (elevator_completed_req_fn) (struct request_queue *, struct request *);
 typedef int (elevator_may_queue_fn) (struct request_queue *, int);
@@ -46,7 +45,6 @@ struct elevator_ops
 	elevator_activate_req_fn *elevator_activate_req_fn;
 	elevator_deactivate_req_fn *elevator_deactivate_req_fn;
 
-	elevator_queue_empty_fn *elevator_queue_empty_fn;
 	elevator_completed_req_fn *elevator_completed_req_fn;
 
 	elevator_request_list_fn *elevator_former_req_fn;
@@ -101,8 +99,8 @@ struct elevator_queue
  */
 extern void elv_dispatch_sort(struct request_queue *, struct request *);
 extern void elv_dispatch_add_tail(struct request_queue *, struct request *);
-extern void elv_add_request(struct request_queue *, struct request *, int, int);
-extern void __elv_add_request(struct request_queue *, struct request *, int, int);
+extern void elv_add_request(struct request_queue *, struct request *, int);
+extern void __elv_add_request(struct request_queue *, struct request *, int);
 extern void elv_insert(struct request_queue *, struct request *, int);
 extern int elv_merge(struct request_queue *, struct request **, struct bio *);
 extern int elv_try_merge(struct request *, struct bio *);
@@ -112,7 +110,6 @@ extern void elv_merged_request(struct request_queue *, struct request *, int);
 extern void elv_bio_merged(struct request_queue *q, struct request *,
 				struct bio *);
 extern void elv_requeue_request(struct request_queue *, struct request *);
-extern int elv_queue_empty(struct request_queue *);
 extern struct request *elv_former_request(struct request_queue *, struct request *);
 extern struct request *elv_latter_request(struct request_queue *, struct request *);
 extern int elv_register_queue(struct request_queue *q);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 32b38cd..c53311c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -583,7 +583,6 @@ typedef int (*read_actor_t)(read_descriptor_t *, struct page *,
 struct address_space_operations {
 	int (*writepage)(struct page *page, struct writeback_control *wbc);
 	int (*readpage)(struct file *, struct page *);
-	void (*sync_page)(struct page *);
 
 	/* Write back some dirty pages from this mapping. */
 	int (*writepages)(struct address_space *, struct writeback_control *);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 9c66e99..e112b8d 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -298,7 +298,6 @@ static inline pgoff_t linear_page_index(struct vm_area_struct *vma,
 
 extern void __lock_page(struct page *page);
 extern int __lock_page_killable(struct page *page);
-extern void __lock_page_nosync(struct page *page);
 extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
 				unsigned int flags);
 extern void unlock_page(struct page *page);
@@ -342,17 +341,6 @@ static inline int lock_page_killable(struct page *page)
 }
 
 /*
- * lock_page_nosync should only be used if we can't pin the page's inode.
- * Doesn't play quite so well with block device plugging.
- */
-static inline void lock_page_nosync(struct page *page)
-{
-	might_sleep();
-	if (!trylock_page(page))
-		__lock_page_nosync(page);
-}
-	
-/*
  * lock_page_or_retry - Lock the page, unless this would block and the
  * caller indicated that it can handle a retry.
  */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4d55932..9ee3218 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -299,8 +299,6 @@ extern void mem_cgroup_get_shmem_target(struct inode *inode, pgoff_t pgoff,
 					struct page **pagep, swp_entry_t *ent);
 #endif
 
-extern void swap_unplug_io_fn(struct backing_dev_info *, struct page *);
-
 #ifdef CONFIG_SWAP
 /* linux/mm/page_io.c */
 extern int swap_readpage(struct page *);
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 027100d..c91e139 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -14,17 +14,11 @@
 
 static atomic_long_t bdi_seq = ATOMIC_LONG_INIT(0);
 
-void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
-{
-}
-EXPORT_SYMBOL(default_unplug_io_fn);
-
 struct backing_dev_info default_backing_dev_info = {
 	.name		= "default",
 	.ra_pages	= VM_MAX_READAHEAD * 1024 / PAGE_CACHE_SIZE,
 	.state		= 0,
 	.capabilities	= BDI_CAP_MAP_COPY,
-	.unplug_io_fn	= default_unplug_io_fn,
 };
 EXPORT_SYMBOL_GPL(default_backing_dev_info);
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 83a45d3..380776c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -155,45 +155,15 @@ void remove_from_page_cache(struct page *page)
 }
 EXPORT_SYMBOL(remove_from_page_cache);
 
-static int sync_page(void *word)
+static int sleep_on_page(void *word)
 {
-	struct address_space *mapping;
-	struct page *page;
-
-	page = container_of((unsigned long *)word, struct page, flags);
-
-	/*
-	 * page_mapping() is being called without PG_locked held.
-	 * Some knowledge of the state and use of the page is used to
-	 * reduce the requirements down to a memory barrier.
-	 * The danger here is of a stale page_mapping() return value
-	 * indicating a struct address_space different from the one it's
-	 * associated with when it is associated with one.
-	 * After smp_mb(), it's either the correct page_mapping() for
-	 * the page, or an old page_mapping() and the page's own
-	 * page_mapping() has gone NULL.
-	 * The ->sync_page() address_space operation must tolerate
-	 * page_mapping() going NULL. By an amazing coincidence,
-	 * this comes about because none of the users of the page
-	 * in the ->sync_page() methods make essential use of the
-	 * page_mapping(), merely passing the page down to the backing
-	 * device's unplug functions when it's non-NULL, which in turn
-	 * ignore it for all cases but swap, where only page_private(page) is
-	 * of interest. When page_mapping() does go NULL, the entire
-	 * call stack gracefully ignores the page and returns.
-	 * -- wli
-	 */
-	smp_mb();
-	mapping = page_mapping(page);
-	if (mapping && mapping->a_ops && mapping->a_ops->sync_page)
-		mapping->a_ops->sync_page(page);
 	io_schedule();
 	return 0;
 }
 
-static int sync_page_killable(void *word)
+static int sleep_on_page_killable(void *word)
 {
-	sync_page(word);
+	sleep_on_page(word);
 	return fatal_signal_pending(current) ? -EINTR : 0;
 }
 
@@ -479,12 +449,6 @@ struct page *__page_cache_alloc(gfp_t gfp)
 EXPORT_SYMBOL(__page_cache_alloc);
 #endif
 
-static int __sleep_on_page_lock(void *word)
-{
-	io_schedule();
-	return 0;
-}
-
 /*
  * In order to wait for pages to become available there must be
  * waitqueues associated with pages. By using a hash table of
@@ -512,7 +476,7 @@ void wait_on_page_bit(struct page *page, int bit_nr)
 	DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
 
 	if (test_bit(bit_nr, &page->flags))
-		__wait_on_bit(page_waitqueue(page), &wait, sync_page,
+		__wait_on_bit(page_waitqueue(page), &wait, sleep_on_page,
 							TASK_UNINTERRUPTIBLE);
 }
 EXPORT_SYMBOL(wait_on_page_bit);
@@ -576,17 +540,12 @@ EXPORT_SYMBOL(end_page_writeback);
 /**
  * __lock_page - get a lock on the page, assuming we need to sleep to get it
  * @page: the page to lock
- *
- * Ugly. Running sync_page() in state TASK_UNINTERRUPTIBLE is scary.  If some
- * random driver's requestfn sets TASK_RUNNING, we could busywait.  However
- * chances are that on the second loop, the block layer's plug list is empty,
- * so sync_page() will then return in state TASK_UNINTERRUPTIBLE.
  */
 void __lock_page(struct page *page)
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
 
-	__wait_on_bit_lock(page_waitqueue(page), &wait, sync_page,
+	__wait_on_bit_lock(page_waitqueue(page), &wait, sleep_on_page,
 							TASK_UNINTERRUPTIBLE);
 }
 EXPORT_SYMBOL(__lock_page);
@@ -596,24 +555,10 @@ int __lock_page_killable(struct page *page)
 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
 
 	return __wait_on_bit_lock(page_waitqueue(page), &wait,
-					sync_page_killable, TASK_KILLABLE);
+					sleep_on_page_killable, TASK_KILLABLE);
 }
 EXPORT_SYMBOL_GPL(__lock_page_killable);
 
-/**
- * __lock_page_nosync - get a lock on the page, without calling sync_page()
- * @page: the page to lock
- *
- * Variant of lock_page that does not require the caller to hold a reference
- * on the page's mapping.
- */
-void __lock_page_nosync(struct page *page)
-{
-	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
-	__wait_on_bit_lock(page_waitqueue(page), &wait, __sleep_on_page_lock,
-							TASK_UNINTERRUPTIBLE);
-}
-
 int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
 			 unsigned int flags)
 {
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 548fbd7..9566685 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -995,7 +995,7 @@ int __memory_failure(unsigned long pfn, int trapno, int flags)
 			 * Check "just unpoisoned", "filter hit", and
 			 * "race with other subpage."
 			 */
-			lock_page_nosync(hpage);
+			lock_page(hpage);
 			if (!PageHWPoison(hpage)
 			    || (hwpoison_filter(p) && TestClearPageHWPoison(p))
 			    || (p != hpage && TestSetPageHWPoison(hpage))) {
@@ -1042,7 +1042,7 @@ int __memory_failure(unsigned long pfn, int trapno, int flags)
 	 * It's very difficult to mess with pages currently under IO
 	 * and in many cases impossible, so we just avoid it here.
 	 */
-	lock_page_nosync(hpage);
+	lock_page(hpage);
 
 	/*
 	 * unpoison always clear PG_hwpoison inside page lock
@@ -1185,7 +1185,7 @@ int unpoison_memory(unsigned long pfn)
 		return 0;
 	}
 
-	lock_page_nosync(page);
+	lock_page(page);
 	/*
 	 * This test is racy because PG_hwpoison is set outside of page lock.
 	 * That's acceptable because that won't trigger kernel panic. Instead,
diff --git a/mm/nommu.c b/mm/nommu.c
index f59e142..fb6cbd6 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1842,10 +1842,6 @@ int remap_vmalloc_range(struct vm_area_struct *vma, void *addr,
 }
 EXPORT_SYMBOL(remap_vmalloc_range);
 
-void swap_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
-{
-}
-
 unsigned long arch_get_unmapped_area(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long pgoff, unsigned long flags)
 {
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 2cb01f6..cc0ede1 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1239,7 +1239,7 @@ int set_page_dirty_lock(struct page *page)
 {
 	int ret;
 
-	lock_page_nosync(page);
+	lock_page(page);
 	ret = set_page_dirty(page);
 	unlock_page(page);
 	return ret;
diff --git a/mm/readahead.c b/mm/readahead.c
index 77506a2..cbddc3e 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -554,17 +554,5 @@ page_cache_async_readahead(struct address_space *mapping,
 
 	/* do read-ahead */
 	ondemand_readahead(mapping, ra, filp, true, offset, req_size);
-
-#ifdef CONFIG_BLOCK
-	/*
-	 * Normally the current page is !uptodate and lock_page() will be
-	 * immediately called to implicitly unplug the device. However this
-	 * is not always true for RAID conifgurations, where data arrives
-	 * not strictly in their submission order. In this case we need to
-	 * explicitly kick off the IO.
-	 */
-	if (PageUptodate(page))
-		blk_run_backing_dev(mapping->backing_dev_info, NULL);
-#endif
 }
 EXPORT_SYMBOL_GPL(page_cache_async_readahead);
diff --git a/mm/shmem.c b/mm/shmem.c
index 5ee67c9..24d23f5 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -224,7 +224,6 @@ static const struct vm_operations_struct shmem_vm_ops;
 static struct backing_dev_info shmem_backing_dev_info  __read_mostly = {
 	.ra_pages	= 0,	/* No readahead */
 	.capabilities	= BDI_CAP_NO_ACCT_AND_WRITEBACK | BDI_CAP_SWAP_BACKED,
-	.unplug_io_fn	= default_unplug_io_fn,
 };
 
 static LIST_HEAD(shmem_swaplist);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 5c8cfab..4668046 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -24,12 +24,10 @@
 
 /*
  * swapper_space is a fiction, retained to simplify the path through
- * vmscan's shrink_page_list, to make sync_page look nicer, and to allow
- * future use of radix_tree tags in the swap cache.
+ * vmscan's shrink_page_list.
  */
 static const struct address_space_operations swap_aops = {
 	.writepage	= swap_writepage,
-	.sync_page	= block_sync_page,
 	.set_page_dirty	= __set_page_dirty_nobuffers,
 	.migratepage	= migrate_page,
 };
@@ -37,7 +35,6 @@ static const struct address_space_operations swap_aops = {
 static struct backing_dev_info swap_backing_dev_info = {
 	.name		= "swap",
 	.capabilities	= BDI_CAP_NO_ACCT_AND_WRITEBACK | BDI_CAP_SWAP_BACKED,
-	.unplug_io_fn	= swap_unplug_io_fn,
 };
 
 struct address_space swapper_space = {
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 07a458d..7ceea78 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -95,39 +95,6 @@ __try_to_reclaim_swap(struct swap_info_struct *si, unsigned long offset)
 }
 
 /*
- * We need this because the bdev->unplug_fn can sleep and we cannot
- * hold swap_lock while calling the unplug_fn. And swap_lock
- * cannot be turned into a mutex.
- */
-static DECLARE_RWSEM(swap_unplug_sem);
-
-void swap_unplug_io_fn(struct backing_dev_info *unused_bdi, struct page *page)
-{
-	swp_entry_t entry;
-
-	down_read(&swap_unplug_sem);
-	entry.val = page_private(page);
-	if (PageSwapCache(page)) {
-		struct block_device *bdev = swap_info[swp_type(entry)]->bdev;
-		struct backing_dev_info *bdi;
-
-		/*
-		 * If the page is removed from swapcache from under us (with a
-		 * racy try_to_unuse/swapoff) we need an additional reference
-		 * count to avoid reading garbage from page_private(page) above.
-		 * If the WARN_ON triggers during a swapoff it maybe the race
-		 * condition and it's harmless. However if it triggers without
-		 * swapoff it signals a problem.
-		 */
-		WARN_ON(page_count(page) <= 1);
-
-		bdi = bdev->bd_inode->i_mapping->backing_dev_info;
-		blk_run_backing_dev(bdi, page);
-	}
-	up_read(&swap_unplug_sem);
-}
-
-/*
  * swapon tell device that all the old swap contents can be discarded,
  * to allow the swap device to optimize its wear-levelling.
  */
@@ -1643,10 +1610,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 		goto out_dput;
 	}
 
-	/* wait for any unplug function to finish */
-	down_write(&swap_unplug_sem);
-	up_write(&swap_unplug_sem);
-
 	destroy_swap_extents(p);
 	if (p->flags & SWP_CONTINUED)
 		free_swap_count_continuations(p);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 47a5096..e204456 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -359,7 +359,7 @@ static int may_write_to_queue(struct backing_dev_info *bdi,
 static void handle_write_error(struct address_space *mapping,
 				struct page *page, int error)
 {
-	lock_page_nosync(page);
+	lock_page(page);
 	if (page_mapping(page) == mapping)
 		mapping_set_error(mapping, error);
 	unlock_page(page);
-- 
1.7.3.2.146.gca209


^ permalink raw reply	[flat|nested] 152+ messages in thread

* [PATCH 06/10] block: kill request allocation batching
  2011-01-22  1:17 [PATCH 0/10] On-stack explicit block queue plugging Jens Axboe
                   ` (4 preceding siblings ...)
  2011-01-22  1:17 ` [PATCH 05/10] block: remove per-queue plugging Jens Axboe
@ 2011-01-22  1:17 ` Jens Axboe
  2011-01-22  9:31   ` Christoph Hellwig
  2011-01-22  1:17 ` [PATCH 07/10] fs: make generic file read/write functions plug Jens Axboe
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 152+ messages in thread
From: Jens Axboe @ 2011-01-22  1:17 UTC (permalink / raw)
  To: jaxboe, linux-kernel; +Cc: hch

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
---
 block/blk-core.c       |   81 ++---------------------------------------------
 block/blk-settings.c   |    1 -
 block/blk.h            |    6 ---
 include/linux/blkdev.h |    1 -
 4 files changed, 4 insertions(+), 85 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 7ab6620..54b5987 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -593,40 +593,6 @@ blk_alloc_request(struct request_queue *q, int flags, int priv, gfp_t gfp_mask)
 	return rq;
 }
 
-/*
- * ioc_batching returns true if the ioc is a valid batching request and
- * should be given priority access to a request.
- */
-static inline int ioc_batching(struct request_queue *q, struct io_context *ioc)
-{
-	if (!ioc)
-		return 0;
-
-	/*
-	 * Make sure the process is able to allocate at least 1 request
-	 * even if the batch times out, otherwise we could theoretically
-	 * lose wakeups.
-	 */
-	return ioc->nr_batch_requests == q->nr_batching ||
-		(ioc->nr_batch_requests > 0
-		&& time_before(jiffies, ioc->last_waited + BLK_BATCH_TIME));
-}
-
-/*
- * ioc_set_batching sets ioc to be a new "batcher" if it is not one. This
- * will cause the process to be a "batcher" on all queues in the system. This
- * is the behaviour we want though - once it gets a wakeup it should be given
- * a nice run.
- */
-static void ioc_set_batching(struct request_queue *q, struct io_context *ioc)
-{
-	if (!ioc || ioc_batching(q, ioc))
-		return;
-
-	ioc->nr_batch_requests = q->nr_batching;
-	ioc->last_waited = jiffies;
-}
-
 static void __freed_request(struct request_queue *q, int sync)
 {
 	struct request_list *rl = &q->rq;
@@ -670,7 +636,6 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 {
 	struct request *rq = NULL;
 	struct request_list *rl = &q->rq;
-	struct io_context *ioc = NULL;
 	const bool is_sync = rw_is_sync(rw_flags) != 0;
 	int may_queue, priv;
 
@@ -679,30 +644,11 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 		goto rq_starved;
 
 	if (rl->count[is_sync]+1 >= queue_congestion_on_threshold(q)) {
-		if (rl->count[is_sync]+1 >= q->nr_requests) {
-			ioc = current_io_context(GFP_ATOMIC, q->node);
-			/*
-			 * The queue will fill after this allocation, so set
-			 * it as full, and mark this process as "batching".
-			 * This process will be allowed to complete a batch of
-			 * requests, others will be blocked.
-			 */
-			if (!blk_queue_full(q, is_sync)) {
-				ioc_set_batching(q, ioc);
-				blk_set_queue_full(q, is_sync);
-			} else {
-				if (may_queue != ELV_MQUEUE_MUST
-						&& !ioc_batching(q, ioc)) {
-					/*
-					 * The queue is full and the allocating
-					 * process is not a "batcher", and not
-					 * exempted by the IO scheduler
-					 */
-					goto out;
-				}
-			}
-		}
 		blk_set_queue_congested(q, is_sync);
+
+		if (rl->count[is_sync]+1 >= q->nr_requests)
+			if (may_queue != ELV_MQUEUE_MUST)
+				goto out;
 	}
 
 	/*
@@ -750,15 +696,6 @@ rq_starved:
 		goto out;
 	}
 
-	/*
-	 * ioc may be NULL here, and ioc_batching will be false. That's
-	 * OK, if the queue is under the request limit then requests need
-	 * not count toward the nr_batch_requests limit. There will always
-	 * be some limit enforced by BLK_BATCH_TIME.
-	 */
-	if (ioc_batching(q, ioc))
-		ioc->nr_batch_requests--;
-
 	trace_block_getrq(q, bio, rw_flags & 1);
 out:
 	return rq;
@@ -779,7 +716,6 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
 	rq = get_request(q, rw_flags, bio, GFP_NOIO);
 	while (!rq) {
 		DEFINE_WAIT(wait);
-		struct io_context *ioc;
 		struct request_list *rl = &q->rq;
 
 		prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
@@ -790,15 +726,6 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
 		spin_unlock_irq(q->queue_lock);
 		io_schedule();
 
-		/*
-		 * After sleeping, we become a "batching" process and
-		 * will be able to allocate at least one request, and
-		 * up to a big batch of them for a small period time.
-		 * See ioc_batching, ioc_set_batching
-		 */
-		ioc = current_io_context(GFP_NOIO, q->node);
-		ioc_set_batching(q, ioc);
-
 		spin_lock_irq(q->queue_lock);
 		finish_wait(&rl->wait[is_sync], &wait);
 
diff --git a/block/blk-settings.c b/block/blk-settings.c
index c8d6892..35420f2 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -162,7 +162,6 @@ void blk_queue_make_request(struct request_queue *q, make_request_fn *mfn)
 	q->make_request_fn = mfn;
 	blk_queue_dma_alignment(q, 511);
 	blk_queue_congestion_threshold(q);
-	q->nr_batching = BLK_BATCH_REQ;
 
 	blk_set_default_limits(&q->limits);
 	blk_queue_max_hw_sectors(q, BLK_SAFE_MAX_SECTORS);
diff --git a/block/blk.h b/block/blk.h
index 2c3d2e7..6db2e75 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -1,12 +1,6 @@
 #ifndef BLK_INTERNAL_H
 #define BLK_INTERNAL_H
 
-/* Amount of time in which a process may batch requests */
-#define BLK_BATCH_TIME	(HZ/50UL)
-
-/* Number of requests a "batching" process may submit */
-#define BLK_BATCH_REQ	32
-
 extern struct kmem_cache *blk_requestq_cachep;
 extern struct kobj_type blk_queue_ktype;
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index dfb6ffd..ebaf9d6 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -326,7 +326,6 @@ struct request_queue
 	unsigned long		nr_requests;	/* Max # of requests */
 	unsigned int		nr_congestion_on;
 	unsigned int		nr_congestion_off;
-	unsigned int		nr_batching;
 
 	void			*dma_drain_buffer;
 	unsigned int		dma_drain_size;
-- 
1.7.3.2.146.gca209


^ permalink raw reply	[flat|nested] 152+ messages in thread

* [PATCH 07/10] fs: make generic file read/write functions plug
  2011-01-22  1:17 [PATCH 0/10] On-stack explicit block queue plugging Jens Axboe
                   ` (5 preceding siblings ...)
  2011-01-22  1:17 ` [PATCH 06/10] block: kill request allocation batching Jens Axboe
@ 2011-01-22  1:17 ` Jens Axboe
  2011-01-24  3:57   ` Dave Chinner
  2011-03-04  4:09   ` Vivek Goyal
  2011-01-22  1:17 ` [PATCH 08/10] read-ahead: use plugging Jens Axboe
                   ` (2 subsequent siblings)
  9 siblings, 2 replies; 152+ messages in thread
From: Jens Axboe @ 2011-01-22  1:17 UTC (permalink / raw)
  To: jaxboe, linux-kernel; +Cc: hch

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
---
 mm/filemap.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 380776c..f9a29c8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1243,12 +1243,15 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
 	unsigned long seg = 0;
 	size_t count;
 	loff_t *ppos = &iocb->ki_pos;
+	struct blk_plug plug;
 
 	count = 0;
 	retval = generic_segment_checks(iov, &nr_segs, &count, VERIFY_WRITE);
 	if (retval)
 		return retval;
 
+	blk_start_plug(&plug);
+
 	/* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
 	if (filp->f_flags & O_DIRECT) {
 		loff_t size;
@@ -1321,6 +1324,7 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
 			break;
 	}
 out:
+	blk_finish_plug(&plug);
 	return retval;
 }
 EXPORT_SYMBOL(generic_file_aio_read);
@@ -2432,11 +2436,13 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file->f_mapping->host;
+	struct blk_plug plug;
 	ssize_t ret;
 
 	BUG_ON(iocb->ki_pos != pos);
 
 	mutex_lock(&inode->i_mutex);
+	blk_start_plug(&plug);
 	ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
 	mutex_unlock(&inode->i_mutex);
 
@@ -2447,6 +2453,7 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 		if (err < 0 && ret > 0)
 			ret = err;
 	}
+	blk_finish_plug(&plug);
 	return ret;
 }
 EXPORT_SYMBOL(generic_file_aio_write);
-- 
1.7.3.2.146.gca209


^ permalink raw reply	[flat|nested] 152+ messages in thread

* [PATCH 08/10] read-ahead: use plugging
  2011-01-22  1:17 [PATCH 0/10] On-stack explicit block queue plugging Jens Axboe
                   ` (6 preceding siblings ...)
  2011-01-22  1:17 ` [PATCH 07/10] fs: make generic file read/write functions plug Jens Axboe
@ 2011-01-22  1:17 ` Jens Axboe
  2011-01-22  1:17 ` [PATCH 09/10] fs: make mpage read/write_pages() plug Jens Axboe
  2011-01-22  1:17 ` [PATCH 10/10] fs: make aio plug Jens Axboe
  9 siblings, 0 replies; 152+ messages in thread
From: Jens Axboe @ 2011-01-22  1:17 UTC (permalink / raw)
  To: jaxboe, linux-kernel; +Cc: hch

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
---
 mm/readahead.c |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index cbddc3e..2c0cc48 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -109,9 +109,12 @@ EXPORT_SYMBOL(read_cache_pages);
 static int read_pages(struct address_space *mapping, struct file *filp,
 		struct list_head *pages, unsigned nr_pages)
 {
+	struct blk_plug plug;
 	unsigned page_idx;
 	int ret;
 
+	blk_start_plug(&plug);
+
 	if (mapping->a_ops->readpages) {
 		ret = mapping->a_ops->readpages(filp, mapping, pages, nr_pages);
 		/* Clean up the remaining pages */
@@ -129,7 +132,10 @@ static int read_pages(struct address_space *mapping, struct file *filp,
 		page_cache_release(page);
 	}
 	ret = 0;
+
 out:
+	blk_finish_plug(&plug);
+
 	return ret;
 }
 
-- 
1.7.3.2.146.gca209


^ permalink raw reply	[flat|nested] 152+ messages in thread

* [PATCH 09/10] fs: make mpage read/write_pages() plug
  2011-01-22  1:17 [PATCH 0/10] On-stack explicit block queue plugging Jens Axboe
                   ` (7 preceding siblings ...)
  2011-01-22  1:17 ` [PATCH 08/10] read-ahead: use plugging Jens Axboe
@ 2011-01-22  1:17 ` Jens Axboe
  2011-01-22  1:17 ` [PATCH 10/10] fs: make aio plug Jens Axboe
  9 siblings, 0 replies; 152+ messages in thread
From: Jens Axboe @ 2011-01-22  1:17 UTC (permalink / raw)
  To: jaxboe, linux-kernel; +Cc: hch

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
---
 fs/mpage.c |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/fs/mpage.c b/fs/mpage.c
index d78455a..0afc809 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -364,6 +364,9 @@ mpage_readpages(struct address_space *mapping, struct list_head *pages,
 	sector_t last_block_in_bio = 0;
 	struct buffer_head map_bh;
 	unsigned long first_logical_block = 0;
+	struct blk_plug plug;
+
+	blk_start_plug(&plug);
 
 	map_bh.b_state = 0;
 	map_bh.b_size = 0;
@@ -385,6 +388,7 @@ mpage_readpages(struct address_space *mapping, struct list_head *pages,
 	BUG_ON(!list_empty(pages));
 	if (bio)
 		mpage_bio_submit(READ, bio);
+	blk_finish_plug(&plug);
 	return 0;
 }
 EXPORT_SYMBOL(mpage_readpages);
@@ -666,8 +670,11 @@ int
 mpage_writepages(struct address_space *mapping,
 		struct writeback_control *wbc, get_block_t get_block)
 {
+	struct blk_plug plug;
 	int ret;
 
+	blk_start_plug(&plug);
+
 	if (!get_block)
 		ret = generic_writepages(mapping, wbc);
 	else {
@@ -682,6 +689,7 @@ mpage_writepages(struct address_space *mapping,
 		if (mpd.bio)
 			mpage_bio_submit(WRITE, mpd.bio);
 	}
+	blk_finish_plug(&plug);
 	return ret;
 }
 EXPORT_SYMBOL(mpage_writepages);
-- 
1.7.3.2.146.gca209


^ permalink raw reply	[flat|nested] 152+ messages in thread

* [PATCH 10/10] fs: make aio plug
  2011-01-22  1:17 [PATCH 0/10] On-stack explicit block queue plugging Jens Axboe
                   ` (8 preceding siblings ...)
  2011-01-22  1:17 ` [PATCH 09/10] fs: make mpage read/write_pages() plug Jens Axboe
@ 2011-01-22  1:17 ` Jens Axboe
  2011-01-24 17:59   ` Jeff Moyer
  9 siblings, 1 reply; 152+ messages in thread
From: Jens Axboe @ 2011-01-22  1:17 UTC (permalink / raw)
  To: jaxboe, linux-kernel; +Cc: hch, Shaohua Li

From: Shaohua Li <shaohua.li@intel.com>

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
---
 fs/aio.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index c5ea494..1476bed 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1660,6 +1660,7 @@ long do_io_submit(aio_context_t ctx_id, long nr,
 	long ret = 0;
 	int i;
 	struct hlist_head batch_hash[AIO_BATCH_HASH_SIZE] = { { 0, }, };
+	struct blk_plug plug;
 
 	if (unlikely(nr < 0))
 		return -EINVAL;
@@ -1676,6 +1677,8 @@ long do_io_submit(aio_context_t ctx_id, long nr,
 		return -EINVAL;
 	}
 
+	blk_start_plug(&plug);
+
 	/*
 	 * AKPM: should this return a partial result if some of the IOs were
 	 * successfully submitted?
@@ -1698,6 +1701,7 @@ long do_io_submit(aio_context_t ctx_id, long nr,
 		if (ret)
 			break;
 	}
+	blk_finish_plug(&plug);
 	aio_batch_free(batch_hash);
 
 	put_ioctx(ctx);
-- 
1.7.3.2.146.gca209


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 02/10] ide-cd: convert to blk_delay_queue() for a short pause
  2011-01-22  1:17 ` [PATCH 02/10] ide-cd: convert to blk_delay_queue() for a short pause Jens Axboe
@ 2011-01-22  1:19   ` David Miller
  0 siblings, 0 replies; 152+ messages in thread
From: David Miller @ 2011-01-22  1:19 UTC (permalink / raw)
  To: jaxboe; +Cc: linux-kernel, hch

From: Jens Axboe <jaxboe@fusionio.com>
Date: Sat, 22 Jan 2011 01:17:21 +0000

> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>

Acked-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-01-22  1:17 ` [PATCH 05/10] block: remove per-queue plugging Jens Axboe
@ 2011-01-22  1:31   ` Nick Piggin
  2011-03-03 21:23   ` Mike Snitzer
  2011-03-04  4:00   ` Vivek Goyal
  2 siblings, 0 replies; 152+ messages in thread
From: Nick Piggin @ 2011-01-22  1:31 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, hch

On Sat, Jan 22, 2011 at 12:17 PM, Jens Axboe <jaxboe@fusionio.com> wrote:

> -/**
> - * __lock_page_nosync - get a lock on the page, without calling sync_page()
> - * @page: the page to lock
> - *
> - * Variant of lock_page that does not require the caller to hold a reference
> - * on the page's mapping.
> - */
> -void __lock_page_nosync(struct page *page)
> -{
> -       DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
> -       __wait_on_bit_lock(page_waitqueue(page), &wait, __sleep_on_page_lock,
> -                                                       TASK_UNINTERRUPTIBLE);
> -}

RIP to this guy, won't be missed :)

Nice work.

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 06/10] block: kill request allocation batching
  2011-01-22  1:17 ` [PATCH 06/10] block: kill request allocation batching Jens Axboe
@ 2011-01-22  9:31   ` Christoph Hellwig
  2011-01-24 19:09     ` Jens Axboe
  0 siblings, 1 reply; 152+ messages in thread
From: Christoph Hellwig @ 2011-01-22  9:31 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, hch

On Sat, Jan 22, 2011 at 01:17:25AM +0000, Jens Axboe wrote:
> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>

A little more explanation of this surely wouldn't hurt.


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 07/10] fs: make generic file read/write functions plug
  2011-01-22  1:17 ` [PATCH 07/10] fs: make generic file read/write functions plug Jens Axboe
@ 2011-01-24  3:57   ` Dave Chinner
  2011-01-24 19:11     ` Jens Axboe
  2011-03-04  4:09   ` Vivek Goyal
  1 sibling, 1 reply; 152+ messages in thread
From: Dave Chinner @ 2011-01-24  3:57 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, hch

On Sat, Jan 22, 2011 at 01:17:26AM +0000, Jens Axboe wrote:
> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
> ---
>  mm/filemap.c |    7 +++++++
>  1 files changed, 7 insertions(+), 0 deletions(-)
> 
.....
> @@ -2432,11 +2436,13 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
>  {
>  	struct file *file = iocb->ki_filp;
>  	struct inode *inode = file->f_mapping->host;
> +	struct blk_plug plug;
>  	ssize_t ret;
>  
>  	BUG_ON(iocb->ki_pos != pos);
>  
>  	mutex_lock(&inode->i_mutex);
> +	blk_start_plug(&plug);
>  	ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
>  	mutex_unlock(&inode->i_mutex);
>  
> @@ -2447,6 +2453,7 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
>  		if (err < 0 && ret > 0)
>  			ret = err;
>  	}
> +	blk_finish_plug(&plug);
>  	return ret;
>  }
>  EXPORT_SYMBOL(generic_file_aio_write);

Why do you want to plug all writes? For non-synchronous buffered
writes we won't be doing any IO, so why woul dwe want to plug and
unplug in that case? Shouldn't the plug/unplug be places in
.writepage for the buffered writeback case (which would handle sync
writes, too)?

Also, what is the impact of not plugging here? You change
generic_file_aio_write, but filesystems like XFS supply their own
.aio_write method and hence woul dneed some kind of change, too?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 10/10] fs: make aio plug
  2011-01-22  1:17 ` [PATCH 10/10] fs: make aio plug Jens Axboe
@ 2011-01-24 17:59   ` Jeff Moyer
  2011-01-24 19:09     ` Jens Axboe
  0 siblings, 1 reply; 152+ messages in thread
From: Jeff Moyer @ 2011-01-24 17:59 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, hch, Shaohua Li

Jens Axboe <jaxboe@fusionio.com> writes:

> From: Shaohua Li <shaohua.li@intel.com>
>
> Signed-off-by: Shaohua Li <shaohua.li@intel.com>
> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
> ---
>  fs/aio.c |    4 ++++
>  1 files changed, 4 insertions(+), 0 deletions(-)
>
> diff --git a/fs/aio.c b/fs/aio.c
> index c5ea494..1476bed 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1660,6 +1660,7 @@ long do_io_submit(aio_context_t ctx_id, long nr,
>  	long ret = 0;
>  	int i;
>  	struct hlist_head batch_hash[AIO_BATCH_HASH_SIZE] = { { 0, }, };
> +	struct blk_plug plug;
>  
>  	if (unlikely(nr < 0))
>  		return -EINVAL;
> @@ -1676,6 +1677,8 @@ long do_io_submit(aio_context_t ctx_id, long nr,
>  		return -EINVAL;
>  	}
>  
> +	blk_start_plug(&plug);
> +
>  	/*
>  	 * AKPM: should this return a partial result if some of the IOs were
>  	 * successfully submitted?
> @@ -1698,6 +1701,7 @@ long do_io_submit(aio_context_t ctx_id, long nr,
>  		if (ret)
>  			break;
>  	}
> +	blk_finish_plug(&plug);
>  	aio_batch_free(batch_hash);

I'm pretty sure you want blk_finish_plug to run after aio_batch_free.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 10/10] fs: make aio plug
  2011-01-24 17:59   ` Jeff Moyer
@ 2011-01-24 19:09     ` Jens Axboe
  2011-01-24 19:15       ` Jeff Moyer
  0 siblings, 1 reply; 152+ messages in thread
From: Jens Axboe @ 2011-01-24 19:09 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: linux-kernel, hch, Shaohua Li

On 2011-01-24 18:59, Jeff Moyer wrote:
> Jens Axboe <jaxboe@fusionio.com> writes:
> 
>> From: Shaohua Li <shaohua.li@intel.com>
>>
>> Signed-off-by: Shaohua Li <shaohua.li@intel.com>
>> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
>> ---
>>  fs/aio.c |    4 ++++
>>  1 files changed, 4 insertions(+), 0 deletions(-)
>>
>> diff --git a/fs/aio.c b/fs/aio.c
>> index c5ea494..1476bed 100644
>> --- a/fs/aio.c
>> +++ b/fs/aio.c
>> @@ -1660,6 +1660,7 @@ long do_io_submit(aio_context_t ctx_id, long nr,
>>  	long ret = 0;
>>  	int i;
>>  	struct hlist_head batch_hash[AIO_BATCH_HASH_SIZE] = { { 0, }, };
>> +	struct blk_plug plug;
>>  
>>  	if (unlikely(nr < 0))
>>  		return -EINVAL;
>> @@ -1676,6 +1677,8 @@ long do_io_submit(aio_context_t ctx_id, long nr,
>>  		return -EINVAL;
>>  	}
>>  
>> +	blk_start_plug(&plug);
>> +
>>  	/*
>>  	 * AKPM: should this return a partial result if some of the IOs were
>>  	 * successfully submitted?
>> @@ -1698,6 +1701,7 @@ long do_io_submit(aio_context_t ctx_id, long nr,
>>  		if (ret)
>>  			break;
>>  	}
>> +	blk_finish_plug(&plug);
>>  	aio_batch_free(batch_hash);
> 
> I'm pretty sure you want blk_finish_plug to run after aio_batch_free.

You mean to cover the iput()? That's not a bad idea, if that ends up
writing it out. I did a few read tests here and confirmed it's sending
down batches of whatever number is submitted.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 06/10] block: kill request allocation batching
  2011-01-22  9:31   ` Christoph Hellwig
@ 2011-01-24 19:09     ` Jens Axboe
  0 siblings, 0 replies; 152+ messages in thread
From: Jens Axboe @ 2011-01-24 19:09 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-kernel

On 2011-01-22 10:31, Christoph Hellwig wrote:
> On Sat, Jan 22, 2011 at 01:17:25AM +0000, Jens Axboe wrote:
>> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
> 
> A little more explanation of this surely wouldn't hurt.

That goes for pretty much all of the patches in this series. I'll update
them and add a proper description.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 07/10] fs: make generic file read/write functions plug
  2011-01-24  3:57   ` Dave Chinner
@ 2011-01-24 19:11     ` Jens Axboe
  0 siblings, 0 replies; 152+ messages in thread
From: Jens Axboe @ 2011-01-24 19:11 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, hch

On 2011-01-24 04:57, Dave Chinner wrote:
> On Sat, Jan 22, 2011 at 01:17:26AM +0000, Jens Axboe wrote:
>> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
>> ---
>>  mm/filemap.c |    7 +++++++
>>  1 files changed, 7 insertions(+), 0 deletions(-)
>>
> .....
>> @@ -2432,11 +2436,13 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
>>  {
>>  	struct file *file = iocb->ki_filp;
>>  	struct inode *inode = file->f_mapping->host;
>> +	struct blk_plug plug;
>>  	ssize_t ret;
>>  
>>  	BUG_ON(iocb->ki_pos != pos);
>>  
>>  	mutex_lock(&inode->i_mutex);
>> +	blk_start_plug(&plug);
>>  	ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
>>  	mutex_unlock(&inode->i_mutex);
>>  
>> @@ -2447,6 +2453,7 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
>>  		if (err < 0 && ret > 0)
>>  			ret = err;
>>  	}
>> +	blk_finish_plug(&plug);
>>  	return ret;
>>  }
>>  EXPORT_SYMBOL(generic_file_aio_write);
> 
> Why do you want to plug all writes? For non-synchronous buffered
> writes we won't be doing any IO, so why woul dwe want to plug and
> unplug in that case? Shouldn't the plug/unplug be places in
> .writepage for the buffered writeback case (which would handle sync
> writes, too)?

Good point, it probably should be placed a bit more clever.

> Also, what is the impact of not plugging here? You change
> generic_file_aio_write, but filesystems like XFS supply their own
> .aio_write method and hence woul dneed some kind of change, too?

Generally, performance tests need to be run and the appropriate places
to plug need to be found. It could potentially cause lower queue depth
on the device side, or less merging of ios if we miss one of the heavy
IO spots.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 10/10] fs: make aio plug
  2011-01-24 19:09     ` Jens Axboe
@ 2011-01-24 19:15       ` Jeff Moyer
  2011-01-24 19:22         ` Jens Axboe
  0 siblings, 1 reply; 152+ messages in thread
From: Jeff Moyer @ 2011-01-24 19:15 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, hch, Shaohua Li

Jens Axboe <jaxboe@fusionio.com> writes:

>>> @@ -1698,6 +1701,7 @@ long do_io_submit(aio_context_t ctx_id, long nr,
>>>  		if (ret)
>>>  			break;
>>>  	}
>>> +	blk_finish_plug(&plug);
>>>  	aio_batch_free(batch_hash);
>> 
>> I'm pretty sure you want blk_finish_plug to run after aio_batch_free.
>
> You mean to cover the iput()? That's not a bad idea, if that ends up
> writing it out. I did a few read tests here and confirmed it's sending
> down batches of whatever number is submitted.

No, I actually didn't make it all the way through 5/10, so I didn't
realize you got rid of the blk_run_address_space.  I agree with the TODO
item to get rid of the aio batching, since it's now taken care of with
the on-stack plugging.

As for the iput, I don't think you will get the final iput here, since
you've just scheduled I/O against the file.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 10/10] fs: make aio plug
  2011-01-24 19:15       ` Jeff Moyer
@ 2011-01-24 19:22         ` Jens Axboe
  2011-01-24 19:29           ` Jeff Moyer
  0 siblings, 1 reply; 152+ messages in thread
From: Jens Axboe @ 2011-01-24 19:22 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: linux-kernel, hch, Shaohua Li

On 2011-01-24 20:15, Jeff Moyer wrote:
> Jens Axboe <jaxboe@fusionio.com> writes:
> 
>>>> @@ -1698,6 +1701,7 @@ long do_io_submit(aio_context_t ctx_id, long nr,
>>>>  		if (ret)
>>>>  			break;
>>>>  	}
>>>> +	blk_finish_plug(&plug);
>>>>  	aio_batch_free(batch_hash);
>>>
>>> I'm pretty sure you want blk_finish_plug to run after aio_batch_free.
>>
>> You mean to cover the iput()? That's not a bad idea, if that ends up
>> writing it out. I did a few read tests here and confirmed it's sending
>> down batches of whatever number is submitted.
> 
> No, I actually didn't make it all the way through 5/10, so I didn't
> realize you got rid of the blk_run_address_space.  I agree with the TODO
> item to get rid of the aio batching, since it's now taken care of with
> the on-stack plugging.

Oh, so that was the whole point of the series :-)

> As for the iput, I don't think you will get the final iput here, since
> you've just scheduled I/O against the file.

Good point, so the original comment is moot - it wont make a difference.
Still, may not be a bad idea to do the freeing first. I was sort-of
hoping to be able to kill the batching in fs/aio.c completely, though...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 10/10] fs: make aio plug
  2011-01-24 19:22         ` Jens Axboe
@ 2011-01-24 19:29           ` Jeff Moyer
  2011-01-24 19:31             ` Jens Axboe
  0 siblings, 1 reply; 152+ messages in thread
From: Jeff Moyer @ 2011-01-24 19:29 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, hch, Shaohua Li

Jens Axboe <jaxboe@fusionio.com> writes:

> On 2011-01-24 20:15, Jeff Moyer wrote:
>> Jens Axboe <jaxboe@fusionio.com> writes:
>> 
>>>>> @@ -1698,6 +1701,7 @@ long do_io_submit(aio_context_t ctx_id, long nr,
>>>>>  		if (ret)
>>>>>  			break;
>>>>>  	}
>>>>> +	blk_finish_plug(&plug);
>>>>>  	aio_batch_free(batch_hash);
>>>>
>>>> I'm pretty sure you want blk_finish_plug to run after aio_batch_free.
>>>
>>> You mean to cover the iput()? That's not a bad idea, if that ends up
>>> writing it out. I did a few read tests here and confirmed it's sending
>>> down batches of whatever number is submitted.
>> 
>> No, I actually didn't make it all the way through 5/10, so I didn't
>> realize you got rid of the blk_run_address_space.  I agree with the TODO
>> item to get rid of the aio batching, since it's now taken care of with
>> the on-stack plugging.
>
> Oh, so that was the whole point of the series :-)

I've said dumber things, I'm sure.  =)

>> As for the iput, I don't think you will get the final iput here, since
>> you've just scheduled I/O against the file.
>
> Good point, so the original comment is moot - it wont make a difference.
> Still, may not be a bad idea to do the freeing first. I was sort-of
> hoping to be able to kill the batching in fs/aio.c completely, though...

Yes, that's what I meant above when I said I agreed with the TODO item.
Go ahead and nuke it.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 10/10] fs: make aio plug
  2011-01-24 19:29           ` Jeff Moyer
@ 2011-01-24 19:31             ` Jens Axboe
  2011-01-24 19:38               ` Jeff Moyer
  0 siblings, 1 reply; 152+ messages in thread
From: Jens Axboe @ 2011-01-24 19:31 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: linux-kernel, hch, Shaohua Li

On 2011-01-24 20:29, Jeff Moyer wrote:
> Jens Axboe <jaxboe@fusionio.com> writes:
> 
>> On 2011-01-24 20:15, Jeff Moyer wrote:
>>> Jens Axboe <jaxboe@fusionio.com> writes:
>>>
>>>>>> @@ -1698,6 +1701,7 @@ long do_io_submit(aio_context_t ctx_id, long nr,
>>>>>>  		if (ret)
>>>>>>  			break;
>>>>>>  	}
>>>>>> +	blk_finish_plug(&plug);
>>>>>>  	aio_batch_free(batch_hash);
>>>>>
>>>>> I'm pretty sure you want blk_finish_plug to run after aio_batch_free.
>>>>
>>>> You mean to cover the iput()? That's not a bad idea, if that ends up
>>>> writing it out. I did a few read tests here and confirmed it's sending
>>>> down batches of whatever number is submitted.
>>>
>>> No, I actually didn't make it all the way through 5/10, so I didn't
>>> realize you got rid of the blk_run_address_space.  I agree with the TODO
>>> item to get rid of the aio batching, since it's now taken care of with
>>> the on-stack plugging.
>>
>> Oh, so that was the whole point of the series :-)
> 
> I've said dumber things, I'm sure.  =)

;-)

>>> As for the iput, I don't think you will get the final iput here, since
>>> you've just scheduled I/O against the file.
>>
>> Good point, so the original comment is moot - it wont make a difference.
>> Still, may not be a bad idea to do the freeing first. I was sort-of
>> hoping to be able to kill the batching in fs/aio.c completely, though...
> 
> Yes, that's what I meant above when I said I agreed with the TODO item.
> Go ahead and nuke it.

Will do, if you promise to help test it :-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 04/10] block: initial patch for on-stack per-task plugging
  2011-01-22  1:17 ` [PATCH 04/10] block: initial patch for on-stack per-task plugging Jens Axboe
@ 2011-01-24 19:36   ` Jeff Moyer
  2011-01-24 21:23     ` Jens Axboe
  2011-03-10 16:54   ` Vivek Goyal
  2011-03-16  8:18   ` Shaohua Li
  2 siblings, 1 reply; 152+ messages in thread
From: Jeff Moyer @ 2011-01-24 19:36 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, hch

Jens Axboe <jaxboe@fusionio.com> writes:

This looks mostly good.  I just have a couple of questions, listed below.

> +/*
> + * Attempts to merge with the plugged list in the current process. Returns
> + * true if merge was succesful, otherwise false.
> + */
> +static bool check_plug_merge(struct task_struct *tsk, struct request_queue *q,
> +			     struct bio *bio)
> +{

Would a better name for this function be attempt_plug_merge?

> +	plug = current->plug;
> +	if (plug && !sync) {
> +		if (!plug->should_sort && !list_empty(&plug->list)) {
> +			struct request *__rq;
> +
> +			__rq = list_entry_rq(plug->list.prev);
> +			if (__rq->q != q)
> +				plug->should_sort = 1;

[snip]

> +static int plug_rq_cmp(void *priv, struct list_head *a, struct list_head *b)
> +{
> +	struct request *rqa = container_of(a, struct request, queuelist);
> +	struct request *rqb = container_of(b, struct request, queuelist);
> +
> +	return !(rqa->q == rqb->q);
> +}


> +static void __blk_finish_plug(struct task_struct *tsk, struct blk_plug *plug)
> +{

[snip]

> +	if (plug->should_sort)
> +		list_sort(NULL, &plug->list, plug_rq_cmp);

The other way to do this is to just keep track of which queues you need
to run after exhausting the plug list.  Is it safe to assume you've done
things this way to keep each request queue's data structures cache hot
while working on it?

> +static inline void blk_flush_plug(struct task_struct *tsk)
> +{
> +	struct blk_plug *plug = tsk->plug;
> +
> +	if (unlikely(plug))
> +		__blk_flush_plug(tsk, plug);
> +}

Why is that unlikely?

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 10/10] fs: make aio plug
  2011-01-24 19:31             ` Jens Axboe
@ 2011-01-24 19:38               ` Jeff Moyer
  0 siblings, 0 replies; 152+ messages in thread
From: Jeff Moyer @ 2011-01-24 19:38 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, hch, Shaohua Li

Jens Axboe <jaxboe@fusionio.com> writes:

>>>> As for the iput, I don't think you will get the final iput here, since
>>>> you've just scheduled I/O against the file.
>>>
>>> Good point, so the original comment is moot - it wont make a difference.
>>> Still, may not be a bad idea to do the freeing first. I was sort-of
>>> hoping to be able to kill the batching in fs/aio.c completely, though...
>> 
>> Yes, that's what I meant above when I said I agreed with the TODO item.
>> Go ahead and nuke it.
>
> Will do, if you promise to help test it :-)

Yeah, I'll sign up to do some testing.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 04/10] block: initial patch for on-stack per-task  plugging
  2011-01-24 19:36   ` Jeff Moyer
@ 2011-01-24 21:23     ` Jens Axboe
  0 siblings, 0 replies; 152+ messages in thread
From: Jens Axboe @ 2011-01-24 21:23 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: linux-kernel, hch

On 2011-01-24 20:36, Jeff Moyer wrote:
> Jens Axboe <jaxboe@fusionio.com> writes:
> 
> This looks mostly good.  I just have a couple of questions, listed below.
> 
>> +/*
>> + * Attempts to merge with the plugged list in the current process. Returns
>> + * true if merge was succesful, otherwise false.
>> + */
>> +static bool check_plug_merge(struct task_struct *tsk, struct request_queue *q,
>> +			     struct bio *bio)
>> +{
> 
> Would a better name for this function be attempt_plug_merge?

Most likely :-). I'll change it.

>> +	plug = current->plug;
>> +	if (plug && !sync) {
>> +		if (!plug->should_sort && !list_empty(&plug->list)) {
>> +			struct request *__rq;
>> +
>> +			__rq = list_entry_rq(plug->list.prev);
>> +			if (__rq->q != q)
>> +				plug->should_sort = 1;
> 
> [snip]
> 
>> +static int plug_rq_cmp(void *priv, struct list_head *a, struct list_head *b)
>> +{
>> +	struct request *rqa = container_of(a, struct request, queuelist);
>> +	struct request *rqb = container_of(b, struct request, queuelist);
>> +
>> +	return !(rqa->q == rqb->q);
>> +}
> 
> 
>> +static void __blk_finish_plug(struct task_struct *tsk, struct blk_plug *plug)
>> +{
> 
> [snip]
> 
>> +	if (plug->should_sort)
>> +		list_sort(NULL, &plug->list, plug_rq_cmp);
> 
> The other way to do this is to just keep track of which queues you need
> to run after exhausting the plug list.  Is it safe to assume you've done
> things this way to keep each request queue's data structures cache hot
> while working on it?

But then you get into memory problems as well, as the number of
different queues could (potentially) be huge. In reality they will not
be, but it's something that has to be handled. And if you track those
queues, that still means you have to grab each queue lock twice instead
of just once.

There are probably areas where the double lock approach may be faster
than spending cycles on a sort, but in practice I think the sort will be
faster. It's something we can play with, though.

>> +static inline void blk_flush_plug(struct task_struct *tsk)
>> +{
>> +	struct blk_plug *plug = tsk->plug;
>> +
>> +	if (unlikely(plug))
>> +		__blk_flush_plug(tsk, plug);
>> +}
> 
> Why is that unlikely?

Main caller is the CPU scheduler, when someone is scheduled out. So the
logic is that unless you're very IO intensive, you'll be more likely to
go to sleep without anything waiting on the plug list than not. This
puts it out-of-line in the CPU scheduler, to keep the cost to a minimum.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-01-22  1:17 ` [PATCH 05/10] block: remove per-queue plugging Jens Axboe
  2011-01-22  1:31   ` Nick Piggin
@ 2011-03-03 21:23   ` Mike Snitzer
  2011-03-03 21:27     ` Mike Snitzer
                       ` (2 more replies)
  2011-03-04  4:00   ` Vivek Goyal
  2 siblings, 3 replies; 152+ messages in thread
From: Mike Snitzer @ 2011-03-03 21:23 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, hch

> diff --git a/block/blk-flush.c b/block/blk-flush.c
> index 54b123d..c0a07aa 100644
> --- a/block/blk-flush.c
> +++ b/block/blk-flush.c
> @@ -59,7 +59,6 @@ static struct request *blk_flush_complete_seq(struct request_queue *q,
>  static void blk_flush_complete_seq_end_io(struct request_queue *q,
>                                          unsigned seq, int error)
>  {
> -       bool was_empty = elv_queue_empty(q);
>        struct request *next_rq;
>
>        next_rq = blk_flush_complete_seq(q, seq, error);
> @@ -68,7 +67,7 @@ static void blk_flush_complete_seq_end_io(struct request_queue *q,
>         * Moving a request silently to empty queue_head may stall the
>         * queue.  Kick the queue in those cases.
>         */
> -       if (was_empty && next_rq)
> +       if (next_rq)
>                __blk_run_queue(q);
>  }
>
...
> diff --git a/block/elevator.c b/block/elevator.c
> index a9fe237..d5d17a4 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -619,8 +619,6 @@ void elv_quiesce_end(struct request_queue *q)
...
> -int elv_queue_empty(struct request_queue *q)
> -{
> -       struct elevator_queue *e = q->elevator;
> -
> -       if (!list_empty(&q->queue_head))
> -               return 0;
> -
> -       if (e->ops->elevator_queue_empty_fn)
> -               return e->ops->elevator_queue_empty_fn(q);
> -
> -       return 1;
> -}
> -EXPORT_SYMBOL(elv_queue_empty);
> -

Your latest 'for-2.6.39/stack-unplug' rebase (commit 7703acb01e)
misses removing a call to elv_queue_empty() in
block/blk-flush.c:flush_data_end_io()

  CC      block/blk-flush.o
block/blk-flush.c: In function ‘flush_data_end_io’:
block/blk-flush.c:266: error: implicit declaration of function ‘elv_queue_empty’

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-03-03 21:23   ` Mike Snitzer
@ 2011-03-03 21:27     ` Mike Snitzer
  2011-03-03 22:13     ` Mike Snitzer
  2011-03-08 12:15     ` Jens Axboe
  2 siblings, 0 replies; 152+ messages in thread
From: Mike Snitzer @ 2011-03-03 21:27 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, hch

On Thu, Mar 03 2011 at  4:23pm -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> > diff --git a/block/blk-flush.c b/block/blk-flush.c
> > index 54b123d..c0a07aa 100644
> > --- a/block/blk-flush.c
> > +++ b/block/blk-flush.c
> > @@ -59,7 +59,6 @@ static struct request *blk_flush_complete_seq(struct request_queue *q,
> >  static void blk_flush_complete_seq_end_io(struct request_queue *q,
> >                                          unsigned seq, int error)
> >  {
> > -       bool was_empty = elv_queue_empty(q);
> >        struct request *next_rq;
> >
> >        next_rq = blk_flush_complete_seq(q, seq, error);
> > @@ -68,7 +67,7 @@ static void blk_flush_complete_seq_end_io(struct request_queue *q,
> >         * Moving a request silently to empty queue_head may stall the
> >         * queue.  Kick the queue in those cases.
> >         */
> > -       if (was_empty && next_rq)
> > +       if (next_rq)
> >                __blk_run_queue(q);
> >  }
> >
> ...
> > diff --git a/block/elevator.c b/block/elevator.c
> > index a9fe237..d5d17a4 100644
> > --- a/block/elevator.c
> > +++ b/block/elevator.c
> > @@ -619,8 +619,6 @@ void elv_quiesce_end(struct request_queue *q)
> ...
> > -int elv_queue_empty(struct request_queue *q)
> > -{
> > -       struct elevator_queue *e = q->elevator;
> > -
> > -       if (!list_empty(&q->queue_head))
> > -               return 0;
> > -
> > -       if (e->ops->elevator_queue_empty_fn)
> > -               return e->ops->elevator_queue_empty_fn(q);
> > -
> > -       return 1;
> > -}
> > -EXPORT_SYMBOL(elv_queue_empty);
> > -
> 
> Your latest 'for-2.6.39/stack-unplug' rebase (commit 7703acb01e)
> misses removing a call to elv_queue_empty() in
> block/blk-flush.c:flush_data_end_io()
> 
>   CC      block/blk-flush.o
> block/blk-flush.c: In function ‘flush_data_end_io’:
> block/blk-flush.c:266: error: implicit declaration of function ‘elv_queue_empty’

This allows me to compile:

diff --git a/block/blk-flush.c b/block/blk-flush.c
index de5ae6e..671fa9d 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -263,10 +263,9 @@ static bool blk_kick_flush(struct request_queue *q)
 static void flush_data_end_io(struct request *rq, int error)
 {
 	struct request_queue *q = rq->q;
-	bool was_empty = elv_queue_empty(q);
 
 	/* after populating an empty queue, kick it to avoid stall */
-	if (blk_flush_complete_seq(rq, REQ_FSEQ_DATA, error) && was_empty)
+	if (blk_flush_complete_seq(rq, REQ_FSEQ_DATA, error))
 		__blk_run_queue(q);
 }
 

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-03-03 21:23   ` Mike Snitzer
  2011-03-03 21:27     ` Mike Snitzer
@ 2011-03-03 22:13     ` Mike Snitzer
  2011-03-04 13:02       ` Shaohua Li
  2011-03-08 12:16       ` Jens Axboe
  2011-03-08 12:15     ` Jens Axboe
  2 siblings, 2 replies; 152+ messages in thread
From: Mike Snitzer @ 2011-03-03 22:13 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, hch

I'm now hitting a lockdep issue, while running a 'for-2.6.39/stack-plug'
kernel, when I try an fsync heavy workload to a request-based mpath
device (the kernel ultimately goes down in flames, I've yet to look at
the crashdump I took)

Initializing cgroup subsys cpuset
Initializing cgroup subsys cpu
Linux version 2.6.38-rc6-snitm+ (root@rhel6) (gcc version 4.4.5 20110116 (Red Hat 4.4.5-5) (GCC) ) #2 SMP Thu Mar 3 16:32:23 EST 2011
Command line: ro root=UUID=e0236db2-5a38-4d48-8bf5-55675671dee6 console=ttyS0 rhgb quiet SYSFONT=latarcyrheb-sun16 LANG=en_US.UTF-8 KEYTABLE=us rd_plytheme=charge crashkernel=auto
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 000000000009f400 (usable)
 BIOS-e820: 000000000009f400 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 000000007fffd000 (usable)
 BIOS-e820: 000000007fffd000 - 0000000080000000 (reserved)
 BIOS-e820: 00000000fffbc000 - 0000000100000000 (reserved)
NX (Execute Disable) protection: active
DMI 2.4 present.
DMI: Bochs Bochs, BIOS Bochs 01/01/2007
e820 update range: 0000000000000000 - 0000000000010000 (usable) ==> (reserved)
e820 remove range: 00000000000a0000 - 0000000000100000 (usable)
No AGP bridge found
last_pfn = 0x7fffd max_arch_pfn = 0x400000000
MTRR default type: write-back
MTRR fixed ranges enabled:
  00000-9FFFF write-back
  A0000-BFFFF uncachable
  C0000-FFFFF write-protect
MTRR variable ranges enabled:
  0 base 00E0000000 mask FFE0000000 uncachable
  1 disabled
  2 disabled
  3 disabled
  4 disabled
  5 disabled
  6 disabled
  7 disabled
PAT not supported by CPU.
found SMP MP-table at [ffff8800000f7fd0] f7fd0
initial memory mapped : 0 - 20000000
init_memory_mapping: 0000000000000000-000000007fffd000
 0000000000 - 007fe00000 page 2M
 007fe00000 - 007fffd000 page 4k
kernel direct mapping tables up to 7fffd000 @ 1fffc000-20000000
RAMDISK: 37b50000 - 37ff0000
crashkernel: memory value expected
ACPI: RSDP 00000000000f7f80 00014 (v00 BOCHS )
ACPI: RSDT 000000007fffde10 00034 (v01 BOCHS  BXPCRSDT 00000001 BXPC 00000001)
ACPI: FACP 000000007ffffe40 00074 (v01 BOCHS  BXPCFACP 00000001 BXPC 00000001)
ACPI: DSDT 000000007fffdfd0 01E22 (v01   BXPC   BXDSDT 00000001 INTL 20090123)
ACPI: FACS 000000007ffffe00 00040
ACPI: SSDT 000000007fffdf80 00044 (v01 BOCHS  BXPCSSDT 00000001 BXPC 00000001)
ACPI: APIC 000000007fffde90 0007A (v01 BOCHS  BXPCAPIC 00000001 BXPC 00000001)
ACPI: HPET 000000007fffde50 00038 (v01 BOCHS  BXPCHPET 00000001 BXPC 00000001)
ACPI: Local APIC address 0xfee00000
No NUMA configuration found
Faking a node at 0000000000000000-000000007fffd000
Initmem setup node 0 0000000000000000-000000007fffd000
  NODE_DATA [000000007ffe9000 - 000000007fffcfff]
kvm-clock: Using msrs 12 and 11
kvm-clock: cpu 0, msr 0:1875141, boot clock
 [ffffea0000000000-ffffea0001bfffff] PMD -> [ffff88007d600000-ffff88007f1fffff] on node 0
Zone PFN ranges:
  DMA      0x00000010 -> 0x00001000
  DMA32    0x00001000 -> 0x00100000
  Normal   empty
Movable zone start PFN for each node
early_node_map[2] active PFN ranges
    0: 0x00000010 -> 0x0000009f
    0: 0x00000100 -> 0x0007fffd
On node 0 totalpages: 524172
  DMA zone: 56 pages used for memmap
  DMA zone: 2 pages reserved
  DMA zone: 3925 pages, LIFO batch:0
  DMA32 zone: 7112 pages used for memmap
  DMA32 zone: 513077 pages, LIFO batch:31
ACPI: PM-Timer IO Port: 0xb008
ACPI: Local APIC address 0xfee00000
ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)
ACPI: IOAPIC (id[0x02] address[0xfec00000] gsi_base[0])
IOAPIC[0]: apic_id 2, version 17, address 0xfec00000, GSI 0-23
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
ACPI: INT_SRC_OVR (bus 0 bus_irq 5 global_irq 5 high level)
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
ACPI: INT_SRC_OVR (bus 0 bus_irq 10 global_irq 10 high level)
ACPI: INT_SRC_OVR (bus 0 bus_irq 11 global_irq 11 high level)
ACPI: IRQ0 used by override.
ACPI: IRQ2 used by override.
ACPI: IRQ5 used by override.
ACPI: IRQ9 used by override.
ACPI: IRQ10 used by override.
ACPI: IRQ11 used by override.
Using ACPI (MADT) for SMP configuration information
ACPI: HPET id: 0x8086a201 base: 0xfed00000
SMP: Allowing 2 CPUs, 0 hotplug CPUs
nr_irqs_gsi: 40
Allocating PCI resources starting at 80000000 (gap: 80000000:7ffbc000)
Booting paravirtualized kernel on KVM
setup_percpu: NR_CPUS:4 nr_cpumask_bits:4 nr_cpu_ids:2 nr_node_ids:1
PERCPU: Embedded 474 pages/cpu @ffff88007f200000 s1912768 r8192 d20544 u2097152
pcpu-alloc: s1912768 r8192 d20544 u2097152 alloc=1*2097152
pcpu-alloc: [0] 0 [0] 1 
kvm-clock: cpu 0, msr 0:7f3d2141, primary cpu clock
Built 1 zonelists in Node order, mobility grouping on.  Total pages: 517002
Policy zone: DMA32
Kernel command line: ro root=UUID=e0236db2-5a38-4d48-8bf5-55675671dee6 console=ttyS0 rhgb quiet SYSFONT=latarcyrheb-sun16 LANG=en_US.UTF-8 KEYTABLE=us rd_plytheme=charge crashkernel=auto
PID hash table entries: 4096 (order: 3, 32768 bytes)
Checking aperture...
No AGP bridge found
Memory: 2037496k/2097140k available (3571k kernel code, 452k absent, 59192k reserved, 3219k data, 3504k init)
Hierarchical RCU implementation.
	RCU-based detection of stalled CPUs is disabled.
NR_IRQS:4352 nr_irqs:512 16
Console: colour VGA+ 80x25
console [ttyS0] enabled
Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar
... MAX_LOCKDEP_SUBCLASSES:  8
... MAX_LOCK_DEPTH:          48
... MAX_LOCKDEP_KEYS:        8191
... CLASSHASH_SIZE:          4096
... MAX_LOCKDEP_ENTRIES:     16384
... MAX_LOCKDEP_CHAINS:      32768
... CHAINHASH_SIZE:          16384
 memory used by lock dependency info: 6367 kB
 per task-struct memory footprint: 2688 bytes
ODEBUG: 11 of 11 active objects replaced
ODEBUG: selftest passed
hpet clockevent registered
Detected 1995.090 MHz processor.
Calibrating delay loop (skipped) preset value.. 3990.18 BogoMIPS (lpj=1995090)
pid_max: default: 32768 minimum: 301
Security Framework initialized
SELinux:  Initializing.
SELinux:  Starting in permissive mode
Dentry cache hash table entries: 262144 (order: 9, 2097152 bytes)
Inode-cache hash table entries: 131072 (order: 8, 1048576 bytes)
Mount-cache hash table entries: 256
Initializing cgroup subsys ns
ns_cgroup deprecated: consider using the 'clone_children' flag without the ns_cgroup.
Initializing cgroup subsys cpuacct
Initializing cgroup subsys devices
Initializing cgroup subsys freezer
Initializing cgroup subsys net_cls
mce: CPU supports 10 MCE banks
ACPI: Core revision 20110112
ftrace: allocating 16994 entries in 67 pages
Setting APIC routing to flat
..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
CPU0: Intel QEMU Virtual CPU version 0.12.5 stepping 03
Performance Events: unsupported p6 CPU model 2 no PMU driver, software events only.
lockdep: fixing up alternatives.
Booting Node   0, Processors  #1 Ok.
kvm-clock: cpu 1, msr 0:7f5d2141, secondary cpu clock
Brought up 2 CPUs
Total of 2 processors activated (7980.36 BogoMIPS).
NET: Registered protocol family 16
ACPI: bus type pci registered
PCI: Using configuration type 1 for base access
mtrr: your CPUs had inconsistent variable MTRR settings
mtrr: your CPUs had inconsistent MTRRdefType settings
mtrr: probably your BIOS does not setup all CPUs.
mtrr: corrected configuration.
bio: create slab <bio-0> at 0
ACPI: EC: Look up EC in DSDT
ACPI: Interpreter enabled
ACPI: (supports S0 S5)
ACPI: Using IOAPIC for interrupt routing
ACPI: No dock devices found.
PCI: Ignoring host bridge windows from ACPI; if necessary, use "pci=use_crs" and report a bug
ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-ff])
pci_root PNP0A03:00: host bridge window [io  0x0000-0x0cf7] (ignored)
pci_root PNP0A03:00: host bridge window [io  0x0d00-0xffff] (ignored)
pci_root PNP0A03:00: host bridge window [mem 0x000a0000-0x000bffff] (ignored)
pci_root PNP0A03:00: host bridge window [mem 0xe0000000-0xfebfffff] (ignored)
pci 0000:00:00.0: [8086:1237] type 0 class 0x000600
pci 0000:00:01.0: [8086:7000] type 0 class 0x000601
pci 0000:00:01.1: [8086:7010] type 0 class 0x000101
pci 0000:00:01.1: reg 20: [io  0xc000-0xc00f]
pci 0000:00:01.2: [8086:7020] type 0 class 0x000c03
pci 0000:00:01.2: reg 20: [io  0xc020-0xc03f]
pci 0000:00:01.3: [8086:7113] type 0 class 0x000680
pci 0000:00:01.3: quirk: [io  0xb000-0xb03f] claimed by PIIX4 ACPI
pci 0000:00:01.3: quirk: [io  0xb100-0xb10f] claimed by PIIX4 SMB
pci 0000:00:02.0: [1013:00b8] type 0 class 0x000300
pci 0000:00:02.0: reg 10: [mem 0xf0000000-0xf1ffffff pref]
pci 0000:00:02.0: reg 14: [mem 0xf2000000-0xf2000fff]
pci 0000:00:03.0: [1af4:1002] type 0 class 0x000500
pci 0000:00:03.0: reg 10: [io  0xc040-0xc05f]
pci 0000:00:04.0: [1af4:1001] type 0 class 0x000100
pci 0000:00:04.0: reg 10: [io  0xc080-0xc0bf]
pci 0000:00:05.0: [1af4:1001] type 0 class 0x000100
pci 0000:00:05.0: reg 10: [io  0xc0c0-0xc0ff]
pci 0000:00:06.0: [1af4:1000] type 0 class 0x000200
pci 0000:00:06.0: reg 10: [io  0xc100-0xc11f]
pci 0000:00:06.0: reg 14: [mem 0xf2001000-0xf2001fff]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
ACPI: PCI Interrupt Link [LNKA] (IRQs 5 *10 11)
ACPI: PCI Interrupt Link [LNKB] (IRQs 5 *10 11)
ACPI: PCI Interrupt Link [LNKC] (IRQs 5 10 *11)
ACPI: PCI Interrupt Link [LNKD] (IRQs 5 10 *11)
vgaarb: device added: PCI:0000:00:02.0,decodes=io+mem,owns=io+mem,locks=none
vgaarb: loaded
SCSI subsystem initialized
libata version 3.00 loaded.
PCI: Using ACPI for IRQ routing
PCI: pci_cache_line_size set to 64 bytes
reserve RAM buffer: 000000000009f400 - 000000000009ffff 
reserve RAM buffer: 000000007fffd000 - 000000007fffffff 
NetLabel: Initializing
NetLabel:  domain hash size = 128
NetLabel:  protocols = UNLABELED CIPSOv4
NetLabel:  unlabeled traffic allowed by default
HPET: 3 timers in total, 0 timers will be used for per-cpu timer
hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0
hpet0: 3 comparators, 64-bit 100.000000 MHz counter
Switching to clocksource kvm-clock
Switched to NOHz mode on CPU #0
Switched to NOHz mode on CPU #1
pnp: PnP ACPI init
ACPI: bus type pnp registered
pnp 00:00: [bus 00-ff]
pnp 00:00: [io  0x0cf8-0x0cff]
pnp 00:00: [io  0x0000-0x0cf7 window]
pnp 00:00: [io  0x0d00-0xffff window]
pnp 00:00: [mem 0x000a0000-0x000bffff window]
pnp 00:00: [mem 0xe0000000-0xfebfffff window]
pnp 00:00: Plug and Play ACPI device, IDs PNP0a03 (active)
pnp 00:01: [io  0x0070-0x0071]
pnp 00:01: [irq 8]
pnp 00:01: [io  0x0072-0x0077]
pnp 00:01: Plug and Play ACPI device, IDs PNP0b00 (active)
pnp 00:02: [io  0x0060]
pnp 00:02: [io  0x0064]
pnp 00:02: [irq 1]
pnp 00:02: Plug and Play ACPI device, IDs PNP0303 (active)
pnp 00:03: [irq 12]
pnp 00:03: Plug and Play ACPI device, IDs PNP0f13 (active)
pnp 00:04: [io  0x03f2-0x03f5]
pnp 00:04: [io  0x03f7]
pnp 00:04: [irq 6]
pnp 00:04: [dma 2]
pnp 00:04: Plug and Play ACPI device, IDs PNP0700 (active)
pnp 00:05: [mem 0xfed00000-0xfed003ff]
pnp 00:05: Plug and Play ACPI device, IDs PNP0103 (active)
pnp: PnP ACPI: found 6 devices
ACPI: ACPI bus type pnp unregistered
pci_bus 0000:00: resource 0 [io  0x0000-0xffff]
pci_bus 0000:00: resource 1 [mem 0x00000000-0xffffffffff]
NET: Registered protocol family 2
IP route cache hash table entries: 65536 (order: 7, 524288 bytes)
TCP established hash table entries: 262144 (order: 10, 4194304 bytes)
TCP bind hash table entries: 65536 (order: 10, 5242880 bytes)
TCP: Hash tables configured (established 262144 bind 65536)
TCP reno registered
UDP hash table entries: 1024 (order: 5, 196608 bytes)
UDP-Lite hash table entries: 1024 (order: 5, 196608 bytes)
NET: Registered protocol family 1
pci 0000:00:00.0: Limiting direct PCI/PCI transfers
pci 0000:00:01.0: PIIX3: Enabling Passive Release
pci 0000:00:01.0: Activating ISA DMA hang workarounds
pci 0000:00:02.0: Boot video device
PCI: CLS 0 bytes, default 64
Trying to unpack rootfs image as initramfs...
Freeing initrd memory: 4736k freed
DMA-API: preallocated 32768 debug entries
DMA-API: debugging enabled by kernel config
audit: initializing netlink socket (disabled)
type=2000 audit(1299188678.444:1): initialized
HugeTLB registered 2 MB page size, pre-allocated 0 pages
VFS: Disk quotas dquot_6.5.2
Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
msgmni has been set to 3988
SELinux:  Registering netfilter hooks
cryptomgr_test used greatest stack depth: 6496 bytes left
Block layer SCSI generic (bsg) driver version 0.4 loaded (major 253)
io scheduler noop registered
io scheduler deadline registered (default)
io scheduler cfq registered
pci_hotplug: PCI Hot Plug PCI Core version: 0.5
pciehp: PCI Express Hot Plug Controller Driver version: 0.4
acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5
acpiphp: Slot [1] registered
acpiphp: Slot [2] registered
acpiphp: Slot [3] registered
acpiphp: Slot [4] registered
acpiphp: Slot [5] registered
acpiphp: Slot [6] registered
acpiphp: Slot [7] registered
acpiphp: Slot [8] registered
acpiphp: Slot [9] registered
acpiphp: Slot [10] registered
acpiphp: Slot [11] registered
acpiphp: Slot [12] registered
acpiphp: Slot [13] registered
acpiphp: Slot [14] registered
acpiphp: Slot [15] registered
acpiphp: Slot [16] registered
acpiphp: Slot [17] registered
acpiphp: Slot [18] registered
acpiphp: Slot [19] registered
acpiphp: Slot [20] registered
acpiphp: Slot [21] registered
acpiphp: Slot [22] registered
acpiphp: Slot [23] registered
acpiphp: Slot [24] registered
acpiphp: Slot [25] registered
acpiphp: Slot [26] registered
acpiphp: Slot [27] registered
acpiphp: Slot [28] registered
acpiphp: Slot [29] registered
acpiphp: Slot [30] registered
acpiphp: Slot [31] registered
input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input0
ACPI: Power Button [PWRF]
ACPI: acpi_idle registered with cpuidle
Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
Non-volatile memory driver v1.3
Linux agpgart interface v0.103
brd: module loaded
loop: module loaded
ata_piix 0000:00:01.1: version 2.13
ata_piix 0000:00:01.1: setting latency timer to 64
scsi0 : ata_piix
scsi1 : ata_piix
ata1: PATA max MWDMA2 cmd 0x1f0 ctl 0x3f6 bmdma 0xc000 irq 14
ata2: PATA max MWDMA2 cmd 0x170 ctl 0x376 bmdma 0xc008 irq 15
i8042: PNP: PS/2 Controller [PNP0303:KBD,PNP0f13:MOU] at 0x60,0x64 irq 1,12
serio: i8042 KBD port at 0x60,0x64 irq 1
serio: i8042 AUX port at 0x60,0x64 irq 12
mousedev: PS/2 mouse device common for all mice
input: AT Translated Set 2 keyboard as /devices/platform/i8042/serio0/input/input1
rtc_cmos 00:01: rtc core: registered rtc_cmos as rtc0
rtc0: alarms up to one day, 114 bytes nvram, hpet irqs
cpuidle: using governor ladder
cpuidle: using governor menu
nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
ip_tables: (C) 2000-2006 Netfilter Core Team
TCP cubic registered
NET: Registered protocol family 17
registered taskstats version 1
IMA: No TPM chip found, activating TPM-bypass!
rtc_cmos 00:01: setting system clock to 2011-03-03 21:44:38 UTC (1299188678)
Freeing unused kernel memory: 3504k freed
Write protecting the kernel read-only data: 6144k
Freeing unused kernel memory: 508k freed
Freeing unused kernel memory: 164k freed
mknod used greatest stack depth: 5296 bytes left
modprobe used greatest stack depth: 5080 bytes left
mknod used greatest stack depth: 4792 bytes left
input: ImExPS/2 Generic Explorer Mouse as /devices/platform/i8042/serio1/input/input2
dracut: dracut-004-35.el6
udev: starting version 147
udevd (70): /proc/70/oom_adj is deprecated, please use /proc/70/oom_score_adj instead.
dracut: Starting plymouth daemon
Refined TSC clocksource calibration: 1994.951 MHz.
ACPI: PCI Interrupt Link [LNKC] enabled at IRQ 11
virtio-pci 0000:00:03.0: PCI INT A -> Link[LNKC] -> GSI 11 (level, high) -> IRQ 11
virtio-pci 0000:00:03.0: setting latency timer to 64
ACPI: PCI Interrupt Link [LNKD] enabled at IRQ 10
virtio-pci 0000:00:04.0: PCI INT A -> Link[LNKD] -> GSI 10 (level, high) -> IRQ 10
virtio-pci 0000:00:04.0: setting latency timer to 64
ACPI: PCI Interrupt Link [LNKA] enabled at IRQ 10
virtio-pci 0000:00:05.0: PCI INT A -> Link[LNKA] -> GSI 10 (level, high) -> IRQ 10
virtio-pci 0000:00:05.0: setting latency timer to 64
ACPI: PCI Interrupt Link [LNKB] enabled at IRQ 11
virtio-pci 0000:00:06.0: PCI INT A -> Link[LNKB] -> GSI 11 (level, high) -> IRQ 11
virtio-pci 0000:00:06.0: setting latency timer to 64
modprobe used greatest stack depth: 4768 bytes left
 vda: vda1 vda2 vda3
 vdb: unknown partition table
modprobe used greatest stack depth: 4672 bytes left
EXT3-fs: barriers not enabled
kjournald starting.  Commit interval 5 seconds
EXT3-fs (vda3): mounted filesystem with ordered data mode
dracut: Remounting /dev/disk/by-uuid/e0236db2-5a38-4d48-8bf5-55675671dee6 with -o barrier=1,ro
kjournald starting.  Commit interval 5 seconds
EXT3-fs (vda3): mounted filesystem with ordered data mode
dracut: Mounted root filesystem /dev/vda3
dracut: Loading SELinux policy
SELinux:  Disabled at runtime.
SELinux:  Unregistering netfilter hooks
type=1404 audit(1299188681.051:2): selinux=0 auid=4294967295 ses=4294967295
load_policy used greatest stack depth: 3664 bytes left
dracut: /sbin/load_policy: Can't load policy: No such file or directory
dracut: Switching root
readahead: starting
udev: starting version 147
ip used greatest stack depth: 3592 bytes left
piix4_smbus 0000:00:01.3: SMBus Host Controller at 0xb100, revision 0
virtio-pci 0000:00:06.0: irq 40 for MSI/MSI-X
virtio-pci 0000:00:06.0: irq 41 for MSI/MSI-X
virtio-pci 0000:00:06.0: irq 42 for MSI/MSI-X
device-mapper: uevent: version 1.0.3
device-mapper: ioctl: 4.19.1-ioctl (2011-01-07) initialised: dm-devel@redhat.com
device-mapper: multipath: version 1.2.0 loaded
EXT3-fs (vda3): using internal journal
kjournald starting.  Commit interval 5 seconds
EXT3-fs (vda1): using internal journal
EXT3-fs (vda1): mounted filesystem with ordered data mode
Adding 524284k swap on /dev/vda2.  Priority:-1 extents:1 across:524284k 
Loading iSCSI transport class v2.0-870.
iscsi: registered transport (tcp)
RPC: Registered udp transport module.
RPC: Registered tcp transport module.
RPC: Registered tcp NFSv4.1 backchannel transport module.
scsi2 : iSCSI Initiator over TCP/IP
scsi3 : iSCSI Initiator over TCP/IP
scsi4 : iSCSI Initiator over TCP/IP
scsi5 : iSCSI Initiator over TCP/IP
scsi 2:0:0:0: Direct-Access     NETAPP   LUN              8010 PQ: 0 ANSI: 5
sd 2:0:0:0: Attached scsi generic sg0 type 0
scsi 4:0:0:0: Direct-Access     NETAPP   LUN              8010 PQ: 0 ANSI: 5
scsi 3:0:0:0: Direct-Access     NETAPP   LUN              8010 PQ: 0 ANSI: 5
scsi 5:0:0:0: Direct-Access     NETAPP   LUN              8010 PQ: 0 ANSI: 5
sd 2:0:0:0: [sda] 20971520 512-byte logical blocks: (10.7 GB/10.0 GiB)
sd 2:0:0:0: [sda] Write Protect is off
sd 2:0:0:0: [sda] Mode Sense: bd 00 00 08
sd 5:0:0:0: Attached scsi generic sg1 type 0
sd 3:0:0:0: Attached scsi generic sg2 type 0
sd 4:0:0:0: Attached scsi generic sg3 type 0
sd 5:0:0:0: [sdb] 20971520 512-byte logical blocks: (10.7 GB/10.0 GiB)
sd 2:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
sd 3:0:0:0: [sdc] 20971520 512-byte logical blocks: (10.7 GB/10.0 GiB)
sd 4:0:0:0: [sdd] 20971520 512-byte logical blocks: (10.7 GB/10.0 GiB)
sd 5:0:0:0: [sdb] Write Protect is off
sd 5:0:0:0: [sdb] Mode Sense: bd 00 00 08
sd 5:0:0:0: [sdb] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
sd 3:0:0:0: [sdc] Write Protect is off
sd 3:0:0:0: [sdc] Mode Sense: bd 00 00 08
sd 4:0:0:0: [sdd] Write Protect is off
sd 4:0:0:0: [sdd] Mode Sense: bd 00 00 08
sd 3:0:0:0: [sdc] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
sd 4:0:0:0: [sdd] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
 sda: sda1 sda2
 sdb: sdb1 sdb2
sd 2:0:0:0: [sda] Attached SCSI disk
 sdc: sdc1 sdc2
 sdd: sdd1 sdd2
sd 5:0:0:0: [sdb] Attached SCSI disk
sd 3:0:0:0: [sdc] Attached SCSI disk
sd 4:0:0:0: [sdd] Attached SCSI disk
sd 2:0:0:0: alua: supports implicit TPGS
sd 2:0:0:0: alua: port group 1100 rel port 83ea
sd 2:0:0:0: alua: port group 1100 state A supports TolUsNA
sd 5:0:0:0: alua: supports implicit TPGS
sd 5:0:0:0: alua: port group 1100 rel port 83e9
sd 5:0:0:0: alua: port group 1100 state A supports TolUsNA
sd 3:0:0:0: alua: supports implicit TPGS
sd 3:0:0:0: alua: port group 1100 rel port 83e8
sd 3:0:0:0: alua: port group 1100 state A supports TolUsNA
sd 4:0:0:0: alua: supports implicit TPGS
sd 4:0:0:0: alua: port group 1100 rel port 83eb
sd 4:0:0:0: alua: port group 1100 state A supports TolUsNA
alua: device handler registered
device-mapper: multipath round-robin: version 1.0.0 loaded
sd 5:0:0:0: alua: port group 1100 state A supports TolUsNA
sd 5:0:0:0: alua: port group 1100 state A supports TolUsNA
sd 5:0:0:0: alua: port group 1100 state A supports TolUsNA
sd 5:0:0:0: alua: port group 1100 state A supports TolUsNA
 sdb:
EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: (null)
scp used greatest stack depth: 3360 bytes left
vi used greatest stack depth: 3184 bytes left

=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.38-rc6-snitm+ #2
-------------------------------------------------------
ffsb/3110 is trying to acquire lock:
 (&(&q->__queue_lock)->rlock){..-...}, at: [<ffffffff811b4c4d>] flush_plug_list+0xbc/0x135

but task is already holding lock:
 (&rq->lock){-.-.-.}, at: [<ffffffff8137132f>] schedule+0x16a/0x725

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #2 (&rq->lock){-.-.-.}:
       [<ffffffff810731eb>] lock_acquire+0xe3/0x110
       [<ffffffff81373773>] _raw_spin_lock+0x36/0x69
       [<ffffffff810348f0>] task_rq_lock+0x51/0x83
       [<ffffffff810402f2>] try_to_wake_up+0x34/0x220
       [<ffffffff810404f0>] default_wake_function+0x12/0x14
       [<ffffffff81030136>] __wake_up_common+0x4e/0x84
       [<ffffffff810345a1>] complete+0x3f/0x52
       [<ffffffff811b9e58>] blk_end_sync_rq+0x34/0x38
       [<ffffffff811b6279>] blk_finish_request+0x1f5/0x224
       [<ffffffff811b62e8>] __blk_end_request_all+0x40/0x49
       [<ffffffffa00165c3>] blk_done+0x92/0xe7 [virtio_blk]
       [<ffffffffa0007382>] vring_interrupt+0x68/0x71 [virtio_ring]
       [<ffffffffa000e416>] vp_vring_interrupt+0x5b/0x97 [virtio_pci]
       [<ffffffffa000e497>] vp_interrupt+0x45/0x4a [virtio_pci]
       [<ffffffff81097a80>] handle_IRQ_event+0x57/0x127
       [<ffffffff81099bfe>] handle_fasteoi_irq+0x96/0xd9
       [<ffffffff8100511b>] handle_irq+0x88/0x91
       [<ffffffff8137ab8d>] do_IRQ+0x4d/0xb4
       [<ffffffff81374453>] ret_from_intr+0x0/0x1a
       [<ffffffff811d4cfe>] __debug_object_init+0x33a/0x377
       [<ffffffff811d4d52>] debug_object_init_on_stack+0x17/0x19
       [<ffffffff8105195c>] init_timer_on_stack_key+0x26/0x3e
       [<ffffffff81371d33>] schedule_timeout+0xa7/0xfe
       [<ffffffff81371b14>] wait_for_common+0xd7/0x135
       [<ffffffff81371c0b>] wait_for_completion_timeout+0x13/0x15
       [<ffffffff811b9fdc>] blk_execute_rq+0xe9/0x12d
       [<ffffffffa001609b>] virtblk_serial_show+0x9b/0xdb [virtio_blk]
       [<ffffffff81266104>] dev_attr_show+0x27/0x4e
       [<ffffffff81159471>] sysfs_read_file+0xbd/0x16b
       [<ffffffff811001ac>] vfs_read+0xae/0x10a
       [<ffffffff811002d1>] sys_read+0x4d/0x77
       [<ffffffff81002b82>] system_call_fastpath+0x16/0x1b

-> #1 (key#28){-.-...}:
       [<ffffffff810731eb>] lock_acquire+0xe3/0x110
       [<ffffffff813738f7>] _raw_spin_lock_irqsave+0x4e/0x88
       [<ffffffff81034583>] complete+0x21/0x52
       [<ffffffff811b9e58>] blk_end_sync_rq+0x34/0x38
       [<ffffffff811b6279>] blk_finish_request+0x1f5/0x224
       [<ffffffff811b6588>] blk_end_bidi_request+0x42/0x5d
       [<ffffffff811b65df>] blk_end_request+0x10/0x12
       [<ffffffff8127c17b>] scsi_io_completion+0x1b0/0x424
       [<ffffffff81275512>] scsi_finish_command+0xe9/0xf2
       [<ffffffff8127c503>] scsi_softirq_done+0xff/0x108
       [<ffffffff811bab18>] blk_done_softirq+0x84/0x98
       [<ffffffff8104a117>] __do_softirq+0xe2/0x1d3
       [<ffffffff81003b1c>] call_softirq+0x1c/0x28
       [<ffffffff8100503b>] do_softirq+0x4b/0xa3
       [<ffffffff81049e71>] irq_exit+0x4a/0x8c
       [<ffffffff8137abdd>] do_IRQ+0x9d/0xb4
       [<ffffffff81374453>] ret_from_intr+0x0/0x1a
       [<ffffffff8137377b>] _raw_spin_lock+0x3e/0x69
       [<ffffffff810e9bdc>] __page_lock_anon_vma+0x65/0x9d
       [<ffffffff810e9c35>] try_to_unmap_anon+0x21/0xdb
       [<ffffffff810e9d1a>] try_to_munlock+0x2b/0x39
       [<ffffffff810e3ca6>] munlock_vma_page+0x45/0x7f
       [<ffffffff810e1e63>] do_wp_page+0x536/0x580
       [<ffffffff810e28b9>] handle_pte_fault+0x6af/0x6e8
       [<ffffffff810e29cc>] handle_mm_fault+0xda/0xed
       [<ffffffff81377768>] do_page_fault+0x3b4/0x3d6
       [<ffffffff81374725>] page_fault+0x25/0x30

-> #0 (&(&q->__queue_lock)->rlock){..-...}:
       [<ffffffff81072e14>] __lock_acquire+0xa32/0xd26
       [<ffffffff810731eb>] lock_acquire+0xe3/0x110
       [<ffffffff81373773>] _raw_spin_lock+0x36/0x69
       [<ffffffff811b4c4d>] flush_plug_list+0xbc/0x135
       [<ffffffff811b4ce0>] __blk_flush_plug+0x1a/0x3a
       [<ffffffff81371471>] schedule+0x2ac/0x725
       [<ffffffffa00fef16>] start_this_handle+0x3be/0x4b1 [jbd2]
       [<ffffffffa00ff1fc>] jbd2__journal_start+0xc2/0xf6 [jbd2]
       [<ffffffffa00ff243>] jbd2_journal_start+0x13/0x15 [jbd2]
       [<ffffffffa013823c>] ext4_journal_start_sb+0xe1/0x116 [ext4]
       [<ffffffffa012748d>] ext4_da_writepages+0x27c/0x517 [ext4]
       [<ffffffff810cd298>] do_writepages+0x24/0x30
       [<ffffffff8111e625>] writeback_single_inode+0xaf/0x1d0
       [<ffffffff8111eb88>] writeback_sb_inodes+0xab/0x134
       [<ffffffff8111f542>] writeback_inodes_wb+0x12b/0x13d
       [<ffffffff810cc920>] balance_dirty_pages_ratelimited_nr+0x2be/0x3d8
       [<ffffffff810c456c>] generic_file_buffered_write+0x1ff/0x267
       [<ffffffff810c593f>] __generic_file_aio_write+0x245/0x27a
       [<ffffffff810c59d9>] generic_file_aio_write+0x65/0xbc
       [<ffffffffa011dd57>] ext4_file_write+0x1f5/0x256 [ext4]
       [<ffffffff810ff5b1>] do_sync_write+0xcb/0x108
       [<ffffffff810fffaf>] vfs_write+0xb1/0x10d
       [<ffffffff811000d4>] sys_write+0x4d/0x77
       [<ffffffff81002b82>] system_call_fastpath+0x16/0x1b

other info that might help us debug this:

3 locks held by ffsb/3110:
 #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff810c59bd>] generic_file_aio_write+0x49/0xbc
 #1:  (&type->s_umount_key#36){.+.+..}, at: [<ffffffff8111f4e5>] writeback_inodes_wb+0xce/0x13d
 #2:  (&rq->lock){-.-.-.}, at: [<ffffffff8137132f>] schedule+0x16a/0x725

stack backtrace:
Pid: 3110, comm: ffsb Not tainted 2.6.38-rc6-snitm+ #2
Call Trace:
 [<ffffffff810714fa>] ? print_circular_bug+0xae/0xbc
 [<ffffffff81072e14>] ? __lock_acquire+0xa32/0xd26
 [<ffffffff810731eb>] ? lock_acquire+0xe3/0x110
 [<ffffffff811b4c4d>] ? flush_plug_list+0xbc/0x135
 [<ffffffff81373773>] ? _raw_spin_lock+0x36/0x69
 [<ffffffff811b4c4d>] ? flush_plug_list+0xbc/0x135
 [<ffffffff811b4c4d>] ? flush_plug_list+0xbc/0x135
 [<ffffffff811b4ce0>] ? __blk_flush_plug+0x1a/0x3a
 [<ffffffff81371471>] ? schedule+0x2ac/0x725
 [<ffffffff810700f3>] ? trace_hardirqs_off+0xd/0xf
 [<ffffffffa00fef16>] ? start_this_handle+0x3be/0x4b1 [jbd2]
 [<ffffffff8106001a>] ? autoremove_wake_function+0x0/0x3d
 [<ffffffffa00ff1fc>] ? jbd2__journal_start+0xc2/0xf6 [jbd2]
 [<ffffffffa00ff243>] ? jbd2_journal_start+0x13/0x15 [jbd2]
 [<ffffffffa013823c>] ? ext4_journal_start_sb+0xe1/0x116 [ext4]
 [<ffffffffa0120d2f>] ? ext4_meta_trans_blocks+0x67/0xb8 [ext4]
 [<ffffffffa012748d>] ? ext4_da_writepages+0x27c/0x517 [ext4]
 [<ffffffff810658fd>] ? sched_clock_local+0x1c/0x82
 [<ffffffff810cd298>] ? do_writepages+0x24/0x30
 [<ffffffff8111e625>] ? writeback_single_inode+0xaf/0x1d0
 [<ffffffff8111eb88>] ? writeback_sb_inodes+0xab/0x134
 [<ffffffff8111f542>] ? writeback_inodes_wb+0x12b/0x13d
 [<ffffffff810cc920>] ? balance_dirty_pages_ratelimited_nr+0x2be/0x3d8
 [<ffffffff810c412d>] ? iov_iter_copy_from_user_atomic+0x81/0xf1
 [<ffffffff810c456c>] ? generic_file_buffered_write+0x1ff/0x267
 [<ffffffff81048adf>] ? current_fs_time+0x27/0x2e
 [<ffffffff810c593f>] ? __generic_file_aio_write+0x245/0x27a
 [<ffffffff810658fd>] ? sched_clock_local+0x1c/0x82
 [<ffffffff810c59d9>] ? generic_file_aio_write+0x65/0xbc
 [<ffffffffa011dd57>] ? ext4_file_write+0x1f5/0x256 [ext4]
 [<ffffffff81070983>] ? mark_lock+0x2d/0x22d
 [<ffffffff8107279e>] ? __lock_acquire+0x3bc/0xd26
 [<ffffffff810ff5b1>] ? do_sync_write+0xcb/0x108
 [<ffffffff810700f3>] ? trace_hardirqs_off+0xd/0xf
 [<ffffffff81065a72>] ? local_clock+0x41/0x5a
 [<ffffffff8118e62f>] ? security_file_permission+0x2e/0x33
 [<ffffffff810fffaf>] ? vfs_write+0xb1/0x10d
 [<ffffffff81100724>] ? fget_light+0x57/0xf0
 [<ffffffff81070e61>] ? trace_hardirqs_on_caller+0x11d/0x141
 [<ffffffff811000d4>] ? sys_write+0x4d/0x77
 [<ffffffff81002b82>] ? system_call_fastpath+0x16/0x1b

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-01-22  1:17 ` [PATCH 05/10] block: remove per-queue plugging Jens Axboe
  2011-01-22  1:31   ` Nick Piggin
  2011-03-03 21:23   ` Mike Snitzer
@ 2011-03-04  4:00   ` Vivek Goyal
  2011-03-08 12:24     ` Jens Axboe
  2 siblings, 1 reply; 152+ messages in thread
From: Vivek Goyal @ 2011-03-04  4:00 UTC (permalink / raw)
  To: Jens Axboe; +Cc: jaxboe, linux-kernel, hch, NeilBrown

On Sat, Jan 22, 2011 at 01:17:24AM +0000, Jens Axboe wrote:

[..]
>  mm/page-writeback.c                |    2 +-
>  mm/readahead.c                     |   12 ---
>  mm/shmem.c                         |    1 -
>  mm/swap_state.c                    |    5 +-
>  mm/swapfile.c                      |   37 --------
>  mm/vmscan.c                        |    2 +-
>  118 files changed, 153 insertions(+), 1248 deletions(-)

block/blk-throttle.c also uses blk_unplug(). We need to get rid of that
also.

[..]
> @@ -632,8 +630,6 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
>  		 * don't force unplug of the queue for that case.
>  		 * Clear unplug_it and fall through.
>  		 */

Above comments now seem to be redundant.

> -		unplug_it = 0;
> -
>  	case ELEVATOR_INSERT_FRONT:
>  		rq->cmd_flags |= REQ_SOFTBARRIER;
>  		list_add(&rq->queuelist, &q->queue_head);

[..]
>  /*
> diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
> index b9e1e15..5ef136c 100644
> --- a/drivers/md/dm-raid.c
> +++ b/drivers/md/dm-raid.c
> @@ -394,7 +394,7 @@ static void raid_unplug(struct dm_target_callbacks *cb)
>  {
>  	struct raid_set *rs = container_of(cb, struct raid_set, callbacks);
>  
> -	md_raid5_unplug_device(rs->md.private);
> +	md_raid5_kick_device(rs->md.private);

With all the unplug logic gone, I think we can get rid of blk_sync_queue()
call from md. It looks like md was syncing the queue just to make sure
that unplug_fn is not called again. Now all that logic is gone so it
should be redundant.

Also we can probably get rid of some queue_lock taking instances in
md code. NeilBrown recently put following patch in, which is taking
queue lock only around plug functions. Now queue plugging gone,
I guess it should not be required.

commit da9cf5050a2e3dbc3cf26a8d908482eb4485ed49
Author: NeilBrown <neilb@suse.de>
Date:   Mon Feb 21 18:25:57 2011 +1100

    md: avoid spinlock problem in blk_throtl_exit

Thanks
Vivek

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 07/10] fs: make generic file read/write functions plug
  2011-01-22  1:17 ` [PATCH 07/10] fs: make generic file read/write functions plug Jens Axboe
  2011-01-24  3:57   ` Dave Chinner
@ 2011-03-04  4:09   ` Vivek Goyal
  2011-03-04 13:22     ` Jens Axboe
  1 sibling, 1 reply; 152+ messages in thread
From: Vivek Goyal @ 2011-03-04  4:09 UTC (permalink / raw)
  To: Jens Axboe; +Cc: jaxboe, linux-kernel, hch

On Sat, Jan 22, 2011 at 01:17:26AM +0000, Jens Axboe wrote:
> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
> ---
>  mm/filemap.c |    7 +++++++
>  1 files changed, 7 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 380776c..f9a29c8 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1243,12 +1243,15 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
>  	unsigned long seg = 0;
>  	size_t count;
>  	loff_t *ppos = &iocb->ki_pos;
> +	struct blk_plug plug;
>  
>  	count = 0;
>  	retval = generic_segment_checks(iov, &nr_segs, &count, VERIFY_WRITE);
>  	if (retval)
>  		return retval;
>  
> +	blk_start_plug(&plug);
> +

Jens,

IIUC, read requests will be considered SYNC and it looks like that
__make_request() will dispatch all the SYNC requests immediately. If
that's the case then for read path blk_plug mechanism is not required?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-03-03 22:13     ` Mike Snitzer
@ 2011-03-04 13:02       ` Shaohua Li
  2011-03-04 13:20         ` Jens Axboe
  2011-03-04 21:43         ` Mike Snitzer
  2011-03-08 12:16       ` Jens Axboe
  1 sibling, 2 replies; 152+ messages in thread
From: Shaohua Li @ 2011-03-04 13:02 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Jens Axboe, linux-kernel, hch

[-- Attachment #1: Type: text/plain, Size: 878 bytes --]

2011/3/4 Mike Snitzer <snitzer@redhat.com>:
> I'm now hitting a lockdep issue, while running a 'for-2.6.39/stack-plug'
> kernel, when I try an fsync heavy workload to a request-based mpath
> device (the kernel ultimately goes down in flames, I've yet to look at
> the crashdump I took)
>
>
> =======================================================
> [ INFO: possible circular locking dependency detected ]
> 2.6.38-rc6-snitm+ #2
> -------------------------------------------------------
> ffsb/3110 is trying to acquire lock:
>  (&(&q->__queue_lock)->rlock){..-...}, at: [<ffffffff811b4c4d>] flush_plug_list+0xbc/0x135
>
> but task is already holding lock:
>  (&rq->lock){-.-.-.}, at: [<ffffffff8137132f>] schedule+0x16a/0x725
>
> which lock already depends on the new lock.
I hit this too. Can you check if attached debug patch fixes it?

Thanks,
Shaohua

[-- Attachment #2: stack-plug-dbg.patch --]
[-- Type: text/x-patch, Size: 1463 bytes --]

diff --git a/block/blk-core.c b/block/blk-core.c
index 4984b46..4924aa0 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1100,6 +1100,7 @@ static bool attempt_plug_merge(struct task_struct *tsk, struct request_queue *q,
 	struct request *rq;
 	bool ret = false;
 
+	preempt_disable();
 	plug = tsk->plug;
 	if (!plug)
 		goto out;
@@ -1122,6 +1123,7 @@ static bool attempt_plug_merge(struct task_struct *tsk, struct request_queue *q,
 		}
 	}
 out:
+	preempt_enable();
 	return ret;
 }
 
diff --git a/kernel/sched.c b/kernel/sched.c
index e806446..7b4b2f9 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3938,6 +3938,14 @@ asmlinkage void __sched schedule(void)
 	struct rq *rq;
 	int cpu;
 
+	/*
+	 * If this task has IO plugged, make sure it gets flushed out to the
+	 * devices before we go to sleep. Must be called before below lock,
+	 * otherwise there is deadlock
+	 */
+	blk_flush_plug(current);
+	BUG_ON(current->plug && !list_empty(&current->plug->list));
+
 need_resched:
 	preempt_disable();
 	cpu = smp_processor_id();
@@ -3973,14 +3981,6 @@ need_resched_nonpreemptible:
 				if (to_wakeup)
 					try_to_wake_up_local(to_wakeup);
 			}
-			/*
-			 * If this task has IO plugged, make sure it
-			 * gets flushed out to the devices before we go
-			 * to sleep
-			 */
-			blk_flush_plug(prev);
-			BUG_ON(prev->plug && !list_empty(&prev->plug->list));
-
 			deactivate_task(rq, prev, DEQUEUE_SLEEP);
 		}
 		switch_count = &prev->nvcsw;

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-03-04 13:02       ` Shaohua Li
@ 2011-03-04 13:20         ` Jens Axboe
  2011-03-04 21:43         ` Mike Snitzer
  1 sibling, 0 replies; 152+ messages in thread
From: Jens Axboe @ 2011-03-04 13:20 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Mike Snitzer, linux-kernel, hch

On 2011-03-04 14:02, Shaohua Li wrote:
> 2011/3/4 Mike Snitzer <snitzer@redhat.com>:
>> I'm now hitting a lockdep issue, while running a 'for-2.6.39/stack-plug'
>> kernel, when I try an fsync heavy workload to a request-based mpath
>> device (the kernel ultimately goes down in flames, I've yet to look at
>> the crashdump I took)
>>
>>
>> =======================================================
>> [ INFO: possible circular locking dependency detected ]
>> 2.6.38-rc6-snitm+ #2
>> -------------------------------------------------------
>> ffsb/3110 is trying to acquire lock:
>>  (&(&q->__queue_lock)->rlock){..-...}, at: [<ffffffff811b4c4d>] flush_plug_list+0xbc/0x135
>>
>> but task is already holding lock:
>>  (&rq->lock){-.-.-.}, at: [<ffffffff8137132f>] schedule+0x16a/0x725
>>
>> which lock already depends on the new lock.
> I hit this too. Can you check if attached debug patch fixes it?

I'll take a look at this. It would be really nice if we could move the
plug flush outside of the runqueue lock.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 07/10] fs: make generic file read/write functions plug
  2011-03-04  4:09   ` Vivek Goyal
@ 2011-03-04 13:22     ` Jens Axboe
  2011-03-04 13:25       ` hch
  0 siblings, 1 reply; 152+ messages in thread
From: Jens Axboe @ 2011-03-04 13:22 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-kernel, hch

On 2011-03-04 05:09, Vivek Goyal wrote:
> On Sat, Jan 22, 2011 at 01:17:26AM +0000, Jens Axboe wrote:
>> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
>> ---
>>  mm/filemap.c |    7 +++++++
>>  1 files changed, 7 insertions(+), 0 deletions(-)
>>
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 380776c..f9a29c8 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -1243,12 +1243,15 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
>>  	unsigned long seg = 0;
>>  	size_t count;
>>  	loff_t *ppos = &iocb->ki_pos;
>> +	struct blk_plug plug;
>>  
>>  	count = 0;
>>  	retval = generic_segment_checks(iov, &nr_segs, &count, VERIFY_WRITE);
>>  	if (retval)
>>  		return retval;
>>  
>> +	blk_start_plug(&plug);
>> +
> 
> Jens,
> 
> IIUC, read requests will be considered SYNC and it looks like that
> __make_request() will dispatch all the SYNC requests immediately. If
> that's the case then for read path blk_plug mechanism is not required?

Good catch, we need to modify that logic. If the task is currently
plugged, it should not dispatch until blk_finish_plug() is called.
Essentially SYNC will not control dispatch. Will allow us to clean up
that logic, too.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 07/10] fs: make generic file read/write functions plug
  2011-03-04 13:22     ` Jens Axboe
@ 2011-03-04 13:25       ` hch
  2011-03-04 13:40         ` Jens Axboe
  2011-03-08 12:38         ` Jens Axboe
  0 siblings, 2 replies; 152+ messages in thread
From: hch @ 2011-03-04 13:25 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Vivek Goyal, linux-kernel, hch

On Fri, Mar 04, 2011 at 02:22:21PM +0100, Jens Axboe wrote:
> Good catch, we need to modify that logic. If the task is currently
> plugged, it should not dispatch until blk_finish_plug() is called.
> Essentially SYNC will not control dispatch. Will allow us to clean up
> that logic, too.

<broken-record-mode>

Time to use the opportunity to sort out what the various bio/request
flags mean.

REQ_UNPLUG should simply go away with the explicit stack plugging.
What's left for REQ_SYNC?  It'll control if the request goes into the
sync bucket and some cfq tweaks.  We should clearly document what it
does.

REQ_META?  Maybe we should finally agree what it does and decide if it
should be used consistenly.  Especially the priority over REQ_SYNC in
cfq still looks somewhat odd, as does the totally inconsequent use.

</broken-record-mode>

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 07/10] fs: make generic file read/write functions plug
  2011-03-04 13:25       ` hch
@ 2011-03-04 13:40         ` Jens Axboe
  2011-03-04 14:08           ` hch
  2011-03-08 12:38         ` Jens Axboe
  1 sibling, 1 reply; 152+ messages in thread
From: Jens Axboe @ 2011-03-04 13:40 UTC (permalink / raw)
  To: hch; +Cc: Vivek Goyal, linux-kernel

On 2011-03-04 14:25, hch@infradead.org wrote:
> On Fri, Mar 04, 2011 at 02:22:21PM +0100, Jens Axboe wrote:
>> Good catch, we need to modify that logic. If the task is currently
>> plugged, it should not dispatch until blk_finish_plug() is called.
>> Essentially SYNC will not control dispatch. Will allow us to clean up
>> that logic, too.
> 
> <broken-record-mode>
> 
> Time to use the opportunity to sort out what the various bio/request
> flags mean.
> 
> REQ_UNPLUG should simply go away with the explicit stack plugging.
> What's left for REQ_SYNC?  It'll control if the request goes into the
> sync bucket and some cfq tweaks.  We should clearly document what it
> does.

Yes, REQ_UNPLUG goes away, it has no meaning anymore since the plugging
is explicitly done by the caller.

With REQ_SYNC, lets make the sync/async thing apply to both reads and
writes. Right now reads are inherently sync, writes are sometimes
(O_DIRECT). So lets stop making it more murky by mixing up READ and
SYNC.

> REQ_META?  Maybe we should finally agree what it does and decide if it
> should be used consistenly.  Especially the priority over REQ_SYNC in
> cfq still looks somewhat odd, as does the totally inconsequent use.

For me it was primarily a blktrace hint, but yes it does have prio boost
properties in CFQ as well. I'm inclined to let those stay the way they
are. Not sure we can properly call it anything outside of a hint that
these IO have slightly higher priority, at least I would not want to
lock the IO scheduler into to something more concrete than that.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 07/10] fs: make generic file read/write functions plug
  2011-03-04 13:40         ` Jens Axboe
@ 2011-03-04 14:08           ` hch
  2011-03-04 22:07             ` Jens Axboe
  0 siblings, 1 reply; 152+ messages in thread
From: hch @ 2011-03-04 14:08 UTC (permalink / raw)
  To: Jens Axboe; +Cc: hch, Vivek Goyal, linux-kernel

On Fri, Mar 04, 2011 at 02:40:09PM +0100, Jens Axboe wrote:
> > REQ_META?  Maybe we should finally agree what it does and decide if it
> > should be used consistenly.  Especially the priority over REQ_SYNC in
> > cfq still looks somewhat odd, as does the totally inconsequent use.
> 
> For me it was primarily a blktrace hint, but yes it does have prio boost
> properties in CFQ as well. I'm inclined to let those stay the way they
> are. Not sure we can properly call it anything outside of a hint that
> these IO have slightly higher priority, at least I would not want to
> lock the IO scheduler into to something more concrete than that.

The problem is that these two meanings are inherently conflicting.
Metadata updates do not have to be synchronous or have any kind of
priority.  In fact for XFS they normall aren't, and I'd be surprised it
the same isn't true for other journalled filesystems.

Priority metadata goes into the log, which for XFS is always written as
a FLUSH+FUA bio.  Writeback of metadata happens asynchronously in the
background, and only becomes a priority if the journal is full and we'll
need to make space available.

So giving plain REQ_META a priority boost makes it impossible to use it
for the blktrace annotation use case.  One could only apply the boost
for the REQ_SYNC + REQ_META combination, but even that seems rather
hackish to me.  I'd really love to see numbers where the additional
boost of REQ_META over REQ_SYNC makes any difference.


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-03-04 13:02       ` Shaohua Li
  2011-03-04 13:20         ` Jens Axboe
@ 2011-03-04 21:43         ` Mike Snitzer
  2011-03-04 21:50           ` Jens Axboe
  1 sibling, 1 reply; 152+ messages in thread
From: Mike Snitzer @ 2011-03-04 21:43 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Jens Axboe, linux-kernel, hch

On Fri, Mar 04 2011 at  8:02am -0500,
Shaohua Li <shli@kernel.org> wrote:

> 2011/3/4 Mike Snitzer <snitzer@redhat.com>:
> > I'm now hitting a lockdep issue, while running a 'for-2.6.39/stack-plug'
> > kernel, when I try an fsync heavy workload to a request-based mpath
> > device (the kernel ultimately goes down in flames, I've yet to look at
> > the crashdump I took)
> >
> >
> > =======================================================
> > [ INFO: possible circular locking dependency detected ]
> > 2.6.38-rc6-snitm+ #2
> > -------------------------------------------------------
> > ffsb/3110 is trying to acquire lock:
> >  (&(&q->__queue_lock)->rlock){..-...}, at: [<ffffffff811b4c4d>] flush_plug_list+0xbc/0x135
> >
> > but task is already holding lock:
> >  (&rq->lock){-.-.-.}, at: [<ffffffff8137132f>] schedule+0x16a/0x725
> >
> > which lock already depends on the new lock.
> I hit this too. Can you check if attached debug patch fixes it?

Fixes it for me.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-03-04 21:43         ` Mike Snitzer
@ 2011-03-04 21:50           ` Jens Axboe
  2011-03-04 22:27             ` Mike Snitzer
  2011-03-07  0:54             ` Shaohua Li
  0 siblings, 2 replies; 152+ messages in thread
From: Jens Axboe @ 2011-03-04 21:50 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Shaohua Li, linux-kernel, hch

On 2011-03-04 22:43, Mike Snitzer wrote:
> On Fri, Mar 04 2011 at  8:02am -0500,
> Shaohua Li <shli@kernel.org> wrote:
> 
>> 2011/3/4 Mike Snitzer <snitzer@redhat.com>:
>>> I'm now hitting a lockdep issue, while running a 'for-2.6.39/stack-plug'
>>> kernel, when I try an fsync heavy workload to a request-based mpath
>>> device (the kernel ultimately goes down in flames, I've yet to look at
>>> the crashdump I took)
>>>
>>>
>>> =======================================================
>>> [ INFO: possible circular locking dependency detected ]
>>> 2.6.38-rc6-snitm+ #2
>>> -------------------------------------------------------
>>> ffsb/3110 is trying to acquire lock:
>>>  (&(&q->__queue_lock)->rlock){..-...}, at: [<ffffffff811b4c4d>] flush_plug_list+0xbc/0x135
>>>
>>> but task is already holding lock:
>>>  (&rq->lock){-.-.-.}, at: [<ffffffff8137132f>] schedule+0x16a/0x725
>>>
>>> which lock already depends on the new lock.
>> I hit this too. Can you check if attached debug patch fixes it?
> 
> Fixes it for me.

The preempt bit in block/ should not be needed. Can you check whether
it's the moving of the flush in sched.c that does the trick?

The problem with the current spot is that it's under the runqueue lock.
The problem with the modified variant is that we flush even if the task
is not going to sleep. We really just want to flush when it is going to
move out of the runqueue, but we want to do that outside of the runqueue
lock as well.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 07/10] fs: make generic file read/write functions plug
  2011-03-04 14:08           ` hch
@ 2011-03-04 22:07             ` Jens Axboe
  2011-03-04 23:12               ` hch
  0 siblings, 1 reply; 152+ messages in thread
From: Jens Axboe @ 2011-03-04 22:07 UTC (permalink / raw)
  To: hch; +Cc: Vivek Goyal, linux-kernel

On 2011-03-04 15:08, hch@infradead.org wrote:
> On Fri, Mar 04, 2011 at 02:40:09PM +0100, Jens Axboe wrote:
>>> REQ_META?  Maybe we should finally agree what it does and decide if it
>>> should be used consistenly.  Especially the priority over REQ_SYNC in
>>> cfq still looks somewhat odd, as does the totally inconsequent use.
>>
>> For me it was primarily a blktrace hint, but yes it does have prio boost
>> properties in CFQ as well. I'm inclined to let those stay the way they
>> are. Not sure we can properly call it anything outside of a hint that
>> these IO have slightly higher priority, at least I would not want to
>> lock the IO scheduler into to something more concrete than that.
> 
> The problem is that these two meanings are inherently conflicting.
> Metadata updates do not have to be synchronous or have any kind of
> priority.  In fact for XFS they normall aren't, and I'd be surprised it
> the same isn't true for other journalled filesystems.
> 
> Priority metadata goes into the log, which for XFS is always written as
> a FLUSH+FUA bio.  Writeback of metadata happens asynchronously in the
> background, and only becomes a priority if the journal is full and we'll
> need to make space available.
> 
> So giving plain REQ_META a priority boost makes it impossible to use it
> for the blktrace annotation use case.  One could only apply the boost
> for the REQ_SYNC + REQ_META combination, but even that seems rather
> hackish to me.  I'd really love to see numbers where the additional
> boost of REQ_META over REQ_SYNC makes any difference.

Seems only gfs2 actually sets the flag now. So how about we remove the
prio boost for meta data and just retain it as an information attribute?

If there's a need for this slight boosts, it is probably best done
explicitly being signalled from the caller.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-03-04 21:50           ` Jens Axboe
@ 2011-03-04 22:27             ` Mike Snitzer
  2011-03-05 20:54               ` Jens Axboe
  2011-03-07  0:54             ` Shaohua Li
  1 sibling, 1 reply; 152+ messages in thread
From: Mike Snitzer @ 2011-03-04 22:27 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Shaohua Li, linux-kernel, hch

On Fri, Mar 04 2011 at  4:50pm -0500,
Jens Axboe <jaxboe@fusionio.com> wrote:

> On 2011-03-04 22:43, Mike Snitzer wrote:
> > On Fri, Mar 04 2011 at  8:02am -0500,
> > Shaohua Li <shli@kernel.org> wrote:
> > 
> >> 2011/3/4 Mike Snitzer <snitzer@redhat.com>:
> >>> I'm now hitting a lockdep issue, while running a 'for-2.6.39/stack-plug'
> >>> kernel, when I try an fsync heavy workload to a request-based mpath
> >>> device (the kernel ultimately goes down in flames, I've yet to look at
> >>> the crashdump I took)
> >>>
> >>>
> >>> =======================================================
> >>> [ INFO: possible circular locking dependency detected ]
> >>> 2.6.38-rc6-snitm+ #2
> >>> -------------------------------------------------------
> >>> ffsb/3110 is trying to acquire lock:
> >>>  (&(&q->__queue_lock)->rlock){..-...}, at: [<ffffffff811b4c4d>] flush_plug_list+0xbc/0x135
> >>>
> >>> but task is already holding lock:
> >>>  (&rq->lock){-.-.-.}, at: [<ffffffff8137132f>] schedule+0x16a/0x725
> >>>
> >>> which lock already depends on the new lock.
> >> I hit this too. Can you check if attached debug patch fixes it?
> > 
> > Fixes it for me.
> 
> The preempt bit in block/ should not be needed. Can you check whether
> it's the moving of the flush in sched.c that does the trick?

It works if I leave out the blk-core.c preempt change too.

> The problem with the current spot is that it's under the runqueue lock.
> The problem with the modified variant is that we flush even if the task
> is not going to sleep. We really just want to flush when it is going to
> move out of the runqueue, but we want to do that outside of the runqueue
> lock as well.

OK. So we still need a proper fix for this issue.

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 07/10] fs: make generic file read/write functions plug
  2011-03-04 22:07             ` Jens Axboe
@ 2011-03-04 23:12               ` hch
  0 siblings, 0 replies; 152+ messages in thread
From: hch @ 2011-03-04 23:12 UTC (permalink / raw)
  To: Jens Axboe; +Cc: hch, Vivek Goyal, linux-kernel

On Fri, Mar 04, 2011 at 11:07:51PM +0100, Jens Axboe wrote:
> Seems only gfs2 actually sets the flag now. So how about we remove the
> prio boost for meta data and just retain it as an information attribute?

ext3/4 also use it for inodes and directories.  XFS has code to use, but
it's currently unreachable.

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-03-04 22:27             ` Mike Snitzer
@ 2011-03-05 20:54               ` Jens Axboe
  2011-03-07 10:23                 ` Peter Zijlstra
  0 siblings, 1 reply; 152+ messages in thread
From: Jens Axboe @ 2011-03-05 20:54 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Shaohua Li, linux-kernel, hch, Ingo Molnar, Peter Zijlstra

On 2011-03-04 23:27, Mike Snitzer wrote:
> On Fri, Mar 04 2011 at  4:50pm -0500,
> Jens Axboe <jaxboe@fusionio.com> wrote:
> 
>> On 2011-03-04 22:43, Mike Snitzer wrote:
>>> On Fri, Mar 04 2011 at  8:02am -0500,
>>> Shaohua Li <shli@kernel.org> wrote:
>>>
>>>> 2011/3/4 Mike Snitzer <snitzer@redhat.com>:
>>>>> I'm now hitting a lockdep issue, while running a 'for-2.6.39/stack-plug'
>>>>> kernel, when I try an fsync heavy workload to a request-based mpath
>>>>> device (the kernel ultimately goes down in flames, I've yet to look at
>>>>> the crashdump I took)
>>>>>
>>>>>
>>>>> =======================================================
>>>>> [ INFO: possible circular locking dependency detected ]
>>>>> 2.6.38-rc6-snitm+ #2
>>>>> -------------------------------------------------------
>>>>> ffsb/3110 is trying to acquire lock:
>>>>>  (&(&q->__queue_lock)->rlock){..-...}, at: [<ffffffff811b4c4d>] flush_plug_list+0xbc/0x135
>>>>>
>>>>> but task is already holding lock:
>>>>>  (&rq->lock){-.-.-.}, at: [<ffffffff8137132f>] schedule+0x16a/0x725
>>>>>
>>>>> which lock already depends on the new lock.
>>>> I hit this too. Can you check if attached debug patch fixes it?
>>>
>>> Fixes it for me.
>>
>> The preempt bit in block/ should not be needed. Can you check whether
>> it's the moving of the flush in sched.c that does the trick?
> 
> It works if I leave out the blk-core.c preempt change too.
> 
>> The problem with the current spot is that it's under the runqueue lock.
>> The problem with the modified variant is that we flush even if the task
>> is not going to sleep. We really just want to flush when it is going to
>> move out of the runqueue, but we want to do that outside of the runqueue
>> lock as well.
> 
> OK. So we still need a proper fix for this issue.

Apparently so. Peter/Ingo, please shoot this one down in flames.
Summary:

- Need a way to trigger this flushing when a task is going to sleep
- It's currently done right before calling deactivate_task(). We know
  the task is going to sleep here, but it's also under the runqueue
  lock. Not good.
- In the new location, it's not completely clear to me whether we can
  safely deref 'prev' or not. The usage of prev_state would seem to
  indicate that we cannot, and as far as I can tell, prev could at this
  point already potentially be running on another CPU.

Help? Peter, we talked about this in Tokyo in September. Initial
suggestion was to use preempt notifiers, which we can't because:

- runqueue lock is also held
- It's not unconditionally available, depends on config.

diff --git a/kernel/sched.c b/kernel/sched.c
index e806446..8581ad3 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2826,6 +2826,14 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
 #endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */
 	finish_lock_switch(rq, prev);
 
+	/*
+	 * If this task has IO plugged, make sure it
+	 * gets flushed out to the devices before we go
+	 * to sleep
+	 */
+	if (prev_state != TASK_RUNNING)
+		blk_flush_plug(prev);
+
 	fire_sched_in_preempt_notifiers(current);
 	if (mm)
 		mmdrop(mm);
@@ -3973,14 +3981,6 @@ need_resched_nonpreemptible:
 				if (to_wakeup)
 					try_to_wake_up_local(to_wakeup);
 			}
-			/*
-			 * If this task has IO plugged, make sure it
-			 * gets flushed out to the devices before we go
-			 * to sleep
-			 */
-			blk_flush_plug(prev);
-			BUG_ON(prev->plug && !list_empty(&prev->plug->list));
-
 			deactivate_task(rq, prev, DEQUEUE_SLEEP);
 		}
 		switch_count = &prev->nvcsw;

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-03-04 21:50           ` Jens Axboe
  2011-03-04 22:27             ` Mike Snitzer
@ 2011-03-07  0:54             ` Shaohua Li
  2011-03-07  8:07               ` Jens Axboe
  1 sibling, 1 reply; 152+ messages in thread
From: Shaohua Li @ 2011-03-07  0:54 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Mike Snitzer, linux-kernel, hch

2011/3/5 Jens Axboe <jaxboe@fusionio.com>:
> On 2011-03-04 22:43, Mike Snitzer wrote:
>> On Fri, Mar 04 2011 at  8:02am -0500,
>> Shaohua Li <shli@kernel.org> wrote:
>>
>>> 2011/3/4 Mike Snitzer <snitzer@redhat.com>:
>>>> I'm now hitting a lockdep issue, while running a 'for-2.6.39/stack-plug'
>>>> kernel, when I try an fsync heavy workload to a request-based mpath
>>>> device (the kernel ultimately goes down in flames, I've yet to look at
>>>> the crashdump I took)
>>>>
>>>>
>>>> =======================================================
>>>> [ INFO: possible circular locking dependency detected ]
>>>> 2.6.38-rc6-snitm+ #2
>>>> -------------------------------------------------------
>>>> ffsb/3110 is trying to acquire lock:
>>>>  (&(&q->__queue_lock)->rlock){..-...}, at: [<ffffffff811b4c4d>] flush_plug_list+0xbc/0x135
>>>>
>>>> but task is already holding lock:
>>>>  (&rq->lock){-.-.-.}, at: [<ffffffff8137132f>] schedule+0x16a/0x725
>>>>
>>>> which lock already depends on the new lock.
>>> I hit this too. Can you check if attached debug patch fixes it?
>>
>> Fixes it for me.
>
> The preempt bit in block/ should not be needed. Can you check whether
> it's the moving of the flush in sched.c that does the trick?
yes, it's not related to the lockdep issue. but I think we still need
it. if there is a preempt  between attempt_plub_merge(), we do queue
flush, then we might hit an incomplete list of request->biotail. Am I
missing anything?

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-03-07  0:54             ` Shaohua Li
@ 2011-03-07  8:07               ` Jens Axboe
  0 siblings, 0 replies; 152+ messages in thread
From: Jens Axboe @ 2011-03-07  8:07 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Mike Snitzer, linux-kernel, hch

On 2011-03-07 01:54, Shaohua Li wrote:
> 2011/3/5 Jens Axboe <jaxboe@fusionio.com>:
>> On 2011-03-04 22:43, Mike Snitzer wrote:
>>> On Fri, Mar 04 2011 at  8:02am -0500,
>>> Shaohua Li <shli@kernel.org> wrote:
>>>
>>>> 2011/3/4 Mike Snitzer <snitzer@redhat.com>:
>>>>> I'm now hitting a lockdep issue, while running a 'for-2.6.39/stack-plug'
>>>>> kernel, when I try an fsync heavy workload to a request-based mpath
>>>>> device (the kernel ultimately goes down in flames, I've yet to look at
>>>>> the crashdump I took)
>>>>>
>>>>>
>>>>> =======================================================
>>>>> [ INFO: possible circular locking dependency detected ]
>>>>> 2.6.38-rc6-snitm+ #2
>>>>> -------------------------------------------------------
>>>>> ffsb/3110 is trying to acquire lock:
>>>>>  (&(&q->__queue_lock)->rlock){..-...}, at: [<ffffffff811b4c4d>] flush_plug_list+0xbc/0x135
>>>>>
>>>>> but task is already holding lock:
>>>>>  (&rq->lock){-.-.-.}, at: [<ffffffff8137132f>] schedule+0x16a/0x725
>>>>>
>>>>> which lock already depends on the new lock.
>>>> I hit this too. Can you check if attached debug patch fixes it?
>>>
>>> Fixes it for me.
>>
>> The preempt bit in block/ should not be needed. Can you check whether
>> it's the moving of the flush in sched.c that does the trick?
> yes, it's not related to the lockdep issue. but I think we still need
> it. if there is a preempt  between attempt_plub_merge(), we do queue
> flush, then we might hit an incomplete list of request->biotail. Am I
> missing anything?

Ah, so it is needed with the other fix you proposed, since we do flush
on preempt then. If we only do the flush on going to sleep, then we
don't need that preemption disable in that section.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-03-05 20:54               ` Jens Axboe
@ 2011-03-07 10:23                 ` Peter Zijlstra
  2011-03-07 19:43                   ` Jens Axboe
  0 siblings, 1 reply; 152+ messages in thread
From: Peter Zijlstra @ 2011-03-07 10:23 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Mike Snitzer, Shaohua Li, linux-kernel, hch, Ingo Molnar

On Sat, 2011-03-05 at 21:54 +0100, Jens Axboe wrote:
> 
> Apparently so. Peter/Ingo, please shoot this one down in flames.
> Summary:
> 
> - Need a way to trigger this flushing when a task is going to sleep
> - It's currently done right before calling deactivate_task(). We know
>   the task is going to sleep here, but it's also under the runqueue
>   lock. Not good.
> - In the new location, it's not completely clear to me whether we can
>   safely deref 'prev' or not. The usage of prev_state would seem to
>   indicate that we cannot, and as far as I can tell, prev could at this
>   point already potentially be running on another CPU.
> 
> Help? Peter, we talked about this in Tokyo in September. Initial
> suggestion was to use preempt notifiers, which we can't because:
> 
> - runqueue lock is also held
> - It's not unconditionally available, depends on config.
> 
> diff --git a/kernel/sched.c b/kernel/sched.c
> index e806446..8581ad3 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -2826,6 +2826,14 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
>  #endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */
>         finish_lock_switch(rq, prev);
>  
> +       /*
> +        * If this task has IO plugged, make sure it
> +        * gets flushed out to the devices before we go
> +        * to sleep
> +        */
> +       if (prev_state != TASK_RUNNING)
> +               blk_flush_plug(prev);
> +
>         fire_sched_in_preempt_notifiers(current);
>         if (mm)
>                 mmdrop(mm);
> @@ -3973,14 +3981,6 @@ need_resched_nonpreemptible:
>                                 if (to_wakeup)
>                                         try_to_wake_up_local(to_wakeup);
>                         }
> -                       /*
> -                        * If this task has IO plugged, make sure it
> -                        * gets flushed out to the devices before we go
> -                        * to sleep
> -                        */
> -                       blk_flush_plug(prev);
> -                       BUG_ON(prev->plug && !list_empty(&prev->plug->list));
> -
>                         deactivate_task(rq, prev, DEQUEUE_SLEEP);
>                 }
>                 switch_count = &prev->nvcsw;
> 

Right, so your new location is still under rq->lock for a number of
architectures (including x86). finish_lock_switch() doesn't actually
release the lock unless __ARCH_WANT_INTERRUPTS_ON_CTXSW ||
__ARCH_WANT_UNLOCKED_CTXSW (the former implies the latter since rq->lock
is IRQ-safe).

If you want a safe place to drop rq->lock (but keep in mind to keep IRQs
disabled there) and use prev, do something like the below. Both
pre_schedule() and idle_balance() can already drop the rq->lock do doing
it once more is quite all-right ;-)

Note that once you drop rq->lock prev->state can change to TASK_RUNNING
again so don't re-check that.

---
 kernel/sched.c |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 655164e..99c5637 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4120,8 +4120,12 @@ need_resched_nonpreemptible:
 		switch_count = &prev->nvcsw;
 	}
 
+	if (prev->state != TASK_RUNNING) {
+		raw_spin_unlock(&rq->lock);
+		blk_flush_plug(prev);
+		raw_spin_lock(&rq->lock);
+	}
 	pre_schedule(rq, prev);
-
 	if (unlikely(!rq->nr_running))
 		idle_balance(cpu, rq);
 


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-03-07 10:23                 ` Peter Zijlstra
@ 2011-03-07 19:43                   ` Jens Axboe
  2011-03-07 20:41                     ` Peter Zijlstra
  0 siblings, 1 reply; 152+ messages in thread
From: Jens Axboe @ 2011-03-07 19:43 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Mike Snitzer, Shaohua Li, linux-kernel, hch, Ingo Molnar

On 2011-03-07 11:23, Peter Zijlstra wrote:
> On Sat, 2011-03-05 at 21:54 +0100, Jens Axboe wrote:
>>
>> Apparently so. Peter/Ingo, please shoot this one down in flames.
>> Summary:
>>
>> - Need a way to trigger this flushing when a task is going to sleep
>> - It's currently done right before calling deactivate_task(). We know
>>   the task is going to sleep here, but it's also under the runqueue
>>   lock. Not good.
>> - In the new location, it's not completely clear to me whether we can
>>   safely deref 'prev' or not. The usage of prev_state would seem to
>>   indicate that we cannot, and as far as I can tell, prev could at this
>>   point already potentially be running on another CPU.
>>
>> Help? Peter, we talked about this in Tokyo in September. Initial
>> suggestion was to use preempt notifiers, which we can't because:
>>
>> - runqueue lock is also held
>> - It's not unconditionally available, depends on config.
>>
>> diff --git a/kernel/sched.c b/kernel/sched.c
>> index e806446..8581ad3 100644
>> --- a/kernel/sched.c
>> +++ b/kernel/sched.c
>> @@ -2826,6 +2826,14 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
>>  #endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */
>>         finish_lock_switch(rq, prev);
>>  
>> +       /*
>> +        * If this task has IO plugged, make sure it
>> +        * gets flushed out to the devices before we go
>> +        * to sleep
>> +        */
>> +       if (prev_state != TASK_RUNNING)
>> +               blk_flush_plug(prev);
>> +
>>         fire_sched_in_preempt_notifiers(current);
>>         if (mm)
>>                 mmdrop(mm);
>> @@ -3973,14 +3981,6 @@ need_resched_nonpreemptible:
>>                                 if (to_wakeup)
>>                                         try_to_wake_up_local(to_wakeup);
>>                         }
>> -                       /*
>> -                        * If this task has IO plugged, make sure it
>> -                        * gets flushed out to the devices before we go
>> -                        * to sleep
>> -                        */
>> -                       blk_flush_plug(prev);
>> -                       BUG_ON(prev->plug && !list_empty(&prev->plug->list));
>> -
>>                         deactivate_task(rq, prev, DEQUEUE_SLEEP);
>>                 }
>>                 switch_count = &prev->nvcsw;
>>
> 
> Right, so your new location is still under rq->lock for a number of
> architectures (including x86). finish_lock_switch() doesn't actually
> release the lock unless __ARCH_WANT_INTERRUPTS_ON_CTXSW ||
> __ARCH_WANT_UNLOCKED_CTXSW (the former implies the latter since rq->lock
> is IRQ-safe).

Ah, thanks for that.

> If you want a safe place to drop rq->lock (but keep in mind to keep IRQs
> disabled there) and use prev, do something like the below. Both
> pre_schedule() and idle_balance() can already drop the rq->lock do doing
> it once more is quite all-right ;-)
> 
> Note that once you drop rq->lock prev->state can change to TASK_RUNNING
> again so don't re-check that.

So that's a problem. If I end up flushing this structure that sits on
the stack of the process, I cannot have it running on another CPU at
that time.

I need the process to be in such a state that it will not get scheduled
on another CPU before this has completed.

Is that even possible? If not, then I think the best solution is to
flush on preempt as well and hence move it up a bit like Shaohua posted
as well. This is also how it was originally done, but I wanted to avoid
that if at all possible.


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-03-07 19:43                   ` Jens Axboe
@ 2011-03-07 20:41                     ` Peter Zijlstra
  2011-03-07 20:46                       ` Jens Axboe
  0 siblings, 1 reply; 152+ messages in thread
From: Peter Zijlstra @ 2011-03-07 20:41 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Mike Snitzer, Shaohua Li, linux-kernel, hch, Ingo Molnar

On Mon, 2011-03-07 at 20:43 +0100, Jens Axboe wrote:
> On 2011-03-07 11:23, Peter Zijlstra wrote:
> > On Sat, 2011-03-05 at 21:54 +0100, Jens Axboe wrote:
> >>
> >> Apparently so. Peter/Ingo, please shoot this one down in flames.
> >> Summary:
> >>
> >> - Need a way to trigger this flushing when a task is going to sleep
> >> - It's currently done right before calling deactivate_task(). We know
> >>   the task is going to sleep here, but it's also under the runqueue
> >>   lock. Not good.
> >> - In the new location, it's not completely clear to me whether we can
> >>   safely deref 'prev' or not. The usage of prev_state would seem to
> >>   indicate that we cannot, and as far as I can tell, prev could at this
> >>   point already potentially be running on another CPU.
> >>
> >> Help? Peter, we talked about this in Tokyo in September. Initial
> >> suggestion was to use preempt notifiers, which we can't because:
> >>
> >> - runqueue lock is also held
> >> - It's not unconditionally available, depends on config.
> >>
> >> diff --git a/kernel/sched.c b/kernel/sched.c
> >> index e806446..8581ad3 100644
> >> --- a/kernel/sched.c
> >> +++ b/kernel/sched.c
> >> @@ -2826,6 +2826,14 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
> >>  #endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */
> >>         finish_lock_switch(rq, prev);
> >>  
> >> +       /*
> >> +        * If this task has IO plugged, make sure it
> >> +        * gets flushed out to the devices before we go
> >> +        * to sleep
> >> +        */
> >> +       if (prev_state != TASK_RUNNING)
> >> +               blk_flush_plug(prev);
> >> +
> >>         fire_sched_in_preempt_notifiers(current);
> >>         if (mm)
> >>                 mmdrop(mm);
> >> @@ -3973,14 +3981,6 @@ need_resched_nonpreemptible:
> >>                                 if (to_wakeup)
> >>                                         try_to_wake_up_local(to_wakeup);
> >>                         }
> >> -                       /*
> >> -                        * If this task has IO plugged, make sure it
> >> -                        * gets flushed out to the devices before we go
> >> -                        * to sleep
> >> -                        */
> >> -                       blk_flush_plug(prev);
> >> -                       BUG_ON(prev->plug && !list_empty(&prev->plug->list));
> >> -
> >>                         deactivate_task(rq, prev, DEQUEUE_SLEEP);
> >>                 }
> >>                 switch_count = &prev->nvcsw;
> >>
> > 
> > Right, so your new location is still under rq->lock for a number of
> > architectures (including x86). finish_lock_switch() doesn't actually
> > release the lock unless __ARCH_WANT_INTERRUPTS_ON_CTXSW ||
> > __ARCH_WANT_UNLOCKED_CTXSW (the former implies the latter since rq->lock
> > is IRQ-safe).
> 
> Ah, thanks for that.
> 
> > If you want a safe place to drop rq->lock (but keep in mind to keep IRQs
> > disabled there) and use prev, do something like the below. Both
> > pre_schedule() and idle_balance() can already drop the rq->lock do doing
> > it once more is quite all-right ;-)
> > 
> > Note that once you drop rq->lock prev->state can change to TASK_RUNNING
> > again so don't re-check that.
> 
> So that's a problem. If I end up flushing this structure that sits on
> the stack of the process, I cannot have it running on another CPU at
> that time.
> 
> I need the process to be in such a state that it will not get scheduled
> on another CPU before this has completed.
> 
> Is that even possible? 

Yes, if prev will be flipped back to TASK_RUNNING it will still stay on
that cpu, it will not migrate until the cpu that schedules it away (the
cpu you're on) will have flipped rq->curr, and that happens way after
this point. So you're good to go, just don't rely on ->state once you
release rq->lock.

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-03-07 20:41                     ` Peter Zijlstra
@ 2011-03-07 20:46                       ` Jens Axboe
  2011-03-08  9:38                         ` Peter Zijlstra
  0 siblings, 1 reply; 152+ messages in thread
From: Jens Axboe @ 2011-03-07 20:46 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Mike Snitzer, Shaohua Li, linux-kernel, hch, Ingo Molnar

On 2011-03-07 21:41, Peter Zijlstra wrote:
> On Mon, 2011-03-07 at 20:43 +0100, Jens Axboe wrote:
>> On 2011-03-07 11:23, Peter Zijlstra wrote:
>>> On Sat, 2011-03-05 at 21:54 +0100, Jens Axboe wrote:
>>>>
>>>> Apparently so. Peter/Ingo, please shoot this one down in flames.
>>>> Summary:
>>>>
>>>> - Need a way to trigger this flushing when a task is going to sleep
>>>> - It's currently done right before calling deactivate_task(). We know
>>>>   the task is going to sleep here, but it's also under the runqueue
>>>>   lock. Not good.
>>>> - In the new location, it's not completely clear to me whether we can
>>>>   safely deref 'prev' or not. The usage of prev_state would seem to
>>>>   indicate that we cannot, and as far as I can tell, prev could at this
>>>>   point already potentially be running on another CPU.
>>>>
>>>> Help? Peter, we talked about this in Tokyo in September. Initial
>>>> suggestion was to use preempt notifiers, which we can't because:
>>>>
>>>> - runqueue lock is also held
>>>> - It's not unconditionally available, depends on config.
>>>>
>>>> diff --git a/kernel/sched.c b/kernel/sched.c
>>>> index e806446..8581ad3 100644
>>>> --- a/kernel/sched.c
>>>> +++ b/kernel/sched.c
>>>> @@ -2826,6 +2826,14 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
>>>>  #endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */
>>>>         finish_lock_switch(rq, prev);
>>>>  
>>>> +       /*
>>>> +        * If this task has IO plugged, make sure it
>>>> +        * gets flushed out to the devices before we go
>>>> +        * to sleep
>>>> +        */
>>>> +       if (prev_state != TASK_RUNNING)
>>>> +               blk_flush_plug(prev);
>>>> +
>>>>         fire_sched_in_preempt_notifiers(current);
>>>>         if (mm)
>>>>                 mmdrop(mm);
>>>> @@ -3973,14 +3981,6 @@ need_resched_nonpreemptible:
>>>>                                 if (to_wakeup)
>>>>                                         try_to_wake_up_local(to_wakeup);
>>>>                         }
>>>> -                       /*
>>>> -                        * If this task has IO plugged, make sure it
>>>> -                        * gets flushed out to the devices before we go
>>>> -                        * to sleep
>>>> -                        */
>>>> -                       blk_flush_plug(prev);
>>>> -                       BUG_ON(prev->plug && !list_empty(&prev->plug->list));
>>>> -
>>>>                         deactivate_task(rq, prev, DEQUEUE_SLEEP);
>>>>                 }
>>>>                 switch_count = &prev->nvcsw;
>>>>
>>>
>>> Right, so your new location is still under rq->lock for a number of
>>> architectures (including x86). finish_lock_switch() doesn't actually
>>> release the lock unless __ARCH_WANT_INTERRUPTS_ON_CTXSW ||
>>> __ARCH_WANT_UNLOCKED_CTXSW (the former implies the latter since rq->lock
>>> is IRQ-safe).
>>
>> Ah, thanks for that.
>>
>>> If you want a safe place to drop rq->lock (but keep in mind to keep IRQs
>>> disabled there) and use prev, do something like the below. Both
>>> pre_schedule() and idle_balance() can already drop the rq->lock do doing
>>> it once more is quite all-right ;-)
>>>
>>> Note that once you drop rq->lock prev->state can change to TASK_RUNNING
>>> again so don't re-check that.
>>
>> So that's a problem. If I end up flushing this structure that sits on
>> the stack of the process, I cannot have it running on another CPU at
>> that time.
>>
>> I need the process to be in such a state that it will not get scheduled
>> on another CPU before this has completed.
>>
>> Is that even possible? 
> 
> Yes, if prev will be flipped back to TASK_RUNNING it will still stay on
> that cpu, it will not migrate until the cpu that schedules it away (the
> cpu you're on) will have flipped rq->curr, and that happens way after
> this point. So you're good to go, just don't rely on ->state once you
> release rq->lock.

Great, that'll work for me! Your patch should work as-is, then. Thanks
Peter.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-03-07 20:46                       ` Jens Axboe
@ 2011-03-08  9:38                         ` Peter Zijlstra
  2011-03-08  9:41                           ` Jens Axboe
  0 siblings, 1 reply; 152+ messages in thread
From: Peter Zijlstra @ 2011-03-08  9:38 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Mike Snitzer, Shaohua Li, linux-kernel, hch, Ingo Molnar

On Mon, 2011-03-07 at 21:46 +0100, Jens Axboe wrote:
> 
> Great, that'll work for me! Your patch should work as-is, then. Thanks
> Peter.

Well I think it would be good to write it like:

  if (prev->state != TASK_RUNNING && blkneeds_flush(prev)) {
    raw_spin_unlock(&rq->lock);
    blk_flush_plug(prev);
    raw_spin_lock(&rq->lock);
  }

To avoid flipping that lock when we don't have to.

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-03-08  9:38                         ` Peter Zijlstra
@ 2011-03-08  9:41                           ` Jens Axboe
  0 siblings, 0 replies; 152+ messages in thread
From: Jens Axboe @ 2011-03-08  9:41 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Mike Snitzer, Shaohua Li, linux-kernel, hch, Ingo Molnar

On 2011-03-08 10:38, Peter Zijlstra wrote:
> On Mon, 2011-03-07 at 21:46 +0100, Jens Axboe wrote:
>>
>> Great, that'll work for me! Your patch should work as-is, then. Thanks
>> Peter.
> 
> Well I think it would be good to write it like:
> 
>   if (prev->state != TASK_RUNNING && blkneeds_flush(prev)) {
>     raw_spin_unlock(&rq->lock);
>     blk_flush_plug(prev);
>     raw_spin_lock(&rq->lock);
>   }
> 
> To avoid flipping that lock when we don't have to.

Yes good point, in any case the need to flush will be an unlikely event.
So saving the lock/unlock dance for when we really need it is a good
optimization.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-03-03 21:23   ` Mike Snitzer
  2011-03-03 21:27     ` Mike Snitzer
  2011-03-03 22:13     ` Mike Snitzer
@ 2011-03-08 12:15     ` Jens Axboe
  2 siblings, 0 replies; 152+ messages in thread
From: Jens Axboe @ 2011-03-08 12:15 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: linux-kernel, hch

On 2011-03-03 22:23, Mike Snitzer wrote:
>> diff --git a/block/blk-flush.c b/block/blk-flush.c
>> index 54b123d..c0a07aa 100644
>> --- a/block/blk-flush.c
>> +++ b/block/blk-flush.c
>> @@ -59,7 +59,6 @@ static struct request *blk_flush_complete_seq(struct request_queue *q,
>>  static void blk_flush_complete_seq_end_io(struct request_queue *q,
>>                                          unsigned seq, int error)
>>  {
>> -       bool was_empty = elv_queue_empty(q);
>>        struct request *next_rq;
>>
>>        next_rq = blk_flush_complete_seq(q, seq, error);
>> @@ -68,7 +67,7 @@ static void blk_flush_complete_seq_end_io(struct request_queue *q,
>>         * Moving a request silently to empty queue_head may stall the
>>         * queue.  Kick the queue in those cases.
>>         */
>> -       if (was_empty && next_rq)
>> +       if (next_rq)
>>                __blk_run_queue(q);
>>  }
>>
> ...
>> diff --git a/block/elevator.c b/block/elevator.c
>> index a9fe237..d5d17a4 100644
>> --- a/block/elevator.c
>> +++ b/block/elevator.c
>> @@ -619,8 +619,6 @@ void elv_quiesce_end(struct request_queue *q)
> ...
>> -int elv_queue_empty(struct request_queue *q)
>> -{
>> -       struct elevator_queue *e = q->elevator;
>> -
>> -       if (!list_empty(&q->queue_head))
>> -               return 0;
>> -
>> -       if (e->ops->elevator_queue_empty_fn)
>> -               return e->ops->elevator_queue_empty_fn(q);
>> -
>> -       return 1;
>> -}
>> -EXPORT_SYMBOL(elv_queue_empty);
>> -
> 
> Your latest 'for-2.6.39/stack-unplug' rebase (commit 7703acb01e)
> misses removing a call to elv_queue_empty() in
> block/blk-flush.c:flush_data_end_io()
> 
>   CC      block/blk-flush.o
> block/blk-flush.c: In function ‘flush_data_end_io’:
> block/blk-flush.c:266: error: implicit declaration of function ‘elv_queue_empty’

Thanks, also fixed now.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-03-03 22:13     ` Mike Snitzer
  2011-03-04 13:02       ` Shaohua Li
@ 2011-03-08 12:16       ` Jens Axboe
  2011-03-08 20:21         ` Mike Snitzer
  1 sibling, 1 reply; 152+ messages in thread
From: Jens Axboe @ 2011-03-08 12:16 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: linux-kernel, hch

On 2011-03-03 23:13, Mike Snitzer wrote:
> I'm now hitting a lockdep issue, while running a 'for-2.6.39/stack-plug'
> kernel, when I try an fsync heavy workload to a request-based mpath
> device (the kernel ultimately goes down in flames, I've yet to look at
> the crashdump I took)

Mike, can you re-run with the current stack-plug branch? I've fixed the
!CONFIG_BLOCK and rebase issues, and also added a change for this flush
on schedule event. It's run outside of the runqueue lock now, so
hopefully that should solve this one.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-03-04  4:00   ` Vivek Goyal
@ 2011-03-08 12:24     ` Jens Axboe
  2011-03-08 22:10       ` blk-throttle: Use blk_plug in throttle code (Was: Re: [PATCH 05/10] block: remove per-queue plugging) Vivek Goyal
  0 siblings, 1 reply; 152+ messages in thread
From: Jens Axboe @ 2011-03-08 12:24 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-kernel, hch, NeilBrown

On 2011-03-04 05:00, Vivek Goyal wrote:
> On Sat, Jan 22, 2011 at 01:17:24AM +0000, Jens Axboe wrote:
> 
> [..]
>>  mm/page-writeback.c                |    2 +-
>>  mm/readahead.c                     |   12 ---
>>  mm/shmem.c                         |    1 -
>>  mm/swap_state.c                    |    5 +-
>>  mm/swapfile.c                      |   37 --------
>>  mm/vmscan.c                        |    2 +-
>>  118 files changed, 153 insertions(+), 1248 deletions(-)
> 
> block/blk-throttle.c also uses blk_unplug(). We need to get rid of that
> also.

Done.

> [..]
>> @@ -632,8 +630,6 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
>>  		 * don't force unplug of the queue for that case.
>>  		 * Clear unplug_it and fall through.
>>  		 */
> 
> Above comments now seem to be redundant.

Killed.

>> diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
>> index b9e1e15..5ef136c 100644
>> --- a/drivers/md/dm-raid.c
>> +++ b/drivers/md/dm-raid.c
>> @@ -394,7 +394,7 @@ static void raid_unplug(struct dm_target_callbacks *cb)
>>  {
>>  	struct raid_set *rs = container_of(cb, struct raid_set, callbacks);
>>  
>> -	md_raid5_unplug_device(rs->md.private);
>> +	md_raid5_kick_device(rs->md.private);
> 
> With all the unplug logic gone, I think we can get rid of blk_sync_queue()
> call from md. It looks like md was syncing the queue just to make sure
> that unplug_fn is not called again. Now all that logic is gone so it
> should be redundant.
> 
> Also we can probably get rid of some queue_lock taking instances in
> md code. NeilBrown recently put following patch in, which is taking
> queue lock only around plug functions. Now queue plugging gone,
> I guess it should not be required.

Agree on both accounts. I'll leave that out for this version, though.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 07/10] fs: make generic file read/write functions plug
  2011-03-04 13:25       ` hch
  2011-03-04 13:40         ` Jens Axboe
@ 2011-03-08 12:38         ` Jens Axboe
  2011-03-09 10:38           ` hch
  1 sibling, 1 reply; 152+ messages in thread
From: Jens Axboe @ 2011-03-08 12:38 UTC (permalink / raw)
  To: hch; +Cc: Vivek Goyal, linux-kernel

On 2011-03-04 14:25, hch@infradead.org wrote:
> REQ_UNPLUG should simply go away with the explicit stack plugging.
> What's left for REQ_SYNC?  It'll control if the request goes into the
> sync bucket and some cfq tweaks.  We should clearly document what it
> does.

How about the below? It gets rid of REQ_UNPLUG.

diff --git a/block/blk-core.c b/block/blk-core.c
index 82a4589..7e9715a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1290,7 +1290,7 @@ get_rq:
 	}
 
 	plug = current->plug;
-	if (plug && !sync) {
+	if (plug) {
 		if (!plug->should_sort && !list_empty(&plug->list)) {
 			struct request *__rq;
 
diff --git a/drivers/block/drbd/drbd_actlog.c b/drivers/block/drbd/drbd_actlog.c
index 2096628..aca3024 100644
--- a/drivers/block/drbd/drbd_actlog.c
+++ b/drivers/block/drbd/drbd_actlog.c
@@ -80,7 +80,7 @@ static int _drbd_md_sync_page_io(struct drbd_conf *mdev,
 
 	if ((rw & WRITE) && !test_bit(MD_NO_FUA, &mdev->flags))
 		rw |= REQ_FUA;
-	rw |= REQ_UNPLUG | REQ_SYNC;
+	rw |= REQ_SYNC;
 
 	bio = bio_alloc(GFP_NOIO, 1);
 	bio->bi_bdev = bdev->md_bdev;
diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index 0b5718e..b0bd27d 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -377,7 +377,7 @@ union p_header {
 #define DP_HARDBARRIER	      1 /* depricated */
 #define DP_RW_SYNC	      2 /* equals REQ_SYNC    */
 #define DP_MAY_SET_IN_SYNC    4
-#define DP_UNPLUG             8 /* equals REQ_UNPLUG  */
+#define DP_UNPLUG             8 /* not used anymore   */
 #define DP_FUA               16 /* equals REQ_FUA     */
 #define DP_FLUSH             32 /* equals REQ_FLUSH   */
 #define DP_DISCARD           64 /* equals REQ_DISCARD */
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 6049cb8..8a43ce0 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -2477,12 +2477,11 @@ static u32 bio_flags_to_wire(struct drbd_conf *mdev, unsigned long bi_rw)
 {
 	if (mdev->agreed_pro_version >= 95)
 		return  (bi_rw & REQ_SYNC ? DP_RW_SYNC : 0) |
-			(bi_rw & REQ_UNPLUG ? DP_UNPLUG : 0) |
 			(bi_rw & REQ_FUA ? DP_FUA : 0) |
 			(bi_rw & REQ_FLUSH ? DP_FLUSH : 0) |
 			(bi_rw & REQ_DISCARD ? DP_DISCARD : 0);
 	else
-		return bi_rw & (REQ_SYNC | REQ_UNPLUG) ? DP_RW_SYNC : 0;
+		return bi_rw & REQ_SYNC ? DP_RW_SYNC : 0;
 }
 
 /* Used to send write requests
diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c
index 84132f8..8e68be9 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -1100,8 +1100,6 @@ next_bio:
 	/* > e->sector, unless this is the first bio */
 	bio->bi_sector = sector;
 	bio->bi_bdev = mdev->ldev->backing_bdev;
-	/* we special case some flags in the multi-bio case, see below
-	 * (REQ_UNPLUG) */
 	bio->bi_rw = rw;
 	bio->bi_private = e;
 	bio->bi_end_io = drbd_endio_sec;
@@ -1130,10 +1128,6 @@ next_bio:
 		bios = bios->bi_next;
 		bio->bi_next = NULL;
 
-		/* strip off REQ_UNPLUG unless it is the last bio */
-		if (bios)
-			bio->bi_rw &= ~REQ_UNPLUG;
-
 		drbd_generic_make_request(mdev, fault_type, bio);
 	} while (bios);
 	return 0;
@@ -1621,12 +1615,11 @@ static unsigned long write_flags_to_bio(struct drbd_conf *mdev, u32 dpf)
 {
 	if (mdev->agreed_pro_version >= 95)
 		return  (dpf & DP_RW_SYNC ? REQ_SYNC : 0) |
-			(dpf & DP_UNPLUG ? REQ_UNPLUG : 0) |
 			(dpf & DP_FUA ? REQ_FUA : 0) |
 			(dpf & DP_FLUSH ? REQ_FUA : 0) |
 			(dpf & DP_DISCARD ? REQ_DISCARD : 0);
 	else
-		return dpf & DP_RW_SYNC ? (REQ_SYNC | REQ_UNPLUG) : 0;
+		return dpf & DP_RW_SYNC ? REQ_SYNC : 0;
 }
 
 /* mirrored write */
diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index 54bfc27..ca203cb 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -347,7 +347,7 @@ static void write_page(struct bitmap *bitmap, struct page *page, int wait)
 			atomic_inc(&bitmap->pending_writes);
 			set_buffer_locked(bh);
 			set_buffer_mapped(bh);
-			submit_bh(WRITE | REQ_UNPLUG | REQ_SYNC, bh);
+			submit_bh(WRITE | REQ_SYNC, bh);
 			bh = bh->b_this_page;
 		}
 
diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index 136d4f7..76a5af0 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -352,7 +352,7 @@ static void dispatch_io(int rw, unsigned int num_regions,
 	BUG_ON(num_regions > DM_IO_MAX_REGIONS);
 
 	if (sync)
-		rw |= REQ_SYNC | REQ_UNPLUG;
+		rw |= REQ_SYNC;
 
 	/*
 	 * For multiple regions we need to be careful to rewind
diff --git a/drivers/md/dm-kcopyd.c b/drivers/md/dm-kcopyd.c
index e8429ce..2c880e9 100644
--- a/drivers/md/dm-kcopyd.c
+++ b/drivers/md/dm-kcopyd.c
@@ -363,11 +363,8 @@ static int run_io_job(struct kcopyd_job *job)
 
 	if (job->rw == READ)
 		r = dm_io(&io_req, 1, &job->source, NULL);
-	else {
-		if (job->num_dests > 1)
-			io_req.bi_rw |= REQ_UNPLUG;
+	else
 		r = dm_io(&io_req, job->num_dests, job->dests, NULL);
-	}
 
 	return r;
 }
diff --git a/drivers/md/md.c b/drivers/md/md.c
index ca0d79c..28f9c1e 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -777,8 +777,7 @@ void md_super_write(mddev_t *mddev, mdk_rdev_t *rdev,
 	bio->bi_end_io = super_written;
 
 	atomic_inc(&mddev->pending_writes);
-	submit_bio(REQ_WRITE | REQ_SYNC | REQ_UNPLUG | REQ_FLUSH | REQ_FUA,
-		   bio);
+	submit_bio(REQ_WRITE | REQ_SYNC | REQ_FLUSH | REQ_FUA, bio);
 }
 
 void md_super_wait(mddev_t *mddev)
@@ -806,7 +805,7 @@ int sync_page_io(mdk_rdev_t *rdev, sector_t sector, int size,
 	struct completion event;
 	int ret;
 
-	rw |= REQ_SYNC | REQ_UNPLUG;
+	rw |= REQ_SYNC;
 
 	bio->bi_bdev = (metadata_op && rdev->meta_bdev) ?
 		rdev->meta_bdev : rdev->bdev;
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 92ac519..b76f7cd 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2182,7 +2182,7 @@ static int __extent_writepage(struct page *page, struct writeback_control *wbc,
 	unsigned long nr_written = 0;
 
 	if (wbc->sync_mode == WB_SYNC_ALL)
-		write_flags = WRITE_SYNC_PLUG;
+		write_flags = WRITE_SYNC;
 	else
 		write_flags = WRITE;
 
diff --git a/fs/buffer.c b/fs/buffer.c
index f903f2e..19ae76a 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -767,7 +767,7 @@ static int fsync_buffers_list(spinlock_t *lock, struct list_head *list)
 				 * still in flight on potentially older
 				 * contents.
 				 */
-				write_dirty_buffer(bh, WRITE_SYNC_PLUG);
+				write_dirty_buffer(bh, WRITE_SYNC);
 
 				/*
 				 * Kick off IO for the previous mapping. Note
@@ -1602,14 +1602,11 @@ EXPORT_SYMBOL(unmap_underlying_metadata);
  * prevents this contention from occurring.
  *
  * If block_write_full_page() is called with wbc->sync_mode ==
- * WB_SYNC_ALL, the writes are posted using WRITE_SYNC_PLUG; this
- * causes the writes to be flagged as synchronous writes, but the
- * block device queue will NOT be unplugged, since usually many pages
- * will be pushed to the out before the higher-level caller actually
- * waits for the writes to be completed.  The various wait functions,
- * such as wait_on_writeback_range() will ultimately call sync_page()
- * which will ultimately call blk_run_backing_dev(), which will end up
- * unplugging the device queue.
+ * WB_SYNC_ALL, the writes are posted using WRITE_SYNC; this
+ * causes the writes to be flagged as synchronous writes.
+ * The various wait functions, such as wait_on_writeback_range() will
+ * ultimately call sync_page() which will ultimately call
+ * blk_run_backing_dev(), which will end up unplugging the device queue.
  */
 static int __block_write_full_page(struct inode *inode, struct page *page,
 			get_block_t *get_block, struct writeback_control *wbc,
@@ -1622,7 +1619,7 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
 	const unsigned blocksize = 1 << inode->i_blkbits;
 	int nr_underway = 0;
 	int write_op = (wbc->sync_mode == WB_SYNC_ALL ?
-			WRITE_SYNC_PLUG : WRITE);
+			WRITE_SYNC : WRITE);
 
 	BUG_ON(!PageLocked(page));
 
diff --git a/fs/direct-io.c b/fs/direct-io.c
index df709b3..4260831 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -1173,7 +1173,7 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 	struct dio *dio;
 
 	if (rw & WRITE)
-		rw = WRITE_ODIRECT_PLUG;
+		rw = WRITE_ODIRECT;
 
 	if (bdev)
 		bdev_blkbits = blksize_bits(bdev_logical_block_size(bdev));
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 955cc30..e2cd90e 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -310,8 +310,7 @@ static int io_submit_init(struct ext4_io_submit *io,
 	io_end->offset = (page->index << PAGE_CACHE_SHIFT) + bh_offset(bh);
 
 	io->io_bio = bio;
-	io->io_op = (wbc->sync_mode == WB_SYNC_ALL ?
-			WRITE_SYNC_PLUG : WRITE);
+	io->io_op = (wbc->sync_mode == WB_SYNC_ALL ?  WRITE_SYNC : WRITE);
 	io->io_next_block = bh->b_blocknr;
 	return 0;
 }
diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index eb01f35..7f1c112 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -121,7 +121,7 @@ __acquires(&sdp->sd_log_lock)
 			lock_buffer(bh);
 			if (test_clear_buffer_dirty(bh)) {
 				bh->b_end_io = end_buffer_write_sync;
-				submit_bh(WRITE_SYNC_PLUG, bh);
+				submit_bh(WRITE_SYNC, bh);
 			} else {
 				unlock_buffer(bh);
 				brelse(bh);
@@ -647,7 +647,7 @@ static void gfs2_ordered_write(struct gfs2_sbd *sdp)
 		lock_buffer(bh);
 		if (buffer_mapped(bh) && test_clear_buffer_dirty(bh)) {
 			bh->b_end_io = end_buffer_write_sync;
-			submit_bh(WRITE_SYNC_PLUG, bh);
+			submit_bh(WRITE_SYNC, bh);
 		} else {
 			unlock_buffer(bh);
 			brelse(bh);
diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
index bf33f82..48b545a 100644
--- a/fs/gfs2/lops.c
+++ b/fs/gfs2/lops.c
@@ -200,7 +200,7 @@ static void buf_lo_before_commit(struct gfs2_sbd *sdp)
 		}
 
 		gfs2_log_unlock(sdp);
-		submit_bh(WRITE_SYNC_PLUG, bh);
+		submit_bh(WRITE_SYNC, bh);
 		gfs2_log_lock(sdp);
 
 		n = 0;
@@ -210,7 +210,7 @@ static void buf_lo_before_commit(struct gfs2_sbd *sdp)
 			gfs2_log_unlock(sdp);
 			lock_buffer(bd2->bd_bh);
 			bh = gfs2_log_fake_buf(sdp, bd2->bd_bh);
-			submit_bh(WRITE_SYNC_PLUG, bh);
+			submit_bh(WRITE_SYNC, bh);
 			gfs2_log_lock(sdp);
 			if (++n >= num)
 				break;
@@ -352,7 +352,7 @@ static void revoke_lo_before_commit(struct gfs2_sbd *sdp)
 		sdp->sd_log_num_revoke--;
 
 		if (offset + sizeof(u64) > sdp->sd_sb.sb_bsize) {
-			submit_bh(WRITE_SYNC_PLUG, bh);
+			submit_bh(WRITE_SYNC, bh);
 
 			bh = gfs2_log_get_buf(sdp);
 			mh = (struct gfs2_meta_header *)bh->b_data;
@@ -369,7 +369,7 @@ static void revoke_lo_before_commit(struct gfs2_sbd *sdp)
 	}
 	gfs2_assert_withdraw(sdp, !sdp->sd_log_num_revoke);
 
-	submit_bh(WRITE_SYNC_PLUG, bh);
+	submit_bh(WRITE_SYNC, bh);
 }
 
 static void revoke_lo_before_scan(struct gfs2_jdesc *jd,
@@ -571,7 +571,7 @@ static void gfs2_write_blocks(struct gfs2_sbd *sdp, struct buffer_head *bh,
 	ptr = bh_log_ptr(bh);
 	
 	get_bh(bh);
-	submit_bh(WRITE_SYNC_PLUG, bh);
+	submit_bh(WRITE_SYNC, bh);
 	gfs2_log_lock(sdp);
 	while(!list_empty(list)) {
 		bd = list_entry(list->next, struct gfs2_bufdata, bd_le.le_list);
@@ -597,7 +597,7 @@ static void gfs2_write_blocks(struct gfs2_sbd *sdp, struct buffer_head *bh,
 		} else {
 			bh1 = gfs2_log_fake_buf(sdp, bd->bd_bh);
 		}
-		submit_bh(WRITE_SYNC_PLUG, bh1);
+		submit_bh(WRITE_SYNC, bh1);
 		gfs2_log_lock(sdp);
 		ptr += 2;
 	}
diff --git a/fs/gfs2/meta_io.c b/fs/gfs2/meta_io.c
index a566331..867b713 100644
--- a/fs/gfs2/meta_io.c
+++ b/fs/gfs2/meta_io.c
@@ -37,7 +37,7 @@ static int gfs2_aspace_writepage(struct page *page, struct writeback_control *wb
 	struct buffer_head *bh, *head;
 	int nr_underway = 0;
 	int write_op = REQ_META |
-		(wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC_PLUG : WRITE);
+		(wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE);
 
 	BUG_ON(!PageLocked(page));
 	BUG_ON(!page_has_buffers(page));
diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
index 34a4861..66be299 100644
--- a/fs/jbd/commit.c
+++ b/fs/jbd/commit.c
@@ -333,7 +333,7 @@ void journal_commit_transaction(journal_t *journal)
 	 * instead we rely on sync_buffer() doing the unplug for us.
 	 */
 	if (commit_transaction->t_synchronous_commit)
-		write_op = WRITE_SYNC_PLUG;
+		write_op = WRITE_SYNC;
 	spin_lock(&commit_transaction->t_handle_lock);
 	while (commit_transaction->t_updates) {
 		DEFINE_WAIT(wait);
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index f3ad159..3da1cc4 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -137,9 +137,9 @@ static int journal_submit_commit_record(journal_t *journal,
 	if (journal->j_flags & JBD2_BARRIER &&
 	    !JBD2_HAS_INCOMPAT_FEATURE(journal,
 				       JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT))
-		ret = submit_bh(WRITE_SYNC_PLUG | WRITE_FLUSH_FUA, bh);
+		ret = submit_bh(WRITE_SYNC | WRITE_FLUSH_FUA, bh);
 	else
-		ret = submit_bh(WRITE_SYNC_PLUG, bh);
+		ret = submit_bh(WRITE_SYNC, bh);
 
 	*cbh = bh;
 	return ret;
@@ -369,7 +369,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 	 * instead we rely on sync_buffer() doing the unplug for us.
 	 */
 	if (commit_transaction->t_synchronous_commit)
-		write_op = WRITE_SYNC_PLUG;
+		write_op = WRITE_SYNC;
 	trace_jbd2_commit_locking(journal, commit_transaction);
 	stats.run.rs_wait = commit_transaction->t_max_wait;
 	stats.run.rs_locked = jiffies;
diff --git a/fs/nilfs2/segbuf.c b/fs/nilfs2/segbuf.c
index 0f83e93..2853ff2 100644
--- a/fs/nilfs2/segbuf.c
+++ b/fs/nilfs2/segbuf.c
@@ -509,7 +509,7 @@ static int nilfs_segbuf_write(struct nilfs_segment_buffer *segbuf,
 		 * Last BIO is always sent through the following
 		 * submission.
 		 */
-		rw |= REQ_SYNC | REQ_UNPLUG;
+		rw |= REQ_SYNC;
 		res = nilfs_segbuf_submit_bio(segbuf, &wi, rw);
 	}
 
diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index 83c1c20..6bbb0ee 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -413,8 +413,7 @@ xfs_submit_ioend_bio(
 	if (xfs_ioend_new_eof(ioend))
 		xfs_mark_inode_dirty(XFS_I(ioend->io_inode));
 
-	submit_bio(wbc->sync_mode == WB_SYNC_ALL ?
-		   WRITE_SYNC_PLUG : WRITE, bio);
+	submit_bio(wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE, bio);
 }
 
 STATIC struct bio *
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 16b2864..be50d9e 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -128,7 +128,6 @@ enum rq_flag_bits {
 	__REQ_NOIDLE,		/* don't anticipate more IO after this one */
 
 	/* bio only flags */
-	__REQ_UNPLUG,		/* unplug the immediately after submission */
 	__REQ_RAHEAD,		/* read ahead, can fail anytime */
 	__REQ_THROTTLED,	/* This bio has already been subjected to
 				 * throttling rules. Don't do it again. */
@@ -172,7 +171,6 @@ enum rq_flag_bits {
 	 REQ_NOIDLE | REQ_FLUSH | REQ_FUA)
 #define REQ_CLONE_MASK		REQ_COMMON_MASK
 
-#define REQ_UNPLUG		(1 << __REQ_UNPLUG)
 #define REQ_RAHEAD		(1 << __REQ_RAHEAD)
 #define REQ_THROTTLED		(1 << __REQ_THROTTLED)
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9f2cf69..543e226 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -135,16 +135,10 @@ struct inodes_stat_t {
  *			block layer could (in theory) choose to ignore this
  *			request if it runs into resource problems.
  * WRITE		A normal async write. Device will be plugged.
- * WRITE_SYNC_PLUG	Synchronous write. Identical to WRITE, but passes down
+ * WRITE_SYNC		Synchronous write. Identical to WRITE, but passes down
  *			the hint that someone will be waiting on this IO
- *			shortly. The device must still be unplugged explicitly,
- *			WRITE_SYNC_PLUG does not do this as we could be
- *			submitting more writes before we actually wait on any
- *			of them.
- * WRITE_SYNC		Like WRITE_SYNC_PLUG, but also unplugs the device
- *			immediately after submission. The write equivalent
- *			of READ_SYNC.
- * WRITE_ODIRECT_PLUG	Special case write for O_DIRECT only.
+ *			shortly. The write equivalent of READ_SYNC.
+ * WRITE_ODIRECT	Special case write for O_DIRECT only.
  * WRITE_FLUSH		Like WRITE_SYNC but with preceding cache flush.
  * WRITE_FUA		Like WRITE_SYNC but data is guaranteed to be on
  *			non-volatile media on completion.
@@ -160,18 +154,14 @@ struct inodes_stat_t {
 #define WRITE			RW_MASK
 #define READA			RWA_MASK
 
-#define READ_SYNC		(READ | REQ_SYNC | REQ_UNPLUG)
+#define READ_SYNC		(READ | REQ_SYNC)
 #define READ_META		(READ | REQ_META)
-#define WRITE_SYNC_PLUG		(WRITE | REQ_SYNC | REQ_NOIDLE)
-#define WRITE_SYNC		(WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG)
-#define WRITE_ODIRECT_PLUG	(WRITE | REQ_SYNC)
+#define WRITE_SYNC		(WRITE | REQ_SYNC | REQ_NOIDLE)
+#define WRITE_ODIRECT		(WRITE | REQ_SYNC)
 #define WRITE_META		(WRITE | REQ_META)
-#define WRITE_FLUSH		(WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
-				 REQ_FLUSH)
-#define WRITE_FUA		(WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
-				 REQ_FUA)
-#define WRITE_FLUSH_FUA		(WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
-				 REQ_FLUSH | REQ_FUA)
+#define WRITE_FLUSH		(WRITE | REQ_SYNC | REQ_NOIDLE | REQ_FLUSH)
+#define WRITE_FUA		(WRITE | REQ_SYNC | REQ_NOIDLE | REQ_FUA)
+#define WRITE_FLUSH_FUA		(WRITE | REQ_SYNC | REQ_NOIDLE | REQ_FLUSH | REQ_FUA)
 
 #define SEL_IN		1
 #define SEL_OUT		2
diff --git a/kernel/power/block_io.c b/kernel/power/block_io.c
index 83bbc7c..d09dd10 100644
--- a/kernel/power/block_io.c
+++ b/kernel/power/block_io.c
@@ -28,7 +28,7 @@
 static int submit(int rw, struct block_device *bdev, sector_t sector,
 		struct page *page, struct bio **bio_chain)
 {
-	const int bio_rw = rw | REQ_SYNC | REQ_UNPLUG;
+	const int bio_rw = rw | REQ_SYNC;
 	struct bio *bio;
 
 	bio = bio_alloc(__GFP_WAIT | __GFP_HIGH, 1);
diff --git a/mm/page_io.c b/mm/page_io.c
index 2dee975..dc76b4d 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -106,7 +106,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
 		goto out;
 	}
 	if (wbc->sync_mode == WB_SYNC_ALL)
-		rw |= REQ_SYNC | REQ_UNPLUG;
+		rw |= REQ_SYNC;
 	count_vm_event(PSWPOUT);
 	set_page_writeback(page);
 	unlock_page(page);

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-03-08 12:16       ` Jens Axboe
@ 2011-03-08 20:21         ` Mike Snitzer
  2011-03-08 20:27           ` Jens Axboe
  0 siblings, 1 reply; 152+ messages in thread
From: Mike Snitzer @ 2011-03-08 20:21 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, hch

On Tue, Mar 08 2011 at  7:16am -0500,
Jens Axboe <jaxboe@fusionio.com> wrote:

> On 2011-03-03 23:13, Mike Snitzer wrote:
> > I'm now hitting a lockdep issue, while running a 'for-2.6.39/stack-plug'
> > kernel, when I try an fsync heavy workload to a request-based mpath
> > device (the kernel ultimately goes down in flames, I've yet to look at
> > the crashdump I took)
> 
> Mike, can you re-run with the current stack-plug branch? I've fixed the
> !CONFIG_BLOCK and rebase issues, and also added a change for this flush
> on schedule event. It's run outside of the runqueue lock now, so
> hopefully that should solve this one.

Works for me, thanks.

Mike

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-03-08 20:21         ` Mike Snitzer
@ 2011-03-08 20:27           ` Jens Axboe
  2011-03-08 21:36             ` Jeff Moyer
  2011-03-08 22:05             ` Mike Snitzer
  0 siblings, 2 replies; 152+ messages in thread
From: Jens Axboe @ 2011-03-08 20:27 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: linux-kernel, hch

On 2011-03-08 21:21, Mike Snitzer wrote:
> On Tue, Mar 08 2011 at  7:16am -0500,
> Jens Axboe <jaxboe@fusionio.com> wrote:
> 
>> On 2011-03-03 23:13, Mike Snitzer wrote:
>>> I'm now hitting a lockdep issue, while running a 'for-2.6.39/stack-plug'
>>> kernel, when I try an fsync heavy workload to a request-based mpath
>>> device (the kernel ultimately goes down in flames, I've yet to look at
>>> the crashdump I took)
>>
>> Mike, can you re-run with the current stack-plug branch? I've fixed the
>> !CONFIG_BLOCK and rebase issues, and also added a change for this flush
>> on schedule event. It's run outside of the runqueue lock now, so
>> hopefully that should solve this one.
> 
> Works for me, thanks.

Super, thanks! Out of curiousity, did you use dm/md?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-03-08 20:27           ` Jens Axboe
@ 2011-03-08 21:36             ` Jeff Moyer
  2011-03-09  7:25               ` Jens Axboe
  2011-03-08 22:05             ` Mike Snitzer
  1 sibling, 1 reply; 152+ messages in thread
From: Jeff Moyer @ 2011-03-08 21:36 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Mike Snitzer, linux-kernel, hch

Jens Axboe <jaxboe@fusionio.com> writes:

> On 2011-03-08 21:21, Mike Snitzer wrote:
>> On Tue, Mar 08 2011 at  7:16am -0500,
>> Jens Axboe <jaxboe@fusionio.com> wrote:
>> 
>>> On 2011-03-03 23:13, Mike Snitzer wrote:
>>>> I'm now hitting a lockdep issue, while running a 'for-2.6.39/stack-plug'
>>>> kernel, when I try an fsync heavy workload to a request-based mpath
>>>> device (the kernel ultimately goes down in flames, I've yet to look at
>>>> the crashdump I took)
>>>
>>> Mike, can you re-run with the current stack-plug branch? I've fixed the
>>> !CONFIG_BLOCK and rebase issues, and also added a change for this flush
>>> on schedule event. It's run outside of the runqueue lock now, so
>>> hopefully that should solve this one.
>> 
>> Works for me, thanks.
>
> Super, thanks! Out of curiousity, did you use dm/md?

mm/memory-failure.c: In function 'hwpoison_user_mappings':
mm/memory-failure.c:948: error: implicit declaration of function 'lock_page_nosync'

You missed a conversion of lock_page_nosync -> lock_page.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-03-08 20:27           ` Jens Axboe
  2011-03-08 21:36             ` Jeff Moyer
@ 2011-03-08 22:05             ` Mike Snitzer
  2011-03-10  0:58               ` Mike Snitzer
  2011-03-17 15:51               ` Mike Snitzer
  1 sibling, 2 replies; 152+ messages in thread
From: Mike Snitzer @ 2011-03-08 22:05 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, hch

On Tue, Mar 08 2011 at  3:27pm -0500,
Jens Axboe <jaxboe@fusionio.com> wrote:

> On 2011-03-08 21:21, Mike Snitzer wrote:
> > On Tue, Mar 08 2011 at  7:16am -0500,
> > Jens Axboe <jaxboe@fusionio.com> wrote:
> > 
> >> On 2011-03-03 23:13, Mike Snitzer wrote:
> >>> I'm now hitting a lockdep issue, while running a 'for-2.6.39/stack-plug'
> >>> kernel, when I try an fsync heavy workload to a request-based mpath
> >>> device (the kernel ultimately goes down in flames, I've yet to look at
> >>> the crashdump I took)
> >>
> >> Mike, can you re-run with the current stack-plug branch? I've fixed the
> >> !CONFIG_BLOCK and rebase issues, and also added a change for this flush
> >> on schedule event. It's run outside of the runqueue lock now, so
> >> hopefully that should solve this one.
> > 
> > Works for me, thanks.
> 
> Super, thanks! Out of curiousity, did you use dm/md?

Yes, I've been using a request-based DM multipath device.

^ permalink raw reply	[flat|nested] 152+ messages in thread

* blk-throttle: Use blk_plug in throttle code (Was: Re: [PATCH 05/10] block: remove per-queue plugging)
  2011-03-08 12:24     ` Jens Axboe
@ 2011-03-08 22:10       ` Vivek Goyal
  2011-03-09  7:26         ` Jens Axboe
  0 siblings, 1 reply; 152+ messages in thread
From: Vivek Goyal @ 2011-03-08 22:10 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, hch, NeilBrown

On Tue, Mar 08, 2011 at 01:24:01PM +0100, Jens Axboe wrote:
> On 2011-03-04 05:00, Vivek Goyal wrote:
> > On Sat, Jan 22, 2011 at 01:17:24AM +0000, Jens Axboe wrote:
> > 
> > [..]
> >>  mm/page-writeback.c                |    2 +-
> >>  mm/readahead.c                     |   12 ---
> >>  mm/shmem.c                         |    1 -
> >>  mm/swap_state.c                    |    5 +-
> >>  mm/swapfile.c                      |   37 --------
> >>  mm/vmscan.c                        |    2 +-
> >>  118 files changed, 153 insertions(+), 1248 deletions(-)
> > 
> > block/blk-throttle.c also uses blk_unplug(). We need to get rid of that
> > also.
> 
> Done.

Thanks Jens. Looking at the usage of blk_plug, i think it makes sense to
make use of it in throttle dispatch also. Here is the patch.

blk-throttle: Use blk_plug in throttle dispatch

Use plug in throttle dispatch also as we are dispatching a bunch of
bios in throttle context and some of them might merge.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-throttle.c |    3 +++
 1 file changed, 3 insertions(+)

Index: linux-2.6-block/block/blk-throttle.c
===================================================================
--- linux-2.6-block.orig/block/blk-throttle.c	2011-03-08 16:43:14.000000000 -0500
+++ linux-2.6-block/block/blk-throttle.c	2011-03-08 16:59:44.359620804 -0500
@@ -770,6 +770,7 @@ static int throtl_dispatch(struct reques
 	unsigned int nr_disp = 0;
 	struct bio_list bio_list_on_stack;
 	struct bio *bio;
+	struct blk_plug plug;
 
 	spin_lock_irq(q->queue_lock);
 
@@ -798,8 +799,10 @@ out:
 	 * immediate dispatch
 	 */
 	if (nr_disp) {
+		blk_start_plug(&plug);
 		while((bio = bio_list_pop(&bio_list_on_stack)))
 			generic_make_request(bio);
+		blk_finish_plug(&plug);
 	}
 	return nr_disp;
 }

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-03-08 21:36             ` Jeff Moyer
@ 2011-03-09  7:25               ` Jens Axboe
  0 siblings, 0 replies; 152+ messages in thread
From: Jens Axboe @ 2011-03-09  7:25 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: Mike Snitzer, linux-kernel, hch

On 2011-03-08 22:36, Jeff Moyer wrote:
> Jens Axboe <jaxboe@fusionio.com> writes:
> 
>> On 2011-03-08 21:21, Mike Snitzer wrote:
>>> On Tue, Mar 08 2011 at  7:16am -0500,
>>> Jens Axboe <jaxboe@fusionio.com> wrote:
>>>
>>>> On 2011-03-03 23:13, Mike Snitzer wrote:
>>>>> I'm now hitting a lockdep issue, while running a 'for-2.6.39/stack-plug'
>>>>> kernel, when I try an fsync heavy workload to a request-based mpath
>>>>> device (the kernel ultimately goes down in flames, I've yet to look at
>>>>> the crashdump I took)
>>>>
>>>> Mike, can you re-run with the current stack-plug branch? I've fixed the
>>>> !CONFIG_BLOCK and rebase issues, and also added a change for this flush
>>>> on schedule event. It's run outside of the runqueue lock now, so
>>>> hopefully that should solve this one.
>>>
>>> Works for me, thanks.
>>
>> Super, thanks! Out of curiousity, did you use dm/md?
> 
> mm/memory-failure.c: In function 'hwpoison_user_mappings':
> mm/memory-failure.c:948: error: implicit declaration of function 'lock_page_nosync'
> 
> You missed a conversion of lock_page_nosync -> lock_page.

Thanks Jeff, I guess I should run a full modconfig/yesconfig build again
just to check that everyone is still uptodate.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: blk-throttle: Use blk_plug in throttle code (Was: Re: [PATCH 05/10]  block: remove per-queue plugging)
  2011-03-08 22:10       ` blk-throttle: Use blk_plug in throttle code (Was: Re: [PATCH 05/10] block: remove per-queue plugging) Vivek Goyal
@ 2011-03-09  7:26         ` Jens Axboe
  0 siblings, 0 replies; 152+ messages in thread
From: Jens Axboe @ 2011-03-09  7:26 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-kernel, hch, NeilBrown

On 2011-03-08 23:10, Vivek Goyal wrote:
> On Tue, Mar 08, 2011 at 01:24:01PM +0100, Jens Axboe wrote:
>> On 2011-03-04 05:00, Vivek Goyal wrote:
>>> On Sat, Jan 22, 2011 at 01:17:24AM +0000, Jens Axboe wrote:
>>>
>>> [..]
>>>>  mm/page-writeback.c                |    2 +-
>>>>  mm/readahead.c                     |   12 ---
>>>>  mm/shmem.c                         |    1 -
>>>>  mm/swap_state.c                    |    5 +-
>>>>  mm/swapfile.c                      |   37 --------
>>>>  mm/vmscan.c                        |    2 +-
>>>>  118 files changed, 153 insertions(+), 1248 deletions(-)
>>>
>>> block/blk-throttle.c also uses blk_unplug(). We need to get rid of that
>>> also.
>>
>> Done.
> 
> Thanks Jens. Looking at the usage of blk_plug, i think it makes sense to
> make use of it in throttle dispatch also. Here is the patch.

Yep, it definitely does. Thanks, I'll apply this.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 07/10] fs: make generic file read/write functions plug
  2011-03-08 12:38         ` Jens Axboe
@ 2011-03-09 10:38           ` hch
  2011-03-09 10:52             ` Jens Axboe
  0 siblings, 1 reply; 152+ messages in thread
From: hch @ 2011-03-09 10:38 UTC (permalink / raw)
  To: Jens Axboe; +Cc: hch, Vivek Goyal, linux-kernel

On Tue, Mar 08, 2011 at 01:38:08PM +0100, Jens Axboe wrote:
> -#define DP_UNPLUG             8 /* equals REQ_UNPLUG  */
> +#define DP_UNPLUG             8 /* not used anymore   */

The way this is used might need some more review.  It seems like DRBD is
trying to propagate unplug requests over the wire, which is
functionality we lose now.

> + * The various wait functions, such as wait_on_writeback_range() will
> + * ultimately call sync_page() which will ultimately call
> + * blk_run_backing_dev(), which will end up unplugging the device queue.

This comment describes code that doesn't exist anymore on your branch.


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 07/10] fs: make generic file read/write functions plug
  2011-03-09 10:38           ` hch
@ 2011-03-09 10:52             ` Jens Axboe
  0 siblings, 0 replies; 152+ messages in thread
From: Jens Axboe @ 2011-03-09 10:52 UTC (permalink / raw)
  To: hch; +Cc: Vivek Goyal, linux-kernel

On 2011-03-09 11:38, hch@infradead.org wrote:
> On Tue, Mar 08, 2011 at 01:38:08PM +0100, Jens Axboe wrote:
>> -#define DP_UNPLUG             8 /* equals REQ_UNPLUG  */
>> +#define DP_UNPLUG             8 /* not used anymore   */
> 
> The way this is used might need some more review.  It seems like DRBD is
> trying to propagate unplug requests over the wire, which is
> functionality we lose now.

But that use case is pretty questionable. The receiving end should just
do its own plugging, if it's beneficial.

>> + * The various wait functions, such as wait_on_writeback_range() will
>> + * ultimately call sync_page() which will ultimately call
>> + * blk_run_backing_dev(), which will end up unplugging the device queue.
> 
> This comment describes code that doesn't exist anymore on your branch.

Thanks, I'll kill those.


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-03-08 22:05             ` Mike Snitzer
@ 2011-03-10  0:58               ` Mike Snitzer
  2011-04-05  3:05                 ` NeilBrown
  2011-03-17 15:51               ` Mike Snitzer
  1 sibling, 1 reply; 152+ messages in thread
From: Mike Snitzer @ 2011-03-10  0:58 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, hch, dm-devel, neilb

On Tue, Mar 08 2011 at  5:05pm -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Tue, Mar 08 2011 at  3:27pm -0500,
> Jens Axboe <jaxboe@fusionio.com> wrote:
> 
> > On 2011-03-08 21:21, Mike Snitzer wrote:
> > > On Tue, Mar 08 2011 at  7:16am -0500,
> > > Jens Axboe <jaxboe@fusionio.com> wrote:
> > > 
> > >> On 2011-03-03 23:13, Mike Snitzer wrote:
> > >>> I'm now hitting a lockdep issue, while running a 'for-2.6.39/stack-plug'
> > >>> kernel, when I try an fsync heavy workload to a request-based mpath
> > >>> device (the kernel ultimately goes down in flames, I've yet to look at
> > >>> the crashdump I took)
> > >>
> > >> Mike, can you re-run with the current stack-plug branch? I've fixed the
> > >> !CONFIG_BLOCK and rebase issues, and also added a change for this flush
> > >> on schedule event. It's run outside of the runqueue lock now, so
> > >> hopefully that should solve this one.
> > > 
> > > Works for me, thanks.
> > 
> > Super, thanks! Out of curiousity, did you use dm/md?
> 
> Yes, I've been using a request-based DM multipath device.

Hi Jens,

I just got to reviewing your onstack plugging DM changes (I looked at
the core block layer changes for additional context and also had a brief
look at MD).

I need to put more time to the review of all this code but one thing
that is immediately apparent is that after these changes DM only has one
onstack plug/unplug -- in drivers/md/dm-kcopyd.c:do_work()

You've removed a considerable amount of implicit plug/explicit unplug
code from DM (and obviously elsewhere but I have my DM hat on ;).

First question: is relying on higher-level (aio, fs, read-ahead)
explicit plugging/unplugging sufficient?  Seems odd to not have the
control/need to unplug the DM device upon resume (after a suspend).

(this naive question/concern stems from me needing to understand the
core block layer's onstack plugging changes better)

(but if those higher-level explicit onstack plug changes make all this
code removal possible shouldn't those commits come before changing
underlying block drivers like DM, MD, etc?)

I noticed that driver/md/dm-raid1.c:do_mirror() seems to follow the same
pattern of drivers/md/dm-kcopyd.c:do_work().. so rather than remove
dm_table_unplug_all() shouldn't it be replaced with a
blk_start_plug/blk_finish_plug?

Also, in your MD changes, you removed all calls to md_unplug() but
didn't remove md_unplug().  Seems it should be removed along with the
'plug' member of 'struct mddev_t'?  Neil?

Thanks,
Mike

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 04/10] block: initial patch for on-stack per-task plugging
  2011-01-22  1:17 ` [PATCH 04/10] block: initial patch for on-stack per-task plugging Jens Axboe
  2011-01-24 19:36   ` Jeff Moyer
@ 2011-03-10 16:54   ` Vivek Goyal
  2011-03-10 19:32     ` Jens Axboe
  2011-03-16  8:18   ` Shaohua Li
  2 siblings, 1 reply; 152+ messages in thread
From: Vivek Goyal @ 2011-03-10 16:54 UTC (permalink / raw)
  To: Jens Axboe; +Cc: jaxboe, linux-kernel, hch, Mike Snitzer

On Sat, Jan 22, 2011 at 01:17:23AM +0000, Jens Axboe wrote:

[..]
> -/*
> - * Only disabling plugging for non-rotational devices if it does tagging
> - * as well, otherwise we do need the proper merging
> - */
> -static inline bool queue_should_plug(struct request_queue *q)
> -{
> -	return !(blk_queue_nonrot(q) && blk_queue_tagged(q));
> -}
> -

Jens,

While discussing stack plug with Mike Snitzer, this occurred to us that in
new code we seem to be plugging even if underlying device is SSD with NCQ.
Should we continue to maintain the old behavior of not plugging for NCQ SSD?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 04/10] block: initial patch for on-stack per-task plugging
  2011-03-10 16:54   ` Vivek Goyal
@ 2011-03-10 19:32     ` Jens Axboe
  2011-03-10 19:46       ` Vivek Goyal
  0 siblings, 1 reply; 152+ messages in thread
From: Jens Axboe @ 2011-03-10 19:32 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Jens Axboe, linux-kernel, hch, Mike Snitzer

On 2011-03-10 17:54, Vivek Goyal wrote:
> On Sat, Jan 22, 2011 at 01:17:23AM +0000, Jens Axboe wrote:
> 
> [..]
>> -/*
>> - * Only disabling plugging for non-rotational devices if it does tagging
>> - * as well, otherwise we do need the proper merging
>> - */
>> -static inline bool queue_should_plug(struct request_queue *q)
>> -{
>> -	return !(blk_queue_nonrot(q) && blk_queue_tagged(q));
>> -}
>> -
> 
> Jens,
> 
> While discussing stack plug with Mike Snitzer, this occurred to us that in
> new code we seem to be plugging even if underlying device is SSD with NCQ.
> Should we continue to maintain the old behavior of not plugging for NCQ SSD?

The main reason plugging was turned off for SSD's previously was because
it ended up hammering on the queue lock a lot. So it was turned off to
speed them up.

The new plugging scheme is faster than hitting the queue directly, so
now it would be a good idea to do the plugging in fact. Plus even for
high performance SSD's, things like merging are still a good idea.

So yes, it's on now and on purpose.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 04/10] block: initial patch for on-stack per-task plugging
  2011-03-10 19:32     ` Jens Axboe
@ 2011-03-10 19:46       ` Vivek Goyal
  0 siblings, 0 replies; 152+ messages in thread
From: Vivek Goyal @ 2011-03-10 19:46 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Jens Axboe, linux-kernel, hch, Mike Snitzer

On Thu, Mar 10, 2011 at 08:32:05PM +0100, Jens Axboe wrote:
> On 2011-03-10 17:54, Vivek Goyal wrote:
> > On Sat, Jan 22, 2011 at 01:17:23AM +0000, Jens Axboe wrote:
> > 
> > [..]
> >> -/*
> >> - * Only disabling plugging for non-rotational devices if it does tagging
> >> - * as well, otherwise we do need the proper merging
> >> - */
> >> -static inline bool queue_should_plug(struct request_queue *q)
> >> -{
> >> -	return !(blk_queue_nonrot(q) && blk_queue_tagged(q));
> >> -}
> >> -
> > 
> > Jens,
> > 
> > While discussing stack plug with Mike Snitzer, this occurred to us that in
> > new code we seem to be plugging even if underlying device is SSD with NCQ.
> > Should we continue to maintain the old behavior of not plugging for NCQ SSD?
> 
> The main reason plugging was turned off for SSD's previously was because
> it ended up hammering on the queue lock a lot. So it was turned off to
> speed them up.
> 
> The new plugging scheme is faster than hitting the queue directly, so
> now it would be a good idea to do the plugging in fact. Plus even for
> high performance SSD's, things like merging are still a good idea.
> 
> So yes, it's on now and on purpose.

Ok. Thanks for the explanation. That helps.

Vivek

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 04/10] block: initial patch for on-stack per-task plugging
  2011-01-22  1:17 ` [PATCH 04/10] block: initial patch for on-stack per-task plugging Jens Axboe
  2011-01-24 19:36   ` Jeff Moyer
  2011-03-10 16:54   ` Vivek Goyal
@ 2011-03-16  8:18   ` Shaohua Li
  2011-03-16 17:31     ` Vivek Goyal
  2011-03-17  9:39     ` Jens Axboe
  2 siblings, 2 replies; 152+ messages in thread
From: Shaohua Li @ 2011-03-16  8:18 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, hch, Vivek Goyal, jmoyer, shaohua.li

2011/1/22 Jens Axboe <jaxboe@fusionio.com>:
> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
> ---
>  block/blk-core.c          |  357 ++++++++++++++++++++++++++++++++------------
>  block/elevator.c          |    6 +-
>  include/linux/blk_types.h |    2 +
>  include/linux/blkdev.h    |   30 ++++
>  include/linux/elevator.h  |    1 +
>  include/linux/sched.h     |    6 +
>  kernel/exit.c             |    1 +
>  kernel/fork.c             |    3 +
>  kernel/sched.c            |   11 ++-
>  9 files changed, 317 insertions(+), 100 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 960f12c..42dbfcc 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -27,6 +27,7 @@
>  #include <linux/writeback.h>
>  #include <linux/task_io_accounting_ops.h>
>  #include <linux/fault-inject.h>
> +#include <linux/list_sort.h>
>
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/block.h>
> @@ -213,7 +214,7 @@ static void blk_delay_work(struct work_struct *work)
>
>        q = container_of(work, struct request_queue, delay_work.work);
>        spin_lock_irq(q->queue_lock);
> -       q->request_fn(q);
> +       __blk_run_queue(q);
>        spin_unlock_irq(q->queue_lock);
>  }
Hi Jens,
I have some questions about the per-task plugging. Since the request
list is per-task, and each task delivers its requests at finish flush
or schedule. But when one cpu delivers requests to global queue, other
cpus don't know. This seems to have problem. For example:
1. get_request_wait() can only flush current task's request list,
other cpus/tasks might still have a lot of requests, which aren't sent
to request_queue. your ioc-rq-alloc branch is for this, right? Will it
be pushed to 2.6.39 too? I'm wondering if we should limit per-task
queue length. If there are enough requests there, we force a flush
plug.
2. some APIs like blk_delay_work, which call __blk_run_queue() might
not work. because other CPUs might not dispatch their requests to
request queue. So __blk_run_queue will eventually find no requests,
which might stall devices.
Since one cpu doesn't know other cpus' request list, I'm wondering if
there are other similar issues.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 04/10] block: initial patch for on-stack per-task plugging
  2011-03-16  8:18   ` Shaohua Li
@ 2011-03-16 17:31     ` Vivek Goyal
  2011-03-17  1:00       ` Shaohua Li
  2011-03-17  9:39     ` Jens Axboe
  1 sibling, 1 reply; 152+ messages in thread
From: Vivek Goyal @ 2011-03-16 17:31 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Jens Axboe, linux-kernel, hch, jmoyer, shaohua.li

On Wed, Mar 16, 2011 at 04:18:30PM +0800, Shaohua Li wrote:
> 2011/1/22 Jens Axboe <jaxboe@fusionio.com>:
> > Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
> > ---
> >  block/blk-core.c          |  357 ++++++++++++++++++++++++++++++++------------
> >  block/elevator.c          |    6 +-
> >  include/linux/blk_types.h |    2 +
> >  include/linux/blkdev.h    |   30 ++++
> >  include/linux/elevator.h  |    1 +
> >  include/linux/sched.h     |    6 +
> >  kernel/exit.c             |    1 +
> >  kernel/fork.c             |    3 +
> >  kernel/sched.c            |   11 ++-
> >  9 files changed, 317 insertions(+), 100 deletions(-)
> >
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index 960f12c..42dbfcc 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -27,6 +27,7 @@
> >  #include <linux/writeback.h>
> >  #include <linux/task_io_accounting_ops.h>
> >  #include <linux/fault-inject.h>
> > +#include <linux/list_sort.h>
> >
> >  #define CREATE_TRACE_POINTS
> >  #include <trace/events/block.h>
> > @@ -213,7 +214,7 @@ static void blk_delay_work(struct work_struct *work)
> >
> >        q = container_of(work, struct request_queue, delay_work.work);
> >        spin_lock_irq(q->queue_lock);
> > -       q->request_fn(q);
> > +       __blk_run_queue(q);
> >        spin_unlock_irq(q->queue_lock);
> >  }
> Hi Jens,
> I have some questions about the per-task plugging. Since the request
> list is per-task, and each task delivers its requests at finish flush
> or schedule. But when one cpu delivers requests to global queue, other
> cpus don't know. This seems to have problem. For example:
> 1. get_request_wait() can only flush current task's request list,
> other cpus/tasks might still have a lot of requests, which aren't sent
> to request_queue.

But very soon these requests will be sent to request queue as soon task
is either scheduled out or task explicitly flushes the plug? So we might
wait a bit longer but that might not matter in general, i guess. 

> your ioc-rq-alloc branch is for this, right? Will it
> be pushed to 2.6.39 too? I'm wondering if we should limit per-task
> queue length. If there are enough requests there, we force a flush
> plug.

That's the idea jens had. But then came the question of maintaining
data structures per task per disk. That makes it complicated.

Even if we move the accounting out of request queue and do it say at
bdi, ideally we shall to do per task per bdi accounting.

Jens seemed to be suggesting that generally fluser threads are the
main cluprit for submitting large amount of IO. They are already per
bdi. So probably just maintain a per task limit for flusher threads.

I am not sure what happens to direct reclaim path, AIO deep queue 
paths etc.
  
> 2. some APIs like blk_delay_work, which call __blk_run_queue() might
> not work. because other CPUs might not dispatch their requests to
> request queue. So __blk_run_queue will eventually find no requests,
> which might stall devices.
> Since one cpu doesn't know other cpus' request list, I'm wondering if
> there are other similar issues.

So again in this case if queue is empty at the time of __blk_run_queue(),
then we will probably just experinece little more delay then intended
till some task flushes. But should not stall the system?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 04/10] block: initial patch for on-stack per-task plugging
  2011-03-16 17:31     ` Vivek Goyal
@ 2011-03-17  1:00       ` Shaohua Li
  2011-03-17  3:19         ` Shaohua Li
  2011-03-17  9:43         ` Jens Axboe
  0 siblings, 2 replies; 152+ messages in thread
From: Shaohua Li @ 2011-03-17  1:00 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Jens Axboe, linux-kernel, hch, jmoyer

On Thu, 2011-03-17 at 01:31 +0800, Vivek Goyal wrote:
> On Wed, Mar 16, 2011 at 04:18:30PM +0800, Shaohua Li wrote:
> > 2011/1/22 Jens Axboe <jaxboe@fusionio.com>:
> > > Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
> > > ---
> > >  block/blk-core.c          |  357 ++++++++++++++++++++++++++++++++------------
> > >  block/elevator.c          |    6 +-
> > >  include/linux/blk_types.h |    2 +
> > >  include/linux/blkdev.h    |   30 ++++
> > >  include/linux/elevator.h  |    1 +
> > >  include/linux/sched.h     |    6 +
> > >  kernel/exit.c             |    1 +
> > >  kernel/fork.c             |    3 +
> > >  kernel/sched.c            |   11 ++-
> > >  9 files changed, 317 insertions(+), 100 deletions(-)
> > >
> > > diff --git a/block/blk-core.c b/block/blk-core.c
> > > index 960f12c..42dbfcc 100644
> > > --- a/block/blk-core.c
> > > +++ b/block/blk-core.c
> > > @@ -27,6 +27,7 @@
> > >  #include <linux/writeback.h>
> > >  #include <linux/task_io_accounting_ops.h>
> > >  #include <linux/fault-inject.h>
> > > +#include <linux/list_sort.h>
> > >
> > >  #define CREATE_TRACE_POINTS
> > >  #include <trace/events/block.h>
> > > @@ -213,7 +214,7 @@ static void blk_delay_work(struct work_struct *work)
> > >
> > >        q = container_of(work, struct request_queue, delay_work.work);
> > >        spin_lock_irq(q->queue_lock);
> > > -       q->request_fn(q);
> > > +       __blk_run_queue(q);
> > >        spin_unlock_irq(q->queue_lock);
> > >  }
> > Hi Jens,
> > I have some questions about the per-task plugging. Since the request
> > list is per-task, and each task delivers its requests at finish flush
> > or schedule. But when one cpu delivers requests to global queue, other
> > cpus don't know. This seems to have problem. For example:
> > 1. get_request_wait() can only flush current task's request list,
> > other cpus/tasks might still have a lot of requests, which aren't sent
> > to request_queue.
> 
> But very soon these requests will be sent to request queue as soon task
> is either scheduled out or task explicitly flushes the plug? So we might
> wait a bit longer but that might not matter in general, i guess. 
Yes, I understand there is just a bit delay. I don't know how severe it
is, but this still could be a problem, especially for fast storage or
random I/O. My current tests show slight regression (3% or so) with
Jens's for 2.6.39/core branch. I'm still checking if it's caused by the
per-task plug, but the per-task plug is highly suspected.

> > your ioc-rq-alloc branch is for this, right? Will it
> > be pushed to 2.6.39 too? I'm wondering if we should limit per-task
> > queue length. If there are enough requests there, we force a flush
> > plug.
> 
> That's the idea jens had. But then came the question of maintaining
> data structures per task per disk. That makes it complicated.
> 
> Even if we move the accounting out of request queue and do it say at
> bdi, ideally we shall to do per task per bdi accounting.
> 
> Jens seemed to be suggesting that generally fluser threads are the
> main cluprit for submitting large amount of IO. They are already per
> bdi. So probably just maintain a per task limit for flusher threads.
Yep, flusher is the main spot in my mind. We need call more flush plug
for flusher thread. 

> I am not sure what happens to direct reclaim path, AIO deep queue 
> paths etc.
direct reclaim path could build deep write queue too. It
uses .writepage, currently there is no flush plug there. Maybe we need
add flush plug in shrink_inactive_list too.

> > 2. some APIs like blk_delay_work, which call __blk_run_queue() might
> > not work. because other CPUs might not dispatch their requests to
> > request queue. So __blk_run_queue will eventually find no requests,
> > which might stall devices.
> > Since one cpu doesn't know other cpus' request list, I'm wondering if
> > there are other similar issues.
> 
> So again in this case if queue is empty at the time of __blk_run_queue(),
> then we will probably just experinece little more delay then intended
> till some task flushes. But should not stall the system?
not stall the system, but device stalls a little time.

Thanks,
Shaohua


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 04/10] block: initial patch for on-stack per-task plugging
  2011-03-17  1:00       ` Shaohua Li
@ 2011-03-17  3:19         ` Shaohua Li
  2011-03-17  9:44           ` Jens Axboe
  2011-03-17  9:43         ` Jens Axboe
  1 sibling, 1 reply; 152+ messages in thread
From: Shaohua Li @ 2011-03-17  3:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, hch, jmoyer, Vivek Goyal

On Thu, 2011-03-17 at 09:00 +0800, Shaohua Li wrote:
> On Thu, 2011-03-17 at 01:31 +0800, Vivek Goyal wrote:
> > On Wed, Mar 16, 2011 at 04:18:30PM +0800, Shaohua Li wrote:
> > > 2011/1/22 Jens Axboe <jaxboe@fusionio.com>:
> > > > Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
> > > > ---
> > > >  block/blk-core.c          |  357 ++++++++++++++++++++++++++++++++------------
> > > >  block/elevator.c          |    6 +-
> > > >  include/linux/blk_types.h |    2 +
> > > >  include/linux/blkdev.h    |   30 ++++
> > > >  include/linux/elevator.h  |    1 +
> > > >  include/linux/sched.h     |    6 +
> > > >  kernel/exit.c             |    1 +
> > > >  kernel/fork.c             |    3 +
> > > >  kernel/sched.c            |   11 ++-
> > > >  9 files changed, 317 insertions(+), 100 deletions(-)
> > > >
> > > > diff --git a/block/blk-core.c b/block/blk-core.c
> > > > index 960f12c..42dbfcc 100644
> > > > --- a/block/blk-core.c
> > > > +++ b/block/blk-core.c
> > > > @@ -27,6 +27,7 @@
> > > >  #include <linux/writeback.h>
> > > >  #include <linux/task_io_accounting_ops.h>
> > > >  #include <linux/fault-inject.h>
> > > > +#include <linux/list_sort.h>
> > > >
> > > >  #define CREATE_TRACE_POINTS
> > > >  #include <trace/events/block.h>
> > > > @@ -213,7 +214,7 @@ static void blk_delay_work(struct work_struct *work)
> > > >
> > > >        q = container_of(work, struct request_queue, delay_work.work);
> > > >        spin_lock_irq(q->queue_lock);
> > > > -       q->request_fn(q);
> > > > +       __blk_run_queue(q);
> > > >        spin_unlock_irq(q->queue_lock);
> > > >  }
> > > Hi Jens,
> > > I have some questions about the per-task plugging. Since the request
> > > list is per-task, and each task delivers its requests at finish flush
> > > or schedule. But when one cpu delivers requests to global queue, other
> > > cpus don't know. This seems to have problem. For example:
> > > 1. get_request_wait() can only flush current task's request list,
> > > other cpus/tasks might still have a lot of requests, which aren't sent
> > > to request_queue.
> > 
> > But very soon these requests will be sent to request queue as soon task
> > is either scheduled out or task explicitly flushes the plug? So we might
> > wait a bit longer but that might not matter in general, i guess. 
> Yes, I understand there is just a bit delay. I don't know how severe it
> is, but this still could be a problem, especially for fast storage or
> random I/O. My current tests show slight regression (3% or so) with
> Jens's for 2.6.39/core branch. I'm still checking if it's caused by the
> per-task plug, but the per-task plug is highly suspected.
> 
> > > your ioc-rq-alloc branch is for this, right? Will it
> > > be pushed to 2.6.39 too? I'm wondering if we should limit per-task
> > > queue length. If there are enough requests there, we force a flush
> > > plug.
> > 
> > That's the idea jens had. But then came the question of maintaining
> > data structures per task per disk. That makes it complicated.
> > 
> > Even if we move the accounting out of request queue and do it say at
> > bdi, ideally we shall to do per task per bdi accounting.
> > 
> > Jens seemed to be suggesting that generally fluser threads are the
> > main cluprit for submitting large amount of IO. They are already per
> > bdi. So probably just maintain a per task limit for flusher threads.
> Yep, flusher is the main spot in my mind. We need call more flush plug
> for flusher thread. 
> 
> > I am not sure what happens to direct reclaim path, AIO deep queue 
> > paths etc.
> direct reclaim path could build deep write queue too. It
> uses .writepage, currently there is no flush plug there. Maybe we need
> add flush plug in shrink_inactive_list too.
> 
> > > 2. some APIs like blk_delay_work, which call __blk_run_queue() might
> > > not work. because other CPUs might not dispatch their requests to
> > > request queue. So __blk_run_queue will eventually find no requests,
> > > which might stall devices.
> > > Since one cpu doesn't know other cpus' request list, I'm wondering if
> > > there are other similar issues.
> > 
> > So again in this case if queue is empty at the time of __blk_run_queue(),
> > then we will probably just experinece little more delay then intended
> > till some task flushes. But should not stall the system?
> not stall the system, but device stalls a little time.
Jens,
I need below patch to recover a ffsb fsync workload, which has about 30%
regression with stack plug. 
I guess the reason is WRITE_SYNC_PLUG doesn't work now, so if a context
hasn't blk_plug, we lose previous plug (request merge). This suggests
all places we use WRITE_SYNC_PLUG before (for example, kjournald) should
have a blk_plug context.

Thanks,
Shaohua


diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index cc0ede1..24b7ac2 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1039,11 +1039,17 @@ static int __writepage(struct page *page, struct writeback_control *wbc,
 int generic_writepages(struct address_space *mapping,
 		       struct writeback_control *wbc)
 {
+	struct blk_plug plug;
+	int ret;
+
 	/* deal with chardevs and other special file */
 	if (!mapping->a_ops->writepage)
 		return 0;
 
-	return write_cache_pages(mapping, wbc, __writepage, mapping);
+	blk_start_plug(&plug);
+	ret = write_cache_pages(mapping, wbc, __writepage, mapping);
+	blk_finish_plug(&plug);
+	return ret;
 }
 
 EXPORT_SYMBOL(generic_writepages);



^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 04/10] block: initial patch for on-stack per-task  plugging
  2011-03-16  8:18   ` Shaohua Li
  2011-03-16 17:31     ` Vivek Goyal
@ 2011-03-17  9:39     ` Jens Axboe
  1 sibling, 0 replies; 152+ messages in thread
From: Jens Axboe @ 2011-03-17  9:39 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-kernel, hch, Vivek Goyal, jmoyer, shaohua.li

On 2011-03-16 09:18, Shaohua Li wrote:
> 2011/1/22 Jens Axboe <jaxboe@fusionio.com>:
>> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
>> ---
>>  block/blk-core.c          |  357 ++++++++++++++++++++++++++++++++------------
>>  block/elevator.c          |    6 +-
>>  include/linux/blk_types.h |    2 +
>>  include/linux/blkdev.h    |   30 ++++
>>  include/linux/elevator.h  |    1 +
>>  include/linux/sched.h     |    6 +
>>  kernel/exit.c             |    1 +
>>  kernel/fork.c             |    3 +
>>  kernel/sched.c            |   11 ++-
>>  9 files changed, 317 insertions(+), 100 deletions(-)
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 960f12c..42dbfcc 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -27,6 +27,7 @@
>>  #include <linux/writeback.h>
>>  #include <linux/task_io_accounting_ops.h>
>>  #include <linux/fault-inject.h>
>> +#include <linux/list_sort.h>
>>
>>  #define CREATE_TRACE_POINTS
>>  #include <trace/events/block.h>
>> @@ -213,7 +214,7 @@ static void blk_delay_work(struct work_struct *work)
>>
>>        q = container_of(work, struct request_queue, delay_work.work);
>>        spin_lock_irq(q->queue_lock);
>> -       q->request_fn(q);
>> +       __blk_run_queue(q);
>>        spin_unlock_irq(q->queue_lock);
>>  }
> Hi Jens,
> I have some questions about the per-task plugging. Since the request
> list is per-task, and each task delivers its requests at finish flush
> or schedule. But when one cpu delivers requests to global queue, other
> cpus don't know. This seems to have problem. For example:
> 1. get_request_wait() can only flush current task's request list,
> other cpus/tasks might still have a lot of requests, which aren't sent
> to request_queue. your ioc-rq-alloc branch is for this, right? Will it
> be pushed to 2.6.39 too? I'm wondering if we should limit per-task
> queue length. If there are enough requests there, we force a flush
> plug.

Any task plug is by definition short lived, since it only persists while
someone is submitting IO or if the task ends up blocking. It's not like
right now where a plug can persist for some time.

I don't plan on submitting the ioc-rq-alloc for 2.6.39, it needs more
work. I think we'll end up dropping the limits completely and just
ensuring that the flusher thread doesn't push out too much.

> 2. some APIs like blk_delay_work, which call __blk_run_queue() might
> not work. because other CPUs might not dispatch their requests to
> request queue. So __blk_run_queue will eventually find no requests,
> which might stall devices.
> Since one cpu doesn't know other cpus' request list, I'm wondering if
> there are other similar issues.

If you call blk_run_queue(), it's to kick something of that you
submitted (and that should already be on the queue). So I don't think
this is an issue.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 04/10] block: initial patch for on-stack per-task  plugging
  2011-03-17  1:00       ` Shaohua Li
  2011-03-17  3:19         ` Shaohua Li
@ 2011-03-17  9:43         ` Jens Axboe
  2011-03-18  6:36           ` Shaohua Li
  1 sibling, 1 reply; 152+ messages in thread
From: Jens Axboe @ 2011-03-17  9:43 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Vivek Goyal, linux-kernel, hch, jmoyer

On 2011-03-17 02:00, Shaohua Li wrote:
> On Thu, 2011-03-17 at 01:31 +0800, Vivek Goyal wrote:
>> On Wed, Mar 16, 2011 at 04:18:30PM +0800, Shaohua Li wrote:
>>> 2011/1/22 Jens Axboe <jaxboe@fusionio.com>:
>>>> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
>>>> ---
>>>>  block/blk-core.c          |  357 ++++++++++++++++++++++++++++++++------------
>>>>  block/elevator.c          |    6 +-
>>>>  include/linux/blk_types.h |    2 +
>>>>  include/linux/blkdev.h    |   30 ++++
>>>>  include/linux/elevator.h  |    1 +
>>>>  include/linux/sched.h     |    6 +
>>>>  kernel/exit.c             |    1 +
>>>>  kernel/fork.c             |    3 +
>>>>  kernel/sched.c            |   11 ++-
>>>>  9 files changed, 317 insertions(+), 100 deletions(-)
>>>>
>>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>>> index 960f12c..42dbfcc 100644
>>>> --- a/block/blk-core.c
>>>> +++ b/block/blk-core.c
>>>> @@ -27,6 +27,7 @@
>>>>  #include <linux/writeback.h>
>>>>  #include <linux/task_io_accounting_ops.h>
>>>>  #include <linux/fault-inject.h>
>>>> +#include <linux/list_sort.h>
>>>>
>>>>  #define CREATE_TRACE_POINTS
>>>>  #include <trace/events/block.h>
>>>> @@ -213,7 +214,7 @@ static void blk_delay_work(struct work_struct *work)
>>>>
>>>>        q = container_of(work, struct request_queue, delay_work.work);
>>>>        spin_lock_irq(q->queue_lock);
>>>> -       q->request_fn(q);
>>>> +       __blk_run_queue(q);
>>>>        spin_unlock_irq(q->queue_lock);
>>>>  }
>>> Hi Jens,
>>> I have some questions about the per-task plugging. Since the request
>>> list is per-task, and each task delivers its requests at finish flush
>>> or schedule. But when one cpu delivers requests to global queue, other
>>> cpus don't know. This seems to have problem. For example:
>>> 1. get_request_wait() can only flush current task's request list,
>>> other cpus/tasks might still have a lot of requests, which aren't sent
>>> to request_queue.
>>
>> But very soon these requests will be sent to request queue as soon task
>> is either scheduled out or task explicitly flushes the plug? So we might
>> wait a bit longer but that might not matter in general, i guess. 
> Yes, I understand there is just a bit delay. I don't know how severe it
> is, but this still could be a problem, especially for fast storage or
> random I/O. My current tests show slight regression (3% or so) with
> Jens's for 2.6.39/core branch. I'm still checking if it's caused by the
> per-task plug, but the per-task plug is highly suspected.

To check this particular case, you can always just bump the request
limit. What test is showing a slowdown? Like the one that Vivek
discovered, we are going to be adding plugs in more places. I didn't go
crazy with those, wanted to have the infrastructure sane and stable
first.

>>
>> Jens seemed to be suggesting that generally fluser threads are the
>> main cluprit for submitting large amount of IO. They are already per
>> bdi. So probably just maintain a per task limit for flusher threads.
> Yep, flusher is the main spot in my mind. We need call more flush plug
> for flusher thread. 
> 
>> I am not sure what happens to direct reclaim path, AIO deep queue 
>> paths etc.
> direct reclaim path could build deep write queue too. It
> uses .writepage, currently there is no flush plug there. Maybe we need
> add flush plug in shrink_inactive_list too.

If you find and locate these spots, I'd very much appreciate a patch too
:-)

>>> 2. some APIs like blk_delay_work, which call __blk_run_queue() might
>>> not work. because other CPUs might not dispatch their requests to
>>> request queue. So __blk_run_queue will eventually find no requests,
>>> which might stall devices.
>>> Since one cpu doesn't know other cpus' request list, I'm wondering if
>>> there are other similar issues.
>>
>> So again in this case if queue is empty at the time of __blk_run_queue(),
>> then we will probably just experinece little more delay then intended
>> till some task flushes. But should not stall the system?
> not stall the system, but device stalls a little time.

It's not a problem. Say you use blk_delay_work(), that is to delay
something that is already on the queue. Any task plug should be
unrelated. For the request starvation issue, if we had the plug persist
across schedules it would be an issue. But the time frame that a
per-task plugs lives for is very short, it's just submitting the IO.
Flushing those plugs would be detrimental to the problem you want to
solve, which is ensure that those IOs finish faster so that we can
allocate more.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 04/10] block: initial patch for on-stack per-task  plugging
  2011-03-17  3:19         ` Shaohua Li
@ 2011-03-17  9:44           ` Jens Axboe
  2011-03-18  1:55             ` Shaohua Li
  0 siblings, 1 reply; 152+ messages in thread
From: Jens Axboe @ 2011-03-17  9:44 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-kernel, hch, jmoyer, Vivek Goyal

On 2011-03-17 04:19, Shaohua Li wrote:
> On Thu, 2011-03-17 at 09:00 +0800, Shaohua Li wrote:
>> On Thu, 2011-03-17 at 01:31 +0800, Vivek Goyal wrote:
>>> On Wed, Mar 16, 2011 at 04:18:30PM +0800, Shaohua Li wrote:
>>>> 2011/1/22 Jens Axboe <jaxboe@fusionio.com>:
>>>>> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
>>>>> ---
>>>>>  block/blk-core.c          |  357 ++++++++++++++++++++++++++++++++------------
>>>>>  block/elevator.c          |    6 +-
>>>>>  include/linux/blk_types.h |    2 +
>>>>>  include/linux/blkdev.h    |   30 ++++
>>>>>  include/linux/elevator.h  |    1 +
>>>>>  include/linux/sched.h     |    6 +
>>>>>  kernel/exit.c             |    1 +
>>>>>  kernel/fork.c             |    3 +
>>>>>  kernel/sched.c            |   11 ++-
>>>>>  9 files changed, 317 insertions(+), 100 deletions(-)
>>>>>
>>>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>>>> index 960f12c..42dbfcc 100644
>>>>> --- a/block/blk-core.c
>>>>> +++ b/block/blk-core.c
>>>>> @@ -27,6 +27,7 @@
>>>>>  #include <linux/writeback.h>
>>>>>  #include <linux/task_io_accounting_ops.h>
>>>>>  #include <linux/fault-inject.h>
>>>>> +#include <linux/list_sort.h>
>>>>>
>>>>>  #define CREATE_TRACE_POINTS
>>>>>  #include <trace/events/block.h>
>>>>> @@ -213,7 +214,7 @@ static void blk_delay_work(struct work_struct *work)
>>>>>
>>>>>        q = container_of(work, struct request_queue, delay_work.work);
>>>>>        spin_lock_irq(q->queue_lock);
>>>>> -       q->request_fn(q);
>>>>> +       __blk_run_queue(q);
>>>>>        spin_unlock_irq(q->queue_lock);
>>>>>  }
>>>> Hi Jens,
>>>> I have some questions about the per-task plugging. Since the request
>>>> list is per-task, and each task delivers its requests at finish flush
>>>> or schedule. But when one cpu delivers requests to global queue, other
>>>> cpus don't know. This seems to have problem. For example:
>>>> 1. get_request_wait() can only flush current task's request list,
>>>> other cpus/tasks might still have a lot of requests, which aren't sent
>>>> to request_queue.
>>>
>>> But very soon these requests will be sent to request queue as soon task
>>> is either scheduled out or task explicitly flushes the plug? So we might
>>> wait a bit longer but that might not matter in general, i guess. 
>> Yes, I understand there is just a bit delay. I don't know how severe it
>> is, but this still could be a problem, especially for fast storage or
>> random I/O. My current tests show slight regression (3% or so) with
>> Jens's for 2.6.39/core branch. I'm still checking if it's caused by the
>> per-task plug, but the per-task plug is highly suspected.
>>
>>>> your ioc-rq-alloc branch is for this, right? Will it
>>>> be pushed to 2.6.39 too? I'm wondering if we should limit per-task
>>>> queue length. If there are enough requests there, we force a flush
>>>> plug.
>>>
>>> That's the idea jens had. But then came the question of maintaining
>>> data structures per task per disk. That makes it complicated.
>>>
>>> Even if we move the accounting out of request queue and do it say at
>>> bdi, ideally we shall to do per task per bdi accounting.
>>>
>>> Jens seemed to be suggesting that generally fluser threads are the
>>> main cluprit for submitting large amount of IO. They are already per
>>> bdi. So probably just maintain a per task limit for flusher threads.
>> Yep, flusher is the main spot in my mind. We need call more flush plug
>> for flusher thread. 
>>
>>> I am not sure what happens to direct reclaim path, AIO deep queue 
>>> paths etc.
>> direct reclaim path could build deep write queue too. It
>> uses .writepage, currently there is no flush plug there. Maybe we need
>> add flush plug in shrink_inactive_list too.
>>
>>>> 2. some APIs like blk_delay_work, which call __blk_run_queue() might
>>>> not work. because other CPUs might not dispatch their requests to
>>>> request queue. So __blk_run_queue will eventually find no requests,
>>>> which might stall devices.
>>>> Since one cpu doesn't know other cpus' request list, I'm wondering if
>>>> there are other similar issues.
>>>
>>> So again in this case if queue is empty at the time of __blk_run_queue(),
>>> then we will probably just experinece little more delay then intended
>>> till some task flushes. But should not stall the system?
>> not stall the system, but device stalls a little time.
> Jens,
> I need below patch to recover a ffsb fsync workload, which has about 30%
> regression with stack plug. 
> I guess the reason is WRITE_SYNC_PLUG doesn't work now, so if a context
> hasn't blk_plug, we lose previous plug (request merge). This suggests
> all places we use WRITE_SYNC_PLUG before (for example, kjournald) should
> have a blk_plug context.

Good point, those should be auto-converted. I'll take this patch and
double check the others. Thanks!

Does it remove that performance regression completely?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-03-08 22:05             ` Mike Snitzer
  2011-03-10  0:58               ` Mike Snitzer
@ 2011-03-17 15:51               ` Mike Snitzer
  2011-03-17 18:31                 ` Jens Axboe
  1 sibling, 1 reply; 152+ messages in thread
From: Mike Snitzer @ 2011-03-17 15:51 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, hch

On Tue, Mar 08 2011 at  5:05pm -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Tue, Mar 08 2011 at  3:27pm -0500,
> Jens Axboe <jaxboe@fusionio.com> wrote:
> 
> > On 2011-03-08 21:21, Mike Snitzer wrote:
> > > On Tue, Mar 08 2011 at  7:16am -0500,
> > > Jens Axboe <jaxboe@fusionio.com> wrote:
> > > 
> > >> On 2011-03-03 23:13, Mike Snitzer wrote:
> > >>> I'm now hitting a lockdep issue, while running a 'for-2.6.39/stack-plug'
> > >>> kernel, when I try an fsync heavy workload to a request-based mpath
> > >>> device (the kernel ultimately goes down in flames, I've yet to look at
> > >>> the crashdump I took)
> > >>
> > >> Mike, can you re-run with the current stack-plug branch? I've fixed the
> > >> !CONFIG_BLOCK and rebase issues, and also added a change for this flush
> > >> on schedule event. It's run outside of the runqueue lock now, so
> > >> hopefully that should solve this one.
> > > 
> > > Works for me, thanks.
> > 
> > Super, thanks! Out of curiousity, did you use dm/md?
> 
> Yes, I've been using a request-based DM multipath device.


Against latest 'for-2.6.39/core', I just ran that same fsync heavy
workload against XFS (ontop of a DM multipath volume).  ffsb induced the
following hangs (ripple effect causing NetworkManager to get hung up on
this data-only XFS volume, etc):

XFS mounting filesystem dm-0
Ending clean XFS mount for filesystem: dm-0
mount used greatest stack depth: 3296 bytes left
ffsb used greatest stack depth: 2592 bytes left
INFO: task kswapd0:23 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kswapd0         D ffff880037b8f6e0  3656    23      2 0x00000000
 ffff880037b8f6d0 0000000000000046 ffff880037b8f630 ffffffff8107012f
 ffff880037b8e010 ffff880037b8ffd8 00000000001d21c0 ffff880037b90600
 ffff880037b90998 ffff880037b90990 00000000001d21c0 00000000001d21c0
Call Trace:
 [<ffffffff8107012f>] ? trace_hardirqs_off+0xd/0xf
 [<ffffffffa013e958>] xlog_wait+0x60/0x78 [xfs]
 [<ffffffff810404de>] ? default_wake_function+0x0/0x14
 [<ffffffff81373c5f>] ? _raw_spin_lock+0x62/0x69
 [<ffffffffa013f874>] xlog_state_get_iclog_space+0x9e/0x22c [xfs]
 [<ffffffffa013fb73>] xlog_write+0x171/0x4ae [xfs]
 [<ffffffffa0150df7>] ? kmem_alloc+0x69/0xb1 [xfs]
 [<ffffffff810fad18>] ? __kmalloc+0x14e/0x160
 [<ffffffffa013ff04>] xfs_log_write+0x54/0x7e [xfs]
 [<ffffffffa014b5b0>] xfs_trans_commit_iclog+0x195/0x2d8 [xfs]
 [<ffffffff81070ece>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffffa014b7bc>] _xfs_trans_commit+0xc9/0x206 [xfs]
 [<ffffffffa0138a18>] xfs_itruncate_finish+0x1fd/0x2bd [xfs]
 [<ffffffffa014f202>] xfs_free_eofblocks+0x1ac/0x1f1 [xfs]
 [<ffffffffa014f707>] xfs_inactive+0x108/0x3a6 [xfs]
 [<ffffffff8106ff27>] ? lockdep_init_map+0xa6/0x11b
 [<ffffffffa015a87f>] xfs_fs_evict_inode+0xf6/0xfe [xfs]
 [<ffffffff81114766>] evict+0x24/0x8c
 [<ffffffff811147ff>] dispose_list+0x31/0xaf
 [<ffffffff81114e92>] shrink_icache_memory+0x1e5/0x215
 [<ffffffff810d1e14>] shrink_slab+0xe0/0x164
 [<ffffffff810d3e5b>] kswapd+0x5e7/0x9dc
 [<ffffffff810d3874>] ? kswapd+0x0/0x9dc
 [<ffffffff8105fb7c>] kthread+0xa0/0xa8
 [<ffffffff81070e9d>] ? trace_hardirqs_on_caller+0x11d/0x141
 [<ffffffff81003a24>] kernel_thread_helper+0x4/0x10
 [<ffffffff813749d4>] ? restore_args+0x0/0x30
 [<ffffffff8105fadc>] ? kthread+0x0/0xa8
 [<ffffffff81003a20>] ? kernel_thread_helper+0x0/0x10
4 locks held by kswapd0/23:
 #0:  (shrinker_rwsem){++++..}, at: [<ffffffff810d1d71>] shrink_slab+0x3d/0x164
 #1:  (iprune_sem){++++.-}, at: [<ffffffff81114cf7>] shrink_icache_memory+0x4a/0x215
 #2:  (xfs_iolock_reclaimable){+.+.-.}, at: [<ffffffffa013615d>] xfs_ilock+0x30/0xb9 [xfs]
 #3:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
INFO: task NetworkManager:958 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
NetworkManager  D ffff88007a481288  3312   958      1 0x00000000
 ffff88007a481278 0000000000000046 ffff88007a4811d8 ffffffff8107012f
 ffff88007a480010 ffff88007a481fd8 00000000001d21c0 ffff88007b4f0f80
 ffff88007b4f1318 ffff88007b4f1310 00000000001d21c0 00000000001d21c0
Call Trace:
 [<ffffffff8107012f>] ? trace_hardirqs_off+0xd/0xf
 [<ffffffffa013e958>] xlog_wait+0x60/0x78 [xfs]
 [<ffffffff810404de>] ? default_wake_function+0x0/0x14
 [<ffffffff81373c5f>] ? _raw_spin_lock+0x62/0x69
 [<ffffffffa013f874>] xlog_state_get_iclog_space+0x9e/0x22c [xfs]
 [<ffffffffa013fb73>] xlog_write+0x171/0x4ae [xfs]
 [<ffffffffa0150df7>] ? kmem_alloc+0x69/0xb1 [xfs]
 [<ffffffff810fad18>] ? __kmalloc+0x14e/0x160
 [<ffffffffa013ff04>] xfs_log_write+0x54/0x7e [xfs]
 [<ffffffffa014b5b0>] xfs_trans_commit_iclog+0x195/0x2d8 [xfs]
 [<ffffffff81070c11>] ? mark_held_locks+0x52/0x70
 [<ffffffff810fa9f7>] ? kmem_cache_alloc+0xd1/0x145
 [<ffffffff81070e9d>] ? trace_hardirqs_on_caller+0x11d/0x141
 [<ffffffff81070ece>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffffa014b7bc>] _xfs_trans_commit+0xc9/0x206 [xfs]
 [<ffffffffa011c5fa>] xfs_bmap_finish+0x87/0x16a [xfs]
 [<ffffffffa01389b9>] xfs_itruncate_finish+0x19e/0x2bd [xfs]
 [<ffffffffa014f202>] xfs_free_eofblocks+0x1ac/0x1f1 [xfs]
 [<ffffffffa014f707>] xfs_inactive+0x108/0x3a6 [xfs]
 [<ffffffff8106ff27>] ? lockdep_init_map+0xa6/0x11b
 [<ffffffffa015a87f>] xfs_fs_evict_inode+0xf6/0xfe [xfs]
 [<ffffffff81114766>] evict+0x24/0x8c
 [<ffffffff811147ff>] dispose_list+0x31/0xaf
 [<ffffffff81114e92>] shrink_icache_memory+0x1e5/0x215
 [<ffffffff810d1e14>] shrink_slab+0xe0/0x164
 [<ffffffff810d3282>] try_to_free_pages+0x27f/0x495
 [<ffffffff810cb3fc>] __alloc_pages_nodemask+0x4e3/0x767
 [<ffffffff810700a3>] ? trace_hardirqs_off_caller+0x1f/0x9e
 [<ffffffff810f5b14>] alloc_pages_current+0xa7/0xca
 [<ffffffff810c515c>] __page_cache_alloc+0x85/0x8c
 [<ffffffff810cd420>] __do_page_cache_readahead+0xdb/0x1df
 [<ffffffff810cd545>] ra_submit+0x21/0x25
 [<ffffffff810c66e3>] filemap_fault+0x176/0x396
 [<ffffffff81070ece>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffff810e15e7>] __do_fault+0x54/0x354
 [<ffffffff810709bf>] ? mark_lock+0x2d/0x22d
 [<ffffffff810e2493>] handle_pte_fault+0x2cf/0x6e8
 [<ffffffff810e004e>] ? __pte_alloc+0xc3/0xd0
 [<ffffffff810e2986>] handle_mm_fault+0xda/0xed
 [<ffffffff81377c28>] do_page_fault+0x3b4/0x3d6
 [<ffffffff8118e6de>] ? fsnotify_perm+0x69/0x75
 [<ffffffff8118e74b>] ? security_file_permission+0x2e/0x33
 [<ffffffff813739e6>] ? trace_hardirqs_off_thunk+0x3a/0x3c
 [<ffffffff81374be5>] page_fault+0x25/0x30
5 locks held by NetworkManager/958:
 #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff81377a36>] do_page_fault+0x1c2/0x3d6
 #1:  (shrinker_rwsem){++++..}, at: [<ffffffff810d1d71>] shrink_slab+0x3d/0x164
 #2:  (iprune_sem){++++.-}, at: [<ffffffff81114cf7>] shrink_icache_memory+0x4a/0x215
 #3:  (xfs_iolock_reclaimable){+.+.-.}, at: [<ffffffffa013615d>] xfs_ilock+0x30/0xb9 [xfs]
 #4:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
INFO: task xfssyncd/dm-0:1346 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
xfssyncd/dm-0   D ffff880072cb1a20  4824  1346      2 0x00000000
 ffff880072cb1a10 0000000000000046 ffff880072cb1970 ffffffff8107012f
 ffff880072cb0010 ffff880072cb1fd8 00000000001d21c0 ffff88007b22ca00
 ffff88007b22cd98 ffff88007b22cd90 00000000001d21c0 00000000001d21c0
Call Trace:
 [<ffffffff8107012f>] ? trace_hardirqs_off+0xd/0xf
 [<ffffffffa013e958>] xlog_wait+0x60/0x78 [xfs]
 [<ffffffff810404de>] ? default_wake_function+0x0/0x14
 [<ffffffff81373c5f>] ? _raw_spin_lock+0x62/0x69
 [<ffffffffa013f874>] xlog_state_get_iclog_space+0x9e/0x22c [xfs]
 [<ffffffffa013fb73>] xlog_write+0x171/0x4ae [xfs]
 [<ffffffff81065905>] ? sched_clock_local+0x1c/0x82
 [<ffffffff810700a3>] ? trace_hardirqs_off_caller+0x1f/0x9e
 [<ffffffff810709bf>] ? mark_lock+0x2d/0x22d
 [<ffffffffa013ff04>] xfs_log_write+0x54/0x7e [xfs]
 [<ffffffffa014b5b0>] xfs_trans_commit_iclog+0x195/0x2d8 [xfs]
 [<ffffffff81070ece>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffffa0150cdd>] ? kmem_zone_alloc+0x69/0xb1 [xfs]
 [<ffffffffa014af66>] ? xfs_trans_add_item+0x50/0x5c [xfs]
 [<ffffffffa014b7bc>] _xfs_trans_commit+0xc9/0x206 [xfs]
 [<ffffffffa0133289>] xfs_fs_log_dummy+0x76/0x7d [xfs]
 [<ffffffffa015cd3d>] xfs_sync_worker+0x37/0x6f [xfs]
 [<ffffffffa015ccb0>] xfssyncd+0x15b/0x1b1 [xfs]
 [<ffffffffa015cb55>] ? xfssyncd+0x0/0x1b1 [xfs]
 [<ffffffff8105fb7c>] kthread+0xa0/0xa8
 [<ffffffff81070e9d>] ? trace_hardirqs_on_caller+0x11d/0x141
 [<ffffffff81003a24>] kernel_thread_helper+0x4/0x10
 [<ffffffff813749d4>] ? restore_args+0x0/0x30
 [<ffffffff8105fadc>] ? kthread+0x0/0xa8
 [<ffffffff81003a20>] ? kernel_thread_helper+0x0/0x10
no locks held by xfssyncd/dm-0/1346.
INFO: task ffsb:1355 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ffsb            D 000000010002503a  3648  1355   1322 0x00000000
 ffff88007baffae8 0000000000000046 ffff88007baffa48 ffffffff00000000
 ffff88007bafe010 ffff88007bafffd8 00000000001d21c0 ffff880071df4680
 ffff880071df4a18 ffff880071df4a10 00000000001d21c0 00000000001d21c0
Call Trace:
 [<ffffffffa013e958>] xlog_wait+0x60/0x78 [xfs]
 [<ffffffff810404de>] ? default_wake_function+0x0/0x14
 [<ffffffff81373c5f>] ? _raw_spin_lock+0x62/0x69
 [<ffffffffa013f874>] xlog_state_get_iclog_space+0x9e/0x22c [xfs]
 [<ffffffffa013fb73>] xlog_write+0x171/0x4ae [xfs]
 [<ffffffff810727da>] ? __lock_acquire+0x3bc/0xd26
 [<ffffffff81065a7a>] ? local_clock+0x41/0x5a
 [<ffffffff81024167>] ? pvclock_clocksource_read+0x4b/0xb4
 [<ffffffffa013ff04>] xfs_log_write+0x54/0x7e [xfs]
 [<ffffffff81065905>] ? sched_clock_local+0x1c/0x82
 [<ffffffffa014b5b0>] xfs_trans_commit_iclog+0x195/0x2d8 [xfs]
 [<ffffffff81070c11>] ? mark_held_locks+0x52/0x70
 [<ffffffff810fa9f7>] ? kmem_cache_alloc+0xd1/0x145
 [<ffffffff81070e9d>] ? trace_hardirqs_on_caller+0x11d/0x141
 [<ffffffff81070ece>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffffa0150cdd>] ? kmem_zone_alloc+0x69/0xb1 [xfs]
 [<ffffffffa014b7bc>] _xfs_trans_commit+0xc9/0x206 [xfs]
 [<ffffffffa01566f8>] xfs_file_fsync+0x166/0x1e6 [xfs]
 [<ffffffff81122a8b>] vfs_fsync_range+0x54/0x7c
 [<ffffffff81122b15>] vfs_fsync+0x1c/0x1e
 [<ffffffff81122b45>] do_fsync+0x2e/0x43
 [<ffffffff81122b81>] sys_fsync+0x10/0x14
 [<ffffffff81002b82>] system_call_fastpath+0x16/0x1b
2 locks held by ffsb/1355:
 #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff81122a7e>] vfs_fsync_range+0x47/0x7c
 #1:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
INFO: task ffsb:1364 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ffsb            D 0000000100024ff6  3776  1364   1322 0x00000000
 ffff880027d25ae8 0000000000000046 ffff880027d25a48 ffffffff00000000
 ffff880027d24010 ffff880027d25fd8 00000000001d21c0 ffff88002839c8c0
 ffff88002839cc58 ffff88002839cc50 00000000001d21c0 00000000001d21c0
Call Trace:
 [<ffffffffa013e958>] xlog_wait+0x60/0x78 [xfs]
 [<ffffffff810404de>] ? default_wake_function+0x0/0x14
 [<ffffffff81373c5f>] ? _raw_spin_lock+0x62/0x69
 [<ffffffffa013f874>] xlog_state_get_iclog_space+0x9e/0x22c [xfs]
 [<ffffffffa013fb73>] xlog_write+0x171/0x4ae [xfs]
 [<ffffffffa0150df7>] ? kmem_alloc+0x69/0xb1 [xfs]
 [<ffffffff810fad18>] ? __kmalloc+0x14e/0x160
 [<ffffffffa013ff04>] xfs_log_write+0x54/0x7e [xfs]
 [<ffffffffa014b5b0>] xfs_trans_commit_iclog+0x195/0x2d8 [xfs]
 [<ffffffff81070c11>] ? mark_held_locks+0x52/0x70
 [<ffffffff810fa9f7>] ? kmem_cache_alloc+0xd1/0x145
 [<ffffffff81070e9d>] ? trace_hardirqs_on_caller+0x11d/0x141
 [<ffffffff81070ece>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffffa0150cdd>] ? kmem_zone_alloc+0x69/0xb1 [xfs]
 [<ffffffffa014b7bc>] _xfs_trans_commit+0xc9/0x206 [xfs]
 [<ffffffffa01566f8>] xfs_file_fsync+0x166/0x1e6 [xfs]
 [<ffffffff81122a8b>] vfs_fsync_range+0x54/0x7c
 [<ffffffff81122b15>] vfs_fsync+0x1c/0x1e
 [<ffffffff81122b45>] do_fsync+0x2e/0x43
 [<ffffffff81122b81>] sys_fsync+0x10/0x14
 [<ffffffff81002b82>] system_call_fastpath+0x16/0x1b
2 locks held by ffsb/1364:
 #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff81122a7e>] vfs_fsync_range+0x47/0x7c
 #1:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]

(and many more ffsb processes hung similar to the 2 above)

I just attempted a git command against the root volume, it hung:

git             D 0000000100252022  3440  1471   1461 0x00000004
 ffff88003ad611d8 0000000000000046 ffff88003ad61138 ffffffff00000000
 ffff88003ad60010 ffff88003ad61fd8 00000000001d21c0 ffff88003b498d40
 ffff88003b4990d8 ffff88003b4990d0 00000000001d21c0 00000000001d21c0
Call Trace:
 [<ffffffffa013e958>] xlog_wait+0x60/0x78 [xfs]
 [<ffffffff810404de>] ? default_wake_function+0x0/0x14
 [<ffffffff81373c5f>] ? _raw_spin_lock+0x62/0x69
 [<ffffffffa013f874>] xlog_state_get_iclog_space+0x9e/0x22c [xfs]
 [<ffffffffa013fb73>] xlog_write+0x171/0x4ae [xfs]
 [<ffffffffa0150df7>] ? kmem_alloc+0x69/0xb1 [xfs]
 [<ffffffff810fad18>] ? __kmalloc+0x14e/0x160
 [<ffffffffa013ff04>] xfs_log_write+0x54/0x7e [xfs]
 [<ffffffffa014b5b0>] xfs_trans_commit_iclog+0x195/0x2d8 [xfs]
 [<ffffffff81070c11>] ? mark_held_locks+0x52/0x70
 [<ffffffff810fa9f7>] ? kmem_cache_alloc+0xd1/0x145
 [<ffffffff81070e9d>] ? trace_hardirqs_on_caller+0x11d/0x141
 [<ffffffff81070ece>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffffa014b7bc>] _xfs_trans_commit+0xc9/0x206 [xfs]
 [<ffffffffa011c5fa>] xfs_bmap_finish+0x87/0x16a [xfs]
 [<ffffffffa01389b9>] xfs_itruncate_finish+0x19e/0x2bd [xfs]
 [<ffffffffa014f202>] xfs_free_eofblocks+0x1ac/0x1f1 [xfs]
 [<ffffffffa014f707>] xfs_inactive+0x108/0x3a6 [xfs]
 [<ffffffff8106ff27>] ? lockdep_init_map+0xa6/0x11b
 [<ffffffffa015a87f>] xfs_fs_evict_inode+0xf6/0xfe [xfs]
 [<ffffffff81114766>] evict+0x24/0x8c
 [<ffffffff811147ff>] dispose_list+0x31/0xaf
 [<ffffffff81114e92>] shrink_icache_memory+0x1e5/0x215
 [<ffffffff810d1e14>] shrink_slab+0xe0/0x164
 [<ffffffff810d3282>] try_to_free_pages+0x27f/0x495
 [<ffffffff810cb3fc>] __alloc_pages_nodemask+0x4e3/0x767
 [<ffffffff810700a3>] ? trace_hardirqs_off_caller+0x1f/0x9e
 [<ffffffff810f5b14>] alloc_pages_current+0xa7/0xca
 [<ffffffff810c515c>] __page_cache_alloc+0x85/0x8c
 [<ffffffff810cd420>] __do_page_cache_readahead+0xdb/0x1df
 [<ffffffff8106fa35>] ? lock_release_holdtime+0x2c/0xd7
 [<ffffffff810cd545>] ra_submit+0x21/0x25
 [<ffffffff810cd92c>] ondemand_readahead+0x1e3/0x1f6
 [<ffffffff810cd9b8>] page_cache_async_readahead+0x79/0x82
 [<ffffffff810c6633>] filemap_fault+0xc6/0x396
 [<ffffffff81070ece>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffff810e15e7>] __do_fault+0x54/0x354
 [<ffffffff810709bf>] ? mark_lock+0x2d/0x22d
 [<ffffffff810e2493>] handle_pte_fault+0x2cf/0x6e8
 [<ffffffff810e004e>] ? __pte_alloc+0xc3/0xd0
 [<ffffffff810e2986>] handle_mm_fault+0xda/0xed
 [<ffffffff81377c28>] do_page_fault+0x3b4/0x3d6
 [<ffffffff810700a3>] ? trace_hardirqs_off_caller+0x1f/0x9e
 [<ffffffff8107012f>] ? trace_hardirqs_off+0xd/0xf
 [<ffffffff81065a7a>] ? local_clock+0x41/0x5a
 [<ffffffff813739e6>] ? trace_hardirqs_off_thunk+0x3a/0x3c
 [<ffffffff81374be5>] page_fault+0x25/0x30

And here is the summary of all the locks (via sysrq-t):

Showing all locks held in the system:
2 locks held by kworker/0:1/10:
 #0:  (xfsdatad){++++..}, at: [<ffffffff81059fba>] process_one_work+0x18a/0x37f
 #1:  ((&ioend->io_work)){+.+...}, at: [<ffffffff81059fba>] process_one_work+0x18a/0x37f
4 locks held by kswapd0/23:
 #0:  (shrinker_rwsem){++++..}, at: [<ffffffff810d1d71>] shrink_slab+0x3d/0x164
 #1:  (iprune_sem){++++.-}, at: [<ffffffff81114cf7>] shrink_icache_memory+0x4a/0x215
 #2:  (xfs_iolock_reclaimable){+.+.-.}, at: [<ffffffffa013615d>] xfs_ilock+0x30/0xb9 [xfs]
 #3:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
1 lock held by multipathd/659:
 #0:  (&u->readlock){+.+.+.}, at: [<ffffffff813533dd>] unix_dgram_recvmsg+0x5a/0x27f
5 locks held by NetworkManager/958:
 #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff81377a36>] do_page_fault+0x1c2/0x3d6
 #1:  (shrinker_rwsem){++++..}, at: [<ffffffff810d1d71>] shrink_slab+0x3d/0x164
 #2:  (iprune_sem){++++.-}, at: [<ffffffff81114cf7>] shrink_icache_memory+0x4a/0x215
 #3:  (xfs_iolock_reclaimable){+.+.-.}, at: [<ffffffffa013615d>] xfs_ilock+0x30/0xb9 [xfs]
 #4:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
1 lock held by agetty/1099:
 #0:  (&tty->atomic_read_lock){+.+.+.}, at: [<ffffffff8123c132>] n_tty_read+0x284/0x7ba
1 lock held by mingetty/1101:
 #0:  (&tty->atomic_read_lock){+.+.+.}, at: [<ffffffff8123c132>] n_tty_read+0x284/0x7ba
1 lock held by mingetty/1103:
 #0:  (&tty->atomic_read_lock){+.+.+.}, at: [<ffffffff8123c132>] n_tty_read+0x284/0x7ba
1 lock held by mingetty/1105:
 #0:  (&tty->atomic_read_lock){+.+.+.}, at: [<ffffffff8123c132>] n_tty_read+0x284/0x7ba
1 lock held by mingetty/1107:
 #0:  (&tty->atomic_read_lock){+.+.+.}, at: [<ffffffff8123c132>] n_tty_read+0x284/0x7ba
1 lock held by mingetty/1109:
 #0:  (&tty->atomic_read_lock){+.+.+.}, at: [<ffffffff8123c132>] n_tty_read+0x284/0x7ba
1 lock held by mingetty/1111:
 #0:  (&tty->atomic_read_lock){+.+.+.}, at: [<ffffffff8123c132>] n_tty_read+0x284/0x7ba
1 lock held by bash/1313:
 #0:  (&tty->atomic_read_lock){+.+.+.}, at: [<ffffffff8123c132>] n_tty_read+0x284/0x7ba
2 locks held by ffsb/1355:
 #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff81122a7e>] vfs_fsync_range+0x47/0x7c
 #1:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
2 locks held by ffsb/1358:
 #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff81122a7e>] vfs_fsync_range+0x47/0x7c
 #1:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
1 lock held by ffsb/1359:
 #0:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
2 locks held by ffsb/1362:
 #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff81122a7e>] vfs_fsync_range+0x47/0x7c
 #1:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
2 locks held by ffsb/1364:
 #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff81122a7e>] vfs_fsync_range+0x47/0x7c
 #1:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
2 locks held by ffsb/1365:
 #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff81122a7e>] vfs_fsync_range+0x47/0x7c
 #1:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
2 locks held by ffsb/1366:
 #0:  (xfs_iolock_active){++++.+}, at: [<ffffffffa0136079>] xfs_ilock_nowait+0x2b/0xdf [xfs]
 #1:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
2 locks held by ffsb/1367:
 #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff81122a7e>] vfs_fsync_range+0x47/0x7c
 #1:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
1 lock held by ffsb/1368:
 #0:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
2 locks held by ffsb/1371:
 #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff81122a7e>] vfs_fsync_range+0x47/0x7c
 #1:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
2 locks held by ffsb/1372:
 #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff81122a7e>] vfs_fsync_range+0x47/0x7c
 #1:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
2 locks held by ffsb/1373:
 #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff81122a7e>] vfs_fsync_range+0x47/0x7c
 #1:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
2 locks held by ffsb/1374:
 #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff81122a7e>] vfs_fsync_range+0x47/0x7c
 #1:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
2 locks held by ffsb/1375:
 #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff81122a7e>] vfs_fsync_range+0x47/0x7c
 #1:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
2 locks held by ffsb/1376:
 #0:  (xfs_iolock_active){++++.+}, at: [<ffffffffa0136079>] xfs_ilock_nowait+0x2b/0xdf [xfs]
 #1:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
2 locks held by ffsb/1377:
 #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff81122a7e>] vfs_fsync_range+0x47/0x7c
 #1:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
2 locks held by ffsb/1378:
 #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff81122a7e>] vfs_fsync_range+0x47/0x7c
 #1:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
2 locks held by ffsb/1380:
 #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff81122a7e>] vfs_fsync_range+0x47/0x7c
 #1:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
2 locks held by ffsb/1381:
 #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff81122a7e>] vfs_fsync_range+0x47/0x7c
 #1:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
2 locks held by ffsb/1383:
 #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff81122a7e>] vfs_fsync_range+0x47/0x7c
 #1:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
4 locks held by ffsb/1384:
 #0:  (&sb->s_type->i_mutex_key#13/1){+.+.+.}, at: [<ffffffff8110b9ba>] do_unlinkat+0x67/0x165
 #1:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff81109eaf>] vfs_unlink+0x4f/0xcc
 #2:  (&(&ip->i_lock)->mr_lock/2){+.+...}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
 #3:  (&(&ip->i_lock)->mr_lock/3){+.+...}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
2 locks held by ffsb/1385:
 #0:  (xfs_iolock_active){++++.+}, at: [<ffffffffa0136079>] xfs_ilock_nowait+0x2b/0xdf [xfs]
 #1:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
2 locks held by ffsb/1386:
 #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff81122a7e>] vfs_fsync_range+0x47/0x7c
 #1:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
2 locks held by ffsb/1387:
 #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff81122a7e>] vfs_fsync_range+0x47/0x7c
 #1:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
2 locks held by ffsb/1388:
 #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff81122a7e>] vfs_fsync_range+0x47/0x7c
 #1:  (&(&ip->i_lock)->mr_lock){++++--}, at: 
2 locks held by ffsb/1389:
 #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff81122a7e>] vfs_fsync_range+0x47/0x7c
 #1:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
1 lock held by ffsb/1390:
 #0:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
2 locks held by ffsb/1391:
 #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff81122a7e>] vfs_fsync_range+0x47/0x7c
 #1:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
2 locks held by ffsb/1392:
 #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff81122a7e>] vfs_fsync_range+0x47/0x7c
 #1:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
2 locks held by ffsb/1393:
 #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff81122a7e>] vfs_fsync_range+0x47/0x7c
 #1:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
1 lock held by ffsb/1394:
 #0:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
2 locks held by ffsb/1395:
 #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff81122a7e>] vfs_fsync_range+0x47/0x7c
 #1:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
2 locks held by ffsb/1396:
 #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff81122a7e>] vfs_fsync_range+0x47/0x7c
 #1:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
2 locks held by ffsb/1397:
 #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff81122a7e>] vfs_fsync_range+0x47/0x7c
 #1:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
2 locks held by ffsb/1398:
 #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff81122a7e>] vfs_fsync_range+0x47/0x7c
 #1:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
2 locks held by ffsb/1399:
 #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff81122a7e>] vfs_fsync_range+0x47/0x7c
 #1:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
2 locks held by ffsb/1400:
 #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff81122a7e>] vfs_fsync_range+0x47/0x7c
 #1:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
2 locks held by ffsb/1402:
 #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff81122a7e>] vfs_fsync_range+0x47/0x7c
 #1:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
2 locks held by ffsb/1403:
 #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff81122a7e>] vfs_fsync_range+0x47/0x7c
 #1:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
1 lock held by ffsb/1404:
 #0:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
2 locks held by ffsb/1405:
 #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff81122a7e>] vfs_fsync_range+0x47/0x7c
 #1:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
2 locks held by ffsb/1406:
 #0:  (xfs_iolock_active){++++.+}, at: [<ffffffffa0136079>] xfs_ilock_nowait+0x2b/0xdf [xfs]
 #1:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
2 locks held by ffsb/1407:
 #0:  (xfs_iolock_active){++++.+}, at: [<ffffffffa0136079>] xfs_ilock_nowait+0x2b/0xdf [xfs]
 #1:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
2 locks held by ffsb/1409:
 #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff81122a7e>] vfs_fsync_range+0x47/0x7c
 #1:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
2 locks held by ffsb/1410:
 #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff81122a7e>] vfs_fsync_range+0x47/0x7c
 #1:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
2 locks held by ffsb/1411:
 #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff81122a7e>] vfs_fsync_range+0x47/0x7c
 #1:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
2 locks held by ffsb/1412:
 #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff81122a7e>] vfs_fsync_range+0x47/0x7c
 #1:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
2 locks held by ffsb/1413:
 #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff81122a7e>] vfs_fsync_range+0x47/0x7c
 #1:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
2 locks held by ffsb/1414:
 #0:  (xfs_iolock_active){++++.+}, at: [<ffffffffa0136079>] xfs_ilock_nowait+0x2b/0xdf [xfs]
 #1:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
2 locks held by ffsb/1416:
 #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff81122a7e>] vfs_fsync_range+0x47/0x7c
 #1:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
2 locks held by ffsb/1417:
 #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff81122a7e>] vfs_fsync_range+0x47/0x7c
 #1:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
3 locks held by ffsb/1418:
 #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff8110a688>] do_last+0xb8/0x2f9
 #1:  (&(&ip->i_lock)->mr_lock/1){+.+.+.}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
 #2:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa01360a7>] xfs_ilock_nowait+0x59/0xdf [xfs]
2 locks held by flush-253:0/1350:
 #0:  (&type->s_umount_key#24){.+.+.+}, at: [<ffffffff8111f509>] writeback_inodes_wb+0xce/0x13d
 #1:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
5 locks held by git/1471:
 #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff81377a36>] do_page_fault+0x1c2/0x3d6
 #1:  (shrinker_rwsem){++++..}, at: [<ffffffff810d1d71>] shrink_slab+0x3d/0x164
 #2:  (iprune_sem){++++.-}, at: [<ffffffff81114cf7>] shrink_icache_memory+0x4a/0x215
 #3:  (xfs_iolock_reclaimable){+.+.-.}, at: [<ffffffffa013615d>] xfs_ilock+0x30/0xb9 [xfs]
 #4:  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffffa0136190>] xfs_ilock+0x63/0xb9 [xfs]
2 locks held by bash/1472:
 #0:  (sysrq_key_table_lock){......}, at: [<ffffffff81242275>] __handle_sysrq+0x28/0x15c
 #1:  (tasklist_lock){.+.+..}, at: [<ffffffff8107062c>] debug_show_all_locks+0x52/0x19b

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-03-17 15:51               ` Mike Snitzer
@ 2011-03-17 18:31                 ` Jens Axboe
  2011-03-17 18:46                   ` Mike Snitzer
  0 siblings, 1 reply; 152+ messages in thread
From: Jens Axboe @ 2011-03-17 18:31 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: linux-kernel, hch

On 2011-03-17 16:51, Mike Snitzer wrote:
> On Tue, Mar 08 2011 at  5:05pm -0500,
> Mike Snitzer <snitzer@redhat.com> wrote:
> 
>> On Tue, Mar 08 2011 at  3:27pm -0500,
>> Jens Axboe <jaxboe@fusionio.com> wrote:
>>
>>> On 2011-03-08 21:21, Mike Snitzer wrote:
>>>> On Tue, Mar 08 2011 at  7:16am -0500,
>>>> Jens Axboe <jaxboe@fusionio.com> wrote:
>>>>
>>>>> On 2011-03-03 23:13, Mike Snitzer wrote:
>>>>>> I'm now hitting a lockdep issue, while running a 'for-2.6.39/stack-plug'
>>>>>> kernel, when I try an fsync heavy workload to a request-based mpath
>>>>>> device (the kernel ultimately goes down in flames, I've yet to look at
>>>>>> the crashdump I took)
>>>>>
>>>>> Mike, can you re-run with the current stack-plug branch? I've fixed the
>>>>> !CONFIG_BLOCK and rebase issues, and also added a change for this flush
>>>>> on schedule event. It's run outside of the runqueue lock now, so
>>>>> hopefully that should solve this one.
>>>>
>>>> Works for me, thanks.
>>>
>>> Super, thanks! Out of curiousity, did you use dm/md?
>>
>> Yes, I've been using a request-based DM multipath device.
> 
> 
> Against latest 'for-2.6.39/core', I just ran that same fsync heavy
> workload against XFS (ontop of a DM multipath volume).  ffsb induced the
> following hangs (ripple effect causing NetworkManager to get hung up on
> this data-only XFS volume, etc):

Ugh. Care to send the recipee for how to reproduce this? Essentially
just looks like IO got stuck.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-03-17 18:31                 ` Jens Axboe
@ 2011-03-17 18:46                   ` Mike Snitzer
  2011-03-18  9:15                     ` hch
  0 siblings, 1 reply; 152+ messages in thread
From: Mike Snitzer @ 2011-03-17 18:46 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, hch

[-- Attachment #1: Type: text/plain, Size: 4976 bytes --]

On Thu, Mar 17 2011 at  2:31pm -0400,
Jens Axboe <jaxboe@fusionio.com> wrote:

> On 2011-03-17 16:51, Mike Snitzer wrote:
> > On Tue, Mar 08 2011 at  5:05pm -0500,
> > Mike Snitzer <snitzer@redhat.com> wrote:
> > 
> >> On Tue, Mar 08 2011 at  3:27pm -0500,
> >> Jens Axboe <jaxboe@fusionio.com> wrote:
> >>
> >>> On 2011-03-08 21:21, Mike Snitzer wrote:
> >>>> On Tue, Mar 08 2011 at  7:16am -0500,
> >>>> Jens Axboe <jaxboe@fusionio.com> wrote:
> >>>>
> >>>>> On 2011-03-03 23:13, Mike Snitzer wrote:
> >>>>>> I'm now hitting a lockdep issue, while running a 'for-2.6.39/stack-plug'
> >>>>>> kernel, when I try an fsync heavy workload to a request-based mpath
> >>>>>> device (the kernel ultimately goes down in flames, I've yet to look at
> >>>>>> the crashdump I took)
> >>>>>
> >>>>> Mike, can you re-run with the current stack-plug branch? I've fixed the
> >>>>> !CONFIG_BLOCK and rebase issues, and also added a change for this flush
> >>>>> on schedule event. It's run outside of the runqueue lock now, so
> >>>>> hopefully that should solve this one.
> >>>>
> >>>> Works for me, thanks.
> >>>
> >>> Super, thanks! Out of curiousity, did you use dm/md?
> >>
> >> Yes, I've been using a request-based DM multipath device.
> > 
> > 
> > Against latest 'for-2.6.39/core', I just ran that same fsync heavy
> > workload against XFS (ontop of a DM multipath volume).  ffsb induced the
> > following hangs (ripple effect causing NetworkManager to get hung up on
> > this data-only XFS volume, etc):
> 
> Ugh. Care to send the recipee for how to reproduce this? Essentially
> just looks like IO got stuck.

Here is the sequence to reproduce with the attached fsync-happy.ffsb
(I've been running the following in a KVM guest):

<create multipath device>
mkfs.xfs /dev/mapper/mpathb
mount /dev/mapper/mpathb /mnt/test
./ffsb fsync-happy.ffsb

And I just verified that the deadlock does _not_ seem to occur without
DM multipath -- by directly using an underlying SCSI device instead.

So multipath is exposing this somehow (could just be changing timing?).

Mike

p.s. though I did get this lockdep warning when unmounting the xfs
filesystem:

=================================
[ INFO: inconsistent lock state ]
2.6.38-rc6-snitm+ #8
---------------------------------
inconsistent {IN-RECLAIM_FS-R} -> {RECLAIM_FS-ON-W} usage.
umount/1524 [HC0[0]:SC0[0]:HE1:SE1] takes:
 (iprune_sem){+++++-}, at: [<ffffffff81114a22>] evict_inodes+0x2f/0x107
{IN-RECLAIM_FS-R} state was registered at:
  [<ffffffff810727c2>] __lock_acquire+0x3a4/0xd26
  [<ffffffff81073227>] lock_acquire+0xe3/0x110
  [<ffffffff81372fa2>] down_read+0x51/0x96
  [<ffffffff81114d57>] shrink_icache_memory+0x4a/0x215
  [<ffffffff810d1e48>] shrink_slab+0xe0/0x164
  [<ffffffff810d3e8f>] kswapd+0x5e7/0x9dc
  [<ffffffff8105fb7c>] kthread+0xa0/0xa8
  [<ffffffff81003a24>] kernel_thread_helper+0x4/0x10
irq event stamp: 73433
hardirqs last  enabled at (73433): [<ffffffff81070ffe>] debug_check_no_locks_freed+0x12e/0x145
hardirqs last disabled at (73432): [<ffffffff81070f13>] debug_check_no_locks_freed+0x43/0x145
softirqs last  enabled at (72996): [<ffffffff8104a1f1>] __do_softirq+0x1b4/0x1d3
softirqs last disabled at (72991): [<ffffffff81003b1c>] call_softirq+0x1c/0x28

other info that might help us debug this:
2 locks held by umount/1524:
 #0:  (&type->s_umount_key#24){++++++}, at: [<ffffffff81102a27>] deactivate_super+0x3d/0x4a
 #1:  (iprune_sem){+++++-}, at: [<ffffffff81114a22>] evict_inodes+0x2f/0x107

stack backtrace:
Pid: 1524, comm: umount Not tainted 2.6.38-rc6-snitm+ #8
Call Trace:
 [<ffffffff8107097f>] ? valid_state+0x17e/0x191
 [<ffffffff810712e8>] ? check_usage_backwards+0x0/0x81
 [<ffffffff81070ae4>] ? mark_lock+0x152/0x22d
 [<ffffffff81070c11>] ? mark_held_locks+0x52/0x70
 [<ffffffff81070cc8>] ? lockdep_trace_alloc+0x99/0xbb
 [<ffffffff810fa98a>] ? kmem_cache_alloc+0x30/0x145
 [<ffffffffa014dcdd>] ? kmem_zone_alloc+0x69/0xb1 [xfs]
 [<ffffffffa014dd39>] ? kmem_zone_zalloc+0x14/0x35 [xfs]
 [<ffffffffa0147ed9>] ? _xfs_trans_alloc+0x27/0x64 [xfs]
 [<ffffffffa0148c97>] ? xfs_trans_alloc+0x9f/0xac [xfs]
 [<ffffffff810643b7>] ? up_read+0x23/0x3c
 [<ffffffffa0133000>] ? xfs_iunlock+0x7e/0xbc [xfs]
 [<ffffffffa014c140>] ? xfs_free_eofblocks+0xea/0x1f1 [xfs]
 [<ffffffffa014c707>] ? xfs_inactive+0x108/0x3a6 [xfs]
 [<ffffffff8106ff27>] ? lockdep_init_map+0xa6/0x11b
 [<ffffffffa015787f>] ? xfs_fs_evict_inode+0xf6/0xfe [xfs]
 [<ffffffff811147c6>] ? evict+0x24/0x8c
 [<ffffffff8111485f>] ? dispose_list+0x31/0xaf
 [<ffffffff81114ae3>] ? evict_inodes+0xf0/0x107
 [<ffffffff81101660>] ? generic_shutdown_super+0x5c/0xdf
 [<ffffffff8110170a>] ? kill_block_super+0x27/0x69
 [<ffffffff81101d89>] ? deactivate_locked_super+0x26/0x4b
 [<ffffffff81102a2f>] ? deactivate_super+0x45/0x4a
 [<ffffffff81118b87>] ? mntput_no_expire+0x105/0x10e
 [<ffffffff81119db6>] ? sys_umount+0x2d9/0x304
 [<ffffffff81070e9d>] ? trace_hardirqs_on_caller+0x11d/0x141
 [<ffffffff81002b82>] ? system_call_fastpath+0x16/0x1b

[-- Attachment #2: fsync-happy.ffsb --]
[-- Type: text/plain, Size: 1642 bytes --]

# Mail server simulation.

time=600
alignio=1
directio=0
#directio=%DIRECTIO%
#directio=0
#callout=/bin/bash
#callout=/usr/local/src/ffsb-6.0-rc2/ltc_tests/enable_lockstat.sh
#callout=/usr/local/src/ffsb-6.0-rc2/ltc_tests/osync.sh

[filesystem0]
	location=/mnt/test
	num_files=100000
	num_dirs=1000

reuse=1
	# File sizes range from 1kB to 1MB.
	size_weight 1KB 10
	size_weight 2KB 15
	size_weight 4KB 16
	size_weight 8KB 16
	size_weight 16KB 15
	size_weight 32KB 10
	size_weight 64KB 8
	size_weight 128KB 4
	size_weight 256KB 3
	size_weight 512KB 2
	size_weight 1MB 1

create_blocksize=1048576
[end0]

[threadgroup0]
	num_threads=64

	readall_weight=4
	create_fsync_weight=2
	delete_weight=1

	append_weight		= 1
	append_fsync_weight	= 1
	stat_weight		= 1
#	write_weight		= 1
#	write_fsync_weight	= 1
#	read_weight		= 1
	create_weight		= 1
	writeall_weight		= 1
	writeall_fsync_weight	= 1
	open_close_weight	= 1


	write_size=64KB
	write_blocksize=512KB

	read_size=64KB
	read_blocksize=512KB

	[stats]
		enable_stats=1
		enable_range=1

		msec_range    0.00      0.01
		msec_range    0.01      0.02
		msec_range    0.02      0.05
		msec_range    0.05      0.10
		msec_range    0.10      0.20
		msec_range    0.20      0.50
		msec_range    0.50      1.00
		msec_range    1.00      2.00
		msec_range    2.00      5.00
		msec_range    5.00     10.00
		msec_range   10.00     20.00
		msec_range   20.00     50.00
		msec_range   50.00    100.00
		msec_range  100.00    200.00
		msec_range  200.00    500.00
		msec_range  500.00   1000.00
		msec_range 1000.00   2000.00
		msec_range 2000.00   5000.00
		msec_range 5000.00  10000.00
	[end]
[end0]

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 04/10] block: initial patch for on-stack per-task plugging
  2011-03-17  9:44           ` Jens Axboe
@ 2011-03-18  1:55             ` Shaohua Li
  0 siblings, 0 replies; 152+ messages in thread
From: Shaohua Li @ 2011-03-18  1:55 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, hch, jmoyer, Vivek Goyal

On Thu, 2011-03-17 at 17:44 +0800, Jens Axboe wrote:
> On 2011-03-17 04:19, Shaohua Li wrote:
> > On Thu, 2011-03-17 at 09:00 +0800, Shaohua Li wrote:
> >> On Thu, 2011-03-17 at 01:31 +0800, Vivek Goyal wrote:
> >>> On Wed, Mar 16, 2011 at 04:18:30PM +0800, Shaohua Li wrote:
> >>>> 2011/1/22 Jens Axboe <jaxboe@fusionio.com>:
> >>>>> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
> >>>>> ---
> >>>>>  block/blk-core.c          |  357 ++++++++++++++++++++++++++++++++------------
> >>>>>  block/elevator.c          |    6 +-
> >>>>>  include/linux/blk_types.h |    2 +
> >>>>>  include/linux/blkdev.h    |   30 ++++
> >>>>>  include/linux/elevator.h  |    1 +
> >>>>>  include/linux/sched.h     |    6 +
> >>>>>  kernel/exit.c             |    1 +
> >>>>>  kernel/fork.c             |    3 +
> >>>>>  kernel/sched.c            |   11 ++-
> >>>>>  9 files changed, 317 insertions(+), 100 deletions(-)
> >>>>>
> >>>>> diff --git a/block/blk-core.c b/block/blk-core.c
> >>>>> index 960f12c..42dbfcc 100644
> >>>>> --- a/block/blk-core.c
> >>>>> +++ b/block/blk-core.c
> >>>>> @@ -27,6 +27,7 @@
> >>>>>  #include <linux/writeback.h>
> >>>>>  #include <linux/task_io_accounting_ops.h>
> >>>>>  #include <linux/fault-inject.h>
> >>>>> +#include <linux/list_sort.h>
> >>>>>
> >>>>>  #define CREATE_TRACE_POINTS
> >>>>>  #include <trace/events/block.h>
> >>>>> @@ -213,7 +214,7 @@ static void blk_delay_work(struct work_struct *work)
> >>>>>
> >>>>>        q = container_of(work, struct request_queue, delay_work.work);
> >>>>>        spin_lock_irq(q->queue_lock);
> >>>>> -       q->request_fn(q);
> >>>>> +       __blk_run_queue(q);
> >>>>>        spin_unlock_irq(q->queue_lock);
> >>>>>  }
> >>>> Hi Jens,
> >>>> I have some questions about the per-task plugging. Since the request
> >>>> list is per-task, and each task delivers its requests at finish flush
> >>>> or schedule. But when one cpu delivers requests to global queue, other
> >>>> cpus don't know. This seems to have problem. For example:
> >>>> 1. get_request_wait() can only flush current task's request list,
> >>>> other cpus/tasks might still have a lot of requests, which aren't sent
> >>>> to request_queue.
> >>>
> >>> But very soon these requests will be sent to request queue as soon task
> >>> is either scheduled out or task explicitly flushes the plug? So we might
> >>> wait a bit longer but that might not matter in general, i guess. 
> >> Yes, I understand there is just a bit delay. I don't know how severe it
> >> is, but this still could be a problem, especially for fast storage or
> >> random I/O. My current tests show slight regression (3% or so) with
> >> Jens's for 2.6.39/core branch. I'm still checking if it's caused by the
> >> per-task plug, but the per-task plug is highly suspected.
> >>
> >>>> your ioc-rq-alloc branch is for this, right? Will it
> >>>> be pushed to 2.6.39 too? I'm wondering if we should limit per-task
> >>>> queue length. If there are enough requests there, we force a flush
> >>>> plug.
> >>>
> >>> That's the idea jens had. But then came the question of maintaining
> >>> data structures per task per disk. That makes it complicated.
> >>>
> >>> Even if we move the accounting out of request queue and do it say at
> >>> bdi, ideally we shall to do per task per bdi accounting.
> >>>
> >>> Jens seemed to be suggesting that generally fluser threads are the
> >>> main cluprit for submitting large amount of IO. They are already per
> >>> bdi. So probably just maintain a per task limit for flusher threads.
> >> Yep, flusher is the main spot in my mind. We need call more flush plug
> >> for flusher thread. 
> >>
> >>> I am not sure what happens to direct reclaim path, AIO deep queue 
> >>> paths etc.
> >> direct reclaim path could build deep write queue too. It
> >> uses .writepage, currently there is no flush plug there. Maybe we need
> >> add flush plug in shrink_inactive_list too.
> >>
> >>>> 2. some APIs like blk_delay_work, which call __blk_run_queue() might
> >>>> not work. because other CPUs might not dispatch their requests to
> >>>> request queue. So __blk_run_queue will eventually find no requests,
> >>>> which might stall devices.
> >>>> Since one cpu doesn't know other cpus' request list, I'm wondering if
> >>>> there are other similar issues.
> >>>
> >>> So again in this case if queue is empty at the time of __blk_run_queue(),
> >>> then we will probably just experinece little more delay then intended
> >>> till some task flushes. But should not stall the system?
> >> not stall the system, but device stalls a little time.
> > Jens,
> > I need below patch to recover a ffsb fsync workload, which has about 30%
> > regression with stack plug. 
> > I guess the reason is WRITE_SYNC_PLUG doesn't work now, so if a context
> > hasn't blk_plug, we lose previous plug (request merge). This suggests
> > all places we use WRITE_SYNC_PLUG before (for example, kjournald) should
> > have a blk_plug context.
> 
> Good point, those should be auto-converted. I'll take this patch and
> double check the others. Thanks!
> 
> Does it remove that performance regression completely?
Yes, it removes the regression completely at my side.


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 04/10] block: initial patch for on-stack per-task plugging
  2011-03-17  9:43         ` Jens Axboe
@ 2011-03-18  6:36           ` Shaohua Li
  2011-03-18 12:54             ` Jens Axboe
  0 siblings, 1 reply; 152+ messages in thread
From: Shaohua Li @ 2011-03-18  6:36 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Vivek Goyal, linux-kernel, hch, jmoyer

On Thu, 2011-03-17 at 17:43 +0800, Jens Axboe wrote:
> On 2011-03-17 02:00, Shaohua Li wrote:
> > On Thu, 2011-03-17 at 01:31 +0800, Vivek Goyal wrote:
> >> On Wed, Mar 16, 2011 at 04:18:30PM +0800, Shaohua Li wrote:
> >>> 2011/1/22 Jens Axboe <jaxboe@fusionio.com>:
> >>>> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
> >>>> ---
> >>>>  block/blk-core.c          |  357 ++++++++++++++++++++++++++++++++------------
> >>>>  block/elevator.c          |    6 +-
> >>>>  include/linux/blk_types.h |    2 +
> >>>>  include/linux/blkdev.h    |   30 ++++
> >>>>  include/linux/elevator.h  |    1 +
> >>>>  include/linux/sched.h     |    6 +
> >>>>  kernel/exit.c             |    1 +
> >>>>  kernel/fork.c             |    3 +
> >>>>  kernel/sched.c            |   11 ++-
> >>>>  9 files changed, 317 insertions(+), 100 deletions(-)
> >>>>
> >>>> diff --git a/block/blk-core.c b/block/blk-core.c
> >>>> index 960f12c..42dbfcc 100644
> >>>> --- a/block/blk-core.c
> >>>> +++ b/block/blk-core.c
> >>>> @@ -27,6 +27,7 @@
> >>>>  #include <linux/writeback.h>
> >>>>  #include <linux/task_io_accounting_ops.h>
> >>>>  #include <linux/fault-inject.h>
> >>>> +#include <linux/list_sort.h>
> >>>>
> >>>>  #define CREATE_TRACE_POINTS
> >>>>  #include <trace/events/block.h>
> >>>> @@ -213,7 +214,7 @@ static void blk_delay_work(struct work_struct *work)
> >>>>
> >>>>        q = container_of(work, struct request_queue, delay_work.work);
> >>>>        spin_lock_irq(q->queue_lock);
> >>>> -       q->request_fn(q);
> >>>> +       __blk_run_queue(q);
> >>>>        spin_unlock_irq(q->queue_lock);
> >>>>  }
> >>> Hi Jens,
> >>> I have some questions about the per-task plugging. Since the request
> >>> list is per-task, and each task delivers its requests at finish flush
> >>> or schedule. But when one cpu delivers requests to global queue, other
> >>> cpus don't know. This seems to have problem. For example:
> >>> 1. get_request_wait() can only flush current task's request list,
> >>> other cpus/tasks might still have a lot of requests, which aren't sent
> >>> to request_queue.
> >>
> >> But very soon these requests will be sent to request queue as soon task
> >> is either scheduled out or task explicitly flushes the plug? So we might
> >> wait a bit longer but that might not matter in general, i guess. 
> > Yes, I understand there is just a bit delay. I don't know how severe it
> > is, but this still could be a problem, especially for fast storage or
> > random I/O. My current tests show slight regression (3% or so) with
> > Jens's for 2.6.39/core branch. I'm still checking if it's caused by the
> > per-task plug, but the per-task plug is highly suspected.
> 
> To check this particular case, you can always just bump the request
> limit. What test is showing a slowdown? 
this is a simple multi-threaded seq read. The issue tends to be request
merge related (not verified yet). The merge reduces about 60% with stack
plug from fio reported data. From trace, without stack plug, requests
from different threads get merged. But with it, such merge is impossible
because flush_plug doesn't check merge, I thought we need add it again.

Thanks,
Shaohua


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-03-17 18:46                   ` Mike Snitzer
@ 2011-03-18  9:15                     ` hch
  0 siblings, 0 replies; 152+ messages in thread
From: hch @ 2011-03-18  9:15 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Jens Axboe, linux-kernel, hch

> p.s. though I did get this lockdep warning when unmounting the xfs
> filesystem:

This is fixed by commit bab1d9444d9a147f1dc3478dd06c16f490227f3e

	"prune back iprune_sem"

which hit mainline this week.


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 04/10] block: initial patch for on-stack per-task  plugging
  2011-03-18  6:36           ` Shaohua Li
@ 2011-03-18 12:54             ` Jens Axboe
  2011-03-18 13:52               ` Jens Axboe
  0 siblings, 1 reply; 152+ messages in thread
From: Jens Axboe @ 2011-03-18 12:54 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Vivek Goyal, linux-kernel, hch, jmoyer

On 2011-03-18 07:36, Shaohua Li wrote:
> On Thu, 2011-03-17 at 17:43 +0800, Jens Axboe wrote:
>> On 2011-03-17 02:00, Shaohua Li wrote:
>>> On Thu, 2011-03-17 at 01:31 +0800, Vivek Goyal wrote:
>>>> On Wed, Mar 16, 2011 at 04:18:30PM +0800, Shaohua Li wrote:
>>>>> 2011/1/22 Jens Axboe <jaxboe@fusionio.com>:
>>>>>> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
>>>>>> ---
>>>>>>  block/blk-core.c          |  357 ++++++++++++++++++++++++++++++++------------
>>>>>>  block/elevator.c          |    6 +-
>>>>>>  include/linux/blk_types.h |    2 +
>>>>>>  include/linux/blkdev.h    |   30 ++++
>>>>>>  include/linux/elevator.h  |    1 +
>>>>>>  include/linux/sched.h     |    6 +
>>>>>>  kernel/exit.c             |    1 +
>>>>>>  kernel/fork.c             |    3 +
>>>>>>  kernel/sched.c            |   11 ++-
>>>>>>  9 files changed, 317 insertions(+), 100 deletions(-)
>>>>>>
>>>>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>>>>> index 960f12c..42dbfcc 100644
>>>>>> --- a/block/blk-core.c
>>>>>> +++ b/block/blk-core.c
>>>>>> @@ -27,6 +27,7 @@
>>>>>>  #include <linux/writeback.h>
>>>>>>  #include <linux/task_io_accounting_ops.h>
>>>>>>  #include <linux/fault-inject.h>
>>>>>> +#include <linux/list_sort.h>
>>>>>>
>>>>>>  #define CREATE_TRACE_POINTS
>>>>>>  #include <trace/events/block.h>
>>>>>> @@ -213,7 +214,7 @@ static void blk_delay_work(struct work_struct *work)
>>>>>>
>>>>>>        q = container_of(work, struct request_queue, delay_work.work);
>>>>>>        spin_lock_irq(q->queue_lock);
>>>>>> -       q->request_fn(q);
>>>>>> +       __blk_run_queue(q);
>>>>>>        spin_unlock_irq(q->queue_lock);
>>>>>>  }
>>>>> Hi Jens,
>>>>> I have some questions about the per-task plugging. Since the request
>>>>> list is per-task, and each task delivers its requests at finish flush
>>>>> or schedule. But when one cpu delivers requests to global queue, other
>>>>> cpus don't know. This seems to have problem. For example:
>>>>> 1. get_request_wait() can only flush current task's request list,
>>>>> other cpus/tasks might still have a lot of requests, which aren't sent
>>>>> to request_queue.
>>>>
>>>> But very soon these requests will be sent to request queue as soon task
>>>> is either scheduled out or task explicitly flushes the plug? So we might
>>>> wait a bit longer but that might not matter in general, i guess. 
>>> Yes, I understand there is just a bit delay. I don't know how severe it
>>> is, but this still could be a problem, especially for fast storage or
>>> random I/O. My current tests show slight regression (3% or so) with
>>> Jens's for 2.6.39/core branch. I'm still checking if it's caused by the
>>> per-task plug, but the per-task plug is highly suspected.
>>
>> To check this particular case, you can always just bump the request
>> limit. What test is showing a slowdown? 
> this is a simple multi-threaded seq read. The issue tends to be request
> merge related (not verified yet). The merge reduces about 60% with stack
> plug from fio reported data. From trace, without stack plug, requests
> from different threads get merged. But with it, such merge is impossible
> because flush_plug doesn't check merge, I thought we need add it again.

What we could try is have the plug flush insert be
ELEVATOR_INSERT_SORT_MERGE and have it lookup potential backmerges.

Here's a quick hack that does that, I have not tested it at all.

diff --git a/block/blk-core.c b/block/blk-core.c
index e1fcf7a..5256932 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2685,7 +2685,7 @@ static void flush_plug_list(struct blk_plug *plug)
 		/*
 		 * rq is already accounted, so use raw insert
 		 */
-		__elv_add_request(q, rq, ELEVATOR_INSERT_SORT);
+		__elv_add_request(q, rq, ELEVATOR_INSERT_SORT_MERGE);
 	}
 
 	if (q) {
diff --git a/block/blk-merge.c b/block/blk-merge.c
index ea85e20..cfcc37c 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -465,3 +465,9 @@ int attempt_front_merge(struct request_queue *q, struct request *rq)
 
 	return 0;
 }
+
+int blk_attempt_req_merge(struct request_queue *q, struct request *rq,
+			  struct request *next)
+{
+	return attempt_merge(q, rq, next);
+}
diff --git a/block/blk.h b/block/blk.h
index 49d21af..c8db371 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -103,6 +103,8 @@ int ll_front_merge_fn(struct request_queue *q, struct request *req,
 		      struct bio *bio);
 int attempt_back_merge(struct request_queue *q, struct request *rq);
 int attempt_front_merge(struct request_queue *q, struct request *rq);
+int blk_attempt_req_merge(struct request_queue *q, struct request *rq,
+				struct request *next);
 void blk_recalc_rq_segments(struct request *rq);
 void blk_rq_set_mixed_merge(struct request *rq);
 
diff --git a/block/elevator.c b/block/elevator.c
index 542ce82..f493e18 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -521,6 +521,33 @@ int elv_merge(struct request_queue *q, struct request **req, struct bio *bio)
 	return ELEVATOR_NO_MERGE;
 }
 
+/*
+ * Returns true if we merged, false otherwise
+ */
+static bool elv_attempt_insert_merge(struct request_queue *q,
+				     struct request *rq)
+{
+	struct request *__rq;
+
+	if (blk_queue_nomerges(q) || blk_queue_noxmerges(q))
+		return false;
+
+	/*
+	 * First try one-hit cache.
+	 */
+	if (q->last_merge && blk_attempt_req_merge(q, rq, q->last_merge))
+		return true;
+
+	/*
+	 * See if our hash lookup can find a potential backmerge.
+	 */
+	__rq = elv_rqhash_find(q, blk_rq_pos(rq));
+	if (__rq && blk_attempt_req_merge(q, rq, __rq))
+		return true;
+
+	return false;
+}
+
 void elv_merged_request(struct request_queue *q, struct request *rq, int type)
 {
 	struct elevator_queue *e = q->elevator;
@@ -647,6 +674,9 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
 		__blk_run_queue(q, false);
 		break;
 
+	case ELEVATOR_INSERT_SORT_MERGE:
+		if (elv_attempt_insert_merge(q, rq))
+			break;
 	case ELEVATOR_INSERT_SORT:
 		BUG_ON(rq->cmd_type != REQ_TYPE_FS &&
 		       !(rq->cmd_flags & REQ_DISCARD));
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index ec6f72b..d93efcc445 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -166,6 +166,7 @@ extern struct request *elv_rb_find(struct rb_root *, sector_t);
 #define ELEVATOR_INSERT_SORT	3
 #define ELEVATOR_INSERT_REQUEUE	4
 #define ELEVATOR_INSERT_FLUSH	5
+#define ELEVATOR_INSERT_SORT_MERGE	6
 
 /*
  * return values from elevator_may_queue_fn


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 04/10] block: initial patch for on-stack per-task  plugging
  2011-03-18 12:54             ` Jens Axboe
@ 2011-03-18 13:52               ` Jens Axboe
  2011-03-21  6:52                 ` Shaohua Li
  0 siblings, 1 reply; 152+ messages in thread
From: Jens Axboe @ 2011-03-18 13:52 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Vivek Goyal, linux-kernel, hch, jmoyer

On 2011-03-18 13:54, Jens Axboe wrote:
> On 2011-03-18 07:36, Shaohua Li wrote:
>> On Thu, 2011-03-17 at 17:43 +0800, Jens Axboe wrote:
>>> On 2011-03-17 02:00, Shaohua Li wrote:
>>>> On Thu, 2011-03-17 at 01:31 +0800, Vivek Goyal wrote:
>>>>> On Wed, Mar 16, 2011 at 04:18:30PM +0800, Shaohua Li wrote:
>>>>>> 2011/1/22 Jens Axboe <jaxboe@fusionio.com>:
>>>>>>> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
>>>>>>> ---
>>>>>>>  block/blk-core.c          |  357 ++++++++++++++++++++++++++++++++------------
>>>>>>>  block/elevator.c          |    6 +-
>>>>>>>  include/linux/blk_types.h |    2 +
>>>>>>>  include/linux/blkdev.h    |   30 ++++
>>>>>>>  include/linux/elevator.h  |    1 +
>>>>>>>  include/linux/sched.h     |    6 +
>>>>>>>  kernel/exit.c             |    1 +
>>>>>>>  kernel/fork.c             |    3 +
>>>>>>>  kernel/sched.c            |   11 ++-
>>>>>>>  9 files changed, 317 insertions(+), 100 deletions(-)
>>>>>>>
>>>>>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>>>>>> index 960f12c..42dbfcc 100644
>>>>>>> --- a/block/blk-core.c
>>>>>>> +++ b/block/blk-core.c
>>>>>>> @@ -27,6 +27,7 @@
>>>>>>>  #include <linux/writeback.h>
>>>>>>>  #include <linux/task_io_accounting_ops.h>
>>>>>>>  #include <linux/fault-inject.h>
>>>>>>> +#include <linux/list_sort.h>
>>>>>>>
>>>>>>>  #define CREATE_TRACE_POINTS
>>>>>>>  #include <trace/events/block.h>
>>>>>>> @@ -213,7 +214,7 @@ static void blk_delay_work(struct work_struct *work)
>>>>>>>
>>>>>>>        q = container_of(work, struct request_queue, delay_work.work);
>>>>>>>        spin_lock_irq(q->queue_lock);
>>>>>>> -       q->request_fn(q);
>>>>>>> +       __blk_run_queue(q);
>>>>>>>        spin_unlock_irq(q->queue_lock);
>>>>>>>  }
>>>>>> Hi Jens,
>>>>>> I have some questions about the per-task plugging. Since the request
>>>>>> list is per-task, and each task delivers its requests at finish flush
>>>>>> or schedule. But when one cpu delivers requests to global queue, other
>>>>>> cpus don't know. This seems to have problem. For example:
>>>>>> 1. get_request_wait() can only flush current task's request list,
>>>>>> other cpus/tasks might still have a lot of requests, which aren't sent
>>>>>> to request_queue.
>>>>>
>>>>> But very soon these requests will be sent to request queue as soon task
>>>>> is either scheduled out or task explicitly flushes the plug? So we might
>>>>> wait a bit longer but that might not matter in general, i guess. 
>>>> Yes, I understand there is just a bit delay. I don't know how severe it
>>>> is, but this still could be a problem, especially for fast storage or
>>>> random I/O. My current tests show slight regression (3% or so) with
>>>> Jens's for 2.6.39/core branch. I'm still checking if it's caused by the
>>>> per-task plug, but the per-task plug is highly suspected.
>>>
>>> To check this particular case, you can always just bump the request
>>> limit. What test is showing a slowdown? 
>> this is a simple multi-threaded seq read. The issue tends to be request
>> merge related (not verified yet). The merge reduces about 60% with stack
>> plug from fio reported data. From trace, without stack plug, requests
>> from different threads get merged. But with it, such merge is impossible
>> because flush_plug doesn't check merge, I thought we need add it again.
> 
> What we could try is have the plug flush insert be
> ELEVATOR_INSERT_SORT_MERGE and have it lookup potential backmerges.
> 
> Here's a quick hack that does that, I have not tested it at all.

Gave it a quick test spin, as suspected it had a few issues. This one
seems to work. Can you toss it through that workload and see if it fares
better?

diff --git a/block/blk-core.c b/block/blk-core.c
index e1fcf7a..e1b29e7 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -55,7 +55,7 @@ struct kmem_cache *blk_requestq_cachep;
  */
 static struct workqueue_struct *kblockd_workqueue;
 
-static void drive_stat_acct(struct request *rq, int new_io)
+void drive_stat_acct(struct request *rq, int new_io)
 {
 	struct hd_struct *part;
 	int rw = rq_data_dir(rq);
@@ -2685,7 +2685,7 @@ static void flush_plug_list(struct blk_plug *plug)
 		/*
 		 * rq is already accounted, so use raw insert
 		 */
-		__elv_add_request(q, rq, ELEVATOR_INSERT_SORT);
+		__elv_add_request(q, rq, ELEVATOR_INSERT_SORT_MERGE);
 	}
 
 	if (q) {
diff --git a/block/blk-merge.c b/block/blk-merge.c
index ea85e20..27a7926 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -429,12 +429,14 @@ static int attempt_merge(struct request_queue *q, struct request *req,
 
 	req->__data_len += blk_rq_bytes(next);
 
-	elv_merge_requests(q, req, next);
+	if (next->cmd_flags & REQ_SORTED) {
+		elv_merge_requests(q, req, next);
 
-	/*
-	 * 'next' is going away, so update stats accordingly
-	 */
-	blk_account_io_merge(next);
+		/*
+		 * 'next' is going away, so update stats accordingly
+		 */
+		blk_account_io_merge(next);
+	}
 
 	req->ioprio = ioprio_best(req->ioprio, next->ioprio);
 	if (blk_rq_cpu_valid(next))
@@ -465,3 +467,15 @@ int attempt_front_merge(struct request_queue *q, struct request *rq)
 
 	return 0;
 }
+
+int blk_attempt_req_merge(struct request_queue *q, struct request *rq,
+			  struct request *next)
+{
+	int ret;
+
+	ret = attempt_merge(q, rq, next);
+	if (ret)
+		drive_stat_acct(rq, 0);
+
+	return ret;
+}
diff --git a/block/blk.h b/block/blk.h
index 49d21af..5b8ecbf 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -103,6 +103,9 @@ int ll_front_merge_fn(struct request_queue *q, struct request *req,
 		      struct bio *bio);
 int attempt_back_merge(struct request_queue *q, struct request *rq);
 int attempt_front_merge(struct request_queue *q, struct request *rq);
+int blk_attempt_req_merge(struct request_queue *q, struct request *rq,
+				struct request *next);
+void drive_stat_acct(struct request *rq, int new_io);
 void blk_recalc_rq_segments(struct request *rq);
 void blk_rq_set_mixed_merge(struct request *rq);
 
diff --git a/block/elevator.c b/block/elevator.c
index 542ce82..88bdf81 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -521,6 +521,33 @@ int elv_merge(struct request_queue *q, struct request **req, struct bio *bio)
 	return ELEVATOR_NO_MERGE;
 }
 
+/*
+ * Returns true if we merged, false otherwise
+ */
+static bool elv_attempt_insert_merge(struct request_queue *q,
+				     struct request *rq)
+{
+	struct request *__rq;
+
+	if (blk_queue_nomerges(q) || blk_queue_noxmerges(q))
+		return false;
+
+	/*
+	 * First try one-hit cache.
+	 */
+	if (q->last_merge && blk_attempt_req_merge(q, q->last_merge, rq))
+		return true;
+
+	/*
+	 * See if our hash lookup can find a potential backmerge.
+	 */
+	__rq = elv_rqhash_find(q, blk_rq_pos(rq));
+	if (__rq && blk_attempt_req_merge(q, __rq, rq))
+		return true;
+
+	return false;
+}
+
 void elv_merged_request(struct request_queue *q, struct request *rq, int type)
 {
 	struct elevator_queue *e = q->elevator;
@@ -647,6 +674,9 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
 		__blk_run_queue(q, false);
 		break;
 
+	case ELEVATOR_INSERT_SORT_MERGE:
+		if (elv_attempt_insert_merge(q, rq))
+			break;
 	case ELEVATOR_INSERT_SORT:
 		BUG_ON(rq->cmd_type != REQ_TYPE_FS &&
 		       !(rq->cmd_flags & REQ_DISCARD));
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index ec6f72b..d93efcc 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -166,6 +166,7 @@ extern struct request *elv_rb_find(struct rb_root *, sector_t);
 #define ELEVATOR_INSERT_SORT	3
 #define ELEVATOR_INSERT_REQUEUE	4
 #define ELEVATOR_INSERT_FLUSH	5
+#define ELEVATOR_INSERT_SORT_MERGE	6
 
 /*
  * return values from elevator_may_queue_fn

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 04/10] block: initial patch for on-stack per-task plugging
  2011-03-18 13:52               ` Jens Axboe
@ 2011-03-21  6:52                 ` Shaohua Li
  2011-03-21  9:20                   ` Jens Axboe
  0 siblings, 1 reply; 152+ messages in thread
From: Shaohua Li @ 2011-03-21  6:52 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Vivek Goyal, linux-kernel, hch, jmoyer

On Fri, 2011-03-18 at 21:52 +0800, Jens Axboe wrote:
> On 2011-03-18 13:54, Jens Axboe wrote:
> > On 2011-03-18 07:36, Shaohua Li wrote:
> >> On Thu, 2011-03-17 at 17:43 +0800, Jens Axboe wrote:
> >>> On 2011-03-17 02:00, Shaohua Li wrote:
> >>>> On Thu, 2011-03-17 at 01:31 +0800, Vivek Goyal wrote:
> >>>>> On Wed, Mar 16, 2011 at 04:18:30PM +0800, Shaohua Li wrote:
> >>>>>> 2011/1/22 Jens Axboe <jaxboe@fusionio.com>:
> >>>>>>> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
> >>>>>>> ---
> >>>>>>>  block/blk-core.c          |  357 ++++++++++++++++++++++++++++++++------------
> >>>>>>>  block/elevator.c          |    6 +-
> >>>>>>>  include/linux/blk_types.h |    2 +
> >>>>>>>  include/linux/blkdev.h    |   30 ++++
> >>>>>>>  include/linux/elevator.h  |    1 +
> >>>>>>>  include/linux/sched.h     |    6 +
> >>>>>>>  kernel/exit.c             |    1 +
> >>>>>>>  kernel/fork.c             |    3 +
> >>>>>>>  kernel/sched.c            |   11 ++-
> >>>>>>>  9 files changed, 317 insertions(+), 100 deletions(-)
> >>>>>>>
> >>>>>>> diff --git a/block/blk-core.c b/block/blk-core.c
> >>>>>>> index 960f12c..42dbfcc 100644
> >>>>>>> --- a/block/blk-core.c
> >>>>>>> +++ b/block/blk-core.c
> >>>>>>> @@ -27,6 +27,7 @@
> >>>>>>>  #include <linux/writeback.h>
> >>>>>>>  #include <linux/task_io_accounting_ops.h>
> >>>>>>>  #include <linux/fault-inject.h>
> >>>>>>> +#include <linux/list_sort.h>
> >>>>>>>
> >>>>>>>  #define CREATE_TRACE_POINTS
> >>>>>>>  #include <trace/events/block.h>
> >>>>>>> @@ -213,7 +214,7 @@ static void blk_delay_work(struct work_struct *work)
> >>>>>>>
> >>>>>>>        q = container_of(work, struct request_queue, delay_work.work);
> >>>>>>>        spin_lock_irq(q->queue_lock);
> >>>>>>> -       q->request_fn(q);
> >>>>>>> +       __blk_run_queue(q);
> >>>>>>>        spin_unlock_irq(q->queue_lock);
> >>>>>>>  }
> >>>>>> Hi Jens,
> >>>>>> I have some questions about the per-task plugging. Since the request
> >>>>>> list is per-task, and each task delivers its requests at finish flush
> >>>>>> or schedule. But when one cpu delivers requests to global queue, other
> >>>>>> cpus don't know. This seems to have problem. For example:
> >>>>>> 1. get_request_wait() can only flush current task's request list,
> >>>>>> other cpus/tasks might still have a lot of requests, which aren't sent
> >>>>>> to request_queue.
> >>>>>
> >>>>> But very soon these requests will be sent to request queue as soon task
> >>>>> is either scheduled out or task explicitly flushes the plug? So we might
> >>>>> wait a bit longer but that might not matter in general, i guess. 
> >>>> Yes, I understand there is just a bit delay. I don't know how severe it
> >>>> is, but this still could be a problem, especially for fast storage or
> >>>> random I/O. My current tests show slight regression (3% or so) with
> >>>> Jens's for 2.6.39/core branch. I'm still checking if it's caused by the
> >>>> per-task plug, but the per-task plug is highly suspected.
> >>>
> >>> To check this particular case, you can always just bump the request
> >>> limit. What test is showing a slowdown? 
> >> this is a simple multi-threaded seq read. The issue tends to be request
> >> merge related (not verified yet). The merge reduces about 60% with stack
> >> plug from fio reported data. From trace, without stack plug, requests
> >> from different threads get merged. But with it, such merge is impossible
> >> because flush_plug doesn't check merge, I thought we need add it again.
> > 
> > What we could try is have the plug flush insert be
> > ELEVATOR_INSERT_SORT_MERGE and have it lookup potential backmerges.
> > 
> > Here's a quick hack that does that, I have not tested it at all.
> 
> Gave it a quick test spin, as suspected it had a few issues. This one
> seems to work. Can you toss it through that workload and see if it fares
> better?
yes, this fully restores the regression I saw. But I have accounting
issue:
1. The merged request is already accounted when it's added into plug
list
2. drive_stat_acct() is called without any protection in
__make_request(). So there is race for in_flight accounting. The race
exists after stack plug is added, so not because of this issue.
Below is the extra patch I need to do the test.

---
 block/blk-merge.c     |   12 +++++-------
 block/elevator.c      |    9 ++++++---
 drivers/md/dm.c       |    7 ++++---
 fs/partitions/check.c |    3 ++-
 include/linux/genhd.h |   12 ++++++------
 5 files changed, 23 insertions(+), 20 deletions(-)

Index: linux-2.6/block/blk-merge.c
===================================================================
--- linux-2.6.orig/block/blk-merge.c
+++ linux-2.6/block/blk-merge.c
@@ -429,14 +429,12 @@ static int attempt_merge(struct request_
 
 	req->__data_len += blk_rq_bytes(next);
 
-	if (next->cmd_flags & REQ_SORTED) {
-		elv_merge_requests(q, req, next);
+	elv_merge_requests(q, req, next);
 
-		/*
-		 * 'next' is going away, so update stats accordingly
-		 */
-		blk_account_io_merge(next);
-	}
+	/*
+	 * 'next' is going away, so update stats accordingly
+	 */
+	blk_account_io_merge(next);
 
 	req->ioprio = ioprio_best(req->ioprio, next->ioprio);
 	if (blk_rq_cpu_valid(next))
Index: linux-2.6/block/elevator.c
===================================================================
--- linux-2.6.orig/block/elevator.c
+++ linux-2.6/block/elevator.c
@@ -566,13 +566,16 @@ void elv_merge_requests(struct request_q
 {
 	struct elevator_queue *e = q->elevator;
 
-	if (e->ops->elevator_merge_req_fn)
+	if ((next->cmd_flags & REQ_SORTED) && e->ops->elevator_merge_req_fn)
 		e->ops->elevator_merge_req_fn(q, rq, next);
 
 	elv_rqhash_reposition(q, rq);
-	elv_rqhash_del(q, next);
 
-	q->nr_sorted--;
+	if (next->cmd_flags & REQ_SORTED) {
+		elv_rqhash_del(q, next);
+		q->nr_sorted--;
+	}
+
 	q->last_merge = rq;
 }
 
Index: linux-2.6/drivers/md/dm.c
===================================================================
--- linux-2.6.orig/drivers/md/dm.c
+++ linux-2.6/drivers/md/dm.c
@@ -477,7 +477,8 @@ static void start_io_acct(struct dm_io *
 	cpu = part_stat_lock();
 	part_round_stats(cpu, &dm_disk(md)->part0);
 	part_stat_unlock();
-	dm_disk(md)->part0.in_flight[rw] = atomic_inc_return(&md->pending[rw]);
+	atomic_set(&dm_disk(md)->part0.in_flight[rw],
+		atomic_inc_return(&md->pending[rw]));
 }
 
 static void end_io_acct(struct dm_io *io)
@@ -497,8 +498,8 @@ static void end_io_acct(struct dm_io *io
 	 * After this is decremented the bio must not be touched if it is
 	 * a flush.
 	 */
-	dm_disk(md)->part0.in_flight[rw] = pending =
-		atomic_dec_return(&md->pending[rw]);
+	pending = atomic_dec_return(&md->pending[rw]);
+	atomic_set(&dm_disk(md)->part0.in_flight[rw], pending);
 	pending += atomic_read(&md->pending[rw^0x1]);
 
 	/* nudge anyone waiting on suspend queue */
Index: linux-2.6/fs/partitions/check.c
===================================================================
--- linux-2.6.orig/fs/partitions/check.c
+++ linux-2.6/fs/partitions/check.c
@@ -290,7 +290,8 @@ ssize_t part_inflight_show(struct device
 {
 	struct hd_struct *p = dev_to_part(dev);
 
-	return sprintf(buf, "%8u %8u\n", p->in_flight[0], p->in_flight[1]);
+	return sprintf(buf, "%8u %8u\n", atomic_read(&p->in_flight[0]),
+		atomic_read(&p->in_flight[1]));
 }
 
 #ifdef CONFIG_FAIL_MAKE_REQUEST
Index: linux-2.6/include/linux/genhd.h
===================================================================
--- linux-2.6.orig/include/linux/genhd.h
+++ linux-2.6/include/linux/genhd.h
@@ -109,7 +109,7 @@ struct hd_struct {
 	int make_it_fail;
 #endif
 	unsigned long stamp;
-	int in_flight[2];
+	atomic_t in_flight[2];
 #ifdef	CONFIG_SMP
 	struct disk_stats __percpu *dkstats;
 #else
@@ -370,21 +370,21 @@ static inline void free_part_stats(struc
 
 static inline void part_inc_in_flight(struct hd_struct *part, int rw)
 {
-	part->in_flight[rw]++;
+	atomic_inc(&part->in_flight[rw]);
 	if (part->partno)
-		part_to_disk(part)->part0.in_flight[rw]++;
+		atomic_inc(&part_to_disk(part)->part0.in_flight[rw]);
 }
 
 static inline void part_dec_in_flight(struct hd_struct *part, int rw)
 {
-	part->in_flight[rw]--;
+	atomic_dec(&part->in_flight[rw]);
 	if (part->partno)
-		part_to_disk(part)->part0.in_flight[rw]--;
+		atomic_dec(&part_to_disk(part)->part0.in_flight[rw]);
 }
 
 static inline int part_in_flight(struct hd_struct *part)
 {
-	return part->in_flight[0] + part->in_flight[1];
+	return atomic_read(&part->in_flight[0]) + atomic_read(&part->in_flight[1]);
 }
 
 static inline struct partition_meta_info *alloc_part_info(struct gendisk *disk)



^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 04/10] block: initial patch for on-stack per-task  plugging
  2011-03-21  6:52                 ` Shaohua Li
@ 2011-03-21  9:20                   ` Jens Axboe
  2011-03-22  0:32                     ` Shaohua Li
  0 siblings, 1 reply; 152+ messages in thread
From: Jens Axboe @ 2011-03-21  9:20 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Vivek Goyal, linux-kernel, hch, jmoyer

On 2011-03-21 07:52, Shaohua Li wrote:
>> Gave it a quick test spin, as suspected it had a few issues. This one
>> seems to work. Can you toss it through that workload and see if it fares
>> better?
> yes, this fully restores the regression I saw. But I have accounting
> issue:

Great!

> 1. The merged request is already accounted when it's added into plug
> list

Good catch. I've updated the patch and merged it now, integrating this
accounting fix.

> 2. drive_stat_acct() is called without any protection in
> __make_request(). So there is race for in_flight accounting. The race
> exists after stack plug is added, so not because of this issue.
> Below is the extra patch I need to do the test.

Looks fine. Can I add your signed-off-by to this patch? I'll merge it as
a separate fix.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 04/10] block: initial patch for on-stack per-task plugging
  2011-03-21  9:20                   ` Jens Axboe
@ 2011-03-22  0:32                     ` Shaohua Li
  2011-03-22  7:36                       ` Jens Axboe
  0 siblings, 1 reply; 152+ messages in thread
From: Shaohua Li @ 2011-03-22  0:32 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Vivek Goyal, linux-kernel, hch, jmoyer

On Mon, 2011-03-21 at 17:20 +0800, Jens Axboe wrote:
> On 2011-03-21 07:52, Shaohua Li wrote:
> >> Gave it a quick test spin, as suspected it had a few issues. This one
> >> seems to work. Can you toss it through that workload and see if it fares
> >> better?
> > yes, this fully restores the regression I saw. But I have accounting
> > issue:
> 
> Great!
> 
> > 1. The merged request is already accounted when it's added into plug
> > list
> 
> Good catch. I've updated the patch and merged it now, integrating this
> accounting fix.
> 
> > 2. drive_stat_acct() is called without any protection in
> > __make_request(). So there is race for in_flight accounting. The race
> > exists after stack plug is added, so not because of this issue.
> > Below is the extra patch I need to do the test.
> 
> Looks fine. Can I add your signed-off-by to this patch? I'll merge it as
> a separate fix.
Sure.
Signed-off-by: Shaohua Li<shaohua.li@intel.com>


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 04/10] block: initial patch for on-stack per-task  plugging
  2011-03-22  0:32                     ` Shaohua Li
@ 2011-03-22  7:36                       ` Jens Axboe
  0 siblings, 0 replies; 152+ messages in thread
From: Jens Axboe @ 2011-03-22  7:36 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Vivek Goyal, linux-kernel, hch, jmoyer

On 2011-03-22 01:32, Shaohua Li wrote:
> On Mon, 2011-03-21 at 17:20 +0800, Jens Axboe wrote:
>> On 2011-03-21 07:52, Shaohua Li wrote:
>>>> Gave it a quick test spin, as suspected it had a few issues. This one
>>>> seems to work. Can you toss it through that workload and see if it fares
>>>> better?
>>> yes, this fully restores the regression I saw. But I have accounting
>>> issue:
>>
>> Great!
>>
>>> 1. The merged request is already accounted when it's added into plug
>>> list
>>
>> Good catch. I've updated the patch and merged it now, integrating this
>> accounting fix.
>>
>>> 2. drive_stat_acct() is called without any protection in
>>> __make_request(). So there is race for in_flight accounting. The race
>>> exists after stack plug is added, so not because of this issue.
>>> Below is the extra patch I need to do the test.
>>
>> Looks fine. Can I add your signed-off-by to this patch? I'll merge it as
>> a separate fix.
> Sure.
> Signed-off-by: Shaohua Li<shaohua.li@intel.com>

Thanks, patch has been added now.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-03-10  0:58               ` Mike Snitzer
@ 2011-04-05  3:05                 ` NeilBrown
  2011-04-11  4:50                   ` NeilBrown
  0 siblings, 1 reply; 152+ messages in thread
From: NeilBrown @ 2011-04-05  3:05 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Jens Axboe, linux-kernel, hch, dm-devel

On Wed, 9 Mar 2011 19:58:10 -0500 Mike Snitzer <snitzer@redhat.com> wrote:

> Also, in your MD changes, you removed all calls to md_unplug() but
> didn't remove md_unplug().  Seems it should be removed along with the
> 'plug' member of 'struct mddev_t'?  Neil?

I've been distracted by other things and only just managed to have a look at
this.

The new plugging code seems to completely ignore the needs of stacked devices
- or at least my needs in md.

For RAID1 with a write-intent-bitmap, I queue all write requests and then on
an unplug I update the write-intent-bitmap to mark all the relevant blocks
and then release the writes.

With the new code there is no way for an unplug event to wake up the raid1d
thread to start the writeout - I haven't tested it but I suspect it will just
hang.

Similarly for RAID5 I gather write bios (long before they become 'struct
request' which is what the plugging code understands) and on an unplug event
I release the writes - hopefully with enough bios per stripe so that we don't
need to pre-read.

Possibly the simplest fix would be to have a second list_head in 'struct
blk_plug' which contained callbacks (a function pointer a list_head in a
struct which is passed as an arg to the function!).
blk_finish_plug could then walk the list and call the call-backs.
It would be quite easy to hook into that.


I suspect I also need to add blk_start_plug/blk_finish_plug around the loop
in raid1d/raid5d/raid10d, but that is pretty straight forward.

Am I missing something important?
Is there a better way to get an unplug event to md?

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-05  3:05                 ` NeilBrown
@ 2011-04-11  4:50                   ` NeilBrown
  2011-04-11  9:19                     ` Jens Axboe
  2011-04-11 16:59                     ` hch
  0 siblings, 2 replies; 152+ messages in thread
From: NeilBrown @ 2011-04-11  4:50 UTC (permalink / raw)
  To: Mike Snitzer, Jens Axboe; +Cc: linux-kernel, hch, dm-devel, linux-raid

On Tue, 5 Apr 2011 13:05:41 +1000 NeilBrown <neilb@suse.de> wrote:

> On Wed, 9 Mar 2011 19:58:10 -0500 Mike Snitzer <snitzer@redhat.com> wrote:
> 
> > Also, in your MD changes, you removed all calls to md_unplug() but
> > didn't remove md_unplug().  Seems it should be removed along with the
> > 'plug' member of 'struct mddev_t'?  Neil?
> 
> I've been distracted by other things and only just managed to have a look at
> this.
> 
> The new plugging code seems to completely ignore the needs of stacked devices
> - or at least my needs in md.
> 
> For RAID1 with a write-intent-bitmap, I queue all write requests and then on
> an unplug I update the write-intent-bitmap to mark all the relevant blocks
> and then release the writes.
> 
> With the new code there is no way for an unplug event to wake up the raid1d
> thread to start the writeout - I haven't tested it but I suspect it will just
> hang.
> 
> Similarly for RAID5 I gather write bios (long before they become 'struct
> request' which is what the plugging code understands) and on an unplug event
> I release the writes - hopefully with enough bios per stripe so that we don't
> need to pre-read.
> 
> Possibly the simplest fix would be to have a second list_head in 'struct
> blk_plug' which contained callbacks (a function pointer a list_head in a
> struct which is passed as an arg to the function!).
> blk_finish_plug could then walk the list and call the call-backs.
> It would be quite easy to hook into that.

I've implemented this and it seems to work.
Jens:  could you please review and hopefully ack the patch below, and let
me know if you will submit it or should I?

My testing of this combined with some other patches which cause various md
personalities to use it shows up a bug somewhere.

The symptoms are crashes in various places in blk-core and sometimes
elevator.c
list_sort occurs fairly often included in the stack but not always.

This patch

diff --git a/block/blk-core.c b/block/blk-core.c
index 273d60b..903ce8d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2674,19 +2674,23 @@ static void flush_plug_list(struct blk_plug *plug)
 	struct request_queue *q;
 	unsigned long flags;
 	struct request *rq;
+	struct list_head head;
 
 	BUG_ON(plug->magic != PLUG_MAGIC);
 
 	if (list_empty(&plug->list))
 		return;
+	list_add(&head, &plug->list);
+	list_del_init(&plug->list);
 
 	if (plug->should_sort)
-		list_sort(NULL, &plug->list, plug_rq_cmp);
+		list_sort(NULL, &head, plug_rq_cmp);
+	plug->should_sort = 0;
 
 	q = NULL;
 	local_irq_save(flags);
-	while (!list_empty(&plug->list)) {
-		rq = list_entry_rq(plug->list.next);
+	while (!list_empty(&head)) {
+		rq = list_entry_rq(head.next);
 		list_del_init(&rq->queuelist);
 		BUG_ON(!(rq->cmd_flags & REQ_ON_PLUG));
 		BUG_ON(!rq->q);


makes the symptom go away.  It simply moves the plug list onto a separate
list head before sorting and processing it.
My test was simply writing to a RAID1 with dd:
  while true; do dd if=/dev/zero of=/dev/md0 size=4k; done

Obviously all writes go to two devices so the plug list will always need
sorting.

The only explanation I can come up with is that very occasionally schedule on
2 separate cpus calls blk_flush_plug for the same task.  I don't understand
the scheduler nearly well enough to know if or how that can happen.
However with this patch in place I can write to a RAID1 constantly for half
an hour, and without it, the write rarely lasts for 3 minutes.

If you want to reproduce my experiment, you can pull from
  git://neil.brown.name/md plug-test
to get my patches for plugging in md (which are quite ready for submission
but seem to work), create a RAID1 using e.g.
   mdadm -C /dev/md0 --level=1 --raid-disks=2 /dev/device1 /dev/device2
   while true; do dd if=/dev/zero of=/dev/md0 bs=4K ; done


Thanks,
NeilBrown



>From 687b189c02276887dd7d5b87a817da9f67ed3c2c Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.de>
Date: Thu, 7 Apr 2011 13:16:59 +1000
Subject: [PATCH] Enhance new plugging support to support general callbacks.

md/raid requires an unplug callback, but as it does not uses
requests the current code cannot provide one.

So allow arbitrary callbacks to be attached to the blk_plug.

Cc: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
---
 block/blk-core.c       |   13 +++++++++++++
 include/linux/blkdev.h |    7 ++++++-
 2 files changed, 19 insertions(+), 1 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 725091d..273d60b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2644,6 +2644,7 @@ void blk_start_plug(struct blk_plug *plug)
 
 	plug->magic = PLUG_MAGIC;
 	INIT_LIST_HEAD(&plug->list);
+	INIT_LIST_HEAD(&plug->cb_list);
 	plug->should_sort = 0;
 
 	/*
@@ -2717,9 +2718,21 @@ static void flush_plug_list(struct blk_plug *plug)
 	local_irq_restore(flags);
 }
 
+static void flush_plug_callbacks(struct blk_plug *plug)
+{
+	while (!list_empty(&plug->cb_list)) {
+		struct blk_plug_cb *cb = list_first_entry(&plug->cb_list,
+							  struct blk_plug_cb,
+							  list);
+		list_del(&cb->list);
+		cb->callback(cb);
+	}
+}
+
 static void __blk_finish_plug(struct task_struct *tsk, struct blk_plug *plug)
 {
 	flush_plug_list(plug);
+	flush_plug_callbacks(plug);
 
 	if (plug == tsk->plug)
 		tsk->plug = NULL;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 32176cc..3e5e604 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -857,8 +857,13 @@ extern void blk_put_queue(struct request_queue *);
 struct blk_plug {
 	unsigned long magic;
 	struct list_head list;
+	struct list_head cb_list;
 	unsigned int should_sort;
 };
+struct blk_plug_cb {
+	struct list_head list;
+	void (*callback)(struct blk_plug_cb *);
+};
 
 extern void blk_start_plug(struct blk_plug *);
 extern void blk_finish_plug(struct blk_plug *);
@@ -876,7 +881,7 @@ static inline bool blk_needs_flush_plug(struct task_struct *tsk)
 {
 	struct blk_plug *plug = tsk->plug;
 
-	return plug && !list_empty(&plug->list);
+	return plug && (!list_empty(&plug->list) || !list_empty(&plug->cb_list));
 }
 
 /*
-- 
1.7.3.4


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-11  4:50                   ` NeilBrown
@ 2011-04-11  9:19                     ` Jens Axboe
  2011-04-11 10:59                       ` NeilBrown
  2011-04-11 16:59                     ` hch
  1 sibling, 1 reply; 152+ messages in thread
From: Jens Axboe @ 2011-04-11  9:19 UTC (permalink / raw)
  To: NeilBrown; +Cc: Mike Snitzer, linux-kernel, hch, dm-devel, linux-raid

On 2011-04-11 06:50, NeilBrown wrote:
> On Tue, 5 Apr 2011 13:05:41 +1000 NeilBrown <neilb@suse.de> wrote:
> 
>> On Wed, 9 Mar 2011 19:58:10 -0500 Mike Snitzer <snitzer@redhat.com> wrote:
>>
>>> Also, in your MD changes, you removed all calls to md_unplug() but
>>> didn't remove md_unplug().  Seems it should be removed along with the
>>> 'plug' member of 'struct mddev_t'?  Neil?
>>
>> I've been distracted by other things and only just managed to have a look at
>> this.
>>
>> The new plugging code seems to completely ignore the needs of stacked devices
>> - or at least my needs in md.
>>
>> For RAID1 with a write-intent-bitmap, I queue all write requests and then on
>> an unplug I update the write-intent-bitmap to mark all the relevant blocks
>> and then release the writes.
>>
>> With the new code there is no way for an unplug event to wake up the raid1d
>> thread to start the writeout - I haven't tested it but I suspect it will just
>> hang.
>>
>> Similarly for RAID5 I gather write bios (long before they become 'struct
>> request' which is what the plugging code understands) and on an unplug event
>> I release the writes - hopefully with enough bios per stripe so that we don't
>> need to pre-read.
>>
>> Possibly the simplest fix would be to have a second list_head in 'struct
>> blk_plug' which contained callbacks (a function pointer a list_head in a
>> struct which is passed as an arg to the function!).
>> blk_finish_plug could then walk the list and call the call-backs.
>> It would be quite easy to hook into that.
> 
> I've implemented this and it seems to work.
> Jens:  could you please review and hopefully ack the patch below, and let
> me know if you will submit it or should I?
> 
> My testing of this combined with some other patches which cause various md
> personalities to use it shows up a bug somewhere.
> 
> The symptoms are crashes in various places in blk-core and sometimes
> elevator.c
> list_sort occurs fairly often included in the stack but not always.
> 
> This patch
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 273d60b..903ce8d 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2674,19 +2674,23 @@ static void flush_plug_list(struct blk_plug *plug)
>  	struct request_queue *q;
>  	unsigned long flags;
>  	struct request *rq;
> +	struct list_head head;
>  
>  	BUG_ON(plug->magic != PLUG_MAGIC);
>  
>  	if (list_empty(&plug->list))
>  		return;
> +	list_add(&head, &plug->list);
> +	list_del_init(&plug->list);
>  
>  	if (plug->should_sort)
> -		list_sort(NULL, &plug->list, plug_rq_cmp);
> +		list_sort(NULL, &head, plug_rq_cmp);
> +	plug->should_sort = 0;
>  
>  	q = NULL;
>  	local_irq_save(flags);
> -	while (!list_empty(&plug->list)) {
> -		rq = list_entry_rq(plug->list.next);
> +	while (!list_empty(&head)) {
> +		rq = list_entry_rq(head.next);
>  		list_del_init(&rq->queuelist);
>  		BUG_ON(!(rq->cmd_flags & REQ_ON_PLUG));
>  		BUG_ON(!rq->q);
> 
> 
> makes the symptom go away.  It simply moves the plug list onto a separate
> list head before sorting and processing it.
> My test was simply writing to a RAID1 with dd:
>   while true; do dd if=/dev/zero of=/dev/md0 size=4k; done
> 
> Obviously all writes go to two devices so the plug list will always need
> sorting.
> 
> The only explanation I can come up with is that very occasionally schedule on
> 2 separate cpus calls blk_flush_plug for the same task.  I don't understand
> the scheduler nearly well enough to know if or how that can happen.
> However with this patch in place I can write to a RAID1 constantly for half
> an hour, and without it, the write rarely lasts for 3 minutes.

Or perhaps if the request_fn blocks, that would be problematic. So the
patch is likely a good idea even for that case.

I'll merge it, changing it to list_splice_init() as I think that would
be more clear.

> From 687b189c02276887dd7d5b87a817da9f67ed3c2c Mon Sep 17 00:00:00 2001
> From: NeilBrown <neilb@suse.de>
> Date: Thu, 7 Apr 2011 13:16:59 +1000
> Subject: [PATCH] Enhance new plugging support to support general callbacks.
> 
> md/raid requires an unplug callback, but as it does not uses
> requests the current code cannot provide one.
> 
> So allow arbitrary callbacks to be attached to the blk_plug.
> 
> Cc: Jens Axboe <jaxboe@fusionio.com>
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  block/blk-core.c       |   13 +++++++++++++
>  include/linux/blkdev.h |    7 ++++++-
>  2 files changed, 19 insertions(+), 1 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 725091d..273d60b 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2644,6 +2644,7 @@ void blk_start_plug(struct blk_plug *plug)
>  
>  	plug->magic = PLUG_MAGIC;
>  	INIT_LIST_HEAD(&plug->list);
> +	INIT_LIST_HEAD(&plug->cb_list);
>  	plug->should_sort = 0;
>  
>  	/*
> @@ -2717,9 +2718,21 @@ static void flush_plug_list(struct blk_plug *plug)
>  	local_irq_restore(flags);
>  }
>  
> +static void flush_plug_callbacks(struct blk_plug *plug)
> +{
> +	while (!list_empty(&plug->cb_list)) {
> +		struct blk_plug_cb *cb = list_first_entry(&plug->cb_list,
> +							  struct blk_plug_cb,
> +							  list);
> +		list_del(&cb->list);
> +		cb->callback(cb);
> +	}
> +}
> +
>  static void __blk_finish_plug(struct task_struct *tsk, struct blk_plug *plug)
>  {
>  	flush_plug_list(plug);
> +	flush_plug_callbacks(plug);
>  
>  	if (plug == tsk->plug)
>  		tsk->plug = NULL;
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 32176cc..3e5e604 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -857,8 +857,13 @@ extern void blk_put_queue(struct request_queue *);
>  struct blk_plug {
>  	unsigned long magic;
>  	struct list_head list;
> +	struct list_head cb_list;
>  	unsigned int should_sort;
>  };
> +struct blk_plug_cb {
> +	struct list_head list;
> +	void (*callback)(struct blk_plug_cb *);
> +};
>  
>  extern void blk_start_plug(struct blk_plug *);
>  extern void blk_finish_plug(struct blk_plug *);
> @@ -876,7 +881,7 @@ static inline bool blk_needs_flush_plug(struct task_struct *tsk)
>  {
>  	struct blk_plug *plug = tsk->plug;
>  
> -	return plug && !list_empty(&plug->list);
> +	return plug && (!list_empty(&plug->list) || !list_empty(&plug->cb_list));
>  }
>  
>  /*

Maybe I'm missing something, but why do you need those callbacks? If
it's to use plugging yourself, perhaps we can just ensure that those
don't get assigned in the task - so it would be have to used with care.

It's not that I disagree to these callbacks, I just want to ensure I
understand why you need them.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-11  9:19                     ` Jens Axboe
@ 2011-04-11 10:59                       ` NeilBrown
  2011-04-11 11:04                         ` Jens Axboe
  2011-04-11 11:55                         ` NeilBrown
  0 siblings, 2 replies; 152+ messages in thread
From: NeilBrown @ 2011-04-11 10:59 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Mike Snitzer, linux-kernel, hch, dm-devel, linux-raid

On Mon, 11 Apr 2011 11:19:58 +0200 Jens Axboe <jaxboe@fusionio.com> wrote:

> On 2011-04-11 06:50, NeilBrown wrote:

> > The only explanation I can come up with is that very occasionally schedule on
> > 2 separate cpus calls blk_flush_plug for the same task.  I don't understand
> > the scheduler nearly well enough to know if or how that can happen.
> > However with this patch in place I can write to a RAID1 constantly for half
> > an hour, and without it, the write rarely lasts for 3 minutes.
> 
> Or perhaps if the request_fn blocks, that would be problematic. So the
> patch is likely a good idea even for that case.
> 
> I'll merge it, changing it to list_splice_init() as I think that would
> be more clear.

OK - though I'm not 100% the patch fixes the problem - just that it hides the
symptom for me.
I might try instrumenting the code a bit more and see if I can find exactly
where it is re-entering flush_plug_list - as that seems to be what is
happening.

And yeah - list_split_init is probably better.  I just never remember exactly
what list_split means and have to look it up every time, where as
list_add/list_del are very clear to me.


> 
> > From 687b189c02276887dd7d5b87a817da9f67ed3c2c Mon Sep 17 00:00:00 2001
> > From: NeilBrown <neilb@suse.de>
> > Date: Thu, 7 Apr 2011 13:16:59 +1000
> > Subject: [PATCH] Enhance new plugging support to support general callbacks.
> > 
> > md/raid requires an unplug callback, but as it does not uses
> > requests the current code cannot provide one.
> > 
> > So allow arbitrary callbacks to be attached to the blk_plug.
> > 
> > Cc: Jens Axboe <jaxboe@fusionio.com>
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > ---
> >  block/blk-core.c       |   13 +++++++++++++
> >  include/linux/blkdev.h |    7 ++++++-
> >  2 files changed, 19 insertions(+), 1 deletions(-)
> > 
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index 725091d..273d60b 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -2644,6 +2644,7 @@ void blk_start_plug(struct blk_plug *plug)
> >  
> >  	plug->magic = PLUG_MAGIC;
> >  	INIT_LIST_HEAD(&plug->list);
> > +	INIT_LIST_HEAD(&plug->cb_list);
> >  	plug->should_sort = 0;
> >  
> >  	/*
> > @@ -2717,9 +2718,21 @@ static void flush_plug_list(struct blk_plug *plug)
> >  	local_irq_restore(flags);
> >  }
> >  
> > +static void flush_plug_callbacks(struct blk_plug *plug)
> > +{
> > +	while (!list_empty(&plug->cb_list)) {
> > +		struct blk_plug_cb *cb = list_first_entry(&plug->cb_list,
> > +							  struct blk_plug_cb,
> > +							  list);
> > +		list_del(&cb->list);
> > +		cb->callback(cb);
> > +	}
> > +}
> > +
> >  static void __blk_finish_plug(struct task_struct *tsk, struct blk_plug *plug)
> >  {
> >  	flush_plug_list(plug);
> > +	flush_plug_callbacks(plug);
> >  
> >  	if (plug == tsk->plug)
> >  		tsk->plug = NULL;
> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > index 32176cc..3e5e604 100644
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -857,8 +857,13 @@ extern void blk_put_queue(struct request_queue *);
> >  struct blk_plug {
> >  	unsigned long magic;
> >  	struct list_head list;
> > +	struct list_head cb_list;
> >  	unsigned int should_sort;
> >  };
> > +struct blk_plug_cb {
> > +	struct list_head list;
> > +	void (*callback)(struct blk_plug_cb *);
> > +};
> >  
> >  extern void blk_start_plug(struct blk_plug *);
> >  extern void blk_finish_plug(struct blk_plug *);
> > @@ -876,7 +881,7 @@ static inline bool blk_needs_flush_plug(struct task_struct *tsk)
> >  {
> >  	struct blk_plug *plug = tsk->plug;
> >  
> > -	return plug && !list_empty(&plug->list);
> > +	return plug && (!list_empty(&plug->list) || !list_empty(&plug->cb_list));
> >  }
> >  
> >  /*
> 
> Maybe I'm missing something, but why do you need those callbacks? If
> it's to use plugging yourself, perhaps we can just ensure that those
> don't get assigned in the task - so it would be have to used with care.
> 
> It's not that I disagree to these callbacks, I just want to ensure I
> understand why you need them.
> 

I'm sure one of us is missing something (probably both) but I'm not sure what.

The callback is central.

It is simply to use plugging in md.
Just like blk-core, md will notice that a blk_plug is active and will put
requests aside.  I then need something to call in to md when blk_finish_plug
is called so that put-aside requests can be released.
As md can be built as a module, that call must be a call-back of some sort.
blk-core doesn't need to register blk_plug_flush because that is never in a
module, so it can be called directly.  But the md equivalent could be in a
module, so I need to be able to register a call back.

Does that help? 

Thanks,
NeilBrown



^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-11 10:59                       ` NeilBrown
@ 2011-04-11 11:04                         ` Jens Axboe
  2011-04-11 11:26                           ` NeilBrown
  2011-04-11 11:55                         ` NeilBrown
  1 sibling, 1 reply; 152+ messages in thread
From: Jens Axboe @ 2011-04-11 11:04 UTC (permalink / raw)
  To: NeilBrown; +Cc: Mike Snitzer, linux-kernel, hch, dm-devel, linux-raid

On 2011-04-11 12:59, NeilBrown wrote:
> On Mon, 11 Apr 2011 11:19:58 +0200 Jens Axboe <jaxboe@fusionio.com> wrote:
> 
>> On 2011-04-11 06:50, NeilBrown wrote:
> 
>>> The only explanation I can come up with is that very occasionally schedule on
>>> 2 separate cpus calls blk_flush_plug for the same task.  I don't understand
>>> the scheduler nearly well enough to know if or how that can happen.
>>> However with this patch in place I can write to a RAID1 constantly for half
>>> an hour, and without it, the write rarely lasts for 3 minutes.
>>
>> Or perhaps if the request_fn blocks, that would be problematic. So the
>> patch is likely a good idea even for that case.
>>
>> I'll merge it, changing it to list_splice_init() as I think that would
>> be more clear.
> 
> OK - though I'm not 100% the patch fixes the problem - just that it hides the
> symptom for me.
> I might try instrumenting the code a bit more and see if I can find exactly
> where it is re-entering flush_plug_list - as that seems to be what is
> happening.

It's definitely a good thing to add, to avoid the list fudging on
schedule. Whether it's your exact problem, I can't tell.

> And yeah - list_split_init is probably better.  I just never remember exactly
> what list_split means and have to look it up every time, where as
> list_add/list_del are very clear to me.

splice, no split :-)

>>> From 687b189c02276887dd7d5b87a817da9f67ed3c2c Mon Sep 17 00:00:00 2001
>>> From: NeilBrown <neilb@suse.de>
>>> Date: Thu, 7 Apr 2011 13:16:59 +1000
>>> Subject: [PATCH] Enhance new plugging support to support general callbacks.
>>>
>>> md/raid requires an unplug callback, but as it does not uses
>>> requests the current code cannot provide one.
>>>
>>> So allow arbitrary callbacks to be attached to the blk_plug.
>>>
>>> Cc: Jens Axboe <jaxboe@fusionio.com>
>>> Signed-off-by: NeilBrown <neilb@suse.de>
>>> ---
>>>  block/blk-core.c       |   13 +++++++++++++
>>>  include/linux/blkdev.h |    7 ++++++-
>>>  2 files changed, 19 insertions(+), 1 deletions(-)
>>>
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index 725091d..273d60b 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -2644,6 +2644,7 @@ void blk_start_plug(struct blk_plug *plug)
>>>  
>>>  	plug->magic = PLUG_MAGIC;
>>>  	INIT_LIST_HEAD(&plug->list);
>>> +	INIT_LIST_HEAD(&plug->cb_list);
>>>  	plug->should_sort = 0;
>>>  
>>>  	/*
>>> @@ -2717,9 +2718,21 @@ static void flush_plug_list(struct blk_plug *plug)
>>>  	local_irq_restore(flags);
>>>  }
>>>  
>>> +static void flush_plug_callbacks(struct blk_plug *plug)
>>> +{
>>> +	while (!list_empty(&plug->cb_list)) {
>>> +		struct blk_plug_cb *cb = list_first_entry(&plug->cb_list,
>>> +							  struct blk_plug_cb,
>>> +							  list);
>>> +		list_del(&cb->list);
>>> +		cb->callback(cb);
>>> +	}
>>> +}
>>> +
>>>  static void __blk_finish_plug(struct task_struct *tsk, struct blk_plug *plug)
>>>  {
>>>  	flush_plug_list(plug);
>>> +	flush_plug_callbacks(plug);
>>>  
>>>  	if (plug == tsk->plug)
>>>  		tsk->plug = NULL;
>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>>> index 32176cc..3e5e604 100644
>>> --- a/include/linux/blkdev.h
>>> +++ b/include/linux/blkdev.h
>>> @@ -857,8 +857,13 @@ extern void blk_put_queue(struct request_queue *);
>>>  struct blk_plug {
>>>  	unsigned long magic;
>>>  	struct list_head list;
>>> +	struct list_head cb_list;
>>>  	unsigned int should_sort;
>>>  };
>>> +struct blk_plug_cb {
>>> +	struct list_head list;
>>> +	void (*callback)(struct blk_plug_cb *);
>>> +};
>>>  
>>>  extern void blk_start_plug(struct blk_plug *);
>>>  extern void blk_finish_plug(struct blk_plug *);
>>> @@ -876,7 +881,7 @@ static inline bool blk_needs_flush_plug(struct task_struct *tsk)
>>>  {
>>>  	struct blk_plug *plug = tsk->plug;
>>>  
>>> -	return plug && !list_empty(&plug->list);
>>> +	return plug && (!list_empty(&plug->list) || !list_empty(&plug->cb_list));
>>>  }
>>>  
>>>  /*
>>
>> Maybe I'm missing something, but why do you need those callbacks? If
>> it's to use plugging yourself, perhaps we can just ensure that those
>> don't get assigned in the task - so it would be have to used with care.
>>
>> It's not that I disagree to these callbacks, I just want to ensure I
>> understand why you need them.
>>
> 
> I'm sure one of us is missing something (probably both) but I'm not
> sure what.
> 
> The callback is central.
> 
> It is simply to use plugging in md.
> Just like blk-core, md will notice that a blk_plug is active and will put
> requests aside.  I then need something to call in to md when blk_finish_plug

But this is done in __make_request(), so md devices should not be
affected at all. This is the part of your explanation that I do not
connect with the code.

If md itself is putting things on the plug list, why is it doing that?

> is called so that put-aside requests can be released.
> As md can be built as a module, that call must be a call-back of some sort.
> blk-core doesn't need to register blk_plug_flush because that is never in a
> module, so it can be called directly.  But the md equivalent could be in a
> module, so I need to be able to register a call back.
> 
> Does that help? 

Not really. Is the problem that _you_ would like to stash things aside,
not the fact that __make_request() puts things on a task plug list?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-11 11:04                         ` Jens Axboe
@ 2011-04-11 11:26                           ` NeilBrown
  2011-04-11 11:37                             ` Jens Axboe
  0 siblings, 1 reply; 152+ messages in thread
From: NeilBrown @ 2011-04-11 11:26 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Mike Snitzer, linux-kernel, hch, dm-devel, linux-raid

On Mon, 11 Apr 2011 13:04:26 +0200 Jens Axboe <jaxboe@fusionio.com> wrote:

> > 
> > I'm sure one of us is missing something (probably both) but I'm not
> > sure what.
> > 
> > The callback is central.
> > 
> > It is simply to use plugging in md.
> > Just like blk-core, md will notice that a blk_plug is active and will put
> > requests aside.  I then need something to call in to md when blk_finish_plug
> 
> But this is done in __make_request(), so md devices should not be
> affected at all. This is the part of your explanation that I do not
> connect with the code.
> 
> If md itself is putting things on the plug list, why is it doing that?

Yes.  Exactly.  md itself want to put things aside on some list.
e.g. in RAID1 when using a write-intent bitmap I want to gather as many write
requests as possible so I can update the bits for all of them at once.
So when a plug is in effect I just queue the bios somewhere and record the
bits that need to be set.
Then when the unplug happens I write out the bitmap updates in a single write
and when that completes, I write out the data (to all devices).

Also in RAID5 it is good if I can wait for lots of write request to arrive
before committing any of them to increase the possibility of getting a
full-stripe write.

Previously I used ->unplug_fn to release the queued requests.  Now that has
gone I need a different way to register a callback when an unplug happens.

> 
> > is called so that put-aside requests can be released.
> > As md can be built as a module, that call must be a call-back of some sort.
> > blk-core doesn't need to register blk_plug_flush because that is never in a
> > module, so it can be called directly.  But the md equivalent could be in a
> > module, so I need to be able to register a call back.
> > 
> > Does that help? 
> 
> Not really. Is the problem that _you_ would like to stash things aside,
> not the fact that __make_request() puts things on a task plug list?
> 

Yes, exactly.  I (in md) want to stash things aside.

(I don't actually put the stashed things on the blk_plug, though it might
make sense to do that later in some cases - I'm not sure.  Currently I stash
things in my own internal lists and just need a call back to say "ok, flush
those lists now").

Thanks,
NeilBrown


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-11 11:26                           ` NeilBrown
@ 2011-04-11 11:37                             ` Jens Axboe
  2011-04-11 12:05                               ` NeilBrown
  0 siblings, 1 reply; 152+ messages in thread
From: Jens Axboe @ 2011-04-11 11:37 UTC (permalink / raw)
  To: NeilBrown; +Cc: Mike Snitzer, linux-kernel, hch, dm-devel, linux-raid

On 2011-04-11 13:26, NeilBrown wrote:
> On Mon, 11 Apr 2011 13:04:26 +0200 Jens Axboe <jaxboe@fusionio.com> wrote:
> 
>>>
>>> I'm sure one of us is missing something (probably both) but I'm not
>>> sure what.
>>>
>>> The callback is central.
>>>
>>> It is simply to use plugging in md.
>>> Just like blk-core, md will notice that a blk_plug is active and will put
>>> requests aside.  I then need something to call in to md when blk_finish_plug
>>
>> But this is done in __make_request(), so md devices should not be
>> affected at all. This is the part of your explanation that I do not
>> connect with the code.
>>
>> If md itself is putting things on the plug list, why is it doing that?
> 
> Yes.  Exactly.  md itself want to put things aside on some list.
> e.g. in RAID1 when using a write-intent bitmap I want to gather as many write
> requests as possible so I can update the bits for all of them at once.
> So when a plug is in effect I just queue the bios somewhere and record the
> bits that need to be set.
> Then when the unplug happens I write out the bitmap updates in a single write
> and when that completes, I write out the data (to all devices).
> 
> Also in RAID5 it is good if I can wait for lots of write request to arrive
> before committing any of them to increase the possibility of getting a
> full-stripe write.
> 
> Previously I used ->unplug_fn to release the queued requests.  Now that has
> gone I need a different way to register a callback when an unplug happens.

Ah, so this is what I was hinting at. But why use the task->plug for
that? Seems a bit counter intuitive. Why can't you just store these
internally?

> 
>>
>>> is called so that put-aside requests can be released.
>>> As md can be built as a module, that call must be a call-back of some sort.
>>> blk-core doesn't need to register blk_plug_flush because that is never in a
>>> module, so it can be called directly.  But the md equivalent could be in a
>>> module, so I need to be able to register a call back.
>>>
>>> Does that help? 
>>
>> Not really. Is the problem that _you_ would like to stash things aside,
>> not the fact that __make_request() puts things on a task plug list?
>>
> 
> Yes, exactly.  I (in md) want to stash things aside.
> 
> (I don't actually put the stashed things on the blk_plug, though it might
> make sense to do that later in some cases - I'm not sure.  Currently I stash
> things in my own internal lists and just need a call back to say "ok, flush
> those lists now").

So we are making some progress... The thing I then don't understand is
why you want to make it associated with the plug? Seems you don't have
any scheduling restrictions, and in which case just storing them in md
seems like a much better option.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-11 10:59                       ` NeilBrown
  2011-04-11 11:04                         ` Jens Axboe
@ 2011-04-11 11:55                         ` NeilBrown
  2011-04-11 12:12                           ` Jens Axboe
  1 sibling, 1 reply; 152+ messages in thread
From: NeilBrown @ 2011-04-11 11:55 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jens Axboe, Mike Snitzer, linux-kernel, hch, dm-devel, linux-raid

On Mon, 11 Apr 2011 20:59:28 +1000 NeilBrown <neilb@suse.de> wrote:

> On Mon, 11 Apr 2011 11:19:58 +0200 Jens Axboe <jaxboe@fusionio.com> wrote:
> 
> > On 2011-04-11 06:50, NeilBrown wrote:
> 
> > > The only explanation I can come up with is that very occasionally schedule on
> > > 2 separate cpus calls blk_flush_plug for the same task.  I don't understand
> > > the scheduler nearly well enough to know if or how that can happen.
> > > However with this patch in place I can write to a RAID1 constantly for half
> > > an hour, and without it, the write rarely lasts for 3 minutes.
> > 
> > Or perhaps if the request_fn blocks, that would be problematic. So the
> > patch is likely a good idea even for that case.
> > 
> > I'll merge it, changing it to list_splice_init() as I think that would
> > be more clear.
> 
> OK - though I'm not 100% the patch fixes the problem - just that it hides the
> symptom for me.
> I might try instrumenting the code a bit more and see if I can find exactly
> where it is re-entering flush_plug_list - as that seems to be what is
> happening.

OK, I found how it re-enters.

The request_fn doesn't exactly block, but when scsi_request_fn calls
spin_unlock_irq, this calls preempt_enable which can call schedule, which is
a recursive call.

The patch I provided will stop that from recursing again as the blk_plug.list
will be empty.

So it is almost what you suggested, however the request_fn doesn't block, it
just enabled preempt.


So the comment I would put at the top of that patch would be something like:


From: NeilBrown <neilb@suse.de>

As request_fn called by __blk_run_queue is allowed to 'schedule()' (after
dropping the queue lock of course), it is possible to get a recursive call:

 schedule -> blk_flush_plug -> __blk_finish_plug -> flush_plug_list
      -> __blk_run_queue -> request_fn -> schedule

We must make sure that the second schedule does not call into blk_flush_plug
again.  So instead of leaving the list of requests on blk_plug->list, move
them to a separate list leaving blk_plug->list empty.

Signed-off-by: NeilBrown <neilb@suse.de>

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-11 11:37                             ` Jens Axboe
@ 2011-04-11 12:05                               ` NeilBrown
  2011-04-11 12:11                                 ` Jens Axboe
  0 siblings, 1 reply; 152+ messages in thread
From: NeilBrown @ 2011-04-11 12:05 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Mike Snitzer, linux-kernel, hch, dm-devel, linux-raid

On Mon, 11 Apr 2011 13:37:20 +0200 Jens Axboe <jaxboe@fusionio.com> wrote:

> On 2011-04-11 13:26, NeilBrown wrote:
> > On Mon, 11 Apr 2011 13:04:26 +0200 Jens Axboe <jaxboe@fusionio.com> wrote:
> > 
> >>>
> >>> I'm sure one of us is missing something (probably both) but I'm not
> >>> sure what.
> >>>
> >>> The callback is central.
> >>>
> >>> It is simply to use plugging in md.
> >>> Just like blk-core, md will notice that a blk_plug is active and will put
> >>> requests aside.  I then need something to call in to md when blk_finish_plug
> >>
> >> But this is done in __make_request(), so md devices should not be
> >> affected at all. This is the part of your explanation that I do not
> >> connect with the code.
> >>
> >> If md itself is putting things on the plug list, why is it doing that?
> > 
> > Yes.  Exactly.  md itself want to put things aside on some list.
> > e.g. in RAID1 when using a write-intent bitmap I want to gather as many write
> > requests as possible so I can update the bits for all of them at once.
> > So when a plug is in effect I just queue the bios somewhere and record the
> > bits that need to be set.
> > Then when the unplug happens I write out the bitmap updates in a single write
> > and when that completes, I write out the data (to all devices).
> > 
> > Also in RAID5 it is good if I can wait for lots of write request to arrive
> > before committing any of them to increase the possibility of getting a
> > full-stripe write.
> > 
> > Previously I used ->unplug_fn to release the queued requests.  Now that has
> > gone I need a different way to register a callback when an unplug happens.
> 
> Ah, so this is what I was hinting at. But why use the task->plug for
> that? Seems a bit counter intuitive. Why can't you just store these
> internally?
> 
> > 
> >>
> >>> is called so that put-aside requests can be released.
> >>> As md can be built as a module, that call must be a call-back of some sort.
> >>> blk-core doesn't need to register blk_plug_flush because that is never in a
> >>> module, so it can be called directly.  But the md equivalent could be in a
> >>> module, so I need to be able to register a call back.
> >>>
> >>> Does that help? 
> >>
> >> Not really. Is the problem that _you_ would like to stash things aside,
> >> not the fact that __make_request() puts things on a task plug list?
> >>
> > 
> > Yes, exactly.  I (in md) want to stash things aside.
> > 
> > (I don't actually put the stashed things on the blk_plug, though it might
> > make sense to do that later in some cases - I'm not sure.  Currently I stash
> > things in my own internal lists and just need a call back to say "ok, flush
> > those lists now").
> 
> So we are making some progress... The thing I then don't understand is
> why you want to make it associated with the plug? Seems you don't have
> any scheduling restrictions, and in which case just storing them in md
> seems like a much better option.
> 

Yes.  But I need to know when to release the requests that I have stored.
I need to know when ->write_pages or ->read_pages or whatever has finished
submitting a pile of pages so that I can start processing the request that I
have put aside.  So I need a callback from blk_finish_plug.

(and I also need to know if a thread that was plugging schedules for the same
reason that you do).

NeilBrown



^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-11 12:05                               ` NeilBrown
@ 2011-04-11 12:11                                 ` Jens Axboe
  2011-04-11 12:36                                   ` NeilBrown
                                                     ` (2 more replies)
  0 siblings, 3 replies; 152+ messages in thread
From: Jens Axboe @ 2011-04-11 12:11 UTC (permalink / raw)
  To: NeilBrown; +Cc: Mike Snitzer, linux-kernel, hch, dm-devel, linux-raid

On 2011-04-11 14:05, NeilBrown wrote:
> On Mon, 11 Apr 2011 13:37:20 +0200 Jens Axboe <jaxboe@fusionio.com> wrote:
> 
>> On 2011-04-11 13:26, NeilBrown wrote:
>>> On Mon, 11 Apr 2011 13:04:26 +0200 Jens Axboe <jaxboe@fusionio.com> wrote:
>>>
>>>>>
>>>>> I'm sure one of us is missing something (probably both) but I'm not
>>>>> sure what.
>>>>>
>>>>> The callback is central.
>>>>>
>>>>> It is simply to use plugging in md.
>>>>> Just like blk-core, md will notice that a blk_plug is active and will put
>>>>> requests aside.  I then need something to call in to md when blk_finish_plug
>>>>
>>>> But this is done in __make_request(), so md devices should not be
>>>> affected at all. This is the part of your explanation that I do not
>>>> connect with the code.
>>>>
>>>> If md itself is putting things on the plug list, why is it doing that?
>>>
>>> Yes.  Exactly.  md itself want to put things aside on some list.
>>> e.g. in RAID1 when using a write-intent bitmap I want to gather as many write
>>> requests as possible so I can update the bits for all of them at once.
>>> So when a plug is in effect I just queue the bios somewhere and record the
>>> bits that need to be set.
>>> Then when the unplug happens I write out the bitmap updates in a single write
>>> and when that completes, I write out the data (to all devices).
>>>
>>> Also in RAID5 it is good if I can wait for lots of write request to arrive
>>> before committing any of them to increase the possibility of getting a
>>> full-stripe write.
>>>
>>> Previously I used ->unplug_fn to release the queued requests.  Now that has
>>> gone I need a different way to register a callback when an unplug happens.
>>
>> Ah, so this is what I was hinting at. But why use the task->plug for
>> that? Seems a bit counter intuitive. Why can't you just store these
>> internally?
>>
>>>
>>>>
>>>>> is called so that put-aside requests can be released.
>>>>> As md can be built as a module, that call must be a call-back of some sort.
>>>>> blk-core doesn't need to register blk_plug_flush because that is never in a
>>>>> module, so it can be called directly.  But the md equivalent could be in a
>>>>> module, so I need to be able to register a call back.
>>>>>
>>>>> Does that help? 
>>>>
>>>> Not really. Is the problem that _you_ would like to stash things aside,
>>>> not the fact that __make_request() puts things on a task plug list?
>>>>
>>>
>>> Yes, exactly.  I (in md) want to stash things aside.
>>>
>>> (I don't actually put the stashed things on the blk_plug, though it might
>>> make sense to do that later in some cases - I'm not sure.  Currently I stash
>>> things in my own internal lists and just need a call back to say "ok, flush
>>> those lists now").
>>
>> So we are making some progress... The thing I then don't understand is
>> why you want to make it associated with the plug? Seems you don't have
>> any scheduling restrictions, and in which case just storing them in md
>> seems like a much better option.
>>
> 
> Yes.  But I need to know when to release the requests that I have stored.
> I need to know when ->write_pages or ->read_pages or whatever has finished
> submitting a pile of pages so that I can start processing the request that I
> have put aside.  So I need a callback from blk_finish_plug.

OK fair enough, I'll add your callback patch.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-11 11:55                         ` NeilBrown
@ 2011-04-11 12:12                           ` Jens Axboe
  2011-04-11 22:58                             ` hch
  0 siblings, 1 reply; 152+ messages in thread
From: Jens Axboe @ 2011-04-11 12:12 UTC (permalink / raw)
  To: NeilBrown; +Cc: Mike Snitzer, linux-kernel, hch, dm-devel, linux-raid

On 2011-04-11 13:55, NeilBrown wrote:
> On Mon, 11 Apr 2011 20:59:28 +1000 NeilBrown <neilb@suse.de> wrote:
> 
>> On Mon, 11 Apr 2011 11:19:58 +0200 Jens Axboe <jaxboe@fusionio.com> wrote:
>>
>>> On 2011-04-11 06:50, NeilBrown wrote:
>>
>>>> The only explanation I can come up with is that very occasionally schedule on
>>>> 2 separate cpus calls blk_flush_plug for the same task.  I don't understand
>>>> the scheduler nearly well enough to know if or how that can happen.
>>>> However with this patch in place I can write to a RAID1 constantly for half
>>>> an hour, and without it, the write rarely lasts for 3 minutes.
>>>
>>> Or perhaps if the request_fn blocks, that would be problematic. So the
>>> patch is likely a good idea even for that case.
>>>
>>> I'll merge it, changing it to list_splice_init() as I think that would
>>> be more clear.
>>
>> OK - though I'm not 100% the patch fixes the problem - just that it hides the
>> symptom for me.
>> I might try instrumenting the code a bit more and see if I can find exactly
>> where it is re-entering flush_plug_list - as that seems to be what is
>> happening.
> 
> OK, I found how it re-enters.
> 
> The request_fn doesn't exactly block, but when scsi_request_fn calls
> spin_unlock_irq, this calls preempt_enable which can call schedule, which is
> a recursive call.
> 
> The patch I provided will stop that from recursing again as the blk_plug.list
> will be empty.
> 
> So it is almost what you suggested, however the request_fn doesn't block, it
> just enabled preempt.
> 
> 
> So the comment I would put at the top of that patch would be something like:

Ah, so it was pretty close. That does explain it. I've already queued up
the patch, I'll ammend the commit message.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-11 12:11                                 ` Jens Axboe
@ 2011-04-11 12:36                                   ` NeilBrown
  2011-04-11 12:48                                     ` Jens Axboe
  2011-04-15  4:26                                   ` hch
  2011-04-17 22:19                                   ` NeilBrown
  2 siblings, 1 reply; 152+ messages in thread
From: NeilBrown @ 2011-04-11 12:36 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Mike Snitzer, linux-kernel, hch, dm-devel, linux-raid

On Mon, 11 Apr 2011 14:11:58 +0200 Jens Axboe <jaxboe@fusionio.com> wrote:

> > Yes.  But I need to know when to release the requests that I have stored.
> > I need to know when ->write_pages or ->read_pages or whatever has finished
> > submitting a pile of pages so that I can start processing the request that I
> > have put aside.  So I need a callback from blk_finish_plug.
> 
> OK fair enough, I'll add your callback patch.
> 

Thanks.  I'll queue up my md fixes to follow it once it gets  to -linus.

NeilBrown

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-11 12:36                                   ` NeilBrown
@ 2011-04-11 12:48                                     ` Jens Axboe
  2011-04-12  1:12                                       ` hch
  0 siblings, 1 reply; 152+ messages in thread
From: Jens Axboe @ 2011-04-11 12:48 UTC (permalink / raw)
  To: NeilBrown; +Cc: Mike Snitzer, linux-kernel, hch, dm-devel, linux-raid

On 2011-04-11 14:36, NeilBrown wrote:
> On Mon, 11 Apr 2011 14:11:58 +0200 Jens Axboe <jaxboe@fusionio.com> wrote:
> 
>>> Yes.  But I need to know when to release the requests that I have stored.
>>> I need to know when ->write_pages or ->read_pages or whatever has finished
>>> submitting a pile of pages so that I can start processing the request that I
>>> have put aside.  So I need a callback from blk_finish_plug.
>>
>> OK fair enough, I'll add your callback patch.
>>
> 
> Thanks.  I'll queue up my md fixes to follow it once it gets  to -linus.

Great, once you do that and XFS kills the blk_flush_plug() calls too,
then we can remove that export and make it internal only.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-11  4:50                   ` NeilBrown
  2011-04-11  9:19                     ` Jens Axboe
@ 2011-04-11 16:59                     ` hch
  2011-04-11 21:14                       ` NeilBrown
  1 sibling, 1 reply; 152+ messages in thread
From: hch @ 2011-04-11 16:59 UTC (permalink / raw)
  To: NeilBrown
  Cc: Mike Snitzer, Jens Axboe, linux-kernel, hch, dm-devel, linux-raid

On Mon, Apr 11, 2011 at 02:50:22PM +1000, NeilBrown wrote:
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 273d60b..903ce8d 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2674,19 +2674,23 @@ static void flush_plug_list(struct blk_plug *plug)
>  	struct request_queue *q;
>  	unsigned long flags;
>  	struct request *rq;
> +	struct list_head head;
>  
>  	BUG_ON(plug->magic != PLUG_MAGIC);
>  
>  	if (list_empty(&plug->list))
>  		return;
> +	list_add(&head, &plug->list);
> +	list_del_init(&plug->list);
>  
>  	if (plug->should_sort)
> -		list_sort(NULL, &plug->list, plug_rq_cmp);
> +		list_sort(NULL, &head, plug_rq_cmp);
> +	plug->should_sort = 0;

As Jens mentioned this should be list_splice_init.  But looking over
flush_plug_list the code there seems strange to me.

What does the local_irq_save in flush_plug_list protect?  Why don't
we need it over the list_sort?  And do we still need it when first
splicing the list to a local one?

It's one of these cases where I'd really like to see more comments
explaining why the code is doing what it's doing.


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-11 16:59                     ` hch
@ 2011-04-11 21:14                       ` NeilBrown
  2011-04-11 22:59                         ` hch
  2011-04-12  6:18                         ` Jens Axboe
  0 siblings, 2 replies; 152+ messages in thread
From: NeilBrown @ 2011-04-11 21:14 UTC (permalink / raw)
  To: hch; +Cc: Mike Snitzer, Jens Axboe, linux-kernel, dm-devel, linux-raid

On Mon, 11 Apr 2011 12:59:23 -0400 "hch@infradead.org" <hch@infradead.org>
wrote:

> On Mon, Apr 11, 2011 at 02:50:22PM +1000, NeilBrown wrote:
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index 273d60b..903ce8d 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -2674,19 +2674,23 @@ static void flush_plug_list(struct blk_plug *plug)
> >  	struct request_queue *q;
> >  	unsigned long flags;
> >  	struct request *rq;
> > +	struct list_head head;
> >  
> >  	BUG_ON(plug->magic != PLUG_MAGIC);
> >  
> >  	if (list_empty(&plug->list))
> >  		return;
> > +	list_add(&head, &plug->list);
> > +	list_del_init(&plug->list);
> >  
> >  	if (plug->should_sort)
> > -		list_sort(NULL, &plug->list, plug_rq_cmp);
> > +		list_sort(NULL, &head, plug_rq_cmp);
> > +	plug->should_sort = 0;
> 
> As Jens mentioned this should be list_splice_init.  But looking over
> flush_plug_list the code there seems strange to me.
> 
> What does the local_irq_save in flush_plug_list protect?  Why don't
> we need it over the list_sort?  And do we still need it when first
> splicing the list to a local one?
> 
> It's one of these cases where I'd really like to see more comments
> explaining why the code is doing what it's doing.

My understanding of that was that the calling requirement of
__elv_add_request is that the queue spinlock is held and that interrupts are
disabled.
So rather than possible enabling and disabling interrupts several times as
different queue are handled, the code just disabled interrupts once, and
then just take the spinlock once for each different queue.

The whole point of the change to plugging was to take locks less often.
Disabling interrupts less often is presumably an analogous goal.

Though I agree that a comment would help.

	q = NULL;
+	/* Disable interrupts just once rather than using spin_lock_irq/sin_unlock_irq
	 * variants
	 */
	local_irq_save(flags);


assuming my analysis is correct.

NeilBrown


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-11 12:12                           ` Jens Axboe
@ 2011-04-11 22:58                             ` hch
  2011-04-12  6:20                               ` Jens Axboe
  0 siblings, 1 reply; 152+ messages in thread
From: hch @ 2011-04-11 22:58 UTC (permalink / raw)
  To: Jens Axboe
  Cc: NeilBrown, Mike Snitzer, linux-kernel, hch, dm-devel, linux-raid

Looking at the patch
(http://git.kernel.dk/?p=linux-2.6-block.git;a=commitdiff;h=761e433f3de6fb8e369af9e5c08beb86286d023f)

I'm not sure it's an optimal design.  The flush callback really
is a per-queue thing.  Why isn't it a function pointer in the request
queue when doing the blk_run_queue call once we're done with a given
queue before moving on to the next one?

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-11 21:14                       ` NeilBrown
@ 2011-04-11 22:59                         ` hch
  2011-04-12  6:18                         ` Jens Axboe
  1 sibling, 0 replies; 152+ messages in thread
From: hch @ 2011-04-11 22:59 UTC (permalink / raw)
  To: NeilBrown
  Cc: hch, Mike Snitzer, Jens Axboe, linux-kernel, dm-devel, linux-raid

On Tue, Apr 12, 2011 at 07:14:28AM +1000, NeilBrown wrote:
> 
> My understanding of that was that the calling requirement of
> __elv_add_request is that the queue spinlock is held and that interrupts are
> disabled.
> So rather than possible enabling and disabling interrupts several times as
> different queue are handled, the code just disabled interrupts once, and
> then just take the spinlock once for each different queue.
> 
> The whole point of the change to plugging was to take locks less often.
> Disabling interrupts less often is presumably an analogous goal.
> 
> Though I agree that a comment would help.
> 
> 	q = NULL;
> +	/* Disable interrupts just once rather than using spin_lock_irq/sin_unlock_irq
> 	 * variants
> 	 */
> 	local_irq_save(flags);
> 
> 
> assuming my analysis is correct.

Your explanation does make sense to be now that you explain it.  I
didn't even thing of that variant before.


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-11 12:48                                     ` Jens Axboe
@ 2011-04-12  1:12                                       ` hch
  2011-04-12  8:36                                         ` Jens Axboe
  0 siblings, 1 reply; 152+ messages in thread
From: hch @ 2011-04-12  1:12 UTC (permalink / raw)
  To: Jens Axboe
  Cc: NeilBrown, Mike Snitzer, linux-kernel, hch, dm-devel, linux-raid

On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote:
> Great, once you do that and XFS kills the blk_flush_plug() calls too,
> then we can remove that export and make it internal only.

Linus pulled the tree, so they are gone now.  Btw, there's still some
bits in the area that confuse me:

 - what's the point of the queue_sync_plugs?  It has a lot of comment
   that seem to pre-data the onstack plugging, but except for that
   it's trivial wrapper around blk_flush_plug, with an argument
   that is not used.
 - is there a good reason for the existance of __blk_flush_plug?  You'd
   get one additional instruction in the inlined version of
   blk_flush_plug when opencoding, but avoid the need for chained
   function calls.
 - Why is having a plug in blk_flush_plug marked unlikely?  Note that
   unlikely is the static branch prediction hint to mark the case
   extremly unlikely and is even used for hot/cold partitioning.  But
   when we call it we usually check beforehand if we actually have
   plugs, so it's actually likely to happen.
 - what is the point of blk_finish_plug?  All callers have
   the plug on stack, and there's no good reason for adding the NULL
   check.  Note that blk_start_plug doesn't have the NULL check either.
 - Why does __blk_flush_plug call __blk_finish_plug which might clear
   tsk->plug, just to set it back after the call? When manually inlining
   __blk_finish_plug ino __blk_flush_plug it looks like:

void __blk_flush_plug(struct task_struct *tsk, struct blk_plug *plug)
{
	flush_plug_list(plug);
	if (plug == tsk->plug)
		tsk->plug = NULL;
	tsk->plug = plug;
}

   it would seem much smarted to just call flush_plug_list directly.
   In fact it seems like the tsk->plug is not nessecary at all and
   all remaining __blk_flush_plug callers could be replaced with
   flush_plug_list.

 - and of course the remaining issue of why io_schedule needs an
   expliciy blk_flush_plug when schedule() already does one in
   case it actually needs to schedule.


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-11 21:14                       ` NeilBrown
  2011-04-11 22:59                         ` hch
@ 2011-04-12  6:18                         ` Jens Axboe
  1 sibling, 0 replies; 152+ messages in thread
From: Jens Axboe @ 2011-04-12  6:18 UTC (permalink / raw)
  To: NeilBrown; +Cc: hch, Mike Snitzer, linux-kernel, dm-devel, linux-raid

On 2011-04-11 23:14, NeilBrown wrote:
> On Mon, 11 Apr 2011 12:59:23 -0400 "hch@infradead.org" <hch@infradead.org>
> wrote:
> 
>> On Mon, Apr 11, 2011 at 02:50:22PM +1000, NeilBrown wrote:
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index 273d60b..903ce8d 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -2674,19 +2674,23 @@ static void flush_plug_list(struct blk_plug *plug)
>>>  	struct request_queue *q;
>>>  	unsigned long flags;
>>>  	struct request *rq;
>>> +	struct list_head head;
>>>  
>>>  	BUG_ON(plug->magic != PLUG_MAGIC);
>>>  
>>>  	if (list_empty(&plug->list))
>>>  		return;
>>> +	list_add(&head, &plug->list);
>>> +	list_del_init(&plug->list);
>>>  
>>>  	if (plug->should_sort)
>>> -		list_sort(NULL, &plug->list, plug_rq_cmp);
>>> +		list_sort(NULL, &head, plug_rq_cmp);
>>> +	plug->should_sort = 0;
>>
>> As Jens mentioned this should be list_splice_init.  But looking over
>> flush_plug_list the code there seems strange to me.
>>
>> What does the local_irq_save in flush_plug_list protect?  Why don't
>> we need it over the list_sort?  And do we still need it when first
>> splicing the list to a local one?
>>
>> It's one of these cases where I'd really like to see more comments
>> explaining why the code is doing what it's doing.
> 
> My understanding of that was that the calling requirement of
> __elv_add_request is that the queue spinlock is held and that interrupts are
> disabled.
> So rather than possible enabling and disabling interrupts several times as
> different queue are handled, the code just disabled interrupts once, and
> then just take the spinlock once for each different queue.
> 
> The whole point of the change to plugging was to take locks less often.
> Disabling interrupts less often is presumably an analogous goal.
> 
> Though I agree that a comment would help.
> 
> 	q = NULL;
> +	/* Disable interrupts just once rather than using spin_lock_irq/sin_unlock_irq
> 	 * variants
> 	 */
> 	local_irq_save(flags);
> 
> 
> assuming my analysis is correct.

Yep that is correct, it's to avoid juggling irq on and off for multiple
queues. I will put a comment there.


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-11 22:58                             ` hch
@ 2011-04-12  6:20                               ` Jens Axboe
  0 siblings, 0 replies; 152+ messages in thread
From: Jens Axboe @ 2011-04-12  6:20 UTC (permalink / raw)
  To: hch; +Cc: NeilBrown, Mike Snitzer, linux-kernel, dm-devel, linux-raid

On 2011-04-12 00:58, hch@infradead.org wrote:
> Looking at the patch
> (http://git.kernel.dk/?p=linux-2.6-block.git;a=commitdiff;h=761e433f3de6fb8e369af9e5c08beb86286d023f)
> 
> I'm not sure it's an optimal design.  The flush callback really
> is a per-queue thing.  Why isn't it a function pointer in the request
> queue when doing the blk_run_queue call once we're done with a given
> queue before moving on to the next one?

I was thinking about this yesterday as well, the design didn't quite
feel just right. Additionally the user now must track this state too,
and whether he's plugged on that task or not.

I'll rewrite this.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-12  1:12                                       ` hch
@ 2011-04-12  8:36                                         ` Jens Axboe
  2011-04-12 12:22                                           ` Dave Chinner
  2011-04-12 16:50                                           ` hch
  0 siblings, 2 replies; 152+ messages in thread
From: Jens Axboe @ 2011-04-12  8:36 UTC (permalink / raw)
  To: hch; +Cc: NeilBrown, Mike Snitzer, linux-kernel, dm-devel, linux-raid

On 2011-04-12 03:12, hch@infradead.org wrote:
> On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote:
>> Great, once you do that and XFS kills the blk_flush_plug() calls too,
>> then we can remove that export and make it internal only.
> 
> Linus pulled the tree, so they are gone now.  Btw, there's still some
> bits in the area that confuse me:

Great!

>  - what's the point of the queue_sync_plugs?  It has a lot of comment
>    that seem to pre-data the onstack plugging, but except for that
>    it's trivial wrapper around blk_flush_plug, with an argument
>    that is not used.

There's really no point to it anymore. It's existance was due to the
older revision that had to track write requests for serializaing around
a barrier. I'll kill it, since we don't do that anymore.

>  - is there a good reason for the existance of __blk_flush_plug?  You'd
>    get one additional instruction in the inlined version of
>    blk_flush_plug when opencoding, but avoid the need for chained
>    function calls.
>  - Why is having a plug in blk_flush_plug marked unlikely?  Note that
>    unlikely is the static branch prediction hint to mark the case
>    extremly unlikely and is even used for hot/cold partitioning.  But
>    when we call it we usually check beforehand if we actually have
>    plugs, so it's actually likely to happen.

The existance and out-of-line is for the scheduler() hook. It should be
an unlikely event to schedule with a plug held, normally the plug should
have been explicitly unplugged before that happens.

>  - what is the point of blk_finish_plug?  All callers have
>    the plug on stack, and there's no good reason for adding the NULL
>    check.  Note that blk_start_plug doesn't have the NULL check either.

That one can probably go, I need to double check that part since some
things changed.

>  - Why does __blk_flush_plug call __blk_finish_plug which might clear
>    tsk->plug, just to set it back after the call? When manually inlining
>    __blk_finish_plug ino __blk_flush_plug it looks like:
> 
> void __blk_flush_plug(struct task_struct *tsk, struct blk_plug *plug)
> {
> 	flush_plug_list(plug);
> 	if (plug == tsk->plug)
> 		tsk->plug = NULL;
> 	tsk->plug = plug;
> }
> 
>    it would seem much smarted to just call flush_plug_list directly.
>    In fact it seems like the tsk->plug is not nessecary at all and
>    all remaining __blk_flush_plug callers could be replaced with
>    flush_plug_list.

It depends on whether this was an explicit unplug (eg
blk_finish_plug()), or whether it was an implicit event (eg on
schedule()). If we do it on schedule(), then we retain the plug after
the flush. Otherwise we clear it.

>  - and of course the remaining issue of why io_schedule needs an
>    expliciy blk_flush_plug when schedule() already does one in
>    case it actually needs to schedule.

Already answered in other email.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-12  8:36                                         ` Jens Axboe
@ 2011-04-12 12:22                                           ` Dave Chinner
  2011-04-12 12:28                                             ` Jens Axboe
  2011-04-12 16:50                                           ` hch
  1 sibling, 1 reply; 152+ messages in thread
From: Dave Chinner @ 2011-04-12 12:22 UTC (permalink / raw)
  To: Jens Axboe
  Cc: hch, NeilBrown, Mike Snitzer, linux-kernel, dm-devel, linux-raid

On Tue, Apr 12, 2011 at 10:36:30AM +0200, Jens Axboe wrote:
> On 2011-04-12 03:12, hch@infradead.org wrote:
> > On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote:
> >> Great, once you do that and XFS kills the blk_flush_plug() calls too,
> >> then we can remove that export and make it internal only.
> > 
> > Linus pulled the tree, so they are gone now.  Btw, there's still some
> > bits in the area that confuse me:
> 
> Great!
> 
> >  - what's the point of the queue_sync_plugs?  It has a lot of comment
> >    that seem to pre-data the onstack plugging, but except for that
> >    it's trivial wrapper around blk_flush_plug, with an argument
> >    that is not used.
> 
> There's really no point to it anymore. It's existance was due to the
> older revision that had to track write requests for serializaing around
> a barrier. I'll kill it, since we don't do that anymore.
> 
> >  - is there a good reason for the existance of __blk_flush_plug?  You'd
> >    get one additional instruction in the inlined version of
> >    blk_flush_plug when opencoding, but avoid the need for chained
> >    function calls.
> >  - Why is having a plug in blk_flush_plug marked unlikely?  Note that
> >    unlikely is the static branch prediction hint to mark the case
> >    extremly unlikely and is even used for hot/cold partitioning.  But
> >    when we call it we usually check beforehand if we actually have
> >    plugs, so it's actually likely to happen.
> 
> The existance and out-of-line is for the scheduler() hook. It should be
> an unlikely event to schedule with a plug held, normally the plug should
> have been explicitly unplugged before that happens.

Though if it does, haven't you just added a significant amount of
depth to the worst case stack usage? I'm seeing this sort of thing
from io_schedule():

        Depth    Size   Location    (40 entries)
        -----    ----   --------
  0)     4256      16   mempool_alloc_slab+0x15/0x20
  1)     4240     144   mempool_alloc+0x63/0x160
  2)     4096      16   scsi_sg_alloc+0x4c/0x60
  3)     4080     112   __sg_alloc_table+0x66/0x140
  4)     3968      32   scsi_init_sgtable+0x33/0x90
  5)     3936      48   scsi_init_io+0x31/0xc0
  6)     3888      32   scsi_setup_fs_cmnd+0x79/0xe0
  7)     3856     112   sd_prep_fn+0x150/0xa90
  8)     3744      48   blk_peek_request+0x6a/0x1f0
  9)     3696      96   scsi_request_fn+0x60/0x510
 10)     3600      32   __blk_run_queue+0x57/0x100
 11)     3568      80   flush_plug_list+0x133/0x1d0
 12)     3488      32   __blk_flush_plug+0x24/0x50
 13)     3456      32   io_schedule+0x79/0x80

(This is from a page fault on ext3 that is doing page cache
readahead and blocking on a locked buffer.)

I've seen traces where mempool_alloc_slab enters direct reclaim
which adds another 1.5k of stack usage to this path. So I'm
extremely concerned that you've just reduced the stack available to
every thread by at least 2.5k of space...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-12 12:22                                           ` Dave Chinner
@ 2011-04-12 12:28                                             ` Jens Axboe
  2011-04-12 12:41                                               ` Dave Chinner
  2011-04-12 13:40                                               ` Dave Chinner
  0 siblings, 2 replies; 152+ messages in thread
From: Jens Axboe @ 2011-04-12 12:28 UTC (permalink / raw)
  To: Dave Chinner
  Cc: hch, NeilBrown, Mike Snitzer, linux-kernel, dm-devel, linux-raid

On 2011-04-12 14:22, Dave Chinner wrote:
> On Tue, Apr 12, 2011 at 10:36:30AM +0200, Jens Axboe wrote:
>> On 2011-04-12 03:12, hch@infradead.org wrote:
>>> On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote:
>>>> Great, once you do that and XFS kills the blk_flush_plug() calls too,
>>>> then we can remove that export and make it internal only.
>>>
>>> Linus pulled the tree, so they are gone now.  Btw, there's still some
>>> bits in the area that confuse me:
>>
>> Great!
>>
>>>  - what's the point of the queue_sync_plugs?  It has a lot of comment
>>>    that seem to pre-data the onstack plugging, but except for that
>>>    it's trivial wrapper around blk_flush_plug, with an argument
>>>    that is not used.
>>
>> There's really no point to it anymore. It's existance was due to the
>> older revision that had to track write requests for serializaing around
>> a barrier. I'll kill it, since we don't do that anymore.
>>
>>>  - is there a good reason for the existance of __blk_flush_plug?  You'd
>>>    get one additional instruction in the inlined version of
>>>    blk_flush_plug when opencoding, but avoid the need for chained
>>>    function calls.
>>>  - Why is having a plug in blk_flush_plug marked unlikely?  Note that
>>>    unlikely is the static branch prediction hint to mark the case
>>>    extremly unlikely and is even used for hot/cold partitioning.  But
>>>    when we call it we usually check beforehand if we actually have
>>>    plugs, so it's actually likely to happen.
>>
>> The existance and out-of-line is for the scheduler() hook. It should be
>> an unlikely event to schedule with a plug held, normally the plug should
>> have been explicitly unplugged before that happens.
> 
> Though if it does, haven't you just added a significant amount of
> depth to the worst case stack usage? I'm seeing this sort of thing
> from io_schedule():
> 
>         Depth    Size   Location    (40 entries)
>         -----    ----   --------
>   0)     4256      16   mempool_alloc_slab+0x15/0x20
>   1)     4240     144   mempool_alloc+0x63/0x160
>   2)     4096      16   scsi_sg_alloc+0x4c/0x60
>   3)     4080     112   __sg_alloc_table+0x66/0x140
>   4)     3968      32   scsi_init_sgtable+0x33/0x90
>   5)     3936      48   scsi_init_io+0x31/0xc0
>   6)     3888      32   scsi_setup_fs_cmnd+0x79/0xe0
>   7)     3856     112   sd_prep_fn+0x150/0xa90
>   8)     3744      48   blk_peek_request+0x6a/0x1f0
>   9)     3696      96   scsi_request_fn+0x60/0x510
>  10)     3600      32   __blk_run_queue+0x57/0x100
>  11)     3568      80   flush_plug_list+0x133/0x1d0
>  12)     3488      32   __blk_flush_plug+0x24/0x50
>  13)     3456      32   io_schedule+0x79/0x80
> 
> (This is from a page fault on ext3 that is doing page cache
> readahead and blocking on a locked buffer.)
> 
> I've seen traces where mempool_alloc_slab enters direct reclaim
> which adds another 1.5k of stack usage to this path. So I'm
> extremely concerned that you've just reduced the stack available to
> every thread by at least 2.5k of space...

Yeah, that does not look great. If this turns out to be problematic, we
can turn the queue runs from the unlikely case into out-of-line from
kblockd.

But this really isn't that new, you could enter the IO dispatch path
when doing IO already (when submitting it). So we better be able to
handle that.

If it's a problem from the schedule()/io_schedule() path, then lets
ensure that those are truly unlikely events so we can punt them to
kblockd.


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-12 12:28                                             ` Jens Axboe
@ 2011-04-12 12:41                                               ` Dave Chinner
  2011-04-12 12:58                                                 ` Jens Axboe
  2011-04-12 13:40                                               ` Dave Chinner
  1 sibling, 1 reply; 152+ messages in thread
From: Dave Chinner @ 2011-04-12 12:41 UTC (permalink / raw)
  To: Jens Axboe
  Cc: hch, NeilBrown, Mike Snitzer, linux-kernel, dm-devel, linux-raid

On Tue, Apr 12, 2011 at 02:28:31PM +0200, Jens Axboe wrote:
> On 2011-04-12 14:22, Dave Chinner wrote:
> > On Tue, Apr 12, 2011 at 10:36:30AM +0200, Jens Axboe wrote:
> >> On 2011-04-12 03:12, hch@infradead.org wrote:
> >>> On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote:
> >>>> Great, once you do that and XFS kills the blk_flush_plug() calls too,
> >>>> then we can remove that export and make it internal only.
> >>>
> >>> Linus pulled the tree, so they are gone now.  Btw, there's still some
> >>> bits in the area that confuse me:
> >>
> >> Great!
> >>
> >>>  - what's the point of the queue_sync_plugs?  It has a lot of comment
> >>>    that seem to pre-data the onstack plugging, but except for that
> >>>    it's trivial wrapper around blk_flush_plug, with an argument
> >>>    that is not used.
> >>
> >> There's really no point to it anymore. It's existance was due to the
> >> older revision that had to track write requests for serializaing around
> >> a barrier. I'll kill it, since we don't do that anymore.
> >>
> >>>  - is there a good reason for the existance of __blk_flush_plug?  You'd
> >>>    get one additional instruction in the inlined version of
> >>>    blk_flush_plug when opencoding, but avoid the need for chained
> >>>    function calls.
> >>>  - Why is having a plug in blk_flush_plug marked unlikely?  Note that
> >>>    unlikely is the static branch prediction hint to mark the case
> >>>    extremly unlikely and is even used for hot/cold partitioning.  But
> >>>    when we call it we usually check beforehand if we actually have
> >>>    plugs, so it's actually likely to happen.
> >>
> >> The existance and out-of-line is for the scheduler() hook. It should be
> >> an unlikely event to schedule with a plug held, normally the plug should
> >> have been explicitly unplugged before that happens.
> > 
> > Though if it does, haven't you just added a significant amount of
> > depth to the worst case stack usage? I'm seeing this sort of thing
> > from io_schedule():
> > 
> >         Depth    Size   Location    (40 entries)
> >         -----    ----   --------
> >   0)     4256      16   mempool_alloc_slab+0x15/0x20
> >   1)     4240     144   mempool_alloc+0x63/0x160
> >   2)     4096      16   scsi_sg_alloc+0x4c/0x60
> >   3)     4080     112   __sg_alloc_table+0x66/0x140
> >   4)     3968      32   scsi_init_sgtable+0x33/0x90
> >   5)     3936      48   scsi_init_io+0x31/0xc0
> >   6)     3888      32   scsi_setup_fs_cmnd+0x79/0xe0
> >   7)     3856     112   sd_prep_fn+0x150/0xa90
> >   8)     3744      48   blk_peek_request+0x6a/0x1f0
> >   9)     3696      96   scsi_request_fn+0x60/0x510
> >  10)     3600      32   __blk_run_queue+0x57/0x100
> >  11)     3568      80   flush_plug_list+0x133/0x1d0
> >  12)     3488      32   __blk_flush_plug+0x24/0x50
> >  13)     3456      32   io_schedule+0x79/0x80
> > 
> > (This is from a page fault on ext3 that is doing page cache
> > readahead and blocking on a locked buffer.)
> > 
> > I've seen traces where mempool_alloc_slab enters direct reclaim
> > which adds another 1.5k of stack usage to this path. So I'm
> > extremely concerned that you've just reduced the stack available to
> > every thread by at least 2.5k of space...
> 
> Yeah, that does not look great. If this turns out to be problematic, we
> can turn the queue runs from the unlikely case into out-of-line from
> kblockd.
> 
> But this really isn't that new, you could enter the IO dispatch path
> when doing IO already (when submitting it). So we better be able to
> handle that.

The problem I see is that IO is submitted when there's plenty of
stack available whould have previously been fine. However now it
hits the plug, and then later on after the thread consumes a lot
more stack it, say, waits for a completion. We then schedule, it
unplugs the queue and we add the IO stack to a place where there
isn't much space available.

So effectively we are moving the places where stack is consumed
about, and it's complete unpredictable where that stack is going to
land now.

> If it's a problem from the schedule()/io_schedule() path, then
> lets ensure that those are truly unlikely events so we can punt
> them to kblockd.

Rather than wait for an explosion to be reported before doing this,
why not just punt unplugs to kblockd unconditionally?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-12 12:41                                               ` Dave Chinner
@ 2011-04-12 12:58                                                 ` Jens Axboe
  2011-04-12 13:31                                                   ` Dave Chinner
  2011-04-12 16:44                                                   ` hch
  0 siblings, 2 replies; 152+ messages in thread
From: Jens Axboe @ 2011-04-12 12:58 UTC (permalink / raw)
  To: Dave Chinner
  Cc: hch, NeilBrown, Mike Snitzer, linux-kernel, dm-devel, linux-raid

On 2011-04-12 14:41, Dave Chinner wrote:
> On Tue, Apr 12, 2011 at 02:28:31PM +0200, Jens Axboe wrote:
>> On 2011-04-12 14:22, Dave Chinner wrote:
>>> On Tue, Apr 12, 2011 at 10:36:30AM +0200, Jens Axboe wrote:
>>>> On 2011-04-12 03:12, hch@infradead.org wrote:
>>>>> On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote:
>>>>>> Great, once you do that and XFS kills the blk_flush_plug() calls too,
>>>>>> then we can remove that export and make it internal only.
>>>>>
>>>>> Linus pulled the tree, so they are gone now.  Btw, there's still some
>>>>> bits in the area that confuse me:
>>>>
>>>> Great!
>>>>
>>>>>  - what's the point of the queue_sync_plugs?  It has a lot of comment
>>>>>    that seem to pre-data the onstack plugging, but except for that
>>>>>    it's trivial wrapper around blk_flush_plug, with an argument
>>>>>    that is not used.
>>>>
>>>> There's really no point to it anymore. It's existance was due to the
>>>> older revision that had to track write requests for serializaing around
>>>> a barrier. I'll kill it, since we don't do that anymore.
>>>>
>>>>>  - is there a good reason for the existance of __blk_flush_plug?  You'd
>>>>>    get one additional instruction in the inlined version of
>>>>>    blk_flush_plug when opencoding, but avoid the need for chained
>>>>>    function calls.
>>>>>  - Why is having a plug in blk_flush_plug marked unlikely?  Note that
>>>>>    unlikely is the static branch prediction hint to mark the case
>>>>>    extremly unlikely and is even used for hot/cold partitioning.  But
>>>>>    when we call it we usually check beforehand if we actually have
>>>>>    plugs, so it's actually likely to happen.
>>>>
>>>> The existance and out-of-line is for the scheduler() hook. It should be
>>>> an unlikely event to schedule with a plug held, normally the plug should
>>>> have been explicitly unplugged before that happens.
>>>
>>> Though if it does, haven't you just added a significant amount of
>>> depth to the worst case stack usage? I'm seeing this sort of thing
>>> from io_schedule():
>>>
>>>         Depth    Size   Location    (40 entries)
>>>         -----    ----   --------
>>>   0)     4256      16   mempool_alloc_slab+0x15/0x20
>>>   1)     4240     144   mempool_alloc+0x63/0x160
>>>   2)     4096      16   scsi_sg_alloc+0x4c/0x60
>>>   3)     4080     112   __sg_alloc_table+0x66/0x140
>>>   4)     3968      32   scsi_init_sgtable+0x33/0x90
>>>   5)     3936      48   scsi_init_io+0x31/0xc0
>>>   6)     3888      32   scsi_setup_fs_cmnd+0x79/0xe0
>>>   7)     3856     112   sd_prep_fn+0x150/0xa90
>>>   8)     3744      48   blk_peek_request+0x6a/0x1f0
>>>   9)     3696      96   scsi_request_fn+0x60/0x510
>>>  10)     3600      32   __blk_run_queue+0x57/0x100
>>>  11)     3568      80   flush_plug_list+0x133/0x1d0
>>>  12)     3488      32   __blk_flush_plug+0x24/0x50
>>>  13)     3456      32   io_schedule+0x79/0x80
>>>
>>> (This is from a page fault on ext3 that is doing page cache
>>> readahead and blocking on a locked buffer.)
>>>
>>> I've seen traces where mempool_alloc_slab enters direct reclaim
>>> which adds another 1.5k of stack usage to this path. So I'm
>>> extremely concerned that you've just reduced the stack available to
>>> every thread by at least 2.5k of space...
>>
>> Yeah, that does not look great. If this turns out to be problematic, we
>> can turn the queue runs from the unlikely case into out-of-line from
>> kblockd.
>>
>> But this really isn't that new, you could enter the IO dispatch path
>> when doing IO already (when submitting it). So we better be able to
>> handle that.
> 
> The problem I see is that IO is submitted when there's plenty of
> stack available whould have previously been fine. However now it
> hits the plug, and then later on after the thread consumes a lot
> more stack it, say, waits for a completion. We then schedule, it
> unplugs the queue and we add the IO stack to a place where there
> isn't much space available.
>
> So effectively we are moving the places where stack is consumed
> about, and it's complete unpredictable where that stack is going to
> land now.

Isn't that example fairly contrived? If we ended up doing the IO
dispatch before, then the only difference now is the stack usage of
schedule() itself. Apart from that, as far as I can tell, there should
not be much difference.

 
>> If it's a problem from the schedule()/io_schedule() path, then
>> lets ensure that those are truly unlikely events so we can punt
>> them to kblockd.
> 
> Rather than wait for an explosion to be reported before doing this,
> why not just punt unplugs to kblockd unconditionally?

Supposedly it's faster to do it inline rather than punt the dispatch.
But that may actually not be true, if you have multiple plugs going (and
thus multiple contenders for the queue lock on dispatch). So lets play
it safe and punt to kblockd, we can always revisit this later.

diff --git a/block/blk-core.c b/block/blk-core.c
index c6eaa1f..36b1a75 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2665,7 +2665,7 @@ static int plug_rq_cmp(void *priv, struct list_head *a, struct list_head *b)
 static void queue_unplugged(struct request_queue *q, unsigned int depth)
 {
 	trace_block_unplug_io(q, depth);
-	__blk_run_queue(q, false);
+	__blk_run_queue(q, true);
 
 	if (q->unplugged_fn)
 		q->unplugged_fn(q);


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-12 12:58                                                 ` Jens Axboe
@ 2011-04-12 13:31                                                   ` Dave Chinner
  2011-04-12 13:45                                                     ` Jens Axboe
  2011-04-12 16:58                                                     ` hch
  2011-04-12 16:44                                                   ` hch
  1 sibling, 2 replies; 152+ messages in thread
From: Dave Chinner @ 2011-04-12 13:31 UTC (permalink / raw)
  To: Jens Axboe
  Cc: hch, NeilBrown, Mike Snitzer, linux-kernel, dm-devel, linux-raid

On Tue, Apr 12, 2011 at 02:58:46PM +0200, Jens Axboe wrote:
> On 2011-04-12 14:41, Dave Chinner wrote:
> > On Tue, Apr 12, 2011 at 02:28:31PM +0200, Jens Axboe wrote:
> >> On 2011-04-12 14:22, Dave Chinner wrote:
> >>> On Tue, Apr 12, 2011 at 10:36:30AM +0200, Jens Axboe wrote:
> >>>> On 2011-04-12 03:12, hch@infradead.org wrote:
> >>>>> On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote:
> >>>>>> Great, once you do that and XFS kills the blk_flush_plug() calls too,
> >>>>>> then we can remove that export and make it internal only.
> >>>>>
> >>>>> Linus pulled the tree, so they are gone now.  Btw, there's still some
> >>>>> bits in the area that confuse me:
> >>>>
> >>>> Great!
> >>>>
> >>>>>  - what's the point of the queue_sync_plugs?  It has a lot of comment
> >>>>>    that seem to pre-data the onstack plugging, but except for that
> >>>>>    it's trivial wrapper around blk_flush_plug, with an argument
> >>>>>    that is not used.
> >>>>
> >>>> There's really no point to it anymore. It's existance was due to the
> >>>> older revision that had to track write requests for serializaing around
> >>>> a barrier. I'll kill it, since we don't do that anymore.
> >>>>
> >>>>>  - is there a good reason for the existance of __blk_flush_plug?  You'd
> >>>>>    get one additional instruction in the inlined version of
> >>>>>    blk_flush_plug when opencoding, but avoid the need for chained
> >>>>>    function calls.
> >>>>>  - Why is having a plug in blk_flush_plug marked unlikely?  Note that
> >>>>>    unlikely is the static branch prediction hint to mark the case
> >>>>>    extremly unlikely and is even used for hot/cold partitioning.  But
> >>>>>    when we call it we usually check beforehand if we actually have
> >>>>>    plugs, so it's actually likely to happen.
> >>>>
> >>>> The existance and out-of-line is for the scheduler() hook. It should be
> >>>> an unlikely event to schedule with a plug held, normally the plug should
> >>>> have been explicitly unplugged before that happens.
> >>>
> >>> Though if it does, haven't you just added a significant amount of
> >>> depth to the worst case stack usage? I'm seeing this sort of thing
> >>> from io_schedule():
> >>>
> >>>         Depth    Size   Location    (40 entries)
> >>>         -----    ----   --------
> >>>   0)     4256      16   mempool_alloc_slab+0x15/0x20
> >>>   1)     4240     144   mempool_alloc+0x63/0x160
> >>>   2)     4096      16   scsi_sg_alloc+0x4c/0x60
> >>>   3)     4080     112   __sg_alloc_table+0x66/0x140
> >>>   4)     3968      32   scsi_init_sgtable+0x33/0x90
> >>>   5)     3936      48   scsi_init_io+0x31/0xc0
> >>>   6)     3888      32   scsi_setup_fs_cmnd+0x79/0xe0
> >>>   7)     3856     112   sd_prep_fn+0x150/0xa90
> >>>   8)     3744      48   blk_peek_request+0x6a/0x1f0
> >>>   9)     3696      96   scsi_request_fn+0x60/0x510
> >>>  10)     3600      32   __blk_run_queue+0x57/0x100
> >>>  11)     3568      80   flush_plug_list+0x133/0x1d0
> >>>  12)     3488      32   __blk_flush_plug+0x24/0x50
> >>>  13)     3456      32   io_schedule+0x79/0x80
> >>>
> >>> (This is from a page fault on ext3 that is doing page cache
> >>> readahead and blocking on a locked buffer.)
> >>>
> >>> I've seen traces where mempool_alloc_slab enters direct reclaim
> >>> which adds another 1.5k of stack usage to this path. So I'm
> >>> extremely concerned that you've just reduced the stack available to
> >>> every thread by at least 2.5k of space...
> >>
> >> Yeah, that does not look great. If this turns out to be problematic, we
> >> can turn the queue runs from the unlikely case into out-of-line from
> >> kblockd.
> >>
> >> But this really isn't that new, you could enter the IO dispatch path
> >> when doing IO already (when submitting it). So we better be able to
> >> handle that.
> > 
> > The problem I see is that IO is submitted when there's plenty of
> > stack available whould have previously been fine. However now it
> > hits the plug, and then later on after the thread consumes a lot
> > more stack it, say, waits for a completion. We then schedule, it
> > unplugs the queue and we add the IO stack to a place where there
> > isn't much space available.
> >
> > So effectively we are moving the places where stack is consumed
> > about, and it's complete unpredictable where that stack is going to
> > land now.
> 
> Isn't that example fairly contrived?

I don't think so. e.g. in the XFS allocation path we do btree block
readahead, then go do the real work. The real work can end up with a
deeper stack before blocking on locks or completions unrelated to
the readahead, leading to schedule() being called and an unplug
being issued at that point.  You might think it contrived, but if
you can't provide a guarantee that it can't happen then I have to
assume it will happen.

My concern is that we're already under stack space stress in the
writeback path, so anything that has the potential to increase it
significantly is a major worry from my point of view...

> If we ended up doing the IO
> dispatch before, then the only difference now is the stack usage of
> schedule() itself. Apart from that, as far as I can tell, there should
> not be much difference.

There's a difference between IO submission and IO dispatch. IO
submission is submit_bio thru to the plug; IO dispatch is from the
plug down to the disk. If they happen at the same place, there's no
problem. If IO dispatch is moved to schedule() via a plug....

> >> If it's a problem from the schedule()/io_schedule() path, then
> >> lets ensure that those are truly unlikely events so we can punt
> >> them to kblockd.
> > 
> > Rather than wait for an explosion to be reported before doing this,
> > why not just punt unplugs to kblockd unconditionally?
> 
> Supposedly it's faster to do it inline rather than punt the dispatch.
> But that may actually not be true, if you have multiple plugs going (and
> thus multiple contenders for the queue lock on dispatch). So lets play
> it safe and punt to kblockd, we can always revisit this later.

It's always best to play it safe when it comes to other peoples
data....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-12 12:28                                             ` Jens Axboe
  2011-04-12 12:41                                               ` Dave Chinner
@ 2011-04-12 13:40                                               ` Dave Chinner
  2011-04-12 13:48                                                 ` Jens Axboe
  1 sibling, 1 reply; 152+ messages in thread
From: Dave Chinner @ 2011-04-12 13:40 UTC (permalink / raw)
  To: Jens Axboe
  Cc: hch, NeilBrown, Mike Snitzer, linux-kernel, dm-devel, linux-raid

On Tue, Apr 12, 2011 at 02:28:31PM +0200, Jens Axboe wrote:
> On 2011-04-12 14:22, Dave Chinner wrote:
> > On Tue, Apr 12, 2011 at 10:36:30AM +0200, Jens Axboe wrote:
> >> On 2011-04-12 03:12, hch@infradead.org wrote:
> >>> On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote:
> >>>    function calls.
> >>>  - Why is having a plug in blk_flush_plug marked unlikely?  Note that
> >>>    unlikely is the static branch prediction hint to mark the case
> >>>    extremly unlikely and is even used for hot/cold partitioning.  But
> >>>    when we call it we usually check beforehand if we actually have
> >>>    plugs, so it's actually likely to happen.
> >>
> >> The existance and out-of-line is for the scheduler() hook. It should be
> >> an unlikely event to schedule with a plug held, normally the plug should
> >> have been explicitly unplugged before that happens.
> > 
> > Though if it does, haven't you just added a significant amount of
> > depth to the worst case stack usage? I'm seeing this sort of thing
> > from io_schedule():
> > 
> >         Depth    Size   Location    (40 entries)
> >         -----    ----   --------
> >   0)     4256      16   mempool_alloc_slab+0x15/0x20
> >   1)     4240     144   mempool_alloc+0x63/0x160
> >   2)     4096      16   scsi_sg_alloc+0x4c/0x60
> >   3)     4080     112   __sg_alloc_table+0x66/0x140
> >   4)     3968      32   scsi_init_sgtable+0x33/0x90
> >   5)     3936      48   scsi_init_io+0x31/0xc0
> >   6)     3888      32   scsi_setup_fs_cmnd+0x79/0xe0
> >   7)     3856     112   sd_prep_fn+0x150/0xa90
> >   8)     3744      48   blk_peek_request+0x6a/0x1f0
> >   9)     3696      96   scsi_request_fn+0x60/0x510
> >  10)     3600      32   __blk_run_queue+0x57/0x100
> >  11)     3568      80   flush_plug_list+0x133/0x1d0
> >  12)     3488      32   __blk_flush_plug+0x24/0x50
> >  13)     3456      32   io_schedule+0x79/0x80
> > 
> > (This is from a page fault on ext3 that is doing page cache
> > readahead and blocking on a locked buffer.)

FYI, the next step in the allocation chain adds >900 bytes to that
stack:

$ cat /sys/kernel/debug/tracing/stack_trace
        Depth    Size   Location    (47 entries)
        -----    ----   --------
  0)     5176      40   zone_statistics+0xad/0xc0
  1)     5136     288   get_page_from_freelist+0x2cf/0x840
  2)     4848     304   __alloc_pages_nodemask+0x121/0x930
  3)     4544      48   kmem_getpages+0x62/0x160
  4)     4496      96   cache_grow+0x308/0x330
  5)     4400      80   cache_alloc_refill+0x21c/0x260
  6)     4320      64   kmem_cache_alloc+0x1b7/0x1e0
  7)     4256      16   mempool_alloc_slab+0x15/0x20
  8)     4240     144   mempool_alloc+0x63/0x160
  9)     4096      16   scsi_sg_alloc+0x4c/0x60
 10)     4080     112   __sg_alloc_table+0x66/0x140
 11)     3968      32   scsi_init_sgtable+0x33/0x90
 12)     3936      48   scsi_init_io+0x31/0xc0
 13)     3888      32   scsi_setup_fs_cmnd+0x79/0xe0
 14)     3856     112   sd_prep_fn+0x150/0xa90
 15)     3744      48   blk_peek_request+0x6a/0x1f0
 16)     3696      96   scsi_request_fn+0x60/0x510
 17)     3600      32   __blk_run_queue+0x57/0x100
 18)     3568      80   flush_plug_list+0x133/0x1d0
 19)     3488      32   __blk_flush_plug+0x24/0x50
 20)     3456      32   io_schedule+0x79/0x80

That's close to 1800 bytes now, and that's not entering the reclaim
path. If i get one deeper than that, I'll be sure to post it. :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-12 13:31                                                   ` Dave Chinner
@ 2011-04-12 13:45                                                     ` Jens Axboe
  2011-04-12 14:34                                                       ` Dave Chinner
  2011-04-12 16:58                                                     ` hch
  1 sibling, 1 reply; 152+ messages in thread
From: Jens Axboe @ 2011-04-12 13:45 UTC (permalink / raw)
  To: Dave Chinner
  Cc: hch, NeilBrown, Mike Snitzer, linux-kernel, dm-devel, linux-raid

On 2011-04-12 15:31, Dave Chinner wrote:
> On Tue, Apr 12, 2011 at 02:58:46PM +0200, Jens Axboe wrote:
>> On 2011-04-12 14:41, Dave Chinner wrote:
>>> On Tue, Apr 12, 2011 at 02:28:31PM +0200, Jens Axboe wrote:
>>>> On 2011-04-12 14:22, Dave Chinner wrote:
>>>>> On Tue, Apr 12, 2011 at 10:36:30AM +0200, Jens Axboe wrote:
>>>>>> On 2011-04-12 03:12, hch@infradead.org wrote:
>>>>>>> On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote:
>>>>>>>> Great, once you do that and XFS kills the blk_flush_plug() calls too,
>>>>>>>> then we can remove that export and make it internal only.
>>>>>>>
>>>>>>> Linus pulled the tree, so they are gone now.  Btw, there's still some
>>>>>>> bits in the area that confuse me:
>>>>>>
>>>>>> Great!
>>>>>>
>>>>>>>  - what's the point of the queue_sync_plugs?  It has a lot of comment
>>>>>>>    that seem to pre-data the onstack plugging, but except for that
>>>>>>>    it's trivial wrapper around blk_flush_plug, with an argument
>>>>>>>    that is not used.
>>>>>>
>>>>>> There's really no point to it anymore. It's existance was due to the
>>>>>> older revision that had to track write requests for serializaing around
>>>>>> a barrier. I'll kill it, since we don't do that anymore.
>>>>>>
>>>>>>>  - is there a good reason for the existance of __blk_flush_plug?  You'd
>>>>>>>    get one additional instruction in the inlined version of
>>>>>>>    blk_flush_plug when opencoding, but avoid the need for chained
>>>>>>>    function calls.
>>>>>>>  - Why is having a plug in blk_flush_plug marked unlikely?  Note that
>>>>>>>    unlikely is the static branch prediction hint to mark the case
>>>>>>>    extremly unlikely and is even used for hot/cold partitioning.  But
>>>>>>>    when we call it we usually check beforehand if we actually have
>>>>>>>    plugs, so it's actually likely to happen.
>>>>>>
>>>>>> The existance and out-of-line is for the scheduler() hook. It should be
>>>>>> an unlikely event to schedule with a plug held, normally the plug should
>>>>>> have been explicitly unplugged before that happens.
>>>>>
>>>>> Though if it does, haven't you just added a significant amount of
>>>>> depth to the worst case stack usage? I'm seeing this sort of thing
>>>>> from io_schedule():
>>>>>
>>>>>         Depth    Size   Location    (40 entries)
>>>>>         -----    ----   --------
>>>>>   0)     4256      16   mempool_alloc_slab+0x15/0x20
>>>>>   1)     4240     144   mempool_alloc+0x63/0x160
>>>>>   2)     4096      16   scsi_sg_alloc+0x4c/0x60
>>>>>   3)     4080     112   __sg_alloc_table+0x66/0x140
>>>>>   4)     3968      32   scsi_init_sgtable+0x33/0x90
>>>>>   5)     3936      48   scsi_init_io+0x31/0xc0
>>>>>   6)     3888      32   scsi_setup_fs_cmnd+0x79/0xe0
>>>>>   7)     3856     112   sd_prep_fn+0x150/0xa90
>>>>>   8)     3744      48   blk_peek_request+0x6a/0x1f0
>>>>>   9)     3696      96   scsi_request_fn+0x60/0x510
>>>>>  10)     3600      32   __blk_run_queue+0x57/0x100
>>>>>  11)     3568      80   flush_plug_list+0x133/0x1d0
>>>>>  12)     3488      32   __blk_flush_plug+0x24/0x50
>>>>>  13)     3456      32   io_schedule+0x79/0x80
>>>>>
>>>>> (This is from a page fault on ext3 that is doing page cache
>>>>> readahead and blocking on a locked buffer.)
>>>>>
>>>>> I've seen traces where mempool_alloc_slab enters direct reclaim
>>>>> which adds another 1.5k of stack usage to this path. So I'm
>>>>> extremely concerned that you've just reduced the stack available to
>>>>> every thread by at least 2.5k of space...
>>>>
>>>> Yeah, that does not look great. If this turns out to be problematic, we
>>>> can turn the queue runs from the unlikely case into out-of-line from
>>>> kblockd.
>>>>
>>>> But this really isn't that new, you could enter the IO dispatch path
>>>> when doing IO already (when submitting it). So we better be able to
>>>> handle that.
>>>
>>> The problem I see is that IO is submitted when there's plenty of
>>> stack available whould have previously been fine. However now it
>>> hits the plug, and then later on after the thread consumes a lot
>>> more stack it, say, waits for a completion. We then schedule, it
>>> unplugs the queue and we add the IO stack to a place where there
>>> isn't much space available.
>>>
>>> So effectively we are moving the places where stack is consumed
>>> about, and it's complete unpredictable where that stack is going to
>>> land now.
>>
>> Isn't that example fairly contrived?
> 
> I don't think so. e.g. in the XFS allocation path we do btree block
> readahead, then go do the real work. The real work can end up with a
> deeper stack before blocking on locks or completions unrelated to
> the readahead, leading to schedule() being called and an unplug
> being issued at that point.  You might think it contrived, but if
> you can't provide a guarantee that it can't happen then I have to
> assume it will happen.

If you ended up in lock_page() somewhere along the way, the path would
have been pretty much the same as it is now:

lock_page()
        __lock_page()
                __wait_on_bit_lock()
                        sync_page()
                                aops->sync_page();
                                        block_sync_page()
                                                __blk_run_backing_dev()

and the dispatch follows after that. If your schedules are only due to,
say, blocking on a mutex, then yes it'll be different. But is that
really the case?

I bet that worst case stack usage is exactly the same as before, and
that's the only metric we really care about.

> My concern is that we're already under stack space stress in the
> writeback path, so anything that has the potential to increase it
> significantly is a major worry from my point of view...

I agree on writeback being a worry, and that's why I made the change
(since it makes sense for other reasons, too). I just don't think we are
worse of than before.

>> If we ended up doing the IO
>> dispatch before, then the only difference now is the stack usage of
>> schedule() itself. Apart from that, as far as I can tell, there should
>> not be much difference.
> 
> There's a difference between IO submission and IO dispatch. IO
> submission is submit_bio thru to the plug; IO dispatch is from the
> plug down to the disk. If they happen at the same place, there's no
> problem. If IO dispatch is moved to schedule() via a plug....

The IO submission can easily and non-deterministically turn into an IO
dispatch, so there's no real difference for the submitter. That was the
case before. With the explicit plug now, you _know_ that the IO
submission is only that and doesn't include IO dispatch. Not until you
schedule() or call blk_finish_plug(), both of which are events that you
can control.

>>>> If it's a problem from the schedule()/io_schedule() path, then
>>>> lets ensure that those are truly unlikely events so we can punt
>>>> them to kblockd.
>>>
>>> Rather than wait for an explosion to be reported before doing this,
>>> why not just punt unplugs to kblockd unconditionally?
>>
>> Supposedly it's faster to do it inline rather than punt the dispatch.
>> But that may actually not be true, if you have multiple plugs going (and
>> thus multiple contenders for the queue lock on dispatch). So lets play
>> it safe and punt to kblockd, we can always revisit this later.
> 
> It's always best to play it safe when it comes to other peoples
> data....

Certainly, but so far I see no real evidence that this is in fact any
safer.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-12 13:40                                               ` Dave Chinner
@ 2011-04-12 13:48                                                 ` Jens Axboe
  2011-04-12 23:35                                                   ` Dave Chinner
  0 siblings, 1 reply; 152+ messages in thread
From: Jens Axboe @ 2011-04-12 13:48 UTC (permalink / raw)
  To: Dave Chinner
  Cc: hch, NeilBrown, Mike Snitzer, linux-kernel, dm-devel, linux-raid

On 2011-04-12 15:40, Dave Chinner wrote:
> On Tue, Apr 12, 2011 at 02:28:31PM +0200, Jens Axboe wrote:
>> On 2011-04-12 14:22, Dave Chinner wrote:
>>> On Tue, Apr 12, 2011 at 10:36:30AM +0200, Jens Axboe wrote:
>>>> On 2011-04-12 03:12, hch@infradead.org wrote:
>>>>> On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote:
>>>>>    function calls.
>>>>>  - Why is having a plug in blk_flush_plug marked unlikely?  Note that
>>>>>    unlikely is the static branch prediction hint to mark the case
>>>>>    extremly unlikely and is even used for hot/cold partitioning.  But
>>>>>    when we call it we usually check beforehand if we actually have
>>>>>    plugs, so it's actually likely to happen.
>>>>
>>>> The existance and out-of-line is for the scheduler() hook. It should be
>>>> an unlikely event to schedule with a plug held, normally the plug should
>>>> have been explicitly unplugged before that happens.
>>>
>>> Though if it does, haven't you just added a significant amount of
>>> depth to the worst case stack usage? I'm seeing this sort of thing
>>> from io_schedule():
>>>
>>>         Depth    Size   Location    (40 entries)
>>>         -----    ----   --------
>>>   0)     4256      16   mempool_alloc_slab+0x15/0x20
>>>   1)     4240     144   mempool_alloc+0x63/0x160
>>>   2)     4096      16   scsi_sg_alloc+0x4c/0x60
>>>   3)     4080     112   __sg_alloc_table+0x66/0x140
>>>   4)     3968      32   scsi_init_sgtable+0x33/0x90
>>>   5)     3936      48   scsi_init_io+0x31/0xc0
>>>   6)     3888      32   scsi_setup_fs_cmnd+0x79/0xe0
>>>   7)     3856     112   sd_prep_fn+0x150/0xa90
>>>   8)     3744      48   blk_peek_request+0x6a/0x1f0
>>>   9)     3696      96   scsi_request_fn+0x60/0x510
>>>  10)     3600      32   __blk_run_queue+0x57/0x100
>>>  11)     3568      80   flush_plug_list+0x133/0x1d0
>>>  12)     3488      32   __blk_flush_plug+0x24/0x50
>>>  13)     3456      32   io_schedule+0x79/0x80
>>>
>>> (This is from a page fault on ext3 that is doing page cache
>>> readahead and blocking on a locked buffer.)
> 
> FYI, the next step in the allocation chain adds >900 bytes to that
> stack:
> 
> $ cat /sys/kernel/debug/tracing/stack_trace
>         Depth    Size   Location    (47 entries)
>         -----    ----   --------
>   0)     5176      40   zone_statistics+0xad/0xc0
>   1)     5136     288   get_page_from_freelist+0x2cf/0x840
>   2)     4848     304   __alloc_pages_nodemask+0x121/0x930
>   3)     4544      48   kmem_getpages+0x62/0x160
>   4)     4496      96   cache_grow+0x308/0x330
>   5)     4400      80   cache_alloc_refill+0x21c/0x260
>   6)     4320      64   kmem_cache_alloc+0x1b7/0x1e0
>   7)     4256      16   mempool_alloc_slab+0x15/0x20
>   8)     4240     144   mempool_alloc+0x63/0x160
>   9)     4096      16   scsi_sg_alloc+0x4c/0x60
>  10)     4080     112   __sg_alloc_table+0x66/0x140
>  11)     3968      32   scsi_init_sgtable+0x33/0x90
>  12)     3936      48   scsi_init_io+0x31/0xc0
>  13)     3888      32   scsi_setup_fs_cmnd+0x79/0xe0
>  14)     3856     112   sd_prep_fn+0x150/0xa90
>  15)     3744      48   blk_peek_request+0x6a/0x1f0
>  16)     3696      96   scsi_request_fn+0x60/0x510
>  17)     3600      32   __blk_run_queue+0x57/0x100
>  18)     3568      80   flush_plug_list+0x133/0x1d0
>  19)     3488      32   __blk_flush_plug+0x24/0x50
>  20)     3456      32   io_schedule+0x79/0x80
> 
> That's close to 1800 bytes now, and that's not entering the reclaim
> path. If i get one deeper than that, I'll be sure to post it. :)

Do you have traces from 2.6.38, or are you just doing them now?

The path you quote above should not go into reclaim, it's a GFP_ATOMIC
allocation.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-12 13:45                                                     ` Jens Axboe
@ 2011-04-12 14:34                                                       ` Dave Chinner
  2011-04-12 21:08                                                         ` NeilBrown
  0 siblings, 1 reply; 152+ messages in thread
From: Dave Chinner @ 2011-04-12 14:34 UTC (permalink / raw)
  To: Jens Axboe
  Cc: hch, NeilBrown, Mike Snitzer, linux-kernel, dm-devel, linux-raid

On Tue, Apr 12, 2011 at 03:45:52PM +0200, Jens Axboe wrote:
> On 2011-04-12 15:31, Dave Chinner wrote:
> > On Tue, Apr 12, 2011 at 02:58:46PM +0200, Jens Axboe wrote:
> >> On 2011-04-12 14:41, Dave Chinner wrote:
> >> Isn't that example fairly contrived?
> > 
> > I don't think so. e.g. in the XFS allocation path we do btree block
> > readahead, then go do the real work. The real work can end up with a
> > deeper stack before blocking on locks or completions unrelated to
> > the readahead, leading to schedule() being called and an unplug
> > being issued at that point.  You might think it contrived, but if
> > you can't provide a guarantee that it can't happen then I have to
> > assume it will happen.
> 
> If you ended up in lock_page() somewhere along the way, the path would
> have been pretty much the same as it is now:
> 
> lock_page()
>         __lock_page()
>                 __wait_on_bit_lock()
>                         sync_page()
>                                 aops->sync_page();
>                                         block_sync_page()
>                                                 __blk_run_backing_dev()
> 
> and the dispatch follows after that. If your schedules are only due to,
> say, blocking on a mutex, then yes it'll be different. But is that
> really the case?

XFS metadata IO does not use the page cache anymore, so won't take
that path - no page locks are taken during read or write. Even
before that change contending on page locks was extremely rare as
XFs uses the buffer container for synchronisation.

AFAICT, we have nothing that will cause plugs to be flushed until
scheduling occurs. In many cases it will be at the same points as
before (the explicit flushes XFS had), but there are going to be new
ones....

Like this:

  0)     5360      40   zone_statistics+0xad/0xc0
  1)     5320     288   get_page_from_freelist+0x2cf/0x840
  2)     5032     304   __alloc_pages_nodemask+0x121/0x930
  3)     4728      48   kmem_getpages+0x62/0x160
  4)     4680      96   cache_grow+0x308/0x330
  5)     4584      80   cache_alloc_refill+0x21c/0x260
  6)     4504      16   __kmalloc+0x230/0x240
  7)     4488     176   virtqueue_add_buf_gfp+0x1f9/0x3e0
  8)     4312     144   do_virtblk_request+0x1f3/0x400
  9)     4168      32   __blk_run_queue+0x57/0x100
 10)     4136      80   flush_plug_list+0x133/0x1d0
 11)     4056      32   __blk_flush_plug+0x24/0x50
 12)     4024     160   schedule+0x867/0x9f0
 13)     3864     208   schedule_timeout+0x1f5/0x2c0
 14)     3656     144   wait_for_common+0xe7/0x190
 15)     3512      16   wait_for_completion+0x1d/0x20
 16)     3496      48   xfs_buf_iowait+0x36/0xb0
 17)     3448      32   _xfs_buf_read+0x98/0xa0
 18)     3416      48   xfs_buf_read+0xa2/0x100
 19)     3368      80   xfs_trans_read_buf+0x1db/0x680
......

This path adds roughly 500 bytes to the previous case of
immediate dispatch of the IO down through _xfs_buf_read()...

> I bet that worst case stack usage is exactly the same as before, and
> that's the only metric we really care about.

I've already demonstrated much worse stack usage with ext3 through
the page fault path via io_schedule(). io_schedule() never used to
dispatch IO and now it does. Similarly there are changes and
increases in XFS stack usage like above. IMO, worst case stack
usage is definitely increased by these changes.

> > My concern is that we're already under stack space stress in the
> > writeback path, so anything that has the potential to increase it
> > significantly is a major worry from my point of view...
> 
> I agree on writeback being a worry, and that's why I made the change
> (since it makes sense for other reasons, too). I just don't think we are
> worse of than before.

We certainly are.

Hmmm, I just noticed a new cumulative stack usage path through
direct reclaim - via congestion_wait() -> io_schedule()....

> >> If we ended up doing the IO
> >> dispatch before, then the only difference now is the stack usage of
> >> schedule() itself. Apart from that, as far as I can tell, there should
> >> not be much difference.
> > 
> > There's a difference between IO submission and IO dispatch. IO
> > submission is submit_bio thru to the plug; IO dispatch is from the
> > plug down to the disk. If they happen at the same place, there's no
> > problem. If IO dispatch is moved to schedule() via a plug....
> 
> The IO submission can easily and non-deterministically turn into an IO
> dispatch, so there's no real difference for the submitter. That was the
> case before. With the explicit plug now, you _know_ that the IO
> submission is only that and doesn't include IO dispatch.

You're violently agreeing with me that you've changed where the IO
dispatch path is run from. ;)

> Not until you
> schedule() or call blk_finish_plug(), both of which are events that you
> can control.

Well, not really - now taking any sleeping lock or waiting on
anything can trigger a plug flush where previously you had to
explicitly issue them. I'm not saying what we had is better, just
that there are implicit flushes with your changes that are
inherently uncontrollable...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-12 12:58                                                 ` Jens Axboe
  2011-04-12 13:31                                                   ` Dave Chinner
@ 2011-04-12 16:44                                                   ` hch
  2011-04-12 16:49                                                     ` Jens Axboe
  1 sibling, 1 reply; 152+ messages in thread
From: hch @ 2011-04-12 16:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Dave Chinner, hch, NeilBrown, Mike Snitzer, linux-kernel,
	dm-devel, linux-raid

On Tue, Apr 12, 2011 at 02:58:46PM +0200, Jens Axboe wrote:
> Supposedly it's faster to do it inline rather than punt the dispatch.
> But that may actually not be true, if you have multiple plugs going (and
> thus multiple contenders for the queue lock on dispatch). So lets play
> it safe and punt to kblockd, we can always revisit this later.

Note that this can be optimized further by adding a new helper that just
queues up work on kblockd without taking the queue lock, e.g. adding a
new

void blk_run_queue_async(struct request_queue *q)
{
	if (likely(!blk_queue_stopped(q)))
		queue_delayed_work(kblockd_workqueue, &q->delay_work, 0);
}

And replacing all

	__blk_run_queue(q, true);

callers with that, at which point they won't need the queuelock any
more.


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-12 16:44                                                   ` hch
@ 2011-04-12 16:49                                                     ` Jens Axboe
  2011-04-12 16:54                                                       ` hch
  0 siblings, 1 reply; 152+ messages in thread
From: Jens Axboe @ 2011-04-12 16:49 UTC (permalink / raw)
  To: hch
  Cc: Dave Chinner, NeilBrown, Mike Snitzer, linux-kernel, dm-devel,
	linux-raid

On 2011-04-12 18:44, hch@infradead.org wrote:
> On Tue, Apr 12, 2011 at 02:58:46PM +0200, Jens Axboe wrote:
>> Supposedly it's faster to do it inline rather than punt the dispatch.
>> But that may actually not be true, if you have multiple plugs going (and
>> thus multiple contenders for the queue lock on dispatch). So lets play
>> it safe and punt to kblockd, we can always revisit this later.
> 
> Note that this can be optimized further by adding a new helper that just
> queues up work on kblockd without taking the queue lock, e.g. adding a
> new
> 
> void blk_run_queue_async(struct request_queue *q)
> {
> 	if (likely(!blk_queue_stopped(q)))
> 		queue_delayed_work(kblockd_workqueue, &q->delay_work, 0);
> }
> 
> And replacing all
> 
> 	__blk_run_queue(q, true);
> 
> callers with that, at which point they won't need the queuelock any
> more.

I realize that, in fact it's already safe as long as you pass in 'true'
for __blk_run_queue(). Before I had rewritten it to move the running
out, so that makes the trick a little difficult. This afternoon I also
tested it and saw no noticable difference, but I'll probably just do it
anyway as it makes sense.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-12  8:36                                         ` Jens Axboe
  2011-04-12 12:22                                           ` Dave Chinner
@ 2011-04-12 16:50                                           ` hch
  1 sibling, 0 replies; 152+ messages in thread
From: hch @ 2011-04-12 16:50 UTC (permalink / raw)
  To: Jens Axboe; +Cc: NeilBrown, Mike Snitzer, linux-kernel, dm-devel, linux-raid

On Tue, Apr 12, 2011 at 10:36:30AM +0200, Jens Axboe wrote:
> The existance and out-of-line is for the scheduler() hook. It should be
> an unlikely event to schedule with a plug held, normally the plug should
> have been explicitly unplugged before that happens.

I still don't think unlikely() is the right thing to do.  The static
branch prediction hints cause a real massive slowdown if taken.  For
things like this that happen during normal operation you're much better
off leaving the dynamic branch prediction in the CPU predicting what's
going on.  And I don't think it's all that unlikely - e.g. for all the
metadata during readpages/writepages schedule/io_schedule will be
the unplugging point right now.  I'll see if I can run an I/O workload
with Steve's likely/unlikely profiling turned on.

> > void __blk_flush_plug(struct task_struct *tsk, struct blk_plug *plug)
> > {
> > 	flush_plug_list(plug);
> > 	if (plug == tsk->plug)
> > 		tsk->plug = NULL;
> > 	tsk->plug = plug;
> > }
> > 
> >    it would seem much smarted to just call flush_plug_list directly.
> >    In fact it seems like the tsk->plug is not nessecary at all and
> >    all remaining __blk_flush_plug callers could be replaced with
> >    flush_plug_list.
> 
> It depends on whether this was an explicit unplug (eg
> blk_finish_plug()), or whether it was an implicit event (eg on
> schedule()). If we do it on schedule(), then we retain the plug after
> the flush. Otherwise we clear it.

blk_finish_plug doesn't got through this codepath.

This is an untested patch how the area should look to me:


diff --git a/block/blk-core.c b/block/blk-core.c
index 90f22cc..6fa5ba1 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2668,7 +2668,7 @@ static int plug_rq_cmp(void *priv, struct list_head *a, struct list_head *b)
 	return !(rqa->q <= rqb->q);
 }
 
-static void flush_plug_list(struct blk_plug *plug)
+void blk_flush_plug_list(struct blk_plug *plug)
 {
 	struct request_queue *q;
 	unsigned long flags;
@@ -2716,29 +2716,16 @@ static void flush_plug_list(struct blk_plug *plug)
 	BUG_ON(!list_empty(&plug->list));
 	local_irq_restore(flags);
 }
-
-static void __blk_finish_plug(struct task_struct *tsk, struct blk_plug *plug)
-{
-	flush_plug_list(plug);
-
-	if (plug == tsk->plug)
-		tsk->plug = NULL;
-}
+EXPORT_SYMBOL_GPL(blk_flush_plug_list);
 
 void blk_finish_plug(struct blk_plug *plug)
 {
-	if (plug)
-		__blk_finish_plug(current, plug);
+	blk_flush_plug_list(plug);
+	if (plug == current->plug)
+		current->plug = NULL;
 }
 EXPORT_SYMBOL(blk_finish_plug);
 
-void __blk_flush_plug(struct task_struct *tsk, struct blk_plug *plug)
-{
-	__blk_finish_plug(tsk, plug);
-	tsk->plug = plug;
-}
-EXPORT_SYMBOL(__blk_flush_plug);
-
 int __init blk_dev_init(void)
 {
 	BUILD_BUG_ON(__REQ_NR_BITS > 8 *
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 32176cc..fa6a4e1 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -862,14 +862,14 @@ struct blk_plug {
 
 extern void blk_start_plug(struct blk_plug *);
 extern void blk_finish_plug(struct blk_plug *);
-extern void __blk_flush_plug(struct task_struct *, struct blk_plug *);
+extern void blk_flush_plug_list(struct blk_plug *);
 
 static inline void blk_flush_plug(struct task_struct *tsk)
 {
 	struct blk_plug *plug = tsk->plug;
 
-	if (unlikely(plug))
-		__blk_flush_plug(tsk, plug);
+	if (plug)
+		blk_flush_plug_list(plug);
 }
 
 static inline bool blk_needs_flush_plug(struct task_struct *tsk)

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-12 16:49                                                     ` Jens Axboe
@ 2011-04-12 16:54                                                       ` hch
  2011-04-12 17:24                                                         ` Jens Axboe
  0 siblings, 1 reply; 152+ messages in thread
From: hch @ 2011-04-12 16:54 UTC (permalink / raw)
  To: Jens Axboe
  Cc: hch, Dave Chinner, NeilBrown, Mike Snitzer, linux-kernel,
	dm-devel, linux-raid

On Tue, Apr 12, 2011 at 06:49:53PM +0200, Jens Axboe wrote:
> I realize that, in fact it's already safe as long as you pass in 'true'
> for __blk_run_queue(). Before I had rewritten it to move the running
> out, so that makes the trick a little difficult. This afternoon I also
> tested it and saw no noticable difference, but I'll probably just do it
> anyway as it makes sense.

We still need the lock for __elv_add_request, so we'll need to keep the
logic anyway.  But splitting out the just queue to kblockd case from
__blk_run_queue and giving the latter a sane prototype still sounds
like a good idea to me.

Btw, now that we don't call the request_fn directly any more and thus
can't block, can the unplugging be moved into the preempt notifiers?

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-12 13:31                                                   ` Dave Chinner
  2011-04-12 13:45                                                     ` Jens Axboe
@ 2011-04-12 16:58                                                     ` hch
  2011-04-12 17:29                                                       ` Jens Axboe
  1 sibling, 1 reply; 152+ messages in thread
From: hch @ 2011-04-12 16:58 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jens Axboe, hch, NeilBrown, Mike Snitzer, linux-kernel, dm-devel,
	linux-raid

On Tue, Apr 12, 2011 at 11:31:17PM +1000, Dave Chinner wrote:
> I don't think so. e.g. in the XFS allocation path we do btree block
> readahead, then go do the real work. The real work can end up with a
> deeper stack before blocking on locks or completions unrelated to
> the readahead, leading to schedule() being called and an unplug
> being issued at that point.  You might think it contrived, but if
> you can't provide a guarantee that it can't happen then I have to
> assume it will happen.

In addition to the stack issue, which is a killer to this also has
latency implications.  Before we could submit a synchronous metadata
read request inside readpage or writepage and kick it off to the disk
immediately, while now it won't get submitted until we block the next
time, i.e. have done some more work that could have been used for
doing I/O in the background.  With the kblockd offload not only have
we spent more time but at the point where we finally kick it we
also need another context switch.  It seem like we really need to
go through the filesystems and explicitly flush the plugging queue
for such cases.  In fact a bio flag marking things as synchronous
metadata reads would help, but then again we need to clean up our
existing bio flags first..


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-12 16:54                                                       ` hch
@ 2011-04-12 17:24                                                         ` Jens Axboe
  0 siblings, 0 replies; 152+ messages in thread
From: Jens Axboe @ 2011-04-12 17:24 UTC (permalink / raw)
  To: hch
  Cc: Dave Chinner, NeilBrown, Mike Snitzer, linux-kernel, dm-devel,
	linux-raid

On 2011-04-12 18:54, hch@infradead.org wrote:
> On Tue, Apr 12, 2011 at 06:49:53PM +0200, Jens Axboe wrote:
>> I realize that, in fact it's already safe as long as you pass in 'true'
>> for __blk_run_queue(). Before I had rewritten it to move the running
>> out, so that makes the trick a little difficult. This afternoon I also
>> tested it and saw no noticable difference, but I'll probably just do it
>> anyway as it makes sense.
> 
> We still need the lock for __elv_add_request, so we'll need to keep the
> logic anyway.  But splitting out the just queue to kblockd case from
> __blk_run_queue and giving the latter a sane prototype still sounds
> like a good idea to me.
> 
> Btw, now that we don't call the request_fn directly any more and thus
> can't block, can the unplugging be moved into the preempt notifiers?

It was only partly the reason, there's still the notice on preempt
(instead of schedule) and the runqueue lock problem. And if we allow
preempt, then we need to do disable preempt around all the plug logic.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-12 16:58                                                     ` hch
@ 2011-04-12 17:29                                                       ` Jens Axboe
  0 siblings, 0 replies; 152+ messages in thread
From: Jens Axboe @ 2011-04-12 17:29 UTC (permalink / raw)
  To: hch
  Cc: Dave Chinner, NeilBrown, Mike Snitzer, linux-kernel, dm-devel,
	linux-raid

On 2011-04-12 18:58, hch@infradead.org wrote:
> On Tue, Apr 12, 2011 at 11:31:17PM +1000, Dave Chinner wrote:
>> I don't think so. e.g. in the XFS allocation path we do btree block
>> readahead, then go do the real work. The real work can end up with a
>> deeper stack before blocking on locks or completions unrelated to
>> the readahead, leading to schedule() being called and an unplug
>> being issued at that point.  You might think it contrived, but if
>> you can't provide a guarantee that it can't happen then I have to
>> assume it will happen.
> 
> In addition to the stack issue, which is a killer to this also has
> latency implications.  Before we could submit a synchronous metadata
> read request inside readpage or writepage and kick it off to the disk
> immediately, while now it won't get submitted until we block the next
> time, i.e. have done some more work that could have been used for
> doing I/O in the background.  With the kblockd offload not only have
> we spent more time but at the point where we finally kick it we
> also need another context switch.  It seem like we really need to
> go through the filesystems and explicitly flush the plugging queue
> for such cases.  In fact a bio flag marking things as synchronous
> metadata reads would help, but then again we need to clean up our
> existing bio flags first..

I think it would be a good idea to audit the SYNC cases, and if feasible
let that retain the 'immediate kick off' logic. If not, have some way to
signal that at least. Essentially allow some fine grained control of
what goes into the plug and what does not.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-12 14:34                                                       ` Dave Chinner
@ 2011-04-12 21:08                                                         ` NeilBrown
  2011-04-13  2:23                                                           ` Linus Torvalds
  0 siblings, 1 reply; 152+ messages in thread
From: NeilBrown @ 2011-04-12 21:08 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jens Axboe, hch, Mike Snitzer, linux-kernel, dm-devel, linux-raid

On Wed, 13 Apr 2011 00:34:52 +1000 Dave Chinner <david@fromorbit.com> wrote:

> On Tue, Apr 12, 2011 at 03:45:52PM +0200, Jens Axboe wrote:
> Not until you
> > schedule() or call blk_finish_plug(), both of which are events that you
> > can control.
> 
> Well, not really - now taking any sleeping lock or waiting on
> anything can trigger a plug flush where previously you had to
> explicitly issue them. I'm not saying what we had is better, just
> that there are implicit flushes with your changes that are
> inherently uncontrollable...

It's not just sleeping locks - if preempt is enabled a schedule can happen at
any time - at any depth.  I've seen a spin_unlock do it.

NeilBrown

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-12 13:48                                                 ` Jens Axboe
@ 2011-04-12 23:35                                                   ` Dave Chinner
  0 siblings, 0 replies; 152+ messages in thread
From: Dave Chinner @ 2011-04-12 23:35 UTC (permalink / raw)
  To: Jens Axboe
  Cc: hch, NeilBrown, Mike Snitzer, linux-kernel, dm-devel, linux-raid

On Tue, Apr 12, 2011 at 03:48:10PM +0200, Jens Axboe wrote:
> On 2011-04-12 15:40, Dave Chinner wrote:
> > On Tue, Apr 12, 2011 at 02:28:31PM +0200, Jens Axboe wrote:
> >> On 2011-04-12 14:22, Dave Chinner wrote:
> >>> On Tue, Apr 12, 2011 at 10:36:30AM +0200, Jens Axboe wrote:
> >>>> On 2011-04-12 03:12, hch@infradead.org wrote:
> >>>>> On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote:
> >>>>>    function calls.
> >>>>>  - Why is having a plug in blk_flush_plug marked unlikely?  Note that
> >>>>>    unlikely is the static branch prediction hint to mark the case
> >>>>>    extremly unlikely and is even used for hot/cold partitioning.  But
> >>>>>    when we call it we usually check beforehand if we actually have
> >>>>>    plugs, so it's actually likely to happen.
> >>>>
> >>>> The existance and out-of-line is for the scheduler() hook. It should be
> >>>> an unlikely event to schedule with a plug held, normally the plug should
> >>>> have been explicitly unplugged before that happens.
> >>>
> >>> Though if it does, haven't you just added a significant amount of
> >>> depth to the worst case stack usage? I'm seeing this sort of thing
> >>> from io_schedule():
> >>>
> >>>         Depth    Size   Location    (40 entries)
> >>>         -----    ----   --------
> >>>   0)     4256      16   mempool_alloc_slab+0x15/0x20
> >>>   1)     4240     144   mempool_alloc+0x63/0x160
> >>>   2)     4096      16   scsi_sg_alloc+0x4c/0x60
> >>>   3)     4080     112   __sg_alloc_table+0x66/0x140
> >>>   4)     3968      32   scsi_init_sgtable+0x33/0x90
> >>>   5)     3936      48   scsi_init_io+0x31/0xc0
> >>>   6)     3888      32   scsi_setup_fs_cmnd+0x79/0xe0
> >>>   7)     3856     112   sd_prep_fn+0x150/0xa90
> >>>   8)     3744      48   blk_peek_request+0x6a/0x1f0
> >>>   9)     3696      96   scsi_request_fn+0x60/0x510
> >>>  10)     3600      32   __blk_run_queue+0x57/0x100
> >>>  11)     3568      80   flush_plug_list+0x133/0x1d0
> >>>  12)     3488      32   __blk_flush_plug+0x24/0x50
> >>>  13)     3456      32   io_schedule+0x79/0x80
> >>>
> >>> (This is from a page fault on ext3 that is doing page cache
> >>> readahead and blocking on a locked buffer.)
> > 
> > FYI, the next step in the allocation chain adds >900 bytes to that
> > stack:
> > 
> > $ cat /sys/kernel/debug/tracing/stack_trace
> >         Depth    Size   Location    (47 entries)
> >         -----    ----   --------
> >   0)     5176      40   zone_statistics+0xad/0xc0
> >   1)     5136     288   get_page_from_freelist+0x2cf/0x840
> >   2)     4848     304   __alloc_pages_nodemask+0x121/0x930
> >   3)     4544      48   kmem_getpages+0x62/0x160
> >   4)     4496      96   cache_grow+0x308/0x330
> >   5)     4400      80   cache_alloc_refill+0x21c/0x260
> >   6)     4320      64   kmem_cache_alloc+0x1b7/0x1e0
> >   7)     4256      16   mempool_alloc_slab+0x15/0x20
> >   8)     4240     144   mempool_alloc+0x63/0x160
> >   9)     4096      16   scsi_sg_alloc+0x4c/0x60
> >  10)     4080     112   __sg_alloc_table+0x66/0x140
> >  11)     3968      32   scsi_init_sgtable+0x33/0x90
> >  12)     3936      48   scsi_init_io+0x31/0xc0
> >  13)     3888      32   scsi_setup_fs_cmnd+0x79/0xe0
> >  14)     3856     112   sd_prep_fn+0x150/0xa90
> >  15)     3744      48   blk_peek_request+0x6a/0x1f0
> >  16)     3696      96   scsi_request_fn+0x60/0x510
> >  17)     3600      32   __blk_run_queue+0x57/0x100
> >  18)     3568      80   flush_plug_list+0x133/0x1d0
> >  19)     3488      32   __blk_flush_plug+0x24/0x50
> >  20)     3456      32   io_schedule+0x79/0x80
> > 
> > That's close to 1800 bytes now, and that's not entering the reclaim
> > path. If i get one deeper than that, I'll be sure to post it. :)
> 
> Do you have traces from 2.6.38, or are you just doing them now?

I do stack checks like this all the time. I generally don't keep
them around, just pay attention to the path and depth. ext3 is used
for / on my test VMs, and has never shown up as the worse case stack
usage when running xfstests. As of the block plugging code, this
trace is the top stack user for the first ~130 tests, and often for
the entire test run on XFS....

> The path you quote above should not go into reclaim, it's a GFP_ATOMIC
> allocation.

Right. I'm still trying to produce a trace that shows more stack
usage in the block layer. It's random chance as to what pops up most
of the time. However, some of the stacks that are showing up in
2.6.39 are quite different from any I've ever seen before...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-12 21:08                                                         ` NeilBrown
@ 2011-04-13  2:23                                                           ` Linus Torvalds
  2011-04-13 11:12                                                             ` Peter Zijlstra
  0 siblings, 1 reply; 152+ messages in thread
From: Linus Torvalds @ 2011-04-13  2:23 UTC (permalink / raw)
  To: NeilBrown
  Cc: Dave Chinner, Jens Axboe, hch, Mike Snitzer, linux-kernel,
	dm-devel, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1433 bytes --]

On Tue, Apr 12, 2011 at 2:08 PM, NeilBrown <neilb@suse.de> wrote:
> On Wed, 13 Apr 2011 00:34:52 +1000 Dave Chinner <david@fromorbit.com> wrote:
>>
>> Well, not really - now taking any sleeping lock or waiting on
>> anything can trigger a plug flush where previously you had to
>> explicitly issue them. I'm not saying what we had is better, just
>> that there are implicit flushes with your changes that are
>> inherently uncontrollable...
>
> It's not just sleeping locks - if preempt is enabled a schedule can happen at
> any time - at any depth.  I've seen a spin_unlock do it.

Hmm. I don't think we should flush IO in the preemption path. That
smells wrong on many levels, just one of them being the "any time, any
depth".

It also sounds really wrong from an IO pattern standpoint. The process
is actually still running, and the IO flushing _already_ does the
"only if it's going to sleep" test, but it actually does it _wrong_.
The "current->state" check doesn't make sense for a preemption event,
because it's not actually going to sleep there.

So a patch like the attached (UNTESTED!) sounds like the right thing to do.

Whether it makes any difference for any MD issues, who knows.. But
considering that the unplugging already used to test for "prev->state
!= TASK_RUNNING", this is absolutely the right thing to do - that old
test was just broken.

                                   Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 1000 bytes --]

 kernel/sched.c |   20 ++++++++++----------
 1 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 48013633d792..a187c3fe027b 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4111,20 +4111,20 @@ need_resched:
 					try_to_wake_up_local(to_wakeup);
 			}
 			deactivate_task(rq, prev, DEQUEUE_SLEEP);
+
+			/*
+			 * If we are going to sleep and we have plugged IO queued, make
+			 * sure to submit it to avoid deadlocks.
+			 */
+			if (blk_needs_flush_plug(prev)) {
+				raw_spin_unlock(&rq->lock);
+				blk_flush_plug(prev);
+				raw_spin_lock(&rq->lock);
+			}
 		}
 		switch_count = &prev->nvcsw;
 	}
 
-	/*
-	 * If we are going to sleep and we have plugged IO queued, make
-	 * sure to submit it to avoid deadlocks.
-	 */
-	if (prev->state != TASK_RUNNING && blk_needs_flush_plug(prev)) {
-		raw_spin_unlock(&rq->lock);
-		blk_flush_plug(prev);
-		raw_spin_lock(&rq->lock);
-	}
-
 	pre_schedule(rq, prev);
 
 	if (unlikely(!rq->nr_running))

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-13  2:23                                                           ` Linus Torvalds
@ 2011-04-13 11:12                                                             ` Peter Zijlstra
  2011-04-13 11:23                                                               ` Jens Axboe
  0 siblings, 1 reply; 152+ messages in thread
From: Peter Zijlstra @ 2011-04-13 11:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: NeilBrown, Dave Chinner, Jens Axboe, hch, Mike Snitzer,
	linux-kernel, dm-devel, linux-raid

On Tue, 2011-04-12 at 19:23 -0700, Linus Torvalds wrote:
>  kernel/sched.c |   20 ++++++++++----------
>  1 files changed, 10 insertions(+), 10 deletions(-)
> 
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 48013633d792..a187c3fe027b 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -4111,20 +4111,20 @@ need_resched:
>                                         try_to_wake_up_local(to_wakeup);
>                         }
>                         deactivate_task(rq, prev, DEQUEUE_SLEEP);
> +
> +                       /*
> +                        * If we are going to sleep and we have plugged IO queued, make
> +                        * sure to submit it to avoid deadlocks.
> +                        */
> +                       if (blk_needs_flush_plug(prev)) {
> +                               raw_spin_unlock(&rq->lock);
> +                               blk_flush_plug(prev);
> +                               raw_spin_lock(&rq->lock);
> +                       }
>                 }
>                 switch_count = &prev->nvcsw;
>         }
>  
> -       /*
> -        * If we are going to sleep and we have plugged IO queued, make
> -        * sure to submit it to avoid deadlocks.
> -        */
> -       if (prev->state != TASK_RUNNING && blk_needs_flush_plug(prev)) {
> -               raw_spin_unlock(&rq->lock);
> -               blk_flush_plug(prev);
> -               raw_spin_lock(&rq->lock);
> -       }
> -
>         pre_schedule(rq, prev);
>  
>         if (unlikely(!rq->nr_running)) 

Right, that cures the preemption problem. The reason I suggested placing
it where it was is that I'd like to keep all things that release
rq->lock in the middle of schedule() in one place, but I guess we can
cure that with some extra comments.




^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-13 11:12                                                             ` Peter Zijlstra
@ 2011-04-13 11:23                                                               ` Jens Axboe
  2011-04-13 11:41                                                                 ` Peter Zijlstra
  2011-04-13 15:13                                                                 ` Linus Torvalds
  0 siblings, 2 replies; 152+ messages in thread
From: Jens Axboe @ 2011-04-13 11:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, NeilBrown, Dave Chinner, hch, Mike Snitzer,
	linux-kernel, dm-devel, linux-raid

On 2011-04-13 13:12, Peter Zijlstra wrote:
> On Tue, 2011-04-12 at 19:23 -0700, Linus Torvalds wrote:
>>  kernel/sched.c |   20 ++++++++++----------
>>  1 files changed, 10 insertions(+), 10 deletions(-)
>>
>> diff --git a/kernel/sched.c b/kernel/sched.c
>> index 48013633d792..a187c3fe027b 100644
>> --- a/kernel/sched.c
>> +++ b/kernel/sched.c
>> @@ -4111,20 +4111,20 @@ need_resched:
>>                                         try_to_wake_up_local(to_wakeup);
>>                         }
>>                         deactivate_task(rq, prev, DEQUEUE_SLEEP);
>> +
>> +                       /*
>> +                        * If we are going to sleep and we have plugged IO queued, make
>> +                        * sure to submit it to avoid deadlocks.
>> +                        */
>> +                       if (blk_needs_flush_plug(prev)) {
>> +                               raw_spin_unlock(&rq->lock);
>> +                               blk_flush_plug(prev);
>> +                               raw_spin_lock(&rq->lock);
>> +                       }
>>                 }
>>                 switch_count = &prev->nvcsw;
>>         }
>>  
>> -       /*
>> -        * If we are going to sleep and we have plugged IO queued, make
>> -        * sure to submit it to avoid deadlocks.
>> -        */
>> -       if (prev->state != TASK_RUNNING && blk_needs_flush_plug(prev)) {
>> -               raw_spin_unlock(&rq->lock);
>> -               blk_flush_plug(prev);
>> -               raw_spin_lock(&rq->lock);
>> -       }
>> -
>>         pre_schedule(rq, prev);
>>  
>>         if (unlikely(!rq->nr_running)) 
> 
> Right, that cures the preemption problem. The reason I suggested placing
> it where it was is that I'd like to keep all things that release
> rq->lock in the middle of schedule() in one place, but I guess we can
> cure that with some extra comments.

We definitely only want to do it on going to sleep, not preempt events.
So if you are fine with this change, then lets please do that.

Linus, I've got a few other things queued up in the area, I'll add this
and send them off soon. Or feel free to add this one yourself, since you
already did it.


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-13 11:23                                                               ` Jens Axboe
@ 2011-04-13 11:41                                                                 ` Peter Zijlstra
  2011-04-13 15:13                                                                 ` Linus Torvalds
  1 sibling, 0 replies; 152+ messages in thread
From: Peter Zijlstra @ 2011-04-13 11:41 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, NeilBrown, Dave Chinner, hch, Mike Snitzer,
	linux-kernel, dm-devel, linux-raid

On Wed, 2011-04-13 at 13:23 +0200, Jens Axboe wrote:
> We definitely only want to do it on going to sleep, not preempt events.
> So if you are fine with this change, then lets please do that.

Here's the Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>, that goes
with it ;-)

> Linus, I've got a few other things queued up in the area, I'll add this
> and send them off soon. Or feel free to add this one yourself, since you
> already did it. 

Right, please send it onwards or have Linus commit it himself and I'll
cook up a patch clarifying the rq->lock'ing mess around there.


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-13 11:23                                                               ` Jens Axboe
  2011-04-13 11:41                                                                 ` Peter Zijlstra
@ 2011-04-13 15:13                                                                 ` Linus Torvalds
  2011-04-13 17:35                                                                   ` Jens Axboe
  1 sibling, 1 reply; 152+ messages in thread
From: Linus Torvalds @ 2011-04-13 15:13 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Peter Zijlstra, NeilBrown, Dave Chinner, hch, Mike Snitzer,
	linux-kernel, dm-devel, linux-raid

On Wed, Apr 13, 2011 at 4:23 AM, Jens Axboe <jaxboe@fusionio.com> wrote:
>
> Linus, I've got a few other things queued up in the area, I'll add this
> and send them off soon. Or feel free to add this one yourself, since you
> already did it.

Ok, I committed it with Peter's and your acks.

And if you already put it in your git tree too, git will merge it.

                    Linus

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-13 15:13                                                                 ` Linus Torvalds
@ 2011-04-13 17:35                                                                   ` Jens Axboe
  0 siblings, 0 replies; 152+ messages in thread
From: Jens Axboe @ 2011-04-13 17:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, NeilBrown, Dave Chinner, hch, Mike Snitzer,
	linux-kernel, dm-devel, linux-raid

On 2011-04-13 17:13, Linus Torvalds wrote:
> On Wed, Apr 13, 2011 at 4:23 AM, Jens Axboe <jaxboe@fusionio.com> wrote:
>>
>> Linus, I've got a few other things queued up in the area, I'll add this
>> and send them off soon. Or feel free to add this one yourself, since you
>> already did it.
> 
> Ok, I committed it with Peter's and your acks.

Great, thanks.

> And if you already put it in your git tree too, git will merge it.

I did not, I had a feeling you'd merge this one.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-11 12:11                                 ` Jens Axboe
  2011-04-11 12:36                                   ` NeilBrown
@ 2011-04-15  4:26                                   ` hch
  2011-04-15  6:34                                     ` Jens Axboe
  2011-04-17 22:19                                   ` NeilBrown
  2 siblings, 1 reply; 152+ messages in thread
From: hch @ 2011-04-15  4:26 UTC (permalink / raw)
  To: Jens Axboe
  Cc: NeilBrown, Mike Snitzer, linux-kernel, hch, dm-devel, linux-raid

Btw, "block: move queue run on unplug to kblockd" currently moves
the __blk_run_queue call to kblockd unconditionally currently.  But
I'm not sure that's correct - if we do an explicit blk_finish_plug
there's no point in forcing the context switch.

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-15  4:26                                   ` hch
@ 2011-04-15  6:34                                     ` Jens Axboe
  0 siblings, 0 replies; 152+ messages in thread
From: Jens Axboe @ 2011-04-15  6:34 UTC (permalink / raw)
  To: hch; +Cc: NeilBrown, Mike Snitzer, linux-kernel, dm-devel, linux-raid

On 2011-04-15 06:26, hch@infradead.org wrote:
> Btw, "block: move queue run on unplug to kblockd" currently moves
> the __blk_run_queue call to kblockd unconditionally currently.  But
> I'm not sure that's correct - if we do an explicit blk_finish_plug
> there's no point in forcing the context switch.

It's correct, but yes it's not optimal for the explicit unplug. Well I
think it really depends - for the single sync case, it's not ideal to
punt to kblockd. But if you have a bunch of threads doing IO, you
probably DO want to punt it to kblockd to avoid too many threads
hammering on the queue lock at the same time. Would need testing to be
sure, the below would a way to accomplish that.


diff --git a/block/blk-core.c b/block/blk-core.c
index b598fa7..995e995 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2662,16 +2662,16 @@ static int plug_rq_cmp(void *priv, struct list_head *a, struct list_head *b)
 	return !(rqa->q <= rqb->q);
 }
 
-static void queue_unplugged(struct request_queue *q, unsigned int depth)
+static void queue_unplugged(struct request_queue *q, unsigned int depth, bool run_from_wq)
 {
 	trace_block_unplug_io(q, depth);
-	__blk_run_queue(q, true);
+	__blk_run_queue(q, run_from_wq);
 
 	if (q->unplugged_fn)
 		q->unplugged_fn(q);
 }
 
-void blk_flush_plug_list(struct blk_plug *plug)
+void blk_flush_plug_list(struct blk_plug *plug, bool run_from_wq)
 {
 	struct request_queue *q;
 	unsigned long flags;
@@ -2706,7 +2706,7 @@ void blk_flush_plug_list(struct blk_plug *plug)
 		BUG_ON(!rq->q);
 		if (rq->q != q) {
 			if (q) {
-				queue_unplugged(q, depth);
+				queue_unplugged(q, depth, run_from_wq);
 				spin_unlock(q->queue_lock);
 			}
 			q = rq->q;
@@ -2727,7 +2727,7 @@ void blk_flush_plug_list(struct blk_plug *plug)
 	}
 
 	if (q) {
-		queue_unplugged(q, depth);
+		queue_unplugged(q, depth, run_from_wq);
 		spin_unlock(q->queue_lock);
 	}
 
@@ -2737,7 +2737,7 @@ EXPORT_SYMBOL(blk_flush_plug_list);
 
 void blk_finish_plug(struct blk_plug *plug)
 {
-	blk_flush_plug_list(plug);
+	blk_flush_plug_list(plug, false);
 
 	if (plug == current->plug)
 		current->plug = NULL;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ffe48ff..1c76506 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -865,14 +865,14 @@ struct blk_plug {
 
 extern void blk_start_plug(struct blk_plug *);
 extern void blk_finish_plug(struct blk_plug *);
-extern void blk_flush_plug_list(struct blk_plug *);
+extern void blk_flush_plug_list(struct blk_plug *, bool);
 
 static inline void blk_flush_plug(struct task_struct *tsk)
 {
 	struct blk_plug *plug = tsk->plug;
 
 	if (plug)
-		blk_flush_plug_list(plug);
+		blk_flush_plug_list(plug, true);
 }
 
 static inline bool blk_needs_flush_plug(struct task_struct *tsk)


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-11 12:11                                 ` Jens Axboe
  2011-04-11 12:36                                   ` NeilBrown
  2011-04-15  4:26                                   ` hch
@ 2011-04-17 22:19                                   ` NeilBrown
  2011-04-18  4:19                                     ` NeilBrown
  2011-04-18  6:38                                     ` Jens Axboe
  2 siblings, 2 replies; 152+ messages in thread
From: NeilBrown @ 2011-04-17 22:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Mike Snitzer, linux-kernel, hch, dm-devel, linux-raid

On Mon, 11 Apr 2011 14:11:58 +0200 Jens Axboe <jaxboe@fusionio.com> wrote:

> > Yes.  But I need to know when to release the requests that I have stored.
> > I need to know when ->write_pages or ->read_pages or whatever has finished
> > submitting a pile of pages so that I can start processing the request that I
> > have put aside.  So I need a callback from blk_finish_plug.
> 
> OK fair enough, I'll add your callback patch.
> 

But you didn't did you?  You added a completely different patch which is
completely pointless.
If you don't like my patch I would really prefer you said so rather than
silently replace it with something completely different (and broken).

I'll try to explain again.

md does not use __make_request.  At all.
md does not use 'struct request'.  At all.

The 'list' in 'struct blk_plug' is a list of 'struct request'.

Therefore md cannot put anything useful on the list in 'struct blk_plug'.

So when blk_flush_plug_list calls queue_unplugged() on a queue that belonged
to a request found on the blk_plug list, that queue cannot possibly ever be
for an 'md' device (because no 'struct request' ever belongs to an md device,
because md doesn't not use 'struct request').

So your patch (commit f75664570d8b) doesn't help MD at all.

For md, I need to attach something to blk_plug which somehow identifies an md
device, so that blk_finish_plug can get to that device and let it unplug.
The most sensible thing to have is a completely generic callback.  That way
different block devices (which choose not to use __make_request) can attach
different sorts of things to blk_plug.

So can we please have my original patch applied? (Revised version using
list_splice_init included below).

Or if not, a clear explanation of why not?

Thanks,
NeilBrown

>From 6a2aa888b855fd298c174bcee130cf43db0b3f7b Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.de>
Date: Mon, 18 Apr 2011 08:15:45 +1000
Subject: [PATCH] Enhance new plugging support to support general callbacks.

md/raid requires an unplug callback, but as it does not uses
requests the current code cannot provide one.

So allow arbitrary callbacks to be attached to the blk_plug.

Cc: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
---
 block/blk-core.c       |   20 ++++++++++++++++++++
 include/linux/blkdev.h |    7 ++++++-
 2 files changed, 26 insertions(+), 1 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 78b7b0c..c2b8006 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2638,6 +2638,7 @@ void blk_start_plug(struct blk_plug *plug)
 
 	plug->magic = PLUG_MAGIC;
 	INIT_LIST_HEAD(&plug->list);
+	INIT_LIST_HEAD(&plug->cb_list);
 	plug->should_sort = 0;
 
 	/*
@@ -2742,9 +2743,28 @@ void blk_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 }
 EXPORT_SYMBOL(blk_flush_plug_list);
 
+static void flush_plug_callbacks(struct blk_plug *plug)
+{
+	LIST_HEAD(callbacks);
+
+	if (list_empty(&plug->cb_list))
+		return;
+
+	list_splice_init(&plug->cb_list, &callbacks);
+
+	while (!list_empty(&callbacks)) {
+		struct blk_plug_cb *cb = list_first_entry(&callbacks,
+							  struct blk_plug_cb,
+							  list);
+		list_del(&cb->list);
+		cb->callback(cb);
+	}
+}
+
 void blk_finish_plug(struct blk_plug *plug)
 {
 	blk_flush_plug_list(plug, false);
+	flush_plug_callbacks(plug);
 
 	if (plug == current->plug)
 		current->plug = NULL;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ec0357d..f3f7879 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -860,8 +860,13 @@ extern void blk_put_queue(struct request_queue *);
 struct blk_plug {
 	unsigned long magic;
 	struct list_head list;
+	struct list_head cb_list;
 	unsigned int should_sort;
 };
+struct blk_plug_cb {
+	struct list_head list;
+	void (*callback)(struct blk_plug_cb *);
+};
 
 extern void blk_start_plug(struct blk_plug *);
 extern void blk_finish_plug(struct blk_plug *);
@@ -887,7 +892,7 @@ static inline bool blk_needs_flush_plug(struct task_struct *tsk)
 {
 	struct blk_plug *plug = tsk->plug;
 
-	return plug && !list_empty(&plug->list);
+	return plug && (!list_empty(&plug->list) || !list_empty(&plug->cb_list));
 }
 
 /*
-- 
1.7.3.4


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-17 22:19                                   ` NeilBrown
@ 2011-04-18  4:19                                     ` NeilBrown
  2011-04-18  6:38                                     ` Jens Axboe
  1 sibling, 0 replies; 152+ messages in thread
From: NeilBrown @ 2011-04-18  4:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Mike Snitzer, linux-kernel, hch, dm-devel, linux-raid

On Mon, 18 Apr 2011 08:19:22 +1000 NeilBrown <neilb@suse.de> wrote:

> So can we please have my original patch applied? (Revised version using
> list_splice_init included below).

I hadn't adjusted that one properly for the recent code shuffling.
This one is actually tested...

Thanks,
NeilBrown

>From 325b1c12b6165002022bd7b599f95c0331491cb3 Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.de>
Date: Mon, 18 Apr 2011 14:06:05 +1000
Subject: [PATCH] Enhance new plugging support to support general callbacks.

md/raid requires an unplug callback, but as it does not uses
requests the current code cannot provide one.

So allow arbitrary callbacks to be attached to the blk_plug.

Cc: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
---
 block/blk-core.c       |   20 ++++++++++++++++++++
 include/linux/blkdev.h |    7 ++++++-
 2 files changed, 26 insertions(+), 1 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 78b7b0c..77edf05 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2638,6 +2638,7 @@ void blk_start_plug(struct blk_plug *plug)
 
 	plug->magic = PLUG_MAGIC;
 	INIT_LIST_HEAD(&plug->list);
+	INIT_LIST_HEAD(&plug->cb_list);
 	plug->should_sort = 0;
 
 	/*
@@ -2678,6 +2679,24 @@ static void queue_unplugged(struct request_queue *q, unsigned int depth,
 		q->unplugged_fn(q);
 }
 
+static void flush_plug_callbacks(struct blk_plug *plug)
+{
+	LIST_HEAD(callbacks);
+
+	if (list_empty(&plug->cb_list))
+		return;
+
+	list_splice_init(&plug->cb_list, &callbacks);
+
+	while (!list_empty(&callbacks)) {
+		struct blk_plug_cb *cb = list_first_entry(&callbacks,
+							  struct blk_plug_cb,
+							  list);
+		list_del(&cb->list);
+		cb->callback(cb);
+	}
+}
+
 void blk_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 {
 	struct request_queue *q;
@@ -2688,6 +2707,7 @@ void blk_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 
 	BUG_ON(plug->magic != PLUG_MAGIC);
 
+	flush_plug_callbacks(plug);
 	if (list_empty(&plug->list))
 		return;
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ec0357d..f3f7879 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -860,8 +860,13 @@ extern void blk_put_queue(struct request_queue *);
 struct blk_plug {
 	unsigned long magic;
 	struct list_head list;
+	struct list_head cb_list;
 	unsigned int should_sort;
 };
+struct blk_plug_cb {
+	struct list_head list;
+	void (*callback)(struct blk_plug_cb *);
+};
 
 extern void blk_start_plug(struct blk_plug *);
 extern void blk_finish_plug(struct blk_plug *);
@@ -887,7 +892,7 @@ static inline bool blk_needs_flush_plug(struct task_struct *tsk)
 {
 	struct blk_plug *plug = tsk->plug;
 
-	return plug && !list_empty(&plug->list);
+	return plug && (!list_empty(&plug->list) || !list_empty(&plug->cb_list));
 }
 
 /*
-- 
1.7.3.4


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-17 22:19                                   ` NeilBrown
  2011-04-18  4:19                                     ` NeilBrown
@ 2011-04-18  6:38                                     ` Jens Axboe
  2011-04-18  7:25                                       ` NeilBrown
  1 sibling, 1 reply; 152+ messages in thread
From: Jens Axboe @ 2011-04-18  6:38 UTC (permalink / raw)
  To: NeilBrown; +Cc: Mike Snitzer, linux-kernel, hch, dm-devel, linux-raid

On 2011-04-18 00:19, NeilBrown wrote:
> On Mon, 11 Apr 2011 14:11:58 +0200 Jens Axboe <jaxboe@fusionio.com> wrote:
> 
>>> Yes.  But I need to know when to release the requests that I have stored.
>>> I need to know when ->write_pages or ->read_pages or whatever has finished
>>> submitting a pile of pages so that I can start processing the request that I
>>> have put aside.  So I need a callback from blk_finish_plug.
>>
>> OK fair enough, I'll add your callback patch.
>>
> 
> But you didn't did you?  You added a completely different patch which is
> completely pointless.
> If you don't like my patch I would really prefer you said so rather than
> silently replace it with something completely different (and broken).

First of all, you were CC'ed on all that discussion, yet didn't speak up
until now. This was last week. Secondly, please change your tone.

> I'll try to explain again.
> 
> md does not use __make_request.  At all.
> md does not use 'struct request'.  At all.
> 
> The 'list' in 'struct blk_plug' is a list of 'struct request'.

I'm well aware of how these facts, but thanks for bringing it up.

> Therefore md cannot put anything useful on the list in 'struct blk_plug'.
> 
> So when blk_flush_plug_list calls queue_unplugged() on a queue that belonged
> to a request found on the blk_plug list, that queue cannot possibly ever be
> for an 'md' device (because no 'struct request' ever belongs to an md device,
> because md doesn't not use 'struct request').
> 
> So your patch (commit f75664570d8b) doesn't help MD at all.
> 
> For md, I need to attach something to blk_plug which somehow identifies an md
> device, so that blk_finish_plug can get to that device and let it unplug.
> The most sensible thing to have is a completely generic callback.  That way
> different block devices (which choose not to use __make_request) can attach
> different sorts of things to blk_plug.
> 
> So can we please have my original patch applied? (Revised version using
> list_splice_init included below).
> 
> Or if not, a clear explanation of why not?

So correct me if I'm wrong here, but the _only_ real difference between
this patch and the current code in the tree, is the checking of the
callback list indicating a need to flush the callbacks. And that's
definitely an oversight. It should be functionally equivelant if md
would just flag this need to get a callback, eg instead of queueing a
callback on the list, just set plug->need_unplug from md instead of
queuing a callback and have blk_needs_flush_plug() do:

        return plug && (!list_empty(&plug->list) || plug->need_unplug);

instead. Something like the below, completely untested.


diff --git a/block/blk-core.c b/block/blk-core.c
index 78b7b0c..e1f5635 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1305,12 +1305,12 @@ get_rq:
 		 */
 		if (list_empty(&plug->list))
 			trace_block_plug(q);
-		else if (!plug->should_sort) {
+		else if (!(plug->flags & BLK_PLUG_F_SORT)) {
 			struct request *__rq;
 
 			__rq = list_entry_rq(plug->list.prev);
 			if (__rq->q != q)
-				plug->should_sort = 1;
+				plug->flags |= BLK_PLUG_F_SORT;
 		}
 		/*
 		 * Debug flag, kill later
@@ -2638,7 +2638,7 @@ void blk_start_plug(struct blk_plug *plug)
 
 	plug->magic = PLUG_MAGIC;
 	INIT_LIST_HEAD(&plug->list);
-	plug->should_sort = 0;
+	plug->flags = 0;
 
 	/*
 	 * If this is a nested plug, don't actually assign it. It will be
@@ -2693,9 +2693,9 @@ void blk_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 
 	list_splice_init(&plug->list, &list);
 
-	if (plug->should_sort) {
+	if (plug->flags & BLK_PLUG_F_SORT) {
 		list_sort(NULL, &list, plug_rq_cmp);
-		plug->should_sort = 0;
+		plug->flags &= ~BLK_PLUG_F_SORT;
 	}
 
 	q = NULL;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ec0357d..1a0b76b 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -860,7 +860,12 @@ extern void blk_put_queue(struct request_queue *);
 struct blk_plug {
 	unsigned long magic;
 	struct list_head list;
-	unsigned int should_sort;
+	unsigned int flags;
+};
+
+enum {
+	BLK_PLUG_F_SORT		= 1,
+	BLK_PLUG_F_NEED_UNPLUG	= 2,
 };
 
 extern void blk_start_plug(struct blk_plug *);
@@ -887,7 +892,8 @@ static inline bool blk_needs_flush_plug(struct task_struct *tsk)
 {
 	struct blk_plug *plug = tsk->plug;
 
-	return plug && !list_empty(&plug->list);
+	return plug && (!list_empty(&plug->list) ||
+			(plug->flags & BLK_PLUG_F_NEED_UNPLUG));
 }
 
 /*

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-18  6:38                                     ` Jens Axboe
@ 2011-04-18  7:25                                       ` NeilBrown
  2011-04-18  8:10                                         ` Jens Axboe
  0 siblings, 1 reply; 152+ messages in thread
From: NeilBrown @ 2011-04-18  7:25 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Mike Snitzer, linux-kernel, hch, dm-devel, linux-raid

On Mon, 18 Apr 2011 08:38:24 +0200 Jens Axboe <jaxboe@fusionio.com> wrote:

> On 2011-04-18 00:19, NeilBrown wrote:
> > On Mon, 11 Apr 2011 14:11:58 +0200 Jens Axboe <jaxboe@fusionio.com> wrote:
> > 
> >>> Yes.  But I need to know when to release the requests that I have stored.
> >>> I need to know when ->write_pages or ->read_pages or whatever has finished
> >>> submitting a pile of pages so that I can start processing the request that I
> >>> have put aside.  So I need a callback from blk_finish_plug.
> >>
> >> OK fair enough, I'll add your callback patch.
> >>
> > 
> > But you didn't did you?  You added a completely different patch which is
> > completely pointless.
> > If you don't like my patch I would really prefer you said so rather than
> > silently replace it with something completely different (and broken).
> 
> First of all, you were CC'ed on all that discussion, yet didn't speak up
> until now. This was last week. Secondly, please change your tone.

Yes, I was CC'ed on a discussion.  In that discussion it was never mentioned
that you had completely changed the patch I sent you, and it never contained
the new patch in-line for review.   Nothing that was discussed was
particularly relevant to md's needs so there was nothing to speak up about.

Yes- there were 'git pull' requests and I could have done a pull myself to
review the code but there seemed to be no urgency because you had already
agreed to apply my patch.
When I did finally pull the patches (after all the other issues had settle
down and I had time to finish of the RAID side) I found ... what I found.

I apologise for my tone, but I was very frustrated.

> 
> > I'll try to explain again.
> > 
> > md does not use __make_request.  At all.
> > md does not use 'struct request'.  At all.
> > 
> > The 'list' in 'struct blk_plug' is a list of 'struct request'.
> 
> I'm well aware of how these facts, but thanks for bringing it up.
> 
> > Therefore md cannot put anything useful on the list in 'struct blk_plug'.
> > 
> > So when blk_flush_plug_list calls queue_unplugged() on a queue that belonged
> > to a request found on the blk_plug list, that queue cannot possibly ever be
> > for an 'md' device (because no 'struct request' ever belongs to an md device,
> > because md doesn't not use 'struct request').
> > 
> > So your patch (commit f75664570d8b) doesn't help MD at all.
> > 
> > For md, I need to attach something to blk_plug which somehow identifies an md
> > device, so that blk_finish_plug can get to that device and let it unplug.
> > The most sensible thing to have is a completely generic callback.  That way
> > different block devices (which choose not to use __make_request) can attach
> > different sorts of things to blk_plug.
> > 
> > So can we please have my original patch applied? (Revised version using
> > list_splice_init included below).
> > 
> > Or if not, a clear explanation of why not?
> 
> So correct me if I'm wrong here, but the _only_ real difference between
> this patch and the current code in the tree, is the checking of the
> callback list indicating a need to flush the callbacks. And that's
> definitely an oversight. It should be functionally equivelant if md
> would just flag this need to get a callback, eg instead of queueing a
> callback on the list, just set plug->need_unplug from md instead of
> queuing a callback and have blk_needs_flush_plug() do:
> 
>         return plug && (!list_empty(&plug->list) || plug->need_unplug);
> 
> instead. Something like the below, completely untested.
> 

No, that is not the only real difference.

The real difference is that in the current code, md has no way to register
anything with a blk_plug because you can only register a 'struct request' on a
blk_plug, and md doesn't make any use of 'struct request'.

As I said in the Email you quote above:

> > Therefore md cannot put anything useful on the list in 'struct blk_plug'.

That is the heart of the problem.

NeilBrown



^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-18  7:25                                       ` NeilBrown
@ 2011-04-18  8:10                                         ` Jens Axboe
  2011-04-18  8:33                                           ` NeilBrown
  2011-04-18  9:19                                           ` hch
  0 siblings, 2 replies; 152+ messages in thread
From: Jens Axboe @ 2011-04-18  8:10 UTC (permalink / raw)
  To: NeilBrown; +Cc: Mike Snitzer, linux-kernel, hch, dm-devel, linux-raid

On 2011-04-18 09:25, NeilBrown wrote:
> On Mon, 18 Apr 2011 08:38:24 +0200 Jens Axboe <jaxboe@fusionio.com> wrote:
> 
>> On 2011-04-18 00:19, NeilBrown wrote:
>>> On Mon, 11 Apr 2011 14:11:58 +0200 Jens Axboe <jaxboe@fusionio.com> wrote:
>>>
>>>>> Yes.  But I need to know when to release the requests that I have stored.
>>>>> I need to know when ->write_pages or ->read_pages or whatever has finished
>>>>> submitting a pile of pages so that I can start processing the request that I
>>>>> have put aside.  So I need a callback from blk_finish_plug.
>>>>
>>>> OK fair enough, I'll add your callback patch.
>>>>
>>>
>>> But you didn't did you?  You added a completely different patch which is
>>> completely pointless.
>>> If you don't like my patch I would really prefer you said so rather than
>>> silently replace it with something completely different (and broken).
>>
>> First of all, you were CC'ed on all that discussion, yet didn't speak up
>> until now. This was last week. Secondly, please change your tone.
> 
> Yes, I was CC'ed on a discussion.  In that discussion it was never mentioned
> that you had completely changed the patch I sent you, and it never contained
> the new patch in-line for review.   Nothing that was discussed was
> particularly relevant to md's needs so there was nothing to speak up about.
> 
> Yes- there were 'git pull' requests and I could have done a pull myself to
> review the code but there seemed to be no urgency because you had already
> agreed to apply my patch.
> When I did finally pull the patches (after all the other issues had settle
> down and I had time to finish of the RAID side) I found ... what I found.
> 
> I apologise for my tone, but I was very frustrated.
> 
>>
>>> I'll try to explain again.
>>>
>>> md does not use __make_request.  At all.
>>> md does not use 'struct request'.  At all.
>>>
>>> The 'list' in 'struct blk_plug' is a list of 'struct request'.
>>
>> I'm well aware of how these facts, but thanks for bringing it up.
>>
>>> Therefore md cannot put anything useful on the list in 'struct blk_plug'.
>>>
>>> So when blk_flush_plug_list calls queue_unplugged() on a queue that belonged
>>> to a request found on the blk_plug list, that queue cannot possibly ever be
>>> for an 'md' device (because no 'struct request' ever belongs to an md device,
>>> because md doesn't not use 'struct request').
>>>
>>> So your patch (commit f75664570d8b) doesn't help MD at all.
>>>
>>> For md, I need to attach something to blk_plug which somehow identifies an md
>>> device, so that blk_finish_plug can get to that device and let it unplug.
>>> The most sensible thing to have is a completely generic callback.  That way
>>> different block devices (which choose not to use __make_request) can attach
>>> different sorts of things to blk_plug.
>>>
>>> So can we please have my original patch applied? (Revised version using
>>> list_splice_init included below).
>>>
>>> Or if not, a clear explanation of why not?
>>
>> So correct me if I'm wrong here, but the _only_ real difference between
>> this patch and the current code in the tree, is the checking of the
>> callback list indicating a need to flush the callbacks. And that's
>> definitely an oversight. It should be functionally equivelant if md
>> would just flag this need to get a callback, eg instead of queueing a
>> callback on the list, just set plug->need_unplug from md instead of
>> queuing a callback and have blk_needs_flush_plug() do:
>>
>>         return plug && (!list_empty(&plug->list) || plug->need_unplug);
>>
>> instead. Something like the below, completely untested.
>>
> 
> No, that is not the only real difference.
> 
> The real difference is that in the current code, md has no way to register
> anything with a blk_plug because you can only register a 'struct request' on a
> blk_plug, and md doesn't make any use of 'struct request'.
> 
> As I said in the Email you quote above:
> 
>>> Therefore md cannot put anything useful on the list in 'struct blk_plug'.
> 
> That is the heart of the problem.

Hmm, I don't really see a way to avoid the list in that case. You really
do need some way to queue items, a single callback or flag or pointer
will not suffice.

I've added the patch and removed the (now) useless ->unplugged_fn
callback. I suggest you base your md changes on top of my for-linus
branch and tell me when you are confident it looks good, then I'll pull
in your MD changes and submit them later today.

OK with you?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-18  8:10                                         ` Jens Axboe
@ 2011-04-18  8:33                                           ` NeilBrown
  2011-04-18  8:42                                             ` Jens Axboe
                                                               ` (2 more replies)
  2011-04-18  9:19                                           ` hch
  1 sibling, 3 replies; 152+ messages in thread
From: NeilBrown @ 2011-04-18  8:33 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Mike Snitzer, linux-kernel, hch, dm-devel, linux-raid



[[NOTE to dm-devel people - one of the patches here remove some
 now-unused code from dm-raid.c plus a declaration from device-mapper.h ]]]


On Mon, 18 Apr 2011 10:10:18 +0200 Jens Axboe <jaxboe@fusionio.com> wrote:

> On 2011-04-18 09:25, NeilBrown wrote:

> >>> Therefore md cannot put anything useful on the list in 'struct blk_plug'.
> > 
> > That is the heart of the problem.
> 
> Hmm, I don't really see a way to avoid the list in that case. You really
> do need some way to queue items, a single callback or flag or pointer
> will not suffice.
> 
> I've added the patch and removed the (now) useless ->unplugged_fn
> callback. I suggest you base your md changes on top of my for-linus
> branch and tell me when you are confident it looks good, then I'll pull
> in your MD changes and submit them later today.
> 
> OK with you?
> 

Yes, that's perfect.  Thanks.

All of my plugging-related patches are now in a 'for-jens' branch:

The following changes since commit 99e22598e9a8e0a996d69c8c0f6b7027cb57720a:

  block: drop queue lock before calling __blk_run_queue() for kblockd punt (2011-04-18 09:59:55 +0200)

are available in the git repository at:
  git://neil.brown.name/md for-jens

NeilBrown (6):
      md: use new plugging interface for RAID IO.
      md/dm - remove remains of plug_fn callback.
      md - remove old plugging code.
      md: provide generic support for handling unplug callbacks.
      md: incorporate new plugging into raid5.
      md: fix up raid1/raid10 unplugging.

 drivers/md/dm-raid.c          |    8 ----
 drivers/md/md.c               |   87 +++++++++++++++++++++-------------------
 drivers/md/md.h               |   26 ++----------
 drivers/md/raid1.c            |   29 +++++++-------
 drivers/md/raid10.c           |   27 ++++++-------
 drivers/md/raid5.c            |   61 ++++++++++++----------------
 drivers/md/raid5.h            |    2 -
 include/linux/device-mapper.h |    1 -
 8 files changed, 103 insertions(+), 138 deletions(-)


Thanks,
NeilBrown


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-18  8:33                                           ` NeilBrown
@ 2011-04-18  8:42                                             ` Jens Axboe
  2011-04-18 21:23                                             ` hch
  2011-04-18 21:30                                             ` hch
  2 siblings, 0 replies; 152+ messages in thread
From: Jens Axboe @ 2011-04-18  8:42 UTC (permalink / raw)
  To: NeilBrown; +Cc: Mike Snitzer, linux-kernel, hch, dm-devel, linux-raid

On 2011-04-18 10:33, NeilBrown wrote:
> 
> 
> [[NOTE to dm-devel people - one of the patches here remove some
>  now-unused code from dm-raid.c plus a declaration from device-mapper.h ]]]
> 
> 
> On Mon, 18 Apr 2011 10:10:18 +0200 Jens Axboe <jaxboe@fusionio.com> wrote:
> 
>> On 2011-04-18 09:25, NeilBrown wrote:
> 
>>>>> Therefore md cannot put anything useful on the list in 'struct blk_plug'.
>>>
>>> That is the heart of the problem.
>>
>> Hmm, I don't really see a way to avoid the list in that case. You really
>> do need some way to queue items, a single callback or flag or pointer
>> will not suffice.
>>
>> I've added the patch and removed the (now) useless ->unplugged_fn
>> callback. I suggest you base your md changes on top of my for-linus
>> branch and tell me when you are confident it looks good, then I'll pull
>> in your MD changes and submit them later today.
>>
>> OK with you?
>>
> 
> Yes, that's perfect.  Thanks.
> 
> All of my plugging-related patches are now in a 'for-jens' branch:
> 
> The following changes since commit 99e22598e9a8e0a996d69c8c0f6b7027cb57720a:
> 
>   block: drop queue lock before calling __blk_run_queue() for kblockd punt (2011-04-18 09:59:55 +0200)
> 
> are available in the git repository at:
>   git://neil.brown.name/md for-jens
> 
> NeilBrown (6):
>       md: use new plugging interface for RAID IO.
>       md/dm - remove remains of plug_fn callback.
>       md - remove old plugging code.
>       md: provide generic support for handling unplug callbacks.
>       md: incorporate new plugging into raid5.
>       md: fix up raid1/raid10 unplugging.
> 
>  drivers/md/dm-raid.c          |    8 ----
>  drivers/md/md.c               |   87 +++++++++++++++++++++-------------------
>  drivers/md/md.h               |   26 ++----------
>  drivers/md/raid1.c            |   29 +++++++-------
>  drivers/md/raid10.c           |   27 ++++++-------
>  drivers/md/raid5.c            |   61 ++++++++++++----------------
>  drivers/md/raid5.h            |    2 -
>  include/linux/device-mapper.h |    1 -
>  8 files changed, 103 insertions(+), 138 deletions(-)

Great, thanks a lot Neil! It's pulled in now, will send the request to
Linus today.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-18  8:10                                         ` Jens Axboe
  2011-04-18  8:33                                           ` NeilBrown
@ 2011-04-18  9:19                                           ` hch
  2011-04-18  9:40                                             ` [dm-devel] " Hannes Reinecke
  2011-04-18  9:46                                             ` Jens Axboe
  1 sibling, 2 replies; 152+ messages in thread
From: hch @ 2011-04-18  9:19 UTC (permalink / raw)
  To: Jens Axboe
  Cc: NeilBrown, Mike Snitzer, linux-kernel, hch, dm-devel, linux-raid

Btw, I really start to wonder if the request level is the right place
to do this on-stack plugging.  Wouldn't it be better to just plug
bios in the on-stack queue?  That way we could also stop doing the
special case merging when adding to the plug list, and leave all the
merging / I/O schedule logic in the __make_request path.  Probably
not .39 material, but worth a prototype?

Also what this dicussion brought up is that the block layer data
structures are highly confusing.  Using a small subset of the
request_queue also for make_request based driver just doesn't make
sense.  It seems like we should try to migrate the required state
to struct gendisk, and submit I/O through a block_device_ops.submit
method, leaving the request_queue as an internal abstraction for
the request based drivers.


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [dm-devel] [PATCH 05/10] block: remove per-queue plugging
  2011-04-18  9:19                                           ` hch
@ 2011-04-18  9:40                                             ` Hannes Reinecke
  2011-04-18  9:47                                               ` Jens Axboe
  2011-04-18  9:46                                             ` Jens Axboe
  1 sibling, 1 reply; 152+ messages in thread
From: Hannes Reinecke @ 2011-04-18  9:40 UTC (permalink / raw)
  To: device-mapper development
  Cc: hch, Jens Axboe, linux-raid, Mike Snitzer, linux-kernel,
	Alasdair G Kergon

On 04/18/2011 11:19 AM, hch@infradead.org wrote:
> Btw, I really start to wonder if the request level is the right place
> to do this on-stack plugging.  Wouldn't it be better to just plug
> bios in the on-stack queue?  That way we could also stop doing the
> special case merging when adding to the plug list, and leave all the
> merging / I/O schedule logic in the __make_request path.  Probably
> not .39 material, but worth a prototype?
>
> Also what this dicussion brought up is that the block layer data
> structures are highly confusing.  Using a small subset of the
> request_queue also for make_request based driver just doesn't make
> sense.  It seems like we should try to migrate the required state
> to struct gendisk, and submit I/O through a block_device_ops.submit
> method, leaving the request_queue as an internal abstraction for
> the request based drivers.
>
Good point.
It would also help us we the device-mapper redesign agk and myself 
discussed at LSF. Having a block_device_ops.submit function would
allow us remap the actual request queue generically; and we would 
even be able to address more than one request queue, which sounds 
awfully similar to what Jens is trying to do ...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-18  9:19                                           ` hch
  2011-04-18  9:40                                             ` [dm-devel] " Hannes Reinecke
@ 2011-04-18  9:46                                             ` Jens Axboe
  1 sibling, 0 replies; 152+ messages in thread
From: Jens Axboe @ 2011-04-18  9:46 UTC (permalink / raw)
  To: hch; +Cc: NeilBrown, Mike Snitzer, linux-kernel, dm-devel, linux-raid

On 2011-04-18 11:19, hch@infradead.org wrote:
> Btw, I really start to wonder if the request level is the right place
> to do this on-stack plugging.  Wouldn't it be better to just plug
> bios in the on-stack queue?  That way we could also stop doing the
> special case merging when adding to the plug list, and leave all the
> merging / I/O schedule logic in the __make_request path.  Probably
> not .39 material, but worth a prototype?
> 
> Also what this dicussion brought up is that the block layer data
> structures are highly confusing.  Using a small subset of the
> request_queue also for make_request based driver just doesn't make
> sense.  It seems like we should try to migrate the required state
> to struct gendisk, and submit I/O through a block_device_ops.submit
> method, leaving the request_queue as an internal abstraction for
> the request based drivers.

Partially agree, I've never really liked the two methods we have where
the light light version was originally meant for stacked devices but
gets used elsewhere now too. It also causes IO scheduling problems, and
then you get things like request based dm to work around that.

But the idea is really to move more towards private queueing more
localized, the multiqueue setup will really apply well there too. I'm
trying to flesh out the design of that, ideally it would be nice to
unify the different bits we have now.

But agree on pulling the stacked bits into some lower part, like the
gendisk. It would clean that up nicely.


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [dm-devel] [PATCH 05/10] block: remove per-queue plugging
  2011-04-18  9:40                                             ` [dm-devel] " Hannes Reinecke
@ 2011-04-18  9:47                                               ` Jens Axboe
  0 siblings, 0 replies; 152+ messages in thread
From: Jens Axboe @ 2011-04-18  9:47 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: device-mapper development, hch, linux-raid, Mike Snitzer,
	linux-kernel, Alasdair G Kergon

On 2011-04-18 11:40, Hannes Reinecke wrote:
> On 04/18/2011 11:19 AM, hch@infradead.org wrote:
>> Btw, I really start to wonder if the request level is the right place
>> to do this on-stack plugging.  Wouldn't it be better to just plug
>> bios in the on-stack queue?  That way we could also stop doing the
>> special case merging when adding to the plug list, and leave all the
>> merging / I/O schedule logic in the __make_request path.  Probably
>> not .39 material, but worth a prototype?
>>
>> Also what this dicussion brought up is that the block layer data
>> structures are highly confusing.  Using a small subset of the
>> request_queue also for make_request based driver just doesn't make
>> sense.  It seems like we should try to migrate the required state
>> to struct gendisk, and submit I/O through a block_device_ops.submit
>> method, leaving the request_queue as an internal abstraction for
>> the request based drivers.
>>
> Good point.
> It would also help us we the device-mapper redesign agk and myself 
> discussed at LSF. Having a block_device_ops.submit function would
> allow us remap the actual request queue generically; and we would 
> even be able to address more than one request queue, which sounds 
> awfully similar to what Jens is trying to do ...

The multiqueue bits would still have one request_queue, but multiple
queueing structures (I called those blk_queue_ctx, iirc).

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-18  8:33                                           ` NeilBrown
  2011-04-18  8:42                                             ` Jens Axboe
@ 2011-04-18 21:23                                             ` hch
  2011-04-22 15:39                                               ` hch
  2011-04-18 21:30                                             ` hch
  2 siblings, 1 reply; 152+ messages in thread
From: hch @ 2011-04-18 21:23 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jens Axboe, Mike Snitzer, linux-kernel, hch, dm-devel, linux-raid

> NeilBrown (6):
>       md: use new plugging interface for RAID IO.
>       md/dm - remove remains of plug_fn callback.
>       md - remove old plugging code.
>       md: provide generic support for handling unplug callbacks.
>       md: incorporate new plugging into raid5.
>       md: fix up raid1/raid10 unplugging.

Looking over more of the unplugging left over, is there a reason to
keep the unplug_work bits in CFQ?  They seem to rather counter the
current scheme (and it is the last user of kblockd outside of
blk-core.c)


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-18  8:33                                           ` NeilBrown
  2011-04-18  8:42                                             ` Jens Axboe
  2011-04-18 21:23                                             ` hch
@ 2011-04-18 21:30                                             ` hch
  2011-04-18 22:38                                               ` NeilBrown
  2 siblings, 1 reply; 152+ messages in thread
From: hch @ 2011-04-18 21:30 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jens Axboe, Mike Snitzer, linux-kernel, hch, dm-devel, linux-raid

>       md: provide generic support for handling unplug callbacks.

This looks like some horribly ugly code to me.  The real fix is to do
the plugging in the block layers for bios instead of requests.  The
effect should be about the same, except that merging will become a
little easier as all bios will be on the list now when calling into
__make_request or it's equivalent, and even better if we extent the
list sort callback to also sort by the start block it will actually
simplify the merge algorithm a lot as it only needs to do front merges
and no back merges for the on-stack merging.

In addition it should also allow for much more optimal queue_lock
roundtrips - we can keep it locked at the end of what's currently
__make_request to have it available for the next bio that's been
on the list.  If it either can be merged now that we have the lock
and/or we optimize get_request_wait not to sleep in the fast path
we could get down to a single queue_lock roundtrip for each unplug.


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-18 21:30                                             ` hch
@ 2011-04-18 22:38                                               ` NeilBrown
  2011-04-20 10:55                                                 ` hch
  0 siblings, 1 reply; 152+ messages in thread
From: NeilBrown @ 2011-04-18 22:38 UTC (permalink / raw)
  To: hch; +Cc: Jens Axboe, Mike Snitzer, linux-kernel, dm-devel, linux-raid

On Mon, 18 Apr 2011 17:30:48 -0400 "hch@infradead.org" <hch@infradead.org>
wrote:

> >       md: provide generic support for handling unplug callbacks.
> 
> This looks like some horribly ugly code to me.  The real fix is to do
> the plugging in the block layers for bios instead of requests.  The
> effect should be about the same, except that merging will become a
> little easier as all bios will be on the list now when calling into
> __make_request or it's equivalent, and even better if we extent the
> list sort callback to also sort by the start block it will actually
> simplify the merge algorithm a lot as it only needs to do front merges
> and no back merges for the on-stack merging.
> 
> In addition it should also allow for much more optimal queue_lock
> roundtrips - we can keep it locked at the end of what's currently
> __make_request to have it available for the next bio that's been
> on the list.  If it either can be merged now that we have the lock
> and/or we optimize get_request_wait not to sleep in the fast path
> we could get down to a single queue_lock roundtrip for each unplug.

Does the following match with your thinking?  I'm trying to make for a more
concrete understanding...

 - We change the ->make_request_fn interface so that it takes a list of
   bios rather than a single bio - linked on ->bi_next.
   These bios must all have the same ->bi_bdev.  They *might* be sorted
   by bi_sector (that needs to be decided).


 - generic_make_request currently queues bios if there is already an active
   request (this limits recursion).  We enhance this to also queue requests
   when code calls blk_start_plug.
   In effect, generic_make_request becomes:
        if (current->plug)
		blk_add_to_plug(current->plug, bio);
	else {
		struct blk_plug plug;
		blk_start_plug(&plug);
		__generic_make_request(bio);
		blk_finish_plug(&plug);
	}

 - __generic_make_request would sort the list of bios by bi_bdev (and maybe 
   bi_sector) and pass them along to the different ->make_request_fn
   functions.

   As there are likely to be only a few different bi_bdev values (often 1) but
   hopefully lots and lots of bios it might be more efficient to do a linear
   bucket sort based on bi_bdev, and only sort those buckets on bi_sector if
   required.

Then make_request_fn handlers can expect to get lots of bios at once, can
optimise their handling as seems appropriate, and not require any further
plugging.


Is that at all close to what you are thinking?

NeilBrown

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-18 22:38                                               ` NeilBrown
@ 2011-04-20 10:55                                                 ` hch
  0 siblings, 0 replies; 152+ messages in thread
From: hch @ 2011-04-20 10:55 UTC (permalink / raw)
  To: NeilBrown
  Cc: hch, Jens Axboe, Mike Snitzer, linux-kernel, dm-devel, linux-raid

On Tue, Apr 19, 2011 at 08:38:13AM +1000, NeilBrown wrote:
> Is that at all close to what you are thinking?

Yes, pretty much like that.


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-18 21:23                                             ` hch
@ 2011-04-22 15:39                                               ` hch
  2011-04-22 16:01                                                 ` Vivek Goyal
  0 siblings, 1 reply; 152+ messages in thread
From: hch @ 2011-04-22 15:39 UTC (permalink / raw)
  Cc: Jens Axboe, vgoyal, linux-kernel

On Mon, Apr 18, 2011 at 05:23:06PM -0400, hch@infradead.org wrote:
> > NeilBrown (6):
> >       md: use new plugging interface for RAID IO.
> >       md/dm - remove remains of plug_fn callback.
> >       md - remove old plugging code.
> >       md: provide generic support for handling unplug callbacks.
> >       md: incorporate new plugging into raid5.
> >       md: fix up raid1/raid10 unplugging.
> 
> Looking over more of the unplugging left over, is there a reason to
> keep the unplug_work bits in CFQ?  They seem to rather counter the
> current scheme (and it is the last user of kblockd outside of
> blk-core.c)

Jens, Vivkek:

can you take a look at if cfq_schedule_dispatch is still needed in
new unplugging world order?  It's the only kblockd user outside the
block core that's still left, and it seems rather odd to me at least.


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-22 15:39                                               ` hch
@ 2011-04-22 16:01                                                 ` Vivek Goyal
  2011-04-22 16:10                                                   ` Vivek Goyal
  0 siblings, 1 reply; 152+ messages in thread
From: Vivek Goyal @ 2011-04-22 16:01 UTC (permalink / raw)
  To: hch; +Cc: Jens Axboe, linux-kernel

On Fri, Apr 22, 2011 at 11:39:08AM -0400, hch@infradead.org wrote:
> On Mon, Apr 18, 2011 at 05:23:06PM -0400, hch@infradead.org wrote:
> > > NeilBrown (6):
> > >       md: use new plugging interface for RAID IO.
> > >       md/dm - remove remains of plug_fn callback.
> > >       md - remove old plugging code.
> > >       md: provide generic support for handling unplug callbacks.
> > >       md: incorporate new plugging into raid5.
> > >       md: fix up raid1/raid10 unplugging.
> > 
> > Looking over more of the unplugging left over, is there a reason to
> > keep the unplug_work bits in CFQ?  They seem to rather counter the
> > current scheme (and it is the last user of kblockd outside of
> > blk-core.c)
> 
> Jens, Vivkek:
> 
> can you take a look at if cfq_schedule_dispatch is still needed in
> new unplugging world order?  It's the only kblockd user outside the
> block core that's still left, and it seems rather odd to me at least.

I guess cfq_schedule_dispatch() will still be required. One use case is
that CFQ might not dispatch requests to driver even if it has one (idling on
cfqq) and once the timer fires, it still need to be able to kick the queue
and dispatch requests.

To me this sounds independent of plugging logic. Or am I missing something.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-22 16:01                                                 ` Vivek Goyal
@ 2011-04-22 16:10                                                   ` Vivek Goyal
  0 siblings, 0 replies; 152+ messages in thread
From: Vivek Goyal @ 2011-04-22 16:10 UTC (permalink / raw)
  To: hch; +Cc: Jens Axboe, linux-kernel

On Fri, Apr 22, 2011 at 12:01:10PM -0400, Vivek Goyal wrote:
> On Fri, Apr 22, 2011 at 11:39:08AM -0400, hch@infradead.org wrote:
> > On Mon, Apr 18, 2011 at 05:23:06PM -0400, hch@infradead.org wrote:
> > > > NeilBrown (6):
> > > >       md: use new plugging interface for RAID IO.
> > > >       md/dm - remove remains of plug_fn callback.
> > > >       md - remove old plugging code.
> > > >       md: provide generic support for handling unplug callbacks.
> > > >       md: incorporate new plugging into raid5.
> > > >       md: fix up raid1/raid10 unplugging.
> > > 
> > > Looking over more of the unplugging left over, is there a reason to
> > > keep the unplug_work bits in CFQ?  They seem to rather counter the
> > > current scheme (and it is the last user of kblockd outside of
> > > blk-core.c)
> > 
> > Jens, Vivkek:
> > 
> > can you take a look at if cfq_schedule_dispatch is still needed in
> > new unplugging world order?  It's the only kblockd user outside the
> > block core that's still left, and it seems rather odd to me at least.
> 
> I guess cfq_schedule_dispatch() will still be required. One use case is
> that CFQ might not dispatch requests to driver even if it has one (idling on
> cfqq) and once the timer fires, it still need to be able to kick the queue
> and dispatch requests.
> 
> To me this sounds independent of plugging logic. Or am I missing something.

I guess you question probably was that do we still need cfqd->unplug_work
and cfq_kick_queue() and can these be replaced by delay_work mechanism.
I think would think that we should be able to. Will wirte a patch and
test it.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 152+ messages in thread

end of thread, other threads:[~2011-04-22 16:10 UTC | newest]

Thread overview: 152+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-22  1:17 [PATCH 0/10] On-stack explicit block queue plugging Jens Axboe
2011-01-22  1:17 ` [PATCH 01/10] block: add API for delaying work/request_fn a little bit Jens Axboe
2011-01-22  1:17 ` [PATCH 02/10] ide-cd: convert to blk_delay_queue() for a short pause Jens Axboe
2011-01-22  1:19   ` David Miller
2011-01-22  1:17 ` [PATCH 03/10] scsi: convert to blk_delay_queue() Jens Axboe
2011-01-22  1:17 ` [PATCH 04/10] block: initial patch for on-stack per-task plugging Jens Axboe
2011-01-24 19:36   ` Jeff Moyer
2011-01-24 21:23     ` Jens Axboe
2011-03-10 16:54   ` Vivek Goyal
2011-03-10 19:32     ` Jens Axboe
2011-03-10 19:46       ` Vivek Goyal
2011-03-16  8:18   ` Shaohua Li
2011-03-16 17:31     ` Vivek Goyal
2011-03-17  1:00       ` Shaohua Li
2011-03-17  3:19         ` Shaohua Li
2011-03-17  9:44           ` Jens Axboe
2011-03-18  1:55             ` Shaohua Li
2011-03-17  9:43         ` Jens Axboe
2011-03-18  6:36           ` Shaohua Li
2011-03-18 12:54             ` Jens Axboe
2011-03-18 13:52               ` Jens Axboe
2011-03-21  6:52                 ` Shaohua Li
2011-03-21  9:20                   ` Jens Axboe
2011-03-22  0:32                     ` Shaohua Li
2011-03-22  7:36                       ` Jens Axboe
2011-03-17  9:39     ` Jens Axboe
2011-01-22  1:17 ` [PATCH 05/10] block: remove per-queue plugging Jens Axboe
2011-01-22  1:31   ` Nick Piggin
2011-03-03 21:23   ` Mike Snitzer
2011-03-03 21:27     ` Mike Snitzer
2011-03-03 22:13     ` Mike Snitzer
2011-03-04 13:02       ` Shaohua Li
2011-03-04 13:20         ` Jens Axboe
2011-03-04 21:43         ` Mike Snitzer
2011-03-04 21:50           ` Jens Axboe
2011-03-04 22:27             ` Mike Snitzer
2011-03-05 20:54               ` Jens Axboe
2011-03-07 10:23                 ` Peter Zijlstra
2011-03-07 19:43                   ` Jens Axboe
2011-03-07 20:41                     ` Peter Zijlstra
2011-03-07 20:46                       ` Jens Axboe
2011-03-08  9:38                         ` Peter Zijlstra
2011-03-08  9:41                           ` Jens Axboe
2011-03-07  0:54             ` Shaohua Li
2011-03-07  8:07               ` Jens Axboe
2011-03-08 12:16       ` Jens Axboe
2011-03-08 20:21         ` Mike Snitzer
2011-03-08 20:27           ` Jens Axboe
2011-03-08 21:36             ` Jeff Moyer
2011-03-09  7:25               ` Jens Axboe
2011-03-08 22:05             ` Mike Snitzer
2011-03-10  0:58               ` Mike Snitzer
2011-04-05  3:05                 ` NeilBrown
2011-04-11  4:50                   ` NeilBrown
2011-04-11  9:19                     ` Jens Axboe
2011-04-11 10:59                       ` NeilBrown
2011-04-11 11:04                         ` Jens Axboe
2011-04-11 11:26                           ` NeilBrown
2011-04-11 11:37                             ` Jens Axboe
2011-04-11 12:05                               ` NeilBrown
2011-04-11 12:11                                 ` Jens Axboe
2011-04-11 12:36                                   ` NeilBrown
2011-04-11 12:48                                     ` Jens Axboe
2011-04-12  1:12                                       ` hch
2011-04-12  8:36                                         ` Jens Axboe
2011-04-12 12:22                                           ` Dave Chinner
2011-04-12 12:28                                             ` Jens Axboe
2011-04-12 12:41                                               ` Dave Chinner
2011-04-12 12:58                                                 ` Jens Axboe
2011-04-12 13:31                                                   ` Dave Chinner
2011-04-12 13:45                                                     ` Jens Axboe
2011-04-12 14:34                                                       ` Dave Chinner
2011-04-12 21:08                                                         ` NeilBrown
2011-04-13  2:23                                                           ` Linus Torvalds
2011-04-13 11:12                                                             ` Peter Zijlstra
2011-04-13 11:23                                                               ` Jens Axboe
2011-04-13 11:41                                                                 ` Peter Zijlstra
2011-04-13 15:13                                                                 ` Linus Torvalds
2011-04-13 17:35                                                                   ` Jens Axboe
2011-04-12 16:58                                                     ` hch
2011-04-12 17:29                                                       ` Jens Axboe
2011-04-12 16:44                                                   ` hch
2011-04-12 16:49                                                     ` Jens Axboe
2011-04-12 16:54                                                       ` hch
2011-04-12 17:24                                                         ` Jens Axboe
2011-04-12 13:40                                               ` Dave Chinner
2011-04-12 13:48                                                 ` Jens Axboe
2011-04-12 23:35                                                   ` Dave Chinner
2011-04-12 16:50                                           ` hch
2011-04-15  4:26                                   ` hch
2011-04-15  6:34                                     ` Jens Axboe
2011-04-17 22:19                                   ` NeilBrown
2011-04-18  4:19                                     ` NeilBrown
2011-04-18  6:38                                     ` Jens Axboe
2011-04-18  7:25                                       ` NeilBrown
2011-04-18  8:10                                         ` Jens Axboe
2011-04-18  8:33                                           ` NeilBrown
2011-04-18  8:42                                             ` Jens Axboe
2011-04-18 21:23                                             ` hch
2011-04-22 15:39                                               ` hch
2011-04-22 16:01                                                 ` Vivek Goyal
2011-04-22 16:10                                                   ` Vivek Goyal
2011-04-18 21:30                                             ` hch
2011-04-18 22:38                                               ` NeilBrown
2011-04-20 10:55                                                 ` hch
2011-04-18  9:19                                           ` hch
2011-04-18  9:40                                             ` [dm-devel] " Hannes Reinecke
2011-04-18  9:47                                               ` Jens Axboe
2011-04-18  9:46                                             ` Jens Axboe
2011-04-11 11:55                         ` NeilBrown
2011-04-11 12:12                           ` Jens Axboe
2011-04-11 22:58                             ` hch
2011-04-12  6:20                               ` Jens Axboe
2011-04-11 16:59                     ` hch
2011-04-11 21:14                       ` NeilBrown
2011-04-11 22:59                         ` hch
2011-04-12  6:18                         ` Jens Axboe
2011-03-17 15:51               ` Mike Snitzer
2011-03-17 18:31                 ` Jens Axboe
2011-03-17 18:46                   ` Mike Snitzer
2011-03-18  9:15                     ` hch
2011-03-08 12:15     ` Jens Axboe
2011-03-04  4:00   ` Vivek Goyal
2011-03-08 12:24     ` Jens Axboe
2011-03-08 22:10       ` blk-throttle: Use blk_plug in throttle code (Was: Re: [PATCH 05/10] block: remove per-queue plugging) Vivek Goyal
2011-03-09  7:26         ` Jens Axboe
2011-01-22  1:17 ` [PATCH 06/10] block: kill request allocation batching Jens Axboe
2011-01-22  9:31   ` Christoph Hellwig
2011-01-24 19:09     ` Jens Axboe
2011-01-22  1:17 ` [PATCH 07/10] fs: make generic file read/write functions plug Jens Axboe
2011-01-24  3:57   ` Dave Chinner
2011-01-24 19:11     ` Jens Axboe
2011-03-04  4:09   ` Vivek Goyal
2011-03-04 13:22     ` Jens Axboe
2011-03-04 13:25       ` hch
2011-03-04 13:40         ` Jens Axboe
2011-03-04 14:08           ` hch
2011-03-04 22:07             ` Jens Axboe
2011-03-04 23:12               ` hch
2011-03-08 12:38         ` Jens Axboe
2011-03-09 10:38           ` hch
2011-03-09 10:52             ` Jens Axboe
2011-01-22  1:17 ` [PATCH 08/10] read-ahead: use plugging Jens Axboe
2011-01-22  1:17 ` [PATCH 09/10] fs: make mpage read/write_pages() plug Jens Axboe
2011-01-22  1:17 ` [PATCH 10/10] fs: make aio plug Jens Axboe
2011-01-24 17:59   ` Jeff Moyer
2011-01-24 19:09     ` Jens Axboe
2011-01-24 19:15       ` Jeff Moyer
2011-01-24 19:22         ` Jens Axboe
2011-01-24 19:29           ` Jeff Moyer
2011-01-24 19:31             ` Jens Axboe
2011-01-24 19:38               ` Jeff Moyer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).