LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [rfc] direct IO submission and completion scalability issues
@ 2007-07-28  1:21 Siddha, Suresh B
  2007-07-30 18:20 ` Christoph Lameter
  2008-02-03  9:52 ` Nick Piggin
  0 siblings, 2 replies; 27+ messages in thread
From: Siddha, Suresh B @ 2007-07-28  1:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: arjan, mingo, npiggin, ak, jens.axboe, James.Bottomley, andrea,
	clameter, akpm, andrew.vasquez

We have been looking into the linux kernel direct IO scalability issues with
database workloads. Comments and suggestions on our below experiments are
welcome.

In the linux kernel, direct IO requests are not batched at the block layer.
i.e, as a new request comes in, the request get directly submitted to the
IO controller on the same cpu that the request originates. And the IO completion
likely happens on a different cpu which is processing interrupts. This results
in cacheline bouncing of some of the hot kernel cachelines (like timers, scsi
cmds, slab, sched, etc) and is becoming an important scalability issue
as the number of cpus and distance between them increase with multi-core
and numa.

In case of the controllers which support RIO/ZIO modes (like some qla2xxx),
IO submission path on each cpu also checks if there any completed
IO commands in the response queue and triggers softirq on the same cpu
to process the completed commands. This results in each logical cpu in the
system spending sometime in softirq processing and this causes contentions in
spinlocks and other data structures.

Not sure when the IO controllers with multiple request/response queues will be
available in the market. In that case we can dedicate each queue pair 
to group of cpus(/a node)  and be done with this problem.

In the absence of such HW today, we were looking into possible solutions for
these problemsa and did couple of experiments as part of this.

In the first experiment, we removed the completed IO command processing during
IO submission. This will now result in the processing of IO commands only
on the cpu receiving interrupts. This will result in more interrupts
(as we are not doing any proactive processing) but wanted to see if this is a
win over each cpu doing the softirq processing. This gave a 1.36% performance
improvement on a x86_64 MP system (total 16 logical cpus) and on two
node ia64 platform(2 nodes, 8 cores, 16 threads) we got 1.5% improvement
[please look at observation #1 below].

Reference patch for this:

diff --git a/drivers/scsi/qla2xxx/qla_iocb.c b/drivers/scsi/qla2xxx/qla_iocb.c
index c5b3c61..357a497 100644
--- a/drivers/scsi/qla2xxx/qla_iocb.c
+++ b/drivers/scsi/qla2xxx/qla_iocb.c
@@ -414,11 +414,6 @@ qla2x00_start_scsi(srb_t *sp)
 	WRT_REG_WORD(ISP_REQ_Q_IN(ha, reg), ha->req_ring_index);
 	RD_REG_WORD_RELAXED(ISP_REQ_Q_IN(ha, reg));	/* PCI Posting. */
 
-	/* Manage unprocessed RIO/ZIO commands in response queue. */
-	if (ha->flags.process_response_queue &&
-	    ha->response_ring_ptr->signature != RESPONSE_PROCESSED)
-		qla2x00_process_response_queue(ha);
-
 	spin_unlock_irqrestore(&ha->hardware_lock, flags);
 	return (QLA_SUCCESS);
 
@@ -844,11 +839,6 @@ qla24xx_start_scsi(srb_t *sp)
 	WRT_REG_DWORD(&reg->req_q_in, ha->req_ring_index);
 	RD_REG_DWORD_RELAXED(&reg->req_q_in);		/* PCI Posting. */
 
-	/* Manage unprocessed RIO/ZIO commands in response queue. */
-	if (ha->flags.process_response_queue &&
-	    ha->response_ring_ptr->signature != RESPONSE_PROCESSED)
-		qla24xx_process_response_queue(ha);
-
 	spin_unlock_irqrestore(&ha->hardware_lock, flags);
 	return QLA_SUCCESS;
 
Observation #1: This experiment puts heavy load on the cpu processing
interrupts. As such, equal distribution of task load by the scheduler didn't
give expected performance improvement(as cpu's with no interrupts race to idle
and migrate some tasks during idle balance, leading to some increase in idle
time aswell as costs associated with excessive task migration). We tweaked our
manual task binding so that cpu's with no interrupts get proportionally more
load compared to cpu's which process interrupts and this gave a nice performance
boost as mentioned above. Perhaps, we need to make the scheduler load balancing
aware of the irq load on that cpu.

Second experiment which we did was migrating the IO submission to the
IO completion cpu. Instead of submitting the IO on the same cpu where the
request arrived, in this experiment  the IO submission gets migrated to the
cpu that is processing IO completions(interrupt). This will minimize the
access to remote cachelines (that happens in timers, slab, scsi layers). The
IO submission request is forwarded to the kblockd thread on the cpu receiving
the interrupts. As part of this, we also made kblockd thread on each cpu as the
highest priority thread, so that IO gets submitted as soon as possible on the
interrupt cpu with out any delay. On x86_64 SMP platform with 16 cores, this
resulted in 2% performance improvement and 3.3% improvement on two node ia64
platform.

Quick and dirty prototype patch(not meant for inclusion) for this io migration
experiment is appended to this e-mail.

Observation #1 mentioned above is also applicable to this experiment. CPU's
processing interrupts will now have to cater IO submission/processing
load aswell.

Observation #2: This introduces some migration overhead during IO submission.
With the current prototype, every incoming IO request results in an IPI and
context switch(to kblockd thread) on the interrupt processing cpu.
This issue needs to be addressed and main challenge to address is
the efficient mechanism of doing this IO migration(how much batching to do and
when to send the migrate request?), so that we don't delay the IO much and at
the same point, don't cause much overhead during migration.

Source of the IO migration experiment came from an old experiment done in EL3
days (linux-2.4.21-scsi-affine-queue.patch in EL3 GA release, pointed
by Arjan).  Arjan pointed out that this patch had some perf issues
and was taken out in a later update release of EL3. Given that 2.6 has
progressed quite a bit from 2.4 days, wondering if we can answer this
challenge easily with today's infrastructure.

Experiment - 1 above can be easily incorporated in to linux kernel(by
doing the proactive IO cmd completion processing only on cpu processing
the interrupts) . We need to address the scheduler load balancing issue
(of taking irq load in to account) though.

Is there a simple and better way to efficiently migrate the IO request(perhaps
only for direct IO and also based on IO load -- similar to what is pursued
in EL3)? Efficient IO migration will further improve the performance
numbers stated above.

io migration prototype(and really dirty) patch follows:

diff -pNru linux-2.6.21-rc7/block/ll_rw_blk.c linux-batch-delay/block/ll_rw_blk.c
--- linux-2.6.21-rc7/block/ll_rw_blk.c	2007-05-22 18:22:02.000000000 -0700
+++ linux-batch-delay/block/ll_rw_blk.c	2007-06-19 11:56:54.000000000 -0700
@@ -177,6 +177,15 @@ void blk_queue_softirq_done(request_queu
 
 EXPORT_SYMBOL(blk_queue_softirq_done);
 
+static void blk_request_fn_work(struct work_struct *work)
+{
+	request_queue_t *q = container_of(work, request_queue_t, request_fn_work);
+
+	spin_lock_irq(q->queue_lock);
+	q->request_fn(q);
+	spin_unlock_irq(q->queue_lock);
+}
+
 /**
  * blk_queue_make_request - define an alternate make_request function for a device
  * @q:  the request queue for the device to be affected
@@ -222,6 +231,7 @@ void blk_queue_make_request(request_queu
 	if (q->unplug_delay == 0)
 		q->unplug_delay = 1;
 
+	INIT_WORK(&q->request_fn_work, blk_request_fn_work);
 	INIT_WORK(&q->unplug_work, blk_unplug_work);
 
 	q->unplug_timer.function = blk_unplug_timeout;
@@ -1574,6 +1584,7 @@ int blk_remove_plug(request_queue_t *q)
 
 EXPORT_SYMBOL(blk_remove_plug);
 
+
 /*
  * remove the plug and let it rip..
  */
@@ -1585,7 +1596,11 @@ void __generic_unplug_device(request_que
 	if (!blk_remove_plug(q))
 		return;
 
-	q->request_fn(q);
+	if (q->cpu_binding && q->submit_cpu &&
+	    *q->submit_cpu != smp_processor_id())
+		kblockd_schedule_work_on_cpu(&q->request_fn_work, *q->submit_cpu);
+	else
+		q->request_fn(q);
 }
 EXPORT_SYMBOL(__generic_unplug_device);
 
@@ -1624,6 +1639,7 @@ static void blk_backing_dev_unplug(struc
 	}
 }
 
+
 static void blk_unplug_work(struct work_struct *work)
 {
 	request_queue_t *q = container_of(work, request_queue_t, unplug_work);
@@ -1641,7 +1657,10 @@ static void blk_unplug_timeout(unsigned 
 	blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_TIMER, NULL,
 				q->rq.count[READ] + q->rq.count[WRITE]);
 
-	kblockd_schedule_work(&q->unplug_work);
+	if (!q->submit_cpu || !q->cpu_binding)
+		kblockd_schedule_work(&q->unplug_work);
+	else if (q->cpu_binding)
+		kblockd_schedule_work_on_cpu(&q->unplug_work, *q->submit_cpu);
 }
 
 /**
@@ -1737,7 +1756,10 @@ void blk_run_queue(struct request_queue 
 			clear_bit(QUEUE_FLAG_REENTER, &q->queue_flags);
 		} else {
 			blk_plug_device(q);
-			kblockd_schedule_work(&q->unplug_work);
+			if (q->cpu_binding && q->submit_cpu)
+				kblockd_schedule_work_on_cpu(&q->unplug_work, *q->submit_cpu);
+			else
+				kblockd_schedule_work(&q->unplug_work);
 		}
 	}
 
@@ -3627,6 +3649,11 @@ int kblockd_schedule_work(struct work_st
 	return queue_work(kblockd_workqueue, work);
 }
 
+int kblockd_schedule_work_on_cpu(struct work_struct *work, int cpu)
+{
+	return queue_work_on_cpu(kblockd_workqueue, work, cpu);
+}
+
 EXPORT_SYMBOL(kblockd_schedule_work);
 
 void kblockd_flush(void)
@@ -3813,6 +3840,22 @@ queue_var_store(unsigned long *var, cons
 	return count;
 }
 
+static ssize_t
+queue_cpu_binding_store(struct request_queue *q, const char *page, size_t count)
+{
+	sscanf(page, "%d", &q->cpu_binding);
+	return count;
+}
+
+static ssize_t queue_cpu_binding_show(struct request_queue *q, char *page)
+{
+	int count;
+	count = queue_var_show(q->cpu_binding, (page));
+	if (q->submit_cpu)
+		count += queue_var_show(*q->submit_cpu, (page + count));
+	return count;
+}
+
 static ssize_t queue_requests_show(struct request_queue *q, char *page)
 {
 	return queue_var_show(q->nr_requests, (page));
@@ -3946,6 +3989,13 @@ static struct queue_sysfs_entry queue_ma
 	.show = queue_max_hw_sectors_show,
 };
 
+static struct queue_sysfs_entry queue_cpu_binding_entry = {
+	.attr = {.name = "cpu_binding", .mode = S_IRUGO | S_IWUSR },
+ 	.show = queue_cpu_binding_show,
+ 	.store = queue_cpu_binding_store,
+};
+
+
 static struct queue_sysfs_entry queue_iosched_entry = {
 	.attr = {.name = "scheduler", .mode = S_IRUGO | S_IWUSR },
 	.show = elv_iosched_show,
@@ -3958,6 +4008,7 @@ static struct attribute *default_attrs[]
 	&queue_max_hw_sectors_entry.attr,
 	&queue_max_sectors_entry.attr,
 	&queue_iosched_entry.attr,
+ 	&queue_cpu_binding_entry.attr,
 	NULL,
 };
 
diff -pNru linux-2.6.21-rc7/drivers/ata/libata-core.c linux-batch-delay/drivers/ata/libata-core.c
--- linux-2.6.21-rc7/drivers/ata/libata-core.c	2007-04-15 16:50:57.000000000 -0700
+++ linux-batch-delay/drivers/ata/libata-core.c	2007-06-19 11:37:29.000000000 -0700
@@ -5223,6 +5223,7 @@ irqreturn_t ata_interrupt (int irq, void
 		    !(ap->flags & ATA_FLAG_DISABLED)) {
 			struct ata_queued_cmd *qc;
 
+			ap->scsi_host->irq_cpu = smp_processor_id();
 			qc = ata_qc_from_tag(ap, ap->active_tag);
 			if (qc && (!(qc->tf.flags & ATA_TFLAG_POLLING)) &&
 			    (qc->flags & ATA_QCFLAG_ACTIVE))
diff -pNru linux-2.6.21-rc7/drivers/scsi/qla2xxx/qla_def.h linux-batch-delay/drivers/scsi/qla2xxx/qla_def.h
--- linux-2.6.21-rc7/drivers/scsi/qla2xxx/qla_def.h	2007-04-15 16:50:57.000000000 -0700
+++ linux-batch-delay/drivers/scsi/qla2xxx/qla_def.h	2007-06-19 11:37:29.000000000 -0700
@@ -2294,6 +2294,8 @@ typedef struct scsi_qla_host {
 	uint8_t rscn_in_ptr;
 	uint8_t rscn_out_ptr;
 
+	unsigned long  last_irq_cpu; /* cpu where we got our last irq */
+
 	/* SNS command interfaces. */
 	ms_iocb_entry_t		*ms_iocb;
 	dma_addr_t		ms_iocb_dma;
diff -pNru linux-2.6.21-rc7/drivers/scsi/qla2xxx/qla_isr.c linux-batch-delay/drivers/scsi/qla2xxx/qla_isr.c
--- linux-2.6.21-rc7/drivers/scsi/qla2xxx/qla_isr.c	2007-04-15 16:50:57.000000000 -0700
+++ linux-batch-delay/drivers/scsi/qla2xxx/qla_isr.c	2007-06-19 11:37:29.000000000 -0700
@@ -44,6 +44,7 @@ qla2100_intr_handler(int irq, void *dev_
 		return (IRQ_NONE);
 	}
 
+	ha->host->irq_cpu = smp_processor_id();
 	reg = &ha->iobase->isp;
 	status = 0;
 
@@ -121,6 +122,7 @@ qla2300_intr_handler(int irq, void *dev_
 		return (IRQ_NONE);
 	}
 
+	ha->host->irq_cpu = smp_processor_id();
 	reg = &ha->iobase->isp;
 	status = 0;
 
@@ -1437,6 +1439,7 @@ qla24xx_intr_handler(int irq, void *dev_
 		return IRQ_NONE;
 	}
 
+	ha->host->irq_cpu = smp_processor_id();
 	reg = &ha->iobase->isp24;
 	status = 0;
 
diff -pNru linux-2.6.21-rc7/drivers/scsi/scsi_scan.c linux-batch-delay/drivers/scsi/scsi_scan.c
--- linux-2.6.21-rc7/drivers/scsi/scsi_scan.c	2007-04-15 16:50:57.000000000 -0700
+++ linux-batch-delay/drivers/scsi/scsi_scan.c	2007-06-19 11:37:37.000000000 -0700
@@ -280,6 +280,8 @@ static struct scsi_device *scsi_alloc_sd
 	}
 
 	sdev->request_queue->queuedata = sdev;
+	if (sdev->host)
+		sdev->request_queue->submit_cpu = &sdev->host->irq_cpu;
 	scsi_adjust_queue_depth(sdev, 0, sdev->host->cmd_per_lun);
 
 	scsi_sysfs_device_initialize(sdev);
diff -pNru linux-2.6.21-rc7/include/linux/blkdev.h linux-batch-delay/include/linux/blkdev.h
--- linux-2.6.21-rc7/include/linux/blkdev.h	2007-05-29 17:02:00.000000000 -0700
+++ linux-batch-delay/include/linux/blkdev.h	2007-06-19 11:37:29.000000000 -0700
@@ -392,6 +392,7 @@ struct request_queue
 	int			unplug_thresh;	/* After this many requests */
 	unsigned long		unplug_delay;	/* After this many jiffies */
 	struct work_struct	unplug_work;
+	struct work_struct	request_fn_work;
 
 	struct backing_dev_info	backing_dev_info;
 
@@ -400,6 +401,8 @@ struct request_queue
 	 * ll_rw_blk doesn't touch it.
 	 */
 	void			*queuedata;
+	int			cpu_binding;
+	int			*submit_cpu;
 
 	/*
 	 * queue needs bounce pages for pages above this limit
@@ -853,6 +856,7 @@ static inline void put_dev_sector(Sector
 
 struct work_struct;
 int kblockd_schedule_work(struct work_struct *work);
+int kblockd_schedule_work_on_cpu(struct work_struct *work, int cpu);
 void kblockd_flush(void);
 
 #define MODULE_ALIAS_BLOCKDEV(major,minor) \
diff -pNru linux-2.6.21-rc7/include/linux/workqueue.h linux-batch-delay/include/linux/workqueue.h
--- linux-2.6.21-rc7/include/linux/workqueue.h	2007-05-29 17:02:00.000000000 -0700
+++ linux-batch-delay/include/linux/workqueue.h	2007-06-19 11:37:29.000000000 -0700
@@ -168,6 +168,7 @@ extern struct workqueue_struct *__create
 extern void destroy_workqueue(struct workqueue_struct *wq);
 
 extern int FASTCALL(queue_work(struct workqueue_struct *wq, struct work_struct *work));
+extern int FASTCALL(queue_work_on_cpu(struct workqueue_struct *wq, struct work_struct *work, int cpu));
 extern int FASTCALL(queue_delayed_work(struct workqueue_struct *wq, struct delayed_work *work, unsigned long delay));
 extern int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
 	struct delayed_work *work, unsigned long delay);
diff -pNru linux-2.6.21-rc7/include/scsi/scsi_host.h linux-batch-delay/include/scsi/scsi_host.h
--- linux-2.6.21-rc7/include/scsi/scsi_host.h	2007-04-15 16:50:57.000000000 -0700
+++ linux-batch-delay/include/scsi/scsi_host.h	2007-06-19 11:37:29.000000000 -0700
@@ -635,6 +635,7 @@ struct Scsi_Host {
 	unsigned char n_io_port;
 	unsigned char dma_channel;
 	unsigned int  irq;
+	unsigned int  irq_cpu;
 	
 
 	enum scsi_host_state shost_state;
diff -pNru linux-2.6.21-rc7/kernel/workqueue.c linux-batch-delay/kernel/workqueue.c
--- linux-2.6.21-rc7/kernel/workqueue.c	2007-06-19 13:13:26.000000000 -0700
+++ linux-batch-delay/kernel/workqueue.c	2007-06-19 11:37:29.000000000 -0700
@@ -218,6 +218,20 @@ int fastcall queue_work(struct workqueue
 }
 EXPORT_SYMBOL_GPL(queue_work);
 
+int fastcall queue_work_on_cpu(struct workqueue_struct *wq, struct work_struct *work,
+			       int cpu)
+{
+	int ret = 0;
+
+	if (!test_and_set_bit(WORK_STRUCT_PENDING, work_data_bits(work))) {
+		BUG_ON(!list_empty(&work->entry));
+		__queue_work(per_cpu_ptr(wq->cpu_wq, cpu), work);
+		ret = 1;
+	}
+	return ret;
+}
+EXPORT_SYMBOL_GPL(queue_work_on_cpu);
+
 void delayed_work_timer_fn(unsigned long __data)
 {
 	struct delayed_work *dwork = (struct delayed_work *)__data;
@@ -351,11 +365,15 @@ static int worker_thread(void *__cwq)
 	DECLARE_WAITQUEUE(wait, current);
 	struct k_sigaction sa;
 	sigset_t blocked;
+ 	struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
 
 	if (!cwq->freezeable)
 		current->flags |= PF_NOFREEZE;
 
-	set_user_nice(current, -5);
+ 	if (!strncmp(cwq->wq->name, "kblockd", 7))
+ 		sched_setscheduler(current, SCHED_FIFO, &param);
+ 	else
+ 		set_user_nice(current, -5);
 
 	/* Block and flush all signals */
 	sigfillset(&blocked);

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [rfc] direct IO submission and completion scalability issues
  2007-07-28  1:21 [rfc] direct IO submission and completion scalability issues Siddha, Suresh B
@ 2007-07-30 18:20 ` Christoph Lameter
  2007-07-30 20:35   ` Siddha, Suresh B
  2008-02-03  9:52 ` Nick Piggin
  1 sibling, 1 reply; 27+ messages in thread
From: Christoph Lameter @ 2007-07-30 18:20 UTC (permalink / raw)
  To: Siddha, Suresh B
  Cc: linux-kernel, arjan, mingo, npiggin, ak, jens.axboe,
	James.Bottomley, andrea, akpm, andrew.vasquez

On Fri, 27 Jul 2007, Siddha, Suresh B wrote:

> We have been looking into the linux kernel direct IO scalability issues with
> database workloads. Comments and suggestions on our below experiments are
> welcome.

This was on an SMP system? These issues are much more pronounced on a NUMA 
system. There the locality of the device may be a prime issue.

> In the linux kernel, direct IO requests are not batched at the block layer.
> i.e, as a new request comes in, the request get directly submitted to the
> IO controller on the same cpu that the request originates. And the IO completion
> likely happens on a different cpu which is processing interrupts. This results
> in cacheline bouncing of some of the hot kernel cachelines (like timers, scsi
> cmds, slab, sched, etc) and is becoming an important scalability issue
> as the number of cpus and distance between them increase with multi-core
> and numa.

Yes. The issue is even worse if the submission comes from a remote node. 
F.e. If we have a system with a scsi controller on node 2. Now I/O 
submission on node 1 and completion on node 2. In that case the 
cacheline has to be transferred across the NUMA interlink.

However, you cannot avoid running the completion on the node where the 
device sits. The device has all sorts of control structures and if you 
would handle the completion on node 1 then it would have to transfer lots
of cachelines that contain device state to node 1.

I think it is better to leave things as is. Or have the I/O submission be 
relocated to the node of the device.
 
> Second experiment which we did was migrating the IO submission to the
> IO completion cpu. Instead of submitting the IO on the same cpu where the
> request arrived, in this experiment  the IO submission gets migrated to the
> cpu that is processing IO completions(interrupt). This will minimize the
> access to remote cachelines (that happens in timers, slab, scsi layers). The
> IO submission request is forwarded to the kblockd thread on the cpu receiving
> the interrupts. As part of this, we also made kblockd thread on each cpu as the
> highest priority thread, so that IO gets submitted as soon as possible on the
> interrupt cpu with out any delay. On x86_64 SMP platform with 16 cores, this
> resulted in 2% performance improvement and 3.3% improvement on two node ia64
> platform.

I think that is the right approach. This will also help in cases where I/O 
devices can only be accessed from a certain node (NUMA device address 
restrictions on some systems may not allow remote cacheline access!)

> Observation #2: This introduces some migration overhead during IO submission.
> With the current prototype, every incoming IO request results in an IPI and
> context switch(to kblockd thread) on the interrupt processing cpu.
> This issue needs to be addressed and main challenge to address is
> the efficient mechanism of doing this IO migration(how much batching to do and
> when to send the migrate request?), so that we don't delay the IO much and at
> the same point, don't cause much overhead during migration.

Right.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [rfc] direct IO submission and completion scalability issues
  2007-07-30 18:20 ` Christoph Lameter
@ 2007-07-30 20:35   ` Siddha, Suresh B
  2007-07-31  4:19     ` Nick Piggin
  0 siblings, 1 reply; 27+ messages in thread
From: Siddha, Suresh B @ 2007-07-30 20:35 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Siddha, Suresh B, linux-kernel, arjan, mingo, npiggin, ak,
	jens.axboe, James.Bottomley, andrea, akpm, andrew.vasquez

On Mon, Jul 30, 2007 at 11:20:04AM -0700, Christoph Lameter wrote:
> On Fri, 27 Jul 2007, Siddha, Suresh B wrote:
> 
> > We have been looking into the linux kernel direct IO scalability issues with
> > database workloads. Comments and suggestions on our below experiments are
> > welcome.
> 
> This was on an SMP system? These issues are much more pronounced on a NUMA 
> system. There the locality of the device may be a prime issue.

We are looking into both SMP(multi-core) and NUMA systems.

> Yes. The issue is even worse if the submission comes from a remote node. 
> F.e. If we have a system with a scsi controller on node 2. Now I/O 
> submission on node 1 and completion on node 2. In that case the 
> cacheline has to be transferred across the NUMA interlink.
> 
> However, you cannot avoid running the completion on the node where the 
> device sits. The device has all sorts of control structures and if you 
> would handle the completion on node 1 then it would have to transfer lots
> of cachelines that contain device state to node 1.

If the device is capable of multi queues, then some of the control structures,
irqbalance can be done based on how those multi queues are distributed.

> I think it is better to leave things as is. Or have the I/O submission be 
> relocated to the node of the device.

In the absence of specialized controllers, it is best to keep the control
structures close to the device node and move the I/O submission to this node.

> > Second experiment which we did was migrating the IO submission to the
> > IO completion cpu. Instead of submitting the IO on the same cpu where the
> > request arrived, in this experiment  the IO submission gets migrated to the
> > cpu that is processing IO completions(interrupt). This will minimize the
> > access to remote cachelines (that happens in timers, slab, scsi layers). The
> > IO submission request is forwarded to the kblockd thread on the cpu receiving
> > the interrupts. As part of this, we also made kblockd thread on each cpu as the
> > highest priority thread, so that IO gets submitted as soon as possible on the
> > interrupt cpu with out any delay. On x86_64 SMP platform with 16 cores, this
> > resulted in 2% performance improvement and 3.3% improvement on two node ia64
> > platform.
> 
> I think that is the right approach. This will also help in cases where I/O 
> devices can only be accessed from a certain node (NUMA device address 
> restrictions on some systems may not allow remote cacheline access!)

Ok, there we have no other choice ;-)

> > Observation #2: This introduces some migration overhead during IO submission.
> > With the current prototype, every incoming IO request results in an IPI and
> > context switch(to kblockd thread) on the interrupt processing cpu.
> > This issue needs to be addressed and main challenge to address is
> > the efficient mechanism of doing this IO migration(how much batching to do and
> > when to send the migrate request?), so that we don't delay the IO much and at
> > the same point, don't cause much overhead during migration.
> 
> Right.

So any suggestions for making this clean and acceptable to everyone?

thanks,
suresh

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [rfc] direct IO submission and completion scalability issues
  2007-07-30 20:35   ` Siddha, Suresh B
@ 2007-07-31  4:19     ` Nick Piggin
  2007-07-31 17:14       ` Siddha, Suresh B
  0 siblings, 1 reply; 27+ messages in thread
From: Nick Piggin @ 2007-07-31  4:19 UTC (permalink / raw)
  To: Siddha, Suresh B
  Cc: Christoph Lameter, linux-kernel, arjan, mingo, ak, jens.axboe,
	James.Bottomley, andrea, akpm, andrew.vasquez

On Mon, Jul 30, 2007 at 01:35:19PM -0700, Suresh B wrote:
> On Mon, Jul 30, 2007 at 11:20:04AM -0700, Christoph Lameter wrote:
> > On Fri, 27 Jul 2007, Siddha, Suresh B wrote:
> 
> > > Observation #2: This introduces some migration overhead during IO submission.
> > > With the current prototype, every incoming IO request results in an IPI and
> > > context switch(to kblockd thread) on the interrupt processing cpu.
> > > This issue needs to be addressed and main challenge to address is
> > > the efficient mechanism of doing this IO migration(how much batching to do and
> > > when to send the migrate request?), so that we don't delay the IO much and at
> > > the same point, don't cause much overhead during migration.
> > 
> > Right.
> 
> So any suggestions for making this clean and acceptable to everyone?

It is obviously a good idea to hand over the IO at the point which
requires the least number of cachelines to be moved, and I think doing
it in the block layer is right. Mostly you have to convince the block
and driver maintainers I guess.

The scheduler really should be made interrupt-load aware anyway, so I
don't have a problem with changing that; or scheduling kblockd at a
higher priority, but I don't know if SCHED_FIFO is a good idea. Couldn't
it be done in a softirq instead?

Latency for IO migration could be the most difficult problem to solve
really. You don't give much details of the workload, profiles, etc... I
hope this is for a real world test? Can the locking be improved in simpler
ways first?

Just some random questions...

It looks like the main source of cacheline bouncing you're eliminating
is from the initial starting of IO from an empty queue (ie. unplug).
>From then on, the submission is driven by completion, right?

Why is the queue allowed to go empty in the first place in an IO critical
workload?

Are you loading up each CPU with as many disks as it can possibly handle
plus a few more? If so, is that realistic? (I honestly don't know).

You say that you'd like to do this for direct IO only, but if it is more
efficient, why not for buffered IO as well? (or is it not more efficient
for buffered IO? if not, why?)

AFAIKS, you'd still have significant queue_lock contention from other
CPUs inserting requests into the list? What IO scheduler are you using?
I assume noop... as a crazy experiment, what happens if you create per-cpu
request queues?


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [rfc] direct IO submission and completion scalability issues
  2007-07-31  4:19     ` Nick Piggin
@ 2007-07-31 17:14       ` Siddha, Suresh B
  2007-08-01  0:41         ` Nick Piggin
  0 siblings, 1 reply; 27+ messages in thread
From: Siddha, Suresh B @ 2007-07-31 17:14 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Siddha, Suresh B, Christoph Lameter, linux-kernel, arjan, mingo,
	ak, jens.axboe, James.Bottomley, andrea, akpm, andrew.vasquez

On Tue, Jul 31, 2007 at 06:19:17AM +0200, Nick Piggin wrote:
> On Mon, Jul 30, 2007 at 01:35:19PM -0700, Suresh B wrote:
> > So any suggestions for making this clean and acceptable to everyone?
> 
> It is obviously a good idea to hand over the IO at the point which
> requires the least number of cachelines to be moved, and I think doing
> it in the block layer is right. Mostly you have to convince the block
> and driver maintainers I guess.

Yes. Implementation is the challenging part I guess.

> The scheduler really should be made interrupt-load aware anyway, so I
> don't have a problem with changing that; or scheduling kblockd at a
> higher priority, but I don't know if SCHED_FIFO is a good idea. Couldn't
> it be done in a softirq instead?

Yes, softirq context is one way. But just didn't want to penalize the running
task by taking away some of its cpu time. With CFS micro accounting, perhaps
we can track irq, softirq time and avoid penalizing the running task's cpu
time.

> Latency for IO migration could be the most difficult problem to solve
> really. You don't give much details of the workload, profiles, etc... I
> hope this is for a real world test?

Improvement numbers quoted are from the OLTP database workload. We can look
into other workloads.

> Can the locking be improved in simpler ways first?
> 
> Just some random questions...
> 
> It looks like the main source of cacheline bouncing you're eliminating
> is from the initial starting of IO from an empty queue (ie. unplug).
> From then on, the submission is driven by completion, right?
> 
> Why is the queue allowed to go empty in the first place in an IO critical
> workload?

This workload is using direct IO and there is no batching at the block layer
for direct IO. IO is submitted to the HW as it arrives.

> Are you loading up each CPU with as many disks as it can possibly handle
> plus a few more? If so, is that realistic? (I honestly don't know).

There is 3-4% iowait time in the system. So the cpu's are not 100% busy,
but there is quite a bit of direct IO going on.

> You say that you'd like to do this for direct IO only, but if it is more
> efficient, why not for buffered IO as well? (or is it not more efficient
> for buffered IO? if not, why?)

It is applicable for both direct IO and buffered IO. But the implementations
will differ. For example in buffered IO, we can setup in such a way that the
block plug timeout function runs on the IO completion cpu.

> AFAIKS, you'd still have significant queue_lock contention from other
> CPUs inserting requests into the list?

Correct. We have more potential to explore. Current implementation
is very elementary.

> What IO scheduler are you using? I assume noop...

yes.

> as a crazy experiment, what happens if you create per-cpu request queues?

or in other words, each kblockd thread catering multiple request queues
(perhaps one for each cpu or one for group of cpu's).

softirq context and each kblockd thread handling multiple request queues will
lead to further improvements.

thanks,
suresh

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [rfc] direct IO submission and completion scalability issues
  2007-07-31 17:14       ` Siddha, Suresh B
@ 2007-08-01  0:41         ` Nick Piggin
  2007-08-01  0:55           ` Siddha, Suresh B
  0 siblings, 1 reply; 27+ messages in thread
From: Nick Piggin @ 2007-08-01  0:41 UTC (permalink / raw)
  To: Siddha, Suresh B
  Cc: Christoph Lameter, linux-kernel, arjan, mingo, ak, jens.axboe,
	James.Bottomley, andrea, akpm, andrew.vasquez

On Tue, Jul 31, 2007 at 10:14:03AM -0700, Suresh B wrote:
> On Tue, Jul 31, 2007 at 06:19:17AM +0200, Nick Piggin wrote:
> > On Mon, Jul 30, 2007 at 01:35:19PM -0700, Suresh B wrote:
> > > So any suggestions for making this clean and acceptable to everyone?
> > 
> > It is obviously a good idea to hand over the IO at the point which
> > requires the least number of cachelines to be moved, and I think doing
> > it in the block layer is right. Mostly you have to convince the block
> > and driver maintainers I guess.
> 
> Yes. Implementation is the challenging part I guess.
> 
> > The scheduler really should be made interrupt-load aware anyway, so I
> > don't have a problem with changing that; or scheduling kblockd at a
> > higher priority, but I don't know if SCHED_FIFO is a good idea. Couldn't
> > it be done in a softirq instead?
> 
> Yes, softirq context is one way. But just didn't want to penalize the running
> task by taking away some of its cpu time. With CFS micro accounting, perhaps
> we can track irq, softirq time and avoid penalizing the running task's cpu
> time.

But you "penalize" the running task in the completion handler as well
anyway. Doing this with a SCHED_FIFO task is sort of like doing interrupt
threading which AFAIK has not been accepted (yet).


> > Latency for IO migration could be the most difficult problem to solve
> > really. You don't give much details of the workload, profiles, etc... I
> > hope this is for a real world test?
> 
> Improvement numbers quoted are from the OLTP database workload. We can look
> into other workloads.
> 
> > Can the locking be improved in simpler ways first?
> > 
> > Just some random questions...
> > 
> > It looks like the main source of cacheline bouncing you're eliminating
> > is from the initial starting of IO from an empty queue (ie. unplug).
> > From then on, the submission is driven by completion, right?
> > 
> > Why is the queue allowed to go empty in the first place in an IO critical
> > workload?
> 
> This workload is using direct IO and there is no batching at the block layer
> for direct IO. IO is submitted to the HW as it arrives.

So you aren't putting concurrent requests into the queue? Sounds like
userspace should be improved.


> > Are you loading up each CPU with as many disks as it can possibly handle
> > plus a few more? If so, is that realistic? (I honestly don't know).
> 
> There is 3-4% iowait time in the system. So the cpu's are not 100% busy,
> but there is quite a bit of direct IO going on.
> 
> > You say that you'd like to do this for direct IO only, but if it is more
> > efficient, why not for buffered IO as well? (or is it not more efficient
> > for buffered IO? if not, why?)
> 
> It is applicable for both direct IO and buffered IO. But the implementations
> will differ. For example in buffered IO, we can setup in such a way that the
> block plug timeout function runs on the IO completion cpu.

It would be nice to be doing that anyway. But unplug via request submission
rather than timeout is fairly common in buffered loads too.


> > AFAIKS, you'd still have significant queue_lock contention from other
> > CPUs inserting requests into the list?
> 
> Correct. We have more potential to explore. Current implementation
> is very elementary.
> 
> > What IO scheduler are you using? I assume noop...
> 
> yes.
> 
> > as a crazy experiment, what happens if you create per-cpu request queues?
> 
> or in other words, each kblockd thread catering multiple request queues
> (perhaps one for each cpu or one for group of cpu's).
> 
> softirq context and each kblockd thread handling multiple request queues will
> lead to further improvements.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [rfc] direct IO submission and completion scalability issues
  2007-08-01  0:41         ` Nick Piggin
@ 2007-08-01  0:55           ` Siddha, Suresh B
  2007-08-01  1:24             ` Nick Piggin
  0 siblings, 1 reply; 27+ messages in thread
From: Siddha, Suresh B @ 2007-08-01  0:55 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Siddha, Suresh B, Christoph Lameter, linux-kernel, arjan, mingo,
	ak, jens.axboe, James.Bottomley, andrea, akpm, andrew.vasquez

On Wed, Aug 01, 2007 at 02:41:18AM +0200, Nick Piggin wrote:
> On Tue, Jul 31, 2007 at 10:14:03AM -0700, Suresh B wrote:
> > Yes, softirq context is one way. But just didn't want to penalize the running
> > task by taking away some of its cpu time. With CFS micro accounting, perhaps
> > we can track irq, softirq time and avoid penalizing the running task's cpu
> > time.
> 
> But you "penalize" the running task in the completion handler as well
> anyway.

Yes.

Ingo, in general with CFS micro accounting, we should be able to avoid
penalizing the running task by tracking irq/softirq time. Isn't it?

> Doing this with a SCHED_FIFO task is sort of like doing interrupt
> threading which AFAIK has not been accepted (yet).

I am not recommending SCHED_FIFO. I will take a look at softirq
infrastructure for this.

> > This workload is using direct IO and there is no batching at the block layer
> > for direct IO. IO is submitted to the HW as it arrives.
> 
> So you aren't putting concurrent requests into the queue? Sounds like
> userspace should be improved.

Nick remember that there are hundreds of disks in this setup and at
an instance, there will be max 1 or 2 requests per disk.

> > It is applicable for both direct IO and buffered IO. But the implementations
> > will differ. For example in buffered IO, we can setup in such a way that the
> > block plug timeout function runs on the IO completion cpu.
> 
> It would be nice to be doing that anyway. But unplug via request submission
> rather than timeout is fairly common in buffered loads too.

Ok. Currently the patch handles both direct and buffered IO. While making
improvements to this patch I will make sure that both the paths take
advantage of this.

thanks,
suresh

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [rfc] direct IO submission and completion scalability issues
  2007-08-01  0:55           ` Siddha, Suresh B
@ 2007-08-01  1:24             ` Nick Piggin
  0 siblings, 0 replies; 27+ messages in thread
From: Nick Piggin @ 2007-08-01  1:24 UTC (permalink / raw)
  To: Siddha, Suresh B
  Cc: Christoph Lameter, linux-kernel, arjan, mingo, ak, jens.axboe,
	James.Bottomley, andrea, akpm, andrew.vasquez

On Tue, Jul 31, 2007 at 05:55:13PM -0700, Suresh B wrote:
> On Wed, Aug 01, 2007 at 02:41:18AM +0200, Nick Piggin wrote:
> > On Tue, Jul 31, 2007 at 10:14:03AM -0700, Suresh B wrote:
> > > task by taking away some of its cpu time. With CFS micro accounting, perhaps
> > > we can track irq, softirq time and avoid penalizing the running task's cpu
> > > time.
> > 
> > But you "penalize" the running task in the completion handler as well
> > anyway.
> 
> Yes.
> 
> Ingo, in general with CFS micro accounting, we should be able to avoid
> penalizing the running task by tracking irq/softirq time. Isn't it?
> 
> > Doing this with a SCHED_FIFO task is sort of like doing interrupt
> > threading which AFAIK has not been accepted (yet).
> 
> I am not recommending SCHED_FIFO. I will take a look at softirq
> infrastructure for this.

I think that would be a fine way to go.


> > > This workload is using direct IO and there is no batching at the block layer
> > > for direct IO. IO is submitted to the HW as it arrives.
> > 
> > So you aren't putting concurrent requests into the queue? Sounds like
> > userspace should be improved.
> 
> Nick remember that there are hundreds of disks in this setup and at
> an instance, there will be max 1 or 2 requests per disk.

Well if there is 2 requests per disk, that's a good thing; you won't
need to unplug. If there is only 1, then as well as the plugging cost,
the hardeware loses some ability to pipeline things effectively.

I'm not saying the kernel shouldn't be improved in the latter case, but
if you're looking for performance, it is nice to ensure you have at
least 2 requests. Presumably you're using some pretty well tuned db
software though, so I guess this is not always possible.

Do you have stats for these things (queue empty vs not empty events,
unplugs, etc)? 


> > > It is applicable for both direct IO and buffered IO. But the implementations
> > > will differ. For example in buffered IO, we can setup in such a way that the
> > > block plug timeout function runs on the IO completion cpu.
> > 
> > It would be nice to be doing that anyway. But unplug via request submission
> > rather than timeout is fairly common in buffered loads too.
> 
> Ok. Currently the patch handles both direct and buffered IO. While making
> improvements to this patch I will make sure that both the paths take
> advantage of this.

Sounds good!


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [rfc] direct IO submission and completion scalability issues
  2007-07-28  1:21 [rfc] direct IO submission and completion scalability issues Siddha, Suresh B
  2007-07-30 18:20 ` Christoph Lameter
@ 2008-02-03  9:52 ` Nick Piggin
  2008-02-03 10:53   ` Pekka Enberg
                     ` (4 more replies)
  1 sibling, 5 replies; 27+ messages in thread
From: Nick Piggin @ 2008-02-03  9:52 UTC (permalink / raw)
  To: Siddha, Suresh B
  Cc: linux-kernel, arjan, mingo, ak, jens.axboe, James.Bottomley,
	andrea, clameter, akpm, andrew.vasquez, willy, Zach Brown

On Fri, Jul 27, 2007 at 06:21:28PM -0700, Suresh B wrote:
> 
> Second experiment which we did was migrating the IO submission to the
> IO completion cpu. Instead of submitting the IO on the same cpu where the
> request arrived, in this experiment  the IO submission gets migrated to the
> cpu that is processing IO completions(interrupt). This will minimize the
> access to remote cachelines (that happens in timers, slab, scsi layers). The
> IO submission request is forwarded to the kblockd thread on the cpu receiving
> the interrupts. As part of this, we also made kblockd thread on each cpu as the
> highest priority thread, so that IO gets submitted as soon as possible on the
> interrupt cpu with out any delay. On x86_64 SMP platform with 16 cores, this
> resulted in 2% performance improvement and 3.3% improvement on two node ia64
> platform.
> 
> Quick and dirty prototype patch(not meant for inclusion) for this io migration
> experiment is appended to this e-mail.
> 
> Observation #1 mentioned above is also applicable to this experiment. CPU's
> processing interrupts will now have to cater IO submission/processing
> load aswell.
> 
> Observation #2: This introduces some migration overhead during IO submission.
> With the current prototype, every incoming IO request results in an IPI and
> context switch(to kblockd thread) on the interrupt processing cpu.
> This issue needs to be addressed and main challenge to address is
> the efficient mechanism of doing this IO migration(how much batching to do and
> when to send the migrate request?), so that we don't delay the IO much and at
> the same point, don't cause much overhead during migration.

Hi guys,

Just had another way we might do this. Migrate the completions out to
the submitting CPUs rather than migrate submission into the completing
CPU.

I've got a basic patch that passes some stress testing. It seems fairly
simple to do at the block layer, and the bulk of the patch involves
introducing a scalable smp_call_function for it.

Now it could be optimised more by looking at batching up IPIs or
optimising the call function path or even mirating the completion event
at a different level...

However, this is a first cut. It actually seems like it might be taking
slightly more CPU to process block IO (~0.2%)... however, this is on my
dual core system that shares an llc, which means that there are very few
cache benefits to the migration, but non-zero overhead. So on multisocket
systems hopefully it might get to positive territory.

---

Index: linux-2.6/arch/x86/kernel/smp_64.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/smp_64.c
+++ linux-2.6/arch/x86/kernel/smp_64.c
@@ -321,6 +321,99 @@ void unlock_ipi_call_lock(void)
 	spin_unlock_irq(&call_lock);
 }
 
+struct call_single_data {
+	struct list_head list;
+	void (*func) (void *info);
+	void *info;
+	int wait;
+};
+
+struct call_single_queue {
+	spinlock_t lock;
+	struct list_head list;
+};
+static DEFINE_PER_CPU(struct call_single_queue, call_single_queue);
+
+int __cpuinit init_smp_call(void)
+{
+	int i;
+
+	for_each_cpu_mask(i, cpu_possible_map) {
+		spin_lock_init(&per_cpu(call_single_queue, i).lock);
+		INIT_LIST_HEAD(&per_cpu(call_single_queue, i).list);
+	}
+	return 0;
+}
+core_initcall(init_smp_call);
+
+/*
+ * this function sends a 'generic call function' IPI to all other CPU
+ * of the system defined in the mask.
+ */
+int smp_call_function_fast(int cpu, void (*func)(void *), void *info,
+				    int wait)
+{
+	struct call_single_data *data;
+	struct call_single_queue *dst = &per_cpu(call_single_queue, cpu);
+	cpumask_t mask = cpumask_of_cpu(cpu);
+	int ipi;
+
+	data = kmalloc(sizeof(struct call_single_data), GFP_ATOMIC);
+	data->func = func;
+	data->info = info;
+	data->wait = wait;
+
+	spin_lock_irq(&dst->lock);
+	ipi = list_empty(&dst->list);
+	list_add_tail(&data->list, &dst->list);
+	spin_unlock_irq(&dst->lock);
+
+	if (ipi)
+		send_IPI_mask(mask, CALL_FUNCTION_SINGLE_VECTOR);
+
+	if (wait) {
+		/* Wait for response */
+		while (data->wait)
+			cpu_relax();
+		kfree(data);
+	}
+
+	return 0;
+}
+
+asmlinkage void smp_call_function_fast_interrupt(void)
+{
+	struct call_single_queue *q;
+	unsigned long flags;
+	LIST_HEAD(list);
+
+	ack_APIC_irq();
+
+	q = &__get_cpu_var(call_single_queue);
+	spin_lock_irqsave(&q->lock, flags);
+	list_replace_init(&q->list, &list);
+	spin_unlock_irqrestore(&q->lock, flags);
+
+	exit_idle();
+	irq_enter();
+	while (!list_empty(&list)) {
+		struct call_single_data *data;
+
+		data = list_entry(list.next, struct call_single_data, list);
+		list_del(&data->list);
+
+		data->func(data->info);
+		if (data->wait) {
+			smp_mb();
+			data->wait = 0;
+		} else {
+			kfree(data);
+		}
+	}
+	add_pda(irq_call_count, 1);
+	irq_exit();
+}
+
 /*
  * this function sends a 'generic call function' IPI to all other CPU
  * of the system defined in the mask.
Index: linux-2.6/block/blk-core.c
===================================================================
--- linux-2.6.orig/block/blk-core.c
+++ linux-2.6/block/blk-core.c
@@ -1604,6 +1604,13 @@ static int __end_that_request_first(stru
 	return 1;
 }
 
+static void blk_done_softirq_other(void *data)
+{
+	struct request *rq = data;
+
+	blk_complete_request(rq);
+}
+
 /*
  * splice the completion data to a local structure and hand off to
  * process_completion_queue() to complete the requests
@@ -1622,7 +1629,15 @@ static void blk_done_softirq(struct soft
 
 		rq = list_entry(local_list.next, struct request, donelist);
 		list_del_init(&rq->donelist);
-		rq->q->softirq_done_fn(rq);
+		if (rq->submission_cpu != smp_processor_id()) {
+			/*
+			 * Could batch up IPIs here, but we should measure how
+			 * often blk_done_softirq gets a large batch...
+			 */
+			smp_call_function_fast(rq->submission_cpu,
+						blk_done_softirq_other, rq, 0);
+		} else
+			rq->q->softirq_done_fn(rq);
 	}
 }
 
Index: linux-2.6/include/asm-x86/hw_irq_64.h
===================================================================
--- linux-2.6.orig/include/asm-x86/hw_irq_64.h
+++ linux-2.6/include/asm-x86/hw_irq_64.h
@@ -68,8 +68,7 @@
 #define ERROR_APIC_VECTOR	0xfe
 #define RESCHEDULE_VECTOR	0xfd
 #define CALL_FUNCTION_VECTOR	0xfc
-/* fb free - please don't readd KDB here because it's useless
-   (hint - think what a NMI bit does to a vector) */
+#define CALL_FUNCTION_SINGLE_VECTOR	0xfb
 #define THERMAL_APIC_VECTOR	0xfa
 #define THRESHOLD_APIC_VECTOR   0xf9
 /* f8 free */
@@ -102,6 +101,7 @@ void spurious_interrupt(void);
 void error_interrupt(void);
 void reschedule_interrupt(void);
 void call_function_interrupt(void);
+void call_function_fast_interrupt(void);
 void irq_move_cleanup_interrupt(void);
 void invalidate_interrupt0(void);
 void invalidate_interrupt1(void);
Index: linux-2.6/include/linux/smp.h
===================================================================
--- linux-2.6.orig/include/linux/smp.h
+++ linux-2.6/include/linux/smp.h
@@ -53,6 +53,7 @@ extern void smp_cpus_done(unsigned int m
  * Call a function on all other processors
  */
 int smp_call_function(void(*func)(void *info), void *info, int retry, int wait);
+int smp_call_function_fast(int cpuid, void(*func)(void *info), void *info, int wait);
 
 int smp_call_function_single(int cpuid, void (*func) (void *info), void *info,
 				int retry, int wait);
@@ -92,6 +93,11 @@ static inline int up_smp_call_function(v
 }
 #define smp_call_function(func, info, retry, wait) \
 			(up_smp_call_function(func, info))
+static inline int smp_call_function_fast(int cpuid, void(*func)(void *info), void *info, int wait)
+{
+	return 0;
+}
+
 #define on_each_cpu(func,info,retry,wait)	\
 	({					\
 		local_irq_disable();		\
Index: linux-2.6/block/elevator.c
===================================================================
--- linux-2.6.orig/block/elevator.c
+++ linux-2.6/block/elevator.c
@@ -648,6 +648,8 @@ void elv_insert(struct request_queue *q,
 void __elv_add_request(struct request_queue *q, struct request *rq, int where,
 		       int plug)
 {
+	rq->submission_cpu = smp_processor_id();
+
 	if (q->ordcolor)
 		rq->cmd_flags |= REQ_ORDERED_COLOR;
 
Index: linux-2.6/include/linux/blkdev.h
===================================================================
--- linux-2.6.orig/include/linux/blkdev.h
+++ linux-2.6/include/linux/blkdev.h
@@ -208,6 +208,8 @@ struct request {
 
 	int ref_count;
 
+	int submission_cpu;
+
 	/*
 	 * when request is used as a packet command carrier
 	 */
Index: linux-2.6/arch/x86/kernel/entry_64.S
===================================================================
--- linux-2.6.orig/arch/x86/kernel/entry_64.S
+++ linux-2.6/arch/x86/kernel/entry_64.S
@@ -696,6 +696,9 @@ END(invalidate_interrupt\num)
 ENTRY(call_function_interrupt)
 	apicinterrupt CALL_FUNCTION_VECTOR,smp_call_function_interrupt
 END(call_function_interrupt)
+ENTRY(call_function_fast_interrupt)
+	apicinterrupt CALL_FUNCTION_SINGLE_VECTOR,smp_call_function_fast_interrupt
+END(call_function_fast_interrupt)
 ENTRY(irq_move_cleanup_interrupt)
 	apicinterrupt IRQ_MOVE_CLEANUP_VECTOR,smp_irq_move_cleanup_interrupt
 END(irq_move_cleanup_interrupt)
Index: linux-2.6/arch/x86/kernel/i8259_64.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/i8259_64.c
+++ linux-2.6/arch/x86/kernel/i8259_64.c
@@ -493,6 +493,7 @@ void __init native_init_IRQ(void)
 
 	/* IPI for generic function call */
 	set_intr_gate(CALL_FUNCTION_VECTOR, call_function_interrupt);
+	set_intr_gate(CALL_FUNCTION_SINGLE_VECTOR, call_function_fast_interrupt);
 
 	/* Low priority IPI to cleanup after moving an irq */
 	set_intr_gate(IRQ_MOVE_CLEANUP_VECTOR, irq_move_cleanup_interrupt);
Index: linux-2.6/include/asm-x86/mach-default/entry_arch.h
===================================================================
--- linux-2.6.orig/include/asm-x86/mach-default/entry_arch.h
+++ linux-2.6/include/asm-x86/mach-default/entry_arch.h
@@ -13,6 +13,7 @@
 BUILD_INTERRUPT(reschedule_interrupt,RESCHEDULE_VECTOR)
 BUILD_INTERRUPT(invalidate_interrupt,INVALIDATE_TLB_VECTOR)
 BUILD_INTERRUPT(call_function_interrupt,CALL_FUNCTION_VECTOR)
+BUILD_INTERRUPT(call_function_fast_interrupt,CALL_FUNCTION_SINGLE_VECTOR)
 #endif
 
 /*

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [rfc] direct IO submission and completion scalability issues
  2008-02-03  9:52 ` Nick Piggin
@ 2008-02-03 10:53   ` Pekka Enberg
  2008-02-03 11:58     ` Nick Piggin
  2008-02-04  2:10   ` David Chinner
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 27+ messages in thread
From: Pekka Enberg @ 2008-02-03 10:53 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Siddha, Suresh B, linux-kernel, arjan, mingo, ak, jens.axboe,
	James.Bottomley, andrea, clameter, akpm, andrew.vasquez, willy,
	Zach Brown

Hi Nick,

On Feb 3, 2008 11:52 AM, Nick Piggin <npiggin@suse.de> wrote:
> +asmlinkage void smp_call_function_fast_interrupt(void)
> +{

[snip]

> +       while (!list_empty(&list)) {
> +               struct call_single_data *data;
> +
> +               data = list_entry(list.next, struct call_single_data, list);
> +               list_del(&data->list);
> +
> +               data->func(data->info);
> +               if (data->wait) {
> +                       smp_mb();
> +                       data->wait = 0;

Why do we need smp_mb() here (maybe add a comment to keep
Andrew/checkpatch happy)?

                        Pekka

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [rfc] direct IO submission and completion scalability issues
  2008-02-03 10:53   ` Pekka Enberg
@ 2008-02-03 11:58     ` Nick Piggin
  0 siblings, 0 replies; 27+ messages in thread
From: Nick Piggin @ 2008-02-03 11:58 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Siddha, Suresh B, linux-kernel, arjan, mingo, ak, jens.axboe,
	James.Bottomley, andrea, clameter, akpm, andrew.vasquez, willy,
	Zach Brown

On Sun, Feb 03, 2008 at 12:53:02PM +0200, Pekka Enberg wrote:
> Hi Nick,
> 
> On Feb 3, 2008 11:52 AM, Nick Piggin <npiggin@suse.de> wrote:
> > +asmlinkage void smp_call_function_fast_interrupt(void)
> > +{
> 
> [snip]
> 
> > +       while (!list_empty(&list)) {
> > +               struct call_single_data *data;
> > +
> > +               data = list_entry(list.next, struct call_single_data, list);
> > +               list_del(&data->list);
> > +
> > +               data->func(data->info);
> > +               if (data->wait) {
> > +                       smp_mb();
> > +                       data->wait = 0;
> 
> Why do we need smp_mb() here (maybe add a comment to keep
> Andrew/checkpatch happy)?

Yeah, definitely... it's just a really basic RFC, but I should get
into the habit of just doing it anyway.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [rfc] direct IO submission and completion scalability issues
  2008-02-03  9:52 ` Nick Piggin
  2008-02-03 10:53   ` Pekka Enberg
@ 2008-02-04  2:10   ` David Chinner
  2008-02-04  4:14     ` Arjan van de Ven
  2008-02-04 18:21     ` Zach Brown
  2008-02-04 10:12   ` Jens Axboe
                     ` (2 subsequent siblings)
  4 siblings, 2 replies; 27+ messages in thread
From: David Chinner @ 2008-02-04  2:10 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Siddha, Suresh B, linux-kernel, arjan, mingo, ak, jens.axboe,
	James.Bottomley, andrea, clameter, akpm, andrew.vasquez, willy,
	Zach Brown

On Sun, Feb 03, 2008 at 10:52:52AM +0100, Nick Piggin wrote:
> On Fri, Jul 27, 2007 at 06:21:28PM -0700, Suresh B wrote:
> > 
> > Second experiment which we did was migrating the IO submission to the
> > IO completion cpu. Instead of submitting the IO on the same cpu where the
> > request arrived, in this experiment  the IO submission gets migrated to the
> > cpu that is processing IO completions(interrupt). This will minimize the
> > access to remote cachelines (that happens in timers, slab, scsi layers). The
> > IO submission request is forwarded to the kblockd thread on the cpu receiving
> > the interrupts. As part of this, we also made kblockd thread on each cpu as the
> > highest priority thread, so that IO gets submitted as soon as possible on the
> > interrupt cpu with out any delay. On x86_64 SMP platform with 16 cores, this
> > resulted in 2% performance improvement and 3.3% improvement on two node ia64
> > platform.
> > 
> > Quick and dirty prototype patch(not meant for inclusion) for this io migration
> > experiment is appended to this e-mail.
> > 
> > Observation #1 mentioned above is also applicable to this experiment. CPU's
> > processing interrupts will now have to cater IO submission/processing
> > load aswell.
> > 
> > Observation #2: This introduces some migration overhead during IO submission.
> > With the current prototype, every incoming IO request results in an IPI and
> > context switch(to kblockd thread) on the interrupt processing cpu.
> > This issue needs to be addressed and main challenge to address is
> > the efficient mechanism of doing this IO migration(how much batching to do and
> > when to send the migrate request?), so that we don't delay the IO much and at
> > the same point, don't cause much overhead during migration.
> 
> Hi guys,
> 
> Just had another way we might do this. Migrate the completions out to
> the submitting CPUs rather than migrate submission into the completing
> CPU.

Hi Nick,

When Matthew was describing this work at an LCA presentation (not
sure whether you were at that presentation or not), Zach came up
with the idea that allowing the submitting application control the
CPU that the io completion processing was occurring would be a good
approach to try.  That is, we submit a "completion cookie" with the
bio that indicates where we want completion to run, rather than
dictating that completion runs on the submission CPU.

The reasoning is that only the higher level context really knows
what is optimal, and that changes from application to application.
The "complete on the submission CPU" policy _may_ be more optimal
for database workloads, but it is definitely suboptimal for XFS and
transaction I/O completion handling because it simply drags a bunch
of global filesystem state around between all the CPUs running
completions. In that case, we really only want a single CPU to be
handling the completions.....

(Zach - please correct me if I've missed anything)

Looking at your patch - if you turn it around so that the
"submission CPU" field can be specified as the "completion cpu" then
I think the patch will expose the policy knobs needed to do the
above. Add the bio -> rq linkage to enable filesystems and DIO to
control the completion CPU field and we're almost done.... ;)

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [rfc] direct IO submission and completion scalability issues
  2008-02-04  2:10   ` David Chinner
@ 2008-02-04  4:14     ` Arjan van de Ven
  2008-02-04  4:40       ` David Chinner
  2008-02-04 18:21     ` Zach Brown
  1 sibling, 1 reply; 27+ messages in thread
From: Arjan van de Ven @ 2008-02-04  4:14 UTC (permalink / raw)
  To: David Chinner
  Cc: Nick Piggin, Siddha, Suresh B, linux-kernel, mingo, ak,
	jens.axboe, James.Bottomley, andrea, clameter, akpm,
	andrew.vasquez, willy, Zach Brown

David Chinner wrote:
> Hi Nick,
> 
> When Matthew was describing this work at an LCA presentation (not
> sure whether you were at that presentation or not), Zach came up
> with the idea that allowing the submitting application control the
> CPU that the io completion processing was occurring would be a good
> approach to try.  That is, we submit a "completion cookie" with the
> bio that indicates where we want completion to run, rather than
> dictating that completion runs on the submission CPU.
> 
> The reasoning is that only the higher level context really knows
> what is optimal, and that changes from application to application.

well.. kinda. One of the really hard parts of the submit/completion stuff is that
the slab/slob/slub/slib allocator ends up basically "cycling" memory through the system;
there's a sink of free memory on all the submission cpus and a source of free memory
on the completion cpu. I don't think applications are capable of working out what is
best in this scenario..



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [rfc] direct IO submission and completion scalability issues
  2008-02-04  4:14     ` Arjan van de Ven
@ 2008-02-04  4:40       ` David Chinner
  2008-02-04 10:09         ` Nick Piggin
  0 siblings, 1 reply; 27+ messages in thread
From: David Chinner @ 2008-02-04  4:40 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: David Chinner, Nick Piggin, Siddha, Suresh B, linux-kernel,
	mingo, ak, jens.axboe, James.Bottomley, andrea, clameter, akpm,
	andrew.vasquez, willy, Zach Brown

On Sun, Feb 03, 2008 at 08:14:45PM -0800, Arjan van de Ven wrote:
> David Chinner wrote:
> >Hi Nick,
> >
> >When Matthew was describing this work at an LCA presentation (not
> >sure whether you were at that presentation or not), Zach came up
> >with the idea that allowing the submitting application control the
> >CPU that the io completion processing was occurring would be a good
> >approach to try.  That is, we submit a "completion cookie" with the
> >bio that indicates where we want completion to run, rather than
> >dictating that completion runs on the submission CPU.
> >
> >The reasoning is that only the higher level context really knows
> >what is optimal, and that changes from application to application.
> 
> well.. kinda. One of the really hard parts of the submit/completion stuff 
> is that
> the slab/slob/slub/slib allocator ends up basically "cycling" memory 
> through the system;
> there's a sink of free memory on all the submission cpus and a source of 
> free memory
> on the completion cpu. I don't think applications are capable of working 
> out what is
> best in this scenario..

Applications as in "anything that calls submit_bio()". i.e, direct I/O,
filesystems, etc. i.e. not userspace but in-kernel applications.

In XFS, simultaneous io completion on multiple CPUs can contribute greatly to
contention of global structures in XFS. By controlling where completions are
delivered, we can greatly reduce this contention, especially on large,
mulitpathed devices that deliver interrupts to multiple CPUs that may be far
distant from each other.  We have all the state and intelligence necessary
to control this sort policy decision effectively.....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [rfc] direct IO submission and completion scalability issues
  2008-02-04  4:40       ` David Chinner
@ 2008-02-04 10:09         ` Nick Piggin
  2008-02-05  0:14           ` David Chinner
  0 siblings, 1 reply; 27+ messages in thread
From: Nick Piggin @ 2008-02-04 10:09 UTC (permalink / raw)
  To: David Chinner
  Cc: Arjan van de Ven, Siddha, Suresh B, linux-kernel, mingo, ak,
	jens.axboe, James.Bottomley, andrea, clameter, akpm,
	andrew.vasquez, willy, Zach Brown

On Mon, Feb 04, 2008 at 03:40:20PM +1100, David Chinner wrote:
> On Sun, Feb 03, 2008 at 08:14:45PM -0800, Arjan van de Ven wrote:
> > David Chinner wrote:
> > >Hi Nick,
> > >
> > >When Matthew was describing this work at an LCA presentation (not
> > >sure whether you were at that presentation or not), Zach came up
> > >with the idea that allowing the submitting application control the
> > >CPU that the io completion processing was occurring would be a good
> > >approach to try.  That is, we submit a "completion cookie" with the
> > >bio that indicates where we want completion to run, rather than
> > >dictating that completion runs on the submission CPU.
> > >
> > >The reasoning is that only the higher level context really knows
> > >what is optimal, and that changes from application to application.
> > 
> > well.. kinda. One of the really hard parts of the submit/completion stuff 
> > is that
> > the slab/slob/slub/slib allocator ends up basically "cycling" memory 
> > through the system;
> > there's a sink of free memory on all the submission cpus and a source of 
> > free memory
> > on the completion cpu. I don't think applications are capable of working 
> > out what is
> > best in this scenario..
> 
> Applications as in "anything that calls submit_bio()". i.e, direct I/O,
> filesystems, etc. i.e. not userspace but in-kernel applications.
> 
> In XFS, simultaneous io completion on multiple CPUs can contribute greatly to
> contention of global structures in XFS. By controlling where completions are
> delivered, we can greatly reduce this contention, especially on large,
> mulitpathed devices that deliver interrupts to multiple CPUs that may be far
> distant from each other.  We have all the state and intelligence necessary
> to control this sort policy decision effectively.....

Hi Dave,

Thanks for taking a look at the patch... yes it would be easy to turn
this bit of state into a more flexible cookie (eg. complete on submitter;
complete on interrupt; complete on CPUx/nodex etc.). Maybe we'll need
something that complex... I'm not sure, it would probably need more
fine tuning. That said, I just wanted to get this approach out there
early for rfc.

I guess both you and Arjan have points. For a _lot_ of things, completing
on the same CPU as submitter (whether that is migrating submission as in
the original patch in the thread, or migrating completion like I do).

You get better behaviour in the slab and page allocators and locality
and cache hotness of memory. For example, I guess in a filesystem /
pagecache heavy workload, you have to touch each struct page, buffer head,
fs private state, and also often have to wake the thread for completion.
Much of this data has just been touched at submit time, so doin this on
the same CPU is nice...

I'm surprised that the xfs global state bouncing would outweigh the
bouncing of all the per-page/block/bio/request/etc data that gets touched
during completion. We'll see.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [rfc] direct IO submission and completion scalability issues
  2008-02-03  9:52 ` Nick Piggin
  2008-02-03 10:53   ` Pekka Enberg
  2008-02-04  2:10   ` David Chinner
@ 2008-02-04 10:12   ` Jens Axboe
  2008-02-04 10:31     ` Nick Piggin
  2008-02-04 10:30   ` Andi Kleen
  2008-02-04 21:47   ` Siddha, Suresh B
  4 siblings, 1 reply; 27+ messages in thread
From: Jens Axboe @ 2008-02-04 10:12 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Siddha, Suresh B, linux-kernel, arjan, mingo, ak,
	James.Bottomley, andrea, clameter, akpm, andrew.vasquez, willy,
	Zach Brown

On Sun, Feb 03 2008, Nick Piggin wrote:
> On Fri, Jul 27, 2007 at 06:21:28PM -0700, Suresh B wrote:
> > 
> > Second experiment which we did was migrating the IO submission to the
> > IO completion cpu. Instead of submitting the IO on the same cpu where the
> > request arrived, in this experiment  the IO submission gets migrated to the
> > cpu that is processing IO completions(interrupt). This will minimize the
> > access to remote cachelines (that happens in timers, slab, scsi layers). The
> > IO submission request is forwarded to the kblockd thread on the cpu receiving
> > the interrupts. As part of this, we also made kblockd thread on each cpu as the
> > highest priority thread, so that IO gets submitted as soon as possible on the
> > interrupt cpu with out any delay. On x86_64 SMP platform with 16 cores, this
> > resulted in 2% performance improvement and 3.3% improvement on two node ia64
> > platform.
> > 
> > Quick and dirty prototype patch(not meant for inclusion) for this io migration
> > experiment is appended to this e-mail.
> > 
> > Observation #1 mentioned above is also applicable to this experiment. CPU's
> > processing interrupts will now have to cater IO submission/processing
> > load aswell.
> > 
> > Observation #2: This introduces some migration overhead during IO submission.
> > With the current prototype, every incoming IO request results in an IPI and
> > context switch(to kblockd thread) on the interrupt processing cpu.
> > This issue needs to be addressed and main challenge to address is
> > the efficient mechanism of doing this IO migration(how much batching to do and
> > when to send the migrate request?), so that we don't delay the IO much and at
> > the same point, don't cause much overhead during migration.
> 
> Hi guys,
> 
> Just had another way we might do this. Migrate the completions out to
> the submitting CPUs rather than migrate submission into the completing
> CPU.
> 
> I've got a basic patch that passes some stress testing. It seems fairly
> simple to do at the block layer, and the bulk of the patch involves
> introducing a scalable smp_call_function for it.
> 
> Now it could be optimised more by looking at batching up IPIs or
> optimising the call function path or even mirating the completion event
> at a different level...
> 
> However, this is a first cut. It actually seems like it might be taking
> slightly more CPU to process block IO (~0.2%)... however, this is on my
> dual core system that shares an llc, which means that there are very few
> cache benefits to the migration, but non-zero overhead. So on multisocket
> systems hopefully it might get to positive territory.

That's pretty funny, I did pretty much the exact same thing last week!
The primary difference between yours and mine is that I used a more
private interface to signal a softirq raise on another CPU, instead of
allocating call data and exposing a generic interface. That put the
locking in blk-core instead, turning blk_cpu_done into a structure with
a lock and list_head instead of just being a list head, and intercepted
at blk_complete_request() time instead of waiting for an already raised
softirq on that CPU.

Didn't get around to any performance testing yet, though. Will try and
clean it up a bit and do that.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [rfc] direct IO submission and completion scalability issues
  2008-02-03  9:52 ` Nick Piggin
                     ` (2 preceding siblings ...)
  2008-02-04 10:12   ` Jens Axboe
@ 2008-02-04 10:30   ` Andi Kleen
  2008-02-04 21:47   ` Siddha, Suresh B
  4 siblings, 0 replies; 27+ messages in thread
From: Andi Kleen @ 2008-02-04 10:30 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Siddha, Suresh B, linux-kernel, arjan, mingo, ak, jens.axboe,
	James.Bottomley, andrea, clameter, akpm, andrew.vasquez, willy,
	Zach Brown

> +	q = &__get_cpu_var(call_single_queue);
> +	spin_lock_irqsave(&q->lock, flags);
> +	list_replace_init(&q->list, &list);
> +	spin_unlock_irqrestore(&q->lock, flags);

I think you could do that lockless if you use a similar data structure
as netchannels (essentially a fixed size single buffer queue with atomic 
exchange of the first/last pointers) and not using a list. That would avoid 
at least one bounce for the lock and likely another one for the  list
manipulation.

Also the right way would be to not add a second mechanism for this,
but fix the standard smp_call_function_single() to support it.

-Andi

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [rfc] direct IO submission and completion scalability issues
  2008-02-04 10:12   ` Jens Axboe
@ 2008-02-04 10:31     ` Nick Piggin
  2008-02-04 10:33       ` Jens Axboe
  0 siblings, 1 reply; 27+ messages in thread
From: Nick Piggin @ 2008-02-04 10:31 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Siddha, Suresh B, linux-kernel, arjan, mingo, ak,
	James.Bottomley, andrea, clameter, akpm, andrew.vasquez, willy,
	Zach Brown

On Mon, Feb 04, 2008 at 11:12:44AM +0100, Jens Axboe wrote:
> On Sun, Feb 03 2008, Nick Piggin wrote:
> > On Fri, Jul 27, 2007 at 06:21:28PM -0700, Suresh B wrote:
> > 
> > Hi guys,
> > 
> > Just had another way we might do this. Migrate the completions out to
> > the submitting CPUs rather than migrate submission into the completing
> > CPU.
> > 
> > I've got a basic patch that passes some stress testing. It seems fairly
> > simple to do at the block layer, and the bulk of the patch involves
> > introducing a scalable smp_call_function for it.
> > 
> > Now it could be optimised more by looking at batching up IPIs or
> > optimising the call function path or even mirating the completion event
> > at a different level...
> > 
> > However, this is a first cut. It actually seems like it might be taking
> > slightly more CPU to process block IO (~0.2%)... however, this is on my
> > dual core system that shares an llc, which means that there are very few
> > cache benefits to the migration, but non-zero overhead. So on multisocket
> > systems hopefully it might get to positive territory.
> 
> That's pretty funny, I did pretty much the exact same thing last week!

Oh nice ;)


> The primary difference between yours and mine is that I used a more
> private interface to signal a softirq raise on another CPU, instead of
> allocating call data and exposing a generic interface. That put the
> locking in blk-core instead, turning blk_cpu_done into a structure with
> a lock and list_head instead of just being a list head, and intercepted
> at blk_complete_request() time instead of waiting for an already raised
> softirq on that CPU.

Yeah I was looking at that... didn't really want to add the spinlock
overhead to the non-migration case. Anyway, I guess that sort of
fine implementation details is going to have to be sorted out with
results.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [rfc] direct IO submission and completion scalability issues
  2008-02-04 10:31     ` Nick Piggin
@ 2008-02-04 10:33       ` Jens Axboe
  2008-02-04 22:28         ` James Bottomley
  0 siblings, 1 reply; 27+ messages in thread
From: Jens Axboe @ 2008-02-04 10:33 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Siddha, Suresh B, linux-kernel, arjan, mingo, ak,
	James.Bottomley, andrea, clameter, akpm, andrew.vasquez, willy,
	Zach Brown

On Mon, Feb 04 2008, Nick Piggin wrote:
> On Mon, Feb 04, 2008 at 11:12:44AM +0100, Jens Axboe wrote:
> > On Sun, Feb 03 2008, Nick Piggin wrote:
> > > On Fri, Jul 27, 2007 at 06:21:28PM -0700, Suresh B wrote:
> > > 
> > > Hi guys,
> > > 
> > > Just had another way we might do this. Migrate the completions out to
> > > the submitting CPUs rather than migrate submission into the completing
> > > CPU.
> > > 
> > > I've got a basic patch that passes some stress testing. It seems fairly
> > > simple to do at the block layer, and the bulk of the patch involves
> > > introducing a scalable smp_call_function for it.
> > > 
> > > Now it could be optimised more by looking at batching up IPIs or
> > > optimising the call function path or even mirating the completion event
> > > at a different level...
> > > 
> > > However, this is a first cut. It actually seems like it might be taking
> > > slightly more CPU to process block IO (~0.2%)... however, this is on my
> > > dual core system that shares an llc, which means that there are very few
> > > cache benefits to the migration, but non-zero overhead. So on multisocket
> > > systems hopefully it might get to positive territory.
> > 
> > That's pretty funny, I did pretty much the exact same thing last week!
> 
> Oh nice ;)
> 
> 
> > The primary difference between yours and mine is that I used a more
> > private interface to signal a softirq raise on another CPU, instead of
> > allocating call data and exposing a generic interface. That put the
> > locking in blk-core instead, turning blk_cpu_done into a structure with
> > a lock and list_head instead of just being a list head, and intercepted
> > at blk_complete_request() time instead of waiting for an already raised
> > softirq on that CPU.
> 
> Yeah I was looking at that... didn't really want to add the spinlock
> overhead to the non-migration case. Anyway, I guess that sort of
> fine implementation details is going to have to be sorted out with
> results.

As Andi mentions, we can look into making that lockless. For the initial
implementation I didn't really care, just wanted something to play with
that would nicely allow me to control both the submit and complete side
of the affinity issue.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [rfc] direct IO submission and completion scalability issues
  2008-02-04  2:10   ` David Chinner
  2008-02-04  4:14     ` Arjan van de Ven
@ 2008-02-04 18:21     ` Zach Brown
  2008-02-04 20:10       ` Jens Axboe
  1 sibling, 1 reply; 27+ messages in thread
From: Zach Brown @ 2008-02-04 18:21 UTC (permalink / raw)
  To: David Chinner
  Cc: Nick Piggin, Siddha, Suresh B, linux-kernel, arjan, mingo, ak,
	jens.axboe, James.Bottomley, andrea, clameter, akpm,
	andrew.vasquez, willy

[ ugh, still jet lagged. ]

> Hi Nick,
> 
> When Matthew was describing this work at an LCA presentation (not
> sure whether you were at that presentation or not), Zach came up
> with the idea that allowing the submitting application control the
> CPU that the io completion processing was occurring would be a good
> approach to try.  That is, we submit a "completion cookie" with the
> bio that indicates where we want completion to run, rather than
> dictating that completion runs on the submission CPU.
> 
> The reasoning is that only the higher level context really knows
> what is optimal, and that changes from application to application.
> The "complete on the submission CPU" policy _may_ be more optimal
> for database workloads, but it is definitely suboptimal for XFS and
> transaction I/O completion handling because it simply drags a bunch
> of global filesystem state around between all the CPUs running
> completions. In that case, we really only want a single CPU to be
> handling the completions.....
> 
> (Zach - please correct me if I've missed anything)

Yeah, I think Nick's patch (and Jens' approach, presumably) is just the
sort of thing we were hoping for when discussing this during Matthew's talk.

I was imagining the patch a little bit differently (per-cpu tasks, do a
wake_up from the driver instead of cpu nr testing up in blk, work
queues, whatever), but we know how to iron out these kinds of details ;).

> Looking at your patch - if you turn it around so that the
> "submission CPU" field can be specified as the "completion cpu" then
> I think the patch will expose the policy knobs needed to do the
> above.

Yeah, that seems pretty straight forward.

We might need some logic for noticing that the desired cpu has been
hot-plugged away while the IO was in flight, it occurs to me.

- z

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [rfc] direct IO submission and completion scalability issues
  2008-02-04 18:21     ` Zach Brown
@ 2008-02-04 20:10       ` Jens Axboe
  2008-02-04 21:45         ` Arjan van de Ven
  0 siblings, 1 reply; 27+ messages in thread
From: Jens Axboe @ 2008-02-04 20:10 UTC (permalink / raw)
  To: Zach Brown
  Cc: David Chinner, Nick Piggin, Siddha, Suresh B, linux-kernel,
	arjan, mingo, ak, James.Bottomley, andrea, clameter, akpm,
	andrew.vasquez, willy

On Mon, Feb 04 2008, Zach Brown wrote:
> [ ugh, still jet lagged. ]
> 
> > Hi Nick,
> > 
> > When Matthew was describing this work at an LCA presentation (not
> > sure whether you were at that presentation or not), Zach came up
> > with the idea that allowing the submitting application control the
> > CPU that the io completion processing was occurring would be a good
> > approach to try.  That is, we submit a "completion cookie" with the
> > bio that indicates where we want completion to run, rather than
> > dictating that completion runs on the submission CPU.
> > 
> > The reasoning is that only the higher level context really knows
> > what is optimal, and that changes from application to application.
> > The "complete on the submission CPU" policy _may_ be more optimal
> > for database workloads, but it is definitely suboptimal for XFS and
> > transaction I/O completion handling because it simply drags a bunch
> > of global filesystem state around between all the CPUs running
> > completions. In that case, we really only want a single CPU to be
> > handling the completions.....
> > 
> > (Zach - please correct me if I've missed anything)
> 
> Yeah, I think Nick's patch (and Jens' approach, presumably) is just the
> sort of thing we were hoping for when discussing this during Matthew's talk.
> 
> I was imagining the patch a little bit differently (per-cpu tasks, do a
> wake_up from the driver instead of cpu nr testing up in blk, work
> queues, whatever), but we know how to iron out these kinds of details ;).

per-cpu tasks/wq's might be better, it's a little awkward to jump
through hoops

> > Looking at your patch - if you turn it around so that the
> > "submission CPU" field can be specified as the "completion cpu" then
> > I think the patch will expose the policy knobs needed to do the
> > above.
> 
> Yeah, that seems pretty straight forward.
> 
> We might need some logic for noticing that the desired cpu has been
> hot-plugged away while the IO was in flight, it occurs to me.

the softirq completion stuff already handles cpus going away, at least
with my patch that stuff works fine (with a dead flag added).

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [rfc] direct IO submission and completion scalability issues
  2008-02-04 20:10       ` Jens Axboe
@ 2008-02-04 21:45         ` Arjan van de Ven
  2008-02-05  8:24           ` Jens Axboe
  0 siblings, 1 reply; 27+ messages in thread
From: Arjan van de Ven @ 2008-02-04 21:45 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Zach Brown, David Chinner, Nick Piggin, Siddha, Suresh B,
	linux-kernel, mingo, ak, James.Bottomley, andrea, clameter, akpm,
	andrew.vasquez, willy

Jens Axboe wrote:
>> I was imagining the patch a little bit differently (per-cpu tasks, do a
>> wake_up from the driver instead of cpu nr testing up in blk, work
>> queues, whatever), but we know how to iron out these kinds of details ;).
> 
> per-cpu tasks/wq's might be better, it's a little awkward to jump
> through hoops
> 

one caveat btw; when the multiqueue storage hw becomes available for Linux,
we need to figure out how to deal with the preference thing; since there
honoring a "non-logical" preference would be quite expensive (it means
you can't make the local submit queues lockless etc etc), so before we go down
the road of having widespread APIs for this stuff.. we need to make sure we're
not going to do something that's going to be really stupid 6 to 18 months down the road.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [rfc] direct IO submission and completion scalability issues
  2008-02-03  9:52 ` Nick Piggin
                     ` (3 preceding siblings ...)
  2008-02-04 10:30   ` Andi Kleen
@ 2008-02-04 21:47   ` Siddha, Suresh B
  4 siblings, 0 replies; 27+ messages in thread
From: Siddha, Suresh B @ 2008-02-04 21:47 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Siddha, Suresh B, linux-kernel, arjan, mingo, ak, jens.axboe,
	James.Bottomley, andrea, clameter, akpm, andrew.vasquez, willy,
	Zach Brown

On Sun, Feb 03, 2008 at 10:52:52AM +0100, Nick Piggin wrote:
> Hi guys,
> 
> Just had another way we might do this. Migrate the completions out to
> the submitting CPUs rather than migrate submission into the completing
> CPU.

Hi Nick, This was the first experiment I tried on a quad core four
package SMP platform. And it didn't show much improvement in my
prototype(my protoype was migrating the softirq to the kblockd context
of the submitting CPU).

In the OLTP workload, quite a bit of activity happens below the block layer
and by the time we come to softirq, some damage is done in
slab, scsi cmds, timers etc. Last year OLS paper
(http://ols.108.redhat.com/2007/Reprints/gough-Reprint.pdf)
shows different cache lines that are contended in the kernel for the
OLTP workload.

Softirq migration should atleast reduce the cacheline contention that
happens in sched and AIO layers. I didn't spend much time why my softirq
migration patch didn't help much (as I was behind bigger birds of migrating
IO submission to completion CPU at that time). If this solution has
less side-effects and easily acceptable, then we can analyze the softirq
migration patch further and findout the potential.

While there is some potential with the softirq migration, full potential
can be exploited by making the IO submission and completion on the same CPU.

thanks,
suresh

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [rfc] direct IO submission and completion scalability issues
  2008-02-04 10:33       ` Jens Axboe
@ 2008-02-04 22:28         ` James Bottomley
  0 siblings, 0 replies; 27+ messages in thread
From: James Bottomley @ 2008-02-04 22:28 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Nick Piggin, Siddha, Suresh B, linux-kernel, arjan, mingo, ak,
	andrea, clameter, akpm, andrew.vasquez, willy, Zach Brown

On Mon, 2008-02-04 at 05:33 -0500, Jens Axboe wrote:
> As Andi mentions, we can look into making that lockless. For the initial
> implementation I didn't really care, just wanted something to play with
> that would nicely allow me to control both the submit and complete side
> of the affinity issue.

Sorry, late to the party ... it went to my steeleye address, not my
current one.

Could you try re-running the tests with a low queue depth (say around 8)
and the card interrupt bound to a single CPU.

The reason for asking you to do this is that it should emulate almost
precisely what you're looking for:  The submit path will be picked up in
the SCSI softirq where the queue gets run, so you should find that all
submit and returns happen on a single CPU, so everything gets cache hot
there.

James

p.s. if everyone could also update my email address to the
hansenpartnership one, the people at steeleye who monitor my old email
account would be grateful.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [rfc] direct IO submission and completion scalability issues
  2008-02-04 10:09         ` Nick Piggin
@ 2008-02-05  0:14           ` David Chinner
  2008-02-08  7:50             ` Nick Piggin
  0 siblings, 1 reply; 27+ messages in thread
From: David Chinner @ 2008-02-05  0:14 UTC (permalink / raw)
  To: Nick Piggin
  Cc: David Chinner, Arjan van de Ven, Siddha, Suresh B, linux-kernel,
	mingo, ak, jens.axboe, James.Bottomley, andrea, clameter, akpm,
	andrew.vasquez, willy, Zach Brown

On Mon, Feb 04, 2008 at 11:09:59AM +0100, Nick Piggin wrote:
> You get better behaviour in the slab and page allocators and locality
> and cache hotness of memory. For example, I guess in a filesystem /
> pagecache heavy workload, you have to touch each struct page, buffer head,
> fs private state, and also often have to wake the thread for completion.
> Much of this data has just been touched at submit time, so doin this on
> the same CPU is nice...

[....]

> I'm surprised that the xfs global state bouncing would outweigh the
> bouncing of all the per-page/block/bio/request/etc data that gets touched
> during completion. We'll see.

per-page/block.bio/request/etc is local to a single I/O. the only
penalty is a cacheline bounce for each of the structures from one
CPU to another.  That is, there is no global state modified by these
completions.

The real issue is metadata. The transaction log I/O completion
funnels through a state machine protected by a single lock, which
means completions on different CPUs pulls that lock to all
completion CPUs. Given that the same lock is used during transaction
completion for other state transitions (in task context, not intr),
the more cpus active at once touches, the worse the problem gets.

Then there's metadata I/O completion, which funnels through a larger
set of global locks in the transaction subsystem (e.g. the active
item list lock, the log reservation locks, the log state lock, etc)
which once again means the more CPUs we have delivering I/O
completions, the worse the problem gets.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [rfc] direct IO submission and completion scalability issues
  2008-02-04 21:45         ` Arjan van de Ven
@ 2008-02-05  8:24           ` Jens Axboe
  0 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2008-02-05  8:24 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Zach Brown, David Chinner, Nick Piggin, Siddha, Suresh B,
	linux-kernel, mingo, ak, James.Bottomley, andrea, clameter, akpm,
	andrew.vasquez, willy

On Mon, Feb 04 2008, Arjan van de Ven wrote:
> Jens Axboe wrote:
> >>I was imagining the patch a little bit differently (per-cpu tasks, do a
> >>wake_up from the driver instead of cpu nr testing up in blk, work
> >>queues, whatever), but we know how to iron out these kinds of details ;).
> >
> >per-cpu tasks/wq's might be better, it's a little awkward to jump
> >through hoops
> >
> 
> one caveat btw; when the multiqueue storage hw becomes available for Linux,
> we need to figure out how to deal with the preference thing; since there
> honoring a "non-logical" preference would be quite expensive (it means

non-local?

> you can't make the local submit queues lockless etc etc), so before we
> go down the road of having widespread APIs for this stuff.. we need to
> make sure we're not going to do something that's going to be really
> stupid 6 to 18 months down the road.

As far as I'm concerned, so far this is just playing around with
affinity (and to some extents taking it too far, on purpose). For
instance, my current patch can move submissions and completions
independently, with a set mask or by 'binding' a request to a CPU. Most
of that doesn't make sense. 'complete on the same CPU, if possible'
makes sense and would fit fine with multi-queue hw.

Moving submissions at the block layer to a defined set of CPUs is a bit
silly imho, it's pretty costly and it's a lot more sane simply bind the
submitters instead. So if you can set irq affinity, then just make the
submitters follow that.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [rfc] direct IO submission and completion scalability issues
  2008-02-05  0:14           ` David Chinner
@ 2008-02-08  7:50             ` Nick Piggin
  0 siblings, 0 replies; 27+ messages in thread
From: Nick Piggin @ 2008-02-08  7:50 UTC (permalink / raw)
  To: David Chinner
  Cc: Arjan van de Ven, Siddha, Suresh B, linux-kernel, mingo, ak,
	jens.axboe, James.Bottomley, andrea, clameter, akpm,
	andrew.vasquez, willy, Zach Brown

On Tue, Feb 05, 2008 at 11:14:19AM +1100, David Chinner wrote:
> On Mon, Feb 04, 2008 at 11:09:59AM +0100, Nick Piggin wrote:
> > You get better behaviour in the slab and page allocators and locality
> > and cache hotness of memory. For example, I guess in a filesystem /
> > pagecache heavy workload, you have to touch each struct page, buffer head,
> > fs private state, and also often have to wake the thread for completion.
> > Much of this data has just been touched at submit time, so doin this on
> > the same CPU is nice...
> 
> [....]
> 
> > I'm surprised that the xfs global state bouncing would outweigh the
> > bouncing of all the per-page/block/bio/request/etc data that gets touched
> > during completion. We'll see.
> 
> per-page/block.bio/request/etc is local to a single I/O. the only
> penalty is a cacheline bounce for each of the structures from one
> CPU to another.  That is, there is no global state modified by these
> completions.

Yeah, but it is going from _all_ submitting CPUs to the one completing
CPU. So you could bottleneck the interconnect at the completing CPU
just as much as if you had cachelines being pulled the other way (ie.
many CPUs trying to pull in a global cacheline).

 
> The real issue is metadata. The transaction log I/O completion
> funnels through a state machine protected by a single lock, which
> means completions on different CPUs pulls that lock to all
> completion CPUs. Given that the same lock is used during transaction
> completion for other state transitions (in task context, not intr),
> the more cpus active at once touches, the worse the problem gets.

OK, once you add locking (and not simply cacheline contention), then
the problem gets harder I agree. But I think that if the submitting
side takes the same locks as log completion (eg. maybe for starting a
new transaction), then it is not going to be a clear win either way,
and you'd have to measure it in the end.


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2008-02-08  7:50 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-07-28  1:21 [rfc] direct IO submission and completion scalability issues Siddha, Suresh B
2007-07-30 18:20 ` Christoph Lameter
2007-07-30 20:35   ` Siddha, Suresh B
2007-07-31  4:19     ` Nick Piggin
2007-07-31 17:14       ` Siddha, Suresh B
2007-08-01  0:41         ` Nick Piggin
2007-08-01  0:55           ` Siddha, Suresh B
2007-08-01  1:24             ` Nick Piggin
2008-02-03  9:52 ` Nick Piggin
2008-02-03 10:53   ` Pekka Enberg
2008-02-03 11:58     ` Nick Piggin
2008-02-04  2:10   ` David Chinner
2008-02-04  4:14     ` Arjan van de Ven
2008-02-04  4:40       ` David Chinner
2008-02-04 10:09         ` Nick Piggin
2008-02-05  0:14           ` David Chinner
2008-02-08  7:50             ` Nick Piggin
2008-02-04 18:21     ` Zach Brown
2008-02-04 20:10       ` Jens Axboe
2008-02-04 21:45         ` Arjan van de Ven
2008-02-05  8:24           ` Jens Axboe
2008-02-04 10:12   ` Jens Axboe
2008-02-04 10:31     ` Nick Piggin
2008-02-04 10:33       ` Jens Axboe
2008-02-04 22:28         ` James Bottomley
2008-02-04 10:30   ` Andi Kleen
2008-02-04 21:47   ` Siddha, Suresh B

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).