LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
From: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
To: Christoph Lameter <clameter@sgi.com>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>,
	linux-kernel@vger.kernel.org, arjan@linux.intel.com,
	mingo@elte.hu, npiggin@suse.de, ak@suse.de,
	jens.axboe@oracle.com, James.Bottomley@SteelEye.com,
	andrea@suse.de, akpm@linux-foundation.org,
	andrew.vasquez@qlogic.com
Subject: Re: [rfc] direct IO submission and completion scalability issues
Date: Mon, 30 Jul 2007 13:35:19 -0700	[thread overview]
Message-ID: <20070730203519.GD10033@linux-os.sc.intel.com> (raw)
In-Reply-To: <Pine.LNX.4.64.0707301114060.743@schroedinger.engr.sgi.com>

On Mon, Jul 30, 2007 at 11:20:04AM -0700, Christoph Lameter wrote:
> On Fri, 27 Jul 2007, Siddha, Suresh B wrote:
> 
> > We have been looking into the linux kernel direct IO scalability issues with
> > database workloads. Comments and suggestions on our below experiments are
> > welcome.
> 
> This was on an SMP system? These issues are much more pronounced on a NUMA 
> system. There the locality of the device may be a prime issue.

We are looking into both SMP(multi-core) and NUMA systems.

> Yes. The issue is even worse if the submission comes from a remote node. 
> F.e. If we have a system with a scsi controller on node 2. Now I/O 
> submission on node 1 and completion on node 2. In that case the 
> cacheline has to be transferred across the NUMA interlink.
> 
> However, you cannot avoid running the completion on the node where the 
> device sits. The device has all sorts of control structures and if you 
> would handle the completion on node 1 then it would have to transfer lots
> of cachelines that contain device state to node 1.

If the device is capable of multi queues, then some of the control structures,
irqbalance can be done based on how those multi queues are distributed.

> I think it is better to leave things as is. Or have the I/O submission be 
> relocated to the node of the device.

In the absence of specialized controllers, it is best to keep the control
structures close to the device node and move the I/O submission to this node.

> > Second experiment which we did was migrating the IO submission to the
> > IO completion cpu. Instead of submitting the IO on the same cpu where the
> > request arrived, in this experiment  the IO submission gets migrated to the
> > cpu that is processing IO completions(interrupt). This will minimize the
> > access to remote cachelines (that happens in timers, slab, scsi layers). The
> > IO submission request is forwarded to the kblockd thread on the cpu receiving
> > the interrupts. As part of this, we also made kblockd thread on each cpu as the
> > highest priority thread, so that IO gets submitted as soon as possible on the
> > interrupt cpu with out any delay. On x86_64 SMP platform with 16 cores, this
> > resulted in 2% performance improvement and 3.3% improvement on two node ia64
> > platform.
> 
> I think that is the right approach. This will also help in cases where I/O 
> devices can only be accessed from a certain node (NUMA device address 
> restrictions on some systems may not allow remote cacheline access!)

Ok, there we have no other choice ;-)

> > Observation #2: This introduces some migration overhead during IO submission.
> > With the current prototype, every incoming IO request results in an IPI and
> > context switch(to kblockd thread) on the interrupt processing cpu.
> > This issue needs to be addressed and main challenge to address is
> > the efficient mechanism of doing this IO migration(how much batching to do and
> > when to send the migrate request?), so that we don't delay the IO much and at
> > the same point, don't cause much overhead during migration.
> 
> Right.

So any suggestions for making this clean and acceptable to everyone?

thanks,
suresh

  reply	other threads:[~2007-07-30 20:41 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-07-28  1:21 Siddha, Suresh B
2007-07-30 18:20 ` Christoph Lameter
2007-07-30 20:35   ` Siddha, Suresh B [this message]
2007-07-31  4:19     ` Nick Piggin
2007-07-31 17:14       ` Siddha, Suresh B
2007-08-01  0:41         ` Nick Piggin
2007-08-01  0:55           ` Siddha, Suresh B
2007-08-01  1:24             ` Nick Piggin
2008-02-03  9:52 ` Nick Piggin
2008-02-03 10:53   ` Pekka Enberg
2008-02-03 11:58     ` Nick Piggin
2008-02-04  2:10   ` David Chinner
2008-02-04  4:14     ` Arjan van de Ven
2008-02-04  4:40       ` David Chinner
2008-02-04 10:09         ` Nick Piggin
2008-02-05  0:14           ` David Chinner
2008-02-08  7:50             ` Nick Piggin
2008-02-04 18:21     ` Zach Brown
2008-02-04 20:10       ` Jens Axboe
2008-02-04 21:45         ` Arjan van de Ven
2008-02-05  8:24           ` Jens Axboe
2008-02-04 10:12   ` Jens Axboe
2008-02-04 10:31     ` Nick Piggin
2008-02-04 10:33       ` Jens Axboe
2008-02-04 22:28         ` James Bottomley
2008-02-04 10:30   ` Andi Kleen
2008-02-04 21:47   ` Siddha, Suresh B

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070730203519.GD10033@linux-os.sc.intel.com \
    --to=suresh.b.siddha@intel.com \
    --cc=James.Bottomley@SteelEye.com \
    --cc=ak@suse.de \
    --cc=akpm@linux-foundation.org \
    --cc=andrea@suse.de \
    --cc=andrew.vasquez@qlogic.com \
    --cc=arjan@linux.intel.com \
    --cc=clameter@sgi.com \
    --cc=jens.axboe@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=npiggin@suse.de \
    --subject='Re: [rfc] direct IO submission and completion scalability issues' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).