LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
From: "Nicholas A. Bellinger" <nab@linux-iscsi.org>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>,
	Vladislav Bolkhovitin <vst@vlnb.net>,
	Bart Van Assche <bart.vanassche@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>,
	linux-scsi@vger.kernel.org, scst-devel@lists.sourceforge.net,
	linux-kernel@vger.kernel.org,
	Mike Christie <michaelc@cs.wisc.edu>
Subject: Re: Integration of SCST in the mainstream Linux kernel
Date: Mon, 04 Feb 2008 11:06:29 -0800	[thread overview]
Message-ID: <1202151989.11265.576.camel@haakon2.linux-iscsi.org> (raw)
In-Reply-To: <alpine.LFD.1.00.0802041018390.3034@hp.linux-foundation.org>

On Mon, 2008-02-04 at 10:29 -0800, Linus Torvalds wrote:
> 
> On Mon, 4 Feb 2008, James Bottomley wrote:
> > 
> > The way a user space solution should work is to schedule mmapped I/O
> > from the backing store and then send this mmapped region off for target
> > I/O.
> 
> mmap'ing may avoid the copy, but the overhead of a mmap operation is 
> quite often much *bigger* than the overhead of a copy operation.
> 
> Please do not advocate the use of mmap() as a way to avoid memory copies. 
> It's not realistic. Even if you can do it with a single "mmap()" system 
> call (which is not at all a given, considering that block devices can 
> easily be much larger than the available virtual memory space), the fact 
> is that page table games along with the fault (and even just TLB miss) 
> overhead is easily more than the cost of copying a page in a nice 
> streaming manner.
> 
> Yes, memory is "slow", but dammit, so is mmap().
> 
> > You also have to pull tricks with the mmap region in the case of writes 
> > to prevent useless data being read in from the backing store.  However, 
> > none of this involves data copies.
> 
> "data copies" is irrelevant. The only thing that matters is performance. 
> And if avoiding data copies is more costly (or even of a similar cost) 
> than the copies themselves would have been, there is absolutely no upside, 
> and only downsides due to extra complexity.
> 

The iSER spec (RFC-5046) quotes the following in the TCP case for direct
data placement:

"  Out-of-order TCP segments in the Traditional iSCSI model have to be
   stored and reassembled before the iSCSI protocol layer within an end
   node can place the data in the iSCSI buffers.  This reassembly is
   required because not every TCP segment is likely to contain an iSCSI
   header to enable its placement, and TCP itself does not have a
   built-in mechanism for signaling Upper Level Protocol (ULP) message
   boundaries to aid placement of out-of-order segments.  This TCP
   reassembly at high network speeds is quite counter-productive for the
   following reasons: wasted memory bandwidth in data copying, the need
   for reassembly memory, wasted CPU cycles in data copying, and the
   general store-and-forward latency from an application perspective."

While this does not have anything to do directly with the kernel vs. user discussion
for target mode storage engine, the scaling and latency case is easy enough
to make if we are talking about scaling TCP for 10 Gb/sec storage fabrics.

> If you want good performance for a service like this, you really generally 
> *do* need to in kernel space. You can play games in user space, but you're 
> fooling yourself if you think you can do as well as doing it in the 
> kernel. And you're *definitely* fooling yourself if you think mmap() 
> solves performance issues. "Zero-copy" does not equate to "fast". Memory 
> speeds may be slower that core CPU speeds, but not infinitely so!
> 

>From looking at this problem from a kernel space perspective for a
number of years, I would be inclined to believe this is true for
software and hardware data-path cases.  The benefits of moving various
control statemachines for something like say traditional iSCSI to
userspace has always been debateable.  The most obvious ones are things
like authentication, espically if something more complex than CHAP are
the obvious case for userspace.  However, I have thought recovery for
failures caused from communication path (iSCSI connections) or entire
nexuses (iSCSI sessions) failures was very problematic to expect to have
to potentially push down IOs state to userspace.

Keeping statemachines for protocol and/or fabric specific statemachines
(CSM-E and CSM-I from connection recovery in iSCSI and iSER are the
obvious ones) are the best canidates for residing in kernel space.

> (That said: there *are* alternatives to mmap, like "splice()", that really 
> do potentially solve some issues without the page table and TLB overheads. 
> But while splice() avoids the costs of paging, I strongly suspect it would 
> still have easily measurable latency issues. Switching between user and 
> kernel space multiple times is definitely not going to be free, although 
> it's probably not a huge issue if you have big enough requests).
> 

Most of the SCSI OS storage subsystems that I have worked with in the
context of iSCSI have used 256 * 512 byte setctor requests, which the
default traditional iSCSI PDU data payload (MRDSL) being 64k to hit the
sweet spot with crc32c checksum calculations.  I am assuming this is
going to be the case for other fabrics as well.

--nab



  parent reply	other threads:[~2008-02-04 19:07 UTC|newest]

Thread overview: 147+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-01-23 14:22 Bart Van Assche
2008-01-23 17:11 ` Vladislav Bolkhovitin
2008-01-29 20:42 ` James Bottomley
2008-01-29 21:31   ` Roland Dreier
2008-01-29 23:32     ` FUJITA Tomonori
2008-01-30  1:15       ` [Scst-devel] " Vu Pham
2008-01-30  8:38       ` Bart Van Assche
2008-01-30 10:56         ` FUJITA Tomonori
2008-01-30 11:40           ` Vladislav Bolkhovitin
2008-01-30 13:10           ` Bart Van Assche
2008-01-30 13:54             ` FUJITA Tomonori
2008-01-31  7:48               ` Bart Van Assche
2008-01-31 13:25           ` Nicholas A. Bellinger
2008-01-31 14:34             ` Bart Van Assche
2008-01-31 14:44               ` Nicholas A. Bellinger
2008-01-31 15:50               ` Vladislav Bolkhovitin
2008-01-31 16:25                 ` [Scst-devel] " Joe Landman
2008-01-31 17:08                   ` Bart Van Assche
2008-01-31 17:13                     ` Joe Landman
2008-01-31 18:12                     ` David Dillow
2008-02-01 11:50                       ` Vladislav Bolkhovitin
2008-02-01 11:50                     ` Vladislav Bolkhovitin
2008-02-01 12:25                       ` Vladislav Bolkhovitin
2008-01-31 17:14                 ` Nicholas A. Bellinger
2008-01-31 17:40                   ` Bart Van Assche
2008-01-31 18:15                     ` Nicholas A. Bellinger
2008-02-01  9:08                       ` Bart Van Assche
2008-02-01  8:11             ` Bart Van Assche
2008-02-01 10:39               ` Nicholas A. Bellinger
2008-02-01 11:04                 ` Bart Van Assche
2008-02-01 12:05                   ` Nicholas A. Bellinger
2008-02-01 13:25                     ` Bart Van Assche
2008-02-01 14:36                       ` Nicholas A. Bellinger
2008-01-30 16:34         ` James Bottomley
2008-01-30 16:50           ` Bart Van Assche
2008-02-02 15:32           ` Pete Wyckoff
2008-02-05 17:01         ` Erez Zilber
2008-02-06 12:16           ` Bart Van Assche
2008-02-06 16:45             ` Benny Halevy
2008-02-06 17:06             ` Roland Dreier
2008-02-18  9:43             ` Erez Zilber
2008-02-18 11:01               ` Bart Van Assche
2008-02-20  7:34                 ` Erez Zilber
2008-02-20  8:41                   ` Bart Van Assche
2008-01-30 11:18       ` Vladislav Bolkhovitin
2008-01-30  8:29   ` Bart Van Assche
2008-01-30 16:22     ` James Bottomley
2008-01-30 17:03       ` Bart Van Assche
2008-02-05  7:14       ` [Scst-devel] " Tomasz Chmielewski
2008-02-05 13:38         ` FUJITA Tomonori
2008-02-05 16:07           ` Tomasz Chmielewski
2008-02-05 16:21             ` Ming Zhang
2008-02-05 16:43             ` FUJITA Tomonori
2008-02-05 17:09           ` Matteo Tescione
2008-02-06  1:29             ` FUJITA Tomonori
2008-02-06  2:01               ` Nicholas A. Bellinger
2008-01-30 11:17   ` Vladislav Bolkhovitin
2008-02-04 12:27     ` Vladislav Bolkhovitin
2008-02-04 13:53       ` Bart Van Assche
2008-02-04 17:00         ` David Dillow
2008-02-04 17:08         ` Vladislav Bolkhovitin
2008-02-05 16:25         ` Bart Van Assche
2008-02-05 18:18           ` Linus Torvalds
2008-02-04 15:30       ` James Bottomley
2008-02-04 16:25         ` Vladislav Bolkhovitin
2008-02-04 17:06           ` James Bottomley
2008-02-04 17:16             ` Vladislav Bolkhovitin
2008-02-04 17:25               ` James Bottomley
2008-02-04 17:56                 ` Vladislav Bolkhovitin
2008-02-04 18:22                   ` James Bottomley
2008-02-04 18:38                     ` Vladislav Bolkhovitin
2008-02-04 18:54                       ` James Bottomley
2008-02-05 18:59                         ` Vladislav Bolkhovitin
2008-02-05 19:13                           ` James Bottomley
2008-02-06 18:07                             ` Vladislav Bolkhovitin
2008-02-07 13:13                             ` [Scst-devel] " Bart Van Assche
2008-02-07 13:45                               ` Vladislav Bolkhovitin
2008-02-07 22:51                                 ` david
2008-02-08 10:37                                   ` Vladislav Bolkhovitin
2008-02-09  7:40                                     ` david
2008-02-08 11:33                                   ` Nicholas A. Bellinger
2008-02-08 14:36                                     ` Vladislav Bolkhovitin
2008-02-08 23:53                                       ` Nicholas A. Bellinger
2008-02-15 15:02                                 ` Bart Van Assche
2008-02-07 15:38                               ` [Scst-devel] " Nicholas A. Bellinger
2008-02-07 20:37                                 ` Luben Tuikov
2008-02-08 10:32                                   ` Vladislav Bolkhovitin
2008-02-09  7:32                                     ` Luben Tuikov
2008-02-11 10:02                                       ` Vladislav Bolkhovitin
2008-02-08 11:53                                   ` [Scst-devel] " Nicholas A. Bellinger
2008-02-08 14:42                                     ` Vladislav Bolkhovitin
2008-02-09  0:00                                       ` Nicholas A. Bellinger
2008-02-04 18:29                 ` Linus Torvalds
2008-02-04 18:49                   ` James Bottomley
2008-02-04 19:06                   ` Nicholas A. Bellinger [this message]
2008-02-04 19:19                     ` Nicholas A. Bellinger
2008-02-04 19:44                     ` Linus Torvalds
2008-02-04 20:06                       ` [Scst-devel] " 4news
2008-02-04 20:24                       ` Nicholas A. Bellinger
2008-02-04 21:01                       ` J. Bruce Fields
2008-02-04 21:24                         ` Linus Torvalds
2008-02-04 22:00                           ` Nicholas A. Bellinger
2008-02-04 22:57                           ` Jeff Garzik
2008-02-04 23:45                             ` Linus Torvalds
2008-02-05  0:08                               ` Jeff Garzik
2008-02-05  1:20                                 ` Linus Torvalds
2008-02-05  8:38                             ` Bart Van Assche
2008-02-05 17:50                               ` Jeff Garzik
2008-02-06 10:22                                 ` Bart Van Assche
2008-02-06 14:21                                   ` Jeff Garzik
2008-02-05 13:05                             ` Olivier Galibert
2008-02-05 18:08                               ` Jeff Garzik
2008-02-05 19:01                           ` Vladislav Bolkhovitin
2008-02-04 22:43                       ` Alan Cox
2008-02-04 17:30                         ` Douglas Gilbert
2008-02-05  2:07                           ` [Scst-devel] " Chris Weiss
2008-02-05 14:19                             ` FUJITA Tomonori
2008-02-04 22:59                         ` Nicholas A. Bellinger
2008-02-04 23:00                         ` James Bottomley
2008-02-04 23:12                           ` Nicholas A. Bellinger
2008-02-04 23:16                             ` Nicholas A. Bellinger
2008-02-05 18:37                             ` James Bottomley
2008-02-04 23:04                         ` Jeff Garzik
2008-02-04 23:27                           ` Linus Torvalds
2008-02-05 19:01                           ` Vladislav Bolkhovitin
2008-02-05 19:12                             ` Jeff Garzik
2008-02-05 19:21                               ` Vladislav Bolkhovitin
2008-02-06  0:11                                 ` Nicholas A. Bellinger
2008-02-06  1:43                                   ` Nicholas A. Bellinger
2008-02-12 16:05                                   ` [Scst-devel] " Bart Van Assche
2008-02-13  3:44                                     ` Nicholas A. Bellinger
2008-02-13  6:18                                       ` CONFIG_SLUB and reproducable general protection faults on 2.6.2x Nicholas A. Bellinger
2008-02-13 16:37                                         ` Nicholas A. Bellinger
2008-02-06  0:17                               ` Integration of SCST in the mainstream Linux kernel Nicholas A. Bellinger
2008-02-06  0:48                             ` Nicholas A. Bellinger
2008-02-06  0:51                               ` Nicholas A. Bellinger
2008-02-05  0:07                         ` Matt Mackall
2008-02-05  0:24                           ` Linus Torvalds
2008-02-05  0:42                             ` Jeff Garzik
2008-02-05  0:45                             ` Matt Mackall
2008-02-05  4:43                             ` [Scst-devel] " Matteo Tescione
2008-02-05  5:07                               ` James Bottomley
2008-02-05 13:38                               ` FUJITA Tomonori
2008-02-05 19:00                       ` Vladislav Bolkhovitin
2008-02-05 17:10 ` Erez Zilber
2008-02-05 19:02   ` Bart Van Assche
2008-02-05 19:02   ` Vladislav Bolkhovitin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1202151989.11265.576.camel@haakon2.linux-iscsi.org \
    --to=nab@linux-iscsi.org \
    --cc=James.Bottomley@HansenPartnership.com \
    --cc=akpm@linux-foundation.org \
    --cc=bart.vanassche@gmail.com \
    --cc=fujita.tomonori@lab.ntt.co.jp \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=michaelc@cs.wisc.edu \
    --cc=scst-devel@lists.sourceforge.net \
    --cc=torvalds@linux-foundation.org \
    --cc=vst@vlnb.net \
    --subject='Re: Integration of SCST in the mainstream Linux kernel' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).