LKML Archive on lore.kernel.org help / color / mirror / Atom feed
* Integration of SCST in the mainstream Linux kernel @ 2008-01-23 14:22 Bart Van Assche 2008-01-23 17:11 ` Vladislav Bolkhovitin ` (2 more replies) 0 siblings, 3 replies; 148+ messages in thread From: Bart Van Assche @ 2008-01-23 14:22 UTC (permalink / raw) To: Linus Torvalds, Andrew Morton, Vladislav Bolkhovitin, James.Bottomley, FUJITA Tomonori Cc: linux-scsi, scst-devel, linux-kernel As you probably know there is a trend in enterprise computing towards networked storage. This is illustrated by the emergence during the past few years of standards like SRP (SCSI RDMA Protocol), iSCSI (Internet SCSI) and iSER (iSCSI Extensions for RDMA). Two different pieces of software are necessary to make networked storage possible: initiator software and target software. As far as I know there exist three different SCSI target implementations for Linux: - The iSCSI Enterprise Target Daemon (IETD, http://iscsitarget.sourceforge.net/); - The Linux SCSI Target Framework (STGT, http://stgt.berlios.de/); - The Generic SCSI Target Middle Level for Linux project (SCST, http://scst.sourceforge.net/). Since I was wondering which SCSI target software would be best suited for an InfiniBand network, I started evaluating the STGT and SCST SCSI target implementations. Apparently the performance difference between STGT and SCST is small on 100 Mbit/s and 1 Gbit/s Ethernet networks, but the SCST target software outperforms the STGT software on an InfiniBand network. See also the following thread for the details: http://sourceforge.net/mailarchive/forum.php?thread_name=e2e108260801170127w2937b2afg9bef324efa945e43%40mail.gmail.com&forum_name=scst-devel. About the design of the SCST software: while one of the goals of the STGT project was to keep the in-kernel code minimal, the SCST project implements the whole SCSI target in kernel space. SCST is implemented as a set of new kernel modules, only minimal changes to the existing kernel are necessary before the SCST kernel modules can be used. This is the same approach that will be followed in the very near future in the OpenSolaris kernel (see also http://opensolaris.org/os/project/comstar/). More information about the design of SCST can be found here: http://scst.sourceforge.net/doc/scst_pg.html. My impression is that both the STGT and SCST projects are well designed, well maintained and have a considerable user base. According to the SCST maintainer (Vladislav Bolkhovitin), SCST is superior to STGT with respect to features, performance, maturity, stability, and number of existing target drivers. Unfortunately the SCST kernel code lives outside the kernel tree, which makes SCST harder to use than STGT. As an SCST user, I would like to see the SCST kernel code integrated in the mainstream kernel because of its excellent performance on an InfiniBand network. Since the SCST project comprises about 14 KLOC, reviewing the SCST code will take considerable time. Who will do this reviewing work ? And with regard to the comments made by the reviewers: Vladislav, do you have the time to carry out the modifications requested by the reviewers ? I expect a.o. that reviewers will ask to move SCST's configuration pseudofiles from procfs to sysfs. Bart Van Assche. ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-01-23 14:22 Integration of SCST in the mainstream Linux kernel Bart Van Assche @ 2008-01-23 17:11 ` Vladislav Bolkhovitin 2008-01-29 20:42 ` James Bottomley 2008-02-05 17:10 ` Erez Zilber 2 siblings, 0 replies; 148+ messages in thread From: Vladislav Bolkhovitin @ 2008-01-23 17:11 UTC (permalink / raw) To: Bart Van Assche Cc: Linus Torvalds, Andrew Morton, James.Bottomley, FUJITA Tomonori, linux-scsi, scst-devel, linux-kernel Bart Van Assche wrote: > As you probably know there is a trend in enterprise computing towards > networked storage. This is illustrated by the emergence during the > past few years of standards like SRP (SCSI RDMA Protocol), iSCSI > (Internet SCSI) and iSER (iSCSI Extensions for RDMA). Two different > pieces of software are necessary to make networked storage possible: > initiator software and target software. As far as I know there exist > three different SCSI target implementations for Linux: > - The iSCSI Enterprise Target Daemon (IETD, > http://iscsitarget.sourceforge.net/); > - The Linux SCSI Target Framework (STGT, http://stgt.berlios.de/); > - The Generic SCSI Target Middle Level for Linux project (SCST, > http://scst.sourceforge.net/). > Since I was wondering which SCSI target software would be best suited > for an InfiniBand network, I started evaluating the STGT and SCST SCSI > target implementations. Apparently the performance difference between > STGT and SCST is small on 100 Mbit/s and 1 Gbit/s Ethernet networks, > but the SCST target software outperforms the STGT software on an > InfiniBand network. See also the following thread for the details: > http://sourceforge.net/mailarchive/forum.php?thread_name=e2e108260801170127w2937b2afg9bef324efa945e43%40mail.gmail.com&forum_name=scst-devel. > > About the design of the SCST software: while one of the goals of the > STGT project was to keep the in-kernel code minimal, the SCST project > implements the whole SCSI target in kernel space. SCST is implemented > as a set of new kernel modules, only minimal changes to the existing > kernel are necessary before the SCST kernel modules can be used. This > is the same approach that will be followed in the very near future in > the OpenSolaris kernel (see also > http://opensolaris.org/os/project/comstar/). More information about > the design of SCST can be found here: > http://scst.sourceforge.net/doc/scst_pg.html. > > My impression is that both the STGT and SCST projects are well > designed, well maintained and have a considerable user base. According > to the SCST maintainer (Vladislav Bolkhovitin), SCST is superior to > STGT with respect to features, performance, maturity, stability, and > number of existing target drivers. Unfortunately the SCST kernel code > lives outside the kernel tree, which makes SCST harder to use than > STGT. > > As an SCST user, I would like to see the SCST kernel code integrated > in the mainstream kernel because of its excellent performance on an > InfiniBand network. Since the SCST project comprises about 14 KLOC, > reviewing the SCST code will take considerable time. Who will do this > reviewing work ? And with regard to the comments made by the > reviewers: Vladislav, do you have the time to carry out the > modifications requested by the reviewers ? I expect a.o. that > reviewers will ask to move SCST's configuration pseudofiles from > procfs to sysfs. Sure, I do, although I personally don't see much sense in such move. > Bart Van Assche. > - > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-01-23 14:22 Integration of SCST in the mainstream Linux kernel Bart Van Assche 2008-01-23 17:11 ` Vladislav Bolkhovitin @ 2008-01-29 20:42 ` James Bottomley 2008-01-29 21:31 ` Roland Dreier ` (2 more replies) 2008-02-05 17:10 ` Erez Zilber 2 siblings, 3 replies; 148+ messages in thread From: James Bottomley @ 2008-01-29 20:42 UTC (permalink / raw) To: Bart Van Assche Cc: Linus Torvalds, Andrew Morton, Vladislav Bolkhovitin, FUJITA Tomonori, linux-scsi, scst-devel, linux-kernel On Wed, 2008-01-23 at 15:22 +0100, Bart Van Assche wrote: > As you probably know there is a trend in enterprise computing towards > networked storage. This is illustrated by the emergence during the > past few years of standards like SRP (SCSI RDMA Protocol), iSCSI > (Internet SCSI) and iSER (iSCSI Extensions for RDMA). Two different > pieces of software are necessary to make networked storage possible: > initiator software and target software. As far as I know there exist > three different SCSI target implementations for Linux: > - The iSCSI Enterprise Target Daemon (IETD, > http://iscsitarget.sourceforge.net/); > - The Linux SCSI Target Framework (STGT, http://stgt.berlios.de/); > - The Generic SCSI Target Middle Level for Linux project (SCST, > http://scst.sourceforge.net/). > Since I was wondering which SCSI target software would be best suited > for an InfiniBand network, I started evaluating the STGT and SCST SCSI > target implementations. Apparently the performance difference between > STGT and SCST is small on 100 Mbit/s and 1 Gbit/s Ethernet networks, > but the SCST target software outperforms the STGT software on an > InfiniBand network. See also the following thread for the details: > http://sourceforge.net/mailarchive/forum.php?thread_name=e2e108260801170127w2937b2afg9bef324efa945e43%40mail.gmail.com&forum_name=scst-devel. That doesn't seem to pull up a thread. However, I assume it's these figures: ............................................................................................. . . STGT read SCST read . STGT read SCST read . . . performance performance . performance performance . . . (0.5K, MB/s) (0.5K, MB/s) . (1 MB, MB/s) (1 MB, MB/s) . ............................................................................................. . Ethernet (1 Gb/s network) . 77 78 . 77 89 . . IPoIB (8 Gb/s network) . 163 185 . 201 239 . . iSER (8 Gb/s network) . 250 N/A . 360 N/A . . SRP (8 Gb/s network) . N/A 421 . N/A 683 . ............................................................................................. On the comparable figures, which only seem to be IPoIB they're showing a 13-18% variance, aren't they? Which isn't an incredible difference. > About the design of the SCST software: while one of the goals of the > STGT project was to keep the in-kernel code minimal, the SCST project > implements the whole SCSI target in kernel space. SCST is implemented > as a set of new kernel modules, only minimal changes to the existing > kernel are necessary before the SCST kernel modules can be used. This > is the same approach that will be followed in the very near future in > the OpenSolaris kernel (see also > http://opensolaris.org/os/project/comstar/). More information about > the design of SCST can be found here: > http://scst.sourceforge.net/doc/scst_pg.html. > > My impression is that both the STGT and SCST projects are well > designed, well maintained and have a considerable user base. According > to the SCST maintainer (Vladislav Bolkhovitin), SCST is superior to > STGT with respect to features, performance, maturity, stability, and > number of existing target drivers. Unfortunately the SCST kernel code > lives outside the kernel tree, which makes SCST harder to use than > STGT. > > As an SCST user, I would like to see the SCST kernel code integrated > in the mainstream kernel because of its excellent performance on an > InfiniBand network. Since the SCST project comprises about 14 KLOC, > reviewing the SCST code will take considerable time. Who will do this > reviewing work ? And with regard to the comments made by the > reviewers: Vladislav, do you have the time to carry out the > modifications requested by the reviewers ? I expect a.o. that > reviewers will ask to move SCST's configuration pseudofiles from > procfs to sysfs. The two target architectures perform essentially identical functions, so there's only really room for one in the kernel. Right at the moment, it's STGT. Problems in STGT come from the user<->kernel boundary which can be mitigated in a variety of ways. The fact that the figures are pretty much comparable on non IB networks shows this. I really need a whole lot more evidence than at worst a 20% performance difference on IB to pull one implementation out and replace it with another. Particularly as there's no real evidence that STGT can't be tweaked to recover the 20% even on IB. James ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-01-29 20:42 ` James Bottomley @ 2008-01-29 21:31 ` Roland Dreier 2008-01-29 23:32 ` FUJITA Tomonori 2008-01-30 8:29 ` Bart Van Assche 2008-01-30 11:17 ` Vladislav Bolkhovitin 2 siblings, 1 reply; 148+ messages in thread From: Roland Dreier @ 2008-01-29 21:31 UTC (permalink / raw) To: James Bottomley Cc: Bart Van Assche, Linus Torvalds, Andrew Morton, Vladislav Bolkhovitin, FUJITA Tomonori, linux-scsi, scst-devel, linux-kernel > . . STGT read SCST read . STGT read SCST read . > . . performance performance . performance performance . > . . (0.5K, MB/s) (0.5K, MB/s) . (1 MB, MB/s) (1 MB, MB/s) . > . iSER (8 Gb/s network) . 250 N/A . 360 N/A . > . SRP (8 Gb/s network) . N/A 421 . N/A 683 . > On the comparable figures, which only seem to be IPoIB they're showing a > 13-18% variance, aren't they? Which isn't an incredible difference. Maybe I'm all wet, but I think iSER vs. SRP should be roughly comparable. The exact formatting of various messages etc. is different but the data path using RDMA is pretty much identical. So the big difference between STGT iSER and SCST SRP hints at some big difference in the efficiency of the two implementations. - R. ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-01-29 21:31 ` Roland Dreier @ 2008-01-29 23:32 ` FUJITA Tomonori 2008-01-30 1:15 ` [Scst-devel] " Vu Pham ` (2 more replies) 0 siblings, 3 replies; 148+ messages in thread From: FUJITA Tomonori @ 2008-01-29 23:32 UTC (permalink / raw) To: rdreier Cc: James.Bottomley, bart.vanassche, torvalds, akpm, vst, fujita.tomonori, linux-scsi, scst-devel, linux-kernel On Tue, 29 Jan 2008 13:31:52 -0800 Roland Dreier <rdreier@cisco.com> wrote: > > . . STGT read SCST read . STGT read SCST read . > > . . performance performance . performance performance . > > . . (0.5K, MB/s) (0.5K, MB/s) . (1 MB, MB/s) (1 MB, MB/s) . > > . iSER (8 Gb/s network) . 250 N/A . 360 N/A . > > . SRP (8 Gb/s network) . N/A 421 . N/A 683 . > > > On the comparable figures, which only seem to be IPoIB they're showing a > > 13-18% variance, aren't they? Which isn't an incredible difference. > > Maybe I'm all wet, but I think iSER vs. SRP should be roughly > comparable. The exact formatting of various messages etc. is > different but the data path using RDMA is pretty much identical. So > the big difference between STGT iSER and SCST SRP hints at some big > difference in the efficiency of the two implementations. iSER has parameters to limit the maximum size of RDMA (it needs to repeat RDMA with a poor configuration)? Anyway, here's the results from Robin Humble: iSER to 7G ramfs, x86_64, centos4.6, 2.6.22 kernels, git tgtd, initiator end booted with mem=512M, target with 8G ram direct i/o dd write/read 800/751 MB/s dd if=/dev/zero of=/dev/sdc bs=1M count=5000 oflag=direct dd of=/dev/null if=/dev/sdc bs=1M count=5000 iflag=direct http://www.mail-archive.com/linux-scsi@vger.kernel.org/msg13502.html I think that STGT is pretty fast with the fast backing storage. I don't think that there is the notable perfornace difference between kernel-space and user-space SRP (or ISER) implementations about moving data between hosts. IB is expected to enable user-space applications to move data between hosts quickly (if not, what can IB provide us?). I think that the question is how fast user-space applications can do I/Os ccompared with I/Os in kernel space. STGT is eager for the advent of good asynchronous I/O and event notification interfances. One more possible optimization for STGT is zero-copy data transfer. STGT uses pre-registered buffers and move data between page cache and thsse buffers, and then does RDMA transfer. If we implement own caching mechanism to use pre-registered buffers directly with (AIO and O_DIRECT), then STGT can move data without data copies. ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel 2008-01-29 23:32 ` FUJITA Tomonori @ 2008-01-30 1:15 ` Vu Pham 2008-01-30 8:38 ` Bart Van Assche 2008-01-30 11:18 ` Vladislav Bolkhovitin 2 siblings, 0 replies; 148+ messages in thread From: Vu Pham @ 2008-01-30 1:15 UTC (permalink / raw) To: FUJITA Tomonori Cc: rdreier, vst, linux-scsi, linux-kernel, James.Bottomley, scst-devel, akpm, torvalds FUJITA Tomonori wrote: > On Tue, 29 Jan 2008 13:31:52 -0800 > Roland Dreier <rdreier@cisco.com> wrote: > >> > . . STGT read SCST read . STGT read SCST read . >> > . . performance performance . performance performance . >> > . . (0.5K, MB/s) (0.5K, MB/s) . (1 MB, MB/s) (1 MB, MB/s) . >> > . iSER (8 Gb/s network) . 250 N/A . 360 N/A . >> > . SRP (8 Gb/s network) . N/A 421 . N/A 683 . >> >> > On the comparable figures, which only seem to be IPoIB they're showing a >> > 13-18% variance, aren't they? Which isn't an incredible difference. >> >> Maybe I'm all wet, but I think iSER vs. SRP should be roughly >> comparable. The exact formatting of various messages etc. is >> different but the data path using RDMA is pretty much identical. So >> the big difference between STGT iSER and SCST SRP hints at some big >> difference in the efficiency of the two implementations. > > iSER has parameters to limit the maximum size of RDMA (it needs to > repeat RDMA with a poor configuration)? > > > Anyway, here's the results from Robin Humble: > > iSER to 7G ramfs, x86_64, centos4.6, 2.6.22 kernels, git tgtd, > initiator end booted with mem=512M, target with 8G ram > > direct i/o dd > write/read 800/751 MB/s > dd if=/dev/zero of=/dev/sdc bs=1M count=5000 oflag=direct > dd of=/dev/null if=/dev/sdc bs=1M count=5000 iflag=direct > Both Robin (iser/stgt) and Bart (scst/srp) using ramfs Robin's numbers come from DDR IB HCAs Bart's numbers come from SDR IB HCAs: Results with /dev/ram0 configured as backing store on the target (buffered I/O): Read Write Read Write performance performance performance performance (0.5K, MB/s) (0.5K, MB/s) (1 MB, MB/s) (1 MB, MB/s) STGT + iSER 250 48 349 781 SCST + SRP 411 66 659 746 Results with /dev/ram0 configured as backing store on the target (direct I/O): Read Write Read Write performance performance performance performance (0.5K, MB/s) (0.5K, MB/s) (1 MB, MB/s) (1 MB, MB/s) STGT + iSER 7.9 9.8 589 647 SCST + SRP 12.3 9.7 811 794 http://www.mail-archive.com/linux-scsi@vger.kernel.org/msg13514.html Here are my numbers with DDR IB HCAs, SCST/SRP 5G /dev/ram0 block_io mode, RHEL5 2.6.18-8.el5 direct i/o dd write/read 1100/895 MB/s dd if=/dev/zero of=/dev/sdc bs=1M count=5000 oflag=direct dd of=/dev/null if=/dev/sdc bs=1M count=5000 iflag=direct buffered i/o dd write/read 950/770 MB/s dd if=/dev/zero of=/dev/sdc bs=1M count=5000 dd of=/dev/null if=/dev/sdc bs=1M count=5000 So when using DDR IB hcas: stgt/iser scst/srp direct I/O 800/751 1100/895 buffered I/O 1109/350 950/770 -vu > http://www.mail-archive.com/linux-scsi@vger.kernel.org/msg13502.html > > I think that STGT is pretty fast with the fast backing storage. > > > I don't think that there is the notable perfornace difference between > kernel-space and user-space SRP (or ISER) implementations about moving > data between hosts. IB is expected to enable user-space applications > to move data between hosts quickly (if not, what can IB provide us?). > > I think that the question is how fast user-space applications can do > I/Os ccompared with I/Os in kernel space. STGT is eager for the advent > of good asynchronous I/O and event notification interfances. > > > One more possible optimization for STGT is zero-copy data > transfer. STGT uses pre-registered buffers and move data between page > cache and thsse buffers, and then does RDMA transfer. If we implement > own caching mechanism to use pre-registered buffers directly with (AIO > and O_DIRECT), then STGT can move data without data copies. > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Scst-devel mailing list > Scst-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scst-devel > ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-01-29 23:32 ` FUJITA Tomonori 2008-01-30 1:15 ` [Scst-devel] " Vu Pham @ 2008-01-30 8:38 ` Bart Van Assche 2008-01-30 10:56 ` FUJITA Tomonori ` (2 more replies) 2008-01-30 11:18 ` Vladislav Bolkhovitin 2 siblings, 3 replies; 148+ messages in thread From: Bart Van Assche @ 2008-01-30 8:38 UTC (permalink / raw) To: FUJITA Tomonori Cc: rdreier, James.Bottomley, torvalds, akpm, vst, linux-scsi, scst-devel, linux-kernel On Jan 30, 2008 12:32 AM, FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> wrote: > > iSER has parameters to limit the maximum size of RDMA (it needs to > repeat RDMA with a poor configuration)? Please specify which parameters you are referring to. As you know I had already repeated my tests with ridiculously high values for the following iSER parameters: FirstBurstLength, MaxBurstLength and MaxRecvDataSegmentLength (16 MB, which is more than the 1 MB block size specified to dd). Bart Van Assche. ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-01-30 8:38 ` Bart Van Assche @ 2008-01-30 10:56 ` FUJITA Tomonori 2008-01-30 11:40 ` Vladislav Bolkhovitin ` (2 more replies) 2008-01-30 16:34 ` James Bottomley 2008-02-05 17:01 ` Erez Zilber 2 siblings, 3 replies; 148+ messages in thread From: FUJITA Tomonori @ 2008-01-30 10:56 UTC (permalink / raw) To: bart.vanassche Cc: fujita.tomonori, rdreier, James.Bottomley, torvalds, akpm, vst, linux-scsi, scst-devel, linux-kernel On Wed, 30 Jan 2008 09:38:04 +0100 "Bart Van Assche" <bart.vanassche@gmail.com> wrote: > On Jan 30, 2008 12:32 AM, FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> wrote: > > > > iSER has parameters to limit the maximum size of RDMA (it needs to > > repeat RDMA with a poor configuration)? > > Please specify which parameters you are referring to. As you know I Sorry, I can't say. I don't know much about iSER. But seems that Pete and Robin can get the better I/O performance - line speed ratio with STGT. The version of OpenIB might matters too. For example, Pete said that STGT reads loses about 100 MB/s for some transfer sizes for some transfer sizes due to the OpenIB version difference or other unclear reasons. http://article.gmane.org/gmane.linux.iscsi.tgt.devel/135 It's fair to say that it takes long time and need lots of knowledge to get the maximum performance of SAN, I think. I think that it would be easier to convince James with the detailed analysis (e.g. where does it take so long, like Pete did), not just 'dd' performance results. Pushing iSCSI target code into mainline failed four times: IET, SCST, STGT (doing I/Os in kernel in the past), and PyX's one (*1). iSCSI target code is huge. You said SCST comprises 14,000 lines, but it's not iSCSI target code. The SCSI engine code comprises 14,000 lines. You need another 10,000 lines for the iSCSI driver. Note that SCST's iSCSI driver provides only basic iSCSI features. PyX's iSCSI target code implemenents more iSCSI features (like MC/S, ERL2, etc) and comprises about 60,000 lines and it still lacks some features like iSER, bidi, etc. I think that it's reasonable to say that we need more than 'dd' results before pushing about possible more than 60,000 lines to mainline. (*1) http://linux-iscsi.org/ ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-01-30 10:56 ` FUJITA Tomonori @ 2008-01-30 11:40 ` Vladislav Bolkhovitin 2008-01-30 13:10 ` Bart Van Assche 2008-01-31 13:25 ` Nicholas A. Bellinger 2 siblings, 0 replies; 148+ messages in thread From: Vladislav Bolkhovitin @ 2008-01-30 11:40 UTC (permalink / raw) To: FUJITA Tomonori Cc: bart.vanassche, fujita.tomonori, rdreier, James.Bottomley, torvalds, akpm, linux-scsi, scst-devel, linux-kernel FUJITA Tomonori wrote: > On Wed, 30 Jan 2008 09:38:04 +0100 > "Bart Van Assche" <bart.vanassche@gmail.com> wrote: > > >>On Jan 30, 2008 12:32 AM, FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> wrote: >> >>>iSER has parameters to limit the maximum size of RDMA (it needs to >>>repeat RDMA with a poor configuration)? >> >>Please specify which parameters you are referring to. As you know I > > > Sorry, I can't say. I don't know much about iSER. But seems that Pete > and Robin can get the better I/O performance - line speed ratio with > STGT. > > The version of OpenIB might matters too. For example, Pete said that > STGT reads loses about 100 MB/s for some transfer sizes for some > transfer sizes due to the OpenIB version difference or other unclear > reasons. > > http://article.gmane.org/gmane.linux.iscsi.tgt.devel/135 > > It's fair to say that it takes long time and need lots of knowledge to > get the maximum performance of SAN, I think. > > I think that it would be easier to convince James with the detailed > analysis (e.g. where does it take so long, like Pete did), not just > 'dd' performance results. > > Pushing iSCSI target code into mainline failed four times: IET, SCST, > STGT (doing I/Os in kernel in the past), and PyX's one (*1). iSCSI > target code is huge. You said SCST comprises 14,000 lines, but it's > not iSCSI target code. The SCSI engine code comprises 14,000 > lines. You need another 10,000 lines for the iSCSI driver. Note that > SCST's iSCSI driver provides only basic iSCSI features. PyX's iSCSI > target code implemenents more iSCSI features (like MC/S, ERL2, etc) > and comprises about 60,000 lines and it still lacks some features like > iSER, bidi, etc. > > I think that it's reasonable to say that we need more than 'dd' > results before pushing about possible more than 60,000 lines to > mainline. Tomo, please stop counting in-kernel lines only (see http://lkml.org/lkml/2007/4/24/364). The amount of the overall project lines for the same feature set is a lot more important. > (*1) http://linux-iscsi.org/ > ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-01-30 10:56 ` FUJITA Tomonori 2008-01-30 11:40 ` Vladislav Bolkhovitin @ 2008-01-30 13:10 ` Bart Van Assche 2008-01-30 13:54 ` FUJITA Tomonori 2008-01-31 13:25 ` Nicholas A. Bellinger 2 siblings, 1 reply; 148+ messages in thread From: Bart Van Assche @ 2008-01-30 13:10 UTC (permalink / raw) To: FUJITA Tomonori Cc: fujita.tomonori, rdreier, James.Bottomley, torvalds, akpm, vst, linux-scsi, scst-devel, linux-kernel On Jan 30, 2008 11:56 AM, FUJITA Tomonori <tomof@acm.org> wrote: > On Wed, 30 Jan 2008 09:38:04 +0100 > "Bart Van Assche" <bart.vanassche@gmail.com> wrote: > > > > Please specify which parameters you are referring to. As you know I > > Sorry, I can't say. I don't know much about iSER. But seems that Pete > and Robin can get the better I/O performance - line speed ratio with > STGT. Robin Humble was using a DDR InfiniBand network, while my tests were performed with an SDR InfiniBand network. Robin's results can't be directly compared to my results. Pete Wyckoff's results (http://www.osc.edu/~pw/papers/wyckoff-iser-snapi07-talk.pdf) are hard to interpret. I have asked Pete which of the numbers in his test can be compared with what I measured, but Pete did not reply. > The version of OpenIB might matters too. For example, Pete said that > STGT reads loses about 100 MB/s for some transfer sizes for some > transfer sizes due to the OpenIB version difference or other unclear > reasons. > > http://article.gmane.org/gmane.linux.iscsi.tgt.devel/135 Pete wrote about a degradation from 600 MB/s to 500 MB/s for reads with STGT+iSER. In my tests I measured 589 MB/s for reads (direct I/O), which matches with the better result obtained by Pete. Note: the InfiniBand kernel modules I used were those from the 2.6.22.9 kernel, not from the OFED distribution. Bart. ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-01-30 13:10 ` Bart Van Assche @ 2008-01-30 13:54 ` FUJITA Tomonori 2008-01-31 7:48 ` Bart Van Assche 0 siblings, 1 reply; 148+ messages in thread From: FUJITA Tomonori @ 2008-01-30 13:54 UTC (permalink / raw) To: bart.vanassche Cc: tomof, fujita.tomonori, rdreier, James.Bottomley, torvalds, akpm, vst, linux-scsi, scst-devel, linux-kernel, fujita.tomonori On Wed, 30 Jan 2008 14:10:47 +0100 "Bart Van Assche" <bart.vanassche@gmail.com> wrote: > On Jan 30, 2008 11:56 AM, FUJITA Tomonori <tomof@acm.org> wrote: > > On Wed, 30 Jan 2008 09:38:04 +0100 > > "Bart Van Assche" <bart.vanassche@gmail.com> wrote: > > > > > > Please specify which parameters you are referring to. As you know I > > > > Sorry, I can't say. I don't know much about iSER. But seems that Pete > > and Robin can get the better I/O performance - line speed ratiwo with > > STGT. > > Robin Humble was using a DDR InfiniBand network, while my tests were > performed with an SDR InfiniBand network. Robin's results can't be > directly compared to my results. I know that you use different hardware. I used 'ratio' word. BTW, you said the performance difference of dio READ is 38% but I think it's 27.3 %, though it's still large. > Pete Wyckoff's results > (http://www.osc.edu/~pw/papers/wyckoff-iser-snapi07-talk.pdf) are hard > to interpret. I have asked Pete which of the numbers in his test can > be compared with what I measured, but Pete did not reply. > > > The version of OpenIB might matters too. For example, Pete said that > > STGT reads loses about 100 MB/s for some transfer sizes for some > > transfer sizes due to the OpenIB version difference or other unclear > > reasons. > > > > http://article.gmane.org/gmane.linux.iscsi.tgt.devel/135 > > Pete wrote about a degradation from 600 MB/s to 500 MB/s for reads > with STGT+iSER. In my tests I measured 589 MB/s for reads (direct > I/O), which matches with the better result obtained by Pete. I don't know he used the same benchmark software so I don't think that we can compare them. All I tried to say is the OFED version might has big effect on the performance. So you might need to find the best one. > Note: the InfiniBand kernel modules I used were those from the > 2.6.22.9 kernel, not from the OFED distribution. I'm talking about a target machine (I think that Pete was also talking about OFED on his target machine). STGT uses OFED libraries, I think. ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-01-30 13:54 ` FUJITA Tomonori @ 2008-01-31 7:48 ` Bart Van Assche 0 siblings, 0 replies; 148+ messages in thread From: Bart Van Assche @ 2008-01-31 7:48 UTC (permalink / raw) To: FUJITA Tomonori Cc: fujita.tomonori, rdreier, James.Bottomley, torvalds, akpm, vst, linux-scsi, scst-devel, linux-kernel On Jan 30, 2008 2:54 PM, FUJITA Tomonori <tomof@acm.org> wrote: > On Wed, 30 Jan 2008 14:10:47 +0100 > "Bart Van Assche" <bart.vanassche@gmail.com> wrote: > > > On Jan 30, 2008 11:56 AM, FUJITA Tomonori <tomof@acm.org> wrote: > > > > > > Sorry, I can't say. I don't know much about iSER. But seems that Pete > > > and Robin can get the better I/O performance - line speed ratio with > > > STGT. > > > > Robin Humble was using a DDR InfiniBand network, while my tests were > > performed with an SDR InfiniBand network. Robin's results can't be > > directly compared to my results. > > I know that you use different hardware. I used 'ratio' word. Let's start with summarizing the relevant numbers from Robin's measurements and my own measurements. Maximum bandwidth of the underlying physical medium: 2000 MB/s for a DDR 4x InfiniBand network and 1000 MB/s for a SDR 4x InfiniBand network. Maximum bandwidth reported by the OFED ib_write_bw test program: 1473 MB/s for Robin's setup and 933 MB/s for my setup. These numbers match published ib_write_bw results (see e.g. figure 11 in http://www.usenix.org/events/usenix06/tech/full_papers/liu/liu_html/index.html or chapter 7 in http://www.voltaire.com/ftp/rocks/HCA-4X0_Linux_GridStack_4.3_Release_Notes_DOC-00171-A00.pdf) Throughput measured for communication via STGT + iSER to a remote RAM disk via direct I/O with dd: 800 MB/s for writing and 751 MB/s for reading in Robin's setup, and 647 MB/s for writing and 589 MB/s for reading in my setup. >From this we can compute the I/O-performance to ib_write_bw bandwidth: 54 % for writing and 51 % for reading in Robin's setup, and 69 % for writing and 63 % for reading in my setup. Or a slightly better utilization of the bandwidth in my setup than in Robin's setup. This is no surprise -- the faster a communication link is, the harder it is to use all of the available bandwidth. So why did you state that in Robin's tests the I/O performance to line speed ratio was better than in my tests ? Bart Van Assche. ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-01-30 10:56 ` FUJITA Tomonori 2008-01-30 11:40 ` Vladislav Bolkhovitin 2008-01-30 13:10 ` Bart Van Assche @ 2008-01-31 13:25 ` Nicholas A. Bellinger 2008-01-31 14:34 ` Bart Van Assche 2008-02-01 8:11 ` Bart Van Assche 2 siblings, 2 replies; 148+ messages in thread From: Nicholas A. Bellinger @ 2008-01-31 13:25 UTC (permalink / raw) To: FUJITA Tomonori Cc: bart.vanassche, fujita.tomonori, rdreier, James.Bottomley, torvalds, akpm, vst, linux-scsi, scst-devel, linux-kernel Greetings all, On Wed, 2008-01-30 at 19:56 +0900, FUJITA Tomonori wrote: > On Wed, 30 Jan 2008 09:38:04 +0100 > "Bart Van Assche" <bart.vanassche@gmail.com> wrote: > > > On Jan 30, 2008 12:32 AM, FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> wrote: > > > > > > iSER has parameters to limit the maximum size of RDMA (it needs to > > > repeat RDMA with a poor configuration)? > > > > Please specify which parameters you are referring to. As you know I > > Sorry, I can't say. I don't know much about iSER. But seems that Pete > and Robin can get the better I/O performance - line speed ratio with > STGT. > > The version of OpenIB might matters too. For example, Pete said that > STGT reads loses about 100 MB/s for some transfer sizes for some > transfer sizes due to the OpenIB version difference or other unclear > reasons. > > http://article.gmane.org/gmane.linux.iscsi.tgt.devel/135 > > It's fair to say that it takes long time and need lots of knowledge to > get the maximum performance of SAN, I think. > > I think that it would be easier to convince James with the detailed > analysis (e.g. where does it take so long, like Pete did), not just > 'dd' performance results. > > Pushing iSCSI target code into mainline failed four times: IET, SCST, > STGT (doing I/Os in kernel in the past), and PyX's one (*1). iSCSI > target code is huge. You said SCST comprises 14,000 lines, but it's > not iSCSI target code. The SCSI engine code comprises 14,000 > lines. You need another 10,000 lines for the iSCSI driver. Note that > SCST's iSCSI driver provides only basic iSCSI features. PyX's iSCSI > target code implemenents more iSCSI features (like MC/S, ERL2, etc) > and comprises about 60,000 lines and it still lacks some features like > iSER, bidi, etc. > The PyX storage engine supports a scatterlist linked list algorithm that maps any sector count + sector size combination down to contiguous struct scatterlist arrays across (potentially) multiple Linux storage subsystems from a single CDB received on a initiator port. This design was a consequence of a requirement for running said engine on Linux v2.2 and v2.4 across non cache coherent systems (MIPS R5900-EE) using a single contiguous memory block mapped into struct buffer_head for PATA access, and struct scsi_cmnd access on USB storage. Note that this was before struct bio and struct scsi_request existed.. The PyX storage engine as it exists at Linux-iSCSI.org today can be thought of as a hybrid OSD processing engine, as it maps storage object memory across a number of tasks from a received command CDB. The ability to pass in pre allocated memory from an RDMA capable adapter, as well as allocated internally (ie: traditional iSCSI without open_iscsi's struct skbuff rx zero-copy) is inherient in the design of the storage engine. The lacking Bidi support can be attributed to lack of greater support (and hence user interest) in Bidi, but I am really glad to see this getting into the SCSI ML and STGT, and is certainly of interest in the long term. Another feature that is missing in the current engine is > 16 Byte CDBs, which I would imagine alot of vendor folks would like to see in Linux as well. This is pretty easy to add in iSCSI with an AHS and in the engine and storage subsystems. > I think that it's reasonable to say that we need more than 'dd' > results before pushing about possible more than 60,000 lines to > mainline. > > (*1) http://linux-iscsi.org/ > - The 60k lines of code also includes functionality (the SE mirroring comes to mind) that I do not plan to push towards mainline, along with other legacy bits so we can build on earlier v2.6 embedded platforms. The existing Target mode LIO-SE that provides linked list scatterlist mapping algorithm that is similar to what Jens and Rusty have been working on, and is under 14k lines including the switch(cdb[0]) + function pointer assignment to per CDB specific structure that is called potentially out-of-order in the RX side context of the CmdSN state machine in RFC-3720. The current SE is also lacking the very SCSI specific task management state machines that not a whole lot of iSCSI implementions implement properly, and seem to be minimal interest to users, and of moderate interest to vendors. Getting this implemented generically in SCSI, as opposed to an transport specific mechanisim would benefit the Linux SCSI target engine. The pSCSI (struct scsi_cmnd), iBlock (struct bio) and FILE (struct file) plugins together are a grand total of 3.5k lines using the v2.9 LIO-SE interface. Assuming we have a single preferred data and control patch for underlying physical and virtual block devices, this could also get smaller. A quick check of the code puts the traditional kernel level iSCSI statemachine at roughly 16k, which is pretty good for the complete state machine. Also, having iSER and traditional iSCSI share MC/S and ERL=2 common code will be of interest, as well as iSCSI login state machines, which are identical minus the extra iSER specific keys and requirement to transition from byte stream mode to RDMA accelerated mode. Since this particular code is located in a non-data path critical section, the kernel vs. user discussion is a wash. If we are talking about data path, yes, the relevance of DD tests in kernel designs are suspect :p. For those IB testers who are interested, perhaps having a look with disktest from the Linux Test Project would give a better comparision between the two implementations on a RDMA capable fabric like IB for best case performance. I think everyone is interested in seeing just how much data path overhead exists between userspace and kernel space in typical and heavy workloads, if if this overhead can be minimized to make userspace a better option for some of this very complex code. --nab > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-01-31 13:25 ` Nicholas A. Bellinger @ 2008-01-31 14:34 ` Bart Van Assche 2008-01-31 14:44 ` Nicholas A. Bellinger 2008-01-31 15:50 ` Vladislav Bolkhovitin 2008-02-01 8:11 ` Bart Van Assche 1 sibling, 2 replies; 148+ messages in thread From: Bart Van Assche @ 2008-01-31 14:34 UTC (permalink / raw) To: Nicholas A. Bellinger Cc: FUJITA Tomonori, fujita.tomonori, rdreier, James.Bottomley, torvalds, akpm, vst, linux-scsi, scst-devel, linux-kernel On Jan 31, 2008 2:25 PM, Nicholas A. Bellinger <nab@linux-iscsi.org> wrote: > Since this particular code is located in a non-data path critical > section, the kernel vs. user discussion is a wash. If we are talking > about data path, yes, the relevance of DD tests in kernel designs are > suspect :p. For those IB testers who are interested, perhaps having a > look with disktest from the Linux Test Project would give a better > comparision between the two implementations on a RDMA capable fabric > like IB for best case performance. I think everyone is interested in > seeing just how much data path overhead exists between userspace and > kernel space in typical and heavy workloads, if if this overhead can be > minimized to make userspace a better option for some of this very > complex code. I can run disktest on the same setups I ran dd on. This will take some time however. Disktest is new to me -- any hints with regard to suitable combinations of command line parameters are welcome. The most recent version I could find on http://ltp.sourceforge.net/ is ltp-20071231. Bart Van Assche. ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-01-31 14:34 ` Bart Van Assche @ 2008-01-31 14:44 ` Nicholas A. Bellinger 2008-01-31 15:50 ` Vladislav Bolkhovitin 1 sibling, 0 replies; 148+ messages in thread From: Nicholas A. Bellinger @ 2008-01-31 14:44 UTC (permalink / raw) To: Bart Van Assche Cc: FUJITA Tomonori, fujita.tomonori, rdreier, James.Bottomley, torvalds, akpm, vst, linux-scsi, scst-devel, linux-kernel Hi Bart, On Thu, 2008-01-31 at 15:34 +0100, Bart Van Assche wrote: > On Jan 31, 2008 2:25 PM, Nicholas A. Bellinger <nab@linux-iscsi.org> wrote: > > Since this particular code is located in a non-data path critical > > section, the kernel vs. user discussion is a wash. If we are talking > > about data path, yes, the relevance of DD tests in kernel designs are > > suspect :p. For those IB testers who are interested, perhaps having a > > look with disktest from the Linux Test Project would give a better > > comparision between the two implementations on a RDMA capable fabric > > like IB for best case performance. I think everyone is interested in > > seeing just how much data path overhead exists between userspace and > > kernel space in typical and heavy workloads, if if this overhead can be > > minimized to make userspace a better option for some of this very > > complex code. > > I can run disktest on the same setups I ran dd on. This will take some > time however. > > Disktest is new to me -- any hints with regard to suitable > combinations of command line parameters are welcome. The most recent > version I could find on http://ltp.sourceforge.net/ is ltp-20071231. > I posted some numbers with traditional iSCSI on Neterion Xframe I 10 Gb/sec with LRO back in 2005 with disktest on the 1st generation x86_64 hardware available at the time. These tests where designed to show the performance advantages of internexus multiplexing that is available within traditional iSCSI, as well as iSER. The disktest parameters that I used are listed in the following thread: https://www.redhat.com/archives/dm-devel/2005-April/msg00013.html --nab > Bart Van Assche. > ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-01-31 14:34 ` Bart Van Assche 2008-01-31 14:44 ` Nicholas A. Bellinger @ 2008-01-31 15:50 ` Vladislav Bolkhovitin 2008-01-31 16:25 ` [Scst-devel] " Joe Landman 2008-01-31 17:14 ` Nicholas A. Bellinger 1 sibling, 2 replies; 148+ messages in thread From: Vladislav Bolkhovitin @ 2008-01-31 15:50 UTC (permalink / raw) To: Bart Van Assche Cc: Nicholas A. Bellinger, FUJITA Tomonori, fujita.tomonori, rdreier, James.Bottomley, torvalds, akpm, linux-scsi, scst-devel, linux-kernel Bart Van Assche wrote: > On Jan 31, 2008 2:25 PM, Nicholas A. Bellinger <nab@linux-iscsi.org> wrote: > >>Since this particular code is located in a non-data path critical >>section, the kernel vs. user discussion is a wash. If we are talking >>about data path, yes, the relevance of DD tests in kernel designs are >>suspect :p. For those IB testers who are interested, perhaps having a >>look with disktest from the Linux Test Project would give a better >>comparision between the two implementations on a RDMA capable fabric >>like IB for best case performance. I think everyone is interested in >>seeing just how much data path overhead exists between userspace and >>kernel space in typical and heavy workloads, if if this overhead can be >>minimized to make userspace a better option for some of this very >>complex code. > > I can run disktest on the same setups I ran dd on. This will take some > time however. Disktest was already referenced in the beginning of the performance comparison thread, but its results are not very interesting if we are going to find out, which implementation is more effective, because in the modes, in which usually people run this utility, it produces latency insensitive workload (multiple threads working in parallel). So, such multithreaded disktests results will be different between STGT and SCST only if STGT's implementation will get target CPU bound. If CPU on the target is powerful enough, even extra busy loops in the STGT or SCST hot path code will change nothing. Additionally, multithreaded disktest over RAM disk is a good example of a synthetic benchmark, which has almost no relation with real life workloads. But people like it, because it produces nice looking results. Actually, I don't know what kind of conclusions it is possible to make from disktest's results (maybe only how throughput gets bigger or slower with increasing number of threads?), it's a good stress test tool, but not more. > Disktest is new to me -- any hints with regard to suitable > combinations of command line parameters are welcome. The most recent > version I could find on http://ltp.sourceforge.net/ is ltp-20071231. > > Bart Van Assche. > - > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel 2008-01-31 15:50 ` Vladislav Bolkhovitin @ 2008-01-31 16:25 ` Joe Landman 2008-01-31 17:08 ` Bart Van Assche 2008-01-31 17:14 ` Nicholas A. Bellinger 1 sibling, 1 reply; 148+ messages in thread From: Joe Landman @ 2008-01-31 16:25 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: Bart Van Assche, James.Bottomley, linux-scsi, rdreier, linux-kernel, Nicholas A. Bellinger, fujita.tomonori, scst-devel, akpm, FUJITA Tomonori, torvalds Vladislav Bolkhovitin wrote: > Bart Van Assche wrote: [...] >> I can run disktest on the same setups I ran dd on. This will take some >> time however. > > Disktest was already referenced in the beginning of the performance > comparison thread, but its results are not very interesting if we are > going to find out, which implementation is more effective, because in > the modes, in which usually people run this utility, it produces latency > insensitive workload (multiple threads working in parallel). So, such There are other issues with disktest, in that you can easily specify option combinations that generate apparently 5+ GB/s of IO, though actual traffic over the link to storage is very low. Caveat disktest emptor. > multithreaded disktests results will be different between STGT and SCST > only if STGT's implementation will get target CPU bound. If CPU on the > target is powerful enough, even extra busy loops in the STGT or SCST hot > path code will change nothing. > > Additionally, multithreaded disktest over RAM disk is a good example of > a synthetic benchmark, which has almost no relation with real life > workloads. But people like it, because it produces nice looking results. I agree. The backing store should be a disk for it to have meaning, though please note my caveat above. > > Actually, I don't know what kind of conclusions it is possible to make > from disktest's results (maybe only how throughput gets bigger or slower > with increasing number of threads?), it's a good stress test tool, but > not more. Unfortunately, I agree. Bonnie++, dd tests, and a few others seem to bear far closer to "real world" tests than disktest and iozone, the latter of which does more to test the speed of RAM cache and system call performance than actual IO. >> Disktest is new to me -- any hints with regard to suitable >> combinations of command line parameters are welcome. The most recent >> version I could find on http://ltp.sourceforge.net/ is ltp-20071231. >> >> Bart Van Assche. Here is what I have run: disktest -K 8 -B 256k -I F -N 20000000 -P A -w /big/file disktest -K 8 -B 64k -I F -N 20000000 -P A -w /big/file disktest -K 8 -B 1k -I B -N 2000000 -P A /dev/sdb2 and many others. Joe -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 fax : +1 866 888 3112 cell : +1 734 612 4615 ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel 2008-01-31 16:25 ` [Scst-devel] " Joe Landman @ 2008-01-31 17:08 ` Bart Van Assche 2008-01-31 17:13 ` Joe Landman ` (2 more replies) 0 siblings, 3 replies; 148+ messages in thread From: Bart Van Assche @ 2008-01-31 17:08 UTC (permalink / raw) To: landman Cc: Vladislav Bolkhovitin, James.Bottomley, linux-scsi, rdreier, linux-kernel, Nicholas A. Bellinger, fujita.tomonori, scst-devel, akpm, FUJITA Tomonori, torvalds On Jan 31, 2008 5:25 PM, Joe Landman <landman@scalableinformatics.com> wrote: > Vladislav Bolkhovitin wrote: > > Actually, I don't know what kind of conclusions it is possible to make > > from disktest's results (maybe only how throughput gets bigger or slower > > with increasing number of threads?), it's a good stress test tool, but > > not more. > > Unfortunately, I agree. Bonnie++, dd tests, and a few others seem to > bear far closer to "real world" tests than disktest and iozone, the > latter of which does more to test the speed of RAM cache and system call > performance than actual IO. I have ran some tests with Bonnie++, but found out that on a fast network like IB the filesystem used for the test has a really big impact on the test results. If anyone has a suggestion for a better test than dd to compare the performance of SCSI storage protocols, please let it know. Bart Van Assche. ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel 2008-01-31 17:08 ` Bart Van Assche @ 2008-01-31 17:13 ` Joe Landman 2008-01-31 18:12 ` David Dillow 2008-02-01 11:50 ` Vladislav Bolkhovitin 2 siblings, 0 replies; 148+ messages in thread From: Joe Landman @ 2008-01-31 17:13 UTC (permalink / raw) To: Bart Van Assche Cc: Vladislav Bolkhovitin, James.Bottomley, linux-scsi, rdreier, linux-kernel, Nicholas A. Bellinger, fujita.tomonori, scst-devel, akpm, FUJITA Tomonori, torvalds Bart Van Assche wrote: > I have ran some tests with Bonnie++, but found out that on a fast > network like IB the filesystem used for the test has a really big > impact on the test results. This is true of the file systems when physically directly connected to the unit as well. Some file systems are designed with high performance in mind, some are not. > If anyone has a suggestion for a better test than dd to compare the > performance of SCSI storage protocols, please let it know. Hmmm... if you care about the protocol side, I can't help. Our users are more concerned with the file system side, so this is where we focus our tuning attention. > > Bart Van Assche. Joe -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 fax : +1 866 888 3112 cell : +1 734 612 4615 ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel 2008-01-31 17:08 ` Bart Van Assche 2008-01-31 17:13 ` Joe Landman @ 2008-01-31 18:12 ` David Dillow 2008-02-01 11:50 ` Vladislav Bolkhovitin 2008-02-01 11:50 ` Vladislav Bolkhovitin 2 siblings, 1 reply; 148+ messages in thread From: David Dillow @ 2008-01-31 18:12 UTC (permalink / raw) To: Bart Van Assche Cc: landman, Vladislav Bolkhovitin, James.Bottomley, linux-scsi, rdreier, linux-kernel, Nicholas A. Bellinger, fujita.tomonori, scst-devel, akpm, FUJITA Tomonori, torvalds On Thu, 2008-01-31 at 18:08 +0100, Bart Van Assche wrote: > If anyone has a suggestion for a better test than dd to compare the > performance of SCSI storage protocols, please let it know. xdd on /dev/sda, sdb, etc. using -dio to do direct IO seems to work decently, though it is hard (ie, impossible) to get a repeatable sequence of IO when using higher queue depths, as it uses threads to generate multiple requests. You may also look at sgpdd_survey from Lustre's iokit, but I've not done much with that -- it uses the sg devices to send lowlevel SCSI commands. I've been playing around with some benchmark code using libaio, but it's not in generally usable shape. xdd: http://www.ioperformance.com/products.htm Lustre IO Kit: http://manual.lustre.org/manual/LustreManual16_HTML/DynamicHTML-20-1.html -- Dave Dillow National Center for Computational Science Oak Ridge National Laboratory (865) 241-6602 office ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel 2008-01-31 18:12 ` David Dillow @ 2008-02-01 11:50 ` Vladislav Bolkhovitin 0 siblings, 0 replies; 148+ messages in thread From: Vladislav Bolkhovitin @ 2008-02-01 11:50 UTC (permalink / raw) To: David Dillow Cc: Bart Van Assche, landman, James.Bottomley, linux-scsi, rdreier, linux-kernel, Nicholas A. Bellinger, fujita.tomonori, scst-devel, akpm, FUJITA Tomonori, torvalds David Dillow wrote: > On Thu, 2008-01-31 at 18:08 +0100, Bart Van Assche wrote: > >>If anyone has a suggestion for a better test than dd to compare the >>performance of SCSI storage protocols, please let it know. > > > xdd on /dev/sda, sdb, etc. using -dio to do direct IO seems to work > decently, though it is hard (ie, impossible) to get a repeatable > sequence of IO when using higher queue depths, as it uses threads to > generate multiple requests. This utility seems to be a good one, but it's basically the same as disktest, although much more advanced. > You may also look at sgpdd_survey from Lustre's iokit, but I've not done > much with that -- it uses the sg devices to send lowlevel SCSI commands. Yes, it might be worth to try. Since fundamentally it's the same as O_DIRECT dd, but with a bit less overhead on the initiator side (hence less initiator side latency), most likely it will show ever bigger difference, than it is with dd. > I've been playing around with some benchmark code using libaio, but it's > not in generally usable shape. > > xdd: > http://www.ioperformance.com/products.htm > > Lustre IO Kit: > http://manual.lustre.org/manual/LustreManual16_HTML/DynamicHTML-20-1.html ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel 2008-01-31 17:08 ` Bart Van Assche 2008-01-31 17:13 ` Joe Landman 2008-01-31 18:12 ` David Dillow @ 2008-02-01 11:50 ` Vladislav Bolkhovitin 2008-02-01 12:25 ` Vladislav Bolkhovitin 2 siblings, 1 reply; 148+ messages in thread From: Vladislav Bolkhovitin @ 2008-02-01 11:50 UTC (permalink / raw) To: Bart Van Assche Cc: landman, fujita.tomonori, linux-scsi, rdreier, linux-kernel, Nicholas A. Bellinger, James.Bottomley, scst-devel, akpm, FUJITA Tomonori, torvalds Bart Van Assche wrote: > On Jan 31, 2008 5:25 PM, Joe Landman <landman@scalableinformatics.com> wrote: > >>Vladislav Bolkhovitin wrote: >> >>>Actually, I don't know what kind of conclusions it is possible to make >>>from disktest's results (maybe only how throughput gets bigger or slower >>>with increasing number of threads?), it's a good stress test tool, but >>>not more. >> >>Unfortunately, I agree. Bonnie++, dd tests, and a few others seem to >>bear far closer to "real world" tests than disktest and iozone, the >>latter of which does more to test the speed of RAM cache and system call >>performance than actual IO. > > > I have ran some tests with Bonnie++, but found out that on a fast > network like IB the filesystem used for the test has a really big > impact on the test results. > > If anyone has a suggestion for a better test than dd to compare the > performance of SCSI storage protocols, please let it know. I would suggest you to try something from real life, like: - Copying large file tree over a single or multiple IB links - Measure of some DB engine's TPC - etc. > Bart Van Assche. > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Scst-devel mailing list > Scst-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scst-devel > ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel 2008-02-01 11:50 ` Vladislav Bolkhovitin @ 2008-02-01 12:25 ` Vladislav Bolkhovitin 0 siblings, 0 replies; 148+ messages in thread From: Vladislav Bolkhovitin @ 2008-02-01 12:25 UTC (permalink / raw) To: Bart Van Assche Cc: landman, fujita.tomonori, linux-scsi, rdreier, linux-kernel, Nicholas A. Bellinger, James.Bottomley, scst-devel, akpm, FUJITA Tomonori, torvalds Vladislav Bolkhovitin wrote: > Bart Van Assche wrote: > >> On Jan 31, 2008 5:25 PM, Joe Landman <landman@scalableinformatics.com> >> wrote: >> >>> Vladislav Bolkhovitin wrote: >>> >>>> Actually, I don't know what kind of conclusions it is possible to make >>>> from disktest's results (maybe only how throughput gets bigger or >>>> slower >>>> with increasing number of threads?), it's a good stress test tool, but >>>> not more. >>> >>> >>> Unfortunately, I agree. Bonnie++, dd tests, and a few others seem to >>> bear far closer to "real world" tests than disktest and iozone, the >>> latter of which does more to test the speed of RAM cache and system call >>> performance than actual IO. >> >> >> >> I have ran some tests with Bonnie++, but found out that on a fast >> network like IB the filesystem used for the test has a really big >> impact on the test results. >> >> If anyone has a suggestion for a better test than dd to compare the >> performance of SCSI storage protocols, please let it know. > > > I would suggest you to try something from real life, like: > > - Copying large file tree over a single or multiple IB links > > - Measure of some DB engine's TPC > > - etc. Forgot to mention. During those tests make sure that imported devices from both SCST and STGT report in the kernel log the same write cache and FUA capabilities, since they significantly affect initiator's behavior. Like: sd 4:0:0:5: [sdf] Write cache: enabled, read cache: enabled, supports DPO and FUA For SCST the fastest mode is NV_CACHE, refer to its README file for details. >> Bart Van Assche. >> >> ------------------------------------------------------------------------- >> This SF.net email is sponsored by: Microsoft >> Defy all challenges. Microsoft(R) Visual Studio 2008. >> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ >> _______________________________________________ >> Scst-devel mailing list >> Scst-devel@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/scst-devel >> > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-01-31 15:50 ` Vladislav Bolkhovitin 2008-01-31 16:25 ` [Scst-devel] " Joe Landman @ 2008-01-31 17:14 ` Nicholas A. Bellinger 2008-01-31 17:40 ` Bart Van Assche 1 sibling, 1 reply; 148+ messages in thread From: Nicholas A. Bellinger @ 2008-01-31 17:14 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: Bart Van Assche, FUJITA Tomonori, fujita.tomonori, rdreier, James.Bottomley, torvalds, akpm, linux-scsi, scst-devel, linux-kernel On Thu, 2008-01-31 at 18:50 +0300, Vladislav Bolkhovitin wrote: > Bart Van Assche wrote: > > On Jan 31, 2008 2:25 PM, Nicholas A. Bellinger <nab@linux-iscsi.org> wrote: > > > >>Since this particular code is located in a non-data path critical > >>section, the kernel vs. user discussion is a wash. If we are talking > >>about data path, yes, the relevance of DD tests in kernel designs are > >>suspect :p. For those IB testers who are interested, perhaps having a > >>look with disktest from the Linux Test Project would give a better > >>comparision between the two implementations on a RDMA capable fabric > >>like IB for best case performance. I think everyone is interested in > >>seeing just how much data path overhead exists between userspace and > >>kernel space in typical and heavy workloads, if if this overhead can be > >>minimized to make userspace a better option for some of this very > >>complex code. > > > > I can run disktest on the same setups I ran dd on. This will take some > > time however. > > Disktest was already referenced in the beginning of the performance > comparison thread, but its results are not very interesting if we are > going to find out, which implementation is more effective, because in > the modes, in which usually people run this utility, it produces latency > insensitive workload (multiple threads working in parallel). So, such > multithreaded disktests results will be different between STGT and SCST > only if STGT's implementation will get target CPU bound. If CPU on the > target is powerful enough, even extra busy loops in the STGT or SCST hot > path code will change nothing. > I think the really interesting numbers are the difference for bulk I/O between kernel and userspace on both traditional iSCSI and the RDMA enabled flavours. I have not been able to determine anything earth shattering from the current run of kernel vs. userspace tests, nor which method of implementation for iSER, SRP, and generic Storage Engine are 'more effective' for that case. Performance and latency to real storage would make alot more sense for the kernel vs. user case. Also workloads against software LVM and Linux MD block devices would be of interest as these would be some of the more typical deployments that would be in the field, and is what Linux-iSCSI.org uses for our production cluster storage today. Having implemented my own iSCSI and SCSI Target mode Storage Engine leads me to believe that putting logic in userspace is probably a good idea in the longterm. If this means putting the entire data IO path into userspace for Linux/iSCSI, then there needs to be a good reason why this will not not scale to multi-port 10 Gb/sec engines in traditional and RDMA mode if we need to take this codepath back into the kernel. The end goal is to have the most polished and complete storage engine and iSCSI stacks designs go upstream, which is something I think we can all agree on. Also, with STGT being a pretty new design which has not undergone alot of optimization, perhaps profiling both pieces of code against similar tests would give us a better idea of where userspace bottlenecks reside. Also, the overhead involved with traditional iSCSI for bulk IO from kernel / userspace would also be a key concern for a much larger set of users, as iSER and SRP on IB is a pretty small userbase and will probably remain small for the near future. > Additionally, multithreaded disktest over RAM disk is a good example of > a synthetic benchmark, which has almost no relation with real life > workloads. But people like it, because it produces nice looking results. > Yes, people like to claim their stacks are the fastest with RAM disk benchmarks. But hooking up their fast network silicon to existing storage hardware and OS storage subsystems and software is where the real game is.. > Actually, I don't know what kind of conclusions it is possible to make > from disktest's results (maybe only how throughput gets bigger or slower > with increasing number of threads?), it's a good stress test tool, > but > not more. > Being able to have a best case baseline with disktest for kernel vs. user would be of interest for both transport protocol and SCSI Target mode Storage Engine profiling. The first run of tests looked pretty bandwith oriented, so disktest works well to determine maximum bandwith. Disktest also is nice for getting reads from cache on hardware RAID controllers because disktest only generates requests with LBAs from 0 -> disktest BLOCKSIZE. --nab > > Disktest is new to me -- any hints with regard to suitable > > combinations of command line parameters are welcome. The most recent > > version I could find on http://ltp.sourceforge.net/ is ltp-20071231. > > > > Bart Van Assche. > > - > > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-01-31 17:14 ` Nicholas A. Bellinger @ 2008-01-31 17:40 ` Bart Van Assche 2008-01-31 18:15 ` Nicholas A. Bellinger 0 siblings, 1 reply; 148+ messages in thread From: Bart Van Assche @ 2008-01-31 17:40 UTC (permalink / raw) To: Nicholas A. Bellinger Cc: Vladislav Bolkhovitin, FUJITA Tomonori, fujita.tomonori, rdreier, James.Bottomley, torvalds, akpm, linux-scsi, scst-devel, linux-kernel On Jan 31, 2008 6:14 PM, Nicholas A. Bellinger <nab@linux-iscsi.org> wrote: > Also, with STGT being a pretty new design which has not undergone alot > of optimization, perhaps profiling both pieces of code against similar > tests would give us a better idea of where userspace bottlenecks reside. > Also, the overhead involved with traditional iSCSI for bulk IO from > kernel / userspace would also be a key concern for a much larger set of > users, as iSER and SRP on IB is a pretty small userbase and will > probably remain small for the near future. Two important trends in data center technology are server consolidation and storage consolidation. A.o. every web hosting company can profit from a fast storage solution. I wouldn't call this a small user base. Regarding iSER and SRP on IB: InfiniBand is today the most economic solution for a fast storage network. I do not know which technology will be the most popular for storage consolidation within a few years -- this can be SRP, iSER, IPoIB + SDP, FCoE (Fibre Channel over Ethernet) or maybe yet another technology. No matter which technology becomes the most popular for storage applications, there will be a need for high-performance storage software. References: * Michael Feldman, Battle of the Network Fabrics, HPCwire, December 2006, http://www.hpcwire.com/hpc/1145060.html * NetApp, Reducing Data Center Power Consumption Through Efficient Storage, February 2007, http://www.netapp.com/ftp/wp-reducing-datacenter-power-consumption.pdf Bart Van Assche. ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-01-31 17:40 ` Bart Van Assche @ 2008-01-31 18:15 ` Nicholas A. Bellinger 2008-02-01 9:08 ` Bart Van Assche 0 siblings, 1 reply; 148+ messages in thread From: Nicholas A. Bellinger @ 2008-01-31 18:15 UTC (permalink / raw) To: Bart Van Assche Cc: Vladislav Bolkhovitin, FUJITA Tomonori, fujita.tomonori, rdreier, James.Bottomley, torvalds, akpm, linux-scsi, scst-devel, linux-kernel On Thu, 2008-01-31 at 18:40 +0100, Bart Van Assche wrote: > On Jan 31, 2008 6:14 PM, Nicholas A. Bellinger <nab@linux-iscsi.org> wrote: > > Also, with STGT being a pretty new design which has not undergone alot > > of optimization, perhaps profiling both pieces of code against similar > > tests would give us a better idea of where userspace bottlenecks reside. > > Also, the overhead involved with traditional iSCSI for bulk IO from > > kernel / userspace would also be a key concern for a much larger set of > > users, as iSER and SRP on IB is a pretty small userbase and will > > probably remain small for the near future. > > Two important trends in data center technology are server > consolidation and storage consolidation. A.o. every web hosting > company can profit from a fast storage solution. I wouldn't call this > a small user base. > > Regarding iSER and SRP on IB: InfiniBand is today the most economic > solution for a fast storage network. I do not know which technology > will be the most popular for storage consolidation within a few years > -- this can be SRP, iSER, IPoIB + SDP, FCoE (Fibre Channel over > Ethernet) or maybe yet another technology. No matter which technology > becomes the most popular for storage applications, there will be a > need for high-performance storage software. > I meant small referring to storage on IB fabrics which has usually been in the research and national lab settings, with some other vendors offering IB as an alternative storage fabric for those who [w,c]ould not wait for 10 Gb/sec copper Ethernet and Direct Data Placement to come online. These types of numbers compared to say traditional iSCSI, that is getting used all over the place these days in areas I won't bother listing here. As for the future, I am obviously cheering for IP storage fabrics, in particular 10 Gb/sec Ethernet and Direct Data Placement in concert with iSCSI Extentions for RDMA to give the data center a high performance, low latency transport that can do OS independent storage multiplexing and recovery across multiple independently developed implementions. Also avoiding lock-in from un-interoptable storage transports (espically on the high end) that had plauged so many vendors in years past has become an real option in the past few years with a IETF defined block level storage protocol. We are actually going on four years since RFC-3720 was ratified. (April 2004) Making the 'enterprise' ethernet switching equipment go from millisecond to nanosecond latency in a whole different story that goes beyond my area of expertise. I know there is one startup (Fulcrum Micro) who is working on this problem and seems to be making some good progress. --nab > References: > * Michael Feldman, Battle of the Network Fabrics, HPCwire, December > 2006, http://www.hpcwire.com/hpc/1145060.html > * NetApp, Reducing Data Center Power Consumption Through Efficient > Storage, February 2007, > http://www.netapp.com/ftp/wp-reducing-datacenter-power-consumption.pdf > > Bart Van Assche. > ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-01-31 18:15 ` Nicholas A. Bellinger @ 2008-02-01 9:08 ` Bart Van Assche 0 siblings, 0 replies; 148+ messages in thread From: Bart Van Assche @ 2008-02-01 9:08 UTC (permalink / raw) To: Nicholas A. Bellinger Cc: Vladislav Bolkhovitin, FUJITA Tomonori, fujita.tomonori, rdreier, James.Bottomley, torvalds, akpm, linux-scsi, scst-devel, linux-kernel On Jan 31, 2008 7:15 PM, Nicholas A. Bellinger <nab@linux-iscsi.org> wrote: > > I meant small referring to storage on IB fabrics which has usually been > in the research and national lab settings, with some other vendors > offering IB as an alternative storage fabric for those who [w,c]ould not > wait for 10 Gb/sec copper Ethernet and Direct Data Placement to come > online. These types of numbers compared to say traditional iSCSI, that > is getting used all over the place these days in areas I won't bother > listing here. InfiniBand has several advantages over 10 Gbit/s Ethernet (the list below probably isn't complete): - Lower latency. Communication latency is not only determined by the latency of a switch. The whole InfiniBand protocol stack was designed with low latency in mind. Low latency is really important for database software that accesses storage over a network. - High-availability is implemented at the network layer. Suppose that a group of servers has dual-port network interfaces and is interconnected via a so-called dual star topology, With an InfiniBand network, failover in case of a single failure (link or switch) is handled without any operating system or application intervention. With Ethernet, failover in case of a single failure must be handled either by the operating system or by the application. - You do not have to use iSER or SRP to use the bandwidth of an InfiniBand network effectively. The SDP (Sockets Direct Protocol) makes it possible that applications benefit from RDMA by using the very classic IPv4 Berkeley sockets interface. An SDP implementation in software is already available today via OFED. iperf reports 470 MB/s on single-threaded tests and 975 MB/s for a performance test with two threads on an SDR 4x InfiniBand network. These tests were performed with the OFED 1.2.5.4 SDP implementation. It is possible that future SDP implementations will perform even better. (Note: I could not yet get iSCSI over SDP working.) We should leave the choice of networking technology open -- both Ethernet and InfiniBand have specific advantages. See also: InfiniBand Trade Association, InfiniBand Architecture Specification Release 1.2.1, http://www.infinibandta.org/specs/register/publicspec/ Bart Van Assche. ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-01-31 13:25 ` Nicholas A. Bellinger 2008-01-31 14:34 ` Bart Van Assche @ 2008-02-01 8:11 ` Bart Van Assche 2008-02-01 10:39 ` Nicholas A. Bellinger 1 sibling, 1 reply; 148+ messages in thread From: Bart Van Assche @ 2008-02-01 8:11 UTC (permalink / raw) To: Nicholas A. Bellinger Cc: FUJITA Tomonori, fujita.tomonori, rdreier, James.Bottomley, torvalds, akpm, vst, linux-scsi, scst-devel, linux-kernel On Jan 31, 2008 2:25 PM, Nicholas A. Bellinger <nab@linux-iscsi.org> wrote: > > The PyX storage engine supports a scatterlist linked list algorithm that > ... Which parts of the PyX source code are licensed under the GPL and which parts are closed source ? A Google query for PyX + iSCSI showed information about licensing deals. Licensing deals can only be closed for software that is not entirely licensed under the GPL. Bart Van Assche. ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-01 8:11 ` Bart Van Assche @ 2008-02-01 10:39 ` Nicholas A. Bellinger 2008-02-01 11:04 ` Bart Van Assche 0 siblings, 1 reply; 148+ messages in thread From: Nicholas A. Bellinger @ 2008-02-01 10:39 UTC (permalink / raw) To: Bart Van Assche Cc: FUJITA Tomonori, fujita.tomonori, rdreier, James.Bottomley, torvalds, akpm, vst, linux-scsi, scst-devel, linux-kernel, Mike Mazarick On Fri, 2008-02-01 at 09:11 +0100, Bart Van Assche wrote: > On Jan 31, 2008 2:25 PM, Nicholas A. Bellinger <nab@linux-iscsi.org> wrote: > > > > The PyX storage engine supports a scatterlist linked list algorithm that > > ... > > Which parts of the PyX source code are licensed under the GPL and > which parts are closed source ? A Google query for PyX + iSCSI showed > information about licensing deals. Licensing deals can only be closed > for software that is not entirely licensed under the GPL. > I was using the name PyX to give an historical context to the discussion. :-) In more recent times, I have been using the name "LIO Target Stack" and "LIO Storage Engine" to refer to Traditional RFC-3720 Target statemachines, and SCSI Processing engine implementation respectively. The codebase has matured significantly from the original codebase, as the Linux SCSI, ATA and Block subsystems envolved from v2.2, v2.4, v2.5 and modern v2.6, the LIO stack has grown (and sometimes shrunk) along with the following requirement; To support all possible storage devices on all subsystems on any hardware platform that Linux could be made to boot. Interopt with other non Linux SCSI subsystems was also an issue early in development.. If you can imagine a Solaris SCSI subsystem asking for T10 EVPD WWN information from a Linux/iSCSI Target with pre libata SATA drivers, you can probably guess just how time was spent looking at packet captures to figure out to make OS dependent (ie: outernexus) multipath to play nice. Note that PyX Target Code for Linux v2.6 has been available in source and binary form for a diverse array of Linux devices and environments since September 2007. Right around this time, the Linux-iSCSI.org Storage and Virtualization stack went online for the first time using OCFS2, PVM, HVM, LVM, RAID6 and of course, traditional RFC-3720 on 10 Gb/sec and 1 Gb/sec fabric. There have also been world's first storage research work and prototypes that have been developed with the LIO code. Information on these topics is available from the homepage, and a few links deep there are older projects and information about features inherent to the LIO Target and Storage Engine. One of my items for the v2.9 codebase in 2008 is start picking apart the current code and determining which pieces should be sent upstream for review. I have also been spending alot of time recently looking at the other available open source storage transport and processing stacks and seeing how Linux/iSCSI, and other projects can benefit from our large pool of people, knowledge, and code. Speaking of the LIO Target and SE code, it today runs the production services for Linux-iSCSI.org and it's storage and virtualization clusters on x86_64. It also also provides a base for next generation and forward looking projects that exist (or soon to exist :-) within the Linux/iSCSI ecosystem. There have been lots of time and resources put into the codebase, and having a real live working RFC-3720 stack that supports optional features that give iSCSI (and hence designed into iSER) the flexibility and transparentness to operate as the original designers intended. Many thanks for your most valuable of time, --nab > Bart Van Assche. > ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-01 10:39 ` Nicholas A. Bellinger @ 2008-02-01 11:04 ` Bart Van Assche 2008-02-01 12:05 ` Nicholas A. Bellinger 0 siblings, 1 reply; 148+ messages in thread From: Bart Van Assche @ 2008-02-01 11:04 UTC (permalink / raw) To: Nicholas A. Bellinger Cc: FUJITA Tomonori, fujita.tomonori, rdreier, James.Bottomley, torvalds, akpm, vst, linux-scsi, scst-devel, linux-kernel, Mike Mazarick On Feb 1, 2008 11:39 AM, Nicholas A. Bellinger <nab@linux-iscsi.org> wrote: > > On Fri, 2008-02-01 at 09:11 +0100, Bart Van Assche wrote: > > On Jan 31, 2008 2:25 PM, Nicholas A. Bellinger <nab@linux-iscsi.org> wrote: > > > > > > The PyX storage engine supports a scatterlist linked list algorithm that > > > ... > > > > Which parts of the PyX source code are licensed under the GPL and > > which parts are closed source ? A Google query for PyX + iSCSI showed > > information about licensing deals. Licensing deals can only be closed > > for software that is not entirely licensed under the GPL. > > > > I was using the name PyX to give an historical context to the > discussion. ... Regarding the PyX Target Code: I have found a link via which I can download a free 30-day demo. This means that a company is earning money via this target code and that the source code is not licensed under the GPL. This is fine, but it also means that today the PyX target code is not a candidate for inclusion in the Linux kernel, and that it is unlikely that all of the PyX target code (kernelspace + userspace) will be made available under GPL soon. Bart Van Assche. ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-01 11:04 ` Bart Van Assche @ 2008-02-01 12:05 ` Nicholas A. Bellinger 2008-02-01 13:25 ` Bart Van Assche 0 siblings, 1 reply; 148+ messages in thread From: Nicholas A. Bellinger @ 2008-02-01 12:05 UTC (permalink / raw) To: Bart Van Assche Cc: FUJITA Tomonori, fujita.tomonori, rdreier, James.Bottomley, torvalds, akpm, vst, linux-scsi, scst-devel, linux-kernel, Mike Mazarick On Fri, 2008-02-01 at 12:04 +0100, Bart Van Assche wrote: > On Feb 1, 2008 11:39 AM, Nicholas A. Bellinger <nab@linux-iscsi.org> wrote: > > > > On Fri, 2008-02-01 at 09:11 +0100, Bart Van Assche wrote: > > > On Jan 31, 2008 2:25 PM, Nicholas A. Bellinger <nab@linux-iscsi.org> wrote: > > > > > > > > The PyX storage engine supports a scatterlist linked list algorithm that > > > > ... > > > > > > Which parts of the PyX source code are licensed under the GPL and > > > which parts are closed source ? A Google query for PyX + iSCSI showed > > > information about licensing deals. Licensing deals can only be closed > > > for software that is not entirely licensed under the GPL. > > > > > > > I was using the name PyX to give an historical context to the > > discussion. ... > > Regarding the PyX Target Code: I have found a link via which I can > download a free 30-day demo. This means that a company is earning > money via this target code and that the source code is not licensed > under the GPL. This is fine, but it also means that today the PyX > target code is not a candidate for inclusion in the Linux kernel, and > that it is unlikely that all of the PyX target code (kernelspace + > userspace) will be made available under GPL soon. > All of the kernel and C userspace code is open source and available from linux-iscsi.org and licensed under the GPL. There is the BSD licensed code from userspace (iSNS), as well as ISCSI and SCSI MIBs. As for what pieces of code will be going upstream (for kernel and/or userspace), LIO Target state machines and SE algoritims are definately some of the best examples of GPL code for production IP storage fabric and has gained maturity from people and resources applied to it in a number of respects. The LIO stack presents a number of possible options to get the diverse amount of hardware and software to work. Completely dismissing the available code is certainly a waste, and there are still significant amounts of functionality related to real-time administration, RFC-3720 MC/S and ERL=2 and generic SE functionality OS storage subsystems that only exist in LIO and our assoicated projects. A one obvious example is the LIO-VM project, which brings LIO active-active transport recovery and other Linux storage functionality to Vmware and Qemu images that can provide target mode IP storage fabric on x86 non-linux based hosts. A first of its kind in the Linux/iSCSI universe. Anyways, lets get back to the technical discussion. --nab > Bart Van Assche. > ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-01 12:05 ` Nicholas A. Bellinger @ 2008-02-01 13:25 ` Bart Van Assche 2008-02-01 14:36 ` Nicholas A. Bellinger 0 siblings, 1 reply; 148+ messages in thread From: Bart Van Assche @ 2008-02-01 13:25 UTC (permalink / raw) To: Nicholas A. Bellinger, FUJITA Tomonori, fujita.tomonori Cc: Roland Dreier, James.Bottomley, torvalds, akpm, vst, linux-scsi, scst-devel, linux-kernel, Mike Mazarick On Feb 1, 2008 1:05 PM, Nicholas A. Bellinger <nab@linux-iscsi.org> wrote: > > All of the kernel and C userspace code is open source and available from > linux-iscsi.org and licensed under the GPL. I found a statement on a web page that the ERL2 implementation is not included in the GPL version (http://zaal.org/iscsi/index.html). The above implies that this statement is incorrect. Tomo, are you the maintainer of this web page ? I'll try to measure the performance of the LIO Target Stack on the same setup on which I ran the other performance tests. Bart Van Assche. ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-01 13:25 ` Bart Van Assche @ 2008-02-01 14:36 ` Nicholas A. Bellinger 0 siblings, 0 replies; 148+ messages in thread From: Nicholas A. Bellinger @ 2008-02-01 14:36 UTC (permalink / raw) To: Bart Van Assche Cc: FUJITA Tomonori, fujita.tomonori, Roland Dreier, James.Bottomley, torvalds, akpm, vst, linux-scsi, scst-devel, linux-kernel, Mike Mazarick On Fri, 2008-02-01 at 14:25 +0100, Bart Van Assche wrote: > On Feb 1, 2008 1:05 PM, Nicholas A. Bellinger <nab@linux-iscsi.org> wrote: > > > > All of the kernel and C userspace code is open source and available from > > linux-iscsi.org and licensed under the GPL. > > I found a statement on a web page that the ERL2 implementation is not > included in the GPL version (http://zaal.org/iscsi/index.html). The > above implies that this statement is incorrect. Tomo, are you the > maintainer of this web page ? > This was mentioned in the context of the Core-iSCSI Initiator module. > I'll try to measure the performance of the LIO Target Stack on the > same setup on which I ran the other performance tests. > Great! --nab ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-01-30 8:38 ` Bart Van Assche 2008-01-30 10:56 ` FUJITA Tomonori @ 2008-01-30 16:34 ` James Bottomley 2008-01-30 16:50 ` Bart Van Assche 2008-02-02 15:32 ` Pete Wyckoff 2008-02-05 17:01 ` Erez Zilber 2 siblings, 2 replies; 148+ messages in thread From: James Bottomley @ 2008-01-30 16:34 UTC (permalink / raw) To: Bart Van Assche Cc: FUJITA Tomonori, rdreier, torvalds, akpm, vst, linux-scsi, scst-devel, linux-kernel On Wed, 2008-01-30 at 09:38 +0100, Bart Van Assche wrote: > On Jan 30, 2008 12:32 AM, FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> wrote: > > > > iSER has parameters to limit the maximum size of RDMA (it needs to > > repeat RDMA with a poor configuration)? > > Please specify which parameters you are referring to. As you know I > had already repeated my tests with ridiculously high values for the > following iSER parameters: FirstBurstLength, MaxBurstLength and > MaxRecvDataSegmentLength (16 MB, which is more than the 1 MB block > size specified to dd). the 1Mb block size is a bit of a red herring. Unless you've specifically increased the max_sector_size and are using an sg_chain converted driver, on x86 the maximum possible transfer accumulation is 0.5MB. I certainly don't rule out that increasing the transfer size up from 0.5MB might be the way to improve STGT efficiency, since at an 1GB/s theoretical peak, that's roughly 2000 context switches per I/O; however, It doesn't look like you've done anything that will overcome the block layer limitations. James ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-01-30 16:34 ` James Bottomley @ 2008-01-30 16:50 ` Bart Van Assche 2008-02-02 15:32 ` Pete Wyckoff 1 sibling, 0 replies; 148+ messages in thread From: Bart Van Assche @ 2008-01-30 16:50 UTC (permalink / raw) To: James Bottomley Cc: FUJITA Tomonori, rdreier, torvalds, akpm, vst, linux-scsi, scst-devel, linux-kernel On Jan 30, 2008 5:34 PM, James Bottomley <James.Bottomley@hansenpartnership.com> wrote: > > On Wed, 2008-01-30 at 09:38 +0100, Bart Van Assche wrote: > > On Jan 30, 2008 12:32 AM, FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> wrote: > > > > > > iSER has parameters to limit the maximum size of RDMA (it needs to > > > repeat RDMA with a poor configuration)? > > > > Please specify which parameters you are referring to. As you know I > > had already repeated my tests with ridiculously high values for the > > following iSER parameters: FirstBurstLength, MaxBurstLength and > > MaxRecvDataSegmentLength (16 MB, which is more than the 1 MB block > > size specified to dd). > > the 1Mb block size is a bit of a red herring. Unless you've > specifically increased the max_sector_size and are using an sg_chain > converted driver, on x86 the maximum possible transfer accumulation is > 0.5MB. I did not publish the results, but I have also done tests with other block sizes. The other sizes I tested were between 0.1MB and 10MB. The performance difference for these other sizes compared to a block size of 1MB was small (smaller than the variance between individual tests results). Bart. ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-01-30 16:34 ` James Bottomley 2008-01-30 16:50 ` Bart Van Assche @ 2008-02-02 15:32 ` Pete Wyckoff 1 sibling, 0 replies; 148+ messages in thread From: Pete Wyckoff @ 2008-02-02 15:32 UTC (permalink / raw) To: James Bottomley Cc: Bart Van Assche, FUJITA Tomonori, rdreier, torvalds, akpm, vst, linux-scsi, scst-devel, linux-kernel James.Bottomley@HansenPartnership.com wrote on Wed, 30 Jan 2008 10:34 -0600: > On Wed, 2008-01-30 at 09:38 +0100, Bart Van Assche wrote: > > On Jan 30, 2008 12:32 AM, FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> wrote: > > > > > > iSER has parameters to limit the maximum size of RDMA (it needs to > > > repeat RDMA with a poor configuration)? > > > > Please specify which parameters you are referring to. As you know I > > had already repeated my tests with ridiculously high values for the > > following iSER parameters: FirstBurstLength, MaxBurstLength and > > MaxRecvDataSegmentLength (16 MB, which is more than the 1 MB block > > size specified to dd). > > the 1Mb block size is a bit of a red herring. Unless you've > specifically increased the max_sector_size and are using an sg_chain > converted driver, on x86 the maximum possible transfer accumulation is > 0.5MB. > > I certainly don't rule out that increasing the transfer size up from > 0.5MB might be the way to improve STGT efficiency, since at an 1GB/s > theoretical peak, that's roughly 2000 context switches per I/O; however, > It doesn't look like you've done anything that will overcome the block > layer limitations. The MRDSL parameter has no effect on iSER, as the RFC describes. How to transfer data to satisfy a command is solely up to the target. So you would need both big requests from the client, then look at how the target will send the data. I've only used 512 kB for the RDMA transfer size from the target, as it matches the default client size and was enough to get good performance out of my IB gear and minimizes resource consumption on the target. It's currently hard-coded as a #define. There is no provision in the protocol for the client to dictate the value. If others want to spend some effort trying to tune stgt for iSER, there are a fair number of comments in the code, including a big one that explains this RDMA transfer size issue. And I'll answer informed questions as I can. But I'm not particularly interested in arguing about which implementation is best, or trying to interpret bandwidth comparison numbers from poorly designed tests. It takes work to understand these issues. -- Pete ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-01-30 8:38 ` Bart Van Assche 2008-01-30 10:56 ` FUJITA Tomonori 2008-01-30 16:34 ` James Bottomley @ 2008-02-05 17:01 ` Erez Zilber 2008-02-06 12:16 ` Bart Van Assche 2 siblings, 1 reply; 148+ messages in thread From: Erez Zilber @ 2008-02-05 17:01 UTC (permalink / raw) To: Bart Van Assche Cc: FUJITA Tomonori, rdreier, James.Bottomley, torvalds, akpm, vst, linux-scsi, scst-devel, linux-kernel Bart Van Assche wrote: > On Jan 30, 2008 12:32 AM, FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> wrote: > >> iSER has parameters to limit the maximum size of RDMA (it needs to >> repeat RDMA with a poor configuration)? >> > > Please specify which parameters you are referring to. As you know I > had already repeated my tests with ridiculously high values for the > following iSER parameters: FirstBurstLength, MaxBurstLength and > MaxRecvDataSegmentLength (16 MB, which is more than the 1 MB block > size specified to dd). > > Using such large values for FirstBurstLength will give you poor performance numbers for WRITE commands (with iSER). FirstBurstLength means how much data should you send as unsolicited data (i.e. without RDMA). It means that your WRITE commands were sent without RDMA. Erez ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-05 17:01 ` Erez Zilber @ 2008-02-06 12:16 ` Bart Van Assche 2008-02-06 16:45 ` Benny Halevy ` (2 more replies) 0 siblings, 3 replies; 148+ messages in thread From: Bart Van Assche @ 2008-02-06 12:16 UTC (permalink / raw) To: Erez Zilber Cc: FUJITA Tomonori, rdreier, James.Bottomley, torvalds, akpm, vst, linux-scsi, scst-devel, linux-kernel On Feb 5, 2008 6:01 PM, Erez Zilber <erezz@voltaire.com> wrote: > > Using such large values for FirstBurstLength will give you poor > performance numbers for WRITE commands (with iSER). FirstBurstLength > means how much data should you send as unsolicited data (i.e. without > RDMA). It means that your WRITE commands were sent without RDMA. Sorry, but I'm afraid you got this wrong. When the iSER transport is used instead of TCP, all data is sent via RDMA, including unsolicited data. If you have look at the iSER implementation in the Linux kernel (source files under drivers/infiniband/ulp/iser), you will see that all data is transferred via RDMA and not via TCP/IP. Bart Van Assche. ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-06 12:16 ` Bart Van Assche @ 2008-02-06 16:45 ` Benny Halevy 2008-02-06 17:06 ` Roland Dreier 2008-02-18 9:43 ` Erez Zilber 2 siblings, 0 replies; 148+ messages in thread From: Benny Halevy @ 2008-02-06 16:45 UTC (permalink / raw) To: Bart Van Assche Cc: Erez Zilber, FUJITA Tomonori, rdreier, James.Bottomley, torvalds, akpm, vst, linux-scsi, scst-devel, linux-kernel On Feb. 06, 2008, 14:16 +0200, "Bart Van Assche" <bart.vanassche@gmail.com> wrote: > On Feb 5, 2008 6:01 PM, Erez Zilber <erezz@voltaire.com> wrote: >> Using such large values for FirstBurstLength will give you poor >> performance numbers for WRITE commands (with iSER). FirstBurstLength >> means how much data should you send as unsolicited data (i.e. without >> RDMA). It means that your WRITE commands were sent without RDMA. > > Sorry, but I'm afraid you got this wrong. When the iSER transport is > used instead of TCP, all data is sent via RDMA, including unsolicited > data. If you have look at the iSER implementation in the Linux kernel > (source files under drivers/infiniband/ulp/iser), you will see that > all data is transferred via RDMA and not via TCP/IP. Regardless of what the current implementation is, the behavior you (Bart) describe seems to disagree with http://www.ietf.org/rfc/rfc5046.txt. Benny > > Bart Van Assche. > - ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-06 12:16 ` Bart Van Assche 2008-02-06 16:45 ` Benny Halevy @ 2008-02-06 17:06 ` Roland Dreier 2008-02-18 9:43 ` Erez Zilber 2 siblings, 0 replies; 148+ messages in thread From: Roland Dreier @ 2008-02-06 17:06 UTC (permalink / raw) To: Bart Van Assche Cc: Erez Zilber, FUJITA Tomonori, James.Bottomley, torvalds, akpm, vst, linux-scsi, scst-devel, linux-kernel > Sorry, but I'm afraid you got this wrong. When the iSER transport is > used instead of TCP, all data is sent via RDMA, including unsolicited > data. If you have look at the iSER implementation in the Linux kernel > (source files under drivers/infiniband/ulp/iser), you will see that > all data is transferred via RDMA and not via TCP/IP. I think the confusion here is caused by a slight misuse of the term "RDMA". It is true that all data is always transported over an InfiniBand connection when iSER is used, but not all such transfers are one-sided RDMA operations; some data can be transferred using send/receive operations. - R. ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-06 12:16 ` Bart Van Assche 2008-02-06 16:45 ` Benny Halevy 2008-02-06 17:06 ` Roland Dreier @ 2008-02-18 9:43 ` Erez Zilber 2008-02-18 11:01 ` Bart Van Assche 2 siblings, 1 reply; 148+ messages in thread From: Erez Zilber @ 2008-02-18 9:43 UTC (permalink / raw) To: Bart Van Assche Cc: FUJITA Tomonori, rdreier, James.Bottomley, torvalds, akpm, vst, linux-scsi, scst-devel, linux-kernel Bart Van Assche wrote: > On Feb 5, 2008 6:01 PM, Erez Zilber <erezz@voltaire.com> wrote: > >> Using such large values for FirstBurstLength will give you poor >> performance numbers for WRITE commands (with iSER). FirstBurstLength >> means how much data should you send as unsolicited data (i.e. without >> RDMA). It means that your WRITE commands were sent without RDMA. >> > > Sorry, but I'm afraid you got this wrong. When the iSER transport is > used instead of TCP, all data is sent via RDMA, including unsolicited > data. If you have look at the iSER implementation in the Linux kernel > (source files under drivers/infiniband/ulp/iser), you will see that > all data is transferred via RDMA and not via TCP/IP. > > When you execute WRITE commands with iSCSI, it works like this: EDTL (Expected data length) - the data length of your command FirstBurstLength - the length of data that will be sent as unsolicited data (i.e. as immediate data with the SCSI command and as unsolicited data-out PDUs) If you use a high value for FirstBurstLength, all (or most) of your data will be sent as unsolicited data-out PDUs. These PDUs don't use the RDMA engine, so you miss the advantage of IB. If you use a lower value for FirstBurstLength, EDTL - FirstBurstLength bytes will be sent as solicited data-out PDUs. With iSER, solicited data-out PDUs are RDMA operations. I hope that I'm more clear now. Erez ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-18 9:43 ` Erez Zilber @ 2008-02-18 11:01 ` Bart Van Assche 2008-02-20 7:34 ` Erez Zilber 0 siblings, 1 reply; 148+ messages in thread From: Bart Van Assche @ 2008-02-18 11:01 UTC (permalink / raw) To: Erez Zilber Cc: FUJITA Tomonori, rdreier, James.Bottomley, torvalds, akpm, vst, linux-scsi, scst-devel, linux-kernel On Feb 18, 2008 10:43 AM, Erez Zilber <erezz@voltaire.com> wrote: > If you use a high value for FirstBurstLength, all (or most) of your data > will be sent as unsolicited data-out PDUs. These PDUs don't use the RDMA > engine, so you miss the advantage of IB. Hello Erez, Did you notice the e-mail Roland Dreier wrote on Februari 6, 2008 ? This is what Roland wrote: > I think the confusion here is caused by a slight misuse of the term > "RDMA". It is true that all data is always transported over an > InfiniBand connection when iSER is used, but not all such transfers > are one-sided RDMA operations; some data can be transferred using > send/receive operations. Or: data sent during the first burst is not transferred via one-sided remote memory reads or writes but via two-sided send/receive operations. At least on my setup, these operations are as fast as one-sided remote memory reads or writes. As an example, I obtained the following numbers on my setup (SDR 4x network); ib_write_bw: 933 MB/s. ib_read_bw: 905 MB/s. ib_send_bw: 931 MB/s. Bart Van Assche. ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-18 11:01 ` Bart Van Assche @ 2008-02-20 7:34 ` Erez Zilber 2008-02-20 8:41 ` Bart Van Assche 0 siblings, 1 reply; 148+ messages in thread From: Erez Zilber @ 2008-02-20 7:34 UTC (permalink / raw) To: Bart Van Assche Cc: FUJITA Tomonori, rdreier, James.Bottomley, torvalds, akpm, vst, linux-scsi, scst-devel, linux-kernel Bart Van Assche wrote: > On Feb 18, 2008 10:43 AM, Erez Zilber <erezz@voltaire.com> wrote: > >> If you use a high value for FirstBurstLength, all (or most) of your data >> will be sent as unsolicited data-out PDUs. These PDUs don't use the RDMA >> engine, so you miss the advantage of IB. >> > > Hello Erez, > > Did you notice the e-mail Roland Dreier wrote on Februari 6, 2008 ? > This is what Roland wrote: > >> I think the confusion here is caused by a slight misuse of the term >> "RDMA". It is true that all data is always transported over an >> InfiniBand connection when iSER is used, but not all such transfers >> are one-sided RDMA operations; some data can be transferred using >> send/receive operations. >> > > Yes, I saw that. I tried to give an explanation with more details. > Or: data sent during the first burst is not transferred via one-sided > remote memory reads or writes but via two-sided send/receive > operations. At least on my setup, these operations are as fast as > one-sided remote memory reads or writes. As an example, I obtained the > following numbers on my setup (SDR 4x network); > ib_write_bw: 933 MB/s. > ib_read_bw: 905 MB/s. > ib_send_bw: 931 MB/s. > > According to these numbers one can think that you don't need RDMA at all, just send iSCSI PDUs over IB. The benchmarks that you use are synthetic IB benchmarks that are not equivalent to iSCSI over iSER. They just send IB packets. I'm not surprised that you got more or less the same performance because, AFAIK, ib_send_bw doesn't copy data (unlike iSCSI that has to copy data that is sent/received without RDMA). When you use RDMA with iSCSI (i.e. iSER), you don't need to create iSCSI PDUs and process them. The CPU is not busy as it is with iSCSI over TCP because no data copies are required. Another advantage is that you don't need header/data digest because the IB HW does that. Erez ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-20 7:34 ` Erez Zilber @ 2008-02-20 8:41 ` Bart Van Assche 0 siblings, 0 replies; 148+ messages in thread From: Bart Van Assche @ 2008-02-20 8:41 UTC (permalink / raw) To: Erez Zilber Cc: FUJITA Tomonori, rdreier, James.Bottomley, torvalds, akpm, vst, linux-scsi, scst-devel, linux-kernel On Feb 20, 2008 8:34 AM, Erez Zilber <erezz@voltaire.com> wrote: > Bart Van Assche wrote: > > Or: data sent during the first burst is not transferred via one-sided > > remote memory reads or writes but via two-sided send/receive > > operations. At least on my setup, these operations are as fast as > > one-sided remote memory reads or writes. As an example, I obtained the > > following numbers on my setup (SDR 4x network); > > ib_write_bw: 933 MB/s. > > ib_read_bw: 905 MB/s. > > ib_send_bw: 931 MB/s. > > According to these numbers one can think that you don't need RDMA at > all, just send iSCSI PDUs over IB. Sorry, but you are misinterpreting what I wrote. > The benchmarks that you use are > synthetic IB benchmarks that are not equivalent to iSCSI over iSER. They > just send IB packets. I'm not surprised that you got more or less the > same performance because, AFAIK, ib_send_bw doesn't copy data (unlike > iSCSI that has to copy data that is sent/received without RDMA). I agree that ib_write_bw / ib_read_bw / ib_send_bw performance results are not equivalent to iSCSI over iSER. The reason that I included these performance results was to illustrate that two-sided data transfers over IB are about as fast as one-sided data transfers. > When you use RDMA with iSCSI (i.e. iSER), you don't need to create iSCSI > PDUs and process them. The CPU is not busy as it is with iSCSI over TCP > because no data copies are required. Another advantage is that you don't > need header/data digest because the IB HW does that. As far as I know, when using iSER, the FirstBurstLength bytes of data are sent via two-sided data transfers, and there is no CPU intervention required to transfer the data itself over the IB network. Bart Van Assche. ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-01-29 23:32 ` FUJITA Tomonori 2008-01-30 1:15 ` [Scst-devel] " Vu Pham 2008-01-30 8:38 ` Bart Van Assche @ 2008-01-30 11:18 ` Vladislav Bolkhovitin 2 siblings, 0 replies; 148+ messages in thread From: Vladislav Bolkhovitin @ 2008-01-30 11:18 UTC (permalink / raw) To: FUJITA Tomonori Cc: rdreier, James.Bottomley, bart.vanassche, torvalds, akpm, linux-scsi, scst-devel, linux-kernel FUJITA Tomonori wrote: > On Tue, 29 Jan 2008 13:31:52 -0800 > Roland Dreier <rdreier@cisco.com> wrote: > > >> > . . STGT read SCST read . STGT read SCST read . >> > . . performance performance . performance performance . >> > . . (0.5K, MB/s) (0.5K, MB/s) . (1 MB, MB/s) (1 MB, MB/s) . >> > . iSER (8 Gb/s network) . 250 N/A . 360 N/A . >> > . SRP (8 Gb/s network) . N/A 421 . N/A 683 . >> >> > On the comparable figures, which only seem to be IPoIB they're showing a >> > 13-18% variance, aren't they? Which isn't an incredible difference. >> >>Maybe I'm all wet, but I think iSER vs. SRP should be roughly >>comparable. The exact formatting of various messages etc. is >>different but the data path using RDMA is pretty much identical. So >>the big difference between STGT iSER and SCST SRP hints at some big >>difference in the efficiency of the two implementations. > > > iSER has parameters to limit the maximum size of RDMA (it needs to > repeat RDMA with a poor configuration)? > > > Anyway, here's the results from Robin Humble: > > iSER to 7G ramfs, x86_64, centos4.6, 2.6.22 kernels, git tgtd, > initiator end booted with mem=512M, target with 8G ram > > direct i/o dd > write/read 800/751 MB/s > dd if=/dev/zero of=/dev/sdc bs=1M count=5000 oflag=direct > dd of=/dev/null if=/dev/sdc bs=1M count=5000 iflag=direct > > http://www.mail-archive.com/linux-scsi@vger.kernel.org/msg13502.html > > I think that STGT is pretty fast with the fast backing storage. How fast SCST will be on the same hardware? > I don't think that there is the notable perfornace difference between > kernel-space and user-space SRP (or ISER) implementations about moving > data between hosts. IB is expected to enable user-space applications > to move data between hosts quickly (if not, what can IB provide us?). > > I think that the question is how fast user-space applications can do > I/Os ccompared with I/Os in kernel space. STGT is eager for the advent > of good asynchronous I/O and event notification interfances. > > One more possible optimization for STGT is zero-copy data > transfer. STGT uses pre-registered buffers and move data between page > cache and thsse buffers, and then does RDMA transfer. If we implement > own caching mechanism to use pre-registered buffers directly with (AIO > and O_DIRECT), then STGT can move data without data copies. Great! So, you are going to duplicate Linux page cache in the user space. You will continue keeping the in-kernel code as small as possible and its mainteinership effort as low as possible by the cost that the user space part's code size and complexity (and, hence, its mainteinership effort) will rocket to the sky. Apparently, this doesn't look like a good design decision. > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-01-29 20:42 ` James Bottomley 2008-01-29 21:31 ` Roland Dreier @ 2008-01-30 8:29 ` Bart Van Assche 2008-01-30 16:22 ` James Bottomley 2008-01-30 11:17 ` Vladislav Bolkhovitin 2 siblings, 1 reply; 148+ messages in thread From: Bart Van Assche @ 2008-01-30 8:29 UTC (permalink / raw) To: James Bottomley Cc: Linus Torvalds, Andrew Morton, Vladislav Bolkhovitin, FUJITA Tomonori, linux-scsi, scst-devel, linux-kernel On Jan 29, 2008 9:42 PM, James Bottomley <James.Bottomley@hansenpartnership.com> wrote: > > As an SCST user, I would like to see the SCST kernel code integrated > > in the mainstream kernel because of its excellent performance on an > > InfiniBand network. Since the SCST project comprises about 14 KLOC, > > reviewing the SCST code will take considerable time. Who will do this > > reviewing work ? And with regard to the comments made by the > > reviewers: Vladislav, do you have the time to carry out the > > modifications requested by the reviewers ? I expect a.o. that > > reviewers will ask to move SCST's configuration pseudofiles from > > procfs to sysfs. > > The two target architectures perform essentially identical functions, so > there's only really room for one in the kernel. Right at the moment, > it's STGT. Problems in STGT come from the user<->kernel boundary which > can be mitigated in a variety of ways. The fact that the figures are > pretty much comparable on non IB networks shows this. Are you saying that users who need an efficient iSCSI implementation should switch to OpenSolaris ? The OpenSolaris COMSTAR project involves the migration of the existing OpenSolaris iSCSI target daemon from userspace to their kernel. The OpenSolaris developers are spending time on this because they expect a significant performance improvement. > I really need a whole lot more evidence than at worst a 20% performance > difference on IB to pull one implementation out and replace it with > another. Particularly as there's no real evidence that STGT can't be > tweaked to recover the 20% even on IB. My measurements on a 1 GB/s InfiniBand network have shown that the current SCST implementation is able to read data via direct I/O at a rate of 811 GB/s (via SRP) and that the current STGT implementation is able to transfer data at a rate of 589 MB/s (via iSER). That's a performance difference of 38%. And even more important, the I/O latency of SCST is significantly lower than that of STGT. This is very important for database workloads -- the I/O pattern caused by database software is close to random I/O, and database software needs low latency I/O in order to run efficiently. In the thread with the title "Performance of SCST versus STGT" on the SCST-devel / STGT-devel mailing lists not only the raw performance numbers were discussed but also which further performance improvements are possible. It became clear that the SCST performance can be improved further by implementing a well known optimization (zero-copy I/O). Fujita Tomonori explained in the same thread that it is possible to improve the performance of STGT further, but that this would require a lot of effort (implementing asynchronous I/O in the kernel and also implementing a new caching mechanism using pre-registered buffers). See also: http://sourceforge.net/mailarchive/forum.php?forum_name=scst-devel&viewmonth=200801&viewday=17 Bart Van Assche. ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-01-30 8:29 ` Bart Van Assche @ 2008-01-30 16:22 ` James Bottomley 2008-01-30 17:03 ` Bart Van Assche 2008-02-05 7:14 ` [Scst-devel] " Tomasz Chmielewski 0 siblings, 2 replies; 148+ messages in thread From: James Bottomley @ 2008-01-30 16:22 UTC (permalink / raw) To: Bart Van Assche Cc: Linus Torvalds, Andrew Morton, Vladislav Bolkhovitin, FUJITA Tomonori, linux-scsi, scst-devel, linux-kernel On Wed, 2008-01-30 at 09:29 +0100, Bart Van Assche wrote: > On Jan 29, 2008 9:42 PM, James Bottomley > <James.Bottomley@hansenpartnership.com> wrote: > > > As an SCST user, I would like to see the SCST kernel code integrated > > > in the mainstream kernel because of its excellent performance on an > > > InfiniBand network. Since the SCST project comprises about 14 KLOC, > > > reviewing the SCST code will take considerable time. Who will do this > > > reviewing work ? And with regard to the comments made by the > > > reviewers: Vladislav, do you have the time to carry out the > > > modifications requested by the reviewers ? I expect a.o. that > > > reviewers will ask to move SCST's configuration pseudofiles from > > > procfs to sysfs. > > > > The two target architectures perform essentially identical functions, so > > there's only really room for one in the kernel. Right at the moment, > > it's STGT. Problems in STGT come from the user<->kernel boundary which > > can be mitigated in a variety of ways. The fact that the figures are > > pretty much comparable on non IB networks shows this. > > Are you saying that users who need an efficient iSCSI implementation > should switch to OpenSolaris ? I'd certainly say that's a totally unsupported conclusion. > The OpenSolaris COMSTAR project involves > the migration of the existing OpenSolaris iSCSI target daemon from > userspace to their kernel. The OpenSolaris developers are > spending time on this because they expect a significant performance > improvement. Just because Solaris takes a particular design decision doesn't automatically make it the right course of action. Microsoft once pulled huge gobs of the C library and their windowing system into the kernel in the name of efficiency. It proved not only to be less efficient, but also to degrade their security model. Deciding what lives in userspace and what should be in the kernel lies at the very heart of architectural decisions. However, the argument that "it should be in the kernel because that would make it faster" is pretty much a discredited one. To prevail on that argument, you have to demonstrate that there's no way to enable user space to do the same thing at the same speed. Further, it was the same argument used the last time around when the STGT vs SCST investigation was done. Your own results on non-IB networks show that both architectures perform at the same speed. That tends to support the conclusion that there's something specific about IB that needs to be tweaked or improved for STGT to get it to perform correctly. Furthermore, if you have already decided before testing that SCST is right and that STGT is wrong based on the architectures, it isn't exactly going to increase my confidence in your measurement methodology claiming to show this, now is it? > > I really need a whole lot more evidence than at worst a 20% performance > > difference on IB to pull one implementation out and replace it with > > another. Particularly as there's no real evidence that STGT can't be > > tweaked to recover the 20% even on IB. > > My measurements on a 1 GB/s InfiniBand network have shown that the current > SCST implementation is able to read data via direct I/O at a rate of 811 GB/s > (via SRP) and that the current STGT implementation is able to transfer data at a > rate of 589 MB/s (via iSER). That's a performance difference of 38%. > > And even more important, the I/O latency of SCST is significantly > lower than that > of STGT. This is very important for database workloads -- the I/O pattern caused > by database software is close to random I/O, and database software needs low > latency I/O in order to run efficiently. > > In the thread with the title "Performance of SCST versus STGT" on the > SCST-devel / > STGT-devel mailing lists not only the raw performance numbers were discussed but > also which further performance improvements are possible. It became clear that > the SCST performance can be improved further by implementing a well known > optimization (zero-copy I/O). Fujita Tomonori explained in the same > thread that it is > possible to improve the performance of STGT further, but that this would require > a lot of effort (implementing asynchronous I/O in the kernel and also > implementing > a new caching mechanism using pre-registered buffers). These are both features being independently worked on, are they not? Even if they weren't, the combination of the size of SCST in kernel plus the problem of having to find a migration path for the current STGT users still looks to me to involve the greater amount of work. James ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-01-30 16:22 ` James Bottomley @ 2008-01-30 17:03 ` Bart Van Assche 2008-02-05 7:14 ` [Scst-devel] " Tomasz Chmielewski 1 sibling, 0 replies; 148+ messages in thread From: Bart Van Assche @ 2008-01-30 17:03 UTC (permalink / raw) To: James Bottomley Cc: Linus Torvalds, Andrew Morton, Vladislav Bolkhovitin, FUJITA Tomonori, linux-scsi, scst-devel, linux-kernel On Jan 30, 2008 5:22 PM, James Bottomley <James.Bottomley@hansenpartnership.com> wrote: > ... > Deciding what lives in userspace and what should be in the kernel lies > at the very heart of architectural decisions. However, the argument > that "it should be in the kernel because that would make it faster" is > pretty much a discredited one. To prevail on that argument, you have to > demonstrate that there's no way to enable user space to do the same > thing at the same speed. Further, it was the same argument used the > last time around when the STGT vs SCST investigation was done. Your own > results on non-IB networks show that both architectures perform at the > same speed. That tends to support the conclusion that there's something > specific about IB that needs to be tweaked or improved for STGT to get > it to perform correctly. You should know that given two different implementations in software of the same communication protocol, differences in latency and throughput become more visible as the network latency gets lower and the throughput gets higher. That's why conclusions can only be drawn from the InfiniBand numbers, and not from the 1 Gbit/s Ethernet numbers. Assuming that there is something specific in STGT with regard to InfiniBand is speculation. > Furthermore, if you have already decided before testing that SCST is > right and that STGT is wrong based on the architectures, it isn't > exactly going to increase my confidence in your measurement methodology > claiming to show this, now is it? I did not draw any conclusions from the architecture -- the only data I based my conclusions on were my own performance measurements. > ... > These are both features being independently worked on, are they not? > Even if they weren't, the combination of the size of SCST in kernel plus > the problem of having to find a migration path for the current STGT > users still looks to me to involve the greater amount of work. My proposal was to have both the SCST kernel code and the STGT kernel code in the mainstream Linux kernel. This would make it easier for current STGT users to evaluate SCST. It's too early to choose one of the two projects -- this choice can be made later on. Bart. ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel 2008-01-30 16:22 ` James Bottomley 2008-01-30 17:03 ` Bart Van Assche @ 2008-02-05 7:14 ` Tomasz Chmielewski 2008-02-05 13:38 ` FUJITA Tomonori 1 sibling, 1 reply; 148+ messages in thread From: Tomasz Chmielewski @ 2008-02-05 7:14 UTC (permalink / raw) To: James Bottomley Cc: Bart Van Assche, Vladislav Bolkhovitin, linux-scsi, linux-kernel, FUJITA Tomonori, scst-devel, Andrew Morton, Linus Torvalds James Bottomley schrieb: > These are both features being independently worked on, are they not? > Even if they weren't, the combination of the size of SCST in kernel plus > the problem of having to find a migration path for the current STGT > users still looks to me to involve the greater amount of work. I don't want to be mean, but does anyone actually use STGT in production? Seriously? In the latest development version of STGT, it's only possible to stop the tgtd target daemon using KILL / 9 signal - which also means all iSCSI initiator connections are corrupted when tgtd target daemon is started again (kernel upgrade, target daemon upgrade, server reboot etc.). Imagine you have to reboot all your NFS clients when you reboot your NFS server. Not only that - your data is probably corrupted, or at least the filesystem deserves checking... -- Tomasz Chmielewski http://wpkg.org ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel 2008-02-05 7:14 ` [Scst-devel] " Tomasz Chmielewski @ 2008-02-05 13:38 ` FUJITA Tomonori 2008-02-05 16:07 ` Tomasz Chmielewski 2008-02-05 17:09 ` Matteo Tescione 0 siblings, 2 replies; 148+ messages in thread From: FUJITA Tomonori @ 2008-02-05 13:38 UTC (permalink / raw) To: mangoo Cc: James.Bottomley, bart.vanassche, vst, linux-scsi, linux-kernel, fujita.tomonori, scst-devel, akpm, torvalds, fujita.tomonori On Tue, 05 Feb 2008 08:14:01 +0100 Tomasz Chmielewski <mangoo@wpkg.org> wrote: > James Bottomley schrieb: > > > These are both features being independently worked on, are they not? > > Even if they weren't, the combination of the size of SCST in kernel plus > > the problem of having to find a migration path for the current STGT > > users still looks to me to involve the greater amount of work. > > I don't want to be mean, but does anyone actually use STGT in > production? Seriously? > > In the latest development version of STGT, it's only possible to stop > the tgtd target daemon using KILL / 9 signal - which also means all > iSCSI initiator connections are corrupted when tgtd target daemon is > started again (kernel upgrade, target daemon upgrade, server reboot etc.). I don't know what "iSCSI initiator connections are corrupted" mean. But if you reboot a server, how can an iSCSI target implementation keep iSCSI tcp connections? > Imagine you have to reboot all your NFS clients when you reboot your NFS > server. Not only that - your data is probably corrupted, or at least the > filesystem deserves checking... ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel 2008-02-05 13:38 ` FUJITA Tomonori @ 2008-02-05 16:07 ` Tomasz Chmielewski 2008-02-05 16:21 ` Ming Zhang 2008-02-05 16:43 ` FUJITA Tomonori 2008-02-05 17:09 ` Matteo Tescione 1 sibling, 2 replies; 148+ messages in thread From: Tomasz Chmielewski @ 2008-02-05 16:07 UTC (permalink / raw) To: FUJITA Tomonori Cc: James.Bottomley, bart.vanassche, vst, linux-scsi, linux-kernel, fujita.tomonori, scst-devel, akpm, torvalds, stgt-devel FUJITA Tomonori schrieb: > On Tue, 05 Feb 2008 08:14:01 +0100 > Tomasz Chmielewski <mangoo@wpkg.org> wrote: > >> James Bottomley schrieb: >> >>> These are both features being independently worked on, are they not? >>> Even if they weren't, the combination of the size of SCST in kernel plus >>> the problem of having to find a migration path for the current STGT >>> users still looks to me to involve the greater amount of work. >> I don't want to be mean, but does anyone actually use STGT in >> production? Seriously? >> >> In the latest development version of STGT, it's only possible to stop >> the tgtd target daemon using KILL / 9 signal - which also means all >> iSCSI initiator connections are corrupted when tgtd target daemon is >> started again (kernel upgrade, target daemon upgrade, server reboot etc.). > > I don't know what "iSCSI initiator connections are corrupted" > mean. But if you reboot a server, how can an iSCSI target > implementation keep iSCSI tcp connections? The problem with tgtd is that you can't start it (configured) in an "atomic" way. Usually, one will start tgtd and it's configuration in a script (I replaced some parameters with "..." to make it shorter and more readable): tgtd tgtadm --op new ... tgtadm --lld iscsi --op new ... However, this won't work - tgtd goes immediately in the background as it is still starting, and the first tgtadm commands will fail: # bash -x tgtd-start + tgtd + tgtadm --op new --mode target ... tgtadm: can't connect to the tgt daemon, Connection refused tgtadm: can't send the request to the tgt daemon, Transport endpoint is not connected + tgtadm --lld iscsi --op new --mode account ... tgtadm: can't connect to the tgt daemon, Connection refused tgtadm: can't send the request to the tgt daemon, Transport endpoint is not connected + tgtadm --lld iscsi --op bind --mode account --tid 1 ... tgtadm: can't find the target + tgtadm --op new --mode logicalunit --tid 1 --lun 1 ... tgtadm: can't find the target + tgtadm --op bind --mode target --tid 1 -I ALL tgtadm: can't find the target + tgtadm --op new --mode target --tid 2 ... + tgtadm --op new --mode logicalunit --tid 2 --lun 1 ... + tgtadm --op bind --mode target --tid 2 -I ALL OK, if tgtd takes longer to start, perhaps it's a good idea to sleep a second right after tgtd? tgtd sleep 1 tgtadm --op new ... tgtadm --lld iscsi --op new ... No, it is not a good idea - if tgtd listens on port 3260 *and* is unconfigured yet, any reconnecting initiator will fail, like below: end_request: I/O error, dev sdb, sector 7045192 Buffer I/O error on device sdb, logical block 880649 lost page write due to I/O error on sdb Aborting journal on device sdb. ext3_abort called. EXT3-fs error (device sdb): ext3_journal_start_sb: Detected aborted journal Remounting filesystem read-only end_request: I/O error, dev sdb, sector 7045880 Buffer I/O error on device sdb, logical block 880735 lost page write due to I/O error on sdb end_request: I/O error, dev sdb, sector 6728 Buffer I/O error on device sdb, logical block 841 lost page write due to I/O error on sdb end_request: I/O error, dev sdb, sector 7045192 Buffer I/O error on device sdb, logical block 880649 lost page write due to I/O error on sdb end_request: I/O error, dev sdb, sector 7045880 Buffer I/O error on device sdb, logical block 880735 lost page write due to I/O error on sdb __journal_remove_journal_head: freeing b_frozen_data __journal_remove_journal_head: freeing b_frozen_data Ouch. So the only way to start/restart tgtd reliably is to do hacks which are needed with yet another iSCSI kernel implementation (IET): use iptables. iptables <block iSCSI traffic> tgtd sleep 1 tgtadm --op new ... tgtadm --lld iscsi --op new ... iptables <unblock iSCSI traffic> A bit ugly, isn't it? Having to tinker with a firewall in order to start a daemon is by no means a sign of a well-tested and mature project. That's why I asked how many people use stgt in a production environment - James was worried about a potential migration path for current users. -- Tomasz Chmielewski http://wpkg.org ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel 2008-02-05 16:07 ` Tomasz Chmielewski @ 2008-02-05 16:21 ` Ming Zhang 2008-02-05 16:43 ` FUJITA Tomonori 1 sibling, 0 replies; 148+ messages in thread From: Ming Zhang @ 2008-02-05 16:21 UTC (permalink / raw) To: Tomasz Chmielewski Cc: FUJITA Tomonori, vst, linux-scsi, linux-kernel, James.Bottomley, scst-devel, stgt-devel, akpm, torvalds, fujita.tomonori On Tue, 2008-02-05 at 17:07 +0100, Tomasz Chmielewski wrote: > FUJITA Tomonori schrieb: > > On Tue, 05 Feb 2008 08:14:01 +0100 > > Tomasz Chmielewski <mangoo@wpkg.org> wrote: > > > >> James Bottomley schrieb: > >> > >>> These are both features being independently worked on, are they not? > >>> Even if they weren't, the combination of the size of SCST in kernel plus > >>> the problem of having to find a migration path for the current STGT > >>> users still looks to me to involve the greater amount of work. > >> I don't want to be mean, but does anyone actually use STGT in > >> production? Seriously? > >> > >> In the latest development version of STGT, it's only possible to stop > >> the tgtd target daemon using KILL / 9 signal - which also means all > >> iSCSI initiator connections are corrupted when tgtd target daemon is > >> started again (kernel upgrade, target daemon upgrade, server reboot etc.). > > > > I don't know what "iSCSI initiator connections are corrupted" > > mean. But if you reboot a server, how can an iSCSI target > > implementation keep iSCSI tcp connections? > > The problem with tgtd is that you can't start it (configured) in an > "atomic" way. > Usually, one will start tgtd and it's configuration in a script (I > replaced some parameters with "..." to make it shorter and more readable): > > > tgtd > tgtadm --op new ... > tgtadm --lld iscsi --op new ... > > > However, this won't work - tgtd goes immediately in the background as it > is still starting, and the first tgtadm commands will fail: this should be a easy fix. start tgtd, get port setup ready in forked process, then signal its parent that ready to quit. or set port ready in parent, fork and pass to daemon. > > # bash -x tgtd-start > + tgtd > + tgtadm --op new --mode target ... > tgtadm: can't connect to the tgt daemon, Connection refused > tgtadm: can't send the request to the tgt daemon, Transport endpoint is > not connected > + tgtadm --lld iscsi --op new --mode account ... > tgtadm: can't connect to the tgt daemon, Connection refused > tgtadm: can't send the request to the tgt daemon, Transport endpoint is > not connected > + tgtadm --lld iscsi --op bind --mode account --tid 1 ... > tgtadm: can't find the target > + tgtadm --op new --mode logicalunit --tid 1 --lun 1 ... > tgtadm: can't find the target > + tgtadm --op bind --mode target --tid 1 -I ALL > tgtadm: can't find the target > + tgtadm --op new --mode target --tid 2 ... > + tgtadm --op new --mode logicalunit --tid 2 --lun 1 ... > + tgtadm --op bind --mode target --tid 2 -I ALL > > > OK, if tgtd takes longer to start, perhaps it's a good idea to sleep a > second right after tgtd? > > tgtd > sleep 1 > tgtadm --op new ... > tgtadm --lld iscsi --op new ... > > > No, it is not a good idea - if tgtd listens on port 3260 *and* is > unconfigured yet, any reconnecting initiator will fail, like below: this is another easy fix. tgtd started with unconfigured status and then a tgtadm can configure it and turn it into ready status. those are really minor usability issue. ( i know it is painful for user, i agree) the major problem here is to discuss in architectural wise, which one is better... linux kernel should have one implementation that is good from foundation... > > end_request: I/O error, dev sdb, sector 7045192 > Buffer I/O error on device sdb, logical block 880649 > lost page write due to I/O error on sdb > Aborting journal on device sdb. > ext3_abort called. > EXT3-fs error (device sdb): ext3_journal_start_sb: Detected aborted journal > Remounting filesystem read-only > end_request: I/O error, dev sdb, sector 7045880 > Buffer I/O error on device sdb, logical block 880735 > lost page write due to I/O error on sdb > end_request: I/O error, dev sdb, sector 6728 > Buffer I/O error on device sdb, logical block 841 > lost page write due to I/O error on sdb > end_request: I/O error, dev sdb, sector 7045192 > Buffer I/O error on device sdb, logical block 880649 > lost page write due to I/O error on sdb > end_request: I/O error, dev sdb, sector 7045880 > Buffer I/O error on device sdb, logical block 880735 > lost page write due to I/O error on sdb > __journal_remove_journal_head: freeing b_frozen_data > __journal_remove_journal_head: freeing b_frozen_data > > > Ouch. > > So the only way to start/restart tgtd reliably is to do hacks which are > needed with yet another iSCSI kernel implementation (IET): use iptables. > > iptables <block iSCSI traffic> > tgtd > sleep 1 > tgtadm --op new ... > tgtadm --lld iscsi --op new ... > iptables <unblock iSCSI traffic> > > > A bit ugly, isn't it? > Having to tinker with a firewall in order to start a daemon is by no > means a sign of a well-tested and mature project. > > That's why I asked how many people use stgt in a production environment > - James was worried about a potential migration path for current users. > > > > -- > Tomasz Chmielewski > http://wpkg.org > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Scst-devel mailing list > Scst-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scst-devel -- Ming Zhang @#$%^ purging memory... (*!% http://blackmagic02881.wordpress.com/ http://www.linkedin.com/in/blackmagic02881 -------------------------------------------- ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel 2008-02-05 16:07 ` Tomasz Chmielewski 2008-02-05 16:21 ` Ming Zhang @ 2008-02-05 16:43 ` FUJITA Tomonori 1 sibling, 0 replies; 148+ messages in thread From: FUJITA Tomonori @ 2008-02-05 16:43 UTC (permalink / raw) To: mangoo Cc: tomof, James.Bottomley, bart.vanassche, vst, linux-scsi, linux-kernel, fujita.tomonori, scst-devel, akpm, torvalds, stgt-devel, fujita.tomonori On Tue, 05 Feb 2008 17:07:07 +0100 Tomasz Chmielewski <mangoo@wpkg.org> wrote: > FUJITA Tomonori schrieb: > > On Tue, 05 Feb 2008 08:14:01 +0100 > > Tomasz Chmielewski <mangoo@wpkg.org> wrote: > > > >> James Bottomley schrieb: > >> > >>> These are both features being independently worked on, are they not? > >>> Even if they weren't, the combination of the size of SCST in kernel plus > >>> the problem of having to find a migration path for the current STGT > >>> users still looks to me to involve the greater amount of work. > >> I don't want to be mean, but does anyone actually use STGT in > >> production? Seriously? > >> > >> In the latest development version of STGT, it's only possible to stop > >> the tgtd target daemon using KILL / 9 signal - which also means all > >> iSCSI initiator connections are corrupted when tgtd target daemon is > >> started again (kernel upgrade, target daemon upgrade, server reboot etc.). > > > > I don't know what "iSCSI initiator connections are corrupted" > > mean. But if you reboot a server, how can an iSCSI target > > implementation keep iSCSI tcp connections? > > The problem with tgtd is that you can't start it (configured) in an > "atomic" way. > Usually, one will start tgtd and it's configuration in a script (I > replaced some parameters with "..." to make it shorter and more readable): Thanks for the details. So the way to stop the daemon is not related with your problem. It's easily fixable. Can you start a new thread about this on stgt-devel mailing list? When we agree on the interface to start the daemon, I'll implement it. > tgtd > tgtadm --op new ... > tgtadm --lld iscsi --op new ... (snip) > So the only way to start/restart tgtd reliably is to do hacks which are > needed with yet another iSCSI kernel implementation (IET): use iptables. > > iptables <block iSCSI traffic> > tgtd > sleep 1 > tgtadm --op new ... > tgtadm --lld iscsi --op new ... > iptables <unblock iSCSI traffic> > > > A bit ugly, isn't it? > Having to tinker with a firewall in order to start a daemon is by no > means a sign of a well-tested and mature project. > > That's why I asked how many people use stgt in a production environment > - James was worried about a potential migration path for current users. I don't know how many people use stgt in a production environment but I'm not sure that this problem prevents many people from using it in a production environment. You want to reboot a server running target devices while initiators connect to it. Rebooting the target server behind the initiators seldom works. System adminstorators in my workplace reboot storage devices once a year and tell us to shut down the initiator machines that use them before that. ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel 2008-02-05 13:38 ` FUJITA Tomonori 2008-02-05 16:07 ` Tomasz Chmielewski @ 2008-02-05 17:09 ` Matteo Tescione 2008-02-06 1:29 ` FUJITA Tomonori 1 sibling, 1 reply; 148+ messages in thread From: Matteo Tescione @ 2008-02-05 17:09 UTC (permalink / raw) To: FUJITA Tomonori, mangoo Cc: vst, linux-scsi, linux-kernel, James.Bottomley, scst-devel, akpm, torvalds, fujita.tomonori On 5-02-2008 14:38, "FUJITA Tomonori" <tomof@acm.org> wrote: > On Tue, 05 Feb 2008 08:14:01 +0100 > Tomasz Chmielewski <mangoo@wpkg.org> wrote: > >> James Bottomley schrieb: >> >>> These are both features being independently worked on, are they not? >>> Even if they weren't, the combination of the size of SCST in kernel plus >>> the problem of having to find a migration path for the current STGT >>> users still looks to me to involve the greater amount of work. >> >> I don't want to be mean, but does anyone actually use STGT in >> production? Seriously? >> >> In the latest development version of STGT, it's only possible to stop >> the tgtd target daemon using KILL / 9 signal - which also means all >> iSCSI initiator connections are corrupted when tgtd target daemon is >> started again (kernel upgrade, target daemon upgrade, server reboot etc.). > > I don't know what "iSCSI initiator connections are corrupted" > mean. But if you reboot a server, how can an iSCSI target > implementation keep iSCSI tcp connections? > > >> Imagine you have to reboot all your NFS clients when you reboot your NFS >> server. Not only that - your data is probably corrupted, or at least the >> filesystem deserves checking... Don't know if matters, but in my setup (iscsi on top of drbd+heartbeat) rebooting the primary server doesn't affect my iscsi traffic, SCST correctly manages stop/crash, by sending unit attention to clients on reconnect. Drbd+heartbeat correctly manages those things too. Still from an end-user POV, i was able to reboot/survive a crash only with SCST, IETD still has reconnect problems and STGT are even worst. Regards, --matteo ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel 2008-02-05 17:09 ` Matteo Tescione @ 2008-02-06 1:29 ` FUJITA Tomonori 2008-02-06 2:01 ` Nicholas A. Bellinger 0 siblings, 1 reply; 148+ messages in thread From: FUJITA Tomonori @ 2008-02-06 1:29 UTC (permalink / raw) To: matteo Cc: tomof, mangoo, vst, linux-scsi, linux-kernel, James.Bottomley, scst-devel, akpm, torvalds, fujita.tomonori On Tue, 05 Feb 2008 18:09:15 +0100 Matteo Tescione <matteo@rmnet.it> wrote: > On 5-02-2008 14:38, "FUJITA Tomonori" <tomof@acm.org> wrote: > > > On Tue, 05 Feb 2008 08:14:01 +0100 > > Tomasz Chmielewski <mangoo@wpkg.org> wrote: > > > >> James Bottomley schrieb: > >> > >>> These are both features being independently worked on, are they not? > >>> Even if they weren't, the combination of the size of SCST in kernel plus > >>> the problem of having to find a migration path for the current STGT > >>> users still looks to me to involve the greater amount of work. > >> > >> I don't want to be mean, but does anyone actually use STGT in > >> production? Seriously? > >> > >> In the latest development version of STGT, it's only possible to stop > >> the tgtd target daemon using KILL / 9 signal - which also means all > >> iSCSI initiator connections are corrupted when tgtd target daemon is > >> started again (kernel upgrade, target daemon upgrade, server reboot etc.). > > > > I don't know what "iSCSI initiator connections are corrupted" > > mean. But if you reboot a server, how can an iSCSI target > > implementation keep iSCSI tcp connections? > > > > > >> Imagine you have to reboot all your NFS clients when you reboot your NFS > >> server. Not only that - your data is probably corrupted, or at least the > >> filesystem deserves checking... > > Don't know if matters, but in my setup (iscsi on top of drbd+heartbeat) > rebooting the primary server doesn't affect my iscsi traffic, SCST correctly > manages stop/crash, by sending unit attention to clients on reconnect. > Drbd+heartbeat correctly manages those things too. > Still from an end-user POV, i was able to reboot/survive a crash only with > SCST, IETD still has reconnect problems and STGT are even worst. Please tell us on stgt-devel mailing list if you see problems. We will try to fix them. Thanks, ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel 2008-02-06 1:29 ` FUJITA Tomonori @ 2008-02-06 2:01 ` Nicholas A. Bellinger 0 siblings, 0 replies; 148+ messages in thread From: Nicholas A. Bellinger @ 2008-02-06 2:01 UTC (permalink / raw) To: FUJITA Tomonori Cc: matteo, tomof, mangoo, vst, linux-scsi, linux-kernel, James.Bottomley, scst-devel, akpm, torvalds On Wed, 2008-02-06 at 10:29 +0900, FUJITA Tomonori wrote: > On Tue, 05 Feb 2008 18:09:15 +0100 > Matteo Tescione <matteo@rmnet.it> wrote: > > > On 5-02-2008 14:38, "FUJITA Tomonori" <tomof@acm.org> wrote: > > > > > On Tue, 05 Feb 2008 08:14:01 +0100 > > > Tomasz Chmielewski <mangoo@wpkg.org> wrote: > > > > > >> James Bottomley schrieb: > > >> > > >>> These are both features being independently worked on, are they not? > > >>> Even if they weren't, the combination of the size of SCST in kernel plus > > >>> the problem of having to find a migration path for the current STGT > > >>> users still looks to me to involve the greater amount of work. > > >> > > >> I don't want to be mean, but does anyone actually use STGT in > > >> production? Seriously? > > >> > > >> In the latest development version of STGT, it's only possible to stop > > >> the tgtd target daemon using KILL / 9 signal - which also means all > > >> iSCSI initiator connections are corrupted when tgtd target daemon is > > >> started again (kernel upgrade, target daemon upgrade, server reboot etc.). > > > > > > I don't know what "iSCSI initiator connections are corrupted" > > > mean. But if you reboot a server, how can an iSCSI target > > > implementation keep iSCSI tcp connections? > > > > > > > > >> Imagine you have to reboot all your NFS clients when you reboot your NFS > > >> server. Not only that - your data is probably corrupted, or at least the > > >> filesystem deserves checking... > > The TCP connection will drop, remember that the TCP connection state for one side has completely vanished. Depending on iSCSI/iSER ErrorRecoveryLevel that is set, this will mean: 1) Session Recovery, ERL=0 - Restarting the entire nexus and all connections across all of the possible subnets or comm-links. All outstanding un-StatSN acknowledged commands will be returned back to the SCSI subsystem with RETRY status. Once a single connection has been reestablished to start the nexus, the CDBs will be resent. 2) Connection Recovery, ERL=2 - CDBs from the failed connection(s) will be retried (nothing changes in the PDU) to fill the iSCSI CmdSN ordering gap, or be explictly retried with TMR TASK_REASSIGN for ones already acknowledged by the ExpCmdSN that are returned to the initiator in response packets or by way of unsolicited NopINs. > > Don't know if matters, but in my setup (iscsi on top of drbd+heartbeat) > > rebooting the primary server doesn't affect my iscsi traffic, SCST correctly > > manages stop/crash, by sending unit attention to clients on reconnect. > > Drbd+heartbeat correctly manages those things too. > > Still from an end-user POV, i was able to reboot/survive a crash only with > > SCST, IETD still has reconnect problems and STGT are even worst. > > Please tell us on stgt-devel mailing list if you see problems. We will > try to fix them. > FYI, the LIO code also supports rmmoding iscsi_target_mod while at full 10 Gb/sec speed. I think it should be a requirement to be able to control per initiator, per portal group, per LUN, per device, per HBA in the design without restarting any other objects. --nab > Thanks, > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-01-29 20:42 ` James Bottomley 2008-01-29 21:31 ` Roland Dreier 2008-01-30 8:29 ` Bart Van Assche @ 2008-01-30 11:17 ` Vladislav Bolkhovitin 2008-02-04 12:27 ` Vladislav Bolkhovitin 2 siblings, 1 reply; 148+ messages in thread From: Vladislav Bolkhovitin @ 2008-01-30 11:17 UTC (permalink / raw) To: James Bottomley Cc: Bart Van Assche, Linus Torvalds, Andrew Morton, FUJITA Tomonori, linux-scsi, scst-devel, linux-kernel James Bottomley wrote: > The two target architectures perform essentially identical functions, so > there's only really room for one in the kernel. Right at the moment, > it's STGT. Problems in STGT come from the user<->kernel boundary which > can be mitigated in a variety of ways. The fact that the figures are > pretty much comparable on non IB networks shows this. > > I really need a whole lot more evidence than at worst a 20% performance > difference on IB to pull one implementation out and replace it with > another. Particularly as there's no real evidence that STGT can't be > tweaked to recover the 20% even on IB. James, Although the performance difference between STGT and SCST is apparent, this isn't the only point why SCST is better. I've already written about it many times in various mailing lists, but let me summarize it one more time here. As you know, almost all kernel parts can be done in user space, including all the drivers, networking, I/O management with block/SCSI initiator subsystem and disk cache manager. But does it mean that currently Linux kernel is bad and all the above should be (re)done in user space instead? I believe, not. Linux isn't a microkernel for very pragmatic reasons: simplicity and performance. So, additional important point why SCST is better is simplicity. For SCSI target, especially with hardware target card, data are came from kernel and eventually served by kernel, which does actual I/O or getting/putting data from/to cache. Dividing requests processing between user and kernel space creates unnecessary interface layer(s) and effectively makes the requests processing job distributed with all its complexity and reliability problems. From my point of view, having such distribution, where user space is master side and kernel is slave is rather wrong, because: 1. It makes kernel depend from user program, which services it and provides for it its routines, while the regular paradigm is the opposite: kernel services user space applications. As a direct consequence from it that there is no real protection for the kernel from faults in the STGT core code without excessive effort, which, no surprise, wasn't currently done and, seems, is never going to be done. So, on practice debugging and developing under STGT isn't easier, than if the whole code was in the kernel space, but, actually, harder (see below why). 2. It requires new complicated interface between kernel and user spaces that creates additional maintenance and debugging headaches, which don't exist for kernel only code. Linus Torvalds some time ago perfectly described why it is bad, see http://lkml.org/lkml/2007/4/24/451, http://lkml.org/lkml/2006/7/1/41 and http://lkml.org/lkml/2007/4/24/364. 3. It makes for SCSI target impossible to use (at least, on a simple and sane way) many effective optimizations: zero-copy cached I/O, more control over read-ahead, device queue unplugging-plugging, etc. One example of already implemented such features is zero-copy network data transmission, done in simple 260 lines put_page_callback patch. This optimization is especially important for the user space gate (scst_user module), see below for details. The whole point that development for kernel is harder, than for user space, is totally nonsense nowadays. It's different, yes, in some ways more limited, yes, but not harder. For ones who need gdb (I for many years - don't) kernel has kgdb, plus it also has many not available for user space or more limited there debug facilities like lockdep, lockup detection, oprofile, etc. (I don't mention wider choice of more effectively implemented synchronization primitives and not only them). For people who need complicated target devices emulation, like, e.g., in case of VTL (Virtual Tape Library), where there is a need to operate with large mmap'ed memory areas, SCST provides gateway to the user space (scst_user module), but, in contrast with STGT, it's done in regular "kernel - master, user application - slave" paradigm, so it's reliable and no fault in user space device emulator can break kernel and other user space applications. Plus, since SCSI target state machine and memory management are in the kernel, it's very effective and allows only one kernel-user space switch per SCSI command. Also, I should note here, that in the current state STGT in many aspects doesn't fully conform SCSI specifications, especially in area of management events, like Unit Attentions generation and processing, and it doesn't look like somebody cares about it. At the same time, SCST pays big attention to fully conform SCSI specifications, because price of non-conformance is a possible user's data corruption. Returning to performance, modern SCSI transports, e.g. InfiniBand, have as low link latency as 1(!) microsecond. For comparison, the inter-thread context switch time on a modern system is about the same, syscall time - about 0.1 microsecond. So, only ten empty syscalls or one context switch add the same latency as the link. Even 1Gbps Ethernet has less, than 100 microseconds of round-trip latency. You, probably, know, that QLogic Fibre Channel target driver for SCST allows commands being executed either directly from soft IRQ, or from the corresponding thread. There is a steady 5-7% difference in IOPS between those modes on 512 bytes reads on nullio using 4Gbps link. So, a single additional inter-kernel-thread context switch costs 5-7% of IOPS. Another source of additional unavoidable with the user space approach latency is data copy to/from cache. With the fully kernel space approach, cache can be used directly, so no extra copy will be needed. We can estimate how much latency the data copying adds. On the modern systems memory copy throughput is less than 2GB/s, so on 20Gbps InfiniBand link it almost doubles data transfer latency. So, putting code in the user space you should accept the extra latency it adds. Many, if not most, real-life workloads more or less latency, not throughput, bound, so there shouldn't be surprise that single stream "dd if=/dev/sdX of=/dev/null" on initiator gives too low values. Such "benchmark" isn't less important and practical, than all the multithreaded latency insensitive benchmarks, which people like running, because it does essentially the same as most Linux processes do when they read data from files. You may object me that the target's backstorage device(s) latency is a lot more, than 1 microsecond, but that is relevant only if data are read/written from/to the actual backstorage media, not from the cache, even from the backstorage device's cache. Nothing prevents target from having 8 or even 64GB of cache, so most even random accesses could be served by it. This is especially important for sync writes. Thus, why SCST is better: 1. It is more simple, because it's monolithic, so all its components are in one place and communicate using direct function calls. Hence, it is smaller, faster, more reliable and maintainable. Currently it's bigger, than STGT, just because it supports more features, see (2). 2. It supports more features: 1 to many pass-through support with all necessary for it functionality, including support for non-disk SCSI devices, like tapes, SGV cache, BLOCKIO, where requests converted to bio's and directly sent to block level (this mode is effective for random mostly workloads with data set size >> memory size on the target), etc. 3. It has better performance and going to have it even better. SCST only now enters in the phase, where it starts exploiting all advantages of being in the kernel. Particularly, zero-copy cached I/O is currently being implemented. 4. It provides safer and more effective interface to emulate target devices in the user space via scst_user module. 5. It much more confirms to SCSI specifications (see above). Vlad ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-01-30 11:17 ` Vladislav Bolkhovitin @ 2008-02-04 12:27 ` Vladislav Bolkhovitin 2008-02-04 13:53 ` Bart Van Assche 2008-02-04 15:30 ` James Bottomley 0 siblings, 2 replies; 148+ messages in thread From: Vladislav Bolkhovitin @ 2008-02-04 12:27 UTC (permalink / raw) To: James Bottomley Cc: Bart Van Assche, Linus Torvalds, Andrew Morton, FUJITA Tomonori, linux-scsi, scst-devel, linux-kernel Vladislav Bolkhovitin wrote: > James Bottomley wrote: > >> The two target architectures perform essentially identical functions, so >> there's only really room for one in the kernel. Right at the moment, >> it's STGT. Problems in STGT come from the user<->kernel boundary which >> can be mitigated in a variety of ways. The fact that the figures are >> pretty much comparable on non IB networks shows this. >> >> I really need a whole lot more evidence than at worst a 20% performance >> difference on IB to pull one implementation out and replace it with >> another. Particularly as there's no real evidence that STGT can't be >> tweaked to recover the 20% even on IB. > > > James, > > Although the performance difference between STGT and SCST is apparent, > this isn't the only point why SCST is better. I've already written about > it many times in various mailing lists, but let me summarize it one more > time here. > > As you know, almost all kernel parts can be done in user space, > including all the drivers, networking, I/O management with block/SCSI > initiator subsystem and disk cache manager. But does it mean that > currently Linux kernel is bad and all the above should be (re)done in > user space instead? I believe, not. Linux isn't a microkernel for very > pragmatic reasons: simplicity and performance. So, additional important > point why SCST is better is simplicity. > > For SCSI target, especially with hardware target card, data are came > from kernel and eventually served by kernel, which does actual I/O or > getting/putting data from/to cache. Dividing requests processing between > user and kernel space creates unnecessary interface layer(s) and > effectively makes the requests processing job distributed with all its > complexity and reliability problems. From my point of view, having such > distribution, where user space is master side and kernel is slave is > rather wrong, because: > > 1. It makes kernel depend from user program, which services it and > provides for it its routines, while the regular paradigm is the > opposite: kernel services user space applications. As a direct > consequence from it that there is no real protection for the kernel from > faults in the STGT core code without excessive effort, which, no > surprise, wasn't currently done and, seems, is never going to be done. > So, on practice debugging and developing under STGT isn't easier, than > if the whole code was in the kernel space, but, actually, harder (see > below why). > > 2. It requires new complicated interface between kernel and user spaces > that creates additional maintenance and debugging headaches, which don't > exist for kernel only code. Linus Torvalds some time ago perfectly > described why it is bad, see http://lkml.org/lkml/2007/4/24/451, > http://lkml.org/lkml/2006/7/1/41 and http://lkml.org/lkml/2007/4/24/364. > > 3. It makes for SCSI target impossible to use (at least, on a simple and > sane way) many effective optimizations: zero-copy cached I/O, more > control over read-ahead, device queue unplugging-plugging, etc. One > example of already implemented such features is zero-copy network data > transmission, done in simple 260 lines put_page_callback patch. This > optimization is especially important for the user space gate (scst_user > module), see below for details. > > The whole point that development for kernel is harder, than for user > space, is totally nonsense nowadays. It's different, yes, in some ways > more limited, yes, but not harder. For ones who need gdb (I for many > years - don't) kernel has kgdb, plus it also has many not available for > user space or more limited there debug facilities like lockdep, lockup > detection, oprofile, etc. (I don't mention wider choice of more > effectively implemented synchronization primitives and not only them). > > For people who need complicated target devices emulation, like, e.g., in > case of VTL (Virtual Tape Library), where there is a need to operate > with large mmap'ed memory areas, SCST provides gateway to the user space > (scst_user module), but, in contrast with STGT, it's done in regular > "kernel - master, user application - slave" paradigm, so it's reliable > and no fault in user space device emulator can break kernel and other > user space applications. Plus, since SCSI target state machine and > memory management are in the kernel, it's very effective and allows only > one kernel-user space switch per SCSI command. > > Also, I should note here, that in the current state STGT in many aspects > doesn't fully conform SCSI specifications, especially in area of > management events, like Unit Attentions generation and processing, and > it doesn't look like somebody cares about it. At the same time, SCST > pays big attention to fully conform SCSI specifications, because price > of non-conformance is a possible user's data corruption. > > Returning to performance, modern SCSI transports, e.g. InfiniBand, have > as low link latency as 1(!) microsecond. For comparison, the > inter-thread context switch time on a modern system is about the same, > syscall time - about 0.1 microsecond. So, only ten empty syscalls or one > context switch add the same latency as the link. Even 1Gbps Ethernet has > less, than 100 microseconds of round-trip latency. > > You, probably, know, that QLogic Fibre Channel target driver for SCST > allows commands being executed either directly from soft IRQ, or from > the corresponding thread. There is a steady 5-7% difference in IOPS > between those modes on 512 bytes reads on nullio using 4Gbps link. So, a > single additional inter-kernel-thread context switch costs 5-7% of IOPS. > > Another source of additional unavoidable with the user space approach > latency is data copy to/from cache. With the fully kernel space > approach, cache can be used directly, so no extra copy will be needed. > We can estimate how much latency the data copying adds. On the modern > systems memory copy throughput is less than 2GB/s, so on 20Gbps > InfiniBand link it almost doubles data transfer latency. > > So, putting code in the user space you should accept the extra latency > it adds. Many, if not most, real-life workloads more or less latency, > not throughput, bound, so there shouldn't be surprise that single stream > "dd if=/dev/sdX of=/dev/null" on initiator gives too low values. Such > "benchmark" isn't less important and practical, than all the > multithreaded latency insensitive benchmarks, which people like running, > because it does essentially the same as most Linux processes do when > they read data from files. > > You may object me that the target's backstorage device(s) latency is a > lot more, than 1 microsecond, but that is relevant only if data are > read/written from/to the actual backstorage media, not from the cache, > even from the backstorage device's cache. Nothing prevents target from > having 8 or even 64GB of cache, so most even random accesses could be > served by it. This is especially important for sync writes. > > Thus, why SCST is better: > > 1. It is more simple, because it's monolithic, so all its components are > in one place and communicate using direct function calls. Hence, it is > smaller, faster, more reliable and maintainable. Currently it's bigger, > than STGT, just because it supports more features, see (2). > > 2. It supports more features: 1 to many pass-through support with all > necessary for it functionality, including support for non-disk SCSI > devices, like tapes, SGV cache, BLOCKIO, where requests converted to > bio's and directly sent to block level (this mode is effective for > random mostly workloads with data set size >> memory size on the > target), etc. > > 3. It has better performance and going to have it even better. SCST only > now enters in the phase, where it starts exploiting all advantages of > being in the kernel. Particularly, zero-copy cached I/O is currently > being implemented. > > 4. It provides safer and more effective interface to emulate target > devices in the user space via scst_user module. > > 5. It much more confirms to SCSI specifications (see above). So, James, what is your opinion on the above? Or the overall SCSI target project simplicity doesn't matter much for you and you think it's fine to duplicate Linux page cache in the user space to keep the in-kernel part of the project as small as possible? Vlad ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-04 12:27 ` Vladislav Bolkhovitin @ 2008-02-04 13:53 ` Bart Van Assche 2008-02-04 17:00 ` David Dillow ` (2 more replies) 2008-02-04 15:30 ` James Bottomley 1 sibling, 3 replies; 148+ messages in thread From: Bart Van Assche @ 2008-02-04 13:53 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: James Bottomley, Linus Torvalds, Andrew Morton, FUJITA Tomonori, linux-scsi, scst-devel, linux-kernel, Nicholas A. Bellinger On Feb 4, 2008 1:27 PM, Vladislav Bolkhovitin <vst@vlnb.net> wrote: > > So, James, what is your opinion on the above? Or the overall SCSI target > project simplicity doesn't matter much for you and you think it's fine > to duplicate Linux page cache in the user space to keep the in-kernel > part of the project as small as possible? It's too early to draw conclusions about performance. I'm currently performing more measurements, and the results are not easy to interpret. My plan is to measure the following: * Setup: target with RAM disk of 2 GB as backing storage. * Throughput reported by dd and xdd (direct I/O). * Transfers with dd/xdd in units of 1 KB to 1 GB (the smallest transfer size that can be specified to xdd is 1 KB). * Target SCSI software to be tested: IETD iSCSI via IPoIB, STGT iSCSI via IPoIB, STGT iSER, SCST iSCSI via IPoIB, SCST SRP, LIO iSCSI via IPoIB. The reason I chose dd/xdd for these tests is that I want to measure the performance of the communication protocols, and that I am assuming that this performance can be modeled by the following formula: (transfer time in s) = (transfer setup latency in s) + (transfer size in MB) / (bandwidth in MB/s). Measuring the time needed for transfers with varying block size allows to compute the constants in the above formula via linear regression. One difficulty I already encountered is that the performance of the Linux IPoIB implementation varies a lot under high load (http://bugzilla.kernel.org/show_bug.cgi?id=9883). Another issue I have to look further into is that dd and xdd report different results for very large block sizes (> 1 MB). Bart Van Assche. ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-04 13:53 ` Bart Van Assche @ 2008-02-04 17:00 ` David Dillow 2008-02-04 17:08 ` Vladislav Bolkhovitin 2008-02-05 16:25 ` Bart Van Assche 2 siblings, 0 replies; 148+ messages in thread From: David Dillow @ 2008-02-04 17:00 UTC (permalink / raw) To: Bart Van Assche Cc: Vladislav Bolkhovitin, James Bottomley, Linus Torvalds, Andrew Morton, FUJITA Tomonori, linux-scsi, scst-devel, linux-kernel, Nicholas A. Bellinger On Mon, 2008-02-04 at 14:53 +0100, Bart Van Assche wrote: > Another issue I have to look further into is that dd and xdd report > different results for very large block sizes (> 1 MB). Be aware that xdd reports 1 MB as 1000000, not 1048576. Though, it looks like dd is the same, so that's probably not helpful. Also, make sure you're passing {i,o}flag=direct to dd if you're using -dio in xdd to be sure you are comparing apples to apples. -- Dave Dillow National Center for Computational Science Oak Ridge National Laboratory (865) 241-6602 office ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-04 13:53 ` Bart Van Assche 2008-02-04 17:00 ` David Dillow @ 2008-02-04 17:08 ` Vladislav Bolkhovitin 2008-02-05 16:25 ` Bart Van Assche 2 siblings, 0 replies; 148+ messages in thread From: Vladislav Bolkhovitin @ 2008-02-04 17:08 UTC (permalink / raw) To: Bart Van Assche Cc: James Bottomley, Linus Torvalds, Andrew Morton, FUJITA Tomonori, linux-scsi, scst-devel, linux-kernel, Nicholas A. Bellinger Bart Van Assche wrote: > On Feb 4, 2008 1:27 PM, Vladislav Bolkhovitin <vst@vlnb.net> wrote: > >>So, James, what is your opinion on the above? Or the overall SCSI target >>project simplicity doesn't matter much for you and you think it's fine >>to duplicate Linux page cache in the user space to keep the in-kernel >>part of the project as small as possible? > > > It's too early to draw conclusions about performance. I'm currently > performing more measurements, and the results are not easy to > interpret. My plan is to measure the following: > * Setup: target with RAM disk of 2 GB as backing storage. > * Throughput reported by dd and xdd (direct I/O). > * Transfers with dd/xdd in units of 1 KB to 1 GB (the smallest > transfer size that can be specified to xdd is 1 KB). > * Target SCSI software to be tested: IETD iSCSI via IPoIB, STGT iSCSI > via IPoIB, STGT iSER, SCST iSCSI via IPoIB, SCST SRP, LIO iSCSI via > IPoIB. > > The reason I chose dd/xdd for these tests is that I want to measure > the performance of the communication protocols, and that I am assuming > that this performance can be modeled by the following formula: > (transfer time in s) = (transfer setup latency in s) + (transfer size > in MB) / (bandwidth in MB/s). It isn't fully correct, you forgot about link latency. More correct one is: (transfer time) = (transfer setup latency on both initiator and target, consisting from software processing time, including memory copy, if necessary, and PCI setup/transfer time) + (transfer size)/(bandwidth) + (link latency to deliver request for READs or status for WRITES) + (2*(link latency) to deliver R2T/XFER_READY request in case of WRITEs, if necessary (e.g. iSER for small transfers might not need it, but SRP most likely always needs it)). Also you should note that it's correct only in case of single threaded workloads with one outstanding command at time. For other workloads it depends from how well they manage to keep the "link" full in interval from (transfer size)/(transfer time) to bandwidth. > Measuring the time needed for transfers > with varying block size allows to compute the constants in the above > formula via linear regression. Unfortunately, it isn't so easy, see above. > One difficulty I already encountered is that the performance of the > Linux IPoIB implementation varies a lot under high load > (http://bugzilla.kernel.org/show_bug.cgi?id=9883). > > Another issue I have to look further into is that dd and xdd report > different results for very large block sizes (> 1 MB). Look at /proc/scsi_tgt/sgv (for SCST) and you will see, which transfer sizes are actually used. Initiators don't like sending big requests and often split them on smaller ones. Look at this message as well, it might be helpful: http://lkml.org/lkml/2007/5/16/223 > Bart Van Assche. > - > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-04 13:53 ` Bart Van Assche 2008-02-04 17:00 ` David Dillow 2008-02-04 17:08 ` Vladislav Bolkhovitin @ 2008-02-05 16:25 ` Bart Van Assche 2008-02-05 18:18 ` Linus Torvalds 2 siblings, 1 reply; 148+ messages in thread From: Bart Van Assche @ 2008-02-05 16:25 UTC (permalink / raw) To: James Bottomley, FUJITA Tomonori, Vladislav Bolkhovitin, Vu Pham Cc: Linus Torvalds, Andrew Morton, linux-scsi, scst-devel, linux-kernel, Nicholas A. Bellinger Regarding the performance tests I promised to perform: although until now I only have been able to run two tests (STGT + iSER versus SCST + SRP), the results are interesting. I will run the remaining test cases during the next days. About the test setup: dd and xdd were used to transfer 2 GB of data between an initiator system and a target system via direct I/O over an SDR InfiniBand network (1GB/s). The block size varied between 512 bytes and 1 GB, but was always a power of two. Expected results: * The measurement results are consistent with the numbers I published earlier. * During data transfers all data is transferred in blocks between 4 KB and 32 KB in size (according to the SCST statistics). * For small and medium block sizes (<= 32 KB) transfer times can be modeled very well by the following formula: (transfer time) = (setup latency) + (bytes transferred)/(bandwidth). The correlation numbers are very close to one. * The latency and bandwidth parameters depend on the test tool (dd versus xdd), on the kind of test performed (reading versus writing), on the SCSI target and on the communication protocol. * When using RDMA (iSER or SRP), SCST has a lower latency and higher bandwidth than STGT (results from linear regression for block sizes <= 32 KB): Test Latency(us) Bandwidth (MB/s) Correlation STGT+iSER, read, dd 64 560 0.999995 STGT+iSER, read, xdd 65 556 0.999994 STGT+iSER, write, dd 53 394 0.999971 STGT+iSER, write, xdd 54 445 0.999959 SCST+SRP, read, dd 39 657 0.999983 SCST+SRP, read, xdd 41 668 0.999987 SCST+SRP, write, dd 52 449 0.999962 SCST+SRP, write, xdd 52 516 0.999977 Results that I did not expect: * A block transfer size of 1 MB is not enough to measure the maximal throughput. The maximal throughput is only reached at much higher block sizes (about 10 MB for SCST + SRP and about 100 MB for STGT + iSER). * There is one case where dd and xdd results are inconsistent: when reading via SCST + SRP and for block sizes of about 1 MB. * For block sizes > 64 KB the measurements differ from the model. This is probably because all initiator-target transfers happen in blocks of 32 KB or less. For the details and some graphs, see also http://software.qlayer.com/display/iSCSI/Measurements . Bart Van Assche. ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-05 16:25 ` Bart Van Assche @ 2008-02-05 18:18 ` Linus Torvalds 0 siblings, 0 replies; 148+ messages in thread From: Linus Torvalds @ 2008-02-05 18:18 UTC (permalink / raw) To: Bart Van Assche Cc: James Bottomley, FUJITA Tomonori, Vladislav Bolkhovitin, Vu Pham, Andrew Morton, linux-scsi, scst-devel, linux-kernel, Nicholas A. Bellinger On Tue, 5 Feb 2008, Bart Van Assche wrote: > > Results that I did not expect: > * A block transfer size of 1 MB is not enough to measure the maximal > throughput. The maximal throughput is only reached at much higher > block sizes (about 10 MB for SCST + SRP and about 100 MB for STGT + > iSER). Block transfer sizes over about 64kB are totally irrelevant for 99% of all people. Don't even bother testing anything more. Yes, bigger transfers happen, but a lot of common loads have *smaller* transfers than 64kB. So benchmarks that try to find "theoretical throughput" by just making big transfers should just be banned. They give numbers, yes, but the numbers are pointless. Linus ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-04 12:27 ` Vladislav Bolkhovitin 2008-02-04 13:53 ` Bart Van Assche @ 2008-02-04 15:30 ` James Bottomley 2008-02-04 16:25 ` Vladislav Bolkhovitin 1 sibling, 1 reply; 148+ messages in thread From: James Bottomley @ 2008-02-04 15:30 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: Bart Van Assche, Linus Torvalds, Andrew Morton, FUJITA Tomonori, linux-scsi, scst-devel, linux-kernel On Mon, 2008-02-04 at 15:27 +0300, Vladislav Bolkhovitin wrote: > Vladislav Bolkhovitin wrote: > So, James, what is your opinion on the above? Or the overall SCSI target > project simplicity doesn't matter much for you and you think it's fine > to duplicate Linux page cache in the user space to keep the in-kernel > part of the project as small as possible? The answers were pretty much contained here http://marc.info/?l=linux-scsi&m=120164008302435 and here: http://marc.info/?l=linux-scsi&m=120171067107293 Weren't they? James ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-04 15:30 ` James Bottomley @ 2008-02-04 16:25 ` Vladislav Bolkhovitin 2008-02-04 17:06 ` James Bottomley 0 siblings, 1 reply; 148+ messages in thread From: Vladislav Bolkhovitin @ 2008-02-04 16:25 UTC (permalink / raw) To: James Bottomley Cc: Bart Van Assche, Linus Torvalds, Andrew Morton, FUJITA Tomonori, linux-scsi, scst-devel, linux-kernel James Bottomley wrote: >>Vladislav Bolkhovitin wrote: >>So, James, what is your opinion on the above? Or the overall SCSI target >>project simplicity doesn't matter much for you and you think it's fine >>to duplicate Linux page cache in the user space to keep the in-kernel >>part of the project as small as possible? > > > The answers were pretty much contained here > > http://marc.info/?l=linux-scsi&m=120164008302435 > > and here: > > http://marc.info/?l=linux-scsi&m=120171067107293 > > Weren't they? No, sorry, it doesn't look so for me. They are about performance, but I'm asking about the overall project's architecture, namely about one part of it: simplicity. Particularly, what do you think about duplicating Linux page cache in the user space to have zero-copy cached I/O? Or can you suggest another architectural solution for that problem in the STGT's approach? Vlad ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-04 16:25 ` Vladislav Bolkhovitin @ 2008-02-04 17:06 ` James Bottomley 2008-02-04 17:16 ` Vladislav Bolkhovitin 0 siblings, 1 reply; 148+ messages in thread From: James Bottomley @ 2008-02-04 17:06 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: Bart Van Assche, Linus Torvalds, Andrew Morton, FUJITA Tomonori, linux-scsi, scst-devel, linux-kernel On Mon, 2008-02-04 at 19:25 +0300, Vladislav Bolkhovitin wrote: > James Bottomley wrote: > >>Vladislav Bolkhovitin wrote: > >>So, James, what is your opinion on the above? Or the overall SCSI target > >>project simplicity doesn't matter much for you and you think it's fine > >>to duplicate Linux page cache in the user space to keep the in-kernel > >>part of the project as small as possible? > > > > > > The answers were pretty much contained here > > > > http://marc.info/?l=linux-scsi&m=120164008302435 > > > > and here: > > > > http://marc.info/?l=linux-scsi&m=120171067107293 > > > > Weren't they? > > No, sorry, it doesn't look so for me. They are about performance, but > I'm asking about the overall project's architecture, namely about one > part of it: simplicity. Particularly, what do you think about > duplicating Linux page cache in the user space to have zero-copy cached > I/O? Or can you suggest another architectural solution for that problem > in the STGT's approach? Isn't that an advantage of a user space solution? It simply uses the backing store of whatever device supplies the data. That means it takes advantage of the existing mechanisms for caching. James ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-04 17:06 ` James Bottomley @ 2008-02-04 17:16 ` Vladislav Bolkhovitin 2008-02-04 17:25 ` James Bottomley 0 siblings, 1 reply; 148+ messages in thread From: Vladislav Bolkhovitin @ 2008-02-04 17:16 UTC (permalink / raw) To: James Bottomley Cc: Bart Van Assche, Linus Torvalds, Andrew Morton, FUJITA Tomonori, linux-scsi, scst-devel, linux-kernel James Bottomley wrote: >>>>So, James, what is your opinion on the above? Or the overall SCSI target >>>>project simplicity doesn't matter much for you and you think it's fine >>>>to duplicate Linux page cache in the user space to keep the in-kernel >>>>part of the project as small as possible? >>> >>> >>>The answers were pretty much contained here >>> >>>http://marc.info/?l=linux-scsi&m=120164008302435 >>> >>>and here: >>> >>>http://marc.info/?l=linux-scsi&m=120171067107293 >>> >>>Weren't they? >> >>No, sorry, it doesn't look so for me. They are about performance, but >>I'm asking about the overall project's architecture, namely about one >>part of it: simplicity. Particularly, what do you think about >>duplicating Linux page cache in the user space to have zero-copy cached >>I/O? Or can you suggest another architectural solution for that problem >>in the STGT's approach? > > > Isn't that an advantage of a user space solution? It simply uses the > backing store of whatever device supplies the data. That means it takes > advantage of the existing mechanisms for caching. No, please reread this thread, especially this message: http://marc.info/?l=linux-kernel&m=120169189504361&w=2. This is one of the advantages of the kernel space implementation. The user space implementation has to have data copied between the cache and user space buffer, but the kernel space one can use pages in the cache directly, without extra copy. Vlad ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-04 17:16 ` Vladislav Bolkhovitin @ 2008-02-04 17:25 ` James Bottomley 2008-02-04 17:56 ` Vladislav Bolkhovitin 2008-02-04 18:29 ` Linus Torvalds 0 siblings, 2 replies; 148+ messages in thread From: James Bottomley @ 2008-02-04 17:25 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: Bart Van Assche, Linus Torvalds, Andrew Morton, FUJITA Tomonori, linux-scsi, scst-devel, linux-kernel On Mon, 2008-02-04 at 20:16 +0300, Vladislav Bolkhovitin wrote: > James Bottomley wrote: > >>>>So, James, what is your opinion on the above? Or the overall SCSI target > >>>>project simplicity doesn't matter much for you and you think it's fine > >>>>to duplicate Linux page cache in the user space to keep the in-kernel > >>>>part of the project as small as possible? > >>> > >>> > >>>The answers were pretty much contained here > >>> > >>>http://marc.info/?l=linux-scsi&m=120164008302435 > >>> > >>>and here: > >>> > >>>http://marc.info/?l=linux-scsi&m=120171067107293 > >>> > >>>Weren't they? > >> > >>No, sorry, it doesn't look so for me. They are about performance, but > >>I'm asking about the overall project's architecture, namely about one > >>part of it: simplicity. Particularly, what do you think about > >>duplicating Linux page cache in the user space to have zero-copy cached > >>I/O? Or can you suggest another architectural solution for that problem > >>in the STGT's approach? > > > > > > Isn't that an advantage of a user space solution? It simply uses the > > backing store of whatever device supplies the data. That means it takes > > advantage of the existing mechanisms for caching. > > No, please reread this thread, especially this message: > http://marc.info/?l=linux-kernel&m=120169189504361&w=2. This is one of > the advantages of the kernel space implementation. The user space > implementation has to have data copied between the cache and user space > buffer, but the kernel space one can use pages in the cache directly, > without extra copy. Well, you've said it thrice (the bellman cried) but that doesn't make it true. The way a user space solution should work is to schedule mmapped I/O from the backing store and then send this mmapped region off for target I/O. For reads, the page gather will ensure that the pages are up to date from the backing store to the cache before sending the I/O out. For writes, You actually have to do a msync on the region to get the data secured to the backing store. You also have to pull tricks with the mmap region in the case of writes to prevent useless data being read in from the backing store. However, none of this involves data copies. James ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-04 17:25 ` James Bottomley @ 2008-02-04 17:56 ` Vladislav Bolkhovitin 2008-02-04 18:22 ` James Bottomley 2008-02-04 18:29 ` Linus Torvalds 1 sibling, 1 reply; 148+ messages in thread From: Vladislav Bolkhovitin @ 2008-02-04 17:56 UTC (permalink / raw) To: James Bottomley Cc: Bart Van Assche, Linus Torvalds, Andrew Morton, FUJITA Tomonori, linux-scsi, scst-devel, linux-kernel James Bottomley wrote: > On Mon, 2008-02-04 at 20:16 +0300, Vladislav Bolkhovitin wrote: > >>James Bottomley wrote: >> >>>>>>So, James, what is your opinion on the above? Or the overall SCSI target >>>>>>project simplicity doesn't matter much for you and you think it's fine >>>>>>to duplicate Linux page cache in the user space to keep the in-kernel >>>>>>part of the project as small as possible? >>>>> >>>>> >>>>>The answers were pretty much contained here >>>>> >>>>>http://marc.info/?l=linux-scsi&m=120164008302435 >>>>> >>>>>and here: >>>>> >>>>>http://marc.info/?l=linux-scsi&m=120171067107293 >>>>> >>>>>Weren't they? >>>> >>>>No, sorry, it doesn't look so for me. They are about performance, but >>>>I'm asking about the overall project's architecture, namely about one >>>>part of it: simplicity. Particularly, what do you think about >>>>duplicating Linux page cache in the user space to have zero-copy cached >>>>I/O? Or can you suggest another architectural solution for that problem >>>>in the STGT's approach? >>> >>> >>>Isn't that an advantage of a user space solution? It simply uses the >>>backing store of whatever device supplies the data. That means it takes >>>advantage of the existing mechanisms for caching. >> >>No, please reread this thread, especially this message: >>http://marc.info/?l=linux-kernel&m=120169189504361&w=2. This is one of >>the advantages of the kernel space implementation. The user space >>implementation has to have data copied between the cache and user space >>buffer, but the kernel space one can use pages in the cache directly, >>without extra copy. > > > Well, you've said it thrice (the bellman cried) but that doesn't make it > true. > > The way a user space solution should work is to schedule mmapped I/O > from the backing store and then send this mmapped region off for target > I/O. For reads, the page gather will ensure that the pages are up to > date from the backing store to the cache before sending the I/O out. > For writes, You actually have to do a msync on the region to get the > data secured to the backing store. James, have you checked how fast is mmaped I/O if work size > size of RAM? It's several times slower comparing to buffered I/O. It was many times discussed in LKML and, seems, VM people consider it unavoidable. So, using mmaped IO isn't an option for high performance. Plus, mmaped IO isn't an option for high reliability requirements, since it doesn't provide a practical way to handle I/O errors. > You also have to pull tricks with > the mmap region in the case of writes to prevent useless data being read > in from the backing store. Can you be more exact and specify what kind of tricks should be done for that? > However, none of this involves data copies. > > James > > > - > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-04 17:56 ` Vladislav Bolkhovitin @ 2008-02-04 18:22 ` James Bottomley 2008-02-04 18:38 ` Vladislav Bolkhovitin 0 siblings, 1 reply; 148+ messages in thread From: James Bottomley @ 2008-02-04 18:22 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: Bart Van Assche, Linus Torvalds, Andrew Morton, FUJITA Tomonori, linux-scsi, scst-devel, linux-kernel On Mon, 2008-02-04 at 20:56 +0300, Vladislav Bolkhovitin wrote: > James Bottomley wrote: > > On Mon, 2008-02-04 at 20:16 +0300, Vladislav Bolkhovitin wrote: > > > >>James Bottomley wrote: > >> > >>>>>>So, James, what is your opinion on the above? Or the overall SCSI target > >>>>>>project simplicity doesn't matter much for you and you think it's fine > >>>>>>to duplicate Linux page cache in the user space to keep the in-kernel > >>>>>>part of the project as small as possible? > >>>>> > >>>>> > >>>>>The answers were pretty much contained here > >>>>> > >>>>>http://marc.info/?l=linux-scsi&m=120164008302435 > >>>>> > >>>>>and here: > >>>>> > >>>>>http://marc.info/?l=linux-scsi&m=120171067107293 > >>>>> > >>>>>Weren't they? > >>>> > >>>>No, sorry, it doesn't look so for me. They are about performance, but > >>>>I'm asking about the overall project's architecture, namely about one > >>>>part of it: simplicity. Particularly, what do you think about > >>>>duplicating Linux page cache in the user space to have zero-copy cached > >>>>I/O? Or can you suggest another architectural solution for that problem > >>>>in the STGT's approach? > >>> > >>> > >>>Isn't that an advantage of a user space solution? It simply uses the > >>>backing store of whatever device supplies the data. That means it takes > >>>advantage of the existing mechanisms for caching. > >> > >>No, please reread this thread, especially this message: > >>http://marc.info/?l=linux-kernel&m=120169189504361&w=2. This is one of > >>the advantages of the kernel space implementation. The user space > >>implementation has to have data copied between the cache and user space > >>buffer, but the kernel space one can use pages in the cache directly, > >>without extra copy. > > > > > > Well, you've said it thrice (the bellman cried) but that doesn't make it > > true. > > > > The way a user space solution should work is to schedule mmapped I/O > > from the backing store and then send this mmapped region off for target > > I/O. For reads, the page gather will ensure that the pages are up to > > date from the backing store to the cache before sending the I/O out. > > For writes, You actually have to do a msync on the region to get the > > data secured to the backing store. > > James, have you checked how fast is mmaped I/O if work size > size of > RAM? It's several times slower comparing to buffered I/O. It was many > times discussed in LKML and, seems, VM people consider it unavoidable. Erm, but if you're using the case of work size > size of RAM, you'll find buffered I/O won't help because you don't have the memory for buffers either. > So, using mmaped IO isn't an option for high performance. Plus, mmaped > IO isn't an option for high reliability requirements, since it doesn't > provide a practical way to handle I/O errors. I think you'll find it does ... the page gather returns -EFAULT if there's an I/O error in the gathered region. msync does something similar if there's a write failure. > > You also have to pull tricks with > > the mmap region in the case of writes to prevent useless data being read > > in from the backing store. > > Can you be more exact and specify what kind of tricks should be done for > that? Actually, just avoid touching it seems to do the trick with a recent kernel. James ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-04 18:22 ` James Bottomley @ 2008-02-04 18:38 ` Vladislav Bolkhovitin 2008-02-04 18:54 ` James Bottomley 0 siblings, 1 reply; 148+ messages in thread From: Vladislav Bolkhovitin @ 2008-02-04 18:38 UTC (permalink / raw) To: James Bottomley Cc: Bart Van Assche, Linus Torvalds, Andrew Morton, FUJITA Tomonori, linux-scsi, scst-devel, linux-kernel James Bottomley wrote: > On Mon, 2008-02-04 at 20:56 +0300, Vladislav Bolkhovitin wrote: > >>James Bottomley wrote: >> >>>On Mon, 2008-02-04 at 20:16 +0300, Vladislav Bolkhovitin wrote: >>> >>> >>>>James Bottomley wrote: >>>> >>>> >>>>>>>>So, James, what is your opinion on the above? Or the overall SCSI target >>>>>>>>project simplicity doesn't matter much for you and you think it's fine >>>>>>>>to duplicate Linux page cache in the user space to keep the in-kernel >>>>>>>>part of the project as small as possible? >>>>>>> >>>>>>> >>>>>>>The answers were pretty much contained here >>>>>>> >>>>>>>http://marc.info/?l=linux-scsi&m=120164008302435 >>>>>>> >>>>>>>and here: >>>>>>> >>>>>>>http://marc.info/?l=linux-scsi&m=120171067107293 >>>>>>> >>>>>>>Weren't they? >>>>>> >>>>>>No, sorry, it doesn't look so for me. They are about performance, but >>>>>>I'm asking about the overall project's architecture, namely about one >>>>>>part of it: simplicity. Particularly, what do you think about >>>>>>duplicating Linux page cache in the user space to have zero-copy cached >>>>>>I/O? Or can you suggest another architectural solution for that problem >>>>>>in the STGT's approach? >>>>> >>>>> >>>>>Isn't that an advantage of a user space solution? It simply uses the >>>>>backing store of whatever device supplies the data. That means it takes >>>>>advantage of the existing mechanisms for caching. >>>> >>>>No, please reread this thread, especially this message: >>>>http://marc.info/?l=linux-kernel&m=120169189504361&w=2. This is one of >>>>the advantages of the kernel space implementation. The user space >>>>implementation has to have data copied between the cache and user space >>>>buffer, but the kernel space one can use pages in the cache directly, >>>>without extra copy. >>> >>> >>>Well, you've said it thrice (the bellman cried) but that doesn't make it >>>true. >>> >>>The way a user space solution should work is to schedule mmapped I/O >>>from the backing store and then send this mmapped region off for target >>>I/O. For reads, the page gather will ensure that the pages are up to >>>date from the backing store to the cache before sending the I/O out. >>>For writes, You actually have to do a msync on the region to get the >>>data secured to the backing store. >> >>James, have you checked how fast is mmaped I/O if work size > size of >>RAM? It's several times slower comparing to buffered I/O. It was many >>times discussed in LKML and, seems, VM people consider it unavoidable. > > > Erm, but if you're using the case of work size > size of RAM, you'll > find buffered I/O won't help because you don't have the memory for > buffers either. James, just check and you will see, buffered I/O is a lot faster. >>So, using mmaped IO isn't an option for high performance. Plus, mmaped >>IO isn't an option for high reliability requirements, since it doesn't >>provide a practical way to handle I/O errors. > > I think you'll find it does ... the page gather returns -EFAULT if > there's an I/O error in the gathered region. Err, to whom return? If you try to read from a mmaped page, which can't be populated due to I/O error, you will get SIGBUS or SIGSEGV, I don't remember exactly. It's quite tricky to get back to the faulted command from the signal handler. Or do you mean mmap(MAP_POPULATE)/munmap() for each command? Do you think that such mapping/unmapping is good for performance? > msync does something > similar if there's a write failure. > >>>You also have to pull tricks with >>>the mmap region in the case of writes to prevent useless data being read >>>in from the backing store. >> >>Can you be more exact and specify what kind of tricks should be done for >>that? > > Actually, just avoid touching it seems to do the trick with a recent > kernel. Hmm, how can one write to an mmaped page and don't touch it? > James > > > ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-04 18:38 ` Vladislav Bolkhovitin @ 2008-02-04 18:54 ` James Bottomley 2008-02-05 18:59 ` Vladislav Bolkhovitin 0 siblings, 1 reply; 148+ messages in thread From: James Bottomley @ 2008-02-04 18:54 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: Bart Van Assche, Linus Torvalds, Andrew Morton, FUJITA Tomonori, linux-scsi, scst-devel, linux-kernel On Mon, 2008-02-04 at 21:38 +0300, Vladislav Bolkhovitin wrote: > James Bottomley wrote: > > On Mon, 2008-02-04 at 20:56 +0300, Vladislav Bolkhovitin wrote: > > > >>James Bottomley wrote: > >> > >>>On Mon, 2008-02-04 at 20:16 +0300, Vladislav Bolkhovitin wrote: > >>> > >>> > >>>>James Bottomley wrote: > >>>> > >>>> > >>>>>>>>So, James, what is your opinion on the above? Or the overall SCSI target > >>>>>>>>project simplicity doesn't matter much for you and you think it's fine > >>>>>>>>to duplicate Linux page cache in the user space to keep the in-kernel > >>>>>>>>part of the project as small as possible? > >>>>>>> > >>>>>>> > >>>>>>>The answers were pretty much contained here > >>>>>>> > >>>>>>>http://marc.info/?l=linux-scsi&m=120164008302435 > >>>>>>> > >>>>>>>and here: > >>>>>>> > >>>>>>>http://marc.info/?l=linux-scsi&m=120171067107293 > >>>>>>> > >>>>>>>Weren't they? > >>>>>> > >>>>>>No, sorry, it doesn't look so for me. They are about performance, but > >>>>>>I'm asking about the overall project's architecture, namely about one > >>>>>>part of it: simplicity. Particularly, what do you think about > >>>>>>duplicating Linux page cache in the user space to have zero-copy cached > >>>>>>I/O? Or can you suggest another architectural solution for that problem > >>>>>>in the STGT's approach? > >>>>> > >>>>> > >>>>>Isn't that an advantage of a user space solution? It simply uses the > >>>>>backing store of whatever device supplies the data. That means it takes > >>>>>advantage of the existing mechanisms for caching. > >>>> > >>>>No, please reread this thread, especially this message: > >>>>http://marc.info/?l=linux-kernel&m=120169189504361&w=2. This is one of > >>>>the advantages of the kernel space implementation. The user space > >>>>implementation has to have data copied between the cache and user space > >>>>buffer, but the kernel space one can use pages in the cache directly, > >>>>without extra copy. > >>> > >>> > >>>Well, you've said it thrice (the bellman cried) but that doesn't make it > >>>true. > >>> > >>>The way a user space solution should work is to schedule mmapped I/O > >>>from the backing store and then send this mmapped region off for target > >>>I/O. For reads, the page gather will ensure that the pages are up to > >>>date from the backing store to the cache before sending the I/O out. > >>>For writes, You actually have to do a msync on the region to get the > >>>data secured to the backing store. > >> > >>James, have you checked how fast is mmaped I/O if work size > size of > >>RAM? It's several times slower comparing to buffered I/O. It was many > >>times discussed in LKML and, seems, VM people consider it unavoidable. > > > > > > Erm, but if you're using the case of work size > size of RAM, you'll > > find buffered I/O won't help because you don't have the memory for > > buffers either. > > James, just check and you will see, buffered I/O is a lot faster. So in an out of memory situation the buffers you don't have are a lot faster than the pages I don't have? > >>So, using mmaped IO isn't an option for high performance. Plus, mmaped > >>IO isn't an option for high reliability requirements, since it doesn't > >>provide a practical way to handle I/O errors. > > > > I think you'll find it does ... the page gather returns -EFAULT if > > there's an I/O error in the gathered region. > > Err, to whom return? If you try to read from a mmaped page, which can't > be populated due to I/O error, you will get SIGBUS or SIGSEGV, I don't > remember exactly. It's quite tricky to get back to the faulted command > from the signal handler. > > Or do you mean mmap(MAP_POPULATE)/munmap() for each command? Do you > think that such mapping/unmapping is good for performance? > > > msync does something > > similar if there's a write failure. > > > >>>You also have to pull tricks with > >>>the mmap region in the case of writes to prevent useless data being read > >>>in from the backing store. > >> > >>Can you be more exact and specify what kind of tricks should be done for > >>that? > > > > Actually, just avoid touching it seems to do the trick with a recent > > kernel. > > Hmm, how can one write to an mmaped page and don't touch it? I meant from user space ... the writes are done inside the kernel. However, as Linus has pointed out, this discussion is getting a bit off topic. There's no actual evidence that copy problems are causing any performatince issues issues for STGT. In fact, there's evidence that they're not for everything except IB networks. James ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-04 18:54 ` James Bottomley @ 2008-02-05 18:59 ` Vladislav Bolkhovitin 2008-02-05 19:13 ` James Bottomley 0 siblings, 1 reply; 148+ messages in thread From: Vladislav Bolkhovitin @ 2008-02-05 18:59 UTC (permalink / raw) To: James Bottomley Cc: FUJITA Tomonori, linux-scsi, linux-kernel, scst-devel, Andrew Morton, Linus Torvalds James Bottomley wrote: > On Mon, 2008-02-04 at 21:38 +0300, Vladislav Bolkhovitin wrote: > >>James Bottomley wrote: >> >>>On Mon, 2008-02-04 at 20:56 +0300, Vladislav Bolkhovitin wrote: >>> >>> >>>>James Bottomley wrote: >>>> >>>> >>>>>On Mon, 2008-02-04 at 20:16 +0300, Vladislav Bolkhovitin wrote: >>>>> >>>>> >>>>> >>>>>>James Bottomley wrote: >>>>>> >>>>>> >>>>>> >>>>>>>>>>So, James, what is your opinion on the above? Or the overall SCSI target >>>>>>>>>>project simplicity doesn't matter much for you and you think it's fine >>>>>>>>>>to duplicate Linux page cache in the user space to keep the in-kernel >>>>>>>>>>part of the project as small as possible? >>>>>>>>> >>>>>>>>> >>>>>>>>>The answers were pretty much contained here >>>>>>>>> >>>>>>>>>http://marc.info/?l=linux-scsi&m=120164008302435 >>>>>>>>> >>>>>>>>>and here: >>>>>>>>> >>>>>>>>>http://marc.info/?l=linux-scsi&m=120171067107293 >>>>>>>>> >>>>>>>>>Weren't they? >>>>>>>> >>>>>>>>No, sorry, it doesn't look so for me. They are about performance, but >>>>>>>>I'm asking about the overall project's architecture, namely about one >>>>>>>>part of it: simplicity. Particularly, what do you think about >>>>>>>>duplicating Linux page cache in the user space to have zero-copy cached >>>>>>>>I/O? Or can you suggest another architectural solution for that problem >>>>>>>>in the STGT's approach? >>>>>>> >>>>>>> >>>>>>>Isn't that an advantage of a user space solution? It simply uses the >>>>>>>backing store of whatever device supplies the data. That means it takes >>>>>>>advantage of the existing mechanisms for caching. >>>>>> >>>>>>No, please reread this thread, especially this message: >>>>>>http://marc.info/?l=linux-kernel&m=120169189504361&w=2. This is one of >>>>>>the advantages of the kernel space implementation. The user space >>>>>>implementation has to have data copied between the cache and user space >>>>>>buffer, but the kernel space one can use pages in the cache directly, >>>>>>without extra copy. >>>>> >>>>> >>>>>Well, you've said it thrice (the bellman cried) but that doesn't make it >>>>>true. >>>>> >>>>>The way a user space solution should work is to schedule mmapped I/O >>>> >>>>>from the backing store and then send this mmapped region off for target >>>> >>>>>I/O. For reads, the page gather will ensure that the pages are up to >>>>>date from the backing store to the cache before sending the I/O out. >>>>>For writes, You actually have to do a msync on the region to get the >>>>>data secured to the backing store. >>>> >>>>James, have you checked how fast is mmaped I/O if work size > size of >>>>RAM? It's several times slower comparing to buffered I/O. It was many >>>>times discussed in LKML and, seems, VM people consider it unavoidable. >>> >>> >>>Erm, but if you're using the case of work size > size of RAM, you'll >>>find buffered I/O won't help because you don't have the memory for >>>buffers either. >> >>James, just check and you will see, buffered I/O is a lot faster. > > So in an out of memory situation the buffers you don't have are a lot > faster than the pages I don't have? There isn't OOM in both cases. Just pages reclamation/readahead work much better in the buffered case. >>>>So, using mmaped IO isn't an option for high performance. Plus, mmaped >>>>IO isn't an option for high reliability requirements, since it doesn't >>>>provide a practical way to handle I/O errors. >>> >>>I think you'll find it does ... the page gather returns -EFAULT if >>>there's an I/O error in the gathered region. >> >>Err, to whom return? If you try to read from a mmaped page, which can't >>be populated due to I/O error, you will get SIGBUS or SIGSEGV, I don't >>remember exactly. It's quite tricky to get back to the faulted command >>from the signal handler. >> >>Or do you mean mmap(MAP_POPULATE)/munmap() for each command? Do you >>think that such mapping/unmapping is good for performance? >> >> >>>msync does something >>>similar if there's a write failure. >>> >>> >>>>>You also have to pull tricks with >>>>>the mmap region in the case of writes to prevent useless data being read >>>>>in from the backing store. >>>> >>>>Can you be more exact and specify what kind of tricks should be done for >>>>that? >>> >>>Actually, just avoid touching it seems to do the trick with a recent >>>kernel. >> >>Hmm, how can one write to an mmaped page and don't touch it? > > I meant from user space ... the writes are done inside the kernel. Sure, the mmap() approach agreed to be unpractical, but could you elaborate more on this anyway, please? I'm just curious. Do you think about implementing a new syscall, which would put pages with data in the mmap'ed area? > However, as Linus has pointed out, this discussion is getting a bit off > topic. No, that isn't off topic. We've just proved that there is no good way to implement zero-copy cached I/O for STGT. I see the only practical way for that, proposed by FUJITA Tomonori some time ago: duplicating Linux page cache in the user space. But will you like it? > There's no actual evidence that copy problems are causing any > performatince issues issues for STGT. In fact, there's evidence that > they're not for everything except IB networks. The zero-copy cached I/O has not yet been implemented in SCST, I simply so far have not had time for that. Currently SCST performs better STGT, because of simpler processing path and less context switches per command. Memcpy() speed on modern systems is about the same as throughput of 20Gbps link (1600MB/s), so when the zero-copy will be implemented, I won't be surprised by more 50-70% SCST advantage. > James > > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Scst-devel mailing list > Scst-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scst-devel > ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-05 18:59 ` Vladislav Bolkhovitin @ 2008-02-05 19:13 ` James Bottomley 2008-02-06 18:07 ` Vladislav Bolkhovitin 2008-02-07 13:13 ` [Scst-devel] " Bart Van Assche 0 siblings, 2 replies; 148+ messages in thread From: James Bottomley @ 2008-02-05 19:13 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: FUJITA Tomonori, linux-scsi, linux-kernel, scst-devel, Andrew Morton, Linus Torvalds On Tue, 2008-02-05 at 21:59 +0300, Vladislav Bolkhovitin wrote: > >>Hmm, how can one write to an mmaped page and don't touch it? > > > > I meant from user space ... the writes are done inside the kernel. > > Sure, the mmap() approach agreed to be unpractical, but could you > elaborate more on this anyway, please? I'm just curious. Do you think > about implementing a new syscall, which would put pages with data in the > mmap'ed area? No, it has to do with the way invalidation occurs. When you mmap a region from a device or file, the kernel places page translations for that region into your vm_area. The regions themselves aren't backed until faulted. For write (i.e. incoming command to target) you specify the write flag and send the area off to receive the data. The gather, expecting the pages to be overwritten, backs them with pages marked dirty but doesn't fault in the contents (unless it already exists in the page cache). The kernel writes the data to the pages and the dirty pages go back to the user. msync() flushes them to the device. The disadvantage of all this is that the handle for the I/O if you will is a virtual address in a user process that doesn't actually care to see the data. non-x86 architectures will do flushes/invalidates on this address space as the I/O occurs. > > However, as Linus has pointed out, this discussion is getting a bit off > > topic. > > No, that isn't off topic. We've just proved that there is no good way to > implement zero-copy cached I/O for STGT. I see the only practical way > for that, proposed by FUJITA Tomonori some time ago: duplicating Linux > page cache in the user space. But will you like it? Well, there's no real evidence that zero copy or lack of it is a problem yet. James ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-05 19:13 ` James Bottomley @ 2008-02-06 18:07 ` Vladislav Bolkhovitin 2008-02-07 13:13 ` [Scst-devel] " Bart Van Assche 1 sibling, 0 replies; 148+ messages in thread From: Vladislav Bolkhovitin @ 2008-02-06 18:07 UTC (permalink / raw) To: James Bottomley Cc: linux-scsi, linux-kernel, FUJITA Tomonori, scst-devel, Andrew Morton, Linus Torvalds James Bottomley wrote: > On Tue, 2008-02-05 at 21:59 +0300, Vladislav Bolkhovitin wrote: > >>>>Hmm, how can one write to an mmaped page and don't touch it? >>> >>>I meant from user space ... the writes are done inside the kernel. >> >>Sure, the mmap() approach agreed to be unpractical, but could you >>elaborate more on this anyway, please? I'm just curious. Do you think >>about implementing a new syscall, which would put pages with data in the >>mmap'ed area? > > No, it has to do with the way invalidation occurs. When you mmap a > region from a device or file, the kernel places page translations for > that region into your vm_area. The regions themselves aren't backed > until faulted. For write (i.e. incoming command to target) you specify > the write flag and send the area off to receive the data. The gather, > expecting the pages to be overwritten, backs them with pages marked > dirty but doesn't fault in the contents (unless it already exists in the > page cache). The kernel writes the data to the pages and the dirty > pages go back to the user. msync() flushes them to the device. > > The disadvantage of all this is that the handle for the I/O if you will > is a virtual address in a user process that doesn't actually care to see > the data. non-x86 architectures will do flushes/invalidates on this > address space as the I/O occurs. I more or less see, thanks. But (1) pages still needs to be mmaped to the user space process before the data transmission, i.e. they must be zeroed before being mmaped, which isn't much faster, than data copy, and (2) I suspect, it would be hard to make it race free, e.g. if another process would want to write to the same area simultaneously >>>However, as Linus has pointed out, this discussion is getting a bit off >>>topic. >> >>No, that isn't off topic. We've just proved that there is no good way to >>implement zero-copy cached I/O for STGT. I see the only practical way >>for that, proposed by FUJITA Tomonori some time ago: duplicating Linux >>page cache in the user space. But will you like it? > > Well, there's no real evidence that zero copy or lack of it is a problem > yet. The performance improvement from zero copy can be easily estimated, knowing the link throughput and data copy throughput, which are about the same for 20Gbps links (I did that few e-mail ago). Vlad ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel 2008-02-05 19:13 ` James Bottomley 2008-02-06 18:07 ` Vladislav Bolkhovitin @ 2008-02-07 13:13 ` Bart Van Assche 2008-02-07 13:45 ` Vladislav Bolkhovitin 2008-02-07 15:38 ` [Scst-devel] " Nicholas A. Bellinger 1 sibling, 2 replies; 148+ messages in thread From: Bart Van Assche @ 2008-02-07 13:13 UTC (permalink / raw) To: James Bottomley, Nicholas A. Bellinger, Vladislav Bolkhovitin, FUJITA Tomonori Cc: linux-scsi, linux-kernel, scst-devel, Andrew Morton, Linus Torvalds Since the focus of this thread shifted somewhat in the last few messages, I'll try to summarize what has been discussed so far: - There was a number of participants who joined this discussion spontaneously. This suggests that there is considerable interest in networked storage and iSCSI. - It has been motivated why iSCSI makes sense as a storage protocol (compared to ATA over Ethernet and Fibre Channel over Ethernet). - The direct I/O performance results for block transfer sizes below 64 KB are a meaningful benchmark for storage target implementations. - It has been discussed whether an iSCSI target should be implemented in user space or in kernel space. It is clear now that an implementation in the kernel can be made faster than a user space implementation (http://kerneltrap.org/mailarchive/linux-kernel/2008/2/4/714804). Regarding existing implementations, measurements have a.o. shown that SCST is faster than STGT (30% with the following setup: iSCSI via IPoIB and direct I/O block transfers with a size of 512 bytes). - It has been discussed which iSCSI target implementation should be in the mainstream Linux kernel. There is no agreement on this subject yet. The short-term options are as follows: 1) Do not integrate any new iSCSI target implementation in the mainstream Linux kernel. 2) Add one of the existing in-kernel iSCSI target implementations to the kernel, e.g. SCST or PyX/LIO. 3) Create a new in-kernel iSCSI target implementation that combines the advantages of the existing iSCSI kernel target implementations (iETD, STGT, SCST and PyX/LIO). As an iSCSI user, I prefer option (3). The big question is whether the various storage target authors agree with this ? Bart Van Assche. ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-07 13:13 ` [Scst-devel] " Bart Van Assche @ 2008-02-07 13:45 ` Vladislav Bolkhovitin 2008-02-07 22:51 ` david 2008-02-15 15:02 ` Bart Van Assche 2008-02-07 15:38 ` [Scst-devel] " Nicholas A. Bellinger 1 sibling, 2 replies; 148+ messages in thread From: Vladislav Bolkhovitin @ 2008-02-07 13:45 UTC (permalink / raw) To: Bart Van Assche Cc: James Bottomley, Nicholas A. Bellinger, FUJITA Tomonori, linux-scsi, linux-kernel, scst-devel, Andrew Morton, Linus Torvalds Bart Van Assche wrote: > Since the focus of this thread shifted somewhat in the last few > messages, I'll try to summarize what has been discussed so far: > - There was a number of participants who joined this discussion > spontaneously. This suggests that there is considerable interest in > networked storage and iSCSI. > - It has been motivated why iSCSI makes sense as a storage protocol > (compared to ATA over Ethernet and Fibre Channel over Ethernet). > - The direct I/O performance results for block transfer sizes below 64 > KB are a meaningful benchmark for storage target implementations. > - It has been discussed whether an iSCSI target should be implemented > in user space or in kernel space. It is clear now that an > implementation in the kernel can be made faster than a user space > implementation (http://kerneltrap.org/mailarchive/linux-kernel/2008/2/4/714804). > Regarding existing implementations, measurements have a.o. shown that > SCST is faster than STGT (30% with the following setup: iSCSI via > IPoIB and direct I/O block transfers with a size of 512 bytes). > - It has been discussed which iSCSI target implementation should be in > the mainstream Linux kernel. There is no agreement on this subject > yet. The short-term options are as follows: > 1) Do not integrate any new iSCSI target implementation in the > mainstream Linux kernel. > 2) Add one of the existing in-kernel iSCSI target implementations to > the kernel, e.g. SCST or PyX/LIO. > 3) Create a new in-kernel iSCSI target implementation that combines > the advantages of the existing iSCSI kernel target implementations > (iETD, STGT, SCST and PyX/LIO). > > As an iSCSI user, I prefer option (3). The big question is whether the > various storage target authors agree with this ? I tend to agree with some important notes: 1. IET should be excluded from this list, iSCSI-SCST is IET updated for SCST framework with a lot of bugfixes and improvements. 2. I think, everybody will agree that Linux iSCSI target should work over some standard SCSI target framework. Hence the choice gets narrower: SCST vs STGT. I don't think there's a way for a dedicated iSCSI target (i.e. PyX/LIO) in the mainline, because of a lot of code duplication. Nicholas could decide to move to either existing framework (although, frankly, I don't think there's a possibility for in-kernel iSCSI target and user space SCSI target framework) and if he decide to go with SCST, I'll be glad to offer my help and support and wouldn't care if LIO-SCST eventually replaced iSCSI-SCST. The better one should win. Vlad ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-07 13:45 ` Vladislav Bolkhovitin @ 2008-02-07 22:51 ` david 2008-02-08 10:37 ` Vladislav Bolkhovitin 2008-02-08 11:33 ` Nicholas A. Bellinger 2008-02-15 15:02 ` Bart Van Assche 1 sibling, 2 replies; 148+ messages in thread From: david @ 2008-02-07 22:51 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: Bart Van Assche, James Bottomley, Nicholas A. Bellinger, FUJITA Tomonori, linux-scsi, linux-kernel, scst-devel, Andrew Morton, Linus Torvalds On Thu, 7 Feb 2008, Vladislav Bolkhovitin wrote: > Bart Van Assche wrote: >> - It has been discussed which iSCSI target implementation should be in >> the mainstream Linux kernel. There is no agreement on this subject >> yet. The short-term options are as follows: >> 1) Do not integrate any new iSCSI target implementation in the >> mainstream Linux kernel. >> 2) Add one of the existing in-kernel iSCSI target implementations to >> the kernel, e.g. SCST or PyX/LIO. >> 3) Create a new in-kernel iSCSI target implementation that combines >> the advantages of the existing iSCSI kernel target implementations >> (iETD, STGT, SCST and PyX/LIO). >> >> As an iSCSI user, I prefer option (3). The big question is whether the >> various storage target authors agree with this ? > > I tend to agree with some important notes: > > 1. IET should be excluded from this list, iSCSI-SCST is IET updated for SCST > framework with a lot of bugfixes and improvements. > > 2. I think, everybody will agree that Linux iSCSI target should work over > some standard SCSI target framework. Hence the choice gets narrower: SCST vs > STGT. I don't think there's a way for a dedicated iSCSI target (i.e. PyX/LIO) > in the mainline, because of a lot of code duplication. Nicholas could decide > to move to either existing framework (although, frankly, I don't think > there's a possibility for in-kernel iSCSI target and user space SCSI target > framework) and if he decide to go with SCST, I'll be glad to offer my help > and support and wouldn't care if LIO-SCST eventually replaced iSCSI-SCST. The > better one should win. why should linux as an iSCSI target be limited to passthrough to a SCSI device. the most common use of this sort of thing that I would see is to load up a bunch of 1TB SATA drives in a commodity PC, run software RAID, and then export the resulting volume to other servers via iSCSI. not a 'real' SCSI device in sight. As far as how good a standard iSCSI is, at this point I don't think it really matters. There are too many devices and manufacturers out there that implement iSCSI as their storage protocol (from both sides, offering storage to other systems, and using external storage). Sometimes the best technology doesn't win, but Linux should be interoperable with as much as possible and be ready to support the winners and the loosers in technology options, for as long as anyone chooses to use the old equipment (after all, we support things like Arcnet networking, which lost to Ethernet many years ago) David Lang ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-07 22:51 ` david @ 2008-02-08 10:37 ` Vladislav Bolkhovitin 2008-02-09 7:40 ` david 2008-02-08 11:33 ` Nicholas A. Bellinger 1 sibling, 1 reply; 148+ messages in thread From: Vladislav Bolkhovitin @ 2008-02-08 10:37 UTC (permalink / raw) To: david Cc: Bart Van Assche, James Bottomley, Nicholas A. Bellinger, FUJITA Tomonori, linux-scsi, linux-kernel, scst-devel, Andrew Morton, Linus Torvalds david@lang.hm wrote: > On Thu, 7 Feb 2008, Vladislav Bolkhovitin wrote: > >> Bart Van Assche wrote: >> >>> - It has been discussed which iSCSI target implementation should be in >>> the mainstream Linux kernel. There is no agreement on this subject >>> yet. The short-term options are as follows: >>> 1) Do not integrate any new iSCSI target implementation in the >>> mainstream Linux kernel. >>> 2) Add one of the existing in-kernel iSCSI target implementations to >>> the kernel, e.g. SCST or PyX/LIO. >>> 3) Create a new in-kernel iSCSI target implementation that combines >>> the advantages of the existing iSCSI kernel target implementations >>> (iETD, STGT, SCST and PyX/LIO). >>> >>> As an iSCSI user, I prefer option (3). The big question is whether the >>> various storage target authors agree with this ? >> >> >> I tend to agree with some important notes: >> >> 1. IET should be excluded from this list, iSCSI-SCST is IET updated >> for SCST framework with a lot of bugfixes and improvements. >> >> 2. I think, everybody will agree that Linux iSCSI target should work >> over some standard SCSI target framework. Hence the choice gets >> narrower: SCST vs STGT. I don't think there's a way for a dedicated >> iSCSI target (i.e. PyX/LIO) in the mainline, because of a lot of code >> duplication. Nicholas could decide to move to either existing >> framework (although, frankly, I don't think there's a possibility for >> in-kernel iSCSI target and user space SCSI target framework) and if he >> decide to go with SCST, I'll be glad to offer my help and support and >> wouldn't care if LIO-SCST eventually replaced iSCSI-SCST. The better >> one should win. > > > why should linux as an iSCSI target be limited to passthrough to a SCSI > device. > > the most common use of this sort of thing that I would see is to load up > a bunch of 1TB SATA drives in a commodity PC, run software RAID, and > then export the resulting volume to other servers via iSCSI. not a > 'real' SCSI device in sight. > > As far as how good a standard iSCSI is, at this point I don't think it > really matters. There are too many devices and manufacturers out there > that implement iSCSI as their storage protocol (from both sides, > offering storage to other systems, and using external storage). > Sometimes the best technology doesn't win, but Linux should be > interoperable with as much as possible and be ready to support the > winners and the loosers in technology options, for as long as anyone > chooses to use the old equipment (after all, we support things like > Arcnet networking, which lost to Ethernet many years ago) David, your question surprises me a lot. From where have you decided that SCST supports only pass-through backstorage? Does the RAM disk, which Bart has been using for performance tests, look like a SCSI device? SCST supports all backstorage types you can imagine and Linux kernel supports. > David Lang > - > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-08 10:37 ` Vladislav Bolkhovitin @ 2008-02-09 7:40 ` david 0 siblings, 0 replies; 148+ messages in thread From: david @ 2008-02-09 7:40 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: Bart Van Assche, James Bottomley, Nicholas A. Bellinger, FUJITA Tomonori, linux-scsi, linux-kernel, scst-devel, Andrew Morton, Linus Torvalds On Fri, 8 Feb 2008, Vladislav Bolkhovitin wrote: >>> 2. I think, everybody will agree that Linux iSCSI target should work over >>> some standard SCSI target framework. Hence the choice gets narrower: SCST >>> vs STGT. I don't think there's a way for a dedicated iSCSI target (i.e. >>> PyX/LIO) in the mainline, because of a lot of code duplication. Nicholas >>> could decide to move to either existing framework (although, frankly, I >>> don't think there's a possibility for in-kernel iSCSI target and user >>> space SCSI target framework) and if he decide to go with SCST, I'll be >>> glad to offer my help and support and wouldn't care if LIO-SCST eventually >>> replaced iSCSI-SCST. The better one should win. >> >> >> why should linux as an iSCSI target be limited to passthrough to a SCSI >> device. >> >> the most common use of this sort of thing that I would see is to load up a >> bunch of 1TB SATA drives in a commodity PC, run software RAID, and then >> export the resulting volume to other servers via iSCSI. not a 'real' SCSI >> device in sight. >> > David, your question surprises me a lot. From where have you decided that > SCST supports only pass-through backstorage? Does the RAM disk, which Bart > has been using for performance tests, look like a SCSI device? I was responding to the start of item #2 that I left in the quote above. it asn't saying that SCST didn't support that, but was stating that any implementation of a iSCSI target should use the SCSI framework. I read this to mean that this would only be able to access things that the SCSI framework can access, and that would not be things like ramdisks, raid arrays, etc. David Lang ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-07 22:51 ` david 2008-02-08 10:37 ` Vladislav Bolkhovitin @ 2008-02-08 11:33 ` Nicholas A. Bellinger 2008-02-08 14:36 ` Vladislav Bolkhovitin 1 sibling, 1 reply; 148+ messages in thread From: Nicholas A. Bellinger @ 2008-02-08 11:33 UTC (permalink / raw) To: david Cc: Vladislav Bolkhovitin, Bart Van Assche, James Bottomley, FUJITA Tomonori, linux-scsi, linux-kernel, scst-devel, Andrew Morton, Linus Torvalds On Thu, 2008-02-07 at 14:51 -0800, david@lang.hm wrote: > On Thu, 7 Feb 2008, Vladislav Bolkhovitin wrote: > > > Bart Van Assche wrote: > >> - It has been discussed which iSCSI target implementation should be in > >> the mainstream Linux kernel. There is no agreement on this subject > >> yet. The short-term options are as follows: > >> 1) Do not integrate any new iSCSI target implementation in the > >> mainstream Linux kernel. > >> 2) Add one of the existing in-kernel iSCSI target implementations to > >> the kernel, e.g. SCST or PyX/LIO. > >> 3) Create a new in-kernel iSCSI target implementation that combines > >> the advantages of the existing iSCSI kernel target implementations > >> (iETD, STGT, SCST and PyX/LIO). > >> > >> As an iSCSI user, I prefer option (3). The big question is whether the > >> various storage target authors agree with this ? > > > > I tend to agree with some important notes: > > > > 1. IET should be excluded from this list, iSCSI-SCST is IET updated for SCST > > framework with a lot of bugfixes and improvements. > > > > 2. I think, everybody will agree that Linux iSCSI target should work over > > some standard SCSI target framework. Hence the choice gets narrower: SCST vs > > STGT. I don't think there's a way for a dedicated iSCSI target (i.e. PyX/LIO) > > in the mainline, because of a lot of code duplication. Nicholas could decide > > to move to either existing framework (although, frankly, I don't think > > there's a possibility for in-kernel iSCSI target and user space SCSI target > > framework) and if he decide to go with SCST, I'll be glad to offer my help > > and support and wouldn't care if LIO-SCST eventually replaced iSCSI-SCST. The > > better one should win. > > why should linux as an iSCSI target be limited to passthrough to a SCSI > device. > <nod> I don't think anyone is saying it should be. It makes sense that the more mature SCSI engines that have working code will be providing alot of the foundation as we talk about options.. >From comparing the designs of SCST and LIO-SE, we know that SCST has supports very SCSI specific target mode hardware, including software target mode forks of other kernel code. This code for the target mode pSCSI, FC and SAS control paths (more for the state machines, that CDB emulation) that will most likely never need to be emulated on non SCSI target engine. SCST has support for the most SCSI fabric protocols of the group (although it is lacking iSER) while the LIO-SE only supports traditional iSCSI using Linux/IP (this means TCP, SCTP and IPv6). The design of LIO-SE was to make every iSCSI initiator that sends SCSI CDBs and data to talk to every potential device in the Linux storage stack on the largest amount of hardware architectures possible. Most of the iSCSI Initiators I know (including non Linux) do not rely on heavy SCSI task management, and I think this would be a lower priority item to get real SCSI specific recovery in the traditional iSCSI target for users. Espically things like SCSI target mode queue locking (affectionally called Auto Contingent Allegiance) make no sense for traditional iSCSI or iSER, because CmdSN rules are doing this for us. > the most common use of this sort of thing that I would see is to load up a > bunch of 1TB SATA drives in a commodity PC, run software RAID, and then > export the resulting volume to other servers via iSCSI. not a 'real' SCSI > device in sight. > I recently moved the last core LIO target machine from a hardware RAID5 to MD RAID6 with struct block_device exported LVM objects via Linux/iSCSI to PVM and HVM domains, and I have been very happy with the results. Being able to export any physical or virtual storage object from whatever layer makes sense for your particular case. This applies to both block and file level access. For example, making an iSCSI Initiator and Target run in the most limited in environments places where NAS (espically userspace server side) would have a really hard time fitting, has always been a requirement. You can imagine a system with a smaller amount of memory (say 32MB) having a difficult time doing I/O to any amount of NAS clients. If are talking about memory required to get best performance, using kernel level DMA ring allocation and submission to a generic target engine uses a significantly smaller amount of memory, than say traditional buffered FILEIO. Going futher up the storage stack with buffered file IO, regardless of if its block or file level, will always start to add overhead. I think that kernel level FILEIO with O_DIRECT and asyncio would probably help alot in this case for general target mode usage of MD and LVM block devices. This is because when we are using PSCSI or IBLOCK to queue I/Os which, may need be different from the original IO from the initiator/client due to OS storage subsystem differences and/or physical HBA limitiations for the layers below block. The current LIO-SE API excepts the storage object to present these physical limitiations if to engine they exist. This is called iscsi_transport_t in iscsi_target_transport.h currently, but really should be called something like target_subsytem_api_t and plugins called target_pscsi_t, target_bio_t, target_file_t, etc. > As far as how good a standard iSCSI is, at this point I don't think it > really matters. There are too many devices and manufacturers out there > that implement iSCSI as their storage protocol (from both sides, offering > storage to other systems, and using external storage). Sometimes the best > technology doesn't win, but Linux should be interoperable with as much as > possible and be ready to support the winners and the loosers in technology > options, for as long as anyone chooses to use the old equipment (after > all, we support things like Arcnet networking, which lost to Ethernet many > years ago) > The RFC-3720 standard has been stable for going on four years in 2008, and as the implementations continue to mature, having Linux lead the way in iSCSI Target, Initiator and Target/Initiator that can potentially run on anything that can boot Linux on the many, many types of system and storage around these days is the goal. I can't personally comment on how many of these types of systems that target mode or iSCSI stacks have run in other people's environments, but I have personally been involved getting LIO/SE and Core/iSCSI running on i386 and x86_64, along with Alpha, ia64, MIPS, PPC and POWER, and lots of ARM. I believe the LIO Target and Initiator stacks have been able to run on the smallest systems so far, including a uclinux 2.6 sub 100 Mhz board and ~4 MB of usable sytem memory. This is still today with the LIO target stack, which has been successfully run on the OpenMoko device with memory and FILEIO. :-) --nab Btw, I definately agree that being able to export the large amount of legacy drivers will continue to be an important part.. ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-08 11:33 ` Nicholas A. Bellinger @ 2008-02-08 14:36 ` Vladislav Bolkhovitin 2008-02-08 23:53 ` Nicholas A. Bellinger 0 siblings, 1 reply; 148+ messages in thread From: Vladislav Bolkhovitin @ 2008-02-08 14:36 UTC (permalink / raw) To: Nicholas A. Bellinger Cc: david, Bart Van Assche, James Bottomley, FUJITA Tomonori, linux-scsi, linux-kernel, scst-devel, Andrew Morton, Linus Torvalds Nicholas A. Bellinger wrote: >>>>- It has been discussed which iSCSI target implementation should be in >>>>the mainstream Linux kernel. There is no agreement on this subject >>>>yet. The short-term options are as follows: >>>>1) Do not integrate any new iSCSI target implementation in the >>>>mainstream Linux kernel. >>>>2) Add one of the existing in-kernel iSCSI target implementations to >>>>the kernel, e.g. SCST or PyX/LIO. >>>>3) Create a new in-kernel iSCSI target implementation that combines >>>>the advantages of the existing iSCSI kernel target implementations >>>>(iETD, STGT, SCST and PyX/LIO). >>>> >>>>As an iSCSI user, I prefer option (3). The big question is whether the >>>>various storage target authors agree with this ? >>> >>>I tend to agree with some important notes: >>> >>>1. IET should be excluded from this list, iSCSI-SCST is IET updated for SCST >>>framework with a lot of bugfixes and improvements. >>> >>>2. I think, everybody will agree that Linux iSCSI target should work over >>>some standard SCSI target framework. Hence the choice gets narrower: SCST vs >>>STGT. I don't think there's a way for a dedicated iSCSI target (i.e. PyX/LIO) >>>in the mainline, because of a lot of code duplication. Nicholas could decide >>>to move to either existing framework (although, frankly, I don't think >>>there's a possibility for in-kernel iSCSI target and user space SCSI target >>>framework) and if he decide to go with SCST, I'll be glad to offer my help >>>and support and wouldn't care if LIO-SCST eventually replaced iSCSI-SCST. The >>>better one should win. >> >>why should linux as an iSCSI target be limited to passthrough to a SCSI >>device. > > <nod> > > I don't think anyone is saying it should be. It makes sense that the > more mature SCSI engines that have working code will be providing alot > of the foundation as we talk about options.. > >>From comparing the designs of SCST and LIO-SE, we know that SCST has > supports very SCSI specific target mode hardware, including software > target mode forks of other kernel code. This code for the target mode > pSCSI, FC and SAS control paths (more for the state machines, that CDB > emulation) that will most likely never need to be emulated on non SCSI > target engine. ...but required for SCSI. So, it must be, anyway. > SCST has support for the most SCSI fabric protocols of > the group (although it is lacking iSER) while the LIO-SE only supports > traditional iSCSI using Linux/IP (this means TCP, SCTP and IPv6). The > design of LIO-SE was to make every iSCSI initiator that sends SCSI CDBs > and data to talk to every potential device in the Linux storage stack on > the largest amount of hardware architectures possible. > > Most of the iSCSI Initiators I know (including non Linux) do not rely on > heavy SCSI task management, and I think this would be a lower priority > item to get real SCSI specific recovery in the traditional iSCSI target > for users. Espically things like SCSI target mode queue locking > (affectionally called Auto Contingent Allegiance) make no sense for > traditional iSCSI or iSER, because CmdSN rules are doing this for us. Sorry, it isn't correct. ACA provides possibility to lock commands queue in case of CHECK CONDITION, so allows to keep commands execution order in case of errors. CmdSN keeps commands execution order only in case of success, in case of error the next queued command will be executed immediately after the failed one, although application might require to have all subsequent after the failed one commands aborted. Think about journaled file systems, for instance. Also ACA allows to retry the failed command and then resume the queue. Vlad ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-08 14:36 ` Vladislav Bolkhovitin @ 2008-02-08 23:53 ` Nicholas A. Bellinger 0 siblings, 0 replies; 148+ messages in thread From: Nicholas A. Bellinger @ 2008-02-08 23:53 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: david, Bart Van Assche, James Bottomley, FUJITA Tomonori, linux-scsi, linux-kernel, scst-devel, Andrew Morton, Linus Torvalds On Fri, 2008-02-08 at 17:36 +0300, Vladislav Bolkhovitin wrote: > Nicholas A. Bellinger wrote: > >>>>- It has been discussed which iSCSI target implementation should be in > >>>>the mainstream Linux kernel. There is no agreement on this subject > >>>>yet. The short-term options are as follows: > >>>>1) Do not integrate any new iSCSI target implementation in the > >>>>mainstream Linux kernel. > >>>>2) Add one of the existing in-kernel iSCSI target implementations to > >>>>the kernel, e.g. SCST or PyX/LIO. > >>>>3) Create a new in-kernel iSCSI target implementation that combines > >>>>the advantages of the existing iSCSI kernel target implementations > >>>>(iETD, STGT, SCST and PyX/LIO). > >>>> > >>>>As an iSCSI user, I prefer option (3). The big question is whether the > >>>>various storage target authors agree with this ? > >>> > >>>I tend to agree with some important notes: > >>> > >>>1. IET should be excluded from this list, iSCSI-SCST is IET updated for SCST > >>>framework with a lot of bugfixes and improvements. > >>> > >>>2. I think, everybody will agree that Linux iSCSI target should work over > >>>some standard SCSI target framework. Hence the choice gets narrower: SCST vs > >>>STGT. I don't think there's a way for a dedicated iSCSI target (i.e. PyX/LIO) > >>>in the mainline, because of a lot of code duplication. Nicholas could decide > >>>to move to either existing framework (although, frankly, I don't think > >>>there's a possibility for in-kernel iSCSI target and user space SCSI target > >>>framework) and if he decide to go with SCST, I'll be glad to offer my help > >>>and support and wouldn't care if LIO-SCST eventually replaced iSCSI-SCST. The > >>>better one should win. > >> > >>why should linux as an iSCSI target be limited to passthrough to a SCSI > >>device. > > > > <nod> > > > > I don't think anyone is saying it should be. It makes sense that the > > more mature SCSI engines that have working code will be providing alot > > of the foundation as we talk about options.. > > > >>From comparing the designs of SCST and LIO-SE, we know that SCST has > > supports very SCSI specific target mode hardware, including software > > target mode forks of other kernel code. This code for the target mode > > pSCSI, FC and SAS control paths (more for the state machines, that CDB > > emulation) that will most likely never need to be emulated on non SCSI > > target engine. > > ...but required for SCSI. So, it must be, anyway. > > > SCST has support for the most SCSI fabric protocols of > > the group (although it is lacking iSER) while the LIO-SE only supports > > traditional iSCSI using Linux/IP (this means TCP, SCTP and IPv6). The > > design of LIO-SE was to make every iSCSI initiator that sends SCSI CDBs > > and data to talk to every potential device in the Linux storage stack on > > the largest amount of hardware architectures possible. > > > > Most of the iSCSI Initiators I know (including non Linux) do not rely on > > heavy SCSI task management, and I think this would be a lower priority > > item to get real SCSI specific recovery in the traditional iSCSI target > > for users. Espically things like SCSI target mode queue locking > > (affectionally called Auto Contingent Allegiance) make no sense for > > traditional iSCSI or iSER, because CmdSN rules are doing this for us. > > Sorry, it isn't correct. ACA provides possibility to lock commands queue > in case of CHECK CONDITION, so allows to keep commands execution order > in case of errors. CmdSN keeps commands execution order only in case of > success, in case of error the next queued command will be executed > immediately after the failed one, although application might require to > have all subsequent after the failed one commands aborted. Think about > journaled file systems, for instance. Also ACA allows to retry the > failed command and then resume the queue. > Fair enough. The point I was making is that I have never actually seen an iSCSI Initiator use ACA functionality (I don't believe that the Linux SCSI Ml implements this), or actually generate a CLEAR_ACA task management request. --nab > Vlad > ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-07 13:45 ` Vladislav Bolkhovitin 2008-02-07 22:51 ` david @ 2008-02-15 15:02 ` Bart Van Assche 1 sibling, 0 replies; 148+ messages in thread From: Bart Van Assche @ 2008-02-15 15:02 UTC (permalink / raw) To: Vladislav Bolkhovitin, Nicholas A. Bellinger Cc: James Bottomley, FUJITA Tomonori, linux-scsi, linux-kernel, scst-devel, Andrew Morton On Thu, Feb 7, 2008 at 2:45 PM, Vladislav Bolkhovitin <vst@vlnb.net> wrote: > > Bart Van Assche wrote: > > Since the focus of this thread shifted somewhat in the last few > > messages, I'll try to summarize what has been discussed so far: > > - There was a number of participants who joined this discussion > > spontaneously. This suggests that there is considerable interest in > > networked storage and iSCSI. > > - It has been motivated why iSCSI makes sense as a storage protocol > > (compared to ATA over Ethernet and Fibre Channel over Ethernet). > > - The direct I/O performance results for block transfer sizes below 64 > > KB are a meaningful benchmark for storage target implementations. > > - It has been discussed whether an iSCSI target should be implemented > > in user space or in kernel space. It is clear now that an > > implementation in the kernel can be made faster than a user space > > implementation (http://kerneltrap.org/mailarchive/linux-kernel/2008/2/4/714804). > > Regarding existing implementations, measurements have a.o. shown that > > SCST is faster than STGT (30% with the following setup: iSCSI via > > IPoIB and direct I/O block transfers with a size of 512 bytes). > > - It has been discussed which iSCSI target implementation should be in > > the mainstream Linux kernel. There is no agreement on this subject > > yet. The short-term options are as follows: > > 1) Do not integrate any new iSCSI target implementation in the > > mainstream Linux kernel. > > 2) Add one of the existing in-kernel iSCSI target implementations to > > the kernel, e.g. SCST or PyX/LIO. > > 3) Create a new in-kernel iSCSI target implementation that combines > > the advantages of the existing iSCSI kernel target implementations > > (iETD, STGT, SCST and PyX/LIO). > > > > As an iSCSI user, I prefer option (3). The big question is whether the > > various storage target authors agree with this ? > > I tend to agree with some important notes: > > 1. IET should be excluded from this list, iSCSI-SCST is IET updated for > SCST framework with a lot of bugfixes and improvements. > > 2. I think, everybody will agree that Linux iSCSI target should work > over some standard SCSI target framework. Hence the choice gets > narrower: SCST vs STGT. I don't think there's a way for a dedicated > iSCSI target (i.e. PyX/LIO) in the mainline, because of a lot of code > duplication. Nicholas could decide to move to either existing framework > (although, frankly, I don't think there's a possibility for in-kernel > iSCSI target and user space SCSI target framework) and if he decide to > go with SCST, I'll be glad to offer my help and support and wouldn't > care if LIO-SCST eventually replaced iSCSI-SCST. The better one should win. If I understood the above correctly, regarding a kernel space iSCSI target implementation, only LIO-SE and SCST should be considered. What I know today about these Linux iSCSI target implementations is as follows: * SCST performs slightly better than LIO-SE, and LIO-SE performs slightly better than STGT (both with regard to latency and with regard to bandwidth). * The coding style of SCST is closer to the Linux kernel coding style than the coding style of the LIO-SE project. * The structure of SCST is closer to what Linus expects than the structure of LIO-SE (i.e., authentication handled in userspace, data transfer handled by the kernel -- LIO-SE handles both in kernel space). * Until now I did not encounter any strange behavior in SCST. The issues I encountered with LIO-SE are being resolved via the LIO-SE mailing list (http://groups.google.com/group/linux-iscsi-target-dev). It would take too much effort to develop a new kernel space iSCSI target from scratch -- we should start from either LIO-SE or SCST. My opinion is that the best approach is to start with integrating SCST in the mainstream kernel, and that the more advanced features from LIO-SE that are not yet in SCST can be ported from LIO-SE to the SCST framework. Nicholas, do you think the structure of SCST is powerful enough to be extended with LIO-SE's powerful features like ERL-2 ? Bart Van Assche. ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel 2008-02-07 13:13 ` [Scst-devel] " Bart Van Assche 2008-02-07 13:45 ` Vladislav Bolkhovitin @ 2008-02-07 15:38 ` Nicholas A. Bellinger 2008-02-07 20:37 ` Luben Tuikov 1 sibling, 1 reply; 148+ messages in thread From: Nicholas A. Bellinger @ 2008-02-07 15:38 UTC (permalink / raw) To: Bart Van Assche Cc: James Bottomley, Vladislav Bolkhovitin, FUJITA Tomonori, linux-scsi, linux-kernel, scst-devel, Andrew Morton, Linus Torvalds, Ming Zhang On Thu, 2008-02-07 at 14:13 +0100, Bart Van Assche wrote: > Since the focus of this thread shifted somewhat in the last few > messages, I'll try to summarize what has been discussed so far: > - There was a number of participants who joined this discussion > spontaneously. This suggests that there is considerable interest in > networked storage and iSCSI. > - It has been motivated why iSCSI makes sense as a storage protocol > (compared to ATA over Ethernet and Fibre Channel over Ethernet). > - The direct I/O performance results for block transfer sizes below 64 > KB are a meaningful benchmark for storage target implementations. > - It has been discussed whether an iSCSI target should be implemented > in user space or in kernel space. It is clear now that an > implementation in the kernel can be made faster than a user space > implementation (http://kerneltrap.org/mailarchive/linux-kernel/2008/2/4/714804). > Regarding existing implementations, measurements have a.o. shown that > SCST is faster than STGT (30% with the following setup: iSCSI via > IPoIB and direct I/O block transfers with a size of 512 bytes). > - It has been discussed which iSCSI target implementation should be in > the mainstream Linux kernel. There is no agreement on this subject > yet. The short-term options are as follows: > 1) Do not integrate any new iSCSI target implementation in the > mainstream Linux kernel. > 2) Add one of the existing in-kernel iSCSI target implementations to > the kernel, e.g. SCST or PyX/LIO. > 3) Create a new in-kernel iSCSI target implementation that combines > the advantages of the existing iSCSI kernel target implementations > (iETD, STGT, SCST and PyX/LIO). > > As an iSCSI user, I prefer option (3). The big question is whether the > various storage target authors agree with this ? > I think the other data point here would be that final target design needs to be as generic as possible. Generic in the sense that the engine eventually needs to be able to accept NDB and other ethernet based target mode storage configurations to an abstracted device object (struct scsi_device, struct block_device, or struct file) just as it would for an IP Storage based request. We know that NDB and *oE will have their own naming and discovery, and the first set of IO tasks to be completed would be those using (iscsi_cmd_t->cmd_flags & ICF_SCSI_DATA_SG_IO_CDB) in iscsi_target_transport.c in the current code. These are single READ_* and WRITE_* codepaths that perform DMA memory pre-proceessing in v2.9 LIO-SE. Also, being able to tell the engine to accelerate to DMA ring operation (say to underlying struct scsi_device or struct block_device) instead of fileio in some cases you will see better performance when using hardware (ie: not a underlying kernel thread queueing IO into block). But I have found FILEIO with sendpage with MD to be faster in single threaded tests than struct block_device. I am currently using IBLOCK for LVM for core LIO operation (which actually sits on software MD raid6). I do this because using submit_bio() with se_mem_t mapped arrays of struct scatterlist -> struct bio_vec can handle power failures properly, and not send back StatSN Acks to the Initiator who thinks that everything has already made it to disk. This is the case with doing IO to struct file in the kernel today without a kernel level O_DIRECT. Also for proper kernel-level target mode support, using struct file with O_DIRECT for storage blocks and emulating control path CDBS is one of the work items. This can be made generic or obtained from the underlying storage object (anything that can be exported from LIO Subsystem TPI) For real hardware (struct scsi_device in just about all the cases these days). Last time I looked this was due to fs/direct-io.c:dio_refill_pages() using get_user_pages()... For really transport specific CDB and control code, which in good amount of cases, we are going eventually be expected to emulate in software. I really like how STGT breaks this up into per device type code segments; spc.c sbc.c mmc.c ssc.c smc.c etc. Having all of these split out properly is one strong point of STGT IMHO, and really makes learning things much easier. Also, being able to queue these IOs into a userspace and receive a asynchronous response back up the storage stack. I think this is actually a pretty interesting potential for passing storage protocol packets into userspace apps and leave the protocol state machines and recovery paths in the kernel with a generic target engine. Also, I know that the SCST folks have put alot of time into getting the very SCSI hardware specific target mode control modes to work. I personally own a bunch of this adapters, and would really like to see better support for target mode on non iSCSI type adapters with a single target mode storage engine that abstracts storage subsystems and wire protocol fabrics. --nab ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel 2008-02-07 15:38 ` [Scst-devel] " Nicholas A. Bellinger @ 2008-02-07 20:37 ` Luben Tuikov 2008-02-08 10:32 ` Vladislav Bolkhovitin 2008-02-08 11:53 ` [Scst-devel] " Nicholas A. Bellinger 0 siblings, 2 replies; 148+ messages in thread From: Luben Tuikov @ 2008-02-07 20:37 UTC (permalink / raw) To: Bart Van Assche, Nicholas A. Bellinger Cc: James Bottomley, Vladislav Bolkhovitin, FUJITA Tomonori, linux-scsi, linux-kernel, scst-devel, Andrew Morton, Linus Torvalds, Ming Zhang Is there an open iSCSI Target implementation which does NOT issue commands to sub-target devices via the SCSI mid-layer, but bypasses it completely? Luben ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-07 20:37 ` Luben Tuikov @ 2008-02-08 10:32 ` Vladislav Bolkhovitin 2008-02-09 7:32 ` Luben Tuikov 2008-02-08 11:53 ` [Scst-devel] " Nicholas A. Bellinger 1 sibling, 1 reply; 148+ messages in thread From: Vladislav Bolkhovitin @ 2008-02-08 10:32 UTC (permalink / raw) To: ltuikov Cc: Bart Van Assche, Nicholas A. Bellinger, James Bottomley, FUJITA Tomonori, linux-scsi, linux-kernel, scst-devel, Andrew Morton, Linus Torvalds, Ming Zhang Luben Tuikov wrote: > Is there an open iSCSI Target implementation which does NOT > issue commands to sub-target devices via the SCSI mid-layer, but > bypasses it completely? What do you mean? To call directly low level backstorage SCSI drivers queuecommand() routine? What are advantages of it? > Luben > > - > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-08 10:32 ` Vladislav Bolkhovitin @ 2008-02-09 7:32 ` Luben Tuikov 2008-02-11 10:02 ` Vladislav Bolkhovitin 0 siblings, 1 reply; 148+ messages in thread From: Luben Tuikov @ 2008-02-09 7:32 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: Bart Van Assche, Nicholas A. Bellinger, James Bottomley, FUJITA Tomonori, linux-scsi, linux-kernel, scst-devel, Andrew Morton, Linus Torvalds, Ming Zhang --- On Fri, 2/8/08, Vladislav Bolkhovitin <vst@vlnb.net> wrote: > > Is there an open iSCSI Target implementation which > does NOT > > issue commands to sub-target devices via the SCSI > mid-layer, but > > bypasses it completely? > > What do you mean? To call directly low level backstorage > SCSI drivers > queuecommand() routine? What are advantages of it? Yes, that's what I meant. Just curious. Thanks, Luben ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-09 7:32 ` Luben Tuikov @ 2008-02-11 10:02 ` Vladislav Bolkhovitin 0 siblings, 0 replies; 148+ messages in thread From: Vladislav Bolkhovitin @ 2008-02-11 10:02 UTC (permalink / raw) To: ltuikov Cc: James Bottomley, linux-scsi, linux-kernel, Nicholas A. Bellinger, scst-devel, Andrew Morton, Linus Torvalds, FUJITA Tomonori Luben Tuikov wrote: >>>Is there an open iSCSI Target implementation which >> >>does NOT >> >>>issue commands to sub-target devices via the SCSI >> >>mid-layer, but >> >>>bypasses it completely? >> >>What do you mean? To call directly low level backstorage >>SCSI drivers >>queuecommand() routine? What are advantages of it? > > Yes, that's what I meant. Just curious. What's advantage of it? > Thanks, > Luben ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel 2008-02-07 20:37 ` Luben Tuikov 2008-02-08 10:32 ` Vladislav Bolkhovitin @ 2008-02-08 11:53 ` Nicholas A. Bellinger 2008-02-08 14:42 ` Vladislav Bolkhovitin 1 sibling, 1 reply; 148+ messages in thread From: Nicholas A. Bellinger @ 2008-02-08 11:53 UTC (permalink / raw) To: ltuikov Cc: Bart Van Assche, James Bottomley, Vladislav Bolkhovitin, FUJITA Tomonori, linux-scsi, linux-kernel, scst-devel, Andrew Morton, Linus Torvalds, Ming Zhang On Thu, 2008-02-07 at 12:37 -0800, Luben Tuikov wrote: > Is there an open iSCSI Target implementation which does NOT > issue commands to sub-target devices via the SCSI mid-layer, but > bypasses it completely? > > Luben > Hi Luben, I am guessing you mean futher down the stack, which I don't know this to be the case. Going futher up the layers is the design of v2.9 LIO-SE. There is a diagram explaining the basic concepts from a 10,000 foot level. http://linux-iscsi.org/builds/user/nab/storage-engine-concept.pdf Note that only traditional iSCSI target is currently implemented in v2.9 LIO-SE codebase in the list of target mode fabrics on left side of the layout. The API between the protocol headers that does encoding/decoding target mode storage packets is probably the least mature area of the LIO stack (because it has always been iSCSI looking towards iSER :). I don't know who has the most mature API between the storage engine and target storage protocol for doing this between SCST and STGT, I am guessing SCST because of the difference in age of the projects. Could someone be so kind to fill me in on this..? Also note, the storage engine plugin for doing userspace passthrough on the right is also currently not implemented. Userspace passthrough in this context is an target engine I/O that is enforcing max_sector and sector_size limitiations, and encodes/decodes target storage protocol packets all out of view of userspace. The addressing will be completely different if we are pointing SE target packets at non SCSI target ports in userspace. --nab > ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-08 11:53 ` [Scst-devel] " Nicholas A. Bellinger @ 2008-02-08 14:42 ` Vladislav Bolkhovitin 2008-02-09 0:00 ` Nicholas A. Bellinger 0 siblings, 1 reply; 148+ messages in thread From: Vladislav Bolkhovitin @ 2008-02-08 14:42 UTC (permalink / raw) To: Nicholas A. Bellinger Cc: ltuikov, Bart Van Assche, James Bottomley, FUJITA Tomonori, linux-scsi, linux-kernel, scst-devel, Andrew Morton, Linus Torvalds, Ming Zhang Nicholas A. Bellinger wrote: > On Thu, 2008-02-07 at 12:37 -0800, Luben Tuikov wrote: > >>Is there an open iSCSI Target implementation which does NOT >>issue commands to sub-target devices via the SCSI mid-layer, but >>bypasses it completely? >> >> Luben >> > > > Hi Luben, > > I am guessing you mean futher down the stack, which I don't know this to > be the case. Going futher up the layers is the design of v2.9 LIO-SE. > There is a diagram explaining the basic concepts from a 10,000 foot > level. > > http://linux-iscsi.org/builds/user/nab/storage-engine-concept.pdf > > Note that only traditional iSCSI target is currently implemented in v2.9 > LIO-SE codebase in the list of target mode fabrics on left side of the > layout. The API between the protocol headers that does > encoding/decoding target mode storage packets is probably the least > mature area of the LIO stack (because it has always been iSCSI looking > towards iSER :). I don't know who has the most mature API between the > storage engine and target storage protocol for doing this between SCST > and STGT, I am guessing SCST because of the difference in age of the > projects. Could someone be so kind to fill me in on this..? SCST uses scsi_execute_async_fifo() function to submit commands to SCSI devices in the pass-through mode. This function is slightly modified version of scsi_execute_async(), which submits requests in FIFO order instead of LIFO as scsi_execute_async() does (so with scsi_execute_async() they are executed in the reverse order). Scsi_execute_async_fifo() added as a separate patch to the kernel. > Also note, the storage engine plugin for doing userspace passthrough on > the right is also currently not implemented. Userspace passthrough in > this context is an target engine I/O that is enforcing max_sector and > sector_size limitiations, and encodes/decodes target storage protocol > packets all out of view of userspace. The addressing will be completely > different if we are pointing SE target packets at non SCSI target ports > in userspace. > > --nab ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-08 14:42 ` Vladislav Bolkhovitin @ 2008-02-09 0:00 ` Nicholas A. Bellinger 0 siblings, 0 replies; 148+ messages in thread From: Nicholas A. Bellinger @ 2008-02-09 0:00 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: ltuikov, Bart Van Assche, James Bottomley, FUJITA Tomonori, linux-scsi, linux-kernel, scst-devel, Andrew Morton, Linus Torvalds, Ming Zhang On Fri, 2008-02-08 at 17:42 +0300, Vladislav Bolkhovitin wrote: > Nicholas A. Bellinger wrote: > > On Thu, 2008-02-07 at 12:37 -0800, Luben Tuikov wrote: > > > >>Is there an open iSCSI Target implementation which does NOT > >>issue commands to sub-target devices via the SCSI mid-layer, but > >>bypasses it completely? > >> > >> Luben > >> > > > > > > Hi Luben, > > > > I am guessing you mean futher down the stack, which I don't know this to > > be the case. Going futher up the layers is the design of v2.9 LIO-SE. > > There is a diagram explaining the basic concepts from a 10,000 foot > > level. > > > > http://linux-iscsi.org/builds/user/nab/storage-engine-concept.pdf > > > > Note that only traditional iSCSI target is currently implemented in v2.9 > > LIO-SE codebase in the list of target mode fabrics on left side of the > > layout. The API between the protocol headers that does > > encoding/decoding target mode storage packets is probably the least > > mature area of the LIO stack (because it has always been iSCSI looking > > towards iSER :). I don't know who has the most mature API between the > > storage engine and target storage protocol for doing this between SCST > > and STGT, I am guessing SCST because of the difference in age of the > > projects. Could someone be so kind to fill me in on this..? > > SCST uses scsi_execute_async_fifo() function to submit commands to SCSI > devices in the pass-through mode. This function is slightly modified > version of scsi_execute_async(), which submits requests in FIFO order > instead of LIFO as scsi_execute_async() does (so with > scsi_execute_async() they are executed in the reverse order). > Scsi_execute_async_fifo() added as a separate patch to the kernel. The LIO-SE PSCSI Plugin also depends on scsi_execute_async() for builds on >= 2.6.18. Note in the core LIO storage engine code (would be iscsi_target_transport.c), there is no subsystem dependence logic. The LIO-SE API is what allows the SE plugins to remain simple and small: -rw-r--r-- 1 root root 35008 2008-02-02 03:25 iscsi_target_pscsi.c -rw-r--r-- 1 root root 7537 2008-02-02 17:27 iscsi_target_pscsi.h -rw-r--r-- 1 root root 18269 2008-02-04 02:23 iscsi_target_iblock.c -rw-r--r-- 1 root root 6834 2008-02-04 02:25 iscsi_target_iblock.h -rw-r--r-- 1 root root 30611 2008-02-02 03:25 iscsi_target_file.c -rw-r--r-- 1 root root 7833 2008-02-02 17:27 iscsi_target_file.h -rw-r--r-- 1 root root 35154 2008-02-02 04:01 iscsi_target_rd.c -rw-r--r-- 1 root root 9900 2008-02-02 17:27 iscsi_target_rd.h It also means that the core LIO-SE code does not have to change when the subsystem APIs change. This has been important in the past for the project, but for upstream code, probably would not make a huge difference. --nab ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-04 17:25 ` James Bottomley 2008-02-04 17:56 ` Vladislav Bolkhovitin @ 2008-02-04 18:29 ` Linus Torvalds 2008-02-04 18:49 ` James Bottomley 2008-02-04 19:06 ` Nicholas A. Bellinger 1 sibling, 2 replies; 148+ messages in thread From: Linus Torvalds @ 2008-02-04 18:29 UTC (permalink / raw) To: James Bottomley Cc: Vladislav Bolkhovitin, Bart Van Assche, Andrew Morton, FUJITA Tomonori, linux-scsi, scst-devel, linux-kernel On Mon, 4 Feb 2008, James Bottomley wrote: > > The way a user space solution should work is to schedule mmapped I/O > from the backing store and then send this mmapped region off for target > I/O. mmap'ing may avoid the copy, but the overhead of a mmap operation is quite often much *bigger* than the overhead of a copy operation. Please do not advocate the use of mmap() as a way to avoid memory copies. It's not realistic. Even if you can do it with a single "mmap()" system call (which is not at all a given, considering that block devices can easily be much larger than the available virtual memory space), the fact is that page table games along with the fault (and even just TLB miss) overhead is easily more than the cost of copying a page in a nice streaming manner. Yes, memory is "slow", but dammit, so is mmap(). > You also have to pull tricks with the mmap region in the case of writes > to prevent useless data being read in from the backing store. However, > none of this involves data copies. "data copies" is irrelevant. The only thing that matters is performance. And if avoiding data copies is more costly (or even of a similar cost) than the copies themselves would have been, there is absolutely no upside, and only downsides due to extra complexity. If you want good performance for a service like this, you really generally *do* need to in kernel space. You can play games in user space, but you're fooling yourself if you think you can do as well as doing it in the kernel. And you're *definitely* fooling yourself if you think mmap() solves performance issues. "Zero-copy" does not equate to "fast". Memory speeds may be slower that core CPU speeds, but not infinitely so! (That said: there *are* alternatives to mmap, like "splice()", that really do potentially solve some issues without the page table and TLB overheads. But while splice() avoids the costs of paging, I strongly suspect it would still have easily measurable latency issues. Switching between user and kernel space multiple times is definitely not going to be free, although it's probably not a huge issue if you have big enough requests). Linus ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-04 18:29 ` Linus Torvalds @ 2008-02-04 18:49 ` James Bottomley 2008-02-04 19:06 ` Nicholas A. Bellinger 1 sibling, 0 replies; 148+ messages in thread From: James Bottomley @ 2008-02-04 18:49 UTC (permalink / raw) To: Linus Torvalds Cc: Vladislav Bolkhovitin, Bart Van Assche, Andrew Morton, FUJITA Tomonori, linux-scsi, scst-devel, linux-kernel On Mon, 2008-02-04 at 10:29 -0800, Linus Torvalds wrote: > > On Mon, 4 Feb 2008, James Bottomley wrote: > > > > The way a user space solution should work is to schedule mmapped I/O > > from the backing store and then send this mmapped region off for target > > I/O. > > mmap'ing may avoid the copy, but the overhead of a mmap operation is > quite often much *bigger* than the overhead of a copy operation. > > Please do not advocate the use of mmap() as a way to avoid memory copies. > It's not realistic. Even if you can do it with a single "mmap()" system > call (which is not at all a given, considering that block devices can > easily be much larger than the available virtual memory space), the fact > is that page table games along with the fault (and even just TLB miss) > overhead is easily more than the cost of copying a page in a nice > streaming manner. > > Yes, memory is "slow", but dammit, so is mmap(). > > > You also have to pull tricks with the mmap region in the case of writes > > to prevent useless data being read in from the backing store. However, > > none of this involves data copies. > > "data copies" is irrelevant. The only thing that matters is performance. > And if avoiding data copies is more costly (or even of a similar cost) > than the copies themselves would have been, there is absolutely no upside, > and only downsides due to extra complexity. > > If you want good performance for a service like this, you really generally > *do* need to in kernel space. You can play games in user space, but you're > fooling yourself if you think you can do as well as doing it in the > kernel. And you're *definitely* fooling yourself if you think mmap() > solves performance issues. "Zero-copy" does not equate to "fast". Memory > speeds may be slower that core CPU speeds, but not infinitely so! > > (That said: there *are* alternatives to mmap, like "splice()", that really > do potentially solve some issues without the page table and TLB overheads. > But while splice() avoids the costs of paging, I strongly suspect it would > still have easily measurable latency issues. Switching between user and > kernel space multiple times is definitely not going to be free, although > it's probably not a huge issue if you have big enough requests). Sorry ... this is really just a discussion of how something (zero copy) could be done, rather than an implementation proposal. (I'm not actually planning to make the STGT people do anything ... although investigating splice does sound interesting). Right at the moment, STGT seems to be performing just fine on measurements up to gigabit networks. There are suggestions that there may be a problem on 8G IB networks, but it's not definitive yet. I'm already on record as saying I think the best fix for IB networks is just to reduce the context switches by increasing the transfer size, but the infrastructure to allow that only just went into git head. James ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-04 18:29 ` Linus Torvalds 2008-02-04 18:49 ` James Bottomley @ 2008-02-04 19:06 ` Nicholas A. Bellinger 2008-02-04 19:19 ` Nicholas A. Bellinger 2008-02-04 19:44 ` Linus Torvalds 1 sibling, 2 replies; 148+ messages in thread From: Nicholas A. Bellinger @ 2008-02-04 19:06 UTC (permalink / raw) To: Linus Torvalds Cc: James Bottomley, Vladislav Bolkhovitin, Bart Van Assche, Andrew Morton, FUJITA Tomonori, linux-scsi, scst-devel, linux-kernel, Mike Christie On Mon, 2008-02-04 at 10:29 -0800, Linus Torvalds wrote: > > On Mon, 4 Feb 2008, James Bottomley wrote: > > > > The way a user space solution should work is to schedule mmapped I/O > > from the backing store and then send this mmapped region off for target > > I/O. > > mmap'ing may avoid the copy, but the overhead of a mmap operation is > quite often much *bigger* than the overhead of a copy operation. > > Please do not advocate the use of mmap() as a way to avoid memory copies. > It's not realistic. Even if you can do it with a single "mmap()" system > call (which is not at all a given, considering that block devices can > easily be much larger than the available virtual memory space), the fact > is that page table games along with the fault (and even just TLB miss) > overhead is easily more than the cost of copying a page in a nice > streaming manner. > > Yes, memory is "slow", but dammit, so is mmap(). > > > You also have to pull tricks with the mmap region in the case of writes > > to prevent useless data being read in from the backing store. However, > > none of this involves data copies. > > "data copies" is irrelevant. The only thing that matters is performance. > And if avoiding data copies is more costly (or even of a similar cost) > than the copies themselves would have been, there is absolutely no upside, > and only downsides due to extra complexity. > The iSER spec (RFC-5046) quotes the following in the TCP case for direct data placement: " Out-of-order TCP segments in the Traditional iSCSI model have to be stored and reassembled before the iSCSI protocol layer within an end node can place the data in the iSCSI buffers. This reassembly is required because not every TCP segment is likely to contain an iSCSI header to enable its placement, and TCP itself does not have a built-in mechanism for signaling Upper Level Protocol (ULP) message boundaries to aid placement of out-of-order segments. This TCP reassembly at high network speeds is quite counter-productive for the following reasons: wasted memory bandwidth in data copying, the need for reassembly memory, wasted CPU cycles in data copying, and the general store-and-forward latency from an application perspective." While this does not have anything to do directly with the kernel vs. user discussion for target mode storage engine, the scaling and latency case is easy enough to make if we are talking about scaling TCP for 10 Gb/sec storage fabrics. > If you want good performance for a service like this, you really generally > *do* need to in kernel space. You can play games in user space, but you're > fooling yourself if you think you can do as well as doing it in the > kernel. And you're *definitely* fooling yourself if you think mmap() > solves performance issues. "Zero-copy" does not equate to "fast". Memory > speeds may be slower that core CPU speeds, but not infinitely so! > >From looking at this problem from a kernel space perspective for a number of years, I would be inclined to believe this is true for software and hardware data-path cases. The benefits of moving various control statemachines for something like say traditional iSCSI to userspace has always been debateable. The most obvious ones are things like authentication, espically if something more complex than CHAP are the obvious case for userspace. However, I have thought recovery for failures caused from communication path (iSCSI connections) or entire nexuses (iSCSI sessions) failures was very problematic to expect to have to potentially push down IOs state to userspace. Keeping statemachines for protocol and/or fabric specific statemachines (CSM-E and CSM-I from connection recovery in iSCSI and iSER are the obvious ones) are the best canidates for residing in kernel space. > (That said: there *are* alternatives to mmap, like "splice()", that really > do potentially solve some issues without the page table and TLB overheads. > But while splice() avoids the costs of paging, I strongly suspect it would > still have easily measurable latency issues. Switching between user and > kernel space multiple times is definitely not going to be free, although > it's probably not a huge issue if you have big enough requests). > Most of the SCSI OS storage subsystems that I have worked with in the context of iSCSI have used 256 * 512 byte setctor requests, which the default traditional iSCSI PDU data payload (MRDSL) being 64k to hit the sweet spot with crc32c checksum calculations. I am assuming this is going to be the case for other fabrics as well. --nab ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-04 19:06 ` Nicholas A. Bellinger @ 2008-02-04 19:19 ` Nicholas A. Bellinger 2008-02-04 19:44 ` Linus Torvalds 1 sibling, 0 replies; 148+ messages in thread From: Nicholas A. Bellinger @ 2008-02-04 19:19 UTC (permalink / raw) To: Linus Torvalds Cc: James Bottomley, Vladislav Bolkhovitin, Bart Van Assche, Andrew Morton, FUJITA Tomonori, linux-scsi, scst-devel, linux-kernel, Mike Christie, CBE-OSS-DEV On Mon, 2008-02-04 at 11:06 -0800, Nicholas A. Bellinger wrote: > On Mon, 2008-02-04 at 10:29 -0800, Linus Torvalds wrote: > > > > On Mon, 4 Feb 2008, James Bottomley wrote: > > > > > > The way a user space solution should work is to schedule mmapped I/O > > > from the backing store and then send this mmapped region off for target > > > I/O. > > > > mmap'ing may avoid the copy, but the overhead of a mmap operation is > > quite often much *bigger* than the overhead of a copy operation. > > > > Please do not advocate the use of mmap() as a way to avoid memory copies. > > It's not realistic. Even if you can do it with a single "mmap()" system > > call (which is not at all a given, considering that block devices can > > easily be much larger than the available virtual memory space), the fact > > is that page table games along with the fault (and even just TLB miss) > > overhead is easily more than the cost of copying a page in a nice > > streaming manner. > > > > Yes, memory is "slow", but dammit, so is mmap(). > > > > > You also have to pull tricks with the mmap region in the case of writes > > > to prevent useless data being read in from the backing store. However, > > > none of this involves data copies. > > > > "data copies" is irrelevant. The only thing that matters is performance. > > And if avoiding data copies is more costly (or even of a similar cost) > > than the copies themselves would have been, there is absolutely no upside, > > and only downsides due to extra complexity. > > > > The iSER spec (RFC-5046) quotes the following in the TCP case for direct > data placement: > > " Out-of-order TCP segments in the Traditional iSCSI model have to be > stored and reassembled before the iSCSI protocol layer within an end > node can place the data in the iSCSI buffers. This reassembly is > required because not every TCP segment is likely to contain an iSCSI > header to enable its placement, and TCP itself does not have a > built-in mechanism for signaling Upper Level Protocol (ULP) message > boundaries to aid placement of out-of-order segments. This TCP > reassembly at high network speeds is quite counter-productive for the > following reasons: wasted memory bandwidth in data copying, the need > for reassembly memory, wasted CPU cycles in data copying, and the > general store-and-forward latency from an application perspective." > > While this does not have anything to do directly with the kernel vs. user discussion > for target mode storage engine, the scaling and latency case is easy enough > to make if we are talking about scaling TCP for 10 Gb/sec storage fabrics. > > > If you want good performance for a service like this, you really generally > > *do* need to in kernel space. You can play games in user space, but you're > > fooling yourself if you think you can do as well as doing it in the > > kernel. And you're *definitely* fooling yourself if you think mmap() > > solves performance issues. "Zero-copy" does not equate to "fast". Memory > > speeds may be slower that core CPU speeds, but not infinitely so! > > > > >From looking at this problem from a kernel space perspective for a > number of years, I would be inclined to believe this is true for > software and hardware data-path cases. The benefits of moving various > control statemachines for something like say traditional iSCSI to > userspace has always been debateable. The most obvious ones are things > like authentication, espically if something more complex than CHAP are > the obvious case for userspace. However, I have thought recovery for > failures caused from communication path (iSCSI connections) or entire > nexuses (iSCSI sessions) failures was very problematic to expect to have > to potentially push down IOs state to userspace. > > Keeping statemachines for protocol and/or fabric specific statemachines > (CSM-E and CSM-I from connection recovery in iSCSI and iSER are the > obvious ones) are the best canidates for residing in kernel space. > > > (That said: there *are* alternatives to mmap, like "splice()", that really > > do potentially solve some issues without the page table and TLB overheads. > > But while splice() avoids the costs of paging, I strongly suspect it would > > still have easily measurable latency issues. Switching between user and > > kernel space multiple times is definitely not going to be free, although > > it's probably not a huge issue if you have big enough requests). > > > Then again, having some data-path for software and hardware bulk IO operation of storage fabric protocol / statemachine in userspace would be really interesting for something like an SPU enabled engine for the Cell Broadband Architecture. --nab ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-04 19:06 ` Nicholas A. Bellinger 2008-02-04 19:19 ` Nicholas A. Bellinger @ 2008-02-04 19:44 ` Linus Torvalds 2008-02-04 20:06 ` [Scst-devel] " 4news ` (4 more replies) 1 sibling, 5 replies; 148+ messages in thread From: Linus Torvalds @ 2008-02-04 19:44 UTC (permalink / raw) To: Nicholas A. Bellinger Cc: James Bottomley, Vladislav Bolkhovitin, Bart Van Assche, Andrew Morton, FUJITA Tomonori, linux-scsi, scst-devel, Linux Kernel Mailing List, Mike Christie On Mon, 4 Feb 2008, Nicholas A. Bellinger wrote: > > While this does not have anything to do directly with the kernel vs. > user discussion for target mode storage engine, the scaling and latency > case is easy enough to make if we are talking about scaling TCP for 10 > Gb/sec storage fabrics. I would like to point out that while I think there is no question that the basic data transfer engine would perform better in kernel space, there stll *are* questions whether - iSCSI is relevant enough for us to even care ... - ... and the complexity is actually worth it. That said, I also tend to believe that trying to split things up between kernel and user space is often more complex than just keeping things in one place, because the trade-offs of which part goes where wll inevitably be wrong in *some* area, and then you're really screwed. So from a purely personal standpoint, I'd like to say that I'm not really interested in iSCSI (and I don't quite know why I've been cc'd on this whole discussion) and think that other approaches are potentially *much* better. So for example, I personally suspect that ATA-over-ethernet is way better than some crazy SCSI-over-TCP crap, but I'm biased for simple and low-level, and against those crazy SCSI people to begin with. So take any utterances of mine with a big pinch of salt. Historically, the only split that has worked pretty well is "connection initiation/setup in user space, actual data transfers in kernel space". Pure user-space solutions work, but tend to eventually be turned into kernel-space if they are simple enough and really do have throughput and latency considerations (eg nfsd), and aren't quite complex and crazy enough to have a large impedance-matching problem even for basic IO stuff (eg samba). And totally pure kernel solutions work only if there are very stable standards and no major authentication or connection setup issues (eg local disks). So just going by what has happened in the past, I'd assume that iSCSI would eventually turn into "connecting/authentication in user space" with "data transfers in kernel space". But only if it really does end up mattering enough. We had a totally user-space NFS daemon for a long time, and it was perfectly fine until people really started caring. Linus ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel 2008-02-04 19:44 ` Linus Torvalds @ 2008-02-04 20:06 ` 4news 2008-02-04 20:24 ` Nicholas A. Bellinger ` (3 subsequent siblings) 4 siblings, 0 replies; 148+ messages in thread From: 4news @ 2008-02-04 20:06 UTC (permalink / raw) To: scst-devel Cc: Linus Torvalds, Nicholas A. Bellinger, Mike Christie, Vladislav Bolkhovitin, linux-scsi, Linux Kernel Mailing List, James Bottomley, Andrew Morton, FUJITA Tomonori On lunedì 4 febbraio 2008, Linus Torvalds wrote: > So from a purely personal standpoint, I'd like to say that I'm not really > interested in iSCSI (and I don't quite know why I've been cc'd on this > whole discussion) and think that other approaches are potentially *much* > better. So for example, I personally suspect that ATA-over-ethernet is way > better than some crazy SCSI-over-TCP crap, but I'm biased for simple and > low-level, and against those crazy SCSI people to begin with. surely aoe is better than iscsi almost on performance because of the lesser protocol stack: iscsi -> scsi - ip - eth aoe -> ata - eth but surely iscsi is more a standard than aoe and is more actively used by real-world . Other really useful feature are that: - iscsi is capable to move to a ip based san scsi devices by routing that ( i've some tape changer routed by scst to some system that don't have other way to see a tape). - because it work on the ip layer it can be routed between long distance , so having needed bandwidth you can have a really remote block device spoking a standard protocol between non ethereogenus systems. - iscsi is now the cheapest san avaible. bye, marco. ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-04 19:44 ` Linus Torvalds 2008-02-04 20:06 ` [Scst-devel] " 4news @ 2008-02-04 20:24 ` Nicholas A. Bellinger 2008-02-04 21:01 ` J. Bruce Fields ` (2 subsequent siblings) 4 siblings, 0 replies; 148+ messages in thread From: Nicholas A. Bellinger @ 2008-02-04 20:24 UTC (permalink / raw) To: Linus Torvalds Cc: James Bottomley, Vladislav Bolkhovitin, Bart Van Assche, Andrew Morton, FUJITA Tomonori, linux-scsi, scst-devel, Linux Kernel Mailing List, Mike Christie On Mon, 2008-02-04 at 11:44 -0800, Linus Torvalds wrote: > > On Mon, 4 Feb 2008, Nicholas A. Bellinger wrote: > > > > While this does not have anything to do directly with the kernel vs. > > user discussion for target mode storage engine, the scaling and latency > > case is easy enough to make if we are talking about scaling TCP for 10 > > Gb/sec storage fabrics. > > I would like to point out that while I think there is no question that the > basic data transfer engine would perform better in kernel space, there > stll *are* questions whether > > - iSCSI is relevant enough for us to even care ... > > - ... and the complexity is actually worth it. > > That said, I also tend to believe that trying to split things up between > kernel and user space is often more complex than just keeping things in > one place, because the trade-offs of which part goes where wll inevitably > be wrong in *some* area, and then you're really screwed. > > So from a purely personal standpoint, I'd like to say that I'm not really > interested in iSCSI (and I don't quite know why I've been cc'd on this > whole discussion) The generic target mode storage engine discussion quickly goes to transport specific scenarios. With so much interest in the SCSI transports, in particuarly iSCSI, there are lots of devs, users, and vendors who would like to see Linux improve in this respect. > and think that other approaches are potentially *much* > better. So for example, I personally suspect that ATA-over-ethernet is way > better than some crazy SCSI-over-TCP crap, Having the non SCSI target mode transports use the same data IO path as the SCSI ones to SCSI, BIO, and FILE subsystems is something that can easily be agreed on. Also having to emulate the non SCSI control paths in a non generic matter to a target mode engine has to suck (I don't know what AoE does for that now, considering that this is going down to libata or real SCSI hardware in some cases. There are some of the more arcane task management functionality in SCSI (ACA anyone?) that even generic SCSI target mode engines do not use, and only seem to make endlessly complex implement and emulate. But aside from those very SCSI hardware specific cases, having a generic method to use something like ABORT_TASK or LUN_RESET for a target mode engine (along with the data path to all of the subsystems) would be beneficial for any fabric. > but I'm biased for simple and > low-level, and against those crazy SCSI people to begin with. Well, having no obvious preconception (well, aside from the email address), I am of the mindset than the iSCSI people are the LEAST crazy said crazy SCSI people. Some people (usually least crazy iSCSI standards folks) say that FCoE people are crazy. Being one of the iSCSI people I am kinda obligated to agree, but the technical points are really solid, and have been so for over a decade. They are listed here for those who are interested: http://www.ietf.org/mail-archive/web/ips/current/msg02325.html > > So take any utterances of mine with a big pinch of salt. > > Historically, the only split that has worked pretty well is "connection > initiation/setup in user space, actual data transfers in kernel space". > > Pure user-space solutions work, but tend to eventually be turned into > kernel-space if they are simple enough and really do have throughput and > latency considerations (eg nfsd), and aren't quite complex and crazy > enough to have a large impedance-matching problem even for basic IO stuff > (eg samba). > > And totally pure kernel solutions work only if there are very stable > standards and no major authentication or connection setup issues (eg local > disks). > > So just going by what has happened in the past, I'd assume that iSCSI > would eventually turn into "connecting/authentication in user space" with > "data transfers in kernel space". But only if it really does end up > mattering enough. We had a totally user-space NFS daemon for a long time, > and it was perfectly fine until people really started caring. Thanks for putting this into an historical perspective. Also it is interesting to note that the iSCSI spec (RFC-3720) was ratified in April 2004, so it will be going on 4 years soon, which pre-RFC products first going out in 2001 (yikes!). In my experience, the iSCSI interopt amongst implementations (espically between different OSes) has been stable since about late 2004, early 2005, with interopt between OS SCSI subsystems (espically talking to non SCSI hardware) being the slower of the two. --nab ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-04 19:44 ` Linus Torvalds 2008-02-04 20:06 ` [Scst-devel] " 4news 2008-02-04 20:24 ` Nicholas A. Bellinger @ 2008-02-04 21:01 ` J. Bruce Fields 2008-02-04 21:24 ` Linus Torvalds 2008-02-04 22:43 ` Alan Cox 2008-02-05 19:00 ` Vladislav Bolkhovitin 4 siblings, 1 reply; 148+ messages in thread From: J. Bruce Fields @ 2008-02-04 21:01 UTC (permalink / raw) To: Linus Torvalds Cc: Nicholas A. Bellinger, James Bottomley, Vladislav Bolkhovitin, Bart Van Assche, Andrew Morton, FUJITA Tomonori, linux-scsi, scst-devel, Linux Kernel Mailing List, Mike Christie On Mon, Feb 04, 2008 at 11:44:31AM -0800, Linus Torvalds wrote: ... > Pure user-space solutions work, but tend to eventually be turned into > kernel-space if they are simple enough and really do have throughput and > latency considerations (eg nfsd), and aren't quite complex and crazy > enough to have a large impedance-matching problem even for basic IO stuff > (eg samba). ... > So just going by what has happened in the past, I'd assume that iSCSI > would eventually turn into "connecting/authentication in user space" with > "data transfers in kernel space". But only if it really does end up > mattering enough. We had a totally user-space NFS daemon for a long time, > and it was perfectly fine until people really started caring. I'd assumed the move was primarily because of the difficulty of getting correct semantics on a shared filesystem--if you're content with NFS-only access to your filesystem, then you can probably do everything in userspace, but once you start worrying about getting stable filehandles, consistent file locking, etc., from a real disk filesystem with local users, then you require much closer cooperation from the kernel. And I seem to recall being told that sort of thing was the motivation more than performance, but I wasn't there (and I haven't seen performance comparisons). --b. ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-04 21:01 ` J. Bruce Fields @ 2008-02-04 21:24 ` Linus Torvalds 2008-02-04 22:00 ` Nicholas A. Bellinger ` (2 more replies) 0 siblings, 3 replies; 148+ messages in thread From: Linus Torvalds @ 2008-02-04 21:24 UTC (permalink / raw) To: J. Bruce Fields Cc: Nicholas A. Bellinger, James Bottomley, Vladislav Bolkhovitin, Bart Van Assche, Andrew Morton, FUJITA Tomonori, linux-scsi, scst-devel, Linux Kernel Mailing List, Mike Christie On Mon, 4 Feb 2008, J. Bruce Fields wrote: > > I'd assumed the move was primarily because of the difficulty of getting > correct semantics on a shared filesystem .. not even shared. It was hard to get correct semantics full stop. Which is a traditional problem. The thing is, the kernel always has some internal state, and it's hard to expose all the semantics that the kernel knows about to user space. So no, performance is not the only reason to move to kernel space. It can easily be things like needing direct access to internal data queues (for a iSCSI target, this could be things like barriers or just tagged commands - yes, you can probably emulate things like that without access to the actual IO queues, but are you sure the semantics will be entirely right? The kernel/userland boundary is not just a performance boundary, it's an abstraction boundary too, and these kinds of protocols tend to break abstractions. NFS broke it by having "file handles" (which is not something that really exists in user space, and is almost impossible to emulate correctly), and I bet the same thing happens when emulating a SCSI target in user space. Maybe not. I _rally_ haven't looked into iSCSI, I'm just guessing there would be things like ordering issues. Linus ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-04 21:24 ` Linus Torvalds @ 2008-02-04 22:00 ` Nicholas A. Bellinger 2008-02-04 22:57 ` Jeff Garzik 2008-02-05 19:01 ` Vladislav Bolkhovitin 2 siblings, 0 replies; 148+ messages in thread From: Nicholas A. Bellinger @ 2008-02-04 22:00 UTC (permalink / raw) To: Linus Torvalds Cc: J. Bruce Fields, James Bottomley, Vladislav Bolkhovitin, Bart Van Assche, Andrew Morton, FUJITA Tomonori, linux-scsi, scst-devel, Linux Kernel Mailing List, Mike Christie On Mon, 2008-02-04 at 13:24 -0800, Linus Torvalds wrote: > > On Mon, 4 Feb 2008, J. Bruce Fields wrote: > > > > I'd assumed the move was primarily because of the difficulty of getting > > correct semantics on a shared filesystem > > .. not even shared. It was hard to get correct semantics full stop. > > Which is a traditional problem. The thing is, the kernel always has some > internal state, and it's hard to expose all the semantics that the kernel > knows about to user space. > > So no, performance is not the only reason to move to kernel space. It can > easily be things like needing direct access to internal data queues (for a > iSCSI target, this could be things like barriers or just tagged commands - > yes, you can probably emulate things like that without access to the > actual IO queues, but are you sure the semantics will be entirely right? > > The kernel/userland boundary is not just a performance boundary, it's an > abstraction boundary too, and these kinds of protocols tend to break > abstractions. NFS broke it by having "file handles" (which is not > something that really exists in user space, and is almost impossible to > emulate correctly), and I bet the same thing happens when emulating a SCSI > target in user space. > > Maybe not. I _rally_ haven't looked into iSCSI, I'm just guessing there > would be things like ordering issues. > <nod>. The iSCSI CDBs and write immediate, unsoliciated, or soliciated data payloads may be received out of order across communication paths (which may be going over different subnets) within the nexus, but the execution of the CDB to SCSI Target Port must be in the same order as it came down from the SCSI subsystem on the initiator port. In iSCSI and iSER terms, this is called Command Sequence Number (CmdSN) ordering, and is enforced within each nexus. The initiator node will be assigning the CmdSNs as the CDBs come down, and when communication paths fail, unacknowledged CmdSNs will be retried on a different communication path when using iSCSI/iSER connection recovery. Already acknowledged CmdSNs will be explictly retried using a iSCSI specific task management function called TASK_REASSIGN. This along with CSM-I and CSM-E statemachines are collectly known as ErrorRecoveryLevel=2 in iSCSI. Anyways, here is a great visual of a modern iSCSI Target processor and SCSI Target Engine. The CmdSN ordering is representd by the oval across across iSCSI connections going to various network portals groups on the left side of the diagram. Thanks Eddy Q! http://www.haifa.il.ibm.com/satran/ips/EddyQuicksall-iSCSI-in-diagrams/portal_groups.pdf --nab ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-04 21:24 ` Linus Torvalds 2008-02-04 22:00 ` Nicholas A. Bellinger @ 2008-02-04 22:57 ` Jeff Garzik 2008-02-04 23:45 ` Linus Torvalds ` (2 more replies) 2008-02-05 19:01 ` Vladislav Bolkhovitin 2 siblings, 3 replies; 148+ messages in thread From: Jeff Garzik @ 2008-02-04 22:57 UTC (permalink / raw) To: Linus Torvalds Cc: J. Bruce Fields, Nicholas A. Bellinger, James Bottomley, Vladislav Bolkhovitin, Bart Van Assche, Andrew Morton, FUJITA Tomonori, linux-scsi, scst-devel, Linux Kernel Mailing List, Mike Christie Linus Torvalds wrote: > So no, performance is not the only reason to move to kernel space. It can > easily be things like needing direct access to internal data queues (for a > iSCSI target, this could be things like barriers or just tagged commands - > yes, you can probably emulate things like that without access to the > actual IO queues, but are you sure the semantics will be entirely right? > > The kernel/userland boundary is not just a performance boundary, it's an > abstraction boundary too, and these kinds of protocols tend to break > abstractions. NFS broke it by having "file handles" (which is not > something that really exists in user space, and is almost impossible to > emulate correctly), and I bet the same thing happens when emulating a SCSI > target in user space. Well, speaking as a complete nutter who just finished the bare bones of an NFSv4 userland server[1]... it depends on your approach. If the userland server is the _only_ one accessing the data[2] -- i.e. the database server model where ls(1) shows a couple multi-gigabyte files or a raw partition -- then it's easy to get all the semantics right, including file handles. You're not racing with local kernel fileserving. Couple that with sendfile(2), sync_file_range(2) and a few other Linux-specific syscalls, and you've got an efficient NFS file server. It becomes a solution similar to Apache or MySQL or Oracle. I quite grant there are many good reasons to do NFS or iSCSI data path in the kernel... my point is more that "impossible" is just from one point of view ;-) > Maybe not. I _rally_ haven't looked into iSCSI, I'm just guessing there > would be things like ordering issues. iSCSI and NBD were passe ideas at birth. :) Networked block devices are attractive because the concepts and implementation are more simple than networked filesystems... but usually you want to run some sort of filesystem on top. At that point you might as well run NFS or [gfs|ocfs|flavor-of-the-week], and ditch your networked block device (and associated complexity). iSCSI is barely useful, because at least someone finally standardized SCSI over LAN/WAN. But you just don't need its complexity if your filesystem must have its own authentication, distributed coordination, multiple-connection management code of its own. Jeff P.S. Clearly my NFSv4 server is NOT intended to replace the kernel one. It's more for experiments, and doing FUSE-like filesystem work. [1] http://linux.yyz.us/projects/nfsv4.html [2] well, outside of dd(1) and similar tricks... the same "going around its back" tricks that can screw up a mounted filesystem. ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-04 22:57 ` Jeff Garzik @ 2008-02-04 23:45 ` Linus Torvalds 2008-02-05 0:08 ` Jeff Garzik 2008-02-05 8:38 ` Bart Van Assche 2008-02-05 13:05 ` Olivier Galibert 2 siblings, 1 reply; 148+ messages in thread From: Linus Torvalds @ 2008-02-04 23:45 UTC (permalink / raw) To: Jeff Garzik Cc: J. Bruce Fields, Nicholas A. Bellinger, James Bottomley, Vladislav Bolkhovitin, Bart Van Assche, Andrew Morton, FUJITA Tomonori, linux-scsi, scst-devel, Linux Kernel Mailing List, Mike Christie On Mon, 4 Feb 2008, Jeff Garzik wrote: > > Well, speaking as a complete nutter who just finished the bare bones of an > NFSv4 userland server[1]... it depends on your approach. You definitely are a complete nutter ;) > If the userland server is the _only_ one accessing the data[2] -- i.e. the > database server model where ls(1) shows a couple multi-gigabyte files or a raw > partition -- then it's easy to get all the semantics right, including file > handles. You're not racing with local kernel fileserving. It's not really simple in general even then. The problems come with file handles, and two big issues in particular: - handling a reboot (of the server) without impacting the client really does need a "look up by file handle" operation (which you can do by logging the pathname to filehandle translation, but it certainly gets problematic). - non-Unix-like filesystems don't necessarily have a stable "st_ino" field (ie it may change over a rename or have no meaning what-so-ever, things like that), and that makes trying to generate a filehandle really interesting for them. I do agree that it's possible - we obviously _did_ have a user-level NFSD for a long while, after all - but it's quite painful if you want to handle things well. Only allowing access through the NFSD certainly helps a lot, but still doesn't make it quite as trivial as you claim ;) Of course, I think you can make NFSv4 to use volatile filehandles instead of the traditional long-lived ones, and that really should avoid almost all of the problems with doing a NFSv4 server in user space. However, I'd expect there to be clients that don't do the whole volatile thing, or support the file handle becoming stale only at certain well-defined points (ie after renames, not at random reboot times). Linus ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-04 23:45 ` Linus Torvalds @ 2008-02-05 0:08 ` Jeff Garzik 2008-02-05 1:20 ` Linus Torvalds 0 siblings, 1 reply; 148+ messages in thread From: Jeff Garzik @ 2008-02-05 0:08 UTC (permalink / raw) To: Linus Torvalds Cc: J. Bruce Fields, Nicholas A. Bellinger, James Bottomley, Vladislav Bolkhovitin, Bart Van Assche, Andrew Morton, FUJITA Tomonori, linux-scsi, scst-devel, Linux Kernel Mailing List, Mike Christie Linus Torvalds wrote: > > On Mon, 4 Feb 2008, Jeff Garzik wrote: >> Well, speaking as a complete nutter who just finished the bare bones of an >> NFSv4 userland server[1]... it depends on your approach. > > You definitely are a complete nutter ;) > >> If the userland server is the _only_ one accessing the data[2] -- i.e. the >> database server model where ls(1) shows a couple multi-gigabyte files or a raw >> partition -- then it's easy to get all the semantics right, including file >> handles. You're not racing with local kernel fileserving. > > It's not really simple in general even then. The problems come with file > handles, and two big issues in particular: > > - handling a reboot (of the server) without impacting the client really > does need a "look up by file handle" operation (which you can do by > logging the pathname to filehandle translation, but it certainly gets > problematic). > > - non-Unix-like filesystems don't necessarily have a stable "st_ino" > field (ie it may change over a rename or have no meaning what-so-ever, > things like that), and that makes trying to generate a filehandle > really interesting for them. Both of these are easily handled if the server is 100% in charge of managing the filesystem _metadata_ and data. That's what I meant by complete control. i.e. it not ext3 or reiserfs or vfat, its a block device or 1000GB file managed by a userland process. Doing it that way gives one a bit more freedom to tune the filesystem format directly. Stable inode numbers and filehandles are just easy as they are with ext3. I'm the filesystem format designer, after all. (run for your lives...) You do wind up having to roll your own dcache in userspace, though. A matter of taste in implementation, but it is not difficult... I've certainly never been accused of having good taste :) > I do agree that it's possible - we obviously _did_ have a user-level NFSD > for a long while, after all - but it's quite painful if you want to handle > things well. Only allowing access through the NFSD certainly helps a lot, > but still doesn't make it quite as trivial as you claim ;) Nah, you're thinking about something different: a userland NFSD competing with other userland processes for access to the same files, while the kernel ultimately manages the filesystem metadata. Recipe for races and inequities, and it's good we moved away from that. I'm talking about where a userland process manages the filesystem metadata too. In a filesystem with a million files, ls(1) on the server will only show a single file: [jgarzik@core ~]$ ls -l /spare/fileserver-data/ total 70657116 -rw-r--r-- 1 jgarzik jgarzik 1818064825 2007-12-29 06:40 fsimage.1 > Of course, I think you can make NFSv4 to use volatile filehandles instead > of the traditional long-lived ones, and that really should avoid almost > all of the problems with doing a NFSv4 server in user space. However, I'd > expect there to be clients that don't do the whole volatile thing, or > support the file handle becoming stale only at certain well-defined points > (ie after renames, not at random reboot times). Don't get me started on "volatile" versus "persistent" filehandles in NFSv4... groan. Jeff ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-05 0:08 ` Jeff Garzik @ 2008-02-05 1:20 ` Linus Torvalds 0 siblings, 0 replies; 148+ messages in thread From: Linus Torvalds @ 2008-02-05 1:20 UTC (permalink / raw) To: Jeff Garzik Cc: J. Bruce Fields, Nicholas A. Bellinger, James Bottomley, Vladislav Bolkhovitin, Bart Van Assche, Andrew Morton, FUJITA Tomonori, linux-scsi, scst-devel, Linux Kernel Mailing List, Mike Christie On Mon, 4 Feb 2008, Jeff Garzik wrote: > > Both of these are easily handled if the server is 100% in charge of managing > the filesystem _metadata_ and data. That's what I meant by complete control. > > i.e. it not ext3 or reiserfs or vfat, its a block device or 1000GB file > managed by a userland process. Oh ok. Yes, if you bring the filesystem into user mode too, then the problems go away - because now your NFSD can interact directly with the filesystem without any kernel/usermode abstraction layer rules in between. So that has all the same properties as moving NFSD entirely into the kernel. Linus ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-04 22:57 ` Jeff Garzik 2008-02-04 23:45 ` Linus Torvalds @ 2008-02-05 8:38 ` Bart Van Assche 2008-02-05 17:50 ` Jeff Garzik 2008-02-05 13:05 ` Olivier Galibert 2 siblings, 1 reply; 148+ messages in thread From: Bart Van Assche @ 2008-02-05 8:38 UTC (permalink / raw) To: Jeff Garzik Cc: Linus Torvalds, J. Bruce Fields, Nicholas A. Bellinger, James Bottomley, Vladislav Bolkhovitin, Andrew Morton, FUJITA Tomonori, linux-scsi, scst-devel, Linux Kernel Mailing List, Mike Christie On Feb 4, 2008 11:57 PM, Jeff Garzik <jeff@garzik.org> wrote: > Networked block devices are attractive because the concepts and > implementation are more simple than networked filesystems... but usually > you want to run some sort of filesystem on top. At that point you might > as well run NFS or [gfs|ocfs|flavor-of-the-week], and ditch your > networked block device (and associated complexity). Running a filesystem on top of iSCSI results in better performance than NFS, especially if the NFS client conforms to the NFS standard (=synchronous writes). By searching the web search for the keywords NFS, iSCSI and performance I found the following (6 years old) document: http://www.technomagesinc.com/papers/ip_paper.html. A quote from the conclusion: Our results, generated by running some of industry standard benchmarks, show that iSCSI significantly outperforms NFS for situations when performing streaming, database like accesses and small file transactions. Bart Van Assche. ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-05 8:38 ` Bart Van Assche @ 2008-02-05 17:50 ` Jeff Garzik 2008-02-06 10:22 ` Bart Van Assche 0 siblings, 1 reply; 148+ messages in thread From: Jeff Garzik @ 2008-02-05 17:50 UTC (permalink / raw) To: Bart Van Assche Cc: Linus Torvalds, J. Bruce Fields, Nicholas A. Bellinger, James Bottomley, Vladislav Bolkhovitin, Andrew Morton, FUJITA Tomonori, linux-scsi, scst-devel, Linux Kernel Mailing List, Mike Christie Bart Van Assche wrote: > On Feb 4, 2008 11:57 PM, Jeff Garzik <jeff@garzik.org> wrote: > >> Networked block devices are attractive because the concepts and >> implementation are more simple than networked filesystems... but usually >> you want to run some sort of filesystem on top. At that point you might >> as well run NFS or [gfs|ocfs|flavor-of-the-week], and ditch your >> networked block device (and associated complexity). > > Running a filesystem on top of iSCSI results in better performance > than NFS, especially if the NFS client conforms to the NFS standard > (=synchronous writes). > By searching the web search for the keywords NFS, iSCSI and > performance I found the following (6 years old) document: > http://www.technomagesinc.com/papers/ip_paper.html. A quote from the > conclusion: > Our results, generated by running some of industry standard benchmarks, > show that iSCSI significantly outperforms NFS for situations when > performing streaming, database like accesses and small file transactions. async performs better than sync... this is news? Furthermore, NFSv4 has not only async capability but delegation too (and RDMA if you like such things), so the comparison is not relevant to modern times. But a networked filesystem (note I'm using that term, not "NFS", from here on) is simply far more useful to the average user. A networked block device is a building block -- and a useful one. A networked filesystem is an immediately usable solution. For remotely accessing data, iSCSI+fs is quite simply more overhead than a networked fs. With iSCSI you are doing local VFS -> local blkdev -> network whereas a networked filesystem is local VFS -> network iSCSI+fs also adds new manageability issues, because unless the filesystem is single-computer (such as diskless iSCSI root fs), you still need to go across the network _once again_ to handle filesystem locking and coordination issues. There is no _fundamental_ reason why remote shared storage via iSCSI OSD is any faster than a networked filesystem. SCSI-over-IP has its uses. Absolutely. It needed to be standardized. But let's not pretend iSCSI is anything more than what it is. Its a bloated cat5 cabling standard :) Jeff ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-05 17:50 ` Jeff Garzik @ 2008-02-06 10:22 ` Bart Van Assche 2008-02-06 14:21 ` Jeff Garzik 0 siblings, 1 reply; 148+ messages in thread From: Bart Van Assche @ 2008-02-06 10:22 UTC (permalink / raw) To: Jeff Garzik Cc: Linus Torvalds, J. Bruce Fields, Nicholas A. Bellinger, James Bottomley, Vladislav Bolkhovitin, Andrew Morton, FUJITA Tomonori, linux-scsi, scst-devel, Linux Kernel Mailing List, Mike Christie On Feb 5, 2008 6:50 PM, Jeff Garzik <jeff@garzik.org> wrote: > For remotely accessing data, iSCSI+fs is quite simply more overhead than > a networked fs. With iSCSI you are doing > > local VFS -> local blkdev -> network > > whereas a networked filesystem is > > local VFS -> network There are use cases than can be solved better via iSCSI and a filesystem than via a network filesystem. One such use case is when deploying a virtual machine whose data is stored on a network server: in that case there is only one user of the data (so there are no locking issues) and filesystem and block device each run in another operating system: the filesystem runs inside the virtual machine and iSCSI either runs in the hypervisor or in the native OS. Bart Van Assche. ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-06 10:22 ` Bart Van Assche @ 2008-02-06 14:21 ` Jeff Garzik 0 siblings, 0 replies; 148+ messages in thread From: Jeff Garzik @ 2008-02-06 14:21 UTC (permalink / raw) To: Bart Van Assche Cc: Linus Torvalds, J. Bruce Fields, Nicholas A. Bellinger, James Bottomley, Vladislav Bolkhovitin, Andrew Morton, FUJITA Tomonori, linux-scsi, scst-devel, Linux Kernel Mailing List, Mike Christie Bart Van Assche wrote: > On Feb 5, 2008 6:50 PM, Jeff Garzik <jeff@garzik.org> wrote: >> For remotely accessing data, iSCSI+fs is quite simply more overhead than >> a networked fs. With iSCSI you are doing >> >> local VFS -> local blkdev -> network >> >> whereas a networked filesystem is >> >> local VFS -> network > > There are use cases than can be solved better via iSCSI and a > filesystem than via a network filesystem. One such use case is when > deploying a virtual machine whose data is stored on a network server: > in that case there is only one user of the data (so there are no > locking issues) and filesystem and block device each run in another > operating system: the filesystem runs inside the virtual machine and > iSCSI either runs in the hypervisor or in the native OS. Hence the diskless root fs configuration I referred to in multiple emails... whoopee, you reinvented NFS root with quotas :) Jeff ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-04 22:57 ` Jeff Garzik 2008-02-04 23:45 ` Linus Torvalds 2008-02-05 8:38 ` Bart Van Assche @ 2008-02-05 13:05 ` Olivier Galibert 2008-02-05 18:08 ` Jeff Garzik 2 siblings, 1 reply; 148+ messages in thread From: Olivier Galibert @ 2008-02-05 13:05 UTC (permalink / raw) To: Jeff Garzik Cc: Linus Torvalds, J. Bruce Fields, Nicholas A. Bellinger, James Bottomley, Vladislav Bolkhovitin, Bart Van Assche, Andrew Morton, FUJITA Tomonori, linux-scsi, scst-devel, Linux Kernel Mailing List, Mike Christie On Mon, Feb 04, 2008 at 05:57:47PM -0500, Jeff Garzik wrote: > iSCSI and NBD were passe ideas at birth. :) > > Networked block devices are attractive because the concepts and > implementation are more simple than networked filesystems... but usually > you want to run some sort of filesystem on top. At that point you might > as well run NFS or [gfs|ocfs|flavor-of-the-week], and ditch your > networked block device (and associated complexity). Call me a sysadmin, but I find easier to plug in and keep in place an ethernet cable than these parallel scsi cables from hell. Every server has at least two ethernet ports by default, with rarely any surprises at the kernel level. Adding ethernet cards is inexpensive, and you pretty much never hear of compatibility problems between cards. So ethernet as a connection medium is really nice compared to scsi. Too bad iscsi is demented and ATAoE/NBD inexistant. Maybe external SAS will be nice, but I don't see it getting to the level of universality of ethernet any time soon. And it won't get the same amount of user-level compatibility testing in any case. OG. ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-05 13:05 ` Olivier Galibert @ 2008-02-05 18:08 ` Jeff Garzik 0 siblings, 0 replies; 148+ messages in thread From: Jeff Garzik @ 2008-02-05 18:08 UTC (permalink / raw) To: Olivier Galibert, Linus Torvalds, J. Bruce Fields, Nicholas A. Bellinger, James Bottomley, Vladislav Bolkhovitin, Bart Van Assche, Andrew Morton, FUJITA Tomonori, linux-scsi, scst-devel, Linux Kernel Mailing List, Mike Christie Olivier Galibert wrote: > On Mon, Feb 04, 2008 at 05:57:47PM -0500, Jeff Garzik wrote: >> iSCSI and NBD were passe ideas at birth. :) >> >> Networked block devices are attractive because the concepts and >> implementation are more simple than networked filesystems... but usually >> you want to run some sort of filesystem on top. At that point you might >> as well run NFS or [gfs|ocfs|flavor-of-the-week], and ditch your >> networked block device (and associated complexity). > > Call me a sysadmin, but I find easier to plug in and keep in place an > ethernet cable than these parallel scsi cables from hell. Every > server has at least two ethernet ports by default, with rarely any > surprises at the kernel level. Adding ethernet cards is inexpensive, > and you pretty much never hear of compatibility problems between > cards. > > So ethernet as a connection medium is really nice compared to scsi. > Too bad iscsi is demented and ATAoE/NBD inexistant. Maybe external > SAS will be nice, but I don't see it getting to the level of > universality of ethernet any time soon. And it won't get the same > amount of user-level compatibility testing in any case. Indeed, at the end of the day iSCSI is a bloated cabling standard. :) It has its uses, but I don't see it as ever coming close to replacing direct-to-network (perhaps backed with local cachefs) filesystems... which is how all the hype comes across to me. Cheap "Lintel" boxes everybody is familiar with _are_ the storage appliances. Until mass-produced ATA and SCSI devices start shipping with ethernet connectors, anyway. Jeff ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-04 21:24 ` Linus Torvalds 2008-02-04 22:00 ` Nicholas A. Bellinger 2008-02-04 22:57 ` Jeff Garzik @ 2008-02-05 19:01 ` Vladislav Bolkhovitin 2 siblings, 0 replies; 148+ messages in thread From: Vladislav Bolkhovitin @ 2008-02-05 19:01 UTC (permalink / raw) To: Linus Torvalds Cc: J. Bruce Fields, Mike Christie, linux-scsi, Linux Kernel Mailing List, Nicholas A. Bellinger, James Bottomley, scst-devel, Andrew Morton, FUJITA Tomonori Linus Torvalds wrote: >>I'd assumed the move was primarily because of the difficulty of getting >>correct semantics on a shared filesystem > > > .. not even shared. It was hard to get correct semantics full stop. > > Which is a traditional problem. The thing is, the kernel always has some > internal state, and it's hard to expose all the semantics that the kernel > knows about to user space. > > So no, performance is not the only reason to move to kernel space. It can > easily be things like needing direct access to internal data queues (for a > iSCSI target, this could be things like barriers or just tagged commands - > yes, you can probably emulate things like that without access to the > actual IO queues, but are you sure the semantics will be entirely right? > > The kernel/userland boundary is not just a performance boundary, it's an > abstraction boundary too, and these kinds of protocols tend to break > abstractions. NFS broke it by having "file handles" (which is not > something that really exists in user space, and is almost impossible to > emulate correctly), and I bet the same thing happens when emulating a SCSI > target in user space. Yes, there is something like that for SCSI target as well. It's a "local initiator" or "local nexus", see http://thread.gmane.org/gmane.linux.scsi/31288 and http://news.gmane.org/find-root.php?message_id=%3c463F36AC.3010207%40vlnb.net%3e for more info about that. In fact, existence of local nexus is one more point why SCST is better, than STGT, because for STGT it's pretty hard to support it (all locally generated commands would have to be passed through its daemon, which would be a total disaster for performance), while for SCST it can be done relatively simply. Vlad ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-04 19:44 ` Linus Torvalds ` (2 preceding siblings ...) 2008-02-04 21:01 ` J. Bruce Fields @ 2008-02-04 22:43 ` Alan Cox 2008-02-04 17:30 ` Douglas Gilbert ` (4 more replies) 2008-02-05 19:00 ` Vladislav Bolkhovitin 4 siblings, 5 replies; 148+ messages in thread From: Alan Cox @ 2008-02-04 22:43 UTC (permalink / raw) To: Linus Torvalds Cc: Nicholas A. Bellinger, James Bottomley, Vladislav Bolkhovitin, Bart Van Assche, Andrew Morton, FUJITA Tomonori, linux-scsi, scst-devel, Linux Kernel Mailing List, Mike Christie > better. So for example, I personally suspect that ATA-over-ethernet is way > better than some crazy SCSI-over-TCP crap, but I'm biased for simple and > low-level, and against those crazy SCSI people to begin with. Current ATAoE isn't. It can't support NCQ. A variant that did NCQ and IP would probably trash iSCSI for latency if nothing else. Alan ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-04 22:43 ` Alan Cox @ 2008-02-04 17:30 ` Douglas Gilbert 2008-02-05 2:07 ` [Scst-devel] " Chris Weiss 2008-02-04 22:59 ` Nicholas A. Bellinger ` (3 subsequent siblings) 4 siblings, 1 reply; 148+ messages in thread From: Douglas Gilbert @ 2008-02-04 17:30 UTC (permalink / raw) To: Alan Cox Cc: Linus Torvalds, Nicholas A. Bellinger, James Bottomley, Vladislav Bolkhovitin, Bart Van Assche, Andrew Morton, FUJITA Tomonori, linux-scsi, scst-devel, Linux Kernel Mailing List, Mike Christie Alan Cox wrote: >> better. So for example, I personally suspect that ATA-over-ethernet is way >> better than some crazy SCSI-over-TCP crap, but I'm biased for simple and >> low-level, and against those crazy SCSI people to begin with. > > Current ATAoE isn't. It can't support NCQ. A variant that did NCQ and IP > would probably trash iSCSI for latency if nothing else. And a variant that doesn't do ATA or IP: http://www.fcoe.com/ ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel 2008-02-04 17:30 ` Douglas Gilbert @ 2008-02-05 2:07 ` Chris Weiss 2008-02-05 14:19 ` FUJITA Tomonori 0 siblings, 1 reply; 148+ messages in thread From: Chris Weiss @ 2008-02-05 2:07 UTC (permalink / raw) To: dougg Cc: Alan Cox, Mike Christie, Vladislav Bolkhovitin, linux-scsi, Linux Kernel Mailing List, Nicholas A. Bellinger, James Bottomley, scst-devel, Andrew Morton, Linus Torvalds, FUJITA Tomonori On Feb 4, 2008 11:30 AM, Douglas Gilbert <dougg@torque.net> wrote: > Alan Cox wrote: > >> better. So for example, I personally suspect that ATA-over-ethernet is way > >> better than some crazy SCSI-over-TCP crap, but I'm biased for simple and > >> low-level, and against those crazy SCSI people to begin with. > > > > Current ATAoE isn't. It can't support NCQ. A variant that did NCQ and IP > > would probably trash iSCSI for latency if nothing else. > > And a variant that doesn't do ATA or IP: > http://www.fcoe.com/ > however, and interestingly enough, the open-fcoe software target depends on scst (for now anyway) ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel 2008-02-05 2:07 ` [Scst-devel] " Chris Weiss @ 2008-02-05 14:19 ` FUJITA Tomonori 0 siblings, 0 replies; 148+ messages in thread From: FUJITA Tomonori @ 2008-02-05 14:19 UTC (permalink / raw) To: cweiss Cc: dougg, alan, michaelc, vst, linux-scsi, linux-kernel, nab, James.Bottomley, scst-devel, akpm, torvalds, fujita.tomonori, fujita.tomonori On Mon, 4 Feb 2008 20:07:01 -0600 "Chris Weiss" <cweiss@gmail.com> wrote: > On Feb 4, 2008 11:30 AM, Douglas Gilbert <dougg@torque.net> wrote: > > Alan Cox wrote: > > >> better. So for example, I personally suspect that ATA-over-ethernet is way > > >> better than some crazy SCSI-over-TCP crap, but I'm biased for simple and > > >> low-level, and against those crazy SCSI people to begin with. > > > > > > Current ATAoE isn't. It can't support NCQ. A variant that did NCQ and IP > > > would probably trash iSCSI for latency if nothing else. > > > > And a variant that doesn't do ATA or IP: > > http://www.fcoe.com/ > > > > however, and interestingly enough, the open-fcoe software target > depends on scst (for now anyway) STGT also supports software FCoE target driver though it's still experimental stuff. http://www.mail-archive.com/linux-scsi@vger.kernel.org/msg12705.html It works in user space like STGT's iSCSI (and iSER) target driver (i.e. no kernel/user space interaction). ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-04 22:43 ` Alan Cox 2008-02-04 17:30 ` Douglas Gilbert @ 2008-02-04 22:59 ` Nicholas A. Bellinger 2008-02-04 23:00 ` James Bottomley ` (2 subsequent siblings) 4 siblings, 0 replies; 148+ messages in thread From: Nicholas A. Bellinger @ 2008-02-04 22:59 UTC (permalink / raw) To: Alan Cox Cc: Linus Torvalds, James Bottomley, Vladislav Bolkhovitin, Bart Van Assche, Andrew Morton, FUJITA Tomonori, linux-scsi, scst-devel, Linux Kernel Mailing List, Mike Christie, Julian Satran On Mon, 2008-02-04 at 22:43 +0000, Alan Cox wrote: > > better. So for example, I personally suspect that ATA-over-ethernet is way > > better than some crazy SCSI-over-TCP crap, but I'm biased for simple and > > low-level, and against those crazy SCSI people to begin with. > > Current ATAoE isn't. It can't support NCQ. A variant that did NCQ and IP > would probably trash iSCSI for latency if nothing else. > In the previous iSCSI vs. FCoE points (here is the link again): http://www.ietf.org/mail-archive/web/ips/current/msg02325.html the latency discussion is the one bit that is not mentioned. I always assumed that back then (as with today) the biggest issue was getting ethernet hardware, espically switching equipment down to the sub millisecond latency, and on par with what you would expect from 'real RDMA' hardware. In lowest of the low, say sub 10 ns latency, which is apparently possible with point to point on high-end 10 Gb/sec adapters today, it would be really interesting to know how much more latency would be expected between software iSCSI vs. *oE when we work our way back up the networking stack. Julo, do you have any idea on this..? --nab > > Alan > ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-04 22:43 ` Alan Cox 2008-02-04 17:30 ` Douglas Gilbert 2008-02-04 22:59 ` Nicholas A. Bellinger @ 2008-02-04 23:00 ` James Bottomley 2008-02-04 23:12 ` Nicholas A. Bellinger 2008-02-04 23:04 ` Jeff Garzik 2008-02-05 0:07 ` Matt Mackall 4 siblings, 1 reply; 148+ messages in thread From: James Bottomley @ 2008-02-04 23:00 UTC (permalink / raw) To: Alan Cox Cc: Linus Torvalds, Nicholas A. Bellinger, Vladislav Bolkhovitin, Bart Van Assche, Andrew Morton, FUJITA Tomonori, linux-scsi, scst-devel, Linux Kernel Mailing List, Mike Christie On Mon, 2008-02-04 at 22:43 +0000, Alan Cox wrote: > > better. So for example, I personally suspect that ATA-over-ethernet is way > > better than some crazy SCSI-over-TCP crap, but I'm biased for simple and > > low-level, and against those crazy SCSI people to begin with. > > Current ATAoE isn't. It can't support NCQ. A variant that did NCQ and IP > would probably trash iSCSI for latency if nothing else. Actually, there's also FCoE now ... which is essentially SCSI encapsulated in Fibre Channel Protocols (FCP) running over ethernet with Jumbo frames. It does the standard SCSI TCQ, so should answer all the latency pieces. Intel even has an implementation: http://www.open-fcoe.org/ I tend to prefer the low levels as well. The whole disadvantage for IP as regards iSCSI was the layers of protocols on top of it for addressing, authenticating, encrypting and finding any iSCSI device anywhere in the connected universe. I tend to see loss of routing from operating at the MAC level to be a nicely justifiable tradeoff (most storage networks tend to be hubbed or switched anyway). Plus an ethernet MAC with jumbo frames is a large framed nearly lossless medium, which is practically what FCP is expecting. If you really have to connect large remote sites ... well that's what tunnelling bridges are for. James ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-04 23:00 ` James Bottomley @ 2008-02-04 23:12 ` Nicholas A. Bellinger 2008-02-04 23:16 ` Nicholas A. Bellinger 2008-02-05 18:37 ` James Bottomley 0 siblings, 2 replies; 148+ messages in thread From: Nicholas A. Bellinger @ 2008-02-04 23:12 UTC (permalink / raw) To: James Bottomley Cc: Alan Cox, Linus Torvalds, Vladislav Bolkhovitin, Bart Van Assche, Andrew Morton, FUJITA Tomonori, linux-scsi, scst-devel, Linux Kernel Mailing List, Mike Christie, Julian Satran On Mon, 2008-02-04 at 17:00 -0600, James Bottomley wrote: > On Mon, 2008-02-04 at 22:43 +0000, Alan Cox wrote: > > > better. So for example, I personally suspect that ATA-over-ethernet is way > > > better than some crazy SCSI-over-TCP crap, but I'm biased for simple and > > > low-level, and against those crazy SCSI people to begin with. > > > > Current ATAoE isn't. It can't support NCQ. A variant that did NCQ and IP > > would probably trash iSCSI for latency if nothing else. > > Actually, there's also FCoE now ... which is essentially SCSI > encapsulated in Fibre Channel Protocols (FCP) running over ethernet with > Jumbo frames. It does the standard SCSI TCQ, so should answer all the > latency pieces. Intel even has an implementation: > > http://www.open-fcoe.org/ > > I tend to prefer the low levels as well. The whole disadvantage for IP > as regards iSCSI was the layers of protocols on top of it for > addressing, authenticating, encrypting and finding any iSCSI device > anywhere in the connected universe. Btw, while simple in-band discovery of iSCSI exists, the standards based IP storage deployments (iSCSI and iFCP) use iSNS (RFC-4171) for discovery and network fabric management, for things like sending state change notifications when a particular network portal is going away so that the initiator can bring up a different communication patch to a different network portal, etc. > > I tend to see loss of routing from operating at the MAC level to be a > nicely justifiable tradeoff (most storage networks tend to be hubbed or > switched anyway). Plus an ethernet MAC with jumbo frames is a large > framed nearly lossless medium, which is practically what FCP is > expecting. If you really have to connect large remote sites ... well > that's what tunnelling bridges are for. > Some of the points by Julo on the IPS TWG iSCSI vs. FCoE thread: * the network is limited in physical span and logical span (number of switches) * flow-control/congestion control is achieved with a mechanism adequate for a limited span network (credits). The packet loss rate is almost nil and that allows FCP to avoid using a transport (end-to-end) layer * FCP she switches are simple (addresses are local and the memory requirements cam be limited through the credit mechanism) * The credit mechanisms is highly unstable for large networks (check switch vendors planning docs for the network diameter limits) – the scaling argument * Ethernet has no credit mechanism and any mechanism with a similar effect increases the end point cost. Building a transport layer in the protocol stack has always been the preferred choice of the networking community – the community argument * The "performance penalty" of a complete protocol stack has always been overstated (and overrated). Advances in protocol stack implementation and finer tuning of the congestion control mechanisms make conventional TCP/IP performing well even at 10 Gb/s and over. Moreover the multicore processors that become dominant on the computing scene have enough compute cycles available to make any "offloading" possible as a mere code restructuring exercise (see the stack reports from Intel, IBM etc.) * Building on a complete stack makes available a wealth of operational and management mechanisms built over the years by the networking community (routing, provisioning, security, service location etc.) – the community argument * Higher level storage access over an IP network is widely available and having both block and file served over the same connection with the same support and management structure is compelling– the community argument * Highly efficient networks are easy to build over IP with optimal (shortest path) routing while Layer 2 networks use bridging and are limited by the logical tree structure that bridges must follow. The effort to combine routers and bridges (rbridges) is promising to change that but it will take some time to finalize (and we don't know exactly how it will operate). Untill then the scale of Layer 2 network is going to seriously limited – the scaling argument Perhaps it would be of worth to get some more linux-net guys in on the discussion. :-) --nab > James > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-04 23:12 ` Nicholas A. Bellinger @ 2008-02-04 23:16 ` Nicholas A. Bellinger 2008-02-05 18:37 ` James Bottomley 1 sibling, 0 replies; 148+ messages in thread From: Nicholas A. Bellinger @ 2008-02-04 23:16 UTC (permalink / raw) To: James Bottomley Cc: Alan Cox, Linus Torvalds, Vladislav Bolkhovitin, Bart Van Assche, Andrew Morton, FUJITA Tomonori, linux-scsi, scst-devel, Linux Kernel Mailing List, Mike Christie, Julian Satran On Mon, 2008-02-04 at 15:12 -0800, Nicholas A. Bellinger wrote: > On Mon, 2008-02-04 at 17:00 -0600, James Bottomley wrote: > > On Mon, 2008-02-04 at 22:43 +0000, Alan Cox wrote: > > > > better. So for example, I personally suspect that ATA-over-ethernet is way > > > > better than some crazy SCSI-over-TCP crap, but I'm biased for simple and > > > > low-level, and against those crazy SCSI people to begin with. > > > > > > Current ATAoE isn't. It can't support NCQ. A variant that did NCQ and IP > > > would probably trash iSCSI for latency if nothing else. > > > > Actually, there's also FCoE now ... which is essentially SCSI > > encapsulated in Fibre Channel Protocols (FCP) running over ethernet with > > Jumbo frames. It does the standard SCSI TCQ, so should answer all the > > latency pieces. Intel even has an implementation: > > > > http://www.open-fcoe.org/ > > > > I tend to prefer the low levels as well. The whole disadvantage for IP > > as regards iSCSI was the layers of protocols on top of it for > > addressing, authenticating, encrypting and finding any iSCSI device > > anywhere in the connected universe. > > Btw, while simple in-band discovery of iSCSI exists, the standards based > IP storage deployments (iSCSI and iFCP) use iSNS (RFC-4171) for > discovery and network fabric management, for things like sending state > change notifications when a particular network portal is going away so > that the initiator can bring up a different communication patch to a > different network portal, etc. > > > > > I tend to see loss of routing from operating at the MAC level to be a > > nicely justifiable tradeoff (most storage networks tend to be hubbed or > > switched anyway). Plus an ethernet MAC with jumbo frames is a large > > framed nearly lossless medium, which is practically what FCP is > > expecting. If you really have to connect large remote sites ... well > > that's what tunnelling bridges are for. > > > > Some of the points by Julo on the IPS TWG iSCSI vs. FCoE thread: > > * the network is limited in physical span and logical span (number > of switches) > * flow-control/congestion control is achieved with a mechanism > adequate for a limited span network (credits). The packet loss > rate is almost nil and that allows FCP to avoid using a > transport (end-to-end) layer > * FCP she switches are simple (addresses are local and the memory > requirements cam be limited through the credit mechanism) > * The credit mechanisms is highly unstable for large networks > (check switch vendors planning docs for the network diameter > limits) – the scaling argument > * Ethernet has no credit mechanism and any mechanism with a > similar effect increases the end point cost. Building a > transport layer in the protocol stack has always been the > preferred choice of the networking community – the community > argument > * The "performance penalty" of a complete protocol stack has > always been overstated (and overrated). Advances in protocol > stack implementation and finer tuning of the congestion control > mechanisms make conventional TCP/IP performing well even at 10 > Gb/s and over. Moreover the multicore processors that become > dominant on the computing scene have enough compute cycles > available to make any "offloading" possible as a mere code > restructuring exercise (see the stack reports from Intel, IBM > etc.) > * Building on a complete stack makes available a wealth of > operational and management mechanisms built over the years by > the networking community (routing, provisioning, security, > service location etc.) – the community argument > * Higher level storage access over an IP network is widely > available and having both block and file served over the same > connection with the same support and management structure is > compelling– the community argument > * Highly efficient networks are easy to build over IP with optimal > (shortest path) routing while Layer 2 networks use bridging and > are limited by the logical tree structure that bridges must > follow. The effort to combine routers and bridges (rbridges) is > promising to change that but it will take some time to finalize > (and we don't know exactly how it will operate). Untill then the > scale of Layer 2 network is going to seriously limited – the > scaling argument > Another data point from the "The "performance penalty of a complete protocol stack has always been overstated (and overrated)" bullet above: "As a side argument – a performance comparison made in 1998 showed SCSI over TCP (a predecessor of the later iSCSI) to perform better than FCP at 1Gbs for block sizes typical for OLTP (4-8KB). That was what convinced us to take the path that lead to iSCSI – and we used plain vanilla x86 servers with plain-vanilla NICs and Linux (with similar measurements conducted on Windows)." --nab ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-04 23:12 ` Nicholas A. Bellinger 2008-02-04 23:16 ` Nicholas A. Bellinger @ 2008-02-05 18:37 ` James Bottomley 1 sibling, 0 replies; 148+ messages in thread From: James Bottomley @ 2008-02-05 18:37 UTC (permalink / raw) To: Nicholas A. Bellinger Cc: Alan Cox, Linus Torvalds, Vladislav Bolkhovitin, Bart Van Assche, Andrew Morton, FUJITA Tomonori, linux-scsi, scst-devel, Linux Kernel Mailing List, Mike Christie, Julian Satran This email somehow didn't manage to make it to the list (I suspect because it had html attachments). James --- From: Julian Satran <Julian_Satran@il.ibm.com> To: Nicholas A. Bellinger <nab@linux-iscsi.org> Cc: Andrew Morton <akpm@linux-foundation.org>, Alan Cox <alan@lxorguk.ukuu.org.uk>, Bart Van Assche <bart.vanassche@gmail.com>, FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>, James Bottomley <James.Bottomley@HansenPartnership.com>, ... Subject: Re: Integration of SCST in the mainstream Linux kernel Date: Mon, 4 Feb 2008 21:31:48 -0500 (20:31 CST) Well stated. In fact the "layers" above ethernet do provide the services that make the TCP/IP stack compelling - a whole complement of services. ALL services required (naming, addressing, discovery, security etc.) will have to be recreated if you take the FcOE route. That makes good business for some but not necessary for the users. Those services BTW are not on the data path and are not "overhead". The TCP/IP stack pathlength is decently low. What makes most implementations poor is that they where naively extended in the SMP world. Recent implementations (published) from IBM and Intel show excellent performance (4-6 times the regular stack). I do not have unfortunately latency numbers (as the community major stress has been throughput) but I assume that RDMA (not necessarily hardware RDMA) and/or the use of infiniband or latency critical applications - within clusters may be the ultimate low latency solution. Ethernet has some inherent latency issues (the bridges) that are inherited by anything on ethernet (FcOE included). The IP protocol stack is not inherently slow but some implementations are somewhat sluggish. But instead of replacing them with new and half backed contraptions we would be all better of improving what we have and understand. In the whole debate of around FcOE I heard a single argument that may have some merit - building convertors iSCSI-FCP to support legacy islands of FCP (read storage products that do not support iSCSI natively) is expensive. It is correct technically - only that FcOE eliminates an expense at the wrong end of the wire - it reduces the cost of the storage box at the expense of added cost at the server (and usually there a many servers using a storage box). FcOE vendors are also bound to provide FCP like services for FcOE - naming, security, discovery etc. - that do not exist on Ethernet. It is a good business for FcOE vendors - a duplicate set of solution for users. It should be apparent by now that if one speaks about a "converged" network we should speak about an IP network and not about Ethernet. If we take this route we might get perhaps also to an "infrastructure physical variants" that support very low latency better than ethernet and we might be able to use them with the same "stack" - a definite forward looking solution. IMHO it is foolish to insist on throwing away the whole stack whenever we make a slight improvement in the physical layer of the network. We have a substantial investment and body of knowledge in the protocol stack and nothing proposed improves on it - obviously not as in its total level of service nor in performance. Julo ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-04 22:43 ` Alan Cox ` (2 preceding siblings ...) 2008-02-04 23:00 ` James Bottomley @ 2008-02-04 23:04 ` Jeff Garzik 2008-02-04 23:27 ` Linus Torvalds 2008-02-05 19:01 ` Vladislav Bolkhovitin 2008-02-05 0:07 ` Matt Mackall 4 siblings, 2 replies; 148+ messages in thread From: Jeff Garzik @ 2008-02-04 23:04 UTC (permalink / raw) To: Alan Cox Cc: Linus Torvalds, Nicholas A. Bellinger, James Bottomley, Vladislav Bolkhovitin, Bart Van Assche, Andrew Morton, FUJITA Tomonori, linux-scsi, scst-devel, Linux Kernel Mailing List, Mike Christie Alan Cox wrote: >> better. So for example, I personally suspect that ATA-over-ethernet is way >> better than some crazy SCSI-over-TCP crap, but I'm biased for simple and >> low-level, and against those crazy SCSI people to begin with. > > Current ATAoE isn't. It can't support NCQ. A variant that did NCQ and IP > would probably trash iSCSI for latency if nothing else. AoE is truly a thing of beauty. It has a two/three page RFC (say no more!). But quite so... AoE is limited to MTU size, which really hurts. Can't really do tagged queueing, etc. iSCSI is way, way too complicated. It's an Internet protocol designed by storage designers, what do you expect? For years I have been hoping that someone will invent a simple protocol (w/ strong auth) that can transit ATA and SCSI commands and responses. Heck, it would be almost trivial if the kernel had a TLS/SSL implementation. Jeff ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-04 23:04 ` Jeff Garzik @ 2008-02-04 23:27 ` Linus Torvalds 2008-02-05 19:01 ` Vladislav Bolkhovitin 1 sibling, 0 replies; 148+ messages in thread From: Linus Torvalds @ 2008-02-04 23:27 UTC (permalink / raw) To: Jeff Garzik Cc: Alan Cox, Nicholas A. Bellinger, James Bottomley, Vladislav Bolkhovitin, Bart Van Assche, Andrew Morton, FUJITA Tomonori, linux-scsi, scst-devel, Linux Kernel Mailing List, Mike Christie On Mon, 4 Feb 2008, Jeff Garzik wrote: > > For years I have been hoping that someone will invent a simple protocol (w/ > strong auth) that can transit ATA and SCSI commands and responses. Heck, it > would be almost trivial if the kernel had a TLS/SSL implementation. Why would you want authorization? If you don't use IP (just ethernet framing), then 99% of the time the solution is to just trust the subnet. So most people would never want TLS/SSL, and the ones that *do* want it would probably also want IP routing, so you'd actually be better off with a separate higher-level bridging protocol rather than have TLS/SSL as part of the actual packet protocol. So don't add complexity. The beauty of ATA-over-ethernet is exactly that it's simple and straightforward. (Simple and straightforward is also nice for actually creating devices that are the targets of this. I just *bet* that an iSCSI target device probably needs two orders of magnitude more CPU power than a simple AoE thing that can probably be done in an FPGA with no real software at all). Whatever. We have now officially gotten totally off topic ;) Linus ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-04 23:04 ` Jeff Garzik 2008-02-04 23:27 ` Linus Torvalds @ 2008-02-05 19:01 ` Vladislav Bolkhovitin 2008-02-05 19:12 ` Jeff Garzik 2008-02-06 0:48 ` Nicholas A. Bellinger 1 sibling, 2 replies; 148+ messages in thread From: Vladislav Bolkhovitin @ 2008-02-05 19:01 UTC (permalink / raw) To: Jeff Garzik Cc: Alan Cox, Mike Christie, linux-scsi, Linux Kernel Mailing List, Nicholas A. Bellinger, James Bottomley, scst-devel, Andrew Morton, Linus Torvalds, FUJITA Tomonori Jeff Garzik wrote: > Alan Cox wrote: > >>>better. So for example, I personally suspect that ATA-over-ethernet is way >>>better than some crazy SCSI-over-TCP crap, but I'm biased for simple and >>>low-level, and against those crazy SCSI people to begin with. >> >>Current ATAoE isn't. It can't support NCQ. A variant that did NCQ and IP >>would probably trash iSCSI for latency if nothing else. > > > AoE is truly a thing of beauty. It has a two/three page RFC (say no more!). > > But quite so... AoE is limited to MTU size, which really hurts. Can't > really do tagged queueing, etc. > > > iSCSI is way, way too complicated. I fully agree. From one side, all that complexity is unavoidable for case of multiple connections per session, but for the regular case of one connection per session it must be a lot simpler. And now think about iSER, which brings iSCSI on the whole new complexity level ;) > It's an Internet protocol designed > by storage designers, what do you expect? > > For years I have been hoping that someone will invent a simple protocol > (w/ strong auth) that can transit ATA and SCSI commands and responses. > Heck, it would be almost trivial if the kernel had a TLS/SSL implementation. > > Jeff ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-05 19:01 ` Vladislav Bolkhovitin @ 2008-02-05 19:12 ` Jeff Garzik 2008-02-05 19:21 ` Vladislav Bolkhovitin 2008-02-06 0:17 ` Integration of SCST in the mainstream Linux kernel Nicholas A. Bellinger 2008-02-06 0:48 ` Nicholas A. Bellinger 1 sibling, 2 replies; 148+ messages in thread From: Jeff Garzik @ 2008-02-05 19:12 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: Alan Cox, Mike Christie, linux-scsi, Linux Kernel Mailing List, Nicholas A. Bellinger, James Bottomley, scst-devel, Andrew Morton, Linus Torvalds, FUJITA Tomonori Vladislav Bolkhovitin wrote: > Jeff Garzik wrote: >> iSCSI is way, way too complicated. > > I fully agree. From one side, all that complexity is unavoidable for > case of multiple connections per session, but for the regular case of > one connection per session it must be a lot simpler. Actually, think about those multiple connections... we already had to implement fast-failover (and load bal) SCSI multi-pathing at a higher level. IMO that portion of the protocol is redundant: You need the same capability elsewhere in the OS _anyway_, if you are to support multi-pathing. Jeff ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-05 19:12 ` Jeff Garzik @ 2008-02-05 19:21 ` Vladislav Bolkhovitin 2008-02-06 0:11 ` Nicholas A. Bellinger 2008-02-06 0:17 ` Integration of SCST in the mainstream Linux kernel Nicholas A. Bellinger 1 sibling, 1 reply; 148+ messages in thread From: Vladislav Bolkhovitin @ 2008-02-05 19:21 UTC (permalink / raw) To: Jeff Garzik Cc: Alan Cox, Mike Christie, linux-scsi, Linux Kernel Mailing List, Nicholas A. Bellinger, James Bottomley, scst-devel, Andrew Morton, Linus Torvalds, FUJITA Tomonori Jeff Garzik wrote: >>> iSCSI is way, way too complicated. >> >> I fully agree. From one side, all that complexity is unavoidable for >> case of multiple connections per session, but for the regular case of >> one connection per session it must be a lot simpler. > > Actually, think about those multiple connections... we already had to > implement fast-failover (and load bal) SCSI multi-pathing at a higher > level. IMO that portion of the protocol is redundant: You need the > same capability elsewhere in the OS _anyway_, if you are to support > multi-pathing. I'm thinking about MC/S as about a way to improve performance using several physical links. There's no other way, except MC/S, to keep commands processing order in that case. So, it's really valuable property of iSCSI, although with a limited application. Vlad ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-05 19:21 ` Vladislav Bolkhovitin @ 2008-02-06 0:11 ` Nicholas A. Bellinger 2008-02-06 1:43 ` Nicholas A. Bellinger 2008-02-12 16:05 ` [Scst-devel] " Bart Van Assche 0 siblings, 2 replies; 148+ messages in thread From: Nicholas A. Bellinger @ 2008-02-06 0:11 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: Jeff Garzik, Alan Cox, Mike Christie, linux-scsi, Linux Kernel Mailing List, James Bottomley, scst-devel, Andrew Morton, Linus Torvalds, FUJITA Tomonori, Julian Satran On Tue, 2008-02-05 at 22:21 +0300, Vladislav Bolkhovitin wrote: > Jeff Garzik wrote: > >>> iSCSI is way, way too complicated. > >> > >> I fully agree. From one side, all that complexity is unavoidable for > >> case of multiple connections per session, but for the regular case of > >> one connection per session it must be a lot simpler. > > > > Actually, think about those multiple connections... we already had to > > implement fast-failover (and load bal) SCSI multi-pathing at a higher > > level. IMO that portion of the protocol is redundant: You need the > > same capability elsewhere in the OS _anyway_, if you are to support > > multi-pathing. > > I'm thinking about MC/S as about a way to improve performance using > several physical links. There's no other way, except MC/S, to keep > commands processing order in that case. So, it's really valuable > property of iSCSI, although with a limited application. > > Vlad > Greetings, I have always observed the case with LIO SE/iSCSI target mode (as well as with other software initiators we can leave out of the discussion for now, and congrats to the open/iscsi on folks recent release. :-) that execution core hardware thread and inter-nexus per 1 Gb/sec ethernet port performance scales up to 4x and 2x core x86_64 very well with MC/S). I have been seeing 450 MB/sec using 2x socket 4x core x86_64 for a number of years with MC/S. Using MC/S on 10 Gb/sec (on PCI-X v2.0 266mhz as well, which was the first transport that LIO Target ran on that was able to reach handle duplex ~1200 MB/sec with 3 initiators and MC/S. In the point to point 10 GB/sec tests on IBM p404 machines, the initiators where able to reach ~910 MB/sec with MC/S. Open/iSCSI was able to go a bit faster (~950 MB/sec) because it uses struct sk_buff directly. A good rule to keep in mind here while considering performance is that context switching overhead and pipeline <-> bus stalling (along with other legacy OS specific storage stack limitations with BLOCK and VFS with O_DIRECT, et al and I will leave out of the discussion for iSCSI and SE engine target mode) is that a initiator will scale roughly 1/2 as well as a target, given comparable hardware and virsh output. The software target case target case also depends, in great regard in many cases, if we are talking about something something as simple as doing contiguous DMA memory allocations in from a SINGLE kernel thread, and handling direction execution to a storage hardware DMA ring that may have not been allocated in the current kernel thread. In MC/S mode this breaks down to: 1) Sorting logic that handles pre execution statemachine for transport from local RDMA memory and OS specific data buffers. TCP application data buffer, struct sk_buff, or RDMA struct page or SG. This should be generic between iSCSI and iSER. 2) Allocation of said memory buffers to OS subsystem dependent code that can be queued up to these drivers. It breaks down to what you can get drivers and OS subsystem folks to agree to implement, and can be made generic in a Transport / BLOCK / VFS layered storage stack. In the "allocate thread DMA ring and use OS supported software and vendor available hardware" I don't think the kernel space requirement will every completely be able to go away. Without diving into RFC-3720 specifics, the statemachine for MC/S side for memory allocation, login and logout generic to iSCSi and ISER, and ERL=2 recovery. My plan is to post the locations in the LIO code where this has been implemented, and where we where can make this easier, etc. In the early in the development of what eventually became LIO Target code, ERL was broken into separete files and separete function prefixes. iscsi_target_erl0, iscsi_target_erl1 and iscsi_target_erl2. The statemachine for ERL=0 and ERL=2 is pretty simple in RFC-3720 (have a look for those interested in the discussion) 7.1.1. State Descriptions for Initiators and Targets The LIO target code is also pretty simple for this: [root@ps3-cell target]# wc -l iscsi_target_erl* 1115 iscsi_target_erl0.c 45 iscsi_target_erl0.h 526 iscsi_target_erl0.o 1426 iscsi_target_erl1.c 51 iscsi_target_erl1.h 1253 iscsi_target_erl1.o 605 iscsi_target_erl2.c 45 iscsi_target_erl2.h 447 iscsi_target_erl2.o 5513 total erl1.c is a bit larger than the others because it contains the MC/S statemachine functions. iscsi_target_erl1.c:iscsi_execute_cmd() and iscsi_target_util.c:iscsi_check_received_cmdsn() do most of the work for LIO MC/S state machine. I would probably benefit from being in broken up into say iscsi_target_mcs.c. Note that all of this code is MC/S safe, with the exception of the specific SCSI TMR functions. For the SCSI TMR pieces, I have always hoped to use SCST code for doing this... Most of the login/logout code is done in iscsi_target.c, which is could probably also benefit fot getting broken out... --nab ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-06 0:11 ` Nicholas A. Bellinger @ 2008-02-06 1:43 ` Nicholas A. Bellinger 2008-02-12 16:05 ` [Scst-devel] " Bart Van Assche 1 sibling, 0 replies; 148+ messages in thread From: Nicholas A. Bellinger @ 2008-02-06 1:43 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: Jeff Garzik, Alan Cox, Mike Christie, linux-scsi, Linux Kernel Mailing List, James Bottomley, scst-devel, Andrew Morton, Linus Torvalds, FUJITA Tomonori, Julian Satran On Tue, 2008-02-05 at 16:11 -0800, Nicholas A. Bellinger wrote: > On Tue, 2008-02-05 at 22:21 +0300, Vladislav Bolkhovitin wrote: > > Jeff Garzik wrote: > > >>> iSCSI is way, way too complicated. > > >> > > >> I fully agree. From one side, all that complexity is unavoidable for > > >> case of multiple connections per session, but for the regular case of > > >> one connection per session it must be a lot simpler. > > > > > > Actually, think about those multiple connections... we already had to > > > implement fast-failover (and load bal) SCSI multi-pathing at a higher > > > level. IMO that portion of the protocol is redundant: You need the > > > same capability elsewhere in the OS _anyway_, if you are to support > > > multi-pathing. > > > > I'm thinking about MC/S as about a way to improve performance using > > several physical links. There's no other way, except MC/S, to keep > > commands processing order in that case. So, it's really valuable > > property of iSCSI, although with a limited application. > > > > Vlad > > > > Greetings, > > I have always observed the case with LIO SE/iSCSI target mode (as well > as with other software initiators we can leave out of the discussion for > now, and congrats to the open/iscsi on folks recent release. :-) that > execution core hardware thread and inter-nexus per 1 Gb/sec ethernet > port performance scales up to 4x and 2x core x86_64 very well with > MC/S). I have been seeing 450 MB/sec using 2x socket 4x core x86_64 for > a number of years with MC/S. Using MC/S on 10 Gb/sec (on PCI-X v2.0 > 266mhz as well, which was the first transport that LIO Target ran on > that was able to reach handle duplex ~1200 MB/sec with 3 initiators and > MC/S. In the point to point 10 GB/sec tests on IBM p404 machines, the > initiators where able to reach ~910 MB/sec with MC/S. Open/iSCSI was > able to go a bit faster (~950 MB/sec) because it uses struct sk_buff > directly. > Sorry, these where IBM p505 express (not p404, duh) which had a 2x socket 2x core POWER5 setup. These along with an IBM X-series machine) where the only ones available for PCI-X v2.0, and this probably is still the case. :-) Also, these numbers where with a ~9000 MTU (I don't recall what the hardware limit on the 10 Gb/sec switch lwas) doing direct struct iovec to preallocated struct page mapping for payload on the target side. This is known as RAMDISK_DR plugin in the LIO-SE. On the initiator, LTP disktest and O_DIRECT where used for direct to SCSI block device access. I can big up this paper if anyone is interested. --nab ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel 2008-02-06 0:11 ` Nicholas A. Bellinger 2008-02-06 1:43 ` Nicholas A. Bellinger @ 2008-02-12 16:05 ` Bart Van Assche 2008-02-13 3:44 ` Nicholas A. Bellinger 1 sibling, 1 reply; 148+ messages in thread From: Bart Van Assche @ 2008-02-12 16:05 UTC (permalink / raw) To: Nicholas A. Bellinger Cc: Vladislav Bolkhovitin, FUJITA Tomonori, Mike Christie, linux-scsi, Linux Kernel Mailing List, James Bottomley, scst-devel, Andrew Morton On Feb 6, 2008 1:11 AM, Nicholas A. Bellinger <nab@linux-iscsi.org> wrote: > I have always observed the case with LIO SE/iSCSI target mode ... Hello Nicholas, Are you sure that the LIO-SE kernel module source code is ready for inclusion in the mainstream Linux kernel ? As you know I tried to test the LIO-SE iSCSI target. Already while configuring the target I encountered a kernel crash that froze the whole system. I can reproduce this kernel crash easily, and I reported it 11 days ago on the LIO-SE mailing list (February 4, 2008). One of the call stacks I posted shows a crash in mempool_alloc() called from jbd. Or: the crash is most likely the result of memory corruption caused by LIO-SE. Because I was curious to know why it took so long to fix such a severe crash, I started browsing through the LIO-SE source code. Analysis of the LIO-SE kernel module source code learned me that this crash is not a coincidence. Dynamic memory allocation (kmalloc()/kfree()) in the LIO-SE kernel module is complex and hard to verify. There are 412 memory allocation/deallocation calls in the current version of the LIO-SE kernel module source code, which is a lot. Additionally, because of the complexity of the memory handling in LIO-SE, it is not possible to verify the correctness of the memory handling by analyzing a single function at a time. In my opinion this makes the LIO-SE source code hard to maintain. Furthermore, the LIO-SE kernel module source code does not follow conventions that have proven their value in the past like grouping all error handling at the end of a function. As could be expected, the consequence is that error handling is not correct in several functions, resulting in memory leaks in case of an error. Some examples of functions in which error handling is clearly incorrect: * transport_allocate_passthrough(). * iscsi_do_build_list(). Bart Van Assche. ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel 2008-02-12 16:05 ` [Scst-devel] " Bart Van Assche @ 2008-02-13 3:44 ` Nicholas A. Bellinger 2008-02-13 6:18 ` CONFIG_SLUB and reproducable general protection faults on 2.6.2x Nicholas A. Bellinger 0 siblings, 1 reply; 148+ messages in thread From: Nicholas A. Bellinger @ 2008-02-13 3:44 UTC (permalink / raw) To: Bart Van Assche Cc: Vladislav Bolkhovitin, FUJITA Tomonori, Mike Christie, linux-scsi, Linux Kernel Mailing List, James Bottomley, scst-devel, Andrew Morton, Christoph Hellwig, Rik van Riel, Chris Weiss, Linus Torvalds Greetings all, On Tue, 2008-02-12 at 17:05 +0100, Bart Van Assche wrote: > On Feb 6, 2008 1:11 AM, Nicholas A. Bellinger <nab@linux-iscsi.org> wrote: > > I have always observed the case with LIO SE/iSCSI target mode ... > > Hello Nicholas, > > Are you sure that the LIO-SE kernel module source code is ready for > inclusion in the mainstream Linux kernel ? As you know I tried to test > the LIO-SE iSCSI target. Already while configuring the target I > encountered a kernel crash that froze the whole system. I can > reproduce this kernel crash easily, and I reported it 11 days ago on > the LIO-SE mailing list (February 4, 2008). One of the call stacks I > posted shows a crash in mempool_alloc() called from jbd. Or: the crash > is most likely the result of memory corruption caused by LIO-SE. > So I was able to FINALLY track this down to: -# CONFIG_SLUB_DEBUG is not set -# CONFIG_SLAB is not set -CONFIG_SLUB=y +CONFIG_SLAB=y in both your and Chris Weiss's configs that was causing the reproduceable general protection faults. I also disabled CONFIG_RELOCATABLE and crash dump because I was debugging using kdb in x86_64 VM on 2.6.24 with your config. I am pretty sure you can leave this (crash dump) in your config for testing. This can take a while to compile and take up alot of space, esp. with all of the kernel debug options enabled, which on 2.6.24, really amounts to alot of CPU time when building. Also with your original config, I was seeing some strange undefined module objects after Stage 2 Link with iscsi_target_mod with modpost with the SLUB the lockups (which are not random btw, and are tracked back to __kmalloc()).. Also, at module load time with the original config, there where some warning about symbol objects (I believe it was SCSI related, same as the ones with modpost). In any event, the dozen 1000 loop discovery test is now working fine (as well as IPoIB) with the above config change, and you should be ready to go for your testing. Tomo, Vlad, Andrew and Co: Do you have any ideas why this would be the case with LIO-Target..? Is anyone else seeing something similar to this with their target mode (mabye its all out of tree code..?) that is having an issue..? I am using Debian x86_64 and Bart and Chris are using Ubuntu x86_64 and we both have this problem with CONFIG_SLUB on >= 2.6.22 kernel.org kernels. Also, I will recompile some of my non x86 machines with the above enabled and see if I can reproduce.. Here the Bart's config again: http://groups.google.com/group/linux-iscsi-target-dev/browse_thread/thread/30835aede1028188 > Because I was curious to know why it took so long to fix such a severe > crash, I started browsing through the LIO-SE source code. Analysis of > the LIO-SE kernel module source code learned me that this crash is not > a coincidence. Dynamic memory allocation (kmalloc()/kfree()) in the > LIO-SE kernel module is complex and hard to verify. What the LIO-SE Target module does is complex. :P Sorry for taking so long, I had to start tracking this down by CONFIG_ option with your config on an x86_64 VM. > There are 412 > memory allocation/deallocation calls in the current version of the > LIO-SE kernel module source code, which is a lot. Additionally, > because of the complexity of the memory handling in LIO-SE, it is not > possible to verify the correctness of the memory handling by analyzing > a single function at a time. In my opinion this makes the LIO-SE > source code hard to maintain. > Furthermore, the LIO-SE kernel module source code does not follow > conventions that have proven their value in the past like grouping all > error handling at the end of a function. As could be expected, the > consequence is that error handling is not correct in several > functions, resulting in memory leaks in case of an error. I would be more than happy to point the release paths for iSCSI Target and LIO-SE to show they are not actual memory leaks (as I mentioned, this code has been stable for a number of years) for some particular SE or iSCSI Target logic if you are interested.. Also, if we are talking about target mode storage engine that should be going upstream, the API to the current stable and future storage systems, and of course the Mem->SG and SG->Mem that handles all possible cases of max_sectors and sector_size to past, present, and future. I really glad that you have been taking a look at this, because some of the code (as you mention) can get very complex to make this a reality as it has been with LIO-Target since v2.2. > Some > examples of functions in which error handling is clearly incorrect: > * transport_allocate_passthrough(). > * iscsi_do_build_list(). > You did find the one in transport_allocate_passthrough() and the strncpy() + strlen() in userspace. Also, thanks for pointing me to the missing sg_init_table() and sg_mark_end() usage for 2.6.24. I will post an update to my thread about how to do this for other drivers.. I will have a look at your new changes and post them on LIO-Target-Dev for your review. Please feel free to Ack them when I post. (Thanks Bart !!) PS: Sometimes it takes a while when you are on the bleeding edge of development to track these types of issues down. :-) --nab ^ permalink raw reply [flat|nested] 148+ messages in thread
* CONFIG_SLUB and reproducable general protection faults on 2.6.2x 2008-02-13 3:44 ` Nicholas A. Bellinger @ 2008-02-13 6:18 ` Nicholas A. Bellinger 2008-02-13 16:37 ` Nicholas A. Bellinger 0 siblings, 1 reply; 148+ messages in thread From: Nicholas A. Bellinger @ 2008-02-13 6:18 UTC (permalink / raw) To: Bart Van Assche Cc: Vladislav Bolkhovitin, FUJITA Tomonori, Mike Christie, linux-scsi, Linux Kernel Mailing List, James Bottomley, Andrew Morton, Christoph Hellwig, Rik van Riel, Chris Weiss, Linus Torvalds On Tue, 2008-02-12 at 19:57 -0800, Nicholas A. Bellinger wrote: > Greetings all, > > On Tue, 2008-02-12 at 17:05 +0100, Bart Van Assche wrote: > > On Feb 6, 2008 1:11 AM, Nicholas A. Bellinger <nab@linux-iscsi.org> wrote: > > > I have always observed the case with LIO SE/iSCSI target mode ... > > > > Hello Nicholas, > > > > Are you sure that the LIO-SE kernel module source code is ready for > > inclusion in the mainstream Linux kernel ? As you know I tried to test > > the LIO-SE iSCSI target. Already while configuring the target I > > encountered a kernel crash that froze the whole system. I can > > reproduce this kernel crash easily, and I reported it 11 days ago on > > the LIO-SE mailing list (February 4, 2008). One of the call stacks I > > posted shows a crash in mempool_alloc() called from jbd. Or: the crash > > is most likely the result of memory corruption caused by LIO-SE. > > > > So I was able to FINALLY track this down to: > > -# CONFIG_SLUB_DEBUG is not set > -# CONFIG_SLAB is not set > -CONFIG_SLUB=y > +CONFIG_SLAB=y > > in both your and Chris Weiss's configs that was causing the > reproduceable general protection faults. I also disabled > CONFIG_RELOCATABLE and crash dump because I was debugging using kdb in > x86_64 VM on 2.6.24 with your config. I am pretty sure you can leave > this (crash dump) in your config for testing. > > This can take a while to compile and take up alot of space, esp. with > all of the kernel debug options enabled, which on 2.6.24, really amounts > to alot of CPU time when building. Also with your original config, I > was seeing some strange undefined module objects after Stage 2 Link with > iscsi_target_mod with modpost with the SLUB the lockups (which are not > random btw, and are tracked back to __kmalloc()).. Also, at module load > time with the original config, there where some warning about symbol > objects (I believe it was SCSI related, same as the ones with modpost). > > In any event, the dozen 1000 loop discovery test is now working fine (as > well as IPoIB) with the above config change, and you should be ready to > go for your testing. > > Tomo, Vlad, Andrew and Co: > > Do you have any ideas why this would be the case with LIO-Target..? Is > anyone else seeing something similar to this with their target mode > (mabye its all out of tree code..?) that is having an issue..? I am > using Debian x86_64 and Bart and Chris are using Ubuntu x86_64 and we > both have this problem with CONFIG_SLUB on >= 2.6.22 kernel.org > kernels. > > Also, I will recompile some of my non x86 machines with the above > enabled and see if I can reproduce.. Here the Bart's config again: > > http://groups.google.com/group/linux-iscsi-target-dev/browse_thread/thread/30835aede1028188 > This is also failing on CONFIG_SLUB on 2.6.24 ppc64. Since the rest of the system seems to work fine, my only guess it may be related to the fact that the module is being compiled out of tree. I took a quick glance at what kbuild was using for compiler and linker parameters, but nothing looked out of the ordinary. I will take a look with kdb and SLUB re-enabled on x86_64 and see if this helps shed any light on the issue. Is anyone else seeing an issue with CONFIG_SLUB..? Also, I wonder who else aside from Ubuntu is using this by default in their .config..? --nab ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: CONFIG_SLUB and reproducable general protection faults on 2.6.2x 2008-02-13 6:18 ` CONFIG_SLUB and reproducable general protection faults on 2.6.2x Nicholas A. Bellinger @ 2008-02-13 16:37 ` Nicholas A. Bellinger 0 siblings, 0 replies; 148+ messages in thread From: Nicholas A. Bellinger @ 2008-02-13 16:37 UTC (permalink / raw) To: Bart Van Assche Cc: Vladislav Bolkhovitin, FUJITA Tomonori, Mike Christie, linux-scsi, Linux Kernel Mailing List, James Bottomley, Andrew Morton, Christoph Hellwig, Rik van Riel, Chris Weiss, Linus Torvalds On Tue, 2008-02-12 at 22:18 -0800, Nicholas A. Bellinger wrote: > On Tue, 2008-02-12 at 19:57 -0800, Nicholas A. Bellinger wrote: > > Greetings all, > > > > On Tue, 2008-02-12 at 17:05 +0100, Bart Van Assche wrote: > > > On Feb 6, 2008 1:11 AM, Nicholas A. Bellinger <nab@linux-iscsi.org> wrote: > > > > I have always observed the case with LIO SE/iSCSI target mode ... > > > > > > Hello Nicholas, > > > > > > Are you sure that the LIO-SE kernel module source code is ready for > > > inclusion in the mainstream Linux kernel ? As you know I tried to test > > > the LIO-SE iSCSI target. Already while configuring the target I > > > encountered a kernel crash that froze the whole system. I can > > > reproduce this kernel crash easily, and I reported it 11 days ago on > > > the LIO-SE mailing list (February 4, 2008). One of the call stacks I > > > posted shows a crash in mempool_alloc() called from jbd. Or: the crash > > > is most likely the result of memory corruption caused by LIO-SE. > > > > > > > So I was able to FINALLY track this down to: > > > > -# CONFIG_SLUB_DEBUG is not set > > -# CONFIG_SLAB is not set > > -CONFIG_SLUB=y > > +CONFIG_SLAB=y > > > > in both your and Chris Weiss's configs that was causing the > > reproduceable general protection faults. I also disabled > > CONFIG_RELOCATABLE and crash dump because I was debugging using kdb in > > x86_64 VM on 2.6.24 with your config. I am pretty sure you can leave > > this (crash dump) in your config for testing. > > > > This can take a while to compile and take up alot of space, esp. with > > all of the kernel debug options enabled, which on 2.6.24, really amounts > > to alot of CPU time when building. Also with your original config, I > > was seeing some strange undefined module objects after Stage 2 Link with > > iscsi_target_mod with modpost with the SLUB the lockups (which are not > > random btw, and are tracked back to __kmalloc()).. Also, at module load > > time with the original config, there where some warning about symbol > > objects (I believe it was SCSI related, same as the ones with modpost). > > > > In any event, the dozen 1000 loop discovery test is now working fine (as > > well as IPoIB) with the above config change, and you should be ready to > > go for your testing. > > > > Tomo, Vlad, Andrew and Co: > > > > Do you have any ideas why this would be the case with LIO-Target..? Is > > anyone else seeing something similar to this with their target mode > > (mabye its all out of tree code..?) that is having an issue..? I am > > using Debian x86_64 and Bart and Chris are using Ubuntu x86_64 and we > > both have this problem with CONFIG_SLUB on >= 2.6.22 kernel.org > > kernels. > > > > Also, I will recompile some of my non x86 machines with the above > > enabled and see if I can reproduce.. Here the Bart's config again: > > > > http://groups.google.com/group/linux-iscsi-target-dev/browse_thread/thread/30835aede1028188 > > > > This is also failing on CONFIG_SLUB on 2.6.24 ppc64. Since the rest of > the system seems to work fine, my only guess it may be related to the > fact that the module is being compiled out of tree. I took a quick > glance at what kbuild was using for compiler and linker parameters, but > nothing looked out of the ordinary. > > I will take a look with kdb and SLUB re-enabled on x86_64 and see if this > helps shed any light on the issue. Is anyone else seeing an issue with CONFIG_SLUB..? > I was able to track this down to a memory corruption issue in the in-band iSCSI discovery path. I just made the change, and the diff can be located at: http://groups.google.com/group/linux-iscsi-target-dev/browse_thread/thread/a70d4835c55be392 In any event, this is now fixed, and I will be generating some new builds for LIO-Target shortly for Debian, CentOS and Ubuntu. Espically for the Ubuntu folks, where this is going to be an issue with their default kernel config. A big thanks to Bart Van Assche for helping me locate the actual issue, and giving me a clue with slub_debug=FZPU. Thanks again, --nab ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-05 19:12 ` Jeff Garzik 2008-02-05 19:21 ` Vladislav Bolkhovitin @ 2008-02-06 0:17 ` Nicholas A. Bellinger 1 sibling, 0 replies; 148+ messages in thread From: Nicholas A. Bellinger @ 2008-02-06 0:17 UTC (permalink / raw) To: Jeff Garzik Cc: Vladislav Bolkhovitin, Alan Cox, Mike Christie, linux-scsi, Linux Kernel Mailing List, James Bottomley, scst-devel, Andrew Morton, Linus Torvalds, FUJITA Tomonori, Julian Satran On Tue, 2008-02-05 at 14:12 -0500, Jeff Garzik wrote: > Vladislav Bolkhovitin wrote: > > Jeff Garzik wrote: > >> iSCSI is way, way too complicated. > > > > I fully agree. From one side, all that complexity is unavoidable for > > case of multiple connections per session, but for the regular case of > > one connection per session it must be a lot simpler. > > > Actually, think about those multiple connections... we already had to > implement fast-failover (and load bal) SCSI multi-pathing at a higher > level. IMO that portion of the protocol is redundant: You need the > same capability elsewhere in the OS _anyway_, if you are to support > multi-pathing. > > Jeff > > Hey Jeff, I put a whitepaper on the LIO cluster recently about this topic.. It is from a few years ago but the datapoints are very relevant. http://linux-iscsi.org/builds/user/nab/Inter.vs.OuterNexus.Multiplexing.pdf The key advantage to MC/S and ERL=2 has always been that they are completely OS independent. They are designed to work together and actually benefit from one another. They are also are protocol independent between Traditional iSCSI and iSER. --nab PS: A great thanks for my former colleague Edward Cheng for putting this together. > > ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-05 19:01 ` Vladislav Bolkhovitin 2008-02-05 19:12 ` Jeff Garzik @ 2008-02-06 0:48 ` Nicholas A. Bellinger 2008-02-06 0:51 ` Nicholas A. Bellinger 1 sibling, 1 reply; 148+ messages in thread From: Nicholas A. Bellinger @ 2008-02-06 0:48 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: Jeff Garzik, Alan Cox, Mike Christie, linux-scsi, Linux Kernel Mailing List, James Bottomley, scst-devel, Andrew Morton, Linus Torvalds, FUJITA Tomonori, Julian Satran On Tue, 2008-02-05 at 22:01 +0300, Vladislav Bolkhovitin wrote: > Jeff Garzik wrote: > > Alan Cox wrote: > > > >>>better. So for example, I personally suspect that ATA-over-ethernet is way > >>>better than some crazy SCSI-over-TCP crap, but I'm biased for simple and > >>>low-level, and against those crazy SCSI people to begin with. > >> > >>Current ATAoE isn't. It can't support NCQ. A variant that did NCQ and IP > >>would probably trash iSCSI for latency if nothing else. > > > > > > AoE is truly a thing of beauty. It has a two/three page RFC (say no more!). > > > > But quite so... AoE is limited to MTU size, which really hurts. Can't > > really do tagged queueing, etc. > > > > > > iSCSI is way, way too complicated. > > I fully agree. From one side, all that complexity is unavoidable for > case of multiple connections per session, but for the regular case of > one connection per session it must be a lot simpler. > > And now think about iSER, which brings iSCSI on the whole new complexity > level ;) Actually, the iSER protocol wire protocol itself is quite simple, because it builds on iSCSI and IPS fundamentals, and because traditional iSCSI's recovery logic for CRC failures (and hence alot of acknowledgement sequence PDUs that go missing, etc) and the RDMA Capable Protocol (RCaP). The logic that iSER collectively disables is known as within-connection and within-command recovery (negotiated as ErrorRecoveryLevel=1 on the wire), RFC-5046 requires that the iSCSI layer that iSER is being enabled to disable CRC32C checksums and any associated timeouts for ERL=1. Also, have a look at Appendix A. in the iSER spec. A.1. iWARP Message Format for iSER Hello Message ...............73 A.2. iWARP Message Format for iSER HelloReply Message ..........74 A.3. iWARP Message Format for SCSI Read Command PDU ............75 A.4. iWARP Message Format for SCSI Read Data ...................76 A.5. iWARP Message Format for SCSI Write Command PDU ...........77 A.6. iWARP Message Format for RDMA Read Request ................78 A.7. iWARP Message Format for Solicited SCSI Write Data ........79 A.8. iWARP Message Format for SCSI Response PDU ................80 This is about as 1/2 as many traditional iSCSI PDUs, that iSER encapulates. --nab ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-06 0:48 ` Nicholas A. Bellinger @ 2008-02-06 0:51 ` Nicholas A. Bellinger 0 siblings, 0 replies; 148+ messages in thread From: Nicholas A. Bellinger @ 2008-02-06 0:51 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: Jeff Garzik, Alan Cox, Mike Christie, linux-scsi, Linux Kernel Mailing List, James Bottomley, scst-devel, Andrew Morton, Linus Torvalds, FUJITA Tomonori, Julian Satran On Tue, 2008-02-05 at 16:48 -0800, Nicholas A. Bellinger wrote: > On Tue, 2008-02-05 at 22:01 +0300, Vladislav Bolkhovitin wrote: > > Jeff Garzik wrote: > > > Alan Cox wrote: > > > > > >>>better. So for example, I personally suspect that ATA-over-ethernet is way > > >>>better than some crazy SCSI-over-TCP crap, but I'm biased for simple and > > >>>low-level, and against those crazy SCSI people to begin with. > > >> > > >>Current ATAoE isn't. It can't support NCQ. A variant that did NCQ and IP > > >>would probably trash iSCSI for latency if nothing else. > > > > > > > > > AoE is truly a thing of beauty. It has a two/three page RFC (say no more!). > > > > > > But quite so... AoE is limited to MTU size, which really hurts. Can't > > > really do tagged queueing, etc. > > > > > > > > > iSCSI is way, way too complicated. > > > > I fully agree. From one side, all that complexity is unavoidable for > > case of multiple connections per session, but for the regular case of > > one connection per session it must be a lot simpler. > > > > And now think about iSER, which brings iSCSI on the whole new complexity > > level ;) > > Actually, the iSER protocol wire protocol itself is quite simple, > because it builds on iSCSI and IPS fundamentals, and because traditional > iSCSI's recovery logic for CRC failures (and hence alot of > acknowledgement sequence PDUs that go missing, etc) and the RDMA > Capable > Protocol (RCaP). this should be: .. and instead the RDMA Capacle Protocol (RCaP) provides the 32-bit or greater data integrity. --nab ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-04 22:43 ` Alan Cox ` (3 preceding siblings ...) 2008-02-04 23:04 ` Jeff Garzik @ 2008-02-05 0:07 ` Matt Mackall 2008-02-05 0:24 ` Linus Torvalds 4 siblings, 1 reply; 148+ messages in thread From: Matt Mackall @ 2008-02-05 0:07 UTC (permalink / raw) To: Alan Cox Cc: Linus Torvalds, Nicholas A. Bellinger, James Bottomley, Vladislav Bolkhovitin, Bart Van Assche, Andrew Morton, FUJITA Tomonori, linux-scsi, scst-devel, Linux Kernel Mailing List, Mike Christie On Mon, 2008-02-04 at 22:43 +0000, Alan Cox wrote: > > better. So for example, I personally suspect that ATA-over-ethernet is way > > better than some crazy SCSI-over-TCP crap, but I'm biased for simple and > > low-level, and against those crazy SCSI people to begin with. > > Current ATAoE isn't. It can't support NCQ. A variant that did NCQ and IP > would probably trash iSCSI for latency if nothing else. But ATAoE is boring because it's not IP. Which means no routing, firewalls, tunnels, congestion control, etc. NBD and iSCSI (for all its hideous growths) can take advantage of these things. -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-05 0:07 ` Matt Mackall @ 2008-02-05 0:24 ` Linus Torvalds 2008-02-05 0:42 ` Jeff Garzik ` (2 more replies) 0 siblings, 3 replies; 148+ messages in thread From: Linus Torvalds @ 2008-02-05 0:24 UTC (permalink / raw) To: Matt Mackall Cc: Alan Cox, Nicholas A. Bellinger, James Bottomley, Vladislav Bolkhovitin, Bart Van Assche, Andrew Morton, FUJITA Tomonori, linux-scsi, scst-devel, Linux Kernel Mailing List, Mike Christie On Mon, 4 Feb 2008, Matt Mackall wrote: > > But ATAoE is boring because it's not IP. Which means no routing, > firewalls, tunnels, congestion control, etc. The thing is, that's often an advantage. Not just for performance. > NBD and iSCSI (for all its hideous growths) can take advantage of these > things. .. and all this could equally well be done by a simple bridging protocol (completely independently of any AoE code). The thing is, iSCSI does things at the wrong level. It *forces* people to use the complex protocols, when it's a known that a lot of people don't want it. Which is why these AoE and FCoE things keep popping up. It's easy to bridge ethernet and add a new layer on top of AoE if you need it. In comparison, it's *impossible* to remove an unnecessary layer from iSCSI. This is why "simple and low-level is good". It's always possible to build on top of low-level protocols, while it's generally never possible to simplify overly complex ones. Linus ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-05 0:24 ` Linus Torvalds @ 2008-02-05 0:42 ` Jeff Garzik 2008-02-05 0:45 ` Matt Mackall 2008-02-05 4:43 ` [Scst-devel] " Matteo Tescione 2 siblings, 0 replies; 148+ messages in thread From: Jeff Garzik @ 2008-02-05 0:42 UTC (permalink / raw) To: Linus Torvalds Cc: Matt Mackall, Alan Cox, Nicholas A. Bellinger, James Bottomley, Vladislav Bolkhovitin, Bart Van Assche, Andrew Morton, FUJITA Tomonori, linux-scsi, scst-devel, Linux Kernel Mailing List, Mike Christie Linus Torvalds wrote: > On Mon, 4 Feb 2008, Matt Mackall wrote: >> But ATAoE is boring because it's not IP. Which means no routing, >> firewalls, tunnels, congestion control, etc. > > The thing is, that's often an advantage. Not just for performance. > >> NBD and iSCSI (for all its hideous growths) can take advantage of these >> things. > > .. and all this could equally well be done by a simple bridging protocol > (completely independently of any AoE code). > > The thing is, iSCSI does things at the wrong level. It *forces* people to > use the complex protocols, when it's a known that a lot of people don't > want it. > > Which is why these AoE and FCoE things keep popping up. > > It's easy to bridge ethernet and add a new layer on top of AoE if you need > it. In comparison, it's *impossible* to remove an unnecessary layer from > iSCSI. > > This is why "simple and low-level is good". It's always possible to build > on top of low-level protocols, while it's generally never possible to > simplify overly complex ones. Never discount "easy" and "just works", which is what IP (and TCP) gives you... Sure you can use a bridging protocol and all that jazz, but I wager, to a network admin yet-another-IP-application is easier to evaluate, deploy and manage on existing networks. Jeff ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-05 0:24 ` Linus Torvalds 2008-02-05 0:42 ` Jeff Garzik @ 2008-02-05 0:45 ` Matt Mackall 2008-02-05 4:43 ` [Scst-devel] " Matteo Tescione 2 siblings, 0 replies; 148+ messages in thread From: Matt Mackall @ 2008-02-05 0:45 UTC (permalink / raw) To: Linus Torvalds Cc: Alan Cox, Nicholas A. Bellinger, James Bottomley, Vladislav Bolkhovitin, Bart Van Assche, Andrew Morton, FUJITA Tomonori, linux-scsi, scst-devel, Linux Kernel Mailing List, Mike Christie On Mon, 2008-02-04 at 16:24 -0800, Linus Torvalds wrote: > > On Mon, 4 Feb 2008, Matt Mackall wrote: > > > > But ATAoE is boring because it's not IP. Which means no routing, > > firewalls, tunnels, congestion control, etc. > > The thing is, that's often an advantage. Not just for performance. > > > NBD and iSCSI (for all its hideous growths) can take advantage of these > > things. > > .. and all this could equally well be done by a simple bridging protocol > (completely independently of any AoE code). > > The thing is, iSCSI does things at the wrong level. It *forces* people to > use the complex protocols, when it's a known that a lot of people don't > want it. I frankly think NBD is at a pretty comfortable level. It's internally very simple (and hardware-agnostic). And moderately easy to do in silicon. But I'm not going to defend iSCSI. I worked on the first implementation (what became the Cisco iSCSI driver) and I have no love for iSCSI at all. It should have been (and started out as) a nearly trivial encapsulation of SCSI over TCP much like ATA over Ethernet but quickly lost the plot when committees got ahold of it. -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel 2008-02-05 0:24 ` Linus Torvalds 2008-02-05 0:42 ` Jeff Garzik 2008-02-05 0:45 ` Matt Mackall @ 2008-02-05 4:43 ` Matteo Tescione 2008-02-05 5:07 ` James Bottomley 2008-02-05 13:38 ` FUJITA Tomonori 2 siblings, 2 replies; 148+ messages in thread From: Matteo Tescione @ 2008-02-05 4:43 UTC (permalink / raw) To: Linus Torvalds, Matt Mackall Cc: Mike Christie, Vladislav Bolkhovitin, linux-scsi, Linux Kernel Mailing List, Nicholas A. Bellinger, James Bottomley, scst-devel, Andrew Morton, FUJITA Tomonori, Alan Cox Hi all, And sorry for intrusion, i am not a developer but i work everyday with iscsi and i found it fantastic. Altough Aoe, Fcoe and so on could be better, we have to look in real world implementations what is needed *now*, and if we look at vmware world, virtual iron, microsoft clustering etc, the answer is iSCSI. And now, SCST is the best open-source iSCSI target. So, from an end-user point of view, what are the really problems to not integrate scst in the mainstream kernel? Just my two cent, -- So long and thank for all the fish -- #Matteo Tescione #RMnet srl > > > On Mon, 4 Feb 2008, Matt Mackall wrote: >> >> But ATAoE is boring because it's not IP. Which means no routing, >> firewalls, tunnels, congestion control, etc. > > The thing is, that's often an advantage. Not just for performance. > >> NBD and iSCSI (for all its hideous growths) can take advantage of these >> things. > > .. and all this could equally well be done by a simple bridging protocol > (completely independently of any AoE code). > > The thing is, iSCSI does things at the wrong level. It *forces* people to > use the complex protocols, when it's a known that a lot of people don't > want it. > > Which is why these AoE and FCoE things keep popping up. > > It's easy to bridge ethernet and add a new layer on top of AoE if you need > it. In comparison, it's *impossible* to remove an unnecessary layer from > iSCSI. > > This is why "simple and low-level is good". It's always possible to build > on top of low-level protocols, while it's generally never possible to > simplify overly complex ones. > > Linus > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Scst-devel mailing list > Scst-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scst-devel > ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel 2008-02-05 4:43 ` [Scst-devel] " Matteo Tescione @ 2008-02-05 5:07 ` James Bottomley 2008-02-05 13:38 ` FUJITA Tomonori 1 sibling, 0 replies; 148+ messages in thread From: James Bottomley @ 2008-02-05 5:07 UTC (permalink / raw) To: Matteo Tescione Cc: Linus Torvalds, Matt Mackall, Mike Christie, Vladislav Bolkhovitin, linux-scsi, Linux Kernel Mailing List, Nicholas A. Bellinger, scst-devel, Andrew Morton, FUJITA Tomonori, Alan Cox On Tue, 2008-02-05 at 05:43 +0100, Matteo Tescione wrote: > Hi all, > And sorry for intrusion, i am not a developer but i work everyday with iscsi > and i found it fantastic. > Altough Aoe, Fcoe and so on could be better, we have to look in real world > implementations what is needed *now*, and if we look at vmware world, > virtual iron, microsoft clustering etc, the answer is iSCSI. > And now, SCST is the best open-source iSCSI target. So, from an end-user > point of view, what are the really problems to not integrate scst in the > mainstream kernel? The fact that your last statement is conjecture. It's definitely untrue for non-IB networks, and the jury is still out on IB networks. James ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel 2008-02-05 4:43 ` [Scst-devel] " Matteo Tescione 2008-02-05 5:07 ` James Bottomley @ 2008-02-05 13:38 ` FUJITA Tomonori 1 sibling, 0 replies; 148+ messages in thread From: FUJITA Tomonori @ 2008-02-05 13:38 UTC (permalink / raw) To: matteo Cc: torvalds, mpm, michaelc, vst, linux-scsi, linux-kernel, nab, James.Bottomley, scst-devel, akpm, fujita.tomonori, alan, fujita.tomonori On Tue, 05 Feb 2008 05:43:10 +0100 Matteo Tescione <matteo@rmnet.it> wrote: > Hi all, > And sorry for intrusion, i am not a developer but i work everyday with iscsi > and i found it fantastic. > Altough Aoe, Fcoe and so on could be better, we have to look in real world > implementations what is needed *now*, and if we look at vmware world, > virtual iron, microsoft clustering etc, the answer is iSCSI. > And now, SCST is the best open-source iSCSI target. So, from an end-user > point of view, what are the really problems to not integrate scst in the > mainstream kernel? Currently, the best open-source iSCSI target implemenation in Linux is Nicholas's LIO, I guess. ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-04 19:44 ` Linus Torvalds ` (3 preceding siblings ...) 2008-02-04 22:43 ` Alan Cox @ 2008-02-05 19:00 ` Vladislav Bolkhovitin 4 siblings, 0 replies; 148+ messages in thread From: Vladislav Bolkhovitin @ 2008-02-05 19:00 UTC (permalink / raw) To: Linus Torvalds Cc: Nicholas A. Bellinger, Mike Christie, linux-scsi, Linux Kernel Mailing List, James Bottomley, scst-devel, Andrew Morton, FUJITA Tomonori Linus Torvalds wrote: > So just going by what has happened in the past, I'd assume that iSCSI > would eventually turn into "connecting/authentication in user space" with > "data transfers in kernel space". This is exactly how iSCSI-SCST (iSCSI target driver for SCST) is implemented, credits to IET and Ardis target developers. Vlad ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-01-23 14:22 Integration of SCST in the mainstream Linux kernel Bart Van Assche 2008-01-23 17:11 ` Vladislav Bolkhovitin 2008-01-29 20:42 ` James Bottomley @ 2008-02-05 17:10 ` Erez Zilber 2008-02-05 19:02 ` Bart Van Assche 2008-02-05 19:02 ` Vladislav Bolkhovitin 2 siblings, 2 replies; 148+ messages in thread From: Erez Zilber @ 2008-02-05 17:10 UTC (permalink / raw) To: Bart Van Assche Cc: Linus Torvalds, Andrew Morton, Vladislav Bolkhovitin, James.Bottomley, FUJITA Tomonori, linux-scsi, scst-devel, linux-kernel Bart Van Assche wrote: > As you probably know there is a trend in enterprise computing towards > networked storage. This is illustrated by the emergence during the > past few years of standards like SRP (SCSI RDMA Protocol), iSCSI > (Internet SCSI) and iSER (iSCSI Extensions for RDMA). Two different > pieces of software are necessary to make networked storage possible: > initiator software and target software. As far as I know there exist > three different SCSI target implementations for Linux: > - The iSCSI Enterprise Target Daemon (IETD, > http://iscsitarget.sourceforge.net/); > - The Linux SCSI Target Framework (STGT, http://stgt.berlios.de/); > - The Generic SCSI Target Middle Level for Linux project (SCST, > http://scst.sourceforge.net/). > Since I was wondering which SCSI target software would be best suited > for an InfiniBand network, I started evaluating the STGT and SCST SCSI > target implementations. Apparently the performance difference between > STGT and SCST is small on 100 Mbit/s and 1 Gbit/s Ethernet networks, > but the SCST target software outperforms the STGT software on an > InfiniBand network. See also the following thread for the details: > http://sourceforge.net/mailarchive/forum.php?thread_name=e2e108260801170127w2937b2afg9bef324efa945e43%40mail.gmail.com&forum_name=scst-devel. > > Sorry for the late response (but better late than never). One may claim that STGT should have lower performance than SCST because its data path is from userspace. However, your results show that for non-IB transports, they both show the same numbers. Furthermore, with IB there shouldn't be any additional difference between the 2 targets because data transfer from userspace is as efficient as data transfer from kernel space. The only explanation that I see is that fine tuning for iSCSI & iSER is required. As was already mentioned in this thread, with SDR you can get ~900 MB/sec with iSER (on STGT). Erez ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-05 17:10 ` Erez Zilber @ 2008-02-05 19:02 ` Bart Van Assche 2008-02-05 19:02 ` Vladislav Bolkhovitin 1 sibling, 0 replies; 148+ messages in thread From: Bart Van Assche @ 2008-02-05 19:02 UTC (permalink / raw) To: Erez Zilber Cc: Linus Torvalds, Andrew Morton, Vladislav Bolkhovitin, James.Bottomley, FUJITA Tomonori, linux-scsi, scst-devel, linux-kernel On Feb 5, 2008 6:10 PM, Erez Zilber <erezz@voltaire.com> wrote: > One may claim that STGT should have lower performance than SCST because > its data path is from userspace. However, your results show that for > non-IB transports, they both show the same numbers. Furthermore, with IB > there shouldn't be any additional difference between the 2 targets > because data transfer from userspace is as efficient as data transfer > from kernel space. > > The only explanation that I see is that fine tuning for iSCSI & iSER is > required. As was already mentioned in this thread, with SDR you can get > ~900 MB/sec with iSER (on STGT). My most recent measurements also show that one can get 900 MB/s with STGT + iSER on an SDR IB network, but only for very large block sizes (>= 100 MB). A quote from Linus Torvalds is relevant here (February 5, 2008): Block transfer sizes over about 64kB are totally irrelevant for 99% of all people. Please read my e-mail (posted earlier today) with a comparison for 4 KB - 64 KB block transfer sizes between SCST and STGT. Bart Van Assche. ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: Integration of SCST in the mainstream Linux kernel 2008-02-05 17:10 ` Erez Zilber 2008-02-05 19:02 ` Bart Van Assche @ 2008-02-05 19:02 ` Vladislav Bolkhovitin 1 sibling, 0 replies; 148+ messages in thread From: Vladislav Bolkhovitin @ 2008-02-05 19:02 UTC (permalink / raw) To: Erez Zilber Cc: Bart Van Assche, FUJITA Tomonori, linux-scsi, linux-kernel, James.Bottomley, scst-devel, Andrew Morton, Linus Torvalds Erez Zilber wrote: > Bart Van Assche wrote: > >>As you probably know there is a trend in enterprise computing towards >>networked storage. This is illustrated by the emergence during the >>past few years of standards like SRP (SCSI RDMA Protocol), iSCSI >>(Internet SCSI) and iSER (iSCSI Extensions for RDMA). Two different >>pieces of software are necessary to make networked storage possible: >>initiator software and target software. As far as I know there exist >>three different SCSI target implementations for Linux: >>- The iSCSI Enterprise Target Daemon (IETD, >>http://iscsitarget.sourceforge.net/); >>- The Linux SCSI Target Framework (STGT, http://stgt.berlios.de/); >>- The Generic SCSI Target Middle Level for Linux project (SCST, >>http://scst.sourceforge.net/). >>Since I was wondering which SCSI target software would be best suited >>for an InfiniBand network, I started evaluating the STGT and SCST SCSI >>target implementations. Apparently the performance difference between >>STGT and SCST is small on 100 Mbit/s and 1 Gbit/s Ethernet networks, >>but the SCST target software outperforms the STGT software on an >>InfiniBand network. See also the following thread for the details: >>http://sourceforge.net/mailarchive/forum.php?thread_name=e2e108260801170127w2937b2afg9bef324efa945e43%40mail.gmail.com&forum_name=scst-devel. >> >> > > Sorry for the late response (but better late than never). > > One may claim that STGT should have lower performance than SCST because > its data path is from userspace. However, your results show that for > non-IB transports, they both show the same numbers. Furthermore, with IB > there shouldn't be any additional difference between the 2 targets > because data transfer from userspace is as efficient as data transfer > from kernel space. And now consider if one target has zero-copy cached I/O. How much that will improve its performance? > The only explanation that I see is that fine tuning for iSCSI & iSER is > required. As was already mentioned in this thread, with SDR you can get > ~900 MB/sec with iSER (on STGT). > > Erez ^ permalink raw reply [flat|nested] 148+ messages in thread
* Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel @ 2008-02-09 7:44 Luben Tuikov 0 siblings, 0 replies; 148+ messages in thread From: Luben Tuikov @ 2008-02-09 7:44 UTC (permalink / raw) To: Nicholas A. Bellinger Cc: Bart Van Assche, James Bottomley, Vladislav Bolkhovitin, FUJITA Tomonori, linux-scsi, linux-kernel, scst-devel, Andrew Morton, Linus Torvalds, Ming Zhang --- On Fri, 2/8/08, Nicholas A. Bellinger <nab@linux-iscsi.org> wrote: > > Is there an open iSCSI Target implementation which > does NOT > > issue commands to sub-target devices via the SCSI > mid-layer, but > > bypasses it completely? > > > > Luben > > > > Hi Luben, > > I am guessing you mean futher down the stack, which I > don't know this to Yes, that's what I meant. > be the case. Going futher up the layers is the design of > v2.9 LIO-SE. > There is a diagram explaining the basic concepts from a > 10,000 foot > level. > > http://linux-iscsi.org/builds/user/nab/storage-engine-concept.pdf Thanks! Luben ^ permalink raw reply [flat|nested] 148+ messages in thread
end of thread, other threads:[~2008-02-20 8:41 UTC | newest] Thread overview: 148+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2008-01-23 14:22 Integration of SCST in the mainstream Linux kernel Bart Van Assche 2008-01-23 17:11 ` Vladislav Bolkhovitin 2008-01-29 20:42 ` James Bottomley 2008-01-29 21:31 ` Roland Dreier 2008-01-29 23:32 ` FUJITA Tomonori 2008-01-30 1:15 ` [Scst-devel] " Vu Pham 2008-01-30 8:38 ` Bart Van Assche 2008-01-30 10:56 ` FUJITA Tomonori 2008-01-30 11:40 ` Vladislav Bolkhovitin 2008-01-30 13:10 ` Bart Van Assche 2008-01-30 13:54 ` FUJITA Tomonori 2008-01-31 7:48 ` Bart Van Assche 2008-01-31 13:25 ` Nicholas A. Bellinger 2008-01-31 14:34 ` Bart Van Assche 2008-01-31 14:44 ` Nicholas A. Bellinger 2008-01-31 15:50 ` Vladislav Bolkhovitin 2008-01-31 16:25 ` [Scst-devel] " Joe Landman 2008-01-31 17:08 ` Bart Van Assche 2008-01-31 17:13 ` Joe Landman 2008-01-31 18:12 ` David Dillow 2008-02-01 11:50 ` Vladislav Bolkhovitin 2008-02-01 11:50 ` Vladislav Bolkhovitin 2008-02-01 12:25 ` Vladislav Bolkhovitin 2008-01-31 17:14 ` Nicholas A. Bellinger 2008-01-31 17:40 ` Bart Van Assche 2008-01-31 18:15 ` Nicholas A. Bellinger 2008-02-01 9:08 ` Bart Van Assche 2008-02-01 8:11 ` Bart Van Assche 2008-02-01 10:39 ` Nicholas A. Bellinger 2008-02-01 11:04 ` Bart Van Assche 2008-02-01 12:05 ` Nicholas A. Bellinger 2008-02-01 13:25 ` Bart Van Assche 2008-02-01 14:36 ` Nicholas A. Bellinger 2008-01-30 16:34 ` James Bottomley 2008-01-30 16:50 ` Bart Van Assche 2008-02-02 15:32 ` Pete Wyckoff 2008-02-05 17:01 ` Erez Zilber 2008-02-06 12:16 ` Bart Van Assche 2008-02-06 16:45 ` Benny Halevy 2008-02-06 17:06 ` Roland Dreier 2008-02-18 9:43 ` Erez Zilber 2008-02-18 11:01 ` Bart Van Assche 2008-02-20 7:34 ` Erez Zilber 2008-02-20 8:41 ` Bart Van Assche 2008-01-30 11:18 ` Vladislav Bolkhovitin 2008-01-30 8:29 ` Bart Van Assche 2008-01-30 16:22 ` James Bottomley 2008-01-30 17:03 ` Bart Van Assche 2008-02-05 7:14 ` [Scst-devel] " Tomasz Chmielewski 2008-02-05 13:38 ` FUJITA Tomonori 2008-02-05 16:07 ` Tomasz Chmielewski 2008-02-05 16:21 ` Ming Zhang 2008-02-05 16:43 ` FUJITA Tomonori 2008-02-05 17:09 ` Matteo Tescione 2008-02-06 1:29 ` FUJITA Tomonori 2008-02-06 2:01 ` Nicholas A. Bellinger 2008-01-30 11:17 ` Vladislav Bolkhovitin 2008-02-04 12:27 ` Vladislav Bolkhovitin 2008-02-04 13:53 ` Bart Van Assche 2008-02-04 17:00 ` David Dillow 2008-02-04 17:08 ` Vladislav Bolkhovitin 2008-02-05 16:25 ` Bart Van Assche 2008-02-05 18:18 ` Linus Torvalds 2008-02-04 15:30 ` James Bottomley 2008-02-04 16:25 ` Vladislav Bolkhovitin 2008-02-04 17:06 ` James Bottomley 2008-02-04 17:16 ` Vladislav Bolkhovitin 2008-02-04 17:25 ` James Bottomley 2008-02-04 17:56 ` Vladislav Bolkhovitin 2008-02-04 18:22 ` James Bottomley 2008-02-04 18:38 ` Vladislav Bolkhovitin 2008-02-04 18:54 ` James Bottomley 2008-02-05 18:59 ` Vladislav Bolkhovitin 2008-02-05 19:13 ` James Bottomley 2008-02-06 18:07 ` Vladislav Bolkhovitin 2008-02-07 13:13 ` [Scst-devel] " Bart Van Assche 2008-02-07 13:45 ` Vladislav Bolkhovitin 2008-02-07 22:51 ` david 2008-02-08 10:37 ` Vladislav Bolkhovitin 2008-02-09 7:40 ` david 2008-02-08 11:33 ` Nicholas A. Bellinger 2008-02-08 14:36 ` Vladislav Bolkhovitin 2008-02-08 23:53 ` Nicholas A. Bellinger 2008-02-15 15:02 ` Bart Van Assche 2008-02-07 15:38 ` [Scst-devel] " Nicholas A. Bellinger 2008-02-07 20:37 ` Luben Tuikov 2008-02-08 10:32 ` Vladislav Bolkhovitin 2008-02-09 7:32 ` Luben Tuikov 2008-02-11 10:02 ` Vladislav Bolkhovitin 2008-02-08 11:53 ` [Scst-devel] " Nicholas A. Bellinger 2008-02-08 14:42 ` Vladislav Bolkhovitin 2008-02-09 0:00 ` Nicholas A. Bellinger 2008-02-04 18:29 ` Linus Torvalds 2008-02-04 18:49 ` James Bottomley 2008-02-04 19:06 ` Nicholas A. Bellinger 2008-02-04 19:19 ` Nicholas A. Bellinger 2008-02-04 19:44 ` Linus Torvalds 2008-02-04 20:06 ` [Scst-devel] " 4news 2008-02-04 20:24 ` Nicholas A. Bellinger 2008-02-04 21:01 ` J. Bruce Fields 2008-02-04 21:24 ` Linus Torvalds 2008-02-04 22:00 ` Nicholas A. Bellinger 2008-02-04 22:57 ` Jeff Garzik 2008-02-04 23:45 ` Linus Torvalds 2008-02-05 0:08 ` Jeff Garzik 2008-02-05 1:20 ` Linus Torvalds 2008-02-05 8:38 ` Bart Van Assche 2008-02-05 17:50 ` Jeff Garzik 2008-02-06 10:22 ` Bart Van Assche 2008-02-06 14:21 ` Jeff Garzik 2008-02-05 13:05 ` Olivier Galibert 2008-02-05 18:08 ` Jeff Garzik 2008-02-05 19:01 ` Vladislav Bolkhovitin 2008-02-04 22:43 ` Alan Cox 2008-02-04 17:30 ` Douglas Gilbert 2008-02-05 2:07 ` [Scst-devel] " Chris Weiss 2008-02-05 14:19 ` FUJITA Tomonori 2008-02-04 22:59 ` Nicholas A. Bellinger 2008-02-04 23:00 ` James Bottomley 2008-02-04 23:12 ` Nicholas A. Bellinger 2008-02-04 23:16 ` Nicholas A. Bellinger 2008-02-05 18:37 ` James Bottomley 2008-02-04 23:04 ` Jeff Garzik 2008-02-04 23:27 ` Linus Torvalds 2008-02-05 19:01 ` Vladislav Bolkhovitin 2008-02-05 19:12 ` Jeff Garzik 2008-02-05 19:21 ` Vladislav Bolkhovitin 2008-02-06 0:11 ` Nicholas A. Bellinger 2008-02-06 1:43 ` Nicholas A. Bellinger 2008-02-12 16:05 ` [Scst-devel] " Bart Van Assche 2008-02-13 3:44 ` Nicholas A. Bellinger 2008-02-13 6:18 ` CONFIG_SLUB and reproducable general protection faults on 2.6.2x Nicholas A. Bellinger 2008-02-13 16:37 ` Nicholas A. Bellinger 2008-02-06 0:17 ` Integration of SCST in the mainstream Linux kernel Nicholas A. Bellinger 2008-02-06 0:48 ` Nicholas A. Bellinger 2008-02-06 0:51 ` Nicholas A. Bellinger 2008-02-05 0:07 ` Matt Mackall 2008-02-05 0:24 ` Linus Torvalds 2008-02-05 0:42 ` Jeff Garzik 2008-02-05 0:45 ` Matt Mackall 2008-02-05 4:43 ` [Scst-devel] " Matteo Tescione 2008-02-05 5:07 ` James Bottomley 2008-02-05 13:38 ` FUJITA Tomonori 2008-02-05 19:00 ` Vladislav Bolkhovitin 2008-02-05 17:10 ` Erez Zilber 2008-02-05 19:02 ` Bart Van Assche 2008-02-05 19:02 ` Vladislav Bolkhovitin 2008-02-09 7:44 [Scst-devel] " Luben Tuikov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).