LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* e1000 full-duplex TCP performance well below wire speed
@ 2008-01-30  9:51 Bruce Allen
  2008-01-30 13:18 ` Andi Kleen
  2008-01-30 13:53 ` David Miller
  0 siblings, 2 replies; 19+ messages in thread
From: Bruce Allen @ 2008-01-30  9:51 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: Henning Fehrmann, Carsten Aulbert, Bruce Allen

Dear LKML,

We've connected a pair of modern high-performance boxes with integrated 
copper Gb/s Intel NICS, with an ethernet crossover cable, and have run 
some netperf full duplex TCP tests.  The transfer rates are well below 
wire speed.  We're reporting this as a kernel bug, because we expect a 
vanilla kernel with default settings to give wire speed (or close to wire 
speed) performance in this case. We DO see wire speed in simplex 
transfers. The behavior has been verified on multiple machines with 
identical hardware.

Details:
Kernel version: 2.6.23.12
ethernet NIC: Intel 82573L
ethernet driver: e1000 version 7.3.20-k2
motherboard: Supermicro PDSML-LN2+ (one quad core Intel Xeon X3220, Intel 
3000 chipset, 8GB memory)

The test was done with various mtu sizes ranging from 1500 to 9000, with 
ethernet flow control switched on and off, and using reno and cubic as a 
TCP congestion control.

The behavior depends on the setup. In one test we used cubic congestion 
control, flow control off. The transfer rate in one direction was above 
0.9Gb/s while in the other direction it was 0.6 to 0.8 Gb/s. After 15-20s 
the rates flipped. Perhaps the two steams are fighting for resources. (The 
performance of a full duplex stream should be close to 1Gb/s in both 
directions.)  A graph of the transfer speed as a function of time is here: 
https://n0.aei.uni-hannover.de/networktest/node19-new20-noflow.jpg
Red shows transmit and green shows receive (please ignore other plots):

We're happy to do additional testing, if that would help, and very 
grateful for any advice!

Bruce Allen
Carsten Aulbert
Henning Fehrmann

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: e1000 full-duplex TCP performance well below wire speed
  2008-01-30  9:51 e1000 full-duplex TCP performance well below wire speed Bruce Allen
@ 2008-01-30 13:18 ` Andi Kleen
  2008-01-30 13:38   ` Bruce Allen
  2008-01-30 13:53 ` David Miller
  1 sibling, 1 reply; 19+ messages in thread
From: Andi Kleen @ 2008-01-30 13:18 UTC (permalink / raw)
  To: Bruce Allen
  Cc: Linux Kernel Mailing List, Henning Fehrmann, Carsten Aulbert,
	Bruce Allen

Bruce Allen <ballen@gravity.phys.uwm.edu> writes:

> Dear LKML,

You forgot to specify what user programs you used to get to the
benchmark results. e.g. if the user space does not use large 
enough reads/writes then performance will be not optimal.

Also best you repost your results with full information
on netdev@vger.kernel.org

-Andi

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: e1000 full-duplex TCP performance well below wire speed
  2008-01-30 13:18 ` Andi Kleen
@ 2008-01-30 13:38   ` Bruce Allen
  2008-01-30 14:08     ` David Miller
  0 siblings, 1 reply; 19+ messages in thread
From: Bruce Allen @ 2008-01-30 13:38 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: Henning Fehrmann, Andi Kleen, Carsten Aulbert, Bruce Allen

Hi Andi,

Thanks for the reply.

> You forgot to specify what user programs you used to get to the 
> benchmark results. e.g. if the user space does not use large enough 
> reads/writes then performance will be not optimal.

We used netperf (as stated in the first paragraph of the original post). 
Tell us if you want the command line.  Previous testing with older kernels 
and broadcom NICS has shown full-duplex wire speed.

> Also best you repost your results with full information on 
> netdev@vger.kernel.org

Wilco.  Just subscribing now.

Cheers,
 	Bruce

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: e1000 full-duplex TCP performance well below wire speed
  2008-01-30  9:51 e1000 full-duplex TCP performance well below wire speed Bruce Allen
  2008-01-30 13:18 ` Andi Kleen
@ 2008-01-30 13:53 ` David Miller
  2008-01-30 14:01   ` Bruce Allen
  1 sibling, 1 reply; 19+ messages in thread
From: David Miller @ 2008-01-30 13:53 UTC (permalink / raw)
  To: ballen
  Cc: linux-kernel, henning.fehrmann, carsten.aulbert, bruce.allen, netdev

From: Bruce Allen <ballen@gravity.phys.uwm.edu>
Date: Wed, 30 Jan 2008 03:51:51 -0600 (CST)

[ netdev@vger.kernel.org added to CC: list, that is where
  kernel networking issues are discussed. ]

> (The performance of a full duplex stream should be close to 1Gb/s in
> both directions.)

This is not a reasonable expectation.

ACKs take up space on the link in the opposite direction of the
transfer.

So the link usage in the opposite direction of the transfer is
very far from zero.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: e1000 full-duplex TCP performance well below wire speed
  2008-01-30 13:53 ` David Miller
@ 2008-01-30 14:01   ` Bruce Allen
  2008-01-30 16:21     ` Stephen Hemminger
  0 siblings, 1 reply; 19+ messages in thread
From: Bruce Allen @ 2008-01-30 14:01 UTC (permalink / raw)
  To: netdev, Linux Kernel Mailing List, David Miller
  Cc: Henning Fehrmann, Carsten Aulbert, Bruce Allen

Hi David,

Thanks for your note.

>> (The performance of a full duplex stream should be close to 1Gb/s in
>> both directions.)
>
> This is not a reasonable expectation.
>
> ACKs take up space on the link in the opposite direction of the
> transfer.
>
> So the link usage in the opposite direction of the transfer is
> very far from zero.

Indeed, we are not asking to see 1000 Mb/s.  We'd be happy to see 900 
Mb/s.

Netperf is trasmitting a large buffer in MTU-sized packets (min 1500 
bytes).  Since the acks are only about 60 bytes in size, they should be 
around 4% of the total traffic.  Hence we would not expect to see more 
than 960 Mb/s.

We have run these same tests on older kernels (with Broadcomm NICS) and 
gotten above 900 Mb/s full duplex.

Cheers,
     Bruce

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: e1000 full-duplex TCP performance well below wire speed
  2008-01-30 13:38   ` Bruce Allen
@ 2008-01-30 14:08     ` David Miller
  0 siblings, 0 replies; 19+ messages in thread
From: David Miller @ 2008-01-30 14:08 UTC (permalink / raw)
  To: ballen; +Cc: linux-kernel, henning.fehrmann, carsten.aulbert, bruce.allen

From: Bruce Allen <ballen@gravity.phys.uwm.edu>
Date: Wed, 30 Jan 2008 07:38:56 -0600 (CST)

> Wilco.  Just subscribing now.

You don't need to subscribe to any list at vger.kernel.org in order to
post a message to it.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: e1000 full-duplex TCP performance well below wire speed
  2008-01-30 14:01   ` Bruce Allen
@ 2008-01-30 16:21     ` Stephen Hemminger
  2008-01-30 22:25       ` Bruce Allen
  0 siblings, 1 reply; 19+ messages in thread
From: Stephen Hemminger @ 2008-01-30 16:21 UTC (permalink / raw)
  To: linux-kernel; +Cc: netdev

On Wed, 30 Jan 2008 08:01:46 -0600 (CST)
Bruce Allen <ballen@gravity.phys.uwm.edu> wrote:

> Hi David,
> 
> Thanks for your note.
> 
> >> (The performance of a full duplex stream should be close to 1Gb/s in
> >> both directions.)
> >
> > This is not a reasonable expectation.
> >
> > ACKs take up space on the link in the opposite direction of the
> > transfer.
> >
> > So the link usage in the opposite direction of the transfer is
> > very far from zero.
> 
> Indeed, we are not asking to see 1000 Mb/s.  We'd be happy to see 900 
> Mb/s.
> 
> Netperf is trasmitting a large buffer in MTU-sized packets (min 1500 
> bytes).  Since the acks are only about 60 bytes in size, they should be 
> around 4% of the total traffic.  Hence we would not expect to see more 
> than 960 Mb/s.
> 
> We have run these same tests on older kernels (with Broadcomm NICS) and 
> gotten above 900 Mb/s full duplex.
> 
> Cheers,
>      Bru

Don't forget the network overhead: http://sd.wareonearth.com/~phil/net/overhead/
 Max TCP Payload data rates over ethernet:
  (1500-40)/(38+1500) = 94.9285 %  IPv4, minimal headers
  (1500-52)/(38+1500) = 94.1482 %  IPv4, TCP timestamps

I believe what you are seeing is an effect that occurs when using
cubic on links with no other idle traffic. With two flows at high speed,
the first flow consumes most of the router buffer and backs off gradually,
and the second flow is not very aggressive.  It has been discussed
back and forth between TCP researchers with no agreement, one side
says that it is unfairness and the other side says it is not a problem in
the real world because of the presence of background traffic.

See:
  http://www.hamilton.ie/net/pfldnet2007_cubic_final.pdf
  http://www.csc.ncsu.edu/faculty/rhee/Rebuttal-LSM-new.pdf
   

-- 
Stephen Hemminger <stephen.hemminger@vyatta.com>



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: e1000 full-duplex TCP performance well below wire speed
  2008-01-30 16:21     ` Stephen Hemminger
@ 2008-01-30 22:25       ` Bruce Allen
  2008-01-30 22:33         ` Stephen Hemminger
  2008-01-31  0:17         ` SANGTAE HA
  0 siblings, 2 replies; 19+ messages in thread
From: Bruce Allen @ 2008-01-30 22:25 UTC (permalink / raw)
  To: Linux Kernel Mailing List, netdev, Stephen Hemminger

Hi Stephen,

Thanks for your helpful reply and especially for the literature pointers.

>> Indeed, we are not asking to see 1000 Mb/s.  We'd be happy to see 900
>> Mb/s.
>>
>> Netperf is trasmitting a large buffer in MTU-sized packets (min 1500
>> bytes).  Since the acks are only about 60 bytes in size, they should be
>> around 4% of the total traffic.  Hence we would not expect to see more
>> than 960 Mb/s.

> Don't forget the network overhead: http://sd.wareonearth.com/~phil/net/overhead/
> Max TCP Payload data rates over ethernet:
>  (1500-40)/(38+1500) = 94.9285 %  IPv4, minimal headers
>  (1500-52)/(38+1500) = 94.1482 %  IPv4, TCP timestamps

Yes.  If you look further down the page, you will see that with jumbo 
frames (which we have also tried) on Gb/s ethernet the maximum throughput 
is:

   (9000-20-20-12)/(9000+14+4+7+1+12)*1000000000/1000000 = 990.042 Mbps

We are very far from this number -- averaging perhaps 600 or 700 Mbps.

> I believe what you are seeing is an effect that occurs when using
> cubic on links with no other idle traffic. With two flows at high speed,
> the first flow consumes most of the router buffer and backs off gradually,
> and the second flow is not very aggressive.  It has been discussed
> back and forth between TCP researchers with no agreement, one side
> says that it is unfairness and the other side says it is not a problem in
> the real world because of the presence of background traffic.

At least in principle, we should have NO congestion here.  We have ports 
on two different machines wired with a crossover cable.  Box A can not 
transmit faster than 1 Gb/s.  Box B should be able to receive that data 
without dropping packets.  It's not doing anything else!

> See:
>  http://www.hamilton.ie/net/pfldnet2007_cubic_final.pdf
>  http://www.csc.ncsu.edu/faculty/rhee/Rebuttal-LSM-new.pdf

This is extremely helpful.  The typical oscillation (startup) period shown 
in the plots in these papers is of order 10 seconds, which is similar to 
the types of oscillation periods that we are seeing.

*However* we have also seen similar behavior with the Reno congestion 
control algorithm.  So this might not be due to cubic, or entirely due to 
cubic.

In our application (cluster computing) we use a very tightly coupled 
high-speed low-latency network.  There is no 'wide area traffic'.  So it's 
hard for me to understand why any networking components or software layers 
should take more than milliseconds to ramp up or back off in speed. 
Perhaps we should be asking for a TCP congestion avoidance algorithm which 
is designed for a data center environment where there are very few hops 
and typical packet delivery times are tens or hundreds of microseconds. 
It's very different than delivering data thousands of km across a WAN.

Cheers,
 	Bruce

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: e1000 full-duplex TCP performance well below wire speed
  2008-01-30 22:25       ` Bruce Allen
@ 2008-01-30 22:33         ` Stephen Hemminger
  2008-01-30 23:23           ` Bruce Allen
  2008-01-31  0:17         ` SANGTAE HA
  1 sibling, 1 reply; 19+ messages in thread
From: Stephen Hemminger @ 2008-01-30 22:33 UTC (permalink / raw)
  To: Bruce Allen; +Cc: Linux Kernel Mailing List, netdev

On Wed, 30 Jan 2008 16:25:12 -0600 (CST)
Bruce Allen <ballen@gravity.phys.uwm.edu> wrote:

> Hi Stephen,
> 
> Thanks for your helpful reply and especially for the literature pointers.
> 
> >> Indeed, we are not asking to see 1000 Mb/s.  We'd be happy to see 900
> >> Mb/s.
> >>
> >> Netperf is trasmitting a large buffer in MTU-sized packets (min 1500
> >> bytes).  Since the acks are only about 60 bytes in size, they should be
> >> around 4% of the total traffic.  Hence we would not expect to see more
> >> than 960 Mb/s.
> 
> > Don't forget the network overhead: http://sd.wareonearth.com/~phil/net/overhead/
> > Max TCP Payload data rates over ethernet:
> >  (1500-40)/(38+1500) = 94.9285 %  IPv4, minimal headers
> >  (1500-52)/(38+1500) = 94.1482 %  IPv4, TCP timestamps
> 
> Yes.  If you look further down the page, you will see that with jumbo 
> frames (which we have also tried) on Gb/s ethernet the maximum throughput 
> is:
> 
>    (9000-20-20-12)/(9000+14+4+7+1+12)*1000000000/1000000 = 990.042 Mbps
> 
> We are very far from this number -- averaging perhaps 600 or 700 Mbps.
>


That is the upper bound of performance on a standard PCI bus (32 bit).
To go higher you need PCI-X or PCI-Express. Also make sure you are really
getting 64-bit PCI, because I have seen some e1000 PCI-X boards that
are only 32bit.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: e1000 full-duplex TCP performance well below wire speed
  2008-01-30 22:33         ` Stephen Hemminger
@ 2008-01-30 23:23           ` Bruce Allen
  0 siblings, 0 replies; 19+ messages in thread
From: Bruce Allen @ 2008-01-30 23:23 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Linux Kernel Mailing List, netdev

Hi Stephen,

>>>> Indeed, we are not asking to see 1000 Mb/s.  We'd be happy to see 900
>>>> Mb/s.
>>>>
>>>> Netperf is trasmitting a large buffer in MTU-sized packets (min 1500
>>>> bytes).  Since the acks are only about 60 bytes in size, they should be
>>>> around 4% of the total traffic.  Hence we would not expect to see more
>>>> than 960 Mb/s.
>>
>>> Don't forget the network overhead: http://sd.wareonearth.com/~phil/net/overhead/
>>> Max TCP Payload data rates over ethernet:
>>>  (1500-40)/(38+1500) = 94.9285 %  IPv4, minimal headers
>>>  (1500-52)/(38+1500) = 94.1482 %  IPv4, TCP timestamps
>>
>> Yes.  If you look further down the page, you will see that with jumbo
>> frames (which we have also tried) on Gb/s ethernet the maximum throughput
>> is:
>>
>>    (9000-20-20-12)/(9000+14+4+7+1+12)*1000000000/1000000 = 990.042 Mbps
>>
>> We are very far from this number -- averaging perhaps 600 or 700 Mbps.

> That is the upper bound of performance on a standard PCI bus (32 bit).
> To go higher you need PCI-X or PCI-Express. Also make sure you are really
> getting 64-bit PCI, because I have seen some e1000 PCI-X boards that
> are only 32bit.

The motherboard NIC is in a PCI-e x1 slot.  This has a maximum speed of 
250 MB/s (2 Gb/s) in each direction.  It should be a factor of 2 more 
interface speed than is needed.

Cheers,
 	Bruce

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: e1000 full-duplex TCP performance well below wire speed
  2008-01-30 22:25       ` Bruce Allen
  2008-01-30 22:33         ` Stephen Hemminger
@ 2008-01-31  0:17         ` SANGTAE HA
  2008-01-31  8:52           ` Bruce Allen
  2008-01-31 11:45           ` Bill Fink
  1 sibling, 2 replies; 19+ messages in thread
From: SANGTAE HA @ 2008-01-31  0:17 UTC (permalink / raw)
  To: Bruce Allen; +Cc: Linux Kernel Mailing List, netdev, Stephen Hemminger

Hi Bruce,

On Jan 30, 2008 5:25 PM, Bruce Allen <ballen@gravity.phys.uwm.edu> wrote:
>
> In our application (cluster computing) we use a very tightly coupled
> high-speed low-latency network.  There is no 'wide area traffic'.  So it's
> hard for me to understand why any networking components or software layers
> should take more than milliseconds to ramp up or back off in speed.
> Perhaps we should be asking for a TCP congestion avoidance algorithm which
> is designed for a data center environment where there are very few hops
> and typical packet delivery times are tens or hundreds of microseconds.
> It's very different than delivering data thousands of km across a WAN.
>

If your network latency is low, regardless of type of protocols should
give you more than 900Mbps. I can guess the RTT of two machines is
less than 4ms in your case and I remember the throughputs of all
high-speed protocols (including tcp-reno) were more than 900Mbps with
4ms RTT. So, my question which kernel version did you use with your
broadcomm NIC and got more than 900Mbps?

I have two machines connected by a gig switch and I can see what
happens in my environment. Could you post what parameters did you use
for netperf testing?
and also if you set any parameters for your testing, please post them
here so that I can see that happens to me as well.

Regards,
Sangtae

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: e1000 full-duplex TCP performance well below wire speed
  2008-01-31  0:17         ` SANGTAE HA
@ 2008-01-31  8:52           ` Bruce Allen
  2008-01-31 11:45           ` Bill Fink
  1 sibling, 0 replies; 19+ messages in thread
From: Bruce Allen @ 2008-01-31  8:52 UTC (permalink / raw)
  To: SANGTAE HA; +Cc: Linux Kernel Mailing List, netdev, Stephen Hemminger

Hi Sangtae,

Thanks for joining this discussion -- it's good to a CUBIC author and 
expert here!

>> In our application (cluster computing) we use a very tightly coupled 
>> high-speed low-latency network.  There is no 'wide area traffic'.  So 
>> it's hard for me to understand why any networking components or 
>> software layers should take more than milliseconds to ramp up or back 
>> off in speed. Perhaps we should be asking for a TCP congestion 
>> avoidance algorithm which is designed for a data center environment 
>> where there are very few hops and typical packet delivery times are 
>> tens or hundreds of microseconds. It's very different than delivering 
>> data thousands of km across a WAN.

> If your network latency is low, regardless of type of protocols should 
> give you more than 900Mbps.

Yes, this is also what I had thought.

In the graph that we posted, the two machines are connected by an ethernet 
crossover cable.  The total RTT of the two machines is probably AT MOST a 
couple of hundred microseconds.  Typically it takes 20 or 30 microseconds 
to get the first packet out the NIC.  Travel across the wire is a few 
nanoseconds.  Then getting the packet into the receiving NIC might be 
another 20 or 30 microseconds.  The ACK should fly back in about the same 
time.

> I can guess the RTT of two machines is less than 4ms in your case and I 
> remember the throughputs of all high-speed protocols (including 
> tcp-reno) were more than 900Mbps with 4ms RTT. So, my question which 
> kernel version did you use with your broadcomm NIC and got more than 
> 900Mbps?

We are going to double-check this (we did the broadcom testing about two 
months ago). Carsten is going to re-run the broadcomm experiments later 
today and will then post the results.

You can see results from some testing on crossover-cable wired systems 
with broadcomm NICs, that I did about 2 years ago, here:
http://www.lsc-group.phys.uwm.edu/beowulf/nemo/design/SMC_8508T_Performance.html
You'll notice that total TCP throughput on the crossover cable was about 
220 MB/sec.  With TCP overhead this is very close to 2Gb/s.

> I have two machines connected by a gig switch and I can see what happens 
> in my environment. Could you post what parameters did you use for 
> netperf testing?

Carsten will post these in the next few hours.  If you want to simplify 
further, you can even take away the gig switch and just use a crossover 
cable.

> and also if you set any parameters for your testing, please post them
> here so that I can see that happens to me as well.

Carsten will post all the sysctl and ethtool parameters shortly.

Thanks again for chiming in. I am sure that with help from you, Jesse, and 
Rick, we can figure out what is going on here, and get it fixed.

Cheers,
 	Bruce

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: e1000 full-duplex TCP performance well below wire speed
  2008-01-31  0:17         ` SANGTAE HA
  2008-01-31  8:52           ` Bruce Allen
@ 2008-01-31 11:45           ` Bill Fink
  2008-01-31 14:50             ` David Acker
                               ` (2 more replies)
  1 sibling, 3 replies; 19+ messages in thread
From: Bill Fink @ 2008-01-31 11:45 UTC (permalink / raw)
  To: SANGTAE HA
  Cc: Bruce Allen, Linux Kernel Mailing List, netdev, Stephen Hemminger

On Wed, 30 Jan 2008, SANGTAE HA wrote:

> On Jan 30, 2008 5:25 PM, Bruce Allen <ballen@gravity.phys.uwm.edu> wrote:
> >
> > In our application (cluster computing) we use a very tightly coupled
> > high-speed low-latency network.  There is no 'wide area traffic'.  So it's
> > hard for me to understand why any networking components or software layers
> > should take more than milliseconds to ramp up or back off in speed.
> > Perhaps we should be asking for a TCP congestion avoidance algorithm which
> > is designed for a data center environment where there are very few hops
> > and typical packet delivery times are tens or hundreds of microseconds.
> > It's very different than delivering data thousands of km across a WAN.
> >
> 
> If your network latency is low, regardless of type of protocols should
> give you more than 900Mbps. I can guess the RTT of two machines is
> less than 4ms in your case and I remember the throughputs of all
> high-speed protocols (including tcp-reno) were more than 900Mbps with
> 4ms RTT. So, my question which kernel version did you use with your
> broadcomm NIC and got more than 900Mbps?
> 
> I have two machines connected by a gig switch and I can see what
> happens in my environment. Could you post what parameters did you use
> for netperf testing?
> and also if you set any parameters for your testing, please post them
> here so that I can see that happens to me as well.

I see similar results on my test systems, using Tyan Thunder K8WE (S2895)
motherboard with dual Intel Xeon 3.06 GHZ CPUs and 1 GB memory, running
a 2.6.15.4 kernel.  The GigE NICs are Intel PRO/1000 82546EB_QUAD_COPPER,
on a 64-bit/133-MHz PCI-X bus, using version 6.1.16-k2 of the e1000
driver, and running with 9000-byte jumbo frames.  The TCP congestion
control is BIC.

Unidirectional TCP test:

[bill@chance4 ~]$ nuttcp -f-beta -Itx -w2m 192.168.6.79
tx:  1186.5649 MB /  10.05 sec =  990.2741 Mbps 11 %TX 9 %RX 0 retrans

and:

[bill@chance4 ~]$ nuttcp -f-beta -Irx -r -w2m 192.168.6.79
rx:  1186.8281 MB /  10.05 sec =  990.5634 Mbps 14 %TX 9 %RX 0 retrans

Each direction gets full GigE line rate.

Bidirectional TCP test:

[bill@chance4 ~]$ nuttcp -f-beta -Itx -w2m 192.168.6.79 & nuttcp -f-beta -Irx -r -w2m 192.168.6.79
tx:   898.9934 MB /  10.05 sec =  750.1634 Mbps 10 %TX 8 %RX 0 retrans
rx:  1167.3750 MB /  10.06 sec =  973.8617 Mbps 14 %TX 11 %RX 0 retrans

While one direction gets close to line rate, the other only got 750 Mbps.
Note there were no TCP retransmitted segments for either data stream, so
that doesn't appear to be the cause of the slower transfer rate in one
direction.

If the receive direction uses a different GigE NIC that's part of the
same quad-GigE, all is fine:

[bill@chance4 ~]$ nuttcp -f-beta -Itx -w2m 192.168.6.79 & nuttcp -f-beta -Irx -r -w2m 192.168.5.79
tx:  1186.5051 MB /  10.05 sec =  990.2250 Mbps 12 %TX 13 %RX 0 retrans
rx:  1186.7656 MB /  10.05 sec =  990.5204 Mbps 15 %TX 14 %RX 0 retrans

Here's a test using the same GigE NIC for both directions with 1-second
interval reports:

[bill@chance4 ~]$ nuttcp -f-beta -Itx -i1 -w2m 192.168.6.79 & nuttcp -f-beta -Irx -r -i1 -w2m 192.168.6.79
tx:    92.3750 MB /   1.01 sec =  767.2277 Mbps     0 retrans
rx:   104.5625 MB /   1.01 sec =  872.4757 Mbps     0 retrans
tx:    83.3125 MB /   1.00 sec =  700.1845 Mbps     0 retrans
rx:   117.6250 MB /   1.00 sec =  986.5541 Mbps     0 retrans
tx:    83.8125 MB /   1.00 sec =  703.0322 Mbps     0 retrans
rx:   117.6250 MB /   1.00 sec =  986.5502 Mbps     0 retrans
tx:    83.0000 MB /   1.00 sec =  696.1779 Mbps     0 retrans
rx:   117.6250 MB /   1.00 sec =  986.5522 Mbps     0 retrans
tx:    83.7500 MB /   1.00 sec =  702.4989 Mbps     0 retrans
rx:   117.6250 MB /   1.00 sec =  986.5512 Mbps     0 retrans
tx:    83.1250 MB /   1.00 sec =  697.2270 Mbps     0 retrans
rx:   117.6250 MB /   1.00 sec =  986.5512 Mbps     0 retrans
tx:    84.1875 MB /   1.00 sec =  706.1665 Mbps     0 retrans
rx:   117.5625 MB /   1.00 sec =  985.5510 Mbps     0 retrans
tx:    83.0625 MB /   1.00 sec =  696.7167 Mbps     0 retrans
rx:   117.6875 MB /   1.00 sec =  987.5543 Mbps     0 retrans
tx:    84.1875 MB /   1.00 sec =  706.1545 Mbps     0 retrans
rx:   117.6250 MB /   1.00 sec =  986.5472 Mbps     0 retrans
rx:   117.6875 MB /   1.00 sec =  987.0724 Mbps     0 retrans
tx:    83.3125 MB /   1.00 sec =  698.8137 Mbps     0 retrans

tx:   844.9375 MB /  10.07 sec =  703.7699 Mbps 11 %TX 6 %RX 0 retrans
rx:  1167.4414 MB /  10.05 sec =  973.9980 Mbps 14 %TX 11 %RX 0 retrans

In this test case, the receiver ramped up to nearly full GigE line rate,
while the transmitter was stuck at about 700 Mbps.  I ran one longer
60-second test and didn't see the oscillating behavior between receiver
and transmitter, but maybe that's because I have the GigE NIC interrupts
and nuttcp client/server applications both locked to CPU 0.

So in my tests, once one direction gets the upper hand, it seems to
stay that way.  Could this be because the slower side is so busy
processing the transmits of the faster side, that it just doesn't
get to do its fair share of transmits (although it doesn't seem to
be a bus or CPU issue).  Hopefully those more knowledgeable about
the Linux TCP/IP stack and network drivers might have some more
concrete ideas.

						-Bill

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: e1000 full-duplex TCP performance well below wire speed
  2008-01-31 11:45           ` Bill Fink
@ 2008-01-31 14:50             ` David Acker
  2008-01-31 15:57               ` Bruce Allen
  2008-01-31 15:54             ` Bruce Allen
  2008-01-31 18:26             ` Brandeburg, Jesse
  2 siblings, 1 reply; 19+ messages in thread
From: David Acker @ 2008-01-31 14:50 UTC (permalink / raw)
  To: Bill Fink
  Cc: SANGTAE HA, Bruce Allen, Linux Kernel Mailing List, netdev,
	Stephen Hemminger

Bill Fink wrote:
> If the receive direction uses a different GigE NIC that's part of the
> same quad-GigE, all is fine:
> 
> [bill@chance4 ~]$ nuttcp -f-beta -Itx -w2m 192.168.6.79 & nuttcp -f-beta -Irx -r -w2m 192.168.5.79
> tx:  1186.5051 MB /  10.05 sec =  990.2250 Mbps 12 %TX 13 %RX 0 retrans
> rx:  1186.7656 MB /  10.05 sec =  990.5204 Mbps 15 %TX 14 %RX 0 retrans
Could this be an issue with pause frames?  At a previous job I remember 
having issues with a similar configuration using two broadcom sb1250 3 
gigE port devices. If I ran bidirectional tests on a single pair of 
ports connected via cross over, it was slower than when I gave each 
direction its own pair of ports.  The problem turned out to be that 
pause frame generation and handling was not configured correctly.
-Ack

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: e1000 full-duplex TCP performance well below wire speed
  2008-01-31 11:45           ` Bill Fink
  2008-01-31 14:50             ` David Acker
@ 2008-01-31 15:54             ` Bruce Allen
  2008-01-31 17:36               ` Bill Fink
  2008-01-31 18:26             ` Brandeburg, Jesse
  2 siblings, 1 reply; 19+ messages in thread
From: Bruce Allen @ 2008-01-31 15:54 UTC (permalink / raw)
  To: Bill Fink
  Cc: SANGTAE HA, Linux Kernel Mailing List, netdev, Stephen Hemminger

Hi Bill,

> I see similar results on my test systems

Thanks for this report and for confirming our observations.  Could you 
please confirm that a single-port bidrectional UDP link runs at wire 
speed?  This helps to localize the problem to the TCP stack or interaction 
of the TCP stack with the e1000 driver and hardware.

Cheers,
 	Bruce

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: e1000 full-duplex TCP performance well below wire speed
  2008-01-31 14:50             ` David Acker
@ 2008-01-31 15:57               ` Bruce Allen
  0 siblings, 0 replies; 19+ messages in thread
From: Bruce Allen @ 2008-01-31 15:57 UTC (permalink / raw)
  To: David Acker
  Cc: Bill Fink, SANGTAE HA, Linux Kernel Mailing List, netdev,
	Stephen Hemminger

Hi David,

> Could this be an issue with pause frames?  At a previous job I remember 
> having issues with a similar configuration using two broadcom sb1250 3 
> gigE port devices. If I ran bidirectional tests on a single pair of 
> ports connected via cross over, it was slower than when I gave each 
> direction its own pair of ports.  The problem turned out to be that 
> pause frame generation and handling was not configured correctly.

We had PAUSE frames turned off for our testing.  The idea is to let TCP 
do the flow and congestion control.

The problem with PAUSE+TCP is that it can cause head-of-line blocking, 
where a single oversubscribed output port on a switch can PAUSE a large 
number of flows on other paths.

Cheers,
 	Bruce

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: e1000 full-duplex TCP performance well below wire speed
  2008-01-31 15:54             ` Bruce Allen
@ 2008-01-31 17:36               ` Bill Fink
  2008-01-31 19:37                 ` Bruce Allen
  0 siblings, 1 reply; 19+ messages in thread
From: Bill Fink @ 2008-01-31 17:36 UTC (permalink / raw)
  To: Bruce Allen
  Cc: SANGTAE HA, Linux Kernel Mailing List, netdev, Stephen Hemminger

Hi Bruce,

On Thu, 31 Jan 2008, Bruce Allen wrote:

> > I see similar results on my test systems
> 
> Thanks for this report and for confirming our observations.  Could you 
> please confirm that a single-port bidrectional UDP link runs at wire 
> speed?  This helps to localize the problem to the TCP stack or interaction 
> of the TCP stack with the e1000 driver and hardware.

Yes, a single-port bidirectional UDP test gets full GigE line rate
in both directions with no packet loss.

[bill@chance4 ~]$ nuttcp -f-beta -Itx -u -Ru -w2m 192.168.6.79 & nuttcp -f-beta -Irx -r -u -Ru -w2m 192.168.6.79
tx:  1187.0078 MB /  10.04 sec =  992.0550 Mbps 19 %TX 7 %RX 0 / 151937 drop/pkt 0.00 %loss
rx:  1187.1016 MB /  10.03 sec =  992.3408 Mbps 19 %TX 7 %RX 0 / 151949 drop/pkt 0.00 %loss

						-Bill

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: e1000 full-duplex TCP performance well below wire speed
  2008-01-31 11:45           ` Bill Fink
  2008-01-31 14:50             ` David Acker
  2008-01-31 15:54             ` Bruce Allen
@ 2008-01-31 18:26             ` Brandeburg, Jesse
  2 siblings, 0 replies; 19+ messages in thread
From: Brandeburg, Jesse @ 2008-01-31 18:26 UTC (permalink / raw)
  To: Bill Fink, SANGTAE HA
  Cc: Bruce Allen, Linux Kernel Mailing List, netdev, Stephen Hemminger

Bill Fink wrote:
> a 2.6.15.4 kernel.  The GigE NICs are Intel PRO/1000
> 82546EB_QUAD_COPPER, 
> on a 64-bit/133-MHz PCI-X bus, using version 6.1.16-k2 of the e1000
> driver, and running with 9000-byte jumbo frames.  The TCP congestion
> control is BIC.

Bill, FYI, there was a known issue with e1000 (fixed in 7.0.38-k2) and
socket charge due to truesize that kept one end or the other from
opening its window.  The result is not so great performance, and you
must upgrade the driver at both ends to fix it.

it was fixed in commit
9e2feace1acd38d7a3b1275f7f9f8a397d09040e

That commit itself needed a couple of follow on bug fixes, but the point
is that you could download 7.3.20 from sourceforge (which would compile
on your kernel) and compare the performance with it if you were
interested in a further experiment.

Jesse

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: e1000 full-duplex TCP performance well below wire speed
  2008-01-31 17:36               ` Bill Fink
@ 2008-01-31 19:37                 ` Bruce Allen
  0 siblings, 0 replies; 19+ messages in thread
From: Bruce Allen @ 2008-01-31 19:37 UTC (permalink / raw)
  To: Bill Fink
  Cc: SANGTAE HA, Linux Kernel Mailing List, netdev, Stephen Hemminger

Hi Bill,

>>> I see similar results on my test systems
>>
>> Thanks for this report and for confirming our observations.  Could you
>> please confirm that a single-port bidrectional UDP link runs at wire
>> speed?  This helps to localize the problem to the TCP stack or interaction
>> of the TCP stack with the e1000 driver and hardware.
>
> Yes, a single-port bidirectional UDP test gets full GigE line rate
> in both directions with no packet loss.

Thanks for confirming this.  And thanks also for nuttcp!  I just 
recognized you as the author.

Cheers,
 	Bruce

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2008-01-31 19:37 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-01-30  9:51 e1000 full-duplex TCP performance well below wire speed Bruce Allen
2008-01-30 13:18 ` Andi Kleen
2008-01-30 13:38   ` Bruce Allen
2008-01-30 14:08     ` David Miller
2008-01-30 13:53 ` David Miller
2008-01-30 14:01   ` Bruce Allen
2008-01-30 16:21     ` Stephen Hemminger
2008-01-30 22:25       ` Bruce Allen
2008-01-30 22:33         ` Stephen Hemminger
2008-01-30 23:23           ` Bruce Allen
2008-01-31  0:17         ` SANGTAE HA
2008-01-31  8:52           ` Bruce Allen
2008-01-31 11:45           ` Bill Fink
2008-01-31 14:50             ` David Acker
2008-01-31 15:57               ` Bruce Allen
2008-01-31 15:54             ` Bruce Allen
2008-01-31 17:36               ` Bill Fink
2008-01-31 19:37                 ` Bruce Allen
2008-01-31 18:26             ` Brandeburg, Jesse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).