LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* Fire Engine??
@ 2003-11-26  0:15 Mr. BOFH
  2003-11-26  1:48 ` [OT] " Nick Piggin
                   ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Mr. BOFH @ 2003-11-26  0:15 UTC (permalink / raw)
  To: linux-kernel


Sun has announced that they have redone their TCP/IP stack and is showing
for some instances a 30% improvement over Linux....

http://www.theregister.co.uk/content/61/33440.html



^ permalink raw reply	[flat|nested] 39+ messages in thread

* [OT] Re: Fire Engine??
  2003-11-26  0:15 Fire Engine?? Mr. BOFH
@ 2003-11-26  1:48 ` Nick Piggin
  2003-11-26  2:11   ` Larry McVoy
  2003-11-26  2:30 ` David S. Miller
  2003-11-26  5:41 ` Valdis.Kletnieks
  2 siblings, 1 reply; 39+ messages in thread
From: Nick Piggin @ 2003-11-26  1:48 UTC (permalink / raw)
  To: Mr. BOFH; +Cc: linux-kernel



Mr. BOFH wrote:

>Sun has announced that they have redone their TCP/IP stack and is showing
>for some instances a 30% improvement over Linux....
>
>http://www.theregister.co.uk/content/61/33440.html
>
>

Thats odd. Since when did Linux's TCP/IP stack become the benchmark? :)

PS. This isn't really appropriate for this list. I'm sure an open and
    verifiable comparison would be welcomed though.



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [OT] Re: Fire Engine??
  2003-11-26  1:48 ` [OT] " Nick Piggin
@ 2003-11-26  2:11   ` Larry McVoy
  2003-11-26  2:48     ` David S. Miller
  2003-11-26  3:31     ` Rik van Riel
  0 siblings, 2 replies; 39+ messages in thread
From: Larry McVoy @ 2003-11-26  2:11 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Mr. BOFH, linux-kernel

On Wed, Nov 26, 2003 at 12:48:19PM +1100, Nick Piggin wrote:
> 
> 
> Mr. BOFH wrote:
> 
> >Sun has announced that they have redone their TCP/IP stack and is showing
> >for some instances a 30% improvement over Linux....
> >
> >http://www.theregister.co.uk/content/61/33440.html
> >
> >
> 
> Thats odd. Since when did Linux's TCP/IP stack become the benchmark? :)
> 
> PS. This isn't really appropriate for this list. I'm sure an open and
>    verifiable comparison would be welcomed though.

And not to dis my Alma Mater but I tend think the whole TOE idea is a lose.
I used to think otherwise, while I was a Sun employee, and Sun employee #1
pointed out to me that CPUs and memory were getting faster more quickly than
the TOE type answers could come to market.  He was right then and he seems
to still be right.

Maybe throwing processors at the problem will make him (and me now) wrong
but I have to think I could do better things with a CPU than offload some
TCP packets.

Linux has it right.  Make the normal case fast and lightweight and ignore
the other cases.  There are no other cases if the normal path is fast.

Another way to say "fast path" is "our normal path sucks".
-- 
---
Larry McVoy              lm at bitmover.com          http://www.bitmover.com/lm

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fire Engine??
  2003-11-26  0:15 Fire Engine?? Mr. BOFH
  2003-11-26  1:48 ` [OT] " Nick Piggin
@ 2003-11-26  2:30 ` David S. Miller
  2003-11-26  5:41 ` Valdis.Kletnieks
  2 siblings, 0 replies; 39+ messages in thread
From: David S. Miller @ 2003-11-26  2:30 UTC (permalink / raw)
  To: Mr. BOFH; +Cc: linux-kernel

On Tue, 25 Nov 2003 16:15:12 -0800
"Mr. BOFH" <icerbofh@hotmail.com> wrote:

> http://www.theregister.co.uk/content/61/33440.html

This was amusing to read, let's read the claim carefuly,
shall we?

	"We worked hard on efficiency, and we now measure,
	 at a given network workload on identical x86 hardware,
	 we use 30 percent less CPU than Linux."

So his claim is that, in their mesaurements, "CPU utilization"
was lower in their stack.  Was he using 2.6.x and TSO capable
cards on the Linux side?  If not, it's not apples to apples
against are current upcoming technology.

And while his CPU utilization claim is interesting (I bet that gain
would go to zero if they'd used Linux TSO in 2.6.x), but was the
networking bandwidth and latency any better as a result?  I think it's
not by accident that the claim was phrased the way it was.

In fact, I bet their connection setup/teardown latency will go in the
toilet with this stuff and Solaris was already horrible in this area.
It is a well established fact that TOE technologies have this problem
because of how the socket setup/teardown operation with TOE cards
requires the OS to go over the bus a few times.

I'm not worried at all about Sun's fire engine.  It's preliminary
technology, and they are going to discover all of the problem TOE
stuff has that I've discussed several times on this list.

They even mention that they don't even support any current generation
shipping TOE cards yet, at least I offer a cpu utilization reduction
optimization (TSO in 2.6.x) with multiple implementation on current
generation hardware (e1000, tg3, etc.).

I fully welcome them to put Linux up against their incredible fire
engine crap in a sanctioned specweb run on identical hardware.  :)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [OT] Re: Fire Engine??
  2003-11-26  2:11   ` Larry McVoy
@ 2003-11-26  2:48     ` David S. Miller
  2003-11-26  3:31     ` Rik van Riel
  1 sibling, 0 replies; 39+ messages in thread
From: David S. Miller @ 2003-11-26  2:48 UTC (permalink / raw)
  To: Larry McVoy; +Cc: piggin, icerbofh, linux-kernel

On Tue, 25 Nov 2003 18:11:11 -0800
Larry McVoy <lm@bitmover.com> wrote:

> I used to think otherwise, while I was a Sun employee, and Sun employee #1
> pointed out to me that CPUs and memory were getting faster more quickly than
> the TOE type answers could come to market.  He was right then and he seems
> to still be right.

Maybe this was at least partially the impetus behind his recent
departure from the company.  And if not the impetus, a possible straw
that broke the camel's back.

How fast will cpus be when Sun actually deploys this stuff?

A commodity x86 U1 box at that time will probably have 6+ GHZ
cpus in it, and super-duper-DDR or whatever the current memory
technology will be.  Why do I need Sun's TOE crap in this box?
Where's all that precious CPU I need to be saving?

This stuff isn't really useful for huge database servers either.

Where do they plan to do, put Solaris10 on iSCSI drives?  ROFL! :)

These days Sun is already several laps behind before the green flag
even comes out to start the race.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [OT] Re: Fire Engine??
  2003-11-26  2:11   ` Larry McVoy
  2003-11-26  2:48     ` David S. Miller
@ 2003-11-26  3:31     ` Rik van Riel
  1 sibling, 0 replies; 39+ messages in thread
From: Rik van Riel @ 2003-11-26  3:31 UTC (permalink / raw)
  To: Larry McVoy; +Cc: Nick Piggin, Mr. BOFH, linux-kernel

On Tue, 25 Nov 2003, Larry McVoy wrote:

> And not to dis my Alma Mater but I tend think the whole TOE idea is a
> lose. I used to think otherwise, while I was a Sun employee, and Sun
> employee #1 pointed out to me that CPUs and memory were getting faster
> more quickly than the TOE type answers could come to market.  He was
> right then and he seems to still be right.

I guess TCP offloading is a good way to stub your TOE ;)

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fire Engine?? 
  2003-11-26  0:15 Fire Engine?? Mr. BOFH
  2003-11-26  1:48 ` [OT] " Nick Piggin
  2003-11-26  2:30 ` David S. Miller
@ 2003-11-26  5:41 ` Valdis.Kletnieks
  2 siblings, 0 replies; 39+ messages in thread
From: Valdis.Kletnieks @ 2003-11-26  5:41 UTC (permalink / raw)
  To: Mr. BOFH; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 462 bytes --]

On Tue, 25 Nov 2003 16:15:12 PST, "Mr. BOFH" <icerbofh@hotmail.com>  said:
> 
> Sun has announced that they have redone their TCP/IP stack and is showing
> for some instances a 30% improvement over Linux....
> 
> http://www.theregister.co.uk/content/61/33440.html

Hmm.. IBM tried this same idea with their 8232 Ethernet controller
(basically, an 'industrial' PC with a 3Com card and a bus&tag card)
and offload of some TCP/IP functionality back in 1988 or so.


[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fire Engine??
  2003-11-26 22:58         ` Andi Kleen
@ 2003-11-27 12:16           ` Ingo Oeser
  0 siblings, 0 replies; 39+ messages in thread
From: Ingo Oeser @ 2003-11-27 12:16 UTC (permalink / raw)
  To: Andi Kleen, arjanv; +Cc: davem, linux-kernel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wednesday 26 November 2003 23:58, Andi Kleen wrote:
> On Wed, 26 Nov 2003 22:34:10 +0100
> Arjan van de Ven <arjanv@redhat.com> wrote:
> > question: do we need a timestamp for every packet or can we do one
> > timestamp per irq-context entry ? (eg one timestamp at irq entry time we
> > do anyway and keep that for all packets processed in the softirq)
>
> If people want the timestamp they usually want it to be accurate
> (e.g. for tcpdump etc.). of course there is already a lot of jitter
> in this information because it is done relatively late in the device
> driver (long after the NIC has received the packet)
>
> Just most people never care about this at all....

Yes, these people not caring just open a SOCK_STREAM or SOCK_DGRAM. I
don't see any field in msghdr, which contains the time.

Other people have packet sockets (or other special stuff) opened, which
is usally bound to a device or to a special RX/TX path. So we know,
which device does need it and which not.

If in doubt, there could be an sysctl option for exact time per device
or for all.

But I'm not really that familiar with the networking code, so please
ignore my ignorance on any issues here.


Regards

Ingo Oeser

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)

iD8DBQE/xesDU56oYWuOrkARAr1sAJ9h/EywUCb9wGVCZiW9GbivMiEVsACghj74
dE4EdzeW84U7QcMi/o+Q9qE=
=70Cm
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fire Engine??
  2003-11-26 19:19         ` Diego Calleja García
  2003-11-26 19:59           ` Mike Fedyk
@ 2003-11-27  3:54           ` Bill Huey
  1 sibling, 0 replies; 39+ messages in thread
From: Bill Huey @ 2003-11-27  3:54 UTC (permalink / raw)
  To: Diego Calleja Garc?a
  Cc: Mike Fedyk, john, ak, davem, linux-kernel, Bill Huey (hui)

On Wed, Nov 26, 2003 at 08:19:03PM +0100, Diego Calleja Garc?a wrote:
> It works here. I don't know if those numbers represent anything for networking.
> Some of the benchmarks look more like "vm benchmarking". And the ones which
> are measuring latency are valid, considering that BSDs are lacking "preempt"?
> (shooting in the dark)

FreeBSD-current is fully preemptive. The preempt patch, which add
preemption points, is meaningless in that context.

bill


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fire Engine??
  2003-11-26 21:38             ` David S. Miller
@ 2003-11-26 23:43               ` Jamie Lokier
  0 siblings, 0 replies; 39+ messages in thread
From: Jamie Lokier @ 2003-11-26 23:43 UTC (permalink / raw)
  To: David S. Miller; +Cc: tytso, ak, linux-kernel

David S. Miller wrote:
> > recvmsg() doesn't return timestamps until they are requested
> > using setsockopt(...SO_TIMESTAMP...).
> > 
> > See sock_recv_timestamp() in include/net/sock.h.
> 
> See MSG_ERRQUEUE and net/ipv4/ip_sockglue.c

I don't see your point.  The test for the SO_TIMESTAMP socket option
is _inside_ sock_recv_timestamp() (the flag is called sk_rcvtstamp).

The MSG_ERRQUEUE code simply calls sock_recv_timestamp(), which in
turn only reports the timestamp if the flag is set.

There are exactly two places where the timestamp is reported to
userspace, and both are at the request of userspace:

	1. sock_recv_timestamp(), called from many places including
	   ip_sockglue.c.  It _only_ reports it if SO_TIMESTAMP is
	   enabled for the socket.

	2. inet_ioctl(SIOCGSTAMP)

Nowhere else is the timestamp reported to userspace.

-- Jamie


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fire Engine??
  2003-11-26 23:13                 ` David S. Miller
  2003-11-26 23:29                   ` Andi Kleen
@ 2003-11-26 23:41                   ` Ben Greear
  1 sibling, 0 replies; 39+ messages in thread
From: Ben Greear @ 2003-11-26 23:41 UTC (permalink / raw)
  To: David S. Miller; +Cc: Andi Kleen, linux-kernel

David S. Miller wrote:
> On Wed, 26 Nov 2003 23:56:41 +0100
> Andi Kleen <ak@suse.de> wrote:
> 
> 
>>On Wed, 26 Nov 2003 14:36:20 -0800
>>"David S. Miller" <davem@redhat.com> wrote:
>>
>>
>>>I don't think this is acceptable.  It's important that all
>>>of the timestamps are as accurate as they were before.
>>
>>I disagree on that. The window is small and slowing down 99.99999% of all 
>>users who never care about this for this extremely obscure
>>misdesigned API does not make  much sense to me.
> 
> 
> We can't change behavior like this.  Every time we've tried to
> do it, we've been burnt.  Remember nonlocal-bind?

I'll try to write up a patch that uses the TSC and lazy conversion
to timeval as soon as I get the rx-all and rx-fcs code happily
into the kernel....

Assuming TSC is very fast and the conversion is accurate enough, I think
this can give good results....

Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fire Engine??
  2003-11-26 23:23         ` Trond Myklebust
@ 2003-11-26 23:38           ` Andi Kleen
  0 siblings, 0 replies; 39+ messages in thread
From: Andi Kleen @ 2003-11-26 23:38 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Andi Kleen, davem, linux-kernel

> There are a still few inefficiencies with this approach, though. Most
> notable is the fact that you need to call kmap_atomic() several times
> per page since the socket lower layers will usually be feeding you 1
> skb at a time. I thought you might be referring to those (and that you
> might have a good solution to propose ;-))

For kmap_atomic? Run a x86-64 box ;-) 

In general doing things with more than one packet at a time would
be probably a good idea, but I don't have any deep thoughts on how
to implement this for TCP RX.

-Andi

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fire Engine??
  2003-11-26 23:13                 ` David S. Miller
@ 2003-11-26 23:29                   ` Andi Kleen
  2003-11-26 23:41                   ` Ben Greear
  1 sibling, 0 replies; 39+ messages in thread
From: Andi Kleen @ 2003-11-26 23:29 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-kernel

On Wed, 26 Nov 2003 15:13:52 -0800
"David S. Miller" <davem@redhat.com> wrote:

> On Wed, 26 Nov 2003 23:56:41 +0100
> Andi Kleen <ak@suse.de> wrote:
> 
> > On Wed, 26 Nov 2003 14:36:20 -0800
> > "David S. Miller" <davem@redhat.com> wrote:
> > 
> > > I don't think this is acceptable.  It's important that all
> > > of the timestamps are as accurate as they were before.
> > 
> > I disagree on that. The window is small and slowing down 99.99999% of all 
> > users who never care about this for this extremely obscure
> > misdesigned API does not make  much sense to me.
> 
> We can't change behavior like this.  Every time we've tried to
> do it, we've been burnt.  Remember nonlocal-bind?

The behaviour is not really changed, just the precision of the timestamp
is temporarily (a few tens of ms on a busy network) worse. 

And the jitter in this timestamp is already higher than this when
you consider queueing delays and interrupt mitigation in the driver.

-Andi


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fire Engine??
  2003-11-26 23:01       ` Andi Kleen
@ 2003-11-26 23:23         ` Trond Myklebust
  2003-11-26 23:38           ` Andi Kleen
  0 siblings, 1 reply; 39+ messages in thread
From: Trond Myklebust @ 2003-11-26 23:23 UTC (permalink / raw)
  To: Andi Kleen; +Cc: davem, linux-kernel

>>>>> " " == Andi Kleen <ak@suse.de> writes:

     > Current sunrpc does two recvmsgs for each record to first get
     > the record length and then the payload.

     > This means you take all the locks and other overhead twice per
     > packet.

     > Having a special function that peeks directly at the TCP
     > receive queue would be much faster (and falls back to normal
     > recvmsg when there is no data waiting)

Oh, right... That would be the server code you are thinking of, then.

The client already does something like this. I've added a function
tcp_read_sock() that is called directly from tcp_data_ready() and
hence fills the page cache directly from within the softirq.

There are a still few inefficiencies with this approach, though. Most
notable is the fact that you need to call kmap_atomic() several times
per page since the socket lower layers will usually be feeding you 1
skb at a time. I thought you might be referring to those (and that you
might have a good solution to propose ;-))

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fire Engine??
  2003-11-26 22:56               ` Andi Kleen
@ 2003-11-26 23:13                 ` David S. Miller
  2003-11-26 23:29                   ` Andi Kleen
  2003-11-26 23:41                   ` Ben Greear
  0 siblings, 2 replies; 39+ messages in thread
From: David S. Miller @ 2003-11-26 23:13 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel

On Wed, 26 Nov 2003 23:56:41 +0100
Andi Kleen <ak@suse.de> wrote:

> On Wed, 26 Nov 2003 14:36:20 -0800
> "David S. Miller" <davem@redhat.com> wrote:
> 
> > I don't think this is acceptable.  It's important that all
> > of the timestamps are as accurate as they were before.
> 
> I disagree on that. The window is small and slowing down 99.99999% of all 
> users who never care about this for this extremely obscure
> misdesigned API does not make  much sense to me.

We can't change behavior like this.  Every time we've tried to
do it, we've been burnt.  Remember nonlocal-bind?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fire Engine??
  2003-11-26 15:00     ` Trond Myklebust
@ 2003-11-26 23:01       ` Andi Kleen
  2003-11-26 23:23         ` Trond Myklebust
  0 siblings, 1 reply; 39+ messages in thread
From: Andi Kleen @ 2003-11-26 23:01 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: davem, linux-kernel

On 26 Nov 2003 10:00:09 -0500
Trond Myklebust <trond.myklebust@fys.uio.no> wrote:

> >>>>> " " == Andi Kleen <ak@suse.de> writes:
> 
>      > - If they tested TCP-over-NFS then I'm pretty sure Linux lost
>                         ^^^^^^^^^^^^ That would be inefficient 8-)

grin. 

>      > badly because the current paths for that are just awfully
>      > inefficient.
> 
> ...mind elaborating?

Current sunrpc does two recvmsgs for each record to first get the record length 
and then the payload.

This means you take all the locks and other overhead twice per packet. 

Having a special function that peeks directly at the TCP receive
queue would be much faster (and falls back to normal recvmsg when
there is no data waiting) 

But that's the really obvious case. I think if you got out an profiler
and optimized carefully you could likely make this path much more
efficient. Same for sunrpc TX probably, although that seems to be
in a better shape already.

-Andi 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fire Engine??
  2003-11-26 21:34       ` Arjan van de Ven
@ 2003-11-26 22:58         ` Andi Kleen
  2003-11-27 12:16           ` Ingo Oeser
  0 siblings, 1 reply; 39+ messages in thread
From: Andi Kleen @ 2003-11-26 22:58 UTC (permalink / raw)
  To: arjanv; +Cc: davem, linux-kernel

On Wed, 26 Nov 2003 22:34:10 +0100
Arjan van de Ven <arjanv@redhat.com> wrote:

> On Wed, 2003-11-26 at 20:30, David S. Miller wrote:
> 
> > > - Doing gettimeofday on each incoming packet is just dumb, especially
> > > when you have gettimeofday backed with a slow southbridge timer.
> > > This shows quite badly on many profile logs.
> > > I still think right solution for that would be to only take time stamps
> > > when there is any user for it (= no timestamps in 99% of all systems) 
> > 
> > Andi, I know this is a problem, but for the millionth time your idea
> > does not work because we don't know if the user asked for the timestamp
> > until we are deep within the recvmsg() processing, which is long after
> > the packet has arrived.
> 
> question: do we need a timestamp for every packet or can we do one
> timestamp per irq-context entry ? (eg one timestamp at irq entry time we
> do anyway and keep that for all packets processed in the softirq)

If people want the timestamp they usually want it to be accurate
(e.g. for tcpdump etc.). of course there is already a lot of jitter
in this information because it is done relatively late in the device
driver (long after the NIC has received the packet)

Just most people never care about this at all.... 

-Andi

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fire Engine??
  2003-11-26 22:36             ` David S. Miller
@ 2003-11-26 22:56               ` Andi Kleen
  2003-11-26 23:13                 ` David S. Miller
  0 siblings, 1 reply; 39+ messages in thread
From: Andi Kleen @ 2003-11-26 22:56 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-kernel

On Wed, 26 Nov 2003 14:36:20 -0800
"David S. Miller" <davem@redhat.com> wrote:

> On Wed, 26 Nov 2003 23:29:09 +0100
> Andi Kleen <ak@suse.de> wrote:
> 
> > The first SIOCGTSTAMP would be inaccurate, but the following (after 
> > all untimestamped packets have been flushed) would be ok.
> 
> I don't think this is acceptable.  It's important that all
> of the timestamps are as accurate as they were before.

I disagree on that. The window is small and slowing down 99.99999% of all 
users who never care about this for this extremely obscure misdesigned API does 
not make  much sense to me.

Also if you worry about these you could add an optional sysctl
to always take it, so if anybody really has an application that relies
on the first time stamp being accurate and they cannot use SO_TIMESTAMP
they could set the sysctl.

-Andi

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fire Engine??
  2003-11-26 22:39       ` Andi Kleen
@ 2003-11-26 22:46         ` David S. Miller
  0 siblings, 0 replies; 39+ messages in thread
From: David S. Miller @ 2003-11-26 22:46 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel

On Wed, 26 Nov 2003 23:39:18 +0100
Andi Kleen <ak@suse.de> wrote:

> You only need to do a fast path for the default scheduler at the beginning.

In the end we're going to have a design and we're going to do it
right, if we decide to do this.

Sun needs fast paths, not us.

> Especially for prefetching having a list of packets helps because you
> can prefetch the next while you're working on the current one. The CPU
> hardware prefetcher cannot do that for you.

The initial prefetches are consumed by the copy implementation
setup instructions.  By the time the real loads execute, the
data is there or not very far away.

This I have measured on UltraSPARC, I suspect other cpus can
match that if not do better.

> I did look seriously at faster csum-copy/copy-to-user for K8, but the conclusion
> was that all the tricks are only worth it when you can work with bigger amounts of data.
> 1.5K at a time is just too small.

Not true, once you have ~300 or so bytes you have enough inertia
to get a good stream going in the main loop, really look at the
ultrasparc-III stuff I wrote for heuristics.

You really should write the k8 code before coming to conclusions
about what it would or would not be capable of doing :)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fire Engine??
  2003-11-26 19:30     ` David S. Miller
                         ` (3 preceding siblings ...)
  2003-11-26 21:34       ` Arjan van de Ven
@ 2003-11-26 22:39       ` Andi Kleen
  2003-11-26 22:46         ` David S. Miller
  4 siblings, 1 reply; 39+ messages in thread
From: Andi Kleen @ 2003-11-26 22:39 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-kernel

On Wed, 26 Nov 2003 11:30:40 -0800
"David S. Miller" <davem@redhat.com> wrote:

>
> > - On TX we are inefficient for the same reason. TCP builds one packet
> > at a time and then goes down through all layers taking all locks (queue,
> > device driver etc.) and submits the single packet. Then repeats that for 
> > lots of packets because many TCP writes are > MTU. Batching that would 
> > likely help a lot, like it was done in the 2.6 VFS. I think it could 
> > also make hard_start_xmit in many drivers significantly faster.
> 
> This is tricky, because of getting all of the queueing stuff right.
> All of the packet scheduler APIs would need to change, as would
> the classification stuff, not to mention netfilter et al.

You only need to do a fast path for the default scheduler at the beginning.
Every complicated "slow" API like advanced queuing or netfilter can still fallback to 
one packet at a time until cleaned up (similar strategy as was done with the 
non linear skbs) 
 
> You're talking about basically redoing the whole TX path if you
> want to really support this.
> 
> I'm not saying "don't do this", just that we should be sure we know
> what we're getting if we invest the time into this.

In some profiling I did some time ago queue locks and device driver
locks were the biggest offenders on TX after copy. 

The only tricky part is to get the state machine in tcp_do_sendmsg()
right that decides when to flush.

 > - user copy and checksum could probably also done faster if they were
> > batched for multiple packets. It is hard to optimize properly for 
> > <= 1.5K copies.
> > This is especially true for 4/4 split kernels which will eat an 
> > page table look up + lock for each individual copy, but also for others.
> 
> I disagree partially, especially in the presence of a chip that provides
> proper implementations of software initiated prefetching.

Especially for prefetching having a list of packets helps because you
can prefetch the next while you're working on the current one. The CPU
hardware prefetcher cannot do that for you.

I did look seriously at faster csum-copy/copy-to-user for K8, but the conclusion
was that all the tricks are only worth it when you can work with bigger amounts of data.
1.5K at a time is just too small.

Ah yes:

- Investigate more performance through explicit prefetching 
(e.g. in the device drivers to optimize eth_type_trans() when you can classify the packet 
just by looking at the RX ring state. Instead do a prefetch on the packet data
and hope the data is already in cache when the IP stack gets around to look at it) 

could be also added to the list

-Andi (who shuts up now because I don't have any time to code on any of this :-( ) 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fire Engine??
  2003-11-26 22:29           ` Andi Kleen
@ 2003-11-26 22:36             ` David S. Miller
  2003-11-26 22:56               ` Andi Kleen
  0 siblings, 1 reply; 39+ messages in thread
From: David S. Miller @ 2003-11-26 22:36 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel

On Wed, 26 Nov 2003 23:29:09 +0100
Andi Kleen <ak@suse.de> wrote:

> The first SIOCGTSTAMP would be inaccurate, but the following (after 
> all untimestamped packets have been flushed) would be ok.

I don't think this is acceptable.  It's important that all
of the timestamps are as accurate as they were before.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fire Engine??
  2003-11-26 20:03         ` David S. Miller
@ 2003-11-26 22:29           ` Andi Kleen
  2003-11-26 22:36             ` David S. Miller
  0 siblings, 1 reply; 39+ messages in thread
From: Andi Kleen @ 2003-11-26 22:29 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-kernel

On Wed, 26 Nov 2003 12:03:16 -0800
"David S. Miller" <davem@redhat.com> wrote:

> On Wed, 26 Nov 2003 11:58:44 -0800
> Paul Menage <menage@google.com> wrote:
> 
> > How about tracking the number of current sockets that have had timestamp 
> > requests for them? If this number is zero, don't bother with the 
> > timestamps. The first time you get a SIOCGSTAMP ioctl on a given socket, 
> > bump the count and set a flag; decrement the count when the socket is 
> > destroyed if the flag is set.
> 
> Reread what I said please, the user can ask for timestamps using CMSG
> objects via the recvmsg() system call, there are no ioctls or socket
> controls done on the socket.  It is completely dynamic and
> unpredictable.

The user sets the SO_TIMESTAMP setsockopt to 1 and then you get the cmsg.
That's per socket state. The other way is to use the SIOCGTSTAMP ioctl.
That is a bit more ugly because it has no state, but you can do 
a heuristic and assume that an process that does SIOCGTSTAMP once
will do it in future too and set a flag in this case. 

The first SIOCGTSTAMP would be inaccurate, but the following (after 
all untimestamped packets have been flushed) would be ok.

Doing for IP would be relatively easy, the only major user of the
timestamp seems to be DECnet and the bridge, but I supose those could be 
converted to use jiffies too.

-Andi

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fire Engine??
  2003-11-26 20:01       ` Jamie Lokier
  2003-11-26 20:04         ` David S. Miller
@ 2003-11-26 21:54         ` Pekka Pietikainen
  1 sibling, 0 replies; 39+ messages in thread
From: Pekka Pietikainen @ 2003-11-26 21:54 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: David S. Miller, Andi Kleen, linux-kernel

On Wed, Nov 26, 2003 at 08:01:53PM +0000, Jamie Lokier wrote:
> > Andi, I know this is a problem, but for the millionth time your idea
> > does not work because we don't know if the user asked for the timestamp
> > until we are deep within the recvmsg() processing, which is long after
> > the packet has arrived.
> 
> Do the timestamps need to be precise and accurately reflect the
> arrival time in the irq handler?  Or, for TCP timestamps, would it be
> good enough to use the time when the protocol handlers are run, and
> only read the hardware clock once for a bunch of received packets?  Or
> even use jiffies?

> Apart from TCP, precise timestamps are only used for packet capture,
> and it's easy to keep track globally of whether anyone has packet
> sockets open.
It should probably noted that really hardcore timestamp users 
have their NICs do it for them, since interrupt coalescing 
makes timestamps done in the kernel too inaccurate for them even
if rdtsc is used (http://www-didc.lbl.gov/papers/SCNM-PAM03.pdf)
Not that it's anywhere near a univeral solution since more or less only 
one brand of NICs supports them.

It would probably be a useful experiment to see whether the performance is
improved in a noticeable way if say jiffies were used. If so, it might be a
reasonable choice for a configurable option, if not then not. 
Isn't stuff like this the reason why the experimental network patches tree
that was announced a while back is out there? ;-)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fire Engine??
  2003-11-26 21:24           ` Jamie Lokier
@ 2003-11-26 21:38             ` David S. Miller
  2003-11-26 23:43               ` Jamie Lokier
  0 siblings, 1 reply; 39+ messages in thread
From: David S. Miller @ 2003-11-26 21:38 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: tytso, ak, linux-kernel

On Wed, 26 Nov 2003 21:24:06 +0000
Jamie Lokier <jamie@shareable.org> wrote:

> recvmsg() doesn't return timestamps until they are requested
> using setsockopt(...SO_TIMESTAMP...).
> 
> See sock_recv_timestamp() in include/net/sock.h.

See MSG_ERRQUEUE and net/ipv4/ip_sockglue.c

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fire Engine??
  2003-11-26 19:30     ` David S. Miller
                         ` (2 preceding siblings ...)
  2003-11-26 20:22       ` Theodore Ts'o
@ 2003-11-26 21:34       ` Arjan van de Ven
  2003-11-26 22:58         ` Andi Kleen
  2003-11-26 22:39       ` Andi Kleen
  4 siblings, 1 reply; 39+ messages in thread
From: Arjan van de Ven @ 2003-11-26 21:34 UTC (permalink / raw)
  To: David S. Miller; +Cc: Andi Kleen, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 866 bytes --]

On Wed, 2003-11-26 at 20:30, David S. Miller wrote:

> > - Doing gettimeofday on each incoming packet is just dumb, especially
> > when you have gettimeofday backed with a slow southbridge timer.
> > This shows quite badly on many profile logs.
> > I still think right solution for that would be to only take time stamps
> > when there is any user for it (= no timestamps in 99% of all systems) 
> 
> Andi, I know this is a problem, but for the millionth time your idea
> does not work because we don't know if the user asked for the timestamp
> until we are deep within the recvmsg() processing, which is long after
> the packet has arrived.

question: do we need a timestamp for every packet or can we do one
timestamp per irq-context entry ? (eg one timestamp at irq entry time we
do anyway and keep that for all packets processed in the softirq)

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fire Engine??
  2003-11-26 21:02         ` David S. Miller
@ 2003-11-26 21:24           ` Jamie Lokier
  2003-11-26 21:38             ` David S. Miller
  0 siblings, 1 reply; 39+ messages in thread
From: Jamie Lokier @ 2003-11-26 21:24 UTC (permalink / raw)
  To: David S. Miller; +Cc: Theodore Ts'o, ak, linux-kernel

David S. Miller wrote:
> > that are currently requesting timestamps, then we can dispense with
> > taking the timestamp.
> 
> You can predict what the arguments will be for the user's
> recvmsg() system call at the time of packet reception?  Wow,
> show me how :)

recvmsg() doesn't return timestamps until they are requested
using setsockopt(...SO_TIMESTAMP...).

See sock_recv_timestamp() in include/net/sock.h.

-- Jamie

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fire Engine??
  2003-11-26 20:22       ` Theodore Ts'o
@ 2003-11-26 21:02         ` David S. Miller
  2003-11-26 21:24           ` Jamie Lokier
  0 siblings, 1 reply; 39+ messages in thread
From: David S. Miller @ 2003-11-26 21:02 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: ak, linux-kernel

On Wed, 26 Nov 2003 15:22:16 -0500
"Theodore Ts'o" <tytso@mit.edu> wrote:

> I believe what Andi was suggesting was if there was **no** processes
> that are currently requesting timestamps, then we can dispense with
> taking the timestamp.

You can predict what the arguments will be for the user's
recvmsg() system call at the time of packet reception?  Wow,
show me how :)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fire Engine??
  2003-11-26 19:30     ` David S. Miller
  2003-11-26 19:58       ` Paul Menage
  2003-11-26 20:01       ` Jamie Lokier
@ 2003-11-26 20:22       ` Theodore Ts'o
  2003-11-26 21:02         ` David S. Miller
  2003-11-26 21:34       ` Arjan van de Ven
  2003-11-26 22:39       ` Andi Kleen
  4 siblings, 1 reply; 39+ messages in thread
From: Theodore Ts'o @ 2003-11-26 20:22 UTC (permalink / raw)
  To: David S. Miller; +Cc: Andi Kleen, linux-kernel

On Wed, Nov 26, 2003 at 11:30:40AM -0800, David S. Miller wrote:
> > - Doing gettimeofday on each incoming packet is just dumb, especially
> > when you have gettimeofday backed with a slow southbridge timer.
> > This shows quite badly on many profile logs.
> > I still think right solution for that would be to only take time stamps
> > when there is any user for it (= no timestamps in 99% of all systems) 
> 
> Andi, I know this is a problem, but for the millionth time your idea
> does not work because we don't know if the user asked for the timestamp
> until we are deep within the recvmsg() processing, which is long after
> the packet has arrived.

I believe what Andi was suggesting was if there was **no** processes
that are currently requesting timestamps, then we can dispense with
taking the timestamp.  If a single user asks for the timestamp, then
we would still end up taking timestamps on all packets.  Is this worth
the overhead to keep track of that factor?  It's arguable, but some
platforms, probably yes.

						- Ted

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fire Engine??
  2003-11-26 20:01       ` Jamie Lokier
@ 2003-11-26 20:04         ` David S. Miller
  2003-11-26 21:54         ` Pekka Pietikainen
  1 sibling, 0 replies; 39+ messages in thread
From: David S. Miller @ 2003-11-26 20:04 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: ak, linux-kernel

On Wed, 26 Nov 2003 20:01:53 +0000
Jamie Lokier <jamie@shareable.org> wrote:

> Do the timestamps need to be precise and accurately reflect the
> arrival time in the irq handler?

It would be a regression to make the timestamps less accurate
than those provided now.

> Or, for TCP timestamps,

The timestamps we are talking about are not used for TCP.

> Apart from TCP, precise timestamps are only used for packet capture,
> and it's easy to keep track globally of whether anyone has packet
> sockets open.

We have no knowledge of what an applications requirements are,
that is why we provide as accurate a timestamp as possible.

If we were writing this stuff for the first time now, sure we could
specify things however conveniently we like, but how this stuff behaves
is already well defined.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fire Engine??
  2003-11-26 19:58       ` Paul Menage
@ 2003-11-26 20:03         ` David S. Miller
  2003-11-26 22:29           ` Andi Kleen
  0 siblings, 1 reply; 39+ messages in thread
From: David S. Miller @ 2003-11-26 20:03 UTC (permalink / raw)
  To: Paul Menage; +Cc: ak, linux-kernel

On Wed, 26 Nov 2003 11:58:44 -0800
Paul Menage <menage@google.com> wrote:

> How about tracking the number of current sockets that have had timestamp 
> requests for them? If this number is zero, don't bother with the 
> timestamps. The first time you get a SIOCGSTAMP ioctl on a given socket, 
> bump the count and set a flag; decrement the count when the socket is 
> destroyed if the flag is set.

Reread what I said please, the user can ask for timestamps using CMSG
objects via the recvmsg() system call, there are no ioctls or socket
controls done on the socket.  It is completely dynamic and
unpredictable.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fire Engine??
  2003-11-26 19:30     ` David S. Miller
  2003-11-26 19:58       ` Paul Menage
@ 2003-11-26 20:01       ` Jamie Lokier
  2003-11-26 20:04         ` David S. Miller
  2003-11-26 21:54         ` Pekka Pietikainen
  2003-11-26 20:22       ` Theodore Ts'o
                         ` (2 subsequent siblings)
  4 siblings, 2 replies; 39+ messages in thread
From: Jamie Lokier @ 2003-11-26 20:01 UTC (permalink / raw)
  To: David S. Miller; +Cc: Andi Kleen, linux-kernel

David S. Miller wrote:
> > - Doing gettimeofday on each incoming packet is just dumb, especially
> > when you have gettimeofday backed with a slow southbridge timer.
> > This shows quite badly on many profile logs.
> > I still think right solution for that would be to only take time stamps
> > when there is any user for it (= no timestamps in 99% of all systems) 
> 
> Andi, I know this is a problem, but for the millionth time your idea
> does not work because we don't know if the user asked for the timestamp
> until we are deep within the recvmsg() processing, which is long after
> the packet has arrived.

Do the timestamps need to be precise and accurately reflect the
arrival time in the irq handler?  Or, for TCP timestamps, would it be
good enough to use the time when the protocol handlers are run, and
only read the hardware clock once for a bunch of received packets?  Or
even use jiffies?

Apart from TCP, precise timestamps are only used for packet capture,
and it's easy to keep track globally of whether anyone has packet
sockets open.

-- Jamie

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fire Engine??
  2003-11-26 19:19         ` Diego Calleja García
@ 2003-11-26 19:59           ` Mike Fedyk
  2003-11-27  3:54           ` Bill Huey
  1 sibling, 0 replies; 39+ messages in thread
From: Mike Fedyk @ 2003-11-26 19:59 UTC (permalink / raw)
  To: Diego Calleja Garc?a; +Cc: john, ak, davem, linux-kernel

On Wed, Nov 26, 2003 at 08:19:03PM +0100, Diego Calleja Garc?a wrote:
> El Wed, 26 Nov 2003 10:50:28 -0800 Mike Fedyk <mfedyk@matchmail.com> escribi?:
> 
> > > http://bulk.fefe.de/scalability/
> > 
> > No such file or directory.
> 
> It works here. I don't know if those numbers represent anything for networking.
> Some of the benchmarks look more like "vm benchmarking". And the ones which
> are measuring latency are valid, considering that BSDs are lacking "preempt"?
> (shooting in the dark)

Grr, that trailing "/" made the difference. :-/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fire Engine??
  2003-11-26 19:30     ` David S. Miller
@ 2003-11-26 19:58       ` Paul Menage
  2003-11-26 20:03         ` David S. Miller
  2003-11-26 20:01       ` Jamie Lokier
                         ` (3 subsequent siblings)
  4 siblings, 1 reply; 39+ messages in thread
From: Paul Menage @ 2003-11-26 19:58 UTC (permalink / raw)
  To: David S. Miller; +Cc: Andi Kleen, linux-kernel

David S. Miller wrote:
  >
> Andi, I know this is a problem, but for the millionth time your idea
> does not work because we don't know if the user asked for the timestamp
> until we are deep within the recvmsg() processing, which is long after
> the packet has arrived.

How about tracking the number of current sockets that have had timestamp 
requests for them? If this number is zero, don't bother with the 
timestamps. The first time you get a SIOCGSTAMP ioctl on a given socket, 
bump the count and set a flag; decrement the count when the socket is 
destroyed if the flag is set.

The drawback is that the first SIOCGSTAMP on any particular socket will 
have to return a bogus value (maybe just the current time?). Ways to 
mitigate that are:

- have a /proc option to let the sysadmin enforce timestamps on all 
packets (just bump the counter)

- bump the counter whenever an interface is in promiscuous mode (I 
imagine that tcpdump et al are the main users of the timestamps?)

Paul


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fire Engine??
  2003-11-26  9:53   ` Andi Kleen
  2003-11-26 11:35     ` John Bradford
  2003-11-26 15:00     ` Trond Myklebust
@ 2003-11-26 19:30     ` David S. Miller
  2003-11-26 19:58       ` Paul Menage
                         ` (4 more replies)
  2 siblings, 5 replies; 39+ messages in thread
From: David S. Miller @ 2003-11-26 19:30 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel

On 26 Nov 2003 10:53:21 +0100
Andi Kleen <ak@suse.de> wrote:

> Some issues just from the top of my head. I have not done detailed profiling
> recently and don't know if any of this would help significantly. It is 
> just what I remember right now.

Thanks for the list Andi, I'll keep it around.  I'd like
to comment on one entry though.

> - On TX we are inefficient for the same reason. TCP builds one packet
> at a time and then goes down through all layers taking all locks (queue,
> device driver etc.) and submits the single packet. Then repeats that for 
> lots of packets because many TCP writes are > MTU. Batching that would 
> likely help a lot, like it was done in the 2.6 VFS. I think it could 
> also make hard_start_xmit in many drivers significantly faster.

This is tricky, because of getting all of the queueing stuff right.
All of the packet scheduler APIs would need to change, as would
the classification stuff, not to mention netfilter et al.

You're talking about basically redoing the whole TX path if you
want to really support this.

I'm not saying "don't do this", just that we should be sure we know
what we're getting if we invest the time into this.

> - The hash tables are too big. This causes unnecessary cache misses all the 
> time.

I agree.  See my comments on this topic in another recent linux-kernel
thread wrt. huge hash tables on numa systems.

> - Doing gettimeofday on each incoming packet is just dumb, especially
> when you have gettimeofday backed with a slow southbridge timer.
> This shows quite badly on many profile logs.
> I still think right solution for that would be to only take time stamps
> when there is any user for it (= no timestamps in 99% of all systems) 

Andi, I know this is a problem, but for the millionth time your idea
does not work because we don't know if the user asked for the timestamp
until we are deep within the recvmsg() processing, which is long after
the packet has arrived.

> - user copy and checksum could probably also done faster if they were
> batched for multiple packets. It is hard to optimize properly for 
> <= 1.5K copies.
> This is especially true for 4/4 split kernels which will eat an 
> page table look up + lock for each individual copy, but also for others.

I disagree partially, especially in the presence of a chip that provides
proper implementations of software initiated prefetching.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fire Engine??
  2003-11-26 18:50       ` Mike Fedyk
@ 2003-11-26 19:19         ` Diego Calleja García
  2003-11-26 19:59           ` Mike Fedyk
  2003-11-27  3:54           ` Bill Huey
  0 siblings, 2 replies; 39+ messages in thread
From: Diego Calleja García @ 2003-11-26 19:19 UTC (permalink / raw)
  To: Mike Fedyk; +Cc: john, ak, davem, linux-kernel

El Wed, 26 Nov 2003 10:50:28 -0800 Mike Fedyk <mfedyk@matchmail.com> escribió:

> > http://bulk.fefe.de/scalability/
> 
> No such file or directory.

It works here. I don't know if those numbers represent anything for networking.
Some of the benchmarks look more like "vm benchmarking". And the ones which
are measuring latency are valid, considering that BSDs are lacking "preempt"?
(shooting in the dark)

Diego Calleja.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fire Engine??
  2003-11-26 11:35     ` John Bradford
@ 2003-11-26 18:50       ` Mike Fedyk
  2003-11-26 19:19         ` Diego Calleja García
  0 siblings, 1 reply; 39+ messages in thread
From: Mike Fedyk @ 2003-11-26 18:50 UTC (permalink / raw)
  To: John Bradford; +Cc: Andi Kleen, David S. Miller, linux-kernel

On Wed, Nov 26, 2003 at 11:35:03AM +0000, John Bradford wrote:
> Quote from Andi Kleen <ak@suse.de>:
> > "David S. Miller" <davem@redhat.com> writes:
> > > 
> > > So his claim is that, in their mesaurements, "CPU utilization"
> > > was lower in their stack.  Was he using 2.6.x and TSO capable
> > > cards on the Linux side?  If not, it's not apples to apples
> > > against are current upcoming technology.
> > 
> > Maybe they just have a better copy_to_user(). That eats most time anyways.
> > 
> > I think there are definitely areas of improvements left in current TCP.
> > It has gotten quite fat over the last years.
> 
> On the subject of general networking performance in Linux, I thought
> this set of benchmarks was quite interesting:
> 
> http://bulk.fefe.de/scalability/

No such file or directory.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fire Engine??
  2003-11-26  9:53   ` Andi Kleen
  2003-11-26 11:35     ` John Bradford
@ 2003-11-26 15:00     ` Trond Myklebust
  2003-11-26 23:01       ` Andi Kleen
  2003-11-26 19:30     ` David S. Miller
  2 siblings, 1 reply; 39+ messages in thread
From: Trond Myklebust @ 2003-11-26 15:00 UTC (permalink / raw)
  To: Andi Kleen; +Cc: David S. Miller, linux-kernel

>>>>> " " == Andi Kleen <ak@suse.de> writes:

     > - If they tested TCP-over-NFS then I'm pretty sure Linux lost
                        ^^^^^^^^^^^^ That would be inefficient 8-)
     > badly because the current paths for that are just awfully
     > inefficient.

...mind elaborating?

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fire Engine??
  2003-11-26  9:53   ` Andi Kleen
@ 2003-11-26 11:35     ` John Bradford
  2003-11-26 18:50       ` Mike Fedyk
  2003-11-26 15:00     ` Trond Myklebust
  2003-11-26 19:30     ` David S. Miller
  2 siblings, 1 reply; 39+ messages in thread
From: John Bradford @ 2003-11-26 11:35 UTC (permalink / raw)
  To: Andi Kleen, David S. Miller; +Cc: linux-kernel

Quote from Andi Kleen <ak@suse.de>:
> "David S. Miller" <davem@redhat.com> writes:
> > 
> > So his claim is that, in their mesaurements, "CPU utilization"
> > was lower in their stack.  Was he using 2.6.x and TSO capable
> > cards on the Linux side?  If not, it's not apples to apples
> > against are current upcoming technology.
> 
> Maybe they just have a better copy_to_user(). That eats most time anyways.
> 
> I think there are definitely areas of improvements left in current TCP.
> It has gotten quite fat over the last years.

On the subject of general networking performance in Linux, I thought
this set of benchmarks was quite interesting:

http://bulk.fefe.de/scalability/

particularly the 2.4 -> 2.6 comparisons.

John.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fire Engine??
       [not found] ` <20031125183035.1c17185a.davem@redhat.com.suse.lists.linux.kernel>
@ 2003-11-26  9:53   ` Andi Kleen
  2003-11-26 11:35     ` John Bradford
                       ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Andi Kleen @ 2003-11-26  9:53 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-kernel

"David S. Miller" <davem@redhat.com> writes:
> 
> So his claim is that, in their mesaurements, "CPU utilization"
> was lower in their stack.  Was he using 2.6.x and TSO capable
> cards on the Linux side?  If not, it's not apples to apples
> against are current upcoming technology.

Maybe they just have a better copy_to_user(). That eats most time anyways.

I think there are definitely areas of improvements left in current TCP.
It has gotten quite fat over the last years.

Some issues just from the top of my head. I have not done detailed profiling
recently and don't know if any of this would help significantly. It is 
just what I remember right now.

- Window computation for incoming packets is quite dumbly coded right now
and could be optimized
- I suspect the copy/process-in--user-context setup needs to be rethought/
rebenchmarked in Gigabit setups.  There was at least one test case
where tcp_low_latency=1 helped. It just adds latency that might hurt
and is not very useful when you have hardware checksums anyways
- If they tested TCP-over-NFS then I'm pretty sure Linux lost badly because
the current paths for that are just awfully inefficient.
- Overall IP/TCP could probably have some more instructions and hopefully
cache misses shaved off with some careful going over the fast paths.
- There are too many locks. That hurts when you have slow atomic operations
(like on P4) and together with the next issue. 
- We do most things one packet at a time. This means locking and multiple
layer overhead multiplies. Most network operations come in packet bursts
and it would be much more efficient to batch operations: always process
lists of packets instead of single packets. This could probably lower
locking overhead a lot.
- On TX we are inefficient for the same reason. TCP builds one packet
at a time and then goes down through all layers taking all locks (queue,
device driver etc.) and submits the single packet. Then repeats that for 
lots of packets because many TCP writes are > MTU. Batching that would 
likely help a lot, like it was done in the 2.6 VFS. I think it could 
also make hard_start_xmit in many drivers significantly faster.
- The hash tables are too big. This causes unnecessary cache misses all the 
time.
- Doing gettimeofday on each incoming packet is just dumb, especially
when you have gettimeofday backed with a slow southbridge timer.
This shows quite badly on many profile logs.
I still think right solution for that would be to only take time stamps
when there is any user for it (= no timestamps in 99% of all systems) 
- user copy and checksum could probably also done faster if they were
batched for multiple packets. It is hard to optimize properly for 
<= 1.5K copies.
This is especially true for 4/4 split kernels which will eat an 
page table look up + lock for each individual copy, but also for others.

-Andi

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2003-11-27 12:18 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-11-26  0:15 Fire Engine?? Mr. BOFH
2003-11-26  1:48 ` [OT] " Nick Piggin
2003-11-26  2:11   ` Larry McVoy
2003-11-26  2:48     ` David S. Miller
2003-11-26  3:31     ` Rik van Riel
2003-11-26  2:30 ` David S. Miller
2003-11-26  5:41 ` Valdis.Kletnieks
     [not found] <BAY1-DAV15JU71pROHD000040e2@hotmail.com.suse.lists.linux.kernel>
     [not found] ` <20031125183035.1c17185a.davem@redhat.com.suse.lists.linux.kernel>
2003-11-26  9:53   ` Andi Kleen
2003-11-26 11:35     ` John Bradford
2003-11-26 18:50       ` Mike Fedyk
2003-11-26 19:19         ` Diego Calleja García
2003-11-26 19:59           ` Mike Fedyk
2003-11-27  3:54           ` Bill Huey
2003-11-26 15:00     ` Trond Myklebust
2003-11-26 23:01       ` Andi Kleen
2003-11-26 23:23         ` Trond Myklebust
2003-11-26 23:38           ` Andi Kleen
2003-11-26 19:30     ` David S. Miller
2003-11-26 19:58       ` Paul Menage
2003-11-26 20:03         ` David S. Miller
2003-11-26 22:29           ` Andi Kleen
2003-11-26 22:36             ` David S. Miller
2003-11-26 22:56               ` Andi Kleen
2003-11-26 23:13                 ` David S. Miller
2003-11-26 23:29                   ` Andi Kleen
2003-11-26 23:41                   ` Ben Greear
2003-11-26 20:01       ` Jamie Lokier
2003-11-26 20:04         ` David S. Miller
2003-11-26 21:54         ` Pekka Pietikainen
2003-11-26 20:22       ` Theodore Ts'o
2003-11-26 21:02         ` David S. Miller
2003-11-26 21:24           ` Jamie Lokier
2003-11-26 21:38             ` David S. Miller
2003-11-26 23:43               ` Jamie Lokier
2003-11-26 21:34       ` Arjan van de Ven
2003-11-26 22:58         ` Andi Kleen
2003-11-27 12:16           ` Ingo Oeser
2003-11-26 22:39       ` Andi Kleen
2003-11-26 22:46         ` David S. Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).