LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
@ 2007-02-25 19:11 Al Boldi
  0 siblings, 0 replies; 277+ messages in thread
From: Al Boldi @ 2007-02-25 19:11 UTC (permalink / raw)
  To: linux-kernel

Ingo Molnar wrote:
> now look at kevents as the queueing model. It does not queue 'tasks', it
> lets user-space queue requests in essence, in various states. But it's
> still the same conceptual thing: a memory buffer with some state
> associated to it. Yes, it has no legacies, it has no priorities and
> other queueing concepts attached to it ... yet. If kevents got
> mainstream, it would get the same kind of pressure to grow 'more
> advanced' event queueing and event scheduling capabilities.
> Prioritization would be needed, etc.

But it would probably be tuned specifically to its use case, which would mean 
inherently better performance.

> So my fundamental claim is: a kernel thread /is/ our main request
> structure. We've got tons of really good system calls that queue these
> 'requests' around the place and offer functionality around this concept.
> Plus there's a 1.2+ billion lines of Linux userspace code that works
> well with this abstraction - while there's nary a few thousand lines of
> event-based user-space code.

Think of the kernel scheduler as a default fallback scheduler, for procs that 
are randomly queued.  Anytime you can identify a group of procs/threads that 
behave in a similar way, it's almost always best to do specific/private 
scheduling, for performance reasons.

> I also say that you'll likely get kevents outperform threadlets. Maybe
> even significantly so under the right conditions. But i very much
> believe we want to get similar kind of performance out of thread/task
> scheduling, and not introduce a parallel framework to do request
> scheduling the hard way ... just because our task concept and scheduling
> implementation got too fat. For the same reason i didnt really like
> fibrils: they are nice, and Zach's core idea i think nicely survived in
> the syslet/threadlet model too, but they are more limited than true
> threads. So doing that parallel infrastructure, which really just
> implements the same, and is only faster because it skips features, would
> just be hiding the problem with our primary abstraction. Ok?

Ok.  But what you are proposing here is a dynamically plugable scheduler that 
is extensible on top of that.

Sounds Great!


Thanks!

--
Al


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-07 18:21                                                                                       ` Kirk Kuchov
  2007-03-07 18:24                                                                                         ` Jens Axboe
@ 2007-03-07 18:32                                                                                         ` Evgeniy Polyakov
  1 sibling, 0 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-07 18:32 UTC (permalink / raw)
  To: Kirk Kuchov
  Cc: Ingo Molnar, Pavel Machek, Davide Libenzi, Eric Dumazet,
	Theodore Tso, Linus Torvalds, Ulrich Drepper,
	Linux Kernel Mailing List, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner

On Wed, Mar 07, 2007 at 03:21:19PM -0300, Kirk Kuchov (kirk.kuchov@gmail.com) wrote:
> On 3/7/07, Ingo Molnar <mingo@elte.hu> wrote:
> >
> >* Kirk Kuchov <kirk.kuchov@gmail.com> wrote:
> >
> >> I don't believe I'm wasting my time explaining this. They don't exist
> >> as /dev/null, they are just fucking _LINKS_.
> >[...]
> >> > Either stop flaming kernel developers or become one. It is that
> >> > simple.
> >>
> >> If I were to become a kernel developer I would stick with FreeBSD.
> >> [...]
> >
> >Hey, really, this is an excellent idea: what a boon you could become to
> >FreeBSD, again! How much they must be longing for your insightful
> >feedback, how much they must be missing your charming style and tactful
> >approach! I bet they'll want to print your mails out, frame them and
> >hang them over their fireplace, to remember the good old days on cold
> >snowy winter days, with warmth in their hearts! Please?
> >
> 
> http://www.totallytom.com/thecureforgayness.html

Fonts are a bit bad in my browser :)

Kirk, I understand your frustration - yes, Linux is not the perfect
place to include startups ideas, and yes it lacks some features modern
(or old) systems support for years, but things change with time.

I posted a patch which allows to poll for signals, it can be trivially
adopted to support timers and essentially any other events.
Kevent did that too, but some things are just too radical for immediate
support, especially when majority of users do not require additional
functionality.

People do work, and a lot of them do really good work, so no need for
rude talks about how things are bad. Things change - even I support
that, although kevent ignorance should put me into the first line with
you :)

Be good, and be cool.

> --
> Kirk Kuchov

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-07 18:21                                                                                       ` Kirk Kuchov
@ 2007-03-07 18:24                                                                                         ` Jens Axboe
  2007-03-07 18:32                                                                                         ` Evgeniy Polyakov
  1 sibling, 0 replies; 277+ messages in thread
From: Jens Axboe @ 2007-03-07 18:24 UTC (permalink / raw)
  To: Kirk Kuchov
  Cc: Ingo Molnar, Pavel Machek, Davide Libenzi, Evgeniy Polyakov,
	Eric Dumazet, Theodore Tso, Linus Torvalds, Ulrich Drepper,
	Linux Kernel Mailing List, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
	Suparna Bhattacharya, Thomas Gleixner

On Wed, Mar 07 2007, Kirk Kuchov wrote:
> On 3/7/07, Ingo Molnar <mingo@elte.hu> wrote:
> >
> >* Kirk Kuchov <kirk.kuchov@gmail.com> wrote:
> >
> >> I don't believe I'm wasting my time explaining this. They don't exist
> >> as /dev/null, they are just fucking _LINKS_.
> >[...]
> >> > Either stop flaming kernel developers or become one. It is that
> >> > simple.
> >>
> >> If I were to become a kernel developer I would stick with FreeBSD.
> >> [...]
> >
> >Hey, really, this is an excellent idea: what a boon you could become to
> >FreeBSD, again! How much they must be longing for your insightful
> >feedback, how much they must be missing your charming style and tactful
> >approach! I bet they'll want to print your mails out, frame them and
> >hang them over their fireplace, to remember the good old days on cold
> >snowy winter days, with warmth in their hearts! Please?
> >
> 
> http://www.totallytom.com/thecureforgayness.html

Dude, get a life. But more importantly, go waste somebody elses time
instead of lkml's.

-- 
Jens Axboe, updating killfile


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-07 17:39                                                                                     ` Ingo Molnar
@ 2007-03-07 18:21                                                                                       ` Kirk Kuchov
  2007-03-07 18:24                                                                                         ` Jens Axboe
  2007-03-07 18:32                                                                                         ` Evgeniy Polyakov
  0 siblings, 2 replies; 277+ messages in thread
From: Kirk Kuchov @ 2007-03-07 18:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pavel Machek, Davide Libenzi, Evgeniy Polyakov, Eric Dumazet,
	Theodore Tso, Linus Torvalds, Ulrich Drepper,
	Linux Kernel Mailing List, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner

On 3/7/07, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Kirk Kuchov <kirk.kuchov@gmail.com> wrote:
>
> > I don't believe I'm wasting my time explaining this. They don't exist
> > as /dev/null, they are just fucking _LINKS_.
> [...]
> > > Either stop flaming kernel developers or become one. It is that
> > > simple.
> >
> > If I were to become a kernel developer I would stick with FreeBSD.
> > [...]
>
> Hey, really, this is an excellent idea: what a boon you could become to
> FreeBSD, again! How much they must be longing for your insightful
> feedback, how much they must be missing your charming style and tactful
> approach! I bet they'll want to print your mails out, frame them and
> hang them over their fireplace, to remember the good old days on cold
> snowy winter days, with warmth in their hearts! Please?
>

http://www.totallytom.com/thecureforgayness.html

--
Kirk Kuchov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-07 12:02                                                                                   ` Kirk Kuchov
  2007-03-07 17:26                                                                                     ` Linus Torvalds
@ 2007-03-07 17:39                                                                                     ` Ingo Molnar
  2007-03-07 18:21                                                                                       ` Kirk Kuchov
  1 sibling, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-03-07 17:39 UTC (permalink / raw)
  To: Kirk Kuchov
  Cc: Pavel Machek, Davide Libenzi, Evgeniy Polyakov, Eric Dumazet,
	Theodore Tso, Linus Torvalds, Ulrich Drepper,
	Linux Kernel Mailing List, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner


* Kirk Kuchov <kirk.kuchov@gmail.com> wrote:

> I don't believe I'm wasting my time explaining this. They don't exist 
> as /dev/null, they are just fucking _LINKS_.
[...]
> > Either stop flaming kernel developers or become one. It is that 
> > simple.
> 
> If I were to become a kernel developer I would stick with FreeBSD. 
> [...]

Hey, really, this is an excellent idea: what a boon you could become to 
FreeBSD, again! How much they must be longing for your insightful 
feedback, how much they must be missing your charming style and tactful 
approach! I bet they'll want to print your mails out, frame them and 
hang them over their fireplace, to remember the good old days on cold 
snowy winter days, with warmth in their hearts! Please?

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-07 12:02                                                                                   ` Kirk Kuchov
@ 2007-03-07 17:26                                                                                     ` Linus Torvalds
  2007-03-07 17:39                                                                                     ` Ingo Molnar
  1 sibling, 0 replies; 277+ messages in thread
From: Linus Torvalds @ 2007-03-07 17:26 UTC (permalink / raw)
  To: Kirk Kuchov
  Cc: Pavel Machek, Davide Libenzi, Evgeniy Polyakov, Ingo Molnar,
	Eric Dumazet, Theodore Tso, Ulrich Drepper,
	Linux Kernel Mailing List, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner



On Wed, 7 Mar 2007, Kirk Kuchov wrote:
> 
> I don't believe I'm wasting my time explaining this. They don't exist
> as /dev/null, they are just fucking _LINKS_. I could even "ln -s
> /proc/self/fd/0 sucker". A real /dev/stdout can/could even exist, but
> that's not the point!

Actually, one large reason for /proc/self/ existing is exactly /dev/stdin 
and friends.

And yes, /proc/self looks like a link too, but that doesn't change the 
fact that it's a very special file. No different from /dev/null or 
friends.

		Linus

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-06  8:19                                                                                 ` Pavel Machek
@ 2007-03-07 12:02                                                                                   ` Kirk Kuchov
  2007-03-07 17:26                                                                                     ` Linus Torvalds
  2007-03-07 17:39                                                                                     ` Ingo Molnar
  0 siblings, 2 replies; 277+ messages in thread
From: Kirk Kuchov @ 2007-03-07 12:02 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Davide Libenzi, Evgeniy Polyakov, Ingo Molnar, Eric Dumazet,
	Theodore Tso, Linus Torvalds, Ulrich Drepper,
	Linux Kernel Mailing List, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner

On 3/6/07, Pavel Machek <pavel@ucw.cz> wrote:
> > >As for why common abstractions like file are a good thing, think about why
> > >having "/dev/null" is cleaner that having a special plug DEVNULL_FD fd
> > >value to be plugged everywhere,
> >
> > This is a stupid comparaison. By your logic we should also have /dev/stdin,
> > /dev/stdout and /dev/stderr.
>
> Bzzt, wrong. We have them.
>
> pavel@amd:~$ ls -al /dev/std*
> lrwxrwxrwx 1 root root 4 Nov 12  2003 /dev/stderr -> fd/2
> lrwxrwxrwx 1 root root 4 Nov 12  2003 /dev/stdin -> fd/0
> lrwxrwxrwx 1 root root 4 Nov 12  2003 /dev/stdout -> fd/1
> pavel@amd:~$ ls -al /proc/self/fd
> total 0
> dr-x------ 2 pavel users  0 Mar  6 09:18 .
> dr-xr-xr-x 4 pavel users  0 Mar  6 09:18 ..
> lrwx------ 1 pavel users 64 Mar  6 09:18 0 -> /dev/ttyp2
> lrwx------ 1 pavel users 64 Mar  6 09:18 1 -> /dev/ttyp2
> lrwx------ 1 pavel users 64 Mar  6 09:18 2 -> /dev/ttyp2
> lr-x------ 1 pavel users 64 Mar  6 09:18 3 -> /proc/2299/fd
> pavel@amd:~$

I don't believe I'm wasting my time explaining this. They don't exist
as /dev/null, they are just fucking _LINKS_. I could even "ln -s
/proc/self/fd/0 sucker". A real /dev/stdout can/could even exist, but
that's not the point!

It remains a stupid comparison because /dev/stdin/stderr/whatever
"must" be plugged, else how could a process write to stdout/stderr
that it coud'nt open it ? The way things are is not because it's
cleaner to have it as a file but because it's the only sane way.
/dev/null is not a must have, it's mainly used for redirecting
purposes. A sys_nullify(fileno(stdout)) would rule out almost any use
of /dev/null.

> > >As for why common abstractions like file are a good thing, think about why
> > >having "/dev/null" is cleaner that having a special plug DEVNULL_FD fd
> > >value to be plugged everywhere,

> > >But here the list could be almost endless.
> > >And please don't start the, they don't scale or they need heavy file
> > >binding tossfeast. They scale as well as the interface that will receive
> > >them (poll, select, epoll). Heavy file binding what? 100 or so bytes for
> > >the struct file? How many signal/timer fd are you gonna have? Like 100K?
> > >Really moot argument when opposed to the benefit of being compatible with
> > >existing POSIX interfaces and being more Unix friendly.
> >
> > So why the HELL don't we have those yet? Why haven't you designed
> > epoll with those in mind? Why don't you back your claims with patches?
> > (I'm not a kernel developer.)
>
> Either stop flaming kernel developers or become one. It is  that
> simple.
>

If I were to become a kernel developer I would stick with FreeBSD. At
least they have kqueue for about seven years now.

--
Kirk Kuchov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-04 22:49                                                                               ` Kirk Kuchov
  2007-03-04 22:57                                                                                 ` Davide Libenzi
  2007-03-05  4:46                                                                                 ` Magnus Naeslund(k)
@ 2007-03-06  8:19                                                                                 ` Pavel Machek
  2007-03-07 12:02                                                                                   ` Kirk Kuchov
  2 siblings, 1 reply; 277+ messages in thread
From: Pavel Machek @ 2007-03-06  8:19 UTC (permalink / raw)
  To: Kirk Kuchov
  Cc: Davide Libenzi, Evgeniy Polyakov, Ingo Molnar, Eric Dumazet,
	Theodore Tso, Linus Torvalds, Ulrich Drepper,
	Linux Kernel Mailing List, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner

> >As for why common abstractions like file are a good thing, think about why
> >having "/dev/null" is cleaner that having a special plug DEVNULL_FD fd
> >value to be plugged everywhere,
> 
> This is a stupid comparaison. By your logic we should also have /dev/stdin,
> /dev/stdout and /dev/stderr.

Bzzt, wrong. We have them.

pavel@amd:~$ ls -al /dev/std*
lrwxrwxrwx 1 root root 4 Nov 12  2003 /dev/stderr -> fd/2
lrwxrwxrwx 1 root root 4 Nov 12  2003 /dev/stdin -> fd/0
lrwxrwxrwx 1 root root 4 Nov 12  2003 /dev/stdout -> fd/1
pavel@amd:~$ ls -al /proc/self/fd
total 0
dr-x------ 2 pavel users  0 Mar  6 09:18 .
dr-xr-xr-x 4 pavel users  0 Mar  6 09:18 ..
lrwx------ 1 pavel users 64 Mar  6 09:18 0 -> /dev/ttyp2
lrwx------ 1 pavel users 64 Mar  6 09:18 1 -> /dev/ttyp2
lrwx------ 1 pavel users 64 Mar  6 09:18 2 -> /dev/ttyp2
lr-x------ 1 pavel users 64 Mar  6 09:18 3 -> /proc/2299/fd
pavel@amd:~$

> >But here the list could be almost endless.
> >And please don't start the, they don't scale or they need heavy file
> >binding tossfeast. They scale as well as the interface that will receive
> >them (poll, select, epoll). Heavy file binding what? 100 or so bytes for
> >the struct file? How many signal/timer fd are you gonna have? Like 100K?
> >Really moot argument when opposed to the benefit of being compatible with
> >existing POSIX interfaces and being more Unix friendly.
> 
> So why the HELL don't we have those yet? Why haven't you designed
> epoll with those in mind? Why don't you back your claims with patches?
> (I'm not a kernel developer.)

Either stop flaming kernel developers or become one. It is  that
simple.

								Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-04 17:46                                                                             ` Kyle Moffett
@ 2007-03-05  5:23                                                                               ` Michael K. Edwards
  0 siblings, 0 replies; 277+ messages in thread
From: Michael K. Edwards @ 2007-03-05  5:23 UTC (permalink / raw)
  To: Kyle Moffett
  Cc: Kirk Kuchov, Davide Libenzi, Evgeniy Polyakov, Ingo Molnar,
	Eric Dumazet, Pavel Machek, Theodore Tso, Linus Torvalds,
	Ulrich Drepper, Linux Kernel Mailing List, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Jens Axboe,
	Thomas Gleixner

On 3/4/07, Kyle Moffett <mrmacman_g4@mac.com> wrote:
> Well, even this far into 2.6, Linus' patch from 2003 still (mostly)
> applies; the maintenance cost for this kind of code is virtually
> zilch.  If it matters that much to you clean it up and make it apply;
> add an alarmfd() syscall (another 100 lines of code at most?) and
> make a "read" return an architecture-independent siginfo-like
> structure and submit it for inclusion.  Adding epoll() support for
> random objects is as simple as a 75-line object-filesystem and a 25-
> line syscall to return an FD to a new inode.  Have fun!  Go wild!
> Something this trivially simple could probably spend a week in -mm
> and go to linus for 2.6.22.

Or, if you want to do slightly more work and produce something a great
deal more useful, you could implement additional netlink address
families for additional "event" sources.  The socket - setsockopt -
bind - sendmsg/recvmsg sequence is a well understood and well
documented UNIX paradigm for multiplexing non-blocking I/O to many
destinations over one socket.  Everyone who has read Stevens is
familiar with the basic UDP and "fd open server" techniques, and if
you look at Linux's IP_PKTINFO and NETLINK_W1 (bravo, Evgeniy!) you'll
see how easily they could be extended to file AIO and other kinds of
event sources.

For file AIO, you might have the application open one AIO socket per
mount point, open files indirectly via the SCM_RIGHTS mechanism, and
submit/retire read/write requests via sendmsg/recvmsg with ancillary
data consisting of an lseek64 tuple and a user-provided cookie.
Although the process still has to have one fd open per actual open
file (because trying to authenticate file accesses without opening fds
is madness), the only fds it has to manipulate directly are those
representing entire pools of outstanding requests.  This is usually a
small enough set that select() will do just fine, if you're careful
with fd allocation.  (You can simply punt indirectly opened fds up to
a high numerical range, where they can't be accessed directly from
userspace but still make fine cookies for use in lseek64 tuples within
cmsg headers).

The same basic approach will work for timers, signals, and just about
any other event source.  Userspace is of course still stuck doing its
own state machines / thread scheduling / however you choose to think
of it.  But all the important activity goes through socketcall(), and
the data and control parameters are all packaged up into a struct
msghdr instead of the bare buffer pointers of read/write.  So if
someone else does come along later and design an ultralight threading
mechanism that isn't a total botch, the actual data paths won't need
much rework; the exception handling will just get a lot simpler.

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-04 22:49                                                                               ` Kirk Kuchov
  2007-03-04 22:57                                                                                 ` Davide Libenzi
@ 2007-03-05  4:46                                                                                 ` Magnus Naeslund(k)
  2007-03-06  8:19                                                                                 ` Pavel Machek
  2 siblings, 0 replies; 277+ messages in thread
From: Magnus Naeslund(k) @ 2007-03-05  4:46 UTC (permalink / raw)
  To: Kirk Kuchov; +Cc: linux-kernel

Kirk Kuchov wrote:
[snip]
> 
> This is a stupid comparaison. By your logic we should also have /dev/stdin,
> /dev/stdout and /dev/stderr.
> 

Well, as a matter of fact (on my system):

# ls -l /dev/std*
lrwxrwxrwx  1 root root 4 Feb  1  2006 /dev/stderr -> fd/2
lrwxrwxrwx  1 root root 4 Feb  1  2006 /dev/stdin -> fd/0
lrwxrwxrwx  1 root root 4 Feb  1  2006 /dev/stdout -> fd/1

Please don't bother to respond to this mail, I just saw that you 
apparently needed the info.

Magnus

P.S.: *PLONK*

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-04 22:49                                                                               ` Kirk Kuchov
@ 2007-03-04 22:57                                                                                 ` Davide Libenzi
  2007-03-05  4:46                                                                                 ` Magnus Naeslund(k)
  2007-03-06  8:19                                                                                 ` Pavel Machek
  2 siblings, 0 replies; 277+ messages in thread
From: Davide Libenzi @ 2007-03-04 22:57 UTC (permalink / raw)
  To: Kirk Kuchov
  Cc: Evgeniy Polyakov, Ingo Molnar, Eric Dumazet, Pavel Machek,
	Theodore Tso, Linus Torvalds, Ulrich Drepper,
	Linux Kernel Mailing List, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner

On Sun, 4 Mar 2007, Kirk Kuchov wrote:

> I don't give a shit.

Here's another good use of /dev/null:

*PLONK*



- Davide



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-04 21:17                                                                             ` Davide Libenzi
@ 2007-03-04 22:49                                                                               ` Kirk Kuchov
  2007-03-04 22:57                                                                                 ` Davide Libenzi
                                                                                                   ` (2 more replies)
  0 siblings, 3 replies; 277+ messages in thread
From: Kirk Kuchov @ 2007-03-04 22:49 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Evgeniy Polyakov, Ingo Molnar, Eric Dumazet, Pavel Machek,
	Theodore Tso, Linus Torvalds, Ulrich Drepper,
	Linux Kernel Mailing List, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner

On 3/4/07, Davide Libenzi <davidel@xmailserver.org> wrote:
> On Sun, 4 Mar 2007, Kirk Kuchov wrote:
>
> > On 3/3/07, Davide Libenzi <davidel@xmailserver.org> wrote:
> > > <snip>
> > >
> > > Those *other* (tons?!?) interfaces can be created *when* the need comes
> > > (see Linus signalfd [1] example to show how urgent that was). *When*
> > > the need comes, they will work with existing POSIX interfaces, without
> > > requiring your own just-another event interface. Those other interfaces
> > > could also be more easily adopted by other Unix cousins, because of
> > > the fact that they rely on existing POSIX interfaces.
> >
> > Please stop with this crap, this chicken or the egg argument of yours is utter
> > BULLSHIT!
>
> Wow, wow, fella! You _deinitely_ cannot afford rudeness here.

I don't give a shit.

> You started bad, and you end even worse. By listing a some APIs that will
> work only with epoll. As I said already, and as it was listed in the
> thread I posted the link, something like:
>
> int signalfd(...);  // Linus initial interface would be perfectly fine
> int timerfd(...);   // Open ...
> int eventfd(...);   // [1]
>
> Will work *even* with standard POSIX select/poll. 95% or more of the
> software does not have scalability issues, and select/poll are more
> portable and easy to use for simple stuff. On top of that, as I already
> said, they are *confined* interfaces that could be more easily adopted by
> other Unixes (if they are 100-200 lines on Linux, don't expect them to be
> a lot more on other Unixes) [2]. We *already* have the infrastructure
> inside Linux to deliver events (f_op->poll subsystem), how about we use
> that instead of just-another way? [3]

Man you're so full of shit, your eyes are brown. NOBODY cares about
select/poll or that the interfaces are going to be adopted by other
Unixes. This issue has already been solved by then YEARS ago.

What I want (and a ton of other users) is a SIMPLE and generic way to
receive events from _MULTIPLE_multiple sources. I don't care about
kernel-level portability, easiness or whatever, the linux kernel
developers are good at not knowing what their users want.

> As for why common abstractions like file are a good thing, think about why
> having "/dev/null" is cleaner that having a special plug DEVNULL_FD fd
> value to be plugged everywhere,

This is a stupid comparaison. By your logic we should also have /dev/stdin,
/dev/stdout and /dev/stderr.

> or why I can use find/grep/cat/echo/... to
> look/edit at my configuration inside /proc, instead of using a frigging
> registry editor.

Yet another stupid comparaison, /proc is a MESS! Almost as worse as
the registry. Linux now has three pieces of crap for
configuration/information: /proc, sysfs and sysctl. Nobody knows
exactly what should go into each one of those. Crap design at it's
best.

> But here the list could be almost endless.
> And please don't start the, they don't scale or they need heavy file
> binding tossfeast. They scale as well as the interface that will receive
> them (poll, select, epoll). Heavy file binding what? 100 or so bytes for
> the struct file? How many signal/timer fd are you gonna have? Like 100K?
> Really moot argument when opposed to the benefit of being compatible with
> existing POSIX interfaces and being more Unix friendly.

So why the HELL don't we have those yet? Why haven't you designed
epoll with those in mind? Why don't you back your claims with patches?
(I'm not a kernel developer.)

> As for the AIO stuff, if threadlets/syslets will prove effective, you can
> host an epoll_wait over a syslet/threadlet. Or, if the 3 lines of
> userspace code needed to do that, fall inside your definition of "kludge",
> we can even find a way to bridge the two.

I don't care about threadlets in this context, I just want to wait for
EVENTS from MULTIPLE sources WITHOUT mixing signals and other crap.
Your arrogance is amusing, stop pushing narrow-minded beliefs down the
throats of all Linux users. Kqueue, event ports,
WaitForMultipleObjects, epoll with multiple sources. That's what users
want, not yet another syscall/whatever hack.

> Now, how about we focus on the topic of this thread?
>
> [1] This could be an idea. People already uses pipes for this, but pipes
>     has some memory overhead inside the kernel (plus use two fds) that
>     could, if really felt necessary, be avoided.

Yet another hack!! 64kiB of space just to push some user events
around. Great idea!

>
> [2] This is how those kind of interfaces should be designed. Modular,
>     re-usable, file-based interfaces, whose acceptance is not linked into
>     slurping-in a whole new interface with tenths of sub, interface-only,
>     objects. And from this POV, epoll is the friendlier.

Who said I want yet another interface? I just fucking want to receive
events from MULTIPLE sources through epoll. With or without a fd! My
anger and frustration is that we can get past this SIMPLE need!

> [3] Notice the similarity between threadlets/syslets and epoll? They
>     enable pretty darn good scalability, with *existing* infrastructure,
>     and w/out special ad-hoc code to be plugged everywhere. This translate
>     directly in easier to maintain code.

Easier for kernel developers, of cource. <sarcasm> Who cares? Fuck the
users. <sarcasm>

--
Kirk Kuchov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-04 16:23                                                                           ` Kirk Kuchov
  2007-03-04 17:46                                                                             ` Kyle Moffett
@ 2007-03-04 21:17                                                                             ` Davide Libenzi
  2007-03-04 22:49                                                                               ` Kirk Kuchov
  1 sibling, 1 reply; 277+ messages in thread
From: Davide Libenzi @ 2007-03-04 21:17 UTC (permalink / raw)
  To: Kirk Kuchov
  Cc: Evgeniy Polyakov, Ingo Molnar, Eric Dumazet, Pavel Machek,
	Theodore Tso, Linus Torvalds, Ulrich Drepper,
	Linux Kernel Mailing List, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner

On Sun, 4 Mar 2007, Kirk Kuchov wrote:

> On 3/3/07, Davide Libenzi <davidel@xmailserver.org> wrote:
> > <snip>
> > 
> > Those *other* (tons?!?) interfaces can be created *when* the need comes
> > (see Linus signalfd [1] example to show how urgent that was). *When*
> > the need comes, they will work with existing POSIX interfaces, without
> > requiring your own just-another event interface. Those other interfaces
> > could also be more easily adopted by other Unix cousins, because of
> > the fact that they rely on existing POSIX interfaces.
> 
> Please stop with this crap, this chicken or the egg argument of yours is utter
> BULLSHIT!

Wow, wow, fella! You _deinitely_ cannot afford rudeness here.
You started bad, and you end even worse. By listing a some APIs that will 
work only with epoll. As I said already, and as it was listed in the 
thread I posted the link, something like:

int signalfd(...);  // Linus initial interface would be perfectly fine
int timerfd(...);   // Open ...
int eventfd(...);   // [1]

Will work *even* with standard POSIX select/poll. 95% or more of the 
software does not have scalability issues, and select/poll are more 
portable and easy to use for simple stuff. On top of that, as I already 
said, they are *confined* interfaces that could be more easily adopted by 
other Unixes (if they are 100-200 lines on Linux, don't expect them to be 
a lot more on other Unixes) [2]. We *already* have the infrastructure 
inside Linux to deliver events (f_op->poll subsystem), how about we use 
that instead of just-another way? [3]
As for why common abstractions like file are a good thing, think about why 
having "/dev/null" is cleaner that having a special plug DEVNULL_FD fd 
value to be plugged everywhere, or why I can use find/grep/cat/echo/... to 
look/edit at my configuration inside /proc, instead of using a frigging 
registry editor. But here the list could be almost endless.
And please don't start the, they don't scale or they need heavy file 
binding tossfeast. They scale as well as the interface that will receive 
them (poll, select, epoll). Heavy file binding what? 100 or so bytes for 
the struct file? How many signal/timer fd are you gonna have? Like 100K? 
Really moot argument when opposed to the benefit of being compatible with 
existing POSIX interfaces and being more Unix friendly.
As for the AIO stuff, if threadlets/syslets will prove effective, you can 
host an epoll_wait over a syslet/threadlet. Or, if the 3 lines of 
userspace code needed to do that, fall inside your definition of "kludge", 
we can even find a way to bridge the two.
Now, how about we focus on the topic of this thread?




[1] This could be an idea. People already uses pipes for this, but pipes 
    has some memory overhead inside the kernel (plus use two fds) that 
    could, if really felt necessary, be avoided.

[2] This is how those kind of interfaces should be designed. Modular,
    re-usable, file-based interfaces, whose acceptance is not linked into 
    slurping-in a whole new interface with tenths of sub, interface-only, 
    objects. And from this POV, epoll is the friendlier.

[3] Notice the similarity between threadlets/syslets and epoll? They 
    enable pretty darn good scalability, with *existing* infrastructure, 
    and w/out special ad-hoc code to be plugged everywhere. This translate 
    directly in easier to maintain code.



- Davide



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-04 16:23                                                                           ` Kirk Kuchov
@ 2007-03-04 17:46                                                                             ` Kyle Moffett
  2007-03-05  5:23                                                                               ` Michael K. Edwards
  2007-03-04 21:17                                                                             ` Davide Libenzi
  1 sibling, 1 reply; 277+ messages in thread
From: Kyle Moffett @ 2007-03-04 17:46 UTC (permalink / raw)
  To: Kirk Kuchov
  Cc: Davide Libenzi, Evgeniy Polyakov, Ingo Molnar, Eric Dumazet,
	Pavel Machek, Theodore Tso, Linus Torvalds, Ulrich Drepper,
	Linux Kernel Mailing List, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner

On Mar 04, 2007, at 11:23:37, Kirk Kuchov wrote:
> So here we are, 2007. epoll() works with files, pipes, sockets,  
> inotify and anything pollable (file descriptors) but aio, timers,  
> signals and user-defined event. Can we please get those working  
> with epoll ? Something as simple as:
>
> [code snipped]
>
> Would this be acceptable? Can we finally move on?

Well, even this far into 2.6, Linus' patch from 2003 still (mostly)  
applies; the maintenance cost for this kind of code is virtually  
zilch.  If it matters that much to you clean it up and make it apply;  
add an alarmfd() syscall (another 100 lines of code at most?) and  
make a "read" return an architecture-independent siginfo-like  
structure and submit it for inclusion.  Adding epoll() support for  
random objects is as simple as a 75-line object-filesystem and a 25- 
line syscall to return an FD to a new inode.  Have fun!  Go wild!   
Something this trivially simple could probably spend a week in -mm  
and go to linus for 2.6.22.

Cheers,
Kyle Moffett


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-03 21:57                                                                         ` Davide Libenzi
  2007-03-03 22:10                                                                           ` Davide Libenzi
@ 2007-03-04 16:23                                                                           ` Kirk Kuchov
  2007-03-04 17:46                                                                             ` Kyle Moffett
  2007-03-04 21:17                                                                             ` Davide Libenzi
  1 sibling, 2 replies; 277+ messages in thread
From: Kirk Kuchov @ 2007-03-04 16:23 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Evgeniy Polyakov, Ingo Molnar, Eric Dumazet, Pavel Machek,
	Theodore Tso, Linus Torvalds, Ulrich Drepper,
	Linux Kernel Mailing List, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner

On 3/3/07, Davide Libenzi <davidel@xmailserver.org> wrote:
> <snip>
>
> Those *other* (tons?!?) interfaces can be created *when* the need comes
> (see Linus signalfd [1] example to show how urgent that was). *When*
> the need comes, they will work with existing POSIX interfaces, without
> requiring your own just-another event interface. Those other interfaces
> could also be more easily adopted by other Unix cousins, because of
> the fact that they rely on existing POSIX interfaces.

Please stop with this crap, this chicken or the egg argument of yours is utter
BULLSHIT! Just because Linux doesn't have a decent kernel event
notification mechanism it does not mean that users don't need. Nobody
cared about Linus's
signalfd because it wasn't mainline.

Look at any event notification libraries out there, it makes me sick how much
kludge they have to go thru to get near the same functionality of
kqueue on Linux.

Solaris has the Event Ports mechanism since 2003. FreeBSD, NetBSD, OpenBSD
and Mac OS X support kqueue since around 2000. Windows has had event
notification for ages now. These _facilities_ are all widely used,
given the platforms
popularity.

So here we are, 2007. epoll() works with files, pipes, sockets,
inotify and anything
pollable (file descriptors) but aio, timers, signals and user-defined
event. Can we
please get those working with epoll ? Something as simple as:

struct epoll_event ev;

ev.events = EV_TIMER | EPOLLONESHOT;
ev.data.u64 = 1000; /* timeout */

epoll_ctl(epfd, EPOLL_CTL_ADD, 0 /* ignored */, &ev);

or

struct sigevent ev;

ev.sigev_notify = SIGEV_EPOLL;
ev.sigev_signo = epfd;
ev.sigev_value = &ev;

timer_create(CLOCK_MONOTONIC, &ev, &timerid);

AIO:

struct sigevent ev;
int fd = io_setup(..); /* oh boy, I wish... but it works */

ev.events = EV_AIO | EPOLLONESHOT;
/* event.data.ptr returns pointer to the iocb */
epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &ev);

or

struct iocb iocb;

iocb.aio_fildes = fileno(stdin);
iocb.aio_lio_opcode = IO_CMD_PREAD;
iocb.c.notify = IO_NOTIFY_EPOLL; /* __pad3/4 */

Would this be acceptable? Can we finally move on?

--
Kirk Kuchov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-03 21:12     ` Ray Lee
  2007-03-03 22:14       ` Ihar `Philips` Filipau
@ 2007-03-04  9:33       ` Michael K. Edwards
  1 sibling, 0 replies; 277+ messages in thread
From: Michael K. Edwards @ 2007-03-04  9:33 UTC (permalink / raw)
  To: ray-gmail; +Cc: Ihar `Philips` Filipau, Evgeniy Polyakov, linux-kernel

Please don't take this the wrong way, Ray, but I don't think _you_
understand the problem space that people are (or should be) trying to
address here.

Servers want to always, always block.  Not on a socket, not on a stat,
not on any _one_ thing, but in a condition where the optimum number of
concurrent I/O requests are outstanding (generally of several kinds
with widely varying expected latencies).  I have an embedded server I
wrote that avoids forking internally for any reason, although it
watches the damn serial port signals in parallel with handling network
I/O, audio, and child processes that handle VoIP signaling protocols
(which are separate processes because it was more practical to write
them in a different language with mediocre embeddability).  There's a
lot of things that can block out there, not just disk I/O, but the
only thing a genuinely scalable server process ever blocks on (apart
from the odd spinlock) is a wait-for-IO-from-somewhere mechanism like
select or epoll or kqueue (or even sleep() while awaiting SIGRT+n, or
if it genuinely doesn't suck, the thread scheduler).

Furthermore, not only do servers want to block rather than shove more
I/O into the plumbing than it can handle without backing up, they also
want to throttle the concurrency of requests at the kernel level *for
the kernel's benefit*.  In particular, a server wants to submit to the
kernel a ton of stats and I/O in parallel, far more than it makes
sense to actually issue concurrently, so that efficient sequencing of
these requests can be left to the kernel.  But the server wants to
guide the kernel with regard to the ratios of concurrency appropriate
to the various classes and the relative urgency of the individual
requests within each class.  The server also wants to be able to
reprioritize groups of requests or cancel them altogether based on new
information about hardware status and user behavior.

Finally, the biggest argument against syslets/threadlets AFAICS is
that -- if done incorrectly, as currently proposed -- they would unify
the AIO and normal IO paths in the kernel.  This would shackle AIO to
the current semantics of synchronous syscalls, in which buffers are
passed as bare pointers and exceptional results are tangled up with
programming errors.  This would, in turn, make it quite impossible for
future hardware to pipeline and speculatively execute chains of AIO
operations, leaving "syslets" to a few RDBMS programmers with time to
burn.  The unimproved ease of long term maintenance on the kernel (not
to mention the complete failure to make the writing of _correct_,
performant server code any easier) makes them unworthy of
consideration for inclusion.

So, while everybody has been talking about cached and non-cached
cases, those are really total irrelevancies.  The principal problem
that needs solving is to model the process's pool of in-flight I/O
requests, together with a much larger number of submitted but not yet
issued requests whose results are foreseeably likely to be needed
soon, using a data structure that efficiently supports _all_ of the
operations needed, including bulk cancellation, reprioritization, and
batch migration based on affinities among requests and locality to the
correct I/O resources.  Memory footprint and gentle-on-real-hardware
scheduling are secondary, but also important, considerations.  If you
happen to be able to service certain things directly from cache,
that's gravy -- but it's not very smart IMHO to put that central to
your design process.

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-03 22:14       ` Ihar `Philips` Filipau
@ 2007-03-03 23:54         ` Ray Lee
  0 siblings, 0 replies; 277+ messages in thread
From: Ray Lee @ 2007-03-03 23:54 UTC (permalink / raw)
  To: Ihar `Philips` Filipau; +Cc: Evgeniy Polyakov, linux-kernel

Ihar `Philips` Filipau wrote:
> On 3/3/07, Ray Lee <madrabbit@gmail.com> wrote:
>> On 3/3/07, Ihar `Philips` Filipau <thephilips@gmail.com> wrote:
>> > What I'm trying to get to: keep things simple. The proposed
>> > optimization by Ingo does nothing else but allowing AIO to probe file
>> > cache - if data there to go with fast path. So why not to implement
>> > what the people want - probing of cache? Because it sounds bad? But
>> > they are in fact proposing precisely that just masked with "fast
>> > threads".
>>
>>
>> Servers want to never, ever block. Not on a socket, not on a stat, not
>> on anything. (I have an embedded server I wrote that has to fork
>> internally just to watch the damn serial port signals in parallel with
>> handling network I/O, audio, and child processes that handle H323.)
>> There's a lot of things that can block out there, and it's not just
>> disk I/O.
>>
> 
> Why select/poll/epoll/friends do not work? I have programmed on both
> sides - user-space network servers and in-kernel network protocols -
> and "never blocking" thing was implemented in *nix in the times I was
> walking under table.
> 

Then you've never had to write something that watches serial port
signals. Google on TIOCMIWAIT to see what I'm talking about. The only
option for a userspace programmer to deal with that is to fork() or poll
the signals every so many milliseconds. There are probably more easy
examples, but that's the one off the top of my head that affected me.

In short, this isn't just about network IO, this isn't just about file IO.

> One can poll() more or less *any* device in system. With frigging
> exception of - right - files.

The problem is the "more or less." Say you're right, and 95% of the
system calls are either already asynchronous or non-blocking/poll()able.
One of the questions on the table is how to extend it to the last 5%.

> User-space-wise, check how squid (caching http proxy) does it: you
> have several (forked) instances to serve network requests and you have
> one/several disk I/O daemons. (So called "diskd storeio") Why? Because
> you cannot poll() file descriptors, but you can poll unix socket
> connected to diskd. If diskd blocks, squid still can serve requests.
> How threadlets are better then pool of diskd instances? All nastiness
> of shared memory set loose...

Samba/lighttpd/git want to issue dozens of stats in parallel so that the
kernel can have an opportunity to sort them better. Are you saying they
should fork() a process per stat that they want to issue in parallel?

> What I'm trying to get to. Threadlets wouldn't help existing
> single-threaded applications - what is about 95% of all applications.

Eh, I don't think that's right. Part of the reason threadlets and
syslets are on the table because it may be a more efficient way to do
AIO. And the differences between the syslet API and the current kernel
Async IO API can be abstracted away by glibc, so that today's apps that
do AIO would immediately benefit.

> What's more, as having some limited experience of kernel programming,
> I fail to see what threadlets would simplify on kernel side.

You can yank the entire separate AIO path, and just treat them as
another blocking API that syslets makes nonblocking. Immediate reduction
of code, and everybody is now using the same code paths, which means
higher test coverage and reduced maintenance cost.

This last point is really important. Even if no extra functionality
eventually makes it to userspace, this last point would still be enough
to make the powers that be consider inclusion.

Ray

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-03 21:12     ` Ray Lee
@ 2007-03-03 22:14       ` Ihar `Philips` Filipau
  2007-03-03 23:54         ` Ray Lee
  2007-03-04  9:33       ` Michael K. Edwards
  1 sibling, 1 reply; 277+ messages in thread
From: Ihar `Philips` Filipau @ 2007-03-03 22:14 UTC (permalink / raw)
  To: ray-gmail; +Cc: Evgeniy Polyakov, linux-kernel

On 3/3/07, Ray Lee <madrabbit@gmail.com> wrote:
> On 3/3/07, Ihar `Philips` Filipau <thephilips@gmail.com> wrote:
> > What I'm trying to get to: keep things simple. The proposed
> > optimization by Ingo does nothing else but allowing AIO to probe file
> > cache - if data there to go with fast path. So why not to implement
> > what the people want - probing of cache? Because it sounds bad? But
> > they are in fact proposing precisely that just masked with "fast
> > threads".
>
>
> Servers want to never, ever block. Not on a socket, not on a stat, not
> on anything. (I have an embedded server I wrote that has to fork
> internally just to watch the damn serial port signals in parallel with
> handling network I/O, audio, and child processes that handle H323.)
> There's a lot of things that can block out there, and it's not just
> disk I/O.
>

Why select/poll/epoll/friends do not work? I have programmed on both
sides - user-space network servers and in-kernel network protocols -
and "never blocking" thing was implemented in *nix in the times I was
walking under table.

One can poll() more or less *any* device in system. With frigging
exception of - right - files. IOW for 75% of I/O problem doesn't
exists since there is proper interface - e.g. sockets - in place.

User-space-wise, check how squid (caching http proxy) does it: you
have several (forked) instances to serve network requests and you have
one/several disk I/O daemons. (So called "diskd storeio") Why? Because
you cannot poll() file descriptors, but you can poll unix socket
connected to diskd. If diskd blocks, squid still can serve requests.
How threadlets are better then pool of diskd instances? All nastiness
of shared memory set loose...

What I'm trying to get to. Threadlets wouldn't help existing
single-threaded applications - what is about 95% of all applications.
And multi-threaded applications would gain little because few real
application create threads dynamically: creation need resources and
can fail, uncontrollable thread spawning hurts overall manageability
and additional care is needed regarding deadlocks/lock contentions
proofing. (The category of applications which want the performance
gain are also the applications which need to ensure greater stability
over long non-stop runs. Uncontrollable dynamism helps nothing.)

Having implemented several "file servers" - daemons serving file I/O
to other daemons - I honestly hardly see any improvements. Now people
configure such file servers to issue e.g. 10 file operations
simultaneously - using pool of 10 threads. What threadlets change? In
the end just to keep in check with threadlets I would need to issue
pthread_join() after some number of threadlets created. And the latter
number is the former "e.g. 10". IOW, programmer-wise the
implementation remain same - and all the limitations remain the same.
And all overhead of user-space locking remain the same. (*)

What's more, as having some limited experience of kernel programming,
I fail to see what threadlets would simplify on kernel side. End
result as I see it: user space becomes bit more complicated because of
dynamic multi-threading and kernel-space becomes also more complicated
because of the same added dynamism.

(*) Hm... On other side, if application would be able to tell kernel
to limit number of issued threadlets to N, then it might simplify the
job. Application can tell kernel "I need at most 10 blocking
threadlets, block me if there are more" and then dumbly throw I/O
threadlets at kernel as they are coming in. And kernel would then put
process to sleep if N+1 thredlets are blocking. That would definitely
simplify the job in user-space: it wouldn't need to call
pthread_join(). But it is still no replacement to poll()able file
descriptor or truly async mmap().

-- 
Don't walk behind me, I may not lead.
Don't walk in front of me, I may not follow.
Just walk beside me and be my friend.
    -- Albert Camus (attributed to)

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-03 21:57                                                                         ` Davide Libenzi
@ 2007-03-03 22:10                                                                           ` Davide Libenzi
  2007-03-04 16:23                                                                           ` Kirk Kuchov
  1 sibling, 0 replies; 277+ messages in thread
From: Davide Libenzi @ 2007-03-03 22:10 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ingo Molnar, Eric Dumazet, Pavel Machek, Theodore Tso,
	Linus Torvalds, Ulrich Drepper, Linux Kernel Mailing List,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Zach Brown, David S. Miller, Suparna Bhattacharya, Jens Axboe,
	Thomas Gleixner

On Sat, 3 Mar 2007, Davide Libenzi wrote:

> Those *other* (tons?!?) interfaces can be created *when* the need comes 
> (see Linus signalfd [1] example to show how urgent that was). *When* 
> the need comes, they will work with existing POSIX interfaces, without 
> requiring your own just-another event interface. Those other interfaces 
> could also be more easily adopted by other Unix cousins, because of 
> the fact that they rely on existing POSIX interfaces. One of the reason 
> about the Unix file abstraction interfaces, is that you do *not* have to 
> plan and bloat interfaces before. As long as your new abstraction behave 
> in a file-fashion, it can be automatically used with existing interfaces. 
> And you create them *when* the need comes.

Now, if you don't mind, my spare time is really limited and I prefer to 
spend it looking at stuff the topic of this thread talks about.
Even because the whole epoll/kevent discussion is heavily dependent on the 
fact that syslets/threadlets will or will not result a viable method for 
generic AIO. Savvy?



- Davide



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-03 20:31                                                                       ` Evgeniy Polyakov
@ 2007-03-03 21:57                                                                         ` Davide Libenzi
  2007-03-03 22:10                                                                           ` Davide Libenzi
  2007-03-04 16:23                                                                           ` Kirk Kuchov
  0 siblings, 2 replies; 277+ messages in thread
From: Davide Libenzi @ 2007-03-03 21:57 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ingo Molnar, Eric Dumazet, Pavel Machek, Theodore Tso,
	Linus Torvalds, Ulrich Drepper, Linux Kernel Mailing List,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Zach Brown, David S. Miller, Suparna Bhattacharya, Jens Axboe,
	Thomas Gleixner

On Sat, 3 Mar 2007, Evgeniy Polyakov wrote:

> > I was referring to dropping an event directly to a userspace buffer, from 
> > the poll callback. If pages are not there, you might sleep, and you can't 
> > since the wakeup function holds a spinlock on the waitqueue head while 
> > looping through the waiters to issue the wakeup. Also, you don't know from 
> > where the poll wakeup is called.
> 
> Ugh, no, that is very limited solution - memory must be either pinned
> (which leads to dos and limited ring buffer), or callback must sleep.
> Actually in any way there _must_ exist a queue - if ring buffer is full
> event is not allowed to be dropped - it must be stored in some other
> place, for example in queue from where entries will be read (copied)
> which ring buffer will have entries (that is how it is implemented in
> kevent at least).

I was not advocating for that, if you read carefully. The fact that epoll 
does not do that, should be a clear hint. The old /dev/epoll IIRC was only 
10% faster than the current epoll under an *heavy* event frequency 
micro-bench like pipetest (and that version of epoll did not have the 
single pass over the ready set optimization). And /dev/epoll was 
delivering events *directly* on userspace visible (mmaped) memory in a 
zero-copy fashion.




> > BTW, Linus made a signalfd sketch code time ago, to deliver signals to an 
> > fd. Code remained there and nobody cared. Question: Was it because
> > 1) it had file bindings or 2) because nobody really cared to deliver 
> > signals to an event collector?
> > And *if* later requirements come, you don't need to change the API by 
> > adding an XXEVENT_SIGNAL_ADD or XXEVENT_TIMER_ADD, or creating a new 
> > XXEVENT-only submission structure. You create an API that automatically 
> > makes that new abstraction work with POSIX poll/select, and you get epoll 
> > support for free. Without even changing a bit in the epoll API.
> 
> Well, we get epoll support for free, but we need to create tons of other
> interfaces and infrastructure for kernel users, and we need to change 
> userspace anyway.

Those *other* (tons?!?) interfaces can be created *when* the need comes 
(see Linus signalfd [1] example to show how urgent that was). *When* 
the need comes, they will work with existing POSIX interfaces, without 
requiring your own just-another event interface. Those other interfaces 
could also be more easily adopted by other Unix cousins, because of 
the fact that they rely on existing POSIX interfaces. One of the reason 
about the Unix file abstraction interfaces, is that you do *not* have to 
plan and bloat interfaces before. As long as your new abstraction behave 
in a file-fashion, it can be automatically used with existing interfaces. 
And you create them *when* the need comes.




[1] That was like 100 lines of code or so. See here:

    http://tinyurl.com/3yuna5



- Davide



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-03 10:58   ` Ihar `Philips` Filipau
  2007-03-03 11:03     ` Evgeniy Polyakov
@ 2007-03-03 21:12     ` Ray Lee
  2007-03-03 22:14       ` Ihar `Philips` Filipau
  2007-03-04  9:33       ` Michael K. Edwards
  1 sibling, 2 replies; 277+ messages in thread
From: Ray Lee @ 2007-03-03 21:12 UTC (permalink / raw)
  To: Ihar `Philips` Filipau; +Cc: Evgeniy Polyakov, linux-kernel

On 3/3/07, Ihar `Philips` Filipau <thephilips@gmail.com> wrote:
> What I'm trying to get to: keep things simple. The proposed
> optimization by Ingo does nothing else but allowing AIO to probe file
> cache - if data there to go with fast path. So why not to implement
> what the people want - probing of cache? Because it sounds bad? But
> they are in fact proposing precisely that just masked with "fast
> threads".

Please don't take this the wrong way, but I don't think you understand
the problem space that people are trying to address here.

Servers want to never, ever block. Not on a socket, not on a stat, not
on anything. (I have an embedded server I wrote that has to fork
internally just to watch the damn serial port signals in parallel with
handling network I/O, audio, and child processes that handle H323.)
There's a lot of things that can block out there, and it's not just
disk I/O.

Further, not only do servers not want to block, they also want to cram
a lot more requests into the kernel at once *for the kernel's
benefit*. In particular, a server wants to issue a ton of stats and
I/O in parallel so that the kernel can optimize which order to handle
the requests.

Finally, the biggest argument in favor of syslets/threadlets AFAICS is
that -- if done correctly -- it would unify the AIO and normal IO
paths in the kernel. The improved ease of long term maintenance on the
kernel (and more test coverage, and more directed optimization,
etc...) just for this point alone makes them worth considering for
inclusion.

So, while everybody has been talking about cached and non-cached
cases, those are really special cases of the entire package that the
rest of us want.

Ray

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-03 18:46                                                                     ` Davide Libenzi
@ 2007-03-03 20:31                                                                       ` Evgeniy Polyakov
  2007-03-03 21:57                                                                         ` Davide Libenzi
  0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-03 20:31 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Ingo Molnar, Eric Dumazet, Pavel Machek, Theodore Tso,
	Linus Torvalds, Ulrich Drepper, Linux Kernel Mailing List,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Zach Brown, David S. Miller, Suparna Bhattacharya, Jens Axboe,
	Thomas Gleixner

On Sat, Mar 03, 2007 at 10:46:59AM -0800, Davide Libenzi (davidel@xmailserver.org) wrote:
> On Sat, 3 Mar 2007, Evgeniy Polyakov wrote:
> 
> > > You've to excuse me if my memory is bad, but IIRC the whole discussion 
> > > and loong benchmark feast born with you throwing a benchmark at Ingo 
> > > (with kevent showing a 1.9x performance boost WRT epoll), not with you 
> > > making any other point.
> > 
> > So, how does it sound?
> > "Threadlets are bad for IO because kevent is 2 times faster than epoll?"
> > 
> > I said threadlets are bad for IO (and we agreed that both approaches
> > shouldbe usedfor the maximum performance) because of rescheduling overhead -
> > tasks are quite heavy structuresa to move around - even pt_regs copy
> > takes more than event structure, but not because there is something in other
> > galaxy which might work faster than another something in another galaxy.
> > That was stupid even to think about that.
> 
> Evgeny, other folks on this thread read what you said, so let's not drag 
> this over.
 
Sure, I was wrong to start this again, but try to get my position - I
really tired from trying to prove that I'm not a camel just because we
had some misunderstanding at the start.

I do think that threadlets are relly cool solution and are indeed very
good approach for majority of the parallel processing, but my point is
still that it is not a perfect solution for all tasks.

Just to draw a line: kevent example is extrapolation of what can be
achieved with event-driven model, but that does not mean that it must be
_only_ used for AIO model - threadlets _and_ event driven model (yes, I
accepted Ingo's point about its declining) is the best solution.
 
> > > And if you really feel raw about the single O(nready) loop that epoll
> > > currently does, a new epoll_wait2 (or whatever) API could be used to
> > > deliver the event directly into a userspace buffer [1], directly from the
> > > poll callback, w/out extra delivery loops (IRQ/event->epoll_callback->event_buffer).
> >
> > > [1] From the epoll callback, we cannot sleep, so it's gonna be either an 
> > >     mlocked userspace buffer, or some kernel pages mapped to userspace.
> > 
> > Callbacks never sleep - they add event into list just like current
> > implementation (maybe some lock must be changed from mutex to spinlock,
> > I do not rememeber), main problem is binding to the file structure,
> > which is heavy.
> 
> I was referring to dropping an event directly to a userspace buffer, from 
> the poll callback. If pages are not there, you might sleep, and you can't 
> since the wakeup function holds a spinlock on the waitqueue head while 
> looping through the waiters to issue the wakeup. Also, you don't know from 
> where the poll wakeup is called.

Ugh, no, that is very limited solution - memory must be either pinned
(which leads to dos and limited ring buffer), or callback must sleep.
Actually in any way there _must_ exist a queue - if ring buffer is full
event is not allowed to be dropped - it must be stored in some other
place, for example in queue from where entries will be read (copied)
which ring buffer will have entries (that is how it is implemented in
kevent at least).

> File binding heavy? The first, and by *far* biggest, source of events 
> inside an event collector, of someone that cares about scalability, are 
> sockets. And those are already files. Second would be AIO, and those (if 
> performance figures agrees) can be hosted inside syslets/threadlets.
> Then you fall into the no-care category, where the extra 100 bytes do not 
> make a case against the ability of using it with an existing POSIX 
> infrastructure (poll/select).

Well, sockets are the files indeed, and sockets already are perfectly
handled by epoll - but there are other users of petential interace - and
it must be designed to scale in _any_ situation very well.
Even if we right now do not have problems with some types of events, we
must scale with any new one.

> BTW, Linus made a signalfd sketch code time ago, to deliver signals to an 
> fd. Code remained there and nobody cared. Question: Was it because
> 1) it had file bindings or 2) because nobody really cared to deliver 
> signals to an event collector?
> And *if* later requirements come, you don't need to change the API by 
> adding an XXEVENT_SIGNAL_ADD or XXEVENT_TIMER_ADD, or creating a new 
> XXEVENT-only submission structure. You create an API that automatically 
> makes that new abstraction work with POSIX poll/select, and you get epoll 
> support for free. Without even changing a bit in the epoll API.

Well, we get epoll support for free, but we need to create tons of other
interfaces and infrastructure for kernel users, and we need to change 
userspace anyway.
But epoll support requires to have quite heavy bindings to file
structure, so why don't we want to design new interface (since we need
to change userspace anyway) so that it could allow to scale and be very
memory optimized from the beginning?

> - Davide
> 

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-03 10:06                                                                   ` Evgeniy Polyakov
@ 2007-03-03 18:46                                                                     ` Davide Libenzi
  2007-03-03 20:31                                                                       ` Evgeniy Polyakov
  0 siblings, 1 reply; 277+ messages in thread
From: Davide Libenzi @ 2007-03-03 18:46 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ingo Molnar, Eric Dumazet, Pavel Machek, Theodore Tso,
	Linus Torvalds, Ulrich Drepper, Linux Kernel Mailing List,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Zach Brown, David S. Miller, Suparna Bhattacharya, Jens Axboe,
	Thomas Gleixner

On Sat, 3 Mar 2007, Evgeniy Polyakov wrote:

> > You've to excuse me if my memory is bad, but IIRC the whole discussion 
> > and loong benchmark feast born with you throwing a benchmark at Ingo 
> > (with kevent showing a 1.9x performance boost WRT epoll), not with you 
> > making any other point.
> 
> So, how does it sound?
> "Threadlets are bad for IO because kevent is 2 times faster than epoll?"
> 
> I said threadlets are bad for IO (and we agreed that both approaches
> shouldbe usedfor the maximum performance) because of rescheduling overhead -
> tasks are quite heavy structuresa to move around - even pt_regs copy
> takes more than event structure, but not because there is something in other
> galaxy which might work faster than another something in another galaxy.
> That was stupid even to think about that.

Evgeny, other folks on this thread read what you said, so let's not drag 
this over.



> > And if you really feel raw about the single O(nready) loop that epoll
> > currently does, a new epoll_wait2 (or whatever) API could be used to
> > deliver the event directly into a userspace buffer [1], directly from the
> > poll callback, w/out extra delivery loops (IRQ/event->epoll_callback->event_buffer).
>
> > [1] From the epoll callback, we cannot sleep, so it's gonna be either an 
> >     mlocked userspace buffer, or some kernel pages mapped to userspace.
> 
> Callbacks never sleep - they add event into list just like current
> implementation (maybe some lock must be changed from mutex to spinlock,
> I do not rememeber), main problem is binding to the file structure,
> which is heavy.

I was referring to dropping an event directly to a userspace buffer, from 
the poll callback. If pages are not there, you might sleep, and you can't 
since the wakeup function holds a spinlock on the waitqueue head while 
looping through the waiters to issue the wakeup. Also, you don't know from 
where the poll wakeup is called.
File binding heavy? The first, and by *far* biggest, source of events 
inside an event collector, of someone that cares about scalability, are 
sockets. And those are already files. Second would be AIO, and those (if 
performance figures agrees) can be hosted inside syslets/threadlets.
Then you fall into the no-care category, where the extra 100 bytes do not 
make a case against the ability of using it with an existing POSIX 
infrastructure (poll/select).
BTW, Linus made a signalfd sketch code time ago, to deliver signals to an 
fd. Code remained there and nobody cared. Question: Was it because
1) it had file bindings or 2) because nobody really cared to deliver 
signals to an event collector?
And *if* later requirements come, you don't need to change the API by 
adding an XXEVENT_SIGNAL_ADD or XXEVENT_TIMER_ADD, or creating a new 
XXEVENT-only submission structure. You create an API that automatically 
makes that new abstraction work with POSIX poll/select, and you get epoll 
support for free. Without even changing a bit in the epoll API.



- Davide



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-03 11:03     ` Evgeniy Polyakov
@ 2007-03-03 11:51       ` Ihar `Philips` Filipau
  0 siblings, 0 replies; 277+ messages in thread
From: Ihar `Philips` Filipau @ 2007-03-03 11:51 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: linux-kernel

On 3/3/07, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> > >Threadlets can work with any functionas a base - if it would be
> > >recv-like it will limit possible case for parallel programming, so you
> > >can code anything in threadlets - it is not only about IO.
> >
> > What I'm trying to get to: keep things simple. The proposed
> > optimization by Ingo does nothing else but allowing AIO to probe file
> > cache - if data there to go with fast path. So why not to implement
> > what the people want - probing of cache? Because it sounds bad? But
> > they are in fact proposing precisely that just masked with "fast
> > threads".
>
> There can be other parts than just plain recv/read syscalls - you can
> create a logical processing entity and if it will block (as a whole, no
> matter where), the whole processing will continue as a new thread.
> And having different syscall to warm cache can end up in cache flush in
> between warming and processing itself.
>

I'm not talking about cache warm up. And if we do - and that the whole
freaking point of AIO - Linux IIRC pins freshly loaded clean pages
anyway. So there would be problem but only under memory pressure. If
you under memory pressure - you already lost the game and do not care
about performance/what threads you are using.

It is the whole "threadlets to threads on blocking" things doesn't
sound convincing. It sounds more like "premature optimization". But
anyway, not that I'm AIO specialist. For networking it is totally
unnecessary since most applications which care have already rate
control and buffer management built in. Network connections/sockets
allows greater level of application control on what and how they do.
Compared to blockdev's plain dumb read()/write() going through global
cache. And not that (judging from interface) AIO changes that much -
it is still dumb read() what IMHO makes no sense whatsoever to mmap()
oriented Linux.

-- 
Don't walk behind me, I may not lead.
Don't walk in front of me, I may not follow.
Just walk beside me and be my friend.
    -- Albert Camus (attributed to)

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-03 10:58   ` Ihar `Philips` Filipau
@ 2007-03-03 11:03     ` Evgeniy Polyakov
  2007-03-03 11:51       ` Ihar `Philips` Filipau
  2007-03-03 21:12     ` Ray Lee
  1 sibling, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-03 11:03 UTC (permalink / raw)
  To: Ihar `Philips` Filipau; +Cc: linux-kernel

On Sat, Mar 03, 2007 at 11:58:17AM +0100, Ihar `Philips` Filipau (thephilips@gmail.com) wrote:
> On 3/3/07, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> >On Fri, Mar 02, 2007 at 08:20:26PM +0100, Ihar `Philips` Filipau 
> >(thephilips@gmail.com) wrote:
> >> I'm not well versed in modern kernel development discussions, and
> >> since you have put the thing into the networked context anyway, can
> >> you please ask on lkml why (if they want threadlets solely for AIO)
> >> not to implement analogue of recv(*filedes*, b, l, MSG_DONTWAIT).
> >> Developers already know the inteface, socket infrastructure is already
> >> in kernel, etc. And it might do precisely what they want: access file
> >> in disk cache - just like in case of socket it does access recv buffer
> >> of socket. Why bother with implicit threads/waiting/etc - if all they
> >> want some way to probe cache?
> >
> >Threadlets can work with any functionas a base - if it would be
> >recv-like it will limit possible case for parallel programming, so you
> >can code anything in threadlets - it is not only about IO.
> >
> 
> Ingo defined them as "plain function calls as long as they do not block".
> 
> But when/what function could block?
> 
> (1) File descriptors. Read. If data are in cache it wouldn't block.
> Otherwise it would. Write. If there is space in cache it wouldn't
> block. Otherwise it would.
> 
> (2) Network sockets. Recv. If data are in buffer they wouldn't block.
> Otherwise they would. Send. If there is space in send buffer it
> wouldn't block. Otherwise it would.
> 
> (3) Pipes, fifos & unix sockets. Unfortunately gain nothing since the
> reliable local communication used mostly for control information
> passing. If you have to block on such socket it most likely important
> information anyway. (e.g. X server communication or sql query to SQL
> server). (Or even less important here case of shell pipes.) And most
> users here are all single threaded and I/O bound: they would gain
> nothing from multi-threading - only PITA of added locking.
> 
> What I'm trying to get to: keep things simple. The proposed
> optimization by Ingo does nothing else but allowing AIO to probe file
> cache - if data there to go with fast path. So why not to implement
> what the people want - probing of cache? Because it sounds bad? But
> they are in fact proposing precisely that just masked with "fast
> threads".

There can be other parts than just plain recv/read syscalls - you can
create a logical processing entity and if it will block (as a whole, no
matter where), the whole processing will continue as a new thread.
And having different syscall to warm cache can end up in cache flush in
between warming and processing itself.
 
> -- 
> Don't walk behind me, I may not lead.
> Don't walk in front of me, I may not follow.
> Just walk beside me and be my friend.
>    -- Albert Camus (attributed to)

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
       [not found] ` <20070303103427.GC22557@2ka.mipt.ru>
@ 2007-03-03 10:58   ` Ihar `Philips` Filipau
  2007-03-03 11:03     ` Evgeniy Polyakov
  2007-03-03 21:12     ` Ray Lee
  0 siblings, 2 replies; 277+ messages in thread
From: Ihar `Philips` Filipau @ 2007-03-03 10:58 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: linux-kernel

On 3/3/07, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> On Fri, Mar 02, 2007 at 08:20:26PM +0100, Ihar `Philips` Filipau (thephilips@gmail.com) wrote:
> > I'm not well versed in modern kernel development discussions, and
> > since you have put the thing into the networked context anyway, can
> > you please ask on lkml why (if they want threadlets solely for AIO)
> > not to implement analogue of recv(*filedes*, b, l, MSG_DONTWAIT).
> > Developers already know the inteface, socket infrastructure is already
> > in kernel, etc. And it might do precisely what they want: access file
> > in disk cache - just like in case of socket it does access recv buffer
> > of socket. Why bother with implicit threads/waiting/etc - if all they
> > want some way to probe cache?
>
> Threadlets can work with any functionas a base - if it would be
> recv-like it will limit possible case for parallel programming, so you
> can code anything in threadlets - it is not only about IO.
>

Ingo defined them as "plain function calls as long as they do not block".

But when/what function could block?

(1) File descriptors. Read. If data are in cache it wouldn't block.
Otherwise it would. Write. If there is space in cache it wouldn't
block. Otherwise it would.

(2) Network sockets. Recv. If data are in buffer they wouldn't block.
Otherwise they would. Send. If there is space in send buffer it
wouldn't block. Otherwise it would.

(3) Pipes, fifos & unix sockets. Unfortunately gain nothing since the
reliable local communication used mostly for control information
passing. If you have to block on such socket it most likely important
information anyway. (e.g. X server communication or sql query to SQL
server). (Or even less important here case of shell pipes.) And most
users here are all single threaded and I/O bound: they would gain
nothing from multi-threading - only PITA of added locking.

What I'm trying to get to: keep things simple. The proposed
optimization by Ingo does nothing else but allowing AIO to probe file
cache - if data there to go with fast path. So why not to implement
what the people want - probing of cache? Because it sounds bad? But
they are in fact proposing precisely that just masked with "fast
threads".


-- 
Don't walk behind me, I may not lead.
Don't walk in front of me, I may not follow.
Just walk beside me and be my friend.
    -- Albert Camus (attributed to)

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-02 17:28                                             ` Davide Libenzi
@ 2007-03-03 10:27                                               ` Evgeniy Polyakov
  0 siblings, 0 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-03 10:27 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Ingo Molnar, Pavel Machek, Theodore Tso, Linus Torvalds,
	Ulrich Drepper, Linux Kernel Mailing List, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Jens Axboe,
	Thomas Gleixner

On Fri, Mar 02, 2007 at 09:28:10AM -0800, Davide Libenzi (davidel@xmailserver.org) wrote:
> On Fri, 2 Mar 2007, Evgeniy Polyakov wrote:
> 
> > do we really want to have per process signalfs, timerfs and so on - each 
> > simple structure must be bound to a file, which becomes too cost.
> 
> I may be old school, but if you ask me, and if you *really* want those 
> events, yes. Reason? Unix's everything-is-a-file rule, and being able to 
> use them with *existing* POSIX poll/select. Remember, not every app 
> requires huge scalability efforts, so working with simpler and familiar 
> APIs is always welcome.
> The *only* thing that was not practical to have as fd, was block requests. 
> But maybe threadlets/syslets will handle those just fine, and close the gap.
 
That means that we bind very small object like timer or signal to the
whoe file structure - yes, as I stated - it is doable, but do we really
have to create a file each time create_timer() or signal() is called?
Signals as a filesystem are limited in that regard that we need to
create additional structures to have signal number<->private data
relations.
I designed kevent to be as small as possible, so I removed file binding
idea first. I do not say it is wrong or epoll (and threadlets) are broken 
(fsck, I hope people do understand that), but as is it can not handle
that scenario, so it must be extended and/or a lot of other stuff
written to be compatible with epoll design. Kevent has different design
(which allows to work with old one though - there is a patch to
implement epoll over kevent).
 
> - Davide
> 

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-02 17:13                                                                 ` Davide Libenzi
  2007-03-02 19:13                                                                   ` Davide Libenzi
@ 2007-03-03 10:06                                                                   ` Evgeniy Polyakov
  2007-03-03 18:46                                                                     ` Davide Libenzi
  1 sibling, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-03 10:06 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Ingo Molnar, Eric Dumazet, Pavel Machek, Theodore Tso,
	Linus Torvalds, Ulrich Drepper, Linux Kernel Mailing List,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Zach Brown, David S. Miller, Suparna Bhattacharya, Jens Axboe,
	Thomas Gleixner

On Fri, Mar 02, 2007 at 09:13:40AM -0800, Davide Libenzi (davidel@xmailserver.org) wrote:
> On Fri, 2 Mar 2007, Evgeniy Polyakov wrote:
> 
> > On Thu, Mar 01, 2007 at 11:31:14AM -0800, Davide Libenzi (davidel@xmailserver.org) wrote:
> > > On Thu, 1 Mar 2007, Evgeniy Polyakov wrote:
> > > 
> > > > Ingo, do you really think I will send mails with faked benchmarks? :))
> > > 
> > > I don't think he ever implied that. He was only suggesting that when you 
> > > post benchmarks, and even more when you make claims based on benchmarks, 
> > > you need to be extra carefull about what you measure. Otherwise the 
> > > external view that you give to others does not look good.
> > > Kevent can be really faster than epoll, but if you post broken benchmarks 
> > > (that can be, unrealiable HTTP loaders, broken server implemenations, 
> > > etc..) and make claims based on that, the only effect that you have is to 
> > > lose your point.
> >  
> > So, I only talked that kevent is superior compared to epoll because (and
> > it is _main_ issue) of its ability to handle essentially any kind of
> > events with very small overhead (the same as epoll has in struct file -
> > list and spinlock) and without significant price of struct file binding
> > to event.
> 
> You've to excuse me if my memory is bad, but IIRC the whole discussion 
> and loong benchmark feast born with you throwing a benchmark at Ingo 
> (with kevent showing a 1.9x performance boost WRT epoll), not with you 
> making any other point.

So, how does it sound?
"Threadlets are bad for IO because kevent is 2 times faster than epoll?"

I said threadlets are bad for IO (and we agreed that both approaches
shouldbe usedfor the maximum performance) because of rescheduling overhead -
tasks are quite heavy structuresa to move around - even pt_regs copy
takes more than event structure, but not because there is something in other
galaxy which might work faster than another something in another galaxy.
That was stupid even to think about that.

> As far as epoll not being able to handle other events. Said who? Of 
> course, with zero modifications, you can handle zero additional events. 
> With modifications, you can handle other events. But lets talk about those 
> other events. The *only* kind of event that ppl (and being the epoll 
> maintainer I tend to receive those requests) missed in epoll, was AIO 
> events, That's the *only* thing that was missed by real life application 
> developers. And if something like threadlets/syslets will prove effective, 
> the gap is closed WRT that requirement.
> Epoll handle already the whole class of pollable devices inside the 
> kernel, and if you exclude block AIO, that's a pretty wide class already. 
> The *existing* f_op->poll subsystem can be used to deliver events at the 
> poll-head wakeup time (by using the "key" member of the poll callback), so 
> that you don't even need the extra f_op->poll call to fetch events.
> And if you really feel raw about the single O(nready) loop that epoll 
> currently does, a new epoll_wait2 (or whatever) API could be used to 
> deliver the event directly into a userspace buffer [1], directly from the 
> poll callback, w/out extra delivery loops (IRQ/event->epoll_callback->event_buffer).

Signals, futexes, timers and userspace events I was requested to add into 
kevent, so far only futexes are missed because I was asked to freeze
development so other hackers could check the project.

> 
> [1] From the epoll callback, we cannot sleep, so it's gonna be either an 
>     mlocked userspace buffer, or some kernel pages mapped to userspace.

Callbacks never sleep - they add event into list just like current
implementation (maybe some lock must be changed from mutex to spinlock,
I do not rememeber), main problem is binding to the file structure,
which is heavy.

> - Davide
> 

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-03  7:19                                                                                     ` Ingo Molnar
  2007-03-03  7:20                                                                                       ` Ingo Molnar
@ 2007-03-03  9:02                                                                                       ` Davide Libenzi
  1 sibling, 0 replies; 277+ messages in thread
From: Davide Libenzi @ 2007-03-03  9:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nicholas Miell, Linux Kernel Mailing List, Arjan van de Ven,
	Linus Torvalds

On Sat, 3 Mar 2007, Ingo Molnar wrote:

> * Davide Libenzi <davidel@xmailserver.org> wrote:
> 
> > [...] Status word and control bits should not be changed from 
> > underneath userspace AFAIK. [...]
> 
> Note that the control bits do not just magically change during normal 
> FPU use. It's a bit like sys_setsid()/iopl/etc., it makes little sense 
> to change those per-thread anyway. This is a non-issue anyway - what is 
> important is that the big bulk of 512 (or more) bytes of FPU state /are/ 
> callee-saved (both on 32-bit and on 64-bit), hence there's no need to 
> unlazy anything or to do expensive FPU state saves or other FPU juggling 
> around threadlet (or even syslet) use.

Well, the unlazy/sync happen in any case later when we switch (given 
TS_USEDFPU set). We'd avoid a copy of it given the above conditions true. 
Wouldn't it makes sense to carry over only the status word and the control 
bits eventually?
Also, if the caller saves the whole context, and if we're scheduled while 
inside a system call (not totally unfrequent case), can't we implement a 
smarter unlazy_fpu that avoids fxsave during schedule-out and frstor after 
schedule-in (do not do stts on this condition, so the newly scheduled 
task don't get a fault at all)? If the above conditions are true (no need 
context-copy for new head in async_exec), this should be possible too.


- Davide



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-03  7:19                                                                                     ` Ingo Molnar
@ 2007-03-03  7:20                                                                                       ` Ingo Molnar
  2007-03-03  9:02                                                                                       ` Davide Libenzi
  1 sibling, 0 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-03-03  7:20 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Nicholas Miell, Linux Kernel Mailing List, Arjan van de Ven,
	Linus Torvalds


* Ingo Molnar <mingo@elte.hu> wrote:

> Note that the control bits do not just magically change during normal 
> FPU use. It's a bit like sys_setsid()/iopl/etc., it makes little sense 
> to change those per-thread anyway. This is a non-issue anyway - what is 
> important is that the big bulk of 512 (or more) bytes of FPU state /are/ 
> callee-saved (both on 32-bit and on 64-bit), hence there's no need to 
     ^---- caller-saved
> unlazy anything or to do expensive FPU state saves or other FPU juggling 
> around threadlet (or even syslet) use.


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-03  2:19                                                                                   ` Davide Libenzi
@ 2007-03-03  7:19                                                                                     ` Ingo Molnar
  2007-03-03  7:20                                                                                       ` Ingo Molnar
  2007-03-03  9:02                                                                                       ` Davide Libenzi
  0 siblings, 2 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-03-03  7:19 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Nicholas Miell, Linux Kernel Mailing List, Arjan van de Ven,
	Linus Torvalds


* Davide Libenzi <davidel@xmailserver.org> wrote:

> [...] Status word and control bits should not be changed from 
> underneath userspace AFAIK. [...]

Note that the control bits do not just magically change during normal 
FPU use. It's a bit like sys_setsid()/iopl/etc., it makes little sense 
to change those per-thread anyway. This is a non-issue anyway - what is 
important is that the big bulk of 512 (or more) bytes of FPU state /are/ 
callee-saved (both on 32-bit and on 64-bit), hence there's no need to 
unlazy anything or to do expensive FPU state saves or other FPU juggling 
around threadlet (or even syslet) use.

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-03  1:36                                                                                 ` Nicholas Miell
  2007-03-03  1:48                                                                                   ` Benjamin LaHaise
@ 2007-03-03  2:19                                                                                   ` Davide Libenzi
  2007-03-03  7:19                                                                                     ` Ingo Molnar
  1 sibling, 1 reply; 277+ messages in thread
From: Davide Libenzi @ 2007-03-03  2:19 UTC (permalink / raw)
  To: Nicholas Miell
  Cc: Ingo Molnar, Linux Kernel Mailing List, Arjan van de Ven, Linus Torvalds

On Fri, 2 Mar 2007, Nicholas Miell wrote:

> On Fri, 2007-03-02 at 16:52 -0800, Davide Libenzi wrote:
> > On Fri, 2 Mar 2007, Nicholas Miell wrote:
> > 
> > > The point Ingo was making is that the x86 ABI already requires the FPU
> > > context to be saved before *all* function calls.
> > 
> > I've not seen that among Ingo's points, but yeah some status is caller 
> > saved. But, aren't things like status word and control bits callee saved? 
> > If that's the case, it might require proper handling.
> > 
> 
> Ingo mentioned it in one of the parts you cut out of your reply:
> 
> > and here is where thinking about threadlets as a function call and not 
> > as an asynchronous context helps alot: the classic gcc convention for 
> > FPU use & function calls should apply: gcc does not call an external 
> > function with an in-use FPU stack/register, it always neatly unuses it, 
> > as no FPU register is callee-saved, all are caller-saved.
> 
> The i386 psABI is ancient (i.e. it predates SSE, so no mention of the
> XMM or MXCSR registers) and a bit vague (no mention at all of the FP
> status word), but I'm fairly certain that Ingo is right.

I'm not sure if that's the case. I'd be happy if it was, but I'm afraid 
it's not. Status word and control bits should not be changed from 
underneath userspace AFAIK. The ABI I remember tells me that those are 
callee saved. A quick gcc asm test tells me that too.
And assuming that's the case, why don't we have a smarter unlazy_fpu() 
then, that avoid FPU context sync if we're scheduled while inside a 
syscall (this is no different than an enter inside sys_async_exec - 
userspace should have taken care of it)?
IMO a syscall enter should not assume that userspace took care of saving 
the whole FPU context.



- Davide



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-03  1:36                                                                                 ` Nicholas Miell
@ 2007-03-03  1:48                                                                                   ` Benjamin LaHaise
  2007-03-03  2:19                                                                                   ` Davide Libenzi
  1 sibling, 0 replies; 277+ messages in thread
From: Benjamin LaHaise @ 2007-03-03  1:48 UTC (permalink / raw)
  To: Nicholas Miell
  Cc: Davide Libenzi, Ingo Molnar, Linux Kernel Mailing List,
	Arjan van de Ven, Linus Torvalds

On Fri, Mar 02, 2007 at 05:36:01PM -0800, Nicholas Miell wrote:
> > as an asynchronous context helps alot: the classic gcc convention for 
> > FPU use & function calls should apply: gcc does not call an external 
> > function with an in-use FPU stack/register, it always neatly unuses it, 
> > as no FPU register is callee-saved, all are caller-saved.
> 
> The i386 psABI is ancient (i.e. it predates SSE, so no mention of the
> XMM or MXCSR registers) and a bit vague (no mention at all of the FP
> status word), but I'm fairly certain that Ingo is right.

The FPU status word *must* be saved, as the rounding behaviour and error mode 
bits are assumed to be preserved.  Iow, yes, there is state which is required.

		-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <zyntrop@kvack.org>.

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-03  0:52                                                                               ` Davide Libenzi
@ 2007-03-03  1:36                                                                                 ` Nicholas Miell
  2007-03-03  1:48                                                                                   ` Benjamin LaHaise
  2007-03-03  2:19                                                                                   ` Davide Libenzi
  0 siblings, 2 replies; 277+ messages in thread
From: Nicholas Miell @ 2007-03-03  1:36 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Ingo Molnar, Linux Kernel Mailing List, Arjan van de Ven, Linus Torvalds

On Fri, 2007-03-02 at 16:52 -0800, Davide Libenzi wrote:
> On Fri, 2 Mar 2007, Nicholas Miell wrote:
> 
> > The point Ingo was making is that the x86 ABI already requires the FPU
> > context to be saved before *all* function calls.
> 
> I've not seen that among Ingo's points, but yeah some status is caller 
> saved. But, aren't things like status word and control bits callee saved? 
> If that's the case, it might require proper handling.
> 

Ingo mentioned it in one of the parts you cut out of your reply:

> and here is where thinking about threadlets as a function call and not 
> as an asynchronous context helps alot: the classic gcc convention for 
> FPU use & function calls should apply: gcc does not call an external 
> function with an in-use FPU stack/register, it always neatly unuses it, 
> as no FPU register is callee-saved, all are caller-saved.

The i386 psABI is ancient (i.e. it predates SSE, so no mention of the
XMM or MXCSR registers) and a bit vague (no mention at all of the FP
status word), but I'm fairly certain that Ingo is right.


-- 
Nicholas Miell <nmiell@comcast.net>


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-02 21:43                                                                             ` Nicholas Miell
@ 2007-03-03  0:52                                                                               ` Davide Libenzi
  2007-03-03  1:36                                                                                 ` Nicholas Miell
  0 siblings, 1 reply; 277+ messages in thread
From: Davide Libenzi @ 2007-03-03  0:52 UTC (permalink / raw)
  To: Nicholas Miell
  Cc: Ingo Molnar, Linux Kernel Mailing List, Arjan van de Ven, Linus Torvalds

On Fri, 2 Mar 2007, Nicholas Miell wrote:

> The point Ingo was making is that the x86 ABI already requires the FPU
> context to be saved before *all* function calls.

I've not seen that among Ingo's points, but yeah some status is caller 
saved. But, aren't things like status word and control bits callee saved? 
If that's the case, it might require proper handling.



- Davide



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-02 20:53                                                                           ` Davide Libenzi
  2007-03-02 21:21                                                                             ` Michael K. Edwards
@ 2007-03-02 21:43                                                                             ` Nicholas Miell
  2007-03-03  0:52                                                                               ` Davide Libenzi
  1 sibling, 1 reply; 277+ messages in thread
From: Nicholas Miell @ 2007-03-02 21:43 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Ingo Molnar, Linux Kernel Mailing List, Arjan van de Ven, Linus Torvalds

On Fri, 2007-03-02 at 12:53 -0800, Davide Libenzi wrote:
> On Fri, 2 Mar 2007, Ingo Molnar wrote:
> 
> > 
> > * Davide Libenzi <davidel@xmailserver.org> wrote:
> > 
> > > I think that the "dirty" FPU context must, at least, follow the new 
> > > head. That's what the userspace sees, and you don't want an async_exec 
> > > to re-emerge with a different FPU context.
> > 
> > well. I think there's some confusion about terminology, so please let me 
> > describe everything in detail. This is how execution goes:
> > 
> >   outer loop() {
> >       call_threadlet();
> >   }
> > 
> > this all runs in the 'head' context. call_threadlet() always switches to 
> > the 'threadlet stack'. The 'outer context' runs in the 'head stack'. If, 
> > while executing the threadlet function, we block, then the 
> > threadlet-thread gets to keep the task (the threadlet stack and also the 
> > FPU), and blocks - and we pick a 'new head' from the thread pool and 
> > continue executing in that context - right after the call_threadlet() 
> > function, in the 'old' head's stack. I.e. it's as if we returned 
> > immediately from call_threadlet(), with a return code that signals that 
> > the 'threadlet went async'.
> > 
> > now, the FPU state that was when the threadlet blocked is totally 
> > meaningless to the 'new head' - that FPU state is from the middle of the 
> > threadlet execution.
> 
> For threadlets, it might be. Now think about a task wanting to dispatch N 
> parallel AIO requests as N independent syslets.
> Think about this task having USEDFPU set, so the FPU context is dirty.
> When it returns from async_exec, with one of the requests being become 
> sleepy, it needs to have the same FPU context it had when it entered, 
> otherwise it won't prolly be happy.
> For the same reason a schedule() must preserve/sync the "prev" FPU 
> context, to be reloaded at the next FPU fault.

The point Ingo was making is that the x86 ABI already requires the FPU
context to be saved before *all* function calls.

Unfortunately, this isn't true of other ABIs -- looking over the psABIs
specs I have laying around, AMD64, PPC64, and MIPS require at least part
of the FPU state to be preserved across function calls, and I'm sure
this is also true of others.

Then there's the other nasty details of new thread creation --
thankfully, the contents of the TLS isn't inherited from the parent
thread, but it still needs to be initialized; not to mention all the
other details involved in pthread creation and destruction.

I don't see any way around the pthread issues other than making a libc
upcall on return from the first system call that blocked.

-- 
Nicholas Miell <nmiell@comcast.net>


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-02 20:53                                                                           ` Davide Libenzi
@ 2007-03-02 21:21                                                                             ` Michael K. Edwards
  2007-03-02 21:43                                                                             ` Nicholas Miell
  1 sibling, 0 replies; 277+ messages in thread
From: Michael K. Edwards @ 2007-03-02 21:21 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Ingo Molnar, Linux Kernel Mailing List, Arjan van de Ven, Linus Torvalds

On 3/2/07, Davide Libenzi <davidel@xmailserver.org> wrote:
> For threadlets, it might be. Now think about a task wanting to dispatch N
> parallel AIO requests as N independent syslets.
> Think about this task having USEDFPU set, so the FPU context is dirty.
> When it returns from async_exec, with one of the requests being become
> sleepy, it needs to have the same FPU context it had when it entered,
> otherwise it won't prolly be happy.
> For the same reason a schedule() must preserve/sync the "prev" FPU
> context, to be reloaded at the next FPU fault.

And if you actually think this through, I think you will arrive at (a
subset of) the conclusions I did a week ago: to keep the threadlets
lightweight enough to schedule and migrate cheaply, they can't be
allowed to "own" their own FPU and TLS context.  They have to be
allowed to _use_ the FPU (or they're useless) and to _use_ TLS (or
they can't use any glibc wrapper around a syscall, since they
practically all set the thread-local errno).  But they have to
"quiesce" the FPU and stash any thread-local state they want to keep
on their stack before entering the next syscall, or else it'll get
clobbered.

Keep thinking, especially about FPU flags, and you'll see why
threadlets spawned from the _same_ threadlet entrypoint should all run
in the same pool of threads, one per CPU, while threadlets from
_different_ entrypoints should never run in the same thread (FPU/TLS
context).  You'll see why threadlets in the same pool shouldn't be
permitted to preempt one another except at syscalls that block, and
the cost of preempting the real thread associated with one threadlet
pool with another real thread associated with a different threadlet
pool is the same as any other thread switch.  At which point,
threadlet pools are themselves first-class objects (to use the snake
oil phrase), and might as well be enhanced to a data structure that
has efficient operations for reprioritization, bulk cancellation, and
all that jazz.

Did I mention that there is actually quite a bit of prior art in this
area, which makes a much better guide to the design of round wheels
than micro-benchmarks do?

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-02 20:29                                                                         ` Ingo Molnar
@ 2007-03-02 20:53                                                                           ` Davide Libenzi
  2007-03-02 21:21                                                                             ` Michael K. Edwards
  2007-03-02 21:43                                                                             ` Nicholas Miell
  0 siblings, 2 replies; 277+ messages in thread
From: Davide Libenzi @ 2007-03-02 20:53 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Linux Kernel Mailing List, Arjan van de Ven, Linus Torvalds

On Fri, 2 Mar 2007, Ingo Molnar wrote:

> 
> * Davide Libenzi <davidel@xmailserver.org> wrote:
> 
> > I think that the "dirty" FPU context must, at least, follow the new 
> > head. That's what the userspace sees, and you don't want an async_exec 
> > to re-emerge with a different FPU context.
> 
> well. I think there's some confusion about terminology, so please let me 
> describe everything in detail. This is how execution goes:
> 
>   outer loop() {
>       call_threadlet();
>   }
> 
> this all runs in the 'head' context. call_threadlet() always switches to 
> the 'threadlet stack'. The 'outer context' runs in the 'head stack'. If, 
> while executing the threadlet function, we block, then the 
> threadlet-thread gets to keep the task (the threadlet stack and also the 
> FPU), and blocks - and we pick a 'new head' from the thread pool and 
> continue executing in that context - right after the call_threadlet() 
> function, in the 'old' head's stack. I.e. it's as if we returned 
> immediately from call_threadlet(), with a return code that signals that 
> the 'threadlet went async'.
> 
> now, the FPU state that was when the threadlet blocked is totally 
> meaningless to the 'new head' - that FPU state is from the middle of the 
> threadlet execution.

For threadlets, it might be. Now think about a task wanting to dispatch N 
parallel AIO requests as N independent syslets.
Think about this task having USEDFPU set, so the FPU context is dirty.
When it returns from async_exec, with one of the requests being become 
sleepy, it needs to have the same FPU context it had when it entered, 
otherwise it won't prolly be happy.
For the same reason a schedule() must preserve/sync the "prev" FPU 
context, to be reloaded at the next FPU fault.




> > So, IMO, if the USEDFPU bit is set, we need to sync the dirty FPU 
> > context with an early unlazy_fpu(), *and* copy the sync'd FPU context 
> > to the new head. This should really be a fork of the dirty FPU context 
> > IMO, and should only happen if the USEDFPU bit is set.
> 
> why? The only effect this will have is a slowdown :) The FPU context 
> from the middle of the threadlet function is totally meaningless to the 
> 'new head'. It might be anything. (although in practice system calls are 
> almost never called with a truly in-use FPU.)

See above ;)



- Davide



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-02 20:18                                                                       ` Davide Libenzi
@ 2007-03-02 20:29                                                                         ` Ingo Molnar
  2007-03-02 20:53                                                                           ` Davide Libenzi
  0 siblings, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-03-02 20:29 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Linux Kernel Mailing List, Arjan van de Ven, Linus Torvalds


* Davide Libenzi <davidel@xmailserver.org> wrote:

> I think that the "dirty" FPU context must, at least, follow the new 
> head. That's what the userspace sees, and you don't want an async_exec 
> to re-emerge with a different FPU context.

well. I think there's some confusion about terminology, so please let me 
describe everything in detail. This is how execution goes:

  outer loop() {
      call_threadlet();
  }

this all runs in the 'head' context. call_threadlet() always switches to 
the 'threadlet stack'. The 'outer context' runs in the 'head stack'. If, 
while executing the threadlet function, we block, then the 
threadlet-thread gets to keep the task (the threadlet stack and also the 
FPU), and blocks - and we pick a 'new head' from the thread pool and 
continue executing in that context - right after the call_threadlet() 
function, in the 'old' head's stack. I.e. it's as if we returned 
immediately from call_threadlet(), with a return code that signals that 
the 'threadlet went async'.

now, the FPU state that was when the threadlet blocked is totally 
meaningless to the 'new head' - that FPU state is from the middle of the 
threadlet execution.

and here is where thinking about threadlets as a function call and not 
as an asynchronous context helps alot: the classic gcc convention for 
FPU use & function calls should apply: gcc does not call an external 
function with an in-use FPU stack/register, it always neatly unuses it, 
as no FPU register is callee-saved, all are caller-saved.

> So, IMO, if the USEDFPU bit is set, we need to sync the dirty FPU 
> context with an early unlazy_fpu(), *and* copy the sync'd FPU context 
> to the new head. This should really be a fork of the dirty FPU context 
> IMO, and should only happen if the USEDFPU bit is set.

why? The only effect this will have is a slowdown :) The FPU context 
from the middle of the threadlet function is totally meaningless to the 
'new head'. It might be anything. (although in practice system calls are 
almost never called with a truly in-use FPU.)

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-02 19:39                                                                     ` Ingo Molnar
@ 2007-03-02 20:18                                                                       ` Davide Libenzi
  2007-03-02 20:29                                                                         ` Ingo Molnar
  0 siblings, 1 reply; 277+ messages in thread
From: Davide Libenzi @ 2007-03-02 20:18 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Linux Kernel Mailing List, Arjan van de Ven, Linus Torvalds

On Fri, 2 Mar 2007, Ingo Molnar wrote:

> 
> * Davide Libenzi <davidel@xmailserver.org> wrote:
> 
> > [...] We're still missing proper FPU context switch in the 
> > move_user_context(). [...]
> 
> yeah - i'm starting to be of the opinion that the FPU context should 
> stay with the threadlet, exclusively. I.e. when calling a threadlet, the 
> 'outer loop' (the event loop) should not leak FPU context into the 
> threadlet and then expect it to be replicated from whatever random point 
> the threadlet ended up sleeping at. It would be possible, but it just 
> makes no sense. What makes most sense is to just keep the FPU context 
> with the threadlet, and to let the 'new head' use an initial (unused) 
> FPU context. And it's in fact the threadlet that will most likely have 
> an acrive FPU context across a system call, not the outer loop. In other 
> words: no special FPU support needed at all for threadlets (i.e. no 
> flipping needed even) - this behavior just naturally happens in the 
> current implementation. Hm?

I think that the "dirty" FPU context must, at least, follow the new head. 
That's what the userspace sees, and you don't want an async_exec to 
re-emerge with a different FPU context.
I think it should also follow the async thread (old, going-to-sleep, 
thread), since a threadlet might have that dirtied, and as a consequence 
it'll want to find it back when it's re-scheduled.
So, IMO, if the USEDFPU bit is set, we need to sync the dirty  FPU context 
with an early unlazy_fpu(), *and* copy the sync'd FPU context to the new head.
This should really be a fork of the dirty FPU context IMO, and should only 
happen if the USEDFPU bit is set.



- Davide



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-02 17:32                                                                   ` Davide Libenzi
@ 2007-03-02 19:39                                                                     ` Ingo Molnar
  2007-03-02 20:18                                                                       ` Davide Libenzi
  0 siblings, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-03-02 19:39 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: linux-kernel, Arjan van de Ven, Linus Torvalds


* Davide Libenzi <davidel@xmailserver.org> wrote:

> [...] We're still missing proper FPU context switch in the 
> move_user_context(). [...]

yeah - i'm starting to be of the opinion that the FPU context should 
stay with the threadlet, exclusively. I.e. when calling a threadlet, the 
'outer loop' (the event loop) should not leak FPU context into the 
threadlet and then expect it to be replicated from whatever random point 
the threadlet ended up sleeping at. It would be possible, but it just 
makes no sense. What makes most sense is to just keep the FPU context 
with the threadlet, and to let the 'new head' use an initial (unused) 
FPU context. And it's in fact the threadlet that will most likely have 
an acrive FPU context across a system call, not the outer loop. In other 
words: no special FPU support needed at all for threadlets (i.e. no 
flipping needed even) - this behavior just naturally happens in the 
current implementation. Hm?

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-02 17:13                                                                 ` Davide Libenzi
@ 2007-03-02 19:13                                                                   ` Davide Libenzi
  2007-03-03 10:06                                                                   ` Evgeniy Polyakov
  1 sibling, 0 replies; 277+ messages in thread
From: Davide Libenzi @ 2007-03-02 19:13 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ingo Molnar, Eric Dumazet, Pavel Machek, Theodore Tso,
	Linus Torvalds, Ulrich Drepper, Linux Kernel Mailing List,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Zach Brown, David S. Miller, Suparna Bhattacharya, Jens Axboe,
	Thomas Gleixner

On Fri, 2 Mar 2007, Davide Libenzi wrote:

> And if you really feel raw about the single O(nready) loop that epoll 
> currently does, a new epoll_wait2 (or whatever) API could be used to 
> deliver the event directly into a userspace buffer [1], directly from the 
> poll callback, w/out extra delivery loops (IRQ/event->epoll_callback->event_buffer).

And if you ever wonder from where the "epoll" name came, it came from the 
old /dev/epoll. The epoll predecessor /dev/epoll, was adding plugs 
everywhere events where needed and was delivering those events in O(1) 
*directly* on a user visible (mmap'd) buffer, in a zero-copy fashion.
The old /dev/epoll was faster the the current epoll, but the latter was 
chosen because despite being sloghtly slower, it had support for every 
pollable device, *without* adding more plugs into the existing code.
Performance and code maintainance are not to be taken disjointly whenever 
you evaluate a solution. That's the reason I got excited about this new 
generic AIO slution.



- Davide



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-02 10:57                                                                 ` Ingo Molnar
  2007-03-02 11:48                                                                   ` Evgeniy Polyakov
@ 2007-03-02 17:32                                                                   ` Davide Libenzi
  2007-03-02 19:39                                                                     ` Ingo Molnar
  1 sibling, 1 reply; 277+ messages in thread
From: Davide Libenzi @ 2007-03-02 17:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Evgeniy Polyakov, Eric Dumazet, Pavel Machek, Theodore Tso,
	Linus Torvalds, Ulrich Drepper, Linux Kernel Mailing List,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Zach Brown, David S. Miller, Suparna Bhattacharya, Jens Axboe,
	Thomas Gleixner

On Fri, 2 Mar 2007, Ingo Molnar wrote:

> > After your changes epoll increased to 5k.
> 
> Can we please stop this pointless episode of benchmarketing, where every 
> mail of yours shows different results and you even deny having said 
> something which you clearly said just a few days ago? At this point i 
> simply cannot trust the numbers you are posting, nor is the discussion 
> style you are following productive in any way in my opinion.

Agreed. Can we focus on the topic here? We're still missing proper FPU 
context switch in the move_user_context(). In v6?


- Davide



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-02 11:08                                           ` Evgeniy Polyakov
@ 2007-03-02 17:28                                             ` Davide Libenzi
  2007-03-03 10:27                                               ` Evgeniy Polyakov
  0 siblings, 1 reply; 277+ messages in thread
From: Davide Libenzi @ 2007-03-02 17:28 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ingo Molnar, Pavel Machek, Theodore Tso, Linus Torvalds,
	Ulrich Drepper, Linux Kernel Mailing List, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Jens Axboe,
	Thomas Gleixner

On Fri, 2 Mar 2007, Evgeniy Polyakov wrote:

> do we really want to have per process signalfs, timerfs and so on - each 
> simple structure must be bound to a file, which becomes too cost.

I may be old school, but if you ask me, and if you *really* want those 
events, yes. Reason? Unix's everything-is-a-file rule, and being able to 
use them with *existing* POSIX poll/select. Remember, not every app 
requires huge scalability efforts, so working with simpler and familiar 
APIs is always welcome.
The *only* thing that was not practical to have as fd, was block requests. 
But maybe threadlets/syslets will handle those just fine, and close the gap.



- Davide



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-02  8:10                                                               ` Evgeniy Polyakov
@ 2007-03-02 17:13                                                                 ` Davide Libenzi
  2007-03-02 19:13                                                                   ` Davide Libenzi
  2007-03-03 10:06                                                                   ` Evgeniy Polyakov
  0 siblings, 2 replies; 277+ messages in thread
From: Davide Libenzi @ 2007-03-02 17:13 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ingo Molnar, Eric Dumazet, Pavel Machek, Theodore Tso,
	Linus Torvalds, Ulrich Drepper, Linux Kernel Mailing List,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Zach Brown, David S. Miller, Suparna Bhattacharya, Jens Axboe,
	Thomas Gleixner

On Fri, 2 Mar 2007, Evgeniy Polyakov wrote:

> On Thu, Mar 01, 2007 at 11:31:14AM -0800, Davide Libenzi (davidel@xmailserver.org) wrote:
> > On Thu, 1 Mar 2007, Evgeniy Polyakov wrote:
> > 
> > > Ingo, do you really think I will send mails with faked benchmarks? :))
> > 
> > I don't think he ever implied that. He was only suggesting that when you 
> > post benchmarks, and even more when you make claims based on benchmarks, 
> > you need to be extra carefull about what you measure. Otherwise the 
> > external view that you give to others does not look good.
> > Kevent can be really faster than epoll, but if you post broken benchmarks 
> > (that can be, unrealiable HTTP loaders, broken server implemenations, 
> > etc..) and make claims based on that, the only effect that you have is to 
> > lose your point.
>  
> So, I only talked that kevent is superior compared to epoll because (and
> it is _main_ issue) of its ability to handle essentially any kind of
> events with very small overhead (the same as epoll has in struct file -
> list and spinlock) and without significant price of struct file binding
> to event.

You've to excuse me if my memory is bad, but IIRC the whole discussion 
and loong benchmark feast born with you throwing a benchmark at Ingo 
(with kevent showing a 1.9x performance boost WRT epoll), not with you 
making any other point.
As far as epoll not being able to handle other events. Said who? Of 
course, with zero modifications, you can handle zero additional events. 
With modifications, you can handle other events. But lets talk about those 
other events. The *only* kind of event that ppl (and being the epoll 
maintainer I tend to receive those requests) missed in epoll, was AIO 
events, That's the *only* thing that was missed by real life application 
developers. And if something like threadlets/syslets will prove effective, 
the gap is closed WRT that requirement.
Epoll handle already the whole class of pollable devices inside the 
kernel, and if you exclude block AIO, that's a pretty wide class already. 
The *existing* f_op->poll subsystem can be used to deliver events at the 
poll-head wakeup time (by using the "key" member of the poll callback), so 
that you don't even need the extra f_op->poll call to fetch events.
And if you really feel raw about the single O(nready) loop that epoll 
currently does, a new epoll_wait2 (or whatever) API could be used to 
deliver the event directly into a userspace buffer [1], directly from the 
poll callback, w/out extra delivery loops (IRQ/event->epoll_callback->event_buffer).


[1] From the epoll callback, we cannot sleep, so it's gonna be either an 
    mlocked userspace buffer, or some kernel pages mapped to userspace.


- Davide



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-02 10:57                                                                 ` Ingo Molnar
@ 2007-03-02 11:48                                                                   ` Evgeniy Polyakov
  2007-03-02 17:32                                                                   ` Davide Libenzi
  1 sibling, 0 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-02 11:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, Pavel Machek, Theodore Tso, Linus Torvalds,
	Ulrich Drepper, linux-kernel, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner

On Fri, Mar 02, 2007 at 11:57:13AM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> 
> > > > > [...] The numbers are still highly suspect - and we are already 
> > > > > down from the prior claim of kevent being almost twice as fast 
> > > > > to a 25% difference.
> > > >
> > > > Btw, there were never almost twice perfromance increase - epoll in 
> > > > my tests always showed 4-5 thousands requests per second, kevent - 
> > > > up to 7 thausands.
> > > 
> > > i'm referring to your claim in this mail of yours from 4 days ago 
> > > for example:
> > > 
> > >   http://lkml.org/lkml/2007/2/25/116
> > > 
> > >  "But note, that on my athlon64 3500 test machine kevent is about 7900
> > >   requests per second compared to 4000+ epoll, so expect a challenge."
> > > 
> > > no matter how i look at it, but 7900 is 1.9 times 4000 - which is 
> > > "almost twice".
> > 
> > After your changes epoll increased to 5k.
> 
> Can we please stop this pointless episode of benchmarketing, where every 
> mail of yours shows different results and you even deny having said 
> something which you clearly said just a few days ago? At this point i 
> simply cannot trust the numbers you are posting, nor is the discussion 
> style you are following productive in any way in my opinion.

I just show what I see in tests - I do not perform deep analysis of
that, since I do not see why it should be done - it is not fake, it is
not fantasy - real behaviour which is observed in my test machine, if it
will sudenly change I will report it.
Btw, I showed cases when epoll behaved better than kevent and
performance was unbeatable 9k requests per second - I do not know, why
it happend - maybe some cache related issues, other process all slept in
once, increased radiation or strong wind blew away my bad aura - it is
not reproducible on demand too.

> (you are never ever wrong, and if you are proven wrong on topic A you 
> claim it is an irrelevant topic (without even admitting you were wrong 
> about it) and you point to topic B claiming it's the /real/ topic you 
> talked about all along. And along the way you are slandering other 
> projects like epoll and threadlets, distorting the discussion. This kind 
> of keep-the-ball-moving discussion style is effective in politics but 
> IMO it's a waste of time when developing a kernel.)

Heh - that is why I'm not subscribed to lkml@ - it tooo frequently ends
up with politics :)

What we are talking about - we try to insult each other with something,
that was supposed to be said after some assumption on theoretical mental
exercise? I can only laugh on that :)

Ingo, I never ever tried to show that something is broken - that is
fantasy based on straight words, not on the real intension.

I never said epoll is broken. Absolutely.

I never said threadlet is broken. Absolutely.

I just showed that it is not (in my opinion) right decision to use
threadlets for IO based model instead of event driven - it is not based
on kevent performance (I _never_ stated it as a main factor - kevent was
only an example of event driven model, you were confused it with kevent
AIO, which is different beast), but instead on experience with nptl
threads and linuxthreads, and related rescheduling overhead compared to 
userspace one.

I showed kevent as a possible usage scenario - since it does support own
AIO. And you started to fight against it in every detail, since you
think kevent is not a good idea to handle AIO model - well, it can be
pefectly correct, I showed kevent AIO (please do not think that kevent
and kevent AIO are the same - the latter is just one of the possible
users I implemented, it only uses kevent to deliver completion event to 
userspace) as possible AIO implementation, but not _kevent_ itself.

But somehow we ended with binding to me some words I never said and ideas
I never based my assumptions on... I do not really think you even
remotely wanted to make any somehow personal assumptions on what we had
discussed.

We even concluded, that perfect IO model should use both approaches to
really scale - both threadlets with its on-demand-only rescheduling, and
event driven ring.
You pointed your opinion on kevents - well, I can not agree with it, but
that is your right not to like something.

Let's not continue bad practice of kicking each other just because there
were some problematic roots which noone even remember correctly - let's
do not make a mistake of pointing something personal out of trivial bits
- if you will be in Russia of around any time soon I will happily buy you 
a beer or what you prefer :)

So, let's just draw a line:
kevent was showed to people, and its performance, although flacky, is a
bit faster than epoll. Threadlets bound to any event driven ring do not
show any performance degradation in network driven setup with small
number of reschedulings with all advantages of simpler programming.
So, repeating myself, both models (not kevent and threadlet, but event
driven and thread based) should be used to achieve the maximum
performance.

So, I will draw yet another request for peace and fun and interesting
technical discussion. Peace?

Anyway, that was interesting discussion, thanks a lot for all
participants :)

> Ingo

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-02 10:56                                         ` Ingo Molnar
@ 2007-03-02 11:08                                           ` Evgeniy Polyakov
  2007-03-02 17:28                                             ` Davide Libenzi
  0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-02 11:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pavel Machek, Theodore Tso, Linus Torvalds, Ulrich Drepper,
	linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
	Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner

On Fri, Mar 02, 2007 at 11:56:18AM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> 
> > Even if kevent has the same speed, it still allows to handle _any_ 
> > kind of events without any major surgery - a very tiny structure of 
> > lock and list head and you can process your own kernel event in 
> > userspace with timers, signals, io events, private userspace events 
> > and others without races and invention of differnet hacks for 
> > different types - _this_ is main point.
> 
> did it ever occur to you to ... extend epoll? To speed it up? To add a 
> new wait syscall to it? Instead of introducing a whole new parallel 
> framework?

Yes, I thought about its extension more than a year ago before started 
kevent, but epoll() is absolutely based on file structure and its 
file_operations with poll methodt, so it is quite impossible to work 
with sockets to implement network AIO. Eventually it had gathered a lot 
of other systems - do we really want to have per process signalfs, timerfs 
and so on - each simple structure must be bound to a file, which becomes 
too cost.

> 	Ingo

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01 15:36                                                               ` Evgeniy Polyakov
@ 2007-03-02 10:57                                                                 ` Ingo Molnar
  2007-03-02 11:48                                                                   ` Evgeniy Polyakov
  2007-03-02 17:32                                                                   ` Davide Libenzi
  0 siblings, 2 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-03-02 10:57 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Eric Dumazet, Pavel Machek, Theodore Tso, Linus Torvalds,
	Ulrich Drepper, linux-kernel, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner


* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> > > > [...] The numbers are still highly suspect - and we are already 
> > > > down from the prior claim of kevent being almost twice as fast 
> > > > to a 25% difference.
> > >
> > > Btw, there were never almost twice perfromance increase - epoll in 
> > > my tests always showed 4-5 thousands requests per second, kevent - 
> > > up to 7 thausands.
> > 
> > i'm referring to your claim in this mail of yours from 4 days ago 
> > for example:
> > 
> >   http://lkml.org/lkml/2007/2/25/116
> > 
> >  "But note, that on my athlon64 3500 test machine kevent is about 7900
> >   requests per second compared to 4000+ epoll, so expect a challenge."
> > 
> > no matter how i look at it, but 7900 is 1.9 times 4000 - which is 
> > "almost twice".
> 
> After your changes epoll increased to 5k.

Can we please stop this pointless episode of benchmarketing, where every 
mail of yours shows different results and you even deny having said 
something which you clearly said just a few days ago? At this point i 
simply cannot trust the numbers you are posting, nor is the discussion 
style you are following productive in any way in my opinion.

(you are never ever wrong, and if you are proven wrong on topic A you 
claim it is an irrelevant topic (without even admitting you were wrong 
about it) and you point to topic B claiming it's the /real/ topic you 
talked about all along. And along the way you are slandering other 
projects like epoll and threadlets, distorting the discussion. This kind 
of keep-the-ball-moving discussion style is effective in politics but 
IMO it's a waste of time when developing a kernel.)

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-02 10:37                                       ` Evgeniy Polyakov
@ 2007-03-02 10:56                                         ` Ingo Molnar
  2007-03-02 11:08                                           ` Evgeniy Polyakov
  0 siblings, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-03-02 10:56 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Pavel Machek, Theodore Tso, Linus Torvalds, Ulrich Drepper,
	linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
	Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner


* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> Even if kevent has the same speed, it still allows to handle _any_ 
> kind of events without any major surgery - a very tiny structure of 
> lock and list head and you can process your own kernel event in 
> userspace with timers, signals, io events, private userspace events 
> and others without races and invention of differnet hacks for 
> different types - _this_ is main point.

did it ever occur to you to ... extend epoll? To speed it up? To add a 
new wait syscall to it? Instead of introducing a whole new parallel 
framework?

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-02 10:27                                     ` Pavel Machek
@ 2007-03-02 10:37                                       ` Evgeniy Polyakov
  2007-03-02 10:56                                         ` Ingo Molnar
  0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-02 10:37 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Theodore Tso, Ingo Molnar, Linus Torvalds, Ulrich Drepper,
	linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
	Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner

On Fri, Mar 02, 2007 at 11:27:14AM +0100, Pavel Machek (pavel@ucw.cz) wrote:
> Maybe. It is not up to me to decide. But "it is faster" is _not_ the
> only merge criterium.

Of course not!
Even if kevent has the same speed, it still allows to handle _any_ kind
of events without any major surgery - a very tiny structure of lock and
list head and you can process your own kernel event in userspace with 
timers, signals, io events, private userspace events and others without 
races and invention of differnet hacks for different types - 
_this_ is main point.

> 									Pavel
> -- 
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01 11:18                                   ` Evgeniy Polyakov
@ 2007-03-02 10:27                                     ` Pavel Machek
  2007-03-02 10:37                                       ` Evgeniy Polyakov
  0 siblings, 1 reply; 277+ messages in thread
From: Pavel Machek @ 2007-03-02 10:27 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Theodore Tso, Ingo Molnar, Linus Torvalds, Ulrich Drepper,
	linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
	Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner

Hi!

> > > > If you can replace them with something simpler, and no worse than 10%
> > > > slower in worst case, then go ahead. (We actually tried to do that at
> > > > some point, only to realize that efence stresses vm subsystem in very
> > > > unexpected/unfriendly way).
> > > 
> > > Agh, only 10% in the worst case.
> > > I think you can not even imagine what tricks network uses to get at
> > > least aditional 1% out of the box.
> > 
> > Yep? Feel free to rewrite networking to assembly on Eugenix. That
> > should get you 1% improvement. If you reserve few registers to be only
> > used by kernel (not allowed by userspace), you can speedup networking
> > 5%, too. Ouch and you could turn off MMU, that is sure way to get few
> > more percent improvement in your networking case.
> 
> It is not _my_ networking, but taht one you use everyday in every Linux
> box. Notice which tricks are used to remove single byte from
> sk_buff.

Ok, so tricks were worth it in sk_buff case.

> It is called optimization, and if it does us a single plus it must be
> implemented. Not all people have magical fear of new things.

But that does not mean "every optimalization must be
implemented". Only optimalizations that are "worth it" are... 

> > > Using such logic you can just abandon any further development, since it
> > > work as is right now.
> > 
> > Stop trying to pervert my logic.
> 
> Ugh? :)
> I just say in simple words your 'we do not need something if adds 10%,
> but is complex to understand'.

Yes... but that does not mean "stop development". You are still free
to clean up the code _while_ making it faster.

> > If your code is so complex that it is almost impossible to use from
> > userspace, that is good enough reason not to be merged. "But it is 3%
> > faster if..." is not a good-enough argument.
> 
> Is it enough for you?
> 
> epoll           4794.23 req/sec
> kevent          6468.95 req/sec

Maybe. It is not up to me to decide. But "it is faster" is _not_ the
only merge criterium.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01 19:31                                                             ` Davide Libenzi
@ 2007-03-02  8:10                                                               ` Evgeniy Polyakov
  2007-03-02 17:13                                                                 ` Davide Libenzi
  0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-02  8:10 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Ingo Molnar, Eric Dumazet, Pavel Machek, Theodore Tso,
	Linus Torvalds, Ulrich Drepper, Linux Kernel Mailing List,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Zach Brown, David S. Miller, Suparna Bhattacharya, Jens Axboe,
	Thomas Gleixner

On Thu, Mar 01, 2007 at 11:31:14AM -0800, Davide Libenzi (davidel@xmailserver.org) wrote:
> On Thu, 1 Mar 2007, Evgeniy Polyakov wrote:
> 
> > Ingo, do you really think I will send mails with faked benchmarks? :))
> 
> I don't think he ever implied that. He was only suggesting that when you 
> post benchmarks, and even more when you make claims based on benchmarks, 
> you need to be extra carefull about what you measure. Otherwise the 
> external view that you give to others does not look good.
> Kevent can be really faster than epoll, but if you post broken benchmarks 
> (that can be, unrealiable HTTP loaders, broken server implemenations, 
> etc..) and make claims based on that, the only effect that you have is to 
> lose your point.
 
We seems to move far away from original topic - I never built any
assumptions on top of kevent _performance_ - kevent is a logical
extrapolation of the epoll, I only showed that event driven model can be
fast and it outperforms threadlet one - after we changed topic we were
unable to actually test threadlets in networking environment, since the
only test I ran showed that threadlest do not reschedule at all, and
Ingo's tests showed small number of reschedulings.

So, I only talked that kevent is superior compared to epoll because (and
it is _main_ issue) of its ability to handle essentially any kind of
events with very small overhead (the same as epoll has in struct file -
list and spinlock) and without significant price of struct file binding
to event.

I did not want and do not want to hurt anyone (even Ingo, although he is 
against kevent :), but my opinion is that thread moved from nice 
discussion about threads and events with jokes and fun into quite angry 
word throwings, and that is too good - let's make it fun again.
I'm not a native english speaker (and do not use a dictionary), so it is 
quite possible that some my phrases were not exactly nice, but it was 
unintentional (at least not very) :)

Peace?

> - Davide
> 

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-28 23:12                 ` Ingo Molnar
  2007-03-01  1:33                   ` Andrea Arcangeli
@ 2007-03-01 21:27                   ` Linus Torvalds
  1 sibling, 0 replies; 277+ messages in thread
From: Linus Torvalds @ 2007-03-01 21:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Davide Libenzi, Ulrich Drepper, Linux Kernel Mailing List,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Zach Brown, Evgeniy Polyakov, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner



On Thu, 1 Mar 2007, Ingo Molnar wrote:
> 
> wrt. one-shot syscalls, the user-space stack footprint would still 
> probably be there, because even async contexts that only do single-shot 
> processing need to drop out of kernel mode to handle signals.

Why?

The easiest thing to do with signals is to just not pick them up. If the 
signal was to that *particular* threadlet (ie a "cancel"), then we just 
want to kill the threadlet. And if the signal was to the thread group, 
there is no reason why the threadlet should pick it up.

In neither case is there *any* reason to handle the signal in the 
threadlet, afaik.

And having to have a stack allocation for each threadlet certainly means 
that you complicate things a lot. Suddenly you have allocations that can't 
just go away. Again, I'm pointing to the problems I already pointed out 
with the allocations of the atom structures - quite often you do *not* 
want to keep track of anything specific for completion time, and that 
means that you MUST NOT have to de-allocate anythign either.

Again, think aio_read(). With the *exact* current binary interface. 
PLEASE. If you cannot emulate that with threadlets, then threadlets are 
*pointless*. On eof the major reasons for the whole exercise was to get 
rid of the special code in fs/aio.c.

So I repeat: if you cannot do that, and remain binary compatible, don't 
even bother.

		Linus

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01 19:37                               ` David Lang
@ 2007-03-01 20:34                                 ` Johann Borck
  0 siblings, 0 replies; 277+ messages in thread
From: Johann Borck @ 2007-03-01 20:34 UTC (permalink / raw)
  To: David Lang; +Cc: linux-kernel

David Lang wrote:
>
> On Thu, 1 Mar 2007, Johann Borck wrote:
>
>> I reported this a while ago and suggested to have the number of 
>> pending accepts reported with the event to save that last syscall.
>> I  created an ab replacement based on kevent, and at least with my 
>> machines, which are comparable to each other, the load on client 
>> dropped from 100% to 2% or something. ab just doesn't give meaningful 
>> results  (if the client is not way more powerful). With that new 
>> client I get very similar results for epoll and kevent, from 1000 
>> through to 26000 concurrent requests, the results have been posted on 
>> kevent-homepage in october, I just checked it with new version, but 
>> there's no significant difference.
>>
>> this is the benchmark with kevent-based client:
>> http://tservice.net.ru/~s0mbre/blog/2006/10/11#2006_10_11
>> btw, each result is average over 1,000,000 requests
>>
>> and just for comparison, this is on the same machines using ab:
>> http://tservice.net.ru/~s0mbre/blog/2006/10/08#2006_10_08
>
> is this client avaialble? and what patches need to be added to the 
> kernel to use it?
>
It's based on an older version of kevent, so I'll have to adapt it a bit 
for use with recent patch, no other than kevent is necessary. I'll post 
a link when it's cleaned up, if you want.

Johann
> David Lang
>


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01 15:32                                                     ` Eric Dumazet
  2007-03-01 15:41                                                       ` Eric Dumazet
  2007-03-01 15:47                                                       ` Evgeniy Polyakov
@ 2007-03-01 19:47                                                       ` Davide Libenzi
  2 siblings, 0 replies; 277+ messages in thread
From: Davide Libenzi @ 2007-03-01 19:47 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Evgeniy Polyakov, Ingo Molnar, Pavel Machek, Theodore Tso,
	Linus Torvalds, Ulrich Drepper, Linux Kernel Mailing List,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Zach Brown, David S. Miller, Suparna Bhattacharya, Jens Axboe,
	Thomas Gleixner



Oh boy, wasn't this thread supposed to focus the syslets/threadlets ... :)



On Thu, 1 Mar 2007, Eric Dumazet wrote:

> On Thursday 01 March 2007 16:23, Evgeniy Polyakov wrote:
> > They are there, since ab runs only 50k requests.
> > If I change it to something noticebly more than 50/80k, ab crashes:
> > # ab -c8000 -t 600 -n800000000 http://192.168.0.48/
> > This is ApacheBench, Version 2.0.40-dev <$Revision: 1.146 $> apache-2.0
> > Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
> > Copyright 2006 The Apache Software Foundation, http://www.apache.org/
> >
> > Benchmarking 192.168.0.48 (be patient)
> > Segmentation fault
> >
> > Are there any other tool suitable for such loads?
> > I only tested httperf (which is worse, since it uses poll/select) and
> > 'ab'.
> >
> > Btw, host machine runs 100% too, so it is possible that client side is
> > broken (too).
> 
> I have similar problems here, ab test just doesnt complete...
> 
> I am still investigating with strace and tcpdump.
> 
> In the meantime you could just rewrite it (based on epoll please :) ), since 
> it should be quite easy to do this (reverse of evserver_epoll)

I have a simple one based on coroutines and epoll. You need libpcl and 
coronet. Debian hs a package named libpcl1-dev for libpcl, otherwise:

http://www.xmailserver.org/libpcl.html

and 'configure --prefix=/usr && sudo make install'.
Coronet is here:

http://www.xmailserver.org/coronet-lib.html

here just 'configure && make'.
Inside the "test" directory there a simple loader named cnhttpload:

  cnhttpload -s HOST -n NCON [-p PORT (80)] [-r NREQS (1)] [-S STKSIZE (8192)]
             [-M MAXCONNS] [-t TMUPD (1000)] [-a NACTIVE] [-T TMSAMP (200)]
             [-h] URL ...

HOST      = Target host
PORT      = Target host port
NCON      = Number of connections to the server
NACTIVE   = Number of active (live) connections
STKSIZE   = Stack size for coroutines
NREQS     = Number of request done for each connection (better be 1 if 
            your server do not support keep-alive)
MAXCONNS  = Maximum number of total connections done to the server. If not 
            set, the test will continue forever (well, till a ^C)
TMUPD     = Millisec time of stats update
TMSAMP    = Millisec internal average-update time
URL       = Target doc (not http:// or host, just doc path)

So for the particular test my inbox was flooded with :), you'd use:

cnhttpload -s HOST -n 80000 -a 8000 -S 4096




- Davide



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01 19:24                             ` Johann Borck
@ 2007-03-01 19:37                               ` David Lang
  2007-03-01 20:34                                 ` Johann Borck
  0 siblings, 1 reply; 277+ messages in thread
From: David Lang @ 2007-03-01 19:37 UTC (permalink / raw)
  To: Johann Borck; +Cc: linux-kernel


On Thu, 1 Mar 2007, Johann Borck wrote:

> I reported this a while ago and suggested to have the number of pending 
> accepts reported with the event to save that last syscall.
> I  created an ab replacement based on kevent, and at least with my machines, 
> which are comparable to each other, the load on client dropped from 100% to 
> 2% or something. ab just doesn't give meaningful results  (if the client is 
> not way more powerful). With that new client I get very similar results for 
> epoll and kevent, from 1000 through to 26000 concurrent requests, the results 
> have been posted on kevent-homepage in october, I just checked it with new 
> version, but there's no significant difference.
>
> this is the benchmark with kevent-based client:
> http://tservice.net.ru/~s0mbre/blog/2006/10/11#2006_10_11
> btw, each result is average over 1,000,000 requests
>
> and just for comparison, this is on the same machines using ab:
> http://tservice.net.ru/~s0mbre/blog/2006/10/08#2006_10_08

is this client avaialble? and what patches need to be added to the kernel to use 
it?

David Lang

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01 14:54                                                           ` Evgeniy Polyakov
  2007-03-01 15:09                                                             ` Ingo Molnar
@ 2007-03-01 19:31                                                             ` Davide Libenzi
  2007-03-02  8:10                                                               ` Evgeniy Polyakov
  1 sibling, 1 reply; 277+ messages in thread
From: Davide Libenzi @ 2007-03-01 19:31 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ingo Molnar, Eric Dumazet, Pavel Machek, Theodore Tso,
	Linus Torvalds, Ulrich Drepper, Linux Kernel Mailing List,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Zach Brown, David S. Miller, Suparna Bhattacharya, Jens Axboe,
	Thomas Gleixner

On Thu, 1 Mar 2007, Evgeniy Polyakov wrote:

> Ingo, do you really think I will send mails with faked benchmarks? :))

I don't think he ever implied that. He was only suggesting that when you 
post benchmarks, and even more when you make claims based on benchmarks, 
you need to be extra carefull about what you measure. Otherwise the 
external view that you give to others does not look good.
Kevent can be really faster than epoll, but if you post broken benchmarks 
(that can be, unrealiable HTTP loaders, broken server implemenations, 
etc..) and make claims based on that, the only effect that you have is to 
lose your point.



- Davide



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01  8:18                           ` Evgeniy Polyakov
  2007-03-01  9:26                             ` Pavel Machek
@ 2007-03-01 19:24                             ` Johann Borck
  2007-03-01 19:37                               ` David Lang
  1 sibling, 1 reply; 277+ messages in thread
From: Johann Borck @ 2007-03-01 19:24 UTC (permalink / raw)
  To: linux-kernel

On Thu, Mar 01, 2007 at 04:41:27PM +0100, Eric Dumazet wrote:
>> 
>> I had to loop on accept() :
>> 
>>         for (i=0; i<num; ++i) {
>>                 if (event[i].data.fd == main_server_s) {
>>                         do {
>>                                 err = evtest_callback_main(event[i].data.fd);
>>                                 } while (err != -1);
>>                         }
>>                 else
>>                         err = evtest_callback_client(event[i].data.fd);
>>         }
>> 
>> Or else we can miss an event forever...
On Thu, 1 Mar 2007, Evgeniy Polyakov wrote:
>
> The same here - I would just enable a debug to find it.
>


I reported this a while ago and suggested to have the number of 
pending accepts reported with the event to save that last syscall.
I  created an ab replacement based on kevent, and at least with my 
machines, which are comparable to each other, the load on client 
dropped from 100% to 2% or something. ab just doesn't give meaningful 
results  (if the client is not way more powerful). With that new client 
I get very similar results for epoll and kevent, from 1000 through to 
26000 concurrent requests, the results have been posted on 
kevent-homepage in october, I just checked it with new version, but 
there's no significant difference.

this is the benchmark with kevent-based client:
http://tservice.net.ru/~s0mbre/blog/2006/10/11#2006_10_11
btw, each result is average over 1,000,000 requests

and just for comparison, this is on the same machines using ab:
http://tservice.net.ru/~s0mbre/blog/2006/10/08#2006_10_08


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01  9:54                                 ` Ingo Molnar
  2007-03-01 10:59                                   ` Evgeniy Polyakov
@ 2007-03-01 19:19                                   ` Davide Libenzi
  1 sibling, 0 replies; 277+ messages in thread
From: Davide Libenzi @ 2007-03-01 19:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Evgeniy Polyakov, Pavel Machek, Theodore Tso, Linus Torvalds,
	Ulrich Drepper, Linux Kernel Mailing List, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Jens Axboe,
	Thomas Gleixner

On Thu, 1 Mar 2007, Ingo Molnar wrote:

> 
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> 
> > I posted kevent/epoll benchmarks and related design issues too many 
> > times both with handmade applications (which might be broken as hell) 
> > and popular open-source servers to repeat them again.
> 
> numbers are crutial here - and given the epoll bugs in the evserver code 
> that we found, do you have updated evserver benchmark results that 
> compare epoll to kevent? I'm wondering why epoll has half the speed of 
> kevent in those measurements - i suspect some possible benchmarking bug. 
> The queueing model of epoll and kevent is roughly comparable, both do 
> only a constant number of steps to serve one particular request, 
> regardless of how many pending connections/requests there are. What is 
> the CPU utilization of the server system during an epoll test, and what 
> is the CPU utilization during a kevent test? 100% utilized in both 
> cases?

With 8K concurrent (live) connections, we may also want to try with the v3 
version of the epoll-event-loops-diet patch ;)



- Davide



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01 17:56                                       ` Evgeniy Polyakov
@ 2007-03-01 18:41                                         ` David Lang
  0 siblings, 0 replies; 277+ messages in thread
From: David Lang @ 2007-03-01 18:41 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ingo Molnar, Pavel Machek, Theodore Tso, Linus Torvalds,
	Ulrich Drepper, linux-kernel, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner

On Thu, 1 Mar 2007, Evgeniy Polyakov wrote:

> On Thu, Mar 01, 2007 at 08:56:28AM -0800, David Lang (david.lang@digitalinsight.com) wrote:
>> the ab numbers below do not seem that impressive to me, especially for such
>> stripped down server processes.
> ...
>> client and server are dual opteron 252 with 8G of ram, running debian in 64
>> bit mode
>
> Decrease your hardware setup in 2-4 times, leave only one apache process
> and try to get the same - we are not talking about how to create a
> perfect web server, instead we try to focus possible problems in
> epoll/kevent event driven logic.

for apache I agree that the target box was maxed out, so if you only had a 
single core on your AMD64 box that would be about half, however the thttpd is 
only useing ~1 of the CPU's (OS overhead is useing just a smidge of the second, 
but overall the box is 45-48% idle

if the amount of ram is an issue then you are swapping in your tests (or at 
least throwing out cache that you need) and so would not be testing what you 
think you are.

> Vanilla (epoll) lighttpd shows 4000-5000 requests per second in my setup (no logs).
> Default mpm-apache2 with bunch of threads - about 8k req/s.
> Default thttpd (disabled logging) - about 2k req/s
>
> Btw, all your tests are network bound, try to decrease
> html page size to get actual event processing speed out of that machines.

same test retreiving a ~128b file the server never gets below 51% idle (so it's 
only useing one CPU)

Server Software:        thttpd/2.23beta1
Server Hostname:        208.2.188.5
Server Port:            81

Document Path:          /128b
Document Length:        136 bytes

Concurrency Level:      8000
Time taken for tests:   9.372902 seconds
Complete requests:      80000
Failed requests:        0
Write errors:           0
Total transferred:      30762842 bytes
HTML transferred:       10952216 bytes
Requests per second:    8535.24 [#/sec] (mean)
Time per request:       937.290 [ms] (mean)
Time per request:       0.117 [ms] (mean, across all concurrent requests)
Transfer rate:          3205.09 [Kbytes/sec] received

Connection Times (ms)
               min  mean[+/-sd] median   max
Connect:       36  287 1125.6     73    9109
Processing:    49   89  19.8     87     339
Waiting:       17   62  16.4     62     292
Total:         92  376 1137.4    159    9262

Percentage of the requests served within a certain time (ms)
   50%    159
   66%    164
   75%    165
   80%    165
   90%    203
   95%    260
   98%   3233
   99%   9201
  100%   9262 (longest request)

note that this is showing the slowdown from the large concurrancy level, if I 
reduce the concurrancy level to 500 I get

Document Path:          /128b
Document Length:        136 bytes

Concurrency Level:      500
Time taken for tests:   4.215025 seconds
Complete requests:      80000
Failed requests:        0
Write errors:           0
Total transferred:      30565348 bytes
HTML transferred:       10881904 bytes
Requests per second:    18979.72 [#/sec] (mean)
Time per request:       26.344 [ms] (mean)
Time per request:       0.053 [ms] (mean, across all concurrent requests)
Transfer rate:          7081.33 [Kbytes/sec] received

Connection Times (ms)
               min  mean[+/-sd] median   max
Connect:        0   15 206.3      1    3006
Processing:     2    7   6.4      6     224
Waiting:        1    6   6.4      5     224
Total:          3   22 208.4      6    3229

Percentage of the requests served within a certain time (ms)
   50%      6
   66%      8
   75%     10
   80%     12
   90%     16
   95%     17
   98%     21
   99%     24
  100%   3229 (longest request)
loadtest2:/proc/sys#

again with >50% idle on the server box

also, ab appears to only use a single cpu so the fact that there are two on the 
client box should not make a difference.

I will reboot these boxes into a UP kernel if you think that this is still a 
significant difference. based on what I'm seeing I don't think it will make much 
of a difference (except for the apache test)

David Lang

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01 16:56                                     ` David Lang
@ 2007-03-01 17:56                                       ` Evgeniy Polyakov
  2007-03-01 18:41                                         ` David Lang
  0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01 17:56 UTC (permalink / raw)
  To: David Lang
  Cc: Ingo Molnar, Pavel Machek, Theodore Tso, Linus Torvalds,
	Ulrich Drepper, linux-kernel, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner

On Thu, Mar 01, 2007 at 08:56:28AM -0800, David Lang (david.lang@digitalinsight.com) wrote:
> the ab numbers below do not seem that impressive to me, especially for such 
> stripped down server processes.
...
> client and server are dual opteron 252 with 8G of ram, running debian in 64 
> bit mode

Decrease your hardware setup in 2-4 times, leave only one apache process 
and try to get the same - we are not talking about how to create a
perfect web server, instead we try to focus possible problems in
epoll/kevent event driven logic.

Vanilla (epoll) lighttpd shows 4000-5000 requests per second in my setup (no logs).
Default mpm-apache2 with bunch of threads - about 8k req/s.
Default thttpd (disabled logging) - about 2k req/s

Btw, all your tests are network bound, try to decrease 
html page size to get actual event processing speed out of that machines.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01 10:59                                   ` Evgeniy Polyakov
                                                       ` (2 preceding siblings ...)
  2007-03-01 12:34                                     ` Ingo Molnar
@ 2007-03-01 16:56                                     ` David Lang
  2007-03-01 17:56                                       ` Evgeniy Polyakov
  3 siblings, 1 reply; 277+ messages in thread
From: David Lang @ 2007-03-01 16:56 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ingo Molnar, Pavel Machek, Theodore Tso, Linus Torvalds,
	Ulrich Drepper, linux-kernel, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner

the ab numbers below do not seem that impressive to me, especially for such 
stripped down server processes.

here are some numbers from a set of test boxes I've got in my lab. I've been 
useing them to test firewalls, and I've been getting throughput similar to what 
is listed below when going through a proxy that does a full fork for each 
connection, and then makes a new connection to the webserver on the other side! 
the first few sets of numbers are going directly from test client to test 
server, the final set is going though the proxy.

client and server are dual opteron 252 with 8G of ram, running debian in 64 bit 
mode

this is with apache2 MPM as the destination (relativly untuned except for 
tinkering with the child count settings). this should be about as bad as you can 
get for a server

loadtest2:/proc/sys# ab -c 8000 -n 80000 http://208.2.188.5:80/4k
This is ApacheBench, Version 2.0.41-dev <$Revision: 1.141 $> apache-2.0
Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright (c) 1998-2002 The Apache Software Foundation, http://www.apache.org/

Benchmarking 208.2.188.5 (be patient)
Completed 8000 requests
Completed 16000 requests
Completed 24000 requests
Completed 32000 requests
Completed 40000 requests
Completed 48000 requests
Completed 56000 requests
Completed 64000 requests
Completed 72000 requests
Finished 80000 requests


Server Software:        Apache/1.3.33
Server Hostname:        208.2.188.5
Server Port:            80

Document Path:          /4k
Document Length:        4352 bytes

Concurrency Level:      8000
Time taken for tests:   10.992838 seconds
Complete requests:      80000
Failed requests:        0
Write errors:           0
Total transferred:      386192835 bytes
HTML transferred:       362612992 bytes
Requests per second:    7277.47 [#/sec] (mean)
Time per request:       1099.284 [ms] (mean)
Time per request:       0.137 [ms] (mean, across all concurrent requests)
Transfer rate:          34307.88 [Kbytes/sec] received

Connection Times (ms)
               min  mean[+/-sd] median   max
Connect:        8  497 1398.3     71    9072
Processing:    17  236 346.9    103    2995
Waiting:        8   91 131.6     65    1692
Total:         26  734 1435.5    187    9786

Percentage of the requests served within a certain time (ms)
   50%    187
   66%    288
   75%    564
   80%    754
   90%   3085
   95%   3163
   98%   4316
   99%   9186
  100%   9786 (longest request)
loadtest2:/proc/sys# ab -c 8000 -n 80000 http://208.2.188.5:80/8k
This is ApacheBench, Version 2.0.41-dev <$Revision: 1.141 $> apache-2.0
Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright (c) 1998-2002 The Apache Software Foundation, http://www.apache.org/

Benchmarking 208.2.188.5 (be patient)
Completed 8000 requests
Completed 16000 requests
Completed 24000 requests
Completed 32000 requests
Completed 40000 requests
Completed 48000 requests
Completed 56000 requests
Completed 64000 requests
Completed 72000 requests
Finished 80000 requests


Server Software:        Apache/1.3.33
Server Hostname:        208.2.188.5
Server Port:            80

Document Path:          /8k
Document Length:        8704 bytes

Concurrency Level:      8000
Time taken for tests:   11.355031 seconds
Complete requests:      80000
Failed requests:        0
Write errors:           0
Total transferred:      736949141 bytes
HTML transferred:       713733802 bytes
Requests per second:    7045.34 [#/sec] (mean)
Time per request:       1135.503 [ms] (mean)
Time per request:       0.142 [ms] (mean, across all concurrent requests)
Transfer rate:          63379.48 [Kbytes/sec] received

Connection Times (ms)
               min  mean[+/-sd] median   max
Connect:       36  495 1297.1     76    9056
Processing:    81  317 529.5    161    3448
Waiting:       25   89  75.1     76    1610
Total:        124  812 1401.5    250   11011

Percentage of the requests served within a certain time (ms)
   50%    250
   66%    304
   75%    497
   80%    705
   90%   3171
   95%   3251
   98%   3455
   99%   9160
  100%  11011 (longest request)

for both of these tests the server had it's cpu maxed out (<5% idle)

switching to thttpd instead of apache and I get

loadtest2:/proc/sys# ab -c 8000 -n 80000 http://208.2.188.5:81/4k
This is ApacheBench, Version 2.0.41-dev <$Revision: 1.141 $> apache-2.0
Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright (c) 1998-2002 The Apache Software Foundation, http://www.apache.org/

Benchmarking 208.2.188.5 (be patient)
Completed 8000 requests
Completed 16000 requests
Completed 24000 requests
Completed 32000 requests
Completed 40000 requests
Completed 48000 requests
Completed 56000 requests
Completed 64000 requests
Completed 72000 requests
Finished 80000 requests


Server Software:        thttpd/2.23beta1
Server Hostname:        208.2.188.5
Server Port:            81

Document Path:          /4k
Document Length:        4352 bytes

Concurrency Level:      8000
Time taken for tests:   9.944605 seconds
Complete requests:      80000
Failed requests:        0
Write errors:           0
Total transferred:      372748950 bytes
HTML transferred:       352729600 bytes
Requests per second:    8044.56 [#/sec] (mean)
Time per request:       994.461 [ms] (mean)
Time per request:       0.124 [ms] (mean, across all concurrent requests)
Transfer rate:          36603.97 [Kbytes/sec] received

Connection Times (ms)
               min  mean[+/-sd] median   max
Connect:       50  324 1274.4     70    9124
Processing:    68   98  33.3     90     781
Waiting:       22   69  26.9     63     729
Total:        125  423 1291.9    161    9324

Percentage of the requests served within a certain time (ms)
   50%    161
   66%    175
   75%    188
   80%    203
   90%    246
   95%    307
   98%   3243
   99%   9272
  100%   9324 (longest request)
loadtest2:/proc/sys# ab -c 8000 -n 80000 http://208.2.188.5:81/8k
This is ApacheBench, Version 2.0.41-dev <$Revision: 1.141 $> apache-2.0
Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright (c) 1998-2002 The Apache Software Foundation, http://www.apache.org/

Benchmarking 208.2.188.5 (be patient)
Completed 8000 requests
Completed 16000 requests
Completed 24000 requests
Completed 32000 requests
Completed 40000 requests
Completed 48000 requests
Completed 56000 requests
Completed 64000 requests
Completed 72000 requests
Finished 80000 requests


Server Software:        thttpd/2.23beta1
Server Hostname:        208.2.188.5
Server Port:            81

Document Path:          /8k
Document Length:        8704 bytes

Concurrency Level:      8000
Time taken for tests:   13.502031 seconds
Complete requests:      80000
Failed requests:        0
Write errors:           0
Total transferred:      729161888 bytes
HTML transferred:       709030153 bytes
Requests per second:    5925.03 [#/sec] (mean)
Time per request:       1350.203 [ms] (mean)
Time per request:       0.169 [ms] (mean, across all concurrent requests)
Transfer rate:          52738.14 [Kbytes/sec] received

Connection Times (ms)
               min  mean[+/-sd] median   max
Connect:       46  338 1189.7    106    9145
Processing:    79  197  52.8    195     670
Waiting:       39   92  28.4     94     577
Total:        140  536 1208.3    293    9424

Percentage of the requests served within a certain time (ms)
   50%    293
   66%    350
   75%    355
   80%    369
   90%    388
   95%   3293
   98%   3388
   99%   9392
  100%   9424 (longest request)

for these the CPU is ~45% idle on the server box.

now if I go through a box in the middle running a proxy that forks for every 
request (so you have two seperate TCP connections, plus a fork for each 
request plus two writes to syslog) the proxy box is a dual opteron 252 with 4G 
or ram running debian 32 bit

loadtest2:/proc/sys# ab -c 8000 -n 80000 http://192.168.254.2:8080/8k
This is ApacheBench, Version 2.0.41-dev <$Revision: 1.141 $> apache-2.0
Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright (c) 1998-2002 The Apache Software Foundation, http://www.apache.org/

Benchmarking 192.168.254.2 (be patient)
Completed 8000 requests
Completed 16000 requests
Completed 24000 requests
Completed 32000 requests
Completed 40000 requests
Completed 48000 requests
Completed 56000 requests
Completed 64000 requests
Completed 72000 requests
Finished 80000 requests


Server Software:        Apache/1.3.33
Server Hostname:        192.168.254.2
Server Port:            8080

Document Path:          /8k
Document Length:        8704 bytes

Concurrency Level:      8000
Time taken for tests:   21.101321 seconds
Complete requests:      80000
Failed requests:        0
Write errors:           0
Total transferred:      721092650 bytes
HTML transferred:       698383315 bytes
Requests per second:    3791.23 [#/sec] (mean)
Time per request:       2110.132 [ms] (mean)
Time per request:       0.264 [ms] (mean, across all concurrent requests)
Transfer rate:          33371.94 [Kbytes/sec] received

Connection Times (ms)
               min  mean[+/-sd] median   max
Connect:        9  621 1652.3     20    9036
Processing:    28   81 195.5     50    6652
Waiting:        9   51 195.5     19    6620
Total:         38  703 1683.2     70   12291

Percentage of the requests served within a certain time (ms)
   50%     70
   66%     80
   75%     83
   80%    101
   90%   3075
   95%   3088
   98%   9073
   99%   9087
  100%  12291 (longest request)

David Lang





On Thu, 1 Mar 2007, Evgeniy Polyakov wrote:

> On Thu, Mar 01, 2007 at 10:54:02AM +0100, Ingo Molnar (mingo@elte.hu) wrote:
>>
>> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
>>
>>> I posted kevent/epoll benchmarks and related design issues too many
>>> times both with handmade applications (which might be broken as hell)
>>> and popular open-source servers to repeat them again.
>>
>> numbers are crutial here - and given the epoll bugs in the evserver code
>> that we found, do you have updated evserver benchmark results that
>> compare epoll to kevent? I'm wondering why epoll has half the speed of
>> kevent in those measurements - i suspect some possible benchmarking bug.
>> The queueing model of epoll and kevent is roughly comparable, both do
>> only a constant number of steps to serve one particular request,
>> regardless of how many pending connections/requests there are. What is
>> the CPU utilization of the server system during an epoll test, and what
>> is the CPU utilization during a kevent test? 100% utilized in both
>> cases?
>
> Yes, it is about 98-100% in both cases.
> I've just re-run tests on my amd64 test machine without debug options:
>
> epoll		4794.23
> kevent		6468.95
>
> here are full client 'ab' outputs for epoll and kevent servers (epoll
> does not contain EPOLLET as you requested, but it does not look like
> it change performance in my case).
>
> epoll ab aoutput:
> # ab -c8000 -n80000 http://192.168.0.48/
> This is ApacheBench, Version 2.0.40-dev <$Revision: 1.146 $> apache-2.0
> Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
> Copyright 2006 The Apache Software Foundation, http://www.apache.org/
>
> Benchmarking 192.168.0.48 (be patient)
> Completed 8000 requests
> Completed 16000 requests
> Completed 24000 requests
> Completed 32000 requests
> Completed 40000 requests
> Completed 48000 requests
> Completed 56000 requests
> Completed 64000 requests
> Completed 72000 requests
> Finished 80000 requests
>
>
> Server Software:        Apache/1.3.27
> Server Hostname:        192.168.0.48
> Server Port:            80
>
> Document Path:          /
> Document Length:        3521 bytes
>
> Concurrency Level:      8000
> Time taken for tests:   16.686737 seconds
> Complete requests:      80000
> Failed requests:        0
> Write errors:           0
> Total transferred:      309760000 bytes
> HTML transferred:       281680000 bytes
> Requests per second:    4794.23 [#/sec] (mean)
> Time per request:       1668.674 [ms] (mean)
> Time per request:       0.209 [ms] (mean, across all concurrent
> requests)
> Transfer rate:          18128.17 [Kbytes/sec] received
>
> Connection Times (ms)
>              min  mean[+/-sd] median   max
> Connect:      159  779 110.1    799     921
> Processing:   468  866  77.4    869     988
> Waiting:       63  426 212.3    425     921
> Total:       1145 1646 115.6   1660    1873
>
> Percentage of the requests served within a certain time (ms)
> 50%   1660
> 66%   1661
> 75%   1662
> 80%   1663
> 90%   1806
> 95%   1830
> 98%   1833
> 99%   1834
> 100%   1873 (longest request)
>
> kevent ab output:
> # ab -c8000 -n80000 http://192.168.0.48/
> This is ApacheBench, Version 2.0.40-dev <$Revision: 1.146 $> apache-2.0
> Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
> Copyright 2006 The Apache Software Foundation, http://www.apache.org/
>
> Benchmarking 192.168.0.48 (be patient)
> Completed 8000 requests
> Completed 16000 requests
> Completed 24000 requests
> Completed 32000 requests
> Completed 40000 requests
> Completed 48000 requests
> Completed 56000 requests
> Completed 64000 requests
> Completed 72000 requests
> Finished 80000 requests
>
>
> Server Software:        Apache/1.3.27
> Server Hostname:        192.168.0.48
> Server Port:            80
>
> Document Path:          /
> Document Length:        3521 bytes
>
> Concurrency Level:      8000
> Time taken for tests:   12.366775 seconds
> Complete requests:      80000
> Failed requests:        0
> Write errors:           0
> Total transferred:      317047104 bytes
> HTML transferred:       288306522 bytes
> Requests per second:    6468.95 [#/sec] (mean)
> Time per request:       1236.677 [ms] (mean)
> Time per request:       0.155 [ms] (mean, across all concurrent
> requests)
> Transfer rate:          25036.12 [Kbytes/sec] received
>
> Connection Times (ms)
>              min  mean[+/-sd] median   max
> Connect:      130  364 871.1    275    9347
> Processing:   178  298  42.5    296     580
> Waiting:       31  202  65.8    210     369
> Total:        411  663 887.0    572    9722
>
> Percentage of the requests served within a certain time (ms)
> 50%    572
> 66%    573
> 75%    618
> 80%    640
> 90%    684
> 95%    709
> 98%    721
> 99%   3455
> 100%   9722 (longest request)
>
> Notice how percentage of the requests served within a certain time
> differs for kevent and epoll. And this server does not include
> ready-on-submission kevent optimization.
>
>> 	Ingo
>
>

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01 15:41                                                       ` Eric Dumazet
@ 2007-03-01 15:51                                                         ` Evgeniy Polyakov
  0 siblings, 0 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01 15:51 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, Pavel Machek, Theodore Tso, Linus Torvalds,
	Ulrich Drepper, linux-kernel, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner

On Thu, Mar 01, 2007 at 04:41:27PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> On Thursday 01 March 2007 16:32, Eric Dumazet wrote:
> > On Thursday 01 March 2007 16:23, Evgeniy Polyakov wrote:
> > > They are there, since ab runs only 50k requests.
> > > If I change it to something noticebly more than 50/80k, ab crashes:
> > > # ab -c8000 -t 600 -n800000000 http://192.168.0.48/
> > > This is ApacheBench, Version 2.0.40-dev <$Revision: 1.146 $> apache-2.0
> > > Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
> > > Copyright 2006 The Apache Software Foundation, http://www.apache.org/
> > >
> > > Benchmarking 192.168.0.48 (be patient)
> > > Segmentation fault
> > >
> > > Are there any other tool suitable for such loads?
> > > I only tested httperf (which is worse, since it uses poll/select) and
> > > 'ab'.
> > >
> > > Btw, host machine runs 100% too, so it is possible that client side is
> > > broken (too).
> >
> > I have similar problems here, ab test just doesnt complete...
> >
> > I am still investigating with strace and tcpdump.
> 
> OK... I found it.
> 
> I had to loop on accept() :
> 
>         for (i=0; i<num; ++i) {
>                 if (event[i].data.fd == main_server_s) {
>                         do {
>                                 err = evtest_callback_main(event[i].data.fd);
>                                 } while (err != -1);
>                         }
>                 else
>                         err = evtest_callback_client(event[i].data.fd);
>         }
> 
> Or else we can miss an event forever...

The same here - I would just enable a debug to find it.

# ab -c8000 -n80000 http://192.168.0.48/
This is ApacheBench, Version 2.0.40-dev <$Revision: 1.146 $> apache-2.0
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright 2006 The Apache Software Foundation, http://www.apache.org/

Benchmarking 192.168.0.48 (be patient)
Completed 8000 requests
Completed 16000 requests
Completed 24000 requests
Completed 32000 requests
Completed 40000 requests
Completed 48000 requests
Completed 56000 requests
Completed 64000 requests
Completed 72000 requests
Finished 80000 requests


Server Software:        Apache/1.3.27
Server Hostname:        192.168.0.48
Server Port:            80

Document Path:          /
Document Length:        3521 bytes

Concurrency Level:      8000
Time taken for tests:   18.250921 seconds
Complete requests:      80000
Failed requests:        0
Write errors:           0
Total transferred:      315691904 bytes
HTML transferred:       287074172 bytes
Requests per second:    4383.34 [#/sec] (mean)
Time per request:       1825.092 [ms] (mean)
Time per request:       0.228 [ms] (mean, across all concurrent
requests)
Transfer rate:          16891.86 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:      137  884 481.1    920    3602
Processing:   567  888 163.6    985     997
Waiting:       47  455 238.2    439     921
Total:        765 1772 566.6   1911    4556

Percentage of the requests served within a certain time
(ms)
50%   1911
66%   1911
75%   1912
80%   1913
90%   1913
95%   1914
98%   4438
99%   4497
100%   4556 (longest request)
kano:~#


-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01 15:32                                                     ` Eric Dumazet
  2007-03-01 15:41                                                       ` Eric Dumazet
@ 2007-03-01 15:47                                                       ` Evgeniy Polyakov
  2007-03-01 19:47                                                       ` Davide Libenzi
  2 siblings, 0 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01 15:47 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, Pavel Machek, Theodore Tso, Linus Torvalds,
	Ulrich Drepper, linux-kernel, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner

On Thu, Mar 01, 2007 at 04:32:37PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> On Thursday 01 March 2007 16:23, Evgeniy Polyakov wrote:
> > They are there, since ab runs only 50k requests.
> > If I change it to something noticebly more than 50/80k, ab crashes:
> > # ab -c8000 -t 600 -n800000000 http://192.168.0.48/
> > This is ApacheBench, Version 2.0.40-dev <$Revision: 1.146 $> apache-2.0
> > Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
> > Copyright 2006 The Apache Software Foundation, http://www.apache.org/
> >
> > Benchmarking 192.168.0.48 (be patient)
> > Segmentation fault
> >
> > Are there any other tool suitable for such loads?
> > I only tested httperf (which is worse, since it uses poll/select) and
> > 'ab'.
> >
> > Btw, host machine runs 100% too, so it is possible that client side is
> > broken (too).
> 
> I have similar problems here, ab test just doesnt complete...
> 
> I am still investigating with strace and tcpdump.
> 
> In the meantime you could just rewrite it (based on epoll please :) ), since 
> it should be quite easy to do this (reverse of evserver_epoll)

Rewriting 'ab' with pure epoll instead of APR lib is like
dandruff treatment on a guillotine.

I will try to cook up something own - simple client (based on epoll)
tomorrow/weekend, now I need to work for money :)

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01 15:32                                                     ` Eric Dumazet
@ 2007-03-01 15:41                                                       ` Eric Dumazet
  2007-03-01 15:51                                                         ` Evgeniy Polyakov
  2007-03-01 15:47                                                       ` Evgeniy Polyakov
  2007-03-01 19:47                                                       ` Davide Libenzi
  2 siblings, 1 reply; 277+ messages in thread
From: Eric Dumazet @ 2007-03-01 15:41 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ingo Molnar, Pavel Machek, Theodore Tso, Linus Torvalds,
	Ulrich Drepper, linux-kernel, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner

On Thursday 01 March 2007 16:32, Eric Dumazet wrote:
> On Thursday 01 March 2007 16:23, Evgeniy Polyakov wrote:
> > They are there, since ab runs only 50k requests.
> > If I change it to something noticebly more than 50/80k, ab crashes:
> > # ab -c8000 -t 600 -n800000000 http://192.168.0.48/
> > This is ApacheBench, Version 2.0.40-dev <$Revision: 1.146 $> apache-2.0
> > Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
> > Copyright 2006 The Apache Software Foundation, http://www.apache.org/
> >
> > Benchmarking 192.168.0.48 (be patient)
> > Segmentation fault
> >
> > Are there any other tool suitable for such loads?
> > I only tested httperf (which is worse, since it uses poll/select) and
> > 'ab'.
> >
> > Btw, host machine runs 100% too, so it is possible that client side is
> > broken (too).
>
> I have similar problems here, ab test just doesnt complete...
>
> I am still investigating with strace and tcpdump.

OK... I found it.

I had to loop on accept() :

        for (i=0; i<num; ++i) {
                if (event[i].data.fd == main_server_s) {
                        do {
                                err = evtest_callback_main(event[i].data.fd);
                                } while (err != -1);
                        }
                else
                        err = evtest_callback_client(event[i].data.fd);
        }

Or else we can miss an event forever...

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01 15:09                                                             ` Ingo Molnar
@ 2007-03-01 15:36                                                               ` Evgeniy Polyakov
  2007-03-02 10:57                                                                 ` Ingo Molnar
  0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01 15:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, Pavel Machek, Theodore Tso, Linus Torvalds,
	Ulrich Drepper, linux-kernel, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner

On Thu, Mar 01, 2007 at 04:09:42PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> 
> > > > I can tell you that the problem (at least on my machine) comes from :
> > > > 
> > > > gettimeofday(&tm, NULL);
> > > > 
> > > > in evserver_epoll.c
> > > 
> > > yeah, that's another difference - especially if it's something like 
> > > an Athlon64 and gettimeofday falls back to pm-timer, that could 
> > > explain the performance difference. That's why i repeatedly asked 
> > > Evgeniy to use the /very same/ client function for both the epoll 
> > > and the kevent test and redo the measurements. The numbers are still 
> > > highly suspect - and we are already down from the prior claim of 
> > > kevent being almost twice as fast to a 25% difference.
> > 
> > There is no gettimeofday() in the running code anymore, and it was 
> > placed not in common server processing code btw.
> > 
> > Ingo, do you really think I will send mails with faked benchmarks? :))
> 
> no, i'd not be in this discussion anymore if i thought that. But i do 
> think that your benchmark results are extremely sloppy, that make your 
> conclusions on them essentially useless.
>
> you were hurling quite colorful and strong assertions into this 
> discussion, backed up by these numbers, so you should expect at least 
> some minimal amount of scrutiny of those numbers.

This discussion was about event driven vs. thread driven IO models, and
threadlet only behaves like event driven because in my tests there was
exactly one threadlet rescheduling per severa thousands of clients.

Kevent is just a logical interpolation of performance of event driven
model.

My assumptions were based not on kevent performance, but on the fact,
that event delivery is much faster and simpler than thread handling.

Ugh, I'm starting that stupid talk again - let's just jump to the end -
I agree that in real-life high-performance systems both models must be
used.

Peace? :)

> > > [...] The numbers are still highly suspect - and we are already down 
> > > from the prior claim of kevent being almost twice as fast to a 25% 
> > > difference.
> >
> > Btw, there were never almost twice perfromance increase - epoll in my 
> > tests always showed 4-5 thousands requests per second, kevent - up to 
> > 7 thausands.
> 
> i'm referring to your claim in this mail of yours from 4 days ago for 
> example:
> 
>   http://lkml.org/lkml/2007/2/25/116
> 
>  "But note, that on my athlon64 3500 test machine kevent is about 7900
>   requests per second compared to 4000+ epoll, so expect a challenge."
> 
> no matter how i look at it, but 7900 is 1.9 times 4000 - which is 
> "almost twice".

After your changes epoll increased to 5k.
I can easily reproduce 6300/4300 split, but can not get more than 7k for
kevent (with oprofile/idle=poll at least).

I've completed 800k run:
kevent 4800
epoll 4450

with tons ofoverflows in 'ab':

Write errors:           0
Total transferred:      -1197367296 bytes
HTML transferred:       -1478167296 bytes
Requests per second:    4440.67 [#/sec] (mean)
Time per request:       1801.529 [ms] (mean)
Time per request:       0.225 [ms] (mean, across all concurrent
requests)
Transfer rate:          -6490.62 [Kbytes/sec] received

Any other bench?

> 	Ingo

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01 15:23                                                   ` Evgeniy Polyakov
@ 2007-03-01 15:32                                                     ` Eric Dumazet
  2007-03-01 15:41                                                       ` Eric Dumazet
                                                                         ` (2 more replies)
  0 siblings, 3 replies; 277+ messages in thread
From: Eric Dumazet @ 2007-03-01 15:32 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ingo Molnar, Pavel Machek, Theodore Tso, Linus Torvalds,
	Ulrich Drepper, linux-kernel, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner

On Thursday 01 March 2007 16:23, Evgeniy Polyakov wrote:
> They are there, since ab runs only 50k requests.
> If I change it to something noticebly more than 50/80k, ab crashes:
> # ab -c8000 -t 600 -n800000000 http://192.168.0.48/
> This is ApacheBench, Version 2.0.40-dev <$Revision: 1.146 $> apache-2.0
> Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
> Copyright 2006 The Apache Software Foundation, http://www.apache.org/
>
> Benchmarking 192.168.0.48 (be patient)
> Segmentation fault
>
> Are there any other tool suitable for such loads?
> I only tested httperf (which is worse, since it uses poll/select) and
> 'ab'.
>
> Btw, host machine runs 100% too, so it is possible that client side is
> broken (too).

I have similar problems here, ab test just doesnt complete...

I am still investigating with strace and tcpdump.

In the meantime you could just rewrite it (based on epoll please :) ), since 
it should be quite easy to do this (reverse of evserver_epoll)


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01 14:47                                                 ` Ingo Molnar
@ 2007-03-01 15:23                                                   ` Evgeniy Polyakov
  2007-03-01 15:32                                                     ` Eric Dumazet
  0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01 15:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, Pavel Machek, Theodore Tso, Linus Torvalds,
	Ulrich Drepper, linux-kernel, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 2362 bytes --]

On Thu, Mar 01, 2007 at 03:47:17PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> 
> > CPU: AMD64 processors, speed 2210.08 MHz (estimated)
> > Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000
> > samples  %        symbol name
> > 195750   67.3097  cpu_idle
> > 14111     4.8521  enter_idle
> > 4979      1.7121  IRQ0x51_interrupt
> > 4765      1.6385  tcp_v4_rcv
> 
> the pretty much only meaningful way to measure this is to:
> 
> - start a really long 'ab' testrun. Something like "ab -c 8000 -t 600".
> - let the system get into 'steady state': i.e. CPU load at 100%
> - reset the oprofile counters, then start an oprofile run for 60 
>   seconds.
> - stop the oprofile run.
> - stop the test.
> 
> this way there wont be that many 'cpu_idle' entries in your profiles, 
> and the profiles between the two event delivery mechanisms will be 
> directly comparable.

They are there, since ab runs only 50k requests.
If I change it to something noticebly more than 50/80k, ab crashes:
# ab -c8000 -t 600 -n800000000 http://192.168.0.48/
This is ApacheBench, Version 2.0.40-dev <$Revision: 1.146 $> apache-2.0
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright 2006 The Apache Software Foundation, http://www.apache.org/

Benchmarking 192.168.0.48 (be patient)
Segmentation fault

Are there any other tool suitable for such loads?
I only tested httperf (which is worse, since it uses poll/select) and
'ab'.

Btw, host machine runs 100% too, so it is possible that client side is
broken (too).

> > In that tests I got epoll perf about 4400 req/s, kevent was about 
> > 5300.
> 
> So we are now up to epoll being 83% of kevent's performance - while the 
> noise of numbers seen today alone is around 100% ... Could you update 
> the files two URLs that you posted before, with the code that you used 
> for the above numbers:

And in a couple of moments I resent profile with 6100 r/s, and now
attached with 6300.

>    http://tservice.net.ru/~s0mbre/archive/kevent/evserver_epoll.c
>    http://tservice.net.ru/~s0mbre/archive/kevent/evserver_kevent.c

Plus http://tservice.net.ru/~s0mbre/archive/kevent/evserver_common.c
which contains common request handling function

> thanks,
> 
> 	Ingo

-- 
	Evgeniy Polyakov

[-- Attachment #2: profile.kevent --]
[-- Type: text/plain, Size: 13546 bytes --]

CPU: AMD64 processors, speed 2210.08 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        symbol name
168753   65.1189  cpu_idle
12451     4.8046  enter_idle
4814      1.8576  tcp_v4_rcv
3980      1.5358  IRQ0x51_interrupt
3142      1.2124  tcp_ack
2738      1.0565  kmem_cache_free
2346      0.9053  kfree
2341      0.9034  memset_c
1927      0.7436  csum_partial_copy_generic
1723      0.6649  ip_route_input
1650      0.6367  dev_queue_xmit
1452      0.5603  ip_output
1416      0.5464  handle_IRQ_event
1335      0.5152  ip_rcv
1326      0.5117  tcp_rcv_state_process
1069      0.4125  schedule
960       0.3704  __do_softirq
943       0.3639  tcp_sendmsg
915       0.3531  ip_queue_xmit
907       0.3500  tcp_v4_do_rcv
897       0.3461  fget
894       0.3450  system_call
890       0.3434  csum_partial
877       0.3384  tcp_transmit_skb
845       0.3261  netif_receive_skb
822       0.3172  ip_local_deliver
812       0.3133  kmem_cache_alloc
788       0.3041  local_bh_enable
773       0.2983  __alloc_skb
771       0.2975  kfree_skbmem
764       0.2948  __d_lookup
757       0.2921  __tcp_push_pending_frames
734       0.2832  pfifo_fast_enqueue
720       0.2778  copy_user_generic_string
627       0.2419  net_rx_action
603       0.2327  pfifo_fast_dequeue
586       0.2261  ret_from_intr
562       0.2169  __link_path_walk
561       0.2165  sock_wfree
549       0.2118  __fput
547       0.2111  __kfree_skb
543       0.2095  get_unused_fd
534       0.2061  number
527       0.2034  sysret_check
516       0.1991  preempt_schedule
508       0.1960  skb_clone
496       0.1914  tcp_parse_options
487       0.1879  _atomic_dec_and_lock
470       0.1814  tcp_poll
469       0.1810  __ip_route_output_key
466       0.1798  rt_hash_code
464       0.1790  tcp_recvmsg
421       0.1625  dput
420       0.1621  tcp_rcv_established
412       0.1590  __tcp_select_window
407       0.1571  exit_idle
394       0.1520  rb_erase
381       0.1470  sys_close
375       0.1447  __mod_timer
365       0.1408  d_alloc
363       0.1401  mask_and_ack_8259A
335       0.1293  lock_timer_base
315       0.1216  cache_alloc_refill
307       0.1185  ret_from_sys_call
300       0.1158  do_path_lookup
299       0.1154  eth_type_trans
298       0.1150  find_next_zero_bit
294       0.1134  tcp_data_queue
286       0.1104  dentry_iput
285       0.1100  ip_append_data
263       0.1015  thread_return
257       0.0992  __dentry_open
255       0.0984  sock_recvmsg
255       0.0984  tcp_rtt_estimator
252       0.0972  sys_fcntl
250       0.0965  tcp_current_mss
248       0.0957  sk_stream_mem_schedule
240       0.0926  call_softirq
233       0.0899  sys_recvfrom
229       0.0884  cache_grow
221       0.0853  vsnprintf
215       0.0830  tcp_send_fin
214       0.0826  do_generic_mapping_read
213       0.0822  call_rcu
213       0.0822  common_interrupt
203       0.0783  do_lookup
196       0.0756  inotify_dentry_parent_queue_event
191       0.0737  memcpy_c
188       0.0725  filp_close
180       0.0695  release_sock
180       0.0695  sock_def_readable
178       0.0687  get_page_from_freelist
177       0.0683  do_sys_open
174       0.0671  restore_args
172       0.0664  strncpy_from_user
167       0.0644  fget_light
162       0.0625  clear_inode
161       0.0621  link_path_walk
159       0.0614  generic_drop_inode
157       0.0606  get_empty_filp
156       0.0602  __skb_checksum_complete
153       0.0590  del_timer
152       0.0587  update_send_head
151       0.0583  percpu_counter_mod
150       0.0579  current_fs_time
150       0.0579  schedule_timeout
150       0.0579  skb_checksum
149       0.0575  fd_install
145       0.0560  sock_close
143       0.0552  try_to_wake_up
142       0.0548  generic_permission
135       0.0521  __put_unused_fd
133       0.0513  new_inode
132       0.0509  half_md4_transform
131       0.0506  alloc_inode
130       0.0502  bictcp_cong_avoid
130       0.0502  memcmp
127       0.0490  tcp_init_tso_segs
126       0.0486  tcp_sync_mss
125       0.0482  __do_page_cache_readahead
125       0.0482  find_get_page
123       0.0475  lookup_mnt
117       0.0451  rb_insert_color
114       0.0440  tcp_v4_send_check
112       0.0432  mod_timer
109       0.0421  page_cache_readahead
101       0.0390  __path_lookup_intent_open
100       0.0386  __wake_up_bit
100       0.0386  may_open
100       0.0386  tcp_snd_test
97        0.0374  tcp_check_space
96        0.0370  expand_files
96        0.0370  skb_copy_datagram_iovec
95        0.0367  getname
95        0.0367  igrab
94        0.0363  open_namei
93        0.0359  groups_search
92        0.0355  dnotify_flush
91        0.0351  locks_remove_posix
91        0.0351  memmove
90        0.0347  sk_reset_timer
89        0.0343  tcp_send_ack
88        0.0340  copy_page_c
88        0.0340  tcp_select_initial_window
87        0.0336  sock_common_recvmsg
85        0.0328  sock_release
83        0.0320  IRQ0x20_interrupt
83        0.0320  file_free_rcu
80        0.0309  rw_verify_area
79        0.0305  d_instantiate
79        0.0305  permission
79        0.0305  put_page
77        0.0297  cond_resched
77        0.0297  get_task_mm
77        0.0297  touch_atime
75        0.0289  __follow_mount
75        0.0289  inotify_inode_queue_event
74        0.0286  file_move
73        0.0282  copy_to_user
71        0.0274  tcp_v4_tw_remember_stamp
69        0.0266  wake_up_inode
65        0.0251  prepare_to_wait
65        0.0251  sockfd_lookup
65        0.0251  tcp_event_data_recv
64        0.0247  file_kill
64        0.0247  fput
62        0.0239  __handle_mm_fault
62        0.0239  tcp_setsockopt
61        0.0235  sock_sendmsg
60        0.0232  __wake_up
59        0.0228  page_fault
57        0.0220  locks_remove_flock
54        0.0208  sk_stream_rfree
54        0.0208  sprintf
53        0.0205  inet_sendmsg
53        0.0205  retint_kernel
51        0.0197  iret_label
51        0.0197  tcp_cwnd_validate
49        0.0189  tcp_rcv_space_adjust
48        0.0185  inode_init_once
48        0.0185  mutex_unlock
45        0.0174  finish_wait
45        0.0174  mntput_no_expire
44        0.0170  __delay
44        0.0170  __tcp_ack_snd_check
43        0.0166  inet_sock_destruct
42        0.0162  try_to_del_timer_sync
41        0.0158  free_hot_cold_page
41        0.0158  memset
40        0.0154  __rb_rotate_left
40        0.0154  init_once
40        0.0154  sys_open
40        0.0154  tcp_unhash
39        0.0150  generic_file_open
39        0.0150  tcp_cong_avoid
38        0.0147  __lookup_mnt
38        0.0147  bit_waitqueue
36        0.0139  clear_page_c
36        0.0139  iput
33        0.0127  in_group_p
33        0.0127  inet_sk_rebuild_header
33        0.0127  sock_fasync
33        0.0127  tcp_init_cwnd
32        0.0123  memcpy_toiovec
32        0.0123  sk_stop_timer
32        0.0123  unmap_vmas
31        0.0120  blockable_page_cache_readahead
30        0.0116  _spin_lock_bh
30        0.0116  inet_getname
29        0.0112  __put_user_8
28        0.0108  copy_from_user
27        0.0104  do_filp_open
26        0.0100  tcp_v4_destroy_sock
25        0.0096  __alloc_pages
24        0.0093  apic_timer_interrupt
24        0.0093  do_page_fault
24        0.0093  file_ra_state_init
24        0.0093  hrtimer_run_queues
24        0.0093  vfs_permission
23        0.0089  tcp_slow_start
23        0.0089  zone_watermark_ok
22        0.0085  mutex_lock
21        0.0081  destroy_inode
21        0.0081  init_timer
21        0.0081  invalidate_inode_buffers
21        0.0081  sk_alloc
19        0.0073  exit_intr
16        0.0062  copy_page_range
15        0.0058  find_vma
14        0.0054  retint_swapgs
14        0.0054  wake_up_bit
13        0.0050  __down_read
12        0.0046  do_wp_page
12        0.0046  mark_page_accessed
11        0.0042  __get_user_4
11        0.0042  __tcp_checksum_complete_user
10        0.0039  vm_normal_page
9         0.0035  __up_read
8         0.0031  rcu_start_batch
8         0.0031  retint_restore_args
8         0.0031  timespec_trunc
7         0.0027  __find_get_block
7         0.0027  error_exit
7         0.0027  flush_tlb_page
6         0.0023  _write_lock_bh
6         0.0023  copy_process
6         0.0023  free_hot_page
6         0.0023  inode_has_buffers
6         0.0023  kmem_flagcheck
6         0.0023  retint_check
5         0.0019  __rb_rotate_right
5         0.0019  del_timer_sync
5         0.0019  nameidata_to_filp
4         0.0015  __down_read_trylock
4         0.0015  __getblk
4         0.0015  __mutex_init
4         0.0015  __set_page_dirty_nobuffers
4         0.0015  _read_lock_irqsave
4         0.0015  do_mmap_pgoff
4         0.0015  error_sti
4         0.0015  free_pgd_range
4         0.0015  load_elf_binary
4         0.0015  mmput
3         0.0012  __iget
3         0.0012  _spin_lock_irqsave
3         0.0012  bio_endio
3         0.0012  copy_strings
3         0.0012  cpuset_update_task_memory_state
3         0.0012  d_lookup
3         0.0012  exit_itimers
3         0.0012  filemap_nopage
3         0.0012  find_vma_prepare
3         0.0012  free_pages
3         0.0012  generic_fillattr
3         0.0012  generic_make_request
3         0.0012  memcpy
3         0.0012  prio_tree_insert
3         0.0012  put_unused_fd
3         0.0012  run_local_timers
3         0.0012  unmap_region
3         0.0012  vma_prio_tree_remove
2        7.7e-04  __clear_user
2        7.7e-04  __find_get_block_slow
2        7.7e-04  __strnlen_user
2        7.7e-04  __vm_enough_memory
2        7.7e-04  add_to_page_cache
2        7.7e-04  add_wait_queue
2        7.7e-04  alloc_pages_current
2        7.7e-04  anon_vma_prepare
2        7.7e-04  dnotify_parent
2        7.7e-04  do_exit
2        7.7e-04  do_munmap
2        7.7e-04  do_select
2        7.7e-04  dup_fd
2        7.7e-04  exit_mmap
2        7.7e-04  find_mergeable_anon_vma
2        7.7e-04  free_pgtables
2        7.7e-04  lru_cache_add_active
2        7.7e-04  mempool_free
2        7.7e-04  page_add_file_rmap
2        7.7e-04  page_remove_rmap
2        7.7e-04  page_waitqueue
2        7.7e-04  path_release
2        7.7e-04  prio_tree_replace
2        7.7e-04  pty_chars_in_buffer
2        7.7e-04  remove_vma
2        7.7e-04  retint_with_reschedule
2        7.7e-04  run_workqueue
2        7.7e-04  seq_puts
2        7.7e-04  split_vma
2        7.7e-04  sys_mmap
2        7.7e-04  sys_mprotect
2        7.7e-04  sys_rt_sigprocmask
2        7.7e-04  truncate_inode_pages_range
2        7.7e-04  vfs_lstat_fd
2        7.7e-04  vm_acct_memory
2        7.7e-04  vma_adjust
2        7.7e-04  vma_link
1        3.9e-04  __block_prepare_write
1        3.9e-04  __bread
1        3.9e-04  __d_path
1        3.9e-04  __down_write
1        3.9e-04  __down_write_nested
1        3.9e-04  __end_that_request_first
1        3.9e-04  __free_pages
1        3.9e-04  __generic_file_aio_write_nolock
1        3.9e-04  __make_request
1        3.9e-04  __page_set_anon_rmap
1        3.9e-04  __pagevec_lru_add_active
1        3.9e-04  __put_user_4
1        3.9e-04  __remove_shared_vm_struct
1        3.9e-04  __up_write
1        3.9e-04  __vma_link_rb
1        3.9e-04  _read_lock_bh
1        3.9e-04  activate_page
1        3.9e-04  anon_vma_unlink
1        3.9e-04  block_write_full_page
1        3.9e-04  cap_bprm_apply_creds
1        3.9e-04  cap_vm_enough_memory
1        3.9e-04  clear_page_dirty_for_io
1        3.9e-04  cond_resched_lock
1        3.9e-04  cp_new_stat
1        3.9e-04  create_empty_buffers
1        3.9e-04  dentry_unhash
1        3.9e-04  do_brk
1        3.9e-04  do_mpage_readpage
1        3.9e-04  do_mremap
1        3.9e-04  do_sigaction
1        3.9e-04  do_wait
1        3.9e-04  drop_buffers
1        3.9e-04  eligible_child
1        3.9e-04  exit_sem
1        3.9e-04  file_read_actor
1        3.9e-04  filldir64
1        3.9e-04  flush_signal_handlers
1        3.9e-04  generic_block_bmap
1        3.9e-04  generic_file_llseek
1        3.9e-04  generic_file_mmap
1        3.9e-04  get_index
1        3.9e-04  get_signal_to_deliver
1        3.9e-04  get_stack
1        3.9e-04  get_unmapped_area
1        3.9e-04  get_vma_policy
1        3.9e-04  init_request_from_bio
1        3.9e-04  inode_setattr
1        3.9e-04  is_bad_inode
1        3.9e-04  kref_put
1        3.9e-04  lru_add_drain
1        3.9e-04  may_delete
1        3.9e-04  mm_release
1        3.9e-04  n_tty_ioctl
1        3.9e-04  nr_blockdev_pages
1        3.9e-04  page_add_new_anon_rmap
1        3.9e-04  pipe_poll
1        3.9e-04  preempt_schedule_irq
1        3.9e-04  proc_lookup
1        3.9e-04  ptregscall_common
1        3.9e-04  put_files_struct
1        3.9e-04  radix_tree_tag_clear
1        3.9e-04  rb_prev
1        3.9e-04  recalc_bh_state
1        3.9e-04  release_pages
1        3.9e-04  sched_exec
1        3.9e-04  sched_fork
1        3.9e-04  seq_escape
1        3.9e-04  set_close_on_exec
1        3.9e-04  set_fs_pwd
1        3.9e-04  set_personality_64bit
1        3.9e-04  shrink_dcache_parent
1        3.9e-04  sock_map_fd
1        3.9e-04  strchr
1        3.9e-04  submit_bio
1        3.9e-04  sys_dup2
1        3.9e-04  sys_faccessat
1        3.9e-04  sys_munmap
1        3.9e-04  sys_rt_sigaction
1        3.9e-04  tty_write
1        3.9e-04  unlink_file_vma
1        3.9e-04  unlock_page
1        3.9e-04  vfs_read
1        3.9e-04  vfs_readdir
1        3.9e-04  vma_merge
1        3.9e-04  vma_prio_tree_insert
1        3.9e-04  wake_up_new_task
1        3.9e-04  writeback_inodes
1        3.9e-04  zonelist_policy

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01 14:54                                                           ` Evgeniy Polyakov
@ 2007-03-01 15:09                                                             ` Ingo Molnar
  2007-03-01 15:36                                                               ` Evgeniy Polyakov
  2007-03-01 19:31                                                             ` Davide Libenzi
  1 sibling, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-03-01 15:09 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Eric Dumazet, Pavel Machek, Theodore Tso, Linus Torvalds,
	Ulrich Drepper, linux-kernel, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner


* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> > > I can tell you that the problem (at least on my machine) comes from :
> > > 
> > > gettimeofday(&tm, NULL);
> > > 
> > > in evserver_epoll.c
> > 
> > yeah, that's another difference - especially if it's something like 
> > an Athlon64 and gettimeofday falls back to pm-timer, that could 
> > explain the performance difference. That's why i repeatedly asked 
> > Evgeniy to use the /very same/ client function for both the epoll 
> > and the kevent test and redo the measurements. The numbers are still 
> > highly suspect - and we are already down from the prior claim of 
> > kevent being almost twice as fast to a 25% difference.
> 
> There is no gettimeofday() in the running code anymore, and it was 
> placed not in common server processing code btw.
> 
> Ingo, do you really think I will send mails with faked benchmarks? :))

no, i'd not be in this discussion anymore if i thought that. But i do 
think that your benchmark results are extremely sloppy, that make your 
conclusions on them essentially useless.

you were hurling quite colorful and strong assertions into this 
discussion, backed up by these numbers, so you should expect at least 
some minimal amount of scrutiny of those numbers.

> > [...] The numbers are still highly suspect - and we are already down 
> > from the prior claim of kevent being almost twice as fast to a 25% 
> > difference.
>
> Btw, there were never almost twice perfromance increase - epoll in my 
> tests always showed 4-5 thousands requests per second, kevent - up to 
> 7 thausands.

i'm referring to your claim in this mail of yours from 4 days ago for 
example:

  http://lkml.org/lkml/2007/2/25/116

 "But note, that on my athlon64 3500 test machine kevent is about 7900
  requests per second compared to 4000+ epoll, so expect a challenge."

no matter how i look at it, but 7900 is 1.9 times 4000 - which is 
"almost twice".

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01 14:43                                               ` Evgeniy Polyakov
  2007-03-01 14:47                                                 ` Ingo Molnar
@ 2007-03-01 14:57                                                 ` Evgeniy Polyakov
  1 sibling, 0 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01 14:57 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, Pavel Machek, Theodore Tso, Linus Torvalds,
	Ulrich Drepper, linux-kernel, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 1130 bytes --]

On Thu, Mar 01, 2007 at 05:43:50PM +0300, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> On Thu, Mar 01, 2007 at 02:12:50PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> > On Thursday 01 March 2007 12:47, Evgeniy Polyakov wrote:
> > >
> > > Could you provide at least remote way to find it?
> > >
> > 
> > Sure :)
> > 
> > > I only found the same problem at
> > > http://lkml.org/lkml/2006/10/27/3
> > >
> > > but without any hits to solve the problem.
> > >
> > > I will try CVS oprofile, if it works I will provide details of course.
> > >
> > 
> > # cat CVS/Root
> > CVS/Root::pserver:anonymous@oprofile.cvs.sourceforge.net:/cvsroot/oprofile
> > 
> > # cvs diff >/tmp/oprofile.diff
> > 
> > Hope it helps
> 
> One of the hunks failed, since it was in CVS already.
> After setting up some mirrors, I've installed all bits needed for
> oprofile.
> Attached kevent and epoll profiles.
> 
> In that tests I got epoll perf about 4400 req/s, kevent was about 5300.

Attached kevent profile with 6100 req/sec.
They all look exactly the same for me - there no kevent or epoll
functions in profiles at all.

-- 
	Evgeniy Polyakov

[-- Attachment #2: profile.kevent --]
[-- Type: text/plain, Size: 12427 bytes --]

CPU: AMD64 processors, speed 2210.08 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        symbol name
103425   55.0868  cpu_idle
8214      4.3750  enter_idle
4712      2.5097  tcp_v4_rcv
3805      2.0266  IRQ0x51_interrupt
3154      1.6799  tcp_ack
2777      1.4791  kmem_cache_free
2286      1.2176  kfree
2155      1.1478  memset_c
1747      0.9305  csum_partial_copy_generic
1710      0.9108  ip_output
1620      0.8629  dev_queue_xmit
1551      0.8261  handle_IRQ_event
1391      0.7409  schedule
1373      0.7313  tcp_rcv_state_process
1337      0.7121  ip_rcv
1100      0.5859  ip_queue_xmit
965       0.5140  ip_route_input
939       0.5001  tcp_sendmsg
935       0.4980  __do_softirq
923       0.4916  ip_local_deliver
916       0.4879  csum_partial
905       0.4820  system_call
889       0.4735  tcp_transmit_skb
884       0.4708  tcp_v4_do_rcv
812       0.4325  netif_receive_skb
778       0.4144  __d_lookup
760       0.4048  __alloc_skb
747       0.3979  local_bh_enable
737       0.3925  __tcp_push_pending_frames
702       0.3739  kfree_skbmem
698       0.3718  pfifo_fast_enqueue
678       0.3611  kmem_cache_alloc
651       0.3467  fget
640       0.3409  pfifo_fast_dequeue
637       0.3393  net_rx_action
629       0.3350  __link_path_walk
602       0.3206  preempt_schedule
599       0.3190  __fput
594       0.3164  sock_wfree
589       0.3137  copy_user_generic_string
579       0.3084  ret_from_intr
559       0.2977  _atomic_dec_and_lock
552       0.2940  __kfree_skb
549       0.2924  skb_clone
514       0.2738  number
494       0.2631  rt_hash_code
473       0.2519  dput
466       0.2482  tcp_parse_options
446       0.2376  tcp_rcv_established
433       0.2306  tcp_recvmsg
431       0.2296  tcp_poll
417       0.2221  get_unused_fd
417       0.2221  sysret_check
377       0.2008  rb_erase
364       0.1939  __tcp_select_window
363       0.1933  lock_timer_base
347       0.1848  __mod_timer
329       0.1752  ip_append_data
326       0.1736  exit_idle
325       0.1731  ret_from_sys_call
317       0.1688  d_alloc
302       0.1609  do_path_lookup
295       0.1571  __ip_route_output_key
290       0.1545  eth_type_trans
285       0.1518  sys_close
283       0.1507  cache_alloc_refill
282       0.1502  mask_and_ack_8259A
275       0.1465  thread_return
267       0.1422  call_softirq
265       0.1411  tcp_rtt_estimator
260       0.1385  tcp_data_queue
258       0.1374  __dentry_open
258       0.1374  vsnprintf
255       0.1358  dentry_iput
255       0.1358  tcp_current_mss
250       0.1332  sk_stream_mem_schedule
239       0.1273  find_next_zero_bit
233       0.1241  cache_grow
233       0.1241  tcp_send_fin
222       0.1182  try_to_wake_up
219       0.1166  sock_recvmsg
216       0.1150  do_generic_mapping_read
211       0.1124  sys_fcntl
209       0.1113  get_empty_filp
207       0.1103  call_rcu
206       0.1097  strncpy_from_user
195       0.1039  sock_def_readable
190       0.1012  generic_drop_inode
190       0.1012  restore_args
184       0.0980  get_page_from_freelist
182       0.0969  sys_recvfrom
176       0.0937  do_lookup
174       0.0927  common_interrupt
171       0.0911  fget_light
167       0.0889  new_inode
167       0.0889  percpu_counter_mod
166       0.0884  link_path_walk
166       0.0884  skb_checksum
160       0.0852  fput
160       0.0852  release_sock
159       0.0847  memcpy_c
158       0.0842  memcmp
157       0.0836  __skb_checksum_complete
157       0.0836  tcp_init_tso_segs
148       0.0788  half_md4_transform
144       0.0767  tcp_v4_send_check
142       0.0756  del_timer
139       0.0740  current_fs_time
135       0.0719  update_send_head
129       0.0687  do_sys_open
126       0.0671  rb_insert_color
125       0.0666  bictcp_cong_avoid
124       0.0660  __put_unused_fd
123       0.0655  schedule_timeout
121       0.0644  clear_inode
118       0.0628  sock_close
116       0.0618  __do_page_cache_readahead
115       0.0613  alloc_inode
115       0.0613  lookup_mnt
114       0.0607  tcp_snd_test
113       0.0602  mod_timer
112       0.0597  generic_permission
109       0.0581  tcp_select_initial_window
101       0.0538  locks_remove_posix
98        0.0522  fd_install
97        0.0517  find_get_page
97        0.0517  sk_reset_timer
94        0.0501  try_to_del_timer_sync
93        0.0495  __follow_mount
92        0.0490  igrab
91        0.0485  page_cache_readahead
90        0.0479  dnotify_flush
90        0.0479  prepare_to_wait
90        0.0479  put_page
89        0.0474  expand_files
89        0.0474  getname
88        0.0469  inotify_dentry_parent_queue_event
88        0.0469  tcp_sync_mss
87        0.0463  __path_lookup_intent_open
86        0.0458  file_free_rcu
85        0.0453  may_open
85        0.0453  skb_copy_datagram_iovec
84        0.0447  IRQ0x20_interrupt
82        0.0437  tcp_cwnd_validate
81        0.0431  copy_page_c
81        0.0431  d_instantiate
81        0.0431  groups_search
80        0.0426  permission
79        0.0421  __handle_mm_fault
79        0.0421  file_kill
79        0.0421  get_task_mm
76        0.0405  rw_verify_area
74        0.0394  copy_to_user
73        0.0389  __wake_up_bit
72        0.0383  __wake_up
72        0.0383  cond_resched
72        0.0383  mntput_no_expire
69        0.0368  memmove
69        0.0368  sock_sendmsg
69        0.0368  tcp_setsockopt
68        0.0362  open_namei
68        0.0362  retint_kernel
67        0.0357  wake_up_inode
66        0.0352  inet_sendmsg
66        0.0352  tcp_event_data_recv
65        0.0346  generic_file_open
64        0.0341  touch_atime
63        0.0336  sock_release
63        0.0336  tcp_send_ack
62        0.0330  file_move
62        0.0330  filp_close
57        0.0304  mutex_unlock
55        0.0293  inet_sk_rebuild_header
55        0.0293  page_fault
55        0.0293  sockfd_lookup
54        0.0288  memset
54        0.0288  sk_stream_rfree
52        0.0277  __tcp_ack_snd_check
52        0.0277  inode_init_once
52        0.0277  sock_common_recvmsg
52        0.0277  tcp_check_space
51        0.0272  sys_open
49        0.0261  iret_label
47        0.0250  locks_remove_flock
46        0.0245  __rb_rotate_left
46        0.0245  tcp_v4_tw_remember_stamp
46        0.0245  unmap_vmas
45        0.0240  finish_wait
44        0.0234  inet_sock_destruct
44        0.0234  sprintf
43        0.0229  tcp_cong_avoid
42        0.0224  inotify_inode_queue_event
41        0.0218  __alloc_pages
41        0.0218  __lookup_mnt
41        0.0218  _spin_lock_bh
41        0.0218  tcp_init_cwnd
38        0.0202  clear_page_c
38        0.0202  tcp_unhash
37        0.0197  bit_waitqueue
37        0.0197  memcpy_toiovec
36        0.0192  iput
35        0.0186  do_filp_open
35        0.0186  init_timer
35        0.0186  sock_fasync
31        0.0165  __delay
31        0.0165  exit_intr
31        0.0165  vfs_permission
30        0.0160  sk_alloc
29        0.0154  copy_from_user
28        0.0149  free_hot_cold_page
27        0.0144  __put_user_8
27        0.0144  del_timer_sync
27        0.0144  hrtimer_run_queues
26        0.0138  init_once
26        0.0138  sk_stop_timer
25        0.0133  tcp_rcv_space_adjust
25        0.0133  tcp_v4_destroy_sock
23        0.0123  copy_page_range
23        0.0123  find_vma
22        0.0117  blockable_page_cache_readahead
22        0.0117  invalidate_inode_buffers
21        0.0112  do_page_fault
21        0.0112  do_wp_page
20        0.0107  in_group_p
20        0.0107  inet_getname
20        0.0107  mutex_lock
20        0.0107  tcp_slow_start
20        0.0107  zone_watermark_ok
18        0.0096  file_ra_state_init
17        0.0091  mark_page_accessed
16        0.0085  __find_get_block
15        0.0080  rt_run_flush
14        0.0075  __down_read
14        0.0075  __up_read
14        0.0075  apic_timer_interrupt
11        0.0059  destroy_inode
11        0.0059  flush_tlb_page
11        0.0059  vm_normal_page
10        0.0053  error_exit
10        0.0053  memcpy
9         0.0048  __get_user_4
9         0.0048  retint_restore_args
9         0.0048  retint_swapgs
8         0.0043  __rb_rotate_right
8         0.0043  run_local_timers
7         0.0037  error_sti
7         0.0037  inode_has_buffers
7         0.0037  nameidata_to_filp
7         0.0037  timespec_trunc
6         0.0032  __tcp_checksum_complete_user
6         0.0032  _spin_lock_irqsave
6         0.0032  _write_lock_bh
6         0.0032  filemap_nopage
6         0.0032  wake_up_bit
5         0.0027  __getblk
5         0.0027  __iget
5         0.0027  _read_lock_irqsave
5         0.0027  find_vma_prepare
5         0.0027  mmput
4         0.0021  __free_pages
4         0.0021  __mutex_init
4         0.0021  do_mmap_pgoff
4         0.0021  lru_cache_add_active
3         0.0016  __down_read_trylock
3         0.0016  __find_get_block_slow
3         0.0016  __make_request
3         0.0016  __mark_inode_dirty
3         0.0016  __remove_shared_vm_struct
3         0.0016  copy_strings
3         0.0016  do_notify_resume
3         0.0016  free_hot_page
3         0.0016  generic_file_aio_read
3         0.0016  generic_make_request
3         0.0016  load_elf_binary
3         0.0016  page_waitqueue
3         0.0016  prio_tree_insert
3         0.0016  prio_tree_replace
3         0.0016  rcu_start_batch
3         0.0016  unlock_page
3         0.0016  vma_link
3         0.0016  vma_merge
3         0.0016  zonelist_policy
2         0.0011  __block_prepare_write
2         0.0011  __pagevec_lru_add_active
2         0.0011  __put_user_4
2         0.0011  __vma_link_rb
2         0.0011  _read_lock_bh
2         0.0011  alloc_pages_current
2         0.0011  anon_vma_unlink
2         0.0011  copy_process
2         0.0011  d_lookup
2         0.0011  dentry_unhash
2         0.0011  do_mpage_readpage
2         0.0011  do_sigaction
2         0.0011  error_entry
2         0.0011  flush_old_exec
2         0.0011  kmem_flagcheck
2         0.0011  mempool_free
2         0.0011  page_add_file_rmap
2         0.0011  run_workqueue
2         0.0011  sched_balance_self
2         0.0011  set_personality_64bit
2         0.0011  strnlen_user
2         0.0011  sys_mprotect
2         0.0011  sys_rt_sigaction
2         0.0011  vfs_lstat_fd
2         0.0011  worker_thread
1        5.3e-04  __brelse
1        5.3e-04  __d_path
1        5.3e-04  __down_write_nested
1        5.3e-04  __lookup_hash
1        5.3e-04  __page_set_anon_rmap
1        5.3e-04  __strnlen_user
1        5.3e-04  __up_write
1        5.3e-04  __vm_enough_memory
1        5.3e-04  add_timer_randomness
1        5.3e-04  anon_vma_link
1        5.3e-04  bio_alloc_bioset
1        5.3e-04  blk_recount_segments
1        5.3e-04  can_vma_merge_after
1        5.3e-04  cap_bprm_apply_creds
1        5.3e-04  cond_resched_softirq
1        5.3e-04  copy_from_read_buf
1        5.3e-04  cpuset_fork
1        5.3e-04  cpuset_update_task_memory_state
1        5.3e-04  dnotify_parent
1        5.3e-04  do_exit
1        5.3e-04  do_munmap
1        5.3e-04  do_select
1        5.3e-04  do_wait
1        5.3e-04  end_bio_bh_io_sync
1        5.3e-04  exit_sem
1        5.3e-04  file_read_actor
1        5.3e-04  filldir64
1        5.3e-04  find_or_create_page
1        5.3e-04  find_vma_prev
1        5.3e-04  flush_thread
1        5.3e-04  free_pgd_range
1        5.3e-04  free_pgtables
1        5.3e-04  generic_fillattr
1        5.3e-04  get_unmapped_area
1        5.3e-04  inode_sub_bytes
1        5.3e-04  kthread_should_stop
1        5.3e-04  mm_init
1        5.3e-04  path_release
1        5.3e-04  pipe_release
1        5.3e-04  prio_tree_remove
1        5.3e-04  profile_munmap
1        5.3e-04  remove_wait_queue
1        5.3e-04  retint_check
1        5.3e-04  seq_puts
1        5.3e-04  strchr
1        5.3e-04  sys_brk
1        5.3e-04  sys_execve
1        5.3e-04  sys_mmap
1        5.3e-04  sys_munmap
1        5.3e-04  sys_read
1        5.3e-04  sys_rmdir
1        5.3e-04  sys_rt_sigprocmask
1        5.3e-04  truncate_complete_page
1        5.3e-04  tty_ldisc_deref
1        5.3e-04  tty_ldisc_try
1        5.3e-04  tty_termios_baud_rate
1        5.3e-04  tty_write
1        5.3e-04  udp_rcv
1        5.3e-04  vfs_rmdir
1        5.3e-04  vfs_stat_fd
1        5.3e-04  vfs_write
1        5.3e-04  vm_acct_memory
1        5.3e-04  vm_stat_account
1        5.3e-04  vma_adjust
1        5.3e-04  vma_prio_tree_remove
1        5.3e-04  wake_up_new_task

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01 14:16                                                         ` Ingo Molnar
  2007-03-01 14:31                                                           ` Eric Dumazet
@ 2007-03-01 14:54                                                           ` Evgeniy Polyakov
  2007-03-01 15:09                                                             ` Ingo Molnar
  2007-03-01 19:31                                                             ` Davide Libenzi
  1 sibling, 2 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01 14:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, Pavel Machek, Theodore Tso, Linus Torvalds,
	Ulrich Drepper, linux-kernel, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner

On Thu, Mar 01, 2007 at 03:16:37PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Eric Dumazet <dada1@cosmosbay.com> wrote:
> 
> > I can tell you that the problem (at least on my machine) comes from :
> > 
> > gettimeofday(&tm, NULL);
> > 
> > in evserver_epoll.c
> 
> yeah, that's another difference - especially if it's something like an 
> Athlon64 and gettimeofday falls back to pm-timer, that could explain the 
> performance difference. That's why i repeatedly asked Evgeniy to use the 
> /very same/ client function for both the epoll and the kevent test and 
> redo the measurements. The numbers are still highly suspect - and we are 
> already down from the prior claim of kevent being almost twice as fast 
> to a 25% difference.

There is no gettimeofday() in the running code anymore, and it was
placed not in common server processing code btw.

Ingo, do you really think I will send mails with faked benchmarks? :))

Btw, there were never almost twice perfromance increase - epoll in my
tests always showed 4-5 thousands requests per second, kevent - up to 7
thausands.

That starts looking like ghost hunting, Ingo, you already said that
you do not see any need in kevent, have you changed your opinion on
that?

> 	Ingo

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01 14:43                                               ` Evgeniy Polyakov
@ 2007-03-01 14:47                                                 ` Ingo Molnar
  2007-03-01 15:23                                                   ` Evgeniy Polyakov
  2007-03-01 14:57                                                 ` Evgeniy Polyakov
  1 sibling, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-03-01 14:47 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Eric Dumazet, Pavel Machek, Theodore Tso, Linus Torvalds,
	Ulrich Drepper, linux-kernel, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner


* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> CPU: AMD64 processors, speed 2210.08 MHz (estimated)
> Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000
> samples  %        symbol name
> 195750   67.3097  cpu_idle
> 14111     4.8521  enter_idle
> 4979      1.7121  IRQ0x51_interrupt
> 4765      1.6385  tcp_v4_rcv

the pretty much only meaningful way to measure this is to:

- start a really long 'ab' testrun. Something like "ab -c 8000 -t 600".
- let the system get into 'steady state': i.e. CPU load at 100%
- reset the oprofile counters, then start an oprofile run for 60 
  seconds.
- stop the oprofile run.
- stop the test.

this way there wont be that many 'cpu_idle' entries in your profiles, 
and the profiles between the two event delivery mechanisms will be 
directly comparable.

> In that tests I got epoll perf about 4400 req/s, kevent was about 
> 5300.

So we are now up to epoll being 83% of kevent's performance - while the 
noise of numbers seen today alone is around 100% ... Could you update 
the files two URLs that you posted before, with the code that you used 
for the above numbers:

   http://tservice.net.ru/~s0mbre/archive/kevent/evserver_epoll.c
   http://tservice.net.ru/~s0mbre/archive/kevent/evserver_kevent.c

thanks,

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01 13:12                                             ` Eric Dumazet
@ 2007-03-01 14:43                                               ` Evgeniy Polyakov
  2007-03-01 14:47                                                 ` Ingo Molnar
  2007-03-01 14:57                                                 ` Evgeniy Polyakov
  0 siblings, 2 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01 14:43 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, Pavel Machek, Theodore Tso, Linus Torvalds,
	Ulrich Drepper, linux-kernel, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 890 bytes --]

On Thu, Mar 01, 2007 at 02:12:50PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> On Thursday 01 March 2007 12:47, Evgeniy Polyakov wrote:
> >
> > Could you provide at least remote way to find it?
> >
> 
> Sure :)
> 
> > I only found the same problem at
> > http://lkml.org/lkml/2006/10/27/3
> >
> > but without any hits to solve the problem.
> >
> > I will try CVS oprofile, if it works I will provide details of course.
> >
> 
> # cat CVS/Root
> CVS/Root::pserver:anonymous@oprofile.cvs.sourceforge.net:/cvsroot/oprofile
> 
> # cvs diff >/tmp/oprofile.diff
> 
> Hope it helps

One of the hunks failed, since it was in CVS already.
After setting up some mirrors, I've installed all bits needed for
oprofile.
Attached kevent and epoll profiles.

In that tests I got epoll perf about 4400 req/s, kevent was about 5300.

epoll does not contain gettimeofday() call.

-- 
	Evgeniy Polyakov

[-- Attachment #2: profile.epoll --]
[-- Type: text/plain, Size: 14192 bytes --]

CPU: AMD64 processors, speed 2210.08 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        symbol name
195750   67.3097  cpu_idle
14111     4.8521  enter_idle
4979      1.7121  IRQ0x51_interrupt
4765      1.6385  tcp_v4_rcv
3316      1.1402  tcp_ack
3138      1.0790  kmem_cache_free
2619      0.9006  kfree
2323      0.7988  memset_c
1747      0.6007  schedule
1682      0.5784  csum_partial_copy_generic
1646      0.5660  ip_output
1563      0.5374  tcp_rcv_state_process
1506      0.5178  dev_queue_xmit
1412      0.4855  handle_IRQ_event
1266      0.4353  ip_rcv
1026      0.3528  ip_queue_xmit
1004      0.3452  __do_softirq
1001      0.3442  ip_local_deliver
906       0.3115  csum_partial
902       0.3102  ip_route_input
889       0.3057  __d_lookup
880       0.3026  kmem_cache_alloc
847       0.2912  tcp_v4_do_rcv
841       0.2892  netif_receive_skb
830       0.2854  tcp_sendmsg
819       0.2816  system_call
788       0.2710  kfree_skbmem
780       0.2682  tcp_transmit_skb
742       0.2551  preempt_schedule
731       0.2514  __tcp_push_pending_frames
699       0.2404  __link_path_walk
687       0.2362  pfifo_fast_dequeue
672       0.2311  local_bh_enable
657       0.2259  __alloc_skb
650       0.2235  net_rx_action
627       0.2156  pfifo_fast_enqueue
583       0.2005  sock_wfree
571       0.1963  get_unused_fd
547       0.1881  tcp_parse_options
546       0.1877  copy_user_generic_string
533       0.1833  _atomic_dec_and_lock
529       0.1819  number
524       0.1802  ret_from_intr
509       0.1750  skb_clone
507       0.1743  fget
507       0.1743  tcp_rcv_established
498       0.1712  __kfree_skb
492       0.1692  tcp_poll
481       0.1654  rt_hash_code
471       0.1620  dput
454       0.1561  sock_def_readable
422       0.1451  mask_and_ack_8259A
421       0.1448  sysret_check
419       0.1441  __fput
413       0.1420  exit_idle
396       0.1362  ip_append_data
374       0.1286  sock_poll
371       0.1276  tcp_data_queue
359       0.1234  __tcp_select_window
356       0.1224  tcp_recvmsg
348       0.1197  lock_timer_base
340       0.1169  cache_alloc_refill
338       0.1162  thread_return
319       0.1097  sys_close
318       0.1093  do_path_lookup
318       0.1093  ret_from_sys_call
311       0.1069  vsnprintf
307       0.1056  eth_type_trans
303       0.1042  find_next_zero_bit
302       0.1038  __mod_timer
298       0.1025  d_alloc
296       0.1018  rb_erase
293       0.1007  call_softirq
290       0.0997  __dentry_open
276       0.0949  cache_grow
274       0.0942  __ip_route_output_key
273       0.0939  try_to_wake_up
258       0.0887  dentry_iput
258       0.0887  sk_stream_mem_schedule
257       0.0884  do_lookup
244       0.0839  strncpy_from_user
234       0.0805  do_generic_mapping_read
231       0.0794  memcmp
229       0.0787  tcp_current_mss
228       0.0784  tcp_rtt_estimator
214       0.0736  restore_args
205       0.0705  sys_recvfrom
204       0.0701  fput
203       0.0698  tcp_send_fin
200       0.0688  release_sock
193       0.0664  memcpy_c
191       0.0657  common_interrupt
189       0.0650  fget_light
185       0.0636  skb_checksum
182       0.0626  generic_drop_inode
180       0.0619  do_sys_open
174       0.0598  get_page_from_freelist
168       0.0578  link_path_walk
165       0.0567  schedule_timeout
163       0.0560  del_timer
162       0.0557  rb_insert_color
160       0.0550  percpu_counter_mod
159       0.0547  __up_read
155       0.0533  expand_files
154       0.0530  sys_fcntl
150       0.0516  tcp_v4_send_check
146       0.0502  fd_install
145       0.0499  bictcp_cong_avoid
143       0.0492  call_rcu
141       0.0485  __down_read
141       0.0485  sock_close
140       0.0481  copy_page_c
138       0.0475  __skb_checksum_complete
138       0.0475  lookup_mnt
137       0.0471  getname
132       0.0454  generic_permission
131       0.0450  find_get_page
130       0.0447  __do_page_cache_readahead
130       0.0447  update_send_head
127       0.0437  get_empty_filp
126       0.0433  __path_lookup_intent_open
124       0.0426  mod_timer
122       0.0420  half_md4_transform
121       0.0416  page_fault
121       0.0416  tcp_sync_mss
120       0.0413  __wake_up
120       0.0413  current_fs_time
119       0.0409  remove_wait_queue
118       0.0406  groups_search
116       0.0399  __handle_mm_fault
115       0.0395  tcp_send_ack
114       0.0392  get_task_mm
109       0.0375  tcp_snd_test
107       0.0368  new_inode
106       0.0364  sock_release
106       0.0364  tcp_init_tso_segs
104       0.0358  inotify_dentry_parent_queue_event
104       0.0358  try_to_del_timer_sync
102       0.0351  add_wait_queue
99        0.0340  cond_resched
97        0.0334  __follow_mount
96        0.0330  __put_unused_fd
95        0.0327  open_namei
94        0.0323  file_move
93        0.0320  clear_inode
93        0.0320  permission
92        0.0316  rw_verify_area
91        0.0313  may_open
91        0.0313  memmove
86        0.0296  __up_write
86        0.0296  dnotify_flush
86        0.0296  tcp_select_initial_window
84        0.0289  sock_common_recvmsg
83        0.0285  __wake_up_bit
83        0.0285  page_cache_readahead
82        0.0282  IRQ0x20_interrupt
81        0.0279  put_page
77        0.0265  inet_sendmsg
76        0.0261  skb_copy_datagram_iovec
76        0.0261  tcp_init_cwnd
75        0.0258  sock_recvmsg
75        0.0258  tcp_event_data_recv
74        0.0254  inet_sk_rebuild_header
74        0.0254  locks_remove_posix
73        0.0251  d_instantiate
68        0.0234  sock_sendmsg
66        0.0227  __rb_rotate_right
65        0.0224  __down_write_nested
65        0.0224  retint_kernel
64        0.0220  alloc_inode
64        0.0220  clear_page_c
64        0.0220  init_timer
63        0.0217  mntput_no_expire
63        0.0217  sk_reset_timer
63        0.0217  tcp_setsockopt
61        0.0210  tcp_check_space
61        0.0210  unmap_vmas
60        0.0206  filp_close
60        0.0206  tcp_v4_tw_remember_stamp
59        0.0203  tcp_rcv_space_adjust
56        0.0193  inotify_inode_queue_event
54        0.0186  __delay
54        0.0186  touch_atime
53        0.0182  sk_stream_rfree
53        0.0182  sys_open
53        0.0182  tcp_cwnd_validate
50        0.0172  copy_to_user
50        0.0172  file_kill
49        0.0168  generic_file_open
46        0.0158  tcp_cong_avoid
44        0.0151  do_filp_open
44        0.0151  inet_sock_destruct
43        0.0148  __rb_rotate_left
43        0.0148  __tcp_ack_snd_check
43        0.0148  inode_init_once
43        0.0148  sprintf
42        0.0144  exit_intr
42        0.0144  sock_fasync
42        0.0144  wake_up_inode
41        0.0141  memset
39        0.0134  __alloc_pages
39        0.0134  free_hot_cold_page
39        0.0134  vfs_permission
38        0.0131  bit_waitqueue
38        0.0131  do_page_fault
38        0.0131  locks_remove_flock
38        0.0131  tcp_unhash
36        0.0124  del_timer_sync
36        0.0124  iput
36        0.0124  iret_label
35        0.0120  file_free_rcu
35        0.0120  file_ra_state_init
35        0.0120  tcp_v4_destroy_sock
34        0.0117  sk_stop_timer
33        0.0113  __lookup_mnt
33        0.0113  inet_getname
33        0.0113  zone_watermark_ok
30        0.0103  copy_page_range
29        0.0100  blockable_page_cache_readahead
29        0.0100  do_wp_page
29        0.0100  hrtimer_run_queues
29        0.0100  sk_alloc
26        0.0089  _spin_lock_bh
26        0.0089  in_group_p
25        0.0086  __put_user_8
25        0.0086  __wake_up_locked
25        0.0086  find_vma
25        0.0086  init_once
24        0.0083  __lock_text_start
23        0.0079  destroy_inode
23        0.0079  tcp_slow_start
22        0.0076  mark_page_accessed
21        0.0072  flush_tlb_page
20        0.0069  _read_lock_irqsave
20        0.0069  memcpy_toiovec
18        0.0062  apic_timer_interrupt
18        0.0062  retint_restore_args
18        0.0062  vm_normal_page
14        0.0048  __down_write
14        0.0048  __tcp_checksum_complete_user
14        0.0048  invalidate_inode_buffers
13        0.0045  mutex_lock
11        0.0038  __get_user_4
11        0.0038  error_exit
11        0.0038  wake_up_bit
10        0.0034  __block_write_full_page
10        0.0034  __down_read_trylock
10        0.0034  copy_from_user
9         0.0031  copy_process
9         0.0031  timespec_trunc
7         0.0024  dup_fd
7         0.0024  filemap_nopage
7         0.0024  lru_cache_add_active
7         0.0024  memcpy
7         0.0024  nameidata_to_filp
7         0.0024  unlink_file_vma
6         0.0021  __find_get_block
6         0.0021  __mutex_init
6         0.0021  find_get_pages_tag
6         0.0021  inode_has_buffers
6         0.0021  mmput
6         0.0021  page_remove_rmap
6         0.0021  sys_mprotect
5         0.0017  __mark_inode_dirty
5         0.0017  do_mmap_pgoff
5         0.0017  error_sti
5         0.0017  free_hot_page
5         0.0017  load_elf_binary
5         0.0017  page_waitqueue
5         0.0017  radix_tree_tag_clear
5         0.0017  rcu_start_batch
5         0.0017  retint_swapgs
4         0.0014  _spin_lock_irqsave
4         0.0014  copy_strings
4         0.0014  free_pages
4         0.0014  free_pgd_range
4         0.0014  free_pgtables
4         0.0014  page_add_file_rmap
4         0.0014  run_local_timers
4         0.0014  sys_rt_sigprocmask
4         0.0014  unlock_page
4         0.0014  vma_link
3         0.0010  __getblk
3         0.0010  __pagevec_lru_add_active
3         0.0010  __strnlen_user
3         0.0010  _write_lock_bh
3         0.0010  anon_vma_prepare
3         0.0010  anon_vma_unlink
3         0.0010  bio_alloc_bioset
3         0.0010  dnotify_parent
3         0.0010  do_exit
3         0.0010  do_wait
3         0.0010  exit_mmap
3         0.0010  kthread_should_stop
3         0.0010  prio_tree_insert
3         0.0010  remove_vma
3         0.0010  retint_check
3         0.0010  vma_prio_tree_add
2        6.9e-04  __block_prepare_write
2        6.9e-04  __bread
2        6.9e-04  __end_that_request_first
2        6.9e-04  __find_get_block_slow
2        6.9e-04  __free_pages
2        6.9e-04  __vm_enough_memory
2        6.9e-04  alloc_page_vma
2        6.9e-04  arch_unmap_area
2        6.9e-04  bio_init
2        6.9e-04  blk_recount_segments
2        6.9e-04  block_write_full_page
2        6.9e-04  clear_page_dirty_for_io
2        6.9e-04  do_group_exit
2        6.9e-04  do_munmap
2        6.9e-04  do_notify_resume
2        6.9e-04  find_vma_prepare
2        6.9e-04  find_vma_prev
2        6.9e-04  generic_fillattr
2        6.9e-04  get_index
2        6.9e-04  get_unmapped_area
2        6.9e-04  get_vma_policy
2        6.9e-04  prio_tree_remove
2        6.9e-04  prio_tree_replace
2        6.9e-04  proc_get_inode
2        6.9e-04  proc_lookup
2        6.9e-04  release_pages
2        6.9e-04  show_vfsmnt
2        6.9e-04  strchr
2        6.9e-04  sys_dup2
2        6.9e-04  sys_faccessat
2        6.9e-04  sys_mmap
2        6.9e-04  test_set_page_writeback
2        6.9e-04  unmap_region
2        6.9e-04  vm_acct_memory
2        6.9e-04  vm_stat_account
2        6.9e-04  vma_adjust
2        6.9e-04  vma_merge
2        6.9e-04  vma_prio_tree_remove
2        6.9e-04  zonelist_policy
1        3.4e-04  __brelse
1        3.4e-04  __clear_user
1        3.4e-04  __d_path
1        3.4e-04  __lookup_hash
1        3.4e-04  __make_request
1        3.4e-04  __page_set_anon_rmap
1        3.4e-04  __pagevec_lru_add
1        3.4e-04  __pmd_alloc
1        3.4e-04  __pollwait
1        3.4e-04  __pte_alloc
1        3.4e-04  __pud_alloc
1        3.4e-04  __put_user_4
1        3.4e-04  add_to_page_cache
1        3.4e-04  alloc_pages_current
1        3.4e-04  blk_rq_map_sg
1        3.4e-04  can_share_swap_page
1        3.4e-04  can_vma_merge_after
1        3.4e-04  copy_semundo
1        3.4e-04  cp_new_stat
1        3.4e-04  cpuset_update_task_memory_state
1        3.4e-04  d_delete
1        3.4e-04  d_path
1        3.4e-04  dcache_readdir
1        3.4e-04  default_llseek
1        3.4e-04  do_arch_prctl
1        3.4e-04  do_brk
1        3.4e-04  do_fsync
1        3.4e-04  do_mpage_readpage
1        3.4e-04  do_sigaction
1        3.4e-04  do_sigaltstack
1        3.4e-04  dupfd
1        3.4e-04  elf_map
1        3.4e-04  end_page_writeback
1        3.4e-04  error_entry
1        3.4e-04  exit_sem
1        3.4e-04  fasync_helper
1        3.4e-04  file_permission
1        3.4e-04  file_read_actor
1        3.4e-04  flush_signal_handlers
1        3.4e-04  flush_thread
1        3.4e-04  generic_delete_inode
1        3.4e-04  generic_file_aio_write
1        3.4e-04  generic_file_buffered_write
1        3.4e-04  generic_file_mmap
1        3.4e-04  get_request_wait
1        3.4e-04  get_signal_to_deliver
1        3.4e-04  get_stack
1        3.4e-04  hrtimer_init
1        3.4e-04  kill_fasync
1        3.4e-04  kmem_flagcheck
1        3.4e-04  meminfo_read_proc
1        3.4e-04  mempool_free
1        3.4e-04  memscan
1        3.4e-04  mutex_unlock
1        3.4e-04  notify_change
1        3.4e-04  nr_blockdev_pages
1        3.4e-04  page_add_new_anon_rmap
1        3.4e-04  path_release
1        3.4e-04  pipe_ioctl
1        3.4e-04  pipe_iov_copy_from_user
1        3.4e-04  poll_freewait
1        3.4e-04  proc_file_read
1        3.4e-04  proc_pident_lookup
1        3.4e-04  ptregscall_common
1        3.4e-04  rb_next
1        3.4e-04  recalc_bh_state
1        3.4e-04  release_task
1        3.4e-04  retint_careful
1        3.4e-04  schedule_tail
1        3.4e-04  seq_escape
1        3.4e-04  set_brk
1        3.4e-04  set_normalized_timespec
1        3.4e-04  si_swapinfo
1        3.4e-04  split_vma
1        3.4e-04  stub_execve
1        3.4e-04  sync_sb_inodes
1        3.4e-04  sys_brk
1        3.4e-04  sys_lseek
1        3.4e-04  sys_munmap
1        3.4e-04  sys_newfstat
1        3.4e-04  sys_newstat
1        3.4e-04  sys_select
1        3.4e-04  sys_write
1        3.4e-04  truncate_complete_page
1        3.4e-04  tty_ioctl
1        3.4e-04  udp_rcv
1        3.4e-04  unlock_buffer
1        3.4e-04  vfs_readdir
1        3.4e-04  vma_prio_tree_insert
1        3.4e-04  worker_thread

[-- Attachment #3: profile.kevent --]
[-- Type: text/plain, Size: 12581 bytes --]

CPU: AMD64 processors, speed 2210.08 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        symbol name
124605   59.0887  cpu_idle
13754     6.5223  enter_idle
4677      2.2179  tcp_v4_rcv
3888      1.8437  IRQ0x51_interrupt
3115      1.4772  tcp_ack
2571      1.2192  kmem_cache_free
2151      1.0200  kfree
2121      1.0058  memset_c
1632      0.7739  csum_partial_copy_generic
1611      0.7639  ip_output
1488      0.7056  schedule
1389      0.6587  dev_queue_xmit
1352      0.6411  tcp_rcv_state_process
1272      0.6032  ip_rcv
1184      0.5615  handle_IRQ_event
951       0.4510  ip_queue_xmit
916       0.4344  __do_softirq
900       0.4268  ip_route_input
888       0.4211  csum_partial
870       0.4126  tcp_v4_do_rcv
856       0.4059  ip_local_deliver
851       0.4036  system_call
829       0.3931  netif_receive_skb
812       0.3851  tcp_sendmsg
812       0.3851  tcp_transmit_skb
805       0.3817  __alloc_skb
683       0.3239  kmem_cache_alloc
683       0.3239  local_bh_enable
678       0.3215  __tcp_push_pending_frames
673       0.3191  fget
665       0.3153  __d_lookup
653       0.3097  pfifo_fast_enqueue
639       0.3030  copy_user_generic_string
607       0.2878  preempt_schedule
606       0.2874  net_rx_action
593       0.2812  pfifo_fast_dequeue
589       0.2793  kfree_skbmem
552       0.2618  __fput
550       0.2608  __link_path_walk
528       0.2504  skb_clone
523       0.2480  tcp_parse_options
515       0.2442  ret_from_intr
512       0.2428  number
496       0.2352  _atomic_dec_and_lock
493       0.2338  __kfree_skb
477       0.2262  rt_hash_code
476       0.2257  sock_wfree
459       0.2177  tcp_poll
449       0.2129  tcp_rcv_established
415       0.1968  dput
398       0.1887  sysret_check
366       0.1736  exit_idle
364       0.1726  __mod_timer
352       0.1669  tcp_recvmsg
352       0.1669  thread_return
346       0.1641  sys_close
341       0.1617  get_unused_fd
340       0.1612  rb_erase
337       0.1598  __tcp_select_window
331       0.1570  lock_timer_base
316       0.1498  mask_and_ack_8259A
299       0.1418  ip_append_data
290       0.1375  cache_alloc_refill
286       0.1356  d_alloc
269       0.1276  ret_from_sys_call
268       0.1271  do_path_lookup
267       0.1266  call_softirq
265       0.1257  eth_type_trans
262       0.1242  tcp_current_mss
262       0.1242  tcp_data_queue
262       0.1242  vsnprintf
261       0.1238  __ip_route_output_key
255       0.1209  sk_stream_mem_schedule
249       0.1181  cache_grow
247       0.1171  tcp_rtt_estimator
239       0.1133  __dentry_open
221       0.1048  sys_fcntl
220       0.1043  find_next_zero_bit
216       0.1024  dentry_iput
208       0.0986  try_to_wake_up
197       0.0934  call_rcu
196       0.0929  strncpy_from_user
195       0.0925  do_lookup
191       0.0906  sock_recvmsg
189       0.0896  tcp_send_fin
188       0.0892  sys_recvfrom
182       0.0863  sock_def_readable
178       0.0844  restore_args
176       0.0835  common_interrupt
174       0.0825  release_sock
171       0.0811  do_generic_mapping_read
167       0.0792  schedule_timeout
163       0.0773  generic_drop_inode
163       0.0773  get_page_from_freelist
163       0.0773  percpu_counter_mod
160       0.0759  skb_checksum
159       0.0754  del_timer
155       0.0735  do_sys_open
153       0.0726  memcpy_c
149       0.0707  memcmp
145       0.0688  __skb_checksum_complete
144       0.0683  link_path_walk
144       0.0683  tcp_v4_send_check
141       0.0669  fd_install
140       0.0664  get_empty_filp
139       0.0659  fget_light
131       0.0621  current_fs_time
126       0.0598  mod_timer
125       0.0593  bictcp_cong_avoid
123       0.0583  tcp_init_tso_segs
121       0.0574  update_send_head
120       0.0569  __put_unused_fd
119       0.0564  __do_page_cache_readahead
119       0.0564  alloc_inode
118       0.0560  rb_insert_color
115       0.0545  prepare_to_wait
112       0.0531  locks_remove_posix
112       0.0531  new_inode
111       0.0526  half_md4_transform
109       0.0517  fput
108       0.0512  tcp_sync_mss
105       0.0498  get_task_mm
104       0.0493  clear_inode
102       0.0484  find_get_page
102       0.0484  tcp_select_initial_window
101       0.0479  lookup_mnt
101       0.0479  tcp_send_ack
99        0.0469  sock_close
98        0.0465  try_to_del_timer_sync
97        0.0460  page_cache_readahead
97        0.0460  tcp_snd_test
94        0.0446  generic_permission
94        0.0446  getname
92        0.0436  may_open
91        0.0432  tcp_event_data_recv
90        0.0427  open_namei
89        0.0422  IRQ0x20_interrupt
88        0.0417  inotify_dentry_parent_queue_event
87        0.0413  put_page
85        0.0403  copy_page_c
84        0.0398  d_instantiate
81        0.0384  finish_wait
80        0.0379  __wake_up
79        0.0375  expand_files
77        0.0365  groups_search
75        0.0356  inet_sendmsg
75        0.0356  skb_copy_datagram_iovec
73        0.0346  file_free_rcu
72        0.0341  __path_lookup_intent_open
72        0.0341  permission
71        0.0337  __follow_mount
71        0.0337  memmove
70        0.0332  dnotify_flush
69        0.0327  cond_resched
69        0.0327  igrab
68        0.0322  sk_reset_timer
66        0.0313  sockfd_lookup
65        0.0308  tcp_setsockopt
64        0.0303  sock_sendmsg
62        0.0294  __wake_up_bit
62        0.0294  file_move
62        0.0294  inotify_inode_queue_event
62        0.0294  retint_kernel
62        0.0294  tcp_v4_tw_remember_stamp
60        0.0285  sprintf
56        0.0266  copy_to_user
56        0.0266  sock_common_recvmsg
56        0.0266  tcp_check_space
56        0.0266  tcp_cwnd_validate
55        0.0261  file_kill
55        0.0261  inet_sk_rebuild_header
55        0.0261  tcp_init_cwnd
54        0.0256  __handle_mm_fault
54        0.0256  init_timer
52        0.0247  filp_close
52        0.0247  mutex_unlock
52        0.0247  rw_verify_area
51        0.0242  sock_release
48        0.0228  __tcp_ack_snd_check
48        0.0228  sys_open
47        0.0223  inode_init_once
47        0.0223  locks_remove_flock
47        0.0223  mntput_no_expire
47        0.0223  touch_atime
46        0.0218  sk_stream_rfree
45        0.0213  page_fault
45        0.0213  unmap_vmas
45        0.0213  wake_up_inode
44        0.0209  clear_page_c
44        0.0209  memset
42        0.0199  __delay
42        0.0199  tcp_cong_avoid
41        0.0194  __rb_rotate_left
39        0.0185  do_filp_open
39        0.0185  memcpy_toiovec
39        0.0185  sock_fasync
38        0.0180  __put_user_8
38        0.0180  exit_intr
35        0.0166  zone_watermark_ok
34        0.0161  iput
33        0.0156  iret_label
32        0.0152  free_hot_cold_page
32        0.0152  tcp_unhash
31        0.0147  __alloc_pages
31        0.0147  sk_alloc
30        0.0142  __lookup_mnt
30        0.0142  mutex_lock
29        0.0138  generic_file_open
28        0.0133  _spin_lock_bh
28        0.0133  bit_waitqueue
28        0.0133  tcp_rcv_space_adjust
27        0.0128  inet_sock_destruct
26        0.0123  sk_stop_timer
26        0.0123  tcp_slow_start
24        0.0114  hrtimer_run_queues
24        0.0114  inet_getname
22        0.0104  file_ra_state_init
22        0.0104  vfs_permission
21        0.0100  blockable_page_cache_readahead
21        0.0100  tcp_v4_destroy_sock
20        0.0095  init_once
19        0.0090  do_wp_page
18        0.0085  destroy_inode
16        0.0076  apic_timer_interrupt
16        0.0076  copy_from_user
16        0.0076  del_timer_sync
15        0.0071  in_group_p
15        0.0071  invalidate_inode_buffers
14        0.0066  do_page_fault
13        0.0062  mark_page_accessed
13        0.0062  retint_restore_args
12        0.0057  __get_user_4
11        0.0052  copy_page_range
11        0.0052  find_vma
11        0.0052  wake_up_bit
10        0.0047  __down_read
9         0.0043  rcu_start_batch
8         0.0038  __tcp_checksum_complete_user
8         0.0038  __up_read
8         0.0038  flush_tlb_page
8         0.0038  free_pages
8         0.0038  timespec_trunc
6         0.0028  __find_get_block
6         0.0028  __rb_rotate_right
6         0.0028  error_sti
6         0.0028  kmem_flagcheck
6         0.0028  retint_swapgs
5         0.0024  __down_read_trylock
5         0.0024  _spin_lock_irqsave
5         0.0024  inode_has_buffers
5         0.0024  retint_check
4         0.0019  __mutex_init
4         0.0019  nameidata_to_filp
4         0.0019  prio_tree_insert
4         0.0019  proc_lookup
4         0.0019  run_workqueue
3         0.0014  __getblk
3         0.0014  __iget
3         0.0014  __pagevec_lru_add_active
3         0.0014  __strnlen_user
3         0.0014  _write_lock_bh
3         0.0014  copy_process
3         0.0014  error_exit
3         0.0014  filemap_nopage
3         0.0014  mmput
3         0.0014  sys_mprotect
3         0.0014  unlink_file_vma
3         0.0014  vma_adjust
2        9.5e-04  __clear_user
2        9.5e-04  __end_that_request_first
2        9.5e-04  __pte_alloc
2        9.5e-04  __set_page_dirty_nobuffers
2        9.5e-04  _read_lock_bh
2        9.5e-04  _read_lock_irqsave
2        9.5e-04  add_to_page_cache
2        9.5e-04  anon_vma_prepare
2        9.5e-04  anon_vma_unlink
2        9.5e-04  d_lookup
2        9.5e-04  do_exit
2        9.5e-04  do_mmap_pgoff
2        9.5e-04  do_munmap
2        9.5e-04  do_wait
2        9.5e-04  dup_fd
2        9.5e-04  flush_signal_handlers
2        9.5e-04  free_hot_page
2        9.5e-04  generic_file_aio_read
2        9.5e-04  generic_fillattr
2        9.5e-04  generic_make_request
2        9.5e-04  kthread_should_stop
2        9.5e-04  load_elf_binary
2        9.5e-04  lru_add_drain
2        9.5e-04  lru_cache_add_active
2        9.5e-04  memcpy
2        9.5e-04  mempool_free
2        9.5e-04  mm_init
2        9.5e-04  page_remove_rmap
2        9.5e-04  put_unused_fd
2        9.5e-04  radix_tree_tag_clear
2        9.5e-04  rb_next
2        9.5e-04  retint_with_reschedule
2        9.5e-04  run_local_timers
2        9.5e-04  strchr
2        9.5e-04  vfs_ioctl
2        9.5e-04  vma_link
1        4.7e-04  __add_entropy_words
1        4.7e-04  __anon_vma_link
1        4.7e-04  __brelse
1        4.7e-04  __d_path
1        4.7e-04  __down_write
1        4.7e-04  __find_get_block_slow
1        4.7e-04  __free_pages
1        4.7e-04  __generic_file_aio_write_nolock
1        4.7e-04  __lookup_hash
1        4.7e-04  __make_request
1        4.7e-04  __mark_inode_dirty
1        4.7e-04  __page_set_anon_rmap
1        4.7e-04  __pagevec_lru_add
1        4.7e-04  __pagevec_release
1        4.7e-04  __remove_shared_vm_struct
1        4.7e-04  __up_write
1        4.7e-04  __user_walk_fd
1        4.7e-04  __vm_enough_memory
1        4.7e-04  __writeback_single_inode
1        4.7e-04  alloc_buffer_head
1        4.7e-04  alloc_page_buffers
1        4.7e-04  alloc_pages_current
1        4.7e-04  bio_alloc_bioset
1        4.7e-04  blk_done_softirq
1        4.7e-04  blk_recount_segments
1        4.7e-04  blk_rq_map_sg
1        4.7e-04  can_vma_merge_before
1        4.7e-04  cfq_set_request
1        4.7e-04  copy_strings
1        4.7e-04  cpuset_update_task_memory_state
1        4.7e-04  create_empty_buffers
1        4.7e-04  deny_write_access
1        4.7e-04  do_arch_prctl
1        4.7e-04  do_mpage_readpage
1        4.7e-04  do_select
1        4.7e-04  do_sigaction
1        4.7e-04  exit_mm
1        4.7e-04  exit_mmap
1        4.7e-04  find_vma_prepare
1        4.7e-04  find_vma_prev
1        4.7e-04  free_pgtables
1        4.7e-04  get_locked_pte
1        4.7e-04  get_request_wait
1        4.7e-04  get_signal_to_deliver
1        4.7e-04  get_unmapped_area
1        4.7e-04  init_once
1        4.7e-04  init_request_from_bio
1        4.7e-04  inode_setattr
1        4.7e-04  ll_rw_block
1        4.7e-04  make_ahead_window
1        4.7e-04  mapping_tagged
1        4.7e-04  mempool_alloc
1        4.7e-04  n_tty_write_wakeup
1        4.7e-04  notify_change
1        4.7e-04  page_add_file_rmap
1        4.7e-04  pipe_iov_copy_from_user
1        4.7e-04  pipe_write
1        4.7e-04  preempt_schedule_irq
1        4.7e-04  proc_file_read
1        4.7e-04  remove_vma
1        4.7e-04  sched_balance_self
1        4.7e-04  sched_fork
1        4.7e-04  search_binary_handler
1        4.7e-04  sigprocmask
1        4.7e-04  stub_execve
1        4.7e-04  sync_dirty_buffer
1        4.7e-04  sys_execve
1        4.7e-04  sys_ftruncate
1        4.7e-04  tty_ldisc_ref_wait
1        4.7e-04  vfs_write
1        4.7e-04  writeback_inodes

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01 14:16                                                         ` Ingo Molnar
@ 2007-03-01 14:31                                                           ` Eric Dumazet
  2007-03-01 14:27                                                             ` Ingo Molnar
  2007-03-01 14:54                                                           ` Evgeniy Polyakov
  1 sibling, 1 reply; 277+ messages in thread
From: Eric Dumazet @ 2007-03-01 14:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Evgeniy Polyakov, Pavel Machek, Theodore Tso, Linus Torvalds,
	Ulrich Drepper, linux-kernel, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner

On Thursday 01 March 2007 15:16, Ingo Molnar wrote:
> * Eric Dumazet <dada1@cosmosbay.com> wrote:
> > I can tell you that the problem (at least on my machine) comes from :
> >
> > gettimeofday(&tm, NULL);
> >
> > in evserver_epoll.c
>
> yeah, that's another difference - especially if it's something like an
> Athlon64 and gettimeofday falls back to pm-timer, that could explain the
> performance difference. That's why i repeatedly asked Evgeniy to use the
> /very same/ client function for both the epoll and the kevent test and
> redo the measurements. The numbers are still highly suspect - and we are
> already down from the prior claim of kevent being almost twice as fast
> to a 25% difference.

Also, ab is quite lame... Maybe we could use a epoll based 'stresser' 

On my machines (again ...), ab is the slow thing... not the 'server'

Some small differences in behavior could have a big impact on ab (and you 
could think there is a problem on the remote side)


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01 14:31                                                           ` Eric Dumazet
@ 2007-03-01 14:27                                                             ` Ingo Molnar
  0 siblings, 0 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-03-01 14:27 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Evgeniy Polyakov, Pavel Machek, Theodore Tso, Linus Torvalds,
	Ulrich Drepper, linux-kernel, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner


* Eric Dumazet <dada1@cosmosbay.com> wrote:

> On my machines (again ...), ab is the slow thing... not the 'server'

Evgeniy said that both in the epoll and the kevent case the server side 
CPU was 98%-100% busy - so inefficiencies on the client side do not 
matter that much - the server is saturated. It's that "kevent is 25% 
faster than epoll" claim that i'm probing mainly.

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01 13:32                                         ` Ingo Molnar
@ 2007-03-01 14:24                                           ` Evgeniy Polyakov
  0 siblings, 0 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01 14:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pavel Machek, Theodore Tso, Linus Torvalds, Ulrich Drepper,
	linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
	Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner

On Thu, Mar 01, 2007 at 02:32:42PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> 
> > [...] that is why number for kevent is higher - it uses edge-triggered 
> > handler (which you asked to remove from epoll), [...]
> 
> no - i did not 'ask' it to be removed from epoll, i only pointed out 
> that with edge-triggered the results were highly unreliable here and 
> that with level-triggered it worked better. Just to make sure: if you 
> put back edge-triggered into evserver_epoll.c, do you get the same 
> numbers, and is CPU utilization still the same 98-100%?

No.
_Now_ it is about 1500-2000 req/sec with 10-20% CPU utilization. 
Number of 'Total transferred' and 'HTML transferred' does
not equal to 80000 multiplied by size of the page.

That are strange tests actually - I managed to get 9000 requests per
second from epoll server (only once!) and 8900 from kevent (two times
only), sometimes they both drop down to 2300-2700 req/s.

> 	Ingo

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01 13:30                                                     ` Evgeniy Polyakov
@ 2007-03-01 14:19                                                       ` Eric Dumazet
  2007-03-01 14:16                                                         ` Ingo Molnar
  0 siblings, 1 reply; 277+ messages in thread
From: Eric Dumazet @ 2007-03-01 14:19 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ingo Molnar, Pavel Machek, Theodore Tso, Linus Torvalds,
	Ulrich Drepper, linux-kernel, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner

On Thursday 01 March 2007 14:30, Evgeniy Polyakov wrote:
> On Thu, Mar 01, 2007 at 02:11:18PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> > ok?
>
> I undesrtood you couple of mails ago.
> No problem, I can put processing into the same function called from
> different servers :)
>
> > Btw., am i correct that in this particular 'ab' test, the 'immediately'
> > flag is always zero, i.e. kweb_kevent_remove() is always called?
>
> Yes.
>
> > 	Ingo

I can tell you that the problem (at least on my machine) comes from :

gettimeofday(&tm, NULL);

in evserver_epoll.c



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01 14:19                                                       ` Eric Dumazet
@ 2007-03-01 14:16                                                         ` Ingo Molnar
  2007-03-01 14:31                                                           ` Eric Dumazet
  2007-03-01 14:54                                                           ` Evgeniy Polyakov
  0 siblings, 2 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-03-01 14:16 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Evgeniy Polyakov, Pavel Machek, Theodore Tso, Linus Torvalds,
	Ulrich Drepper, linux-kernel, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner


* Eric Dumazet <dada1@cosmosbay.com> wrote:

> I can tell you that the problem (at least on my machine) comes from :
> 
> gettimeofday(&tm, NULL);
> 
> in evserver_epoll.c

yeah, that's another difference - especially if it's something like an 
Athlon64 and gettimeofday falls back to pm-timer, that could explain the 
performance difference. That's why i repeatedly asked Evgeniy to use the 
/very same/ client function for both the epoll and the kevent test and 
redo the measurements. The numbers are still highly suspect - and we are 
already down from the prior claim of kevent being almost twice as fast 
to a 25% difference.

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01 13:26                                       ` Evgeniy Polyakov
@ 2007-03-01 13:32                                         ` Ingo Molnar
  2007-03-01 14:24                                           ` Evgeniy Polyakov
  0 siblings, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-03-01 13:32 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Pavel Machek, Theodore Tso, Linus Torvalds, Ulrich Drepper,
	linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
	Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner


* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> [...] that is why number for kevent is higher - it uses edge-triggered 
> handler (which you asked to remove from epoll), [...]

no - i did not 'ask' it to be removed from epoll, i only pointed out 
that with edge-triggered the results were highly unreliable here and 
that with level-triggered it worked better. Just to make sure: if you 
put back edge-triggered into evserver_epoll.c, do you get the same 
numbers, and is CPU utilization still the same 98-100%?

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01 13:11                                                   ` Ingo Molnar
@ 2007-03-01 13:30                                                     ` Evgeniy Polyakov
  2007-03-01 14:19                                                       ` Eric Dumazet
  0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01 13:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pavel Machek, Theodore Tso, Linus Torvalds, Ulrich Drepper,
	linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
	Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner

On Thu, Mar 01, 2007 at 02:11:18PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> ok?

I undesrtood you couple of mails ago.
No problem, I can put processing into the same function called from
different servers :)

> Btw., am i correct that in this particular 'ab' test, the 'immediately' 
> flag is always zero, i.e. kweb_kevent_remove() is always called?

Yes.

> 	Ingo

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01 12:34                                     ` Ingo Molnar
@ 2007-03-01 13:26                                       ` Evgeniy Polyakov
  2007-03-01 13:32                                         ` Ingo Molnar
  0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01 13:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pavel Machek, Theodore Tso, Linus Torvalds, Ulrich Drepper,
	linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
	Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner

On Thu, Mar 01, 2007 at 01:34:23PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> 
> > Document Length:        3521 bytes
> 
> > Concurrency Level:      8000
> > Time taken for tests:   16.686737 seconds
> > Complete requests:      80000
> > Failed requests:        0
> > Write errors:           0
> > Total transferred:      309760000 bytes
> > HTML transferred:       281680000 bytes
> > Requests per second:    4794.23 [#/sec] (mean)
> 
> > Concurrency Level:      8000
> > Time taken for tests:   12.366775 seconds
> > Complete requests:      80000
> > Failed requests:        0
> > Write errors:           0
> > Total transferred:      317047104 bytes
> > HTML transferred:       288306522 bytes
> > Requests per second:    6468.95 [#/sec] (mean)
> 
> i'm wondering - how can the 'Total transferred' and 'HTML transferred' 
> numbers be different?
>
> Since document length is 3521, and the number of requests is 80000, the 
> correct 'HTML transferred' is 281680000 - which is the epoll result. The 
> kevent result shows more bytes transferred, which suggests that the 
> kevent loop is probably incorrect somewhere.
> 
> this might be some benign thing, but the /first/ thing you /have to/ do 
> before claiming that 'kevent is 25% faster than epoll' is to make sure 
> the results are totally reliable.

Kevent sent additional 525 pages ((311792800-309760000)/3872) - that is why 
number for kevent is higher - it uses edge-triggered handler (which you 
asked to remove from epoll), which can produce false-positives, for 
absolute result in that case ret_data must be checked where poll flags 
were stored (before). 'ab' does not count additional data as new requests 
and does not count them in 'requests per second'.
Even if it could do so, additional 500 requests can not provide 35%
higher rate.

For example, lighttpd results are the same for kevent and epoll and
'Total transferred' and 'HTML transferred' numbers change between runs both 
for epoll and kevent.

> 	Ingo

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01 11:47                                           ` Evgeniy Polyakov
@ 2007-03-01 13:12                                             ` Eric Dumazet
  2007-03-01 14:43                                               ` Evgeniy Polyakov
  0 siblings, 1 reply; 277+ messages in thread
From: Eric Dumazet @ 2007-03-01 13:12 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ingo Molnar, Pavel Machek, Theodore Tso, Linus Torvalds,
	Ulrich Drepper, linux-kernel, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 456 bytes --]

On Thursday 01 March 2007 12:47, Evgeniy Polyakov wrote:
>
> Could you provide at least remote way to find it?
>

Sure :)

> I only found the same problem at
> http://lkml.org/lkml/2006/10/27/3
>
> but without any hits to solve the problem.
>
> I will try CVS oprofile, if it works I will provide details of course.
>

# cat CVS/Root
CVS/Root::pserver:anonymous@oprofile.cvs.sourceforge.net:/cvsroot/oprofile

# cvs diff >/tmp/oprofile.diff

Hope it helps

[-- Attachment #2: oprofile.diff --]
[-- Type: text/x-diff, Size: 1663 bytes --]

Index: libop/op_alloc_counter.c
===================================================================
RCS file: /cvsroot/oprofile/oprofile/libop/op_alloc_counter.c,v
retrieving revision 1.8
diff -r1.8 op_alloc_counter.c
14a15,16
> #include <ctype.h>
> #include <dirent.h>
133c135
< 			return 0;
---
> 			continue;
145a148,183
> /* determine which directories are counter directories
>  */
> static int perfcounterdir(const struct dirent * entry)
> {
> 	return (isdigit(entry->d_name[0]));
> }
> 
> 
> /**
>  * @param mask pointer where to place bit mask of unavailable counters
>  *
>  * return >= 0 number of counters that are available
>  *        < 0  could not determine number of counters
>  *
>  */
> static int op_get_counter_mask(u32 * mask)
> {
> 	struct dirent **counterlist;
> 	int count, i;
> 	/* assume nothing is available */
> 	u32 available=0;
> 
> 	count = scandir("/dev/oprofile", &counterlist, perfcounterdir,
> 			alphasort);
> 	if (count < 0)
> 		/* unable to determine bit mask */
> 		return -1;
> 	/* convert to bit map (0 where counter exists) */
> 	for (i=0; i<count; ++i) {
> 		available |= 1 << atoi(counterlist[i]->d_name);
> 		free(counterlist[i]);
> 	}
> 	*mask=~available;
> 	free(counterlist);
> 	return count;
> }
152a191
> 	u32 unavailable_counters = 0;
154c193,195
< 	nr_counters = op_get_nr_counters(cpu_type);
---
> 	nr_counters = op_get_counter_mask(&unavailable_counters);
> 	if (nr_counters < 0) 
> 		nr_counters = op_get_nr_counters(cpu_type);
162c203,204
< 	if (!allocate_counter(ctr_arc, nr_events, 0, 0, counter_map)) {
---
> 	if (!allocate_counter(ctr_arc, nr_events, 0, unavailable_counters,
> 			      counter_map)) {

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01 13:01                                                 ` Evgeniy Polyakov
@ 2007-03-01 13:11                                                   ` Ingo Molnar
  2007-03-01 13:30                                                     ` Evgeniy Polyakov
  0 siblings, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-03-01 13:11 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Pavel Machek, Theodore Tso, Linus Torvalds, Ulrich Drepper,
	linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
	Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner


* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> > i dont care whether they are separate or not - but you have not 
> > replied to the request that there be a handle_web_request() function 
> > in /both/ files, which is precisely the same function. I didnt ask 
> > you to merge the two files - i only asked for the two web handling 
> > functions to be one and the same function.
> 
> They are not the same in general - if kevent is ready immediately, there 
> will not be its removing from kevent tree, but current kevent server has 
> it always not-immediately for lighttpd tests - so functions are the same:
> open()
> sendfile()
> cork_off
> close(fd)
> close(s)
> remove_event_from_the_kernel
> 
> with the same parameters.

you /STILL/ dont understand. I'm only talking about evserver_epoll.c and 
evserver_kevent.c. Not about lighttpd. Not about historic reasons. I 
simply suggested a common-sense change:

| | Would it be so hard to introduce a single handle_web_request() 
| | function that is exactly the same in the two tests? All the queueing 
| | details (which are of course different in the epoll and the kevent 
| | case) should be in the client function, which calls 
| | handle_web_request().

i.e. put remove_event_from_the_kernel() (kweb_kevent_remove()) and 
evtest_remove()) into a SEPARATE client function, which calls the 
/common/ handle_web_request(sock) function. You can do the 
immediate-removal in that separate, kevent-specific client function - 
but the socket function, handle_web_request(sock) should be /perfectly 
identical/ in the two files.

I.e.:

static inline int handle_web_request(int s)
{
	int err, fd, on = 0;
	off_t offset = 0;
	int count = 40960;
	char path[] = "/tmp/index.html";
	char buf[4096];
		
	err = recv(s, buf, sizeof(buf), 0);
	if (err <= 0)
		return err;

	fd = open(path, O_RDONLY);
	if (fd == -1)
		return fd;

	err = sendfile(s, fd, &offset, count);
	if (err < 0) {
		ulog_err("Failed send %d bytes: fd=%d.\n", count, s);
		close(fd);
		return err;
	}

	setsockopt(s, SOL_TCP, TCP_CORK, &on, sizeof(on));
	close(fd);
	close(s); /* No keepalive */

	return 0;
}

And in evserver_epoll.c do this:

static int evtest_callback_client(int s)
{
	int err = handle_web_request(s);
	if (err)
		evtest_remove(s);
	return err;
}

and in evserver_kevent.c do this:

static int kweb_callback_client(struct ukevent *e, int im)
{
	int err = handle_web_request(e->id.raw[0]);
	if (err || !im)
		kweb_kevent_remove(e);
	return err;
}

ok?

Btw., am i correct that in this particular 'ab' test, the 'immediately' 
flag is always zero, i.e. kweb_kevent_remove() is always called?

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01 12:43                                               ` Ingo Molnar
@ 2007-03-01 13:01                                                 ` Evgeniy Polyakov
  2007-03-01 13:11                                                   ` Ingo Molnar
  0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01 13:01 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pavel Machek, Theodore Tso, Linus Torvalds, Ulrich Drepper,
	linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
	Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner

On Thu, Mar 01, 2007 at 01:43:36PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> 
> > I separated epoll and kevent servers, since originally kevent server 
> > included additional kevent features, but then new ones were added and 
> > I moved slowly to the similar to epoll case.
> 
> i dont care whether they are separate or not - but you have not replied 
> to the request that there be a handle_web_request() function in /both/ 
> files, which is precisely the same function. I didnt ask you to merge 
> the two files - i only asked for the two web handling functions to be 
> one and the same function.

They are not the same in general - if kevent is ready immediately, there 
will not be its removing from kevent tree, but current kevent server has 
it always not-immediately for lighttpd tests - so functions are the same:
open()
sendfile()
cork_off
close(fd)
close(s)
remove_event_from_the_kernel

with the same parameters.

> 	Ingo

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01 12:10                                             ` Evgeniy Polyakov
@ 2007-03-01 12:43                                               ` Ingo Molnar
  2007-03-01 13:01                                                 ` Evgeniy Polyakov
  0 siblings, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-03-01 12:43 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Pavel Machek, Theodore Tso, Linus Torvalds, Ulrich Drepper,
	linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
	Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner


* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> I separated epoll and kevent servers, since originally kevent server 
> included additional kevent features, but then new ones were added and 
> I moved slowly to the similar to epoll case.

i dont care whether they are separate or not - but you have not replied 
to the request that there be a handle_web_request() function in /both/ 
files, which is precisely the same function. I didnt ask you to merge 
the two files - i only asked for the two web handling functions to be 
one and the same function.

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01 10:59                                   ` Evgeniy Polyakov
  2007-03-01 11:00                                     ` Ingo Molnar
  2007-03-01 11:14                                     ` Eric Dumazet
@ 2007-03-01 12:34                                     ` Ingo Molnar
  2007-03-01 13:26                                       ` Evgeniy Polyakov
  2007-03-01 16:56                                     ` David Lang
  3 siblings, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-03-01 12:34 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Pavel Machek, Theodore Tso, Linus Torvalds, Ulrich Drepper,
	linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
	Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner


* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> Document Length:        3521 bytes

> Concurrency Level:      8000
> Time taken for tests:   16.686737 seconds
> Complete requests:      80000
> Failed requests:        0
> Write errors:           0
> Total transferred:      309760000 bytes
> HTML transferred:       281680000 bytes
> Requests per second:    4794.23 [#/sec] (mean)

> Concurrency Level:      8000
> Time taken for tests:   12.366775 seconds
> Complete requests:      80000
> Failed requests:        0
> Write errors:           0
> Total transferred:      317047104 bytes
> HTML transferred:       288306522 bytes
> Requests per second:    6468.95 [#/sec] (mean)

i'm wondering - how can the 'Total transferred' and 'HTML transferred' 
numbers be different?

Since document length is 3521, and the number of requests is 80000, the 
correct 'HTML transferred' is 281680000 - which is the epoll result. The 
kevent result shows more bytes transferred, which suggests that the 
kevent loop is probably incorrect somewhere.

this might be some benign thing, but the /first/ thing you /have to/ do 
before claiming that 'kevent is 25% faster than epoll' is to make sure 
the results are totally reliable.

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01 11:28                                         ` Eric Dumazet
  2007-03-01 11:47                                           ` Evgeniy Polyakov
@ 2007-03-01 12:19                                           ` Evgeniy Polyakov
  1 sibling, 0 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01 12:19 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, Pavel Machek, Theodore Tso, Linus Torvalds,
	Ulrich Drepper, linux-kernel, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner

On Thu, Mar 01, 2007 at 12:28:00PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> I used the CVS version of oprofile plus a patch you can find in the mailing 
> list archives. Dont remember exactly, since I hit this some months ago

Ugh, I started - but CVS compilation requires about 40mb of additional
libs (according to debian testing dependencies on my very light
installation), so with my miserable 1-1.6 kb/sec do not expect it today :)

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01 11:47                                           ` Ingo Molnar
@ 2007-03-01 12:10                                             ` Evgeniy Polyakov
  2007-03-01 12:43                                               ` Ingo Molnar
  0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01 12:10 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pavel Machek, Theodore Tso, Linus Torvalds, Ulrich Drepper,
	linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
	Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner

On Thu, Mar 01, 2007 at 12:47:35PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Ingo Molnar <mingo@elte.hu> wrote:
> 
> > > I also changed client socket to nonblocking mode with the same result 
> > > in epoll server. If you will find it broken, please send me corrected 
> > > to test too.
> > 
> > this line in evserver_kevent.c looks a bit fishy:
> 
> this one in evserver_kevent.c looks problematic too:
> 
>         shutdown(s, SHUT_RDWR);
>         close(s);
> 
> as evserver_epoll.c only does:
> 
>         close(s);
> 
> again, there might be TCP control flow differences due to this. [ Or the 
> removal of this shutdown() call might be a small speedup for the kevent 
> case ;) ]

:)

> Also, the order of fd and socket close() is different in the two cases. 
> It shouldnt make any difference - but that too just makes the results 
> harder to trust. Would it be so hard to introduce a single 
> handle_web_request() function that is exactly the same in the two tests? 
> All the queueing details (which are of course different in the epoll and 
> the kevent case) should be in the client function, which calls 
> handle_web_request().

I've removed shutdown - things are the same.

Sometimes kevent performance drops to lower numbers and its graph of
times needed to handle events has high platoes (with and without
shutdown - it was always), like this:

Percentage of the requests served within a certain time (ms)
50%    128
66%    486
75%    505
80%    507
90%    732
95%   3087	// something is wrong at this point
98%   9058
99%   9072
100%  15562 (longest request)

it is possible that threre are some other bugs in the server though,
which prevent sockets from being quicly closed and thus its processing
time increases - I do not know for sure the root of that event.

I separated epoll and kevent servers, since originally kevent server 
included additional kevent features, but then new ones were added 
and I moved slowly to the similar to epoll case.

Current version of the server was a pre-test one for lighttpd patches,
so essentially it should be like epoll except minor details.

> 	Ingo

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01 11:41                                         ` Ingo Molnar
  2007-03-01 11:47                                           ` Ingo Molnar
@ 2007-03-01 12:01                                           ` Evgeniy Polyakov
  1 sibling, 0 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01 12:01 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pavel Machek, Theodore Tso, Linus Torvalds, Ulrich Drepper,
	linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
	Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner

On Thu, Mar 01, 2007 at 12:41:37PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> 
> > I also changed client socket to nonblocking mode with the same result 
> > in epoll server. If you will find it broken, please send me corrected 
> > to test too.
> 
> this line in evserver_kevent.c looks a bit fishy:
> 
>         err = recv(s, buf, 100, 0);
> 
> because on the evserver_epoll.c side the following is done:
> 
>         err = recv(s, buf, 4096, 0);
> 
> now, for 'ab', the request size is 76 bytes, so it should fit fine 
> functionality-wise. But, the TCP stack might decide differently of 
> whether to return with a partial packet depending on how much data is 
> requested. I dont know whether it actually makes a difference in the TCP 
> flow decisions, and whether it makes a performance difference in your 
> test, but safest would be to use 4096 in both cases.

Well, that would be quite strange - as far as I known linux network
stack (for which kevent was originally created to support network AIO),
there should not be any difference.

Anyway, I've reran the test with the same values:

# ab -c8000 -n80000 http://192.168.0.48/
This is ApacheBench, Version 2.0.40-dev <$Revision: 1.146 $> apache-2.0
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright 2006 The Apache Software Foundation, http://www.apache.org/

Benchmarking 192.168.0.48 (be patient)
Completed 8000 requests
Completed 16000 requests
Completed 24000 requests
Completed 32000 requests
Completed 40000 requests
Completed 48000 requests
Completed 56000 requests
Completed 64000 requests
Completed 72000 requests
Finished 80000 requests


Server Software:        Apache/1.3.27
Server Hostname:        192.168.0.48
Server Port:            80

Document Path:          /
Document Length:        3521 bytes

Concurrency Level:      8000
Time taken for tests:   18.398381 seconds
Complete requests:      80000
Failed requests:        0
Write errors:           0
Total transferred:      338738048 bytes
HTML transferred:       308031164 bytes
Requests per second:    4348.21 [#/sec] (mean)
Time per request:       1839.838 [ms] (mean)
Time per request:       0.230 [ms] (mean, across all concurrent
requests)
Transfer rate:          17979.73 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:      148  795 196.9    808    3599
Processing:   824  882  39.7    878     986
Waiting:       59  426 212.6    423     914
Total:       1073 1678 200.8   1673    4579

Percentage of the requests served within a certain time (ms)
50%   1673
66%   1674
75%   1678
80%   1686
90%   1852
95%   1861
98%   1864
99%   1865
100%   4579 (longest request)

Essentially the same result (in limits of some inaccuracy).

> in general, please make sure the exact same system calls are done in the 
> client function. (except of course for the event queueing syscalls 
> themselves)

Yes, that should be done of course.
I even have a plan to create the same binary for both, but have also in
plans to turn some kevent optimization (mainly readiness-on-submit, when
requested event (secv/send/anything) is ready immediately - kevent
supports to return that event in the submission syscall without
additional overhead by reading it from ring or queue).

> 	Ingo

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01 11:41                                         ` Ingo Molnar
@ 2007-03-01 11:47                                           ` Ingo Molnar
  2007-03-01 12:10                                             ` Evgeniy Polyakov
  2007-03-01 12:01                                           ` Evgeniy Polyakov
  1 sibling, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-03-01 11:47 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Pavel Machek, Theodore Tso, Linus Torvalds, Ulrich Drepper,
	linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
	Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner


* Ingo Molnar <mingo@elte.hu> wrote:

> > I also changed client socket to nonblocking mode with the same result 
> > in epoll server. If you will find it broken, please send me corrected 
> > to test too.
> 
> this line in evserver_kevent.c looks a bit fishy:

this one in evserver_kevent.c looks problematic too:

        shutdown(s, SHUT_RDWR);
        close(s);

as evserver_epoll.c only does:

        close(s);

again, there might be TCP control flow differences due to this. [ Or the 
removal of this shutdown() call might be a small speedup for the kevent 
case ;) ]

Also, the order of fd and socket close() is different in the two cases. 
It shouldnt make any difference - but that too just makes the results 
harder to trust. Would it be so hard to introduce a single 
handle_web_request() function that is exactly the same in the two tests? 
All the queueing details (which are of course different in the epoll and 
the kevent case) should be in the client function, which calls 
handle_web_request().

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01 11:28                                         ` Eric Dumazet
@ 2007-03-01 11:47                                           ` Evgeniy Polyakov
  2007-03-01 13:12                                             ` Eric Dumazet
  2007-03-01 12:19                                           ` Evgeniy Polyakov
  1 sibling, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01 11:47 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, Pavel Machek, Theodore Tso, Linus Torvalds,
	Ulrich Drepper, linux-kernel, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner

On Thu, Mar 01, 2007 at 12:28:00PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> On Thursday 01 March 2007 12:20, Evgeniy Polyakov wrote:
> > On Thu, Mar 01, 2007 at 12:14:44PM +0100, Eric Dumazet (dada1@cosmosbay.com) 
> wrote:
> > > On Thursday 01 March 2007 11:59, Evgeniy Polyakov wrote:
> > > > Yes, it is about 98-100% in both cases.
> > > > I've just re-run tests on my amd64 test machine without debug options:
> > > >
> > > > epoll		4794.23
> > > > kevent		6468.95
> > >
> > > It would be valuable if you could post oprofile results
> > > (CPU_CLK_UNHALTED) for both tests.
> >
> > I can't - oprofile does not work on this x86_64 machine:
> >
> 
> Yes, this is a known problem, but you can make it works, as I did.
> 
> Please :)

I can not resist :)

> I used the CVS version of oprofile plus a patch you can find in the mailing 
> list archives. Dont remember exactly, since I hit this some months ago

Could you provide at least remote way to find it?

I only found the same problem at 
http://lkml.org/lkml/2006/10/27/3

but without any hits to solve the problem.

I will try CVS oprofile, if it works I will provide details of course.

My tree is based on rc1 and has this latest commit:
commit b5bf28cde894b3bb3bd25c13a7647020562f9ea0
Author: Linus Torvalds <torvalds@woody.linux-foundation.org>
Date:   Wed Feb 21 11:21:44 2007 -0800

There are no commits after that data with word 'oprofile' in
git-whatchanged at least.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01 11:16                                       ` Evgeniy Polyakov
  2007-03-01 11:27                                         ` Ingo Molnar
@ 2007-03-01 11:41                                         ` Ingo Molnar
  2007-03-01 11:47                                           ` Ingo Molnar
  2007-03-01 12:01                                           ` Evgeniy Polyakov
  1 sibling, 2 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-03-01 11:41 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Pavel Machek, Theodore Tso, Linus Torvalds, Ulrich Drepper,
	linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
	Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner


* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> I also changed client socket to nonblocking mode with the same result 
> in epoll server. If you will find it broken, please send me corrected 
> to test too.

this line in evserver_kevent.c looks a bit fishy:

        err = recv(s, buf, 100, 0);

because on the evserver_epoll.c side the following is done:

        err = recv(s, buf, 4096, 0);

now, for 'ab', the request size is 76 bytes, so it should fit fine 
functionality-wise. But, the TCP stack might decide differently of 
whether to return with a partial packet depending on how much data is 
requested. I dont know whether it actually makes a difference in the TCP 
flow decisions, and whether it makes a performance difference in your 
test, but safest would be to use 4096 in both cases.

in general, please make sure the exact same system calls are done in the 
client function. (except of course for the event queueing syscalls 
themselves)

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01 11:27                                         ` Ingo Molnar
@ 2007-03-01 11:36                                           ` Evgeniy Polyakov
  0 siblings, 0 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01 11:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pavel Machek, Theodore Tso, Linus Torvalds, Ulrich Drepper,
	linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
	Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner

On Thu, Mar 01, 2007 at 12:27:00PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> 
> > I've uploaded them to:
> > 
> > http://tservice.net.ru/~s0mbre/archive/kevent/evserver_epoll.c
> > http://tservice.net.ru/~s0mbre/archive/kevent/evserver_kevent.c
> 
> thanks.
> 
> > I also changed client socket to nonblocking mode with the same result 
> > in epoll server. [...]
> 
> what does this mean exactly? Did you change this line in 
> evserver_epoll.c:
> 
>         //fcntl(cs, F_SETFL, O_NONBLOCK);
> 
> to:
> 
>         fcntl(cs, F_SETFL, O_NONBLOCK);
> 
> and the result was the same?

Yep.

> 	Ingo

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01 11:20                                       ` Evgeniy Polyakov
@ 2007-03-01 11:28                                         ` Eric Dumazet
  2007-03-01 11:47                                           ` Evgeniy Polyakov
  2007-03-01 12:19                                           ` Evgeniy Polyakov
  0 siblings, 2 replies; 277+ messages in thread
From: Eric Dumazet @ 2007-03-01 11:28 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ingo Molnar, Pavel Machek, Theodore Tso, Linus Torvalds,
	Ulrich Drepper, linux-kernel, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner

On Thursday 01 March 2007 12:20, Evgeniy Polyakov wrote:
> On Thu, Mar 01, 2007 at 12:14:44PM +0100, Eric Dumazet (dada1@cosmosbay.com) 
wrote:
> > On Thursday 01 March 2007 11:59, Evgeniy Polyakov wrote:
> > > Yes, it is about 98-100% in both cases.
> > > I've just re-run tests on my amd64 test machine without debug options:
> > >
> > > epoll		4794.23
> > > kevent		6468.95
> >
> > It would be valuable if you could post oprofile results
> > (CPU_CLK_UNHALTED) for both tests.
>
> I can't - oprofile does not work on this x86_64 machine:
>

Yes, this is a known problem, but you can make it works, as I did.

Please :)

I used the CVS version of oprofile plus a patch you can find in the mailing 
list archives. Dont remember exactly, since I hit this some months ago

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01 11:16                                       ` Evgeniy Polyakov
@ 2007-03-01 11:27                                         ` Ingo Molnar
  2007-03-01 11:36                                           ` Evgeniy Polyakov
  2007-03-01 11:41                                         ` Ingo Molnar
  1 sibling, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-03-01 11:27 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Pavel Machek, Theodore Tso, Linus Torvalds, Ulrich Drepper,
	linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
	Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner


* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> I've uploaded them to:
> 
> http://tservice.net.ru/~s0mbre/archive/kevent/evserver_epoll.c
> http://tservice.net.ru/~s0mbre/archive/kevent/evserver_kevent.c

thanks.

> I also changed client socket to nonblocking mode with the same result 
> in epoll server. [...]

what does this mean exactly? Did you change this line in 
evserver_epoll.c:

        //fcntl(cs, F_SETFL, O_NONBLOCK);

to:

        fcntl(cs, F_SETFL, O_NONBLOCK);

and the result was the same?

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01 11:14                                     ` Eric Dumazet
@ 2007-03-01 11:20                                       ` Evgeniy Polyakov
  2007-03-01 11:28                                         ` Eric Dumazet
  0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01 11:20 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, Pavel Machek, Theodore Tso, Linus Torvalds,
	Ulrich Drepper, linux-kernel, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner

On Thu, Mar 01, 2007 at 12:14:44PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> On Thursday 01 March 2007 11:59, Evgeniy Polyakov wrote:
> 
> > Yes, it is about 98-100% in both cases.
> > I've just re-run tests on my amd64 test machine without debug options:
> >
> > epoll		4794.23
> > kevent		6468.95
> >
> 
> It would be valuable if you could post oprofile results (CPU_CLK_UNHALTED) for 
> both tests.

I can't - oprofile does not work on this x86_64 machine:

#opcontrol --setup --vmlinux=/home/s0mbre/aWork/git/linux-2.6.kevent/vmlinux
# opcontrol --start
Using default event: CPU_CLK_UNHALTED:100000:0:1:1
/usr/bin/opcontrol: line 994: /dev/oprofile/0/enabled: No such file or
directory
/usr/bin/opcontrol: line 994: /dev/oprofile/0/event: No such file or
directory
/usr/bin/opcontrol: line 994: /dev/oprofile/0/count: No such file or
directory
/usr/bin/opcontrol: line 994: /dev/oprofile/0/kernel: No such file or
directory
/usr/bin/opcontrol: line 994: /dev/oprofile/0/user: No such file or
directory
/usr/bin/opcontrol: line 994: /dev/oprofile/0/unit_mask: No such file or
directory

# ls -l /dev/oprofile/
total 0
drwxr-xr-x 1 root root 0 2007-03-01 09:41 1
drwxr-xr-x 1 root root 0 2007-03-01 09:41 2
drwxr-xr-x 1 root root 0 2007-03-01 09:41 3
-rw-r--r-- 1 root root 0 2007-03-01 09:41 backtrace_depth
-rw-r--r-- 1 root root 0 2007-03-01 09:41 buffer
-rw-r--r-- 1 root root 0 2007-03-01 09:41 buffer_size
-rw-r--r-- 1 root root 0 2007-03-01 09:41 buffer_watershed
-rw-r--r-- 1 root root 0 2007-03-01 09:41 cpu_buffer_size
-rw-r--r-- 1 root root 0 2007-03-01 09:41 cpu_type
-rw-rw-rw- 1 root root 0 2007-03-01 09:41 dump
-rw-r--r-- 1 root root 0 2007-03-01 09:41 enable
-rw-r--r-- 1 root root 0 2007-03-01 09:41 pointer_size
drwxr-xr-x 1 root root 0 2007-03-01 09:41 stats

> Thank you

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01 10:11                                 ` Pavel Machek
  2007-03-01 10:19                                   ` Ingo Molnar
@ 2007-03-01 11:18                                   ` Evgeniy Polyakov
  2007-03-02 10:27                                     ` Pavel Machek
  1 sibling, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01 11:18 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Theodore Tso, Ingo Molnar, Linus Torvalds, Ulrich Drepper,
	linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
	Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner

On Thu, Mar 01, 2007 at 11:11:02AM +0100, Pavel Machek (pavel@ucw.cz) wrote:
> > > > > 10% gain in speed is NOT worth major complexity increase.
> > > > 
> > > > Should I create a patch to remove rb-tree implementation?
> > > 
> > > If you can replace them with something simpler, and no worse than 10%
> > > slower in worst case, then go ahead. (We actually tried to do that at
> > > some point, only to realize that efence stresses vm subsystem in very
> > > unexpected/unfriendly way).
> > 
> > Agh, only 10% in the worst case.
> > I think you can not even imagine what tricks network uses to get at
> > least aditional 1% out of the box.
> 
> Yep? Feel free to rewrite networking to assembly on Eugenix. That
> should get you 1% improvement. If you reserve few registers to be only
> used by kernel (not allowed by userspace), you can speedup networking
> 5%, too. Ouch and you could turn off MMU, that is sure way to get few
> more percent improvement in your networking case.

It is not _my_ networking, but taht one you use everyday in every Linux
box. Notice which tricks are used to remove single byte from sk_buff.
It is called optimization, and if it does us a single plus it must be
implemented. Not all people have magical fear of new things.

> > Using such logic you can just abandon any further development, since it
> > work as is right now.
> 
> Stop trying to pervert my logic.

Ugh? :)
I just say in simple words your 'we do not need something if adds 10%,
but is complex to understand'.

> > > > That practice is stupid IMO.
> > > 
> > > Too bad. Now you can start Linux fork called Eugenix.
> > > 
> > > (But really, Linux is not "maximum performance at any cost". Linux is
> > > "how fast can we get that while keeping it maintainable?").
> > 
> > Should I read it like: we do not understand what it is and thus we do
> > not want it?
> 
> Actually, yes, that's a concern. If your code is so crappy that we
> can't understand it, guess what, it is not going to be merged. Notice
> that someone will have to maintain your code if you get hit by bus.
> 
> If your code is so complex that it is almost impossible to use from
> userspace, that is good enough reason not to be merged. "But it is 3%
> faster if..." is not a good-enough argument.

Is it enough for you?

epoll           4794.23 req/sec
kevent          6468.95 req/sec

And we are not saying about other kevent features like ability to
deliver essentially any event through its queue or shared ring (and a
some of its ideas are being slowly implemented in syslet/threadlet code,
btw).

Even if kevent is as fast as epoll, it allows to work with any kind of
events (signals, timers, aio completion, io events and any other you
like) with one queue/ring, which removes races and does _simplofy_
development, since there is no need to create different models to handle
different events.

> > > That is why, while arguing syslets vs. kevents, you need need to argue
> > > not "kevents are faster because they avoid context switch overhead",
> > > but "kevents are _so much_ faster that it is worth the added
> > > complexity". And Ingo seems to showing you they are not _so much_
> > > faster.
> > 
> > Threadlets behave much worse without event driven model, events can
> > behave worse without backed threads, they are mutually compensating.
> 
> I think Ingo demonstrated unoptimized threadlets to be within 5% to
> the speed of kevent. Demonstrate that kevents are twice faster than
> syslets on reasonable test case, and I guess we'll listen...

That was compared to epoll, not kevent.

But I repeat again - kevent is not only epoll, it can do a lot of ther
things which does improve performance and simplify development - did you
saw terrible hacks in libevent to handle signals without race in polling
loop? It is not needed anymore completely - one event loop, one event
structure, completely unified interface for all operations.
Some kevent features are slowly being implemented in syslet/threadlet
async code too, and it looks like I see where things will end up :), but
likely I do not care about new 'kevent', I just wanted that that was
said half a year ago, when I started its resending again, but Ingo
already said his definitive word :)

> 								Pavel
> -- 
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01 11:00                                     ` Ingo Molnar
@ 2007-03-01 11:16                                       ` Evgeniy Polyakov
  2007-03-01 11:27                                         ` Ingo Molnar
  2007-03-01 11:41                                         ` Ingo Molnar
  0 siblings, 2 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01 11:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pavel Machek, Theodore Tso, Linus Torvalds, Ulrich Drepper,
	linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
	Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner

On Thu, Mar 01, 2007 at 12:00:22PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> 
> > I've just re-run tests on my amd64 test machine without debug options:
> > 
> > epoll        4794.23
> > kevent       6468.95
> 
> could you please post the two URLs for the exact evserver code used for 
> these measurements? (even if you did so already in the past - best to 
> have them always together with the numbers) Thanks!

I've uploaded them to:

http://tservice.net.ru/~s0mbre/archive/kevent/evserver_epoll.c
http://tservice.net.ru/~s0mbre/archive/kevent/evserver_kevent.c

I also changed client socket to nonblocking mode with the same result in
epoll server. If you will find it broken, please send me corrected to
test too.

> 	Ingo

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01 10:59                                   ` Evgeniy Polyakov
  2007-03-01 11:00                                     ` Ingo Molnar
@ 2007-03-01 11:14                                     ` Eric Dumazet
  2007-03-01 11:20                                       ` Evgeniy Polyakov
  2007-03-01 12:34                                     ` Ingo Molnar
  2007-03-01 16:56                                     ` David Lang
  3 siblings, 1 reply; 277+ messages in thread
From: Eric Dumazet @ 2007-03-01 11:14 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ingo Molnar, Pavel Machek, Theodore Tso, Linus Torvalds,
	Ulrich Drepper, linux-kernel, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner

On Thursday 01 March 2007 11:59, Evgeniy Polyakov wrote:

> Yes, it is about 98-100% in both cases.
> I've just re-run tests on my amd64 test machine without debug options:
>
> epoll		4794.23
> kevent		6468.95
>

It would be valuable if you could post oprofile results (CPU_CLK_UNHALTED) for 
both tests.

Thank you

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01 10:59                                   ` Evgeniy Polyakov
@ 2007-03-01 11:00                                     ` Ingo Molnar
  2007-03-01 11:16                                       ` Evgeniy Polyakov
  2007-03-01 11:14                                     ` Eric Dumazet
                                                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-03-01 11:00 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Pavel Machek, Theodore Tso, Linus Torvalds, Ulrich Drepper,
	linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
	Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner


* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> I've just re-run tests on my amd64 test machine without debug options:
> 
> epoll        4794.23
> kevent       6468.95

could you please post the two URLs for the exact evserver code used for 
these measurements? (even if you did so already in the past - best to 
have them always together with the numbers) Thanks!

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01  9:54                                 ` Ingo Molnar
@ 2007-03-01 10:59                                   ` Evgeniy Polyakov
  2007-03-01 11:00                                     ` Ingo Molnar
                                                       ` (3 more replies)
  2007-03-01 19:19                                   ` Davide Libenzi
  1 sibling, 4 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01 10:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pavel Machek, Theodore Tso, Linus Torvalds, Ulrich Drepper,
	linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
	Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner

On Thu, Mar 01, 2007 at 10:54:02AM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> 
> > I posted kevent/epoll benchmarks and related design issues too many 
> > times both with handmade applications (which might be broken as hell) 
> > and popular open-source servers to repeat them again.
> 
> numbers are crutial here - and given the epoll bugs in the evserver code 
> that we found, do you have updated evserver benchmark results that 
> compare epoll to kevent? I'm wondering why epoll has half the speed of 
> kevent in those measurements - i suspect some possible benchmarking bug. 
> The queueing model of epoll and kevent is roughly comparable, both do 
> only a constant number of steps to serve one particular request, 
> regardless of how many pending connections/requests there are. What is 
> the CPU utilization of the server system during an epoll test, and what 
> is the CPU utilization during a kevent test? 100% utilized in both 
> cases?

Yes, it is about 98-100% in both cases.
I've just re-run tests on my amd64 test machine without debug options:

epoll		4794.23
kevent		6468.95

here are full client 'ab' outputs for epoll and kevent servers (epoll 
does not contain EPOLLET as you requested, but it does not look like 
it change performance in my case).

epoll ab aoutput:
# ab -c8000 -n80000 http://192.168.0.48/
This is ApacheBench, Version 2.0.40-dev <$Revision: 1.146 $> apache-2.0
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright 2006 The Apache Software Foundation, http://www.apache.org/

Benchmarking 192.168.0.48 (be patient)
Completed 8000 requests
Completed 16000 requests
Completed 24000 requests
Completed 32000 requests
Completed 40000 requests
Completed 48000 requests
Completed 56000 requests
Completed 64000 requests
Completed 72000 requests
Finished 80000 requests


Server Software:        Apache/1.3.27
Server Hostname:        192.168.0.48
Server Port:            80

Document Path:          /
Document Length:        3521 bytes

Concurrency Level:      8000
Time taken for tests:   16.686737 seconds
Complete requests:      80000
Failed requests:        0
Write errors:           0
Total transferred:      309760000 bytes
HTML transferred:       281680000 bytes
Requests per second:    4794.23 [#/sec] (mean)
Time per request:       1668.674 [ms] (mean)
Time per request:       0.209 [ms] (mean, across all concurrent
requests)
Transfer rate:          18128.17 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:      159  779 110.1    799     921
Processing:   468  866  77.4    869     988
Waiting:       63  426 212.3    425     921
Total:       1145 1646 115.6   1660    1873

Percentage of the requests served within a certain time (ms)
50%   1660
66%   1661
75%   1662
80%   1663
90%   1806
95%   1830
98%   1833
99%   1834
100%   1873 (longest request)

kevent ab output:
# ab -c8000 -n80000 http://192.168.0.48/
This is ApacheBench, Version 2.0.40-dev <$Revision: 1.146 $> apache-2.0
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright 2006 The Apache Software Foundation, http://www.apache.org/

Benchmarking 192.168.0.48 (be patient)
Completed 8000 requests
Completed 16000 requests
Completed 24000 requests
Completed 32000 requests
Completed 40000 requests
Completed 48000 requests
Completed 56000 requests
Completed 64000 requests
Completed 72000 requests
Finished 80000 requests


Server Software:        Apache/1.3.27
Server Hostname:        192.168.0.48
Server Port:            80

Document Path:          /
Document Length:        3521 bytes

Concurrency Level:      8000
Time taken for tests:   12.366775 seconds
Complete requests:      80000
Failed requests:        0
Write errors:           0
Total transferred:      317047104 bytes
HTML transferred:       288306522 bytes
Requests per second:    6468.95 [#/sec] (mean)
Time per request:       1236.677 [ms] (mean)
Time per request:       0.155 [ms] (mean, across all concurrent
requests)
Transfer rate:          25036.12 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:      130  364 871.1    275    9347
Processing:   178  298  42.5    296     580
Waiting:       31  202  65.8    210     369
Total:        411  663 887.0    572    9722

Percentage of the requests served within a certain time (ms)
50%    572
66%    573
75%    618
80%    640
90%    684
95%    709
98%    721
99%   3455
100%   9722 (longest request)

Notice how percentage of the requests served within a certain time
differs for kevent and epoll. And this server does not include
ready-on-submission kevent optimization.

> 	Ingo

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-27 14:15 Al Boldi
  2007-02-27 19:22 ` Theodore Tso
@ 2007-03-01 10:21 ` Pavel Machek
  1 sibling, 0 replies; 277+ messages in thread
From: Pavel Machek @ 2007-03-01 10:21 UTC (permalink / raw)
  To: Al Boldi; +Cc: linux-kernel

Hi!

> > But it is true that for most kernel programmers, threaded
> > programming is much easier to understand, and we need to engineer the
> > kernel for what will be maintainable for the majority of the kernel
> > development community.
> 
> What's probably true is that, for a kernel to stay competitive you need two 
> distinct traits:
> 
> 1. Stability
> 2. Performance
> 
> And you can't get that, by arguing that the kernel development community 
> doesn't have the brains to code for performance, which I dearly doubt.
> 
> So, instead of using intimidating language to force one's opinion thru, 
> especially when it comes from those in control, why not have a democratic 
> vote?

Linus cast his vote, and as that is only vote that matters, can we end
this discussion? (Eugeny is not willing to listen to Linus, but I
guess that's Eugeny's problem at this point).
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01 10:11                                 ` Pavel Machek
@ 2007-03-01 10:19                                   ` Ingo Molnar
  2007-03-01 11:18                                   ` Evgeniy Polyakov
  1 sibling, 0 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-03-01 10:19 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Evgeniy Polyakov, Theodore Tso, Linus Torvalds, Ulrich Drepper,
	linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
	Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner


* Pavel Machek <pavel@ucw.cz> wrote:

> > Threadlets behave much worse without event driven model, events can 
> > behave worse without backed threads, they are mutually compensating.
> 
> I think Ingo demonstrated unoptimized threadlets to be within 5% to 
> the speed of kevent. [...]

that was epoll not kevent, but yeah. To me the biggest question is, how 
much improvement does kevent bring relative to epoll? Epoll is a pretty 
good event queueing API, and it covers all the event sources today. It 
is also O(1) throughout, so i dont really understand how it could only 
achieve half the speed of kevent in the evserver_epoll/kevent.c 
benchmark. We need to understand that first.

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01  9:47                               ` Evgeniy Polyakov
  2007-03-01  9:54                                 ` Ingo Molnar
@ 2007-03-01 10:11                                 ` Pavel Machek
  2007-03-01 10:19                                   ` Ingo Molnar
  2007-03-01 11:18                                   ` Evgeniy Polyakov
  1 sibling, 2 replies; 277+ messages in thread
From: Pavel Machek @ 2007-03-01 10:11 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Theodore Tso, Ingo Molnar, Linus Torvalds, Ulrich Drepper,
	linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
	Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner

Hi!

> > > > 10% gain in speed is NOT worth major complexity increase.
> > > 
> > > Should I create a patch to remove rb-tree implementation?
> > 
> > If you can replace them with something simpler, and no worse than 10%
> > slower in worst case, then go ahead. (We actually tried to do that at
> > some point, only to realize that efence stresses vm subsystem in very
> > unexpected/unfriendly way).
> 
> Agh, only 10% in the worst case.
> I think you can not even imagine what tricks network uses to get at
> least aditional 1% out of the box.

Yep? Feel free to rewrite networking to assembly on Eugenix. That
should get you 1% improvement. If you reserve few registers to be only
used by kernel (not allowed by userspace), you can speedup networking
5%, too. Ouch and you could turn off MMU, that is sure way to get few
more percent improvement in your networking case.

> Using such logic you can just abandon any further development, since it
> work as is right now.

Stop trying to pervert my logic.

> > > That practice is stupid IMO.
> > 
> > Too bad. Now you can start Linux fork called Eugenix.
> > 
> > (But really, Linux is not "maximum performance at any cost". Linux is
> > "how fast can we get that while keeping it maintainable?").
> 
> Should I read it like: we do not understand what it is and thus we do
> not want it?

Actually, yes, that's a concern. If your code is so crappy that we
can't understand it, guess what, it is not going to be merged. Notice
that someone will have to maintain your code if you get hit by bus.

If your code is so complex that it is almost impossible to use from
userspace, that is good enough reason not to be merged. "But it is 3%
faster if..." is not a good-enough argument.

> > That is why, while arguing syslets vs. kevents, you need need to argue
> > not "kevents are faster because they avoid context switch overhead",
> > but "kevents are _so much_ faster that it is worth the added
> > complexity". And Ingo seems to showing you they are not _so much_
> > faster.
> 
> Threadlets behave much worse without event driven model, events can
> behave worse without backed threads, they are mutually compensating.

I think Ingo demonstrated unoptimized threadlets to be within 5% to
the speed of kevent. Demonstrate that kevents are twice faster than
syslets on reasonable test case, and I guess we'll listen...
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01  9:47                               ` Evgeniy Polyakov
@ 2007-03-01  9:54                                 ` Ingo Molnar
  2007-03-01 10:59                                   ` Evgeniy Polyakov
  2007-03-01 19:19                                   ` Davide Libenzi
  2007-03-01 10:11                                 ` Pavel Machek
  1 sibling, 2 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-03-01  9:54 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Pavel Machek, Theodore Tso, Linus Torvalds, Ulrich Drepper,
	linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
	Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner


* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> I posted kevent/epoll benchmarks and related design issues too many 
> times both with handmade applications (which might be broken as hell) 
> and popular open-source servers to repeat them again.

numbers are crutial here - and given the epoll bugs in the evserver code 
that we found, do you have updated evserver benchmark results that 
compare epoll to kevent? I'm wondering why epoll has half the speed of 
kevent in those measurements - i suspect some possible benchmarking bug. 
The queueing model of epoll and kevent is roughly comparable, both do 
only a constant number of steps to serve one particular request, 
regardless of how many pending connections/requests there are. What is 
the CPU utilization of the server system during an epoll test, and what 
is the CPU utilization during a kevent test? 100% utilized in both 
cases?

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01  9:26                             ` Pavel Machek
@ 2007-03-01  9:47                               ` Evgeniy Polyakov
  2007-03-01  9:54                                 ` Ingo Molnar
  2007-03-01 10:11                                 ` Pavel Machek
  0 siblings, 2 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01  9:47 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Theodore Tso, Ingo Molnar, Linus Torvalds, Ulrich Drepper,
	linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
	Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner

On Thu, Mar 01, 2007 at 10:26:34AM +0100, Pavel Machek (pavel@ucw.cz) wrote:
> > > 10% gain in speed is NOT worth major complexity increase.
> > 
> > Should I create a patch to remove rb-tree implementation?
> 
> If you can replace them with something simpler, and no worse than 10%
> slower in worst case, then go ahead. (We actually tried to do that at
> some point, only to realize that efence stresses vm subsystem in very
> unexpected/unfriendly way).

Agh, only 10% in the worst case.
I think you can not even imagine what tricks network uses to get at
least aditional 1% out of the box.
Using such logic you can just abandon any further development, since it
work as is right now.

> > That practice is stupid IMO.
> 
> Too bad. Now you can start Linux fork called Eugenix.
> 
> (But really, Linux is not "maximum performance at any cost". Linux is
> "how fast can we get that while keeping it maintainable?").

Should I read it like: we do not understand what it is and thus we do
not want it?

> That is why, while arguing syslets vs. kevents, you need need to argue
> not "kevents are faster because they avoid context switch overhead",
> but "kevents are _so much_ faster that it is worth the added
> complexity". And Ingo seems to showing you they are not _so much_
> faster.

Threadlets behave much worse without event driven model, events can
behave worse without backed threads, they are mutually compensating.

I posted kevent/epoll benchmarks and related design issues too many 
times both with handmade applications (which might be broken as hell)
and popular open-source servers to repeat them again.

> 									Pavel
> -- 
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01  8:38                           ` Evgeniy Polyakov
@ 2007-03-01  9:28                             ` Evgeniy Polyakov
  0 siblings, 0 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01  9:28 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Chris Friesen, Linus Torvalds, Ingo Molnar, Ulrich Drepper,
	Linux Kernel Mailing List, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner

On Thu, Mar 01, 2007 at 11:38:06AM +0300, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> > struct async_syscall {
> > 	long *result;
> > 	unsigned long asynid;
> > 	unsigned long nr_sysc;
> > 	unsigned long params[8];
> > };
> 
> Having result pointer as NULL might imply that result is not interested
> in and thus it can be discarded and event async syscall will not be
> returned through aync_wait().
> More flexible is having request flags field in the structure.

Ugh, that is alrasdy implemented in v5.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01  8:18                           ` Evgeniy Polyakov
@ 2007-03-01  9:26                             ` Pavel Machek
  2007-03-01  9:47                               ` Evgeniy Polyakov
  2007-03-01 19:24                             ` Johann Borck
  1 sibling, 1 reply; 277+ messages in thread
From: Pavel Machek @ 2007-03-01  9:26 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Theodore Tso, Ingo Molnar, Linus Torvalds, Ulrich Drepper,
	linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
	Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner

Hi!

> > > I understand that - and I totally agree.
> > > But when more complex, more bug-prone code results in higher performance
> > > - that must be used. We have linked lists and binary trees - the latter
> > 
> > No-o. Kernel is not designed like that.
> > 
> > Often, more complex and slightly faster code exists, and we simply use
> > slower variant, because it is fast enough.
> > 
> > 10% gain in speed is NOT worth major complexity increase.
> 
> Should I create a patch to remove rb-tree implementation?

If you can replace them with something simpler, and no worse than 10%
slower in worst case, then go ahead. (We actually tried to do that at
some point, only to realize that efence stresses vm subsystem in very
unexpected/unfriendly way).

> That practice is stupid IMO.

Too bad. Now you can start Linux fork called Eugenix.

(But really, Linux is not "maximum performance at any cost". Linux is
"how fast can we get that while keeping it maintainable?").

That is why, while arguing syslets vs. kevents, you need need to argue
not "kevents are faster because they avoid context switch overhead",
but "kevents are _so much_ faster that it is worth the added
complexity". And Ingo seems to showing you they are not _so much_
faster.

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-03-01  1:33                   ` Andrea Arcangeli
@ 2007-03-01  9:15                     ` Evgeniy Polyakov
  0 siblings, 0 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01  9:15 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Ingo Molnar, Linus Torvalds, Davide Libenzi, Ulrich Drepper,
	Linux Kernel Mailing List, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner

On Thu, Mar 01, 2007 at 02:33:01AM +0100, Andrea Arcangeli (andrea@suse.de) wrote:
> On Thu, Mar 01, 2007 at 12:12:28AM +0100, Ingo Molnar wrote:
> > more capable by providing more special system calls like sys_upcall() to 
> > execute a user-space function. (that way a syslet could still execute 
> > user-space code without having to exit out of kernel mode too 
> > frequently) Or perhaps a sys_x86_bytecode() call, that would execute a 
> > pre-verified, kernel-stored sequence of simplified x86 bytecode, using 
> > the kernel stack.
> 
> Which means the userspace code would then run with kernel privilege
> level somehow (after security verifier, whatever). You remember I
> think it's a plain crazy idea...

Syslets/threadlets do not execute userspace code in kernel - they behave
similar to threads. sys_upcall() would be a wrapper for quite complex
threadlet machinery.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-28 19:42                         ` Davide Libenzi
@ 2007-03-01  8:38                           ` Evgeniy Polyakov
  2007-03-01  9:28                             ` Evgeniy Polyakov
  0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01  8:38 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Chris Friesen, Linus Torvalds, Ingo Molnar, Ulrich Drepper,
	Linux Kernel Mailing List, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner

On Wed, Feb 28, 2007 at 11:42:24AM -0800, Davide Libenzi (davidel@xmailserver.org) wrote:
> On Wed, 28 Feb 2007, Chris Friesen wrote:
> 
> > Davide Libenzi wrote:
> > 
> > > struct async_syscall {
> > > 	unsigned long nr_sysc;
> > > 	unsigned long params[8];
> > > 	long *result;
> > > };
> > > 
> > > And what would async_wait() return bak? Pointers to "struct async_syscall"
> > > or pointers to "result"?
> > 
> > Either one has downsides.  Pointer to struct async_syscall requires that the
> > caller keep the struct around.  Pointer to result requires that the caller
> > always reserve a location for the result.
> > 
> > Does the kernel care about the (possibly rare) case of callers that don't want
> > to pay attention to result?  If so, what about adding some kind of
> > caller-specified handle to struct async_syscall, and having async_wait()
> > return the handle?  In the case where the caller does care about the result,
> > the handle could just be the address of result.
> 
> Something like this (with async_wait() returning asynid's)?
> 
> struct async_syscall {
> 	long *result;
> 	unsigned long asynid;
> 	unsigned long nr_sysc;
> 	unsigned long params[8];
> };

Having result pointer as NULL might imply that result is not interested
in and thus it can be discarded and event async syscall will not be
returned through aync_wait().
More flexible is having request flags field in the structure.
 
> - Davide
> 

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-28 16:14                         ` Pavel Machek
@ 2007-03-01  8:18                           ` Evgeniy Polyakov
  2007-03-01  9:26                             ` Pavel Machek
  2007-03-01 19:24                             ` Johann Borck
  0 siblings, 2 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01  8:18 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Theodore Tso, Ingo Molnar, Linus Torvalds, Ulrich Drepper,
	linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
	Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner

On Wed, Feb 28, 2007 at 04:14:14PM +0000, Pavel Machek (pavel@ucw.cz) wrote:
> Hi!
> 
> > > I think what you are not hearing, and what everyone else is saying
> > > (INCLUDING Linus), is that for most programmers, state machines are
> > > much, much harder to program, understand, and debug compared to
> > > multi-threaded code.  You may disagree (were you a MacOS 9 programmer
> > > in another life?), and it may not even be true for you if you happen
> > > to be one of those folks more at home with Scheme continuations, for
> > > example.  But it is true that for most kernel programmers, threaded
> > > programming is much easier to understand, and we need to engineer the
> > > kernel for what will be maintainable for the majority of the kernel
> > > development community.
> > 
> > I understand that - and I totally agree.
> > But when more complex, more bug-prone code results in higher performance
> > - that must be used. We have linked lists and binary trees - the latter
> 
> No-o. Kernel is not designed like that.
> 
> Often, more complex and slightly faster code exists, and we simply use
> slower variant, because it is fast enough.
> 
> 10% gain in speed is NOT worth major complexity increase.

Should I create a patch to remove rb-tree implementation?
That practice is stupid IMO.

> 							Pavel
> -- 
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-28 23:12                 ` Ingo Molnar
@ 2007-03-01  1:33                   ` Andrea Arcangeli
  2007-03-01  9:15                     ` Evgeniy Polyakov
  2007-03-01 21:27                   ` Linus Torvalds
  1 sibling, 1 reply; 277+ messages in thread
From: Andrea Arcangeli @ 2007-03-01  1:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Davide Libenzi, Ulrich Drepper,
	Linux Kernel Mailing List, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Zach Brown, Evgeniy Polyakov,
	David S. Miller, Suparna Bhattacharya, Jens Axboe,
	Thomas Gleixner

On Thu, Mar 01, 2007 at 12:12:28AM +0100, Ingo Molnar wrote:
> more capable by providing more special system calls like sys_upcall() to 
> execute a user-space function. (that way a syslet could still execute 
> user-space code without having to exit out of kernel mode too 
> frequently) Or perhaps a sys_x86_bytecode() call, that would execute a 
> pre-verified, kernel-stored sequence of simplified x86 bytecode, using 
> the kernel stack.

Which means the userspace code would then run with kernel privilege
level somehow (after security verifier, whatever). You remember I
think it's a plain crazy idea...

I don't want to argue about syslets, threadlets, whatever async or
syscall-merging mechanism here, I'm just focusing on the idea of yours
of running userland code in kernel space somehow (I hoped you given up
on it by now). Fixing the greatest syslets limitation is going to open
a can of worms as far as security is concerned.

The fact that userland code must not run with kernel privilege level,
is the reason why syslets aren't very useful (but again: focusing on
the syslets vs async-syscalls isn't my interest).

Frankly I think this idea of running userland code with kernel
privileges fits in the same category of porting linux to segmentation
to avoid the cost of pagetables to gain some bit of performance
despite losing in many other areas. Nobody in real life will want to
make that trade, for such an incredibly small performance
improvement.

For things that can be frequently combined, it's much simpler and
cheaper to created a "merged" syscall (i.e. sys_spawn =
sys_fork+sys_exec) than to invent a way to upload userland generated
bytecodes to kernel space to do that.

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-28 16:42               ` Linus Torvalds
  2007-02-28 17:26                 ` Ingo Molnar
  2007-02-28 18:22                 ` Davide Libenzi
@ 2007-02-28 23:12                 ` Ingo Molnar
  2007-03-01  1:33                   ` Andrea Arcangeli
  2007-03-01 21:27                   ` Linus Torvalds
  2 siblings, 2 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-02-28 23:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Davide Libenzi, Ulrich Drepper, Linux Kernel Mailing List,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Zach Brown, Evgeniy Polyakov, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> So I would repeat my call for getting rid of the atoms, and instead 
> just do a "single submission" at a time. Do the linking by running a 
> threadlet that has user space code (and the stack overhead), which is 
> MUCH more flexible. And do nonlinked single system calls without 
> *either* atoms *or* a user-space stack footprint.

I agree that threadlets are much more flexible - and they might in fact 
win in the long run due to that.

i'll add a one-shot syscall API in v6 and then we'll be able to see them 
side by side. (wanted to do that in v5 but it got delayed by x86_64 
issues, x86_64's entry code is certainly ... tricky wrt. ptregs saving)

wrt. one-shot syscalls, the user-space stack footprint would still 
probably be there, because even async contexts that only do single-shot 
processing need to drop out of kernel mode to handle signals. We could 
probably hack the signal routing code to never deliver to such threads 
(but bounce it over to the head context, which is always available) but 
i think that would be a bit messy. (i dont exclude it though)

I think syslets might also act as a prototyping platform for new system 
calls. If any particular syslet atom string comes up more frequently 
(and we could even automate the profiling of that within the kernel), 
then it's a good candidate for a standalone syscall. Currently we dont 
have such information in any structured way: the connection between 
streams of syscalls done by applications is totally opaque to the 
kernel.

Also, i genuinely believe that to be competitive (performance-wise) with 
fully in-kernel queueing solutions, we need syslets - the syslet NULL 
overhead is 20 cycles (this includes copying, engine overhead, etc.), 
the syscall NULL overhead is 280-300 cycles. It could probably be made 
more capable by providing more special system calls like sys_upcall() to 
execute a user-space function. (that way a syslet could still execute 
user-space code without having to exit out of kernel mode too 
frequently) Or perhaps a sys_x86_bytecode() call, that would execute a 
pre-verified, kernel-stored sequence of simplified x86 bytecode, using 
the kernel stack.

My fear is that if we force all these things over to one-shot syscalls 
or threadlets then this will become another second-tier mechanism. By 
providing syslets we give the message: "sure, come on and play within 
the kernel if you want to, but it's not easy".

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-28 22:22                       ` Ingo Molnar
@ 2007-02-28 22:47                         ` Davide Libenzi
  0 siblings, 0 replies; 277+ messages in thread
From: Davide Libenzi @ 2007-02-28 22:47 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ulrich Drepper, Linux Kernel Mailing List, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Zach Brown, Evgeniy Polyakov, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner

On Wed, 28 Feb 2007, Ingo Molnar wrote:

> > Or with a simple/parellel async submission, coupled with threadlets, 
> > we can cover a pretty broad range of real life use cases?
> 
> sure, if we debate its virtualization driven market penetration via self 
> promoting technologies that also drive customer satisfaction, then we'll 
> be able to increase shareholder value by improving the user experience 
> and we'll also succeed in turning this vision into a supply/demand 
> marketplace. Or not?

Okkey then, I guess it's good to go as is :)



- Davide



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-28 21:46                     ` Davide Libenzi
@ 2007-02-28 22:22                       ` Ingo Molnar
  2007-02-28 22:47                         ` Davide Libenzi
  0 siblings, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-02-28 22:22 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Ulrich Drepper, Linux Kernel Mailing List, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Zach Brown, Evgeniy Polyakov, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner


* Davide Libenzi <davidel@xmailserver.org> wrote:

> On Wed, 28 Feb 2007, Ingo Molnar wrote:
> 
> > * Davide Libenzi <davidel@xmailserver.org> wrote:
> > 
> > > Did you hide all the complexity of the userspace atom decoding inside 
> > > another function? :)
> > 
> > no, i made the 64-bit and 32-bit structures layout-compatible. This 
> > makes the 32-bit structure as large as the 64-bit ones, but that's not a 
> > big issue, compared to the simplifications it brings.
> 
> Do you have a new version to review?

yep, i've just released -v5.

> How about this, with async_wait returning asynid's back to a userspace 
> ring buffer?
> 
> struct syslet_utaom {
>         long *result;
>         unsigned long asynid;
>         unsigned long nr_sysc;
>         unsigned long params[8];
> };

we talked about the parameters at length: if they are pointers the 
layout is significantly more flexible and more capable. It's a pretty 
similar argument to the return-pointer thing. For example take a look at 
how the IO syslet atoms in Jens' FIO engine share the same fd. Even if 
there's 20000 of them. And they are fully cacheable in constructed 
state. The same goes for the webserving examples i've got in the 
async-test userspace sample code. I can pick up a cached request and 
only update req->fd, i dont have to reinit the atoms at all. It stays 
nicely in the cache, is not re-dirtied, etc.

furthermore, having the parameters as pointers is also an optimization: 
look at the copy_uatom() x86 assembly code i did - it can do a simple 
jump out of the parameter fetching code. I actually tried /both/ of 
these variants in assembly (as i mentioned it in a previous reply, in 
the v1 thread) and the speed difference between a pointer and 
non-pointer variant was negligible. (even with 6 parameters filled in)

but yes ... another two more small changes and your layout will be 
awfully similar to the current uatom layout =B-)

> My problem with the syslets in their current form is, do we have a 
> real use for them that justify the extra complexity inside the kernel?

i call bullshit. really. I have just gone out and wasted some time 
cutting & pasting all the syslet engine code: it is 153 lines total, 
plus 51 lines of comments. The total patchset in comparison is:

 35 files changed, 1890 insertions(+), 71 deletions(-)

(and this over-estimates it because if this got removed then we'd still 
have to add an async execution syscall.) And the code is pretty compact 
and self-contained. Threadlets share much of the infrastructure with 
syslets: for example the completion ring code is _100%_ shared, the 
async execution code is 98% shared.

You are free to not like it though, and i'm willing to change any aspect 
of the API to make it more intuitive and more useful, but calling it 
'complexity' at this point is just handwaving. And believe it or not, a 
good number of people actually find syslets pretty cool.

> Or with a simple/parellel async submission, coupled with threadlets, 
> we can cover a pretty broad range of real life use cases?

sure, if we debate its virtualization driven market penetration via self 
promoting technologies that also drive customer satisfaction, then we'll 
be able to increase shareholder value by improving the user experience 
and we'll also succeed in turning this vision into a supply/demand 
marketplace. Or not?

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-28 21:23                   ` Ingo Molnar
@ 2007-02-28 21:46                     ` Davide Libenzi
  2007-02-28 22:22                       ` Ingo Molnar
  0 siblings, 1 reply; 277+ messages in thread
From: Davide Libenzi @ 2007-02-28 21:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ulrich Drepper, Linux Kernel Mailing List, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Zach Brown, Evgeniy Polyakov, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner

On Wed, 28 Feb 2007, Ingo Molnar wrote:

> * Davide Libenzi <davidel@xmailserver.org> wrote:
> 
> > Did you hide all the complexity of the userspace atom decoding inside 
> > another function? :)
> 
> no, i made the 64-bit and 32-bit structures layout-compatible. This 
> makes the 32-bit structure as large as the 64-bit ones, but that's not a 
> big issue, compared to the simplifications it brings.

Do you have a new version to review?



> > > But i'm happy to change the syslet API in any sane way, and did so 
> > > based on feedback from Jens who is actually using them.
> > 
> > Wouldn't you agree on a simple/parallel execution engine [...]
> 
> the thing is, there's almost zero overhead from having those basic 
> things like conditions and the ->next link, and they make it so much 
> more capable. As usual my biggest problem is that you are not trying to 
> use syslets at all - you are only trying to get rid of them ;-) My 
> purpose with syslets is to enable a syslet to do almost anything that 
> user-space could do too, as simply as possible. Syslets could even 
> allocate user-space memory and then use it (i dont think we actually 
> want to do that though). That doesnt mean arbitrary complex code 
> /should/ be done via syslets, or that it wont be significantly slower 
> than what user-space can do, but i'd not like to artificially dumb the 
> engine down. I'm totally willing to simplify/shrink the vectoring of 
> arguments and just about anything else, but your proposals so far (such 
> as your return-value-embedded-in-atom suggestion) all kill important 
> aspects of the engine.

Ok, we're past the error code in the atom, as Linus pointed out ;)
How about this, with async_wait returning asynid's back to a userspace 
ring buffer?

struct syslet_utaom {
        long *result;
        unsigned long asynid;
        unsigned long nr_sysc;
        unsigned long params[8];
};

My problem with the syslets in their current form is, do we have a real 
use for them that justify the extra complexity inside the kernel? Or with 
a simple/parellel async submission, coupled with threadlets, we can cover 
a pretty broad range of real life use cases?



- Davide



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-28 21:09                 ` Davide Libenzi
@ 2007-02-28 21:23                   ` Ingo Molnar
  2007-02-28 21:46                     ` Davide Libenzi
  0 siblings, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-02-28 21:23 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Ulrich Drepper, Linux Kernel Mailing List, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Zach Brown, Evgeniy Polyakov, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner


* Davide Libenzi <davidel@xmailserver.org> wrote:

> On Wed, 28 Feb 2007, Ingo Molnar wrote:
> 
> > 
> > * Davide Libenzi <davidel@xmailserver.org> wrote:
> > 
> > > My point is, the syslet infrastructure is expensive for the kernel in 
> > > terms of compat, [...]
> > 
> > it is not. Today i've implemented 64-bit syslets on x86_64 and 
> > 32-bit-on-64-bit compat syslets. Both the 64-bit and the 32-bit syslet 
> > (and threadlet) binaries work just fine on a 64-bit kernel, and they 
> > share 99% of the infrastructure. There's only a single #ifdef 
> > CONFIG_COMPAT in kernel/async.c:
> > 
> > #ifdef CONFIG_COMPAT
> > 
> > asmlinkage struct syslet_uatom __user *
> > compat_sys_async_exec(struct syslet_uatom __user *uatom,
> >                       struct async_head_user __user *ahu)
> > {
> >         return __sys_async_exec(uatom, ahu, &compat_sys_call_table,
> >                                 compat_NR_syscalls);
> > }
> > 
> > #endif
> 
> Did you hide all the complexity of the userspace atom decoding inside 
> another function? :)

no, i made the 64-bit and 32-bit structures layout-compatible. This 
makes the 32-bit structure as large as the 64-bit ones, but that's not a 
big issue, compared to the simplifications it brings.

> > But i'm happy to change the syslet API in any sane way, and did so 
> > based on feedback from Jens who is actually using them.
> 
> Wouldn't you agree on a simple/parallel execution engine [...]

the thing is, there's almost zero overhead from having those basic 
things like conditions and the ->next link, and they make it so much 
more capable. As usual my biggest problem is that you are not trying to 
use syslets at all - you are only trying to get rid of them ;-) My 
purpose with syslets is to enable a syslet to do almost anything that 
user-space could do too, as simply as possible. Syslets could even 
allocate user-space memory and then use it (i dont think we actually 
want to do that though). That doesnt mean arbitrary complex code 
/should/ be done via syslets, or that it wont be significantly slower 
than what user-space can do, but i'd not like to artificially dumb the 
engine down. I'm totally willing to simplify/shrink the vectoring of 
arguments and just about anything else, but your proposals so far (such 
as your return-value-embedded-in-atom suggestion) all kill important 
aspects of the engine.

All the existing syslet features were purpose-driven: i actually coded 
up a sample syslet, trying to do something that makes sense, and added 
these features based on that. The engine core takes up maybe 50 lines of 
code.

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-28 20:21               ` Ingo Molnar
@ 2007-02-28 21:09                 ` Davide Libenzi
  2007-02-28 21:23                   ` Ingo Molnar
  0 siblings, 1 reply; 277+ messages in thread
From: Davide Libenzi @ 2007-02-28 21:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ulrich Drepper, Linux Kernel Mailing List, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Zach Brown, Evgeniy Polyakov, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner

On Wed, 28 Feb 2007, Ingo Molnar wrote:

> 
> * Davide Libenzi <davidel@xmailserver.org> wrote:
> 
> > My point is, the syslet infrastructure is expensive for the kernel in 
> > terms of compat, [...]
> 
> it is not. Today i've implemented 64-bit syslets on x86_64 and 
> 32-bit-on-64-bit compat syslets. Both the 64-bit and the 32-bit syslet 
> (and threadlet) binaries work just fine on a 64-bit kernel, and they 
> share 99% of the infrastructure. There's only a single #ifdef 
> CONFIG_COMPAT in kernel/async.c:
> 
> #ifdef CONFIG_COMPAT
> 
> asmlinkage struct syslet_uatom __user *
> compat_sys_async_exec(struct syslet_uatom __user *uatom,
>                       struct async_head_user __user *ahu)
> {
>         return __sys_async_exec(uatom, ahu, &compat_sys_call_table,
>                                 compat_NR_syscalls);
> }
> 
> #endif

Did you hide all the complexity of the userspace atom decoding inside 
another function? :)
How much code would go away, in case we pick a simple/parallel 
sys_async_exec engine? Atoms decoding, special userspace variable access 
for loops, jumps/cond/... VM engine.



> Even mixed-mode syslets should work (although i havent specifically 
> tested them), where the head switches between 64-bit and 32-bit mode and 
> submits syslets from both 64-bit and from 32-bit mode, and at the same 
> time there might be both 64-bit and 32-bit syslets 'in flight'.
> 
> But i'm happy to change the syslet API in any sane way, and did so based 
> on feedback from Jens who is actually using them.

Wouldn't you agree on a simple/parallel execution engine like me and Linus 
are proposing (and threadlets, of course)?



- Davide



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-28 16:17             ` Davide Libenzi
  2007-02-28 16:42               ` Linus Torvalds
@ 2007-02-28 20:21               ` Ingo Molnar
  2007-02-28 21:09                 ` Davide Libenzi
  1 sibling, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-02-28 20:21 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Ulrich Drepper, Linux Kernel Mailing List, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Zach Brown, Evgeniy Polyakov, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner


* Davide Libenzi <davidel@xmailserver.org> wrote:

> My point is, the syslet infrastructure is expensive for the kernel in 
> terms of compat, [...]

it is not. Today i've implemented 64-bit syslets on x86_64 and 
32-bit-on-64-bit compat syslets. Both the 64-bit and the 32-bit syslet 
(and threadlet) binaries work just fine on a 64-bit kernel, and they 
share 99% of the infrastructure. There's only a single #ifdef 
CONFIG_COMPAT in kernel/async.c:

#ifdef CONFIG_COMPAT

asmlinkage struct syslet_uatom __user *
compat_sys_async_exec(struct syslet_uatom __user *uatom,
                      struct async_head_user __user *ahu)
{
        return __sys_async_exec(uatom, ahu, &compat_sys_call_table,
                                compat_NR_syscalls);
}

#endif

Even mixed-mode syslets should work (although i havent specifically 
tested them), where the head switches between 64-bit and 32-bit mode and 
submits syslets from both 64-bit and from 32-bit mode, and at the same 
time there might be both 64-bit and 32-bit syslets 'in flight'.

But i'm happy to change the syslet API in any sane way, and did so based 
on feedback from Jens who is actually using them.

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-28 19:03                       ` Chris Friesen
@ 2007-02-28 19:42                         ` Davide Libenzi
  2007-03-01  8:38                           ` Evgeniy Polyakov
  0 siblings, 1 reply; 277+ messages in thread
From: Davide Libenzi @ 2007-02-28 19:42 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Linus Torvalds, Ingo Molnar, Ulrich Drepper,
	Linux Kernel Mailing List, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Zach Brown, Evgeniy Polyakov,
	David S. Miller, Suparna Bhattacharya, Jens Axboe,
	Thomas Gleixner

On Wed, 28 Feb 2007, Chris Friesen wrote:

> Davide Libenzi wrote:
> 
> > struct async_syscall {
> > 	unsigned long nr_sysc;
> > 	unsigned long params[8];
> > 	long *result;
> > };
> > 
> > And what would async_wait() return bak? Pointers to "struct async_syscall"
> > or pointers to "result"?
> 
> Either one has downsides.  Pointer to struct async_syscall requires that the
> caller keep the struct around.  Pointer to result requires that the caller
> always reserve a location for the result.
> 
> Does the kernel care about the (possibly rare) case of callers that don't want
> to pay attention to result?  If so, what about adding some kind of
> caller-specified handle to struct async_syscall, and having async_wait()
> return the handle?  In the case where the caller does care about the result,
> the handle could just be the address of result.

Something like this (with async_wait() returning asynid's)?

struct async_syscall {
	long *result;
	unsigned long asynid;
	unsigned long nr_sysc;
	unsigned long params[8];
};



- Davide



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-28 18:50                     ` Davide Libenzi
@ 2007-02-28 19:03                       ` Chris Friesen
  2007-02-28 19:42                         ` Davide Libenzi
  0 siblings, 1 reply; 277+ messages in thread
From: Chris Friesen @ 2007-02-28 19:03 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Linus Torvalds, Ingo Molnar, Ulrich Drepper,
	Linux Kernel Mailing List, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Zach Brown, Evgeniy Polyakov,
	David S. Miller, Suparna Bhattacharya, Jens Axboe,
	Thomas Gleixner

Davide Libenzi wrote:

> struct async_syscall {
> 	unsigned long nr_sysc;
> 	unsigned long params[8];
> 	long *result;
> };
> 
> And what would async_wait() return bak? Pointers to "struct async_syscall"
> or pointers to "result"?

Either one has downsides.  Pointer to struct async_syscall requires that 
the caller keep the struct around.  Pointer to result requires that the 
caller always reserve a location for the result.

Does the kernel care about the (possibly rare) case of callers that 
don't want to pay attention to result?  If so, what about adding some 
kind of caller-specified handle to struct async_syscall, and having 
async_wait() return the handle?  In the case where the caller does care 
about the result, the handle could just be the address of result.

Chris




^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-28 18:42                   ` Linus Torvalds
@ 2007-02-28 18:50                     ` Davide Libenzi
  2007-02-28 19:03                       ` Chris Friesen
  0 siblings, 1 reply; 277+ messages in thread
From: Davide Libenzi @ 2007-02-28 18:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Ulrich Drepper, Linux Kernel Mailing List,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Zach Brown, Evgeniy Polyakov, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner

On Wed, 28 Feb 2007, Linus Torvalds wrote:

> On Wed, 28 Feb 2007, Davide Libenzi wrote:
> > 
> > Here we very much agree. The way I'd like it:
> > 
> > struct async_syscall {
> > 	unsigned long nr_sysc;
> > 	unsigned long params[8];
> > 	long result;
> > };
> 
> No, the "result" needs to go somewhere else. The caller may be totally 
> uninterested in keeping the system call number or parameters around until 
> the operation completes, but if you put them in the same structure with 
> the result, you obviously cannot sanely get rid of them.
> 
> I also don't much like read-write interfaces (which the above would be: 
> the kernel would read most of the structure, and then write one member of 
> the structure). 
> 
> It's entirely possible, for example, that the operation we submit is some 
> legacy "aio_read()", which has soem other structure layout than the new 
> one (but one field will be the result code).

Ok, makes sense. Something like this then?

struct async_syscall {
	unsigned long nr_sysc;
	unsigned long params[8];
	long *result;
};

And what would async_wait() return bak? Pointers to "struct async_syscall"
or pointers to "result"?



- Davide



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-28 18:22                 ` Davide Libenzi
@ 2007-02-28 18:42                   ` Linus Torvalds
  2007-02-28 18:50                     ` Davide Libenzi
  0 siblings, 1 reply; 277+ messages in thread
From: Linus Torvalds @ 2007-02-28 18:42 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Ingo Molnar, Ulrich Drepper, Linux Kernel Mailing List,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Zach Brown, Evgeniy Polyakov, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner



On Wed, 28 Feb 2007, Davide Libenzi wrote:
> 
> Here we very much agree. The way I'd like it:
> 
> struct async_syscall {
> 	unsigned long nr_sysc;
> 	unsigned long params[8];
> 	long result;
> };

No, the "result" needs to go somewhere else. The caller may be totally 
uninterested in keeping the system call number or parameters around until 
the operation completes, but if you put them in the same structure with 
the result, you obviously cannot sanely get rid of them.

I also don't much like read-write interfaces (which the above would be: 
the kernel would read most of the structure, and then write one member of 
the structure). 

It's entirely possible, for example, that the operation we submit is some 
legacy "aio_read()", which has soem other structure layout than the new 
one (but one field will be the result code).

		Linus

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-28 16:42               ` Linus Torvalds
  2007-02-28 17:26                 ` Ingo Molnar
@ 2007-02-28 18:22                 ` Davide Libenzi
  2007-02-28 18:42                   ` Linus Torvalds
  2007-02-28 23:12                 ` Ingo Molnar
  2 siblings, 1 reply; 277+ messages in thread
From: Davide Libenzi @ 2007-02-28 18:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Ulrich Drepper, Linux Kernel Mailing List,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Zach Brown, Evgeniy Polyakov, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner

On Wed, 28 Feb 2007, Linus Torvalds wrote:

> On Wed, 28 Feb 2007, Davide Libenzi wrote:
> > 
> > At this point, given how threadlets can be easily/effectively dispatched 
> > from userspace, I'd argue the presence of either single/parallel or syslet 
> > submission altogether. Threadlets allows you to code chains *way* more 
> > naturally than syslets, and since they basically are like functions calls 
> > in the fast path, they can be used even for single/parallel submissions. 
> 
> Well, I agree, except for one thing:
>  - user space execution is *inherently* more expensive.
> 
> Why? Stack. Stack. Stack.
> 
> If you support threadlets with user space code, it means that you need a 
> separate user-space stack for each threadlet. That's a potentially *big* 
> cost to bear, both from a setup standpoint and from simply a memory 
> allocation standpoint.

Right, point taken.



> In short - the only thing I *don't* think is a great idea are those linked 
> lists of atoms. I still think it's a pretty horrible interface, and I 
> still don't think it really buys us very much. The only way it would buy 
> us a lot is to change the linked lists dynamically (ie add new events at 
> the end while old events are still executing), but quite frankly, that 
> just makes the whole interface *even*worse* and just makes me have 
> debugging nightmares (I'm also not even convinced it really would help 
> us: we might avoid some costs of adding new events, but it would only 
> avoid them for serial execution, and if the whole point of this is to 
> execute things in parallel, that's a stupid thing to do).
> 
> So I would repeat my call for getting rid of the atoms, and instead just 
> do a "single submission" at a time. Do the linking by running a threadlet 
> that has user space code (and the stack overhead), which is MUCH more 
> flexible. And do nonlinked single system calls without *either* atoms *or* 
> a user-space stack footprint.

Here we very much agree. The way I'd like it:

struct async_syscall {
	unsigned long nr_sysc;
	unsigned long params[8];
	long result;
};

int async_exec(struct async_syscall *a, int n);

or:

int async_exec(struct async_syscall **a, int n);

At this point I'm ok even with the userspace ring buffer, returning 
back pointers to "struct async_syscall".




- Davide



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-28 16:42               ` Linus Torvalds
@ 2007-02-28 17:26                 ` Ingo Molnar
  2007-02-28 18:22                 ` Davide Libenzi
  2007-02-28 23:12                 ` Ingo Molnar
  2 siblings, 0 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-02-28 17:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Davide Libenzi, Ulrich Drepper, Linux Kernel Mailing List,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Zach Brown, Evgeniy Polyakov, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> [...] The only way it would buy us a lot is to change the linked lists 
> dynamically (ie add new events at the end while old events are still 
> executing), [...]

that's quite close to what Jens' FIO plugin for syslets 
(engines/syslet-rw.c) does currently: it builds lists of syslets as IO 
gets submitted, batches them up for some time and then sends them off. 
It is a natural next step to do this for in-flight syslets as well.

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-28  8:02                         ` Evgeniy Polyakov
@ 2007-02-28 17:01                           ` Michael K. Edwards
  0 siblings, 0 replies; 277+ messages in thread
From: Michael K. Edwards @ 2007-02-28 17:01 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Theodore Tso, Ingo Molnar, Linus Torvalds, Ulrich Drepper,
	linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
	Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner

On 2/28/07, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> 130 lines skipped...

Yeah, I edited it down a lot before sending it.  :-)

> I have only one question - wasn't it too lazy to write all that? :)

I'm pretty lazy all right.  But occasionally an interesting problem
(and revamping AIO is very interesting) makes me think, and what
little thinking I do is always accompanied by writing.  Once I've
thought something through to the point that I think I understand the
problem, I've even been known to attempt a solution.  Not always,
though; more often, I find a new interesting problem, or else I am
forcibly reminded that I should be spending my little store of insight
on revenue-producing activity.

In this instance, there didn't seem to be any harm in sending my
thoughts to LKML as I wrote them, on the off chance that Ingo or
Davide would get some value out of them in this design cycle (which
any code I eventually get around to producing will miss).  So far,
I've gotten some rather dismissive pushback from Ingo and Alan (who
seem to have no interest outside x86 and less understanding than I
would have thought of what real userspace code looks like), a "why
preach to people who know more than you do" from Davide, a brief aside
on the dominance of x86 from Oleg, and one off-list "keep up the good
work".  Not a very rich harvest from (IMHO) pretty good seeds.

In short, so far the "Linux kernel community" is upholding its
reputation for insularity, arrogance, coding without prior design,
lack of interest in userspace problems, and inability to learn from
the mistakes of others.  (None of these characterizations depends on
there being any real insight in anything I have written.)  Linus
himself has a very different reputation -- plenty of arrogance all
right, but genuine brilliance and hard work, and sincere (if cranky)
efforts to explain the "theory of operations" underlying central
design choices.  So far he hasn't commented directly on anything I
have had to say; it will be interesting to see whether he tells me to
stop annoying the pros and to go away until I have some code to
contribute.

Happy hacking,
- Michael

P. S.  I do think "threadlets" are brilliant, though, and reading
Ingo's patches gave me a much better idea of what would be involved in
prototyping Asynchronously Executed I/O Unit opcodes.

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-28 16:17             ` Davide Libenzi
@ 2007-02-28 16:42               ` Linus Torvalds
  2007-02-28 17:26                 ` Ingo Molnar
                                   ` (2 more replies)
  2007-02-28 20:21               ` Ingo Molnar
  1 sibling, 3 replies; 277+ messages in thread
From: Linus Torvalds @ 2007-02-28 16:42 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Ingo Molnar, Ulrich Drepper, Linux Kernel Mailing List,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Zach Brown, Evgeniy Polyakov, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner



On Wed, 28 Feb 2007, Davide Libenzi wrote:
> 
> At this point, given how threadlets can be easily/effectively dispatched 
> from userspace, I'd argue the presence of either single/parallel or syslet 
> submission altogether. Threadlets allows you to code chains *way* more 
> naturally than syslets, and since they basically are like functions calls 
> in the fast path, they can be used even for single/parallel submissions. 

Well, I agree, except for one thing:
 - user space execution is *inherently* more expensive.

Why? Stack. Stack. Stack.

If you support threadlets with user space code, it means that you need a 
separate user-space stack for each threadlet. That's a potentially *big* 
cost to bear, both from a setup standpoint and from simply a memory 
allocation standpoint.

Quite frankly, I think threadlets are a great idea, but I think the lack 
of user-level footprint is *also* a great idea, and you should support 
both.

In short - the only thing I *don't* think is a great idea are those linked 
lists of atoms. I still think it's a pretty horrible interface, and I 
still don't think it really buys us very much. The only way it would buy 
us a lot is to change the linked lists dynamically (ie add new events at 
the end while old events are still executing), but quite frankly, that 
just makes the whole interface *even*worse* and just makes me have 
debugging nightmares (I'm also not even convinced it really would help 
us: we might avoid some costs of adding new events, but it would only 
avoid them for serial execution, and if the whole point of this is to 
execute things in parallel, that's a stupid thing to do).

So I would repeat my call for getting rid of the atoms, and instead just 
do a "single submission" at a time. Do the linking by running a threadlet 
that has user space code (and the stack overhead), which is MUCH more 
flexible. And do nonlinked single system calls without *either* atoms *or* 
a user-space stack footprint.

Please? 

What am I missing?

		Linus

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-28  3:03                       ` Michael K. Edwards
  2007-02-28  8:02                         ` Evgeniy Polyakov
@ 2007-02-28 16:38                         ` Phillip Susi
  1 sibling, 0 replies; 277+ messages in thread
From: Phillip Susi @ 2007-02-28 16:38 UTC (permalink / raw)
  To: Michael K. Edwards
  Cc: Theodore Tso, Evgeniy Polyakov, Ingo Molnar, Linus Torvalds,
	Ulrich Drepper, linux-kernel, Arjan.van.de.Ven

Michael K. Edwards wrote:
> State machines are much harder to write without going through a real
> on-paper design phase first.  But multi-threaded code is much harder
> for a team of average working coders to write correctly, judging from
> the numerous train wrecks that I've been called in to salvage over the
> last ten years or so.

I have to agree; state machines are harder to design and read, but 
multithreaded programs are harder to write and debug _correctly_.

Another way of putting it is that the threadlet approach is easier to 
do, but harder to do _right_.


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-28  9:45           ` Ingo Molnar
@ 2007-02-28 16:17             ` Davide Libenzi
  2007-02-28 16:42               ` Linus Torvalds
  2007-02-28 20:21               ` Ingo Molnar
  0 siblings, 2 replies; 277+ messages in thread
From: Davide Libenzi @ 2007-02-28 16:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ulrich Drepper, Linux Kernel Mailing List, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Zach Brown, Evgeniy Polyakov, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner

On Wed, 28 Feb 2007, Ingo Molnar wrote:

> 
> * Davide Libenzi <davidel@xmailserver.org> wrote:
> 
> > Why can't aio_* be implemented with *simple* (or parallel/unrelated) 
> > syscall submit w/out the burden of a complex, limiting and heavy API
> 
> there are so many variants of what people think 'asynchronous IO' should 
> look like - i'd not like to limit them. I agree that once a particular 
> syslet script becomes really popular, it might (and should) in fact be 
> pushed into a separate system call.
> 
> But i also agree that a one-shot-syscall sys_async() syscall could be 
> done too - for those uses where only a single system call is needed and 
> where the fetching of a single uatom would be small but nevertheless 
> unnecessary overhead. A one-shot async syscall needs to get /8/ 
> parameters (the syscall nr is the seventh parameter and the return code 
> of the nested syscall is the eighth). So at least two parameters will 
> have to be passed in indirectly and validated, and 32/64-bit compat 
> conversions added, etc. anyway!

At this point, given how threadlets can be easily/effectively dispatched 
from userspace, I'd argue the presence of either single/parallel or syslet 
submission altogether. Threadlets allows you to code chains *way* more 
naturally than syslets, and since they basically are like functions calls 
in the fast path, they can be used even for single/parallel submissions. 
No compat code required (ok, besides the trivial async_wait).
My point is, the syslet infrastructure is expensive for the kernel in 
terms of compat, and extra code added to handle the cond/jumps/etc. Is 
also non-trivial to use from userspace. Are those big performance 
advantages there to justify its existence? I doubt that the price of a 
sysenter is a lot bigger than a atom decoding, but I'm looking forward in 
being proven wrong by real life performance numbers ;)



- Davide



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-27 12:11                       ` Evgeniy Polyakov
  2007-02-27 12:13                         ` Ingo Molnar
@ 2007-02-28 16:14                         ` Pavel Machek
  2007-03-01  8:18                           ` Evgeniy Polyakov
  1 sibling, 1 reply; 277+ messages in thread
From: Pavel Machek @ 2007-02-28 16:14 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Theodore Tso, Ingo Molnar, Linus Torvalds, Ulrich Drepper,
	linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
	Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner

Hi!

> > I think what you are not hearing, and what everyone else is saying
> > (INCLUDING Linus), is that for most programmers, state machines are
> > much, much harder to program, understand, and debug compared to
> > multi-threaded code.  You may disagree (were you a MacOS 9 programmer
> > in another life?), and it may not even be true for you if you happen
> > to be one of those folks more at home with Scheme continuations, for
> > example.  But it is true that for most kernel programmers, threaded
> > programming is much easier to understand, and we need to engineer the
> > kernel for what will be maintainable for the majority of the kernel
> > development community.
> 
> I understand that - and I totally agree.
> But when more complex, more bug-prone code results in higher performance
> - that must be used. We have linked lists and binary trees - the latter

No-o. Kernel is not designed like that.

Often, more complex and slightly faster code exists, and we simply use
slower variant, because it is fast enough.

10% gain in speed is NOT worth major complexity increase.
							Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-22 19:38         ` Davide Libenzi
@ 2007-02-28  9:45           ` Ingo Molnar
  2007-02-28 16:17             ` Davide Libenzi
  0 siblings, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-02-28  9:45 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Ulrich Drepper, Linux Kernel Mailing List, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Zach Brown, Evgeniy Polyakov, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner


* Davide Libenzi <davidel@xmailserver.org> wrote:

> > my current thinking is that special-purpose (non-programmable, 
> > static) APIs like aio_*() and lio_*(), where every last cycle of 
> > performance matters, should be implemented using syslets - even if 
> > it is quite tricky to write syslets (which they no doubt are - just 
> > compare the size of syslet-test.c to threadlet-test.c). So i'd move 
> > syslets into the same category as raw syscalls: pieces of the raw 
> > infrastructure between the kernel and glibc, not an exposed API to 
> > apps. [and even if we keep them in that category they still need 
> > quite a bit of API work, to clean up the 32/64-bit issues, etc.]
> 
> Why can't aio_* be implemented with *simple* (or parallel/unrelated) 
> syscall submit w/out the burden of a complex, limiting and heavy API

there are so many variants of what people think 'asynchronous IO' should 
look like - i'd not like to limit them. I agree that once a particular 
syslet script becomes really popular, it might (and should) in fact be 
pushed into a separate system call.

But i also agree that a one-shot-syscall sys_async() syscall could be 
done too - for those uses where only a single system call is needed and 
where the fetching of a single uatom would be small but nevertheless 
unnecessary overhead. A one-shot async syscall needs to get /8/ 
parameters (the syscall nr is the seventh parameter and the return code 
of the nested syscall is the eighth). So at least two parameters will 
have to be passed in indirectly and validated, and 32/64-bit compat 
conversions added, etc. anyway!

The copy_uatom() assembly code i did is really fast so i doubt there 
would be much measurable performance difference between the two 
solutions. Plus, putting the uatom into user memory allows the caching 
of uatoms - further dilluting the advantage of passing in the values per 
register. The whole difference should be on the order of 10 cycles, so 
this really isnt a high prio item in my view.

> Now that chains of syscalls can be way more easily handled with 
> clets^wthreadlets, why would we need the whole syslets crud inside?

no, threadlets dont really solve the basic issue of people wanting to 
'combine' syscalls, avoid the syscall entry overhead (even if that is 
small), and the desire to rely on kthread->kthread context switching 
which is even faster than uthread->uthread context-switching, etc. 
Furthermore, syslets dont really cause any new problem. They are almost 
totally orthogonal, isolated, and cause no wide infrastructure needs.

as long as syslets remain a syscall-level API, for the measured use of 
the likes of glibc and libaio (and not exposed in a programmable manner 
to user-space), i see no big problem with them at all. They can also be 
used without them having any classic pthread user-state (without linking 
to libpthread). Think of it like the raw use of clone(): possible and 
useful in some cases, but not something that a typical application would 
do. This is a 'raw syscall plugins' thing, to be used by those 
user-space entities that use raw syscalls: infrastructure libraries. Raw 
syscalls themselves are tied to the platform, are not easily used in 
some cases, thus almost no application uses them directly, but uses the 
generic functions glibc exposes.

in the long run, sys_syslet_exec(), were it not to establish itself as a 
widely used interface, could be implemented purely from user-space too 
(say from the VDSO, at much worse performance, but the kernel would stay 
backwards compatible with the syscall), so there's almost no risk here. 
You dont like it => dont use it. Meanwhile, i'll happily take any 
suggestion to make the syslet API more digestable.

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-28  3:03                       ` Michael K. Edwards
@ 2007-02-28  8:02                         ` Evgeniy Polyakov
  2007-02-28 17:01                           ` Michael K. Edwards
  2007-02-28 16:38                         ` Phillip Susi
  1 sibling, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-28  8:02 UTC (permalink / raw)
  To: Michael K. Edwards
  Cc: Theodore Tso, Ingo Molnar, Linus Torvalds, Ulrich Drepper,
	linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
	Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner

On Tue, Feb 27, 2007 at 07:03:21PM -0800, Michael K. Edwards (medwards.linux@gmail.com) wrote:
> 
> State machines are much harder to write without going through a real
> on-paper design phase first.  But multi-threaded code is much harder
> for a team of average working coders to write correctly, judging from
> the numerous train wrecks that I've been called in to salvage over the
> last ten years or so.

130 lines skipped...

I have only one question - wasn't it too lazy to write all that? :)

> Cheers,
> - Michael

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-27 11:52                     ` Theodore Tso
  2007-02-27 12:11                       ` Evgeniy Polyakov
  2007-02-27 12:34                       ` Ingo Molnar
@ 2007-02-28  3:03                       ` Michael K. Edwards
  2007-02-28  8:02                         ` Evgeniy Polyakov
  2007-02-28 16:38                         ` Phillip Susi
  2 siblings, 2 replies; 277+ messages in thread
From: Michael K. Edwards @ 2007-02-28  3:03 UTC (permalink / raw)
  To: Theodore Tso, Evgeniy Polyakov, Ingo Molnar, Linus Torvalds,
	Ulrich Drepper, linux-kernel, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner

On 2/27/07, Theodore Tso <tytso@mit.edu> wrote:
> I think what you are not hearing, and what everyone else is saying
> (INCLUDING Linus), is that for most programmers, state machines are
> much, much harder to program, understand, and debug compared to
> multi-threaded code.  You may disagree (were you a MacOS 9 programmer
> in another life?), and it may not even be true for you if you happen
> to be one of those folks more at home with Scheme continuations, for
> example.  But it is true that for most kernel programmers, threaded
> programming is much easier to understand, and we need to engineer the
> kernel for what will be maintainable for the majority of the kernel
> development community.

State machines are much harder to write without going through a real
on-paper design phase first.  But multi-threaded code is much harder
for a team of average working coders to write correctly, judging from
the numerous train wrecks that I've been called in to salvage over the
last ten years or so.

The typical 50-250KLoC multi-threaded C/C++/Java application, even if
it's been shipping to customers for several years, is littered with
locking constructs yet routinely corrupts shared data structures.
Change the number of threads in a pool, adjust the thread priorities,
or move a couple of lines of code around, and you're very likely to
get an unexplained deadlock.  God help you if you try to use a
debugger on it -- hundreds of latent race conditions will crop up that
didn't happen to trigger before because the thread didn't get
preempted there.

The only programming languages that I have seen widely used in US
industry (so Lisps and self-consciously functional languages are out)
in which mere mortals write passable multi-threaded applications are
Visual Basic and Python.  That's partly because programmers in these
languages are not in the habit of throwing pointers around; but if
that were all there was to it, Java programmers would be a lot more
successful than they are at actually writing threaded programs rather
than nibbling cautiously around the edges with EJB.  It also helps a
lot that strings are immutable; but again, Java shares this property.
No, the big difference is that VB and Python dicts and arrays are
thread-safed by the language runtime, and Java collections are not.
So while there may be all sorts of pointless and dangerous
mis-locking, it's "protecting" somewhat higher-level data structures.

What does this have to do with the kernel?  Well, if you're going to
create Yet Another Micro^WLightweight-Threading Construct for AIO, it
would be mighty nice not to be slinging bare pointers around until the
IO is actually complete and the kernel isn't going to be touching the
data buffer any more.  It would also be mighty nice to have a
thread-safe "request pool" data structure on which actions like bulk
cancellation and iteration over a subset can operate.  (The iterator
returned by, say, a three-sided query on a RCU priority queue may
contain _stale_ information, but never _inconsistent_ information.)

I recognize that this is more object-oriented snake oil than kernel
programmers usually tolerate, but it would really help AIO-threaded
programs suck less.  It is also very much in the Unix tradition --
what are file descriptors and fd_sets if not object-oriented design?
And if following the socket model was good enough for epoll and
netlink, why not for threadlet pools?

In the best of all possible worlds, AIO would look just like the good
old socket-bind-listen-accept model, except that I/O is transacted on
the "listen" socket as long as it can be serviced from cache, and
accept() only gets a new connection when a delayed I/O arrives.  The
object hiding inside the fd returned by socket() would be the
"threadlet pool", and the object hiding inside each fd returned by
accept() would be a threadlet.  Only this time you do it right and
make errno(fd) be a vsyscall that returns a threadlet-local error
state, and you assign reasonable semantics to operations on an fd that
has already encountered an exception.  Much like IEEE 754, actually.

Anyway, like I said, good threaded code is quite rare.  On the other
hand, I have seen plenty of reasonably successful event-loop
programming in C and C++, mostly in MacOS and Windows and PalmOS GUIs
where the OS deals with event queues and event handler registration.
It's not the world's most CPU-efficient strategy because of all those
function pointers and virtual methods, but those costs are dwarfed by
the GUI bounding boxes and repaints and things anyway.  More to the
point, writing an event-loop framework for other people to use
involves extensive APIs that are stable in the tiniest details and
extensively documented.  Not, perhaps, Plan A for the Linux kernel
community.  :-)

Happily, a largely event-driven userspace framework can easily be
stacked on top of a threaded kernel -- as long as they're the right
kind of threads.  The right kind of threads do not proliferate malloc
arenas by allowing preemption in mid-malloc.  (They may need to
malloc(), and that may be a preemption point relative to _real_
threads, but you shouldn't switch or cancel threadlets there.)  The
right kind of threads do not leak locking primitives when cancelled,
because they don't have to take a lock in order to update the right
kind of data structure.  The right kind of threads can use floating
point safely as long as they don't expect FPU state to be preserved
across a syscall.

The right kind of threads, in short, work like coroutines or
MacOS/PalmOS "event handlers", with the added convenience of being
able to write them as if they were normal sequential code, with normal
access to a persistent stack frame and to process globals.  And if you
do them right, they're cheap to migrate, easy and harmless to throttle
and cancel in bulk, and easy to punt out to an "I/O coprocessor" in
the future.  The key is to move data into and out of the "I/O
registers" at well-defined points and not to break the encapsulation
in between.  Delay the extraction of results from the "I/O registers"
as long as possible, and the hypothetical AIO coprocessor can go chew
on them in parallel with the main "integer" code flow, which only
stalls when it can't go any further without the I/O result.

If you've got some time to kill, you can even analyze an existing,
well-documented flavor of I/O strings (I like James Antill's Vstr
library myself) and define a set of "AIO opcodes" that manipulate AIO
fds and AIO strings as primitive types, just like the FPU manipulates
floating-point numbers as primitive types.  Pick a CPU architecture
with a good range of unused trap opcodes (PPC, for instance) and
actually move the I/O strings into kernel space, mapping the AIO
operations and the I/O string API onto the free trap opcodes (really
no different from syscalls, except it's easier to inspect the assembly
that the compiler produces and see what's going on).  For extra
credit, implement most of the AIO opcodes in Verilog, bolt them onto
the PPC core inside a Virtex4 FX FPGA, refine the semantics to permit
efficient pipelining, and collect your Turing award.  :-)

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-27 14:15 Al Boldi
@ 2007-02-27 19:22 ` Theodore Tso
  2007-03-01 10:21 ` Pavel Machek
  1 sibling, 0 replies; 277+ messages in thread
From: Theodore Tso @ 2007-02-27 19:22 UTC (permalink / raw)
  To: Al Boldi; +Cc: linux-kernel

On Tue, Feb 27, 2007 at 05:15:31PM +0300, Al Boldi wrote:
> > You may disagree (were you a MacOS 9 programmer
> > in another life?), and it may not even be true for you if you happen
> > to be one of those folks more at home with Scheme continuations, for
> > example.
> 
> Personal attacks are really rather unhelpful/unscientific.

Just to be clear; this wasn't a personal attack.  I know a lot of
people who I greatly respect who were MacOS and Scheme programmers (I
went to school at MIT, after all, the birthplace of Scheme).  The
reality though is that most people don't program that way, and their
brains aren't wired in that fashion.  There is a reason why procedural
languages are far more common than purely functional models, and why
aside from (pre-version 10) MacOS, most OS's don't use an event driven
system call interface.

> So, instead of using intimidating language to force one's opinion thru, 
> especially when it comes from those in control, why not have a democratic 
> vote?

So far, I'd have to say the people arguing for an event driven model
are in the distinct minority...

And as far as voting is concerned, I prefer informed voters who can
explain why, in their own words, why they are in favor of a particular
alternative.

Regards,

						- Ted

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-27 16:21                                         ` Evgeniy Polyakov
  2007-02-27 16:58                                           ` Eric Dumazet
@ 2007-02-27 19:20                                           ` Davide Libenzi
  1 sibling, 0 replies; 277+ messages in thread
From: Davide Libenzi @ 2007-02-27 19:20 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ingo Molnar, Ulrich Drepper, Linux Kernel Mailing List,
	Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner

On Tue, 27 Feb 2007, Evgeniy Polyakov wrote:

> I probably selected wrong words to desribe, here is in details how
> kevent differs from epoll.
> 
> Polling case need to perform additional check before event can be copied
> to userspace, that check must be done for each even being copied.
> Kevent does not need that (it needs it for poll emulation) - if event is
> ready, then it is ready.

That could be changed too. The "void *key" doesn't need to be NULL. Wake 
ups to f_op->poll() waiters can use that to send ready events directly, 
avoiding an extra f_op->poll() to fetch them.
Infrastructure is already there, just need a big patch to do it everywhere ;)



> Kevent works slightly different - it does not perform additional check
> for readiness (although it can, and it does in poll notifications), if
> event is marked as ready, parked in waiting syscal thread is awakened
> and event is copied to userspace.
> Also waiting syscall is awakened through one queue - event is added
> and wake_up() is called, while in epoll() there are two queues.

The really ancient version of epoll (called /dev/epoll at that time) was 
doing a very similar thing. Was adding custom plugs is all over the places 
where we wanted to get events from, and was collecting them w/out 
resorting to extra f_op->poll(). Event masks going straight through an 
event buffer.
The reason why the current design of epoll was chosen, was because:

o Was not requiring custom plus all over the places
o Was working with the current kernel abstractions as-is (though f_op->poll)




- Davide



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-26 20:35                                   ` Ingo Molnar
  2007-02-26 22:06                                     ` Bill Huey
  2007-02-27 10:09                                     ` Evgeniy Polyakov
@ 2007-02-27 17:13                                     ` Pavel Machek
  2 siblings, 0 replies; 277+ messages in thread
From: Pavel Machek @ 2007-02-27 17:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Evgeniy Polyakov, Ulrich Drepper, linux-kernel, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner

Hi!

> > If kernelspace rescheduling is that fast, then please explain me why 
> > userspace one always beats kernel/userspace?
> 
> because 'user space scheduling' makes no sense? I explained my thinking 
> about that in a past mail:
...
>  2) there has been an IO event. The thing is, for IO events we enter the
>     kernel no matter what - and we'll do so for the next 10 years at

..actually, at some point 3D acceleration was done by accessing hw
directly from userspace. OTOH I think we are moving away from that
model, so it is probably irrelevant here.
							Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-27 16:58                                           ` Eric Dumazet
@ 2007-02-27 17:06                                             ` Evgeniy Polyakov
  0 siblings, 0 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-27 17:06 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Davide Libenzi, Ingo Molnar, Ulrich Drepper,
	Linux Kernel Mailing List, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Jens Axboe,
	Thomas Gleixner

On Tue, Feb 27, 2007 at 05:58:14PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> I believe one advantage of epoll is it uses standard mechanism (mandated for 
> poll()/select() ), while kevent adds some glue and kevent_storage in some 
> structures (struct inode, struct file, ...), thus adding some extra code and 
> extra storage in hot paths. Yes there might be a gain IF most users of these 
> path want kevent. But other users pay the price (larger kernel code and 
> data), that you cannot easily bench.
> 
> Using or not epoll has nearly zero cost over standard kernel (only struct file 
> has some extra storage)

Well, that's a price - any event which wants to be supported needs to
store for events - kevent_storage is a list_head plus spinlock and
pointer to itself (with all current users that pointer can be removed and
access transferred to container_of()) - it is exactly as in epoll storage - 
both were created with the smallest possible overhead in mind.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-27 16:21                                         ` Evgeniy Polyakov
@ 2007-02-27 16:58                                           ` Eric Dumazet
  2007-02-27 17:06                                             ` Evgeniy Polyakov
  2007-02-27 19:20                                           ` Davide Libenzi
  1 sibling, 1 reply; 277+ messages in thread
From: Eric Dumazet @ 2007-02-27 16:58 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Davide Libenzi, Ingo Molnar, Ulrich Drepper,
	Linux Kernel Mailing List, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Jens Axboe,
	Thomas Gleixner

On Tuesday 27 February 2007 17:21, Evgeniy Polyakov wrote:
> I probably selected wrong words to desribe, here is in details how
> kevent differs from epoll.
>
> Polling case need to perform additional check before event can be copied
> to userspace, that check must be done for each even being copied.
> Kevent does not need that (it needs it for poll emulation) - if event is
> ready, then it is ready.
>
> sys_poll() create a wait queue where different events (callbacks for
> them) are stored, when driver calls wake_up() appropriate event is added
> to the ready list and calls wake_up() for that wait queue, which in turn
> calls ->poll for each event and transfer it to userspace if it is ready.
>
> Kevent works slightly different - it does not perform additional check
> for readiness (although it can, and it does in poll notifications), if
> event is marked as ready, parked in waiting syscal thread is awakened
> and event is copied to userspace.
> Also waiting syscall is awakened through one queue - event is added
> and wake_up() is called, while in epoll() there are two queues.

Thank you Evgeniy for this comparison. poll()/select()/epoll() are tricky 
indeed.

I believe one advantage of epoll is it uses standard mechanism (mandated for 
poll()/select() ), while kevent adds some glue and kevent_storage in some 
structures (struct inode, struct file, ...), thus adding some extra code and 
extra storage in hot paths. Yes there might be a gain IF most users of these 
path want kevent. But other users pay the price (larger kernel code and 
data), that you cannot easily bench.

Using or not epoll has nearly zero cost over standard kernel (only struct file 
has some extra storage)


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-27 16:01                                       ` Davide Libenzi
@ 2007-02-27 16:21                                         ` Evgeniy Polyakov
  2007-02-27 16:58                                           ` Eric Dumazet
  2007-02-27 19:20                                           ` Davide Libenzi
  0 siblings, 2 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-27 16:21 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Ingo Molnar, Ulrich Drepper, Linux Kernel Mailing List,
	Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner

On Tue, Feb 27, 2007 at 08:01:05AM -0800, Davide Libenzi (davidel@xmailserver.org) wrote:
> It does not matter if inside the loop you invert a 20Kx20K matrix or you 
> copy a byte, they both are O(num_ready).

I probably selected wrong words to desribe, here is in details how
kevent differs from epoll.

Polling case need to perform additional check before event can be copied
to userspace, that check must be done for each even being copied.
Kevent does not need that (it needs it for poll emulation) - if event is
ready, then it is ready.

sys_poll() create a wait queue where different events (callbacks for
them) are stored, when driver calls wake_up() appropriate event is added 
to the ready list and calls wake_up() for that wait queue, which in turn 
calls ->poll for each event and transfer it to userspace if it is ready.

Kevent works slightly different - it does not perform additional check
for readiness (although it can, and it does in poll notifications), if
event is marked as ready, parked in waiting syscal thread is awakened
and event is copied to userspace.
Also waiting syscall is awakened through one queue - event is added
and wake_up() is called, while in epoll() there are two queues.

> - Davide
> 

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-27 10:13                                     ` Evgeniy Polyakov
@ 2007-02-27 16:01                                       ` Davide Libenzi
  2007-02-27 16:21                                         ` Evgeniy Polyakov
  0 siblings, 1 reply; 277+ messages in thread
From: Davide Libenzi @ 2007-02-27 16:01 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ingo Molnar, Ulrich Drepper, Linux Kernel Mailing List,
	Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner

On Tue, 27 Feb 2007, Evgeniy Polyakov wrote:

> On Mon, Feb 26, 2007 at 06:18:51PM -0800, Davide Libenzi (davidel@xmailserver.org) wrote:
> > On Mon, 26 Feb 2007, Evgeniy Polyakov wrote:
> > 
> > > 2. its notifications do not go through the second loop, i.e. it is O(1),
> > > not O(ready_num), and notifications happens directly from internals of
> > > the appropriate subsystem, which does not require special wakeup
> > > (although it can be done too).
> > 
> > Sorry if I do not read kevent code correctly, but in kevent_user_wait() 
> > there is a:
> > 
> >     while (num < max_nr && ((k = kevent_dequeue_ready(u)) != NULL)) {
> >         ...
> >     }
> > 
> > loop, that make it O(ready_num). From a mathematical standpoint, they're 
> > both O(ready_num), but epoll is doing three passes over the ready set.
> > I always though that if the number of ready events is so big that the more 
> > passes over the ready set becomes relevant, probably the "work" done by 
> > userspace for each fetched event would make the extra cost irrelevant.
> > But that can be fixed by a patch that will follow on lkml ...
>  
> No, kevent_dequeue_ready() copies data to userspace, that is it.
> So it looks roughly following:

In all the books where I studied, the algorithms below would be classified 
as O(num_ready) ones:

[sys_kevent_wait]
+	for (i=0; i<num; ++i) {
+		k = kevent_dequeue_ready_ring(u);
+		if (!k)
+			break;
+		kevent_complete_ready(k);
+
+		if (k->event.ret_flags & KEVENT_RET_COPY_FAILED)
+			break;
+		kevent_stat_ring(u);
+		copied++;
+	}

[kevent_user_wait]
+	while (num < max_nr && ((k = kevent_dequeue_ready(u)) != NULL)) {
+		if (copy_to_user(buf + num*sizeof(struct ukevent),
+					&k->event, sizeof(struct ukevent))) {
+			if (num == 0)
+				num = -EFAULT;
+			break;
+		}
+		kevent_complete_ready(k);
+		++num;
+		kevent_stat_wait(u);
+	}

It does not matter if inside the loop you invert a 20Kx20K matrix or you 
copy a byte, they both are O(num_ready).



- Davide



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
@ 2007-02-27 14:15 Al Boldi
  0 siblings, 0 replies; 277+ messages in thread
From: Al Boldi @ 2007-02-27 14:15 UTC (permalink / raw)
  To: linux-kernel

Evgeniy Polyakov wrote:
> Ingo Molnar (mingo@elte.hu) wrote:
> > based servers. The measurements so far have shown that the absolute
> > worst-case threading server performance is at around 60% of that of
> > non-context-switching servers - and even that level is reached
> > gradually, leaving time for action for the server owner. While with
> > fully event based servers there are mostly only two modes of
> > performance: 100% performance and near-0% performance: total breakdown.
>
> Let's live in piece! :)
> I always agreed that they should be used together - event-based rings
> of IO requests, if request happens to block (which should be avaided
> as much as possible), then it continues on behalf of sleeping thread.

Agreed 100%.  Notify always when you can, thread only when you must.


Thanks!

--
Al


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
@ 2007-02-27 14:15 Al Boldi
  2007-02-27 19:22 ` Theodore Tso
  2007-03-01 10:21 ` Pavel Machek
  0 siblings, 2 replies; 277+ messages in thread
From: Al Boldi @ 2007-02-27 14:15 UTC (permalink / raw)
  To: linux-kernel

Theodore Tso wrote:
> On Tue, Feb 27, 2007 at 01:28:32PM +0300, Evgeniy Polyakov wrote:
> > Obviously there are bugs, it is simply how things work.
> > And debugging state machine code has exactly the same complexity as
> > debugging multi-threading code - if not less...
>
> Evgeniy,
>
> I think what you are not hearing, and what everyone else is saying
> (INCLUDING Linus),

Excluding possibly many others.

> is that for most programmers, state machines are
> much, much harder to program, understand, and debug compared to
> multi-threaded code.

That's why you introduce an infrastructure that hides all the nitty-gritty 
plumbing, and makes it easy to use.

> You may disagree (were you a MacOS 9 programmer
> in another life?), and it may not even be true for you if you happen
> to be one of those folks more at home with Scheme continuations, for
> example.

Personal attacks are really rather unhelpful/unscientific.

> But it is true that for most kernel programmers, threaded
> programming is much easier to understand, and we need to engineer the
> kernel for what will be maintainable for the majority of the kernel
> development community.

What's probably true is that, for a kernel to stay competitive you need two 
distinct traits:

1. Stability
2. Performance

And you can't get that, by arguing that the kernel development community 
doesn't have the brains to code for performance, which I dearly doubt.

So, instead of using intimidating language to force one's opinion thru, 
especially when it comes from those in control, why not have a democratic 
vote?


Thanks!

--
Al


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-27 12:34                       ` Ingo Molnar
  2007-02-27 13:14                         ` Evgeniy Polyakov
@ 2007-02-27 13:32                         ` Avi Kivity
  1 sibling, 0 replies; 277+ messages in thread
From: Avi Kivity @ 2007-02-27 13:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Theodore Tso, Evgeniy Polyakov, Linus Torvalds, Ulrich Drepper,
	linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
	Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner

Ingo Molnar wrote:
> * Theodore Tso <tytso@mit.edu> wrote:
>
>   
>> I think what you are not hearing, and what everyone else is saying 
>> (INCLUDING Linus), is that for most programmers, state machines are 
>> much, much harder to program, understand, and debug compared to 
>> multi-threaded code. [...]
>>     
>
> btw., another crutial thing that i think Evgeniy is missing is that 
> threadlets /enable/ event loops to be used in practice! Right now the 
> epoll/kevent programming model requires a total 100% avoidance of all 
> context-switching in the 'main' event handler context while handling a 
> request. If just 1% of all requests happen to block it might cause a 
> /complete/ breakdown of an event loop's performance - it can easily 
> cause a 10x drop in performance or worse!
>
> So context-switching has to be avoided in 100% of the code that runs 
> while handling requests, file descriptors have to be set to nonblocking 
> (causing extra system calls), and all the syscalls that might return 
> incomplete with either -EINVAL or with a short read/write have to be 
> converted into a state machine. (or in the alternative, user-space 
> threading has to be used, which opens up another hornet's nest)
>
> /That/ is the main inhibiting factor of the measured use of event loops 
> within Linux! It has zero integration capabilities with 'usual' coding 
> techniques - driving the costs of its application up in the sky, and 
> pushing event based servers into niches.
>
>   

Having written such a niche event based server, I can 100% confirm what 
Ingo is saying here.  We had a single process drive I/O to the kernel 
through an event model (based on kernel aio extended with IO_CMD_POLL), 
and user level threads managed by a custom scheduler that managed I/O, 
timeouts, and thread scheduling.

We once considered dropping from a user-level thread model to a state 
machine model, but the effort was astronomical and we wouldn't see the 
rewards until it was all done, so naturally we didn't do it.

> With threadlets the picture changes dramatically: all we have to 
> concentrate on to get the performance of "100% event based servers" is 
> to handle 'most' rescheduling events in the event loop. A 10-20% context 
> switching ratio does not hurt at all. (it causes ~1% of throughput 
> loss.)
>
> Furthermore, even if a particular configuration or module of the server 
> (say Apache) happens to trigger a high rate of scheduling, the 
> performance breakdown model of threadlets is /vastly/ superior to event 
> based servers. The measurements so far have shown that the absolute 
> worst-case threading server performance is at around 60% of that of 
> non-context-switching servers - and even that level is reached 
> gradually, leaving time for action for the server owner. While with 
> fully event based servers there are mostly only two modes of 
> performance: 100% performance and near-0% performance: total breakdown.
>   

Yes.  Threadlets as the default aio solution (easy to use, acceptable 
performance even in worst cases), with specialized solutions where 
applicable (epoll for networking, aio for O_DIRECT disk) look like a 
good mix of performance and sanity.



-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-27 12:34                       ` Ingo Molnar
@ 2007-02-27 13:14                         ` Evgeniy Polyakov
  2007-02-27 13:32                         ` Avi Kivity
  1 sibling, 0 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-27 13:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Theodore Tso, Linus Torvalds, Ulrich Drepper, linux-kernel,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner

On Tue, Feb 27, 2007 at 01:34:21PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> based servers. The measurements so far have shown that the absolute 
> worst-case threading server performance is at around 60% of that of 
> non-context-switching servers - and even that level is reached 
> gradually, leaving time for action for the server owner. While with 
> fully event based servers there are mostly only two modes of 
> performance: 100% performance and near-0% performance: total breakdown.

Let's live in piece! :) 
I always agreed that they should be used together - event-based rings 
of IO requests, if request happens to block (which should be avaided 
as much as possible), then it continues on behalf of sleeping thread.

> 	Ingo

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-27 12:13                         ` Ingo Molnar
@ 2007-02-27 12:40                           ` Evgeniy Polyakov
  0 siblings, 0 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-27 12:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Theodore Tso, Linus Torvalds, Ulrich Drepper, linux-kernel,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner

On Tue, Feb 27, 2007 at 01:13:28PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> 
> > > [...]  But it is true that for most kernel programmers, threaded 
> > > programming is much easier to understand, and we need to engineer 
> > > the kernel for what will be maintainable for the majority of the 
> > > kernel development community.
> > 
> > I understand that - and I totally agree.
> 
> why did you then write, just one mail ago, the exact opposite:
> 
> > And debugging state machine code has exactly the same complexity as 
> > debugging multi-threading code - if not less...

Because thread machinery is much more complex than event one - just
compare amount of code in scheduler and the whole kevent -
kernel/sched.c has about the same size as the whole kevent :)

> the kernel /IS/ multi-threaded code.
> 
> which statement of yours is also patently, absurdly untrue. 
> Multithreaded code is harder to debug than process based code, but it is 
> still a breeze compared to complex state-machines...

It seems that we are talking about different levels.

Model I propose to use in userspace - very simple events mostly about
completion of the request - they are simple to use and simple to debug.
It can be slightly more hard to debug, than the simplest threading model
(one thread - one logical entity, which whould never work with others)
though.

>From userspace point of view it is about the same complexity to check
why event is not marked as ready, or why some thread never got
scheduled...
And we do not get into account possible synchronization methods needed
to run multithreaded code without problems.

> 	Ingo

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-27 11:52                     ` Theodore Tso
  2007-02-27 12:11                       ` Evgeniy Polyakov
@ 2007-02-27 12:34                       ` Ingo Molnar
  2007-02-27 13:14                         ` Evgeniy Polyakov
  2007-02-27 13:32                         ` Avi Kivity
  2007-02-28  3:03                       ` Michael K. Edwards
  2 siblings, 2 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-02-27 12:34 UTC (permalink / raw)
  To: Theodore Tso, Evgeniy Polyakov, Linus Torvalds, Ulrich Drepper,
	linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
	Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner


* Theodore Tso <tytso@mit.edu> wrote:

> I think what you are not hearing, and what everyone else is saying 
> (INCLUDING Linus), is that for most programmers, state machines are 
> much, much harder to program, understand, and debug compared to 
> multi-threaded code. [...]

btw., another crutial thing that i think Evgeniy is missing is that 
threadlets /enable/ event loops to be used in practice! Right now the 
epoll/kevent programming model requires a total 100% avoidance of all 
context-switching in the 'main' event handler context while handling a 
request. If just 1% of all requests happen to block it might cause a 
/complete/ breakdown of an event loop's performance - it can easily 
cause a 10x drop in performance or worse!

So context-switching has to be avoided in 100% of the code that runs 
while handling requests, file descriptors have to be set to nonblocking 
(causing extra system calls), and all the syscalls that might return 
incomplete with either -EINVAL or with a short read/write have to be 
converted into a state machine. (or in the alternative, user-space 
threading has to be used, which opens up another hornet's nest)

/That/ is the main inhibiting factor of the measured use of event loops 
within Linux! It has zero integration capabilities with 'usual' coding 
techniques - driving the costs of its application up in the sky, and 
pushing event based servers into niches.

With threadlets the picture changes dramatically: all we have to 
concentrate on to get the performance of "100% event based servers" is 
to handle 'most' rescheduling events in the event loop. A 10-20% context 
switching ratio does not hurt at all. (it causes ~1% of throughput 
loss.)

Furthermore, even if a particular configuration or module of the server 
(say Apache) happens to trigger a high rate of scheduling, the 
performance breakdown model of threadlets is /vastly/ superior to event 
based servers. The measurements so far have shown that the absolute 
worst-case threading server performance is at around 60% of that of 
non-context-switching servers - and even that level is reached 
gradually, leaving time for action for the server owner. While with 
fully event based servers there are mostly only two modes of 
performance: 100% performance and near-0% performance: total breakdown.

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-27 12:11                       ` Evgeniy Polyakov
@ 2007-02-27 12:13                         ` Ingo Molnar
  2007-02-27 12:40                           ` Evgeniy Polyakov
  2007-02-28 16:14                         ` Pavel Machek
  1 sibling, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-02-27 12:13 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Theodore Tso, Linus Torvalds, Ulrich Drepper, linux-kernel,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner


* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> > [...]  But it is true that for most kernel programmers, threaded 
> > programming is much easier to understand, and we need to engineer 
> > the kernel for what will be maintainable for the majority of the 
> > kernel development community.
> 
> I understand that - and I totally agree.

why did you then write, just one mail ago, the exact opposite:

> And debugging state machine code has exactly the same complexity as 
> debugging multi-threading code - if not less...

the kernel /IS/ multi-threaded code.

which statement of yours is also patently, absurdly untrue. 
Multithreaded code is harder to debug than process based code, but it is 
still a breeze compared to complex state-machines...

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-27 11:52                     ` Theodore Tso
@ 2007-02-27 12:11                       ` Evgeniy Polyakov
  2007-02-27 12:13                         ` Ingo Molnar
  2007-02-28 16:14                         ` Pavel Machek
  2007-02-27 12:34                       ` Ingo Molnar
  2007-02-28  3:03                       ` Michael K. Edwards
  2 siblings, 2 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-27 12:11 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Ingo Molnar, Linus Torvalds, Ulrich Drepper, linux-kernel,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner

On Tue, Feb 27, 2007 at 06:52:22AM -0500, Theodore Tso (tytso@mit.edu) wrote:
> On Tue, Feb 27, 2007 at 01:28:32PM +0300, Evgeniy Polyakov wrote:
> > Obviously there are bugs, it is simply how things work.
> > And debugging state machine code has exactly the same complexity as
> > debugging multi-threading code - if not less...
> 
> Evgeniy,

Hi Ted.

> I think what you are not hearing, and what everyone else is saying
> (INCLUDING Linus), is that for most programmers, state machines are
> much, much harder to program, understand, and debug compared to
> multi-threaded code.  You may disagree (were you a MacOS 9 programmer
> in another life?), and it may not even be true for you if you happen
> to be one of those folks more at home with Scheme continuations, for
> example.  But it is true that for most kernel programmers, threaded
> programming is much easier to understand, and we need to engineer the
> kernel for what will be maintainable for the majority of the kernel
> development community.

I understand that - and I totally agree.
But when more complex, more bug-prone code results in higher performance
- that must be used. We have linked lists and binary trees - the latter
are quite complex structures, but they allow to have higher performance
in searching operatins, so we use them.

The same applies to state machines - yes, in some cases it is hard to
program, but when things are already implemented and are wrapped into
nice (no posix) aio_read(), there is absolutely no usage complexity.

Even if it is up to programmer to programm state machine based on
generated events, that higher-layer state machines are not complex.

Let's get simple case of (aio_)read() from file descriptor - if page is in the
cache, no readpage() method will be called, so we do not need to create
some kind of events - just copy data, if there is no page or page is not
uptodate, we allocate a bio and do not wait until buffers are read - we
return to userspace and start another reading, when bio is completed and
its end_io callback is called, we mark pages as uptodate, copy data to userspace,
and mark event bound to above (aio_)read() as completed.
(that is how kevent aio works, btw).
Userspace programmer just calls 
cookie = aio_read();
aio_wait(cookie);
or something like that.

It is simple, it is straightforward, especially if data read must then
be used somewhere else - in that case processing thread will need to operate
with main one, which is simple in event model, since there is a place,
where events of _all_ types are gathered.

> Regards,
> 
> 						- Ted

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-27 10:28                   ` Evgeniy Polyakov
@ 2007-02-27 11:52                     ` Theodore Tso
  2007-02-27 12:11                       ` Evgeniy Polyakov
                                         ` (2 more replies)
  0 siblings, 3 replies; 277+ messages in thread
From: Theodore Tso @ 2007-02-27 11:52 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ingo Molnar, Linus Torvalds, Ulrich Drepper, linux-kernel,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner

On Tue, Feb 27, 2007 at 01:28:32PM +0300, Evgeniy Polyakov wrote:
> Obviously there are bugs, it is simply how things work.
> And debugging state machine code has exactly the same complexity as
> debugging multi-threading code - if not less...

Evgeniy,

I think what you are not hearing, and what everyone else is saying
(INCLUDING Linus), is that for most programmers, state machines are
much, much harder to program, understand, and debug compared to
multi-threaded code.  You may disagree (were you a MacOS 9 programmer
in another life?), and it may not even be true for you if you happen
to be one of those folks more at home with Scheme continuations, for
example.  But it is true that for most kernel programmers, threaded
programming is much easier to understand, and we need to engineer the
kernel for what will be maintainable for the majority of the kernel
development community.

Regards,

						- Ted

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-27 10:41                                         ` Evgeniy Polyakov
@ 2007-02-27 10:49                                           ` Ingo Molnar
  0 siblings, 0 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-02-27 10:49 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Davide Libenzi, Linux Kernel Mailing List, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Ulrich Drepper, Zach Brown, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner


* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> > does that work for you?
> 
> Yes, -fomit-frame-point make the deal.
> 
> In average, threadlet runs as fast as epoll.

yeah.

> Just because there are _no_ rescheduling in that case.

in my test it was 'little', not 'no'. But yes, that's exactly my point: 
we can remove the nonblock hackeries from event loops and just 
concentrate on making it schedule in less than 10-20% of the cases. Even 
a relatively high 10-20% rescheduling rate is hardly measurable with 
threadlets, while it gives a 10%-20% regression (and possibly bad 
latencies) for the pure epoll/kevent server.

and such a mixed technique is simply not possible with ordinary 
user-space threads, because there it's an all-or-nothing affair: either 
you go fully to threads (at which point we are again back to a fully 
threaded design, now also saddled with event loop overhead), or you try 
to do user-space threads, which Just Make Little Sense (tm).

so threadlets remove the biggest headache from event loops: they dont 
have to be '100% nonblocking' anymore. No O_NONBLOCK overhead, no 
complex state machines - just handle the most common event type via an 
outer event loop and keep the other 99% of server complexity in plain 
procedural programming. 1% of state-machine code is perfectly 
acceptable.

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-27  6:24                                       ` Ingo Molnar
@ 2007-02-27 10:41                                         ` Evgeniy Polyakov
  2007-02-27 10:49                                           ` Ingo Molnar
  0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-27 10:41 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Davide Libenzi, Linux Kernel Mailing List, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Ulrich Drepper, Zach Brown, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner

On Tue, Feb 27, 2007 at 07:24:19AM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> 
> > On Mon, Feb 26, 2007 at 01:51:23PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> > > 
> > > * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> > > 
> > > > Even having main dispatcher as epoll/kevent loop, the _whole_ 
> > > > threadlet model is absolutely micro-thread in nature and not state 
> > > > machine/event.
> > > 
> > > Evgeniy, i'm not sure how many different ways to tell this to you, but 
> > > you are not listening, you are not learning and you are still not 
> > > getting it at all.
> > > 
> > > The scheduler /IS/ a generic work/event queue. And it's pretty damn 
> > > fast. No amount of badmouthing will change that basic fact. Not exactly 
> > > as fast as a special-purpose queueing system (for all the reasons i 
> > > outlined to you, and which you ignored), but it gets pretty damn close 
> > > even for the web workload /you/ identified, and offers a user-space 
> > > programming model that is about 1000 times more useful than 
> > > state-machines.
> > 
> > Meanwhile on practiceal side:
> > via epia kevent/epoll/threadlet:
> > 
> > client: ab -c500 -n5000 $url
> > 
> > kevent:		849.72
> > epoll:		538.16
> > threadlet:
> >  gcc ./evserver_epoll_threadlet.c -o ./evserver_epoll_threadlet
> >  In file included from ./evserver_epoll_threadlet.c:30:
> >  ./threadlet.h: In function ‘threadlet_exec’:
> >  ./threadlet.h:46: error: can't find a register in class ‘GENERAL_REGS’
> >  while reloading ‘asm’
> > 
> > That particular asm optimization fails to compile.
> 
> it's not really an asm optimization but an API glue. I'm using:
> 
>  gcc -O2 -g -Wall -o evserver_epoll_threadlet evserver_epoll_threadlet.c -fomit-frame-pointer
> 
> does that work for you?

Yes, -fomit-frame-point make the deal.

In average, threadlet runs as fast as epoll. 
Just because there are _no_ rescheduling in that case.

I added a printk into __async_schedule() and started
ab -c7000 -n20000 http://192.168.4.80/ 
against slow via epia, and got only one of them.
Client is latest
../async-test-v4/evserver_epoll_threadlet


Btw, I need to admit, that I have totally broken kevent tree there - it
does not work on that machine on higher loads, so I'm investigating that
problem now.

> 	Ingo

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-26 19:54                 ` Ingo Molnar
@ 2007-02-27 10:28                   ` Evgeniy Polyakov
  2007-02-27 11:52                     ` Theodore Tso
  0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-27 10:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Ulrich Drepper, linux-kernel, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner

On Mon, Feb 26, 2007 at 08:54:16PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > > Reading from the disk is _exactly_ the same - the same waiting for 
> > > buffer_heads/pages, and (since it is bigger) it can be easily 
> > > transferred to event driven model. Ugh, wait, it not only _can_ be 
> > > transferred, it is already done in kevent AIO, and it shows faster 
> > > speeds (though I only tested sending them over the net).
> > 
> > It would be absolutely horrible to program for. Try anything more 
> > complex than read/write (which is the simplest case, but even that is 
> > nasty).
> 
> note that even for something as 'simple and straightforward' as TCP 
> sockets, the 25-50 lines of evserver code i worked on today had 3 
> separate bugs, is known to be fundamentally incorrect and one of the 
> bugs (the lost event problem) showed up as a subtle epoll performance 
> problem and it took me more than an hour to track down. And that matches 
> my Tux experience as well: event based models are horribly hard to debug 
> BECAUSE there is /no procedural state associated with requests/. Hence 
> there is no real /proof of progress/. Not much to use for debugging - 
> except huge logs of execution, which, if one is unlucky (which i often 
> was with Tux) would just make the problem go away.

I'm glad you found a bug in my epoll processing code (which does not exist
in kevent part though) - it took me more than a year after kevent
introduction to make someone to look into.

Obviously there are bugs, it is simply how things work.
And debugging state machine code has exactly the same complexity as
debugging multi-threading code - if not less...

> Furthermore, with a 'retry' model, who guarantees that the retry wont be 
> an infinite retry where none of the retries ever progresses the state of 
> the system enough to return the data we are interested in? The moment we 
> have to /retry/, depending on the 'depth' of how deep the retry kicked 
> in, we've got to reach that 'depth' of code again and execute it.
> 
> plus, 'there is not much state' is not even completely true to begin 
> with, even in the most simple, TCP socket case! There /is/ quite a bit 
> of state constructed on the kernel stack: user parameters have been 
> evaluated/converted, the socket has been looked up, its state has been 
> validated, etc. With a 'retry' model - but even with a pure 'event 
> queueing' model we redo all those things /both/ at request submission 
> and at event generation time, again and again - while with a synchronous 
> syscall you do it just once and upon event completion a piece of that 
> data is already on the kernel stack. I'd much rather spend time and 
> effort on simplifying the scheduler and reducing the cache footprint of 
> the kernel thread context switch path, etc., to make it more useful even 
> in more extreme, highly prallel '100% context-switching' case, because i 
> have first-hand experience about how fragile and inflexible event based 
> servers are. I do think that event interfaces for raw, external physical 
> events make sense in some circumstances, but for any more complex 
> 'derived' event type it's less and less clear whether we want a direct 
> interface to it. For something like the VFS it's outright horrible to 
> even think about.

Ingo, you draw too much black into the picture.
No one will crate purely event based model for socket or VFS processing
- event is completion of the request - data in the buffer, data was sent
and so one - it is also possible to add events into middle of the
processing especially if operation can be logically separated - like
sendfile.

> 	Ingo

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-27  2:18                                   ` Davide Libenzi
@ 2007-02-27 10:13                                     ` Evgeniy Polyakov
  2007-02-27 16:01                                       ` Davide Libenzi
  0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-27 10:13 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Ingo Molnar, Ulrich Drepper, Linux Kernel Mailing List,
	Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner

On Mon, Feb 26, 2007 at 06:18:51PM -0800, Davide Libenzi (davidel@xmailserver.org) wrote:
> On Mon, 26 Feb 2007, Evgeniy Polyakov wrote:
> 
> > 2. its notifications do not go through the second loop, i.e. it is O(1),
> > not O(ready_num), and notifications happens directly from internals of
> > the appropriate subsystem, which does not require special wakeup
> > (although it can be done too).
> 
> Sorry if I do not read kevent code correctly, but in kevent_user_wait() 
> there is a:
> 
>     while (num < max_nr && ((k = kevent_dequeue_ready(u)) != NULL)) {
>         ...
>     }
> 
> loop, that make it O(ready_num). From a mathematical standpoint, they're 
> both O(ready_num), but epoll is doing three passes over the ready set.
> I always though that if the number of ready events is so big that the more 
> passes over the ready set becomes relevant, probably the "work" done by 
> userspace for each fetched event would make the extra cost irrelevant.
> But that can be fixed by a patch that will follow on lkml ...
 
No, kevent_dequeue_ready() copies data to userspace, that is it.
So it looks roughly following:

storage is ready: -> kevent_requee() - ends up in ading event to the end
of the queue (list add under spinlock)

kevent_wait() -> copy first, second, ...

Kevent poll (as long as epoll) model requires _additional_ check in 
userspace context before it is copied, so we endup with checking the 
full ready queue again - that what I pointed as O(ready_num), O() implies 
price for copying to userspace, list_add and so on.
 
> - Davide
> 

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-26 20:35                                   ` Ingo Molnar
  2007-02-26 22:06                                     ` Bill Huey
@ 2007-02-27 10:09                                     ` Evgeniy Polyakov
  2007-02-27 17:13                                     ` Pavel Machek
  2 siblings, 0 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-27 10:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ulrich Drepper, linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner

On Mon, Feb 26, 2007 at 09:35:43PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> > If kernelspace rescheduling is that fast, then please explain me why 
> > userspace one always beats kernel/userspace?
> 
> because 'user space scheduling' makes no sense? I explained my thinking 
> about that in a past mail:
> 
> -------------------------->
> One often repeated (because pretty much only) performance advantage of 
> 'light threads' is context-switch performance between user-space 
> threads. But reality is, nobody /cares/ about being able to 
> context-switch between "light user-space threads"! Why? Because there 
> are only two reasons why such a high-performance context-switch would 
> occur:
> 
>  1) there's contention between those two tasks. Wonderful: now two
>     artificial threads are running on the /same/ CPU and they are even
>     contending each other. Why not run a single context on a single CPU
>     instead and only get contended if /another/ CPU runs a conflicting
>     context?? While this makes for nice "pthread locking benchmarks",
>     it is not really useful for anything real.

Obviously there must be several real threads, each of them can reschedule 
userspace threads, because it is fast and scalable.
So, one CPU - one real thread.

>  2) there has been an IO event. The thing is, for IO events we enter the
>     kernel no matter what - and we'll do so for the next 10 years at
>     minimum. We want to abstract away the hardware, we want to do
>     reliable resource accounting, we want to share hardware resources,
>     we want to rate-limit, etc., etc. While in /theory/ you could handle
>     IO purely from user-space, in practice you dont want to do that. And
>     if we accept the premise that we'll enter the kernel anyway, there's
>     zero performance difference between scheduling right there in the
>     kernel, or returning back to user-space to schedule there. (in fact
>     i submit that the former is faster). Or if we accept the theoretical
>     possibility of 'perfect IO hardware' that implements /all/ the
>     features that the kernel wants (in a secure and generic way, and
>     mind you, such IO hardware does not exist yet), then /at most/ the
>     performance advantage of user-space doing the scheduling is the
>     overhead of a null syscall entry. Which is a whopping 100 nsecs on
>     modern CPUs! That's roughly the latency of a /single/ DRAM access!
> ....

And here were start our discussion again from the begining :)
We entered kernel, of course, but then kernel decides to move thread
away and put another one on its place under hardware sun - so kernel
starts to move registers, change some states and so on.

While in case of userspace threads we arelady returned back to userspace
(on behalf of written above overhead) and start doing new task - move
registers, change some states and so one.

And somehow it happens that doing it in userspace (for example with
setjmp/longjmp) is faster than in kernel. I do not know why - I never
investigated that case, but that is it.

> <-----------
> 
> (see http://lwn.net/Articles/219958/)
> 
> btw., the words that follow that section are quite interesting in 
> retrospect:
> 
> | Furthermore, 'light thread' concepts can no way approach the 
> | performance of #2 state-machines: if you /know/ what the structure of 
> | your context is, and you can program it in a specialized state-machine 
> | way, there's just so many shortcuts possible that it's not even funny.
> 
> [ oops! ;-) ]
> 
> i severely under-estimated the kind of performance one can reach even 
> with pure procedural concepts. Btw., when i wrote this mail was when i 
> started thinking about "is it really true that the only way to get good 
> performance are 100% event-based servers and nonblocking designs?", and 
> started coding syslets and then threadlets.

:-)

Threads happen to be really fast, and easy to programm, but the maximum
performance still can not be achieved with them just because event
handling is faster - read one cacheline from the ring or queue, or
reschedule threads?

Anyway, what we are talking about, Ingo?
I understand your point, I also understand that you are not going to
change it (yes, that's it, but I need to confirm that I'm guilty too - I
doubtly will think that thousands of threads doing IO is a win), so we
can close the discussion at this point.

My main point in participating in it was kevent introduction - although
I created kevent AIO as a state machine, I never wanted to promote
kevent as AIO - kevent is event processing mechanism, one of its usage
models is AIO, another ones are signals, file events, timers, whatever...
You drawn a line - kevent is not needed - that is your point.
I wanted to hear definitive wordsa half a year ago, but community kept
silence. Eventually things are done.

Thanks.

> 	Ingo

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-26 20:04                       ` Linus Torvalds
@ 2007-02-27  8:09                         ` Evgeniy Polyakov
  0 siblings, 0 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-27  8:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Ulrich Drepper, linux-kernel, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner

On Mon, Feb 26, 2007 at 12:04:51PM -0800, Linus Torvalds (torvalds@linux-foundation.org) wrote:
> 
> 
> On Mon, 26 Feb 2007, Evgeniy Polyakov wrote:
> > 
> > Will you argue that people do things like
> > num = epoll_wait()
> > for (i=0; i<num; ++i) {
> > 	process(event[i])?
> > }
> 
> I have several times told you that I argue for a *combination* of 
> event-bassed interfaces and thread-like code. And that the choice depends 
> on which is more natural. Sometimes you might have just one or the other. 
> Sometimes you have both. And sometimes you have neither (although, 
> strictly speaking, normal single-threaded code is certainly "thread-like" 
> - it's a serial execution, the same way threadlets are serial executions - 
> it's just not running in parallel with anything else).
> 
> > Will you spawn thread per IO?
> 
> Depending on what the IO is, yes. 
> 
> Is that _really_ so hard to understand? There is no "yes" or "no". There's 
> a "depends on what the problem is, and what the solution looks like".
> 
> Sometimes the best way to do parallelism is through explicit threads. 
> Sometimes it is through whatever "threadlets" or other that gets out of 
> this whole development discussion. Sometimes it's an event loop.
> 
> So get over it. The world is not a black and white, either-or kind of 
> place. It's full of grayscales, and colors, and mixing things 
> appropriately. And the choices are often done on whims and on whatever 
> feels most comfortable to the person doing the choice. Not on some strict 
> "you must always use things in an event-driven main loop" or "you must 
> always use threads for parallelism".
> 
> The world is simply _richer_ than that kind of either-or thing.

It seems you like to repeate that white is white and it is not black :)
Did you see what I wrote in previous e-mail to you?

No, you do not.
I wrote that both model should co-exist, and boundary between the two is
not clear, but for described workloads IMO event driven model provides
way too higher performance.

That is it, Linus, no one wants to say that you say something stupid -
just read what others write for you.

Thanks.


> 		Linus

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-26 16:46                                     ` Evgeniy Polyakov
@ 2007-02-27  6:24                                       ` Ingo Molnar
  2007-02-27 10:41                                         ` Evgeniy Polyakov
  0 siblings, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-02-27  6:24 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Davide Libenzi, Linux Kernel Mailing List, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Ulrich Drepper, Zach Brown, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner


* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> On Mon, Feb 26, 2007 at 01:51:23PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> > 
> > * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> > 
> > > Even having main dispatcher as epoll/kevent loop, the _whole_ 
> > > threadlet model is absolutely micro-thread in nature and not state 
> > > machine/event.
> > 
> > Evgeniy, i'm not sure how many different ways to tell this to you, but 
> > you are not listening, you are not learning and you are still not 
> > getting it at all.
> > 
> > The scheduler /IS/ a generic work/event queue. And it's pretty damn 
> > fast. No amount of badmouthing will change that basic fact. Not exactly 
> > as fast as a special-purpose queueing system (for all the reasons i 
> > outlined to you, and which you ignored), but it gets pretty damn close 
> > even for the web workload /you/ identified, and offers a user-space 
> > programming model that is about 1000 times more useful than 
> > state-machines.
> 
> Meanwhile on practiceal side:
> via epia kevent/epoll/threadlet:
> 
> client: ab -c500 -n5000 $url
> 
> kevent:		849.72
> epoll:		538.16
> threadlet:
>  gcc ./evserver_epoll_threadlet.c -o ./evserver_epoll_threadlet
>  In file included from ./evserver_epoll_threadlet.c:30:
>  ./threadlet.h: In function ‘threadlet_exec’:
>  ./threadlet.h:46: error: can't find a register in class ‘GENERAL_REGS’
>  while reloading ‘asm’
> 
> That particular asm optimization fails to compile.

it's not really an asm optimization but an API glue. I'm using:

 gcc -O2 -g -Wall -o evserver_epoll_threadlet evserver_epoll_threadlet.c -fomit-frame-pointer

does that work for you?

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-26 16:55                                 ` Evgeniy Polyakov
  2007-02-26 20:35                                   ` Ingo Molnar
@ 2007-02-27  2:18                                   ` Davide Libenzi
  2007-02-27 10:13                                     ` Evgeniy Polyakov
  1 sibling, 1 reply; 277+ messages in thread
From: Davide Libenzi @ 2007-02-27  2:18 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ingo Molnar, Ulrich Drepper, Linux Kernel Mailing List,
	Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner

On Mon, 26 Feb 2007, Evgeniy Polyakov wrote:

> 2. its notifications do not go through the second loop, i.e. it is O(1),
> not O(ready_num), and notifications happens directly from internals of
> the appropriate subsystem, which does not require special wakeup
> (although it can be done too).

Sorry if I do not read kevent code correctly, but in kevent_user_wait() 
there is a:

    while (num < max_nr && ((k = kevent_dequeue_ready(u)) != NULL)) {
        ...
    }

loop, that make it O(ready_num). From a mathematical standpoint, they're 
both O(ready_num), but epoll is doing three passes over the ready set.
I always though that if the number of ready events is so big that the more 
passes over the ready set becomes relevant, probably the "work" done by 
userspace for each fetched event would make the extra cost irrelevant.
But that can be fixed by a patch that will follow on lkml ...



- Davide



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-26 20:35                                   ` Ingo Molnar
@ 2007-02-26 22:06                                     ` Bill Huey
  2007-02-27 10:09                                     ` Evgeniy Polyakov
  2007-02-27 17:13                                     ` Pavel Machek
  2 siblings, 0 replies; 277+ messages in thread
From: Bill Huey @ 2007-02-26 22:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Evgeniy Polyakov, Ulrich Drepper, linux-kernel, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner, Bill Huey (hui)

On Mon, Feb 26, 2007 at 09:35:43PM +0100, Ingo Molnar wrote:
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> 
> > If kernelspace rescheduling is that fast, then please explain me why 
> > userspace one always beats kernel/userspace?
> 
> because 'user space scheduling' makes no sense? I explained my thinking 
> about that in a past mail:
> 
> -------------------------->
> One often repeated (because pretty much only) performance advantage of 
> 'light threads' is context-switch performance between user-space 
> threads. But reality is, nobody /cares/ about being able to 
> context-switch between "light user-space threads"! Why? Because there 
> are only two reasons why such a high-performance context-switch would 
> occur:
... 
>  2) there has been an IO event. The thing is, for IO events we enter the
>     kernel no matter what - and we'll do so for the next 10 years at
>     minimum. We want to abstract away the hardware, we want to do
>     reliable resource accounting, we want to share hardware resources,
>     we want to rate-limit, etc., etc. While in /theory/ you could handle
>     IO purely from user-space, in practice you dont want to do that. And
>     if we accept the premise that we'll enter the kernel anyway, there's
>     zero performance difference between scheduling right there in the
>     kernel, or returning back to user-space to schedule there. (in fact
>     i submit that the former is faster). Or if we accept the theoretical
>     possibility of 'perfect IO hardware' that implements /all/ the
>     features that the kernel wants (in a secure and generic way, and
>     mind you, such IO hardware does not exist yet), then /at most/ the
>     performance advantage of user-space doing the scheduling is the
>     overhead of a null syscall entry. Which is a whopping 100 nsecs on
>     modern CPUs! That's roughly the latency of a /single/ DRAM access!

Ingo and Evgeniy,

I was trying to avoid getting into this discussion, but whatever. M:N
threading systems also require just about all of the threading semantics
that are inside the kernel to be available in userspace. Implementations
of the userspace scheduler side of things must be able to turn off
preemption to do per CPU local storage, report blocking/preempting via
(via upcall or a mailbox) and other scheduler-ish things in reliable way
so that the complexity of a system like that ends up not being worth it
and is often monsteriously large to implement and debug. That's why
Solaris 10 removed their scheduler activations framework and went with
1:1 like in Linux since the scheduler activations model is so difficult
to control. The slowness of the futex stuff might be compounded by some
VM mapping issues that Bill Irwin and Peter Ziljstra have pointed out in
the past regard, if I understand correctly.

Bryan Cantril of Solaris 10/dtrace fame can comment on that if you ask
him sometime.

For an exercise, think about all of things you need to either migrate
or to do a cross CPU wake of a task. It goes to hell in complexity
really quick. Erlang and other language based concurrency systems get
their regularities by indirectly oversimplifying what threading is from
what kernel folks are use to. Try doing a cross CPU wake quickly a
system like that, good luck. Now think about how to do an IPI in
userspace ? Good luck.

That's all :)

bill


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-26 16:55                                 ` Evgeniy Polyakov
@ 2007-02-26 20:35                                   ` Ingo Molnar
  2007-02-26 22:06                                     ` Bill Huey
                                                       ` (2 more replies)
  2007-02-27  2:18                                   ` Davide Libenzi
  1 sibling, 3 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-02-26 20:35 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ulrich Drepper, linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner


* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> If kernelspace rescheduling is that fast, then please explain me why 
> userspace one always beats kernel/userspace?

because 'user space scheduling' makes no sense? I explained my thinking 
about that in a past mail:

-------------------------->
One often repeated (because pretty much only) performance advantage of 
'light threads' is context-switch performance between user-space 
threads. But reality is, nobody /cares/ about being able to 
context-switch between "light user-space threads"! Why? Because there 
are only two reasons why such a high-performance context-switch would 
occur:

 1) there's contention between those two tasks. Wonderful: now two
    artificial threads are running on the /same/ CPU and they are even
    contending each other. Why not run a single context on a single CPU
    instead and only get contended if /another/ CPU runs a conflicting
    context?? While this makes for nice "pthread locking benchmarks",
    it is not really useful for anything real.

 2) there has been an IO event. The thing is, for IO events we enter the
    kernel no matter what - and we'll do so for the next 10 years at
    minimum. We want to abstract away the hardware, we want to do
    reliable resource accounting, we want to share hardware resources,
    we want to rate-limit, etc., etc. While in /theory/ you could handle
    IO purely from user-space, in practice you dont want to do that. And
    if we accept the premise that we'll enter the kernel anyway, there's
    zero performance difference between scheduling right there in the
    kernel, or returning back to user-space to schedule there. (in fact
    i submit that the former is faster). Or if we accept the theoretical
    possibility of 'perfect IO hardware' that implements /all/ the
    features that the kernel wants (in a secure and generic way, and
    mind you, such IO hardware does not exist yet), then /at most/ the
    performance advantage of user-space doing the scheduling is the
    overhead of a null syscall entry. Which is a whopping 100 nsecs on
    modern CPUs! That's roughly the latency of a /single/ DRAM access!
....
<-----------

(see http://lwn.net/Articles/219958/)

btw., the words that follow that section are quite interesting in 
retrospect:

| Furthermore, 'light thread' concepts can no way approach the 
| performance of #2 state-machines: if you /know/ what the structure of 
| your context is, and you can program it in a specialized state-machine 
| way, there's just so many shortcuts possible that it's not even funny.

[ oops! ;-) ]

i severely under-estimated the kind of performance one can reach even 
with pure procedural concepts. Btw., when i wrote this mail was when i 
started thinking about "is it really true that the only way to get good 
performance are 100% event-based servers and nonblocking designs?", and 
started coding syslets and then threadlets.

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-26 19:30                     ` Evgeniy Polyakov
@ 2007-02-26 20:04                       ` Linus Torvalds
  2007-02-27  8:09                         ` Evgeniy Polyakov
  0 siblings, 1 reply; 277+ messages in thread
From: Linus Torvalds @ 2007-02-26 20:04 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ingo Molnar, Ulrich Drepper, linux-kernel, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner



On Mon, 26 Feb 2007, Evgeniy Polyakov wrote:
> 
> Will you argue that people do things like
> num = epoll_wait()
> for (i=0; i<num; ++i) {
> 	process(event[i])?
> }

I have several times told you that I argue for a *combination* of 
event-bassed interfaces and thread-like code. And that the choice depends 
on which is more natural. Sometimes you might have just one or the other. 
Sometimes you have both. And sometimes you have neither (although, 
strictly speaking, normal single-threaded code is certainly "thread-like" 
- it's a serial execution, the same way threadlets are serial executions - 
it's just not running in parallel with anything else).

> Will you spawn thread per IO?

Depending on what the IO is, yes. 

Is that _really_ so hard to understand? There is no "yes" or "no". There's 
a "depends on what the problem is, and what the solution looks like".

Sometimes the best way to do parallelism is through explicit threads. 
Sometimes it is through whatever "threadlets" or other that gets out of 
this whole development discussion. Sometimes it's an event loop.

So get over it. The world is not a black and white, either-or kind of 
place. It's full of grayscales, and colors, and mixing things 
appropriately. And the choices are often done on whims and on whatever 
feels most comfortable to the person doing the choice. Not on some strict 
"you must always use things in an event-driven main loop" or "you must 
always use threads for parallelism".

The world is simply _richer_ than that kind of either-or thing.

		Linus

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-26 10:31                             ` Ingo Molnar
  2007-02-26 10:43                               ` Evgeniy Polyakov
@ 2007-02-26 20:02                               ` Davide Libenzi
  1 sibling, 0 replies; 277+ messages in thread
From: Davide Libenzi @ 2007-02-26 20:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Evgeniy Polyakov, Linux Kernel Mailing List, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Ulrich Drepper, Zach Brown, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner

On Mon, 26 Feb 2007, Ingo Molnar wrote:

> 
> * Ingo Molnar <mingo@elte.hu> wrote:
> 
> > please also try evserver_epoll_threadlet.c that i've attached below - 
> > it uses epoll as the main event mechanism but does threadlets for 
> > request handling.
> 
> find updated code below - your evserver_epoll.c spuriously missed event 
> edges - so i changed it back to level-triggered. While that is not as 
> fast as edge-triggered, it does not result in spurious hangs and 
> workflow 'hickups' during the test.
> 
> Could this be the reason why in your testing kevents outperformed epoll?

This is how I handle a read (write/accept/connect, same thing) inside 
coronet (coroutine+epoll async library - http://www.xmailserver.org/coronet-lib.html ).


static int conet_read_ll(struct sk_conn *conn, char *buf, int nbyte) {
        int n;
        
        while ((n = read(conn->sfd, buf, nbyte)) < 0) {
                if (errno == EINTR)
                        continue;
                if (errno != EAGAIN && errno != EWOULDBLOCK)
                        return -1;
                if (!(conn->events & EPOLLIN)) {
                        conn->events = EPOLLIN;
                        if (conet_mod_conn(conn, conn->events) < 0)
                                return -1;
                }
                if (conet_yield(conn) < 0)
                        return -1;
        }
                        
        return n;
}

I use EPOLLET and, you don't change the interest set until you actually 
get an EAGAIN. *Many* read/write mode changes in the usage will simply 
happen w/out an epoll_ctl() needed. The conet_mod_conn() function does the 
actual epoll_ctl() and add EPOLLET to the specified event set. The 
conet_yield() function end up calling the libpcl's co_resume(), that is 
basically a switch-to-next-coroutine-until-fd-becomes-ready (maps 
directly to a swapcontext).
That cuts 50+% of the epoll_ctl(EPOLL_CTL_MOD).




- Davide



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-26 17:57               ` Linus Torvalds
  2007-02-26 18:32                 ` Evgeniy Polyakov
@ 2007-02-26 19:54                 ` Ingo Molnar
  2007-02-27 10:28                   ` Evgeniy Polyakov
  1 sibling, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-02-26 19:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Evgeniy Polyakov, Ulrich Drepper, linux-kernel, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > Reading from the disk is _exactly_ the same - the same waiting for 
> > buffer_heads/pages, and (since it is bigger) it can be easily 
> > transferred to event driven model. Ugh, wait, it not only _can_ be 
> > transferred, it is already done in kevent AIO, and it shows faster 
> > speeds (though I only tested sending them over the net).
> 
> It would be absolutely horrible to program for. Try anything more 
> complex than read/write (which is the simplest case, but even that is 
> nasty).

note that even for something as 'simple and straightforward' as TCP 
sockets, the 25-50 lines of evserver code i worked on today had 3 
separate bugs, is known to be fundamentally incorrect and one of the 
bugs (the lost event problem) showed up as a subtle epoll performance 
problem and it took me more than an hour to track down. And that matches 
my Tux experience as well: event based models are horribly hard to debug 
BECAUSE there is /no procedural state associated with requests/. Hence 
there is no real /proof of progress/. Not much to use for debugging - 
except huge logs of execution, which, if one is unlucky (which i often 
was with Tux) would just make the problem go away.

Furthermore, with a 'retry' model, who guarantees that the retry wont be 
an infinite retry where none of the retries ever progresses the state of 
the system enough to return the data we are interested in? The moment we 
have to /retry/, depending on the 'depth' of how deep the retry kicked 
in, we've got to reach that 'depth' of code again and execute it.

plus, 'there is not much state' is not even completely true to begin 
with, even in the most simple, TCP socket case! There /is/ quite a bit 
of state constructed on the kernel stack: user parameters have been 
evaluated/converted, the socket has been looked up, its state has been 
validated, etc. With a 'retry' model - but even with a pure 'event 
queueing' model we redo all those things /both/ at request submission 
and at event generation time, again and again - while with a synchronous 
syscall you do it just once and upon event completion a piece of that 
data is already on the kernel stack. I'd much rather spend time and 
effort on simplifying the scheduler and reducing the cache footprint of 
the kernel thread context switch path, etc., to make it more useful even 
in more extreme, highly prallel '100% context-switching' case, because i 
have first-hand experience about how fragile and inflexible event based 
servers are. I do think that event interfaces for raw, external physical 
events make sense in some circumstances, but for any more complex 
'derived' event type it's less and less clear whether we want a direct 
interface to it. For something like the VFS it's outright horrible to 
even think about.

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-25 19:42                         ` Evgeniy Polyakov
  2007-02-25 20:38                           ` Ingo Molnar
  2007-02-26 12:39                           ` Ingo Molnar
@ 2007-02-26 19:47                           ` Davide Libenzi
  2 siblings, 0 replies; 277+ messages in thread
From: Davide Libenzi @ 2007-02-26 19:47 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ingo Molnar, Ulrich Drepper, Linux Kernel Mailing List,
	Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner

On Sun, 25 Feb 2007, Evgeniy Polyakov wrote:

> Why userspace rescheduling is in order of tens times faster than
> kernel/user?

About 50 times in my Opteron 254 actually. That's libpcl's (swapcontext 
based) cobench against lat_ctx.



- Davide



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-26 19:22                   ` Linus Torvalds
@ 2007-02-26 19:30                     ` Evgeniy Polyakov
  2007-02-26 20:04                       ` Linus Torvalds
  0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-26 19:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Ulrich Drepper, linux-kernel, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner

On Mon, Feb 26, 2007 at 11:22:46AM -0800, Linus Torvalds (torvalds@linux-foundation.org) wrote:
> See? Stop blathering about how everything is an event. THAT'S NOT 
> RELEVANT. I've told you a hundred times - they may be "logically 
> equivalent", but that doesn't change ANYTHING. Event-based programming 
> simply isn't suitable for 99% of all stuff, and for the 1% where it *is* 
> suitable, it actually tends to be a very specific subset of the code that 
> you actually use events for (ie accept and read/write on pure streams).

Will you argue that people do things like
num = epoll_wait()
for (i=0; i<num; ++i) {
	process(event[i])?
}

Will you spawn thread per IO?

Stop writing the same again and again - I perfectly understand that not
everything can be easily covered by events, but covering everything with
threads is more stupid idea.

High-performance IO requires as small as possible overhead, dispatching
events from ring buffer or queue from each cpu is the smallest one, but
not spawning a thread per read.

> 		Linus

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-26 18:32                 ` Evgeniy Polyakov
@ 2007-02-26 19:22                   ` Linus Torvalds
  2007-02-26 19:30                     ` Evgeniy Polyakov
  0 siblings, 1 reply; 277+ messages in thread
From: Linus Torvalds @ 2007-02-26 19:22 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ingo Molnar, Ulrich Drepper, linux-kernel, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner



On Mon, 26 Feb 2007, Evgeniy Polyakov wrote:
> 
> I want to say, that read() consists of tons of events, but programmer
> needs only one - data is ready in requested buffer. Programmer might
> not even know what is the object behind provided file descriptor.
> One only wans data in the buffer.

You're not following the discussion.

First off, I already *told* you that "read()" is the absolute simplest 
case, and yes, we could make it an event if you also passed in the "which 
range of the file do we care about" information (we could consider it 
"f_pos", which the kernel already knows about, but that doesn't handle 
pread()/pwrite(), so it's not very good for many cases).

But that's not THE ISSUE.

The issue is that it's a horrible interface from a users standpoint.  
It's a lot better to program certain things as a thread. Why do you argue 
against that, when that is just obviously true.

There's a reason that people write code that is functional, rather than 
write code as a state machine.

We simply don't write code like

	for (;;) {
	switch (state) {
	case Open:
		fd = open();
		if (fd < 0)
			break;
		state = Stat;
	case Stat:
		if (fstat(fd, &stat) < 0)
			break;
		state = Read;
	case Read:
		count = read(fd, buf + pos, size - pos));
		if (count < 0)
			break;
		pos += count;
		if (!count || pos == size)
			state = Close;
		continue;
	case Close;
		if (close(fd) < 0)
			break;
		state = Done;
		return 0;
	}
	}
	/* Returning 1 means wait in the event loop .. */
	if (errno == EAGAIN || errno == EINTR)
		return 1;
	/* Else we had a real error */
	state = Error;
	return -1;

and instead we write it as

	fd = open(..)
	if (fd < 0)
		return -1;
	if (fstat(fd, &st)) < 0) {
		close(fd);
		return -1;
	}
	.. 

and if you cannot see the *reason* why people don't use event-based 
programming for everything, I don't see the point of continuing this 
discussion.

See? Stop blathering about how everything is an event. THAT'S NOT 
RELEVANT. I've told you a hundred times - they may be "logically 
equivalent", but that doesn't change ANYTHING. Event-based programming 
simply isn't suitable for 99% of all stuff, and for the 1% where it *is* 
suitable, it actually tends to be a very specific subset of the code that 
you actually use events for (ie accept and read/write on pure streams).

		Linus

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-26 18:56                     ` Chris Friesen
@ 2007-02-26 19:20                       ` Evgeniy Polyakov
  0 siblings, 0 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-26 19:20 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Arjan van de Ven, Ingo Molnar, Linus Torvalds, Ulrich Drepper,
	linux-kernel, Christoph Hellwig, Andrew Morton, Alan Cox,
	Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner

On Mon, Feb 26, 2007 at 12:56:33PM -0600, Chris Friesen (cfriesen@nortel.com) wrote:
> Evgeniy Polyakov wrote:
> 
> >I never ever tried to say _everything_ must be driven by events.
> >IO must be driven, it is a must IMO.
> 
> Do you disagree with Linus' post about the difficulty of treating 
> open(), fstat(), page faults, etc. as events?  Or do you not consider 
> them to be IO?

>From practical point of view - yes some of that processes are complex
enough to not attract attention as async usage model.

But I'm absolutely for the scenario, when several operations are 
performed asynchronously like open+stat+fadvice+sendfile.

By IO I meant something which has end result, and that result must be
enough to start async processing - data in the buffer for example.
Async open I would combine with actual data processing - that one can be
a one event.

> Chris

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-26 18:38                   ` Evgeniy Polyakov
@ 2007-02-26 18:56                     ` Chris Friesen
  2007-02-26 19:20                       ` Evgeniy Polyakov
  0 siblings, 1 reply; 277+ messages in thread
From: Chris Friesen @ 2007-02-26 18:56 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Arjan van de Ven, Ingo Molnar, Linus Torvalds, Ulrich Drepper,
	linux-kernel, Christoph Hellwig, Andrew Morton, Alan Cox,
	Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner

Evgeniy Polyakov wrote:

> I never ever tried to say _everything_ must be driven by events.
> IO must be driven, it is a must IMO.

Do you disagree with Linus' post about the difficulty of treating 
open(), fstat(), page faults, etc. as events?  Or do you not consider 
them to be IO?

Chris

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-26 18:19                 ` Arjan van de Ven
@ 2007-02-26 18:38                   ` Evgeniy Polyakov
  2007-02-26 18:56                     ` Chris Friesen
  0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-26 18:38 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Ingo Molnar, Linus Torvalds, Ulrich Drepper, linux-kernel,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner

On Mon, Feb 26, 2007 at 10:19:03AM -0800, Arjan van de Ven (arjan@infradead.org) wrote:
> On Mon, 2007-02-26 at 20:37 +0300, Evgeniy Polyakov wrote:
> 
> > I tend to agree.
> > Yes, some loads require event driven model, other can be done using
> > threads.
> 
> event driven model is really complex though. For event driven to work
> well you basically can't tolerate blocking calls at all ...
> open() blocks on files. read() may block on metadata reads (say indirect
> blockgroups) not just on datareads etc etc. It gets really hairy really
> quick that way.

I never ever tried to say _everything_ must be driven by events.
IO must be driven, it is a must IMO.
Some (and a lot of) things can easily be programmed by threads.
Threads are essentially for those tasks which can not be easily
programmed with events.
But have in mind that fact, that thread execution _is_ an event itself, 
its completion is an event.

> -- 
> if you want to mail me at work (you don't), use arjan (at) linux.intel.com
> Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-26 17:57               ` Linus Torvalds
@ 2007-02-26 18:32                 ` Evgeniy Polyakov
  2007-02-26 19:22                   ` Linus Torvalds
  2007-02-26 19:54                 ` Ingo Molnar
  1 sibling, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-26 18:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Ulrich Drepper, linux-kernel, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner

On Mon, Feb 26, 2007 at 09:57:00AM -0800, Linus Torvalds (torvalds@linux-foundation.org) wrote:
> Similarly, even for a simple "read()" on a filesystem, there is no way to 
> just say "block until data is available" like there is for a socket, 
> because on a filesystem, the data may be available, BUT AT THE WRONG 
> OFFSET. So while a socket or a pipe are both simple "streaming interfaces" 
> as far as a "read()" is concerned, a file access is *not* a simple 
> streaming interface.
> 
> Notice? For a read()/recvmsg() call on a socket or a pipe, there is no 
> "position" involved. The "event" is clear: it's always the head of the 
> streaming interface that is relevant, and the event is "is there room" or 
> "is there data". It's an event-based thing.
> 
> But for a read() on a file, it's no longer a streaming interface, and 
> there is no longer a simple "is there data" event. You'd have to make the 
> event be a much more complex "is there data at position X through Y" kind 
> of thing.
> 
> And "read()" on a filesystem is the _simple_ case. Sure, we could add 
> support for those kinds of ranges, and have an event interface for that. 
> But the "open a filename" is much more complicated, and doesn't even have 
> a file descriptor available to it (since we're trying to _create_ one), so 
> you'd have to do something even more complex to have the event "that 
> filename can now be opened without blocking".
> 
> See? Even if you could make those kinds of events, it would be absolutely 
> HORRIBLE to code for. And it would suck horribly performance-wise for most 
> code too.

I see our main discussion trouble I think.

I never ever tried to say, that every bit in read() should be
non-blocking, it would be even more stupid than you expect it to be.
But you and Ingo do not want to see, what I'm trying to say, because it
is too cozy to read own right ideas than others.

I want to say, that read() consists of tons of events, but programmer
needs only one - data is ready in requested buffer. Programmer might
not even know what is the object behind provided file descriptor.
One only wans data in the buffer.

That is an event.

It is also possible to have other events inside complex read() machinery
- for example waiting for page obtained via ->readpages().

> THAT is what I'm saying. There's a *difference* between event-based and 
> thread-based programming. It makes no sense to try to turn one into the 
> other. But it often makes sense to *combine* the two approaches.

Thread is simple just because there is an interface.
Change interface, and no one will ever want to do it.

Provide nice aio_read() (forget about posix) and everyone will use it.

I always said that combining of such models is a must, I fully agree,
but it seems that we still do not agree where the bound should be drawn.

> > Userspace wants to open a file, so it needs some file-related (inode,
> > direntry and others) structures in the mem, they should be read from
> > disk. Eventually it will be reading some blocks from the disk 
> > (for example ext3_lookup->ext3_find_entry->ext3_getblk/ll_rw_block) and
> > we will wait for them (wait_on_bit()) - we will wait for event.
> > 
> > But I agree, it was a brainfscking example, but nevertheless, it can be
> > easily done using event driven model.
> > 
> > Reading from the disk is _exactly_ the same - the same waiting for
> > buffer_heads/pages, and (since it is bigger) it can be easily
> > transferred to event driven model.
> > Ugh, wait, it not only _can_ be transferred, it is already done in
> > kevent AIO, and it shows faster speeds (though I only tested sending
> > them over the net).
> 
> It would be absolutely horrible to program for. Try anything more complex 
> than read/write (which is the simplest case, but even that is nasty).
> 
> Try imagining yourself in the shoes of a database server (or just about 
> anything else). Imagine what kind of code you want to write. You probably 
> do *not* want to have everything be one big event loop, and having to make 
> different "states" for "I'm trying to open the file", "I opened the file, 
> am now doing 'fstat()' to figure out how big it is", "I'm now reading the 
> file and have read X bytes of the total Y bytes I want to read", "I took a 
> page fault in the middle" etc etc.
> 
> I pretty much can *guarantee* you that you'll never see anybody do that. 
> Page faults in user space are particularly hard to handle in a state 
> machine, since they basically require saving the whole thread state, as 
> they can happen on any random access. So yeah, you could do them as a 
> state machine, but in reality it would just become a "user-level thread 
> library" in the end, just to handle those.
>
> In contrast, if you start using thread-like programming to begin with, you 
> have none of those issues. Sure, some thread may block because you got a 
> page fault, or because an inode needed to be brought into memory, but from 
> a user-level programming interface standpoint, the thread library just 
> takes care of the "state machine" on its own, so it's much simpler, and in 
> the end more efficient.
> 
> And *THAT* is what I'm trying to say. Some simple obvious events are 
> better handled and seen as "events" in user space. But other things are so 
> intertwined, and have basically random state associated with them, that 
> they are better seen as threads.

In threading you still need to do exactly the same, since stat() can
not be done before open(). So you will maintain some state (actually
order of things which will not be changed).
Have the same in the function.

Then start execution - it can be perfectly done using threads.

Just because it is unconvenient to program using state machines and
there is no appropriate interface.

But there are another operations.
Consider database or whatever you like, which wants to read thousands of
blocks from disk, each access potentially blocks, and blocking on the
mutex is nothing compared to blocking on waiting for storage to be ready
to provide data (network, disk, whatever). If storage is not optimized
(or small cache, or something else) we end up with blocked contexts,
which sleep - thousands of contexts.
And this number will not decrease.

So we ended with enourmous overhead just because we do not have simple
enough aio_read() and aio_wait().

> Yes, from a "turing machine" kind of viewpoint, the two are 100% logically 
> equivalent. But "logical equivalence" does NOT translate into "can 
> practically speaking be implemented".

You have said that finally!
"can practically speaking be implemented".

I see your and Ingo point - kevent is a 'merde' (pardon my french, but I
even somehow glad Ingo said (almost) that - I now have plenty of time for other
interesting projects), we want threads. 
Ok, you have threads, but you need to wait on (some of) them.
So you will invent the wheel again - some subsystem which will wait for
threads (likely waiting on futexes), then other events (probably).
Eventually it will be the same from different point of view :)

Anyway, I _do_ hope we understood each other that both events and
threads can co-exist, although board was not drawn.

My point is that many things can be done using event just because they
are faster and smaller - and IO (any IO without limit) is one of the
areas where it is unbeatable.
Threading in turn can be done a higher-layer abstraction model - thread
can wrap around the whole transaction with all related data flows, but
not thread per IO.

> 			Linus

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-26 17:37               ` Evgeniy Polyakov
@ 2007-02-26 18:19                 ` Arjan van de Ven
  2007-02-26 18:38                   ` Evgeniy Polyakov
  0 siblings, 1 reply; 277+ messages in thread
From: Arjan van de Ven @ 2007-02-26 18:19 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ingo Molnar, Linus Torvalds, Ulrich Drepper, linux-kernel,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner

On Mon, 2007-02-26 at 20:37 +0300, Evgeniy Polyakov wrote:

> I tend to agree.
> Yes, some loads require event driven model, other can be done using
> threads.

event driven model is really complex though. For event driven to work
well you basically can't tolerate blocking calls at all ...
open() blocks on files. read() may block on metadata reads (say indirect
blockgroups) not just on datareads etc etc. It gets really hairy really
quick that way.

-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-26 17:28             ` Evgeniy Polyakov
@ 2007-02-26 17:57               ` Linus Torvalds
  2007-02-26 18:32                 ` Evgeniy Polyakov
  2007-02-26 19:54                 ` Ingo Molnar
  0 siblings, 2 replies; 277+ messages in thread
From: Linus Torvalds @ 2007-02-26 17:57 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ingo Molnar, Ulrich Drepper, linux-kernel, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner



On Mon, 26 Feb 2007, Evgeniy Polyakov wrote:
> 
> Linus, you made your point clearly - generic AIO should not be used for
> the cases, when it is supposed to block 90% of the time - only when it
> almost never blocks, like in case of buffered IO.

I don't think it's quite that simple.

EVEN *IF* it were to block 100% of the time, it depends on other things 
than just "blockingness".

For example, let's look at something like

	fd = open(filename, O_RDONLY);
	if (fd < 0)
		return -1;
	if (fstat(fd, &st) < 0) {
		close(fd);
		return -1;
	}
	.. do something ..

and please realize that EVEN IF YOU KNOW WITH 100% CERTAINTY that the 
above open (or fstat()) is going to block, because you know that your 
working set is bigger than the available memory for caching, YOU SIMPLY 
CANNOT SANELY WRITE THAT AS AN EVENT-BASED STATE MACHINE!

It's really that simple. Some things block "in the middle". The reason 
UNIX made non-blocking reads available for networking, but not for 
filesystem accesses is not because one blocks and the other doesn't. No, 
it's really much more fundamental than that!

When you do a "recvmsg()", there is a clear event-based model: you can 
return -EAGAIN if the data simply isn't there, because a network 
connection is a simple stream of data, and there is a clear event on "ok, 
data arrived" without any state what-so-ever.

The same is simply not true for "open a file descriptor". There is no sane 
way to turn the "filename lookup blocked" into an event model with a 
select- or kevent-based interface.

Similarly, even for a simple "read()" on a filesystem, there is no way to 
just say "block until data is available" like there is for a socket, 
because on a filesystem, the data may be available, BUT AT THE WRONG 
OFFSET. So while a socket or a pipe are both simple "streaming interfaces" 
as far as a "read()" is concerned, a file access is *not* a simple 
streaming interface.

Notice? For a read()/recvmsg() call on a socket or a pipe, there is no 
"position" involved. The "event" is clear: it's always the head of the 
streaming interface that is relevant, and the event is "is there room" or 
"is there data". It's an event-based thing.

But for a read() on a file, it's no longer a streaming interface, and 
there is no longer a simple "is there data" event. You'd have to make the 
event be a much more complex "is there data at position X through Y" kind 
of thing.

And "read()" on a filesystem is the _simple_ case. Sure, we could add 
support for those kinds of ranges, and have an event interface for that. 
But the "open a filename" is much more complicated, and doesn't even have 
a file descriptor available to it (since we're trying to _create_ one), so 
you'd have to do something even more complex to have the event "that 
filename can now be opened without blocking".

See? Even if you could make those kinds of events, it would be absolutely 
HORRIBLE to code for. And it would suck horribly performance-wise for most 
code too.

THAT is what I'm saying. There's a *difference* between event-based and 
thread-based programming. It makes no sense to try to turn one into the 
other. But it often makes sense to *combine* the two approaches.

> Userspace wants to open a file, so it needs some file-related (inode,
> direntry and others) structures in the mem, they should be read from
> disk. Eventually it will be reading some blocks from the disk 
> (for example ext3_lookup->ext3_find_entry->ext3_getblk/ll_rw_block) and
> we will wait for them (wait_on_bit()) - we will wait for event.
> 
> But I agree, it was a brainfscking example, but nevertheless, it can be
> easily done using event driven model.
> 
> Reading from the disk is _exactly_ the same - the same waiting for
> buffer_heads/pages, and (since it is bigger) it can be easily
> transferred to event driven model.
> Ugh, wait, it not only _can_ be transferred, it is already done in
> kevent AIO, and it shows faster speeds (though I only tested sending
> them over the net).

It would be absolutely horrible to program for. Try anything more complex 
than read/write (which is the simplest case, but even that is nasty).

Try imagining yourself in the shoes of a database server (or just about 
anything else). Imagine what kind of code you want to write. You probably 
do *not* want to have everything be one big event loop, and having to make 
different "states" for "I'm trying to open the file", "I opened the file, 
am now doing 'fstat()' to figure out how big it is", "I'm now reading the 
file and have read X bytes of the total Y bytes I want to read", "I took a 
page fault in the middle" etc etc.

I pretty much can *guarantee* you that you'll never see anybody do that. 
Page faults in user space are particularly hard to handle in a state 
machine, since they basically require saving the whole thread state, as 
they can happen on any random access. So yeah, you could do them as a 
state machine, but in reality it would just become a "user-level thread 
library" in the end, just to handle those.

In contrast, if you start using thread-like programming to begin with, you 
have none of those issues. Sure, some thread may block because you got a 
page fault, or because an inode needed to be brought into memory, but from 
a user-level programming interface standpoint, the thread library just 
takes care of the "state machine" on its own, so it's much simpler, and in 
the end more efficient.

And *THAT* is what I'm trying to say. Some simple obvious events are 
better handled and seen as "events" in user space. But other things are so 
intertwined, and have basically random state associated with them, that 
they are better seen as threads.

Yes, from a "turing machine" kind of viewpoint, the two are 100% logically 
equivalent. But "logical equivalence" does NOT translate into "can 
practically speaking be implemented".

			Linus

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-26 13:11             ` Ingo Molnar
@ 2007-02-26 17:37               ` Evgeniy Polyakov
  2007-02-26 18:19                 ` Arjan van de Ven
  0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-26 17:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Ulrich Drepper, linux-kernel, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner

On Mon, Feb 26, 2007 at 02:11:33PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > > My tests show that with 4k connections per second (8k concurrency) 
> > > more than 20k connections of 80k total block in tcp_sendmsg() over 
> > > gigabit lan between quite fast machines.
> > 
> > Why do people *keep* taking this up as an issue?
> > 
> > Use select/poll/epoll/kevent/whatever for event mechanisms. STOP 
> > CLAIMING that you'd use threadlets/syslets/aio for that. It's been 
> > pointed out over and over and over again, and yet you continue to make 
> > the same mistake, Evgeniy.
> > 
> > So please read that sentence ten times, and then don't continue to 
> > make that same mistake. PLEASE.
> > 
> > Event mechanisms are *superior* for events. But they *suck* for things 
> > that aren't events, but are actual code execution with random places 
> > that can block. THE TWO THINGS ARE TOTALLY AND UTTERLY INDEPENDENT!
> 
> Note that even for something tasks are supposed to suck at, and even if 
> used in extremely stupid ways, they perform reasonably well in practice 
> ;-)
> 
> And i fully agree: specialization based on knowledge about frequency of 
> blocking will always be useful - if not /forced/ on the workflow 
> architecture and if not overdone. On the other hand, fully event-driven 
> servers based on 'nonblocking' calls, which Evgeniy is advocating and 
> which the kevent model is forcing upon userspace, is pure madness.
> 
> We very much can and should use things like epoll for events that we 
> expect to happen asynchronously 100% of the time - it just makes no 
> sense for those events to take up 4-5K of RAM apiece, when they could 
> also be only using up the 32 bytes that say a pending timer takes. I've 
> posted the code for that, how to do an 'outer' epoll loop around an 
> internal threadlep iterator. But those will always be very narrow event 
> sources, and likely wont (and shouldnt) cover 'request-internal' 
> processing.
> 
> but otherwise, there is no real difference between a task that is 
> scheduled and a request that is queued, 'other' than the size of the 
> request (the task takes 4-5K of RAM), and the register context (64-128 
> bytes on most CPUs, the loading of which is optimized to death).
> 
> Which difference can still be significant for certain workloads, so we 
> certainly dont want to prohibit specialized event interfaces and force 
> generic threads on everything. But for anything that isnt a raw and 
> natural external event source (time, network, disk, user-generated) 
> there shouldnt be much of an event queueing abstraction i believe (other 
> than what we get 'for free' within epoll, from having poll()-able files) 
> - and even for those event sources threadlets offer a pretty good run 
> for the money.

I tend to agree.
Yes, some loads require event driven model, other can be done using
threads. The only reason kevent was created is to allow to process any
type of events in exaclty the same way in the same processing loop,
it was optimized to have event register structure less than a cache
line.

What I can not agree with, is that IO is a thread based stuff.

> one can always find the point and workload where say 40,000 threads 
> start trashing the L2 cache, but where 40,000 queued special requests 
> are still fully in cache, and produce spectacular numbers.
> 
> 	Ingo

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-25 22:44           ` Linus Torvalds
  2007-02-26 13:11             ` Ingo Molnar
@ 2007-02-26 17:28             ` Evgeniy Polyakov
  2007-02-26 17:57               ` Linus Torvalds
  1 sibling, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-26 17:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Ulrich Drepper, linux-kernel, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner

On Sun, Feb 25, 2007 at 02:44:11PM -0800, Linus Torvalds (torvalds@linux-foundation.org) wrote:
> 
> 
> On Thu, 22 Feb 2007, Evgeniy Polyakov wrote:
> > 
> > My tests show that with 4k connections per second (8k concurrency) more 
> > than 20k connections of 80k total block in tcp_sendmsg() over gigabit 
> > lan between quite fast machines.
> 
> Why do people *keep* taking this up as an issue?

Let's warm our brains a little in this pseudo-technical word throwings :)

> Use select/poll/epoll/kevent/whatever for event mechanisms. STOP CLAIMING 
> that you'd use threadlets/syslets/aio for that. It's been pointed out over 
> and over and over again, and yet you continue to make the same mistake, 
> Evgeniy.
>
> So please read that sentence ten times, and then don't continue to make 
> that same mistake. PLEASE.
> 
> Event mechanisms are *superior* for events. But they *suck* for things 
> that aren't events, but are actual code execution with random places that 
> can block. THE TWO THINGS ARE TOTALLY AND UTTERLY INDEPENDENT!
> 
> Examples of events:
>  - packet arrives
>  - timer happens
> 
> Examples of things that are *not* "events":
>  - filesystem lookup.
>  - page faults
> 
> So the basic point is: for events, you use an event-based thing. For code 
> execution, you use a thread-based thing. It's really that simple.

Linus, you made your point clearly - generic AIO should not be used for
the cases, when it is supposed to block 90% of the time - only when it
almost never blocks, like in case of buffered IO.
Otherwise, when load nature requires almost always to block, we would
see that each block is eventually removed - by something, which calls
wake_up(), that something is an event, which we were supposed to wait,
but instead we we resceduled and waited there, so we just added
additional layer of dereferencing - we were scehduled away, did some
work, then awakened, instead of doing some work and get event.

I do not even discuss, that micro-threading model is way tooo simpler to
programm, but from above example is clear that it adds additional
overhead, which in turn can be high or noticeble.

> And yes, the two different things can usually be translated (at a very 
> high cost in complexity *and* performance) into each other, so people who 
> look at it as purely a theoretical exercise may think that "events" and 
> "code execution" are equivalent. That's a very very silly and stupid way 
> of looking at things in real life, though.
> 
> Yes, you can turn things that are better seen as threaded execution into 
> an event-based thing by turning it into a state machine. And usually that 
> is a TOTAL DISASTER, and the end result is fragile and impossible to 
> maintain.
> 
> And yes, you can often (more easily) turn an event-based mechanism into a 
> thread-based one, and usually the end result is a TOTAL DISASTER because 
> it doesn't scale very well, and while it may actually result in somewhat 
> simpler code, the overhead of managing ten thousand outstanding threads is 
> just too high, when you compare to managing just a list of ten thousand 
> outstanding events.
>
> And yes, people have done both of those mistakes. Java, for example, 
> largely did the latter mistake ("we don't need anything like 'select', 
> because we'll just use threads for everything" - what a totally moronic 
> thing to do!)

I can only say that I'm fully agree. Absolutely. No jokes.

> So Evgeniy, threadlets/syslets/aio is *not* a replacement for event 
> queues. It's a TOTALLY DIFFERENT MECHANISM, and one that is hugely 
> superior to event queues for certain kinds of things. Anybody who thinks 
> they want to do pathname and inode lookup as a series of events is likely 
> a moron. It's really that simple.

Hmmm... let me describe that process a bit:

Userspace wants to open a file, so it needs some file-related (inode,
direntry and others) structures in the mem, they should be read from
disk. Eventually it will be reading some blocks from the disk 
(for example ext3_lookup->ext3_find_entry->ext3_getblk/ll_rw_block) and
we will wait for them (wait_on_bit()) - we will wait for event.

But I agree, it was a brainfscking example, but nevertheless, it can be
easily done using event driven model.

Reading from the disk is _exactly_ the same - the same waiting for
buffer_heads/pages, and (since it is bigger) it can be easily
transferred to event driven model.
Ugh, wait, it not only _can_ be transferred, it is already done in
kevent AIO, and it shows faster speeds (though I only tested sending
them over the net).

> In a complex server (say, a database), you'd use both. You'd probably use 
> events for doing the things you *already* use events for (whether it be 
> select/poll/epoll or whatever): probably things like the client network 
> connection handling.
> 
> But you'd *in*addition* use threadlets to be able to do the actual 
> database IO in a threaded manner, so that you can scale the things that 
> are not easily handled as events (usually because they have internal 
> kernel state that the user cannot even see, and *must*not* see because of 
> security issues).

Event is data readiness - no more, no less.
It has nothing with internal kernel structures - you just wait until
data is ready in requested buffer (disk, net, whatever).
Internal mechanism of moving data to the destination point can use event
driven model too, but it is another question.

Eventually threads wait for the same events - but there is an additional
overhead created to manage that objects called threads.
Ingo says that it is fast to manage them, but it can not be faster than
properly created event driven abstraction, because, as you noticed
himself, mutual transformations are complex.

Threadlets are simpler to program, but they do not add a gain compared
to properly created singlethreaded model (or number of CPU-threadede
model) with the right event processing mechanims.

Waiting for any IO is a waiting for event, other tasks can be made an
event too, but I agree, it is simpler to use different models just
because they already exist.

> So please. Stop this "kevents are better". The only thing you show by 
> trying to go down that avenue is that you don't understand the 
> *difference* between an event model and a thread model. They are both 
> perfectly fine models and they ARE NOT THE SAME! They aren't even mutually 
> incompatible - quite the reverse.
> 
> The thing people want to remove with threadlets is the internal overhead 
> of maintaining special-purpose code like aio_read() inside the kernel, 
> that doesn't even do all that people want it to do, and that really does 
> need a fair amount of internal complexity that we could hopefully do with 
> a more generic (and hopefully *simpler*) model.

Let me rephrase that stuff:

the thing people want to remove with linked lists is the internal
overhead of maintaing special-purpose code like RB trees.

It sounds with similar part of idiocity in it.

If additional code provides faster processing, it must be used instead of
fear of 'the internal overhead of maintaining special-purpose code'.

> 		Linus

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-26 14:15                               ` Ingo Molnar
@ 2007-02-26 16:55                                 ` Evgeniy Polyakov
  2007-02-26 20:35                                   ` Ingo Molnar
  2007-02-27  2:18                                   ` Davide Libenzi
  0 siblings, 2 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-26 16:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ulrich Drepper, linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner

On Mon, Feb 26, 2007 at 03:15:18PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> > > your whole reasoning seems to be faith-based:
> > > 
> > > [...] Anyway, kevents are very small, threads are very big, [...]
> > > 
> > > How about following the scientific method instead?
> > 
> > That are only rethorical words as you have understood I bet, I meant 
> > that the whole process of getting readiness notification from kevent 
> > is way tooo much faster than resheduling of the new process/thread to 
> > handle that IO.
> > 
> > The whole process of switching from one process to another can be as 
> > fast as bloody hell, but all other details just kill the thing.
> 
> for our primary abstractions there /IS NO OTHER DETAIL/ but wakeup and 
> context-switching! The "event notification" of a sys_read() /IS/ the 
> wakeup and context-switching that we do - or the epoll/kevent enqueueing 
> as an alternative.
> 
> yes, the two are still different in a number of ways, and yes, it's 
> still stupid to do a pool of thousands of threads and thus we can always 
> optimize queuing, RAM and cache footprint via specialization, but your 
> whole foundation seems to be constructed around the false notion that 
> queueing and scheduling a task by the scheduler is somehow magically 
> expensive and different from queueing and scheduling other type of 
> requests. Please reconsider that foundation and open up a bit more to a 
> slightly different world view: scheduling is really just another, more 
> generic (and thus certainly more expensive) type of 'request queueing', 
> and user-space, most of the time, is much better off if it handles its 
> 'requests' and 'events' via tasks. (Especially if many of those 'events' 
> turn out to be non-events at all, so to speak.)

If kernelspace rescheduling is that fast, then please explain me why
userspace one always beats kernel/userspace?

And you showed that threadlets without polling accept still does not
scale good - if it is the same fast queueing of events, then why doesn't
it work?

Actually it does not matter, if that narrow place exist (like
kernel/user transformation, or register copy or something else), it can
be eliminated in different model - kevent is that model - it does not
require a lot of things to be changed to get notification and start
working, so it scales better.

It is very similar to epoll, but there are at least two significant
moments:
1. it can work with _any_ type of events with minimal overhead (can not
be even remotely compared with 'file' binding which is required to be
pollable).
2. its notifications do not go through the second loop, i.e. it is O(1),
not O(ready_num), and notifications happens directly from internals of
the appropriate subsystem, which does not require special wakeup
(although it can be done too).

> 	Ingo

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-26 12:51                                   ` Ingo Molnar
@ 2007-02-26 16:46                                     ` Evgeniy Polyakov
  2007-02-27  6:24                                       ` Ingo Molnar
  0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-26 16:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Davide Libenzi, Linux Kernel Mailing List, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Ulrich Drepper, Zach Brown, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner

On Mon, Feb 26, 2007 at 01:51:23PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> 
> > Even having main dispatcher as epoll/kevent loop, the _whole_ 
> > threadlet model is absolutely micro-thread in nature and not state 
> > machine/event.
> 
> Evgeniy, i'm not sure how many different ways to tell this to you, but 
> you are not listening, you are not learning and you are still not 
> getting it at all.
> 
> The scheduler /IS/ a generic work/event queue. And it's pretty damn 
> fast. No amount of badmouthing will change that basic fact. Not exactly 
> as fast as a special-purpose queueing system (for all the reasons i 
> outlined to you, and which you ignored), but it gets pretty damn close 
> even for the web workload /you/ identified, and offers a user-space 
> programming model that is about 1000 times more useful than 
> state-machines.

Meanwhile on practiceal side:
via epia kevent/epoll/threadlet:

client: ab -c500 -n5000 $url

kevent:		849.72
epoll:		538.16
threadlet:
 gcc ./evserver_epoll_threadlet.c -o ./evserver_epoll_threadlet
 In file included from ./evserver_epoll_threadlet.c:30:
 ./threadlet.h: In function ‘threadlet_exec’:
 ./threadlet.h:46: error: can't find a register in class ‘GENERAL_REGS’
 while reloading ‘asm’

That particular asm optimization fails to compile.

> 	Ingo

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-26 14:05                             ` Evgeniy Polyakov
@ 2007-02-26 14:15                               ` Ingo Molnar
  2007-02-26 16:55                                 ` Evgeniy Polyakov
  0 siblings, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-02-26 14:15 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ulrich Drepper, linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner


* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> > your whole reasoning seems to be faith-based:
> > 
> > [...] Anyway, kevents are very small, threads are very big, [...]
> > 
> > How about following the scientific method instead?
> 
> That are only rethorical words as you have understood I bet, I meant 
> that the whole process of getting readiness notification from kevent 
> is way tooo much faster than resheduling of the new process/thread to 
> handle that IO.
> 
> The whole process of switching from one process to another can be as 
> fast as bloody hell, but all other details just kill the thing.

for our primary abstractions there /IS NO OTHER DETAIL/ but wakeup and 
context-switching! The "event notification" of a sys_read() /IS/ the 
wakeup and context-switching that we do - or the epoll/kevent enqueueing 
as an alternative.

yes, the two are still different in a number of ways, and yes, it's 
still stupid to do a pool of thousands of threads and thus we can always 
optimize queuing, RAM and cache footprint via specialization, but your 
whole foundation seems to be constructed around the false notion that 
queueing and scheduling a task by the scheduler is somehow magically 
expensive and different from queueing and scheduling other type of 
requests. Please reconsider that foundation and open up a bit more to a 
slightly different world view: scheduling is really just another, more 
generic (and thus certainly more expensive) type of 'request queueing', 
and user-space, most of the time, is much better off if it handles its 
'requests' and 'events' via tasks. (Especially if many of those 'events' 
turn out to be non-events at all, so to speak.)

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-26 12:39                           ` Ingo Molnar
@ 2007-02-26 14:05                             ` Evgeniy Polyakov
  2007-02-26 14:15                               ` Ingo Molnar
  0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-26 14:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ulrich Drepper, linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner

On Mon, Feb 26, 2007 at 01:39:23PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> 
> > > > Kevent is a _very_ small entity and there is _no_ cost of 
> > > > requeueing (well, there is list_add guarded by lock) - after it is 
> > > > done, process can start real work. With rescheduling there are 
> > > > _too_ many things to be done before we can start new work. [...]
> > > 
> > > actually, no. For example a wakeup too is fundamentally a list_add 
> > > guarded by a lock. Take a look at try_to_wake_up(). The rest you see 
> > > there is just extra frills that relate to things like 
> > > 'load-balancing the requests over multiple CPUs [which i'm sure 
> > > kevent users would request in the future too]'.
> > 
> > wake_up() as a call is pretty simple and fast, but its result - it is 
> > slow. [...]
> 
> You are still very much wrong, and now you refuse to even /read/ what i 
> wrote. Your only reply to my detailed analysis is: "it is slow, because 
> it is slow and heavy". I told you how fast it is, i told you what 
> happens on a context switch and why, i told you that you can measure if 
> you want.

Ingo, you likely will not believe, but your mails are ones of the
several which I always read several times to get every bit of it :)

I clearly understand your point of view, it is absoutely clear and shine
for me. But... I can not agree with it.

Because of theoretical point of view and practical one concerned my
measurements. It is not pure speculation, which one can expect, but real
life comparison of kernel/user scheduling with pure userspace one (like
in own M:N threading lib or concurrent prgramming language like erlang).
For me (and probably _only_ for me), it is enough to show that some lib
shows 10 times faster rescheduling to start developing own, so I pointed
to it in a discussion.

> > [...] I did not run reschedulingtests with kernel thread, but posix 
> > threads (they do look like a kernel thread) have significant overhead 
> > there.
> 
> You are wrong. Let me show you some more numbers. This is a 
> hackbench_pth.c run:
> 
>   $ ./hackbench_pth 500
>   Time: 14.371
> 
> this uses 20,000 real threads and during this test the runqueue length 
> is extreme - up to over a ten thousand threads. (hackbench_pth.c was 
> posted to lkml recently.
> 
> The same run with hackbench.c (20,000 forked processes):
> 
>   $ ./hackbench 500
>   Time: 14.632
> 
> so the TLB overhead from using processes is 1.8%.
> 
> > [...] In early developemnt days of M:N threading library I tested 
> > rescheduling performance of the POSIX threads - I created pool of 
> > threads and 'sent' a message using futex wait/wake - such performance 
> > of the userspace threading library (I tested erlang) was 10 times 
> > slower.
> 
> how much would it take for you to actually re-measure it and interpet 
> the results you are seeing? You've apparently built a whole mental house 
> of cards on the flawed proposition that tasks are 'super-heavy' and that 
> context-switching them is 'slow'. You are unwilling to explain /how/ 
> they are slow, and all the numbers i post are contrary to that 
> proposition of yours.
> 
> your whole reasoning seems to be faith-based:
> 
> [...] Anyway, kevents are very small, threads are very big, [...]
> 
> How about following the scientific method instead?

That are only rethorical words as you have understood I bet, I meant that the
whole process of getting readiness notification from kevent is way tooo
much faster than resheduling of the new process/thread to handle that IO.

The whole process of switching from one process to another can be as fast
as bloody hell, but all other details just kill the thing.

I do not know, what exact line ends up with problem, but having thousands
of threads, rescheduling itself, results in slower performance than
userspace rescheduling. Then I interpolate it to our IO test case.

> > [...] and both are the way they are exactly on purpose - threads serve 
> > for processing of any generic code, kevents are used for event waiting 
> > - IO is such an event, it does not require a lot of infrastructure to 
> > handle, it only nees some simple bits, so it can be optimized to be 
> > extremely fast, with huge infrastructure behind each IO (like in case 
> > when it is a separated thread) it can not be done effectively.
> 
> you are wrong, and i have pointed it out to you in my previous replies 
> why you are wrong. Your only coherent specific thought on this topic was 
> your incorrect assumption is that the scheduler somehow saves registers 
> and that this makes it heavy. I pointed it out to you in the mail you 
> reply to that /every/ system call that saves user registers. You've not 
> even replied to that point of mine, you are ignoring it completely and 
> you are still repeating your same old, incorrect argument. If it is 
> heavy, /why/ do you think it is heavy? Where is that magic pixie dust 
> piece of scheduler code that miraculously turns the runqueue into a 
> molass slow, heavy piece of thing?

I do not arue that I'm absolutely right.
I just point that I tested some cases and that tests ends up with
completely broken behaviour for micro-thread design (even besides the
case of thousand of new thread creation/reuse per second, which itself
does not look perfect).

I do not even ever try to say that threadlets suck (although I do believe
that it is in some cases, at leat for now :) ), I just point that
rescheduling overhead happens to be tooo big when it comes to
benchmarks I ran (where you never replied too, but that does not matter
after all :).

It can end up with (handwaving) broken syscall wrapper implementation,
or with anything else. Absolutely.

I never ever tried to say that scheduler's code is broken - I just show
my own tests which resulted in the situation, when many working threads
can end up with timings worse than some other case.

Register/tlb/whatever is just a speculation about _possible _ root of
the problem. I did not investigate problem enough - I just decided to
implement different library. Shame on me for that, since I never showed 
what exactly is a root of the problem, but for _me_ it is enough, so I'm
trying to share it with you and other developers.

> Or put in another way: your test-code does ~6 syscalls per every event. 
> So if what you said would be true (which it isnt), a kevent based 
> request would have be just as slow as thread based request ...

I can neither confirm nor object against this sentence.

> > > i think you are really, really mistaken if you believe that the fact 
> > > that whole tasks/threads or processes can be 'monster structures', 
> > > somehow has any relevance to scheduling/task-queueing performance 
> > > and scalability. It does not matter how large a task's address space 
> > > is - scheduling only relates to the minimal context that is in the 
> > > CPU. And most of that context we save upon /every system call 
> > > entry/, and restore it upon every system call return. If it's so 
> > > expensive to manipulate, why can the Linux kernel do a full system 
> > > call in ~150 cycles? That's cheaper than the access latency to a 
> > > single DRAM page.
> > 
> > I meant not its size, but the whole infrastructure, which surrounds 
> > task. [...]
> 
> /what/ infrastructure do you mean? sched.c? Most of that never runs in 
> the scheduler hotpath.
>
> > [...] If it is that lightweight, why don't we have posix thread per 
> > IO? [...]
> 
> because it would be pretty stupid to do that?
> 
> But more importantly: because many people still believe 'the scheduler 
> is slow and context-switching is evil'? The FIO AIO syslet code from 
> Jens is an intelligent mix of queueing combined with async execution. I 
> expect that model to prevail.

Suparna showed its problems - although on an older version. 
Let's see another tests.

> > [...] One question is that mmap/allocation of the stack is too slow 
> > (and it is very slow indeed, that is why glibc and M:N threading lib 
> > caches allocated stacks), another one is kernel/userspace boundary 
> > crossing, next one are tlb flushes, then copies.
> 
> now you come up again with creation overhead but nobody is talking about 
> context creation overhead. (btw., you are also wrong if you think that 
> mmap() is all that slow - try measuring it one day) We were talking 
> about context /switching/.

Ugh, Ingo, do not think absolutely...
I did it. And it is slow.
http://tservice.net.ru/~s0mbre/blog/2007/01/15#2007_01_15

> > Why userspace rescheduling is in order of tens times faster than 
> > kernel/user?
> 
> what on earth does this have to do with the topic of whether context 
> switches are fast enough? Or if you like random info, just let me throw 
> in a random piece of information as well:
> 
> user-space function calls are more than /two/ orders of magnitude faster 
> than system calls. Still you are using 6 ... SIX system calls in the 
> sample kevent request handling hotpath.

I can only laugh on that, Ingo :)
If you will be in Moscow, I will buy you a beer, just drop me a mail.

What we are talking about, Ingo, kevent and IO in thread contexts, 
or userspace vs. kernelspace scheduling?

Kevent can be broken as hell, it can be stupid application, which does
not work at all - it is one of the possible theories.
Practice however shows that it is not true.

Anyway, if we are talking about kevents and micro-threads, that is one
point, if we are talking about possible overhead of rescheduling - it is
another topic.

> > > for the same reason has it no relevance that the full kevent-based 
> > > webserver is a 'monster structure' - still a single request's basic 
> > > queueing operation is cheap. The same is true to tasks/threads.
> > 
> > To move that tasks there must be done too may steps, and although each 
> > one can be quite fast, the whole process of rescheduling in the case 
> > of thousands running threads creates too big overhead per task to drop 
> > performance.
> 
> again, please come up with specifics! I certainly came up with enough 
> specifics.

I thought I showed it several times already.
Anyway: http://tservice.net.ru/~s0mbre/blog/2006/11/09#2006_11_09

That is an initial step, which shows that rescheduling of threads (
I DO NOT say about problems in sched.c, Ingo, that would be somehow
stupid, althogh can be right) has some overhead comapred to userspace
rescheduling. If so, it can be eliminated or reduced.

Second (COMPELTELY DIFFERENT STARTING POINT).
If rescheduling has some overhead, is it possible to reduce it using
different model for IO? So I created kevent (as you likely do not know,
original idea was a bit differnet - network AIO, but results is quite good).

> 	Ingo

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-25 22:44           ` Linus Torvalds
@ 2007-02-26 13:11             ` Ingo Molnar
  2007-02-26 17:37               ` Evgeniy Polyakov
  2007-02-26 17:28             ` Evgeniy Polyakov
  1 sibling, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-02-26 13:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Evgeniy Polyakov, Ulrich Drepper, linux-kernel, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > My tests show that with 4k connections per second (8k concurrency) 
> > more than 20k connections of 80k total block in tcp_sendmsg() over 
> > gigabit lan between quite fast machines.
> 
> Why do people *keep* taking this up as an issue?
> 
> Use select/poll/epoll/kevent/whatever for event mechanisms. STOP 
> CLAIMING that you'd use threadlets/syslets/aio for that. It's been 
> pointed out over and over and over again, and yet you continue to make 
> the same mistake, Evgeniy.
> 
> So please read that sentence ten times, and then don't continue to 
> make that same mistake. PLEASE.
> 
> Event mechanisms are *superior* for events. But they *suck* for things 
> that aren't events, but are actual code execution with random places 
> that can block. THE TWO THINGS ARE TOTALLY AND UTTERLY INDEPENDENT!

Note that even for something tasks are supposed to suck at, and even if 
used in extremely stupid ways, they perform reasonably well in practice 
;-)

And i fully agree: specialization based on knowledge about frequency of 
blocking will always be useful - if not /forced/ on the workflow 
architecture and if not overdone. On the other hand, fully event-driven 
servers based on 'nonblocking' calls, which Evgeniy is advocating and 
which the kevent model is forcing upon userspace, is pure madness.

We very much can and should use things like epoll for events that we 
expect to happen asynchronously 100% of the time - it just makes no 
sense for those events to take up 4-5K of RAM apiece, when they could 
also be only using up the 32 bytes that say a pending timer takes. I've 
posted the code for that, how to do an 'outer' epoll loop around an 
internal threadlep iterator. But those will always be very narrow event 
sources, and likely wont (and shouldnt) cover 'request-internal' 
processing.

but otherwise, there is no real difference between a task that is 
scheduled and a request that is queued, 'other' than the size of the 
request (the task takes 4-5K of RAM), and the register context (64-128 
bytes on most CPUs, the loading of which is optimized to death).

Which difference can still be significant for certain workloads, so we 
certainly dont want to prohibit specialized event interfaces and force 
generic threads on everything. But for anything that isnt a raw and 
natural external event source (time, network, disk, user-generated) 
there shouldnt be much of an event queueing abstraction i believe (other 
than what we get 'for free' within epoll, from having poll()-able files) 
- and even for those event sources threadlets offer a pretty good run 
for the money.

one can always find the point and workload where say 40,000 threads 
start trashing the L2 cache, but where 40,000 queued special requests 
are still fully in cache, and produce spectacular numbers.

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-26 10:47                                 ` Evgeniy Polyakov
@ 2007-02-26 12:51                                   ` Ingo Molnar
  2007-02-26 16:46                                     ` Evgeniy Polyakov
  0 siblings, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-02-26 12:51 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Davide Libenzi, Linux Kernel Mailing List, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Ulrich Drepper, Zach Brown, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner


* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> Even having main dispatcher as epoll/kevent loop, the _whole_ 
> threadlet model is absolutely micro-thread in nature and not state 
> machine/event.

Evgeniy, i'm not sure how many different ways to tell this to you, but 
you are not listening, you are not learning and you are still not 
getting it at all.

The scheduler /IS/ a generic work/event queue. And it's pretty damn 
fast. No amount of badmouthing will change that basic fact. Not exactly 
as fast as a special-purpose queueing system (for all the reasons i 
outlined to you, and which you ignored), but it gets pretty damn close 
even for the web workload /you/ identified, and offers a user-space 
programming model that is about 1000 times more useful than 
state-machines.

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-25 19:42                         ` Evgeniy Polyakov
  2007-02-25 20:38                           ` Ingo Molnar
@ 2007-02-26 12:39                           ` Ingo Molnar
  2007-02-26 14:05                             ` Evgeniy Polyakov
  2007-02-26 19:47                           ` Davide Libenzi
  2 siblings, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-02-26 12:39 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ulrich Drepper, linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner


* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> > > Kevent is a _very_ small entity and there is _no_ cost of 
> > > requeueing (well, there is list_add guarded by lock) - after it is 
> > > done, process can start real work. With rescheduling there are 
> > > _too_ many things to be done before we can start new work. [...]
> > 
> > actually, no. For example a wakeup too is fundamentally a list_add 
> > guarded by a lock. Take a look at try_to_wake_up(). The rest you see 
> > there is just extra frills that relate to things like 
> > 'load-balancing the requests over multiple CPUs [which i'm sure 
> > kevent users would request in the future too]'.
> 
> wake_up() as a call is pretty simple and fast, but its result - it is 
> slow. [...]

You are still very much wrong, and now you refuse to even /read/ what i 
wrote. Your only reply to my detailed analysis is: "it is slow, because 
it is slow and heavy". I told you how fast it is, i told you what 
happens on a context switch and why, i told you that you can measure if 
you want.

> [...] I did not run reschedulingtests with kernel thread, but posix 
> threads (they do look like a kernel thread) have significant overhead 
> there.

You are wrong. Let me show you some more numbers. This is a 
hackbench_pth.c run:

  $ ./hackbench_pth 500
  Time: 14.371

this uses 20,000 real threads and during this test the runqueue length 
is extreme - up to over a ten thousand threads. (hackbench_pth.c was 
posted to lkml recently.

The same run with hackbench.c (20,000 forked processes):

  $ ./hackbench 500
  Time: 14.632

so the TLB overhead from using processes is 1.8%.

> [...] In early developemnt days of M:N threading library I tested 
> rescheduling performance of the POSIX threads - I created pool of 
> threads and 'sent' a message using futex wait/wake - such performance 
> of the userspace threading library (I tested erlang) was 10 times 
> slower.

how much would it take for you to actually re-measure it and interpet 
the results you are seeing? You've apparently built a whole mental house 
of cards on the flawed proposition that tasks are 'super-heavy' and that 
context-switching them is 'slow'. You are unwilling to explain /how/ 
they are slow, and all the numbers i post are contrary to that 
proposition of yours.

your whole reasoning seems to be faith-based:

[...] Anyway, kevents are very small, threads are very big, [...]

How about following the scientific method instead?

> [...] and both are the way they are exactly on purpose - threads serve 
> for processing of any generic code, kevents are used for event waiting 
> - IO is such an event, it does not require a lot of infrastructure to 
> handle, it only nees some simple bits, so it can be optimized to be 
> extremely fast, with huge infrastructure behind each IO (like in case 
> when it is a separated thread) it can not be done effectively.

you are wrong, and i have pointed it out to you in my previous replies 
why you are wrong. Your only coherent specific thought on this topic was 
your incorrect assumption is that the scheduler somehow saves registers 
and that this makes it heavy. I pointed it out to you in the mail you 
reply to that /every/ system call that saves user registers. You've not 
even replied to that point of mine, you are ignoring it completely and 
you are still repeating your same old, incorrect argument. If it is 
heavy, /why/ do you think it is heavy? Where is that magic pixie dust 
piece of scheduler code that miraculously turns the runqueue into a 
molass slow, heavy piece of thing?

Or put in another way: your test-code does ~6 syscalls per every event. 
So if what you said would be true (which it isnt), a kevent based 
request would have be just as slow as thread based request ...

> > i think you are really, really mistaken if you believe that the fact 
> > that whole tasks/threads or processes can be 'monster structures', 
> > somehow has any relevance to scheduling/task-queueing performance 
> > and scalability. It does not matter how large a task's address space 
> > is - scheduling only relates to the minimal context that is in the 
> > CPU. And most of that context we save upon /every system call 
> > entry/, and restore it upon every system call return. If it's so 
> > expensive to manipulate, why can the Linux kernel do a full system 
> > call in ~150 cycles? That's cheaper than the access latency to a 
> > single DRAM page.
> 
> I meant not its size, but the whole infrastructure, which surrounds 
> task. [...]

/what/ infrastructure do you mean? sched.c? Most of that never runs in 
the scheduler hotpath.

> [...] If it is that lightweight, why don't we have posix thread per 
> IO? [...]

because it would be pretty stupid to do that?

But more importantly: because many people still believe 'the scheduler 
is slow and context-switching is evil'? The FIO AIO syslet code from 
Jens is an intelligent mix of queueing combined with async execution. I 
expect that model to prevail.

> [...] One question is that mmap/allocation of the stack is too slow 
> (and it is very slow indeed, that is why glibc and M:N threading lib 
> caches allocated stacks), another one is kernel/userspace boundary 
> crossing, next one are tlb flushes, then copies.

now you come up again with creation overhead but nobody is talking about 
context creation overhead. (btw., you are also wrong if you think that 
mmap() is all that slow - try measuring it one day) We were talking 
about context /switching/.

> Why userspace rescheduling is in order of tens times faster than 
> kernel/user?

what on earth does this have to do with the topic of whether context 
switches are fast enough? Or if you like random info, just let me throw 
in a random piece of information as well:

user-space function calls are more than /two/ orders of magnitude faster 
than system calls. Still you are using 6 ... SIX system calls in the 
sample kevent request handling hotpath.

> > for the same reason has it no relevance that the full kevent-based 
> > webserver is a 'monster structure' - still a single request's basic 
> > queueing operation is cheap. The same is true to tasks/threads.
> 
> To move that tasks there must be done too may steps, and although each 
> one can be quite fast, the whole process of rescheduling in the case 
> of thousands running threads creates too big overhead per task to drop 
> performance.

again, please come up with specifics! I certainly came up with enough 
specifics.

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-26 10:35                               ` Ingo Molnar
@ 2007-02-26 10:47                                 ` Evgeniy Polyakov
  2007-02-26 12:51                                   ` Ingo Molnar
  0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-26 10:47 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Davide Libenzi, Linux Kernel Mailing List, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Ulrich Drepper, Zach Brown, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner

On Mon, Feb 26, 2007 at 11:35:17AM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> 
> > Btw, 'evserver' in the name means 'event server', so you might think 
> > about changing the name :)
> 
> why should i change the name? The 'outer' loop, which feeds requests to 
> threadlets, is an epoll based event loop. The inner loop, where all the 
> application complexity resides, is a threadlet. This is the "more 
> intelligent queueing" model i talked about in my reply to David 4 days 
> ago:
> 
>   http://lkml.org/lkml/2007/2/22/180
>   http://lkml.org/lkml/2007/2/22/191

:)
Ingo, of course it was a joke.

Even having main dispatcher as epoll/kevent loop, the _whole_ threadlet
model is absolutely micro-thread in nature and not state machine/event.
So it does not have events at all, especially with speculations about
removing completion notifications - fire and forget model.

> 	Ingo

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-26 10:31                             ` Ingo Molnar
@ 2007-02-26 10:43                               ` Evgeniy Polyakov
  2007-02-26 20:02                               ` Davide Libenzi
  1 sibling, 0 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-26 10:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Davide Libenzi, Linux Kernel Mailing List, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Ulrich Drepper, Zach Brown, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner

On Mon, Feb 26, 2007 at 11:31:17AM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Ingo Molnar <mingo@elte.hu> wrote:
> 
> > please also try evserver_epoll_threadlet.c that i've attached below - 
> > it uses epoll as the main event mechanism but does threadlets for 
> > request handling.
> 
> find updated code below - your evserver_epoll.c spuriously missed event 
> edges - so i changed it back to level-triggered. While that is not as 
> fast as edge-triggered, it does not result in spurious hangs and 
> workflow 'hickups' during the test.

Hmm, exact the same evserver_epoll.c you downloaded works ok for me,
although yes, it is buggy in that regard that it does not contain socket
close when data is transferred.

> Could this be the reason why in your testing kevents outperformed epoll?

I will try to check. In theory without _ET it should perfoem much worse,
but in practice its performance is essentially the same (the same applies 
to kevent without KEVENT_REQ_ET flag - since the same socket almost never 
is used several times, it is purely zero overhead to have or not have that 
flag set).

> Also, i have removed the set-nonblocking calls because they are not 
> needed under threadlets.
> 
> [ to build this code, copy it into the async-test/ directory and build
>   it there - or copy the *.h files from async-test/ directory into your
>   build directory. ]

Ok, right now I'm compiling kevent/threadlet tree on my test machines.

> 	Ingo

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-26 10:33                             ` Evgeniy Polyakov
@ 2007-02-26 10:35                               ` Ingo Molnar
  2007-02-26 10:47                                 ` Evgeniy Polyakov
  0 siblings, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-02-26 10:35 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Davide Libenzi, Linux Kernel Mailing List, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Ulrich Drepper, Zach Brown, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner


* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> Btw, 'evserver' in the name means 'event server', so you might think 
> about changing the name :)

why should i change the name? The 'outer' loop, which feeds requests to 
threadlets, is an epoll based event loop. The inner loop, where all the 
application complexity resides, is a threadlet. This is the "more 
intelligent queueing" model i talked about in my reply to David 4 days 
ago:

  http://lkml.org/lkml/2007/2/22/180
  http://lkml.org/lkml/2007/2/22/191

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-26  9:55                           ` Ingo Molnar
  2007-02-26 10:31                             ` Ingo Molnar
@ 2007-02-26 10:33                             ` Evgeniy Polyakov
  2007-02-26 10:35                               ` Ingo Molnar
  1 sibling, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-26 10:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Davide Libenzi, Linux Kernel Mailing List, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Ulrich Drepper, Zach Brown, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner

On Mon, Feb 26, 2007 at 10:55:47AM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> > I will use Ingo's evserver_threadlet server as plong as evserver_epoll 
> > (with fixed closing) and evserver_kevent.c.
> 
> please also try evserver_epoll_threadlet.c that i've attached below - it 
> uses epoll as the main event mechanism but does threadlets for request 
> handling.
> 
> This is a one step more intelligent threadlet queueing model than 
> 'thousands of threads' - although obviously epoll alone should do well 
> too with this trivial workload.

No problem.
If I will complete setup today before I go climbing (I need to do some 
paid job too), I will post results here and in my blog (without
political correctness).

Btw, 'evserver' in the name means 'event server', so you might think
about changing the name :)

Stay tuned.

> 	Ingo

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-26  9:55                           ` Ingo Molnar
@ 2007-02-26 10:31                             ` Ingo Molnar
  2007-02-26 10:43                               ` Evgeniy Polyakov
  2007-02-26 20:02                               ` Davide Libenzi
  2007-02-26 10:33                             ` Evgeniy Polyakov
  1 sibling, 2 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-02-26 10:31 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Davide Libenzi, Linux Kernel Mailing List, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Ulrich Drepper, Zach Brown, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner


* Ingo Molnar <mingo@elte.hu> wrote:

> please also try evserver_epoll_threadlet.c that i've attached below - 
> it uses epoll as the main event mechanism but does threadlets for 
> request handling.

find updated code below - your evserver_epoll.c spuriously missed event 
edges - so i changed it back to level-triggered. While that is not as 
fast as edge-triggered, it does not result in spurious hangs and 
workflow 'hickups' during the test.

Could this be the reason why in your testing kevents outperformed epoll?

Also, i have removed the set-nonblocking calls because they are not 
needed under threadlets.

[ to build this code, copy it into the async-test/ directory and build
  it there - or copy the *.h files from async-test/ directory into your
  build directory. ]

	Ingo

-------{ evserver_epoll_threadlet.c }-------------->
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/resource.h>
#include <sys/wait.h>
#include <sys/ioctl.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <sys/poll.h>
#include <sys/sendfile.h>
#include <sys/epoll.h>

#include <netinet/in.h>
#include <netinet/tcp.h>
#include <arpa/inet.h>

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>
#include <string.h>
#include <fcntl.h>
#include <time.h>
#include <ctype.h>
#include <netdb.h>

#define DEBUG           0

#include "syslet.h"
#include "sys.h"
#include "threadlet.h"

struct request {
	struct request *next_free;
	/*
	 * The threadlet stack is part of the request structure
	 * and is thus reused as threadlets complete:
	 */
	unsigned long threadlet_stack;

	/*
	 * These are all the request-specific parameters:
	 */
	long sock;
};

/*
 * Freelist to recycle requests:
 */
static struct request *freelist;

/*
 * Allocate a request and set up its syslet atoms:
 */
static struct request *alloc_req(void)
{
	struct request *req;

	/*
	 * Occasionally we have to refill the new-thread stack
	 * entry:
	 */
	if (!async_head.new_thread_stack) {
		async_head.new_thread_stack = thread_stack_alloc();
		pr("allocated new thread stack: %08lx\n",
			async_head.new_thread_stack);
	}

	if (freelist) {
		req = freelist;
		pr("reusing req %p, threadlet stack %08lx\n",
			req, req->threadlet_stack);
		freelist = freelist->next_free;
		req->next_free = NULL;
		return req;
	}

	req = calloc(1, sizeof(struct request));
	pr("allocated req %p\n", req);
	req->threadlet_stack = thread_stack_alloc();
	pr("allocated thread stack %08lx\n", req->threadlet_stack);

	return req;
}

/*
 * Check whether there are any completions queued for user-space
 * to finish up:
 */
static unsigned long complete(void)
{
	unsigned long completed = 0;
	struct request *req;

	for (;;) {
		req = (void *)completion_ring[async_head.user_ring_idx];
		if (!req)
			return completed;
		completed++;
		pr("completed req %p (threadlet stack %08lx)\n",
			req, req->threadlet_stack);

		req->next_free = freelist;
		freelist = req;

		/*
		 * Clear the completion pointer. To make sure the
		 * kernel never stomps upon still unhandled completions
		 * in the ring the kernel only writes to a NULL entry,
		 * so user-space has to clear it explicitly:
		 */
		completion_ring[async_head.user_ring_idx] = NULL;
		async_head.user_ring_idx++;
		if (async_head.user_ring_idx == MAX_PENDING)
			async_head.user_ring_idx = 0;
	}
}

static unsigned int pending_requests;

/*
 * Handle a request that has just been submitted (either it has
 * already been executed, or we have to account it as pending):
 */
static void handle_submitted_request(struct request *req, long done)
{
	unsigned int nr;

	if (done) {
		/*
		 * This is the cached case - free the request:
		 */
		pr("cache completed req %p (threadlet stack %08lx)\n",
			req, req->threadlet_stack);
		req->next_free = freelist;
		freelist = req;
		return;
	}
	/*
	 * 'cachemiss' case - the syslet is not finished
	 * yet. We will be notified about its completion
	 * via the completion ring:
	 */
	assert(pending_requests < MAX_PENDING-1);

	pending_requests++;
	pr("req %p is pending. %d reqs pending.\n", req, pending_requests);
	/*
	 * Attempt to complete requests - this is a fast
	 * check if there's no completions:
	 */
	nr = complete();
	pending_requests -= nr;

	/*
	 * If the ring is full then wait a bit:
	 */
	while (pending_requests == MAX_PENDING-1) {
		pr("sys_async_wait()");
		/*
		 * Wait for 4 events - to batch things a bit:
		 */
		sys_async_wait(4, async_head.user_ring_idx, &async_head);
		nr = complete();
		pending_requests -= nr;
		pr("after wait: completed %d requests - still pending: %d\n",
			nr, pending_requests);
	}
}

#include <linux/types.h>

//#define ulog(f, a...) fprintf(stderr, f, ##a)
#define ulog(f, a...)
#define ulog_err(f, a...) printf(f ": %s [%d].\n", ##a, strerror(errno), errno)


static int kevent_ctl_fd, main_server_s;

static void usage(char *p)
{
	ulog("Usage: %s -a addr -p port -f kevent_path -t timeout -w wait_num\n", p);
}

static int evtest_server_init(char *addr, unsigned short port)
{
	struct hostent *h;
	int s, on;
	struct sockaddr_in sa;

	if (!addr) {
		ulog("%s: Bind address cannot be NULL.\n", __func__);
		return -1;
	}

	h = gethostbyname(addr);
	if (!h) {
		ulog_err("%s: Failed to get address of %s.\n", __func__, addr);
		return -1;
	}

	s = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
	if (s == -1) {
		ulog_err("%s: Failed to create server socket", __func__);
		return -1;
	}
//	fcntl(s, F_SETFL, O_NONBLOCK);

	memcpy(&(sa.sin_addr.s_addr), h->h_addr_list[0], 4);
	sa.sin_port = htons(port);
	sa.sin_family = AF_INET;

	on = 1;
	setsockopt(s, SOL_SOCKET, SO_REUSEADDR, &on, 4);

	if (bind(s, (struct sockaddr *)&sa, sizeof(struct sockaddr_in)) == -1) {
		ulog_err("%s: Failed to bind to %s", __func__, addr);
		close(s);
		return -1;
	}

	if (listen(s, 30000) == -1) {
		ulog_err("%s: Failed to listen on %s", __func__, addr);
		close(s);
		return -1;
	}

	return s;
}

#define EPOLL_EVENT_MASK (EPOLLIN | EPOLLERR | EPOLLPRI)

static int evtest_kevent_remove(int fd)
{
	int err;
	struct epoll_event event;

	event.events = EPOLL_EVENT_MASK;
	event.data.fd = fd;

	err = epoll_ctl(kevent_ctl_fd, EPOLL_CTL_DEL, fd, &event);
	if (err < 0) {
		ulog_err("Failed to perform control REMOVE operation");
		return err;
	}

	return err;
}

static int evtest_kevent_init(int fd)
{
	int err;
	struct timeval tm;
	struct epoll_event event;

	event.events = EPOLL_EVENT_MASK;
	event.data.fd = fd;

	err = epoll_ctl(kevent_ctl_fd, EPOLL_CTL_ADD, fd, &event);
	gettimeofday(&tm, NULL);
	ulog("%08lu:%06lu: fd=%3d, err=%1d.\n", tm.tv_sec, tm.tv_usec, fd, err);
	if (err < 0) {
		ulog_err("Failed to perform control ADD operation: fd=%d, events=%08x", fd, event.events);
		return err;
	}

	return err;
}

#define MAX_FILES 1000000

/*
 * Debug check:
 */
static struct request *fd_to_req[MAX_FILES];

static long handle_request(void *__req)
{
	struct request *req = __req;
	int s = req->sock, err, fd;
	off_t offset;
	int count;
	char path[] = "/tmp/index.html";
	char buf[4096];
	struct timeval tm;

	if (!fd_to_req[s])
		ulog_err("Bad: no request to fd?");

	count = 40960;
	offset = 0;

	err = recv(s, buf, sizeof(buf), 0);
	if (err < 0) {
		ulog_err("Failed to read data from s=%d", s);
		goto err_out_remove;
	}
	if (err == 0) {
		gettimeofday(&tm, NULL);
		ulog("%08lu:%06lu: Client exited: fd=%d.\n", tm.tv_sec, tm.tv_usec, s);
		goto err_out_remove;
	}

	fd = open(path, O_RDONLY);
	if (fd == -1) {
		ulog_err("Failed to open '%s'", path);
		err = -1;
		goto err_out_remove;
	}
#if 0
	do {
		err = read(fd, buf, sizeof(buf));
		if (err <= 0)
			break;
		err = send(s, buf, err, 0);
		if (err <= 0)
			break;
	} while (1);
#endif
	err = sendfile(s, fd, &offset, count);
	{
		int on = 0;
		setsockopt(s, SOL_TCP, TCP_CORK, &on, sizeof(on));
	}

	close(fd);
	if (err < 0) {
		ulog_err("Failed send %d bytes: fd=%d.\n", count, s);
		goto err_out_remove;
	}

	gettimeofday(&tm, NULL);
	ulog("%08lu:%06lu: %d bytes has been sent to client fd=%d.\n", tm.tv_sec, tm.tv_usec, err, s);

	close(s);
	fd_to_req[s] = NULL;

	return complete_threadlet_fn(req, &async_head);

err_out_remove:
	evtest_kevent_remove(s);
	close(s);
	fd_to_req[s] = NULL;

	return complete_threadlet_fn(req, &async_head);
}

static int evtest_callback_client(int sock)
{
	struct request *req;
	long done;

	if (fd_to_req[sock]) {
		ulog_err("Bad: request overlap?");
		return 0;
	}

	req = alloc_req();
	if (!req) {
		ulog_err("Bad: no req\n");
		evtest_kevent_remove(sock);
		return -ENOMEM;
	}

	req->sock = sock;
	fd_to_req[sock] = req;
	done = threadlet_exec(handle_request, req,
			req->threadlet_stack, &async_head);

	handle_submitted_request(req, done);

	return 0;
}

static int evtest_callback_main(int s)
{
	int cs, err;
	struct sockaddr_in csa;
	socklen_t addrlen = sizeof(struct sockaddr_in);
	struct timeval tm;

	memset(&csa, 0, sizeof(csa));

	if ((cs = accept(s, (struct sockaddr *)&csa, &addrlen)) == -1) {
		ulog_err("Failed to accept client");
		return -1;
	}
//	fcntl(cs, F_SETFL, O_NONBLOCK);

	gettimeofday(&tm, NULL);

	ulog("%08lu:%06lu: Accepted connect from %s:%d.\n",
		tm.tv_sec, tm.tv_usec,
		inet_ntoa(csa.sin_addr), ntohs(csa.sin_port));

	err = evtest_kevent_init(cs);
	if (err < 0) {
		close(cs);
		return -1;
	}

	return 0;
}

static int evtest_kevent_wait(unsigned int timeout, unsigned int wait_num)
{
	int num, err;
	struct timeval tm;
	struct epoll_event event[256];
	int i;

	err = epoll_wait(kevent_ctl_fd, event, 256, -1);
	if (err < 0) {
		ulog_err("Failed to perform control operation");
		return num;
	}

	gettimeofday(&tm, NULL);

	num = err;
	ulog("%08lu.%06lu: Wait: num=%d.\n", tm.tv_sec, tm.tv_usec, num);
	for (i=0; i<num; ++i) {
		if (event[i].data.fd == main_server_s)
			err = evtest_callback_main(event[i].data.fd);
		else
			err = evtest_callback_client(event[i].data.fd);
	}

	return err;
}

int main(int argc, char *argv[])
{
	int ch, err;
	char *addr;
	unsigned short port;
	unsigned int timeout, wait_num;

	addr = "0.0.0.0";
	port = 8080;
	timeout = 1000;
	wait_num = 1;

	async_head_init();

	while ((ch = getopt(argc, argv, "f:n:t:a:p:h")) > 0) {
		switch (ch) {
			case 't':
				timeout = atoi(optarg);
				break;
			case 'n':
				wait_num = atoi(optarg);
				break;
			case 'a':
				addr = optarg;
				break;
			case 'p':
				port = atoi(optarg);
				break;
			case 'f':
				break;
			default:
				usage(argv[0]);
				return -1;
		}
	}

	kevent_ctl_fd = epoll_create(10);
	if (kevent_ctl_fd == -1) {
		ulog_err("Failed to epoll descriptor");
		return -1;
	}

	main_server_s = evtest_server_init(addr, port);
	if (main_server_s < 0)
		return main_server_s;

	err = evtest_kevent_init(main_server_s);
	if (err < 0)
		goto err_out_exit;

	while (1) {
		err = evtest_kevent_wait(timeout, wait_num);
	}

err_out_exit:
	close(kevent_ctl_fd);

	async_head_exit();

	return 0;
}

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-26  9:25                         ` Evgeniy Polyakov
@ 2007-02-26  9:55                           ` Ingo Molnar
  2007-02-26 10:31                             ` Ingo Molnar
  2007-02-26 10:33                             ` Evgeniy Polyakov
  0 siblings, 2 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-02-26  9:55 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Davide Libenzi, Linux Kernel Mailing List, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Ulrich Drepper, Zach Brown, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner


* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> I will use Ingo's evserver_threadlet server as plong as evserver_epoll 
> (with fixed closing) and evserver_kevent.c.

please also try evserver_epoll_threadlet.c that i've attached below - it 
uses epoll as the main event mechanism but does threadlets for request 
handling.

This is a one step more intelligent threadlet queueing model than 
'thousands of threads' - although obviously epoll alone should do well 
too with this trivial workload.

	Ingo

---------------------------->
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/resource.h>
#include <sys/wait.h>
#include <sys/ioctl.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <sys/poll.h>
#include <sys/sendfile.h>
#include <sys/epoll.h>

#include <netinet/in.h>
#include <netinet/tcp.h>
#include <arpa/inet.h>

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>
#include <string.h>
#include <fcntl.h>
#include <time.h>
#include <ctype.h>
#include <netdb.h>

#define DEBUG           0

#include "syslet.h"
#include "sys.h"
#include "threadlet.h"

struct request {
	struct request *next_free;
	/*
	 * The threadlet stack is part of the request structure
	 * and is thus reused as threadlets complete:
	 */
	unsigned long threadlet_stack;

	/*
	 * These are all the request-specific parameters:
	 */
	long sock;
};

/*
 * Freelist to recycle requests:
 */
static struct request *freelist;

/*
 * Allocate a request and set up its syslet atoms:
 */
static struct request *alloc_req(void)
{
	struct request *req;

	/*
	 * Occasionally we have to refill the new-thread stack
	 * entry:
	 */
	if (!async_head.new_thread_stack) {
		async_head.new_thread_stack = thread_stack_alloc();
		pr("allocated new thread stack: %08lx\n",
			async_head.new_thread_stack);
	}

	if (freelist) {
		req = freelist;
		pr("reusing req %p, threadlet stack %08lx\n",
			req, req->threadlet_stack);
		freelist = freelist->next_free;
		req->next_free = NULL;
		return req;
	}

	req = calloc(1, sizeof(struct request));
	pr("allocated req %p\n", req);
	req->threadlet_stack = thread_stack_alloc();
	pr("allocated thread stack %08lx\n", req->threadlet_stack);

	return req;
}

/*
 * Check whether there are any completions queued for user-space
 * to finish up:
 */
static unsigned long complete(void)
{
	unsigned long completed = 0;
	struct request *req;

	for (;;) {
		req = (void *)completion_ring[async_head.user_ring_idx];
		if (!req)
			return completed;
		completed++;
		pr("completed req %p (threadlet stack %08lx)\n",
			req, req->threadlet_stack);

		req->next_free = freelist;
		freelist = req;

		/*
		 * Clear the completion pointer. To make sure the
		 * kernel never stomps upon still unhandled completions
		 * in the ring the kernel only writes to a NULL entry,
		 * so user-space has to clear it explicitly:
		 */
		completion_ring[async_head.user_ring_idx] = NULL;
		async_head.user_ring_idx++;
		if (async_head.user_ring_idx == MAX_PENDING)
			async_head.user_ring_idx = 0;
	}
}

static unsigned int pending_requests;

/*
 * Handle a request that has just been submitted (either it has
 * already been executed, or we have to account it as pending):
 */
static void handle_submitted_request(struct request *req, long done)
{
	unsigned int nr;

	if (done) {
		/*
		 * This is the cached case - free the request:
		 */
		pr("cache completed req %p (threadlet stack %08lx)\n",
			req, req->threadlet_stack);
		req->next_free = freelist;
		freelist = req;
		return;
	}
	/*
	 * 'cachemiss' case - the syslet is not finished
	 * yet. We will be notified about its completion
	 * via the completion ring:
	 */
	assert(pending_requests < MAX_PENDING-1);

	pending_requests++;
	pr("req %p is pending. %d reqs pending.\n", req, pending_requests);
	/*
	 * Attempt to complete requests - this is a fast
	 * check if there's no completions:
	 */
	nr = complete();
	pending_requests -= nr;

	/*
	 * If the ring is full then wait a bit:
	 */
	while (pending_requests == MAX_PENDING-1) {
		pr("sys_async_wait()");
		/*
		 * Wait for 4 events - to batch things a bit:
		 */
		sys_async_wait(4, async_head.user_ring_idx, &async_head);
		nr = complete();
		pending_requests -= nr;
		pr("after wait: completed %d requests - still pending: %d\n",
			nr, pending_requests);
	}
}

#include <linux/types.h>

//#define ulog(f, a...) fprintf(stderr, f, ##a)
#define ulog(f, a...)
#define ulog_err(f, a...) printf(f ": %s [%d].\n", ##a, strerror(errno), errno)


static int kevent_ctl_fd, main_server_s;

static void usage(char *p)
{
	ulog("Usage: %s -a addr -p port -f kevent_path -t timeout -w wait_num\n", p);
}

static int evtest_server_init(char *addr, unsigned short port)
{
	struct hostent *h;
	int s, on;
	struct sockaddr_in sa;

	if (!addr) {
		ulog("%s: Bind address cannot be NULL.\n", __func__);
		return -1;
	}

	h = gethostbyname(addr);
	if (!h) {
		ulog_err("%s: Failed to get address of %s.\n", __func__, addr);
		return -1;
	}

	s = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
	if (s == -1) {
		ulog_err("%s: Failed to create server socket", __func__);
		return -1;
	}
	fcntl(s, F_SETFL, O_NONBLOCK);

	memcpy(&(sa.sin_addr.s_addr), h->h_addr_list[0], 4);
	sa.sin_port = htons(port);
	sa.sin_family = AF_INET;

	on = 1;
	setsockopt(s, SOL_SOCKET, SO_REUSEADDR, &on, 4);

	if (bind(s, (struct sockaddr *)&sa, sizeof(struct sockaddr_in)) == -1) {
		ulog_err("%s: Failed to bind to %s", __func__, addr);
		close(s);
		return -1;
	}

	if (listen(s, 30000) == -1) {
		ulog_err("%s: Failed to listen on %s", __func__, addr);
		close(s);
		return -1;
	}

	return s;
}

static int evtest_kevent_remove(int fd)
{
	int err;
	struct epoll_event event;

	event.events = EPOLLIN | EPOLLET;
	event.data.fd = fd;

	err = epoll_ctl(kevent_ctl_fd, EPOLL_CTL_DEL, fd, &event);
	if (err < 0) {
		ulog_err("Failed to perform control REMOVE operation");
		return err;
	}

	return err;
}

static int evtest_kevent_init(int fd)
{
	int err;
	struct timeval tm;
	struct epoll_event event;

	event.events = EPOLLIN | EPOLLET;
	event.data.fd = fd;

	err = epoll_ctl(kevent_ctl_fd, EPOLL_CTL_ADD, fd, &event);
	gettimeofday(&tm, NULL);
	ulog("%08lu:%06lu: fd=%3d, err=%1d.\n", tm.tv_sec, tm.tv_usec, fd, err);
	if (err < 0) {
		ulog_err("Failed to perform control ADD operation: fd=%d, events=%08x", fd, event.events);
		return err;
	}

	return err;
}

static long handle_request(void *__req)
{
	struct request *req = __req;
	int s = req->sock, err, fd;
	off_t offset;
	int count;
	char path[] = "/tmp/index.html";
	char buf[4096];
	struct timeval tm;

	count = 40960;
	offset = 0;

	err = recv(s, buf, sizeof(buf), 0);
	if (err < 0) {
		ulog_err("Failed to read data from s=%d", s);
		goto err_out_remove;
	}
	if (err == 0) {
		gettimeofday(&tm, NULL);
		ulog("%08lu:%06lu: Client exited: fd=%d.\n", tm.tv_sec, tm.tv_usec, s);
		goto err_out_remove;
	}

	fd = open(path, O_RDONLY);
	if (fd == -1) {
		ulog_err("Failed to open '%s'", path);
		err = -1;
		goto err_out_remove;
	}
#if 0
	do {
		err = read(fd, buf, sizeof(buf));
		if (err <= 0)
			break;
		err = send(s, buf, err, 0);
		if (err <= 0)
			break;
	} while (1);
#endif
	err = sendfile(s, fd, &offset, count);
	{
		int on = 0;
		setsockopt(s, SOL_TCP, TCP_CORK, &on, sizeof(on));
	}

	close(fd);
	if (err < 0) {
		ulog_err("Failed send %d bytes: fd=%d.\n", count, s);
		goto err_out_remove;
	}

	gettimeofday(&tm, NULL);
	ulog("%08lu:%06lu: %d bytes has been sent to client fd=%d.\n", tm.tv_sec, tm.tv_usec, err, s);

	close(s);

	return complete_threadlet_fn(req, &async_head);

err_out_remove:
	evtest_kevent_remove(s);
	close(s);

	return complete_threadlet_fn(req, &async_head);
}

static int evtest_callback_client(int sock)
{
	struct request *req;
	long done;

	req = alloc_req();
	if (!req) {
		printf("no req\n");
		evtest_kevent_remove(sock);
		return -ENOMEM;
	}

	req->sock = sock;
	done = threadlet_exec(handle_request, req,
			req->threadlet_stack, &async_head);

	handle_submitted_request(req, done);

	return 0;
}

static int evtest_callback_main(int s)
{
	int cs, err;
	struct sockaddr_in csa;
	socklen_t addrlen = sizeof(struct sockaddr_in);
	struct timeval tm;

	memset(&csa, 0, sizeof(csa));

	if ((cs = accept(s, (struct sockaddr *)&csa, &addrlen)) == -1) {
		ulog_err("Failed to accept client");
		return -1;
	}
	fcntl(cs, F_SETFL, O_NONBLOCK);

	gettimeofday(&tm, NULL);

	ulog("%08lu:%06lu: Accepted connect from %s:%d.\n",
		tm.tv_sec, tm.tv_usec,
		inet_ntoa(csa.sin_addr), ntohs(csa.sin_port));

	err = evtest_kevent_init(cs);
	if (err < 0) {
		close(cs);
		return -1;
	}

	return 0;
}

static int evtest_kevent_wait(unsigned int timeout, unsigned int wait_num)
{
	int num, err;
	struct timeval tm;
	struct epoll_event event[256];
	int i;

	err = epoll_wait(kevent_ctl_fd, event, 256, -1);
	if (err < 0) {
		ulog_err("Failed to perform control operation");
		return num;
	}

	gettimeofday(&tm, NULL);

	num = err;
	ulog("%08lu.%06lu: Wait: num=%d.\n", tm.tv_sec, tm.tv_usec, num);
	for (i=0; i<num; ++i) {
		if (event[i].data.fd == main_server_s)
			err = evtest_callback_main(event[i].data.fd);
		else
			err = evtest_callback_client(event[i].data.fd);
	}

	return err;
}

int main(int argc, char *argv[])
{
	int ch, err;
	char *addr;
	unsigned short port;
	unsigned int timeout, wait_num;

	addr = "0.0.0.0";
	port = 8080;
	timeout = 1000;
	wait_num = 1;

	async_head_init();

	while ((ch = getopt(argc, argv, "f:n:t:a:p:h")) > 0) {
		switch (ch) {
			case 't':
				timeout = atoi(optarg);
				break;
			case 'n':
				wait_num = atoi(optarg);
				break;
			case 'a':
				addr = optarg;
				break;
			case 'p':
				port = atoi(optarg);
				break;
			case 'f':
				break;
			default:
				usage(argv[0]);
				return -1;
		}
	}

	kevent_ctl_fd = epoll_create(10);
	if (kevent_ctl_fd == -1) {
		ulog_err("Failed to epoll descriptor");
		return -1;
	}

	main_server_s = evtest_server_init(addr, port);
	if (main_server_s < 0)
		return main_server_s;

	err = evtest_kevent_init(main_server_s);
	if (err < 0)
		goto err_out_exit;

	while (1) {
		err = evtest_kevent_wait(timeout, wait_num);
	}

err_out_exit:
	close(kevent_ctl_fd);

	async_head_exit();

	return 0;
}

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-26  8:16                       ` Ingo Molnar
@ 2007-02-26  9:25                         ` Evgeniy Polyakov
  2007-02-26  9:55                           ` Ingo Molnar
  0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-26  9:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Davide Libenzi, Linux Kernel Mailing List, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Ulrich Drepper, Zach Brown, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner

On Mon, Feb 26, 2007 at 09:16:56AM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> > Also, the evtest_kevent_remove call is superfluous with epoll.
> 
> it's only used in the error path AFAICS.
> 
> but you are right about evserver_epoll/kevent.c incorrectly assuming 
> that things wont block in evtest_callback_client(), which, after 
> receiving the "there's stuff on the input socket" event does:
> 
> 	recvmsg(sock),
> 	fd = open();
> 	sendfile(sock, fd)
> 	close(fd);
> 
> while evserver_threadlet.c, even this naive implementation, does not 
> assume that we wont block in that function.
> 
> > In any case, comparing epoll/kevent with 100K active sessions, against 
> > threadlets, is not exactly a fair/appropriate test for it.
> 
> fully agreed.

Hi.

I will highlight several items in this mail:
1. evserver_epoll.c is broken in that regard, that it does not close a
socket - I tried to make it possible to work with keepalive, but failed.
So close of the socket is a must, like in evserver_kevent.c
2. keepalive is not supported - it is a hack server at the end.
3. this test does not assume that above snippet blocks or does not - it
is _typical_ case of the web server with one working thread (per cpu) -
every op can block, so there is no problem - threadlet will reschedule,
event based will block (bad for them).
4. benchmark does not cover all possible cases - initial goal of that
servers was to show how fast/slow _event_ generation/processing is in
epoll/kevent case, but not to create real-life web server.
lighttpd for example can not be used as a good bench too, since its arch
does not support some kevent extensions (which do not exist in epoll),
and, looking to the number of comments in kevent threads, I'm not motivated 
to change it at all.

So, drawing a line, evserver_* is a simple event driven server, it does
have disadvantages, but the same approach only favours threadlet model.
Having millions or thousands of connections works against threadlets,
but we compare ultimate cases, it is a one of the possible tests.

So...
I'm cooking up a git tree with kevents and threadlets, which I will test
in via epia (1ghz, 256 mb of ram, 100mbit lan) and Intel(R) Core(TM)2 CPU
6600  @ 2.40GHz (2gb of ram, 1gbit lan) later today with
kevent/epoll/threadlet if wine will not end suddenly.
I will use Ingo's evserver_threadlet server as plong as evserver_epoll
(with fixed closing) and evserver_kevent.c.
Eventually I can move all of them into one file.

Client is 'ab' on my desktop core duo 3.7 ghz. Machines are connected
over gigabit dlink dgs-1216t switch (which freezes on slightly broken 
dhcp and tftp packets btw).

Stay tuned.

P.S. Linus, if you do not mind, I will postpone scholastic masturbation 
about event vs process context in IO. One sentence only - page fault and
filename lookup both wait - they wait for new page is ready or inode is read
from the disk, eventually they wake up, thing which wakes them up _is_
an event.

> 	Ingo

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
       [not found]                     ` <Pine.LNX.4.64.0702251232350.6011@alien.or.mcafeemobile.com>
@ 2007-02-26  8:16                       ` Ingo Molnar
  2007-02-26  9:25                         ` Evgeniy Polyakov
  0 siblings, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-02-26  8:16 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Evgeniy Polyakov, Linux Kernel Mailing List, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Ulrich Drepper, Zach Brown, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner


* Davide Libenzi <davidel@xmailserver.org> wrote:

> Also, the evtest_kevent_remove call is superfluous with epoll.

it's only used in the error path AFAICS.

but you are right about evserver_epoll/kevent.c incorrectly assuming 
that things wont block in evtest_callback_client(), which, after 
receiving the "there's stuff on the input socket" event does:

	recvmsg(sock),
	fd = open();
	sendfile(sock, fd)
	close(fd);

while evserver_threadlet.c, even this naive implementation, does not 
assume that we wont block in that function.

> In any case, comparing epoll/kevent with 100K active sessions, against 
> threadlets, is not exactly a fair/appropriate test for it.

fully agreed.

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-25 19:04                       ` Ingo Molnar
  2007-02-25 19:42                         ` Evgeniy Polyakov
@ 2007-02-25 23:14                         ` Michael K. Edwards
  1 sibling, 0 replies; 277+ messages in thread
From: Michael K. Edwards @ 2007-02-25 23:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Evgeniy Polyakov, Ulrich Drepper, linux-kernel, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner

On 2/25/07, Ingo Molnar <mingo@elte.hu> wrote:
> Fundamentally a kernel thread is just its
> EIP/ESP [on x86, similar on other architectures] - which can be
> saved/restored in near zero time.

That's because the kernel address space is identical in every
process's MMU context, so the MMU doesn't have to be touched _at_all_.
 Also, the kernel very rarely touches FPU state, and even when it
does, the FXSAVE/FXRSTOR pair is highly optimized for the "save state
just long enough to move some memory around with XMM instructions"
case.  (I know you know this; this is for the benefit of less
experienced readers.)  If your threadlet model shares the FPU state
and TLS arena among all threadlets running on the same CPU, and
threadlets are scheduled in bursts belonging to the same process (and
preferably the same threadlet entrypoint), then you will get similarly
efficient userspace threadlet-to-threadlet transitions.  If not, not.

> scheduling only relates to the minimal context that is in the CPU. And
> most of that context we save upon /every system call entry/, and restore
> it upon every system call return. If it's so expensive to manipulate,
> why can the Linux kernel do a full system call in ~150 cycles? That's
> cheaper than the access latency to a single DRAM page.

That would be the magic of shadow register files.  When the software
does things that hardware expects it to do, everybody wins.  When the
software tries to get clever based on micro-benchmarks, everybody
loses.

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-22 11:31         ` Evgeniy Polyakov
  2007-02-22 11:52           ` Arjan van de Ven
  2007-02-22 12:59           ` Ingo Molnar
@ 2007-02-25 22:44           ` Linus Torvalds
  2007-02-26 13:11             ` Ingo Molnar
  2007-02-26 17:28             ` Evgeniy Polyakov
  2 siblings, 2 replies; 277+ messages in thread
From: Linus Torvalds @ 2007-02-25 22:44 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ingo Molnar, Ulrich Drepper, linux-kernel, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner



On Thu, 22 Feb 2007, Evgeniy Polyakov wrote:
> 
> My tests show that with 4k connections per second (8k concurrency) more 
> than 20k connections of 80k total block in tcp_sendmsg() over gigabit 
> lan between quite fast machines.

Why do people *keep* taking this up as an issue?

Use select/poll/epoll/kevent/whatever for event mechanisms. STOP CLAIMING 
that you'd use threadlets/syslets/aio for that. It's been pointed out over 
and over and over again, and yet you continue to make the same mistake, 
Evgeniy.

So please read that sentence ten times, and then don't continue to make 
that same mistake. PLEASE.

Event mechanisms are *superior* for events. But they *suck* for things 
that aren't events, but are actual code execution with random places that 
can block. THE TWO THINGS ARE TOTALLY AND UTTERLY INDEPENDENT!

Examples of events:
 - packet arrives
 - timer happens

Examples of things that are *not* "events":
 - filesystem lookup.
 - page faults

So the basic point is: for events, you use an event-based thing. For code 
execution, you use a thread-based thing. It's really that simple.

And yes, the two different things can usually be translated (at a very 
high cost in complexity *and* performance) into each other, so people who 
look at it as purely a theoretical exercise may think that "events" and 
"code execution" are equivalent. That's a very very silly and stupid way 
of looking at things in real life, though.

Yes, you can turn things that are better seen as threaded execution into 
an event-based thing by turning it into a state machine. And usually that 
is a TOTAL DISASTER, and the end result is fragile and impossible to 
maintain.

And yes, you can often (more easily) turn an event-based mechanism into a 
thread-based one, and usually the end result is a TOTAL DISASTER because 
it doesn't scale very well, and while it may actually result in somewhat 
simpler code, the overhead of managing ten thousand outstanding threads is 
just too high, when you compare to managing just a list of ten thousand 
outstanding events.

And yes, people have done both of those mistakes. Java, for example, 
largely did the latter mistake ("we don't need anything like 'select', 
because we'll just use threads for everything" - what a totally moronic 
thing to do!)

So Evgeniy, threadlets/syslets/aio is *not* a replacement for event 
queues. It's a TOTALLY DIFFERENT MECHANISM, and one that is hugely 
superior to event queues for certain kinds of things. Anybody who thinks 
they want to do pathname and inode lookup as a series of events is likely 
a moron. It's really that simple.

In a complex server (say, a database), you'd use both. You'd probably use 
events for doing the things you *already* use events for (whether it be 
select/poll/epoll or whatever): probably things like the client network 
connection handling.

But you'd *in*addition* use threadlets to be able to do the actual 
database IO in a threaded manner, so that you can scale the things that 
are not easily handled as events (usually because they have internal 
kernel state that the user cannot even see, and *must*not* see because of 
security issues).

So please. Stop this "kevents are better". The only thing you show by 
trying to go down that avenue is that you don't understand the 
*difference* between an event model and a thread model. They are both 
perfectly fine models and they ARE NOT THE SAME! They aren't even mutually 
incompatible - quite the reverse.

The thing people want to remove with threadlets is the internal overhead 
of maintaining special-purpose code like aio_read() inside the kernel, 
that doesn't even do all that people want it to do, and that really does 
need a fair amount of internal complexity that we could hopefully do with 
a more generic (and hopefully *simpler*) model.

		Linus

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-25 19:42                         ` Evgeniy Polyakov
@ 2007-02-25 20:38                           ` Ingo Molnar
  2007-02-26 12:39                           ` Ingo Molnar
  2007-02-26 19:47                           ` Davide Libenzi
  2 siblings, 0 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-02-25 20:38 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ulrich Drepper, linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner


* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> Interesting discussion, that will be very fun if kevent will lose 
> badly :)

with your keepalive test no way can it lose against 80,000 sync 
threadlets - it's pretty much the worst-case thing for threadlets while 
it's the best-case for kevents. Try a non-keepalive test perhaps?

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-25 18:34               ` Ingo Molnar
@ 2007-02-25 20:01                 ` Frederik Deweerdt
  0 siblings, 0 replies; 277+ messages in thread
From: Frederik Deweerdt @ 2007-02-25 20:01 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Evgeniy Polyakov, linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner

On Sun, Feb 25, 2007 at 07:34:38PM +0100, Ingo Molnar wrote:
> 
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> 
> > > thx - i guess i should just run them without any options and they 
> > > bind themselves to port 80? What 'ab' options are you using 
> > > typically to measure them?
> > 
> > Yes, but they require /tmp/index.html to have http header and actual 
> > data page. They do not parse http request :)
> 
> ok. When i connect to the epoll server via "telnet mysever 80", and 
> enter a 'request', i get back the content - but the socket connection is 
> not closed. Every time i type enter i get a new content back. Why is 
> that so - the code seems to contain a close(fd).
> 
I'd say a close(s); is missing just before return 0; in
evtest_callback_client() ?

Regards,
Frederik

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-25 19:04                       ` Ingo Molnar
@ 2007-02-25 19:42                         ` Evgeniy Polyakov
  2007-02-25 20:38                           ` Ingo Molnar
                                             ` (2 more replies)
  2007-02-25 23:14                         ` Michael K. Edwards
  1 sibling, 3 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-25 19:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ulrich Drepper, linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner

On Sun, Feb 25, 2007 at 08:04:15PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> 
> > Kevent is a _very_ small entity and there is _no_ cost of requeueing 
> > (well, there is list_add guarded by lock) - after it is done, process 
> > can start real work. With rescheduling there are _too_ many things to 
> > be done before we can start new work. [...]
> 
> actually, no. For example a wakeup too is fundamentally a list_add 
> guarded by a lock. Take a look at try_to_wake_up(). The rest you see 
> there is just extra frills that relate to things like 'load-balancing 
> the requests over multiple CPUs [which i'm sure kevent users would 
> request in the future too]'.

wake_up() as a call is pretty simple and fast, but its result - it is
slow. I did not run reschedulingtests with kernel thread, but posix
threads (they do look like a kernel thread) have significant overhead
there. In early developemnt days of M:N threading library I tested
rescheduling performance of the POSIX threads - I created pool of
threads and 'sent' a message using futex wait/wake - such performance of
the userspace threading library (I tested erlang) was 10 times slower.

> > [...] We have to change registers, change address space, various tlb 
> > bits and so on - we have to do it, since task describes very heavy 
> > entity - the whole process. [...]
> 
> but ... 'threadlets' are called thread-lets because they are not full 
> processes, they are threads. There's no TLB state in that case. There's 
> indeed register state associated with them, and currently there can 
> certainly be quite a bit of overhead in a context switch - but not in 
> register saving. We do user-space register saving not in the scheduler 
> but upon /every system call/. Fundamentally a kernel thread is just its 
> EIP/ESP [on x86, similar on other architectures] - which can be 
> saved/restored in near zero time. All the rest is something we added for 
> good /work queueing/ reasons - and those same extras should either be 
> eliminated if they turn out to be not so good reasons after all, or they 
> will be wanted for kevents too eventually, once it matures as a work 
> queueing solution.

If things decreases performance noticebly, it is a bad things, but it is
matter of taste. Anyway, kevents are very small, threads are very big,
and both are the way they are exactly on purpose - threads serve for
processing of any generic code, kevents are used for event waiting - IO
is such an event, it does not require a lot of infrastructure to handle,
it only nees some simple bits, so it can be optimized to be extremely
fast, with huge infrastructure behind each IO (like in case when it is a
separated thread) it can not be done effectively.

> > I think it is _too_ heavy to have such a monster structure like 
> > task(thread/process) and related overhead just to do an IO.
> 
> i think you are really, really mistaken if you believe that the fact 
> that whole tasks/threads or processes can be 'monster structures', 
> somehow has any relevance to scheduling/task-queueing performance and 
> scalability. It does not matter how large a task's address space is - 
> scheduling only relates to the minimal context that is in the CPU. And 
> most of that context we save upon /every system call entry/, and restore 
> it upon every system call return. If it's so expensive to manipulate, 
> why can the Linux kernel do a full system call in ~150 cycles? That's 
> cheaper than the access latency to a single DRAM page.

I meant not its size, but the whole infrastructure, which surrounds
task. If it is that lightweight, why don't we have posix thread per IO?
One question is that mmap/allocation of the stack is too slow (and it is
very slow indeed, that is why glibc and M:N threading lib caches
allocated stacks), another one is kernel/userspace boundary crossing,
next one are tlb flushes, then copies.

Why userspace rescheduling is in order of tens times faster than
kernel/user?

> for the same reason has it no relevance that the full kevent-based 
> webserver is a 'monster structure' - still a single request's basic 
> queueing operation is cheap. The same is true to tasks/threads.

To move that tasks there must be done too may steps, and although each
one can be quite fast, the whole process of rescheduling in the case of
thousands running threads creates too big overhead per task to drop
performance.

> Really, you dont even have to know or assume anything about the 
> scheduler, just lets do some elementary math here:
> 
> the reqs/sec your sendfile+kevent based webserver can do is 7900 per 
> sec. Lets assume you will write further great kevent code which will 
> optimize it further and it goes up to 10,100 reqs per sec (100 usecs per 
> request), ok? Then also try how many reschedules/sec can your Athon64 
> 3500 box do. My guess is: about a million per second (1 usec per 
> reschedule), perhaps a bit more.

Let's calculate: disk bandwidth is about gigabytes per second (cached
case), so to transfer 10k file we need about 10 usec - 10% of it will be
spent in rescheduling (if there will be only one  if any).
Network is an order of 10 slower (1gbit for example), but there are much
more blockings, so to transfer 10k we will have let's say 5 blocks, i.e.
5 reschedulings - another 5%, so we wasted 15% of our time in
rescheduling.

Event in turn is a 30 bytes copy (plus of course own overhead, but it is
still faster - it is faster just because ukevet size is smaller than
pt_regs :).

Interesting discussion, that will be very fun if kevent will lose badly
:)

> Now lets assume that a threadlet based server would have to 
> context-switch for /every single/ request served. That's totally 
> over-estimating it, even with lots of slow clients, but lets assume it, 
> to judge the worst-case impact.
> 
> So if you had to schedule once per every request served, you'd have to 
> add 1 usec to your 100 usecs cost, making it 101 usecs. That would bring 
> your 10,100 requests per sec to 10,000 requests/sec, under a threadlet 
> model of operation. Put differently: it will cost you only 1% in 
> performance to schedule once for every request. Or lets assume the task 
> is totally cache-cold and you'd have to add 4 usecs for its scheduling - 
> that'd still only be 4%. So where is the fat?

I need to move home or I will sleep on the street, otherwise I would already
ran a test and started to eat a hat (present me a red one like Lan Cox
had), or watch how you to do it :)

Give me several hours.

> 	Ingo

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-25 18:37             ` Evgeniy Polyakov
  2007-02-25 18:34               ` Ingo Molnar
@ 2007-02-25 19:21               ` Ingo Molnar
       [not found]                 ` <20070225194645.GB1353@2ka.mipt.ru>
  1 sibling, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-02-25 19:21 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner


* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> > thx - i guess i should just run them without any options and they 
> > bind themselves to port 80? What 'ab' options are you using 
> > typically to measure them?
> 
> Yes, but they require /tmp/index.html to have http header and actual 
> data page. They do not parse http request :)
> 
> For athlon 3500 I used
> ab -c8000 -n80000 $url

how do the header portions of your /tmp/index.html data page look like?

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
@ 2007-02-25 19:10 Al Boldi
  0 siblings, 0 replies; 277+ messages in thread
From: Al Boldi @ 2007-02-25 19:10 UTC (permalink / raw)
  To: linux-kernel

Ingo Molnar wrote:
> if you create a threadlet based test-webserver, could you please do a
> comparable kevents implementation as well? I.e. same HTTP parser (or
> non-parser, as usually the case is with prototypes ;). Best would be
> something that one could trigger between threadlet and kevent mode,
> using the same binary :-)

Now, why would you want this?

Is there some performance issue with separately loaded binaries?


Thanks!

--
Al


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-25 18:09                     ` Evgeniy Polyakov
@ 2007-02-25 19:04                       ` Ingo Molnar
  2007-02-25 19:42                         ` Evgeniy Polyakov
  2007-02-25 23:14                         ` Michael K. Edwards
  0 siblings, 2 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-02-25 19:04 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ulrich Drepper, linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner


* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> Kevent is a _very_ small entity and there is _no_ cost of requeueing 
> (well, there is list_add guarded by lock) - after it is done, process 
> can start real work. With rescheduling there are _too_ many things to 
> be done before we can start new work. [...]

actually, no. For example a wakeup too is fundamentally a list_add 
guarded by a lock. Take a look at try_to_wake_up(). The rest you see 
there is just extra frills that relate to things like 'load-balancing 
the requests over multiple CPUs [which i'm sure kevent users would 
request in the future too]'.

> [...] We have to change registers, change address space, various tlb 
> bits and so on - we have to do it, since task describes very heavy 
> entity - the whole process. [...]

but ... 'threadlets' are called thread-lets because they are not full 
processes, they are threads. There's no TLB state in that case. There's 
indeed register state associated with them, and currently there can 
certainly be quite a bit of overhead in a context switch - but not in 
register saving. We do user-space register saving not in the scheduler 
but upon /every system call/. Fundamentally a kernel thread is just its 
EIP/ESP [on x86, similar on other architectures] - which can be 
saved/restored in near zero time. All the rest is something we added for 
good /work queueing/ reasons - and those same extras should either be 
eliminated if they turn out to be not so good reasons after all, or they 
will be wanted for kevents too eventually, once it matures as a work 
queueing solution.

> I think it is _too_ heavy to have such a monster structure like 
> task(thread/process) and related overhead just to do an IO.

i think you are really, really mistaken if you believe that the fact 
that whole tasks/threads or processes can be 'monster structures', 
somehow has any relevance to scheduling/task-queueing performance and 
scalability. It does not matter how large a task's address space is - 
scheduling only relates to the minimal context that is in the CPU. And 
most of that context we save upon /every system call entry/, and restore 
it upon every system call return. If it's so expensive to manipulate, 
why can the Linux kernel do a full system call in ~150 cycles? That's 
cheaper than the access latency to a single DRAM page.

for the same reason has it no relevance that the full kevent-based 
webserver is a 'monster structure' - still a single request's basic 
queueing operation is cheap. The same is true to tasks/threads.

Really, you dont even have to know or assume anything about the 
scheduler, just lets do some elementary math here:

the reqs/sec your sendfile+kevent based webserver can do is 7900 per 
sec. Lets assume you will write further great kevent code which will 
optimize it further and it goes up to 10,100 reqs per sec (100 usecs per 
request), ok? Then also try how many reschedules/sec can your Athon64 
3500 box do. My guess is: about a million per second (1 usec per 
reschedule), perhaps a bit more.

Now lets assume that a threadlet based server would have to 
context-switch for /every single/ request served. That's totally 
over-estimating it, even with lots of slow clients, but lets assume it, 
to judge the worst-case impact.

So if you had to schedule once per every request served, you'd have to 
add 1 usec to your 100 usecs cost, making it 101 usecs. That would bring 
your 10,100 requests per sec to 10,000 requests/sec, under a threadlet 
model of operation. Put differently: it will cost you only 1% in 
performance to schedule once for every request. Or lets assume the task 
is totally cache-cold and you'd have to add 4 usecs for its scheduling - 
that'd still only be 4%. So where is the fat?

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-25 18:22           ` Ingo Molnar
@ 2007-02-25 18:37             ` Evgeniy Polyakov
  2007-02-25 18:34               ` Ingo Molnar
  2007-02-25 19:21               ` Ingo Molnar
  0 siblings, 2 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-25 18:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner

On Sun, Feb 25, 2007 at 07:22:30PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> > > Do you have any link where i could check the type of HTTP parsing 
> > > and send transport you are (or will be) using? What type of http 
> > > client are you using to measure, with precisely what options?
> > 
> > For example this ones (essentially the same, except that epoll and
> > kevent are used):
> > http://tservice.net.ru/~s0mbre/archive/kevent/evserver_kevent.c
> > http://tservice.net.ru/~s0mbre/archive/kevent/evserver_epoll.c
> 
> thx - i guess i should just run them without any options and they bind 
> themselves to port 80? What 'ab' options are you using typically to 
> measure them?

Yes, but they require /tmp/index.html to have http header and actual
data page. They do not parse http request :)

For athlon 3500 I used
ab -c8000 -n80000 $url

for via epia likely two/three times less.

> 	Ingo

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-25 18:37             ` Evgeniy Polyakov
@ 2007-02-25 18:34               ` Ingo Molnar
  2007-02-25 20:01                 ` Frederik Deweerdt
  2007-02-25 19:21               ` Ingo Molnar
  1 sibling, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-02-25 18:34 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner


* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> > thx - i guess i should just run them without any options and they 
> > bind themselves to port 80? What 'ab' options are you using 
> > typically to measure them?
> 
> Yes, but they require /tmp/index.html to have http header and actual 
> data page. They do not parse http request :)

ok. When i connect to the epoll server via "telnet mysever 80", and 
enter a 'request', i get back the content - but the socket connection is 
not closed. Every time i type enter i get a new content back. Why is 
that so - the code seems to contain a close(fd).

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-25 18:21         ` Evgeniy Polyakov
  2007-02-25 18:22           ` Ingo Molnar
@ 2007-02-25 18:25           ` Evgeniy Polyakov
  2007-02-25 18:24             ` Ingo Molnar
  1 sibling, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-25 18:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner

On Sun, Feb 25, 2007 at 09:21:35PM +0300, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> > Do you have any link where i could check the type of HTTP parsing and 
> > send transport you are (or will be) using? What type of http client are 
> > you using to measure, with precisely what options?
> 
> For example this ones (essentially the same, except that epoll and
> kevent are used):
> http://tservice.net.ru/~s0mbre/archive/kevent/evserver_kevent.c
> http://tservice.net.ru/~s0mbre/archive/kevent/evserver_epoll.c

Client is 'ab' with high number of connections and concurrency.
For example for athlon64 3500 I used concurrency of 8000 connections 
and 80k total.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-25 18:25           ` Evgeniy Polyakov
@ 2007-02-25 18:24             ` Ingo Molnar
  0 siblings, 0 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-02-25 18:24 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner


* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> > > Do you have any link where i could check the type of HTTP parsing 
> > > and send transport you are (or will be) using? What type of http 
> > > client are you using to measure, with precisely what options?
> > 
> > For example this ones (essentially the same, except that epoll and
> > kevent are used):
> > http://tservice.net.ru/~s0mbre/archive/kevent/evserver_kevent.c
> > http://tservice.net.ru/~s0mbre/archive/kevent/evserver_epoll.c
> 
> Client is 'ab' with high number of connections and concurrency. For 
> example for athlon64 3500 I used concurrency of 8000 connections and 
> 80k total.

ok. So it's:

   ab -c 8000 -n 80000 http://yourserver/tmp/index.html

right? How large is index.html typically - the default 40960 bytes, and 
with a constructed HTTP reply header already included in the file?

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-25 18:21         ` Evgeniy Polyakov
@ 2007-02-25 18:22           ` Ingo Molnar
  2007-02-25 18:37             ` Evgeniy Polyakov
  2007-02-25 18:25           ` Evgeniy Polyakov
  1 sibling, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-02-25 18:22 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner


* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> > Do you have any link where i could check the type of HTTP parsing 
> > and send transport you are (or will be) using? What type of http 
> > client are you using to measure, with precisely what options?
> 
> For example this ones (essentially the same, except that epoll and
> kevent are used):
> http://tservice.net.ru/~s0mbre/archive/kevent/evserver_kevent.c
> http://tservice.net.ru/~s0mbre/archive/kevent/evserver_epoll.c

thx - i guess i should just run them without any options and they bind 
themselves to port 80? What 'ab' options are you using typically to 
measure them?

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-25 17:54       ` Ingo Molnar
@ 2007-02-25 18:21         ` Evgeniy Polyakov
  2007-02-25 18:22           ` Ingo Molnar
  2007-02-25 18:25           ` Evgeniy Polyakov
  0 siblings, 2 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-25 18:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner

On Sun, Feb 25, 2007 at 06:54:37PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> 
> > > hm, what tree are you using as a base? The syslet patches are 
> > > against v2.6.20 at the moment. (the x86 PDA changes will probably 
> > > interfere with it on v2.6.21-rc1-ish kernels) Note that otherwise 
> > > the syslet/threadlet patches are for x86 only at the moment (as i 
> > > mentioned in the announcement), and the generic code itself contains 
> > > some occasional x86-ishms as well. (None of the concepts are 
> > > x86-specific though - multi-stack architectures should work just as 
> > > well as RISC-ish CPUs.)
> > 
> > It is rc1 - and crashes.
> 
> yeah. I'm not surprised. The PDA is not set up in create_async_thread() 
> for example.

Ok, I will roll back to vanilla 2.6.20 tomorrow.

> > > if you create a threadlet based test-webserver, could you please do 
> > > a comparable kevents implementation as well? I.e. same HTTP parser 
> > > (or non-parser, as usually the case is with prototypes ;). Best 
> > > would be something that one could trigger between threadlet and 
> > > kevent mode, using the same binary :-)
> > 
> > Ok, I will create such a monster tomorrow :)
> > 
> > I will use the same base for threadlet as for kevent/epoll - there is 
> > no parser, just sendfile() of the static file which contains http 
> > header and actual page.
> > 
> > threadlet1 {
> > 	accept() 
> > 	create threadlet2 {
> > 		send data
> > 	}
> > }
> > 
> > Is above scheme correct for threadlet scenario?
> 
> yep, this is a good first cut. Doing this after the listen() is useful:
> 
>         int one = 1;
> 
>         ret = setsockopt(listen_sock_fd, SOL_SOCKET, SO_REUSEADDR,
> 			 (char *)&one, sizeof(int));
> 
> and i'd suggest to do this after every accept()-ed socket:
> 
>         int flag = 1;
> 
>         setsockopt(sock_fd, IPPROTO_TCP, TCP_NODELAY,
>                             (char *) &flag, sizeof(int));
> 
> Do you have any link where i could check the type of HTTP parsing and 
> send transport you are (or will be) using? What type of http client are 
> you using to measure, with precisely what options?

For example this ones (essentially the same, except that epoll and
kevent are used):
http://tservice.net.ru/~s0mbre/archive/kevent/evserver_kevent.c
http://tservice.net.ru/~s0mbre/archive/kevent/evserver_epoll.c

> > But note, that on my athlon64 3500 test machine kevent is about 7900 
> > requests per second compared to 4000+ epoll, so expect a challenge.
> 
> single-core CPU i suspect?

Yep.

> > lighhtpd is about the same 4000 requests per second though, since it 
> > can not be easily optimized for kevents.
> 
> mean question: do you promise to post the results even if they are not 
> unfavorable to threadlets? ;-)

If they are too good, I will start searching for bugs and tune my code
first, but eventually of course yes. 
In my blog I will post them in 'real-time' even if kevent will
unbelieveably suck.

> if i want to test kevents on a v2.6.20 kernel base, do you have an URL 
> for me to try?

I have a git tree at (based on rc1 as requested by Andrew Morton):
http://tservice.net.ru/~s0mbre/archive/kevent/kevent.git/

Or patches at kevent homepage:
http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent

Direct link to the latest patchset:
http://tservice.net.ru/~s0mbre/archive/kevent/kevent-37/

(order is insignificant as far as I recall, except 'compile-fix',
whcih must be the latest).

> 	Ingo

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-25 17:45                   ` Ingo Molnar
@ 2007-02-25 18:09                     ` Evgeniy Polyakov
  2007-02-25 19:04                       ` Ingo Molnar
  0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-25 18:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ulrich Drepper, linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner

On Sun, Feb 25, 2007 at 06:45:05PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> 
> > My main concern was only about the situation, when we ends up with 
> > truly bloking context (like network), and this results in having 
> > thousands of threads doing the work - even having most of them 
> > sleeping, there is a problem with memory overhead and context 
> > switching, although it is usable situation, but when all of them are 
> > ready immediately - context switching will kill a machine even with 
> > O(1) scheduler which made situation damn better than before, but it is 
> > not a cure for the problem.
> 
> yes. This is why in the original fibril discussion i concentrated so 
> much on scheduling performance.
> 
> to me the picture is this: conceptually the scheduler runqueue is a 
> queue of work. You get items queued upon certain events, and they can 
> unqueue themselves. (there is also register context but that is already 
> optimized to death by hardware) So whatever scheduling overhead we have, 
> it's a pure software thing. It's because we have priorities attached. 
> It's because we have some legacies. Etc., etc. - it's all stuff /we/ 
> wanted to add, but nothing truly fundamental on top of the basic 'work 
> queueing' model.
> 
> now look at kevents as the queueing model. It does not queue 'tasks', it 
> lets user-space queue requests in essence, in various states. But it's 
> still the same conceptual thing: a memory buffer with some state 
> associated to it. Yes, it has no legacies, it has no priorities and 
> other queueing concepts attached to it ... yet. If kevents got 
> mainstream, it would get the same kind of pressure to grow 'more 
> advanced' event queueing and event scheduling capabilities. 
> Prioritization would be needed, etc.
> 
> So my fundamental claim is: a kernel thread /is/ our main request 
> structure. We've got tons of really good system calls that queue these 
> 'requests' around the place and offer functionality around this concept. 
> Plus there's a 1.2+ billion lines of Linux userspace code that works 
> well with this abstraction - while there's nary a few thousand lines of 
> event-based user-space code.
> 
> I also say that you'll likely get kevents outperform threadlets. Maybe 
> even significantly so under the right conditions. But i very much 
> believe we want to get similar kind of performance out of thread/task 
> scheduling, and not introduce a parallel framework to do request 
> scheduling the hard way ... just because our task concept and scheduling 
> implementation got too fat. For the same reason i didnt really like 
> fibrils: they are nice, and Zach's core idea i think nicely survived in 
> the syslet/threadlet model too, but they are more limited than true 
> threads. So doing that parallel infrastructure, which really just 
> implements the same, and is only faster because it skips features, would 
> just be hiding the problem with our primary abstraction. Ok?

Kevent is a _very_ small entity and there is _no_ cost of requeueing
(well, there is list_add guarded by lock) - after it is done, process
can start real work. With rescheduling there are _too_ many things to be
done before we can start new work. We have to change registers, change
address space, various tlb bits and so on - we have to do it, since task
describes very heavy entity - the whole process.
IO in turn is a very small subset of what process is (can do), so there 
is no need to change the whole picture, so it is enough to have one 
process, which does the work.

Threads are a bit smaller than process, but still it is too heavy to
have it per IO - so we have pools - this decreases rescheduling
overhead, but limits parallelism.

I think it is _too_ heavy to have such a monster structure like
task(thread/process) and related overhead just to do an IO.

> 	Ingo

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-25 17:44     ` Evgeniy Polyakov
@ 2007-02-25 17:54       ` Ingo Molnar
  2007-02-25 18:21         ` Evgeniy Polyakov
  0 siblings, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-02-25 17:54 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner


* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> > hm, what tree are you using as a base? The syslet patches are 
> > against v2.6.20 at the moment. (the x86 PDA changes will probably 
> > interfere with it on v2.6.21-rc1-ish kernels) Note that otherwise 
> > the syslet/threadlet patches are for x86 only at the moment (as i 
> > mentioned in the announcement), and the generic code itself contains 
> > some occasional x86-ishms as well. (None of the concepts are 
> > x86-specific though - multi-stack architectures should work just as 
> > well as RISC-ish CPUs.)
> 
> It is rc1 - and crashes.

yeah. I'm not surprised. The PDA is not set up in create_async_thread() 
for example.

> > if you create a threadlet based test-webserver, could you please do 
> > a comparable kevents implementation as well? I.e. same HTTP parser 
> > (or non-parser, as usually the case is with prototypes ;). Best 
> > would be something that one could trigger between threadlet and 
> > kevent mode, using the same binary :-)
> 
> Ok, I will create such a monster tomorrow :)
> 
> I will use the same base for threadlet as for kevent/epoll - there is 
> no parser, just sendfile() of the static file which contains http 
> header and actual page.
> 
> threadlet1 {
> 	accept() 
> 	create threadlet2 {
> 		send data
> 	}
> }
> 
> Is above scheme correct for threadlet scenario?

yep, this is a good first cut. Doing this after the listen() is useful:

        int one = 1;

        ret = setsockopt(listen_sock_fd, SOL_SOCKET, SO_REUSEADDR,
			 (char *)&one, sizeof(int));

and i'd suggest to do this after every accept()-ed socket:

        int flag = 1;

        setsockopt(sock_fd, IPPROTO_TCP, TCP_NODELAY,
                            (char *) &flag, sizeof(int));

Do you have any link where i could check the type of HTTP parsing and 
send transport you are (or will be) using? What type of http client are 
you using to measure, with precisely what options?

> But note, that on my athlon64 3500 test machine kevent is about 7900 
> requests per second compared to 4000+ epoll, so expect a challenge.

single-core CPU i suspect?

> lighhtpd is about the same 4000 requests per second though, since it 
> can not be easily optimized for kevents.

mean question: do you promise to post the results even if they are not 
unfavorable to threadlets? ;-)

if i want to test kevents on a v2.6.20 kernel base, do you have an URL 
for me to try?

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-23 12:22                 ` Evgeniy Polyakov
  2007-02-23 12:41                   ` Evgeniy Polyakov
@ 2007-02-25 17:45                   ` Ingo Molnar
  2007-02-25 18:09                     ` Evgeniy Polyakov
  1 sibling, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-02-25 17:45 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ulrich Drepper, linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner


* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> My main concern was only about the situation, when we ends up with 
> truly bloking context (like network), and this results in having 
> thousands of threads doing the work - even having most of them 
> sleeping, there is a problem with memory overhead and context 
> switching, although it is usable situation, but when all of them are 
> ready immediately - context switching will kill a machine even with 
> O(1) scheduler which made situation damn better than before, but it is 
> not a cure for the problem.

yes. This is why in the original fibril discussion i concentrated so 
much on scheduling performance.

to me the picture is this: conceptually the scheduler runqueue is a 
queue of work. You get items queued upon certain events, and they can 
unqueue themselves. (there is also register context but that is already 
optimized to death by hardware) So whatever scheduling overhead we have, 
it's a pure software thing. It's because we have priorities attached. 
It's because we have some legacies. Etc., etc. - it's all stuff /we/ 
wanted to add, but nothing truly fundamental on top of the basic 'work 
queueing' model.

now look at kevents as the queueing model. It does not queue 'tasks', it 
lets user-space queue requests in essence, in various states. But it's 
still the same conceptual thing: a memory buffer with some state 
associated to it. Yes, it has no legacies, it has no priorities and 
other queueing concepts attached to it ... yet. If kevents got 
mainstream, it would get the same kind of pressure to grow 'more 
advanced' event queueing and event scheduling capabilities. 
Prioritization would be needed, etc.

So my fundamental claim is: a kernel thread /is/ our main request 
structure. We've got tons of really good system calls that queue these 
'requests' around the place and offer functionality around this concept. 
Plus there's a 1.2+ billion lines of Linux userspace code that works 
well with this abstraction - while there's nary a few thousand lines of 
event-based user-space code.

I also say that you'll likely get kevents outperform threadlets. Maybe 
even significantly so under the right conditions. But i very much 
believe we want to get similar kind of performance out of thread/task 
scheduling, and not introduce a parallel framework to do request 
scheduling the hard way ... just because our task concept and scheduling 
implementation got too fat. For the same reason i didnt really like 
fibrils: they are nice, and Zach's core idea i think nicely survived in 
the syslet/threadlet model too, but they are more limited than true 
threads. So doing that parallel infrastructure, which really just 
implements the same, and is only faster because it skips features, would 
just be hiding the problem with our primary abstraction. Ok?

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-25 17:23   ` Ingo Molnar
@ 2007-02-25 17:44     ` Evgeniy Polyakov
  2007-02-25 17:54       ` Ingo Molnar
  0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-25 17:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner

On Sun, Feb 25, 2007 at 06:23:38PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> 
> > On Wed, Feb 21, 2007 at 10:13:55PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> > > this is the v3 release of the syslet/threadlet subsystem:
> > > 
> > >    http://redhat.com/~mingo/syslet-patches/
> > 
> > There is no %xgs.
> > 
> > --- ./arch/i386/kernel/process.c~	2007-02-24 22:56:14.000000000 +0300
> > +++ ./arch/i386/kernel/process.c	2007-02-24 22:53:19.000000000 +0300
> > @@ -426,7 +426,6 @@
> >  
> >  	regs.xds = __USER_DS;
> >  	regs.xes = __USER_DS;
> > -	regs.xgs = __KERNEL_PDA;
> 
> hm, what tree are you using as a base? The syslet patches are against 
> v2.6.20 at the moment. (the x86 PDA changes will probably interfere with 
> it on v2.6.21-rc1-ish kernels) Note that otherwise the syslet/threadlet 
> patches are for x86 only at the moment (as i mentioned in the 
> announcement), and the generic code itself contains some occasional 
> x86-ishms as well. (None of the concepts are x86-specific though - 
> multi-stack architectures should work just as well as RISC-ish CPUs.)

It is rc1 - and crashes.
I test on i386 via epia (the only machine which runs x86 right now).

If there will not be any new patches, I will create 2.6.20 test tree
tomorrow.

> if you create a threadlet based test-webserver, could you please do a 
> comparable kevents implementation as well? I.e. same HTTP parser (or 
> non-parser, as usually the case is with prototypes ;). Best would be 
> something that one could trigger between threadlet and kevent mode, 
> using the same binary :-)

Ok, I will create such a monster tomorrow :)

I will use the same base for threadlet as for kevent/epoll - there is no
parser, just sendfile() of the static file which contains http header
and actual page.

threadlet1 {
	accept() 
	create threadlet2 {
		send data
	}
}

Is above scheme correct for threadlet scenario?

But note, that on my athlon64 3500 test machine kevent is about 7900
requests per second compared to 4000+ epoll, so expect a challenge.
lighhtpd is about the same 4000 requests per second though, since it can
not be easily optimized for kevents.

> 	Ingo

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-24 18:34 ` Evgeniy Polyakov
@ 2007-02-25 17:23   ` Ingo Molnar
  2007-02-25 17:44     ` Evgeniy Polyakov
  0 siblings, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-02-25 17:23 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner


* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> On Wed, Feb 21, 2007 at 10:13:55PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> > this is the v3 release of the syslet/threadlet subsystem:
> > 
> >    http://redhat.com/~mingo/syslet-patches/
> 
> There is no %xgs.
> 
> --- ./arch/i386/kernel/process.c~	2007-02-24 22:56:14.000000000 +0300
> +++ ./arch/i386/kernel/process.c	2007-02-24 22:53:19.000000000 +0300
> @@ -426,7 +426,6 @@
>  
>  	regs.xds = __USER_DS;
>  	regs.xes = __USER_DS;
> -	regs.xgs = __KERNEL_PDA;

hm, what tree are you using as a base? The syslet patches are against 
v2.6.20 at the moment. (the x86 PDA changes will probably interfere with 
it on v2.6.21-rc1-ish kernels) Note that otherwise the syslet/threadlet 
patches are for x86 only at the moment (as i mentioned in the 
announcement), and the generic code itself contains some occasional 
x86-ishms as well. (None of the concepts are x86-specific though - 
multi-stack architectures should work just as well as RISC-ish CPUs.)

if you create a threadlet based test-webserver, could you please do a 
comparable kevents implementation as well? I.e. same HTTP parser (or 
non-parser, as usually the case is with prototypes ;). Best would be 
something that one could trigger between threadlet and kevent mode, 
using the same binary :-)

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-24 21:04                   ` Davide Libenzi
@ 2007-02-24 23:01                     ` Michael K. Edwards
  0 siblings, 0 replies; 277+ messages in thread
From: Michael K. Edwards @ 2007-02-24 23:01 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Ingo Molnar, Evgeniy Polyakov, Ulrich Drepper,
	Linux Kernel Mailing List, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Jens Axboe,
	Thomas Gleixner

On 2/24/07, Davide Libenzi <davidel@xmailserver.org> wrote:
> Ok, roger that. But why are you playing "Google & Preach" games to Ingo,
> that ate bread and CPUs for the last 15 years?

Sure I used Google -- for clickable references so that lurkers can
tell I'm not making these things up as I go along.  Ingo and Alan have
obviously forgotten more about x86en than I will ever know, and I'm
carrying coals to Newcastle when I comment on pros and cons of XMM
memcpy.

But although the latest edition of the threadlet patches actually has
quite good internal documentation and makes most of its intent clear
even to a reader (me) who is unfamiliar with the code being patched,
it lacks "theory of operations".  How is an arch maintainer supposed
to adapt this interface to a completely different CPU, with different
stuff in pt_regs and different cost profiles for blown pipelines and
reloaded coprocessor state?  What are the hidden costs of this
particular style of M:N microthreading, and will they explode when
this model escapes out of the microbenchmarks and people who don't
know CPUs inside and out start using it?  What standard
thread-pool-management use cases are being glossed over at kernel
level and left to Ulrich (or implementors of JVMs and other bytecode
machines) to sort out?

At some level, I'm just along for the ride; nobody with any sense is
going to pay me to design this sort of thing, and the level of effort
involved in coding an alternate AIO implementation is not something I
can afford to expend on non-revenue-producing activities even if I did
have the skill.  Maybe half of my quibbles are sheer stupidity and
four out of five of the rest are things that Ingo has already taken
account in v4 of his patch set.  But that would leave one quibble in
ten that has some substance, which might save some nasty rework down
the line.  Even if everything I ask about has a simple explanation,
and for Alan and Ingo to waste time spelling it out for me would
result in nothing but an accelerated "theory of operation" document,
would that be a bad thing?

Now I know very little about x86_64 other than that 64-bit code not
only has double-size integer registers to work with, it has twice as
many of them.  So for all I know the transition to pure-64-bit 2-4
core x 2-4 thread/core systems, which is going to be 90% or more of
the revenue-generating Linux market over the next few years, makes all
of my concerns moot for Ingo's purposes.  After all, as long as Linux
stays good enough to keep Oracle from losing confidence and switching
to Darwin or something, the 100 or so people who earn invites to the
kernel summit have cush jobs for life.

The rest of us would perhaps like for major proposed kernel overhauls
to be accompanied by some kind of analysis of their impact on arches
that live elsewhere in CPU parameter space.  That analysis might
suggest small design refinements that make Linux AIO scale well on the
class of processors I'm interested, too.  And I personally would like
to see Ingo get that Turing award for designing AIO semantics that are
as big an advance over the past as IEEE 754 was over its predecessors.
 He'd have to earn it, though.

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-24 19:52                 ` Michael K. Edwards
@ 2007-02-24 21:04                   ` Davide Libenzi
  2007-02-24 23:01                     ` Michael K. Edwards
  0 siblings, 1 reply; 277+ messages in thread
From: Davide Libenzi @ 2007-02-24 21:04 UTC (permalink / raw)
  To: Michael K. Edwards
  Cc: Ingo Molnar, Evgeniy Polyakov, Ulrich Drepper,
	Linux Kernel Mailing List, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Jens Axboe,
	Thomas Gleixner

On Sat, 24 Feb 2007, Michael K. Edwards wrote:

> The preceding may contain errors in detail -- I am neither a CPU
> architect nor an x86 compiler writer nor even a serious kernel hacker.

Ok, roger that. But why are you playing "Google & Preach" games to Ingo, 
that ate bread and CPUs for the last 15 years?



- Davide



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-23 12:17               ` Ingo Molnar
@ 2007-02-24 19:52                 ` Michael K. Edwards
  2007-02-24 21:04                   ` Davide Libenzi
  0 siblings, 1 reply; 277+ messages in thread
From: Michael K. Edwards @ 2007-02-24 19:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Evgeniy Polyakov, Ulrich Drepper, linux-kernel, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner

On 2/23/07, Ingo Molnar <mingo@elte.hu> wrote:
> > This is a fundamental misconception. [...]
>
> > The scheduler, on the other hand, has to blow and reload all of the
> > hidden state associated with force-loading the PC and wherever your
> > architecture keeps its TLS (maybe not the whole TLB, but not nothing,
> > either). [...]
>
> please read up a bit more about how the Linux scheduler works. Maybe
> even read the code if in doubt? In any case, please direct kernel newbie
> questions to http://kernelnewbies.org/, not linux-kernel@vger.kernel.org.

This is not the first kernel I've swum around in, and I've been
mucking with the Linux kernel since early 2.2 and coding assembly for
heavily pipelined processors on and off since 1990.  So I may be a
newbie to your lingo, and I may even be a loud-mouthed idiot, but I'm
not a wet-behind-the-ears undergrad, OK?

Now, I've addressed the non-free-ness of a TLS swap elsewhere; what
about function pointers in state machines (with or without flipping
"supervisor mode" bits)?  Just because loading the PC from a data
register is one opcode in the instruction stream does not mean that it
is not quite expensive in terms of blown pipeline state and I-cache
stalls.  Really fast state machines exploit PC-relative branches that
really smart CPUs can speculatively execute past (after a few
traversals) because there are a small number of branch targets
actually hit.  The instruction prefetch / scheduler unit actually
keeps a table of PC-relative jump instructions found in I-cache, with
a little histogram of destinations eventually branched to, and
speculatively executes down the top branch or two.  (Intel Pentiums
have a fairly primitive but effective variant of this; see
http://www.x86.org/articles/branch/branchprediction.htm.)

More general mechanisms are called "branch target buffers" and US
Patent 6609194 is a good hook into the literature.  A sufficiently
smart CPU designer may have figured out how to do something similar
with computed jumps (add pc, pc, foo), but odds are high that it cuts
out when you throw function pointers around.  Syscall dispatch is a
special and heavily optimized case, though -- so it's quite
conceivable that a well designed userland switch/case state machine
that makes syscalls will outperform an in-kernel state machine data
structure traversal.  If this doesn't happen to be true on today's
desktop, it may be on tomorrow's desktop or today's NUMA monstrosity
or embedded mega-multi-MIPS.

There can also be other reasons why tabulated PC-relative jumps and
immediate PC loads are faster than PC loads from data registers.
Take, for instance, the Transmeta Crusoe, which (AIUI) used a trick
similar to the FX!32 x86 emulation on Alpha/NT.  If you're going to
"translate" CISC to RISC on the fly, you're going to recognize
switch/case idioms (including tabulated PC-relative branches), and fix
up the translated branch table to contain offsets to the
RISC-translated branch targets.  So the state transitions are just as
cheap as if they had been compiled to RISC in the first place.  Do it
with function pointers, and the the execution machine is going to have
to stall while it looks up the text location to see if it has it
translated in I-cache somewhere.  Guess what:  the PIV works the same
way (http://www.karbosguide.com/books/pcarchitecture/chapter12.htm).

Are you starting to get the picture that syslets -- clever as they
might have been on a VAX -- defeat many of the mechanisms that CPU and
compiler architects have negotiated over decades for accelerating real
code?  Especially now that we have hyper-threaded CPUs (parallel
instruction decode/issue units sharing almost all of their cache
hierarchy), you can almost treat the kernel as if it were microcode
for a syscall coprocessor.  If you try to migrate application code
across the syscall boundary, you may perform well on micro-benchmarks
but you're storing up trouble for the future.

If you don't think this kind of fallout is real, talk to whoever had
the bright idea of hijacking FPU registers to implement memcpy in
1996.  The PIII designers rolled over and added XMM so
micro-optimizers would get their dirty mitts off the FPU, which it
appears that Doug Ledford and Jim Blandy duly acted on in 1999.  Yes,
you still need to use FXSAVE/FXRSTOR when you want to mess with the
XMM stuff, but the CPU is smart enough to keep a shadow copy of all
the microstate that the flag states represent.  So if all you do
between FXSAVE and FXRSTOR is shlep bytes around with MOVAPS, the
FXRSTOR costs you little or nothing.  What hurts is an FXRSTOR from a
location that isn't the last location you FXSAVEd to, or an FXRSTOR
after actual FP arithmetic instructions have altered status flags.

The preceding may contain errors in detail -- I am neither a CPU
architect nor an x86 compiler writer nor even a serious kernel hacker.
 But hopefully it's at least food for thought.  If not, you know where
the "ignore this prolix nitwit" key is to be found on your keyboard.

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-21 21:13 Ingo Molnar
  2007-02-21 22:46 ` Michael K. Edwards
  2007-02-22 10:01 ` Suparna Bhattacharya
@ 2007-02-24 18:34 ` Evgeniy Polyakov
  2007-02-25 17:23   ` Ingo Molnar
  2 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-24 18:34 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner

On Wed, Feb 21, 2007 at 10:13:55PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> this is the v3 release of the syslet/threadlet subsystem:
> 
>    http://redhat.com/~mingo/syslet-patches/

There is no %xgs.

--- ./arch/i386/kernel/process.c~	2007-02-24 22:56:14.000000000 +0300
+++ ./arch/i386/kernel/process.c	2007-02-24 22:53:19.000000000 +0300
@@ -426,7 +426,6 @@
 
 	regs.xds = __USER_DS;
 	regs.xes = __USER_DS;
-	regs.xgs = __KERNEL_PDA;
 	regs.orig_eax = -1;
 	regs.eip = (unsigned long) async_thread_helper;
 	regs.xcs = __KERNEL_CS | get_kernel_rpl();


-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-24  0:51                         ` Michael K. Edwards
  2007-02-24  2:17                           ` Michael K. Edwards
@ 2007-02-24  3:25                           ` Michael K. Edwards
  1 sibling, 0 replies; 277+ messages in thread
From: Michael K. Edwards @ 2007-02-24  3:25 UTC (permalink / raw)
  To: Alan
  Cc: Ingo Molnar, Evgeniy Polyakov, Ulrich Drepper, linux-kernel,
	Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner

On 2/23/07, Michael K. Edwards <medwards.linux@gmail.com> wrote:
> which costs you a D-cache stall.)  Now put an sprintf with a %d in it
> between a couple of the syscalls, and _your_ arch is hurting.  ...

er, that would be a %f.  :-)

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-24  0:51                         ` Michael K. Edwards
@ 2007-02-24  2:17                           ` Michael K. Edwards
  2007-02-24  3:25                           ` Michael K. Edwards
  1 sibling, 0 replies; 277+ messages in thread
From: Michael K. Edwards @ 2007-02-24  2:17 UTC (permalink / raw)
  To: Alan
  Cc: Ingo Molnar, Evgeniy Polyakov, Ulrich Drepper, linux-kernel,
	Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner

I wrote:
> (On a pre-EABI ARM, there is even a substantial
> cache-related penalty for encoding the syscall number in the syscall
> opcode, because you have to peek back at the text segment to see it,
> which costs you a D-cache stall.)

Before you say it, I'm aware that this is not directly relevant to TLS
switch costs, except insofar as the "arch-dependent syscalls"
introduced for certain parts of ARM TLS handling carry the same
overhead as any other syscall.  My point is that the system impact of
seemingly benign operations is not always predictable even to the arch
experts, and therefore one should be "parsimonious" (to use Kahan's
word) in defining what semantics programmers may rely on in
performance-critical situations.

If you arrange things so that threadlets are scheduled as much as
possible in bursts that share the same processor context (process
context, location in program text, TLS arena, FPU state -- basically
everything other than stack and integer registers), you are giving
yourself and future designers the maximum opportunity for exploiting
hardware optimizations.  This would be a good thing if you want
threadlets to be performance-competitive with state machine designs.

If you still allow application programmers to _use_ shared processor
state, in the knowledge that it will be clobbered on threadlet switch,
then threadlets can use most of the coding style with which
programmers of event-driven frameworks are familiar.  This would be a
good thing if you want threadlets to get wider use than the innards of
three or four databases and web servers.

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-23 23:49                     ` Michael K. Edwards
@ 2007-02-24  1:08                       ` Alan
  2007-02-24  0:51                         ` Michael K. Edwards
  0 siblings, 1 reply; 277+ messages in thread
From: Alan @ 2007-02-24  1:08 UTC (permalink / raw)
  To: Michael K. Edwards
  Cc: Ingo Molnar, Evgeniy Polyakov, Ulrich Drepper, linux-kernel,
	Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner

> long my_threadlet_fn(void *data)
> {
>        char *name = data;
>        int fd;
> 
>        fd = open(name, O_RDONLY);
>        if (fd < 0)
>                goto out;
> 
>        fstat(fd, &stat);
>        read(fd, buf, count)
>        ...
> 
> out:
>        return threadlet_complete();
> }
> 
> You're telling me that runs entirely in kernel space when open()
> blocks, and doesn't touch errno if fstat() fails?  Now who hasn't read
> the code?

That example touches back into user space, but doesnt involve MMU changes
or cache flushes, or tlb flushes, or floating point.

errno is thread specific if you use it but errno is as I said before
entirely a C library detail that you don't have to suffer if you don't
want to. Avoiding that saves a segment register load - which isn't too
costly but isn't free.

Alan

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-24  1:08                       ` Alan
@ 2007-02-24  0:51                         ` Michael K. Edwards
  2007-02-24  2:17                           ` Michael K. Edwards
  2007-02-24  3:25                           ` Michael K. Edwards
  0 siblings, 2 replies; 277+ messages in thread
From: Michael K. Edwards @ 2007-02-24  0:51 UTC (permalink / raw)
  To: Alan
  Cc: Ingo Molnar, Evgeniy Polyakov, Ulrich Drepper, linux-kernel,
	Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner

Thanks for taking me at least minimally seriously, Alan.  Pretty
generous of you, all things considered.

On 2/23/07, Alan <alan@lxorguk.ukuu.org.uk> wrote:
> That example touches back into user space, but doesnt involve MMU changes
> or cache flushes, or tlb flushes, or floating point.

True -- on an architecture where a change of TLS does not
substantially affect the TLB and cache, which (AIUI) it does on most
or all ARMs.  (On a pre-EABI ARM, there is even a substantial
cache-related penalty for encoding the syscall number in the syscall
opcode, because you have to peek back at the text segment to see it,
which costs you a D-cache stall.)  Now put an sprintf with a %d in it
between a couple of the syscalls, and _your_ arch is hurting.  Deny
the userspace programmer the use of the FPU in threadlets, and they
become a lot less widely applicable -- and a lot flakier in a
non-wizard's hands, given that people often cheat around the small
number of x86 integer registers by using FP registers when copying
memory in bulk.

> errno is thread specific if you use it but errno is as I said before
> entirely a C library detail that you don't have to suffer if you don't
> want to. Avoiding that saves a segment register load - which isn't too
> costly but isn't free.

On your arch, it's a segment register -- and another
who-knows-how-many pages to migrate along with the stack and pt_regs.
On ARM, it's a coprocessor register that is incorrectly emulated by
most JTAG emulators (so bye-bye JTAG-assisted debugging and
profiling), or possibly a register stolen from the general purpose
register set.  On some MIPSes I have known you probably can't
implement TLS safely without a cache flush.

If you tell people up front not to touch TLS in threadlets -- which
means not to use routines from <stdlib.h> and <stdio.h> -- then
implementors may have enough flexibility to make them perform well on
a wide range of architectures.  Alternately, if there are some things
that threadlet users will genuinely need TLS for, you can tell them
that all of the threadlets belonging to process X on CPU Y share a TLS
context, and therefore things like errno can't be trusted across a
syscall -- but then you had better make fairly sure that threadlets
aren't preempted by other threadlets in between syscalls.  Similar
arguments apply to FPU state.

IEEE 754.  Harp, harp.  :-)

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-23 12:37                   ` Alan
@ 2007-02-23 23:49                     ` Michael K. Edwards
  2007-02-24  1:08                       ` Alan
  0 siblings, 1 reply; 277+ messages in thread
From: Michael K. Edwards @ 2007-02-23 23:49 UTC (permalink / raw)
  To: Alan
  Cc: Ingo Molnar, Evgeniy Polyakov, Ulrich Drepper, linux-kernel,
	Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner

On 2/23/07, Alan <alan@lxorguk.ukuu.org.uk> wrote:
> > Do you not understand that real user code touches FPU state at
> > unpredictable (to the kernel) junctures?  Maybe not in a database or a
>
> We don't care. We don't have to care. The kernel threadlets don't execute
> in user space and don't do FP.

Blocked threadlets go back out to userspace, as new threads, after the
first blocking syscall completes.  That's how Ingo described them in
plain English, that's how his threadlet example would have to work,
and that appears to be what his patches actually do.

> > web server, but in the GUIs and web-based monitoring applications that
> > are 99% of the potential customers for kernel AIO?  I have no idea
> > what a %cr3 is, but if you don't fence off thread-local stuff from the
>
> How about you go read the intel architecture manuals then you might know
> more.

Y'know, there's more to life than x86.  I'm no MMU expert, but I know
enough about ARM TLS and ptrace to have fixed ltrace -- not that that
took any special wizardry, just a need for it to work and some basic
forensic skill.  If you want me to go away completely or not follow up
henceforth on anything you write, say so, and I'll decide what to do
in response.  Otherwise, you might consider evaluating whether there's
a way to interpret my comments so that they reflect a perspective that
does not overlap 100% with yours rather than total idiocy.

> Last time I checked glibc was in userspace and the interface for kernel
> AIO is a matter for the kernel so errno is irrelevant, plus any
> threadlets doing system calls will only be living in kernel space anyway.

Ingo's original example code:

long my_threadlet_fn(void *data)
{
       char *name = data;
       int fd;

       fd = open(name, O_RDONLY);
       if (fd < 0)
               goto out;

       fstat(fd, &stat);
       read(fd, buf, count)
       ...

out:
       return threadlet_complete();
}

You're telling me that runs entirely in kernel space when open()
blocks, and doesn't touch errno if fstat() fails?  Now who hasn't read
the code?

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-23 18:01                     ` Evgeniy Polyakov
@ 2007-02-23 20:43                       ` Davide Libenzi
  0 siblings, 0 replies; 277+ messages in thread
From: Davide Libenzi @ 2007-02-23 20:43 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ingo Molnar, Ulrich Drepper, Linux Kernel Mailing List,
	Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner

On Fri, 23 Feb 2007, Evgeniy Polyakov wrote:

> I was not clear - I meant why do we need to do that when we can run the
> same code in userspace? And better if we can have non-blocking dataflows
> and number of threads equal to number of processors...

I've a userspace library that does exactly that (GUASI - GPL code avail if 
you want, but no man page written yet). It uses a pool of threads and 
queue requests. I've a bench that crawl through a directory a read files. 
The sync against async performance sucks. You can't do the cachehit 
optimization in userspace. With network stuff could prolly do better 
(since network is more heavily towards async), but still. 




> I started a week of writing without russian-english dictionary, so
> expect some troubles in communications with me :)
> 
> I said that about kernel design - when we have thousand(s) of threads,
> which do the work - if number of context switches is small (i.e. when
> operations mostly do not block), then it is ok (although 'ps' output 
> with threads can scary a grandma).
> It is also ok to say - 'hey, Linux has so easy AIO model, so that
> everyone should switch and start using it and do not care about problems 
> associated with multi-threaded programming with high concurrency', 
> but, in my opinion, both that cases can not cover all (and most of) 
> the usage cases.
> 
> To eat my hat (or force others to do the same) I'm preparing a tree for 
> threadlet test - I plan to write a trivial web server
> (accept/recv/send/sendfile in one threadlet function) and give it a try
> soon.

Funny, I lazily started doing the same thing last weekend (than I had to 
stop, since real job kicked in ;). I wanted to compare a fully MT trivial 
HTTP server:

http://www.xmailserver.org/thrhttp.c

with one that is event-driven (epoll) and coroutine based. This one will 
only be compared for memory-content delivery, since it has no async vfs 
capabilities. They both support the special "/mem-XXXX" url, that allows 
an HTTP loader to request a given content size.
I also have a epoll+coroutine HTTP loader (that works around httperf 
limitations).
Then, I wanted to compare the above, with one that is epoll+GUASI+coroutine 
based (basically a userpace-only thingy).
I've the code for all the above.
Finally, with one that is epoll+syslet+coroutine based (no code for this 
yet - but it should be a easy port from the GUASI one).
Keep in mind though, that a threadlet solution doing accept/recv/send/sendfile
is becoming blazingly similar to a full MT solution.
I can only immagine the thunders and flames that Larry would throw at us 
for using all those threads :D




- Davide



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-23 17:43                   ` Davide Libenzi
@ 2007-02-23 18:01                     ` Evgeniy Polyakov
  2007-02-23 20:43                       ` Davide Libenzi
  0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-23 18:01 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Ingo Molnar, Ulrich Drepper, Linux Kernel Mailing List,
	Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner

On Fri, Feb 23, 2007 at 09:43:14AM -0800, Davide Libenzi (davidel@xmailserver.org) wrote:
> On Fri, 23 Feb 2007, Evgeniy Polyakov wrote:
> 
> > On Thu, Feb 22, 2007 at 11:46:48AM -0800, Davide Libenzi (davidel@xmailserver.org) wrote:
> > > 
> > > A dynamic pool will smooth thread creation/freeing up by a lot.
> > > And, in my box a *pthread* create/free takes ~10us, at 1000/s is 10ms, 1%. 
> > > Bad, but not so aweful ;)
> > > Look, I'm *definitely* not trying to advocate the use of async syscalls for 
> > > network here, just pointing out that when we're talking about threads, 
> > > Linux does a pretty good job.
> >  
> > If we are going to create 1000 threads each second, then it is better to
> > preallocate them and queue a work to that pool - like syslets did with
> > syscalls, but not ulitimately create a new thread just because it is not 
> > that slow.
> 
> We do create a pool indeed, as I said in the opening of my asnwer. The 
> numbers I posted was just to show that thread creation/destroy is pretty 
> fast, but that does not justify it as a design choice.
 
I was not clear - I meant why do we need to do that when we can run the
same code in userspace? And better if we can have non-blocking dataflows
and number of threads equal to number of processors...
 
> > All such micro-thread designs are especially good in the case when 
> > 1. switching is _rare_ (very)
> > 2. programmer does not want to create complex model to achieve maximum
> > performance
> > 
> > Disk (cached) IO definitely hits first entry and second one is there for
> > advertisements and fast deployment, but overall usage of the
> > asynchronous IO model is not limited to the above scenario, so
> > micro-threads definitely hit own niche, but they can not cover all usage
> > cases.
> 
> You know, I read this a few times, but I still don't get what your point 
> is here ;) Are you talking about micro-thread design in the kernel as for 
> kthreads usage for AIO, or about userspace?
 
I started a week of writing without russian-english dictionary, so
expect some troubles in communications with me :)

I said that about kernel design - when we have thousand(s) of threads,
which do the work - if number of context switches is small (i.e. when
operations mostly do not block), then it is ok (although 'ps' output 
with threads can scary a grandma).
It is also ok to say - 'hey, Linux has so easy AIO model, so that
everyone should switch and start using it and do not care about problems 
associated with multi-threaded programming with high concurrency', 
but, in my opinion, both that cases can not cover all (and most of) 
the usage cases.

To eat my hat (or force others to do the same) I'm preparing a tree for 
threadlet test - I plan to write a trivial web server
(accept/recv/send/sendfile in one threadlet function) and give it a try
soon.

> - Davide
> 

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-23 12:15                 ` Evgeniy Polyakov
@ 2007-02-23 17:43                   ` Davide Libenzi
  2007-02-23 18:01                     ` Evgeniy Polyakov
  0 siblings, 1 reply; 277+ messages in thread
From: Davide Libenzi @ 2007-02-23 17:43 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ingo Molnar, Ulrich Drepper, Linux Kernel Mailing List,
	Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner

On Fri, 23 Feb 2007, Evgeniy Polyakov wrote:

> On Thu, Feb 22, 2007 at 11:46:48AM -0800, Davide Libenzi (davidel@xmailserver.org) wrote:
> > 
> > A dynamic pool will smooth thread creation/freeing up by a lot.
> > And, in my box a *pthread* create/free takes ~10us, at 1000/s is 10ms, 1%. 
> > Bad, but not so aweful ;)
> > Look, I'm *definitely* not trying to advocate the use of async syscalls for 
> > network here, just pointing out that when we're talking about threads, 
> > Linux does a pretty good job.
>  
> If we are going to create 1000 threads each second, then it is better to
> preallocate them and queue a work to that pool - like syslets did with
> syscalls, but not ulitimately create a new thread just because it is not 
> that slow.

We do create a pool indeed, as I said in the opening of my asnwer. The 
numbers I posted was just to show that thread creation/destroy is pretty 
fast, but that does not justify it as a design choice.



> All such micro-thread designs are especially good in the case when 
> 1. switching is _rare_ (very)
> 2. programmer does not want to create complex model to achieve maximum
> performance
> 
> Disk (cached) IO definitely hits first entry and second one is there for
> advertisements and fast deployment, but overall usage of the
> asynchronous IO model is not limited to the above scenario, so
> micro-threads definitely hit own niche, but they can not cover all usage
> cases.

You know, I read this a few times, but I still don't get what your point 
is here ;) Are you talking about micro-thread design in the kernel as for 
kthreads usage for AIO, or about userspace?



- Davide



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-22 14:36               ` Ingo Molnar
@ 2007-02-23 14:23                 ` Suparna Bhattacharya
  0 siblings, 0 replies; 277+ messages in thread
From: Suparna Bhattacharya @ 2007-02-23 14:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Evgeniy Polyakov, Ulrich Drepper, linux-kernel, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Zach Brown, David S. Miller, Davide Libenzi, Jens Axboe,
	Thomas Gleixner

On Thu, Feb 22, 2007 at 03:36:58PM +0100, Ingo Molnar wrote:
> 
> * Suparna Bhattacharya <suparna@in.ibm.com> wrote:
> 
> > > maybe it will, maybe it wont. Lets try? There is no true difference 
> > > between having a 'request structure' that represents the current 
> > > state of the HTTP connection plus a statemachine that moves that 
> > > request between various queues, and a 'kernel stack' that goes in 
> > > and out of runnable state and carries its processing state in its 
> > > stack - other than the amount of RAM they take. (the kernel stack is 
> > > 4K at a minimum - so with a million outstanding requests they would 
> > > use up 4 GB of RAM. With 20k outstanding requests it's 80 MB of RAM 
> > > - that's acceptable.)
> > 
> > At what point are the cachemiss threads destroyed ? In other words how 
> > well does this adapt to load variations ? For example, would this 80MB 
> > of RAM continue to be locked down even during periods of lighter loads 
> > thereafter ?
> 
> you can destroy them at will from user-space too - just start a slow 
> timer that zaps them if load goes down. I can add a 
> sys_async_thread_exit(nr_threads) API to be able to drive this without 
> knowing the TIDs of those threads, and/or i can add a kernel-internal 
> mechanism to zap inactive threads. It would be rather easy and 
> low-overhead - the v2 code already had a max_nr_threads tunable, i can 
> reintroduce it. So the size of the pool of contexts does not have to be 
> permanent at all.

If you can find a way to do this without additional tunables burden on
the administrator that would certainly help ! IIRC, performance problems
linked to having too many or too few AIO kernel threads has been a commonly
reported issue elsewhere - it would be nice to be able to avoid repeating
the crux of that (mistake) in Linux. To me, any need to manually tune the
number has always seemed to defeat the very benefit of adaptability of varying
loads that AIO intrinsically provides. 

Regards
Suparna

> 
> 	Ingo

-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Lab, India


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-23 12:22                 ` Evgeniy Polyakov
@ 2007-02-23 12:41                   ` Evgeniy Polyakov
  2007-02-25 17:45                   ` Ingo Molnar
  1 sibling, 0 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-23 12:41 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ulrich Drepper, linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner

On Fri, Feb 23, 2007 at 03:22:25PM +0300, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> I meant that we end up with having one thread per IO - they were
> preallocated, but that does not matter. And what about your idea of
> switching userspace threads to cachemiss threads?
> 
> My main concern was only about the situation, when we ends up with truly
> bloking context (like network), and this results in having thousands of
> threads doing the work - even having most of them sleeping, there is a
> problem with memory overhead and context switching, although it is usable
> situation, but when all of them are ready immediately - context switching 
                                            simultaneously
> will kill a machine even with O(1) scheduler which made situation damn
> better than before, but it is not a cure for the problem.

Week of no-dictionary writings starts beating me.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-23  2:47                 ` Michael K. Edwards
  2007-02-23  8:31                   ` Michael K. Edwards
  2007-02-23 10:22                   ` Ingo Molnar
@ 2007-02-23 12:37                   ` Alan
  2007-02-23 23:49                     ` Michael K. Edwards
  2 siblings, 1 reply; 277+ messages in thread
From: Alan @ 2007-02-23 12:37 UTC (permalink / raw)
  To: Michael K. Edwards
  Cc: Ingo Molnar, Evgeniy Polyakov, Ulrich Drepper, linux-kernel,
	Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner

> Do you not understand that real user code touches FPU state at
> unpredictable (to the kernel) junctures?  Maybe not in a database or a

We don't care. We don't have to care. The kernel threadlets don't execute
in user space and don't do FP. 

> web server, but in the GUIs and web-based monitoring applications that
> are 99% of the potential customers for kernel AIO?  I have no idea
> what a %cr3 is, but if you don't fence off thread-local stuff from the

How about you go read the intel architecture manuals then you might know
more.

> > We don't have an errno in the kernel because its a stupid idea. Errno is
> > a user space hack for compatibility with 1970's bad design. So its not
> > relevant either.
> 
> Dude, it's thread-local, and the glibc wrapper around most synchronous

Last time I checked glibc was in userspace and the interface for kernel
AIO is a matter for the kernel so errno is irrelevant, plus any
threadlets doing system calls will only be living in kernel space anyway.


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-23 11:51               ` Ingo Molnar
@ 2007-02-23 12:22                 ` Evgeniy Polyakov
  2007-02-23 12:41                   ` Evgeniy Polyakov
  2007-02-25 17:45                   ` Ingo Molnar
  0 siblings, 2 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-23 12:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ulrich Drepper, linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner

On Fri, Feb 23, 2007 at 12:51:52PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> > [...] Those 20k blocked requests were created in about 20 seconds, so 
> > roughly saying we have 1k of thread creation/freeing per second - do 
> > we want this?
> 
> i'm not sure why you mention thread creation and freeing. The 
> syslet/threadlet code reuses already created async threads, and that is 
> visible all around in both the kernel-space and in the user-space 
> syslet/threadlet code.
> 
> While Linux creates+destroys threads pretty damn fast (in about 10-15 
> usecs - which is roughly the cost of getting a single 1-byte packet 
> through a TCP socket from one process to another, on localhost), still 
> we dont want to create and destroy a thread per request.

I meant that we end up with having one thread per IO - they were
preallocated, but that does not matter. And what about your idea of
switching userspace threads to cachemiss threads?

My main concern was only about the situation, when we ends up with truly
bloking context (like network), and this results in having thousands of
threads doing the work - even having most of them sleeping, there is a
problem with memory overhead and context switching, although it is usable
situation, but when all of them are ready immediately - context switching 
will kill a machine even with O(1) scheduler which made situation damn
better than before, but it is not a cure for the problem.

> 	Ingo

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-22 21:24             ` Michael K. Edwards
  2007-02-23  0:30               ` Alan
@ 2007-02-23 12:17               ` Ingo Molnar
  2007-02-24 19:52                 ` Michael K. Edwards
  1 sibling, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-02-23 12:17 UTC (permalink / raw)
  To: Michael K. Edwards
  Cc: Evgeniy Polyakov, Ulrich Drepper, linux-kernel, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner


* Michael K. Edwards <medwards.linux@gmail.com> wrote:

> On 2/22/07, Ingo Molnar <mingo@elte.hu> wrote:

> > maybe it will, maybe it wont. Lets try? There is no true difference 
> > between having a 'request structure' that represents the current 
> > state of the HTTP connection plus a statemachine that moves that 
> > request between various queues, and a 'kernel stack' that goes in 
> > and out of runnable state and carries its processing state in its 
> > stack - other than the amount of RAM they take. (the kernel stack is 
> > 4K at a minimum - so with a million outstanding requests they would 
> > use up 4 GB of RAM. With 20k outstanding requests it's 80 MB of RAM 
> > - that's acceptable.)
> 
> This is a fundamental misconception. [...]

> The scheduler, on the other hand, has to blow and reload all of the 
> hidden state associated with force-loading the PC and wherever your 
> architecture keeps its TLS (maybe not the whole TLB, but not nothing, 
> either). [...]

please read up a bit more about how the Linux scheduler works. Maybe 
even read the code if in doubt? In any case, please direct kernel newbie 
questions to http://kernelnewbies.org/, not linux-kernel@vger.kernel.org.

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-22 19:46               ` Davide Libenzi
@ 2007-02-23 12:15                 ` Evgeniy Polyakov
  2007-02-23 17:43                   ` Davide Libenzi
  0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-23 12:15 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Ingo Molnar, Ulrich Drepper, Linux Kernel Mailing List,
	Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner

On Thu, Feb 22, 2007 at 11:46:48AM -0800, Davide Libenzi (davidel@xmailserver.org) wrote:
> > I tried already :) - I just made a allocations atomic in tcp_sendmsg() and
> > ended up with 1/4 of the sends blocking (I counted both allocation
> > failure and socket queue overflow). Those 20k blocked requests were
> > created in about 20 seconds, so roughly saying we have 1k of thread
> > creation/freeing per second - do we want this?
> 
> A dynamic pool will smooth thread creation/freeing up by a lot.
> And, in my box a *pthread* create/free takes ~10us, at 1000/s is 10ms, 1%. 
> Bad, but not so aweful ;)
> Look, I'm *definitely* not trying to advocate the use of async syscalls for 
> network here, just pointing out that when we're talking about threads, 
> Linux does a pretty good job.
 
If we are going to create 1000 threads each second, then it is better to
preallocate them and queue a work to that pool - like syslets did with
syscalls, but not ulitimately create a new thread just because it is not 
that slow.

All such micro-thread designs are especially good in the case when 
1. switching is _rare_ (very)
2. programmer does not want to create complex model to achieve maximum
performance

Disk (cached) IO definitely hits first entry and second one is there for
advertisements and fast deployment, but overall usage of the
asynchronous IO model is not limited to the above scenario, so
micro-threads definitely hit own niche, but they can not cover all usage
cases.
 
> 
> - Davide
> 

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-22 13:32             ` Evgeniy Polyakov
  2007-02-22 19:46               ` Davide Libenzi
@ 2007-02-23 11:51               ` Ingo Molnar
  2007-02-23 12:22                 ` Evgeniy Polyakov
  1 sibling, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-02-23 11:51 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ulrich Drepper, linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner


* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> [...] Those 20k blocked requests were created in about 20 seconds, so 
> roughly saying we have 1k of thread creation/freeing per second - do 
> we want this?

i'm not sure why you mention thread creation and freeing. The 
syslet/threadlet code reuses already created async threads, and that is 
visible all around in both the kernel-space and in the user-space 
syslet/threadlet code.

While Linux creates+destroys threads pretty damn fast (in about 10-15 
usecs - which is roughly the cost of getting a single 1-byte packet 
through a TCP socket from one process to another, on localhost), still 
we dont want to create and destroy a thread per request.

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-22 17:17                       ` David Miller
@ 2007-02-23 11:12                         ` Ingo Molnar
  0 siblings, 0 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-02-23 11:12 UTC (permalink / raw)
  To: David Miller
  Cc: johnpol, arjan, drepper, linux-kernel, torvalds, hch, akpm, alan,
	zach.brown, suparna, davidel, jens.axboe, tglx


* David Miller <davem@davemloft.net> wrote:

> > furthermore, in a real webserver there's a whole lot of other stuff 
> > happening too: VFS blocking, mutex/lock blocking, memory pressure 
> > blocking, filesystem blocking, etc., etc. Threadlets/syslets cover 
> > them /all/ and never hold up the primary context: as long as there's 
> > more requests to process, they will be processed. Plus other 
> > important networked workloads, like fileservers are typically on 
> > fast LANs and those requests are very much a fire-and-forget matter 
> > most of the time.
> 
> I expect clients of a fileserver to cause the server to block in 
> places such as tcp_sendmsg() as much if not more so than a webserver
> :-)

yeah, true. But ... i'd still like to midly disagree with the 
characterisation that due to blocking being the norm in networking, that 
this goes against the concept of syslets/threadlets.

Networking has a 'work cache' too, in an abstract sense, which works in 
favor of syslets: the socket buffers. If there is a reasonably sized 
output queue for sockets (not extremely small like 4k per socket but 
something at least 16k), then user-space can chunk up its workload along 
pretty reasonable lines without assuming too much, and turn one such 
'chunk' into one atomic step done in the 'syslet/threadlet request'. In 
the rare cases where blocking happens in an unexpected way, the 
syslet/threadlet 'goes async' but that only holds up that particular 
request, and only for that chunk, not the main loop of processing and 
doesnt force the request into an async thread forever.

the kevent model is very much about: /never every/ block the main 
thread. If you ever block, performance goes down the drain.

the syslet performance model is: if you block less than say 10-20% of 
the time, you are basically as fast as the most extreme kevents based 
model. Syslets/threadlets also avoids the fundamental workflow problem 
that nonblocking designs have in a natural way: 'how do we guarantee 
that the system progresses forward', which problem makes nonblocking 
code quite fragile.

another property is that the performance curve is alot less sensitive to 
blocking in the syslet model - and real user-space servers will always 
have unexpected points of blockage - unless all of userspace code is 
perfectly converted into state machines - which is not reasonable. So 
with syslets we are not forced to program everything as state-machines 
in user-space, such techniques are only needed to reduce the amount of 
context-switching and to reduce the RAM footprint - they wont change 
fundamental scalability.

plus there's the hidden advantage of having a 'constructed state' on the 
kernel stack: thread that blocks in the middle of tcp_sendmsg() has 
quite some state: the socket has been looked up in the hash(es), all 
input parameters have been validated, the timer has been set, skbs have 
been allocated ahead, etc. Those things do add up especially if you are 
after a long wait and all those things are scattered around the memory 
map cache-cold - not nicely composed into a single block of memory on 
the stack (generating only a single cachemiss in essence).

All in one, i'm cautiosly optimistic that even a totally naive, 
blocks-itself-for-every-request syslet application would perform pretty 
close to a Tux/kevents type of nonblock+event-queueing based application 
- with a vastly higher generic utility benefit. So i'm not dismissing 
this possibility and i'm not writing off syslet performance just because 
they do context-switches :-)

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-23  2:47                 ` Michael K. Edwards
  2007-02-23  8:31                   ` Michael K. Edwards
@ 2007-02-23 10:22                   ` Ingo Molnar
  2007-02-23 12:37                   ` Alan
  2 siblings, 0 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-02-23 10:22 UTC (permalink / raw)
  To: Michael K. Edwards
  Cc: Alan, Evgeniy Polyakov, Ulrich Drepper, linux-kernel,
	Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner


* Michael K. Edwards <medwards.linux@gmail.com> wrote:

> On 2/22/07, Alan <alan@lxorguk.ukuu.org.uk> wrote:

> > We don't use the FPU in the kernel except in very weird cases where 
> > it makes an enormous performance difference. The threadlets also 
> > have the same page tables so they have the same %cr3 so its very 
> > cheap to switch, basically a predicted jump and some register loads
> 
> Do you not understand that real user code touches FPU state at 
> unpredictable (to the kernel) junctures?  Maybe not in a database or a 
> web server, but in the GUIs and web-based monitoring applications that 
> are 99% of the potential customers for kernel AIO?
> I have no idea what a %cr3 is, [...]

then please stop wasting Alan's time ...

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-23  2:47                 ` Michael K. Edwards
@ 2007-02-23  8:31                   ` Michael K. Edwards
  2007-02-23 10:22                   ` Ingo Molnar
  2007-02-23 12:37                   ` Alan
  2 siblings, 0 replies; 277+ messages in thread
From: Michael K. Edwards @ 2007-02-23  8:31 UTC (permalink / raw)
  To: Alan
  Cc: Ingo Molnar, Evgeniy Polyakov, Ulrich Drepper, linux-kernel,
	Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner

OK, having skimmed through Ingo's code once now, I can already see I
have some crow to eat.  But I still have some marginally less stupid
questions.

Cachemiss threads are created with CLONE_VM | CLONE_FS | CLONE_FILES |
CLONE_SIGHAND | CLONE_THREAD | CLONE_SYSVSEM.  Does that mean they
share thread-local storage with the userspace thread, have
thread-local storage of their own, or have no thread-local storage
until NPTL asks for it?

When the kernel zeroes the userspace stack pointer in
cachemiss_thread(), presumably the allocation of a new userspace stack
page is postponed until that thread needs to resume userspace
execution (after completion of the first I/O that missed cache).  When
do you copy the contents of the threadlet function's stack frame into
this new stack page?

Is there anything in a struct pt_regs that is expensive to restore
(perhaps because it flushes a pipeline or cache that wasn't already
flushed on syscall entry)?  Is there any reason why the FPU context
has to differ among threadlets that have blocked while executing the
same userspace function with different stacks?  If the TLS pointer
isn't in either of these, where is it, and why doesn't
move_user_context() swap it?

If you set out to cancel one of these threadlets, how are you going to
ensure that it isn't holding any locks?  Is there any reasonable way
to implement a userland finally { } block so that you can release
malloc'd memory and clean up application data structures?

If you want to migrate a threadlet to another CPU on syscall entry
and/or exit, what has to travel other than the userspace stack and the
struct pt_regs?  (I am assuming a quiesced FPU and thread(s) at the
destination with compatible FPU flags.)  Does it make sense for the
userspace stack page to have space reserved for a struct pt_regs
before the threadlet stack frame, so that the entire userspace
threadlet state migrates as one page?

I now see that an effort is already made to schedule threadlets in
bursts, grouped by PID, when several have unblocked since the last
timeslice.  What is the transition cost from one threadlet to another?
 Can that transition cost be made lower by reducing the amount of
state that belongs to the individual threadlet vs. the pool of
cachemiss threads associated with that threadlet entrypoint?

Generally, is there a "contract" that could be made between the
threadlet application programmer and the implementation which would
allow, perhaps in future hardware, the kind of invisible pipelined
coprocessing for AIO that has been so successful for FP?

I apologize for having adopted a hostile tone in a couple of previous
messages in this thread; remind me in the future not to alternate
between thinking about code and about the FSF.  :-)  I do really like
a lot of things about the threadlet model, and would rather not see it
given up on for network I/O and NUMA systems.  So I'm going to
reiterate again -- more politely this time -- the need for a
data-structure-centric threadlet pool abstraction that supports
request throttling, reprioritization, bulk cancellation, and migration
of individual threadlets to the node nearest the relevant I/O port.

I'm still not sold on syslets as anything userspace-visible, but I
could imagine them enabling a sort of functional syntax for chaining
I/O operations, with most failures handled as inline "Not-a-Pointer"
values or as "AEIOU" (asynchronously executed I/O unit?) exceptions
instead of syscall-test-branch-syscall-test-branch.  Actually working
out the semantics and getting them adopted as an IEEE standard could
even win someone a Turing award.  :-)

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-23  0:30               ` Alan
@ 2007-02-23  2:47                 ` Michael K. Edwards
  2007-02-23  8:31                   ` Michael K. Edwards
                                     ` (2 more replies)
  0 siblings, 3 replies; 277+ messages in thread
From: Michael K. Edwards @ 2007-02-23  2:47 UTC (permalink / raw)
  To: Alan
  Cc: Ingo Molnar, Evgeniy Polyakov, Ulrich Drepper, linux-kernel,
	Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner

On 2/22/07, Alan <alan@lxorguk.ukuu.org.uk> wrote:
> > to do anything but chase pointers through cache.  Done right, it
> > hardly even branches (although the branch misprediction penalty is a
> > lot less of a worry on current x86_64 than it was in the
> > mega-superscalar-out-of-order-speculative-execution days).  It's damn
>
> Actually it costs a lot more on at least one vendors processor because
> you stall very long pipelines.

You're right; I overreached there.  I haven't measured branch
misprediction penalties in dog's years (I focus more on system latency
issues these days), so I'm just going on rumor.  If your CPU vendor is
still playing the tune-for-SpecINT-at-the-expense-of-real-code game
(*cough* Itanic *cough*), get another CPU vendor -- while you still
can.

> > threadlets promise that they will not touch anything thread-local, and
> > that when the FPU is handed to them in a specific, known state, they
> > leave it in that same state.  (Some of the flags can be
>
> We don't use the FPU in the kernel except in very weird cases where it
> makes an enormous performance difference. The threadlets also have the
> same page tables so they have the same %cr3 so its very cheap to switch,
> basically a predicted jump and some register loads

Do you not understand that real user code touches FPU state at
unpredictable (to the kernel) junctures?  Maybe not in a database or a
web server, but in the GUIs and web-based monitoring applications that
are 99% of the potential customers for kernel AIO?  I have no idea
what a %cr3 is, but if you don't fence off thread-local stuff from the
threadlets you are just begging for end-user Heisenbugs and a place in
the dustheap of history next to Symbolics LISP.

> > Do me a favor.  Do some floating point math and a memcpy() in between
> > syscalls in the threadlet.  Actually fiddle with errno and the FPU
>
> We don't have an errno in the kernel because its a stupid idea. Errno is
> a user space hack for compatibility with 1970's bad design. So its not
> relevant either.

Dude, it's thread-local, and the glibc wrapper around most synchronous
syscalls touches it.  If you don't instantiate a new TLS context (or
whatever the right lingo for that is) for every threadlet, you are
TOAST -- if you let the user call stuff out of <stdlib.h> (let alone
<stdio.h>) from within the threadlet.

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-22 21:24             ` Michael K. Edwards
@ 2007-02-23  0:30               ` Alan
  2007-02-23  2:47                 ` Michael K. Edwards
  2007-02-23 12:17               ` Ingo Molnar
  1 sibling, 1 reply; 277+ messages in thread
From: Alan @ 2007-02-23  0:30 UTC (permalink / raw)
  To: Michael K. Edwards
  Cc: Ingo Molnar, Evgeniy Polyakov, Ulrich Drepper, linux-kernel,
	Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner

> to do anything but chase pointers through cache.  Done right, it
> hardly even branches (although the branch misprediction penalty is a
> lot less of a worry on current x86_64 than it was in the
> mega-superscalar-out-of-order-speculative-execution days).  It's damn

Actually it costs a lot more on at least one vendors processor because
you stall very long pipelines.

> threadlets promise that they will not touch anything thread-local, and
> that when the FPU is handed to them in a specific, known state, they
> leave it in that same state.  (Some of the flags can be

We don't use the FPU in the kernel except in very weird cases where it
makes an enormous performance difference. The threadlets also have the
same page tables so they have the same %cr3 so its very cheap to switch,
basically a predicted jump and some register loads

> Do me a favor.  Do some floating point math and a memcpy() in between
> syscalls in the threadlet.  Actually fiddle with errno and the FPU

We don't have an errno in the kernel because its a stupid idea. Errno is
a user space hack for compatibility with 1970's bad design. So its not
relevant either.

Alan

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-22 13:40 ` David Miller
@ 2007-02-22 23:52   ` linux
  0 siblings, 0 replies; 277+ messages in thread
From: linux @ 2007-02-22 23:52 UTC (permalink / raw)
  To: davem, linux; +Cc: linux-kernel, mingo

> It's brilliant for disk I/O, not for networking for which
> blocking is the norm not the exception.
> 
> So people will have to likely do something like divide their
> applications into handling for I/O to files and I/O to networking.
> So beautiful. :-)
>
> Nobody has proposed anything yet which scales well and handles both
> cases.

The truly brilliant thing about the whole "create a thread on blocking"
is that you immediately make *every* system call asynchronous-capable,
including the thousands of obscure ioctls, without having to boil the
ocean rewriting 5/6 of the kernel from implicit (stack-based) to
explicit state machines.

You're right that it doesn't solve everything, but it's a big step
forward while keeping a reasonably clean interface.


Now, we have some portions of the kernel (to be precise, those that
currently support poll() and select()) that are written as explicit
state machines and can block on a much smaller context structure.

In truth, the division you assume above isn't so terrible.
My applications are *already* written like that.  It's just "poll()
until I accumulate a whole request, then fork a thread to handle it."

The only way to avoid allocating a kernel stack is to have the entire
handling code path, including the return to user space, written in
explicit state machine style.  (Once you get to user space, you can have
a threading library there if you like.) All the flaming about different
ways to implement completion notification is precisely because not much
is known about the best way to do it; there aren't a lot of applications
that work that way.

(Certainly that's because it wasn't possible before, but it's clearly
an area that requires research, so not committing to an implementation
is A Good Thing.)

But once that is solved, and "system call complete" can be reported
without returning to a user-space thread (which is basically an alternate
system call submission interface, *independent* of the fibril/threadlet
non-blocking implementation), then you can find the hot paths in the
kernel and special-case them to avoid creating a whole thread.

To use a networking analogy, this is a cleanly layered protocol design,
with an optimized fast path *implementation* that blurs the boundaries.


As for the overhead of threading, there are basically three parts:
1) System call (user/kernel boundary crossing) costs.  These depend only
   on the total number of system calls and not on the number of threads
   making them.  They can be mitigated *if necessary* with a syslet-like
   "macro syscall" mechanism to increase the work per boundary crossing.

   The only place threading might increase these numbers is thread
   synchronization, and futexes already solve that pretty well.

2) Register and stack swapping.  These (and associated cache issues)
   are basically unavoidable, and are the bare minimum that longjmp()
   does.  Nothing thread-based is going to reduce this.  (Actually,
   the kernel can do better than user space because it can do lazy FPU
   state swapping.)

3) MMU context switch costs.  These are the big ones, particular on x86
   without TLB context IDs.  However, these fall into a few categories:
   - Mandatory switches because the entire application is blocked.
     I don't see how this can be avoided; these are the cases where
     even a user-space longjmp-based thread library would context
     switch.
   - Context switches between threads in an application.  The Linux
     kernel already optimizes out the MMU context switch in this case,
     and the scheduler already knows that such context switches are
     cheaper and preferred.

   The one further optimization that's possible is if you have a system
   call that (in a common case) blocks multiple times *without accessing
   user memory*.  This is not a read() or write(), but could be
   something like fsync() or ftruncate().  In this case, you could
   temporarily mark the thread as a "kernel thread" that can run in any
   MMU context, and then fix it explicitly when you unmark it on the
   return path.

I can see the space overhead of 1:1 threading, but I really don't think
there's much time overhead.

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-22 21:32           ` Benjamin LaHaise
@ 2007-02-22 21:44             ` Zach Brown
  0 siblings, 0 replies; 277+ messages in thread
From: Zach Brown @ 2007-02-22 21:44 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Ingo Molnar, Ulrich Drepper, linux-kernel, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Evgeniy Polyakov, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner

> direct-io.c is evil.  Ridiculously.

You will have a hard time finding someone to defend it, I predict :).

There is good news on that front, too.  Chris (Mason) is making  
progress on getting rid of the worst of the Magical Locking that  
makes buffered races with O_DIRECT ops so awful.

I'm not holding my breath for a page cache so fine grained that it  
could pin and reference 512B granular user regions and build bios  
from them, though that sure would be nice :).

>> As an experiment, I'm working on backing the sys_io_*() calls with
>> syslets.  It's looking very promising so far.
>
> Great, I'd love to see the comparisons.

I'm out for data.  If it sucks, well, we'll know just how much.  I'm  
pretty hopeful that it won't :).

> One other implementation to consider is actually using kernel threads
> compared to how syslets perform.  Direct IO for one always blocks, so
> there shouldn't be much of a performance difference compared to  
> syslets,
> with the bonus that no arch specific code is needed.

Yeah, I'm starting with raw kernel threads so we can get some numbers  
before moving to syslets.

One of the benefits syslets bring is the ability to start processing  
the next pending request the moment current request processing  
blocks.  In the concurrent O_DIRECT write case that avoids releasing  
a ton of kernel threads which all just run to serialize on i_mutex  
(potentially bouncing it around cache domains) as the O_DIRECT ops  
are built and sent.

-z 

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-22 14:31                 ` Ingo Molnar
  2007-02-22 14:47                   ` David Miller
  2007-02-22 14:59                   ` Ingo Molnar
@ 2007-02-22 21:42                   ` Michael K. Edwards
  2 siblings, 0 replies; 277+ messages in thread
From: Michael K. Edwards @ 2007-02-22 21:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Miller, johnpol, arjan, drepper, linux-kernel, torvalds,
	hch, akpm, alan, zach.brown, suparna, davidel, jens.axboe, tglx

On 2/22/07, Ingo Molnar <mingo@elte.hu> wrote:
> Secondly, even assuming lots of pending requests/async-threads and a
> naive queueing model, an open request will eat up resources on the
> server no matter what.

Another fundamental misconception.  Kernel AIO is not for servers.
One programmer in a hundred is working on a server codebase, and one
in a thousand dares to touch server plumbing.  Kernel AIO is for
clients, especially when mated to GUIs with an event delivery
mechanism.  Ask yourself why the one and only thing that Windows NT
has ever gotten right about networking is I/O completion ports.

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-22 21:23         ` Zach Brown
@ 2007-02-22 21:32           ` Benjamin LaHaise
  2007-02-22 21:44             ` Zach Brown
  0 siblings, 1 reply; 277+ messages in thread
From: Benjamin LaHaise @ 2007-02-22 21:32 UTC (permalink / raw)
  To: Zach Brown
  Cc: Ingo Molnar, Ulrich Drepper, linux-kernel, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Evgeniy Polyakov, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner

On Thu, Feb 22, 2007 at 01:23:57PM -0800, Zach Brown wrote:
> As one of the poor suckers who has been fixing bugs in fs/aio.c and  
> fs/direct-io.c, I really want everyone to read Ingo's paragraph a few  
> times.  Have it printed on a t-shirt.

direct-io.c is evil.  Ridiculously.

> Amen.
> 
> As an experiment, I'm working on backing the sys_io_*() calls with  
> syslets.  It's looking very promising so far.

Great, I'd love to see the comparisons.

> >So all in one, i used to think that AIO state-machines have a long- 
> >term
> >place within the kernel, but with syslets i think i've proven myself
> >embarrasingly wrong =B-)
> 
> Welcome to the party :).

Well, there are always the 2.4 patches which are properly state drive and 
reasonably simple.  Retry was born out of a need to come up with a mechanism 
that had less impact on the core kernel code, and yes, it seems to be a 
failure and in dire need of replacement.

One other implementation to consider is actually using kernel threads 
compared to how syslets perform.  Direct IO for one always blocks, so 
there shouldn't be much of a performance difference compared to syslets, 
with the bonus that no arch specific code is needed.

		-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <zyntrop@kvack.org>.

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-22 14:47                   ` David Miller
                                       ` (2 preceding siblings ...)
  2007-02-22 20:13                     ` Davide Libenzi
@ 2007-02-22 21:30                     ` Zach Brown
  3 siblings, 0 replies; 277+ messages in thread
From: Zach Brown @ 2007-02-22 21:30 UTC (permalink / raw)
  To: David Miller
  Cc: mingo, johnpol, arjan, drepper, linux-kernel, torvalds, hch,
	akpm, alan, suparna, davidel, jens.axboe, tglx

> The more I think about it, a reasonable solution might actually be to
> use threadlets for disk I/O and pure event based processing for
> networking.  It is two different handling paths and non-unified,
> but that might be the price for good performance :-)

I generally agree, with some comments.

If we come to the decision that there are some message rates that are  
better suited to delivery into a user-read ring (10gige rx to kevent,  
say) then it doesn't seem like it would be much of a stretch to add a  
facility where syslet completion could be funneled into that channel  
as well.

I also wonder if there isn't some opportunity to cut down the number  
of syscalls / op in networking land.  Is it madness to think of a  
call like recvmsgv() which could provide a vector of msghdrs?  It  
might not make sense, but it might cut down on the per-op overhead  
for loads that know they're going to be heavy enough to get a decent  
amount of batching without fatally harming latency.  Maybe those  
loads are rare..

- z 

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-22 12:59           ` Ingo Molnar
  2007-02-22 13:32             ` Evgeniy Polyakov
  2007-02-22 14:17             ` Suparna Bhattacharya
@ 2007-02-22 21:24             ` Michael K. Edwards
  2007-02-23  0:30               ` Alan
  2007-02-23 12:17               ` Ingo Molnar
  2 siblings, 2 replies; 277+ messages in thread
From: Michael K. Edwards @ 2007-02-22 21:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Evgeniy Polyakov, Ulrich Drepper, linux-kernel, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Zach Brown, David S. Miller, Suparna Bhattacharya,
	Davide Libenzi, Jens Axboe, Thomas Gleixner

On 2/22/07, Ingo Molnar <mingo@elte.hu> wrote:
> > It is not a TUX anymore - you had 1024 threads, and all of them will
> > be consumed by tcp_sendmsg() for slow clients - rescheduling will kill
> > a machine.
>
> maybe it will, maybe it wont. Lets try? There is no true difference
> between having a 'request structure' that represents the current state
> of the HTTP connection plus a statemachine that moves that request
> between various queues, and a 'kernel stack' that goes in and out of
> runnable state and carries its processing state in its stack - other
> than the amount of RAM they take. (the kernel stack is 4K at a minimum -
> so with a million outstanding requests they would use up 4 GB of RAM.
> With 20k outstanding requests it's 80 MB of RAM - that's acceptable.)

This is a fundamental misconception.  The state machine doesn't have
to do anything but chase pointers through cache.  Done right, it
hardly even branches (although the branch misprediction penalty is a
lot less of a worry on current x86_64 than it was in the
mega-superscalar-out-of-order-speculative-execution days).  It's damn
near free -- but it's a pain in the butt to code, and it has to be
done either in-kernel or in per-CPU OS-atop-the-OS dispatch threads.

The scheduler, on the other hand, has to blow and reload all of the
hidden state associated with force-loading the PC and wherever your
architecture keeps its TLS (maybe not the whole TLB, but not nothing,
either).  The only way around this that I can think of is to make
threadlets promise that they will not touch anything thread-local, and
that when the FPU is handed to them in a specific, known state, they
leave it in that same state.  (Some of the flags can be
unspecified-but-don't-touch-me.)  Then you can schedule threadlets in
bursts with negligible transition cost from one to the next.

There is, however, a substantial setup cost for a burst, because you
have to put the FPU in that known state and lock out TLS access (this
is user code, after all).  If the wrong process is in foreground, you
also need to switch process context at the start of a burst; no
fandangos on other processes' core, please, and to be remotely useful
the threadlets need access to process-global data structures and
synchronization primitives anyway.  That's why you need for threadlets
to have a separate SCHED_THREADLET priority and at least a weak
ordering by PID.  At which point you are outside the feature set of
the O(1) scheduler as I understand it, and you might as well schedule
them from the next tasklet following the softirq dispatcher.

> > My tests show that with 4k connections per second (8k concurrency)
> > more than 20k connections of 80k total block in tcp_sendmsg() over
> > gigabit lan between quite fast machines.
>
> yeah. Note that you can have a million sleeping threads if you want, the
> scheduler wont care. What matters more is the amount of true concurrency
> that is present at any given time. But yes, i agree that overscheduling
> can be a problem.

What matters is that a burst of I/O responses be scheduled efficiently
without taking down the rest of the box.  That, and the ability to
cancel no-longer-interesting I/O requests in bulk, without leaking
memory and synchronization primitives all over the place.  If you
don't have that, this scheme is UNUSABLE for network I/O.

> btw., what is the measurement utility you are using with kevents ('ab'
> perhaps, with a high -c concurrency count?), and which webserver are you
> using? (light-httpd?)

Do me a favor.  Do some floating point math and a memcpy() in between
syscalls in the threadlet.  Actually fiddle with errno and the FPU
rounding flags.  Watch it slow to a crawl and/or break floating point
arithmetic horribly.  Understand why no one with half a brain uses
Java, or any other language which cuts FP corners for the sake of
cheap threads, for calculations that have to be correct.  (Note that
Kahan received the Turing award for contributions to IEEE 754.  If his
polemic is too thick, read
http://www-128.ibm.com/developerworks/java/library/j-jtp0114/.)

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-22  7:40       ` Ingo Molnar
  2007-02-22 11:31         ` Evgeniy Polyakov
  2007-02-22 19:38         ` Davide Libenzi
@ 2007-02-22 21:23         ` Zach Brown
  2007-02-22 21:32           ` Benjamin LaHaise
  2 siblings, 1 reply; 277+ messages in thread
From: Zach Brown @ 2007-02-22 21:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ulrich Drepper, linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Evgeniy Polyakov,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner

> Plus there's the fundamental killer that KAIO is a /lot/ harder to
> implement (and to maintain) on the kernel side: it has to be  
> implemented
> for every IO discipline, and even for the IO disciplines it  
> supports at
> the moment, it is not truly asynchronous for things like metadata
> blocking or VFS blocking. To handle things like metadata blocking  
> it has
> to resort to non-statemachine techniques like retries - which are bad
> for performance.

Yes, yes, yes.

As one of the poor suckers who has been fixing bugs in fs/aio.c and  
fs/direct-io.c, I really want everyone to read Ingo's paragraph a few  
times.  Have it printed on a t-shirt.

Look at the number of man-years that have gone into fs/aio.c and fs/ 
direct-io.c.  After all that effort it *barely* supports non-blocking  
O_DIRECT IO.

The maintenance overhead of those two files, above all else, is what  
pushed me to finally try that nutty fibril attempt.

> Syslets/threadlets on the other hand, once the core is implemented,  
> have
> near zero ongoing maintainance cost (compared to KAIO pushed into  
> every
> IO subsystem) and cover all IO disciplines and API variants  
> immediately,
> and they are as perfectly asynchronous as it gets.

Amen.

As an experiment, I'm working on backing the sys_io_*() calls with  
syslets.  It's looking very promising so far.

> So all in one, i used to think that AIO state-machines have a long- 
> term
> place within the kernel, but with syslets i think i've proven myself
> embarrasingly wrong =B-)

Welcome to the party :).

- z

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-22 14:47                   ` David Miller
  2007-02-22 15:02                     ` Evgeniy Polyakov
  2007-02-22 15:15                     ` Ingo Molnar
@ 2007-02-22 20:13                     ` Davide Libenzi
  2007-02-22 21:30                     ` Zach Brown
  3 siblings, 0 replies; 277+ messages in thread
From: Davide Libenzi @ 2007-02-22 20:13 UTC (permalink / raw)
  To: David Miller
  Cc: Ingo Molnar, johnpol, Arjan Van de Ven, Ulrich Drepper,
	Linux Kernel Mailing List, Linus Torvalds, hch, akpm, Alan Cox,
	zach.brown, suparna, jens.axboe, tglx

On Thu, 22 Feb 2007, David Miller wrote:

> The more I think about it, a reasonable solution might actually be to
> use threadlets for disk I/O and pure event based processing for
> networking.  It is two different handling paths and non-unified,
> but that might be the price for good performance :-)

Well, it takes 20 lines of userspace C code to bring *unification* to the 
universe ;)


- Davide



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-22 13:32             ` Evgeniy Polyakov
@ 2007-02-22 19:46               ` Davide Libenzi
  2007-02-23 12:15                 ` Evgeniy Polyakov
  2007-02-23 11:51               ` Ingo Molnar
  1 sibling, 1 reply; 277+ messages in thread
From: Davide Libenzi @ 2007-02-22 19:46 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ingo Molnar, Ulrich Drepper, Linux Kernel Mailing List,
	Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
	Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner

On Thu, 22 Feb 2007, Evgeniy Polyakov wrote:

> > maybe it will, maybe it wont. Lets try? There is no true difference 
> > between having a 'request structure' that represents the current state 
> > of the HTTP connection plus a statemachine that moves that request 
> > between various queues, and a 'kernel stack' that goes in and out of 
> > runnable state and carries its processing state in its stack - other 
> > than the amount of RAM they take. (the kernel stack is 4K at a minimum - 
> > so with a million outstanding requests they would use up 4 GB of RAM. 
> > With 20k outstanding requests it's 80 MB of RAM - that's acceptable.)
> 
> I tried already :) - I just made a allocations atomic in tcp_sendmsg() and
> ended up with 1/4 of the sends blocking (I counted both allocation
> failure and socket queue overflow). Those 20k blocked requests were
> created in about 20 seconds, so roughly saying we have 1k of thread
> creation/freeing per second - do we want this?

A dynamic pool will smooth thread creation/freeing up by a lot.
And, in my box a *pthread* create/free takes ~10us, at 1000/s is 10ms, 1%. 
Bad, but not so aweful ;)
Look, I'm *definitely* not trying to advocate the use of async syscalls for 
network here, just pointing out that when we're talking about threads, 
Linux does a pretty good job.




- Davide



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-22  7:40       ` Ingo Molnar
  2007-02-22 11:31         ` Evgeniy Polyakov
@ 2007-02-22 19:38         ` Davide Libenzi
  2007-02-28  9:45           ` Ingo Molnar
  2007-02-22 21:23         ` Zach Brown
  2 siblings, 1 reply; 277+ messages in thread
From: Davide Libenzi @ 2007-02-22 19:38 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ulrich Drepper, Linux Kernel Mailing List, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Zach Brown, Evgeniy Polyakov, David S. Miller,
	Suparna Bhattacharya, Jens Axboe, Thomas Gleixner

On Thu, 22 Feb 2007, Ingo Molnar wrote:

> 
> * Ulrich Drepper <drepper@redhat.com> wrote:
> 
> > Ingo Molnar wrote:
> > > in terms of AIO, the best queueing model is i think what the kernel uses 
> > > internally: freely ordered, with barrier support.
> > 
> > Speaking of AIO, how do you imagine lio_listio is implemented?  If 
> > there is no asynchronous syscall it would mean creating a threadlet 
> > for each request but this means either waiting or creating 
> > several/many threads.
> 
> my current thinking is that special-purpose (non-programmable, static) 
> APIs like aio_*() and lio_*(), where every last cycle of performance 
> matters, should be implemented using syslets - even if it is quite 
> tricky to write syslets (which they no doubt are - just compare the size 
> of syslet-test.c to threadlet-test.c). So i'd move syslets into the same 
> category as raw syscalls: pieces of the raw infrastructure between the 
> kernel and glibc, not an exposed API to apps. [and even if we keep them 
> in that category they still need quite a bit of API work, to clean up 
> the 32/64-bit issues, etc.]

Now that chains of syscalls can be way more easily handled with clets^wthreadlets,
why would we need the whole syslets crud inside?
Why can't aio_* be implemented with *simple* (or parallel/unrelated) 
syscall submit w/out the burden of a complex, limiting and heavy API (I 
won't list all the points against syslets, because I already did it 
enough times)? The compat layer only is so bad to not be even funny.
Look at the code. Only removing the syslets crud would prolly cut 40% of 
it. And we did not even touch the compat code yet.



- Davide


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-22 15:15                     ` Ingo Molnar
  2007-02-22 15:29                       ` Ingo Molnar
@ 2007-02-22 17:17                       ` David Miller
  2007-02-23 11:12                         ` Ingo Molnar
  1 sibling, 1 reply; 277+ messages in thread
From: David Miller @ 2007-02-22 17:17 UTC (permalink / raw)
  To: mingo
  Cc: johnpol, arjan, drepper, linux-kernel, torvalds, hch, akpm, alan,
	zach.brown, suparna, davidel, jens.axboe, tglx

From: Ingo Molnar <mingo@elte.hu>
Date: Thu, 22 Feb 2007 16:15:09 +0100

> furthermore, in a real webserver there's a whole lot of other stuff 
> happening too: VFS blocking, mutex/lock blocking, memory pressure 
> blocking, filesystem blocking, etc., etc. Threadlets/syslets cover them 
> /all/ and never hold up the primary context: as long as there's more 
> requests to process, they will be processed. Plus other important 
> networked workloads, like fileservers are typically on fast LANs and 
> those requests are very much a fire-and-forget matter most of the time.

I expect clients of a fileserver to cause the server to block in
places such as tcp_sendmsg() as much if not more so than a webserver
:-)

But yes, it should all be tested, for sure.

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-22 15:15                     ` Ingo Molnar
@ 2007-02-22 15:29                       ` Ingo Molnar
  2007-02-22 17:17                       ` David Miller
  1 sibling, 0 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-02-22 15:29 UTC (permalink / raw)
  To: David Miller
  Cc: johnpol, arjan, drepper, linux-kernel, torvalds, hch, akpm, alan,
	zach.brown, suparna, davidel, jens.axboe, tglx


* Ingo Molnar <mingo@elte.hu> wrote:

> > The pushback to the primary thread you speak of is just extra work 
> > in my mind, for networking.  Better to just begin operations and sit 
> > in the primary thread(s) waiting for events, and when they arrive 
> > push the operations further along using non-blocking writes, reads, 
> > and accept() calls.  There is no blocking context really needed for 
> > these kinds of things, so a mechanism that tries to provide one is a 
> > waste.
> 
> one question is, what is cheaper, to block out of a read and a write and 
                                         ^-------to back out
> to set up the event notification and then to return to the user 
> context, or to just stay right in there with all the context already 
> constructed and on the stack, and schedule away and then come back and 
> queue back to the primary thread once the condition the thread is 
> waiting for is done? The latter isnt all that unattractive in my mind, 
> because it always does forward progress, with minimal 'backout' costs.

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-22 14:47                   ` David Miller
  2007-02-22 15:02                     ` Evgeniy Polyakov
@ 2007-02-22 15:15                     ` Ingo Molnar
  2007-02-22 15:29                       ` Ingo Molnar
  2007-02-22 17:17                       ` David Miller
  2007-02-22 20:13                     ` Davide Libenzi
  2007-02-22 21:30                     ` Zach Brown
  3 siblings, 2 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-02-22 15:15 UTC (permalink / raw)
  To: David Miller
  Cc: johnpol, arjan, drepper, linux-kernel, torvalds, hch, akpm, alan,
	zach.brown, suparna, davidel, jens.axboe, tglx


* David Miller <davem@davemloft.net> wrote:

> The pushback to the primary thread you speak of is just extra work in 
> my mind, for networking.  Better to just begin operations and sit in 
> the primary thread(s) waiting for events, and when they arrive push 
> the operations further along using non-blocking writes, reads, and 
> accept() calls.  There is no blocking context really needed for these 
> kinds of things, so a mechanism that tries to provide one is a waste.

one question is, what is cheaper, to block out of a read and a write and 
to set up the event notification and then to return to the user context, 
or to just stay right in there with all the context already constructed 
and on the stack, and schedule away and then come back and queue back to 
the primary thread once the condition the thread is waiting for is done? 
The latter isnt all that unattractive in my mind, because it always does 
forward progress, with minimal 'backout' costs.

furthermore, in a real webserver there's a whole lot of other stuff 
happening too: VFS blocking, mutex/lock blocking, memory pressure 
blocking, filesystem blocking, etc., etc. Threadlets/syslets cover them 
/all/ and never hold up the primary context: as long as there's more 
requests to process, they will be processed. Plus other important 
networked workloads, like fileservers are typically on fast LANs and 
those requests are very much a fire-and-forget matter most of the time.

in any case, this definitely needs to be measured.

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-22 14:47                   ` David Miller
@ 2007-02-22 15:02                     ` Evgeniy Polyakov
  2007-02-22 15:15                     ` Ingo Molnar
                                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-22 15:02 UTC (permalink / raw)
  To: David Miller
  Cc: mingo, arjan, drepper, linux-kernel, torvalds, hch, akpm, alan,
	zach.brown, suparna, davidel, jens.axboe, tglx

On Thu, Feb 22, 2007 at 06:47:04AM -0800, David Miller (davem@davemloft.net) wrote:
> As a side note although Evgeniy likes M:N threading model ideas, they
> are a mine field wrt. signal semantics.  Solaris guys took several
> years to get it right, just grep through the Solaris kernel patch
> readme files over the years to get an idea of how bad it can be.  I
> would therefore never advocate such an approach.

I have fully synchronous kevent signal delivery for that purpose :)
Having all events synchronous allows trivial handling of them -
including signals.

> The more I think about it, a reasonable solution might actually be to
> use threadlets for disk I/O and pure event based processing for
> networking.  It is two different handling paths and non-unified,
> but that might be the price for good performance :-)

Hmm, yes, for such scenario we need some kind of event delivery
mechanism, which would allow to wait on different kinds of events.

In the above sentence I see known to pain letters - 
letter k
letter e
letter v
letter e
letter n
letter t

Or more modern trend - async_wait(epoll).

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-22 14:31                 ` Ingo Molnar
  2007-02-22 14:47                   ` David Miller
@ 2007-02-22 14:59                   ` Ingo Molnar
  2007-02-22 21:42                   ` Michael K. Edwards
  2 siblings, 0 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-02-22 14:59 UTC (permalink / raw)
  To: David Miller
  Cc: johnpol, arjan, drepper, linux-kernel, torvalds, hch, akpm, alan,
	zach.brown, suparna, davidel, jens.axboe, tglx


* Ingo Molnar <mingo@elte.hu> wrote:

> Firstly, i dont think you are fully applying the syslet/threadlet 
> model. There is no reason why an 'idle' client would have to use up a 
> full thread! It all depends on how you use syslets/threadlets, and how 
> (frequently) you queue back requests from cachemiss threads back to 
> the primary thread. It is only the simplest queueing model where there 
> is one thread per request that is currently blocked. 
> Syslets/threadlets do /not/ force request processing to be performed 
> in the async context forever - the async thread could very much queue 
> it back to the primary context. (That's in essence what Tux did.) So 
> the same state-machine techniques can be applied on both the syslet 
> and the threadlet model, but in much more natural (and thus lower 
> overhead) points: /between/ system calls and not in the middle of 
> them. There are a number of measures that can be used to keep the 
> number of parallel threads down.

i think the best model here is to use kevents or epoll to discover 
accept()-able or recv()-able keepalive sockets, and to do the main 
request loop via syslets/threadlets, with a 'queue back to the main 
context if we went async and if the request is done' feedback mechanism 
that keeps the size of the pool down.

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-22 13:41               ` David Miller
  2007-02-22 14:31                 ` Ingo Molnar
@ 2007-02-22 14:53                 ` Avi Kivity
  1 sibling, 0 replies; 277+ messages in thread
From: Avi Kivity @ 2007-02-22 14:53 UTC (permalink / raw)
  To: David Miller
  Cc: johnpol, arjan, mingo, drepper, linux-kernel, torvalds, hch,
	akpm, alan, zach.brown, suparna, davidel, jens.axboe, tglx

David Miller wrote:
> From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
> Date: Thu, 22 Feb 2007 15:39:30 +0300
>
>   
>> It does not matter - even with threads cost of having thousands of
>> threads is _too_ expensive. So, IMO, it is wrong to have to create 
>> 20k threads for the simple web server which only sends one index page to
>> 80k connections with 4k connections per seconds rate.
>>
>> Just have that example in mind - more than 20k blocks in 80k connections 
>> over gigabit lan, and it is likely optimistic result, when designing new 
>> type of AIO.
>>     
>
> I totally agree with Evgeniy on these points.
>
> Using things like syslets and threadlets for networking I/O
> is not a very good idea.  Blocking is more the norm than the
> exception for networking I/O.
>   

And for O_DIRECT, and for large storage systems which overwhelm caches.  
The optimize for the nonblocking case approach does not fit all 
workloads.  And of course we have to be able to mix mostly-nonblocking 
threadlets and mostly-blocking O_DIRECT and networking.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-22 14:31                 ` Ingo Molnar
@ 2007-02-22 14:47                   ` David Miller
  2007-02-22 15:02                     ` Evgeniy Polyakov
                                       ` (3 more replies)
  2007-02-22 14:59                   ` Ingo Molnar
  2007-02-22 21:42                   ` Michael K. Edwards
  2 siblings, 4 replies; 277+ messages in thread
From: David Miller @ 2007-02-22 14:47 UTC (permalink / raw)
  To: mingo
  Cc: johnpol, arjan, drepper, linux-kernel, torvalds, hch, akpm, alan,
	zach.brown, suparna, davidel, jens.axboe, tglx

From: Ingo Molnar <mingo@elte.hu>
Date: Thu, 22 Feb 2007 15:31:45 +0100

> Firstly, i dont think you are fully applying the syslet/threadlet model. 
> There is no reason why an 'idle' client would have to use up a full 
> thread! It all depends on how you use syslets/threadlets, and how 
> (frequently) you queue back requests from cachemiss threads back to the 
> primary thread. It is only the simplest queueing model where there is 
> one thread per request that is currently blocked. Syslets/threadlets do 
> /not/ force request processing to be performed in the async context 
> forever - the async thread could very much queue it back to the primary 
> context. (That's in essence what Tux did.) So the same state-machine 
> techniques can be applied on both the syslet and the threadlet model, 
> but in much more natural (and thus lower overhead) points: /between/ 
> system calls and not in the middle of them. There are a number of 
> measures that can be used to keep the number of parallel threads down.

Ok.

> Secondly, even assuming lots of pending requests/async-threads and a 
> naive queueing model, an open request will eat up resources on the 
> server no matter what. So if your point is that "+4K of kernel stack 
> pinned down per open, blocked request makes syslets and threadlets not a 
> very good idea", then i'd like to disagree with that: while it wont be 
> zero-cost (4K does cost you 400MB of RAM per 100,000 outstanding 
> threads), it's often comparable to the other RAM costs that are already 
> attached to an open connection.

The 400MB is extra, and it's in no way commensurate with the cost
of the TCP socket itself even including the application specific
state being used for that connection.

Even if it would be _equal_, we would be doubling the memory
requirements for such a scenerio.

This is why I dislike the threadlet model, when used in that way.

The pushback to the primary thread you speak of is just extra work in
my mind, for networking.  Better to just begin operations and sit in
the primary thread(s) waiting for events, and when they arrive push
the operations further along using non-blocking writes, reads, and
accept() calls.  There is no blocking context really needed for these
kinds of things, so a mechanism that tries to provide one is a waste.

As a side note although Evgeniy likes M:N threading model ideas, they
are a mine field wrt. signal semantics.  Solaris guys took several
years to get it right, just grep through the Solaris kernel patch
readme files over the years to get an idea of how bad it can be.  I
would therefore never advocate such an approach.

The more I think about it, a reasonable solution might actually be to
use threadlets for disk I/O and pure event based processing for
networking.  It is two different handling paths and non-unified,
but that might be the price for good performance :-)

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-22 14:17             ` Suparna Bhattacharya
@ 2007-02-22 14:36               ` Ingo Molnar
  2007-02-23 14:23                 ` Suparna Bhattacharya
  0 siblings, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-02-22 14:36 UTC (permalink / raw)
  To: Suparna Bhattacharya
  Cc: Evgeniy Polyakov, Ulrich Drepper, linux-kernel, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Zach Brown, David S. Miller, Davide Libenzi, Jens Axboe,
	Thomas Gleixner


* Suparna Bhattacharya <suparna@in.ibm.com> wrote:

> > maybe it will, maybe it wont. Lets try? There is no true difference 
> > between having a 'request structure' that represents the current 
> > state of the HTTP connection plus a statemachine that moves that 
> > request between various queues, and a 'kernel stack' that goes in 
> > and out of runnable state and carries its processing state in its 
> > stack - other than the amount of RAM they take. (the kernel stack is 
> > 4K at a minimum - so with a million outstanding requests they would 
> > use up 4 GB of RAM. With 20k outstanding requests it's 80 MB of RAM 
> > - that's acceptable.)
> 
> At what point are the cachemiss threads destroyed ? In other words how 
> well does this adapt to load variations ? For example, would this 80MB 
> of RAM continue to be locked down even during periods of lighter loads 
> thereafter ?

you can destroy them at will from user-space too - just start a slow 
timer that zaps them if load goes down. I can add a 
sys_async_thread_exit(nr_threads) API to be able to drive this without 
knowing the TIDs of those threads, and/or i can add a kernel-internal 
mechanism to zap inactive threads. It would be rather easy and 
low-overhead - the v2 code already had a max_nr_threads tunable, i can 
reintroduce it. So the size of the pool of contexts does not have to be 
permanent at all.

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-22 13:41               ` David Miller
@ 2007-02-22 14:31                 ` Ingo Molnar
  2007-02-22 14:47                   ` David Miller
                                     ` (2 more replies)
  2007-02-22 14:53                 ` Avi Kivity
  1 sibling, 3 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-02-22 14:31 UTC (permalink / raw)
  To: David Miller
  Cc: johnpol, arjan, drepper, linux-kernel, torvalds, hch, akpm, alan,
	zach.brown, suparna, davidel, jens.axboe, tglx


* David Miller <davem@davemloft.net> wrote:

> If used for networking one could easily make this new interface create 
> an arbitrary number of threads by just opening up that many 
> connections to such a server and just sitting there not reading 
> anything from any of the client sockets.  And this happens 
> non-maliciously for slow clients, whether that is due to application 
> blockage or the characteristics of the network path.

there are two issues on which i'd like to disagree.

Firstly, i dont think you are fully applying the syslet/threadlet model. 
There is no reason why an 'idle' client would have to use up a full 
thread! It all depends on how you use syslets/threadlets, and how 
(frequently) you queue back requests from cachemiss threads back to the 
primary thread. It is only the simplest queueing model where there is 
one thread per request that is currently blocked. Syslets/threadlets do 
/not/ force request processing to be performed in the async context 
forever - the async thread could very much queue it back to the primary 
context. (That's in essence what Tux did.) So the same state-machine 
techniques can be applied on both the syslet and the threadlet model, 
but in much more natural (and thus lower overhead) points: /between/ 
system calls and not in the middle of them. There are a number of 
measures that can be used to keep the number of parallel threads down.

Secondly, even assuming lots of pending requests/async-threads and a 
naive queueing model, an open request will eat up resources on the 
server no matter what. So if your point is that "+4K of kernel stack 
pinned down per open, blocked request makes syslets and threadlets not a 
very good idea", then i'd like to disagree with that: while it wont be 
zero-cost (4K does cost you 400MB of RAM per 100,000 outstanding 
threads), it's often comparable to the other RAM costs that are already 
attached to an open connection.

(let me know if i misunderstood your point.)

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-22 12:59           ` Ingo Molnar
  2007-02-22 13:32             ` Evgeniy Polyakov
@ 2007-02-22 14:17             ` Suparna Bhattacharya
  2007-02-22 14:36               ` Ingo Molnar
  2007-02-22 21:24             ` Michael K. Edwards
  2 siblings, 1 reply; 277+ messages in thread
From: Suparna Bhattacharya @ 2007-02-22 14:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Evgeniy Polyakov, Ulrich Drepper, linux-kernel, Linus Torvalds,
	Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
	Zach Brown, David S. Miller, Davide Libenzi, Jens Axboe,
	Thomas Gleixner

On Thu, Feb 22, 2007 at 01:59:31PM +0100, Ingo Molnar wrote:
> 
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> 
> > It is not a TUX anymore - you had 1024 threads, and all of them will 
> > be consumed by tcp_sendmsg() for slow clients - rescheduling will kill 
> > a machine.
> 
> maybe it will, maybe it wont. Lets try? There is no true difference 
> between having a 'request structure' that represents the current state 
> of the HTTP connection plus a statemachine that moves that request 
> between various queues, and a 'kernel stack' that goes in and out of 
> runnable state and carries its processing state in its stack - other 
> than the amount of RAM they take. (the kernel stack is 4K at a minimum - 
> so with a million outstanding requests they would use up 4 GB of RAM. 
> With 20k outstanding requests it's 80 MB of RAM - that's acceptable.)

At what point are the cachemiss threads destroyed ? In other words how well
does this adapt to load variations ? For example, would this 80MB of RAM 
continue to be locked down even during periods of lighter loads thereafter ?

Regards
Suparna

> 
> > My tests show that with 4k connections per second (8k concurrency) 
> > more than 20k connections of 80k total block in tcp_sendmsg() over 
> > gigabit lan between quite fast machines.
> 
> yeah. Note that you can have a million sleeping threads if you want, the 
> scheduler wont care. What matters more is the amount of true concurrency 
> that is present at any given time. But yes, i agree that overscheduling 
> can be a problem.
> 
> btw., what is the measurement utility you are using with kevents ('ab' 
> perhaps, with a high -c concurrency count?), and which webserver are you 
> using? (light-httpd?)
> 
> 	Ingo

-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Lab, India


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-22 12:39             ` Evgeniy Polyakov
@ 2007-02-22 13:41               ` David Miller
  2007-02-22 14:31                 ` Ingo Molnar
  2007-02-22 14:53                 ` Avi Kivity
  0 siblings, 2 replies; 277+ messages in thread
From: David Miller @ 2007-02-22 13:41 UTC (permalink / raw)
  To: johnpol
  Cc: arjan, mingo, drepper, linux-kernel, torvalds, hch, akpm, alan,
	zach.brown, suparna, davidel, jens.axboe, tglx

From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Date: Thu, 22 Feb 2007 15:39:30 +0300

> It does not matter - even with threads cost of having thousands of
> threads is _too_ expensive. So, IMO, it is wrong to have to create 
> 20k threads for the simple web server which only sends one index page to
> 80k connections with 4k connections per seconds rate.
> 
> Just have that example in mind - more than 20k blocks in 80k connections 
> over gigabit lan, and it is likely optimistic result, when designing new 
> type of AIO.

I totally agree with Evgeniy on these points.

Using things like syslets and threadlets for networking I/O
is not a very good idea.  Blocking is more the norm than the
exception for networking I/O.


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-22 12:27 linux
@ 2007-02-22 13:40 ` David Miller
  2007-02-22 23:52   ` linux
  0 siblings, 1 reply; 277+ messages in thread
From: David Miller @ 2007-02-22 13:40 UTC (permalink / raw)
  To: linux; +Cc: mingo, linux-kernel

From: linux@horizon.com
Date: 22 Feb 2007 07:27:21 -0500

> May I just say, that this is f***ing brilliant.

It's brilliant for disk I/O, not for networking for which
blocking is the norm not the exception.

So people will have to likely do something like divide their
applications into handling for I/O to files and I/O to networking.
So beautiful. :-)

Nobody has proposed anything yet which scales well and handles both
cases.

It is one reoccuring point made by Evgeniy, and he is very right
about it.

If used for networking one could easily make this new interface create
an arbitrary number of threads by just opening up that many
connections to such a server and just sitting there not reading
anything from any of the client sockets.  And this happens
non-maliciously for slow clients, whether that is due to application
blockage or the characteristics of the network path.

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-22 12:59           ` Ingo Molnar
@ 2007-02-22 13:32             ` Evgeniy Polyakov
  2007-02-22 19:46               ` Davide Libenzi
  2007-02-23 11:51               ` Ingo Molnar
  2007-02-22 14:17             ` Suparna Bhattacharya
  2007-02-22 21:24             ` Michael K. Edwards
  2 siblings, 2 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-22 13:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ulrich Drepper, linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner

On Thu, Feb 22, 2007 at 01:59:31PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> 
> > It is not a TUX anymore - you had 1024 threads, and all of them will 
> > be consumed by tcp_sendmsg() for slow clients - rescheduling will kill 
> > a machine.
> 
> maybe it will, maybe it wont. Lets try? There is no true difference 
> between having a 'request structure' that represents the current state 
> of the HTTP connection plus a statemachine that moves that request 
> between various queues, and a 'kernel stack' that goes in and out of 
> runnable state and carries its processing state in its stack - other 
> than the amount of RAM they take. (the kernel stack is 4K at a minimum - 
> so with a million outstanding requests they would use up 4 GB of RAM. 
> With 20k outstanding requests it's 80 MB of RAM - that's acceptable.)

I tried already :) - I just made a allocations atomic in tcp_sendmsg() and
ended up with 1/4 of the sends blocking (I counted both allocation
failure and socket queue overflow). Those 20k blocked requests were
created in about 20 seconds, so roughly saying we have 1k of thread
creation/freeing per second - do we want this?

> > My tests show that with 4k connections per second (8k concurrency) 
> > more than 20k connections of 80k total block in tcp_sendmsg() over 
> > gigabit lan between quite fast machines.
> 
> yeah. Note that you can have a million sleeping threads if you want, the 
> scheduler wont care. What matters more is the amount of true concurrency 
> that is present at any given time. But yes, i agree that overscheduling 
> can be a problem.

Before I started M:N threading library implementation I checked some
threading perfomance of the current POSIX library - I created simple
pool of threads and 'sent' a message between them using futex wait/wake
(sema_post/wait) one-by-one, results are quite dissapointing - given
that number of sleeping threads were about hundreds, kernel rescheduling
is about 10 times slower than setjmp based (I think so) used in Erlang.

Above example is not 100% correct, I understand, but situation with
thread-like AIO is much worse - it is possible that several threads will
be ready simultaneous, so rescheduling between them will kill a
performance.

> btw., what is the measurement utility you are using with kevents ('ab' 
> perhaps, with a high -c concurrency count?), and which webserver are you 
> using? (light-httpd?)

Yes, it is ab and lighttpd, before it was httperf (unfair on high load
due to poll/select usage) and own web server (evserver_kevent.c).

> 	Ingo

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-22 11:31         ` Evgeniy Polyakov
  2007-02-22 11:52           ` Arjan van de Ven
@ 2007-02-22 12:59           ` Ingo Molnar
  2007-02-22 13:32             ` Evgeniy Polyakov
                               ` (2 more replies)
  2007-02-25 22:44           ` Linus Torvalds
  2 siblings, 3 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-02-22 12:59 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ulrich Drepper, linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner


* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> It is not a TUX anymore - you had 1024 threads, and all of them will 
> be consumed by tcp_sendmsg() for slow clients - rescheduling will kill 
> a machine.

maybe it will, maybe it wont. Lets try? There is no true difference 
between having a 'request structure' that represents the current state 
of the HTTP connection plus a statemachine that moves that request 
between various queues, and a 'kernel stack' that goes in and out of 
runnable state and carries its processing state in its stack - other 
than the amount of RAM they take. (the kernel stack is 4K at a minimum - 
so with a million outstanding requests they would use up 4 GB of RAM. 
With 20k outstanding requests it's 80 MB of RAM - that's acceptable.)

> My tests show that with 4k connections per second (8k concurrency) 
> more than 20k connections of 80k total block in tcp_sendmsg() over 
> gigabit lan between quite fast machines.

yeah. Note that you can have a million sleeping threads if you want, the 
scheduler wont care. What matters more is the amount of true concurrency 
that is present at any given time. But yes, i agree that overscheduling 
can be a problem.

btw., what is the measurement utility you are using with kevents ('ab' 
perhaps, with a high -c concurrency count?), and which webserver are you 
using? (light-httpd?)

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-22 11:52           ` Arjan van de Ven
@ 2007-02-22 12:39             ` Evgeniy Polyakov
  2007-02-22 13:41               ` David Miller
  0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-22 12:39 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Ingo Molnar, Ulrich Drepper, linux-kernel, Linus Torvalds,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner

On Thu, Feb 22, 2007 at 12:52:39PM +0100, Arjan van de Ven (arjan@infradead.org) wrote:
> 
> > It is not a TUX anymore - you had 1024 threads, and all of them will be
> > consumed by tcp_sendmsg() for slow clients - rescheduling will kill a
> > machine.
> 
> I think it's time to make a split in what "context switch" or
> "reschedule" means...
> 
> there are two types of context switch:
> 
> 1) To a different process. This means teardown of the TLB, going to a
> new MMU state, saving FPU state etc etc etc. This is obviously quite
> expensive
> 
> 2) To a thread of the same process. No TLB flush no new MMU state,
> effectively all it does is getting a new task struct on the kernel side,
> and a new ESP/EIP pair on the userspace side. If there is FPU code
> involved that gets saved as well.
> 
> Number 1 is very expensive and that is what is really worrying normally;
> number 2 is a LOT lighter weight, and while Linux is a bit heavy there,
> it can be made lighter... there's no fundamental reason for it to be
> really expensive.

It does not matter - even with threads cost of having thousands of
threads is _too_ expensive. So, IMO, it is wrong to have to create 
20k threads for the simple web server which only sends one index page to
80k connections with 4k connections per seconds rate.

Just have that example in mind - more than 20k blocks in 80k connections 
over gigabit lan, and it is likely optimistic result, when designing new 
type of AIO.

> -- 
> if you want to mail me at work (you don't), use arjan (at) linux.intel.com
> Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
@ 2007-02-22 12:27 linux
  2007-02-22 13:40 ` David Miller
  0 siblings, 1 reply; 277+ messages in thread
From: linux @ 2007-02-22 12:27 UTC (permalink / raw)
  To: mingo; +Cc: linux, linux-kernel

May I just say, that this is f***ing brilliant.
It completely separates the threadlet/fibril core from the (contentious)
completion notification debate, and allows you to use whatever mechanism
you like.  (fd, signal, kevent, futex, ...)

You can also add a "macro syscall" like the original syslet idea,
and it can be independent of the threadlet mechanism but provide the
same effects.

If the macros can be designed to always exit when donew, a guarantee
never to return to user space, then you can always recycle the stack
after threadlet_exec() returns, whether it blocked in the syscall or not,
and you have your original design.


May I just suggest, however, that the interface be:
	tid = threadlet_exec(...)
Where tid < 0 means error, tid == 0 means completed synchronously,
and tod > 0 identifies the child so it can be waited for?


Anyway, this is a really excellent user-space API.  (You might add some
sort of "am I synchronous?" query, or maybe you could just use gettid()
for the purpose.)

The one interesting question is, can you nest threadlet_exec() calls?
I think it's implementable, and I can definitely see the attraction
of being able to call libraries that use it internally (to do async
read-ahead or whatever) from a threadlet function.

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-22 11:31         ` Evgeniy Polyakov
@ 2007-02-22 11:52           ` Arjan van de Ven
  2007-02-22 12:39             ` Evgeniy Polyakov
  2007-02-22 12:59           ` Ingo Molnar
  2007-02-25 22:44           ` Linus Torvalds
  2 siblings, 1 reply; 277+ messages in thread
From: Arjan van de Ven @ 2007-02-22 11:52 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ingo Molnar, Ulrich Drepper, linux-kernel, Linus Torvalds,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner


> It is not a TUX anymore - you had 1024 threads, and all of them will be
> consumed by tcp_sendmsg() for slow clients - rescheduling will kill a
> machine.

I think it's time to make a split in what "context switch" or
"reschedule" means...

there are two types of context switch:

1) To a different process. This means teardown of the TLB, going to a
new MMU state, saving FPU state etc etc etc. This is obviously quite
expensive

2) To a thread of the same process. No TLB flush no new MMU state,
effectively all it does is getting a new task struct on the kernel side,
and a new ESP/EIP pair on the userspace side. If there is FPU code
involved that gets saved as well.

Number 1 is very expensive and that is what is really worrying normally;
number 2 is a LOT lighter weight, and while Linux is a bit heavy there,
it can be made lighter... there's no fundamental reason for it to be
really expensive.

-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-22  7:40       ` Ingo Molnar
@ 2007-02-22 11:31         ` Evgeniy Polyakov
  2007-02-22 11:52           ` Arjan van de Ven
                             ` (2 more replies)
  2007-02-22 19:38         ` Davide Libenzi
  2007-02-22 21:23         ` Zach Brown
  2 siblings, 3 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-22 11:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ulrich Drepper, linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
	David S. Miller, Suparna Bhattacharya, Davide Libenzi,
	Jens Axboe, Thomas Gleixner

Hi Ingo, developers.

On Thu, Feb 22, 2007 at 08:40:44AM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> Syslets/threadlets on the other hand, once the core is implemented, have 
> near zero ongoing maintainance cost (compared to KAIO pushed into every 
> IO subsystem) and cover all IO disciplines and API variants immediately, 
> and they are as perfectly asynchronous as it gets.
> 
> So all in one, i used to think that AIO state-machines have a long-term 
> place within the kernel, but with syslets i think i've proven myself 
> embarrasingly wrong =B-)

Hmm...
Try to have a network web server with huge load made on top of
syslets/threadlets.

It is not a TUX anymore - you had 1024 threads, and all of them will be
consumed by tcp_sendmsg() for slow clients - rescheduling will kill a
machine.

My tests show that with 4k connections per second (8k concurrency) more
than 20k connections of 80k total block in tcp_sendmsg() over gigabit
lan between quite fast machines.

Or threadlet/syslet AIO should not be used with networking too?

> 	Ingo

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-22 10:01 ` Suparna Bhattacharya
@ 2007-02-22 11:20   ` Ingo Molnar
  0 siblings, 0 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-02-22 11:20 UTC (permalink / raw)
  To: Suparna Bhattacharya
  Cc: linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, Evgeniy Polyakov, David S. Miller, Davide Libenzi,
	Jens Axboe, Thomas Gleixner


* Suparna Bhattacharya <suparna@in.ibm.com> wrote:

> > Threadlets share much of the scheduling infrastructure with syslets.
> > 
> > Syslets (small, kernel-side, scripted "syscall plugins") are still 
> > supported - they are (much...) harder to program than threadlets but 
> > they allow the highest performance. Core infrastructure libraries 
> > like glibc/libaio are expected to use syslets. Jens Axboe's FIO tool 
> > already includes support for v2 syslets, and the following patch 
> > updates FIO to
> 
> Ah, glad to see that - I was wondering if it was worthwhile to try 
> adding syslet support to aio-stress to be able to perform some 
> comparisons. [...]

i think it would definitely be worth it.

> [...] Hopefully FIO should be able to generate a similar workload, but 
> I haven't tried it yet so am not sure. Are you planning to upload some 
> results (so I can compare it with patterns I am familiar with) ?

i had no time yet to do careful benchmarks. Right now my impression from 
quick testing is that libaio performance can be exceeded via syslets. So 
it would be very interesting if you could try this too, independently of 
me.

	Ingo

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
  2007-02-21 21:13 Ingo Molnar
  2007-02-21 22:46 ` Michael K. Edwards
@ 2007-02-22 10:01 ` Suparna Bhattacharya
  2007-02-22 11:20   ` Ingo Molnar
  2007-02-24 18:34 ` Evgeniy Polyakov
  2 siblings, 1 reply; 277+ messages in thread
From: Suparna Bhattacharya @ 2007-02-22 10:01 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Arjan van de Ven,
	Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
	Zach Brown, Evgeniy Polyakov, David S. Miller, Davide Libenzi,
	Jens Axboe, Thomas Gleixner

On Wed, Feb 21, 2007 at 10:13:55PM +0100, Ingo Molnar wrote:
> this is the v3 release of the syslet/threadlet subsystem:
> 
>    http://redhat.com/~mingo/syslet-patches/
> 
> This release came a few days later than i originally wanted, because 
> i've implemented many fundamental changes to the code. The biggest 
> highlights of v3 are:
> 
>  - "Threadlets": the introduction of the 'threadlet' execution concept.
> 
>  - syslets: multiple rings support with no kernel-side footprint, the 
>    elimination of mlock() pinning, no async_register/unregister() calls 
>    needed anymore and more.
> 
> "Threadlets" are basically the user-space equivalent of syslets: small 
> functions of execution that the kernel attempts to execute without 
> scheduling. If the threadlet blocks, the kernel creates a real thread 
> from it, and execution continues in that thread. The 'head' context (the 
> context that never blocks) returns to the original function that called 
> the threadlet. Threadlets are very easy to use:
> 
> long my_threadlet_fn(void *data)
> {
> 	char *name = data;
> 	int fd;
> 
> 	fd = open(name, O_RDONLY);
> 	if (fd < 0)
> 		goto out;
> 
> 	fstat(fd, &stat);
> 	read(fd, buf, count)
> 	...
> 
> out:
> 	return threadlet_complete();
> }
> 
> 
> main()
> {
> 	done = threadlet_exec(threadlet_fn, new_stack, &user_head);
> 	if (!done)
> 		reqs_queued++;
> }
> 
> There is no limitation whatsoever about how a threadlet function can 
> look like: it can use arbitrary system-calls and all execution will be 
> procedural. There is no 'registration' needed when running threadlets 
> either: the kernel will take care of all the details, user-space just 
> runs a threadlet without any preparation and that's it.
> 
> Completion of async threadlets can be done from user-space via any of 
> the existing APIs: in threadlet-test.c (see the async-test-v3.tar.gz 
> user-space examples at the URL above) i've for example used a futex 
> between the head and the async threads to do threadlet notification. But 
> select(), poll() or signals can be used too - whichever is most 
> convenient to the application writer.
> 
> Threadlets can also be thought of as 'optional threads': they execute in 
> the original context as long as they do not block,