LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
@ 2007-02-22 12:27 linux
2007-02-22 13:40 ` David Miller
0 siblings, 1 reply; 277+ messages in thread
From: linux @ 2007-02-22 12:27 UTC (permalink / raw)
To: mingo; +Cc: linux, linux-kernel
May I just say, that this is f***ing brilliant.
It completely separates the threadlet/fibril core from the (contentious)
completion notification debate, and allows you to use whatever mechanism
you like. (fd, signal, kevent, futex, ...)
You can also add a "macro syscall" like the original syslet idea,
and it can be independent of the threadlet mechanism but provide the
same effects.
If the macros can be designed to always exit when donew, a guarantee
never to return to user space, then you can always recycle the stack
after threadlet_exec() returns, whether it blocked in the syscall or not,
and you have your original design.
May I just suggest, however, that the interface be:
tid = threadlet_exec(...)
Where tid < 0 means error, tid == 0 means completed synchronously,
and tod > 0 identifies the child so it can be waited for?
Anyway, this is a really excellent user-space API. (You might add some
sort of "am I synchronous?" query, or maybe you could just use gettid()
for the purpose.)
The one interesting question is, can you nest threadlet_exec() calls?
I think it's implementable, and I can definitely see the attraction
of being able to call libraries that use it internally (to do async
read-ahead or whatever) from a threadlet function.
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-22 12:27 [patch 00/13] Syslets, "Threadlets", generic AIO support, v3 linux
@ 2007-02-22 13:40 ` David Miller
2007-02-22 23:52 ` linux
0 siblings, 1 reply; 277+ messages in thread
From: David Miller @ 2007-02-22 13:40 UTC (permalink / raw)
To: linux; +Cc: mingo, linux-kernel
From: linux@horizon.com
Date: 22 Feb 2007 07:27:21 -0500
> May I just say, that this is f***ing brilliant.
It's brilliant for disk I/O, not for networking for which
blocking is the norm not the exception.
So people will have to likely do something like divide their
applications into handling for I/O to files and I/O to networking.
So beautiful. :-)
Nobody has proposed anything yet which scales well and handles both
cases.
It is one reoccuring point made by Evgeniy, and he is very right
about it.
If used for networking one could easily make this new interface create
an arbitrary number of threads by just opening up that many
connections to such a server and just sitting there not reading
anything from any of the client sockets. And this happens
non-maliciously for slow clients, whether that is due to application
blockage or the characteristics of the network path.
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-22 13:40 ` David Miller
@ 2007-02-22 23:52 ` linux
0 siblings, 0 replies; 277+ messages in thread
From: linux @ 2007-02-22 23:52 UTC (permalink / raw)
To: davem, linux; +Cc: linux-kernel, mingo
> It's brilliant for disk I/O, not for networking for which
> blocking is the norm not the exception.
>
> So people will have to likely do something like divide their
> applications into handling for I/O to files and I/O to networking.
> So beautiful. :-)
>
> Nobody has proposed anything yet which scales well and handles both
> cases.
The truly brilliant thing about the whole "create a thread on blocking"
is that you immediately make *every* system call asynchronous-capable,
including the thousands of obscure ioctls, without having to boil the
ocean rewriting 5/6 of the kernel from implicit (stack-based) to
explicit state machines.
You're right that it doesn't solve everything, but it's a big step
forward while keeping a reasonably clean interface.
Now, we have some portions of the kernel (to be precise, those that
currently support poll() and select()) that are written as explicit
state machines and can block on a much smaller context structure.
In truth, the division you assume above isn't so terrible.
My applications are *already* written like that. It's just "poll()
until I accumulate a whole request, then fork a thread to handle it."
The only way to avoid allocating a kernel stack is to have the entire
handling code path, including the return to user space, written in
explicit state machine style. (Once you get to user space, you can have
a threading library there if you like.) All the flaming about different
ways to implement completion notification is precisely because not much
is known about the best way to do it; there aren't a lot of applications
that work that way.
(Certainly that's because it wasn't possible before, but it's clearly
an area that requires research, so not committing to an implementation
is A Good Thing.)
But once that is solved, and "system call complete" can be reported
without returning to a user-space thread (which is basically an alternate
system call submission interface, *independent* of the fibril/threadlet
non-blocking implementation), then you can find the hot paths in the
kernel and special-case them to avoid creating a whole thread.
To use a networking analogy, this is a cleanly layered protocol design,
with an optimized fast path *implementation* that blurs the boundaries.
As for the overhead of threading, there are basically three parts:
1) System call (user/kernel boundary crossing) costs. These depend only
on the total number of system calls and not on the number of threads
making them. They can be mitigated *if necessary* with a syslet-like
"macro syscall" mechanism to increase the work per boundary crossing.
The only place threading might increase these numbers is thread
synchronization, and futexes already solve that pretty well.
2) Register and stack swapping. These (and associated cache issues)
are basically unavoidable, and are the bare minimum that longjmp()
does. Nothing thread-based is going to reduce this. (Actually,
the kernel can do better than user space because it can do lazy FPU
state swapping.)
3) MMU context switch costs. These are the big ones, particular on x86
without TLB context IDs. However, these fall into a few categories:
- Mandatory switches because the entire application is blocked.
I don't see how this can be avoided; these are the cases where
even a user-space longjmp-based thread library would context
switch.
- Context switches between threads in an application. The Linux
kernel already optimizes out the MMU context switch in this case,
and the scheduler already knows that such context switches are
cheaper and preferred.
The one further optimization that's possible is if you have a system
call that (in a common case) blocks multiple times *without accessing
user memory*. This is not a read() or write(), but could be
something like fsync() or ftruncate(). In this case, you could
temporarily mark the thread as a "kernel thread" that can run in any
MMU context, and then fix it explicitly when you unmark it on the
return path.
I can see the space overhead of 1:1 threading, but I really don't think
there's much time overhead.
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-07 18:21 ` Kirk Kuchov
2007-03-07 18:24 ` Jens Axboe
@ 2007-03-07 18:32 ` Evgeniy Polyakov
1 sibling, 0 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-07 18:32 UTC (permalink / raw)
To: Kirk Kuchov
Cc: Ingo Molnar, Pavel Machek, Davide Libenzi, Eric Dumazet,
Theodore Tso, Linus Torvalds, Ulrich Drepper,
Linux Kernel Mailing List, Arjan van de Ven, Christoph Hellwig,
Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
On Wed, Mar 07, 2007 at 03:21:19PM -0300, Kirk Kuchov (kirk.kuchov@gmail.com) wrote:
> On 3/7/07, Ingo Molnar <mingo@elte.hu> wrote:
> >
> >* Kirk Kuchov <kirk.kuchov@gmail.com> wrote:
> >
> >> I don't believe I'm wasting my time explaining this. They don't exist
> >> as /dev/null, they are just fucking _LINKS_.
> >[...]
> >> > Either stop flaming kernel developers or become one. It is that
> >> > simple.
> >>
> >> If I were to become a kernel developer I would stick with FreeBSD.
> >> [...]
> >
> >Hey, really, this is an excellent idea: what a boon you could become to
> >FreeBSD, again! How much they must be longing for your insightful
> >feedback, how much they must be missing your charming style and tactful
> >approach! I bet they'll want to print your mails out, frame them and
> >hang them over their fireplace, to remember the good old days on cold
> >snowy winter days, with warmth in their hearts! Please?
> >
>
> http://www.totallytom.com/thecureforgayness.html
Fonts are a bit bad in my browser :)
Kirk, I understand your frustration - yes, Linux is not the perfect
place to include startups ideas, and yes it lacks some features modern
(or old) systems support for years, but things change with time.
I posted a patch which allows to poll for signals, it can be trivially
adopted to support timers and essentially any other events.
Kevent did that too, but some things are just too radical for immediate
support, especially when majority of users do not require additional
functionality.
People do work, and a lot of them do really good work, so no need for
rude talks about how things are bad. Things change - even I support
that, although kevent ignorance should put me into the first line with
you :)
Be good, and be cool.
> --
> Kirk Kuchov
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-07 18:21 ` Kirk Kuchov
@ 2007-03-07 18:24 ` Jens Axboe
2007-03-07 18:32 ` Evgeniy Polyakov
1 sibling, 0 replies; 277+ messages in thread
From: Jens Axboe @ 2007-03-07 18:24 UTC (permalink / raw)
To: Kirk Kuchov
Cc: Ingo Molnar, Pavel Machek, Davide Libenzi, Evgeniy Polyakov,
Eric Dumazet, Theodore Tso, Linus Torvalds, Ulrich Drepper,
Linux Kernel Mailing List, Arjan van de Ven, Christoph Hellwig,
Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
Suparna Bhattacharya, Thomas Gleixner
On Wed, Mar 07 2007, Kirk Kuchov wrote:
> On 3/7/07, Ingo Molnar <mingo@elte.hu> wrote:
> >
> >* Kirk Kuchov <kirk.kuchov@gmail.com> wrote:
> >
> >> I don't believe I'm wasting my time explaining this. They don't exist
> >> as /dev/null, they are just fucking _LINKS_.
> >[...]
> >> > Either stop flaming kernel developers or become one. It is that
> >> > simple.
> >>
> >> If I were to become a kernel developer I would stick with FreeBSD.
> >> [...]
> >
> >Hey, really, this is an excellent idea: what a boon you could become to
> >FreeBSD, again! How much they must be longing for your insightful
> >feedback, how much they must be missing your charming style and tactful
> >approach! I bet they'll want to print your mails out, frame them and
> >hang them over their fireplace, to remember the good old days on cold
> >snowy winter days, with warmth in their hearts! Please?
> >
>
> http://www.totallytom.com/thecureforgayness.html
Dude, get a life. But more importantly, go waste somebody elses time
instead of lkml's.
--
Jens Axboe, updating killfile
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-07 17:39 ` Ingo Molnar
@ 2007-03-07 18:21 ` Kirk Kuchov
2007-03-07 18:24 ` Jens Axboe
2007-03-07 18:32 ` Evgeniy Polyakov
0 siblings, 2 replies; 277+ messages in thread
From: Kirk Kuchov @ 2007-03-07 18:21 UTC (permalink / raw)
To: Ingo Molnar
Cc: Pavel Machek, Davide Libenzi, Evgeniy Polyakov, Eric Dumazet,
Theodore Tso, Linus Torvalds, Ulrich Drepper,
Linux Kernel Mailing List, Arjan van de Ven, Christoph Hellwig,
Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
On 3/7/07, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Kirk Kuchov <kirk.kuchov@gmail.com> wrote:
>
> > I don't believe I'm wasting my time explaining this. They don't exist
> > as /dev/null, they are just fucking _LINKS_.
> [...]
> > > Either stop flaming kernel developers or become one. It is that
> > > simple.
> >
> > If I were to become a kernel developer I would stick with FreeBSD.
> > [...]
>
> Hey, really, this is an excellent idea: what a boon you could become to
> FreeBSD, again! How much they must be longing for your insightful
> feedback, how much they must be missing your charming style and tactful
> approach! I bet they'll want to print your mails out, frame them and
> hang them over their fireplace, to remember the good old days on cold
> snowy winter days, with warmth in their hearts! Please?
>
http://www.totallytom.com/thecureforgayness.html
--
Kirk Kuchov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-07 12:02 ` Kirk Kuchov
2007-03-07 17:26 ` Linus Torvalds
@ 2007-03-07 17:39 ` Ingo Molnar
2007-03-07 18:21 ` Kirk Kuchov
1 sibling, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-03-07 17:39 UTC (permalink / raw)
To: Kirk Kuchov
Cc: Pavel Machek, Davide Libenzi, Evgeniy Polyakov, Eric Dumazet,
Theodore Tso, Linus Torvalds, Ulrich Drepper,
Linux Kernel Mailing List, Arjan van de Ven, Christoph Hellwig,
Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
* Kirk Kuchov <kirk.kuchov@gmail.com> wrote:
> I don't believe I'm wasting my time explaining this. They don't exist
> as /dev/null, they are just fucking _LINKS_.
[...]
> > Either stop flaming kernel developers or become one. It is that
> > simple.
>
> If I were to become a kernel developer I would stick with FreeBSD.
> [...]
Hey, really, this is an excellent idea: what a boon you could become to
FreeBSD, again! How much they must be longing for your insightful
feedback, how much they must be missing your charming style and tactful
approach! I bet they'll want to print your mails out, frame them and
hang them over their fireplace, to remember the good old days on cold
snowy winter days, with warmth in their hearts! Please?
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-07 12:02 ` Kirk Kuchov
@ 2007-03-07 17:26 ` Linus Torvalds
2007-03-07 17:39 ` Ingo Molnar
1 sibling, 0 replies; 277+ messages in thread
From: Linus Torvalds @ 2007-03-07 17:26 UTC (permalink / raw)
To: Kirk Kuchov
Cc: Pavel Machek, Davide Libenzi, Evgeniy Polyakov, Ingo Molnar,
Eric Dumazet, Theodore Tso, Ulrich Drepper,
Linux Kernel Mailing List, Arjan van de Ven, Christoph Hellwig,
Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
On Wed, 7 Mar 2007, Kirk Kuchov wrote:
>
> I don't believe I'm wasting my time explaining this. They don't exist
> as /dev/null, they are just fucking _LINKS_. I could even "ln -s
> /proc/self/fd/0 sucker". A real /dev/stdout can/could even exist, but
> that's not the point!
Actually, one large reason for /proc/self/ existing is exactly /dev/stdin
and friends.
And yes, /proc/self looks like a link too, but that doesn't change the
fact that it's a very special file. No different from /dev/null or
friends.
Linus
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-06 8:19 ` Pavel Machek
@ 2007-03-07 12:02 ` Kirk Kuchov
2007-03-07 17:26 ` Linus Torvalds
2007-03-07 17:39 ` Ingo Molnar
0 siblings, 2 replies; 277+ messages in thread
From: Kirk Kuchov @ 2007-03-07 12:02 UTC (permalink / raw)
To: Pavel Machek
Cc: Davide Libenzi, Evgeniy Polyakov, Ingo Molnar, Eric Dumazet,
Theodore Tso, Linus Torvalds, Ulrich Drepper,
Linux Kernel Mailing List, Arjan van de Ven, Christoph Hellwig,
Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
On 3/6/07, Pavel Machek <pavel@ucw.cz> wrote:
> > >As for why common abstractions like file are a good thing, think about why
> > >having "/dev/null" is cleaner that having a special plug DEVNULL_FD fd
> > >value to be plugged everywhere,
> >
> > This is a stupid comparaison. By your logic we should also have /dev/stdin,
> > /dev/stdout and /dev/stderr.
>
> Bzzt, wrong. We have them.
>
> pavel@amd:~$ ls -al /dev/std*
> lrwxrwxrwx 1 root root 4 Nov 12 2003 /dev/stderr -> fd/2
> lrwxrwxrwx 1 root root 4 Nov 12 2003 /dev/stdin -> fd/0
> lrwxrwxrwx 1 root root 4 Nov 12 2003 /dev/stdout -> fd/1
> pavel@amd:~$ ls -al /proc/self/fd
> total 0
> dr-x------ 2 pavel users 0 Mar 6 09:18 .
> dr-xr-xr-x 4 pavel users 0 Mar 6 09:18 ..
> lrwx------ 1 pavel users 64 Mar 6 09:18 0 -> /dev/ttyp2
> lrwx------ 1 pavel users 64 Mar 6 09:18 1 -> /dev/ttyp2
> lrwx------ 1 pavel users 64 Mar 6 09:18 2 -> /dev/ttyp2
> lr-x------ 1 pavel users 64 Mar 6 09:18 3 -> /proc/2299/fd
> pavel@amd:~$
I don't believe I'm wasting my time explaining this. They don't exist
as /dev/null, they are just fucking _LINKS_. I could even "ln -s
/proc/self/fd/0 sucker". A real /dev/stdout can/could even exist, but
that's not the point!
It remains a stupid comparison because /dev/stdin/stderr/whatever
"must" be plugged, else how could a process write to stdout/stderr
that it coud'nt open it ? The way things are is not because it's
cleaner to have it as a file but because it's the only sane way.
/dev/null is not a must have, it's mainly used for redirecting
purposes. A sys_nullify(fileno(stdout)) would rule out almost any use
of /dev/null.
> > >As for why common abstractions like file are a good thing, think about why
> > >having "/dev/null" is cleaner that having a special plug DEVNULL_FD fd
> > >value to be plugged everywhere,
> > >But here the list could be almost endless.
> > >And please don't start the, they don't scale or they need heavy file
> > >binding tossfeast. They scale as well as the interface that will receive
> > >them (poll, select, epoll). Heavy file binding what? 100 or so bytes for
> > >the struct file? How many signal/timer fd are you gonna have? Like 100K?
> > >Really moot argument when opposed to the benefit of being compatible with
> > >existing POSIX interfaces and being more Unix friendly.
> >
> > So why the HELL don't we have those yet? Why haven't you designed
> > epoll with those in mind? Why don't you back your claims with patches?
> > (I'm not a kernel developer.)
>
> Either stop flaming kernel developers or become one. It is that
> simple.
>
If I were to become a kernel developer I would stick with FreeBSD. At
least they have kqueue for about seven years now.
--
Kirk Kuchov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-04 22:49 ` Kirk Kuchov
2007-03-04 22:57 ` Davide Libenzi
2007-03-05 4:46 ` Magnus Naeslund(k)
@ 2007-03-06 8:19 ` Pavel Machek
2007-03-07 12:02 ` Kirk Kuchov
2 siblings, 1 reply; 277+ messages in thread
From: Pavel Machek @ 2007-03-06 8:19 UTC (permalink / raw)
To: Kirk Kuchov
Cc: Davide Libenzi, Evgeniy Polyakov, Ingo Molnar, Eric Dumazet,
Theodore Tso, Linus Torvalds, Ulrich Drepper,
Linux Kernel Mailing List, Arjan van de Ven, Christoph Hellwig,
Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
> >As for why common abstractions like file are a good thing, think about why
> >having "/dev/null" is cleaner that having a special plug DEVNULL_FD fd
> >value to be plugged everywhere,
>
> This is a stupid comparaison. By your logic we should also have /dev/stdin,
> /dev/stdout and /dev/stderr.
Bzzt, wrong. We have them.
pavel@amd:~$ ls -al /dev/std*
lrwxrwxrwx 1 root root 4 Nov 12 2003 /dev/stderr -> fd/2
lrwxrwxrwx 1 root root 4 Nov 12 2003 /dev/stdin -> fd/0
lrwxrwxrwx 1 root root 4 Nov 12 2003 /dev/stdout -> fd/1
pavel@amd:~$ ls -al /proc/self/fd
total 0
dr-x------ 2 pavel users 0 Mar 6 09:18 .
dr-xr-xr-x 4 pavel users 0 Mar 6 09:18 ..
lrwx------ 1 pavel users 64 Mar 6 09:18 0 -> /dev/ttyp2
lrwx------ 1 pavel users 64 Mar 6 09:18 1 -> /dev/ttyp2
lrwx------ 1 pavel users 64 Mar 6 09:18 2 -> /dev/ttyp2
lr-x------ 1 pavel users 64 Mar 6 09:18 3 -> /proc/2299/fd
pavel@amd:~$
> >But here the list could be almost endless.
> >And please don't start the, they don't scale or they need heavy file
> >binding tossfeast. They scale as well as the interface that will receive
> >them (poll, select, epoll). Heavy file binding what? 100 or so bytes for
> >the struct file? How many signal/timer fd are you gonna have? Like 100K?
> >Really moot argument when opposed to the benefit of being compatible with
> >existing POSIX interfaces and being more Unix friendly.
>
> So why the HELL don't we have those yet? Why haven't you designed
> epoll with those in mind? Why don't you back your claims with patches?
> (I'm not a kernel developer.)
Either stop flaming kernel developers or become one. It is that
simple.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-04 17:46 ` Kyle Moffett
@ 2007-03-05 5:23 ` Michael K. Edwards
0 siblings, 0 replies; 277+ messages in thread
From: Michael K. Edwards @ 2007-03-05 5:23 UTC (permalink / raw)
To: Kyle Moffett
Cc: Kirk Kuchov, Davide Libenzi, Evgeniy Polyakov, Ingo Molnar,
Eric Dumazet, Pavel Machek, Theodore Tso, Linus Torvalds,
Ulrich Drepper, Linux Kernel Mailing List, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Jens Axboe,
Thomas Gleixner
On 3/4/07, Kyle Moffett <mrmacman_g4@mac.com> wrote:
> Well, even this far into 2.6, Linus' patch from 2003 still (mostly)
> applies; the maintenance cost for this kind of code is virtually
> zilch. If it matters that much to you clean it up and make it apply;
> add an alarmfd() syscall (another 100 lines of code at most?) and
> make a "read" return an architecture-independent siginfo-like
> structure and submit it for inclusion. Adding epoll() support for
> random objects is as simple as a 75-line object-filesystem and a 25-
> line syscall to return an FD to a new inode. Have fun! Go wild!
> Something this trivially simple could probably spend a week in -mm
> and go to linus for 2.6.22.
Or, if you want to do slightly more work and produce something a great
deal more useful, you could implement additional netlink address
families for additional "event" sources. The socket - setsockopt -
bind - sendmsg/recvmsg sequence is a well understood and well
documented UNIX paradigm for multiplexing non-blocking I/O to many
destinations over one socket. Everyone who has read Stevens is
familiar with the basic UDP and "fd open server" techniques, and if
you look at Linux's IP_PKTINFO and NETLINK_W1 (bravo, Evgeniy!) you'll
see how easily they could be extended to file AIO and other kinds of
event sources.
For file AIO, you might have the application open one AIO socket per
mount point, open files indirectly via the SCM_RIGHTS mechanism, and
submit/retire read/write requests via sendmsg/recvmsg with ancillary
data consisting of an lseek64 tuple and a user-provided cookie.
Although the process still has to have one fd open per actual open
file (because trying to authenticate file accesses without opening fds
is madness), the only fds it has to manipulate directly are those
representing entire pools of outstanding requests. This is usually a
small enough set that select() will do just fine, if you're careful
with fd allocation. (You can simply punt indirectly opened fds up to
a high numerical range, where they can't be accessed directly from
userspace but still make fine cookies for use in lseek64 tuples within
cmsg headers).
The same basic approach will work for timers, signals, and just about
any other event source. Userspace is of course still stuck doing its
own state machines / thread scheduling / however you choose to think
of it. But all the important activity goes through socketcall(), and
the data and control parameters are all packaged up into a struct
msghdr instead of the bare buffer pointers of read/write. So if
someone else does come along later and design an ultralight threading
mechanism that isn't a total botch, the actual data paths won't need
much rework; the exception handling will just get a lot simpler.
Cheers,
- Michael
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-04 22:49 ` Kirk Kuchov
2007-03-04 22:57 ` Davide Libenzi
@ 2007-03-05 4:46 ` Magnus Naeslund(k)
2007-03-06 8:19 ` Pavel Machek
2 siblings, 0 replies; 277+ messages in thread
From: Magnus Naeslund(k) @ 2007-03-05 4:46 UTC (permalink / raw)
To: Kirk Kuchov; +Cc: linux-kernel
Kirk Kuchov wrote:
[snip]
>
> This is a stupid comparaison. By your logic we should also have /dev/stdin,
> /dev/stdout and /dev/stderr.
>
Well, as a matter of fact (on my system):
# ls -l /dev/std*
lrwxrwxrwx 1 root root 4 Feb 1 2006 /dev/stderr -> fd/2
lrwxrwxrwx 1 root root 4 Feb 1 2006 /dev/stdin -> fd/0
lrwxrwxrwx 1 root root 4 Feb 1 2006 /dev/stdout -> fd/1
Please don't bother to respond to this mail, I just saw that you
apparently needed the info.
Magnus
P.S.: *PLONK*
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-04 22:49 ` Kirk Kuchov
@ 2007-03-04 22:57 ` Davide Libenzi
2007-03-05 4:46 ` Magnus Naeslund(k)
2007-03-06 8:19 ` Pavel Machek
2 siblings, 0 replies; 277+ messages in thread
From: Davide Libenzi @ 2007-03-04 22:57 UTC (permalink / raw)
To: Kirk Kuchov
Cc: Evgeniy Polyakov, Ingo Molnar, Eric Dumazet, Pavel Machek,
Theodore Tso, Linus Torvalds, Ulrich Drepper,
Linux Kernel Mailing List, Arjan van de Ven, Christoph Hellwig,
Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
On Sun, 4 Mar 2007, Kirk Kuchov wrote:
> I don't give a shit.
Here's another good use of /dev/null:
*PLONK*
- Davide
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-04 21:17 ` Davide Libenzi
@ 2007-03-04 22:49 ` Kirk Kuchov
2007-03-04 22:57 ` Davide Libenzi
` (2 more replies)
0 siblings, 3 replies; 277+ messages in thread
From: Kirk Kuchov @ 2007-03-04 22:49 UTC (permalink / raw)
To: Davide Libenzi
Cc: Evgeniy Polyakov, Ingo Molnar, Eric Dumazet, Pavel Machek,
Theodore Tso, Linus Torvalds, Ulrich Drepper,
Linux Kernel Mailing List, Arjan van de Ven, Christoph Hellwig,
Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
On 3/4/07, Davide Libenzi <davidel@xmailserver.org> wrote:
> On Sun, 4 Mar 2007, Kirk Kuchov wrote:
>
> > On 3/3/07, Davide Libenzi <davidel@xmailserver.org> wrote:
> > > <snip>
> > >
> > > Those *other* (tons?!?) interfaces can be created *when* the need comes
> > > (see Linus signalfd [1] example to show how urgent that was). *When*
> > > the need comes, they will work with existing POSIX interfaces, without
> > > requiring your own just-another event interface. Those other interfaces
> > > could also be more easily adopted by other Unix cousins, because of
> > > the fact that they rely on existing POSIX interfaces.
> >
> > Please stop with this crap, this chicken or the egg argument of yours is utter
> > BULLSHIT!
>
> Wow, wow, fella! You _deinitely_ cannot afford rudeness here.
I don't give a shit.
> You started bad, and you end even worse. By listing a some APIs that will
> work only with epoll. As I said already, and as it was listed in the
> thread I posted the link, something like:
>
> int signalfd(...); // Linus initial interface would be perfectly fine
> int timerfd(...); // Open ...
> int eventfd(...); // [1]
>
> Will work *even* with standard POSIX select/poll. 95% or more of the
> software does not have scalability issues, and select/poll are more
> portable and easy to use for simple stuff. On top of that, as I already
> said, they are *confined* interfaces that could be more easily adopted by
> other Unixes (if they are 100-200 lines on Linux, don't expect them to be
> a lot more on other Unixes) [2]. We *already* have the infrastructure
> inside Linux to deliver events (f_op->poll subsystem), how about we use
> that instead of just-another way? [3]
Man you're so full of shit, your eyes are brown. NOBODY cares about
select/poll or that the interfaces are going to be adopted by other
Unixes. This issue has already been solved by then YEARS ago.
What I want (and a ton of other users) is a SIMPLE and generic way to
receive events from _MULTIPLE_multiple sources. I don't care about
kernel-level portability, easiness or whatever, the linux kernel
developers are good at not knowing what their users want.
> As for why common abstractions like file are a good thing, think about why
> having "/dev/null" is cleaner that having a special plug DEVNULL_FD fd
> value to be plugged everywhere,
This is a stupid comparaison. By your logic we should also have /dev/stdin,
/dev/stdout and /dev/stderr.
> or why I can use find/grep/cat/echo/... to
> look/edit at my configuration inside /proc, instead of using a frigging
> registry editor.
Yet another stupid comparaison, /proc is a MESS! Almost as worse as
the registry. Linux now has three pieces of crap for
configuration/information: /proc, sysfs and sysctl. Nobody knows
exactly what should go into each one of those. Crap design at it's
best.
> But here the list could be almost endless.
> And please don't start the, they don't scale or they need heavy file
> binding tossfeast. They scale as well as the interface that will receive
> them (poll, select, epoll). Heavy file binding what? 100 or so bytes for
> the struct file? How many signal/timer fd are you gonna have? Like 100K?
> Really moot argument when opposed to the benefit of being compatible with
> existing POSIX interfaces and being more Unix friendly.
So why the HELL don't we have those yet? Why haven't you designed
epoll with those in mind? Why don't you back your claims with patches?
(I'm not a kernel developer.)
> As for the AIO stuff, if threadlets/syslets will prove effective, you can
> host an epoll_wait over a syslet/threadlet. Or, if the 3 lines of
> userspace code needed to do that, fall inside your definition of "kludge",
> we can even find a way to bridge the two.
I don't care about threadlets in this context, I just want to wait for
EVENTS from MULTIPLE sources WITHOUT mixing signals and other crap.
Your arrogance is amusing, stop pushing narrow-minded beliefs down the
throats of all Linux users. Kqueue, event ports,
WaitForMultipleObjects, epoll with multiple sources. That's what users
want, not yet another syscall/whatever hack.
> Now, how about we focus on the topic of this thread?
>
> [1] This could be an idea. People already uses pipes for this, but pipes
> has some memory overhead inside the kernel (plus use two fds) that
> could, if really felt necessary, be avoided.
Yet another hack!! 64kiB of space just to push some user events
around. Great idea!
>
> [2] This is how those kind of interfaces should be designed. Modular,
> re-usable, file-based interfaces, whose acceptance is not linked into
> slurping-in a whole new interface with tenths of sub, interface-only,
> objects. And from this POV, epoll is the friendlier.
Who said I want yet another interface? I just fucking want to receive
events from MULTIPLE sources through epoll. With or without a fd! My
anger and frustration is that we can get past this SIMPLE need!
> [3] Notice the similarity between threadlets/syslets and epoll? They
> enable pretty darn good scalability, with *existing* infrastructure,
> and w/out special ad-hoc code to be plugged everywhere. This translate
> directly in easier to maintain code.
Easier for kernel developers, of cource. <sarcasm> Who cares? Fuck the
users. <sarcasm>
--
Kirk Kuchov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-04 16:23 ` Kirk Kuchov
2007-03-04 17:46 ` Kyle Moffett
@ 2007-03-04 21:17 ` Davide Libenzi
2007-03-04 22:49 ` Kirk Kuchov
1 sibling, 1 reply; 277+ messages in thread
From: Davide Libenzi @ 2007-03-04 21:17 UTC (permalink / raw)
To: Kirk Kuchov
Cc: Evgeniy Polyakov, Ingo Molnar, Eric Dumazet, Pavel Machek,
Theodore Tso, Linus Torvalds, Ulrich Drepper,
Linux Kernel Mailing List, Arjan van de Ven, Christoph Hellwig,
Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
On Sun, 4 Mar 2007, Kirk Kuchov wrote:
> On 3/3/07, Davide Libenzi <davidel@xmailserver.org> wrote:
> > <snip>
> >
> > Those *other* (tons?!?) interfaces can be created *when* the need comes
> > (see Linus signalfd [1] example to show how urgent that was). *When*
> > the need comes, they will work with existing POSIX interfaces, without
> > requiring your own just-another event interface. Those other interfaces
> > could also be more easily adopted by other Unix cousins, because of
> > the fact that they rely on existing POSIX interfaces.
>
> Please stop with this crap, this chicken or the egg argument of yours is utter
> BULLSHIT!
Wow, wow, fella! You _deinitely_ cannot afford rudeness here.
You started bad, and you end even worse. By listing a some APIs that will
work only with epoll. As I said already, and as it was listed in the
thread I posted the link, something like:
int signalfd(...); // Linus initial interface would be perfectly fine
int timerfd(...); // Open ...
int eventfd(...); // [1]
Will work *even* with standard POSIX select/poll. 95% or more of the
software does not have scalability issues, and select/poll are more
portable and easy to use for simple stuff. On top of that, as I already
said, they are *confined* interfaces that could be more easily adopted by
other Unixes (if they are 100-200 lines on Linux, don't expect them to be
a lot more on other Unixes) [2]. We *already* have the infrastructure
inside Linux to deliver events (f_op->poll subsystem), how about we use
that instead of just-another way? [3]
As for why common abstractions like file are a good thing, think about why
having "/dev/null" is cleaner that having a special plug DEVNULL_FD fd
value to be plugged everywhere, or why I can use find/grep/cat/echo/... to
look/edit at my configuration inside /proc, instead of using a frigging
registry editor. But here the list could be almost endless.
And please don't start the, they don't scale or they need heavy file
binding tossfeast. They scale as well as the interface that will receive
them (poll, select, epoll). Heavy file binding what? 100 or so bytes for
the struct file? How many signal/timer fd are you gonna have? Like 100K?
Really moot argument when opposed to the benefit of being compatible with
existing POSIX interfaces and being more Unix friendly.
As for the AIO stuff, if threadlets/syslets will prove effective, you can
host an epoll_wait over a syslet/threadlet. Or, if the 3 lines of
userspace code needed to do that, fall inside your definition of "kludge",
we can even find a way to bridge the two.
Now, how about we focus on the topic of this thread?
[1] This could be an idea. People already uses pipes for this, but pipes
has some memory overhead inside the kernel (plus use two fds) that
could, if really felt necessary, be avoided.
[2] This is how those kind of interfaces should be designed. Modular,
re-usable, file-based interfaces, whose acceptance is not linked into
slurping-in a whole new interface with tenths of sub, interface-only,
objects. And from this POV, epoll is the friendlier.
[3] Notice the similarity between threadlets/syslets and epoll? They
enable pretty darn good scalability, with *existing* infrastructure,
and w/out special ad-hoc code to be plugged everywhere. This translate
directly in easier to maintain code.
- Davide
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-04 16:23 ` Kirk Kuchov
@ 2007-03-04 17:46 ` Kyle Moffett
2007-03-05 5:23 ` Michael K. Edwards
2007-03-04 21:17 ` Davide Libenzi
1 sibling, 1 reply; 277+ messages in thread
From: Kyle Moffett @ 2007-03-04 17:46 UTC (permalink / raw)
To: Kirk Kuchov
Cc: Davide Libenzi, Evgeniy Polyakov, Ingo Molnar, Eric Dumazet,
Pavel Machek, Theodore Tso, Linus Torvalds, Ulrich Drepper,
Linux Kernel Mailing List, Arjan van de Ven, Christoph Hellwig,
Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
On Mar 04, 2007, at 11:23:37, Kirk Kuchov wrote:
> So here we are, 2007. epoll() works with files, pipes, sockets,
> inotify and anything pollable (file descriptors) but aio, timers,
> signals and user-defined event. Can we please get those working
> with epoll ? Something as simple as:
>
> [code snipped]
>
> Would this be acceptable? Can we finally move on?
Well, even this far into 2.6, Linus' patch from 2003 still (mostly)
applies; the maintenance cost for this kind of code is virtually
zilch. If it matters that much to you clean it up and make it apply;
add an alarmfd() syscall (another 100 lines of code at most?) and
make a "read" return an architecture-independent siginfo-like
structure and submit it for inclusion. Adding epoll() support for
random objects is as simple as a 75-line object-filesystem and a 25-
line syscall to return an FD to a new inode. Have fun! Go wild!
Something this trivially simple could probably spend a week in -mm
and go to linus for 2.6.22.
Cheers,
Kyle Moffett
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-03 21:57 ` Davide Libenzi
2007-03-03 22:10 ` Davide Libenzi
@ 2007-03-04 16:23 ` Kirk Kuchov
2007-03-04 17:46 ` Kyle Moffett
2007-03-04 21:17 ` Davide Libenzi
1 sibling, 2 replies; 277+ messages in thread
From: Kirk Kuchov @ 2007-03-04 16:23 UTC (permalink / raw)
To: Davide Libenzi
Cc: Evgeniy Polyakov, Ingo Molnar, Eric Dumazet, Pavel Machek,
Theodore Tso, Linus Torvalds, Ulrich Drepper,
Linux Kernel Mailing List, Arjan van de Ven, Christoph Hellwig,
Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
On 3/3/07, Davide Libenzi <davidel@xmailserver.org> wrote:
> <snip>
>
> Those *other* (tons?!?) interfaces can be created *when* the need comes
> (see Linus signalfd [1] example to show how urgent that was). *When*
> the need comes, they will work with existing POSIX interfaces, without
> requiring your own just-another event interface. Those other interfaces
> could also be more easily adopted by other Unix cousins, because of
> the fact that they rely on existing POSIX interfaces.
Please stop with this crap, this chicken or the egg argument of yours is utter
BULLSHIT! Just because Linux doesn't have a decent kernel event
notification mechanism it does not mean that users don't need. Nobody
cared about Linus's
signalfd because it wasn't mainline.
Look at any event notification libraries out there, it makes me sick how much
kludge they have to go thru to get near the same functionality of
kqueue on Linux.
Solaris has the Event Ports mechanism since 2003. FreeBSD, NetBSD, OpenBSD
and Mac OS X support kqueue since around 2000. Windows has had event
notification for ages now. These _facilities_ are all widely used,
given the platforms
popularity.
So here we are, 2007. epoll() works with files, pipes, sockets,
inotify and anything
pollable (file descriptors) but aio, timers, signals and user-defined
event. Can we
please get those working with epoll ? Something as simple as:
struct epoll_event ev;
ev.events = EV_TIMER | EPOLLONESHOT;
ev.data.u64 = 1000; /* timeout */
epoll_ctl(epfd, EPOLL_CTL_ADD, 0 /* ignored */, &ev);
or
struct sigevent ev;
ev.sigev_notify = SIGEV_EPOLL;
ev.sigev_signo = epfd;
ev.sigev_value = &ev;
timer_create(CLOCK_MONOTONIC, &ev, &timerid);
AIO:
struct sigevent ev;
int fd = io_setup(..); /* oh boy, I wish... but it works */
ev.events = EV_AIO | EPOLLONESHOT;
/* event.data.ptr returns pointer to the iocb */
epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &ev);
or
struct iocb iocb;
iocb.aio_fildes = fileno(stdin);
iocb.aio_lio_opcode = IO_CMD_PREAD;
iocb.c.notify = IO_NOTIFY_EPOLL; /* __pad3/4 */
Would this be acceptable? Can we finally move on?
--
Kirk Kuchov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-03 21:12 ` Ray Lee
2007-03-03 22:14 ` Ihar `Philips` Filipau
@ 2007-03-04 9:33 ` Michael K. Edwards
1 sibling, 0 replies; 277+ messages in thread
From: Michael K. Edwards @ 2007-03-04 9:33 UTC (permalink / raw)
To: ray-gmail; +Cc: Ihar `Philips` Filipau, Evgeniy Polyakov, linux-kernel
Please don't take this the wrong way, Ray, but I don't think _you_
understand the problem space that people are (or should be) trying to
address here.
Servers want to always, always block. Not on a socket, not on a stat,
not on any _one_ thing, but in a condition where the optimum number of
concurrent I/O requests are outstanding (generally of several kinds
with widely varying expected latencies). I have an embedded server I
wrote that avoids forking internally for any reason, although it
watches the damn serial port signals in parallel with handling network
I/O, audio, and child processes that handle VoIP signaling protocols
(which are separate processes because it was more practical to write
them in a different language with mediocre embeddability). There's a
lot of things that can block out there, not just disk I/O, but the
only thing a genuinely scalable server process ever blocks on (apart
from the odd spinlock) is a wait-for-IO-from-somewhere mechanism like
select or epoll or kqueue (or even sleep() while awaiting SIGRT+n, or
if it genuinely doesn't suck, the thread scheduler).
Furthermore, not only do servers want to block rather than shove more
I/O into the plumbing than it can handle without backing up, they also
want to throttle the concurrency of requests at the kernel level *for
the kernel's benefit*. In particular, a server wants to submit to the
kernel a ton of stats and I/O in parallel, far more than it makes
sense to actually issue concurrently, so that efficient sequencing of
these requests can be left to the kernel. But the server wants to
guide the kernel with regard to the ratios of concurrency appropriate
to the various classes and the relative urgency of the individual
requests within each class. The server also wants to be able to
reprioritize groups of requests or cancel them altogether based on new
information about hardware status and user behavior.
Finally, the biggest argument against syslets/threadlets AFAICS is
that -- if done incorrectly, as currently proposed -- they would unify
the AIO and normal IO paths in the kernel. This would shackle AIO to
the current semantics of synchronous syscalls, in which buffers are
passed as bare pointers and exceptional results are tangled up with
programming errors. This would, in turn, make it quite impossible for
future hardware to pipeline and speculatively execute chains of AIO
operations, leaving "syslets" to a few RDBMS programmers with time to
burn. The unimproved ease of long term maintenance on the kernel (not
to mention the complete failure to make the writing of _correct_,
performant server code any easier) makes them unworthy of
consideration for inclusion.
So, while everybody has been talking about cached and non-cached
cases, those are really total irrelevancies. The principal problem
that needs solving is to model the process's pool of in-flight I/O
requests, together with a much larger number of submitted but not yet
issued requests whose results are foreseeably likely to be needed
soon, using a data structure that efficiently supports _all_ of the
operations needed, including bulk cancellation, reprioritization, and
batch migration based on affinities among requests and locality to the
correct I/O resources. Memory footprint and gentle-on-real-hardware
scheduling are secondary, but also important, considerations. If you
happen to be able to service certain things directly from cache,
that's gravy -- but it's not very smart IMHO to put that central to
your design process.
Cheers,
- Michael
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-03 22:14 ` Ihar `Philips` Filipau
@ 2007-03-03 23:54 ` Ray Lee
0 siblings, 0 replies; 277+ messages in thread
From: Ray Lee @ 2007-03-03 23:54 UTC (permalink / raw)
To: Ihar `Philips` Filipau; +Cc: Evgeniy Polyakov, linux-kernel
Ihar `Philips` Filipau wrote:
> On 3/3/07, Ray Lee <madrabbit@gmail.com> wrote:
>> On 3/3/07, Ihar `Philips` Filipau <thephilips@gmail.com> wrote:
>> > What I'm trying to get to: keep things simple. The proposed
>> > optimization by Ingo does nothing else but allowing AIO to probe file
>> > cache - if data there to go with fast path. So why not to implement
>> > what the people want - probing of cache? Because it sounds bad? But
>> > they are in fact proposing precisely that just masked with "fast
>> > threads".
>>
>>
>> Servers want to never, ever block. Not on a socket, not on a stat, not
>> on anything. (I have an embedded server I wrote that has to fork
>> internally just to watch the damn serial port signals in parallel with
>> handling network I/O, audio, and child processes that handle H323.)
>> There's a lot of things that can block out there, and it's not just
>> disk I/O.
>>
>
> Why select/poll/epoll/friends do not work? I have programmed on both
> sides - user-space network servers and in-kernel network protocols -
> and "never blocking" thing was implemented in *nix in the times I was
> walking under table.
>
Then you've never had to write something that watches serial port
signals. Google on TIOCMIWAIT to see what I'm talking about. The only
option for a userspace programmer to deal with that is to fork() or poll
the signals every so many milliseconds. There are probably more easy
examples, but that's the one off the top of my head that affected me.
In short, this isn't just about network IO, this isn't just about file IO.
> One can poll() more or less *any* device in system. With frigging
> exception of - right - files.
The problem is the "more or less." Say you're right, and 95% of the
system calls are either already asynchronous or non-blocking/poll()able.
One of the questions on the table is how to extend it to the last 5%.
> User-space-wise, check how squid (caching http proxy) does it: you
> have several (forked) instances to serve network requests and you have
> one/several disk I/O daemons. (So called "diskd storeio") Why? Because
> you cannot poll() file descriptors, but you can poll unix socket
> connected to diskd. If diskd blocks, squid still can serve requests.
> How threadlets are better then pool of diskd instances? All nastiness
> of shared memory set loose...
Samba/lighttpd/git want to issue dozens of stats in parallel so that the
kernel can have an opportunity to sort them better. Are you saying they
should fork() a process per stat that they want to issue in parallel?
> What I'm trying to get to. Threadlets wouldn't help existing
> single-threaded applications - what is about 95% of all applications.
Eh, I don't think that's right. Part of the reason threadlets and
syslets are on the table because it may be a more efficient way to do
AIO. And the differences between the syslet API and the current kernel
Async IO API can be abstracted away by glibc, so that today's apps that
do AIO would immediately benefit.
> What's more, as having some limited experience of kernel programming,
> I fail to see what threadlets would simplify on kernel side.
You can yank the entire separate AIO path, and just treat them as
another blocking API that syslets makes nonblocking. Immediate reduction
of code, and everybody is now using the same code paths, which means
higher test coverage and reduced maintenance cost.
This last point is really important. Even if no extra functionality
eventually makes it to userspace, this last point would still be enough
to make the powers that be consider inclusion.
Ray
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-03 21:12 ` Ray Lee
@ 2007-03-03 22:14 ` Ihar `Philips` Filipau
2007-03-03 23:54 ` Ray Lee
2007-03-04 9:33 ` Michael K. Edwards
1 sibling, 1 reply; 277+ messages in thread
From: Ihar `Philips` Filipau @ 2007-03-03 22:14 UTC (permalink / raw)
To: ray-gmail; +Cc: Evgeniy Polyakov, linux-kernel
On 3/3/07, Ray Lee <madrabbit@gmail.com> wrote:
> On 3/3/07, Ihar `Philips` Filipau <thephilips@gmail.com> wrote:
> > What I'm trying to get to: keep things simple. The proposed
> > optimization by Ingo does nothing else but allowing AIO to probe file
> > cache - if data there to go with fast path. So why not to implement
> > what the people want - probing of cache? Because it sounds bad? But
> > they are in fact proposing precisely that just masked with "fast
> > threads".
>
>
> Servers want to never, ever block. Not on a socket, not on a stat, not
> on anything. (I have an embedded server I wrote that has to fork
> internally just to watch the damn serial port signals in parallel with
> handling network I/O, audio, and child processes that handle H323.)
> There's a lot of things that can block out there, and it's not just
> disk I/O.
>
Why select/poll/epoll/friends do not work? I have programmed on both
sides - user-space network servers and in-kernel network protocols -
and "never blocking" thing was implemented in *nix in the times I was
walking under table.
One can poll() more or less *any* device in system. With frigging
exception of - right - files. IOW for 75% of I/O problem doesn't
exists since there is proper interface - e.g. sockets - in place.
User-space-wise, check how squid (caching http proxy) does it: you
have several (forked) instances to serve network requests and you have
one/several disk I/O daemons. (So called "diskd storeio") Why? Because
you cannot poll() file descriptors, but you can poll unix socket
connected to diskd. If diskd blocks, squid still can serve requests.
How threadlets are better then pool of diskd instances? All nastiness
of shared memory set loose...
What I'm trying to get to. Threadlets wouldn't help existing
single-threaded applications - what is about 95% of all applications.
And multi-threaded applications would gain little because few real
application create threads dynamically: creation need resources and
can fail, uncontrollable thread spawning hurts overall manageability
and additional care is needed regarding deadlocks/lock contentions
proofing. (The category of applications which want the performance
gain are also the applications which need to ensure greater stability
over long non-stop runs. Uncontrollable dynamism helps nothing.)
Having implemented several "file servers" - daemons serving file I/O
to other daemons - I honestly hardly see any improvements. Now people
configure such file servers to issue e.g. 10 file operations
simultaneously - using pool of 10 threads. What threadlets change? In
the end just to keep in check with threadlets I would need to issue
pthread_join() after some number of threadlets created. And the latter
number is the former "e.g. 10". IOW, programmer-wise the
implementation remain same - and all the limitations remain the same.
And all overhead of user-space locking remain the same. (*)
What's more, as having some limited experience of kernel programming,
I fail to see what threadlets would simplify on kernel side. End
result as I see it: user space becomes bit more complicated because of
dynamic multi-threading and kernel-space becomes also more complicated
because of the same added dynamism.
(*) Hm... On other side, if application would be able to tell kernel
to limit number of issued threadlets to N, then it might simplify the
job. Application can tell kernel "I need at most 10 blocking
threadlets, block me if there are more" and then dumbly throw I/O
threadlets at kernel as they are coming in. And kernel would then put
process to sleep if N+1 thredlets are blocking. That would definitely
simplify the job in user-space: it wouldn't need to call
pthread_join(). But it is still no replacement to poll()able file
descriptor or truly async mmap().
--
Don't walk behind me, I may not lead.
Don't walk in front of me, I may not follow.
Just walk beside me and be my friend.
-- Albert Camus (attributed to)
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-03 21:57 ` Davide Libenzi
@ 2007-03-03 22:10 ` Davide Libenzi
2007-03-04 16:23 ` Kirk Kuchov
1 sibling, 0 replies; 277+ messages in thread
From: Davide Libenzi @ 2007-03-03 22:10 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Ingo Molnar, Eric Dumazet, Pavel Machek, Theodore Tso,
Linus Torvalds, Ulrich Drepper, Linux Kernel Mailing List,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Zach Brown, David S. Miller, Suparna Bhattacharya, Jens Axboe,
Thomas Gleixner
On Sat, 3 Mar 2007, Davide Libenzi wrote:
> Those *other* (tons?!?) interfaces can be created *when* the need comes
> (see Linus signalfd [1] example to show how urgent that was). *When*
> the need comes, they will work with existing POSIX interfaces, without
> requiring your own just-another event interface. Those other interfaces
> could also be more easily adopted by other Unix cousins, because of
> the fact that they rely on existing POSIX interfaces. One of the reason
> about the Unix file abstraction interfaces, is that you do *not* have to
> plan and bloat interfaces before. As long as your new abstraction behave
> in a file-fashion, it can be automatically used with existing interfaces.
> And you create them *when* the need comes.
Now, if you don't mind, my spare time is really limited and I prefer to
spend it looking at stuff the topic of this thread talks about.
Even because the whole epoll/kevent discussion is heavily dependent on the
fact that syslets/threadlets will or will not result a viable method for
generic AIO. Savvy?
- Davide
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-03 20:31 ` Evgeniy Polyakov
@ 2007-03-03 21:57 ` Davide Libenzi
2007-03-03 22:10 ` Davide Libenzi
2007-03-04 16:23 ` Kirk Kuchov
0 siblings, 2 replies; 277+ messages in thread
From: Davide Libenzi @ 2007-03-03 21:57 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Ingo Molnar, Eric Dumazet, Pavel Machek, Theodore Tso,
Linus Torvalds, Ulrich Drepper, Linux Kernel Mailing List,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Zach Brown, David S. Miller, Suparna Bhattacharya, Jens Axboe,
Thomas Gleixner
On Sat, 3 Mar 2007, Evgeniy Polyakov wrote:
> > I was referring to dropping an event directly to a userspace buffer, from
> > the poll callback. If pages are not there, you might sleep, and you can't
> > since the wakeup function holds a spinlock on the waitqueue head while
> > looping through the waiters to issue the wakeup. Also, you don't know from
> > where the poll wakeup is called.
>
> Ugh, no, that is very limited solution - memory must be either pinned
> (which leads to dos and limited ring buffer), or callback must sleep.
> Actually in any way there _must_ exist a queue - if ring buffer is full
> event is not allowed to be dropped - it must be stored in some other
> place, for example in queue from where entries will be read (copied)
> which ring buffer will have entries (that is how it is implemented in
> kevent at least).
I was not advocating for that, if you read carefully. The fact that epoll
does not do that, should be a clear hint. The old /dev/epoll IIRC was only
10% faster than the current epoll under an *heavy* event frequency
micro-bench like pipetest (and that version of epoll did not have the
single pass over the ready set optimization). And /dev/epoll was
delivering events *directly* on userspace visible (mmaped) memory in a
zero-copy fashion.
> > BTW, Linus made a signalfd sketch code time ago, to deliver signals to an
> > fd. Code remained there and nobody cared. Question: Was it because
> > 1) it had file bindings or 2) because nobody really cared to deliver
> > signals to an event collector?
> > And *if* later requirements come, you don't need to change the API by
> > adding an XXEVENT_SIGNAL_ADD or XXEVENT_TIMER_ADD, or creating a new
> > XXEVENT-only submission structure. You create an API that automatically
> > makes that new abstraction work with POSIX poll/select, and you get epoll
> > support for free. Without even changing a bit in the epoll API.
>
> Well, we get epoll support for free, but we need to create tons of other
> interfaces and infrastructure for kernel users, and we need to change
> userspace anyway.
Those *other* (tons?!?) interfaces can be created *when* the need comes
(see Linus signalfd [1] example to show how urgent that was). *When*
the need comes, they will work with existing POSIX interfaces, without
requiring your own just-another event interface. Those other interfaces
could also be more easily adopted by other Unix cousins, because of
the fact that they rely on existing POSIX interfaces. One of the reason
about the Unix file abstraction interfaces, is that you do *not* have to
plan and bloat interfaces before. As long as your new abstraction behave
in a file-fashion, it can be automatically used with existing interfaces.
And you create them *when* the need comes.
[1] That was like 100 lines of code or so. See here:
http://tinyurl.com/3yuna5
- Davide
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-03 10:58 ` Ihar `Philips` Filipau
2007-03-03 11:03 ` Evgeniy Polyakov
@ 2007-03-03 21:12 ` Ray Lee
2007-03-03 22:14 ` Ihar `Philips` Filipau
2007-03-04 9:33 ` Michael K. Edwards
1 sibling, 2 replies; 277+ messages in thread
From: Ray Lee @ 2007-03-03 21:12 UTC (permalink / raw)
To: Ihar `Philips` Filipau; +Cc: Evgeniy Polyakov, linux-kernel
On 3/3/07, Ihar `Philips` Filipau <thephilips@gmail.com> wrote:
> What I'm trying to get to: keep things simple. The proposed
> optimization by Ingo does nothing else but allowing AIO to probe file
> cache - if data there to go with fast path. So why not to implement
> what the people want - probing of cache? Because it sounds bad? But
> they are in fact proposing precisely that just masked with "fast
> threads".
Please don't take this the wrong way, but I don't think you understand
the problem space that people are trying to address here.
Servers want to never, ever block. Not on a socket, not on a stat, not
on anything. (I have an embedded server I wrote that has to fork
internally just to watch the damn serial port signals in parallel with
handling network I/O, audio, and child processes that handle H323.)
There's a lot of things that can block out there, and it's not just
disk I/O.
Further, not only do servers not want to block, they also want to cram
a lot more requests into the kernel at once *for the kernel's
benefit*. In particular, a server wants to issue a ton of stats and
I/O in parallel so that the kernel can optimize which order to handle
the requests.
Finally, the biggest argument in favor of syslets/threadlets AFAICS is
that -- if done correctly -- it would unify the AIO and normal IO
paths in the kernel. The improved ease of long term maintenance on the
kernel (and more test coverage, and more directed optimization,
etc...) just for this point alone makes them worth considering for
inclusion.
So, while everybody has been talking about cached and non-cached
cases, those are really special cases of the entire package that the
rest of us want.
Ray
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-03 18:46 ` Davide Libenzi
@ 2007-03-03 20:31 ` Evgeniy Polyakov
2007-03-03 21:57 ` Davide Libenzi
0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-03 20:31 UTC (permalink / raw)
To: Davide Libenzi
Cc: Ingo Molnar, Eric Dumazet, Pavel Machek, Theodore Tso,
Linus Torvalds, Ulrich Drepper, Linux Kernel Mailing List,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Zach Brown, David S. Miller, Suparna Bhattacharya, Jens Axboe,
Thomas Gleixner
On Sat, Mar 03, 2007 at 10:46:59AM -0800, Davide Libenzi (davidel@xmailserver.org) wrote:
> On Sat, 3 Mar 2007, Evgeniy Polyakov wrote:
>
> > > You've to excuse me if my memory is bad, but IIRC the whole discussion
> > > and loong benchmark feast born with you throwing a benchmark at Ingo
> > > (with kevent showing a 1.9x performance boost WRT epoll), not with you
> > > making any other point.
> >
> > So, how does it sound?
> > "Threadlets are bad for IO because kevent is 2 times faster than epoll?"
> >
> > I said threadlets are bad for IO (and we agreed that both approaches
> > shouldbe usedfor the maximum performance) because of rescheduling overhead -
> > tasks are quite heavy structuresa to move around - even pt_regs copy
> > takes more than event structure, but not because there is something in other
> > galaxy which might work faster than another something in another galaxy.
> > That was stupid even to think about that.
>
> Evgeny, other folks on this thread read what you said, so let's not drag
> this over.
Sure, I was wrong to start this again, but try to get my position - I
really tired from trying to prove that I'm not a camel just because we
had some misunderstanding at the start.
I do think that threadlets are relly cool solution and are indeed very
good approach for majority of the parallel processing, but my point is
still that it is not a perfect solution for all tasks.
Just to draw a line: kevent example is extrapolation of what can be
achieved with event-driven model, but that does not mean that it must be
_only_ used for AIO model - threadlets _and_ event driven model (yes, I
accepted Ingo's point about its declining) is the best solution.
> > > And if you really feel raw about the single O(nready) loop that epoll
> > > currently does, a new epoll_wait2 (or whatever) API could be used to
> > > deliver the event directly into a userspace buffer [1], directly from the
> > > poll callback, w/out extra delivery loops (IRQ/event->epoll_callback->event_buffer).
> >
> > > [1] From the epoll callback, we cannot sleep, so it's gonna be either an
> > > mlocked userspace buffer, or some kernel pages mapped to userspace.
> >
> > Callbacks never sleep - they add event into list just like current
> > implementation (maybe some lock must be changed from mutex to spinlock,
> > I do not rememeber), main problem is binding to the file structure,
> > which is heavy.
>
> I was referring to dropping an event directly to a userspace buffer, from
> the poll callback. If pages are not there, you might sleep, and you can't
> since the wakeup function holds a spinlock on the waitqueue head while
> looping through the waiters to issue the wakeup. Also, you don't know from
> where the poll wakeup is called.
Ugh, no, that is very limited solution - memory must be either pinned
(which leads to dos and limited ring buffer), or callback must sleep.
Actually in any way there _must_ exist a queue - if ring buffer is full
event is not allowed to be dropped - it must be stored in some other
place, for example in queue from where entries will be read (copied)
which ring buffer will have entries (that is how it is implemented in
kevent at least).
> File binding heavy? The first, and by *far* biggest, source of events
> inside an event collector, of someone that cares about scalability, are
> sockets. And those are already files. Second would be AIO, and those (if
> performance figures agrees) can be hosted inside syslets/threadlets.
> Then you fall into the no-care category, where the extra 100 bytes do not
> make a case against the ability of using it with an existing POSIX
> infrastructure (poll/select).
Well, sockets are the files indeed, and sockets already are perfectly
handled by epoll - but there are other users of petential interace - and
it must be designed to scale in _any_ situation very well.
Even if we right now do not have problems with some types of events, we
must scale with any new one.
> BTW, Linus made a signalfd sketch code time ago, to deliver signals to an
> fd. Code remained there and nobody cared. Question: Was it because
> 1) it had file bindings or 2) because nobody really cared to deliver
> signals to an event collector?
> And *if* later requirements come, you don't need to change the API by
> adding an XXEVENT_SIGNAL_ADD or XXEVENT_TIMER_ADD, or creating a new
> XXEVENT-only submission structure. You create an API that automatically
> makes that new abstraction work with POSIX poll/select, and you get epoll
> support for free. Without even changing a bit in the epoll API.
Well, we get epoll support for free, but we need to create tons of other
interfaces and infrastructure for kernel users, and we need to change
userspace anyway.
But epoll support requires to have quite heavy bindings to file
structure, so why don't we want to design new interface (since we need
to change userspace anyway) so that it could allow to scale and be very
memory optimized from the beginning?
> - Davide
>
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-03 10:06 ` Evgeniy Polyakov
@ 2007-03-03 18:46 ` Davide Libenzi
2007-03-03 20:31 ` Evgeniy Polyakov
0 siblings, 1 reply; 277+ messages in thread
From: Davide Libenzi @ 2007-03-03 18:46 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Ingo Molnar, Eric Dumazet, Pavel Machek, Theodore Tso,
Linus Torvalds, Ulrich Drepper, Linux Kernel Mailing List,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Zach Brown, David S. Miller, Suparna Bhattacharya, Jens Axboe,
Thomas Gleixner
On Sat, 3 Mar 2007, Evgeniy Polyakov wrote:
> > You've to excuse me if my memory is bad, but IIRC the whole discussion
> > and loong benchmark feast born with you throwing a benchmark at Ingo
> > (with kevent showing a 1.9x performance boost WRT epoll), not with you
> > making any other point.
>
> So, how does it sound?
> "Threadlets are bad for IO because kevent is 2 times faster than epoll?"
>
> I said threadlets are bad for IO (and we agreed that both approaches
> shouldbe usedfor the maximum performance) because of rescheduling overhead -
> tasks are quite heavy structuresa to move around - even pt_regs copy
> takes more than event structure, but not because there is something in other
> galaxy which might work faster than another something in another galaxy.
> That was stupid even to think about that.
Evgeny, other folks on this thread read what you said, so let's not drag
this over.
> > And if you really feel raw about the single O(nready) loop that epoll
> > currently does, a new epoll_wait2 (or whatever) API could be used to
> > deliver the event directly into a userspace buffer [1], directly from the
> > poll callback, w/out extra delivery loops (IRQ/event->epoll_callback->event_buffer).
>
> > [1] From the epoll callback, we cannot sleep, so it's gonna be either an
> > mlocked userspace buffer, or some kernel pages mapped to userspace.
>
> Callbacks never sleep - they add event into list just like current
> implementation (maybe some lock must be changed from mutex to spinlock,
> I do not rememeber), main problem is binding to the file structure,
> which is heavy.
I was referring to dropping an event directly to a userspace buffer, from
the poll callback. If pages are not there, you might sleep, and you can't
since the wakeup function holds a spinlock on the waitqueue head while
looping through the waiters to issue the wakeup. Also, you don't know from
where the poll wakeup is called.
File binding heavy? The first, and by *far* biggest, source of events
inside an event collector, of someone that cares about scalability, are
sockets. And those are already files. Second would be AIO, and those (if
performance figures agrees) can be hosted inside syslets/threadlets.
Then you fall into the no-care category, where the extra 100 bytes do not
make a case against the ability of using it with an existing POSIX
infrastructure (poll/select).
BTW, Linus made a signalfd sketch code time ago, to deliver signals to an
fd. Code remained there and nobody cared. Question: Was it because
1) it had file bindings or 2) because nobody really cared to deliver
signals to an event collector?
And *if* later requirements come, you don't need to change the API by
adding an XXEVENT_SIGNAL_ADD or XXEVENT_TIMER_ADD, or creating a new
XXEVENT-only submission structure. You create an API that automatically
makes that new abstraction work with POSIX poll/select, and you get epoll
support for free. Without even changing a bit in the epoll API.
- Davide
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-03 11:03 ` Evgeniy Polyakov
@ 2007-03-03 11:51 ` Ihar `Philips` Filipau
0 siblings, 0 replies; 277+ messages in thread
From: Ihar `Philips` Filipau @ 2007-03-03 11:51 UTC (permalink / raw)
To: Evgeniy Polyakov; +Cc: linux-kernel
On 3/3/07, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> > >Threadlets can work with any functionas a base - if it would be
> > >recv-like it will limit possible case for parallel programming, so you
> > >can code anything in threadlets - it is not only about IO.
> >
> > What I'm trying to get to: keep things simple. The proposed
> > optimization by Ingo does nothing else but allowing AIO to probe file
> > cache - if data there to go with fast path. So why not to implement
> > what the people want - probing of cache? Because it sounds bad? But
> > they are in fact proposing precisely that just masked with "fast
> > threads".
>
> There can be other parts than just plain recv/read syscalls - you can
> create a logical processing entity and if it will block (as a whole, no
> matter where), the whole processing will continue as a new thread.
> And having different syscall to warm cache can end up in cache flush in
> between warming and processing itself.
>
I'm not talking about cache warm up. And if we do - and that the whole
freaking point of AIO - Linux IIRC pins freshly loaded clean pages
anyway. So there would be problem but only under memory pressure. If
you under memory pressure - you already lost the game and do not care
about performance/what threads you are using.
It is the whole "threadlets to threads on blocking" things doesn't
sound convincing. It sounds more like "premature optimization". But
anyway, not that I'm AIO specialist. For networking it is totally
unnecessary since most applications which care have already rate
control and buffer management built in. Network connections/sockets
allows greater level of application control on what and how they do.
Compared to blockdev's plain dumb read()/write() going through global
cache. And not that (judging from interface) AIO changes that much -
it is still dumb read() what IMHO makes no sense whatsoever to mmap()
oriented Linux.
--
Don't walk behind me, I may not lead.
Don't walk in front of me, I may not follow.
Just walk beside me and be my friend.
-- Albert Camus (attributed to)
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-03 10:58 ` Ihar `Philips` Filipau
@ 2007-03-03 11:03 ` Evgeniy Polyakov
2007-03-03 11:51 ` Ihar `Philips` Filipau
2007-03-03 21:12 ` Ray Lee
1 sibling, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-03 11:03 UTC (permalink / raw)
To: Ihar `Philips` Filipau; +Cc: linux-kernel
On Sat, Mar 03, 2007 at 11:58:17AM +0100, Ihar `Philips` Filipau (thephilips@gmail.com) wrote:
> On 3/3/07, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> >On Fri, Mar 02, 2007 at 08:20:26PM +0100, Ihar `Philips` Filipau
> >(thephilips@gmail.com) wrote:
> >> I'm not well versed in modern kernel development discussions, and
> >> since you have put the thing into the networked context anyway, can
> >> you please ask on lkml why (if they want threadlets solely for AIO)
> >> not to implement analogue of recv(*filedes*, b, l, MSG_DONTWAIT).
> >> Developers already know the inteface, socket infrastructure is already
> >> in kernel, etc. And it might do precisely what they want: access file
> >> in disk cache - just like in case of socket it does access recv buffer
> >> of socket. Why bother with implicit threads/waiting/etc - if all they
> >> want some way to probe cache?
> >
> >Threadlets can work with any functionas a base - if it would be
> >recv-like it will limit possible case for parallel programming, so you
> >can code anything in threadlets - it is not only about IO.
> >
>
> Ingo defined them as "plain function calls as long as they do not block".
>
> But when/what function could block?
>
> (1) File descriptors. Read. If data are in cache it wouldn't block.
> Otherwise it would. Write. If there is space in cache it wouldn't
> block. Otherwise it would.
>
> (2) Network sockets. Recv. If data are in buffer they wouldn't block.
> Otherwise they would. Send. If there is space in send buffer it
> wouldn't block. Otherwise it would.
>
> (3) Pipes, fifos & unix sockets. Unfortunately gain nothing since the
> reliable local communication used mostly for control information
> passing. If you have to block on such socket it most likely important
> information anyway. (e.g. X server communication or sql query to SQL
> server). (Or even less important here case of shell pipes.) And most
> users here are all single threaded and I/O bound: they would gain
> nothing from multi-threading - only PITA of added locking.
>
> What I'm trying to get to: keep things simple. The proposed
> optimization by Ingo does nothing else but allowing AIO to probe file
> cache - if data there to go with fast path. So why not to implement
> what the people want - probing of cache? Because it sounds bad? But
> they are in fact proposing precisely that just masked with "fast
> threads".
There can be other parts than just plain recv/read syscalls - you can
create a logical processing entity and if it will block (as a whole, no
matter where), the whole processing will continue as a new thread.
And having different syscall to warm cache can end up in cache flush in
between warming and processing itself.
> --
> Don't walk behind me, I may not lead.
> Don't walk in front of me, I may not follow.
> Just walk beside me and be my friend.
> -- Albert Camus (attributed to)
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
[not found] ` <20070303103427.GC22557@2ka.mipt.ru>
@ 2007-03-03 10:58 ` Ihar `Philips` Filipau
2007-03-03 11:03 ` Evgeniy Polyakov
2007-03-03 21:12 ` Ray Lee
0 siblings, 2 replies; 277+ messages in thread
From: Ihar `Philips` Filipau @ 2007-03-03 10:58 UTC (permalink / raw)
To: Evgeniy Polyakov; +Cc: linux-kernel
On 3/3/07, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> On Fri, Mar 02, 2007 at 08:20:26PM +0100, Ihar `Philips` Filipau (thephilips@gmail.com) wrote:
> > I'm not well versed in modern kernel development discussions, and
> > since you have put the thing into the networked context anyway, can
> > you please ask on lkml why (if they want threadlets solely for AIO)
> > not to implement analogue of recv(*filedes*, b, l, MSG_DONTWAIT).
> > Developers already know the inteface, socket infrastructure is already
> > in kernel, etc. And it might do precisely what they want: access file
> > in disk cache - just like in case of socket it does access recv buffer
> > of socket. Why bother with implicit threads/waiting/etc - if all they
> > want some way to probe cache?
>
> Threadlets can work with any functionas a base - if it would be
> recv-like it will limit possible case for parallel programming, so you
> can code anything in threadlets - it is not only about IO.
>
Ingo defined them as "plain function calls as long as they do not block".
But when/what function could block?
(1) File descriptors. Read. If data are in cache it wouldn't block.
Otherwise it would. Write. If there is space in cache it wouldn't
block. Otherwise it would.
(2) Network sockets. Recv. If data are in buffer they wouldn't block.
Otherwise they would. Send. If there is space in send buffer it
wouldn't block. Otherwise it would.
(3) Pipes, fifos & unix sockets. Unfortunately gain nothing since the
reliable local communication used mostly for control information
passing. If you have to block on such socket it most likely important
information anyway. (e.g. X server communication or sql query to SQL
server). (Or even less important here case of shell pipes.) And most
users here are all single threaded and I/O bound: they would gain
nothing from multi-threading - only PITA of added locking.
What I'm trying to get to: keep things simple. The proposed
optimization by Ingo does nothing else but allowing AIO to probe file
cache - if data there to go with fast path. So why not to implement
what the people want - probing of cache? Because it sounds bad? But
they are in fact proposing precisely that just masked with "fast
threads".
--
Don't walk behind me, I may not lead.
Don't walk in front of me, I may not follow.
Just walk beside me and be my friend.
-- Albert Camus (attributed to)
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-02 17:28 ` Davide Libenzi
@ 2007-03-03 10:27 ` Evgeniy Polyakov
0 siblings, 0 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-03 10:27 UTC (permalink / raw)
To: Davide Libenzi
Cc: Ingo Molnar, Pavel Machek, Theodore Tso, Linus Torvalds,
Ulrich Drepper, Linux Kernel Mailing List, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Jens Axboe,
Thomas Gleixner
On Fri, Mar 02, 2007 at 09:28:10AM -0800, Davide Libenzi (davidel@xmailserver.org) wrote:
> On Fri, 2 Mar 2007, Evgeniy Polyakov wrote:
>
> > do we really want to have per process signalfs, timerfs and so on - each
> > simple structure must be bound to a file, which becomes too cost.
>
> I may be old school, but if you ask me, and if you *really* want those
> events, yes. Reason? Unix's everything-is-a-file rule, and being able to
> use them with *existing* POSIX poll/select. Remember, not every app
> requires huge scalability efforts, so working with simpler and familiar
> APIs is always welcome.
> The *only* thing that was not practical to have as fd, was block requests.
> But maybe threadlets/syslets will handle those just fine, and close the gap.
That means that we bind very small object like timer or signal to the
whoe file structure - yes, as I stated - it is doable, but do we really
have to create a file each time create_timer() or signal() is called?
Signals as a filesystem are limited in that regard that we need to
create additional structures to have signal number<->private data
relations.
I designed kevent to be as small as possible, so I removed file binding
idea first. I do not say it is wrong or epoll (and threadlets) are broken
(fsck, I hope people do understand that), but as is it can not handle
that scenario, so it must be extended and/or a lot of other stuff
written to be compatible with epoll design. Kevent has different design
(which allows to work with old one though - there is a patch to
implement epoll over kevent).
> - Davide
>
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-02 17:13 ` Davide Libenzi
2007-03-02 19:13 ` Davide Libenzi
@ 2007-03-03 10:06 ` Evgeniy Polyakov
2007-03-03 18:46 ` Davide Libenzi
1 sibling, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-03 10:06 UTC (permalink / raw)
To: Davide Libenzi
Cc: Ingo Molnar, Eric Dumazet, Pavel Machek, Theodore Tso,
Linus Torvalds, Ulrich Drepper, Linux Kernel Mailing List,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Zach Brown, David S. Miller, Suparna Bhattacharya, Jens Axboe,
Thomas Gleixner
On Fri, Mar 02, 2007 at 09:13:40AM -0800, Davide Libenzi (davidel@xmailserver.org) wrote:
> On Fri, 2 Mar 2007, Evgeniy Polyakov wrote:
>
> > On Thu, Mar 01, 2007 at 11:31:14AM -0800, Davide Libenzi (davidel@xmailserver.org) wrote:
> > > On Thu, 1 Mar 2007, Evgeniy Polyakov wrote:
> > >
> > > > Ingo, do you really think I will send mails with faked benchmarks? :))
> > >
> > > I don't think he ever implied that. He was only suggesting that when you
> > > post benchmarks, and even more when you make claims based on benchmarks,
> > > you need to be extra carefull about what you measure. Otherwise the
> > > external view that you give to others does not look good.
> > > Kevent can be really faster than epoll, but if you post broken benchmarks
> > > (that can be, unrealiable HTTP loaders, broken server implemenations,
> > > etc..) and make claims based on that, the only effect that you have is to
> > > lose your point.
> >
> > So, I only talked that kevent is superior compared to epoll because (and
> > it is _main_ issue) of its ability to handle essentially any kind of
> > events with very small overhead (the same as epoll has in struct file -
> > list and spinlock) and without significant price of struct file binding
> > to event.
>
> You've to excuse me if my memory is bad, but IIRC the whole discussion
> and loong benchmark feast born with you throwing a benchmark at Ingo
> (with kevent showing a 1.9x performance boost WRT epoll), not with you
> making any other point.
So, how does it sound?
"Threadlets are bad for IO because kevent is 2 times faster than epoll?"
I said threadlets are bad for IO (and we agreed that both approaches
shouldbe usedfor the maximum performance) because of rescheduling overhead -
tasks are quite heavy structuresa to move around - even pt_regs copy
takes more than event structure, but not because there is something in other
galaxy which might work faster than another something in another galaxy.
That was stupid even to think about that.
> As far as epoll not being able to handle other events. Said who? Of
> course, with zero modifications, you can handle zero additional events.
> With modifications, you can handle other events. But lets talk about those
> other events. The *only* kind of event that ppl (and being the epoll
> maintainer I tend to receive those requests) missed in epoll, was AIO
> events, That's the *only* thing that was missed by real life application
> developers. And if something like threadlets/syslets will prove effective,
> the gap is closed WRT that requirement.
> Epoll handle already the whole class of pollable devices inside the
> kernel, and if you exclude block AIO, that's a pretty wide class already.
> The *existing* f_op->poll subsystem can be used to deliver events at the
> poll-head wakeup time (by using the "key" member of the poll callback), so
> that you don't even need the extra f_op->poll call to fetch events.
> And if you really feel raw about the single O(nready) loop that epoll
> currently does, a new epoll_wait2 (or whatever) API could be used to
> deliver the event directly into a userspace buffer [1], directly from the
> poll callback, w/out extra delivery loops (IRQ/event->epoll_callback->event_buffer).
Signals, futexes, timers and userspace events I was requested to add into
kevent, so far only futexes are missed because I was asked to freeze
development so other hackers could check the project.
>
> [1] From the epoll callback, we cannot sleep, so it's gonna be either an
> mlocked userspace buffer, or some kernel pages mapped to userspace.
Callbacks never sleep - they add event into list just like current
implementation (maybe some lock must be changed from mutex to spinlock,
I do not rememeber), main problem is binding to the file structure,
which is heavy.
> - Davide
>
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-03 7:19 ` Ingo Molnar
2007-03-03 7:20 ` Ingo Molnar
@ 2007-03-03 9:02 ` Davide Libenzi
1 sibling, 0 replies; 277+ messages in thread
From: Davide Libenzi @ 2007-03-03 9:02 UTC (permalink / raw)
To: Ingo Molnar
Cc: Nicholas Miell, Linux Kernel Mailing List, Arjan van de Ven,
Linus Torvalds
On Sat, 3 Mar 2007, Ingo Molnar wrote:
> * Davide Libenzi <davidel@xmailserver.org> wrote:
>
> > [...] Status word and control bits should not be changed from
> > underneath userspace AFAIK. [...]
>
> Note that the control bits do not just magically change during normal
> FPU use. It's a bit like sys_setsid()/iopl/etc., it makes little sense
> to change those per-thread anyway. This is a non-issue anyway - what is
> important is that the big bulk of 512 (or more) bytes of FPU state /are/
> callee-saved (both on 32-bit and on 64-bit), hence there's no need to
> unlazy anything or to do expensive FPU state saves or other FPU juggling
> around threadlet (or even syslet) use.
Well, the unlazy/sync happen in any case later when we switch (given
TS_USEDFPU set). We'd avoid a copy of it given the above conditions true.
Wouldn't it makes sense to carry over only the status word and the control
bits eventually?
Also, if the caller saves the whole context, and if we're scheduled while
inside a system call (not totally unfrequent case), can't we implement a
smarter unlazy_fpu that avoids fxsave during schedule-out and frstor after
schedule-in (do not do stts on this condition, so the newly scheduled
task don't get a fault at all)? If the above conditions are true (no need
context-copy for new head in async_exec), this should be possible too.
- Davide
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-03 7:19 ` Ingo Molnar
@ 2007-03-03 7:20 ` Ingo Molnar
2007-03-03 9:02 ` Davide Libenzi
1 sibling, 0 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-03-03 7:20 UTC (permalink / raw)
To: Davide Libenzi
Cc: Nicholas Miell, Linux Kernel Mailing List, Arjan van de Ven,
Linus Torvalds
* Ingo Molnar <mingo@elte.hu> wrote:
> Note that the control bits do not just magically change during normal
> FPU use. It's a bit like sys_setsid()/iopl/etc., it makes little sense
> to change those per-thread anyway. This is a non-issue anyway - what is
> important is that the big bulk of 512 (or more) bytes of FPU state /are/
> callee-saved (both on 32-bit and on 64-bit), hence there's no need to
^---- caller-saved
> unlazy anything or to do expensive FPU state saves or other FPU juggling
> around threadlet (or even syslet) use.
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-03 2:19 ` Davide Libenzi
@ 2007-03-03 7:19 ` Ingo Molnar
2007-03-03 7:20 ` Ingo Molnar
2007-03-03 9:02 ` Davide Libenzi
0 siblings, 2 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-03-03 7:19 UTC (permalink / raw)
To: Davide Libenzi
Cc: Nicholas Miell, Linux Kernel Mailing List, Arjan van de Ven,
Linus Torvalds
* Davide Libenzi <davidel@xmailserver.org> wrote:
> [...] Status word and control bits should not be changed from
> underneath userspace AFAIK. [...]
Note that the control bits do not just magically change during normal
FPU use. It's a bit like sys_setsid()/iopl/etc., it makes little sense
to change those per-thread anyway. This is a non-issue anyway - what is
important is that the big bulk of 512 (or more) bytes of FPU state /are/
callee-saved (both on 32-bit and on 64-bit), hence there's no need to
unlazy anything or to do expensive FPU state saves or other FPU juggling
around threadlet (or even syslet) use.
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-03 1:36 ` Nicholas Miell
2007-03-03 1:48 ` Benjamin LaHaise
@ 2007-03-03 2:19 ` Davide Libenzi
2007-03-03 7:19 ` Ingo Molnar
1 sibling, 1 reply; 277+ messages in thread
From: Davide Libenzi @ 2007-03-03 2:19 UTC (permalink / raw)
To: Nicholas Miell
Cc: Ingo Molnar, Linux Kernel Mailing List, Arjan van de Ven, Linus Torvalds
On Fri, 2 Mar 2007, Nicholas Miell wrote:
> On Fri, 2007-03-02 at 16:52 -0800, Davide Libenzi wrote:
> > On Fri, 2 Mar 2007, Nicholas Miell wrote:
> >
> > > The point Ingo was making is that the x86 ABI already requires the FPU
> > > context to be saved before *all* function calls.
> >
> > I've not seen that among Ingo's points, but yeah some status is caller
> > saved. But, aren't things like status word and control bits callee saved?
> > If that's the case, it might require proper handling.
> >
>
> Ingo mentioned it in one of the parts you cut out of your reply:
>
> > and here is where thinking about threadlets as a function call and not
> > as an asynchronous context helps alot: the classic gcc convention for
> > FPU use & function calls should apply: gcc does not call an external
> > function with an in-use FPU stack/register, it always neatly unuses it,
> > as no FPU register is callee-saved, all are caller-saved.
>
> The i386 psABI is ancient (i.e. it predates SSE, so no mention of the
> XMM or MXCSR registers) and a bit vague (no mention at all of the FP
> status word), but I'm fairly certain that Ingo is right.
I'm not sure if that's the case. I'd be happy if it was, but I'm afraid
it's not. Status word and control bits should not be changed from
underneath userspace AFAIK. The ABI I remember tells me that those are
callee saved. A quick gcc asm test tells me that too.
And assuming that's the case, why don't we have a smarter unlazy_fpu()
then, that avoid FPU context sync if we're scheduled while inside a
syscall (this is no different than an enter inside sys_async_exec -
userspace should have taken care of it)?
IMO a syscall enter should not assume that userspace took care of saving
the whole FPU context.
- Davide
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-03 1:36 ` Nicholas Miell
@ 2007-03-03 1:48 ` Benjamin LaHaise
2007-03-03 2:19 ` Davide Libenzi
1 sibling, 0 replies; 277+ messages in thread
From: Benjamin LaHaise @ 2007-03-03 1:48 UTC (permalink / raw)
To: Nicholas Miell
Cc: Davide Libenzi, Ingo Molnar, Linux Kernel Mailing List,
Arjan van de Ven, Linus Torvalds
On Fri, Mar 02, 2007 at 05:36:01PM -0800, Nicholas Miell wrote:
> > as an asynchronous context helps alot: the classic gcc convention for
> > FPU use & function calls should apply: gcc does not call an external
> > function with an in-use FPU stack/register, it always neatly unuses it,
> > as no FPU register is callee-saved, all are caller-saved.
>
> The i386 psABI is ancient (i.e. it predates SSE, so no mention of the
> XMM or MXCSR registers) and a bit vague (no mention at all of the FP
> status word), but I'm fairly certain that Ingo is right.
The FPU status word *must* be saved, as the rounding behaviour and error mode
bits are assumed to be preserved. Iow, yes, there is state which is required.
-ben
--
"Time is of no importance, Mr. President, only life is important."
Don't Email: <zyntrop@kvack.org>.
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-03 0:52 ` Davide Libenzi
@ 2007-03-03 1:36 ` Nicholas Miell
2007-03-03 1:48 ` Benjamin LaHaise
2007-03-03 2:19 ` Davide Libenzi
0 siblings, 2 replies; 277+ messages in thread
From: Nicholas Miell @ 2007-03-03 1:36 UTC (permalink / raw)
To: Davide Libenzi
Cc: Ingo Molnar, Linux Kernel Mailing List, Arjan van de Ven, Linus Torvalds
On Fri, 2007-03-02 at 16:52 -0800, Davide Libenzi wrote:
> On Fri, 2 Mar 2007, Nicholas Miell wrote:
>
> > The point Ingo was making is that the x86 ABI already requires the FPU
> > context to be saved before *all* function calls.
>
> I've not seen that among Ingo's points, but yeah some status is caller
> saved. But, aren't things like status word and control bits callee saved?
> If that's the case, it might require proper handling.
>
Ingo mentioned it in one of the parts you cut out of your reply:
> and here is where thinking about threadlets as a function call and not
> as an asynchronous context helps alot: the classic gcc convention for
> FPU use & function calls should apply: gcc does not call an external
> function with an in-use FPU stack/register, it always neatly unuses it,
> as no FPU register is callee-saved, all are caller-saved.
The i386 psABI is ancient (i.e. it predates SSE, so no mention of the
XMM or MXCSR registers) and a bit vague (no mention at all of the FP
status word), but I'm fairly certain that Ingo is right.
--
Nicholas Miell <nmiell@comcast.net>
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-02 21:43 ` Nicholas Miell
@ 2007-03-03 0:52 ` Davide Libenzi
2007-03-03 1:36 ` Nicholas Miell
0 siblings, 1 reply; 277+ messages in thread
From: Davide Libenzi @ 2007-03-03 0:52 UTC (permalink / raw)
To: Nicholas Miell
Cc: Ingo Molnar, Linux Kernel Mailing List, Arjan van de Ven, Linus Torvalds
On Fri, 2 Mar 2007, Nicholas Miell wrote:
> The point Ingo was making is that the x86 ABI already requires the FPU
> context to be saved before *all* function calls.
I've not seen that among Ingo's points, but yeah some status is caller
saved. But, aren't things like status word and control bits callee saved?
If that's the case, it might require proper handling.
- Davide
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-02 20:53 ` Davide Libenzi
2007-03-02 21:21 ` Michael K. Edwards
@ 2007-03-02 21:43 ` Nicholas Miell
2007-03-03 0:52 ` Davide Libenzi
1 sibling, 1 reply; 277+ messages in thread
From: Nicholas Miell @ 2007-03-02 21:43 UTC (permalink / raw)
To: Davide Libenzi
Cc: Ingo Molnar, Linux Kernel Mailing List, Arjan van de Ven, Linus Torvalds
On Fri, 2007-03-02 at 12:53 -0800, Davide Libenzi wrote:
> On Fri, 2 Mar 2007, Ingo Molnar wrote:
>
> >
> > * Davide Libenzi <davidel@xmailserver.org> wrote:
> >
> > > I think that the "dirty" FPU context must, at least, follow the new
> > > head. That's what the userspace sees, and you don't want an async_exec
> > > to re-emerge with a different FPU context.
> >
> > well. I think there's some confusion about terminology, so please let me
> > describe everything in detail. This is how execution goes:
> >
> > outer loop() {
> > call_threadlet();
> > }
> >
> > this all runs in the 'head' context. call_threadlet() always switches to
> > the 'threadlet stack'. The 'outer context' runs in the 'head stack'. If,
> > while executing the threadlet function, we block, then the
> > threadlet-thread gets to keep the task (the threadlet stack and also the
> > FPU), and blocks - and we pick a 'new head' from the thread pool and
> > continue executing in that context - right after the call_threadlet()
> > function, in the 'old' head's stack. I.e. it's as if we returned
> > immediately from call_threadlet(), with a return code that signals that
> > the 'threadlet went async'.
> >
> > now, the FPU state that was when the threadlet blocked is totally
> > meaningless to the 'new head' - that FPU state is from the middle of the
> > threadlet execution.
>
> For threadlets, it might be. Now think about a task wanting to dispatch N
> parallel AIO requests as N independent syslets.
> Think about this task having USEDFPU set, so the FPU context is dirty.
> When it returns from async_exec, with one of the requests being become
> sleepy, it needs to have the same FPU context it had when it entered,
> otherwise it won't prolly be happy.
> For the same reason a schedule() must preserve/sync the "prev" FPU
> context, to be reloaded at the next FPU fault.
The point Ingo was making is that the x86 ABI already requires the FPU
context to be saved before *all* function calls.
Unfortunately, this isn't true of other ABIs -- looking over the psABIs
specs I have laying around, AMD64, PPC64, and MIPS require at least part
of the FPU state to be preserved across function calls, and I'm sure
this is also true of others.
Then there's the other nasty details of new thread creation --
thankfully, the contents of the TLS isn't inherited from the parent
thread, but it still needs to be initialized; not to mention all the
other details involved in pthread creation and destruction.
I don't see any way around the pthread issues other than making a libc
upcall on return from the first system call that blocked.
--
Nicholas Miell <nmiell@comcast.net>
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-02 20:53 ` Davide Libenzi
@ 2007-03-02 21:21 ` Michael K. Edwards
2007-03-02 21:43 ` Nicholas Miell
1 sibling, 0 replies; 277+ messages in thread
From: Michael K. Edwards @ 2007-03-02 21:21 UTC (permalink / raw)
To: Davide Libenzi
Cc: Ingo Molnar, Linux Kernel Mailing List, Arjan van de Ven, Linus Torvalds
On 3/2/07, Davide Libenzi <davidel@xmailserver.org> wrote:
> For threadlets, it might be. Now think about a task wanting to dispatch N
> parallel AIO requests as N independent syslets.
> Think about this task having USEDFPU set, so the FPU context is dirty.
> When it returns from async_exec, with one of the requests being become
> sleepy, it needs to have the same FPU context it had when it entered,
> otherwise it won't prolly be happy.
> For the same reason a schedule() must preserve/sync the "prev" FPU
> context, to be reloaded at the next FPU fault.
And if you actually think this through, I think you will arrive at (a
subset of) the conclusions I did a week ago: to keep the threadlets
lightweight enough to schedule and migrate cheaply, they can't be
allowed to "own" their own FPU and TLS context. They have to be
allowed to _use_ the FPU (or they're useless) and to _use_ TLS (or
they can't use any glibc wrapper around a syscall, since they
practically all set the thread-local errno). But they have to
"quiesce" the FPU and stash any thread-local state they want to keep
on their stack before entering the next syscall, or else it'll get
clobbered.
Keep thinking, especially about FPU flags, and you'll see why
threadlets spawned from the _same_ threadlet entrypoint should all run
in the same pool of threads, one per CPU, while threadlets from
_different_ entrypoints should never run in the same thread (FPU/TLS
context). You'll see why threadlets in the same pool shouldn't be
permitted to preempt one another except at syscalls that block, and
the cost of preempting the real thread associated with one threadlet
pool with another real thread associated with a different threadlet
pool is the same as any other thread switch. At which point,
threadlet pools are themselves first-class objects (to use the snake
oil phrase), and might as well be enhanced to a data structure that
has efficient operations for reprioritization, bulk cancellation, and
all that jazz.
Did I mention that there is actually quite a bit of prior art in this
area, which makes a much better guide to the design of round wheels
than micro-benchmarks do?
Cheers,
- Michael
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-02 20:29 ` Ingo Molnar
@ 2007-03-02 20:53 ` Davide Libenzi
2007-03-02 21:21 ` Michael K. Edwards
2007-03-02 21:43 ` Nicholas Miell
0 siblings, 2 replies; 277+ messages in thread
From: Davide Libenzi @ 2007-03-02 20:53 UTC (permalink / raw)
To: Ingo Molnar; +Cc: Linux Kernel Mailing List, Arjan van de Ven, Linus Torvalds
On Fri, 2 Mar 2007, Ingo Molnar wrote:
>
> * Davide Libenzi <davidel@xmailserver.org> wrote:
>
> > I think that the "dirty" FPU context must, at least, follow the new
> > head. That's what the userspace sees, and you don't want an async_exec
> > to re-emerge with a different FPU context.
>
> well. I think there's some confusion about terminology, so please let me
> describe everything in detail. This is how execution goes:
>
> outer loop() {
> call_threadlet();
> }
>
> this all runs in the 'head' context. call_threadlet() always switches to
> the 'threadlet stack'. The 'outer context' runs in the 'head stack'. If,
> while executing the threadlet function, we block, then the
> threadlet-thread gets to keep the task (the threadlet stack and also the
> FPU), and blocks - and we pick a 'new head' from the thread pool and
> continue executing in that context - right after the call_threadlet()
> function, in the 'old' head's stack. I.e. it's as if we returned
> immediately from call_threadlet(), with a return code that signals that
> the 'threadlet went async'.
>
> now, the FPU state that was when the threadlet blocked is totally
> meaningless to the 'new head' - that FPU state is from the middle of the
> threadlet execution.
For threadlets, it might be. Now think about a task wanting to dispatch N
parallel AIO requests as N independent syslets.
Think about this task having USEDFPU set, so the FPU context is dirty.
When it returns from async_exec, with one of the requests being become
sleepy, it needs to have the same FPU context it had when it entered,
otherwise it won't prolly be happy.
For the same reason a schedule() must preserve/sync the "prev" FPU
context, to be reloaded at the next FPU fault.
> > So, IMO, if the USEDFPU bit is set, we need to sync the dirty FPU
> > context with an early unlazy_fpu(), *and* copy the sync'd FPU context
> > to the new head. This should really be a fork of the dirty FPU context
> > IMO, and should only happen if the USEDFPU bit is set.
>
> why? The only effect this will have is a slowdown :) The FPU context
> from the middle of the threadlet function is totally meaningless to the
> 'new head'. It might be anything. (although in practice system calls are
> almost never called with a truly in-use FPU.)
See above ;)
- Davide
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-02 20:18 ` Davide Libenzi
@ 2007-03-02 20:29 ` Ingo Molnar
2007-03-02 20:53 ` Davide Libenzi
0 siblings, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-03-02 20:29 UTC (permalink / raw)
To: Davide Libenzi
Cc: Linux Kernel Mailing List, Arjan van de Ven, Linus Torvalds
* Davide Libenzi <davidel@xmailserver.org> wrote:
> I think that the "dirty" FPU context must, at least, follow the new
> head. That's what the userspace sees, and you don't want an async_exec
> to re-emerge with a different FPU context.
well. I think there's some confusion about terminology, so please let me
describe everything in detail. This is how execution goes:
outer loop() {
call_threadlet();
}
this all runs in the 'head' context. call_threadlet() always switches to
the 'threadlet stack'. The 'outer context' runs in the 'head stack'. If,
while executing the threadlet function, we block, then the
threadlet-thread gets to keep the task (the threadlet stack and also the
FPU), and blocks - and we pick a 'new head' from the thread pool and
continue executing in that context - right after the call_threadlet()
function, in the 'old' head's stack. I.e. it's as if we returned
immediately from call_threadlet(), with a return code that signals that
the 'threadlet went async'.
now, the FPU state that was when the threadlet blocked is totally
meaningless to the 'new head' - that FPU state is from the middle of the
threadlet execution.
and here is where thinking about threadlets as a function call and not
as an asynchronous context helps alot: the classic gcc convention for
FPU use & function calls should apply: gcc does not call an external
function with an in-use FPU stack/register, it always neatly unuses it,
as no FPU register is callee-saved, all are caller-saved.
> So, IMO, if the USEDFPU bit is set, we need to sync the dirty FPU
> context with an early unlazy_fpu(), *and* copy the sync'd FPU context
> to the new head. This should really be a fork of the dirty FPU context
> IMO, and should only happen if the USEDFPU bit is set.
why? The only effect this will have is a slowdown :) The FPU context
from the middle of the threadlet function is totally meaningless to the
'new head'. It might be anything. (although in practice system calls are
almost never called with a truly in-use FPU.)
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-02 19:39 ` Ingo Molnar
@ 2007-03-02 20:18 ` Davide Libenzi
2007-03-02 20:29 ` Ingo Molnar
0 siblings, 1 reply; 277+ messages in thread
From: Davide Libenzi @ 2007-03-02 20:18 UTC (permalink / raw)
To: Ingo Molnar; +Cc: Linux Kernel Mailing List, Arjan van de Ven, Linus Torvalds
On Fri, 2 Mar 2007, Ingo Molnar wrote:
>
> * Davide Libenzi <davidel@xmailserver.org> wrote:
>
> > [...] We're still missing proper FPU context switch in the
> > move_user_context(). [...]
>
> yeah - i'm starting to be of the opinion that the FPU context should
> stay with the threadlet, exclusively. I.e. when calling a threadlet, the
> 'outer loop' (the event loop) should not leak FPU context into the
> threadlet and then expect it to be replicated from whatever random point
> the threadlet ended up sleeping at. It would be possible, but it just
> makes no sense. What makes most sense is to just keep the FPU context
> with the threadlet, and to let the 'new head' use an initial (unused)
> FPU context. And it's in fact the threadlet that will most likely have
> an acrive FPU context across a system call, not the outer loop. In other
> words: no special FPU support needed at all for threadlets (i.e. no
> flipping needed even) - this behavior just naturally happens in the
> current implementation. Hm?
I think that the "dirty" FPU context must, at least, follow the new head.
That's what the userspace sees, and you don't want an async_exec to
re-emerge with a different FPU context.
I think it should also follow the async thread (old, going-to-sleep,
thread), since a threadlet might have that dirtied, and as a consequence
it'll want to find it back when it's re-scheduled.
So, IMO, if the USEDFPU bit is set, we need to sync the dirty FPU context
with an early unlazy_fpu(), *and* copy the sync'd FPU context to the new head.
This should really be a fork of the dirty FPU context IMO, and should only
happen if the USEDFPU bit is set.
- Davide
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-02 17:32 ` Davide Libenzi
@ 2007-03-02 19:39 ` Ingo Molnar
2007-03-02 20:18 ` Davide Libenzi
0 siblings, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-03-02 19:39 UTC (permalink / raw)
To: Davide Libenzi; +Cc: linux-kernel, Arjan van de Ven, Linus Torvalds
* Davide Libenzi <davidel@xmailserver.org> wrote:
> [...] We're still missing proper FPU context switch in the
> move_user_context(). [...]
yeah - i'm starting to be of the opinion that the FPU context should
stay with the threadlet, exclusively. I.e. when calling a threadlet, the
'outer loop' (the event loop) should not leak FPU context into the
threadlet and then expect it to be replicated from whatever random point
the threadlet ended up sleeping at. It would be possible, but it just
makes no sense. What makes most sense is to just keep the FPU context
with the threadlet, and to let the 'new head' use an initial (unused)
FPU context. And it's in fact the threadlet that will most likely have
an acrive FPU context across a system call, not the outer loop. In other
words: no special FPU support needed at all for threadlets (i.e. no
flipping needed even) - this behavior just naturally happens in the
current implementation. Hm?
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-02 17:13 ` Davide Libenzi
@ 2007-03-02 19:13 ` Davide Libenzi
2007-03-03 10:06 ` Evgeniy Polyakov
1 sibling, 0 replies; 277+ messages in thread
From: Davide Libenzi @ 2007-03-02 19:13 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Ingo Molnar, Eric Dumazet, Pavel Machek, Theodore Tso,
Linus Torvalds, Ulrich Drepper, Linux Kernel Mailing List,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Zach Brown, David S. Miller, Suparna Bhattacharya, Jens Axboe,
Thomas Gleixner
On Fri, 2 Mar 2007, Davide Libenzi wrote:
> And if you really feel raw about the single O(nready) loop that epoll
> currently does, a new epoll_wait2 (or whatever) API could be used to
> deliver the event directly into a userspace buffer [1], directly from the
> poll callback, w/out extra delivery loops (IRQ/event->epoll_callback->event_buffer).
And if you ever wonder from where the "epoll" name came, it came from the
old /dev/epoll. The epoll predecessor /dev/epoll, was adding plugs
everywhere events where needed and was delivering those events in O(1)
*directly* on a user visible (mmap'd) buffer, in a zero-copy fashion.
The old /dev/epoll was faster the the current epoll, but the latter was
chosen because despite being sloghtly slower, it had support for every
pollable device, *without* adding more plugs into the existing code.
Performance and code maintainance are not to be taken disjointly whenever
you evaluate a solution. That's the reason I got excited about this new
generic AIO slution.
- Davide
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-02 10:57 ` Ingo Molnar
2007-03-02 11:48 ` Evgeniy Polyakov
@ 2007-03-02 17:32 ` Davide Libenzi
2007-03-02 19:39 ` Ingo Molnar
1 sibling, 1 reply; 277+ messages in thread
From: Davide Libenzi @ 2007-03-02 17:32 UTC (permalink / raw)
To: Ingo Molnar
Cc: Evgeniy Polyakov, Eric Dumazet, Pavel Machek, Theodore Tso,
Linus Torvalds, Ulrich Drepper, Linux Kernel Mailing List,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Zach Brown, David S. Miller, Suparna Bhattacharya, Jens Axboe,
Thomas Gleixner
On Fri, 2 Mar 2007, Ingo Molnar wrote:
> > After your changes epoll increased to 5k.
>
> Can we please stop this pointless episode of benchmarketing, where every
> mail of yours shows different results and you even deny having said
> something which you clearly said just a few days ago? At this point i
> simply cannot trust the numbers you are posting, nor is the discussion
> style you are following productive in any way in my opinion.
Agreed. Can we focus on the topic here? We're still missing proper FPU
context switch in the move_user_context(). In v6?
- Davide
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-02 11:08 ` Evgeniy Polyakov
@ 2007-03-02 17:28 ` Davide Libenzi
2007-03-03 10:27 ` Evgeniy Polyakov
0 siblings, 1 reply; 277+ messages in thread
From: Davide Libenzi @ 2007-03-02 17:28 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Ingo Molnar, Pavel Machek, Theodore Tso, Linus Torvalds,
Ulrich Drepper, Linux Kernel Mailing List, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Jens Axboe,
Thomas Gleixner
On Fri, 2 Mar 2007, Evgeniy Polyakov wrote:
> do we really want to have per process signalfs, timerfs and so on - each
> simple structure must be bound to a file, which becomes too cost.
I may be old school, but if you ask me, and if you *really* want those
events, yes. Reason? Unix's everything-is-a-file rule, and being able to
use them with *existing* POSIX poll/select. Remember, not every app
requires huge scalability efforts, so working with simpler and familiar
APIs is always welcome.
The *only* thing that was not practical to have as fd, was block requests.
But maybe threadlets/syslets will handle those just fine, and close the gap.
- Davide
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-02 8:10 ` Evgeniy Polyakov
@ 2007-03-02 17:13 ` Davide Libenzi
2007-03-02 19:13 ` Davide Libenzi
2007-03-03 10:06 ` Evgeniy Polyakov
0 siblings, 2 replies; 277+ messages in thread
From: Davide Libenzi @ 2007-03-02 17:13 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Ingo Molnar, Eric Dumazet, Pavel Machek, Theodore Tso,
Linus Torvalds, Ulrich Drepper, Linux Kernel Mailing List,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Zach Brown, David S. Miller, Suparna Bhattacharya, Jens Axboe,
Thomas Gleixner
On Fri, 2 Mar 2007, Evgeniy Polyakov wrote:
> On Thu, Mar 01, 2007 at 11:31:14AM -0800, Davide Libenzi (davidel@xmailserver.org) wrote:
> > On Thu, 1 Mar 2007, Evgeniy Polyakov wrote:
> >
> > > Ingo, do you really think I will send mails with faked benchmarks? :))
> >
> > I don't think he ever implied that. He was only suggesting that when you
> > post benchmarks, and even more when you make claims based on benchmarks,
> > you need to be extra carefull about what you measure. Otherwise the
> > external view that you give to others does not look good.
> > Kevent can be really faster than epoll, but if you post broken benchmarks
> > (that can be, unrealiable HTTP loaders, broken server implemenations,
> > etc..) and make claims based on that, the only effect that you have is to
> > lose your point.
>
> So, I only talked that kevent is superior compared to epoll because (and
> it is _main_ issue) of its ability to handle essentially any kind of
> events with very small overhead (the same as epoll has in struct file -
> list and spinlock) and without significant price of struct file binding
> to event.
You've to excuse me if my memory is bad, but IIRC the whole discussion
and loong benchmark feast born with you throwing a benchmark at Ingo
(with kevent showing a 1.9x performance boost WRT epoll), not with you
making any other point.
As far as epoll not being able to handle other events. Said who? Of
course, with zero modifications, you can handle zero additional events.
With modifications, you can handle other events. But lets talk about those
other events. The *only* kind of event that ppl (and being the epoll
maintainer I tend to receive those requests) missed in epoll, was AIO
events, That's the *only* thing that was missed by real life application
developers. And if something like threadlets/syslets will prove effective,
the gap is closed WRT that requirement.
Epoll handle already the whole class of pollable devices inside the
kernel, and if you exclude block AIO, that's a pretty wide class already.
The *existing* f_op->poll subsystem can be used to deliver events at the
poll-head wakeup time (by using the "key" member of the poll callback), so
that you don't even need the extra f_op->poll call to fetch events.
And if you really feel raw about the single O(nready) loop that epoll
currently does, a new epoll_wait2 (or whatever) API could be used to
deliver the event directly into a userspace buffer [1], directly from the
poll callback, w/out extra delivery loops (IRQ/event->epoll_callback->event_buffer).
[1] From the epoll callback, we cannot sleep, so it's gonna be either an
mlocked userspace buffer, or some kernel pages mapped to userspace.
- Davide
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-02 10:57 ` Ingo Molnar
@ 2007-03-02 11:48 ` Evgeniy Polyakov
2007-03-02 17:32 ` Davide Libenzi
1 sibling, 0 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-02 11:48 UTC (permalink / raw)
To: Ingo Molnar
Cc: Eric Dumazet, Pavel Machek, Theodore Tso, Linus Torvalds,
Ulrich Drepper, linux-kernel, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
On Fri, Mar 02, 2007 at 11:57:13AM +0100, Ingo Molnar (mingo@elte.hu) wrote:
>
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
>
> > > > > [...] The numbers are still highly suspect - and we are already
> > > > > down from the prior claim of kevent being almost twice as fast
> > > > > to a 25% difference.
> > > >
> > > > Btw, there were never almost twice perfromance increase - epoll in
> > > > my tests always showed 4-5 thousands requests per second, kevent -
> > > > up to 7 thausands.
> > >
> > > i'm referring to your claim in this mail of yours from 4 days ago
> > > for example:
> > >
> > > http://lkml.org/lkml/2007/2/25/116
> > >
> > > "But note, that on my athlon64 3500 test machine kevent is about 7900
> > > requests per second compared to 4000+ epoll, so expect a challenge."
> > >
> > > no matter how i look at it, but 7900 is 1.9 times 4000 - which is
> > > "almost twice".
> >
> > After your changes epoll increased to 5k.
>
> Can we please stop this pointless episode of benchmarketing, where every
> mail of yours shows different results and you even deny having said
> something which you clearly said just a few days ago? At this point i
> simply cannot trust the numbers you are posting, nor is the discussion
> style you are following productive in any way in my opinion.
I just show what I see in tests - I do not perform deep analysis of
that, since I do not see why it should be done - it is not fake, it is
not fantasy - real behaviour which is observed in my test machine, if it
will sudenly change I will report it.
Btw, I showed cases when epoll behaved better than kevent and
performance was unbeatable 9k requests per second - I do not know, why
it happend - maybe some cache related issues, other process all slept in
once, increased radiation or strong wind blew away my bad aura - it is
not reproducible on demand too.
> (you are never ever wrong, and if you are proven wrong on topic A you
> claim it is an irrelevant topic (without even admitting you were wrong
> about it) and you point to topic B claiming it's the /real/ topic you
> talked about all along. And along the way you are slandering other
> projects like epoll and threadlets, distorting the discussion. This kind
> of keep-the-ball-moving discussion style is effective in politics but
> IMO it's a waste of time when developing a kernel.)
Heh - that is why I'm not subscribed to lkml@ - it tooo frequently ends
up with politics :)
What we are talking about - we try to insult each other with something,
that was supposed to be said after some assumption on theoretical mental
exercise? I can only laugh on that :)
Ingo, I never ever tried to show that something is broken - that is
fantasy based on straight words, not on the real intension.
I never said epoll is broken. Absolutely.
I never said threadlet is broken. Absolutely.
I just showed that it is not (in my opinion) right decision to use
threadlets for IO based model instead of event driven - it is not based
on kevent performance (I _never_ stated it as a main factor - kevent was
only an example of event driven model, you were confused it with kevent
AIO, which is different beast), but instead on experience with nptl
threads and linuxthreads, and related rescheduling overhead compared to
userspace one.
I showed kevent as a possible usage scenario - since it does support own
AIO. And you started to fight against it in every detail, since you
think kevent is not a good idea to handle AIO model - well, it can be
pefectly correct, I showed kevent AIO (please do not think that kevent
and kevent AIO are the same - the latter is just one of the possible
users I implemented, it only uses kevent to deliver completion event to
userspace) as possible AIO implementation, but not _kevent_ itself.
But somehow we ended with binding to me some words I never said and ideas
I never based my assumptions on... I do not really think you even
remotely wanted to make any somehow personal assumptions on what we had
discussed.
We even concluded, that perfect IO model should use both approaches to
really scale - both threadlets with its on-demand-only rescheduling, and
event driven ring.
You pointed your opinion on kevents - well, I can not agree with it, but
that is your right not to like something.
Let's not continue bad practice of kicking each other just because there
were some problematic roots which noone even remember correctly - let's
do not make a mistake of pointing something personal out of trivial bits
- if you will be in Russia of around any time soon I will happily buy you
a beer or what you prefer :)
So, let's just draw a line:
kevent was showed to people, and its performance, although flacky, is a
bit faster than epoll. Threadlets bound to any event driven ring do not
show any performance degradation in network driven setup with small
number of reschedulings with all advantages of simpler programming.
So, repeating myself, both models (not kevent and threadlet, but event
driven and thread based) should be used to achieve the maximum
performance.
So, I will draw yet another request for peace and fun and interesting
technical discussion. Peace?
Anyway, that was interesting discussion, thanks a lot for all
participants :)
> Ingo
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-02 10:56 ` Ingo Molnar
@ 2007-03-02 11:08 ` Evgeniy Polyakov
2007-03-02 17:28 ` Davide Libenzi
0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-02 11:08 UTC (permalink / raw)
To: Ingo Molnar
Cc: Pavel Machek, Theodore Tso, Linus Torvalds, Ulrich Drepper,
linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
On Fri, Mar 02, 2007 at 11:56:18AM +0100, Ingo Molnar (mingo@elte.hu) wrote:
>
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
>
> > Even if kevent has the same speed, it still allows to handle _any_
> > kind of events without any major surgery - a very tiny structure of
> > lock and list head and you can process your own kernel event in
> > userspace with timers, signals, io events, private userspace events
> > and others without races and invention of differnet hacks for
> > different types - _this_ is main point.
>
> did it ever occur to you to ... extend epoll? To speed it up? To add a
> new wait syscall to it? Instead of introducing a whole new parallel
> framework?
Yes, I thought about its extension more than a year ago before started
kevent, but epoll() is absolutely based on file structure and its
file_operations with poll methodt, so it is quite impossible to work
with sockets to implement network AIO. Eventually it had gathered a lot
of other systems - do we really want to have per process signalfs, timerfs
and so on - each simple structure must be bound to a file, which becomes
too cost.
> Ingo
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 15:36 ` Evgeniy Polyakov
@ 2007-03-02 10:57 ` Ingo Molnar
2007-03-02 11:48 ` Evgeniy Polyakov
2007-03-02 17:32 ` Davide Libenzi
0 siblings, 2 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-03-02 10:57 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Eric Dumazet, Pavel Machek, Theodore Tso, Linus Torvalds,
Ulrich Drepper, linux-kernel, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> > > > [...] The numbers are still highly suspect - and we are already
> > > > down from the prior claim of kevent being almost twice as fast
> > > > to a 25% difference.
> > >
> > > Btw, there were never almost twice perfromance increase - epoll in
> > > my tests always showed 4-5 thousands requests per second, kevent -
> > > up to 7 thausands.
> >
> > i'm referring to your claim in this mail of yours from 4 days ago
> > for example:
> >
> > http://lkml.org/lkml/2007/2/25/116
> >
> > "But note, that on my athlon64 3500 test machine kevent is about 7900
> > requests per second compared to 4000+ epoll, so expect a challenge."
> >
> > no matter how i look at it, but 7900 is 1.9 times 4000 - which is
> > "almost twice".
>
> After your changes epoll increased to 5k.
Can we please stop this pointless episode of benchmarketing, where every
mail of yours shows different results and you even deny having said
something which you clearly said just a few days ago? At this point i
simply cannot trust the numbers you are posting, nor is the discussion
style you are following productive in any way in my opinion.
(you are never ever wrong, and if you are proven wrong on topic A you
claim it is an irrelevant topic (without even admitting you were wrong
about it) and you point to topic B claiming it's the /real/ topic you
talked about all along. And along the way you are slandering other
projects like epoll and threadlets, distorting the discussion. This kind
of keep-the-ball-moving discussion style is effective in politics but
IMO it's a waste of time when developing a kernel.)
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-02 10:37 ` Evgeniy Polyakov
@ 2007-03-02 10:56 ` Ingo Molnar
2007-03-02 11:08 ` Evgeniy Polyakov
0 siblings, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-03-02 10:56 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Pavel Machek, Theodore Tso, Linus Torvalds, Ulrich Drepper,
linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> Even if kevent has the same speed, it still allows to handle _any_
> kind of events without any major surgery - a very tiny structure of
> lock and list head and you can process your own kernel event in
> userspace with timers, signals, io events, private userspace events
> and others without races and invention of differnet hacks for
> different types - _this_ is main point.
did it ever occur to you to ... extend epoll? To speed it up? To add a
new wait syscall to it? Instead of introducing a whole new parallel
framework?
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-02 10:27 ` Pavel Machek
@ 2007-03-02 10:37 ` Evgeniy Polyakov
2007-03-02 10:56 ` Ingo Molnar
0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-02 10:37 UTC (permalink / raw)
To: Pavel Machek
Cc: Theodore Tso, Ingo Molnar, Linus Torvalds, Ulrich Drepper,
linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
On Fri, Mar 02, 2007 at 11:27:14AM +0100, Pavel Machek (pavel@ucw.cz) wrote:
> Maybe. It is not up to me to decide. But "it is faster" is _not_ the
> only merge criterium.
Of course not!
Even if kevent has the same speed, it still allows to handle _any_ kind
of events without any major surgery - a very tiny structure of lock and
list head and you can process your own kernel event in userspace with
timers, signals, io events, private userspace events and others without
races and invention of differnet hacks for different types -
_this_ is main point.
> Pavel
> --
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 11:18 ` Evgeniy Polyakov
@ 2007-03-02 10:27 ` Pavel Machek
2007-03-02 10:37 ` Evgeniy Polyakov
0 siblings, 1 reply; 277+ messages in thread
From: Pavel Machek @ 2007-03-02 10:27 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Theodore Tso, Ingo Molnar, Linus Torvalds, Ulrich Drepper,
linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
Hi!
> > > > If you can replace them with something simpler, and no worse than 10%
> > > > slower in worst case, then go ahead. (We actually tried to do that at
> > > > some point, only to realize that efence stresses vm subsystem in very
> > > > unexpected/unfriendly way).
> > >
> > > Agh, only 10% in the worst case.
> > > I think you can not even imagine what tricks network uses to get at
> > > least aditional 1% out of the box.
> >
> > Yep? Feel free to rewrite networking to assembly on Eugenix. That
> > should get you 1% improvement. If you reserve few registers to be only
> > used by kernel (not allowed by userspace), you can speedup networking
> > 5%, too. Ouch and you could turn off MMU, that is sure way to get few
> > more percent improvement in your networking case.
>
> It is not _my_ networking, but taht one you use everyday in every Linux
> box. Notice which tricks are used to remove single byte from
> sk_buff.
Ok, so tricks were worth it in sk_buff case.
> It is called optimization, and if it does us a single plus it must be
> implemented. Not all people have magical fear of new things.
But that does not mean "every optimalization must be
implemented". Only optimalizations that are "worth it" are...
> > > Using such logic you can just abandon any further development, since it
> > > work as is right now.
> >
> > Stop trying to pervert my logic.
>
> Ugh? :)
> I just say in simple words your 'we do not need something if adds 10%,
> but is complex to understand'.
Yes... but that does not mean "stop development". You are still free
to clean up the code _while_ making it faster.
> > If your code is so complex that it is almost impossible to use from
> > userspace, that is good enough reason not to be merged. "But it is 3%
> > faster if..." is not a good-enough argument.
>
> Is it enough for you?
>
> epoll 4794.23 req/sec
> kevent 6468.95 req/sec
Maybe. It is not up to me to decide. But "it is faster" is _not_ the
only merge criterium.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 19:31 ` Davide Libenzi
@ 2007-03-02 8:10 ` Evgeniy Polyakov
2007-03-02 17:13 ` Davide Libenzi
0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-02 8:10 UTC (permalink / raw)
To: Davide Libenzi
Cc: Ingo Molnar, Eric Dumazet, Pavel Machek, Theodore Tso,
Linus Torvalds, Ulrich Drepper, Linux Kernel Mailing List,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Zach Brown, David S. Miller, Suparna Bhattacharya, Jens Axboe,
Thomas Gleixner
On Thu, Mar 01, 2007 at 11:31:14AM -0800, Davide Libenzi (davidel@xmailserver.org) wrote:
> On Thu, 1 Mar 2007, Evgeniy Polyakov wrote:
>
> > Ingo, do you really think I will send mails with faked benchmarks? :))
>
> I don't think he ever implied that. He was only suggesting that when you
> post benchmarks, and even more when you make claims based on benchmarks,
> you need to be extra carefull about what you measure. Otherwise the
> external view that you give to others does not look good.
> Kevent can be really faster than epoll, but if you post broken benchmarks
> (that can be, unrealiable HTTP loaders, broken server implemenations,
> etc..) and make claims based on that, the only effect that you have is to
> lose your point.
We seems to move far away from original topic - I never built any
assumptions on top of kevent _performance_ - kevent is a logical
extrapolation of the epoll, I only showed that event driven model can be
fast and it outperforms threadlet one - after we changed topic we were
unable to actually test threadlets in networking environment, since the
only test I ran showed that threadlest do not reschedule at all, and
Ingo's tests showed small number of reschedulings.
So, I only talked that kevent is superior compared to epoll because (and
it is _main_ issue) of its ability to handle essentially any kind of
events with very small overhead (the same as epoll has in struct file -
list and spinlock) and without significant price of struct file binding
to event.
I did not want and do not want to hurt anyone (even Ingo, although he is
against kevent :), but my opinion is that thread moved from nice
discussion about threads and events with jokes and fun into quite angry
word throwings, and that is too good - let's make it fun again.
I'm not a native english speaker (and do not use a dictionary), so it is
quite possible that some my phrases were not exactly nice, but it was
unintentional (at least not very) :)
Peace?
> - Davide
>
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-28 23:12 ` Ingo Molnar
2007-03-01 1:33 ` Andrea Arcangeli
@ 2007-03-01 21:27 ` Linus Torvalds
1 sibling, 0 replies; 277+ messages in thread
From: Linus Torvalds @ 2007-03-01 21:27 UTC (permalink / raw)
To: Ingo Molnar
Cc: Davide Libenzi, Ulrich Drepper, Linux Kernel Mailing List,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Zach Brown, Evgeniy Polyakov, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
On Thu, 1 Mar 2007, Ingo Molnar wrote:
>
> wrt. one-shot syscalls, the user-space stack footprint would still
> probably be there, because even async contexts that only do single-shot
> processing need to drop out of kernel mode to handle signals.
Why?
The easiest thing to do with signals is to just not pick them up. If the
signal was to that *particular* threadlet (ie a "cancel"), then we just
want to kill the threadlet. And if the signal was to the thread group,
there is no reason why the threadlet should pick it up.
In neither case is there *any* reason to handle the signal in the
threadlet, afaik.
And having to have a stack allocation for each threadlet certainly means
that you complicate things a lot. Suddenly you have allocations that can't
just go away. Again, I'm pointing to the problems I already pointed out
with the allocations of the atom structures - quite often you do *not*
want to keep track of anything specific for completion time, and that
means that you MUST NOT have to de-allocate anythign either.
Again, think aio_read(). With the *exact* current binary interface.
PLEASE. If you cannot emulate that with threadlets, then threadlets are
*pointless*. On eof the major reasons for the whole exercise was to get
rid of the special code in fs/aio.c.
So I repeat: if you cannot do that, and remain binary compatible, don't
even bother.
Linus
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 19:37 ` David Lang
@ 2007-03-01 20:34 ` Johann Borck
0 siblings, 0 replies; 277+ messages in thread
From: Johann Borck @ 2007-03-01 20:34 UTC (permalink / raw)
To: David Lang; +Cc: linux-kernel
David Lang wrote:
>
> On Thu, 1 Mar 2007, Johann Borck wrote:
>
>> I reported this a while ago and suggested to have the number of
>> pending accepts reported with the event to save that last syscall.
>> I created an ab replacement based on kevent, and at least with my
>> machines, which are comparable to each other, the load on client
>> dropped from 100% to 2% or something. ab just doesn't give meaningful
>> results (if the client is not way more powerful). With that new
>> client I get very similar results for epoll and kevent, from 1000
>> through to 26000 concurrent requests, the results have been posted on
>> kevent-homepage in october, I just checked it with new version, but
>> there's no significant difference.
>>
>> this is the benchmark with kevent-based client:
>> http://tservice.net.ru/~s0mbre/blog/2006/10/11#2006_10_11
>> btw, each result is average over 1,000,000 requests
>>
>> and just for comparison, this is on the same machines using ab:
>> http://tservice.net.ru/~s0mbre/blog/2006/10/08#2006_10_08
>
> is this client avaialble? and what patches need to be added to the
> kernel to use it?
>
It's based on an older version of kevent, so I'll have to adapt it a bit
for use with recent patch, no other than kevent is necessary. I'll post
a link when it's cleaned up, if you want.
Johann
> David Lang
>
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 15:32 ` Eric Dumazet
2007-03-01 15:41 ` Eric Dumazet
2007-03-01 15:47 ` Evgeniy Polyakov
@ 2007-03-01 19:47 ` Davide Libenzi
2 siblings, 0 replies; 277+ messages in thread
From: Davide Libenzi @ 2007-03-01 19:47 UTC (permalink / raw)
To: Eric Dumazet
Cc: Evgeniy Polyakov, Ingo Molnar, Pavel Machek, Theodore Tso,
Linus Torvalds, Ulrich Drepper, Linux Kernel Mailing List,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Zach Brown, David S. Miller, Suparna Bhattacharya, Jens Axboe,
Thomas Gleixner
Oh boy, wasn't this thread supposed to focus the syslets/threadlets ... :)
On Thu, 1 Mar 2007, Eric Dumazet wrote:
> On Thursday 01 March 2007 16:23, Evgeniy Polyakov wrote:
> > They are there, since ab runs only 50k requests.
> > If I change it to something noticebly more than 50/80k, ab crashes:
> > # ab -c8000 -t 600 -n800000000 http://192.168.0.48/
> > This is ApacheBench, Version 2.0.40-dev <$Revision: 1.146 $> apache-2.0
> > Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
> > Copyright 2006 The Apache Software Foundation, http://www.apache.org/
> >
> > Benchmarking 192.168.0.48 (be patient)
> > Segmentation fault
> >
> > Are there any other tool suitable for such loads?
> > I only tested httperf (which is worse, since it uses poll/select) and
> > 'ab'.
> >
> > Btw, host machine runs 100% too, so it is possible that client side is
> > broken (too).
>
> I have similar problems here, ab test just doesnt complete...
>
> I am still investigating with strace and tcpdump.
>
> In the meantime you could just rewrite it (based on epoll please :) ), since
> it should be quite easy to do this (reverse of evserver_epoll)
I have a simple one based on coroutines and epoll. You need libpcl and
coronet. Debian hs a package named libpcl1-dev for libpcl, otherwise:
http://www.xmailserver.org/libpcl.html
and 'configure --prefix=/usr && sudo make install'.
Coronet is here:
http://www.xmailserver.org/coronet-lib.html
here just 'configure && make'.
Inside the "test" directory there a simple loader named cnhttpload:
cnhttpload -s HOST -n NCON [-p PORT (80)] [-r NREQS (1)] [-S STKSIZE (8192)]
[-M MAXCONNS] [-t TMUPD (1000)] [-a NACTIVE] [-T TMSAMP (200)]
[-h] URL ...
HOST = Target host
PORT = Target host port
NCON = Number of connections to the server
NACTIVE = Number of active (live) connections
STKSIZE = Stack size for coroutines
NREQS = Number of request done for each connection (better be 1 if
your server do not support keep-alive)
MAXCONNS = Maximum number of total connections done to the server. If not
set, the test will continue forever (well, till a ^C)
TMUPD = Millisec time of stats update
TMSAMP = Millisec internal average-update time
URL = Target doc (not http:// or host, just doc path)
So for the particular test my inbox was flooded with :), you'd use:
cnhttpload -s HOST -n 80000 -a 8000 -S 4096
- Davide
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 19:24 ` Johann Borck
@ 2007-03-01 19:37 ` David Lang
2007-03-01 20:34 ` Johann Borck
0 siblings, 1 reply; 277+ messages in thread
From: David Lang @ 2007-03-01 19:37 UTC (permalink / raw)
To: Johann Borck; +Cc: linux-kernel
On Thu, 1 Mar 2007, Johann Borck wrote:
> I reported this a while ago and suggested to have the number of pending
> accepts reported with the event to save that last syscall.
> I created an ab replacement based on kevent, and at least with my machines,
> which are comparable to each other, the load on client dropped from 100% to
> 2% or something. ab just doesn't give meaningful results (if the client is
> not way more powerful). With that new client I get very similar results for
> epoll and kevent, from 1000 through to 26000 concurrent requests, the results
> have been posted on kevent-homepage in october, I just checked it with new
> version, but there's no significant difference.
>
> this is the benchmark with kevent-based client:
> http://tservice.net.ru/~s0mbre/blog/2006/10/11#2006_10_11
> btw, each result is average over 1,000,000 requests
>
> and just for comparison, this is on the same machines using ab:
> http://tservice.net.ru/~s0mbre/blog/2006/10/08#2006_10_08
is this client avaialble? and what patches need to be added to the kernel to use
it?
David Lang
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 14:54 ` Evgeniy Polyakov
2007-03-01 15:09 ` Ingo Molnar
@ 2007-03-01 19:31 ` Davide Libenzi
2007-03-02 8:10 ` Evgeniy Polyakov
1 sibling, 1 reply; 277+ messages in thread
From: Davide Libenzi @ 2007-03-01 19:31 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Ingo Molnar, Eric Dumazet, Pavel Machek, Theodore Tso,
Linus Torvalds, Ulrich Drepper, Linux Kernel Mailing List,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Zach Brown, David S. Miller, Suparna Bhattacharya, Jens Axboe,
Thomas Gleixner
On Thu, 1 Mar 2007, Evgeniy Polyakov wrote:
> Ingo, do you really think I will send mails with faked benchmarks? :))
I don't think he ever implied that. He was only suggesting that when you
post benchmarks, and even more when you make claims based on benchmarks,
you need to be extra carefull about what you measure. Otherwise the
external view that you give to others does not look good.
Kevent can be really faster than epoll, but if you post broken benchmarks
(that can be, unrealiable HTTP loaders, broken server implemenations,
etc..) and make claims based on that, the only effect that you have is to
lose your point.
- Davide
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 8:18 ` Evgeniy Polyakov
2007-03-01 9:26 ` Pavel Machek
@ 2007-03-01 19:24 ` Johann Borck
2007-03-01 19:37 ` David Lang
1 sibling, 1 reply; 277+ messages in thread
From: Johann Borck @ 2007-03-01 19:24 UTC (permalink / raw)
To: linux-kernel
On Thu, Mar 01, 2007 at 04:41:27PM +0100, Eric Dumazet wrote:
>>
>> I had to loop on accept() :
>>
>> for (i=0; i<num; ++i) {
>> if (event[i].data.fd == main_server_s) {
>> do {
>> err = evtest_callback_main(event[i].data.fd);
>> } while (err != -1);
>> }
>> else
>> err = evtest_callback_client(event[i].data.fd);
>> }
>>
>> Or else we can miss an event forever...
On Thu, 1 Mar 2007, Evgeniy Polyakov wrote:
>
> The same here - I would just enable a debug to find it.
>
I reported this a while ago and suggested to have the number of
pending accepts reported with the event to save that last syscall.
I created an ab replacement based on kevent, and at least with my
machines, which are comparable to each other, the load on client
dropped from 100% to 2% or something. ab just doesn't give meaningful
results (if the client is not way more powerful). With that new client
I get very similar results for epoll and kevent, from 1000 through to
26000 concurrent requests, the results have been posted on
kevent-homepage in october, I just checked it with new version, but
there's no significant difference.
this is the benchmark with kevent-based client:
http://tservice.net.ru/~s0mbre/blog/2006/10/11#2006_10_11
btw, each result is average over 1,000,000 requests
and just for comparison, this is on the same machines using ab:
http://tservice.net.ru/~s0mbre/blog/2006/10/08#2006_10_08
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 9:54 ` Ingo Molnar
2007-03-01 10:59 ` Evgeniy Polyakov
@ 2007-03-01 19:19 ` Davide Libenzi
1 sibling, 0 replies; 277+ messages in thread
From: Davide Libenzi @ 2007-03-01 19:19 UTC (permalink / raw)
To: Ingo Molnar
Cc: Evgeniy Polyakov, Pavel Machek, Theodore Tso, Linus Torvalds,
Ulrich Drepper, Linux Kernel Mailing List, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Jens Axboe,
Thomas Gleixner
On Thu, 1 Mar 2007, Ingo Molnar wrote:
>
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
>
> > I posted kevent/epoll benchmarks and related design issues too many
> > times both with handmade applications (which might be broken as hell)
> > and popular open-source servers to repeat them again.
>
> numbers are crutial here - and given the epoll bugs in the evserver code
> that we found, do you have updated evserver benchmark results that
> compare epoll to kevent? I'm wondering why epoll has half the speed of
> kevent in those measurements - i suspect some possible benchmarking bug.
> The queueing model of epoll and kevent is roughly comparable, both do
> only a constant number of steps to serve one particular request,
> regardless of how many pending connections/requests there are. What is
> the CPU utilization of the server system during an epoll test, and what
> is the CPU utilization during a kevent test? 100% utilized in both
> cases?
With 8K concurrent (live) connections, we may also want to try with the v3
version of the epoll-event-loops-diet patch ;)
- Davide
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 17:56 ` Evgeniy Polyakov
@ 2007-03-01 18:41 ` David Lang
0 siblings, 0 replies; 277+ messages in thread
From: David Lang @ 2007-03-01 18:41 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Ingo Molnar, Pavel Machek, Theodore Tso, Linus Torvalds,
Ulrich Drepper, linux-kernel, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
On Thu, 1 Mar 2007, Evgeniy Polyakov wrote:
> On Thu, Mar 01, 2007 at 08:56:28AM -0800, David Lang (david.lang@digitalinsight.com) wrote:
>> the ab numbers below do not seem that impressive to me, especially for such
>> stripped down server processes.
> ...
>> client and server are dual opteron 252 with 8G of ram, running debian in 64
>> bit mode
>
> Decrease your hardware setup in 2-4 times, leave only one apache process
> and try to get the same - we are not talking about how to create a
> perfect web server, instead we try to focus possible problems in
> epoll/kevent event driven logic.
for apache I agree that the target box was maxed out, so if you only had a
single core on your AMD64 box that would be about half, however the thttpd is
only useing ~1 of the CPU's (OS overhead is useing just a smidge of the second,
but overall the box is 45-48% idle
if the amount of ram is an issue then you are swapping in your tests (or at
least throwing out cache that you need) and so would not be testing what you
think you are.
> Vanilla (epoll) lighttpd shows 4000-5000 requests per second in my setup (no logs).
> Default mpm-apache2 with bunch of threads - about 8k req/s.
> Default thttpd (disabled logging) - about 2k req/s
>
> Btw, all your tests are network bound, try to decrease
> html page size to get actual event processing speed out of that machines.
same test retreiving a ~128b file the server never gets below 51% idle (so it's
only useing one CPU)
Server Software: thttpd/2.23beta1
Server Hostname: 208.2.188.5
Server Port: 81
Document Path: /128b
Document Length: 136 bytes
Concurrency Level: 8000
Time taken for tests: 9.372902 seconds
Complete requests: 80000
Failed requests: 0
Write errors: 0
Total transferred: 30762842 bytes
HTML transferred: 10952216 bytes
Requests per second: 8535.24 [#/sec] (mean)
Time per request: 937.290 [ms] (mean)
Time per request: 0.117 [ms] (mean, across all concurrent requests)
Transfer rate: 3205.09 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 36 287 1125.6 73 9109
Processing: 49 89 19.8 87 339
Waiting: 17 62 16.4 62 292
Total: 92 376 1137.4 159 9262
Percentage of the requests served within a certain time (ms)
50% 159
66% 164
75% 165
80% 165
90% 203
95% 260
98% 3233
99% 9201
100% 9262 (longest request)
note that this is showing the slowdown from the large concurrancy level, if I
reduce the concurrancy level to 500 I get
Document Path: /128b
Document Length: 136 bytes
Concurrency Level: 500
Time taken for tests: 4.215025 seconds
Complete requests: 80000
Failed requests: 0
Write errors: 0
Total transferred: 30565348 bytes
HTML transferred: 10881904 bytes
Requests per second: 18979.72 [#/sec] (mean)
Time per request: 26.344 [ms] (mean)
Time per request: 0.053 [ms] (mean, across all concurrent requests)
Transfer rate: 7081.33 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 15 206.3 1 3006
Processing: 2 7 6.4 6 224
Waiting: 1 6 6.4 5 224
Total: 3 22 208.4 6 3229
Percentage of the requests served within a certain time (ms)
50% 6
66% 8
75% 10
80% 12
90% 16
95% 17
98% 21
99% 24
100% 3229 (longest request)
loadtest2:/proc/sys#
again with >50% idle on the server box
also, ab appears to only use a single cpu so the fact that there are two on the
client box should not make a difference.
I will reboot these boxes into a UP kernel if you think that this is still a
significant difference. based on what I'm seeing I don't think it will make much
of a difference (except for the apache test)
David Lang
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 16:56 ` David Lang
@ 2007-03-01 17:56 ` Evgeniy Polyakov
2007-03-01 18:41 ` David Lang
0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01 17:56 UTC (permalink / raw)
To: David Lang
Cc: Ingo Molnar, Pavel Machek, Theodore Tso, Linus Torvalds,
Ulrich Drepper, linux-kernel, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
On Thu, Mar 01, 2007 at 08:56:28AM -0800, David Lang (david.lang@digitalinsight.com) wrote:
> the ab numbers below do not seem that impressive to me, especially for such
> stripped down server processes.
...
> client and server are dual opteron 252 with 8G of ram, running debian in 64
> bit mode
Decrease your hardware setup in 2-4 times, leave only one apache process
and try to get the same - we are not talking about how to create a
perfect web server, instead we try to focus possible problems in
epoll/kevent event driven logic.
Vanilla (epoll) lighttpd shows 4000-5000 requests per second in my setup (no logs).
Default mpm-apache2 with bunch of threads - about 8k req/s.
Default thttpd (disabled logging) - about 2k req/s
Btw, all your tests are network bound, try to decrease
html page size to get actual event processing speed out of that machines.
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 10:59 ` Evgeniy Polyakov
` (2 preceding siblings ...)
2007-03-01 12:34 ` Ingo Molnar
@ 2007-03-01 16:56 ` David Lang
2007-03-01 17:56 ` Evgeniy Polyakov
3 siblings, 1 reply; 277+ messages in thread
From: David Lang @ 2007-03-01 16:56 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Ingo Molnar, Pavel Machek, Theodore Tso, Linus Torvalds,
Ulrich Drepper, linux-kernel, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
the ab numbers below do not seem that impressive to me, especially for such
stripped down server processes.
here are some numbers from a set of test boxes I've got in my lab. I've been
useing them to test firewalls, and I've been getting throughput similar to what
is listed below when going through a proxy that does a full fork for each
connection, and then makes a new connection to the webserver on the other side!
the first few sets of numbers are going directly from test client to test
server, the final set is going though the proxy.
client and server are dual opteron 252 with 8G of ram, running debian in 64 bit
mode
this is with apache2 MPM as the destination (relativly untuned except for
tinkering with the child count settings). this should be about as bad as you can
get for a server
loadtest2:/proc/sys# ab -c 8000 -n 80000 http://208.2.188.5:80/4k
This is ApacheBench, Version 2.0.41-dev <$Revision: 1.141 $> apache-2.0
Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright (c) 1998-2002 The Apache Software Foundation, http://www.apache.org/
Benchmarking 208.2.188.5 (be patient)
Completed 8000 requests
Completed 16000 requests
Completed 24000 requests
Completed 32000 requests
Completed 40000 requests
Completed 48000 requests
Completed 56000 requests
Completed 64000 requests
Completed 72000 requests
Finished 80000 requests
Server Software: Apache/1.3.33
Server Hostname: 208.2.188.5
Server Port: 80
Document Path: /4k
Document Length: 4352 bytes
Concurrency Level: 8000
Time taken for tests: 10.992838 seconds
Complete requests: 80000
Failed requests: 0
Write errors: 0
Total transferred: 386192835 bytes
HTML transferred: 362612992 bytes
Requests per second: 7277.47 [#/sec] (mean)
Time per request: 1099.284 [ms] (mean)
Time per request: 0.137 [ms] (mean, across all concurrent requests)
Transfer rate: 34307.88 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 8 497 1398.3 71 9072
Processing: 17 236 346.9 103 2995
Waiting: 8 91 131.6 65 1692
Total: 26 734 1435.5 187 9786
Percentage of the requests served within a certain time (ms)
50% 187
66% 288
75% 564
80% 754
90% 3085
95% 3163
98% 4316
99% 9186
100% 9786 (longest request)
loadtest2:/proc/sys# ab -c 8000 -n 80000 http://208.2.188.5:80/8k
This is ApacheBench, Version 2.0.41-dev <$Revision: 1.141 $> apache-2.0
Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright (c) 1998-2002 The Apache Software Foundation, http://www.apache.org/
Benchmarking 208.2.188.5 (be patient)
Completed 8000 requests
Completed 16000 requests
Completed 24000 requests
Completed 32000 requests
Completed 40000 requests
Completed 48000 requests
Completed 56000 requests
Completed 64000 requests
Completed 72000 requests
Finished 80000 requests
Server Software: Apache/1.3.33
Server Hostname: 208.2.188.5
Server Port: 80
Document Path: /8k
Document Length: 8704 bytes
Concurrency Level: 8000
Time taken for tests: 11.355031 seconds
Complete requests: 80000
Failed requests: 0
Write errors: 0
Total transferred: 736949141 bytes
HTML transferred: 713733802 bytes
Requests per second: 7045.34 [#/sec] (mean)
Time per request: 1135.503 [ms] (mean)
Time per request: 0.142 [ms] (mean, across all concurrent requests)
Transfer rate: 63379.48 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 36 495 1297.1 76 9056
Processing: 81 317 529.5 161 3448
Waiting: 25 89 75.1 76 1610
Total: 124 812 1401.5 250 11011
Percentage of the requests served within a certain time (ms)
50% 250
66% 304
75% 497
80% 705
90% 3171
95% 3251
98% 3455
99% 9160
100% 11011 (longest request)
for both of these tests the server had it's cpu maxed out (<5% idle)
switching to thttpd instead of apache and I get
loadtest2:/proc/sys# ab -c 8000 -n 80000 http://208.2.188.5:81/4k
This is ApacheBench, Version 2.0.41-dev <$Revision: 1.141 $> apache-2.0
Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright (c) 1998-2002 The Apache Software Foundation, http://www.apache.org/
Benchmarking 208.2.188.5 (be patient)
Completed 8000 requests
Completed 16000 requests
Completed 24000 requests
Completed 32000 requests
Completed 40000 requests
Completed 48000 requests
Completed 56000 requests
Completed 64000 requests
Completed 72000 requests
Finished 80000 requests
Server Software: thttpd/2.23beta1
Server Hostname: 208.2.188.5
Server Port: 81
Document Path: /4k
Document Length: 4352 bytes
Concurrency Level: 8000
Time taken for tests: 9.944605 seconds
Complete requests: 80000
Failed requests: 0
Write errors: 0
Total transferred: 372748950 bytes
HTML transferred: 352729600 bytes
Requests per second: 8044.56 [#/sec] (mean)
Time per request: 994.461 [ms] (mean)
Time per request: 0.124 [ms] (mean, across all concurrent requests)
Transfer rate: 36603.97 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 50 324 1274.4 70 9124
Processing: 68 98 33.3 90 781
Waiting: 22 69 26.9 63 729
Total: 125 423 1291.9 161 9324
Percentage of the requests served within a certain time (ms)
50% 161
66% 175
75% 188
80% 203
90% 246
95% 307
98% 3243
99% 9272
100% 9324 (longest request)
loadtest2:/proc/sys# ab -c 8000 -n 80000 http://208.2.188.5:81/8k
This is ApacheBench, Version 2.0.41-dev <$Revision: 1.141 $> apache-2.0
Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright (c) 1998-2002 The Apache Software Foundation, http://www.apache.org/
Benchmarking 208.2.188.5 (be patient)
Completed 8000 requests
Completed 16000 requests
Completed 24000 requests
Completed 32000 requests
Completed 40000 requests
Completed 48000 requests
Completed 56000 requests
Completed 64000 requests
Completed 72000 requests
Finished 80000 requests
Server Software: thttpd/2.23beta1
Server Hostname: 208.2.188.5
Server Port: 81
Document Path: /8k
Document Length: 8704 bytes
Concurrency Level: 8000
Time taken for tests: 13.502031 seconds
Complete requests: 80000
Failed requests: 0
Write errors: 0
Total transferred: 729161888 bytes
HTML transferred: 709030153 bytes
Requests per second: 5925.03 [#/sec] (mean)
Time per request: 1350.203 [ms] (mean)
Time per request: 0.169 [ms] (mean, across all concurrent requests)
Transfer rate: 52738.14 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 46 338 1189.7 106 9145
Processing: 79 197 52.8 195 670
Waiting: 39 92 28.4 94 577
Total: 140 536 1208.3 293 9424
Percentage of the requests served within a certain time (ms)
50% 293
66% 350
75% 355
80% 369
90% 388
95% 3293
98% 3388
99% 9392
100% 9424 (longest request)
for these the CPU is ~45% idle on the server box.
now if I go through a box in the middle running a proxy that forks for every
request (so you have two seperate TCP connections, plus a fork for each
request plus two writes to syslog) the proxy box is a dual opteron 252 with 4G
or ram running debian 32 bit
loadtest2:/proc/sys# ab -c 8000 -n 80000 http://192.168.254.2:8080/8k
This is ApacheBench, Version 2.0.41-dev <$Revision: 1.141 $> apache-2.0
Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright (c) 1998-2002 The Apache Software Foundation, http://www.apache.org/
Benchmarking 192.168.254.2 (be patient)
Completed 8000 requests
Completed 16000 requests
Completed 24000 requests
Completed 32000 requests
Completed 40000 requests
Completed 48000 requests
Completed 56000 requests
Completed 64000 requests
Completed 72000 requests
Finished 80000 requests
Server Software: Apache/1.3.33
Server Hostname: 192.168.254.2
Server Port: 8080
Document Path: /8k
Document Length: 8704 bytes
Concurrency Level: 8000
Time taken for tests: 21.101321 seconds
Complete requests: 80000
Failed requests: 0
Write errors: 0
Total transferred: 721092650 bytes
HTML transferred: 698383315 bytes
Requests per second: 3791.23 [#/sec] (mean)
Time per request: 2110.132 [ms] (mean)
Time per request: 0.264 [ms] (mean, across all concurrent requests)
Transfer rate: 33371.94 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 9 621 1652.3 20 9036
Processing: 28 81 195.5 50 6652
Waiting: 9 51 195.5 19 6620
Total: 38 703 1683.2 70 12291
Percentage of the requests served within a certain time (ms)
50% 70
66% 80
75% 83
80% 101
90% 3075
95% 3088
98% 9073
99% 9087
100% 12291 (longest request)
David Lang
On Thu, 1 Mar 2007, Evgeniy Polyakov wrote:
> On Thu, Mar 01, 2007 at 10:54:02AM +0100, Ingo Molnar (mingo@elte.hu) wrote:
>>
>> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
>>
>>> I posted kevent/epoll benchmarks and related design issues too many
>>> times both with handmade applications (which might be broken as hell)
>>> and popular open-source servers to repeat them again.
>>
>> numbers are crutial here - and given the epoll bugs in the evserver code
>> that we found, do you have updated evserver benchmark results that
>> compare epoll to kevent? I'm wondering why epoll has half the speed of
>> kevent in those measurements - i suspect some possible benchmarking bug.
>> The queueing model of epoll and kevent is roughly comparable, both do
>> only a constant number of steps to serve one particular request,
>> regardless of how many pending connections/requests there are. What is
>> the CPU utilization of the server system during an epoll test, and what
>> is the CPU utilization during a kevent test? 100% utilized in both
>> cases?
>
> Yes, it is about 98-100% in both cases.
> I've just re-run tests on my amd64 test machine without debug options:
>
> epoll 4794.23
> kevent 6468.95
>
> here are full client 'ab' outputs for epoll and kevent servers (epoll
> does not contain EPOLLET as you requested, but it does not look like
> it change performance in my case).
>
> epoll ab aoutput:
> # ab -c8000 -n80000 http://192.168.0.48/
> This is ApacheBench, Version 2.0.40-dev <$Revision: 1.146 $> apache-2.0
> Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
> Copyright 2006 The Apache Software Foundation, http://www.apache.org/
>
> Benchmarking 192.168.0.48 (be patient)
> Completed 8000 requests
> Completed 16000 requests
> Completed 24000 requests
> Completed 32000 requests
> Completed 40000 requests
> Completed 48000 requests
> Completed 56000 requests
> Completed 64000 requests
> Completed 72000 requests
> Finished 80000 requests
>
>
> Server Software: Apache/1.3.27
> Server Hostname: 192.168.0.48
> Server Port: 80
>
> Document Path: /
> Document Length: 3521 bytes
>
> Concurrency Level: 8000
> Time taken for tests: 16.686737 seconds
> Complete requests: 80000
> Failed requests: 0
> Write errors: 0
> Total transferred: 309760000 bytes
> HTML transferred: 281680000 bytes
> Requests per second: 4794.23 [#/sec] (mean)
> Time per request: 1668.674 [ms] (mean)
> Time per request: 0.209 [ms] (mean, across all concurrent
> requests)
> Transfer rate: 18128.17 [Kbytes/sec] received
>
> Connection Times (ms)
> min mean[+/-sd] median max
> Connect: 159 779 110.1 799 921
> Processing: 468 866 77.4 869 988
> Waiting: 63 426 212.3 425 921
> Total: 1145 1646 115.6 1660 1873
>
> Percentage of the requests served within a certain time (ms)
> 50% 1660
> 66% 1661
> 75% 1662
> 80% 1663
> 90% 1806
> 95% 1830
> 98% 1833
> 99% 1834
> 100% 1873 (longest request)
>
> kevent ab output:
> # ab -c8000 -n80000 http://192.168.0.48/
> This is ApacheBench, Version 2.0.40-dev <$Revision: 1.146 $> apache-2.0
> Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
> Copyright 2006 The Apache Software Foundation, http://www.apache.org/
>
> Benchmarking 192.168.0.48 (be patient)
> Completed 8000 requests
> Completed 16000 requests
> Completed 24000 requests
> Completed 32000 requests
> Completed 40000 requests
> Completed 48000 requests
> Completed 56000 requests
> Completed 64000 requests
> Completed 72000 requests
> Finished 80000 requests
>
>
> Server Software: Apache/1.3.27
> Server Hostname: 192.168.0.48
> Server Port: 80
>
> Document Path: /
> Document Length: 3521 bytes
>
> Concurrency Level: 8000
> Time taken for tests: 12.366775 seconds
> Complete requests: 80000
> Failed requests: 0
> Write errors: 0
> Total transferred: 317047104 bytes
> HTML transferred: 288306522 bytes
> Requests per second: 6468.95 [#/sec] (mean)
> Time per request: 1236.677 [ms] (mean)
> Time per request: 0.155 [ms] (mean, across all concurrent
> requests)
> Transfer rate: 25036.12 [Kbytes/sec] received
>
> Connection Times (ms)
> min mean[+/-sd] median max
> Connect: 130 364 871.1 275 9347
> Processing: 178 298 42.5 296 580
> Waiting: 31 202 65.8 210 369
> Total: 411 663 887.0 572 9722
>
> Percentage of the requests served within a certain time (ms)
> 50% 572
> 66% 573
> 75% 618
> 80% 640
> 90% 684
> 95% 709
> 98% 721
> 99% 3455
> 100% 9722 (longest request)
>
> Notice how percentage of the requests served within a certain time
> differs for kevent and epoll. And this server does not include
> ready-on-submission kevent optimization.
>
>> Ingo
>
>
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 15:41 ` Eric Dumazet
@ 2007-03-01 15:51 ` Evgeniy Polyakov
0 siblings, 0 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01 15:51 UTC (permalink / raw)
To: Eric Dumazet
Cc: Ingo Molnar, Pavel Machek, Theodore Tso, Linus Torvalds,
Ulrich Drepper, linux-kernel, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
On Thu, Mar 01, 2007 at 04:41:27PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> On Thursday 01 March 2007 16:32, Eric Dumazet wrote:
> > On Thursday 01 March 2007 16:23, Evgeniy Polyakov wrote:
> > > They are there, since ab runs only 50k requests.
> > > If I change it to something noticebly more than 50/80k, ab crashes:
> > > # ab -c8000 -t 600 -n800000000 http://192.168.0.48/
> > > This is ApacheBench, Version 2.0.40-dev <$Revision: 1.146 $> apache-2.0
> > > Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
> > > Copyright 2006 The Apache Software Foundation, http://www.apache.org/
> > >
> > > Benchmarking 192.168.0.48 (be patient)
> > > Segmentation fault
> > >
> > > Are there any other tool suitable for such loads?
> > > I only tested httperf (which is worse, since it uses poll/select) and
> > > 'ab'.
> > >
> > > Btw, host machine runs 100% too, so it is possible that client side is
> > > broken (too).
> >
> > I have similar problems here, ab test just doesnt complete...
> >
> > I am still investigating with strace and tcpdump.
>
> OK... I found it.
>
> I had to loop on accept() :
>
> for (i=0; i<num; ++i) {
> if (event[i].data.fd == main_server_s) {
> do {
> err = evtest_callback_main(event[i].data.fd);
> } while (err != -1);
> }
> else
> err = evtest_callback_client(event[i].data.fd);
> }
>
> Or else we can miss an event forever...
The same here - I would just enable a debug to find it.
# ab -c8000 -n80000 http://192.168.0.48/
This is ApacheBench, Version 2.0.40-dev <$Revision: 1.146 $> apache-2.0
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright 2006 The Apache Software Foundation, http://www.apache.org/
Benchmarking 192.168.0.48 (be patient)
Completed 8000 requests
Completed 16000 requests
Completed 24000 requests
Completed 32000 requests
Completed 40000 requests
Completed 48000 requests
Completed 56000 requests
Completed 64000 requests
Completed 72000 requests
Finished 80000 requests
Server Software: Apache/1.3.27
Server Hostname: 192.168.0.48
Server Port: 80
Document Path: /
Document Length: 3521 bytes
Concurrency Level: 8000
Time taken for tests: 18.250921 seconds
Complete requests: 80000
Failed requests: 0
Write errors: 0
Total transferred: 315691904 bytes
HTML transferred: 287074172 bytes
Requests per second: 4383.34 [#/sec] (mean)
Time per request: 1825.092 [ms] (mean)
Time per request: 0.228 [ms] (mean, across all concurrent
requests)
Transfer rate: 16891.86 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 137 884 481.1 920 3602
Processing: 567 888 163.6 985 997
Waiting: 47 455 238.2 439 921
Total: 765 1772 566.6 1911 4556
Percentage of the requests served within a certain time
(ms)
50% 1911
66% 1911
75% 1912
80% 1913
90% 1913
95% 1914
98% 4438
99% 4497
100% 4556 (longest request)
kano:~#
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 15:32 ` Eric Dumazet
2007-03-01 15:41 ` Eric Dumazet
@ 2007-03-01 15:47 ` Evgeniy Polyakov
2007-03-01 19:47 ` Davide Libenzi
2 siblings, 0 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01 15:47 UTC (permalink / raw)
To: Eric Dumazet
Cc: Ingo Molnar, Pavel Machek, Theodore Tso, Linus Torvalds,
Ulrich Drepper, linux-kernel, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
On Thu, Mar 01, 2007 at 04:32:37PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> On Thursday 01 March 2007 16:23, Evgeniy Polyakov wrote:
> > They are there, since ab runs only 50k requests.
> > If I change it to something noticebly more than 50/80k, ab crashes:
> > # ab -c8000 -t 600 -n800000000 http://192.168.0.48/
> > This is ApacheBench, Version 2.0.40-dev <$Revision: 1.146 $> apache-2.0
> > Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
> > Copyright 2006 The Apache Software Foundation, http://www.apache.org/
> >
> > Benchmarking 192.168.0.48 (be patient)
> > Segmentation fault
> >
> > Are there any other tool suitable for such loads?
> > I only tested httperf (which is worse, since it uses poll/select) and
> > 'ab'.
> >
> > Btw, host machine runs 100% too, so it is possible that client side is
> > broken (too).
>
> I have similar problems here, ab test just doesnt complete...
>
> I am still investigating with strace and tcpdump.
>
> In the meantime you could just rewrite it (based on epoll please :) ), since
> it should be quite easy to do this (reverse of evserver_epoll)
Rewriting 'ab' with pure epoll instead of APR lib is like
dandruff treatment on a guillotine.
I will try to cook up something own - simple client (based on epoll)
tomorrow/weekend, now I need to work for money :)
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 15:32 ` Eric Dumazet
@ 2007-03-01 15:41 ` Eric Dumazet
2007-03-01 15:51 ` Evgeniy Polyakov
2007-03-01 15:47 ` Evgeniy Polyakov
2007-03-01 19:47 ` Davide Libenzi
2 siblings, 1 reply; 277+ messages in thread
From: Eric Dumazet @ 2007-03-01 15:41 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Ingo Molnar, Pavel Machek, Theodore Tso, Linus Torvalds,
Ulrich Drepper, linux-kernel, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
On Thursday 01 March 2007 16:32, Eric Dumazet wrote:
> On Thursday 01 March 2007 16:23, Evgeniy Polyakov wrote:
> > They are there, since ab runs only 50k requests.
> > If I change it to something noticebly more than 50/80k, ab crashes:
> > # ab -c8000 -t 600 -n800000000 http://192.168.0.48/
> > This is ApacheBench, Version 2.0.40-dev <$Revision: 1.146 $> apache-2.0
> > Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
> > Copyright 2006 The Apache Software Foundation, http://www.apache.org/
> >
> > Benchmarking 192.168.0.48 (be patient)
> > Segmentation fault
> >
> > Are there any other tool suitable for such loads?
> > I only tested httperf (which is worse, since it uses poll/select) and
> > 'ab'.
> >
> > Btw, host machine runs 100% too, so it is possible that client side is
> > broken (too).
>
> I have similar problems here, ab test just doesnt complete...
>
> I am still investigating with strace and tcpdump.
OK... I found it.
I had to loop on accept() :
for (i=0; i<num; ++i) {
if (event[i].data.fd == main_server_s) {
do {
err = evtest_callback_main(event[i].data.fd);
} while (err != -1);
}
else
err = evtest_callback_client(event[i].data.fd);
}
Or else we can miss an event forever...
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 15:09 ` Ingo Molnar
@ 2007-03-01 15:36 ` Evgeniy Polyakov
2007-03-02 10:57 ` Ingo Molnar
0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01 15:36 UTC (permalink / raw)
To: Ingo Molnar
Cc: Eric Dumazet, Pavel Machek, Theodore Tso, Linus Torvalds,
Ulrich Drepper, linux-kernel, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
On Thu, Mar 01, 2007 at 04:09:42PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
>
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
>
> > > > I can tell you that the problem (at least on my machine) comes from :
> > > >
> > > > gettimeofday(&tm, NULL);
> > > >
> > > > in evserver_epoll.c
> > >
> > > yeah, that's another difference - especially if it's something like
> > > an Athlon64 and gettimeofday falls back to pm-timer, that could
> > > explain the performance difference. That's why i repeatedly asked
> > > Evgeniy to use the /very same/ client function for both the epoll
> > > and the kevent test and redo the measurements. The numbers are still
> > > highly suspect - and we are already down from the prior claim of
> > > kevent being almost twice as fast to a 25% difference.
> >
> > There is no gettimeofday() in the running code anymore, and it was
> > placed not in common server processing code btw.
> >
> > Ingo, do you really think I will send mails with faked benchmarks? :))
>
> no, i'd not be in this discussion anymore if i thought that. But i do
> think that your benchmark results are extremely sloppy, that make your
> conclusions on them essentially useless.
>
> you were hurling quite colorful and strong assertions into this
> discussion, backed up by these numbers, so you should expect at least
> some minimal amount of scrutiny of those numbers.
This discussion was about event driven vs. thread driven IO models, and
threadlet only behaves like event driven because in my tests there was
exactly one threadlet rescheduling per severa thousands of clients.
Kevent is just a logical interpolation of performance of event driven
model.
My assumptions were based not on kevent performance, but on the fact,
that event delivery is much faster and simpler than thread handling.
Ugh, I'm starting that stupid talk again - let's just jump to the end -
I agree that in real-life high-performance systems both models must be
used.
Peace? :)
> > > [...] The numbers are still highly suspect - and we are already down
> > > from the prior claim of kevent being almost twice as fast to a 25%
> > > difference.
> >
> > Btw, there were never almost twice perfromance increase - epoll in my
> > tests always showed 4-5 thousands requests per second, kevent - up to
> > 7 thausands.
>
> i'm referring to your claim in this mail of yours from 4 days ago for
> example:
>
> http://lkml.org/lkml/2007/2/25/116
>
> "But note, that on my athlon64 3500 test machine kevent is about 7900
> requests per second compared to 4000+ epoll, so expect a challenge."
>
> no matter how i look at it, but 7900 is 1.9 times 4000 - which is
> "almost twice".
After your changes epoll increased to 5k.
I can easily reproduce 6300/4300 split, but can not get more than 7k for
kevent (with oprofile/idle=poll at least).
I've completed 800k run:
kevent 4800
epoll 4450
with tons ofoverflows in 'ab':
Write errors: 0
Total transferred: -1197367296 bytes
HTML transferred: -1478167296 bytes
Requests per second: 4440.67 [#/sec] (mean)
Time per request: 1801.529 [ms] (mean)
Time per request: 0.225 [ms] (mean, across all concurrent
requests)
Transfer rate: -6490.62 [Kbytes/sec] received
Any other bench?
> Ingo
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 15:23 ` Evgeniy Polyakov
@ 2007-03-01 15:32 ` Eric Dumazet
2007-03-01 15:41 ` Eric Dumazet
` (2 more replies)
0 siblings, 3 replies; 277+ messages in thread
From: Eric Dumazet @ 2007-03-01 15:32 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Ingo Molnar, Pavel Machek, Theodore Tso, Linus Torvalds,
Ulrich Drepper, linux-kernel, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
On Thursday 01 March 2007 16:23, Evgeniy Polyakov wrote:
> They are there, since ab runs only 50k requests.
> If I change it to something noticebly more than 50/80k, ab crashes:
> # ab -c8000 -t 600 -n800000000 http://192.168.0.48/
> This is ApacheBench, Version 2.0.40-dev <$Revision: 1.146 $> apache-2.0
> Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
> Copyright 2006 The Apache Software Foundation, http://www.apache.org/
>
> Benchmarking 192.168.0.48 (be patient)
> Segmentation fault
>
> Are there any other tool suitable for such loads?
> I only tested httperf (which is worse, since it uses poll/select) and
> 'ab'.
>
> Btw, host machine runs 100% too, so it is possible that client side is
> broken (too).
I have similar problems here, ab test just doesnt complete...
I am still investigating with strace and tcpdump.
In the meantime you could just rewrite it (based on epoll please :) ), since
it should be quite easy to do this (reverse of evserver_epoll)
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 14:47 ` Ingo Molnar
@ 2007-03-01 15:23 ` Evgeniy Polyakov
2007-03-01 15:32 ` Eric Dumazet
0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01 15:23 UTC (permalink / raw)
To: Ingo Molnar
Cc: Eric Dumazet, Pavel Machek, Theodore Tso, Linus Torvalds,
Ulrich Drepper, linux-kernel, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
[-- Attachment #1: Type: text/plain, Size: 2362 bytes --]
On Thu, Mar 01, 2007 at 03:47:17PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
>
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
>
> > CPU: AMD64 processors, speed 2210.08 MHz (estimated)
> > Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000
> > samples % symbol name
> > 195750 67.3097 cpu_idle
> > 14111 4.8521 enter_idle
> > 4979 1.7121 IRQ0x51_interrupt
> > 4765 1.6385 tcp_v4_rcv
>
> the pretty much only meaningful way to measure this is to:
>
> - start a really long 'ab' testrun. Something like "ab -c 8000 -t 600".
> - let the system get into 'steady state': i.e. CPU load at 100%
> - reset the oprofile counters, then start an oprofile run for 60
> seconds.
> - stop the oprofile run.
> - stop the test.
>
> this way there wont be that many 'cpu_idle' entries in your profiles,
> and the profiles between the two event delivery mechanisms will be
> directly comparable.
They are there, since ab runs only 50k requests.
If I change it to something noticebly more than 50/80k, ab crashes:
# ab -c8000 -t 600 -n800000000 http://192.168.0.48/
This is ApacheBench, Version 2.0.40-dev <$Revision: 1.146 $> apache-2.0
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright 2006 The Apache Software Foundation, http://www.apache.org/
Benchmarking 192.168.0.48 (be patient)
Segmentation fault
Are there any other tool suitable for such loads?
I only tested httperf (which is worse, since it uses poll/select) and
'ab'.
Btw, host machine runs 100% too, so it is possible that client side is
broken (too).
> > In that tests I got epoll perf about 4400 req/s, kevent was about
> > 5300.
>
> So we are now up to epoll being 83% of kevent's performance - while the
> noise of numbers seen today alone is around 100% ... Could you update
> the files two URLs that you posted before, with the code that you used
> for the above numbers:
And in a couple of moments I resent profile with 6100 r/s, and now
attached with 6300.
> http://tservice.net.ru/~s0mbre/archive/kevent/evserver_epoll.c
> http://tservice.net.ru/~s0mbre/archive/kevent/evserver_kevent.c
Plus http://tservice.net.ru/~s0mbre/archive/kevent/evserver_common.c
which contains common request handling function
> thanks,
>
> Ingo
--
Evgeniy Polyakov
[-- Attachment #2: profile.kevent --]
[-- Type: text/plain, Size: 13546 bytes --]
CPU: AMD64 processors, speed 2210.08 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000
samples % symbol name
168753 65.1189 cpu_idle
12451 4.8046 enter_idle
4814 1.8576 tcp_v4_rcv
3980 1.5358 IRQ0x51_interrupt
3142 1.2124 tcp_ack
2738 1.0565 kmem_cache_free
2346 0.9053 kfree
2341 0.9034 memset_c
1927 0.7436 csum_partial_copy_generic
1723 0.6649 ip_route_input
1650 0.6367 dev_queue_xmit
1452 0.5603 ip_output
1416 0.5464 handle_IRQ_event
1335 0.5152 ip_rcv
1326 0.5117 tcp_rcv_state_process
1069 0.4125 schedule
960 0.3704 __do_softirq
943 0.3639 tcp_sendmsg
915 0.3531 ip_queue_xmit
907 0.3500 tcp_v4_do_rcv
897 0.3461 fget
894 0.3450 system_call
890 0.3434 csum_partial
877 0.3384 tcp_transmit_skb
845 0.3261 netif_receive_skb
822 0.3172 ip_local_deliver
812 0.3133 kmem_cache_alloc
788 0.3041 local_bh_enable
773 0.2983 __alloc_skb
771 0.2975 kfree_skbmem
764 0.2948 __d_lookup
757 0.2921 __tcp_push_pending_frames
734 0.2832 pfifo_fast_enqueue
720 0.2778 copy_user_generic_string
627 0.2419 net_rx_action
603 0.2327 pfifo_fast_dequeue
586 0.2261 ret_from_intr
562 0.2169 __link_path_walk
561 0.2165 sock_wfree
549 0.2118 __fput
547 0.2111 __kfree_skb
543 0.2095 get_unused_fd
534 0.2061 number
527 0.2034 sysret_check
516 0.1991 preempt_schedule
508 0.1960 skb_clone
496 0.1914 tcp_parse_options
487 0.1879 _atomic_dec_and_lock
470 0.1814 tcp_poll
469 0.1810 __ip_route_output_key
466 0.1798 rt_hash_code
464 0.1790 tcp_recvmsg
421 0.1625 dput
420 0.1621 tcp_rcv_established
412 0.1590 __tcp_select_window
407 0.1571 exit_idle
394 0.1520 rb_erase
381 0.1470 sys_close
375 0.1447 __mod_timer
365 0.1408 d_alloc
363 0.1401 mask_and_ack_8259A
335 0.1293 lock_timer_base
315 0.1216 cache_alloc_refill
307 0.1185 ret_from_sys_call
300 0.1158 do_path_lookup
299 0.1154 eth_type_trans
298 0.1150 find_next_zero_bit
294 0.1134 tcp_data_queue
286 0.1104 dentry_iput
285 0.1100 ip_append_data
263 0.1015 thread_return
257 0.0992 __dentry_open
255 0.0984 sock_recvmsg
255 0.0984 tcp_rtt_estimator
252 0.0972 sys_fcntl
250 0.0965 tcp_current_mss
248 0.0957 sk_stream_mem_schedule
240 0.0926 call_softirq
233 0.0899 sys_recvfrom
229 0.0884 cache_grow
221 0.0853 vsnprintf
215 0.0830 tcp_send_fin
214 0.0826 do_generic_mapping_read
213 0.0822 call_rcu
213 0.0822 common_interrupt
203 0.0783 do_lookup
196 0.0756 inotify_dentry_parent_queue_event
191 0.0737 memcpy_c
188 0.0725 filp_close
180 0.0695 release_sock
180 0.0695 sock_def_readable
178 0.0687 get_page_from_freelist
177 0.0683 do_sys_open
174 0.0671 restore_args
172 0.0664 strncpy_from_user
167 0.0644 fget_light
162 0.0625 clear_inode
161 0.0621 link_path_walk
159 0.0614 generic_drop_inode
157 0.0606 get_empty_filp
156 0.0602 __skb_checksum_complete
153 0.0590 del_timer
152 0.0587 update_send_head
151 0.0583 percpu_counter_mod
150 0.0579 current_fs_time
150 0.0579 schedule_timeout
150 0.0579 skb_checksum
149 0.0575 fd_install
145 0.0560 sock_close
143 0.0552 try_to_wake_up
142 0.0548 generic_permission
135 0.0521 __put_unused_fd
133 0.0513 new_inode
132 0.0509 half_md4_transform
131 0.0506 alloc_inode
130 0.0502 bictcp_cong_avoid
130 0.0502 memcmp
127 0.0490 tcp_init_tso_segs
126 0.0486 tcp_sync_mss
125 0.0482 __do_page_cache_readahead
125 0.0482 find_get_page
123 0.0475 lookup_mnt
117 0.0451 rb_insert_color
114 0.0440 tcp_v4_send_check
112 0.0432 mod_timer
109 0.0421 page_cache_readahead
101 0.0390 __path_lookup_intent_open
100 0.0386 __wake_up_bit
100 0.0386 may_open
100 0.0386 tcp_snd_test
97 0.0374 tcp_check_space
96 0.0370 expand_files
96 0.0370 skb_copy_datagram_iovec
95 0.0367 getname
95 0.0367 igrab
94 0.0363 open_namei
93 0.0359 groups_search
92 0.0355 dnotify_flush
91 0.0351 locks_remove_posix
91 0.0351 memmove
90 0.0347 sk_reset_timer
89 0.0343 tcp_send_ack
88 0.0340 copy_page_c
88 0.0340 tcp_select_initial_window
87 0.0336 sock_common_recvmsg
85 0.0328 sock_release
83 0.0320 IRQ0x20_interrupt
83 0.0320 file_free_rcu
80 0.0309 rw_verify_area
79 0.0305 d_instantiate
79 0.0305 permission
79 0.0305 put_page
77 0.0297 cond_resched
77 0.0297 get_task_mm
77 0.0297 touch_atime
75 0.0289 __follow_mount
75 0.0289 inotify_inode_queue_event
74 0.0286 file_move
73 0.0282 copy_to_user
71 0.0274 tcp_v4_tw_remember_stamp
69 0.0266 wake_up_inode
65 0.0251 prepare_to_wait
65 0.0251 sockfd_lookup
65 0.0251 tcp_event_data_recv
64 0.0247 file_kill
64 0.0247 fput
62 0.0239 __handle_mm_fault
62 0.0239 tcp_setsockopt
61 0.0235 sock_sendmsg
60 0.0232 __wake_up
59 0.0228 page_fault
57 0.0220 locks_remove_flock
54 0.0208 sk_stream_rfree
54 0.0208 sprintf
53 0.0205 inet_sendmsg
53 0.0205 retint_kernel
51 0.0197 iret_label
51 0.0197 tcp_cwnd_validate
49 0.0189 tcp_rcv_space_adjust
48 0.0185 inode_init_once
48 0.0185 mutex_unlock
45 0.0174 finish_wait
45 0.0174 mntput_no_expire
44 0.0170 __delay
44 0.0170 __tcp_ack_snd_check
43 0.0166 inet_sock_destruct
42 0.0162 try_to_del_timer_sync
41 0.0158 free_hot_cold_page
41 0.0158 memset
40 0.0154 __rb_rotate_left
40 0.0154 init_once
40 0.0154 sys_open
40 0.0154 tcp_unhash
39 0.0150 generic_file_open
39 0.0150 tcp_cong_avoid
38 0.0147 __lookup_mnt
38 0.0147 bit_waitqueue
36 0.0139 clear_page_c
36 0.0139 iput
33 0.0127 in_group_p
33 0.0127 inet_sk_rebuild_header
33 0.0127 sock_fasync
33 0.0127 tcp_init_cwnd
32 0.0123 memcpy_toiovec
32 0.0123 sk_stop_timer
32 0.0123 unmap_vmas
31 0.0120 blockable_page_cache_readahead
30 0.0116 _spin_lock_bh
30 0.0116 inet_getname
29 0.0112 __put_user_8
28 0.0108 copy_from_user
27 0.0104 do_filp_open
26 0.0100 tcp_v4_destroy_sock
25 0.0096 __alloc_pages
24 0.0093 apic_timer_interrupt
24 0.0093 do_page_fault
24 0.0093 file_ra_state_init
24 0.0093 hrtimer_run_queues
24 0.0093 vfs_permission
23 0.0089 tcp_slow_start
23 0.0089 zone_watermark_ok
22 0.0085 mutex_lock
21 0.0081 destroy_inode
21 0.0081 init_timer
21 0.0081 invalidate_inode_buffers
21 0.0081 sk_alloc
19 0.0073 exit_intr
16 0.0062 copy_page_range
15 0.0058 find_vma
14 0.0054 retint_swapgs
14 0.0054 wake_up_bit
13 0.0050 __down_read
12 0.0046 do_wp_page
12 0.0046 mark_page_accessed
11 0.0042 __get_user_4
11 0.0042 __tcp_checksum_complete_user
10 0.0039 vm_normal_page
9 0.0035 __up_read
8 0.0031 rcu_start_batch
8 0.0031 retint_restore_args
8 0.0031 timespec_trunc
7 0.0027 __find_get_block
7 0.0027 error_exit
7 0.0027 flush_tlb_page
6 0.0023 _write_lock_bh
6 0.0023 copy_process
6 0.0023 free_hot_page
6 0.0023 inode_has_buffers
6 0.0023 kmem_flagcheck
6 0.0023 retint_check
5 0.0019 __rb_rotate_right
5 0.0019 del_timer_sync
5 0.0019 nameidata_to_filp
4 0.0015 __down_read_trylock
4 0.0015 __getblk
4 0.0015 __mutex_init
4 0.0015 __set_page_dirty_nobuffers
4 0.0015 _read_lock_irqsave
4 0.0015 do_mmap_pgoff
4 0.0015 error_sti
4 0.0015 free_pgd_range
4 0.0015 load_elf_binary
4 0.0015 mmput
3 0.0012 __iget
3 0.0012 _spin_lock_irqsave
3 0.0012 bio_endio
3 0.0012 copy_strings
3 0.0012 cpuset_update_task_memory_state
3 0.0012 d_lookup
3 0.0012 exit_itimers
3 0.0012 filemap_nopage
3 0.0012 find_vma_prepare
3 0.0012 free_pages
3 0.0012 generic_fillattr
3 0.0012 generic_make_request
3 0.0012 memcpy
3 0.0012 prio_tree_insert
3 0.0012 put_unused_fd
3 0.0012 run_local_timers
3 0.0012 unmap_region
3 0.0012 vma_prio_tree_remove
2 7.7e-04 __clear_user
2 7.7e-04 __find_get_block_slow
2 7.7e-04 __strnlen_user
2 7.7e-04 __vm_enough_memory
2 7.7e-04 add_to_page_cache
2 7.7e-04 add_wait_queue
2 7.7e-04 alloc_pages_current
2 7.7e-04 anon_vma_prepare
2 7.7e-04 dnotify_parent
2 7.7e-04 do_exit
2 7.7e-04 do_munmap
2 7.7e-04 do_select
2 7.7e-04 dup_fd
2 7.7e-04 exit_mmap
2 7.7e-04 find_mergeable_anon_vma
2 7.7e-04 free_pgtables
2 7.7e-04 lru_cache_add_active
2 7.7e-04 mempool_free
2 7.7e-04 page_add_file_rmap
2 7.7e-04 page_remove_rmap
2 7.7e-04 page_waitqueue
2 7.7e-04 path_release
2 7.7e-04 prio_tree_replace
2 7.7e-04 pty_chars_in_buffer
2 7.7e-04 remove_vma
2 7.7e-04 retint_with_reschedule
2 7.7e-04 run_workqueue
2 7.7e-04 seq_puts
2 7.7e-04 split_vma
2 7.7e-04 sys_mmap
2 7.7e-04 sys_mprotect
2 7.7e-04 sys_rt_sigprocmask
2 7.7e-04 truncate_inode_pages_range
2 7.7e-04 vfs_lstat_fd
2 7.7e-04 vm_acct_memory
2 7.7e-04 vma_adjust
2 7.7e-04 vma_link
1 3.9e-04 __block_prepare_write
1 3.9e-04 __bread
1 3.9e-04 __d_path
1 3.9e-04 __down_write
1 3.9e-04 __down_write_nested
1 3.9e-04 __end_that_request_first
1 3.9e-04 __free_pages
1 3.9e-04 __generic_file_aio_write_nolock
1 3.9e-04 __make_request
1 3.9e-04 __page_set_anon_rmap
1 3.9e-04 __pagevec_lru_add_active
1 3.9e-04 __put_user_4
1 3.9e-04 __remove_shared_vm_struct
1 3.9e-04 __up_write
1 3.9e-04 __vma_link_rb
1 3.9e-04 _read_lock_bh
1 3.9e-04 activate_page
1 3.9e-04 anon_vma_unlink
1 3.9e-04 block_write_full_page
1 3.9e-04 cap_bprm_apply_creds
1 3.9e-04 cap_vm_enough_memory
1 3.9e-04 clear_page_dirty_for_io
1 3.9e-04 cond_resched_lock
1 3.9e-04 cp_new_stat
1 3.9e-04 create_empty_buffers
1 3.9e-04 dentry_unhash
1 3.9e-04 do_brk
1 3.9e-04 do_mpage_readpage
1 3.9e-04 do_mremap
1 3.9e-04 do_sigaction
1 3.9e-04 do_wait
1 3.9e-04 drop_buffers
1 3.9e-04 eligible_child
1 3.9e-04 exit_sem
1 3.9e-04 file_read_actor
1 3.9e-04 filldir64
1 3.9e-04 flush_signal_handlers
1 3.9e-04 generic_block_bmap
1 3.9e-04 generic_file_llseek
1 3.9e-04 generic_file_mmap
1 3.9e-04 get_index
1 3.9e-04 get_signal_to_deliver
1 3.9e-04 get_stack
1 3.9e-04 get_unmapped_area
1 3.9e-04 get_vma_policy
1 3.9e-04 init_request_from_bio
1 3.9e-04 inode_setattr
1 3.9e-04 is_bad_inode
1 3.9e-04 kref_put
1 3.9e-04 lru_add_drain
1 3.9e-04 may_delete
1 3.9e-04 mm_release
1 3.9e-04 n_tty_ioctl
1 3.9e-04 nr_blockdev_pages
1 3.9e-04 page_add_new_anon_rmap
1 3.9e-04 pipe_poll
1 3.9e-04 preempt_schedule_irq
1 3.9e-04 proc_lookup
1 3.9e-04 ptregscall_common
1 3.9e-04 put_files_struct
1 3.9e-04 radix_tree_tag_clear
1 3.9e-04 rb_prev
1 3.9e-04 recalc_bh_state
1 3.9e-04 release_pages
1 3.9e-04 sched_exec
1 3.9e-04 sched_fork
1 3.9e-04 seq_escape
1 3.9e-04 set_close_on_exec
1 3.9e-04 set_fs_pwd
1 3.9e-04 set_personality_64bit
1 3.9e-04 shrink_dcache_parent
1 3.9e-04 sock_map_fd
1 3.9e-04 strchr
1 3.9e-04 submit_bio
1 3.9e-04 sys_dup2
1 3.9e-04 sys_faccessat
1 3.9e-04 sys_munmap
1 3.9e-04 sys_rt_sigaction
1 3.9e-04 tty_write
1 3.9e-04 unlink_file_vma
1 3.9e-04 unlock_page
1 3.9e-04 vfs_read
1 3.9e-04 vfs_readdir
1 3.9e-04 vma_merge
1 3.9e-04 vma_prio_tree_insert
1 3.9e-04 wake_up_new_task
1 3.9e-04 writeback_inodes
1 3.9e-04 zonelist_policy
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 14:54 ` Evgeniy Polyakov
@ 2007-03-01 15:09 ` Ingo Molnar
2007-03-01 15:36 ` Evgeniy Polyakov
2007-03-01 19:31 ` Davide Libenzi
1 sibling, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-03-01 15:09 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Eric Dumazet, Pavel Machek, Theodore Tso, Linus Torvalds,
Ulrich Drepper, linux-kernel, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> > > I can tell you that the problem (at least on my machine) comes from :
> > >
> > > gettimeofday(&tm, NULL);
> > >
> > > in evserver_epoll.c
> >
> > yeah, that's another difference - especially if it's something like
> > an Athlon64 and gettimeofday falls back to pm-timer, that could
> > explain the performance difference. That's why i repeatedly asked
> > Evgeniy to use the /very same/ client function for both the epoll
> > and the kevent test and redo the measurements. The numbers are still
> > highly suspect - and we are already down from the prior claim of
> > kevent being almost twice as fast to a 25% difference.
>
> There is no gettimeofday() in the running code anymore, and it was
> placed not in common server processing code btw.
>
> Ingo, do you really think I will send mails with faked benchmarks? :))
no, i'd not be in this discussion anymore if i thought that. But i do
think that your benchmark results are extremely sloppy, that make your
conclusions on them essentially useless.
you were hurling quite colorful and strong assertions into this
discussion, backed up by these numbers, so you should expect at least
some minimal amount of scrutiny of those numbers.
> > [...] The numbers are still highly suspect - and we are already down
> > from the prior claim of kevent being almost twice as fast to a 25%
> > difference.
>
> Btw, there were never almost twice perfromance increase - epoll in my
> tests always showed 4-5 thousands requests per second, kevent - up to
> 7 thausands.
i'm referring to your claim in this mail of yours from 4 days ago for
example:
http://lkml.org/lkml/2007/2/25/116
"But note, that on my athlon64 3500 test machine kevent is about 7900
requests per second compared to 4000+ epoll, so expect a challenge."
no matter how i look at it, but 7900 is 1.9 times 4000 - which is
"almost twice".
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 14:43 ` Evgeniy Polyakov
2007-03-01 14:47 ` Ingo Molnar
@ 2007-03-01 14:57 ` Evgeniy Polyakov
1 sibling, 0 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01 14:57 UTC (permalink / raw)
To: Eric Dumazet
Cc: Ingo Molnar, Pavel Machek, Theodore Tso, Linus Torvalds,
Ulrich Drepper, linux-kernel, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
[-- Attachment #1: Type: text/plain, Size: 1130 bytes --]
On Thu, Mar 01, 2007 at 05:43:50PM +0300, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> On Thu, Mar 01, 2007 at 02:12:50PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> > On Thursday 01 March 2007 12:47, Evgeniy Polyakov wrote:
> > >
> > > Could you provide at least remote way to find it?
> > >
> >
> > Sure :)
> >
> > > I only found the same problem at
> > > http://lkml.org/lkml/2006/10/27/3
> > >
> > > but without any hits to solve the problem.
> > >
> > > I will try CVS oprofile, if it works I will provide details of course.
> > >
> >
> > # cat CVS/Root
> > CVS/Root::pserver:anonymous@oprofile.cvs.sourceforge.net:/cvsroot/oprofile
> >
> > # cvs diff >/tmp/oprofile.diff
> >
> > Hope it helps
>
> One of the hunks failed, since it was in CVS already.
> After setting up some mirrors, I've installed all bits needed for
> oprofile.
> Attached kevent and epoll profiles.
>
> In that tests I got epoll perf about 4400 req/s, kevent was about 5300.
Attached kevent profile with 6100 req/sec.
They all look exactly the same for me - there no kevent or epoll
functions in profiles at all.
--
Evgeniy Polyakov
[-- Attachment #2: profile.kevent --]
[-- Type: text/plain, Size: 12427 bytes --]
CPU: AMD64 processors, speed 2210.08 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000
samples % symbol name
103425 55.0868 cpu_idle
8214 4.3750 enter_idle
4712 2.5097 tcp_v4_rcv
3805 2.0266 IRQ0x51_interrupt
3154 1.6799 tcp_ack
2777 1.4791 kmem_cache_free
2286 1.2176 kfree
2155 1.1478 memset_c
1747 0.9305 csum_partial_copy_generic
1710 0.9108 ip_output
1620 0.8629 dev_queue_xmit
1551 0.8261 handle_IRQ_event
1391 0.7409 schedule
1373 0.7313 tcp_rcv_state_process
1337 0.7121 ip_rcv
1100 0.5859 ip_queue_xmit
965 0.5140 ip_route_input
939 0.5001 tcp_sendmsg
935 0.4980 __do_softirq
923 0.4916 ip_local_deliver
916 0.4879 csum_partial
905 0.4820 system_call
889 0.4735 tcp_transmit_skb
884 0.4708 tcp_v4_do_rcv
812 0.4325 netif_receive_skb
778 0.4144 __d_lookup
760 0.4048 __alloc_skb
747 0.3979 local_bh_enable
737 0.3925 __tcp_push_pending_frames
702 0.3739 kfree_skbmem
698 0.3718 pfifo_fast_enqueue
678 0.3611 kmem_cache_alloc
651 0.3467 fget
640 0.3409 pfifo_fast_dequeue
637 0.3393 net_rx_action
629 0.3350 __link_path_walk
602 0.3206 preempt_schedule
599 0.3190 __fput
594 0.3164 sock_wfree
589 0.3137 copy_user_generic_string
579 0.3084 ret_from_intr
559 0.2977 _atomic_dec_and_lock
552 0.2940 __kfree_skb
549 0.2924 skb_clone
514 0.2738 number
494 0.2631 rt_hash_code
473 0.2519 dput
466 0.2482 tcp_parse_options
446 0.2376 tcp_rcv_established
433 0.2306 tcp_recvmsg
431 0.2296 tcp_poll
417 0.2221 get_unused_fd
417 0.2221 sysret_check
377 0.2008 rb_erase
364 0.1939 __tcp_select_window
363 0.1933 lock_timer_base
347 0.1848 __mod_timer
329 0.1752 ip_append_data
326 0.1736 exit_idle
325 0.1731 ret_from_sys_call
317 0.1688 d_alloc
302 0.1609 do_path_lookup
295 0.1571 __ip_route_output_key
290 0.1545 eth_type_trans
285 0.1518 sys_close
283 0.1507 cache_alloc_refill
282 0.1502 mask_and_ack_8259A
275 0.1465 thread_return
267 0.1422 call_softirq
265 0.1411 tcp_rtt_estimator
260 0.1385 tcp_data_queue
258 0.1374 __dentry_open
258 0.1374 vsnprintf
255 0.1358 dentry_iput
255 0.1358 tcp_current_mss
250 0.1332 sk_stream_mem_schedule
239 0.1273 find_next_zero_bit
233 0.1241 cache_grow
233 0.1241 tcp_send_fin
222 0.1182 try_to_wake_up
219 0.1166 sock_recvmsg
216 0.1150 do_generic_mapping_read
211 0.1124 sys_fcntl
209 0.1113 get_empty_filp
207 0.1103 call_rcu
206 0.1097 strncpy_from_user
195 0.1039 sock_def_readable
190 0.1012 generic_drop_inode
190 0.1012 restore_args
184 0.0980 get_page_from_freelist
182 0.0969 sys_recvfrom
176 0.0937 do_lookup
174 0.0927 common_interrupt
171 0.0911 fget_light
167 0.0889 new_inode
167 0.0889 percpu_counter_mod
166 0.0884 link_path_walk
166 0.0884 skb_checksum
160 0.0852 fput
160 0.0852 release_sock
159 0.0847 memcpy_c
158 0.0842 memcmp
157 0.0836 __skb_checksum_complete
157 0.0836 tcp_init_tso_segs
148 0.0788 half_md4_transform
144 0.0767 tcp_v4_send_check
142 0.0756 del_timer
139 0.0740 current_fs_time
135 0.0719 update_send_head
129 0.0687 do_sys_open
126 0.0671 rb_insert_color
125 0.0666 bictcp_cong_avoid
124 0.0660 __put_unused_fd
123 0.0655 schedule_timeout
121 0.0644 clear_inode
118 0.0628 sock_close
116 0.0618 __do_page_cache_readahead
115 0.0613 alloc_inode
115 0.0613 lookup_mnt
114 0.0607 tcp_snd_test
113 0.0602 mod_timer
112 0.0597 generic_permission
109 0.0581 tcp_select_initial_window
101 0.0538 locks_remove_posix
98 0.0522 fd_install
97 0.0517 find_get_page
97 0.0517 sk_reset_timer
94 0.0501 try_to_del_timer_sync
93 0.0495 __follow_mount
92 0.0490 igrab
91 0.0485 page_cache_readahead
90 0.0479 dnotify_flush
90 0.0479 prepare_to_wait
90 0.0479 put_page
89 0.0474 expand_files
89 0.0474 getname
88 0.0469 inotify_dentry_parent_queue_event
88 0.0469 tcp_sync_mss
87 0.0463 __path_lookup_intent_open
86 0.0458 file_free_rcu
85 0.0453 may_open
85 0.0453 skb_copy_datagram_iovec
84 0.0447 IRQ0x20_interrupt
82 0.0437 tcp_cwnd_validate
81 0.0431 copy_page_c
81 0.0431 d_instantiate
81 0.0431 groups_search
80 0.0426 permission
79 0.0421 __handle_mm_fault
79 0.0421 file_kill
79 0.0421 get_task_mm
76 0.0405 rw_verify_area
74 0.0394 copy_to_user
73 0.0389 __wake_up_bit
72 0.0383 __wake_up
72 0.0383 cond_resched
72 0.0383 mntput_no_expire
69 0.0368 memmove
69 0.0368 sock_sendmsg
69 0.0368 tcp_setsockopt
68 0.0362 open_namei
68 0.0362 retint_kernel
67 0.0357 wake_up_inode
66 0.0352 inet_sendmsg
66 0.0352 tcp_event_data_recv
65 0.0346 generic_file_open
64 0.0341 touch_atime
63 0.0336 sock_release
63 0.0336 tcp_send_ack
62 0.0330 file_move
62 0.0330 filp_close
57 0.0304 mutex_unlock
55 0.0293 inet_sk_rebuild_header
55 0.0293 page_fault
55 0.0293 sockfd_lookup
54 0.0288 memset
54 0.0288 sk_stream_rfree
52 0.0277 __tcp_ack_snd_check
52 0.0277 inode_init_once
52 0.0277 sock_common_recvmsg
52 0.0277 tcp_check_space
51 0.0272 sys_open
49 0.0261 iret_label
47 0.0250 locks_remove_flock
46 0.0245 __rb_rotate_left
46 0.0245 tcp_v4_tw_remember_stamp
46 0.0245 unmap_vmas
45 0.0240 finish_wait
44 0.0234 inet_sock_destruct
44 0.0234 sprintf
43 0.0229 tcp_cong_avoid
42 0.0224 inotify_inode_queue_event
41 0.0218 __alloc_pages
41 0.0218 __lookup_mnt
41 0.0218 _spin_lock_bh
41 0.0218 tcp_init_cwnd
38 0.0202 clear_page_c
38 0.0202 tcp_unhash
37 0.0197 bit_waitqueue
37 0.0197 memcpy_toiovec
36 0.0192 iput
35 0.0186 do_filp_open
35 0.0186 init_timer
35 0.0186 sock_fasync
31 0.0165 __delay
31 0.0165 exit_intr
31 0.0165 vfs_permission
30 0.0160 sk_alloc
29 0.0154 copy_from_user
28 0.0149 free_hot_cold_page
27 0.0144 __put_user_8
27 0.0144 del_timer_sync
27 0.0144 hrtimer_run_queues
26 0.0138 init_once
26 0.0138 sk_stop_timer
25 0.0133 tcp_rcv_space_adjust
25 0.0133 tcp_v4_destroy_sock
23 0.0123 copy_page_range
23 0.0123 find_vma
22 0.0117 blockable_page_cache_readahead
22 0.0117 invalidate_inode_buffers
21 0.0112 do_page_fault
21 0.0112 do_wp_page
20 0.0107 in_group_p
20 0.0107 inet_getname
20 0.0107 mutex_lock
20 0.0107 tcp_slow_start
20 0.0107 zone_watermark_ok
18 0.0096 file_ra_state_init
17 0.0091 mark_page_accessed
16 0.0085 __find_get_block
15 0.0080 rt_run_flush
14 0.0075 __down_read
14 0.0075 __up_read
14 0.0075 apic_timer_interrupt
11 0.0059 destroy_inode
11 0.0059 flush_tlb_page
11 0.0059 vm_normal_page
10 0.0053 error_exit
10 0.0053 memcpy
9 0.0048 __get_user_4
9 0.0048 retint_restore_args
9 0.0048 retint_swapgs
8 0.0043 __rb_rotate_right
8 0.0043 run_local_timers
7 0.0037 error_sti
7 0.0037 inode_has_buffers
7 0.0037 nameidata_to_filp
7 0.0037 timespec_trunc
6 0.0032 __tcp_checksum_complete_user
6 0.0032 _spin_lock_irqsave
6 0.0032 _write_lock_bh
6 0.0032 filemap_nopage
6 0.0032 wake_up_bit
5 0.0027 __getblk
5 0.0027 __iget
5 0.0027 _read_lock_irqsave
5 0.0027 find_vma_prepare
5 0.0027 mmput
4 0.0021 __free_pages
4 0.0021 __mutex_init
4 0.0021 do_mmap_pgoff
4 0.0021 lru_cache_add_active
3 0.0016 __down_read_trylock
3 0.0016 __find_get_block_slow
3 0.0016 __make_request
3 0.0016 __mark_inode_dirty
3 0.0016 __remove_shared_vm_struct
3 0.0016 copy_strings
3 0.0016 do_notify_resume
3 0.0016 free_hot_page
3 0.0016 generic_file_aio_read
3 0.0016 generic_make_request
3 0.0016 load_elf_binary
3 0.0016 page_waitqueue
3 0.0016 prio_tree_insert
3 0.0016 prio_tree_replace
3 0.0016 rcu_start_batch
3 0.0016 unlock_page
3 0.0016 vma_link
3 0.0016 vma_merge
3 0.0016 zonelist_policy
2 0.0011 __block_prepare_write
2 0.0011 __pagevec_lru_add_active
2 0.0011 __put_user_4
2 0.0011 __vma_link_rb
2 0.0011 _read_lock_bh
2 0.0011 alloc_pages_current
2 0.0011 anon_vma_unlink
2 0.0011 copy_process
2 0.0011 d_lookup
2 0.0011 dentry_unhash
2 0.0011 do_mpage_readpage
2 0.0011 do_sigaction
2 0.0011 error_entry
2 0.0011 flush_old_exec
2 0.0011 kmem_flagcheck
2 0.0011 mempool_free
2 0.0011 page_add_file_rmap
2 0.0011 run_workqueue
2 0.0011 sched_balance_self
2 0.0011 set_personality_64bit
2 0.0011 strnlen_user
2 0.0011 sys_mprotect
2 0.0011 sys_rt_sigaction
2 0.0011 vfs_lstat_fd
2 0.0011 worker_thread
1 5.3e-04 __brelse
1 5.3e-04 __d_path
1 5.3e-04 __down_write_nested
1 5.3e-04 __lookup_hash
1 5.3e-04 __page_set_anon_rmap
1 5.3e-04 __strnlen_user
1 5.3e-04 __up_write
1 5.3e-04 __vm_enough_memory
1 5.3e-04 add_timer_randomness
1 5.3e-04 anon_vma_link
1 5.3e-04 bio_alloc_bioset
1 5.3e-04 blk_recount_segments
1 5.3e-04 can_vma_merge_after
1 5.3e-04 cap_bprm_apply_creds
1 5.3e-04 cond_resched_softirq
1 5.3e-04 copy_from_read_buf
1 5.3e-04 cpuset_fork
1 5.3e-04 cpuset_update_task_memory_state
1 5.3e-04 dnotify_parent
1 5.3e-04 do_exit
1 5.3e-04 do_munmap
1 5.3e-04 do_select
1 5.3e-04 do_wait
1 5.3e-04 end_bio_bh_io_sync
1 5.3e-04 exit_sem
1 5.3e-04 file_read_actor
1 5.3e-04 filldir64
1 5.3e-04 find_or_create_page
1 5.3e-04 find_vma_prev
1 5.3e-04 flush_thread
1 5.3e-04 free_pgd_range
1 5.3e-04 free_pgtables
1 5.3e-04 generic_fillattr
1 5.3e-04 get_unmapped_area
1 5.3e-04 inode_sub_bytes
1 5.3e-04 kthread_should_stop
1 5.3e-04 mm_init
1 5.3e-04 path_release
1 5.3e-04 pipe_release
1 5.3e-04 prio_tree_remove
1 5.3e-04 profile_munmap
1 5.3e-04 remove_wait_queue
1 5.3e-04 retint_check
1 5.3e-04 seq_puts
1 5.3e-04 strchr
1 5.3e-04 sys_brk
1 5.3e-04 sys_execve
1 5.3e-04 sys_mmap
1 5.3e-04 sys_munmap
1 5.3e-04 sys_read
1 5.3e-04 sys_rmdir
1 5.3e-04 sys_rt_sigprocmask
1 5.3e-04 truncate_complete_page
1 5.3e-04 tty_ldisc_deref
1 5.3e-04 tty_ldisc_try
1 5.3e-04 tty_termios_baud_rate
1 5.3e-04 tty_write
1 5.3e-04 udp_rcv
1 5.3e-04 vfs_rmdir
1 5.3e-04 vfs_stat_fd
1 5.3e-04 vfs_write
1 5.3e-04 vm_acct_memory
1 5.3e-04 vm_stat_account
1 5.3e-04 vma_adjust
1 5.3e-04 vma_prio_tree_remove
1 5.3e-04 wake_up_new_task
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 14:16 ` Ingo Molnar
2007-03-01 14:31 ` Eric Dumazet
@ 2007-03-01 14:54 ` Evgeniy Polyakov
2007-03-01 15:09 ` Ingo Molnar
2007-03-01 19:31 ` Davide Libenzi
1 sibling, 2 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01 14:54 UTC (permalink / raw)
To: Ingo Molnar
Cc: Eric Dumazet, Pavel Machek, Theodore Tso, Linus Torvalds,
Ulrich Drepper, linux-kernel, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
On Thu, Mar 01, 2007 at 03:16:37PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
>
> * Eric Dumazet <dada1@cosmosbay.com> wrote:
>
> > I can tell you that the problem (at least on my machine) comes from :
> >
> > gettimeofday(&tm, NULL);
> >
> > in evserver_epoll.c
>
> yeah, that's another difference - especially if it's something like an
> Athlon64 and gettimeofday falls back to pm-timer, that could explain the
> performance difference. That's why i repeatedly asked Evgeniy to use the
> /very same/ client function for both the epoll and the kevent test and
> redo the measurements. The numbers are still highly suspect - and we are
> already down from the prior claim of kevent being almost twice as fast
> to a 25% difference.
There is no gettimeofday() in the running code anymore, and it was
placed not in common server processing code btw.
Ingo, do you really think I will send mails with faked benchmarks? :))
Btw, there were never almost twice perfromance increase - epoll in my
tests always showed 4-5 thousands requests per second, kevent - up to 7
thausands.
That starts looking like ghost hunting, Ingo, you already said that
you do not see any need in kevent, have you changed your opinion on
that?
> Ingo
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 14:43 ` Evgeniy Polyakov
@ 2007-03-01 14:47 ` Ingo Molnar
2007-03-01 15:23 ` Evgeniy Polyakov
2007-03-01 14:57 ` Evgeniy Polyakov
1 sibling, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-03-01 14:47 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Eric Dumazet, Pavel Machek, Theodore Tso, Linus Torvalds,
Ulrich Drepper, linux-kernel, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> CPU: AMD64 processors, speed 2210.08 MHz (estimated)
> Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000
> samples % symbol name
> 195750 67.3097 cpu_idle
> 14111 4.8521 enter_idle
> 4979 1.7121 IRQ0x51_interrupt
> 4765 1.6385 tcp_v4_rcv
the pretty much only meaningful way to measure this is to:
- start a really long 'ab' testrun. Something like "ab -c 8000 -t 600".
- let the system get into 'steady state': i.e. CPU load at 100%
- reset the oprofile counters, then start an oprofile run for 60
seconds.
- stop the oprofile run.
- stop the test.
this way there wont be that many 'cpu_idle' entries in your profiles,
and the profiles between the two event delivery mechanisms will be
directly comparable.
> In that tests I got epoll perf about 4400 req/s, kevent was about
> 5300.
So we are now up to epoll being 83% of kevent's performance - while the
noise of numbers seen today alone is around 100% ... Could you update
the files two URLs that you posted before, with the code that you used
for the above numbers:
http://tservice.net.ru/~s0mbre/archive/kevent/evserver_epoll.c
http://tservice.net.ru/~s0mbre/archive/kevent/evserver_kevent.c
thanks,
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 13:12 ` Eric Dumazet
@ 2007-03-01 14:43 ` Evgeniy Polyakov
2007-03-01 14:47 ` Ingo Molnar
2007-03-01 14:57 ` Evgeniy Polyakov
0 siblings, 2 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01 14:43 UTC (permalink / raw)
To: Eric Dumazet
Cc: Ingo Molnar, Pavel Machek, Theodore Tso, Linus Torvalds,
Ulrich Drepper, linux-kernel, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
[-- Attachment #1: Type: text/plain, Size: 890 bytes --]
On Thu, Mar 01, 2007 at 02:12:50PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> On Thursday 01 March 2007 12:47, Evgeniy Polyakov wrote:
> >
> > Could you provide at least remote way to find it?
> >
>
> Sure :)
>
> > I only found the same problem at
> > http://lkml.org/lkml/2006/10/27/3
> >
> > but without any hits to solve the problem.
> >
> > I will try CVS oprofile, if it works I will provide details of course.
> >
>
> # cat CVS/Root
> CVS/Root::pserver:anonymous@oprofile.cvs.sourceforge.net:/cvsroot/oprofile
>
> # cvs diff >/tmp/oprofile.diff
>
> Hope it helps
One of the hunks failed, since it was in CVS already.
After setting up some mirrors, I've installed all bits needed for
oprofile.
Attached kevent and epoll profiles.
In that tests I got epoll perf about 4400 req/s, kevent was about 5300.
epoll does not contain gettimeofday() call.
--
Evgeniy Polyakov
[-- Attachment #2: profile.epoll --]
[-- Type: text/plain, Size: 14192 bytes --]
CPU: AMD64 processors, speed 2210.08 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000
samples % symbol name
195750 67.3097 cpu_idle
14111 4.8521 enter_idle
4979 1.7121 IRQ0x51_interrupt
4765 1.6385 tcp_v4_rcv
3316 1.1402 tcp_ack
3138 1.0790 kmem_cache_free
2619 0.9006 kfree
2323 0.7988 memset_c
1747 0.6007 schedule
1682 0.5784 csum_partial_copy_generic
1646 0.5660 ip_output
1563 0.5374 tcp_rcv_state_process
1506 0.5178 dev_queue_xmit
1412 0.4855 handle_IRQ_event
1266 0.4353 ip_rcv
1026 0.3528 ip_queue_xmit
1004 0.3452 __do_softirq
1001 0.3442 ip_local_deliver
906 0.3115 csum_partial
902 0.3102 ip_route_input
889 0.3057 __d_lookup
880 0.3026 kmem_cache_alloc
847 0.2912 tcp_v4_do_rcv
841 0.2892 netif_receive_skb
830 0.2854 tcp_sendmsg
819 0.2816 system_call
788 0.2710 kfree_skbmem
780 0.2682 tcp_transmit_skb
742 0.2551 preempt_schedule
731 0.2514 __tcp_push_pending_frames
699 0.2404 __link_path_walk
687 0.2362 pfifo_fast_dequeue
672 0.2311 local_bh_enable
657 0.2259 __alloc_skb
650 0.2235 net_rx_action
627 0.2156 pfifo_fast_enqueue
583 0.2005 sock_wfree
571 0.1963 get_unused_fd
547 0.1881 tcp_parse_options
546 0.1877 copy_user_generic_string
533 0.1833 _atomic_dec_and_lock
529 0.1819 number
524 0.1802 ret_from_intr
509 0.1750 skb_clone
507 0.1743 fget
507 0.1743 tcp_rcv_established
498 0.1712 __kfree_skb
492 0.1692 tcp_poll
481 0.1654 rt_hash_code
471 0.1620 dput
454 0.1561 sock_def_readable
422 0.1451 mask_and_ack_8259A
421 0.1448 sysret_check
419 0.1441 __fput
413 0.1420 exit_idle
396 0.1362 ip_append_data
374 0.1286 sock_poll
371 0.1276 tcp_data_queue
359 0.1234 __tcp_select_window
356 0.1224 tcp_recvmsg
348 0.1197 lock_timer_base
340 0.1169 cache_alloc_refill
338 0.1162 thread_return
319 0.1097 sys_close
318 0.1093 do_path_lookup
318 0.1093 ret_from_sys_call
311 0.1069 vsnprintf
307 0.1056 eth_type_trans
303 0.1042 find_next_zero_bit
302 0.1038 __mod_timer
298 0.1025 d_alloc
296 0.1018 rb_erase
293 0.1007 call_softirq
290 0.0997 __dentry_open
276 0.0949 cache_grow
274 0.0942 __ip_route_output_key
273 0.0939 try_to_wake_up
258 0.0887 dentry_iput
258 0.0887 sk_stream_mem_schedule
257 0.0884 do_lookup
244 0.0839 strncpy_from_user
234 0.0805 do_generic_mapping_read
231 0.0794 memcmp
229 0.0787 tcp_current_mss
228 0.0784 tcp_rtt_estimator
214 0.0736 restore_args
205 0.0705 sys_recvfrom
204 0.0701 fput
203 0.0698 tcp_send_fin
200 0.0688 release_sock
193 0.0664 memcpy_c
191 0.0657 common_interrupt
189 0.0650 fget_light
185 0.0636 skb_checksum
182 0.0626 generic_drop_inode
180 0.0619 do_sys_open
174 0.0598 get_page_from_freelist
168 0.0578 link_path_walk
165 0.0567 schedule_timeout
163 0.0560 del_timer
162 0.0557 rb_insert_color
160 0.0550 percpu_counter_mod
159 0.0547 __up_read
155 0.0533 expand_files
154 0.0530 sys_fcntl
150 0.0516 tcp_v4_send_check
146 0.0502 fd_install
145 0.0499 bictcp_cong_avoid
143 0.0492 call_rcu
141 0.0485 __down_read
141 0.0485 sock_close
140 0.0481 copy_page_c
138 0.0475 __skb_checksum_complete
138 0.0475 lookup_mnt
137 0.0471 getname
132 0.0454 generic_permission
131 0.0450 find_get_page
130 0.0447 __do_page_cache_readahead
130 0.0447 update_send_head
127 0.0437 get_empty_filp
126 0.0433 __path_lookup_intent_open
124 0.0426 mod_timer
122 0.0420 half_md4_transform
121 0.0416 page_fault
121 0.0416 tcp_sync_mss
120 0.0413 __wake_up
120 0.0413 current_fs_time
119 0.0409 remove_wait_queue
118 0.0406 groups_search
116 0.0399 __handle_mm_fault
115 0.0395 tcp_send_ack
114 0.0392 get_task_mm
109 0.0375 tcp_snd_test
107 0.0368 new_inode
106 0.0364 sock_release
106 0.0364 tcp_init_tso_segs
104 0.0358 inotify_dentry_parent_queue_event
104 0.0358 try_to_del_timer_sync
102 0.0351 add_wait_queue
99 0.0340 cond_resched
97 0.0334 __follow_mount
96 0.0330 __put_unused_fd
95 0.0327 open_namei
94 0.0323 file_move
93 0.0320 clear_inode
93 0.0320 permission
92 0.0316 rw_verify_area
91 0.0313 may_open
91 0.0313 memmove
86 0.0296 __up_write
86 0.0296 dnotify_flush
86 0.0296 tcp_select_initial_window
84 0.0289 sock_common_recvmsg
83 0.0285 __wake_up_bit
83 0.0285 page_cache_readahead
82 0.0282 IRQ0x20_interrupt
81 0.0279 put_page
77 0.0265 inet_sendmsg
76 0.0261 skb_copy_datagram_iovec
76 0.0261 tcp_init_cwnd
75 0.0258 sock_recvmsg
75 0.0258 tcp_event_data_recv
74 0.0254 inet_sk_rebuild_header
74 0.0254 locks_remove_posix
73 0.0251 d_instantiate
68 0.0234 sock_sendmsg
66 0.0227 __rb_rotate_right
65 0.0224 __down_write_nested
65 0.0224 retint_kernel
64 0.0220 alloc_inode
64 0.0220 clear_page_c
64 0.0220 init_timer
63 0.0217 mntput_no_expire
63 0.0217 sk_reset_timer
63 0.0217 tcp_setsockopt
61 0.0210 tcp_check_space
61 0.0210 unmap_vmas
60 0.0206 filp_close
60 0.0206 tcp_v4_tw_remember_stamp
59 0.0203 tcp_rcv_space_adjust
56 0.0193 inotify_inode_queue_event
54 0.0186 __delay
54 0.0186 touch_atime
53 0.0182 sk_stream_rfree
53 0.0182 sys_open
53 0.0182 tcp_cwnd_validate
50 0.0172 copy_to_user
50 0.0172 file_kill
49 0.0168 generic_file_open
46 0.0158 tcp_cong_avoid
44 0.0151 do_filp_open
44 0.0151 inet_sock_destruct
43 0.0148 __rb_rotate_left
43 0.0148 __tcp_ack_snd_check
43 0.0148 inode_init_once
43 0.0148 sprintf
42 0.0144 exit_intr
42 0.0144 sock_fasync
42 0.0144 wake_up_inode
41 0.0141 memset
39 0.0134 __alloc_pages
39 0.0134 free_hot_cold_page
39 0.0134 vfs_permission
38 0.0131 bit_waitqueue
38 0.0131 do_page_fault
38 0.0131 locks_remove_flock
38 0.0131 tcp_unhash
36 0.0124 del_timer_sync
36 0.0124 iput
36 0.0124 iret_label
35 0.0120 file_free_rcu
35 0.0120 file_ra_state_init
35 0.0120 tcp_v4_destroy_sock
34 0.0117 sk_stop_timer
33 0.0113 __lookup_mnt
33 0.0113 inet_getname
33 0.0113 zone_watermark_ok
30 0.0103 copy_page_range
29 0.0100 blockable_page_cache_readahead
29 0.0100 do_wp_page
29 0.0100 hrtimer_run_queues
29 0.0100 sk_alloc
26 0.0089 _spin_lock_bh
26 0.0089 in_group_p
25 0.0086 __put_user_8
25 0.0086 __wake_up_locked
25 0.0086 find_vma
25 0.0086 init_once
24 0.0083 __lock_text_start
23 0.0079 destroy_inode
23 0.0079 tcp_slow_start
22 0.0076 mark_page_accessed
21 0.0072 flush_tlb_page
20 0.0069 _read_lock_irqsave
20 0.0069 memcpy_toiovec
18 0.0062 apic_timer_interrupt
18 0.0062 retint_restore_args
18 0.0062 vm_normal_page
14 0.0048 __down_write
14 0.0048 __tcp_checksum_complete_user
14 0.0048 invalidate_inode_buffers
13 0.0045 mutex_lock
11 0.0038 __get_user_4
11 0.0038 error_exit
11 0.0038 wake_up_bit
10 0.0034 __block_write_full_page
10 0.0034 __down_read_trylock
10 0.0034 copy_from_user
9 0.0031 copy_process
9 0.0031 timespec_trunc
7 0.0024 dup_fd
7 0.0024 filemap_nopage
7 0.0024 lru_cache_add_active
7 0.0024 memcpy
7 0.0024 nameidata_to_filp
7 0.0024 unlink_file_vma
6 0.0021 __find_get_block
6 0.0021 __mutex_init
6 0.0021 find_get_pages_tag
6 0.0021 inode_has_buffers
6 0.0021 mmput
6 0.0021 page_remove_rmap
6 0.0021 sys_mprotect
5 0.0017 __mark_inode_dirty
5 0.0017 do_mmap_pgoff
5 0.0017 error_sti
5 0.0017 free_hot_page
5 0.0017 load_elf_binary
5 0.0017 page_waitqueue
5 0.0017 radix_tree_tag_clear
5 0.0017 rcu_start_batch
5 0.0017 retint_swapgs
4 0.0014 _spin_lock_irqsave
4 0.0014 copy_strings
4 0.0014 free_pages
4 0.0014 free_pgd_range
4 0.0014 free_pgtables
4 0.0014 page_add_file_rmap
4 0.0014 run_local_timers
4 0.0014 sys_rt_sigprocmask
4 0.0014 unlock_page
4 0.0014 vma_link
3 0.0010 __getblk
3 0.0010 __pagevec_lru_add_active
3 0.0010 __strnlen_user
3 0.0010 _write_lock_bh
3 0.0010 anon_vma_prepare
3 0.0010 anon_vma_unlink
3 0.0010 bio_alloc_bioset
3 0.0010 dnotify_parent
3 0.0010 do_exit
3 0.0010 do_wait
3 0.0010 exit_mmap
3 0.0010 kthread_should_stop
3 0.0010 prio_tree_insert
3 0.0010 remove_vma
3 0.0010 retint_check
3 0.0010 vma_prio_tree_add
2 6.9e-04 __block_prepare_write
2 6.9e-04 __bread
2 6.9e-04 __end_that_request_first
2 6.9e-04 __find_get_block_slow
2 6.9e-04 __free_pages
2 6.9e-04 __vm_enough_memory
2 6.9e-04 alloc_page_vma
2 6.9e-04 arch_unmap_area
2 6.9e-04 bio_init
2 6.9e-04 blk_recount_segments
2 6.9e-04 block_write_full_page
2 6.9e-04 clear_page_dirty_for_io
2 6.9e-04 do_group_exit
2 6.9e-04 do_munmap
2 6.9e-04 do_notify_resume
2 6.9e-04 find_vma_prepare
2 6.9e-04 find_vma_prev
2 6.9e-04 generic_fillattr
2 6.9e-04 get_index
2 6.9e-04 get_unmapped_area
2 6.9e-04 get_vma_policy
2 6.9e-04 prio_tree_remove
2 6.9e-04 prio_tree_replace
2 6.9e-04 proc_get_inode
2 6.9e-04 proc_lookup
2 6.9e-04 release_pages
2 6.9e-04 show_vfsmnt
2 6.9e-04 strchr
2 6.9e-04 sys_dup2
2 6.9e-04 sys_faccessat
2 6.9e-04 sys_mmap
2 6.9e-04 test_set_page_writeback
2 6.9e-04 unmap_region
2 6.9e-04 vm_acct_memory
2 6.9e-04 vm_stat_account
2 6.9e-04 vma_adjust
2 6.9e-04 vma_merge
2 6.9e-04 vma_prio_tree_remove
2 6.9e-04 zonelist_policy
1 3.4e-04 __brelse
1 3.4e-04 __clear_user
1 3.4e-04 __d_path
1 3.4e-04 __lookup_hash
1 3.4e-04 __make_request
1 3.4e-04 __page_set_anon_rmap
1 3.4e-04 __pagevec_lru_add
1 3.4e-04 __pmd_alloc
1 3.4e-04 __pollwait
1 3.4e-04 __pte_alloc
1 3.4e-04 __pud_alloc
1 3.4e-04 __put_user_4
1 3.4e-04 add_to_page_cache
1 3.4e-04 alloc_pages_current
1 3.4e-04 blk_rq_map_sg
1 3.4e-04 can_share_swap_page
1 3.4e-04 can_vma_merge_after
1 3.4e-04 copy_semundo
1 3.4e-04 cp_new_stat
1 3.4e-04 cpuset_update_task_memory_state
1 3.4e-04 d_delete
1 3.4e-04 d_path
1 3.4e-04 dcache_readdir
1 3.4e-04 default_llseek
1 3.4e-04 do_arch_prctl
1 3.4e-04 do_brk
1 3.4e-04 do_fsync
1 3.4e-04 do_mpage_readpage
1 3.4e-04 do_sigaction
1 3.4e-04 do_sigaltstack
1 3.4e-04 dupfd
1 3.4e-04 elf_map
1 3.4e-04 end_page_writeback
1 3.4e-04 error_entry
1 3.4e-04 exit_sem
1 3.4e-04 fasync_helper
1 3.4e-04 file_permission
1 3.4e-04 file_read_actor
1 3.4e-04 flush_signal_handlers
1 3.4e-04 flush_thread
1 3.4e-04 generic_delete_inode
1 3.4e-04 generic_file_aio_write
1 3.4e-04 generic_file_buffered_write
1 3.4e-04 generic_file_mmap
1 3.4e-04 get_request_wait
1 3.4e-04 get_signal_to_deliver
1 3.4e-04 get_stack
1 3.4e-04 hrtimer_init
1 3.4e-04 kill_fasync
1 3.4e-04 kmem_flagcheck
1 3.4e-04 meminfo_read_proc
1 3.4e-04 mempool_free
1 3.4e-04 memscan
1 3.4e-04 mutex_unlock
1 3.4e-04 notify_change
1 3.4e-04 nr_blockdev_pages
1 3.4e-04 page_add_new_anon_rmap
1 3.4e-04 path_release
1 3.4e-04 pipe_ioctl
1 3.4e-04 pipe_iov_copy_from_user
1 3.4e-04 poll_freewait
1 3.4e-04 proc_file_read
1 3.4e-04 proc_pident_lookup
1 3.4e-04 ptregscall_common
1 3.4e-04 rb_next
1 3.4e-04 recalc_bh_state
1 3.4e-04 release_task
1 3.4e-04 retint_careful
1 3.4e-04 schedule_tail
1 3.4e-04 seq_escape
1 3.4e-04 set_brk
1 3.4e-04 set_normalized_timespec
1 3.4e-04 si_swapinfo
1 3.4e-04 split_vma
1 3.4e-04 stub_execve
1 3.4e-04 sync_sb_inodes
1 3.4e-04 sys_brk
1 3.4e-04 sys_lseek
1 3.4e-04 sys_munmap
1 3.4e-04 sys_newfstat
1 3.4e-04 sys_newstat
1 3.4e-04 sys_select
1 3.4e-04 sys_write
1 3.4e-04 truncate_complete_page
1 3.4e-04 tty_ioctl
1 3.4e-04 udp_rcv
1 3.4e-04 unlock_buffer
1 3.4e-04 vfs_readdir
1 3.4e-04 vma_prio_tree_insert
1 3.4e-04 worker_thread
[-- Attachment #3: profile.kevent --]
[-- Type: text/plain, Size: 12581 bytes --]
CPU: AMD64 processors, speed 2210.08 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000
samples % symbol name
124605 59.0887 cpu_idle
13754 6.5223 enter_idle
4677 2.2179 tcp_v4_rcv
3888 1.8437 IRQ0x51_interrupt
3115 1.4772 tcp_ack
2571 1.2192 kmem_cache_free
2151 1.0200 kfree
2121 1.0058 memset_c
1632 0.7739 csum_partial_copy_generic
1611 0.7639 ip_output
1488 0.7056 schedule
1389 0.6587 dev_queue_xmit
1352 0.6411 tcp_rcv_state_process
1272 0.6032 ip_rcv
1184 0.5615 handle_IRQ_event
951 0.4510 ip_queue_xmit
916 0.4344 __do_softirq
900 0.4268 ip_route_input
888 0.4211 csum_partial
870 0.4126 tcp_v4_do_rcv
856 0.4059 ip_local_deliver
851 0.4036 system_call
829 0.3931 netif_receive_skb
812 0.3851 tcp_sendmsg
812 0.3851 tcp_transmit_skb
805 0.3817 __alloc_skb
683 0.3239 kmem_cache_alloc
683 0.3239 local_bh_enable
678 0.3215 __tcp_push_pending_frames
673 0.3191 fget
665 0.3153 __d_lookup
653 0.3097 pfifo_fast_enqueue
639 0.3030 copy_user_generic_string
607 0.2878 preempt_schedule
606 0.2874 net_rx_action
593 0.2812 pfifo_fast_dequeue
589 0.2793 kfree_skbmem
552 0.2618 __fput
550 0.2608 __link_path_walk
528 0.2504 skb_clone
523 0.2480 tcp_parse_options
515 0.2442 ret_from_intr
512 0.2428 number
496 0.2352 _atomic_dec_and_lock
493 0.2338 __kfree_skb
477 0.2262 rt_hash_code
476 0.2257 sock_wfree
459 0.2177 tcp_poll
449 0.2129 tcp_rcv_established
415 0.1968 dput
398 0.1887 sysret_check
366 0.1736 exit_idle
364 0.1726 __mod_timer
352 0.1669 tcp_recvmsg
352 0.1669 thread_return
346 0.1641 sys_close
341 0.1617 get_unused_fd
340 0.1612 rb_erase
337 0.1598 __tcp_select_window
331 0.1570 lock_timer_base
316 0.1498 mask_and_ack_8259A
299 0.1418 ip_append_data
290 0.1375 cache_alloc_refill
286 0.1356 d_alloc
269 0.1276 ret_from_sys_call
268 0.1271 do_path_lookup
267 0.1266 call_softirq
265 0.1257 eth_type_trans
262 0.1242 tcp_current_mss
262 0.1242 tcp_data_queue
262 0.1242 vsnprintf
261 0.1238 __ip_route_output_key
255 0.1209 sk_stream_mem_schedule
249 0.1181 cache_grow
247 0.1171 tcp_rtt_estimator
239 0.1133 __dentry_open
221 0.1048 sys_fcntl
220 0.1043 find_next_zero_bit
216 0.1024 dentry_iput
208 0.0986 try_to_wake_up
197 0.0934 call_rcu
196 0.0929 strncpy_from_user
195 0.0925 do_lookup
191 0.0906 sock_recvmsg
189 0.0896 tcp_send_fin
188 0.0892 sys_recvfrom
182 0.0863 sock_def_readable
178 0.0844 restore_args
176 0.0835 common_interrupt
174 0.0825 release_sock
171 0.0811 do_generic_mapping_read
167 0.0792 schedule_timeout
163 0.0773 generic_drop_inode
163 0.0773 get_page_from_freelist
163 0.0773 percpu_counter_mod
160 0.0759 skb_checksum
159 0.0754 del_timer
155 0.0735 do_sys_open
153 0.0726 memcpy_c
149 0.0707 memcmp
145 0.0688 __skb_checksum_complete
144 0.0683 link_path_walk
144 0.0683 tcp_v4_send_check
141 0.0669 fd_install
140 0.0664 get_empty_filp
139 0.0659 fget_light
131 0.0621 current_fs_time
126 0.0598 mod_timer
125 0.0593 bictcp_cong_avoid
123 0.0583 tcp_init_tso_segs
121 0.0574 update_send_head
120 0.0569 __put_unused_fd
119 0.0564 __do_page_cache_readahead
119 0.0564 alloc_inode
118 0.0560 rb_insert_color
115 0.0545 prepare_to_wait
112 0.0531 locks_remove_posix
112 0.0531 new_inode
111 0.0526 half_md4_transform
109 0.0517 fput
108 0.0512 tcp_sync_mss
105 0.0498 get_task_mm
104 0.0493 clear_inode
102 0.0484 find_get_page
102 0.0484 tcp_select_initial_window
101 0.0479 lookup_mnt
101 0.0479 tcp_send_ack
99 0.0469 sock_close
98 0.0465 try_to_del_timer_sync
97 0.0460 page_cache_readahead
97 0.0460 tcp_snd_test
94 0.0446 generic_permission
94 0.0446 getname
92 0.0436 may_open
91 0.0432 tcp_event_data_recv
90 0.0427 open_namei
89 0.0422 IRQ0x20_interrupt
88 0.0417 inotify_dentry_parent_queue_event
87 0.0413 put_page
85 0.0403 copy_page_c
84 0.0398 d_instantiate
81 0.0384 finish_wait
80 0.0379 __wake_up
79 0.0375 expand_files
77 0.0365 groups_search
75 0.0356 inet_sendmsg
75 0.0356 skb_copy_datagram_iovec
73 0.0346 file_free_rcu
72 0.0341 __path_lookup_intent_open
72 0.0341 permission
71 0.0337 __follow_mount
71 0.0337 memmove
70 0.0332 dnotify_flush
69 0.0327 cond_resched
69 0.0327 igrab
68 0.0322 sk_reset_timer
66 0.0313 sockfd_lookup
65 0.0308 tcp_setsockopt
64 0.0303 sock_sendmsg
62 0.0294 __wake_up_bit
62 0.0294 file_move
62 0.0294 inotify_inode_queue_event
62 0.0294 retint_kernel
62 0.0294 tcp_v4_tw_remember_stamp
60 0.0285 sprintf
56 0.0266 copy_to_user
56 0.0266 sock_common_recvmsg
56 0.0266 tcp_check_space
56 0.0266 tcp_cwnd_validate
55 0.0261 file_kill
55 0.0261 inet_sk_rebuild_header
55 0.0261 tcp_init_cwnd
54 0.0256 __handle_mm_fault
54 0.0256 init_timer
52 0.0247 filp_close
52 0.0247 mutex_unlock
52 0.0247 rw_verify_area
51 0.0242 sock_release
48 0.0228 __tcp_ack_snd_check
48 0.0228 sys_open
47 0.0223 inode_init_once
47 0.0223 locks_remove_flock
47 0.0223 mntput_no_expire
47 0.0223 touch_atime
46 0.0218 sk_stream_rfree
45 0.0213 page_fault
45 0.0213 unmap_vmas
45 0.0213 wake_up_inode
44 0.0209 clear_page_c
44 0.0209 memset
42 0.0199 __delay
42 0.0199 tcp_cong_avoid
41 0.0194 __rb_rotate_left
39 0.0185 do_filp_open
39 0.0185 memcpy_toiovec
39 0.0185 sock_fasync
38 0.0180 __put_user_8
38 0.0180 exit_intr
35 0.0166 zone_watermark_ok
34 0.0161 iput
33 0.0156 iret_label
32 0.0152 free_hot_cold_page
32 0.0152 tcp_unhash
31 0.0147 __alloc_pages
31 0.0147 sk_alloc
30 0.0142 __lookup_mnt
30 0.0142 mutex_lock
29 0.0138 generic_file_open
28 0.0133 _spin_lock_bh
28 0.0133 bit_waitqueue
28 0.0133 tcp_rcv_space_adjust
27 0.0128 inet_sock_destruct
26 0.0123 sk_stop_timer
26 0.0123 tcp_slow_start
24 0.0114 hrtimer_run_queues
24 0.0114 inet_getname
22 0.0104 file_ra_state_init
22 0.0104 vfs_permission
21 0.0100 blockable_page_cache_readahead
21 0.0100 tcp_v4_destroy_sock
20 0.0095 init_once
19 0.0090 do_wp_page
18 0.0085 destroy_inode
16 0.0076 apic_timer_interrupt
16 0.0076 copy_from_user
16 0.0076 del_timer_sync
15 0.0071 in_group_p
15 0.0071 invalidate_inode_buffers
14 0.0066 do_page_fault
13 0.0062 mark_page_accessed
13 0.0062 retint_restore_args
12 0.0057 __get_user_4
11 0.0052 copy_page_range
11 0.0052 find_vma
11 0.0052 wake_up_bit
10 0.0047 __down_read
9 0.0043 rcu_start_batch
8 0.0038 __tcp_checksum_complete_user
8 0.0038 __up_read
8 0.0038 flush_tlb_page
8 0.0038 free_pages
8 0.0038 timespec_trunc
6 0.0028 __find_get_block
6 0.0028 __rb_rotate_right
6 0.0028 error_sti
6 0.0028 kmem_flagcheck
6 0.0028 retint_swapgs
5 0.0024 __down_read_trylock
5 0.0024 _spin_lock_irqsave
5 0.0024 inode_has_buffers
5 0.0024 retint_check
4 0.0019 __mutex_init
4 0.0019 nameidata_to_filp
4 0.0019 prio_tree_insert
4 0.0019 proc_lookup
4 0.0019 run_workqueue
3 0.0014 __getblk
3 0.0014 __iget
3 0.0014 __pagevec_lru_add_active
3 0.0014 __strnlen_user
3 0.0014 _write_lock_bh
3 0.0014 copy_process
3 0.0014 error_exit
3 0.0014 filemap_nopage
3 0.0014 mmput
3 0.0014 sys_mprotect
3 0.0014 unlink_file_vma
3 0.0014 vma_adjust
2 9.5e-04 __clear_user
2 9.5e-04 __end_that_request_first
2 9.5e-04 __pte_alloc
2 9.5e-04 __set_page_dirty_nobuffers
2 9.5e-04 _read_lock_bh
2 9.5e-04 _read_lock_irqsave
2 9.5e-04 add_to_page_cache
2 9.5e-04 anon_vma_prepare
2 9.5e-04 anon_vma_unlink
2 9.5e-04 d_lookup
2 9.5e-04 do_exit
2 9.5e-04 do_mmap_pgoff
2 9.5e-04 do_munmap
2 9.5e-04 do_wait
2 9.5e-04 dup_fd
2 9.5e-04 flush_signal_handlers
2 9.5e-04 free_hot_page
2 9.5e-04 generic_file_aio_read
2 9.5e-04 generic_fillattr
2 9.5e-04 generic_make_request
2 9.5e-04 kthread_should_stop
2 9.5e-04 load_elf_binary
2 9.5e-04 lru_add_drain
2 9.5e-04 lru_cache_add_active
2 9.5e-04 memcpy
2 9.5e-04 mempool_free
2 9.5e-04 mm_init
2 9.5e-04 page_remove_rmap
2 9.5e-04 put_unused_fd
2 9.5e-04 radix_tree_tag_clear
2 9.5e-04 rb_next
2 9.5e-04 retint_with_reschedule
2 9.5e-04 run_local_timers
2 9.5e-04 strchr
2 9.5e-04 vfs_ioctl
2 9.5e-04 vma_link
1 4.7e-04 __add_entropy_words
1 4.7e-04 __anon_vma_link
1 4.7e-04 __brelse
1 4.7e-04 __d_path
1 4.7e-04 __down_write
1 4.7e-04 __find_get_block_slow
1 4.7e-04 __free_pages
1 4.7e-04 __generic_file_aio_write_nolock
1 4.7e-04 __lookup_hash
1 4.7e-04 __make_request
1 4.7e-04 __mark_inode_dirty
1 4.7e-04 __page_set_anon_rmap
1 4.7e-04 __pagevec_lru_add
1 4.7e-04 __pagevec_release
1 4.7e-04 __remove_shared_vm_struct
1 4.7e-04 __up_write
1 4.7e-04 __user_walk_fd
1 4.7e-04 __vm_enough_memory
1 4.7e-04 __writeback_single_inode
1 4.7e-04 alloc_buffer_head
1 4.7e-04 alloc_page_buffers
1 4.7e-04 alloc_pages_current
1 4.7e-04 bio_alloc_bioset
1 4.7e-04 blk_done_softirq
1 4.7e-04 blk_recount_segments
1 4.7e-04 blk_rq_map_sg
1 4.7e-04 can_vma_merge_before
1 4.7e-04 cfq_set_request
1 4.7e-04 copy_strings
1 4.7e-04 cpuset_update_task_memory_state
1 4.7e-04 create_empty_buffers
1 4.7e-04 deny_write_access
1 4.7e-04 do_arch_prctl
1 4.7e-04 do_mpage_readpage
1 4.7e-04 do_select
1 4.7e-04 do_sigaction
1 4.7e-04 exit_mm
1 4.7e-04 exit_mmap
1 4.7e-04 find_vma_prepare
1 4.7e-04 find_vma_prev
1 4.7e-04 free_pgtables
1 4.7e-04 get_locked_pte
1 4.7e-04 get_request_wait
1 4.7e-04 get_signal_to_deliver
1 4.7e-04 get_unmapped_area
1 4.7e-04 init_once
1 4.7e-04 init_request_from_bio
1 4.7e-04 inode_setattr
1 4.7e-04 ll_rw_block
1 4.7e-04 make_ahead_window
1 4.7e-04 mapping_tagged
1 4.7e-04 mempool_alloc
1 4.7e-04 n_tty_write_wakeup
1 4.7e-04 notify_change
1 4.7e-04 page_add_file_rmap
1 4.7e-04 pipe_iov_copy_from_user
1 4.7e-04 pipe_write
1 4.7e-04 preempt_schedule_irq
1 4.7e-04 proc_file_read
1 4.7e-04 remove_vma
1 4.7e-04 sched_balance_self
1 4.7e-04 sched_fork
1 4.7e-04 search_binary_handler
1 4.7e-04 sigprocmask
1 4.7e-04 stub_execve
1 4.7e-04 sync_dirty_buffer
1 4.7e-04 sys_execve
1 4.7e-04 sys_ftruncate
1 4.7e-04 tty_ldisc_ref_wait
1 4.7e-04 vfs_write
1 4.7e-04 writeback_inodes
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 14:16 ` Ingo Molnar
@ 2007-03-01 14:31 ` Eric Dumazet
2007-03-01 14:27 ` Ingo Molnar
2007-03-01 14:54 ` Evgeniy Polyakov
1 sibling, 1 reply; 277+ messages in thread
From: Eric Dumazet @ 2007-03-01 14:31 UTC (permalink / raw)
To: Ingo Molnar
Cc: Evgeniy Polyakov, Pavel Machek, Theodore Tso, Linus Torvalds,
Ulrich Drepper, linux-kernel, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
On Thursday 01 March 2007 15:16, Ingo Molnar wrote:
> * Eric Dumazet <dada1@cosmosbay.com> wrote:
> > I can tell you that the problem (at least on my machine) comes from :
> >
> > gettimeofday(&tm, NULL);
> >
> > in evserver_epoll.c
>
> yeah, that's another difference - especially if it's something like an
> Athlon64 and gettimeofday falls back to pm-timer, that could explain the
> performance difference. That's why i repeatedly asked Evgeniy to use the
> /very same/ client function for both the epoll and the kevent test and
> redo the measurements. The numbers are still highly suspect - and we are
> already down from the prior claim of kevent being almost twice as fast
> to a 25% difference.
Also, ab is quite lame... Maybe we could use a epoll based 'stresser'
On my machines (again ...), ab is the slow thing... not the 'server'
Some small differences in behavior could have a big impact on ab (and you
could think there is a problem on the remote side)
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 14:31 ` Eric Dumazet
@ 2007-03-01 14:27 ` Ingo Molnar
0 siblings, 0 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-03-01 14:27 UTC (permalink / raw)
To: Eric Dumazet
Cc: Evgeniy Polyakov, Pavel Machek, Theodore Tso, Linus Torvalds,
Ulrich Drepper, linux-kernel, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
* Eric Dumazet <dada1@cosmosbay.com> wrote:
> On my machines (again ...), ab is the slow thing... not the 'server'
Evgeniy said that both in the epoll and the kevent case the server side
CPU was 98%-100% busy - so inefficiencies on the client side do not
matter that much - the server is saturated. It's that "kevent is 25%
faster than epoll" claim that i'm probing mainly.
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 13:32 ` Ingo Molnar
@ 2007-03-01 14:24 ` Evgeniy Polyakov
0 siblings, 0 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01 14:24 UTC (permalink / raw)
To: Ingo Molnar
Cc: Pavel Machek, Theodore Tso, Linus Torvalds, Ulrich Drepper,
linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
On Thu, Mar 01, 2007 at 02:32:42PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
>
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
>
> > [...] that is why number for kevent is higher - it uses edge-triggered
> > handler (which you asked to remove from epoll), [...]
>
> no - i did not 'ask' it to be removed from epoll, i only pointed out
> that with edge-triggered the results were highly unreliable here and
> that with level-triggered it worked better. Just to make sure: if you
> put back edge-triggered into evserver_epoll.c, do you get the same
> numbers, and is CPU utilization still the same 98-100%?
No.
_Now_ it is about 1500-2000 req/sec with 10-20% CPU utilization.
Number of 'Total transferred' and 'HTML transferred' does
not equal to 80000 multiplied by size of the page.
That are strange tests actually - I managed to get 9000 requests per
second from epoll server (only once!) and 8900 from kevent (two times
only), sometimes they both drop down to 2300-2700 req/s.
> Ingo
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 13:30 ` Evgeniy Polyakov
@ 2007-03-01 14:19 ` Eric Dumazet
2007-03-01 14:16 ` Ingo Molnar
0 siblings, 1 reply; 277+ messages in thread
From: Eric Dumazet @ 2007-03-01 14:19 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Ingo Molnar, Pavel Machek, Theodore Tso, Linus Torvalds,
Ulrich Drepper, linux-kernel, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
On Thursday 01 March 2007 14:30, Evgeniy Polyakov wrote:
> On Thu, Mar 01, 2007 at 02:11:18PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> > ok?
>
> I undesrtood you couple of mails ago.
> No problem, I can put processing into the same function called from
> different servers :)
>
> > Btw., am i correct that in this particular 'ab' test, the 'immediately'
> > flag is always zero, i.e. kweb_kevent_remove() is always called?
>
> Yes.
>
> > Ingo
I can tell you that the problem (at least on my machine) comes from :
gettimeofday(&tm, NULL);
in evserver_epoll.c
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 14:19 ` Eric Dumazet
@ 2007-03-01 14:16 ` Ingo Molnar
2007-03-01 14:31 ` Eric Dumazet
2007-03-01 14:54 ` Evgeniy Polyakov
0 siblings, 2 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-03-01 14:16 UTC (permalink / raw)
To: Eric Dumazet
Cc: Evgeniy Polyakov, Pavel Machek, Theodore Tso, Linus Torvalds,
Ulrich Drepper, linux-kernel, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
* Eric Dumazet <dada1@cosmosbay.com> wrote:
> I can tell you that the problem (at least on my machine) comes from :
>
> gettimeofday(&tm, NULL);
>
> in evserver_epoll.c
yeah, that's another difference - especially if it's something like an
Athlon64 and gettimeofday falls back to pm-timer, that could explain the
performance difference. That's why i repeatedly asked Evgeniy to use the
/very same/ client function for both the epoll and the kevent test and
redo the measurements. The numbers are still highly suspect - and we are
already down from the prior claim of kevent being almost twice as fast
to a 25% difference.
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 13:26 ` Evgeniy Polyakov
@ 2007-03-01 13:32 ` Ingo Molnar
2007-03-01 14:24 ` Evgeniy Polyakov
0 siblings, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-03-01 13:32 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Pavel Machek, Theodore Tso, Linus Torvalds, Ulrich Drepper,
linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> [...] that is why number for kevent is higher - it uses edge-triggered
> handler (which you asked to remove from epoll), [...]
no - i did not 'ask' it to be removed from epoll, i only pointed out
that with edge-triggered the results were highly unreliable here and
that with level-triggered it worked better. Just to make sure: if you
put back edge-triggered into evserver_epoll.c, do you get the same
numbers, and is CPU utilization still the same 98-100%?
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 13:11 ` Ingo Molnar
@ 2007-03-01 13:30 ` Evgeniy Polyakov
2007-03-01 14:19 ` Eric Dumazet
0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01 13:30 UTC (permalink / raw)
To: Ingo Molnar
Cc: Pavel Machek, Theodore Tso, Linus Torvalds, Ulrich Drepper,
linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
On Thu, Mar 01, 2007 at 02:11:18PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> ok?
I undesrtood you couple of mails ago.
No problem, I can put processing into the same function called from
different servers :)
> Btw., am i correct that in this particular 'ab' test, the 'immediately'
> flag is always zero, i.e. kweb_kevent_remove() is always called?
Yes.
> Ingo
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 12:34 ` Ingo Molnar
@ 2007-03-01 13:26 ` Evgeniy Polyakov
2007-03-01 13:32 ` Ingo Molnar
0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01 13:26 UTC (permalink / raw)
To: Ingo Molnar
Cc: Pavel Machek, Theodore Tso, Linus Torvalds, Ulrich Drepper,
linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
On Thu, Mar 01, 2007 at 01:34:23PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
>
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
>
> > Document Length: 3521 bytes
>
> > Concurrency Level: 8000
> > Time taken for tests: 16.686737 seconds
> > Complete requests: 80000
> > Failed requests: 0
> > Write errors: 0
> > Total transferred: 309760000 bytes
> > HTML transferred: 281680000 bytes
> > Requests per second: 4794.23 [#/sec] (mean)
>
> > Concurrency Level: 8000
> > Time taken for tests: 12.366775 seconds
> > Complete requests: 80000
> > Failed requests: 0
> > Write errors: 0
> > Total transferred: 317047104 bytes
> > HTML transferred: 288306522 bytes
> > Requests per second: 6468.95 [#/sec] (mean)
>
> i'm wondering - how can the 'Total transferred' and 'HTML transferred'
> numbers be different?
>
> Since document length is 3521, and the number of requests is 80000, the
> correct 'HTML transferred' is 281680000 - which is the epoll result. The
> kevent result shows more bytes transferred, which suggests that the
> kevent loop is probably incorrect somewhere.
>
> this might be some benign thing, but the /first/ thing you /have to/ do
> before claiming that 'kevent is 25% faster than epoll' is to make sure
> the results are totally reliable.
Kevent sent additional 525 pages ((311792800-309760000)/3872) - that is why
number for kevent is higher - it uses edge-triggered handler (which you
asked to remove from epoll), which can produce false-positives, for
absolute result in that case ret_data must be checked where poll flags
were stored (before). 'ab' does not count additional data as new requests
and does not count them in 'requests per second'.
Even if it could do so, additional 500 requests can not provide 35%
higher rate.
For example, lighttpd results are the same for kevent and epoll and
'Total transferred' and 'HTML transferred' numbers change between runs both
for epoll and kevent.
> Ingo
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 11:47 ` Evgeniy Polyakov
@ 2007-03-01 13:12 ` Eric Dumazet
2007-03-01 14:43 ` Evgeniy Polyakov
0 siblings, 1 reply; 277+ messages in thread
From: Eric Dumazet @ 2007-03-01 13:12 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Ingo Molnar, Pavel Machek, Theodore Tso, Linus Torvalds,
Ulrich Drepper, linux-kernel, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
[-- Attachment #1: Type: text/plain, Size: 456 bytes --]
On Thursday 01 March 2007 12:47, Evgeniy Polyakov wrote:
>
> Could you provide at least remote way to find it?
>
Sure :)
> I only found the same problem at
> http://lkml.org/lkml/2006/10/27/3
>
> but without any hits to solve the problem.
>
> I will try CVS oprofile, if it works I will provide details of course.
>
# cat CVS/Root
CVS/Root::pserver:anonymous@oprofile.cvs.sourceforge.net:/cvsroot/oprofile
# cvs diff >/tmp/oprofile.diff
Hope it helps
[-- Attachment #2: oprofile.diff --]
[-- Type: text/x-diff, Size: 1663 bytes --]
Index: libop/op_alloc_counter.c
===================================================================
RCS file: /cvsroot/oprofile/oprofile/libop/op_alloc_counter.c,v
retrieving revision 1.8
diff -r1.8 op_alloc_counter.c
14a15,16
> #include <ctype.h>
> #include <dirent.h>
133c135
< return 0;
---
> continue;
145a148,183
> /* determine which directories are counter directories
> */
> static int perfcounterdir(const struct dirent * entry)
> {
> return (isdigit(entry->d_name[0]));
> }
>
>
> /**
> * @param mask pointer where to place bit mask of unavailable counters
> *
> * return >= 0 number of counters that are available
> * < 0 could not determine number of counters
> *
> */
> static int op_get_counter_mask(u32 * mask)
> {
> struct dirent **counterlist;
> int count, i;
> /* assume nothing is available */
> u32 available=0;
>
> count = scandir("/dev/oprofile", &counterlist, perfcounterdir,
> alphasort);
> if (count < 0)
> /* unable to determine bit mask */
> return -1;
> /* convert to bit map (0 where counter exists) */
> for (i=0; i<count; ++i) {
> available |= 1 << atoi(counterlist[i]->d_name);
> free(counterlist[i]);
> }
> *mask=~available;
> free(counterlist);
> return count;
> }
152a191
> u32 unavailable_counters = 0;
154c193,195
< nr_counters = op_get_nr_counters(cpu_type);
---
> nr_counters = op_get_counter_mask(&unavailable_counters);
> if (nr_counters < 0)
> nr_counters = op_get_nr_counters(cpu_type);
162c203,204
< if (!allocate_counter(ctr_arc, nr_events, 0, 0, counter_map)) {
---
> if (!allocate_counter(ctr_arc, nr_events, 0, unavailable_counters,
> counter_map)) {
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 13:01 ` Evgeniy Polyakov
@ 2007-03-01 13:11 ` Ingo Molnar
2007-03-01 13:30 ` Evgeniy Polyakov
0 siblings, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-03-01 13:11 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Pavel Machek, Theodore Tso, Linus Torvalds, Ulrich Drepper,
linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> > i dont care whether they are separate or not - but you have not
> > replied to the request that there be a handle_web_request() function
> > in /both/ files, which is precisely the same function. I didnt ask
> > you to merge the two files - i only asked for the two web handling
> > functions to be one and the same function.
>
> They are not the same in general - if kevent is ready immediately, there
> will not be its removing from kevent tree, but current kevent server has
> it always not-immediately for lighttpd tests - so functions are the same:
> open()
> sendfile()
> cork_off
> close(fd)
> close(s)
> remove_event_from_the_kernel
>
> with the same parameters.
you /STILL/ dont understand. I'm only talking about evserver_epoll.c and
evserver_kevent.c. Not about lighttpd. Not about historic reasons. I
simply suggested a common-sense change:
| | Would it be so hard to introduce a single handle_web_request()
| | function that is exactly the same in the two tests? All the queueing
| | details (which are of course different in the epoll and the kevent
| | case) should be in the client function, which calls
| | handle_web_request().
i.e. put remove_event_from_the_kernel() (kweb_kevent_remove()) and
evtest_remove()) into a SEPARATE client function, which calls the
/common/ handle_web_request(sock) function. You can do the
immediate-removal in that separate, kevent-specific client function -
but the socket function, handle_web_request(sock) should be /perfectly
identical/ in the two files.
I.e.:
static inline int handle_web_request(int s)
{
int err, fd, on = 0;
off_t offset = 0;
int count = 40960;
char path[] = "/tmp/index.html";
char buf[4096];
err = recv(s, buf, sizeof(buf), 0);
if (err <= 0)
return err;
fd = open(path, O_RDONLY);
if (fd == -1)
return fd;
err = sendfile(s, fd, &offset, count);
if (err < 0) {
ulog_err("Failed send %d bytes: fd=%d.\n", count, s);
close(fd);
return err;
}
setsockopt(s, SOL_TCP, TCP_CORK, &on, sizeof(on));
close(fd);
close(s); /* No keepalive */
return 0;
}
And in evserver_epoll.c do this:
static int evtest_callback_client(int s)
{
int err = handle_web_request(s);
if (err)
evtest_remove(s);
return err;
}
and in evserver_kevent.c do this:
static int kweb_callback_client(struct ukevent *e, int im)
{
int err = handle_web_request(e->id.raw[0]);
if (err || !im)
kweb_kevent_remove(e);
return err;
}
ok?
Btw., am i correct that in this particular 'ab' test, the 'immediately'
flag is always zero, i.e. kweb_kevent_remove() is always called?
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 12:43 ` Ingo Molnar
@ 2007-03-01 13:01 ` Evgeniy Polyakov
2007-03-01 13:11 ` Ingo Molnar
0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01 13:01 UTC (permalink / raw)
To: Ingo Molnar
Cc: Pavel Machek, Theodore Tso, Linus Torvalds, Ulrich Drepper,
linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
On Thu, Mar 01, 2007 at 01:43:36PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
>
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
>
> > I separated epoll and kevent servers, since originally kevent server
> > included additional kevent features, but then new ones were added and
> > I moved slowly to the similar to epoll case.
>
> i dont care whether they are separate or not - but you have not replied
> to the request that there be a handle_web_request() function in /both/
> files, which is precisely the same function. I didnt ask you to merge
> the two files - i only asked for the two web handling functions to be
> one and the same function.
They are not the same in general - if kevent is ready immediately, there
will not be its removing from kevent tree, but current kevent server has
it always not-immediately for lighttpd tests - so functions are the same:
open()
sendfile()
cork_off
close(fd)
close(s)
remove_event_from_the_kernel
with the same parameters.
> Ingo
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 12:10 ` Evgeniy Polyakov
@ 2007-03-01 12:43 ` Ingo Molnar
2007-03-01 13:01 ` Evgeniy Polyakov
0 siblings, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-03-01 12:43 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Pavel Machek, Theodore Tso, Linus Torvalds, Ulrich Drepper,
linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> I separated epoll and kevent servers, since originally kevent server
> included additional kevent features, but then new ones were added and
> I moved slowly to the similar to epoll case.
i dont care whether they are separate or not - but you have not replied
to the request that there be a handle_web_request() function in /both/
files, which is precisely the same function. I didnt ask you to merge
the two files - i only asked for the two web handling functions to be
one and the same function.
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 10:59 ` Evgeniy Polyakov
2007-03-01 11:00 ` Ingo Molnar
2007-03-01 11:14 ` Eric Dumazet
@ 2007-03-01 12:34 ` Ingo Molnar
2007-03-01 13:26 ` Evgeniy Polyakov
2007-03-01 16:56 ` David Lang
3 siblings, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-03-01 12:34 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Pavel Machek, Theodore Tso, Linus Torvalds, Ulrich Drepper,
linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> Document Length: 3521 bytes
> Concurrency Level: 8000
> Time taken for tests: 16.686737 seconds
> Complete requests: 80000
> Failed requests: 0
> Write errors: 0
> Total transferred: 309760000 bytes
> HTML transferred: 281680000 bytes
> Requests per second: 4794.23 [#/sec] (mean)
> Concurrency Level: 8000
> Time taken for tests: 12.366775 seconds
> Complete requests: 80000
> Failed requests: 0
> Write errors: 0
> Total transferred: 317047104 bytes
> HTML transferred: 288306522 bytes
> Requests per second: 6468.95 [#/sec] (mean)
i'm wondering - how can the 'Total transferred' and 'HTML transferred'
numbers be different?
Since document length is 3521, and the number of requests is 80000, the
correct 'HTML transferred' is 281680000 - which is the epoll result. The
kevent result shows more bytes transferred, which suggests that the
kevent loop is probably incorrect somewhere.
this might be some benign thing, but the /first/ thing you /have to/ do
before claiming that 'kevent is 25% faster than epoll' is to make sure
the results are totally reliable.
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 11:28 ` Eric Dumazet
2007-03-01 11:47 ` Evgeniy Polyakov
@ 2007-03-01 12:19 ` Evgeniy Polyakov
1 sibling, 0 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01 12:19 UTC (permalink / raw)
To: Eric Dumazet
Cc: Ingo Molnar, Pavel Machek, Theodore Tso, Linus Torvalds,
Ulrich Drepper, linux-kernel, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
On Thu, Mar 01, 2007 at 12:28:00PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> I used the CVS version of oprofile plus a patch you can find in the mailing
> list archives. Dont remember exactly, since I hit this some months ago
Ugh, I started - but CVS compilation requires about 40mb of additional
libs (according to debian testing dependencies on my very light
installation), so with my miserable 1-1.6 kb/sec do not expect it today :)
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 11:47 ` Ingo Molnar
@ 2007-03-01 12:10 ` Evgeniy Polyakov
2007-03-01 12:43 ` Ingo Molnar
0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01 12:10 UTC (permalink / raw)
To: Ingo Molnar
Cc: Pavel Machek, Theodore Tso, Linus Torvalds, Ulrich Drepper,
linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
On Thu, Mar 01, 2007 at 12:47:35PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
>
> * Ingo Molnar <mingo@elte.hu> wrote:
>
> > > I also changed client socket to nonblocking mode with the same result
> > > in epoll server. If you will find it broken, please send me corrected
> > > to test too.
> >
> > this line in evserver_kevent.c looks a bit fishy:
>
> this one in evserver_kevent.c looks problematic too:
>
> shutdown(s, SHUT_RDWR);
> close(s);
>
> as evserver_epoll.c only does:
>
> close(s);
>
> again, there might be TCP control flow differences due to this. [ Or the
> removal of this shutdown() call might be a small speedup for the kevent
> case ;) ]
:)
> Also, the order of fd and socket close() is different in the two cases.
> It shouldnt make any difference - but that too just makes the results
> harder to trust. Would it be so hard to introduce a single
> handle_web_request() function that is exactly the same in the two tests?
> All the queueing details (which are of course different in the epoll and
> the kevent case) should be in the client function, which calls
> handle_web_request().
I've removed shutdown - things are the same.
Sometimes kevent performance drops to lower numbers and its graph of
times needed to handle events has high platoes (with and without
shutdown - it was always), like this:
Percentage of the requests served within a certain time (ms)
50% 128
66% 486
75% 505
80% 507
90% 732
95% 3087 // something is wrong at this point
98% 9058
99% 9072
100% 15562 (longest request)
it is possible that threre are some other bugs in the server though,
which prevent sockets from being quicly closed and thus its processing
time increases - I do not know for sure the root of that event.
I separated epoll and kevent servers, since originally kevent server
included additional kevent features, but then new ones were added
and I moved slowly to the similar to epoll case.
Current version of the server was a pre-test one for lighttpd patches,
so essentially it should be like epoll except minor details.
> Ingo
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 11:41 ` Ingo Molnar
2007-03-01 11:47 ` Ingo Molnar
@ 2007-03-01 12:01 ` Evgeniy Polyakov
1 sibling, 0 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01 12:01 UTC (permalink / raw)
To: Ingo Molnar
Cc: Pavel Machek, Theodore Tso, Linus Torvalds, Ulrich Drepper,
linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
On Thu, Mar 01, 2007 at 12:41:37PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
>
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
>
> > I also changed client socket to nonblocking mode with the same result
> > in epoll server. If you will find it broken, please send me corrected
> > to test too.
>
> this line in evserver_kevent.c looks a bit fishy:
>
> err = recv(s, buf, 100, 0);
>
> because on the evserver_epoll.c side the following is done:
>
> err = recv(s, buf, 4096, 0);
>
> now, for 'ab', the request size is 76 bytes, so it should fit fine
> functionality-wise. But, the TCP stack might decide differently of
> whether to return with a partial packet depending on how much data is
> requested. I dont know whether it actually makes a difference in the TCP
> flow decisions, and whether it makes a performance difference in your
> test, but safest would be to use 4096 in both cases.
Well, that would be quite strange - as far as I known linux network
stack (for which kevent was originally created to support network AIO),
there should not be any difference.
Anyway, I've reran the test with the same values:
# ab -c8000 -n80000 http://192.168.0.48/
This is ApacheBench, Version 2.0.40-dev <$Revision: 1.146 $> apache-2.0
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright 2006 The Apache Software Foundation, http://www.apache.org/
Benchmarking 192.168.0.48 (be patient)
Completed 8000 requests
Completed 16000 requests
Completed 24000 requests
Completed 32000 requests
Completed 40000 requests
Completed 48000 requests
Completed 56000 requests
Completed 64000 requests
Completed 72000 requests
Finished 80000 requests
Server Software: Apache/1.3.27
Server Hostname: 192.168.0.48
Server Port: 80
Document Path: /
Document Length: 3521 bytes
Concurrency Level: 8000
Time taken for tests: 18.398381 seconds
Complete requests: 80000
Failed requests: 0
Write errors: 0
Total transferred: 338738048 bytes
HTML transferred: 308031164 bytes
Requests per second: 4348.21 [#/sec] (mean)
Time per request: 1839.838 [ms] (mean)
Time per request: 0.230 [ms] (mean, across all concurrent
requests)
Transfer rate: 17979.73 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 148 795 196.9 808 3599
Processing: 824 882 39.7 878 986
Waiting: 59 426 212.6 423 914
Total: 1073 1678 200.8 1673 4579
Percentage of the requests served within a certain time (ms)
50% 1673
66% 1674
75% 1678
80% 1686
90% 1852
95% 1861
98% 1864
99% 1865
100% 4579 (longest request)
Essentially the same result (in limits of some inaccuracy).
> in general, please make sure the exact same system calls are done in the
> client function. (except of course for the event queueing syscalls
> themselves)
Yes, that should be done of course.
I even have a plan to create the same binary for both, but have also in
plans to turn some kevent optimization (mainly readiness-on-submit, when
requested event (secv/send/anything) is ready immediately - kevent
supports to return that event in the submission syscall without
additional overhead by reading it from ring or queue).
> Ingo
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 11:41 ` Ingo Molnar
@ 2007-03-01 11:47 ` Ingo Molnar
2007-03-01 12:10 ` Evgeniy Polyakov
2007-03-01 12:01 ` Evgeniy Polyakov
1 sibling, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-03-01 11:47 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Pavel Machek, Theodore Tso, Linus Torvalds, Ulrich Drepper,
linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
* Ingo Molnar <mingo@elte.hu> wrote:
> > I also changed client socket to nonblocking mode with the same result
> > in epoll server. If you will find it broken, please send me corrected
> > to test too.
>
> this line in evserver_kevent.c looks a bit fishy:
this one in evserver_kevent.c looks problematic too:
shutdown(s, SHUT_RDWR);
close(s);
as evserver_epoll.c only does:
close(s);
again, there might be TCP control flow differences due to this. [ Or the
removal of this shutdown() call might be a small speedup for the kevent
case ;) ]
Also, the order of fd and socket close() is different in the two cases.
It shouldnt make any difference - but that too just makes the results
harder to trust. Would it be so hard to introduce a single
handle_web_request() function that is exactly the same in the two tests?
All the queueing details (which are of course different in the epoll and
the kevent case) should be in the client function, which calls
handle_web_request().
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 11:28 ` Eric Dumazet
@ 2007-03-01 11:47 ` Evgeniy Polyakov
2007-03-01 13:12 ` Eric Dumazet
2007-03-01 12:19 ` Evgeniy Polyakov
1 sibling, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01 11:47 UTC (permalink / raw)
To: Eric Dumazet
Cc: Ingo Molnar, Pavel Machek, Theodore Tso, Linus Torvalds,
Ulrich Drepper, linux-kernel, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
On Thu, Mar 01, 2007 at 12:28:00PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> On Thursday 01 March 2007 12:20, Evgeniy Polyakov wrote:
> > On Thu, Mar 01, 2007 at 12:14:44PM +0100, Eric Dumazet (dada1@cosmosbay.com)
> wrote:
> > > On Thursday 01 March 2007 11:59, Evgeniy Polyakov wrote:
> > > > Yes, it is about 98-100% in both cases.
> > > > I've just re-run tests on my amd64 test machine without debug options:
> > > >
> > > > epoll 4794.23
> > > > kevent 6468.95
> > >
> > > It would be valuable if you could post oprofile results
> > > (CPU_CLK_UNHALTED) for both tests.
> >
> > I can't - oprofile does not work on this x86_64 machine:
> >
>
> Yes, this is a known problem, but you can make it works, as I did.
>
> Please :)
I can not resist :)
> I used the CVS version of oprofile plus a patch you can find in the mailing
> list archives. Dont remember exactly, since I hit this some months ago
Could you provide at least remote way to find it?
I only found the same problem at
http://lkml.org/lkml/2006/10/27/3
but without any hits to solve the problem.
I will try CVS oprofile, if it works I will provide details of course.
My tree is based on rc1 and has this latest commit:
commit b5bf28cde894b3bb3bd25c13a7647020562f9ea0
Author: Linus Torvalds <torvalds@woody.linux-foundation.org>
Date: Wed Feb 21 11:21:44 2007 -0800
There are no commits after that data with word 'oprofile' in
git-whatchanged at least.
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 11:16 ` Evgeniy Polyakov
2007-03-01 11:27 ` Ingo Molnar
@ 2007-03-01 11:41 ` Ingo Molnar
2007-03-01 11:47 ` Ingo Molnar
2007-03-01 12:01 ` Evgeniy Polyakov
1 sibling, 2 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-03-01 11:41 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Pavel Machek, Theodore Tso, Linus Torvalds, Ulrich Drepper,
linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> I also changed client socket to nonblocking mode with the same result
> in epoll server. If you will find it broken, please send me corrected
> to test too.
this line in evserver_kevent.c looks a bit fishy:
err = recv(s, buf, 100, 0);
because on the evserver_epoll.c side the following is done:
err = recv(s, buf, 4096, 0);
now, for 'ab', the request size is 76 bytes, so it should fit fine
functionality-wise. But, the TCP stack might decide differently of
whether to return with a partial packet depending on how much data is
requested. I dont know whether it actually makes a difference in the TCP
flow decisions, and whether it makes a performance difference in your
test, but safest would be to use 4096 in both cases.
in general, please make sure the exact same system calls are done in the
client function. (except of course for the event queueing syscalls
themselves)
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 11:27 ` Ingo Molnar
@ 2007-03-01 11:36 ` Evgeniy Polyakov
0 siblings, 0 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01 11:36 UTC (permalink / raw)
To: Ingo Molnar
Cc: Pavel Machek, Theodore Tso, Linus Torvalds, Ulrich Drepper,
linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
On Thu, Mar 01, 2007 at 12:27:00PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
>
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
>
> > I've uploaded them to:
> >
> > http://tservice.net.ru/~s0mbre/archive/kevent/evserver_epoll.c
> > http://tservice.net.ru/~s0mbre/archive/kevent/evserver_kevent.c
>
> thanks.
>
> > I also changed client socket to nonblocking mode with the same result
> > in epoll server. [...]
>
> what does this mean exactly? Did you change this line in
> evserver_epoll.c:
>
> //fcntl(cs, F_SETFL, O_NONBLOCK);
>
> to:
>
> fcntl(cs, F_SETFL, O_NONBLOCK);
>
> and the result was the same?
Yep.
> Ingo
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 11:20 ` Evgeniy Polyakov
@ 2007-03-01 11:28 ` Eric Dumazet
2007-03-01 11:47 ` Evgeniy Polyakov
2007-03-01 12:19 ` Evgeniy Polyakov
0 siblings, 2 replies; 277+ messages in thread
From: Eric Dumazet @ 2007-03-01 11:28 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Ingo Molnar, Pavel Machek, Theodore Tso, Linus Torvalds,
Ulrich Drepper, linux-kernel, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
On Thursday 01 March 2007 12:20, Evgeniy Polyakov wrote:
> On Thu, Mar 01, 2007 at 12:14:44PM +0100, Eric Dumazet (dada1@cosmosbay.com)
wrote:
> > On Thursday 01 March 2007 11:59, Evgeniy Polyakov wrote:
> > > Yes, it is about 98-100% in both cases.
> > > I've just re-run tests on my amd64 test machine without debug options:
> > >
> > > epoll 4794.23
> > > kevent 6468.95
> >
> > It would be valuable if you could post oprofile results
> > (CPU_CLK_UNHALTED) for both tests.
>
> I can't - oprofile does not work on this x86_64 machine:
>
Yes, this is a known problem, but you can make it works, as I did.
Please :)
I used the CVS version of oprofile plus a patch you can find in the mailing
list archives. Dont remember exactly, since I hit this some months ago
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 11:16 ` Evgeniy Polyakov
@ 2007-03-01 11:27 ` Ingo Molnar
2007-03-01 11:36 ` Evgeniy Polyakov
2007-03-01 11:41 ` Ingo Molnar
1 sibling, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-03-01 11:27 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Pavel Machek, Theodore Tso, Linus Torvalds, Ulrich Drepper,
linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> I've uploaded them to:
>
> http://tservice.net.ru/~s0mbre/archive/kevent/evserver_epoll.c
> http://tservice.net.ru/~s0mbre/archive/kevent/evserver_kevent.c
thanks.
> I also changed client socket to nonblocking mode with the same result
> in epoll server. [...]
what does this mean exactly? Did you change this line in
evserver_epoll.c:
//fcntl(cs, F_SETFL, O_NONBLOCK);
to:
fcntl(cs, F_SETFL, O_NONBLOCK);
and the result was the same?
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 11:14 ` Eric Dumazet
@ 2007-03-01 11:20 ` Evgeniy Polyakov
2007-03-01 11:28 ` Eric Dumazet
0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01 11:20 UTC (permalink / raw)
To: Eric Dumazet
Cc: Ingo Molnar, Pavel Machek, Theodore Tso, Linus Torvalds,
Ulrich Drepper, linux-kernel, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
On Thu, Mar 01, 2007 at 12:14:44PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> On Thursday 01 March 2007 11:59, Evgeniy Polyakov wrote:
>
> > Yes, it is about 98-100% in both cases.
> > I've just re-run tests on my amd64 test machine without debug options:
> >
> > epoll 4794.23
> > kevent 6468.95
> >
>
> It would be valuable if you could post oprofile results (CPU_CLK_UNHALTED) for
> both tests.
I can't - oprofile does not work on this x86_64 machine:
#opcontrol --setup --vmlinux=/home/s0mbre/aWork/git/linux-2.6.kevent/vmlinux
# opcontrol --start
Using default event: CPU_CLK_UNHALTED:100000:0:1:1
/usr/bin/opcontrol: line 994: /dev/oprofile/0/enabled: No such file or
directory
/usr/bin/opcontrol: line 994: /dev/oprofile/0/event: No such file or
directory
/usr/bin/opcontrol: line 994: /dev/oprofile/0/count: No such file or
directory
/usr/bin/opcontrol: line 994: /dev/oprofile/0/kernel: No such file or
directory
/usr/bin/opcontrol: line 994: /dev/oprofile/0/user: No such file or
directory
/usr/bin/opcontrol: line 994: /dev/oprofile/0/unit_mask: No such file or
directory
# ls -l /dev/oprofile/
total 0
drwxr-xr-x 1 root root 0 2007-03-01 09:41 1
drwxr-xr-x 1 root root 0 2007-03-01 09:41 2
drwxr-xr-x 1 root root 0 2007-03-01 09:41 3
-rw-r--r-- 1 root root 0 2007-03-01 09:41 backtrace_depth
-rw-r--r-- 1 root root 0 2007-03-01 09:41 buffer
-rw-r--r-- 1 root root 0 2007-03-01 09:41 buffer_size
-rw-r--r-- 1 root root 0 2007-03-01 09:41 buffer_watershed
-rw-r--r-- 1 root root 0 2007-03-01 09:41 cpu_buffer_size
-rw-r--r-- 1 root root 0 2007-03-01 09:41 cpu_type
-rw-rw-rw- 1 root root 0 2007-03-01 09:41 dump
-rw-r--r-- 1 root root 0 2007-03-01 09:41 enable
-rw-r--r-- 1 root root 0 2007-03-01 09:41 pointer_size
drwxr-xr-x 1 root root 0 2007-03-01 09:41 stats
> Thank you
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 10:11 ` Pavel Machek
2007-03-01 10:19 ` Ingo Molnar
@ 2007-03-01 11:18 ` Evgeniy Polyakov
2007-03-02 10:27 ` Pavel Machek
1 sibling, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01 11:18 UTC (permalink / raw)
To: Pavel Machek
Cc: Theodore Tso, Ingo Molnar, Linus Torvalds, Ulrich Drepper,
linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
On Thu, Mar 01, 2007 at 11:11:02AM +0100, Pavel Machek (pavel@ucw.cz) wrote:
> > > > > 10% gain in speed is NOT worth major complexity increase.
> > > >
> > > > Should I create a patch to remove rb-tree implementation?
> > >
> > > If you can replace them with something simpler, and no worse than 10%
> > > slower in worst case, then go ahead. (We actually tried to do that at
> > > some point, only to realize that efence stresses vm subsystem in very
> > > unexpected/unfriendly way).
> >
> > Agh, only 10% in the worst case.
> > I think you can not even imagine what tricks network uses to get at
> > least aditional 1% out of the box.
>
> Yep? Feel free to rewrite networking to assembly on Eugenix. That
> should get you 1% improvement. If you reserve few registers to be only
> used by kernel (not allowed by userspace), you can speedup networking
> 5%, too. Ouch and you could turn off MMU, that is sure way to get few
> more percent improvement in your networking case.
It is not _my_ networking, but taht one you use everyday in every Linux
box. Notice which tricks are used to remove single byte from sk_buff.
It is called optimization, and if it does us a single plus it must be
implemented. Not all people have magical fear of new things.
> > Using such logic you can just abandon any further development, since it
> > work as is right now.
>
> Stop trying to pervert my logic.
Ugh? :)
I just say in simple words your 'we do not need something if adds 10%,
but is complex to understand'.
> > > > That practice is stupid IMO.
> > >
> > > Too bad. Now you can start Linux fork called Eugenix.
> > >
> > > (But really, Linux is not "maximum performance at any cost". Linux is
> > > "how fast can we get that while keeping it maintainable?").
> >
> > Should I read it like: we do not understand what it is and thus we do
> > not want it?
>
> Actually, yes, that's a concern. If your code is so crappy that we
> can't understand it, guess what, it is not going to be merged. Notice
> that someone will have to maintain your code if you get hit by bus.
>
> If your code is so complex that it is almost impossible to use from
> userspace, that is good enough reason not to be merged. "But it is 3%
> faster if..." is not a good-enough argument.
Is it enough for you?
epoll 4794.23 req/sec
kevent 6468.95 req/sec
And we are not saying about other kevent features like ability to
deliver essentially any event through its queue or shared ring (and a
some of its ideas are being slowly implemented in syslet/threadlet code,
btw).
Even if kevent is as fast as epoll, it allows to work with any kind of
events (signals, timers, aio completion, io events and any other you
like) with one queue/ring, which removes races and does _simplofy_
development, since there is no need to create different models to handle
different events.
> > > That is why, while arguing syslets vs. kevents, you need need to argue
> > > not "kevents are faster because they avoid context switch overhead",
> > > but "kevents are _so much_ faster that it is worth the added
> > > complexity". And Ingo seems to showing you they are not _so much_
> > > faster.
> >
> > Threadlets behave much worse without event driven model, events can
> > behave worse without backed threads, they are mutually compensating.
>
> I think Ingo demonstrated unoptimized threadlets to be within 5% to
> the speed of kevent. Demonstrate that kevents are twice faster than
> syslets on reasonable test case, and I guess we'll listen...
That was compared to epoll, not kevent.
But I repeat again - kevent is not only epoll, it can do a lot of ther
things which does improve performance and simplify development - did you
saw terrible hacks in libevent to handle signals without race in polling
loop? It is not needed anymore completely - one event loop, one event
structure, completely unified interface for all operations.
Some kevent features are slowly being implemented in syslet/threadlet
async code too, and it looks like I see where things will end up :), but
likely I do not care about new 'kevent', I just wanted that that was
said half a year ago, when I started its resending again, but Ingo
already said his definitive word :)
> Pavel
> --
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 11:00 ` Ingo Molnar
@ 2007-03-01 11:16 ` Evgeniy Polyakov
2007-03-01 11:27 ` Ingo Molnar
2007-03-01 11:41 ` Ingo Molnar
0 siblings, 2 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01 11:16 UTC (permalink / raw)
To: Ingo Molnar
Cc: Pavel Machek, Theodore Tso, Linus Torvalds, Ulrich Drepper,
linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
On Thu, Mar 01, 2007 at 12:00:22PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
>
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
>
> > I've just re-run tests on my amd64 test machine without debug options:
> >
> > epoll 4794.23
> > kevent 6468.95
>
> could you please post the two URLs for the exact evserver code used for
> these measurements? (even if you did so already in the past - best to
> have them always together with the numbers) Thanks!
I've uploaded them to:
http://tservice.net.ru/~s0mbre/archive/kevent/evserver_epoll.c
http://tservice.net.ru/~s0mbre/archive/kevent/evserver_kevent.c
I also changed client socket to nonblocking mode with the same result in
epoll server. If you will find it broken, please send me corrected to
test too.
> Ingo
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 10:59 ` Evgeniy Polyakov
2007-03-01 11:00 ` Ingo Molnar
@ 2007-03-01 11:14 ` Eric Dumazet
2007-03-01 11:20 ` Evgeniy Polyakov
2007-03-01 12:34 ` Ingo Molnar
2007-03-01 16:56 ` David Lang
3 siblings, 1 reply; 277+ messages in thread
From: Eric Dumazet @ 2007-03-01 11:14 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Ingo Molnar, Pavel Machek, Theodore Tso, Linus Torvalds,
Ulrich Drepper, linux-kernel, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
On Thursday 01 March 2007 11:59, Evgeniy Polyakov wrote:
> Yes, it is about 98-100% in both cases.
> I've just re-run tests on my amd64 test machine without debug options:
>
> epoll 4794.23
> kevent 6468.95
>
It would be valuable if you could post oprofile results (CPU_CLK_UNHALTED) for
both tests.
Thank you
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 10:59 ` Evgeniy Polyakov
@ 2007-03-01 11:00 ` Ingo Molnar
2007-03-01 11:16 ` Evgeniy Polyakov
2007-03-01 11:14 ` Eric Dumazet
` (2 subsequent siblings)
3 siblings, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-03-01 11:00 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Pavel Machek, Theodore Tso, Linus Torvalds, Ulrich Drepper,
linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> I've just re-run tests on my amd64 test machine without debug options:
>
> epoll 4794.23
> kevent 6468.95
could you please post the two URLs for the exact evserver code used for
these measurements? (even if you did so already in the past - best to
have them always together with the numbers) Thanks!
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 9:54 ` Ingo Molnar
@ 2007-03-01 10:59 ` Evgeniy Polyakov
2007-03-01 11:00 ` Ingo Molnar
` (3 more replies)
2007-03-01 19:19 ` Davide Libenzi
1 sibling, 4 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01 10:59 UTC (permalink / raw)
To: Ingo Molnar
Cc: Pavel Machek, Theodore Tso, Linus Torvalds, Ulrich Drepper,
linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
On Thu, Mar 01, 2007 at 10:54:02AM +0100, Ingo Molnar (mingo@elte.hu) wrote:
>
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
>
> > I posted kevent/epoll benchmarks and related design issues too many
> > times both with handmade applications (which might be broken as hell)
> > and popular open-source servers to repeat them again.
>
> numbers are crutial here - and given the epoll bugs in the evserver code
> that we found, do you have updated evserver benchmark results that
> compare epoll to kevent? I'm wondering why epoll has half the speed of
> kevent in those measurements - i suspect some possible benchmarking bug.
> The queueing model of epoll and kevent is roughly comparable, both do
> only a constant number of steps to serve one particular request,
> regardless of how many pending connections/requests there are. What is
> the CPU utilization of the server system during an epoll test, and what
> is the CPU utilization during a kevent test? 100% utilized in both
> cases?
Yes, it is about 98-100% in both cases.
I've just re-run tests on my amd64 test machine without debug options:
epoll 4794.23
kevent 6468.95
here are full client 'ab' outputs for epoll and kevent servers (epoll
does not contain EPOLLET as you requested, but it does not look like
it change performance in my case).
epoll ab aoutput:
# ab -c8000 -n80000 http://192.168.0.48/
This is ApacheBench, Version 2.0.40-dev <$Revision: 1.146 $> apache-2.0
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright 2006 The Apache Software Foundation, http://www.apache.org/
Benchmarking 192.168.0.48 (be patient)
Completed 8000 requests
Completed 16000 requests
Completed 24000 requests
Completed 32000 requests
Completed 40000 requests
Completed 48000 requests
Completed 56000 requests
Completed 64000 requests
Completed 72000 requests
Finished 80000 requests
Server Software: Apache/1.3.27
Server Hostname: 192.168.0.48
Server Port: 80
Document Path: /
Document Length: 3521 bytes
Concurrency Level: 8000
Time taken for tests: 16.686737 seconds
Complete requests: 80000
Failed requests: 0
Write errors: 0
Total transferred: 309760000 bytes
HTML transferred: 281680000 bytes
Requests per second: 4794.23 [#/sec] (mean)
Time per request: 1668.674 [ms] (mean)
Time per request: 0.209 [ms] (mean, across all concurrent
requests)
Transfer rate: 18128.17 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 159 779 110.1 799 921
Processing: 468 866 77.4 869 988
Waiting: 63 426 212.3 425 921
Total: 1145 1646 115.6 1660 1873
Percentage of the requests served within a certain time (ms)
50% 1660
66% 1661
75% 1662
80% 1663
90% 1806
95% 1830
98% 1833
99% 1834
100% 1873 (longest request)
kevent ab output:
# ab -c8000 -n80000 http://192.168.0.48/
This is ApacheBench, Version 2.0.40-dev <$Revision: 1.146 $> apache-2.0
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright 2006 The Apache Software Foundation, http://www.apache.org/
Benchmarking 192.168.0.48 (be patient)
Completed 8000 requests
Completed 16000 requests
Completed 24000 requests
Completed 32000 requests
Completed 40000 requests
Completed 48000 requests
Completed 56000 requests
Completed 64000 requests
Completed 72000 requests
Finished 80000 requests
Server Software: Apache/1.3.27
Server Hostname: 192.168.0.48
Server Port: 80
Document Path: /
Document Length: 3521 bytes
Concurrency Level: 8000
Time taken for tests: 12.366775 seconds
Complete requests: 80000
Failed requests: 0
Write errors: 0
Total transferred: 317047104 bytes
HTML transferred: 288306522 bytes
Requests per second: 6468.95 [#/sec] (mean)
Time per request: 1236.677 [ms] (mean)
Time per request: 0.155 [ms] (mean, across all concurrent
requests)
Transfer rate: 25036.12 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 130 364 871.1 275 9347
Processing: 178 298 42.5 296 580
Waiting: 31 202 65.8 210 369
Total: 411 663 887.0 572 9722
Percentage of the requests served within a certain time (ms)
50% 572
66% 573
75% 618
80% 640
90% 684
95% 709
98% 721
99% 3455
100% 9722 (longest request)
Notice how percentage of the requests served within a certain time
differs for kevent and epoll. And this server does not include
ready-on-submission kevent optimization.
> Ingo
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-27 14:15 Al Boldi
2007-02-27 19:22 ` Theodore Tso
@ 2007-03-01 10:21 ` Pavel Machek
1 sibling, 0 replies; 277+ messages in thread
From: Pavel Machek @ 2007-03-01 10:21 UTC (permalink / raw)
To: Al Boldi; +Cc: linux-kernel
Hi!
> > But it is true that for most kernel programmers, threaded
> > programming is much easier to understand, and we need to engineer the
> > kernel for what will be maintainable for the majority of the kernel
> > development community.
>
> What's probably true is that, for a kernel to stay competitive you need two
> distinct traits:
>
> 1. Stability
> 2. Performance
>
> And you can't get that, by arguing that the kernel development community
> doesn't have the brains to code for performance, which I dearly doubt.
>
> So, instead of using intimidating language to force one's opinion thru,
> especially when it comes from those in control, why not have a democratic
> vote?
Linus cast his vote, and as that is only vote that matters, can we end
this discussion? (Eugeny is not willing to listen to Linus, but I
guess that's Eugeny's problem at this point).
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 10:11 ` Pavel Machek
@ 2007-03-01 10:19 ` Ingo Molnar
2007-03-01 11:18 ` Evgeniy Polyakov
1 sibling, 0 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-03-01 10:19 UTC (permalink / raw)
To: Pavel Machek
Cc: Evgeniy Polyakov, Theodore Tso, Linus Torvalds, Ulrich Drepper,
linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
* Pavel Machek <pavel@ucw.cz> wrote:
> > Threadlets behave much worse without event driven model, events can
> > behave worse without backed threads, they are mutually compensating.
>
> I think Ingo demonstrated unoptimized threadlets to be within 5% to
> the speed of kevent. [...]
that was epoll not kevent, but yeah. To me the biggest question is, how
much improvement does kevent bring relative to epoll? Epoll is a pretty
good event queueing API, and it covers all the event sources today. It
is also O(1) throughout, so i dont really understand how it could only
achieve half the speed of kevent in the evserver_epoll/kevent.c
benchmark. We need to understand that first.
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 9:47 ` Evgeniy Polyakov
2007-03-01 9:54 ` Ingo Molnar
@ 2007-03-01 10:11 ` Pavel Machek
2007-03-01 10:19 ` Ingo Molnar
2007-03-01 11:18 ` Evgeniy Polyakov
1 sibling, 2 replies; 277+ messages in thread
From: Pavel Machek @ 2007-03-01 10:11 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Theodore Tso, Ingo Molnar, Linus Torvalds, Ulrich Drepper,
linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
Hi!
> > > > 10% gain in speed is NOT worth major complexity increase.
> > >
> > > Should I create a patch to remove rb-tree implementation?
> >
> > If you can replace them with something simpler, and no worse than 10%
> > slower in worst case, then go ahead. (We actually tried to do that at
> > some point, only to realize that efence stresses vm subsystem in very
> > unexpected/unfriendly way).
>
> Agh, only 10% in the worst case.
> I think you can not even imagine what tricks network uses to get at
> least aditional 1% out of the box.
Yep? Feel free to rewrite networking to assembly on Eugenix. That
should get you 1% improvement. If you reserve few registers to be only
used by kernel (not allowed by userspace), you can speedup networking
5%, too. Ouch and you could turn off MMU, that is sure way to get few
more percent improvement in your networking case.
> Using such logic you can just abandon any further development, since it
> work as is right now.
Stop trying to pervert my logic.
> > > That practice is stupid IMO.
> >
> > Too bad. Now you can start Linux fork called Eugenix.
> >
> > (But really, Linux is not "maximum performance at any cost". Linux is
> > "how fast can we get that while keeping it maintainable?").
>
> Should I read it like: we do not understand what it is and thus we do
> not want it?
Actually, yes, that's a concern. If your code is so crappy that we
can't understand it, guess what, it is not going to be merged. Notice
that someone will have to maintain your code if you get hit by bus.
If your code is so complex that it is almost impossible to use from
userspace, that is good enough reason not to be merged. "But it is 3%
faster if..." is not a good-enough argument.
> > That is why, while arguing syslets vs. kevents, you need need to argue
> > not "kevents are faster because they avoid context switch overhead",
> > but "kevents are _so much_ faster that it is worth the added
> > complexity". And Ingo seems to showing you they are not _so much_
> > faster.
>
> Threadlets behave much worse without event driven model, events can
> behave worse without backed threads, they are mutually compensating.
I think Ingo demonstrated unoptimized threadlets to be within 5% to
the speed of kevent. Demonstrate that kevents are twice faster than
syslets on reasonable test case, and I guess we'll listen...
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 9:47 ` Evgeniy Polyakov
@ 2007-03-01 9:54 ` Ingo Molnar
2007-03-01 10:59 ` Evgeniy Polyakov
2007-03-01 19:19 ` Davide Libenzi
2007-03-01 10:11 ` Pavel Machek
1 sibling, 2 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-03-01 9:54 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Pavel Machek, Theodore Tso, Linus Torvalds, Ulrich Drepper,
linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> I posted kevent/epoll benchmarks and related design issues too many
> times both with handmade applications (which might be broken as hell)
> and popular open-source servers to repeat them again.
numbers are crutial here - and given the epoll bugs in the evserver code
that we found, do you have updated evserver benchmark results that
compare epoll to kevent? I'm wondering why epoll has half the speed of
kevent in those measurements - i suspect some possible benchmarking bug.
The queueing model of epoll and kevent is roughly comparable, both do
only a constant number of steps to serve one particular request,
regardless of how many pending connections/requests there are. What is
the CPU utilization of the server system during an epoll test, and what
is the CPU utilization during a kevent test? 100% utilized in both
cases?
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 9:26 ` Pavel Machek
@ 2007-03-01 9:47 ` Evgeniy Polyakov
2007-03-01 9:54 ` Ingo Molnar
2007-03-01 10:11 ` Pavel Machek
0 siblings, 2 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01 9:47 UTC (permalink / raw)
To: Pavel Machek
Cc: Theodore Tso, Ingo Molnar, Linus Torvalds, Ulrich Drepper,
linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
On Thu, Mar 01, 2007 at 10:26:34AM +0100, Pavel Machek (pavel@ucw.cz) wrote:
> > > 10% gain in speed is NOT worth major complexity increase.
> >
> > Should I create a patch to remove rb-tree implementation?
>
> If you can replace them with something simpler, and no worse than 10%
> slower in worst case, then go ahead. (We actually tried to do that at
> some point, only to realize that efence stresses vm subsystem in very
> unexpected/unfriendly way).
Agh, only 10% in the worst case.
I think you can not even imagine what tricks network uses to get at
least aditional 1% out of the box.
Using such logic you can just abandon any further development, since it
work as is right now.
> > That practice is stupid IMO.
>
> Too bad. Now you can start Linux fork called Eugenix.
>
> (But really, Linux is not "maximum performance at any cost". Linux is
> "how fast can we get that while keeping it maintainable?").
Should I read it like: we do not understand what it is and thus we do
not want it?
> That is why, while arguing syslets vs. kevents, you need need to argue
> not "kevents are faster because they avoid context switch overhead",
> but "kevents are _so much_ faster that it is worth the added
> complexity". And Ingo seems to showing you they are not _so much_
> faster.
Threadlets behave much worse without event driven model, events can
behave worse without backed threads, they are mutually compensating.
I posted kevent/epoll benchmarks and related design issues too many
times both with handmade applications (which might be broken as hell)
and popular open-source servers to repeat them again.
> Pavel
> --
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 8:38 ` Evgeniy Polyakov
@ 2007-03-01 9:28 ` Evgeniy Polyakov
0 siblings, 0 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01 9:28 UTC (permalink / raw)
To: Davide Libenzi
Cc: Chris Friesen, Linus Torvalds, Ingo Molnar, Ulrich Drepper,
Linux Kernel Mailing List, Arjan van de Ven, Christoph Hellwig,
Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
On Thu, Mar 01, 2007 at 11:38:06AM +0300, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> > struct async_syscall {
> > long *result;
> > unsigned long asynid;
> > unsigned long nr_sysc;
> > unsigned long params[8];
> > };
>
> Having result pointer as NULL might imply that result is not interested
> in and thus it can be discarded and event async syscall will not be
> returned through aync_wait().
> More flexible is having request flags field in the structure.
Ugh, that is alrasdy implemented in v5.
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 8:18 ` Evgeniy Polyakov
@ 2007-03-01 9:26 ` Pavel Machek
2007-03-01 9:47 ` Evgeniy Polyakov
2007-03-01 19:24 ` Johann Borck
1 sibling, 1 reply; 277+ messages in thread
From: Pavel Machek @ 2007-03-01 9:26 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Theodore Tso, Ingo Molnar, Linus Torvalds, Ulrich Drepper,
linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
Hi!
> > > I understand that - and I totally agree.
> > > But when more complex, more bug-prone code results in higher performance
> > > - that must be used. We have linked lists and binary trees - the latter
> >
> > No-o. Kernel is not designed like that.
> >
> > Often, more complex and slightly faster code exists, and we simply use
> > slower variant, because it is fast enough.
> >
> > 10% gain in speed is NOT worth major complexity increase.
>
> Should I create a patch to remove rb-tree implementation?
If you can replace them with something simpler, and no worse than 10%
slower in worst case, then go ahead. (We actually tried to do that at
some point, only to realize that efence stresses vm subsystem in very
unexpected/unfriendly way).
> That practice is stupid IMO.
Too bad. Now you can start Linux fork called Eugenix.
(But really, Linux is not "maximum performance at any cost". Linux is
"how fast can we get that while keeping it maintainable?").
That is why, while arguing syslets vs. kevents, you need need to argue
not "kevents are faster because they avoid context switch overhead",
but "kevents are _so much_ faster that it is worth the added
complexity". And Ingo seems to showing you they are not _so much_
faster.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-03-01 1:33 ` Andrea Arcangeli
@ 2007-03-01 9:15 ` Evgeniy Polyakov
0 siblings, 0 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01 9:15 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Ingo Molnar, Linus Torvalds, Davide Libenzi, Ulrich Drepper,
Linux Kernel Mailing List, Arjan van de Ven, Christoph Hellwig,
Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
On Thu, Mar 01, 2007 at 02:33:01AM +0100, Andrea Arcangeli (andrea@suse.de) wrote:
> On Thu, Mar 01, 2007 at 12:12:28AM +0100, Ingo Molnar wrote:
> > more capable by providing more special system calls like sys_upcall() to
> > execute a user-space function. (that way a syslet could still execute
> > user-space code without having to exit out of kernel mode too
> > frequently) Or perhaps a sys_x86_bytecode() call, that would execute a
> > pre-verified, kernel-stored sequence of simplified x86 bytecode, using
> > the kernel stack.
>
> Which means the userspace code would then run with kernel privilege
> level somehow (after security verifier, whatever). You remember I
> think it's a plain crazy idea...
Syslets/threadlets do not execute userspace code in kernel - they behave
similar to threads. sys_upcall() would be a wrapper for quite complex
threadlet machinery.
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-28 19:42 ` Davide Libenzi
@ 2007-03-01 8:38 ` Evgeniy Polyakov
2007-03-01 9:28 ` Evgeniy Polyakov
0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01 8:38 UTC (permalink / raw)
To: Davide Libenzi
Cc: Chris Friesen, Linus Torvalds, Ingo Molnar, Ulrich Drepper,
Linux Kernel Mailing List, Arjan van de Ven, Christoph Hellwig,
Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
On Wed, Feb 28, 2007 at 11:42:24AM -0800, Davide Libenzi (davidel@xmailserver.org) wrote:
> On Wed, 28 Feb 2007, Chris Friesen wrote:
>
> > Davide Libenzi wrote:
> >
> > > struct async_syscall {
> > > unsigned long nr_sysc;
> > > unsigned long params[8];
> > > long *result;
> > > };
> > >
> > > And what would async_wait() return bak? Pointers to "struct async_syscall"
> > > or pointers to "result"?
> >
> > Either one has downsides. Pointer to struct async_syscall requires that the
> > caller keep the struct around. Pointer to result requires that the caller
> > always reserve a location for the result.
> >
> > Does the kernel care about the (possibly rare) case of callers that don't want
> > to pay attention to result? If so, what about adding some kind of
> > caller-specified handle to struct async_syscall, and having async_wait()
> > return the handle? In the case where the caller does care about the result,
> > the handle could just be the address of result.
>
> Something like this (with async_wait() returning asynid's)?
>
> struct async_syscall {
> long *result;
> unsigned long asynid;
> unsigned long nr_sysc;
> unsigned long params[8];
> };
Having result pointer as NULL might imply that result is not interested
in and thus it can be discarded and event async syscall will not be
returned through aync_wait().
More flexible is having request flags field in the structure.
> - Davide
>
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-28 16:14 ` Pavel Machek
@ 2007-03-01 8:18 ` Evgeniy Polyakov
2007-03-01 9:26 ` Pavel Machek
2007-03-01 19:24 ` Johann Borck
0 siblings, 2 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-03-01 8:18 UTC (permalink / raw)
To: Pavel Machek
Cc: Theodore Tso, Ingo Molnar, Linus Torvalds, Ulrich Drepper,
linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
On Wed, Feb 28, 2007 at 04:14:14PM +0000, Pavel Machek (pavel@ucw.cz) wrote:
> Hi!
>
> > > I think what you are not hearing, and what everyone else is saying
> > > (INCLUDING Linus), is that for most programmers, state machines are
> > > much, much harder to program, understand, and debug compared to
> > > multi-threaded code. You may disagree (were you a MacOS 9 programmer
> > > in another life?), and it may not even be true for you if you happen
> > > to be one of those folks more at home with Scheme continuations, for
> > > example. But it is true that for most kernel programmers, threaded
> > > programming is much easier to understand, and we need to engineer the
> > > kernel for what will be maintainable for the majority of the kernel
> > > development community.
> >
> > I understand that - and I totally agree.
> > But when more complex, more bug-prone code results in higher performance
> > - that must be used. We have linked lists and binary trees - the latter
>
> No-o. Kernel is not designed like that.
>
> Often, more complex and slightly faster code exists, and we simply use
> slower variant, because it is fast enough.
>
> 10% gain in speed is NOT worth major complexity increase.
Should I create a patch to remove rb-tree implementation?
That practice is stupid IMO.
> Pavel
> --
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-28 23:12 ` Ingo Molnar
@ 2007-03-01 1:33 ` Andrea Arcangeli
2007-03-01 9:15 ` Evgeniy Polyakov
2007-03-01 21:27 ` Linus Torvalds
1 sibling, 1 reply; 277+ messages in thread
From: Andrea Arcangeli @ 2007-03-01 1:33 UTC (permalink / raw)
To: Ingo Molnar
Cc: Linus Torvalds, Davide Libenzi, Ulrich Drepper,
Linux Kernel Mailing List, Arjan van de Ven, Christoph Hellwig,
Andrew Morton, Alan Cox, Zach Brown, Evgeniy Polyakov,
David S. Miller, Suparna Bhattacharya, Jens Axboe,
Thomas Gleixner
On Thu, Mar 01, 2007 at 12:12:28AM +0100, Ingo Molnar wrote:
> more capable by providing more special system calls like sys_upcall() to
> execute a user-space function. (that way a syslet could still execute
> user-space code without having to exit out of kernel mode too
> frequently) Or perhaps a sys_x86_bytecode() call, that would execute a
> pre-verified, kernel-stored sequence of simplified x86 bytecode, using
> the kernel stack.
Which means the userspace code would then run with kernel privilege
level somehow (after security verifier, whatever). You remember I
think it's a plain crazy idea...
I don't want to argue about syslets, threadlets, whatever async or
syscall-merging mechanism here, I'm just focusing on the idea of yours
of running userland code in kernel space somehow (I hoped you given up
on it by now). Fixing the greatest syslets limitation is going to open
a can of worms as far as security is concerned.
The fact that userland code must not run with kernel privilege level,
is the reason why syslets aren't very useful (but again: focusing on
the syslets vs async-syscalls isn't my interest).
Frankly I think this idea of running userland code with kernel
privileges fits in the same category of porting linux to segmentation
to avoid the cost of pagetables to gain some bit of performance
despite losing in many other areas. Nobody in real life will want to
make that trade, for such an incredibly small performance
improvement.
For things that can be frequently combined, it's much simpler and
cheaper to created a "merged" syscall (i.e. sys_spawn =
sys_fork+sys_exec) than to invent a way to upload userland generated
bytecodes to kernel space to do that.
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-28 16:42 ` Linus Torvalds
2007-02-28 17:26 ` Ingo Molnar
2007-02-28 18:22 ` Davide Libenzi
@ 2007-02-28 23:12 ` Ingo Molnar
2007-03-01 1:33 ` Andrea Arcangeli
2007-03-01 21:27 ` Linus Torvalds
2 siblings, 2 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-02-28 23:12 UTC (permalink / raw)
To: Linus Torvalds
Cc: Davide Libenzi, Ulrich Drepper, Linux Kernel Mailing List,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Zach Brown, Evgeniy Polyakov, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> So I would repeat my call for getting rid of the atoms, and instead
> just do a "single submission" at a time. Do the linking by running a
> threadlet that has user space code (and the stack overhead), which is
> MUCH more flexible. And do nonlinked single system calls without
> *either* atoms *or* a user-space stack footprint.
I agree that threadlets are much more flexible - and they might in fact
win in the long run due to that.
i'll add a one-shot syscall API in v6 and then we'll be able to see them
side by side. (wanted to do that in v5 but it got delayed by x86_64
issues, x86_64's entry code is certainly ... tricky wrt. ptregs saving)
wrt. one-shot syscalls, the user-space stack footprint would still
probably be there, because even async contexts that only do single-shot
processing need to drop out of kernel mode to handle signals. We could
probably hack the signal routing code to never deliver to such threads
(but bounce it over to the head context, which is always available) but
i think that would be a bit messy. (i dont exclude it though)
I think syslets might also act as a prototyping platform for new system
calls. If any particular syslet atom string comes up more frequently
(and we could even automate the profiling of that within the kernel),
then it's a good candidate for a standalone syscall. Currently we dont
have such information in any structured way: the connection between
streams of syscalls done by applications is totally opaque to the
kernel.
Also, i genuinely believe that to be competitive (performance-wise) with
fully in-kernel queueing solutions, we need syslets - the syslet NULL
overhead is 20 cycles (this includes copying, engine overhead, etc.),
the syscall NULL overhead is 280-300 cycles. It could probably be made
more capable by providing more special system calls like sys_upcall() to
execute a user-space function. (that way a syslet could still execute
user-space code without having to exit out of kernel mode too
frequently) Or perhaps a sys_x86_bytecode() call, that would execute a
pre-verified, kernel-stored sequence of simplified x86 bytecode, using
the kernel stack.
My fear is that if we force all these things over to one-shot syscalls
or threadlets then this will become another second-tier mechanism. By
providing syslets we give the message: "sure, come on and play within
the kernel if you want to, but it's not easy".
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-28 22:22 ` Ingo Molnar
@ 2007-02-28 22:47 ` Davide Libenzi
0 siblings, 0 replies; 277+ messages in thread
From: Davide Libenzi @ 2007-02-28 22:47 UTC (permalink / raw)
To: Ingo Molnar
Cc: Ulrich Drepper, Linux Kernel Mailing List, Linus Torvalds,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Zach Brown, Evgeniy Polyakov, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
On Wed, 28 Feb 2007, Ingo Molnar wrote:
> > Or with a simple/parellel async submission, coupled with threadlets,
> > we can cover a pretty broad range of real life use cases?
>
> sure, if we debate its virtualization driven market penetration via self
> promoting technologies that also drive customer satisfaction, then we'll
> be able to increase shareholder value by improving the user experience
> and we'll also succeed in turning this vision into a supply/demand
> marketplace. Or not?
Okkey then, I guess it's good to go as is :)
- Davide
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-28 21:46 ` Davide Libenzi
@ 2007-02-28 22:22 ` Ingo Molnar
2007-02-28 22:47 ` Davide Libenzi
0 siblings, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-02-28 22:22 UTC (permalink / raw)
To: Davide Libenzi
Cc: Ulrich Drepper, Linux Kernel Mailing List, Linus Torvalds,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Zach Brown, Evgeniy Polyakov, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
* Davide Libenzi <davidel@xmailserver.org> wrote:
> On Wed, 28 Feb 2007, Ingo Molnar wrote:
>
> > * Davide Libenzi <davidel@xmailserver.org> wrote:
> >
> > > Did you hide all the complexity of the userspace atom decoding inside
> > > another function? :)
> >
> > no, i made the 64-bit and 32-bit structures layout-compatible. This
> > makes the 32-bit structure as large as the 64-bit ones, but that's not a
> > big issue, compared to the simplifications it brings.
>
> Do you have a new version to review?
yep, i've just released -v5.
> How about this, with async_wait returning asynid's back to a userspace
> ring buffer?
>
> struct syslet_utaom {
> long *result;
> unsigned long asynid;
> unsigned long nr_sysc;
> unsigned long params[8];
> };
we talked about the parameters at length: if they are pointers the
layout is significantly more flexible and more capable. It's a pretty
similar argument to the return-pointer thing. For example take a look at
how the IO syslet atoms in Jens' FIO engine share the same fd. Even if
there's 20000 of them. And they are fully cacheable in constructed
state. The same goes for the webserving examples i've got in the
async-test userspace sample code. I can pick up a cached request and
only update req->fd, i dont have to reinit the atoms at all. It stays
nicely in the cache, is not re-dirtied, etc.
furthermore, having the parameters as pointers is also an optimization:
look at the copy_uatom() x86 assembly code i did - it can do a simple
jump out of the parameter fetching code. I actually tried /both/ of
these variants in assembly (as i mentioned it in a previous reply, in
the v1 thread) and the speed difference between a pointer and
non-pointer variant was negligible. (even with 6 parameters filled in)
but yes ... another two more small changes and your layout will be
awfully similar to the current uatom layout =B-)
> My problem with the syslets in their current form is, do we have a
> real use for them that justify the extra complexity inside the kernel?
i call bullshit. really. I have just gone out and wasted some time
cutting & pasting all the syslet engine code: it is 153 lines total,
plus 51 lines of comments. The total patchset in comparison is:
35 files changed, 1890 insertions(+), 71 deletions(-)
(and this over-estimates it because if this got removed then we'd still
have to add an async execution syscall.) And the code is pretty compact
and self-contained. Threadlets share much of the infrastructure with
syslets: for example the completion ring code is _100%_ shared, the
async execution code is 98% shared.
You are free to not like it though, and i'm willing to change any aspect
of the API to make it more intuitive and more useful, but calling it
'complexity' at this point is just handwaving. And believe it or not, a
good number of people actually find syslets pretty cool.
> Or with a simple/parellel async submission, coupled with threadlets,
> we can cover a pretty broad range of real life use cases?
sure, if we debate its virtualization driven market penetration via self
promoting technologies that also drive customer satisfaction, then we'll
be able to increase shareholder value by improving the user experience
and we'll also succeed in turning this vision into a supply/demand
marketplace. Or not?
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-28 21:23 ` Ingo Molnar
@ 2007-02-28 21:46 ` Davide Libenzi
2007-02-28 22:22 ` Ingo Molnar
0 siblings, 1 reply; 277+ messages in thread
From: Davide Libenzi @ 2007-02-28 21:46 UTC (permalink / raw)
To: Ingo Molnar
Cc: Ulrich Drepper, Linux Kernel Mailing List, Linus Torvalds,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Zach Brown, Evgeniy Polyakov, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
On Wed, 28 Feb 2007, Ingo Molnar wrote:
> * Davide Libenzi <davidel@xmailserver.org> wrote:
>
> > Did you hide all the complexity of the userspace atom decoding inside
> > another function? :)
>
> no, i made the 64-bit and 32-bit structures layout-compatible. This
> makes the 32-bit structure as large as the 64-bit ones, but that's not a
> big issue, compared to the simplifications it brings.
Do you have a new version to review?
> > > But i'm happy to change the syslet API in any sane way, and did so
> > > based on feedback from Jens who is actually using them.
> >
> > Wouldn't you agree on a simple/parallel execution engine [...]
>
> the thing is, there's almost zero overhead from having those basic
> things like conditions and the ->next link, and they make it so much
> more capable. As usual my biggest problem is that you are not trying to
> use syslets at all - you are only trying to get rid of them ;-) My
> purpose with syslets is to enable a syslet to do almost anything that
> user-space could do too, as simply as possible. Syslets could even
> allocate user-space memory and then use it (i dont think we actually
> want to do that though). That doesnt mean arbitrary complex code
> /should/ be done via syslets, or that it wont be significantly slower
> than what user-space can do, but i'd not like to artificially dumb the
> engine down. I'm totally willing to simplify/shrink the vectoring of
> arguments and just about anything else, but your proposals so far (such
> as your return-value-embedded-in-atom suggestion) all kill important
> aspects of the engine.
Ok, we're past the error code in the atom, as Linus pointed out ;)
How about this, with async_wait returning asynid's back to a userspace
ring buffer?
struct syslet_utaom {
long *result;
unsigned long asynid;
unsigned long nr_sysc;
unsigned long params[8];
};
My problem with the syslets in their current form is, do we have a real
use for them that justify the extra complexity inside the kernel? Or with
a simple/parellel async submission, coupled with threadlets, we can cover
a pretty broad range of real life use cases?
- Davide
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-28 21:09 ` Davide Libenzi
@ 2007-02-28 21:23 ` Ingo Molnar
2007-02-28 21:46 ` Davide Libenzi
0 siblings, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-02-28 21:23 UTC (permalink / raw)
To: Davide Libenzi
Cc: Ulrich Drepper, Linux Kernel Mailing List, Linus Torvalds,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Zach Brown, Evgeniy Polyakov, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
* Davide Libenzi <davidel@xmailserver.org> wrote:
> On Wed, 28 Feb 2007, Ingo Molnar wrote:
>
> >
> > * Davide Libenzi <davidel@xmailserver.org> wrote:
> >
> > > My point is, the syslet infrastructure is expensive for the kernel in
> > > terms of compat, [...]
> >
> > it is not. Today i've implemented 64-bit syslets on x86_64 and
> > 32-bit-on-64-bit compat syslets. Both the 64-bit and the 32-bit syslet
> > (and threadlet) binaries work just fine on a 64-bit kernel, and they
> > share 99% of the infrastructure. There's only a single #ifdef
> > CONFIG_COMPAT in kernel/async.c:
> >
> > #ifdef CONFIG_COMPAT
> >
> > asmlinkage struct syslet_uatom __user *
> > compat_sys_async_exec(struct syslet_uatom __user *uatom,
> > struct async_head_user __user *ahu)
> > {
> > return __sys_async_exec(uatom, ahu, &compat_sys_call_table,
> > compat_NR_syscalls);
> > }
> >
> > #endif
>
> Did you hide all the complexity of the userspace atom decoding inside
> another function? :)
no, i made the 64-bit and 32-bit structures layout-compatible. This
makes the 32-bit structure as large as the 64-bit ones, but that's not a
big issue, compared to the simplifications it brings.
> > But i'm happy to change the syslet API in any sane way, and did so
> > based on feedback from Jens who is actually using them.
>
> Wouldn't you agree on a simple/parallel execution engine [...]
the thing is, there's almost zero overhead from having those basic
things like conditions and the ->next link, and they make it so much
more capable. As usual my biggest problem is that you are not trying to
use syslets at all - you are only trying to get rid of them ;-) My
purpose with syslets is to enable a syslet to do almost anything that
user-space could do too, as simply as possible. Syslets could even
allocate user-space memory and then use it (i dont think we actually
want to do that though). That doesnt mean arbitrary complex code
/should/ be done via syslets, or that it wont be significantly slower
than what user-space can do, but i'd not like to artificially dumb the
engine down. I'm totally willing to simplify/shrink the vectoring of
arguments and just about anything else, but your proposals so far (such
as your return-value-embedded-in-atom suggestion) all kill important
aspects of the engine.
All the existing syslet features were purpose-driven: i actually coded
up a sample syslet, trying to do something that makes sense, and added
these features based on that. The engine core takes up maybe 50 lines of
code.
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-28 20:21 ` Ingo Molnar
@ 2007-02-28 21:09 ` Davide Libenzi
2007-02-28 21:23 ` Ingo Molnar
0 siblings, 1 reply; 277+ messages in thread
From: Davide Libenzi @ 2007-02-28 21:09 UTC (permalink / raw)
To: Ingo Molnar
Cc: Ulrich Drepper, Linux Kernel Mailing List, Linus Torvalds,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Zach Brown, Evgeniy Polyakov, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
On Wed, 28 Feb 2007, Ingo Molnar wrote:
>
> * Davide Libenzi <davidel@xmailserver.org> wrote:
>
> > My point is, the syslet infrastructure is expensive for the kernel in
> > terms of compat, [...]
>
> it is not. Today i've implemented 64-bit syslets on x86_64 and
> 32-bit-on-64-bit compat syslets. Both the 64-bit and the 32-bit syslet
> (and threadlet) binaries work just fine on a 64-bit kernel, and they
> share 99% of the infrastructure. There's only a single #ifdef
> CONFIG_COMPAT in kernel/async.c:
>
> #ifdef CONFIG_COMPAT
>
> asmlinkage struct syslet_uatom __user *
> compat_sys_async_exec(struct syslet_uatom __user *uatom,
> struct async_head_user __user *ahu)
> {
> return __sys_async_exec(uatom, ahu, &compat_sys_call_table,
> compat_NR_syscalls);
> }
>
> #endif
Did you hide all the complexity of the userspace atom decoding inside
another function? :)
How much code would go away, in case we pick a simple/parallel
sys_async_exec engine? Atoms decoding, special userspace variable access
for loops, jumps/cond/... VM engine.
> Even mixed-mode syslets should work (although i havent specifically
> tested them), where the head switches between 64-bit and 32-bit mode and
> submits syslets from both 64-bit and from 32-bit mode, and at the same
> time there might be both 64-bit and 32-bit syslets 'in flight'.
>
> But i'm happy to change the syslet API in any sane way, and did so based
> on feedback from Jens who is actually using them.
Wouldn't you agree on a simple/parallel execution engine like me and Linus
are proposing (and threadlets, of course)?
- Davide
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-28 16:17 ` Davide Libenzi
2007-02-28 16:42 ` Linus Torvalds
@ 2007-02-28 20:21 ` Ingo Molnar
2007-02-28 21:09 ` Davide Libenzi
1 sibling, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-02-28 20:21 UTC (permalink / raw)
To: Davide Libenzi
Cc: Ulrich Drepper, Linux Kernel Mailing List, Linus Torvalds,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Zach Brown, Evgeniy Polyakov, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
* Davide Libenzi <davidel@xmailserver.org> wrote:
> My point is, the syslet infrastructure is expensive for the kernel in
> terms of compat, [...]
it is not. Today i've implemented 64-bit syslets on x86_64 and
32-bit-on-64-bit compat syslets. Both the 64-bit and the 32-bit syslet
(and threadlet) binaries work just fine on a 64-bit kernel, and they
share 99% of the infrastructure. There's only a single #ifdef
CONFIG_COMPAT in kernel/async.c:
#ifdef CONFIG_COMPAT
asmlinkage struct syslet_uatom __user *
compat_sys_async_exec(struct syslet_uatom __user *uatom,
struct async_head_user __user *ahu)
{
return __sys_async_exec(uatom, ahu, &compat_sys_call_table,
compat_NR_syscalls);
}
#endif
Even mixed-mode syslets should work (although i havent specifically
tested them), where the head switches between 64-bit and 32-bit mode and
submits syslets from both 64-bit and from 32-bit mode, and at the same
time there might be both 64-bit and 32-bit syslets 'in flight'.
But i'm happy to change the syslet API in any sane way, and did so based
on feedback from Jens who is actually using them.
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-28 19:03 ` Chris Friesen
@ 2007-02-28 19:42 ` Davide Libenzi
2007-03-01 8:38 ` Evgeniy Polyakov
0 siblings, 1 reply; 277+ messages in thread
From: Davide Libenzi @ 2007-02-28 19:42 UTC (permalink / raw)
To: Chris Friesen
Cc: Linus Torvalds, Ingo Molnar, Ulrich Drepper,
Linux Kernel Mailing List, Arjan van de Ven, Christoph Hellwig,
Andrew Morton, Alan Cox, Zach Brown, Evgeniy Polyakov,
David S. Miller, Suparna Bhattacharya, Jens Axboe,
Thomas Gleixner
On Wed, 28 Feb 2007, Chris Friesen wrote:
> Davide Libenzi wrote:
>
> > struct async_syscall {
> > unsigned long nr_sysc;
> > unsigned long params[8];
> > long *result;
> > };
> >
> > And what would async_wait() return bak? Pointers to "struct async_syscall"
> > or pointers to "result"?
>
> Either one has downsides. Pointer to struct async_syscall requires that the
> caller keep the struct around. Pointer to result requires that the caller
> always reserve a location for the result.
>
> Does the kernel care about the (possibly rare) case of callers that don't want
> to pay attention to result? If so, what about adding some kind of
> caller-specified handle to struct async_syscall, and having async_wait()
> return the handle? In the case where the caller does care about the result,
> the handle could just be the address of result.
Something like this (with async_wait() returning asynid's)?
struct async_syscall {
long *result;
unsigned long asynid;
unsigned long nr_sysc;
unsigned long params[8];
};
- Davide
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-28 18:50 ` Davide Libenzi
@ 2007-02-28 19:03 ` Chris Friesen
2007-02-28 19:42 ` Davide Libenzi
0 siblings, 1 reply; 277+ messages in thread
From: Chris Friesen @ 2007-02-28 19:03 UTC (permalink / raw)
To: Davide Libenzi
Cc: Linus Torvalds, Ingo Molnar, Ulrich Drepper,
Linux Kernel Mailing List, Arjan van de Ven, Christoph Hellwig,
Andrew Morton, Alan Cox, Zach Brown, Evgeniy Polyakov,
David S. Miller, Suparna Bhattacharya, Jens Axboe,
Thomas Gleixner
Davide Libenzi wrote:
> struct async_syscall {
> unsigned long nr_sysc;
> unsigned long params[8];
> long *result;
> };
>
> And what would async_wait() return bak? Pointers to "struct async_syscall"
> or pointers to "result"?
Either one has downsides. Pointer to struct async_syscall requires that
the caller keep the struct around. Pointer to result requires that the
caller always reserve a location for the result.
Does the kernel care about the (possibly rare) case of callers that
don't want to pay attention to result? If so, what about adding some
kind of caller-specified handle to struct async_syscall, and having
async_wait() return the handle? In the case where the caller does care
about the result, the handle could just be the address of result.
Chris
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-28 18:42 ` Linus Torvalds
@ 2007-02-28 18:50 ` Davide Libenzi
2007-02-28 19:03 ` Chris Friesen
0 siblings, 1 reply; 277+ messages in thread
From: Davide Libenzi @ 2007-02-28 18:50 UTC (permalink / raw)
To: Linus Torvalds
Cc: Ingo Molnar, Ulrich Drepper, Linux Kernel Mailing List,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Zach Brown, Evgeniy Polyakov, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
On Wed, 28 Feb 2007, Linus Torvalds wrote:
> On Wed, 28 Feb 2007, Davide Libenzi wrote:
> >
> > Here we very much agree. The way I'd like it:
> >
> > struct async_syscall {
> > unsigned long nr_sysc;
> > unsigned long params[8];
> > long result;
> > };
>
> No, the "result" needs to go somewhere else. The caller may be totally
> uninterested in keeping the system call number or parameters around until
> the operation completes, but if you put them in the same structure with
> the result, you obviously cannot sanely get rid of them.
>
> I also don't much like read-write interfaces (which the above would be:
> the kernel would read most of the structure, and then write one member of
> the structure).
>
> It's entirely possible, for example, that the operation we submit is some
> legacy "aio_read()", which has soem other structure layout than the new
> one (but one field will be the result code).
Ok, makes sense. Something like this then?
struct async_syscall {
unsigned long nr_sysc;
unsigned long params[8];
long *result;
};
And what would async_wait() return bak? Pointers to "struct async_syscall"
or pointers to "result"?
- Davide
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-28 18:22 ` Davide Libenzi
@ 2007-02-28 18:42 ` Linus Torvalds
2007-02-28 18:50 ` Davide Libenzi
0 siblings, 1 reply; 277+ messages in thread
From: Linus Torvalds @ 2007-02-28 18:42 UTC (permalink / raw)
To: Davide Libenzi
Cc: Ingo Molnar, Ulrich Drepper, Linux Kernel Mailing List,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Zach Brown, Evgeniy Polyakov, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
On Wed, 28 Feb 2007, Davide Libenzi wrote:
>
> Here we very much agree. The way I'd like it:
>
> struct async_syscall {
> unsigned long nr_sysc;
> unsigned long params[8];
> long result;
> };
No, the "result" needs to go somewhere else. The caller may be totally
uninterested in keeping the system call number or parameters around until
the operation completes, but if you put them in the same structure with
the result, you obviously cannot sanely get rid of them.
I also don't much like read-write interfaces (which the above would be:
the kernel would read most of the structure, and then write one member of
the structure).
It's entirely possible, for example, that the operation we submit is some
legacy "aio_read()", which has soem other structure layout than the new
one (but one field will be the result code).
Linus
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-28 16:42 ` Linus Torvalds
2007-02-28 17:26 ` Ingo Molnar
@ 2007-02-28 18:22 ` Davide Libenzi
2007-02-28 18:42 ` Linus Torvalds
2007-02-28 23:12 ` Ingo Molnar
2 siblings, 1 reply; 277+ messages in thread
From: Davide Libenzi @ 2007-02-28 18:22 UTC (permalink / raw)
To: Linus Torvalds
Cc: Ingo Molnar, Ulrich Drepper, Linux Kernel Mailing List,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Zach Brown, Evgeniy Polyakov, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
On Wed, 28 Feb 2007, Linus Torvalds wrote:
> On Wed, 28 Feb 2007, Davide Libenzi wrote:
> >
> > At this point, given how threadlets can be easily/effectively dispatched
> > from userspace, I'd argue the presence of either single/parallel or syslet
> > submission altogether. Threadlets allows you to code chains *way* more
> > naturally than syslets, and since they basically are like functions calls
> > in the fast path, they can be used even for single/parallel submissions.
>
> Well, I agree, except for one thing:
> - user space execution is *inherently* more expensive.
>
> Why? Stack. Stack. Stack.
>
> If you support threadlets with user space code, it means that you need a
> separate user-space stack for each threadlet. That's a potentially *big*
> cost to bear, both from a setup standpoint and from simply a memory
> allocation standpoint.
Right, point taken.
> In short - the only thing I *don't* think is a great idea are those linked
> lists of atoms. I still think it's a pretty horrible interface, and I
> still don't think it really buys us very much. The only way it would buy
> us a lot is to change the linked lists dynamically (ie add new events at
> the end while old events are still executing), but quite frankly, that
> just makes the whole interface *even*worse* and just makes me have
> debugging nightmares (I'm also not even convinced it really would help
> us: we might avoid some costs of adding new events, but it would only
> avoid them for serial execution, and if the whole point of this is to
> execute things in parallel, that's a stupid thing to do).
>
> So I would repeat my call for getting rid of the atoms, and instead just
> do a "single submission" at a time. Do the linking by running a threadlet
> that has user space code (and the stack overhead), which is MUCH more
> flexible. And do nonlinked single system calls without *either* atoms *or*
> a user-space stack footprint.
Here we very much agree. The way I'd like it:
struct async_syscall {
unsigned long nr_sysc;
unsigned long params[8];
long result;
};
int async_exec(struct async_syscall *a, int n);
or:
int async_exec(struct async_syscall **a, int n);
At this point I'm ok even with the userspace ring buffer, returning
back pointers to "struct async_syscall".
- Davide
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-28 16:42 ` Linus Torvalds
@ 2007-02-28 17:26 ` Ingo Molnar
2007-02-28 18:22 ` Davide Libenzi
2007-02-28 23:12 ` Ingo Molnar
2 siblings, 0 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-02-28 17:26 UTC (permalink / raw)
To: Linus Torvalds
Cc: Davide Libenzi, Ulrich Drepper, Linux Kernel Mailing List,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Zach Brown, Evgeniy Polyakov, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> [...] The only way it would buy us a lot is to change the linked lists
> dynamically (ie add new events at the end while old events are still
> executing), [...]
that's quite close to what Jens' FIO plugin for syslets
(engines/syslet-rw.c) does currently: it builds lists of syslets as IO
gets submitted, batches them up for some time and then sends them off.
It is a natural next step to do this for in-flight syslets as well.
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-28 8:02 ` Evgeniy Polyakov
@ 2007-02-28 17:01 ` Michael K. Edwards
0 siblings, 0 replies; 277+ messages in thread
From: Michael K. Edwards @ 2007-02-28 17:01 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Theodore Tso, Ingo Molnar, Linus Torvalds, Ulrich Drepper,
linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
On 2/28/07, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> 130 lines skipped...
Yeah, I edited it down a lot before sending it. :-)
> I have only one question - wasn't it too lazy to write all that? :)
I'm pretty lazy all right. But occasionally an interesting problem
(and revamping AIO is very interesting) makes me think, and what
little thinking I do is always accompanied by writing. Once I've
thought something through to the point that I think I understand the
problem, I've even been known to attempt a solution. Not always,
though; more often, I find a new interesting problem, or else I am
forcibly reminded that I should be spending my little store of insight
on revenue-producing activity.
In this instance, there didn't seem to be any harm in sending my
thoughts to LKML as I wrote them, on the off chance that Ingo or
Davide would get some value out of them in this design cycle (which
any code I eventually get around to producing will miss). So far,
I've gotten some rather dismissive pushback from Ingo and Alan (who
seem to have no interest outside x86 and less understanding than I
would have thought of what real userspace code looks like), a "why
preach to people who know more than you do" from Davide, a brief aside
on the dominance of x86 from Oleg, and one off-list "keep up the good
work". Not a very rich harvest from (IMHO) pretty good seeds.
In short, so far the "Linux kernel community" is upholding its
reputation for insularity, arrogance, coding without prior design,
lack of interest in userspace problems, and inability to learn from
the mistakes of others. (None of these characterizations depends on
there being any real insight in anything I have written.) Linus
himself has a very different reputation -- plenty of arrogance all
right, but genuine brilliance and hard work, and sincere (if cranky)
efforts to explain the "theory of operations" underlying central
design choices. So far he hasn't commented directly on anything I
have had to say; it will be interesting to see whether he tells me to
stop annoying the pros and to go away until I have some code to
contribute.
Happy hacking,
- Michael
P. S. I do think "threadlets" are brilliant, though, and reading
Ingo's patches gave me a much better idea of what would be involved in
prototyping Asynchronously Executed I/O Unit opcodes.
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-28 16:17 ` Davide Libenzi
@ 2007-02-28 16:42 ` Linus Torvalds
2007-02-28 17:26 ` Ingo Molnar
` (2 more replies)
2007-02-28 20:21 ` Ingo Molnar
1 sibling, 3 replies; 277+ messages in thread
From: Linus Torvalds @ 2007-02-28 16:42 UTC (permalink / raw)
To: Davide Libenzi
Cc: Ingo Molnar, Ulrich Drepper, Linux Kernel Mailing List,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Zach Brown, Evgeniy Polyakov, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
On Wed, 28 Feb 2007, Davide Libenzi wrote:
>
> At this point, given how threadlets can be easily/effectively dispatched
> from userspace, I'd argue the presence of either single/parallel or syslet
> submission altogether. Threadlets allows you to code chains *way* more
> naturally than syslets, and since they basically are like functions calls
> in the fast path, they can be used even for single/parallel submissions.
Well, I agree, except for one thing:
- user space execution is *inherently* more expensive.
Why? Stack. Stack. Stack.
If you support threadlets with user space code, it means that you need a
separate user-space stack for each threadlet. That's a potentially *big*
cost to bear, both from a setup standpoint and from simply a memory
allocation standpoint.
Quite frankly, I think threadlets are a great idea, but I think the lack
of user-level footprint is *also* a great idea, and you should support
both.
In short - the only thing I *don't* think is a great idea are those linked
lists of atoms. I still think it's a pretty horrible interface, and I
still don't think it really buys us very much. The only way it would buy
us a lot is to change the linked lists dynamically (ie add new events at
the end while old events are still executing), but quite frankly, that
just makes the whole interface *even*worse* and just makes me have
debugging nightmares (I'm also not even convinced it really would help
us: we might avoid some costs of adding new events, but it would only
avoid them for serial execution, and if the whole point of this is to
execute things in parallel, that's a stupid thing to do).
So I would repeat my call for getting rid of the atoms, and instead just
do a "single submission" at a time. Do the linking by running a threadlet
that has user space code (and the stack overhead), which is MUCH more
flexible. And do nonlinked single system calls without *either* atoms *or*
a user-space stack footprint.
Please?
What am I missing?
Linus
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-28 3:03 ` Michael K. Edwards
2007-02-28 8:02 ` Evgeniy Polyakov
@ 2007-02-28 16:38 ` Phillip Susi
1 sibling, 0 replies; 277+ messages in thread
From: Phillip Susi @ 2007-02-28 16:38 UTC (permalink / raw)
To: Michael K. Edwards
Cc: Theodore Tso, Evgeniy Polyakov, Ingo Molnar, Linus Torvalds,
Ulrich Drepper, linux-kernel, Arjan.van.de.Ven
Michael K. Edwards wrote:
> State machines are much harder to write without going through a real
> on-paper design phase first. But multi-threaded code is much harder
> for a team of average working coders to write correctly, judging from
> the numerous train wrecks that I've been called in to salvage over the
> last ten years or so.
I have to agree; state machines are harder to design and read, but
multithreaded programs are harder to write and debug _correctly_.
Another way of putting it is that the threadlet approach is easier to
do, but harder to do _right_.
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-28 9:45 ` Ingo Molnar
@ 2007-02-28 16:17 ` Davide Libenzi
2007-02-28 16:42 ` Linus Torvalds
2007-02-28 20:21 ` Ingo Molnar
0 siblings, 2 replies; 277+ messages in thread
From: Davide Libenzi @ 2007-02-28 16:17 UTC (permalink / raw)
To: Ingo Molnar
Cc: Ulrich Drepper, Linux Kernel Mailing List, Linus Torvalds,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Zach Brown, Evgeniy Polyakov, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
On Wed, 28 Feb 2007, Ingo Molnar wrote:
>
> * Davide Libenzi <davidel@xmailserver.org> wrote:
>
> > Why can't aio_* be implemented with *simple* (or parallel/unrelated)
> > syscall submit w/out the burden of a complex, limiting and heavy API
>
> there are so many variants of what people think 'asynchronous IO' should
> look like - i'd not like to limit them. I agree that once a particular
> syslet script becomes really popular, it might (and should) in fact be
> pushed into a separate system call.
>
> But i also agree that a one-shot-syscall sys_async() syscall could be
> done too - for those uses where only a single system call is needed and
> where the fetching of a single uatom would be small but nevertheless
> unnecessary overhead. A one-shot async syscall needs to get /8/
> parameters (the syscall nr is the seventh parameter and the return code
> of the nested syscall is the eighth). So at least two parameters will
> have to be passed in indirectly and validated, and 32/64-bit compat
> conversions added, etc. anyway!
At this point, given how threadlets can be easily/effectively dispatched
from userspace, I'd argue the presence of either single/parallel or syslet
submission altogether. Threadlets allows you to code chains *way* more
naturally than syslets, and since they basically are like functions calls
in the fast path, they can be used even for single/parallel submissions.
No compat code required (ok, besides the trivial async_wait).
My point is, the syslet infrastructure is expensive for the kernel in
terms of compat, and extra code added to handle the cond/jumps/etc. Is
also non-trivial to use from userspace. Are those big performance
advantages there to justify its existence? I doubt that the price of a
sysenter is a lot bigger than a atom decoding, but I'm looking forward in
being proven wrong by real life performance numbers ;)
- Davide
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-27 12:11 ` Evgeniy Polyakov
2007-02-27 12:13 ` Ingo Molnar
@ 2007-02-28 16:14 ` Pavel Machek
2007-03-01 8:18 ` Evgeniy Polyakov
1 sibling, 1 reply; 277+ messages in thread
From: Pavel Machek @ 2007-02-28 16:14 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Theodore Tso, Ingo Molnar, Linus Torvalds, Ulrich Drepper,
linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
Hi!
> > I think what you are not hearing, and what everyone else is saying
> > (INCLUDING Linus), is that for most programmers, state machines are
> > much, much harder to program, understand, and debug compared to
> > multi-threaded code. You may disagree (were you a MacOS 9 programmer
> > in another life?), and it may not even be true for you if you happen
> > to be one of those folks more at home with Scheme continuations, for
> > example. But it is true that for most kernel programmers, threaded
> > programming is much easier to understand, and we need to engineer the
> > kernel for what will be maintainable for the majority of the kernel
> > development community.
>
> I understand that - and I totally agree.
> But when more complex, more bug-prone code results in higher performance
> - that must be used. We have linked lists and binary trees - the latter
No-o. Kernel is not designed like that.
Often, more complex and slightly faster code exists, and we simply use
slower variant, because it is fast enough.
10% gain in speed is NOT worth major complexity increase.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-22 19:38 ` Davide Libenzi
@ 2007-02-28 9:45 ` Ingo Molnar
2007-02-28 16:17 ` Davide Libenzi
0 siblings, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-02-28 9:45 UTC (permalink / raw)
To: Davide Libenzi
Cc: Ulrich Drepper, Linux Kernel Mailing List, Linus Torvalds,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Zach Brown, Evgeniy Polyakov, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
* Davide Libenzi <davidel@xmailserver.org> wrote:
> > my current thinking is that special-purpose (non-programmable,
> > static) APIs like aio_*() and lio_*(), where every last cycle of
> > performance matters, should be implemented using syslets - even if
> > it is quite tricky to write syslets (which they no doubt are - just
> > compare the size of syslet-test.c to threadlet-test.c). So i'd move
> > syslets into the same category as raw syscalls: pieces of the raw
> > infrastructure between the kernel and glibc, not an exposed API to
> > apps. [and even if we keep them in that category they still need
> > quite a bit of API work, to clean up the 32/64-bit issues, etc.]
>
> Why can't aio_* be implemented with *simple* (or parallel/unrelated)
> syscall submit w/out the burden of a complex, limiting and heavy API
there are so many variants of what people think 'asynchronous IO' should
look like - i'd not like to limit them. I agree that once a particular
syslet script becomes really popular, it might (and should) in fact be
pushed into a separate system call.
But i also agree that a one-shot-syscall sys_async() syscall could be
done too - for those uses where only a single system call is needed and
where the fetching of a single uatom would be small but nevertheless
unnecessary overhead. A one-shot async syscall needs to get /8/
parameters (the syscall nr is the seventh parameter and the return code
of the nested syscall is the eighth). So at least two parameters will
have to be passed in indirectly and validated, and 32/64-bit compat
conversions added, etc. anyway!
The copy_uatom() assembly code i did is really fast so i doubt there
would be much measurable performance difference between the two
solutions. Plus, putting the uatom into user memory allows the caching
of uatoms - further dilluting the advantage of passing in the values per
register. The whole difference should be on the order of 10 cycles, so
this really isnt a high prio item in my view.
> Now that chains of syscalls can be way more easily handled with
> clets^wthreadlets, why would we need the whole syslets crud inside?
no, threadlets dont really solve the basic issue of people wanting to
'combine' syscalls, avoid the syscall entry overhead (even if that is
small), and the desire to rely on kthread->kthread context switching
which is even faster than uthread->uthread context-switching, etc.
Furthermore, syslets dont really cause any new problem. They are almost
totally orthogonal, isolated, and cause no wide infrastructure needs.
as long as syslets remain a syscall-level API, for the measured use of
the likes of glibc and libaio (and not exposed in a programmable manner
to user-space), i see no big problem with them at all. They can also be
used without them having any classic pthread user-state (without linking
to libpthread). Think of it like the raw use of clone(): possible and
useful in some cases, but not something that a typical application would
do. This is a 'raw syscall plugins' thing, to be used by those
user-space entities that use raw syscalls: infrastructure libraries. Raw
syscalls themselves are tied to the platform, are not easily used in
some cases, thus almost no application uses them directly, but uses the
generic functions glibc exposes.
in the long run, sys_syslet_exec(), were it not to establish itself as a
widely used interface, could be implemented purely from user-space too
(say from the VDSO, at much worse performance, but the kernel would stay
backwards compatible with the syscall), so there's almost no risk here.
You dont like it => dont use it. Meanwhile, i'll happily take any
suggestion to make the syslet API more digestable.
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-28 3:03 ` Michael K. Edwards
@ 2007-02-28 8:02 ` Evgeniy Polyakov
2007-02-28 17:01 ` Michael K. Edwards
2007-02-28 16:38 ` Phillip Susi
1 sibling, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-28 8:02 UTC (permalink / raw)
To: Michael K. Edwards
Cc: Theodore Tso, Ingo Molnar, Linus Torvalds, Ulrich Drepper,
linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
On Tue, Feb 27, 2007 at 07:03:21PM -0800, Michael K. Edwards (medwards.linux@gmail.com) wrote:
>
> State machines are much harder to write without going through a real
> on-paper design phase first. But multi-threaded code is much harder
> for a team of average working coders to write correctly, judging from
> the numerous train wrecks that I've been called in to salvage over the
> last ten years or so.
130 lines skipped...
I have only one question - wasn't it too lazy to write all that? :)
> Cheers,
> - Michael
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-27 11:52 ` Theodore Tso
2007-02-27 12:11 ` Evgeniy Polyakov
2007-02-27 12:34 ` Ingo Molnar
@ 2007-02-28 3:03 ` Michael K. Edwards
2007-02-28 8:02 ` Evgeniy Polyakov
2007-02-28 16:38 ` Phillip Susi
2 siblings, 2 replies; 277+ messages in thread
From: Michael K. Edwards @ 2007-02-28 3:03 UTC (permalink / raw)
To: Theodore Tso, Evgeniy Polyakov, Ingo Molnar, Linus Torvalds,
Ulrich Drepper, linux-kernel, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
On 2/27/07, Theodore Tso <tytso@mit.edu> wrote:
> I think what you are not hearing, and what everyone else is saying
> (INCLUDING Linus), is that for most programmers, state machines are
> much, much harder to program, understand, and debug compared to
> multi-threaded code. You may disagree (were you a MacOS 9 programmer
> in another life?), and it may not even be true for you if you happen
> to be one of those folks more at home with Scheme continuations, for
> example. But it is true that for most kernel programmers, threaded
> programming is much easier to understand, and we need to engineer the
> kernel for what will be maintainable for the majority of the kernel
> development community.
State machines are much harder to write without going through a real
on-paper design phase first. But multi-threaded code is much harder
for a team of average working coders to write correctly, judging from
the numerous train wrecks that I've been called in to salvage over the
last ten years or so.
The typical 50-250KLoC multi-threaded C/C++/Java application, even if
it's been shipping to customers for several years, is littered with
locking constructs yet routinely corrupts shared data structures.
Change the number of threads in a pool, adjust the thread priorities,
or move a couple of lines of code around, and you're very likely to
get an unexplained deadlock. God help you if you try to use a
debugger on it -- hundreds of latent race conditions will crop up that
didn't happen to trigger before because the thread didn't get
preempted there.
The only programming languages that I have seen widely used in US
industry (so Lisps and self-consciously functional languages are out)
in which mere mortals write passable multi-threaded applications are
Visual Basic and Python. That's partly because programmers in these
languages are not in the habit of throwing pointers around; but if
that were all there was to it, Java programmers would be a lot more
successful than they are at actually writing threaded programs rather
than nibbling cautiously around the edges with EJB. It also helps a
lot that strings are immutable; but again, Java shares this property.
No, the big difference is that VB and Python dicts and arrays are
thread-safed by the language runtime, and Java collections are not.
So while there may be all sorts of pointless and dangerous
mis-locking, it's "protecting" somewhat higher-level data structures.
What does this have to do with the kernel? Well, if you're going to
create Yet Another Micro^WLightweight-Threading Construct for AIO, it
would be mighty nice not to be slinging bare pointers around until the
IO is actually complete and the kernel isn't going to be touching the
data buffer any more. It would also be mighty nice to have a
thread-safe "request pool" data structure on which actions like bulk
cancellation and iteration over a subset can operate. (The iterator
returned by, say, a three-sided query on a RCU priority queue may
contain _stale_ information, but never _inconsistent_ information.)
I recognize that this is more object-oriented snake oil than kernel
programmers usually tolerate, but it would really help AIO-threaded
programs suck less. It is also very much in the Unix tradition --
what are file descriptors and fd_sets if not object-oriented design?
And if following the socket model was good enough for epoll and
netlink, why not for threadlet pools?
In the best of all possible worlds, AIO would look just like the good
old socket-bind-listen-accept model, except that I/O is transacted on
the "listen" socket as long as it can be serviced from cache, and
accept() only gets a new connection when a delayed I/O arrives. The
object hiding inside the fd returned by socket() would be the
"threadlet pool", and the object hiding inside each fd returned by
accept() would be a threadlet. Only this time you do it right and
make errno(fd) be a vsyscall that returns a threadlet-local error
state, and you assign reasonable semantics to operations on an fd that
has already encountered an exception. Much like IEEE 754, actually.
Anyway, like I said, good threaded code is quite rare. On the other
hand, I have seen plenty of reasonably successful event-loop
programming in C and C++, mostly in MacOS and Windows and PalmOS GUIs
where the OS deals with event queues and event handler registration.
It's not the world's most CPU-efficient strategy because of all those
function pointers and virtual methods, but those costs are dwarfed by
the GUI bounding boxes and repaints and things anyway. More to the
point, writing an event-loop framework for other people to use
involves extensive APIs that are stable in the tiniest details and
extensively documented. Not, perhaps, Plan A for the Linux kernel
community. :-)
Happily, a largely event-driven userspace framework can easily be
stacked on top of a threaded kernel -- as long as they're the right
kind of threads. The right kind of threads do not proliferate malloc
arenas by allowing preemption in mid-malloc. (They may need to
malloc(), and that may be a preemption point relative to _real_
threads, but you shouldn't switch or cancel threadlets there.) The
right kind of threads do not leak locking primitives when cancelled,
because they don't have to take a lock in order to update the right
kind of data structure. The right kind of threads can use floating
point safely as long as they don't expect FPU state to be preserved
across a syscall.
The right kind of threads, in short, work like coroutines or
MacOS/PalmOS "event handlers", with the added convenience of being
able to write them as if they were normal sequential code, with normal
access to a persistent stack frame and to process globals. And if you
do them right, they're cheap to migrate, easy and harmless to throttle
and cancel in bulk, and easy to punt out to an "I/O coprocessor" in
the future. The key is to move data into and out of the "I/O
registers" at well-defined points and not to break the encapsulation
in between. Delay the extraction of results from the "I/O registers"
as long as possible, and the hypothetical AIO coprocessor can go chew
on them in parallel with the main "integer" code flow, which only
stalls when it can't go any further without the I/O result.
If you've got some time to kill, you can even analyze an existing,
well-documented flavor of I/O strings (I like James Antill's Vstr
library myself) and define a set of "AIO opcodes" that manipulate AIO
fds and AIO strings as primitive types, just like the FPU manipulates
floating-point numbers as primitive types. Pick a CPU architecture
with a good range of unused trap opcodes (PPC, for instance) and
actually move the I/O strings into kernel space, mapping the AIO
operations and the I/O string API onto the free trap opcodes (really
no different from syscalls, except it's easier to inspect the assembly
that the compiler produces and see what's going on). For extra
credit, implement most of the AIO opcodes in Verilog, bolt them onto
the PPC core inside a Virtex4 FX FPGA, refine the semantics to permit
efficient pipelining, and collect your Turing award. :-)
Cheers,
- Michael
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-27 14:15 Al Boldi
@ 2007-02-27 19:22 ` Theodore Tso
2007-03-01 10:21 ` Pavel Machek
1 sibling, 0 replies; 277+ messages in thread
From: Theodore Tso @ 2007-02-27 19:22 UTC (permalink / raw)
To: Al Boldi; +Cc: linux-kernel
On Tue, Feb 27, 2007 at 05:15:31PM +0300, Al Boldi wrote:
> > You may disagree (were you a MacOS 9 programmer
> > in another life?), and it may not even be true for you if you happen
> > to be one of those folks more at home with Scheme continuations, for
> > example.
>
> Personal attacks are really rather unhelpful/unscientific.
Just to be clear; this wasn't a personal attack. I know a lot of
people who I greatly respect who were MacOS and Scheme programmers (I
went to school at MIT, after all, the birthplace of Scheme). The
reality though is that most people don't program that way, and their
brains aren't wired in that fashion. There is a reason why procedural
languages are far more common than purely functional models, and why
aside from (pre-version 10) MacOS, most OS's don't use an event driven
system call interface.
> So, instead of using intimidating language to force one's opinion thru,
> especially when it comes from those in control, why not have a democratic
> vote?
So far, I'd have to say the people arguing for an event driven model
are in the distinct minority...
And as far as voting is concerned, I prefer informed voters who can
explain why, in their own words, why they are in favor of a particular
alternative.
Regards,
- Ted
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-27 16:21 ` Evgeniy Polyakov
2007-02-27 16:58 ` Eric Dumazet
@ 2007-02-27 19:20 ` Davide Libenzi
1 sibling, 0 replies; 277+ messages in thread
From: Davide Libenzi @ 2007-02-27 19:20 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Ingo Molnar, Ulrich Drepper, Linux Kernel Mailing List,
Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
On Tue, 27 Feb 2007, Evgeniy Polyakov wrote:
> I probably selected wrong words to desribe, here is in details how
> kevent differs from epoll.
>
> Polling case need to perform additional check before event can be copied
> to userspace, that check must be done for each even being copied.
> Kevent does not need that (it needs it for poll emulation) - if event is
> ready, then it is ready.
That could be changed too. The "void *key" doesn't need to be NULL. Wake
ups to f_op->poll() waiters can use that to send ready events directly,
avoiding an extra f_op->poll() to fetch them.
Infrastructure is already there, just need a big patch to do it everywhere ;)
> Kevent works slightly different - it does not perform additional check
> for readiness (although it can, and it does in poll notifications), if
> event is marked as ready, parked in waiting syscal thread is awakened
> and event is copied to userspace.
> Also waiting syscall is awakened through one queue - event is added
> and wake_up() is called, while in epoll() there are two queues.
The really ancient version of epoll (called /dev/epoll at that time) was
doing a very similar thing. Was adding custom plugs is all over the places
where we wanted to get events from, and was collecting them w/out
resorting to extra f_op->poll(). Event masks going straight through an
event buffer.
The reason why the current design of epoll was chosen, was because:
o Was not requiring custom plus all over the places
o Was working with the current kernel abstractions as-is (though f_op->poll)
- Davide
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-26 20:35 ` Ingo Molnar
2007-02-26 22:06 ` Bill Huey
2007-02-27 10:09 ` Evgeniy Polyakov
@ 2007-02-27 17:13 ` Pavel Machek
2 siblings, 0 replies; 277+ messages in thread
From: Pavel Machek @ 2007-02-27 17:13 UTC (permalink / raw)
To: Ingo Molnar
Cc: Evgeniy Polyakov, Ulrich Drepper, linux-kernel, Linus Torvalds,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
Hi!
> > If kernelspace rescheduling is that fast, then please explain me why
> > userspace one always beats kernel/userspace?
>
> because 'user space scheduling' makes no sense? I explained my thinking
> about that in a past mail:
...
> 2) there has been an IO event. The thing is, for IO events we enter the
> kernel no matter what - and we'll do so for the next 10 years at
..actually, at some point 3D acceleration was done by accessing hw
directly from userspace. OTOH I think we are moving away from that
model, so it is probably irrelevant here.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-27 16:58 ` Eric Dumazet
@ 2007-02-27 17:06 ` Evgeniy Polyakov
0 siblings, 0 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-27 17:06 UTC (permalink / raw)
To: Eric Dumazet
Cc: Davide Libenzi, Ingo Molnar, Ulrich Drepper,
Linux Kernel Mailing List, Linus Torvalds, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Jens Axboe,
Thomas Gleixner
On Tue, Feb 27, 2007 at 05:58:14PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> I believe one advantage of epoll is it uses standard mechanism (mandated for
> poll()/select() ), while kevent adds some glue and kevent_storage in some
> structures (struct inode, struct file, ...), thus adding some extra code and
> extra storage in hot paths. Yes there might be a gain IF most users of these
> path want kevent. But other users pay the price (larger kernel code and
> data), that you cannot easily bench.
>
> Using or not epoll has nearly zero cost over standard kernel (only struct file
> has some extra storage)
Well, that's a price - any event which wants to be supported needs to
store for events - kevent_storage is a list_head plus spinlock and
pointer to itself (with all current users that pointer can be removed and
access transferred to container_of()) - it is exactly as in epoll storage -
both were created with the smallest possible overhead in mind.
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-27 16:21 ` Evgeniy Polyakov
@ 2007-02-27 16:58 ` Eric Dumazet
2007-02-27 17:06 ` Evgeniy Polyakov
2007-02-27 19:20 ` Davide Libenzi
1 sibling, 1 reply; 277+ messages in thread
From: Eric Dumazet @ 2007-02-27 16:58 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Davide Libenzi, Ingo Molnar, Ulrich Drepper,
Linux Kernel Mailing List, Linus Torvalds, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Jens Axboe,
Thomas Gleixner
On Tuesday 27 February 2007 17:21, Evgeniy Polyakov wrote:
> I probably selected wrong words to desribe, here is in details how
> kevent differs from epoll.
>
> Polling case need to perform additional check before event can be copied
> to userspace, that check must be done for each even being copied.
> Kevent does not need that (it needs it for poll emulation) - if event is
> ready, then it is ready.
>
> sys_poll() create a wait queue where different events (callbacks for
> them) are stored, when driver calls wake_up() appropriate event is added
> to the ready list and calls wake_up() for that wait queue, which in turn
> calls ->poll for each event and transfer it to userspace if it is ready.
>
> Kevent works slightly different - it does not perform additional check
> for readiness (although it can, and it does in poll notifications), if
> event is marked as ready, parked in waiting syscal thread is awakened
> and event is copied to userspace.
> Also waiting syscall is awakened through one queue - event is added
> and wake_up() is called, while in epoll() there are two queues.
Thank you Evgeniy for this comparison. poll()/select()/epoll() are tricky
indeed.
I believe one advantage of epoll is it uses standard mechanism (mandated for
poll()/select() ), while kevent adds some glue and kevent_storage in some
structures (struct inode, struct file, ...), thus adding some extra code and
extra storage in hot paths. Yes there might be a gain IF most users of these
path want kevent. But other users pay the price (larger kernel code and
data), that you cannot easily bench.
Using or not epoll has nearly zero cost over standard kernel (only struct file
has some extra storage)
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-27 16:01 ` Davide Libenzi
@ 2007-02-27 16:21 ` Evgeniy Polyakov
2007-02-27 16:58 ` Eric Dumazet
2007-02-27 19:20 ` Davide Libenzi
0 siblings, 2 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-27 16:21 UTC (permalink / raw)
To: Davide Libenzi
Cc: Ingo Molnar, Ulrich Drepper, Linux Kernel Mailing List,
Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
On Tue, Feb 27, 2007 at 08:01:05AM -0800, Davide Libenzi (davidel@xmailserver.org) wrote:
> It does not matter if inside the loop you invert a 20Kx20K matrix or you
> copy a byte, they both are O(num_ready).
I probably selected wrong words to desribe, here is in details how
kevent differs from epoll.
Polling case need to perform additional check before event can be copied
to userspace, that check must be done for each even being copied.
Kevent does not need that (it needs it for poll emulation) - if event is
ready, then it is ready.
sys_poll() create a wait queue where different events (callbacks for
them) are stored, when driver calls wake_up() appropriate event is added
to the ready list and calls wake_up() for that wait queue, which in turn
calls ->poll for each event and transfer it to userspace if it is ready.
Kevent works slightly different - it does not perform additional check
for readiness (although it can, and it does in poll notifications), if
event is marked as ready, parked in waiting syscal thread is awakened
and event is copied to userspace.
Also waiting syscall is awakened through one queue - event is added
and wake_up() is called, while in epoll() there are two queues.
> - Davide
>
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-27 10:13 ` Evgeniy Polyakov
@ 2007-02-27 16:01 ` Davide Libenzi
2007-02-27 16:21 ` Evgeniy Polyakov
0 siblings, 1 reply; 277+ messages in thread
From: Davide Libenzi @ 2007-02-27 16:01 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Ingo Molnar, Ulrich Drepper, Linux Kernel Mailing List,
Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
On Tue, 27 Feb 2007, Evgeniy Polyakov wrote:
> On Mon, Feb 26, 2007 at 06:18:51PM -0800, Davide Libenzi (davidel@xmailserver.org) wrote:
> > On Mon, 26 Feb 2007, Evgeniy Polyakov wrote:
> >
> > > 2. its notifications do not go through the second loop, i.e. it is O(1),
> > > not O(ready_num), and notifications happens directly from internals of
> > > the appropriate subsystem, which does not require special wakeup
> > > (although it can be done too).
> >
> > Sorry if I do not read kevent code correctly, but in kevent_user_wait()
> > there is a:
> >
> > while (num < max_nr && ((k = kevent_dequeue_ready(u)) != NULL)) {
> > ...
> > }
> >
> > loop, that make it O(ready_num). From a mathematical standpoint, they're
> > both O(ready_num), but epoll is doing three passes over the ready set.
> > I always though that if the number of ready events is so big that the more
> > passes over the ready set becomes relevant, probably the "work" done by
> > userspace for each fetched event would make the extra cost irrelevant.
> > But that can be fixed by a patch that will follow on lkml ...
>
> No, kevent_dequeue_ready() copies data to userspace, that is it.
> So it looks roughly following:
In all the books where I studied, the algorithms below would be classified
as O(num_ready) ones:
[sys_kevent_wait]
+ for (i=0; i<num; ++i) {
+ k = kevent_dequeue_ready_ring(u);
+ if (!k)
+ break;
+ kevent_complete_ready(k);
+
+ if (k->event.ret_flags & KEVENT_RET_COPY_FAILED)
+ break;
+ kevent_stat_ring(u);
+ copied++;
+ }
[kevent_user_wait]
+ while (num < max_nr && ((k = kevent_dequeue_ready(u)) != NULL)) {
+ if (copy_to_user(buf + num*sizeof(struct ukevent),
+ &k->event, sizeof(struct ukevent))) {
+ if (num == 0)
+ num = -EFAULT;
+ break;
+ }
+ kevent_complete_ready(k);
+ ++num;
+ kevent_stat_wait(u);
+ }
It does not matter if inside the loop you invert a 20Kx20K matrix or you
copy a byte, they both are O(num_ready).
- Davide
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
@ 2007-02-27 14:15 Al Boldi
0 siblings, 0 replies; 277+ messages in thread
From: Al Boldi @ 2007-02-27 14:15 UTC (permalink / raw)
To: linux-kernel
Evgeniy Polyakov wrote:
> Ingo Molnar (mingo@elte.hu) wrote:
> > based servers. The measurements so far have shown that the absolute
> > worst-case threading server performance is at around 60% of that of
> > non-context-switching servers - and even that level is reached
> > gradually, leaving time for action for the server owner. While with
> > fully event based servers there are mostly only two modes of
> > performance: 100% performance and near-0% performance: total breakdown.
>
> Let's live in piece! :)
> I always agreed that they should be used together - event-based rings
> of IO requests, if request happens to block (which should be avaided
> as much as possible), then it continues on behalf of sleeping thread.
Agreed 100%. Notify always when you can, thread only when you must.
Thanks!
--
Al
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
@ 2007-02-27 14:15 Al Boldi
2007-02-27 19:22 ` Theodore Tso
2007-03-01 10:21 ` Pavel Machek
0 siblings, 2 replies; 277+ messages in thread
From: Al Boldi @ 2007-02-27 14:15 UTC (permalink / raw)
To: linux-kernel
Theodore Tso wrote:
> On Tue, Feb 27, 2007 at 01:28:32PM +0300, Evgeniy Polyakov wrote:
> > Obviously there are bugs, it is simply how things work.
> > And debugging state machine code has exactly the same complexity as
> > debugging multi-threading code - if not less...
>
> Evgeniy,
>
> I think what you are not hearing, and what everyone else is saying
> (INCLUDING Linus),
Excluding possibly many others.
> is that for most programmers, state machines are
> much, much harder to program, understand, and debug compared to
> multi-threaded code.
That's why you introduce an infrastructure that hides all the nitty-gritty
plumbing, and makes it easy to use.
> You may disagree (were you a MacOS 9 programmer
> in another life?), and it may not even be true for you if you happen
> to be one of those folks more at home with Scheme continuations, for
> example.
Personal attacks are really rather unhelpful/unscientific.
> But it is true that for most kernel programmers, threaded
> programming is much easier to understand, and we need to engineer the
> kernel for what will be maintainable for the majority of the kernel
> development community.
What's probably true is that, for a kernel to stay competitive you need two
distinct traits:
1. Stability
2. Performance
And you can't get that, by arguing that the kernel development community
doesn't have the brains to code for performance, which I dearly doubt.
So, instead of using intimidating language to force one's opinion thru,
especially when it comes from those in control, why not have a democratic
vote?
Thanks!
--
Al
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-27 12:34 ` Ingo Molnar
2007-02-27 13:14 ` Evgeniy Polyakov
@ 2007-02-27 13:32 ` Avi Kivity
1 sibling, 0 replies; 277+ messages in thread
From: Avi Kivity @ 2007-02-27 13:32 UTC (permalink / raw)
To: Ingo Molnar
Cc: Theodore Tso, Evgeniy Polyakov, Linus Torvalds, Ulrich Drepper,
linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
Ingo Molnar wrote:
> * Theodore Tso <tytso@mit.edu> wrote:
>
>
>> I think what you are not hearing, and what everyone else is saying
>> (INCLUDING Linus), is that for most programmers, state machines are
>> much, much harder to program, understand, and debug compared to
>> multi-threaded code. [...]
>>
>
> btw., another crutial thing that i think Evgeniy is missing is that
> threadlets /enable/ event loops to be used in practice! Right now the
> epoll/kevent programming model requires a total 100% avoidance of all
> context-switching in the 'main' event handler context while handling a
> request. If just 1% of all requests happen to block it might cause a
> /complete/ breakdown of an event loop's performance - it can easily
> cause a 10x drop in performance or worse!
>
> So context-switching has to be avoided in 100% of the code that runs
> while handling requests, file descriptors have to be set to nonblocking
> (causing extra system calls), and all the syscalls that might return
> incomplete with either -EINVAL or with a short read/write have to be
> converted into a state machine. (or in the alternative, user-space
> threading has to be used, which opens up another hornet's nest)
>
> /That/ is the main inhibiting factor of the measured use of event loops
> within Linux! It has zero integration capabilities with 'usual' coding
> techniques - driving the costs of its application up in the sky, and
> pushing event based servers into niches.
>
>
Having written such a niche event based server, I can 100% confirm what
Ingo is saying here. We had a single process drive I/O to the kernel
through an event model (based on kernel aio extended with IO_CMD_POLL),
and user level threads managed by a custom scheduler that managed I/O,
timeouts, and thread scheduling.
We once considered dropping from a user-level thread model to a state
machine model, but the effort was astronomical and we wouldn't see the
rewards until it was all done, so naturally we didn't do it.
> With threadlets the picture changes dramatically: all we have to
> concentrate on to get the performance of "100% event based servers" is
> to handle 'most' rescheduling events in the event loop. A 10-20% context
> switching ratio does not hurt at all. (it causes ~1% of throughput
> loss.)
>
> Furthermore, even if a particular configuration or module of the server
> (say Apache) happens to trigger a high rate of scheduling, the
> performance breakdown model of threadlets is /vastly/ superior to event
> based servers. The measurements so far have shown that the absolute
> worst-case threading server performance is at around 60% of that of
> non-context-switching servers - and even that level is reached
> gradually, leaving time for action for the server owner. While with
> fully event based servers there are mostly only two modes of
> performance: 100% performance and near-0% performance: total breakdown.
>
Yes. Threadlets as the default aio solution (easy to use, acceptable
performance even in worst cases), with specialized solutions where
applicable (epoll for networking, aio for O_DIRECT disk) look like a
good mix of performance and sanity.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-27 12:34 ` Ingo Molnar
@ 2007-02-27 13:14 ` Evgeniy Polyakov
2007-02-27 13:32 ` Avi Kivity
1 sibling, 0 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-27 13:14 UTC (permalink / raw)
To: Ingo Molnar
Cc: Theodore Tso, Linus Torvalds, Ulrich Drepper, linux-kernel,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
On Tue, Feb 27, 2007 at 01:34:21PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> based servers. The measurements so far have shown that the absolute
> worst-case threading server performance is at around 60% of that of
> non-context-switching servers - and even that level is reached
> gradually, leaving time for action for the server owner. While with
> fully event based servers there are mostly only two modes of
> performance: 100% performance and near-0% performance: total breakdown.
Let's live in piece! :)
I always agreed that they should be used together - event-based rings
of IO requests, if request happens to block (which should be avaided
as much as possible), then it continues on behalf of sleeping thread.
> Ingo
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-27 12:13 ` Ingo Molnar
@ 2007-02-27 12:40 ` Evgeniy Polyakov
0 siblings, 0 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-27 12:40 UTC (permalink / raw)
To: Ingo Molnar
Cc: Theodore Tso, Linus Torvalds, Ulrich Drepper, linux-kernel,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
On Tue, Feb 27, 2007 at 01:13:28PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
>
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
>
> > > [...] But it is true that for most kernel programmers, threaded
> > > programming is much easier to understand, and we need to engineer
> > > the kernel for what will be maintainable for the majority of the
> > > kernel development community.
> >
> > I understand that - and I totally agree.
>
> why did you then write, just one mail ago, the exact opposite:
>
> > And debugging state machine code has exactly the same complexity as
> > debugging multi-threading code - if not less...
Because thread machinery is much more complex than event one - just
compare amount of code in scheduler and the whole kevent -
kernel/sched.c has about the same size as the whole kevent :)
> the kernel /IS/ multi-threaded code.
>
> which statement of yours is also patently, absurdly untrue.
> Multithreaded code is harder to debug than process based code, but it is
> still a breeze compared to complex state-machines...
It seems that we are talking about different levels.
Model I propose to use in userspace - very simple events mostly about
completion of the request - they are simple to use and simple to debug.
It can be slightly more hard to debug, than the simplest threading model
(one thread - one logical entity, which whould never work with others)
though.
>From userspace point of view it is about the same complexity to check
why event is not marked as ready, or why some thread never got
scheduled...
And we do not get into account possible synchronization methods needed
to run multithreaded code without problems.
> Ingo
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-27 11:52 ` Theodore Tso
2007-02-27 12:11 ` Evgeniy Polyakov
@ 2007-02-27 12:34 ` Ingo Molnar
2007-02-27 13:14 ` Evgeniy Polyakov
2007-02-27 13:32 ` Avi Kivity
2007-02-28 3:03 ` Michael K. Edwards
2 siblings, 2 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-02-27 12:34 UTC (permalink / raw)
To: Theodore Tso, Evgeniy Polyakov, Linus Torvalds, Ulrich Drepper,
linux-kernel, Arjan van de Ven, Christoph Hellwig, Andrew Morton,
Alan Cox, Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
* Theodore Tso <tytso@mit.edu> wrote:
> I think what you are not hearing, and what everyone else is saying
> (INCLUDING Linus), is that for most programmers, state machines are
> much, much harder to program, understand, and debug compared to
> multi-threaded code. [...]
btw., another crutial thing that i think Evgeniy is missing is that
threadlets /enable/ event loops to be used in practice! Right now the
epoll/kevent programming model requires a total 100% avoidance of all
context-switching in the 'main' event handler context while handling a
request. If just 1% of all requests happen to block it might cause a
/complete/ breakdown of an event loop's performance - it can easily
cause a 10x drop in performance or worse!
So context-switching has to be avoided in 100% of the code that runs
while handling requests, file descriptors have to be set to nonblocking
(causing extra system calls), and all the syscalls that might return
incomplete with either -EINVAL or with a short read/write have to be
converted into a state machine. (or in the alternative, user-space
threading has to be used, which opens up another hornet's nest)
/That/ is the main inhibiting factor of the measured use of event loops
within Linux! It has zero integration capabilities with 'usual' coding
techniques - driving the costs of its application up in the sky, and
pushing event based servers into niches.
With threadlets the picture changes dramatically: all we have to
concentrate on to get the performance of "100% event based servers" is
to handle 'most' rescheduling events in the event loop. A 10-20% context
switching ratio does not hurt at all. (it causes ~1% of throughput
loss.)
Furthermore, even if a particular configuration or module of the server
(say Apache) happens to trigger a high rate of scheduling, the
performance breakdown model of threadlets is /vastly/ superior to event
based servers. The measurements so far have shown that the absolute
worst-case threading server performance is at around 60% of that of
non-context-switching servers - and even that level is reached
gradually, leaving time for action for the server owner. While with
fully event based servers there are mostly only two modes of
performance: 100% performance and near-0% performance: total breakdown.
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-27 12:11 ` Evgeniy Polyakov
@ 2007-02-27 12:13 ` Ingo Molnar
2007-02-27 12:40 ` Evgeniy Polyakov
2007-02-28 16:14 ` Pavel Machek
1 sibling, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-02-27 12:13 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Theodore Tso, Linus Torvalds, Ulrich Drepper, linux-kernel,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> > [...] But it is true that for most kernel programmers, threaded
> > programming is much easier to understand, and we need to engineer
> > the kernel for what will be maintainable for the majority of the
> > kernel development community.
>
> I understand that - and I totally agree.
why did you then write, just one mail ago, the exact opposite:
> And debugging state machine code has exactly the same complexity as
> debugging multi-threading code - if not less...
the kernel /IS/ multi-threaded code.
which statement of yours is also patently, absurdly untrue.
Multithreaded code is harder to debug than process based code, but it is
still a breeze compared to complex state-machines...
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-27 11:52 ` Theodore Tso
@ 2007-02-27 12:11 ` Evgeniy Polyakov
2007-02-27 12:13 ` Ingo Molnar
2007-02-28 16:14 ` Pavel Machek
2007-02-27 12:34 ` Ingo Molnar
2007-02-28 3:03 ` Michael K. Edwards
2 siblings, 2 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-27 12:11 UTC (permalink / raw)
To: Theodore Tso
Cc: Ingo Molnar, Linus Torvalds, Ulrich Drepper, linux-kernel,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
On Tue, Feb 27, 2007 at 06:52:22AM -0500, Theodore Tso (tytso@mit.edu) wrote:
> On Tue, Feb 27, 2007 at 01:28:32PM +0300, Evgeniy Polyakov wrote:
> > Obviously there are bugs, it is simply how things work.
> > And debugging state machine code has exactly the same complexity as
> > debugging multi-threading code - if not less...
>
> Evgeniy,
Hi Ted.
> I think what you are not hearing, and what everyone else is saying
> (INCLUDING Linus), is that for most programmers, state machines are
> much, much harder to program, understand, and debug compared to
> multi-threaded code. You may disagree (were you a MacOS 9 programmer
> in another life?), and it may not even be true for you if you happen
> to be one of those folks more at home with Scheme continuations, for
> example. But it is true that for most kernel programmers, threaded
> programming is much easier to understand, and we need to engineer the
> kernel for what will be maintainable for the majority of the kernel
> development community.
I understand that - and I totally agree.
But when more complex, more bug-prone code results in higher performance
- that must be used. We have linked lists and binary trees - the latter
are quite complex structures, but they allow to have higher performance
in searching operatins, so we use them.
The same applies to state machines - yes, in some cases it is hard to
program, but when things are already implemented and are wrapped into
nice (no posix) aio_read(), there is absolutely no usage complexity.
Even if it is up to programmer to programm state machine based on
generated events, that higher-layer state machines are not complex.
Let's get simple case of (aio_)read() from file descriptor - if page is in the
cache, no readpage() method will be called, so we do not need to create
some kind of events - just copy data, if there is no page or page is not
uptodate, we allocate a bio and do not wait until buffers are read - we
return to userspace and start another reading, when bio is completed and
its end_io callback is called, we mark pages as uptodate, copy data to userspace,
and mark event bound to above (aio_)read() as completed.
(that is how kevent aio works, btw).
Userspace programmer just calls
cookie = aio_read();
aio_wait(cookie);
or something like that.
It is simple, it is straightforward, especially if data read must then
be used somewhere else - in that case processing thread will need to operate
with main one, which is simple in event model, since there is a place,
where events of _all_ types are gathered.
> Regards,
>
> - Ted
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-27 10:28 ` Evgeniy Polyakov
@ 2007-02-27 11:52 ` Theodore Tso
2007-02-27 12:11 ` Evgeniy Polyakov
` (2 more replies)
0 siblings, 3 replies; 277+ messages in thread
From: Theodore Tso @ 2007-02-27 11:52 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Ingo Molnar, Linus Torvalds, Ulrich Drepper, linux-kernel,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
On Tue, Feb 27, 2007 at 01:28:32PM +0300, Evgeniy Polyakov wrote:
> Obviously there are bugs, it is simply how things work.
> And debugging state machine code has exactly the same complexity as
> debugging multi-threading code - if not less...
Evgeniy,
I think what you are not hearing, and what everyone else is saying
(INCLUDING Linus), is that for most programmers, state machines are
much, much harder to program, understand, and debug compared to
multi-threaded code. You may disagree (were you a MacOS 9 programmer
in another life?), and it may not even be true for you if you happen
to be one of those folks more at home with Scheme continuations, for
example. But it is true that for most kernel programmers, threaded
programming is much easier to understand, and we need to engineer the
kernel for what will be maintainable for the majority of the kernel
development community.
Regards,
- Ted
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-27 10:41 ` Evgeniy Polyakov
@ 2007-02-27 10:49 ` Ingo Molnar
0 siblings, 0 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-02-27 10:49 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Davide Libenzi, Linux Kernel Mailing List, Linus Torvalds,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Ulrich Drepper, Zach Brown, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> > does that work for you?
>
> Yes, -fomit-frame-point make the deal.
>
> In average, threadlet runs as fast as epoll.
yeah.
> Just because there are _no_ rescheduling in that case.
in my test it was 'little', not 'no'. But yes, that's exactly my point:
we can remove the nonblock hackeries from event loops and just
concentrate on making it schedule in less than 10-20% of the cases. Even
a relatively high 10-20% rescheduling rate is hardly measurable with
threadlets, while it gives a 10%-20% regression (and possibly bad
latencies) for the pure epoll/kevent server.
and such a mixed technique is simply not possible with ordinary
user-space threads, because there it's an all-or-nothing affair: either
you go fully to threads (at which point we are again back to a fully
threaded design, now also saddled with event loop overhead), or you try
to do user-space threads, which Just Make Little Sense (tm).
so threadlets remove the biggest headache from event loops: they dont
have to be '100% nonblocking' anymore. No O_NONBLOCK overhead, no
complex state machines - just handle the most common event type via an
outer event loop and keep the other 99% of server complexity in plain
procedural programming. 1% of state-machine code is perfectly
acceptable.
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-27 6:24 ` Ingo Molnar
@ 2007-02-27 10:41 ` Evgeniy Polyakov
2007-02-27 10:49 ` Ingo Molnar
0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-27 10:41 UTC (permalink / raw)
To: Ingo Molnar
Cc: Davide Libenzi, Linux Kernel Mailing List, Linus Torvalds,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Ulrich Drepper, Zach Brown, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
On Tue, Feb 27, 2007 at 07:24:19AM +0100, Ingo Molnar (mingo@elte.hu) wrote:
>
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
>
> > On Mon, Feb 26, 2007 at 01:51:23PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> > >
> > > * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> > >
> > > > Even having main dispatcher as epoll/kevent loop, the _whole_
> > > > threadlet model is absolutely micro-thread in nature and not state
> > > > machine/event.
> > >
> > > Evgeniy, i'm not sure how many different ways to tell this to you, but
> > > you are not listening, you are not learning and you are still not
> > > getting it at all.
> > >
> > > The scheduler /IS/ a generic work/event queue. And it's pretty damn
> > > fast. No amount of badmouthing will change that basic fact. Not exactly
> > > as fast as a special-purpose queueing system (for all the reasons i
> > > outlined to you, and which you ignored), but it gets pretty damn close
> > > even for the web workload /you/ identified, and offers a user-space
> > > programming model that is about 1000 times more useful than
> > > state-machines.
> >
> > Meanwhile on practiceal side:
> > via epia kevent/epoll/threadlet:
> >
> > client: ab -c500 -n5000 $url
> >
> > kevent: 849.72
> > epoll: 538.16
> > threadlet:
> > gcc ./evserver_epoll_threadlet.c -o ./evserver_epoll_threadlet
> > In file included from ./evserver_epoll_threadlet.c:30:
> > ./threadlet.h: In function ‘threadlet_exec’:
> > ./threadlet.h:46: error: can't find a register in class ‘GENERAL_REGS’
> > while reloading ‘asm’
> >
> > That particular asm optimization fails to compile.
>
> it's not really an asm optimization but an API glue. I'm using:
>
> gcc -O2 -g -Wall -o evserver_epoll_threadlet evserver_epoll_threadlet.c -fomit-frame-pointer
>
> does that work for you?
Yes, -fomit-frame-point make the deal.
In average, threadlet runs as fast as epoll.
Just because there are _no_ rescheduling in that case.
I added a printk into __async_schedule() and started
ab -c7000 -n20000 http://192.168.4.80/
against slow via epia, and got only one of them.
Client is latest
../async-test-v4/evserver_epoll_threadlet
Btw, I need to admit, that I have totally broken kevent tree there - it
does not work on that machine on higher loads, so I'm investigating that
problem now.
> Ingo
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-26 19:54 ` Ingo Molnar
@ 2007-02-27 10:28 ` Evgeniy Polyakov
2007-02-27 11:52 ` Theodore Tso
0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-27 10:28 UTC (permalink / raw)
To: Ingo Molnar
Cc: Linus Torvalds, Ulrich Drepper, linux-kernel, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
On Mon, Feb 26, 2007 at 08:54:16PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
>
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> > > Reading from the disk is _exactly_ the same - the same waiting for
> > > buffer_heads/pages, and (since it is bigger) it can be easily
> > > transferred to event driven model. Ugh, wait, it not only _can_ be
> > > transferred, it is already done in kevent AIO, and it shows faster
> > > speeds (though I only tested sending them over the net).
> >
> > It would be absolutely horrible to program for. Try anything more
> > complex than read/write (which is the simplest case, but even that is
> > nasty).
>
> note that even for something as 'simple and straightforward' as TCP
> sockets, the 25-50 lines of evserver code i worked on today had 3
> separate bugs, is known to be fundamentally incorrect and one of the
> bugs (the lost event problem) showed up as a subtle epoll performance
> problem and it took me more than an hour to track down. And that matches
> my Tux experience as well: event based models are horribly hard to debug
> BECAUSE there is /no procedural state associated with requests/. Hence
> there is no real /proof of progress/. Not much to use for debugging -
> except huge logs of execution, which, if one is unlucky (which i often
> was with Tux) would just make the problem go away.
I'm glad you found a bug in my epoll processing code (which does not exist
in kevent part though) - it took me more than a year after kevent
introduction to make someone to look into.
Obviously there are bugs, it is simply how things work.
And debugging state machine code has exactly the same complexity as
debugging multi-threading code - if not less...
> Furthermore, with a 'retry' model, who guarantees that the retry wont be
> an infinite retry where none of the retries ever progresses the state of
> the system enough to return the data we are interested in? The moment we
> have to /retry/, depending on the 'depth' of how deep the retry kicked
> in, we've got to reach that 'depth' of code again and execute it.
>
> plus, 'there is not much state' is not even completely true to begin
> with, even in the most simple, TCP socket case! There /is/ quite a bit
> of state constructed on the kernel stack: user parameters have been
> evaluated/converted, the socket has been looked up, its state has been
> validated, etc. With a 'retry' model - but even with a pure 'event
> queueing' model we redo all those things /both/ at request submission
> and at event generation time, again and again - while with a synchronous
> syscall you do it just once and upon event completion a piece of that
> data is already on the kernel stack. I'd much rather spend time and
> effort on simplifying the scheduler and reducing the cache footprint of
> the kernel thread context switch path, etc., to make it more useful even
> in more extreme, highly prallel '100% context-switching' case, because i
> have first-hand experience about how fragile and inflexible event based
> servers are. I do think that event interfaces for raw, external physical
> events make sense in some circumstances, but for any more complex
> 'derived' event type it's less and less clear whether we want a direct
> interface to it. For something like the VFS it's outright horrible to
> even think about.
Ingo, you draw too much black into the picture.
No one will crate purely event based model for socket or VFS processing
- event is completion of the request - data in the buffer, data was sent
and so one - it is also possible to add events into middle of the
processing especially if operation can be logically separated - like
sendfile.
> Ingo
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-27 2:18 ` Davide Libenzi
@ 2007-02-27 10:13 ` Evgeniy Polyakov
2007-02-27 16:01 ` Davide Libenzi
0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-27 10:13 UTC (permalink / raw)
To: Davide Libenzi
Cc: Ingo Molnar, Ulrich Drepper, Linux Kernel Mailing List,
Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
On Mon, Feb 26, 2007 at 06:18:51PM -0800, Davide Libenzi (davidel@xmailserver.org) wrote:
> On Mon, 26 Feb 2007, Evgeniy Polyakov wrote:
>
> > 2. its notifications do not go through the second loop, i.e. it is O(1),
> > not O(ready_num), and notifications happens directly from internals of
> > the appropriate subsystem, which does not require special wakeup
> > (although it can be done too).
>
> Sorry if I do not read kevent code correctly, but in kevent_user_wait()
> there is a:
>
> while (num < max_nr && ((k = kevent_dequeue_ready(u)) != NULL)) {
> ...
> }
>
> loop, that make it O(ready_num). From a mathematical standpoint, they're
> both O(ready_num), but epoll is doing three passes over the ready set.
> I always though that if the number of ready events is so big that the more
> passes over the ready set becomes relevant, probably the "work" done by
> userspace for each fetched event would make the extra cost irrelevant.
> But that can be fixed by a patch that will follow on lkml ...
No, kevent_dequeue_ready() copies data to userspace, that is it.
So it looks roughly following:
storage is ready: -> kevent_requee() - ends up in ading event to the end
of the queue (list add under spinlock)
kevent_wait() -> copy first, second, ...
Kevent poll (as long as epoll) model requires _additional_ check in
userspace context before it is copied, so we endup with checking the
full ready queue again - that what I pointed as O(ready_num), O() implies
price for copying to userspace, list_add and so on.
> - Davide
>
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-26 20:35 ` Ingo Molnar
2007-02-26 22:06 ` Bill Huey
@ 2007-02-27 10:09 ` Evgeniy Polyakov
2007-02-27 17:13 ` Pavel Machek
2 siblings, 0 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-27 10:09 UTC (permalink / raw)
To: Ingo Molnar
Cc: Ulrich Drepper, linux-kernel, Linus Torvalds, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
On Mon, Feb 26, 2007 at 09:35:43PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> > If kernelspace rescheduling is that fast, then please explain me why
> > userspace one always beats kernel/userspace?
>
> because 'user space scheduling' makes no sense? I explained my thinking
> about that in a past mail:
>
> -------------------------->
> One often repeated (because pretty much only) performance advantage of
> 'light threads' is context-switch performance between user-space
> threads. But reality is, nobody /cares/ about being able to
> context-switch between "light user-space threads"! Why? Because there
> are only two reasons why such a high-performance context-switch would
> occur:
>
> 1) there's contention between those two tasks. Wonderful: now two
> artificial threads are running on the /same/ CPU and they are even
> contending each other. Why not run a single context on a single CPU
> instead and only get contended if /another/ CPU runs a conflicting
> context?? While this makes for nice "pthread locking benchmarks",
> it is not really useful for anything real.
Obviously there must be several real threads, each of them can reschedule
userspace threads, because it is fast and scalable.
So, one CPU - one real thread.
> 2) there has been an IO event. The thing is, for IO events we enter the
> kernel no matter what - and we'll do so for the next 10 years at
> minimum. We want to abstract away the hardware, we want to do
> reliable resource accounting, we want to share hardware resources,
> we want to rate-limit, etc., etc. While in /theory/ you could handle
> IO purely from user-space, in practice you dont want to do that. And
> if we accept the premise that we'll enter the kernel anyway, there's
> zero performance difference between scheduling right there in the
> kernel, or returning back to user-space to schedule there. (in fact
> i submit that the former is faster). Or if we accept the theoretical
> possibility of 'perfect IO hardware' that implements /all/ the
> features that the kernel wants (in a secure and generic way, and
> mind you, such IO hardware does not exist yet), then /at most/ the
> performance advantage of user-space doing the scheduling is the
> overhead of a null syscall entry. Which is a whopping 100 nsecs on
> modern CPUs! That's roughly the latency of a /single/ DRAM access!
> ....
And here were start our discussion again from the begining :)
We entered kernel, of course, but then kernel decides to move thread
away and put another one on its place under hardware sun - so kernel
starts to move registers, change some states and so on.
While in case of userspace threads we arelady returned back to userspace
(on behalf of written above overhead) and start doing new task - move
registers, change some states and so one.
And somehow it happens that doing it in userspace (for example with
setjmp/longjmp) is faster than in kernel. I do not know why - I never
investigated that case, but that is it.
> <-----------
>
> (see http://lwn.net/Articles/219958/)
>
> btw., the words that follow that section are quite interesting in
> retrospect:
>
> | Furthermore, 'light thread' concepts can no way approach the
> | performance of #2 state-machines: if you /know/ what the structure of
> | your context is, and you can program it in a specialized state-machine
> | way, there's just so many shortcuts possible that it's not even funny.
>
> [ oops! ;-) ]
>
> i severely under-estimated the kind of performance one can reach even
> with pure procedural concepts. Btw., when i wrote this mail was when i
> started thinking about "is it really true that the only way to get good
> performance are 100% event-based servers and nonblocking designs?", and
> started coding syslets and then threadlets.
:-)
Threads happen to be really fast, and easy to programm, but the maximum
performance still can not be achieved with them just because event
handling is faster - read one cacheline from the ring or queue, or
reschedule threads?
Anyway, what we are talking about, Ingo?
I understand your point, I also understand that you are not going to
change it (yes, that's it, but I need to confirm that I'm guilty too - I
doubtly will think that thousands of threads doing IO is a win), so we
can close the discussion at this point.
My main point in participating in it was kevent introduction - although
I created kevent AIO as a state machine, I never wanted to promote
kevent as AIO - kevent is event processing mechanism, one of its usage
models is AIO, another ones are signals, file events, timers, whatever...
You drawn a line - kevent is not needed - that is your point.
I wanted to hear definitive wordsa half a year ago, but community kept
silence. Eventually things are done.
Thanks.
> Ingo
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-26 20:04 ` Linus Torvalds
@ 2007-02-27 8:09 ` Evgeniy Polyakov
0 siblings, 0 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-27 8:09 UTC (permalink / raw)
To: Linus Torvalds
Cc: Ingo Molnar, Ulrich Drepper, linux-kernel, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
On Mon, Feb 26, 2007 at 12:04:51PM -0800, Linus Torvalds (torvalds@linux-foundation.org) wrote:
>
>
> On Mon, 26 Feb 2007, Evgeniy Polyakov wrote:
> >
> > Will you argue that people do things like
> > num = epoll_wait()
> > for (i=0; i<num; ++i) {
> > process(event[i])?
> > }
>
> I have several times told you that I argue for a *combination* of
> event-bassed interfaces and thread-like code. And that the choice depends
> on which is more natural. Sometimes you might have just one or the other.
> Sometimes you have both. And sometimes you have neither (although,
> strictly speaking, normal single-threaded code is certainly "thread-like"
> - it's a serial execution, the same way threadlets are serial executions -
> it's just not running in parallel with anything else).
>
> > Will you spawn thread per IO?
>
> Depending on what the IO is, yes.
>
> Is that _really_ so hard to understand? There is no "yes" or "no". There's
> a "depends on what the problem is, and what the solution looks like".
>
> Sometimes the best way to do parallelism is through explicit threads.
> Sometimes it is through whatever "threadlets" or other that gets out of
> this whole development discussion. Sometimes it's an event loop.
>
> So get over it. The world is not a black and white, either-or kind of
> place. It's full of grayscales, and colors, and mixing things
> appropriately. And the choices are often done on whims and on whatever
> feels most comfortable to the person doing the choice. Not on some strict
> "you must always use things in an event-driven main loop" or "you must
> always use threads for parallelism".
>
> The world is simply _richer_ than that kind of either-or thing.
It seems you like to repeate that white is white and it is not black :)
Did you see what I wrote in previous e-mail to you?
No, you do not.
I wrote that both model should co-exist, and boundary between the two is
not clear, but for described workloads IMO event driven model provides
way too higher performance.
That is it, Linus, no one wants to say that you say something stupid -
just read what others write for you.
Thanks.
> Linus
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-26 16:46 ` Evgeniy Polyakov
@ 2007-02-27 6:24 ` Ingo Molnar
2007-02-27 10:41 ` Evgeniy Polyakov
0 siblings, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-02-27 6:24 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Davide Libenzi, Linux Kernel Mailing List, Linus Torvalds,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Ulrich Drepper, Zach Brown, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> On Mon, Feb 26, 2007 at 01:51:23PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> >
> > * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> >
> > > Even having main dispatcher as epoll/kevent loop, the _whole_
> > > threadlet model is absolutely micro-thread in nature and not state
> > > machine/event.
> >
> > Evgeniy, i'm not sure how many different ways to tell this to you, but
> > you are not listening, you are not learning and you are still not
> > getting it at all.
> >
> > The scheduler /IS/ a generic work/event queue. And it's pretty damn
> > fast. No amount of badmouthing will change that basic fact. Not exactly
> > as fast as a special-purpose queueing system (for all the reasons i
> > outlined to you, and which you ignored), but it gets pretty damn close
> > even for the web workload /you/ identified, and offers a user-space
> > programming model that is about 1000 times more useful than
> > state-machines.
>
> Meanwhile on practiceal side:
> via epia kevent/epoll/threadlet:
>
> client: ab -c500 -n5000 $url
>
> kevent: 849.72
> epoll: 538.16
> threadlet:
> gcc ./evserver_epoll_threadlet.c -o ./evserver_epoll_threadlet
> In file included from ./evserver_epoll_threadlet.c:30:
> ./threadlet.h: In function ‘threadlet_exec’:
> ./threadlet.h:46: error: can't find a register in class ‘GENERAL_REGS’
> while reloading ‘asm’
>
> That particular asm optimization fails to compile.
it's not really an asm optimization but an API glue. I'm using:
gcc -O2 -g -Wall -o evserver_epoll_threadlet evserver_epoll_threadlet.c -fomit-frame-pointer
does that work for you?
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-26 16:55 ` Evgeniy Polyakov
2007-02-26 20:35 ` Ingo Molnar
@ 2007-02-27 2:18 ` Davide Libenzi
2007-02-27 10:13 ` Evgeniy Polyakov
1 sibling, 1 reply; 277+ messages in thread
From: Davide Libenzi @ 2007-02-27 2:18 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Ingo Molnar, Ulrich Drepper, Linux Kernel Mailing List,
Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
On Mon, 26 Feb 2007, Evgeniy Polyakov wrote:
> 2. its notifications do not go through the second loop, i.e. it is O(1),
> not O(ready_num), and notifications happens directly from internals of
> the appropriate subsystem, which does not require special wakeup
> (although it can be done too).
Sorry if I do not read kevent code correctly, but in kevent_user_wait()
there is a:
while (num < max_nr && ((k = kevent_dequeue_ready(u)) != NULL)) {
...
}
loop, that make it O(ready_num). From a mathematical standpoint, they're
both O(ready_num), but epoll is doing three passes over the ready set.
I always though that if the number of ready events is so big that the more
passes over the ready set becomes relevant, probably the "work" done by
userspace for each fetched event would make the extra cost irrelevant.
But that can be fixed by a patch that will follow on lkml ...
- Davide
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-26 20:35 ` Ingo Molnar
@ 2007-02-26 22:06 ` Bill Huey
2007-02-27 10:09 ` Evgeniy Polyakov
2007-02-27 17:13 ` Pavel Machek
2 siblings, 0 replies; 277+ messages in thread
From: Bill Huey @ 2007-02-26 22:06 UTC (permalink / raw)
To: Ingo Molnar
Cc: Evgeniy Polyakov, Ulrich Drepper, linux-kernel, Linus Torvalds,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner, Bill Huey (hui)
On Mon, Feb 26, 2007 at 09:35:43PM +0100, Ingo Molnar wrote:
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
>
> > If kernelspace rescheduling is that fast, then please explain me why
> > userspace one always beats kernel/userspace?
>
> because 'user space scheduling' makes no sense? I explained my thinking
> about that in a past mail:
>
> -------------------------->
> One often repeated (because pretty much only) performance advantage of
> 'light threads' is context-switch performance between user-space
> threads. But reality is, nobody /cares/ about being able to
> context-switch between "light user-space threads"! Why? Because there
> are only two reasons why such a high-performance context-switch would
> occur:
...
> 2) there has been an IO event. The thing is, for IO events we enter the
> kernel no matter what - and we'll do so for the next 10 years at
> minimum. We want to abstract away the hardware, we want to do
> reliable resource accounting, we want to share hardware resources,
> we want to rate-limit, etc., etc. While in /theory/ you could handle
> IO purely from user-space, in practice you dont want to do that. And
> if we accept the premise that we'll enter the kernel anyway, there's
> zero performance difference between scheduling right there in the
> kernel, or returning back to user-space to schedule there. (in fact
> i submit that the former is faster). Or if we accept the theoretical
> possibility of 'perfect IO hardware' that implements /all/ the
> features that the kernel wants (in a secure and generic way, and
> mind you, such IO hardware does not exist yet), then /at most/ the
> performance advantage of user-space doing the scheduling is the
> overhead of a null syscall entry. Which is a whopping 100 nsecs on
> modern CPUs! That's roughly the latency of a /single/ DRAM access!
Ingo and Evgeniy,
I was trying to avoid getting into this discussion, but whatever. M:N
threading systems also require just about all of the threading semantics
that are inside the kernel to be available in userspace. Implementations
of the userspace scheduler side of things must be able to turn off
preemption to do per CPU local storage, report blocking/preempting via
(via upcall or a mailbox) and other scheduler-ish things in reliable way
so that the complexity of a system like that ends up not being worth it
and is often monsteriously large to implement and debug. That's why
Solaris 10 removed their scheduler activations framework and went with
1:1 like in Linux since the scheduler activations model is so difficult
to control. The slowness of the futex stuff might be compounded by some
VM mapping issues that Bill Irwin and Peter Ziljstra have pointed out in
the past regard, if I understand correctly.
Bryan Cantril of Solaris 10/dtrace fame can comment on that if you ask
him sometime.
For an exercise, think about all of things you need to either migrate
or to do a cross CPU wake of a task. It goes to hell in complexity
really quick. Erlang and other language based concurrency systems get
their regularities by indirectly oversimplifying what threading is from
what kernel folks are use to. Try doing a cross CPU wake quickly a
system like that, good luck. Now think about how to do an IPI in
userspace ? Good luck.
That's all :)
bill
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-26 16:55 ` Evgeniy Polyakov
@ 2007-02-26 20:35 ` Ingo Molnar
2007-02-26 22:06 ` Bill Huey
` (2 more replies)
2007-02-27 2:18 ` Davide Libenzi
1 sibling, 3 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-02-26 20:35 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Ulrich Drepper, linux-kernel, Linus Torvalds, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> If kernelspace rescheduling is that fast, then please explain me why
> userspace one always beats kernel/userspace?
because 'user space scheduling' makes no sense? I explained my thinking
about that in a past mail:
-------------------------->
One often repeated (because pretty much only) performance advantage of
'light threads' is context-switch performance between user-space
threads. But reality is, nobody /cares/ about being able to
context-switch between "light user-space threads"! Why? Because there
are only two reasons why such a high-performance context-switch would
occur:
1) there's contention between those two tasks. Wonderful: now two
artificial threads are running on the /same/ CPU and they are even
contending each other. Why not run a single context on a single CPU
instead and only get contended if /another/ CPU runs a conflicting
context?? While this makes for nice "pthread locking benchmarks",
it is not really useful for anything real.
2) there has been an IO event. The thing is, for IO events we enter the
kernel no matter what - and we'll do so for the next 10 years at
minimum. We want to abstract away the hardware, we want to do
reliable resource accounting, we want to share hardware resources,
we want to rate-limit, etc., etc. While in /theory/ you could handle
IO purely from user-space, in practice you dont want to do that. And
if we accept the premise that we'll enter the kernel anyway, there's
zero performance difference between scheduling right there in the
kernel, or returning back to user-space to schedule there. (in fact
i submit that the former is faster). Or if we accept the theoretical
possibility of 'perfect IO hardware' that implements /all/ the
features that the kernel wants (in a secure and generic way, and
mind you, such IO hardware does not exist yet), then /at most/ the
performance advantage of user-space doing the scheduling is the
overhead of a null syscall entry. Which is a whopping 100 nsecs on
modern CPUs! That's roughly the latency of a /single/ DRAM access!
....
<-----------
(see http://lwn.net/Articles/219958/)
btw., the words that follow that section are quite interesting in
retrospect:
| Furthermore, 'light thread' concepts can no way approach the
| performance of #2 state-machines: if you /know/ what the structure of
| your context is, and you can program it in a specialized state-machine
| way, there's just so many shortcuts possible that it's not even funny.
[ oops! ;-) ]
i severely under-estimated the kind of performance one can reach even
with pure procedural concepts. Btw., when i wrote this mail was when i
started thinking about "is it really true that the only way to get good
performance are 100% event-based servers and nonblocking designs?", and
started coding syslets and then threadlets.
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-26 19:30 ` Evgeniy Polyakov
@ 2007-02-26 20:04 ` Linus Torvalds
2007-02-27 8:09 ` Evgeniy Polyakov
0 siblings, 1 reply; 277+ messages in thread
From: Linus Torvalds @ 2007-02-26 20:04 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Ingo Molnar, Ulrich Drepper, linux-kernel, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
On Mon, 26 Feb 2007, Evgeniy Polyakov wrote:
>
> Will you argue that people do things like
> num = epoll_wait()
> for (i=0; i<num; ++i) {
> process(event[i])?
> }
I have several times told you that I argue for a *combination* of
event-bassed interfaces and thread-like code. And that the choice depends
on which is more natural. Sometimes you might have just one or the other.
Sometimes you have both. And sometimes you have neither (although,
strictly speaking, normal single-threaded code is certainly "thread-like"
- it's a serial execution, the same way threadlets are serial executions -
it's just not running in parallel with anything else).
> Will you spawn thread per IO?
Depending on what the IO is, yes.
Is that _really_ so hard to understand? There is no "yes" or "no". There's
a "depends on what the problem is, and what the solution looks like".
Sometimes the best way to do parallelism is through explicit threads.
Sometimes it is through whatever "threadlets" or other that gets out of
this whole development discussion. Sometimes it's an event loop.
So get over it. The world is not a black and white, either-or kind of
place. It's full of grayscales, and colors, and mixing things
appropriately. And the choices are often done on whims and on whatever
feels most comfortable to the person doing the choice. Not on some strict
"you must always use things in an event-driven main loop" or "you must
always use threads for parallelism".
The world is simply _richer_ than that kind of either-or thing.
Linus
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-26 10:31 ` Ingo Molnar
2007-02-26 10:43 ` Evgeniy Polyakov
@ 2007-02-26 20:02 ` Davide Libenzi
1 sibling, 0 replies; 277+ messages in thread
From: Davide Libenzi @ 2007-02-26 20:02 UTC (permalink / raw)
To: Ingo Molnar
Cc: Evgeniy Polyakov, Linux Kernel Mailing List, Linus Torvalds,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Ulrich Drepper, Zach Brown, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
On Mon, 26 Feb 2007, Ingo Molnar wrote:
>
> * Ingo Molnar <mingo@elte.hu> wrote:
>
> > please also try evserver_epoll_threadlet.c that i've attached below -
> > it uses epoll as the main event mechanism but does threadlets for
> > request handling.
>
> find updated code below - your evserver_epoll.c spuriously missed event
> edges - so i changed it back to level-triggered. While that is not as
> fast as edge-triggered, it does not result in spurious hangs and
> workflow 'hickups' during the test.
>
> Could this be the reason why in your testing kevents outperformed epoll?
This is how I handle a read (write/accept/connect, same thing) inside
coronet (coroutine+epoll async library - http://www.xmailserver.org/coronet-lib.html ).
static int conet_read_ll(struct sk_conn *conn, char *buf, int nbyte) {
int n;
while ((n = read(conn->sfd, buf, nbyte)) < 0) {
if (errno == EINTR)
continue;
if (errno != EAGAIN && errno != EWOULDBLOCK)
return -1;
if (!(conn->events & EPOLLIN)) {
conn->events = EPOLLIN;
if (conet_mod_conn(conn, conn->events) < 0)
return -1;
}
if (conet_yield(conn) < 0)
return -1;
}
return n;
}
I use EPOLLET and, you don't change the interest set until you actually
get an EAGAIN. *Many* read/write mode changes in the usage will simply
happen w/out an epoll_ctl() needed. The conet_mod_conn() function does the
actual epoll_ctl() and add EPOLLET to the specified event set. The
conet_yield() function end up calling the libpcl's co_resume(), that is
basically a switch-to-next-coroutine-until-fd-becomes-ready (maps
directly to a swapcontext).
That cuts 50+% of the epoll_ctl(EPOLL_CTL_MOD).
- Davide
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-26 17:57 ` Linus Torvalds
2007-02-26 18:32 ` Evgeniy Polyakov
@ 2007-02-26 19:54 ` Ingo Molnar
2007-02-27 10:28 ` Evgeniy Polyakov
1 sibling, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-02-26 19:54 UTC (permalink / raw)
To: Linus Torvalds
Cc: Evgeniy Polyakov, Ulrich Drepper, linux-kernel, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > Reading from the disk is _exactly_ the same - the same waiting for
> > buffer_heads/pages, and (since it is bigger) it can be easily
> > transferred to event driven model. Ugh, wait, it not only _can_ be
> > transferred, it is already done in kevent AIO, and it shows faster
> > speeds (though I only tested sending them over the net).
>
> It would be absolutely horrible to program for. Try anything more
> complex than read/write (which is the simplest case, but even that is
> nasty).
note that even for something as 'simple and straightforward' as TCP
sockets, the 25-50 lines of evserver code i worked on today had 3
separate bugs, is known to be fundamentally incorrect and one of the
bugs (the lost event problem) showed up as a subtle epoll performance
problem and it took me more than an hour to track down. And that matches
my Tux experience as well: event based models are horribly hard to debug
BECAUSE there is /no procedural state associated with requests/. Hence
there is no real /proof of progress/. Not much to use for debugging -
except huge logs of execution, which, if one is unlucky (which i often
was with Tux) would just make the problem go away.
Furthermore, with a 'retry' model, who guarantees that the retry wont be
an infinite retry where none of the retries ever progresses the state of
the system enough to return the data we are interested in? The moment we
have to /retry/, depending on the 'depth' of how deep the retry kicked
in, we've got to reach that 'depth' of code again and execute it.
plus, 'there is not much state' is not even completely true to begin
with, even in the most simple, TCP socket case! There /is/ quite a bit
of state constructed on the kernel stack: user parameters have been
evaluated/converted, the socket has been looked up, its state has been
validated, etc. With a 'retry' model - but even with a pure 'event
queueing' model we redo all those things /both/ at request submission
and at event generation time, again and again - while with a synchronous
syscall you do it just once and upon event completion a piece of that
data is already on the kernel stack. I'd much rather spend time and
effort on simplifying the scheduler and reducing the cache footprint of
the kernel thread context switch path, etc., to make it more useful even
in more extreme, highly prallel '100% context-switching' case, because i
have first-hand experience about how fragile and inflexible event based
servers are. I do think that event interfaces for raw, external physical
events make sense in some circumstances, but for any more complex
'derived' event type it's less and less clear whether we want a direct
interface to it. For something like the VFS it's outright horrible to
even think about.
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-25 19:42 ` Evgeniy Polyakov
2007-02-25 20:38 ` Ingo Molnar
2007-02-26 12:39 ` Ingo Molnar
@ 2007-02-26 19:47 ` Davide Libenzi
2 siblings, 0 replies; 277+ messages in thread
From: Davide Libenzi @ 2007-02-26 19:47 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Ingo Molnar, Ulrich Drepper, Linux Kernel Mailing List,
Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
On Sun, 25 Feb 2007, Evgeniy Polyakov wrote:
> Why userspace rescheduling is in order of tens times faster than
> kernel/user?
About 50 times in my Opteron 254 actually. That's libpcl's (swapcontext
based) cobench against lat_ctx.
- Davide
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-26 19:22 ` Linus Torvalds
@ 2007-02-26 19:30 ` Evgeniy Polyakov
2007-02-26 20:04 ` Linus Torvalds
0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-26 19:30 UTC (permalink / raw)
To: Linus Torvalds
Cc: Ingo Molnar, Ulrich Drepper, linux-kernel, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
On Mon, Feb 26, 2007 at 11:22:46AM -0800, Linus Torvalds (torvalds@linux-foundation.org) wrote:
> See? Stop blathering about how everything is an event. THAT'S NOT
> RELEVANT. I've told you a hundred times - they may be "logically
> equivalent", but that doesn't change ANYTHING. Event-based programming
> simply isn't suitable for 99% of all stuff, and for the 1% where it *is*
> suitable, it actually tends to be a very specific subset of the code that
> you actually use events for (ie accept and read/write on pure streams).
Will you argue that people do things like
num = epoll_wait()
for (i=0; i<num; ++i) {
process(event[i])?
}
Will you spawn thread per IO?
Stop writing the same again and again - I perfectly understand that not
everything can be easily covered by events, but covering everything with
threads is more stupid idea.
High-performance IO requires as small as possible overhead, dispatching
events from ring buffer or queue from each cpu is the smallest one, but
not spawning a thread per read.
> Linus
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-26 18:32 ` Evgeniy Polyakov
@ 2007-02-26 19:22 ` Linus Torvalds
2007-02-26 19:30 ` Evgeniy Polyakov
0 siblings, 1 reply; 277+ messages in thread
From: Linus Torvalds @ 2007-02-26 19:22 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Ingo Molnar, Ulrich Drepper, linux-kernel, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
On Mon, 26 Feb 2007, Evgeniy Polyakov wrote:
>
> I want to say, that read() consists of tons of events, but programmer
> needs only one - data is ready in requested buffer. Programmer might
> not even know what is the object behind provided file descriptor.
> One only wans data in the buffer.
You're not following the discussion.
First off, I already *told* you that "read()" is the absolute simplest
case, and yes, we could make it an event if you also passed in the "which
range of the file do we care about" information (we could consider it
"f_pos", which the kernel already knows about, but that doesn't handle
pread()/pwrite(), so it's not very good for many cases).
But that's not THE ISSUE.
The issue is that it's a horrible interface from a users standpoint.
It's a lot better to program certain things as a thread. Why do you argue
against that, when that is just obviously true.
There's a reason that people write code that is functional, rather than
write code as a state machine.
We simply don't write code like
for (;;) {
switch (state) {
case Open:
fd = open();
if (fd < 0)
break;
state = Stat;
case Stat:
if (fstat(fd, &stat) < 0)
break;
state = Read;
case Read:
count = read(fd, buf + pos, size - pos));
if (count < 0)
break;
pos += count;
if (!count || pos == size)
state = Close;
continue;
case Close;
if (close(fd) < 0)
break;
state = Done;
return 0;
}
}
/* Returning 1 means wait in the event loop .. */
if (errno == EAGAIN || errno == EINTR)
return 1;
/* Else we had a real error */
state = Error;
return -1;
and instead we write it as
fd = open(..)
if (fd < 0)
return -1;
if (fstat(fd, &st)) < 0) {
close(fd);
return -1;
}
..
and if you cannot see the *reason* why people don't use event-based
programming for everything, I don't see the point of continuing this
discussion.
See? Stop blathering about how everything is an event. THAT'S NOT
RELEVANT. I've told you a hundred times - they may be "logically
equivalent", but that doesn't change ANYTHING. Event-based programming
simply isn't suitable for 99% of all stuff, and for the 1% where it *is*
suitable, it actually tends to be a very specific subset of the code that
you actually use events for (ie accept and read/write on pure streams).
Linus
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-26 18:56 ` Chris Friesen
@ 2007-02-26 19:20 ` Evgeniy Polyakov
0 siblings, 0 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-26 19:20 UTC (permalink / raw)
To: Chris Friesen
Cc: Arjan van de Ven, Ingo Molnar, Linus Torvalds, Ulrich Drepper,
linux-kernel, Christoph Hellwig, Andrew Morton, Alan Cox,
Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
On Mon, Feb 26, 2007 at 12:56:33PM -0600, Chris Friesen (cfriesen@nortel.com) wrote:
> Evgeniy Polyakov wrote:
>
> >I never ever tried to say _everything_ must be driven by events.
> >IO must be driven, it is a must IMO.
>
> Do you disagree with Linus' post about the difficulty of treating
> open(), fstat(), page faults, etc. as events? Or do you not consider
> them to be IO?
>From practical point of view - yes some of that processes are complex
enough to not attract attention as async usage model.
But I'm absolutely for the scenario, when several operations are
performed asynchronously like open+stat+fadvice+sendfile.
By IO I meant something which has end result, and that result must be
enough to start async processing - data in the buffer for example.
Async open I would combine with actual data processing - that one can be
a one event.
> Chris
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-26 18:38 ` Evgeniy Polyakov
@ 2007-02-26 18:56 ` Chris Friesen
2007-02-26 19:20 ` Evgeniy Polyakov
0 siblings, 1 reply; 277+ messages in thread
From: Chris Friesen @ 2007-02-26 18:56 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Arjan van de Ven, Ingo Molnar, Linus Torvalds, Ulrich Drepper,
linux-kernel, Christoph Hellwig, Andrew Morton, Alan Cox,
Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
Evgeniy Polyakov wrote:
> I never ever tried to say _everything_ must be driven by events.
> IO must be driven, it is a must IMO.
Do you disagree with Linus' post about the difficulty of treating
open(), fstat(), page faults, etc. as events? Or do you not consider
them to be IO?
Chris
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-26 18:19 ` Arjan van de Ven
@ 2007-02-26 18:38 ` Evgeniy Polyakov
2007-02-26 18:56 ` Chris Friesen
0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-26 18:38 UTC (permalink / raw)
To: Arjan van de Ven
Cc: Ingo Molnar, Linus Torvalds, Ulrich Drepper, linux-kernel,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
On Mon, Feb 26, 2007 at 10:19:03AM -0800, Arjan van de Ven (arjan@infradead.org) wrote:
> On Mon, 2007-02-26 at 20:37 +0300, Evgeniy Polyakov wrote:
>
> > I tend to agree.
> > Yes, some loads require event driven model, other can be done using
> > threads.
>
> event driven model is really complex though. For event driven to work
> well you basically can't tolerate blocking calls at all ...
> open() blocks on files. read() may block on metadata reads (say indirect
> blockgroups) not just on datareads etc etc. It gets really hairy really
> quick that way.
I never ever tried to say _everything_ must be driven by events.
IO must be driven, it is a must IMO.
Some (and a lot of) things can easily be programmed by threads.
Threads are essentially for those tasks which can not be easily
programmed with events.
But have in mind that fact, that thread execution _is_ an event itself,
its completion is an event.
> --
> if you want to mail me at work (you don't), use arjan (at) linux.intel.com
> Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-26 17:57 ` Linus Torvalds
@ 2007-02-26 18:32 ` Evgeniy Polyakov
2007-02-26 19:22 ` Linus Torvalds
2007-02-26 19:54 ` Ingo Molnar
1 sibling, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-26 18:32 UTC (permalink / raw)
To: Linus Torvalds
Cc: Ingo Molnar, Ulrich Drepper, linux-kernel, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
On Mon, Feb 26, 2007 at 09:57:00AM -0800, Linus Torvalds (torvalds@linux-foundation.org) wrote:
> Similarly, even for a simple "read()" on a filesystem, there is no way to
> just say "block until data is available" like there is for a socket,
> because on a filesystem, the data may be available, BUT AT THE WRONG
> OFFSET. So while a socket or a pipe are both simple "streaming interfaces"
> as far as a "read()" is concerned, a file access is *not* a simple
> streaming interface.
>
> Notice? For a read()/recvmsg() call on a socket or a pipe, there is no
> "position" involved. The "event" is clear: it's always the head of the
> streaming interface that is relevant, and the event is "is there room" or
> "is there data". It's an event-based thing.
>
> But for a read() on a file, it's no longer a streaming interface, and
> there is no longer a simple "is there data" event. You'd have to make the
> event be a much more complex "is there data at position X through Y" kind
> of thing.
>
> And "read()" on a filesystem is the _simple_ case. Sure, we could add
> support for those kinds of ranges, and have an event interface for that.
> But the "open a filename" is much more complicated, and doesn't even have
> a file descriptor available to it (since we're trying to _create_ one), so
> you'd have to do something even more complex to have the event "that
> filename can now be opened without blocking".
>
> See? Even if you could make those kinds of events, it would be absolutely
> HORRIBLE to code for. And it would suck horribly performance-wise for most
> code too.
I see our main discussion trouble I think.
I never ever tried to say, that every bit in read() should be
non-blocking, it would be even more stupid than you expect it to be.
But you and Ingo do not want to see, what I'm trying to say, because it
is too cozy to read own right ideas than others.
I want to say, that read() consists of tons of events, but programmer
needs only one - data is ready in requested buffer. Programmer might
not even know what is the object behind provided file descriptor.
One only wans data in the buffer.
That is an event.
It is also possible to have other events inside complex read() machinery
- for example waiting for page obtained via ->readpages().
> THAT is what I'm saying. There's a *difference* between event-based and
> thread-based programming. It makes no sense to try to turn one into the
> other. But it often makes sense to *combine* the two approaches.
Thread is simple just because there is an interface.
Change interface, and no one will ever want to do it.
Provide nice aio_read() (forget about posix) and everyone will use it.
I always said that combining of such models is a must, I fully agree,
but it seems that we still do not agree where the bound should be drawn.
> > Userspace wants to open a file, so it needs some file-related (inode,
> > direntry and others) structures in the mem, they should be read from
> > disk. Eventually it will be reading some blocks from the disk
> > (for example ext3_lookup->ext3_find_entry->ext3_getblk/ll_rw_block) and
> > we will wait for them (wait_on_bit()) - we will wait for event.
> >
> > But I agree, it was a brainfscking example, but nevertheless, it can be
> > easily done using event driven model.
> >
> > Reading from the disk is _exactly_ the same - the same waiting for
> > buffer_heads/pages, and (since it is bigger) it can be easily
> > transferred to event driven model.
> > Ugh, wait, it not only _can_ be transferred, it is already done in
> > kevent AIO, and it shows faster speeds (though I only tested sending
> > them over the net).
>
> It would be absolutely horrible to program for. Try anything more complex
> than read/write (which is the simplest case, but even that is nasty).
>
> Try imagining yourself in the shoes of a database server (or just about
> anything else). Imagine what kind of code you want to write. You probably
> do *not* want to have everything be one big event loop, and having to make
> different "states" for "I'm trying to open the file", "I opened the file,
> am now doing 'fstat()' to figure out how big it is", "I'm now reading the
> file and have read X bytes of the total Y bytes I want to read", "I took a
> page fault in the middle" etc etc.
>
> I pretty much can *guarantee* you that you'll never see anybody do that.
> Page faults in user space are particularly hard to handle in a state
> machine, since they basically require saving the whole thread state, as
> they can happen on any random access. So yeah, you could do them as a
> state machine, but in reality it would just become a "user-level thread
> library" in the end, just to handle those.
>
> In contrast, if you start using thread-like programming to begin with, you
> have none of those issues. Sure, some thread may block because you got a
> page fault, or because an inode needed to be brought into memory, but from
> a user-level programming interface standpoint, the thread library just
> takes care of the "state machine" on its own, so it's much simpler, and in
> the end more efficient.
>
> And *THAT* is what I'm trying to say. Some simple obvious events are
> better handled and seen as "events" in user space. But other things are so
> intertwined, and have basically random state associated with them, that
> they are better seen as threads.
In threading you still need to do exactly the same, since stat() can
not be done before open(). So you will maintain some state (actually
order of things which will not be changed).
Have the same in the function.
Then start execution - it can be perfectly done using threads.
Just because it is unconvenient to program using state machines and
there is no appropriate interface.
But there are another operations.
Consider database or whatever you like, which wants to read thousands of
blocks from disk, each access potentially blocks, and blocking on the
mutex is nothing compared to blocking on waiting for storage to be ready
to provide data (network, disk, whatever). If storage is not optimized
(or small cache, or something else) we end up with blocked contexts,
which sleep - thousands of contexts.
And this number will not decrease.
So we ended with enourmous overhead just because we do not have simple
enough aio_read() and aio_wait().
> Yes, from a "turing machine" kind of viewpoint, the two are 100% logically
> equivalent. But "logical equivalence" does NOT translate into "can
> practically speaking be implemented".
You have said that finally!
"can practically speaking be implemented".
I see your and Ingo point - kevent is a 'merde' (pardon my french, but I
even somehow glad Ingo said (almost) that - I now have plenty of time for other
interesting projects), we want threads.
Ok, you have threads, but you need to wait on (some of) them.
So you will invent the wheel again - some subsystem which will wait for
threads (likely waiting on futexes), then other events (probably).
Eventually it will be the same from different point of view :)
Anyway, I _do_ hope we understood each other that both events and
threads can co-exist, although board was not drawn.
My point is that many things can be done using event just because they
are faster and smaller - and IO (any IO without limit) is one of the
areas where it is unbeatable.
Threading in turn can be done a higher-layer abstraction model - thread
can wrap around the whole transaction with all related data flows, but
not thread per IO.
> Linus
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-26 17:37 ` Evgeniy Polyakov
@ 2007-02-26 18:19 ` Arjan van de Ven
2007-02-26 18:38 ` Evgeniy Polyakov
0 siblings, 1 reply; 277+ messages in thread
From: Arjan van de Ven @ 2007-02-26 18:19 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Ingo Molnar, Linus Torvalds, Ulrich Drepper, linux-kernel,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
On Mon, 2007-02-26 at 20:37 +0300, Evgeniy Polyakov wrote:
> I tend to agree.
> Yes, some loads require event driven model, other can be done using
> threads.
event driven model is really complex though. For event driven to work
well you basically can't tolerate blocking calls at all ...
open() blocks on files. read() may block on metadata reads (say indirect
blockgroups) not just on datareads etc etc. It gets really hairy really
quick that way.
--
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-26 17:28 ` Evgeniy Polyakov
@ 2007-02-26 17:57 ` Linus Torvalds
2007-02-26 18:32 ` Evgeniy Polyakov
2007-02-26 19:54 ` Ingo Molnar
0 siblings, 2 replies; 277+ messages in thread
From: Linus Torvalds @ 2007-02-26 17:57 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Ingo Molnar, Ulrich Drepper, linux-kernel, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
On Mon, 26 Feb 2007, Evgeniy Polyakov wrote:
>
> Linus, you made your point clearly - generic AIO should not be used for
> the cases, when it is supposed to block 90% of the time - only when it
> almost never blocks, like in case of buffered IO.
I don't think it's quite that simple.
EVEN *IF* it were to block 100% of the time, it depends on other things
than just "blockingness".
For example, let's look at something like
fd = open(filename, O_RDONLY);
if (fd < 0)
return -1;
if (fstat(fd, &st) < 0) {
close(fd);
return -1;
}
.. do something ..
and please realize that EVEN IF YOU KNOW WITH 100% CERTAINTY that the
above open (or fstat()) is going to block, because you know that your
working set is bigger than the available memory for caching, YOU SIMPLY
CANNOT SANELY WRITE THAT AS AN EVENT-BASED STATE MACHINE!
It's really that simple. Some things block "in the middle". The reason
UNIX made non-blocking reads available for networking, but not for
filesystem accesses is not because one blocks and the other doesn't. No,
it's really much more fundamental than that!
When you do a "recvmsg()", there is a clear event-based model: you can
return -EAGAIN if the data simply isn't there, because a network
connection is a simple stream of data, and there is a clear event on "ok,
data arrived" without any state what-so-ever.
The same is simply not true for "open a file descriptor". There is no sane
way to turn the "filename lookup blocked" into an event model with a
select- or kevent-based interface.
Similarly, even for a simple "read()" on a filesystem, there is no way to
just say "block until data is available" like there is for a socket,
because on a filesystem, the data may be available, BUT AT THE WRONG
OFFSET. So while a socket or a pipe are both simple "streaming interfaces"
as far as a "read()" is concerned, a file access is *not* a simple
streaming interface.
Notice? For a read()/recvmsg() call on a socket or a pipe, there is no
"position" involved. The "event" is clear: it's always the head of the
streaming interface that is relevant, and the event is "is there room" or
"is there data". It's an event-based thing.
But for a read() on a file, it's no longer a streaming interface, and
there is no longer a simple "is there data" event. You'd have to make the
event be a much more complex "is there data at position X through Y" kind
of thing.
And "read()" on a filesystem is the _simple_ case. Sure, we could add
support for those kinds of ranges, and have an event interface for that.
But the "open a filename" is much more complicated, and doesn't even have
a file descriptor available to it (since we're trying to _create_ one), so
you'd have to do something even more complex to have the event "that
filename can now be opened without blocking".
See? Even if you could make those kinds of events, it would be absolutely
HORRIBLE to code for. And it would suck horribly performance-wise for most
code too.
THAT is what I'm saying. There's a *difference* between event-based and
thread-based programming. It makes no sense to try to turn one into the
other. But it often makes sense to *combine* the two approaches.
> Userspace wants to open a file, so it needs some file-related (inode,
> direntry and others) structures in the mem, they should be read from
> disk. Eventually it will be reading some blocks from the disk
> (for example ext3_lookup->ext3_find_entry->ext3_getblk/ll_rw_block) and
> we will wait for them (wait_on_bit()) - we will wait for event.
>
> But I agree, it was a brainfscking example, but nevertheless, it can be
> easily done using event driven model.
>
> Reading from the disk is _exactly_ the same - the same waiting for
> buffer_heads/pages, and (since it is bigger) it can be easily
> transferred to event driven model.
> Ugh, wait, it not only _can_ be transferred, it is already done in
> kevent AIO, and it shows faster speeds (though I only tested sending
> them over the net).
It would be absolutely horrible to program for. Try anything more complex
than read/write (which is the simplest case, but even that is nasty).
Try imagining yourself in the shoes of a database server (or just about
anything else). Imagine what kind of code you want to write. You probably
do *not* want to have everything be one big event loop, and having to make
different "states" for "I'm trying to open the file", "I opened the file,
am now doing 'fstat()' to figure out how big it is", "I'm now reading the
file and have read X bytes of the total Y bytes I want to read", "I took a
page fault in the middle" etc etc.
I pretty much can *guarantee* you that you'll never see anybody do that.
Page faults in user space are particularly hard to handle in a state
machine, since they basically require saving the whole thread state, as
they can happen on any random access. So yeah, you could do them as a
state machine, but in reality it would just become a "user-level thread
library" in the end, just to handle those.
In contrast, if you start using thread-like programming to begin with, you
have none of those issues. Sure, some thread may block because you got a
page fault, or because an inode needed to be brought into memory, but from
a user-level programming interface standpoint, the thread library just
takes care of the "state machine" on its own, so it's much simpler, and in
the end more efficient.
And *THAT* is what I'm trying to say. Some simple obvious events are
better handled and seen as "events" in user space. But other things are so
intertwined, and have basically random state associated with them, that
they are better seen as threads.
Yes, from a "turing machine" kind of viewpoint, the two are 100% logically
equivalent. But "logical equivalence" does NOT translate into "can
practically speaking be implemented".
Linus
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-26 13:11 ` Ingo Molnar
@ 2007-02-26 17:37 ` Evgeniy Polyakov
2007-02-26 18:19 ` Arjan van de Ven
0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-26 17:37 UTC (permalink / raw)
To: Ingo Molnar
Cc: Linus Torvalds, Ulrich Drepper, linux-kernel, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
On Mon, Feb 26, 2007 at 02:11:33PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
>
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> > > My tests show that with 4k connections per second (8k concurrency)
> > > more than 20k connections of 80k total block in tcp_sendmsg() over
> > > gigabit lan between quite fast machines.
> >
> > Why do people *keep* taking this up as an issue?
> >
> > Use select/poll/epoll/kevent/whatever for event mechanisms. STOP
> > CLAIMING that you'd use threadlets/syslets/aio for that. It's been
> > pointed out over and over and over again, and yet you continue to make
> > the same mistake, Evgeniy.
> >
> > So please read that sentence ten times, and then don't continue to
> > make that same mistake. PLEASE.
> >
> > Event mechanisms are *superior* for events. But they *suck* for things
> > that aren't events, but are actual code execution with random places
> > that can block. THE TWO THINGS ARE TOTALLY AND UTTERLY INDEPENDENT!
>
> Note that even for something tasks are supposed to suck at, and even if
> used in extremely stupid ways, they perform reasonably well in practice
> ;-)
>
> And i fully agree: specialization based on knowledge about frequency of
> blocking will always be useful - if not /forced/ on the workflow
> architecture and if not overdone. On the other hand, fully event-driven
> servers based on 'nonblocking' calls, which Evgeniy is advocating and
> which the kevent model is forcing upon userspace, is pure madness.
>
> We very much can and should use things like epoll for events that we
> expect to happen asynchronously 100% of the time - it just makes no
> sense for those events to take up 4-5K of RAM apiece, when they could
> also be only using up the 32 bytes that say a pending timer takes. I've
> posted the code for that, how to do an 'outer' epoll loop around an
> internal threadlep iterator. But those will always be very narrow event
> sources, and likely wont (and shouldnt) cover 'request-internal'
> processing.
>
> but otherwise, there is no real difference between a task that is
> scheduled and a request that is queued, 'other' than the size of the
> request (the task takes 4-5K of RAM), and the register context (64-128
> bytes on most CPUs, the loading of which is optimized to death).
>
> Which difference can still be significant for certain workloads, so we
> certainly dont want to prohibit specialized event interfaces and force
> generic threads on everything. But for anything that isnt a raw and
> natural external event source (time, network, disk, user-generated)
> there shouldnt be much of an event queueing abstraction i believe (other
> than what we get 'for free' within epoll, from having poll()-able files)
> - and even for those event sources threadlets offer a pretty good run
> for the money.
I tend to agree.
Yes, some loads require event driven model, other can be done using
threads. The only reason kevent was created is to allow to process any
type of events in exaclty the same way in the same processing loop,
it was optimized to have event register structure less than a cache
line.
What I can not agree with, is that IO is a thread based stuff.
> one can always find the point and workload where say 40,000 threads
> start trashing the L2 cache, but where 40,000 queued special requests
> are still fully in cache, and produce spectacular numbers.
>
> Ingo
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-25 22:44 ` Linus Torvalds
2007-02-26 13:11 ` Ingo Molnar
@ 2007-02-26 17:28 ` Evgeniy Polyakov
2007-02-26 17:57 ` Linus Torvalds
1 sibling, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-26 17:28 UTC (permalink / raw)
To: Linus Torvalds
Cc: Ingo Molnar, Ulrich Drepper, linux-kernel, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
On Sun, Feb 25, 2007 at 02:44:11PM -0800, Linus Torvalds (torvalds@linux-foundation.org) wrote:
>
>
> On Thu, 22 Feb 2007, Evgeniy Polyakov wrote:
> >
> > My tests show that with 4k connections per second (8k concurrency) more
> > than 20k connections of 80k total block in tcp_sendmsg() over gigabit
> > lan between quite fast machines.
>
> Why do people *keep* taking this up as an issue?
Let's warm our brains a little in this pseudo-technical word throwings :)
> Use select/poll/epoll/kevent/whatever for event mechanisms. STOP CLAIMING
> that you'd use threadlets/syslets/aio for that. It's been pointed out over
> and over and over again, and yet you continue to make the same mistake,
> Evgeniy.
>
> So please read that sentence ten times, and then don't continue to make
> that same mistake. PLEASE.
>
> Event mechanisms are *superior* for events. But they *suck* for things
> that aren't events, but are actual code execution with random places that
> can block. THE TWO THINGS ARE TOTALLY AND UTTERLY INDEPENDENT!
>
> Examples of events:
> - packet arrives
> - timer happens
>
> Examples of things that are *not* "events":
> - filesystem lookup.
> - page faults
>
> So the basic point is: for events, you use an event-based thing. For code
> execution, you use a thread-based thing. It's really that simple.
Linus, you made your point clearly - generic AIO should not be used for
the cases, when it is supposed to block 90% of the time - only when it
almost never blocks, like in case of buffered IO.
Otherwise, when load nature requires almost always to block, we would
see that each block is eventually removed - by something, which calls
wake_up(), that something is an event, which we were supposed to wait,
but instead we we resceduled and waited there, so we just added
additional layer of dereferencing - we were scehduled away, did some
work, then awakened, instead of doing some work and get event.
I do not even discuss, that micro-threading model is way tooo simpler to
programm, but from above example is clear that it adds additional
overhead, which in turn can be high or noticeble.
> And yes, the two different things can usually be translated (at a very
> high cost in complexity *and* performance) into each other, so people who
> look at it as purely a theoretical exercise may think that "events" and
> "code execution" are equivalent. That's a very very silly and stupid way
> of looking at things in real life, though.
>
> Yes, you can turn things that are better seen as threaded execution into
> an event-based thing by turning it into a state machine. And usually that
> is a TOTAL DISASTER, and the end result is fragile and impossible to
> maintain.
>
> And yes, you can often (more easily) turn an event-based mechanism into a
> thread-based one, and usually the end result is a TOTAL DISASTER because
> it doesn't scale very well, and while it may actually result in somewhat
> simpler code, the overhead of managing ten thousand outstanding threads is
> just too high, when you compare to managing just a list of ten thousand
> outstanding events.
>
> And yes, people have done both of those mistakes. Java, for example,
> largely did the latter mistake ("we don't need anything like 'select',
> because we'll just use threads for everything" - what a totally moronic
> thing to do!)
I can only say that I'm fully agree. Absolutely. No jokes.
> So Evgeniy, threadlets/syslets/aio is *not* a replacement for event
> queues. It's a TOTALLY DIFFERENT MECHANISM, and one that is hugely
> superior to event queues for certain kinds of things. Anybody who thinks
> they want to do pathname and inode lookup as a series of events is likely
> a moron. It's really that simple.
Hmmm... let me describe that process a bit:
Userspace wants to open a file, so it needs some file-related (inode,
direntry and others) structures in the mem, they should be read from
disk. Eventually it will be reading some blocks from the disk
(for example ext3_lookup->ext3_find_entry->ext3_getblk/ll_rw_block) and
we will wait for them (wait_on_bit()) - we will wait for event.
But I agree, it was a brainfscking example, but nevertheless, it can be
easily done using event driven model.
Reading from the disk is _exactly_ the same - the same waiting for
buffer_heads/pages, and (since it is bigger) it can be easily
transferred to event driven model.
Ugh, wait, it not only _can_ be transferred, it is already done in
kevent AIO, and it shows faster speeds (though I only tested sending
them over the net).
> In a complex server (say, a database), you'd use both. You'd probably use
> events for doing the things you *already* use events for (whether it be
> select/poll/epoll or whatever): probably things like the client network
> connection handling.
>
> But you'd *in*addition* use threadlets to be able to do the actual
> database IO in a threaded manner, so that you can scale the things that
> are not easily handled as events (usually because they have internal
> kernel state that the user cannot even see, and *must*not* see because of
> security issues).
Event is data readiness - no more, no less.
It has nothing with internal kernel structures - you just wait until
data is ready in requested buffer (disk, net, whatever).
Internal mechanism of moving data to the destination point can use event
driven model too, but it is another question.
Eventually threads wait for the same events - but there is an additional
overhead created to manage that objects called threads.
Ingo says that it is fast to manage them, but it can not be faster than
properly created event driven abstraction, because, as you noticed
himself, mutual transformations are complex.
Threadlets are simpler to program, but they do not add a gain compared
to properly created singlethreaded model (or number of CPU-threadede
model) with the right event processing mechanims.
Waiting for any IO is a waiting for event, other tasks can be made an
event too, but I agree, it is simpler to use different models just
because they already exist.
> So please. Stop this "kevents are better". The only thing you show by
> trying to go down that avenue is that you don't understand the
> *difference* between an event model and a thread model. They are both
> perfectly fine models and they ARE NOT THE SAME! They aren't even mutually
> incompatible - quite the reverse.
>
> The thing people want to remove with threadlets is the internal overhead
> of maintaining special-purpose code like aio_read() inside the kernel,
> that doesn't even do all that people want it to do, and that really does
> need a fair amount of internal complexity that we could hopefully do with
> a more generic (and hopefully *simpler*) model.
Let me rephrase that stuff:
the thing people want to remove with linked lists is the internal
overhead of maintaing special-purpose code like RB trees.
It sounds with similar part of idiocity in it.
If additional code provides faster processing, it must be used instead of
fear of 'the internal overhead of maintaining special-purpose code'.
> Linus
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-26 14:15 ` Ingo Molnar
@ 2007-02-26 16:55 ` Evgeniy Polyakov
2007-02-26 20:35 ` Ingo Molnar
2007-02-27 2:18 ` Davide Libenzi
0 siblings, 2 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-26 16:55 UTC (permalink / raw)
To: Ingo Molnar
Cc: Ulrich Drepper, linux-kernel, Linus Torvalds, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
On Mon, Feb 26, 2007 at 03:15:18PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> > > your whole reasoning seems to be faith-based:
> > >
> > > [...] Anyway, kevents are very small, threads are very big, [...]
> > >
> > > How about following the scientific method instead?
> >
> > That are only rethorical words as you have understood I bet, I meant
> > that the whole process of getting readiness notification from kevent
> > is way tooo much faster than resheduling of the new process/thread to
> > handle that IO.
> >
> > The whole process of switching from one process to another can be as
> > fast as bloody hell, but all other details just kill the thing.
>
> for our primary abstractions there /IS NO OTHER DETAIL/ but wakeup and
> context-switching! The "event notification" of a sys_read() /IS/ the
> wakeup and context-switching that we do - or the epoll/kevent enqueueing
> as an alternative.
>
> yes, the two are still different in a number of ways, and yes, it's
> still stupid to do a pool of thousands of threads and thus we can always
> optimize queuing, RAM and cache footprint via specialization, but your
> whole foundation seems to be constructed around the false notion that
> queueing and scheduling a task by the scheduler is somehow magically
> expensive and different from queueing and scheduling other type of
> requests. Please reconsider that foundation and open up a bit more to a
> slightly different world view: scheduling is really just another, more
> generic (and thus certainly more expensive) type of 'request queueing',
> and user-space, most of the time, is much better off if it handles its
> 'requests' and 'events' via tasks. (Especially if many of those 'events'
> turn out to be non-events at all, so to speak.)
If kernelspace rescheduling is that fast, then please explain me why
userspace one always beats kernel/userspace?
And you showed that threadlets without polling accept still does not
scale good - if it is the same fast queueing of events, then why doesn't
it work?
Actually it does not matter, if that narrow place exist (like
kernel/user transformation, or register copy or something else), it can
be eliminated in different model - kevent is that model - it does not
require a lot of things to be changed to get notification and start
working, so it scales better.
It is very similar to epoll, but there are at least two significant
moments:
1. it can work with _any_ type of events with minimal overhead (can not
be even remotely compared with 'file' binding which is required to be
pollable).
2. its notifications do not go through the second loop, i.e. it is O(1),
not O(ready_num), and notifications happens directly from internals of
the appropriate subsystem, which does not require special wakeup
(although it can be done too).
> Ingo
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-26 12:51 ` Ingo Molnar
@ 2007-02-26 16:46 ` Evgeniy Polyakov
2007-02-27 6:24 ` Ingo Molnar
0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-26 16:46 UTC (permalink / raw)
To: Ingo Molnar
Cc: Davide Libenzi, Linux Kernel Mailing List, Linus Torvalds,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Ulrich Drepper, Zach Brown, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
On Mon, Feb 26, 2007 at 01:51:23PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
>
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
>
> > Even having main dispatcher as epoll/kevent loop, the _whole_
> > threadlet model is absolutely micro-thread in nature and not state
> > machine/event.
>
> Evgeniy, i'm not sure how many different ways to tell this to you, but
> you are not listening, you are not learning and you are still not
> getting it at all.
>
> The scheduler /IS/ a generic work/event queue. And it's pretty damn
> fast. No amount of badmouthing will change that basic fact. Not exactly
> as fast as a special-purpose queueing system (for all the reasons i
> outlined to you, and which you ignored), but it gets pretty damn close
> even for the web workload /you/ identified, and offers a user-space
> programming model that is about 1000 times more useful than
> state-machines.
Meanwhile on practiceal side:
via epia kevent/epoll/threadlet:
client: ab -c500 -n5000 $url
kevent: 849.72
epoll: 538.16
threadlet:
gcc ./evserver_epoll_threadlet.c -o ./evserver_epoll_threadlet
In file included from ./evserver_epoll_threadlet.c:30:
./threadlet.h: In function ‘threadlet_exec’:
./threadlet.h:46: error: can't find a register in class ‘GENERAL_REGS’
while reloading ‘asm’
That particular asm optimization fails to compile.
> Ingo
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-26 14:05 ` Evgeniy Polyakov
@ 2007-02-26 14:15 ` Ingo Molnar
2007-02-26 16:55 ` Evgeniy Polyakov
0 siblings, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-02-26 14:15 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Ulrich Drepper, linux-kernel, Linus Torvalds, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> > your whole reasoning seems to be faith-based:
> >
> > [...] Anyway, kevents are very small, threads are very big, [...]
> >
> > How about following the scientific method instead?
>
> That are only rethorical words as you have understood I bet, I meant
> that the whole process of getting readiness notification from kevent
> is way tooo much faster than resheduling of the new process/thread to
> handle that IO.
>
> The whole process of switching from one process to another can be as
> fast as bloody hell, but all other details just kill the thing.
for our primary abstractions there /IS NO OTHER DETAIL/ but wakeup and
context-switching! The "event notification" of a sys_read() /IS/ the
wakeup and context-switching that we do - or the epoll/kevent enqueueing
as an alternative.
yes, the two are still different in a number of ways, and yes, it's
still stupid to do a pool of thousands of threads and thus we can always
optimize queuing, RAM and cache footprint via specialization, but your
whole foundation seems to be constructed around the false notion that
queueing and scheduling a task by the scheduler is somehow magically
expensive and different from queueing and scheduling other type of
requests. Please reconsider that foundation and open up a bit more to a
slightly different world view: scheduling is really just another, more
generic (and thus certainly more expensive) type of 'request queueing',
and user-space, most of the time, is much better off if it handles its
'requests' and 'events' via tasks. (Especially if many of those 'events'
turn out to be non-events at all, so to speak.)
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-26 12:39 ` Ingo Molnar
@ 2007-02-26 14:05 ` Evgeniy Polyakov
2007-02-26 14:15 ` Ingo Molnar
0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-26 14:05 UTC (permalink / raw)
To: Ingo Molnar
Cc: Ulrich Drepper, linux-kernel, Linus Torvalds, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
On Mon, Feb 26, 2007 at 01:39:23PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
>
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
>
> > > > Kevent is a _very_ small entity and there is _no_ cost of
> > > > requeueing (well, there is list_add guarded by lock) - after it is
> > > > done, process can start real work. With rescheduling there are
> > > > _too_ many things to be done before we can start new work. [...]
> > >
> > > actually, no. For example a wakeup too is fundamentally a list_add
> > > guarded by a lock. Take a look at try_to_wake_up(). The rest you see
> > > there is just extra frills that relate to things like
> > > 'load-balancing the requests over multiple CPUs [which i'm sure
> > > kevent users would request in the future too]'.
> >
> > wake_up() as a call is pretty simple and fast, but its result - it is
> > slow. [...]
>
> You are still very much wrong, and now you refuse to even /read/ what i
> wrote. Your only reply to my detailed analysis is: "it is slow, because
> it is slow and heavy". I told you how fast it is, i told you what
> happens on a context switch and why, i told you that you can measure if
> you want.
Ingo, you likely will not believe, but your mails are ones of the
several which I always read several times to get every bit of it :)
I clearly understand your point of view, it is absoutely clear and shine
for me. But... I can not agree with it.
Because of theoretical point of view and practical one concerned my
measurements. It is not pure speculation, which one can expect, but real
life comparison of kernel/user scheduling with pure userspace one (like
in own M:N threading lib or concurrent prgramming language like erlang).
For me (and probably _only_ for me), it is enough to show that some lib
shows 10 times faster rescheduling to start developing own, so I pointed
to it in a discussion.
> > [...] I did not run reschedulingtests with kernel thread, but posix
> > threads (they do look like a kernel thread) have significant overhead
> > there.
>
> You are wrong. Let me show you some more numbers. This is a
> hackbench_pth.c run:
>
> $ ./hackbench_pth 500
> Time: 14.371
>
> this uses 20,000 real threads and during this test the runqueue length
> is extreme - up to over a ten thousand threads. (hackbench_pth.c was
> posted to lkml recently.
>
> The same run with hackbench.c (20,000 forked processes):
>
> $ ./hackbench 500
> Time: 14.632
>
> so the TLB overhead from using processes is 1.8%.
>
> > [...] In early developemnt days of M:N threading library I tested
> > rescheduling performance of the POSIX threads - I created pool of
> > threads and 'sent' a message using futex wait/wake - such performance
> > of the userspace threading library (I tested erlang) was 10 times
> > slower.
>
> how much would it take for you to actually re-measure it and interpet
> the results you are seeing? You've apparently built a whole mental house
> of cards on the flawed proposition that tasks are 'super-heavy' and that
> context-switching them is 'slow'. You are unwilling to explain /how/
> they are slow, and all the numbers i post are contrary to that
> proposition of yours.
>
> your whole reasoning seems to be faith-based:
>
> [...] Anyway, kevents are very small, threads are very big, [...]
>
> How about following the scientific method instead?
That are only rethorical words as you have understood I bet, I meant that the
whole process of getting readiness notification from kevent is way tooo
much faster than resheduling of the new process/thread to handle that IO.
The whole process of switching from one process to another can be as fast
as bloody hell, but all other details just kill the thing.
I do not know, what exact line ends up with problem, but having thousands
of threads, rescheduling itself, results in slower performance than
userspace rescheduling. Then I interpolate it to our IO test case.
> > [...] and both are the way they are exactly on purpose - threads serve
> > for processing of any generic code, kevents are used for event waiting
> > - IO is such an event, it does not require a lot of infrastructure to
> > handle, it only nees some simple bits, so it can be optimized to be
> > extremely fast, with huge infrastructure behind each IO (like in case
> > when it is a separated thread) it can not be done effectively.
>
> you are wrong, and i have pointed it out to you in my previous replies
> why you are wrong. Your only coherent specific thought on this topic was
> your incorrect assumption is that the scheduler somehow saves registers
> and that this makes it heavy. I pointed it out to you in the mail you
> reply to that /every/ system call that saves user registers. You've not
> even replied to that point of mine, you are ignoring it completely and
> you are still repeating your same old, incorrect argument. If it is
> heavy, /why/ do you think it is heavy? Where is that magic pixie dust
> piece of scheduler code that miraculously turns the runqueue into a
> molass slow, heavy piece of thing?
I do not arue that I'm absolutely right.
I just point that I tested some cases and that tests ends up with
completely broken behaviour for micro-thread design (even besides the
case of thousand of new thread creation/reuse per second, which itself
does not look perfect).
I do not even ever try to say that threadlets suck (although I do believe
that it is in some cases, at leat for now :) ), I just point that
rescheduling overhead happens to be tooo big when it comes to
benchmarks I ran (where you never replied too, but that does not matter
after all :).
It can end up with (handwaving) broken syscall wrapper implementation,
or with anything else. Absolutely.
I never ever tried to say that scheduler's code is broken - I just show
my own tests which resulted in the situation, when many working threads
can end up with timings worse than some other case.
Register/tlb/whatever is just a speculation about _possible _ root of
the problem. I did not investigate problem enough - I just decided to
implement different library. Shame on me for that, since I never showed
what exactly is a root of the problem, but for _me_ it is enough, so I'm
trying to share it with you and other developers.
> Or put in another way: your test-code does ~6 syscalls per every event.
> So if what you said would be true (which it isnt), a kevent based
> request would have be just as slow as thread based request ...
I can neither confirm nor object against this sentence.
> > > i think you are really, really mistaken if you believe that the fact
> > > that whole tasks/threads or processes can be 'monster structures',
> > > somehow has any relevance to scheduling/task-queueing performance
> > > and scalability. It does not matter how large a task's address space
> > > is - scheduling only relates to the minimal context that is in the
> > > CPU. And most of that context we save upon /every system call
> > > entry/, and restore it upon every system call return. If it's so
> > > expensive to manipulate, why can the Linux kernel do a full system
> > > call in ~150 cycles? That's cheaper than the access latency to a
> > > single DRAM page.
> >
> > I meant not its size, but the whole infrastructure, which surrounds
> > task. [...]
>
> /what/ infrastructure do you mean? sched.c? Most of that never runs in
> the scheduler hotpath.
>
> > [...] If it is that lightweight, why don't we have posix thread per
> > IO? [...]
>
> because it would be pretty stupid to do that?
>
> But more importantly: because many people still believe 'the scheduler
> is slow and context-switching is evil'? The FIO AIO syslet code from
> Jens is an intelligent mix of queueing combined with async execution. I
> expect that model to prevail.
Suparna showed its problems - although on an older version.
Let's see another tests.
> > [...] One question is that mmap/allocation of the stack is too slow
> > (and it is very slow indeed, that is why glibc and M:N threading lib
> > caches allocated stacks), another one is kernel/userspace boundary
> > crossing, next one are tlb flushes, then copies.
>
> now you come up again with creation overhead but nobody is talking about
> context creation overhead. (btw., you are also wrong if you think that
> mmap() is all that slow - try measuring it one day) We were talking
> about context /switching/.
Ugh, Ingo, do not think absolutely...
I did it. And it is slow.
http://tservice.net.ru/~s0mbre/blog/2007/01/15#2007_01_15
> > Why userspace rescheduling is in order of tens times faster than
> > kernel/user?
>
> what on earth does this have to do with the topic of whether context
> switches are fast enough? Or if you like random info, just let me throw
> in a random piece of information as well:
>
> user-space function calls are more than /two/ orders of magnitude faster
> than system calls. Still you are using 6 ... SIX system calls in the
> sample kevent request handling hotpath.
I can only laugh on that, Ingo :)
If you will be in Moscow, I will buy you a beer, just drop me a mail.
What we are talking about, Ingo, kevent and IO in thread contexts,
or userspace vs. kernelspace scheduling?
Kevent can be broken as hell, it can be stupid application, which does
not work at all - it is one of the possible theories.
Practice however shows that it is not true.
Anyway, if we are talking about kevents and micro-threads, that is one
point, if we are talking about possible overhead of rescheduling - it is
another topic.
> > > for the same reason has it no relevance that the full kevent-based
> > > webserver is a 'monster structure' - still a single request's basic
> > > queueing operation is cheap. The same is true to tasks/threads.
> >
> > To move that tasks there must be done too may steps, and although each
> > one can be quite fast, the whole process of rescheduling in the case
> > of thousands running threads creates too big overhead per task to drop
> > performance.
>
> again, please come up with specifics! I certainly came up with enough
> specifics.
I thought I showed it several times already.
Anyway: http://tservice.net.ru/~s0mbre/blog/2006/11/09#2006_11_09
That is an initial step, which shows that rescheduling of threads (
I DO NOT say about problems in sched.c, Ingo, that would be somehow
stupid, althogh can be right) has some overhead comapred to userspace
rescheduling. If so, it can be eliminated or reduced.
Second (COMPELTELY DIFFERENT STARTING POINT).
If rescheduling has some overhead, is it possible to reduce it using
different model for IO? So I created kevent (as you likely do not know,
original idea was a bit differnet - network AIO, but results is quite good).
> Ingo
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-25 22:44 ` Linus Torvalds
@ 2007-02-26 13:11 ` Ingo Molnar
2007-02-26 17:37 ` Evgeniy Polyakov
2007-02-26 17:28 ` Evgeniy Polyakov
1 sibling, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-02-26 13:11 UTC (permalink / raw)
To: Linus Torvalds
Cc: Evgeniy Polyakov, Ulrich Drepper, linux-kernel, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > My tests show that with 4k connections per second (8k concurrency)
> > more than 20k connections of 80k total block in tcp_sendmsg() over
> > gigabit lan between quite fast machines.
>
> Why do people *keep* taking this up as an issue?
>
> Use select/poll/epoll/kevent/whatever for event mechanisms. STOP
> CLAIMING that you'd use threadlets/syslets/aio for that. It's been
> pointed out over and over and over again, and yet you continue to make
> the same mistake, Evgeniy.
>
> So please read that sentence ten times, and then don't continue to
> make that same mistake. PLEASE.
>
> Event mechanisms are *superior* for events. But they *suck* for things
> that aren't events, but are actual code execution with random places
> that can block. THE TWO THINGS ARE TOTALLY AND UTTERLY INDEPENDENT!
Note that even for something tasks are supposed to suck at, and even if
used in extremely stupid ways, they perform reasonably well in practice
;-)
And i fully agree: specialization based on knowledge about frequency of
blocking will always be useful - if not /forced/ on the workflow
architecture and if not overdone. On the other hand, fully event-driven
servers based on 'nonblocking' calls, which Evgeniy is advocating and
which the kevent model is forcing upon userspace, is pure madness.
We very much can and should use things like epoll for events that we
expect to happen asynchronously 100% of the time - it just makes no
sense for those events to take up 4-5K of RAM apiece, when they could
also be only using up the 32 bytes that say a pending timer takes. I've
posted the code for that, how to do an 'outer' epoll loop around an
internal threadlep iterator. But those will always be very narrow event
sources, and likely wont (and shouldnt) cover 'request-internal'
processing.
but otherwise, there is no real difference between a task that is
scheduled and a request that is queued, 'other' than the size of the
request (the task takes 4-5K of RAM), and the register context (64-128
bytes on most CPUs, the loading of which is optimized to death).
Which difference can still be significant for certain workloads, so we
certainly dont want to prohibit specialized event interfaces and force
generic threads on everything. But for anything that isnt a raw and
natural external event source (time, network, disk, user-generated)
there shouldnt be much of an event queueing abstraction i believe (other
than what we get 'for free' within epoll, from having poll()-able files)
- and even for those event sources threadlets offer a pretty good run
for the money.
one can always find the point and workload where say 40,000 threads
start trashing the L2 cache, but where 40,000 queued special requests
are still fully in cache, and produce spectacular numbers.
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-26 10:47 ` Evgeniy Polyakov
@ 2007-02-26 12:51 ` Ingo Molnar
2007-02-26 16:46 ` Evgeniy Polyakov
0 siblings, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-02-26 12:51 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Davide Libenzi, Linux Kernel Mailing List, Linus Torvalds,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Ulrich Drepper, Zach Brown, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> Even having main dispatcher as epoll/kevent loop, the _whole_
> threadlet model is absolutely micro-thread in nature and not state
> machine/event.
Evgeniy, i'm not sure how many different ways to tell this to you, but
you are not listening, you are not learning and you are still not
getting it at all.
The scheduler /IS/ a generic work/event queue. And it's pretty damn
fast. No amount of badmouthing will change that basic fact. Not exactly
as fast as a special-purpose queueing system (for all the reasons i
outlined to you, and which you ignored), but it gets pretty damn close
even for the web workload /you/ identified, and offers a user-space
programming model that is about 1000 times more useful than
state-machines.
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-25 19:42 ` Evgeniy Polyakov
2007-02-25 20:38 ` Ingo Molnar
@ 2007-02-26 12:39 ` Ingo Molnar
2007-02-26 14:05 ` Evgeniy Polyakov
2007-02-26 19:47 ` Davide Libenzi
2 siblings, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-02-26 12:39 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Ulrich Drepper, linux-kernel, Linus Torvalds, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> > > Kevent is a _very_ small entity and there is _no_ cost of
> > > requeueing (well, there is list_add guarded by lock) - after it is
> > > done, process can start real work. With rescheduling there are
> > > _too_ many things to be done before we can start new work. [...]
> >
> > actually, no. For example a wakeup too is fundamentally a list_add
> > guarded by a lock. Take a look at try_to_wake_up(). The rest you see
> > there is just extra frills that relate to things like
> > 'load-balancing the requests over multiple CPUs [which i'm sure
> > kevent users would request in the future too]'.
>
> wake_up() as a call is pretty simple and fast, but its result - it is
> slow. [...]
You are still very much wrong, and now you refuse to even /read/ what i
wrote. Your only reply to my detailed analysis is: "it is slow, because
it is slow and heavy". I told you how fast it is, i told you what
happens on a context switch and why, i told you that you can measure if
you want.
> [...] I did not run reschedulingtests with kernel thread, but posix
> threads (they do look like a kernel thread) have significant overhead
> there.
You are wrong. Let me show you some more numbers. This is a
hackbench_pth.c run:
$ ./hackbench_pth 500
Time: 14.371
this uses 20,000 real threads and during this test the runqueue length
is extreme - up to over a ten thousand threads. (hackbench_pth.c was
posted to lkml recently.
The same run with hackbench.c (20,000 forked processes):
$ ./hackbench 500
Time: 14.632
so the TLB overhead from using processes is 1.8%.
> [...] In early developemnt days of M:N threading library I tested
> rescheduling performance of the POSIX threads - I created pool of
> threads and 'sent' a message using futex wait/wake - such performance
> of the userspace threading library (I tested erlang) was 10 times
> slower.
how much would it take for you to actually re-measure it and interpet
the results you are seeing? You've apparently built a whole mental house
of cards on the flawed proposition that tasks are 'super-heavy' and that
context-switching them is 'slow'. You are unwilling to explain /how/
they are slow, and all the numbers i post are contrary to that
proposition of yours.
your whole reasoning seems to be faith-based:
[...] Anyway, kevents are very small, threads are very big, [...]
How about following the scientific method instead?
> [...] and both are the way they are exactly on purpose - threads serve
> for processing of any generic code, kevents are used for event waiting
> - IO is such an event, it does not require a lot of infrastructure to
> handle, it only nees some simple bits, so it can be optimized to be
> extremely fast, with huge infrastructure behind each IO (like in case
> when it is a separated thread) it can not be done effectively.
you are wrong, and i have pointed it out to you in my previous replies
why you are wrong. Your only coherent specific thought on this topic was
your incorrect assumption is that the scheduler somehow saves registers
and that this makes it heavy. I pointed it out to you in the mail you
reply to that /every/ system call that saves user registers. You've not
even replied to that point of mine, you are ignoring it completely and
you are still repeating your same old, incorrect argument. If it is
heavy, /why/ do you think it is heavy? Where is that magic pixie dust
piece of scheduler code that miraculously turns the runqueue into a
molass slow, heavy piece of thing?
Or put in another way: your test-code does ~6 syscalls per every event.
So if what you said would be true (which it isnt), a kevent based
request would have be just as slow as thread based request ...
> > i think you are really, really mistaken if you believe that the fact
> > that whole tasks/threads or processes can be 'monster structures',
> > somehow has any relevance to scheduling/task-queueing performance
> > and scalability. It does not matter how large a task's address space
> > is - scheduling only relates to the minimal context that is in the
> > CPU. And most of that context we save upon /every system call
> > entry/, and restore it upon every system call return. If it's so
> > expensive to manipulate, why can the Linux kernel do a full system
> > call in ~150 cycles? That's cheaper than the access latency to a
> > single DRAM page.
>
> I meant not its size, but the whole infrastructure, which surrounds
> task. [...]
/what/ infrastructure do you mean? sched.c? Most of that never runs in
the scheduler hotpath.
> [...] If it is that lightweight, why don't we have posix thread per
> IO? [...]
because it would be pretty stupid to do that?
But more importantly: because many people still believe 'the scheduler
is slow and context-switching is evil'? The FIO AIO syslet code from
Jens is an intelligent mix of queueing combined with async execution. I
expect that model to prevail.
> [...] One question is that mmap/allocation of the stack is too slow
> (and it is very slow indeed, that is why glibc and M:N threading lib
> caches allocated stacks), another one is kernel/userspace boundary
> crossing, next one are tlb flushes, then copies.
now you come up again with creation overhead but nobody is talking about
context creation overhead. (btw., you are also wrong if you think that
mmap() is all that slow - try measuring it one day) We were talking
about context /switching/.
> Why userspace rescheduling is in order of tens times faster than
> kernel/user?
what on earth does this have to do with the topic of whether context
switches are fast enough? Or if you like random info, just let me throw
in a random piece of information as well:
user-space function calls are more than /two/ orders of magnitude faster
than system calls. Still you are using 6 ... SIX system calls in the
sample kevent request handling hotpath.
> > for the same reason has it no relevance that the full kevent-based
> > webserver is a 'monster structure' - still a single request's basic
> > queueing operation is cheap. The same is true to tasks/threads.
>
> To move that tasks there must be done too may steps, and although each
> one can be quite fast, the whole process of rescheduling in the case
> of thousands running threads creates too big overhead per task to drop
> performance.
again, please come up with specifics! I certainly came up with enough
specifics.
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-26 10:35 ` Ingo Molnar
@ 2007-02-26 10:47 ` Evgeniy Polyakov
2007-02-26 12:51 ` Ingo Molnar
0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-26 10:47 UTC (permalink / raw)
To: Ingo Molnar
Cc: Davide Libenzi, Linux Kernel Mailing List, Linus Torvalds,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Ulrich Drepper, Zach Brown, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
On Mon, Feb 26, 2007 at 11:35:17AM +0100, Ingo Molnar (mingo@elte.hu) wrote:
>
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
>
> > Btw, 'evserver' in the name means 'event server', so you might think
> > about changing the name :)
>
> why should i change the name? The 'outer' loop, which feeds requests to
> threadlets, is an epoll based event loop. The inner loop, where all the
> application complexity resides, is a threadlet. This is the "more
> intelligent queueing" model i talked about in my reply to David 4 days
> ago:
>
> http://lkml.org/lkml/2007/2/22/180
> http://lkml.org/lkml/2007/2/22/191
:)
Ingo, of course it was a joke.
Even having main dispatcher as epoll/kevent loop, the _whole_ threadlet
model is absolutely micro-thread in nature and not state machine/event.
So it does not have events at all, especially with speculations about
removing completion notifications - fire and forget model.
> Ingo
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-26 10:31 ` Ingo Molnar
@ 2007-02-26 10:43 ` Evgeniy Polyakov
2007-02-26 20:02 ` Davide Libenzi
1 sibling, 0 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-26 10:43 UTC (permalink / raw)
To: Ingo Molnar
Cc: Davide Libenzi, Linux Kernel Mailing List, Linus Torvalds,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Ulrich Drepper, Zach Brown, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
On Mon, Feb 26, 2007 at 11:31:17AM +0100, Ingo Molnar (mingo@elte.hu) wrote:
>
> * Ingo Molnar <mingo@elte.hu> wrote:
>
> > please also try evserver_epoll_threadlet.c that i've attached below -
> > it uses epoll as the main event mechanism but does threadlets for
> > request handling.
>
> find updated code below - your evserver_epoll.c spuriously missed event
> edges - so i changed it back to level-triggered. While that is not as
> fast as edge-triggered, it does not result in spurious hangs and
> workflow 'hickups' during the test.
Hmm, exact the same evserver_epoll.c you downloaded works ok for me,
although yes, it is buggy in that regard that it does not contain socket
close when data is transferred.
> Could this be the reason why in your testing kevents outperformed epoll?
I will try to check. In theory without _ET it should perfoem much worse,
but in practice its performance is essentially the same (the same applies
to kevent without KEVENT_REQ_ET flag - since the same socket almost never
is used several times, it is purely zero overhead to have or not have that
flag set).
> Also, i have removed the set-nonblocking calls because they are not
> needed under threadlets.
>
> [ to build this code, copy it into the async-test/ directory and build
> it there - or copy the *.h files from async-test/ directory into your
> build directory. ]
Ok, right now I'm compiling kevent/threadlet tree on my test machines.
> Ingo
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-26 10:33 ` Evgeniy Polyakov
@ 2007-02-26 10:35 ` Ingo Molnar
2007-02-26 10:47 ` Evgeniy Polyakov
0 siblings, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-02-26 10:35 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Davide Libenzi, Linux Kernel Mailing List, Linus Torvalds,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Ulrich Drepper, Zach Brown, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> Btw, 'evserver' in the name means 'event server', so you might think
> about changing the name :)
why should i change the name? The 'outer' loop, which feeds requests to
threadlets, is an epoll based event loop. The inner loop, where all the
application complexity resides, is a threadlet. This is the "more
intelligent queueing" model i talked about in my reply to David 4 days
ago:
http://lkml.org/lkml/2007/2/22/180
http://lkml.org/lkml/2007/2/22/191
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-26 9:55 ` Ingo Molnar
2007-02-26 10:31 ` Ingo Molnar
@ 2007-02-26 10:33 ` Evgeniy Polyakov
2007-02-26 10:35 ` Ingo Molnar
1 sibling, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-26 10:33 UTC (permalink / raw)
To: Ingo Molnar
Cc: Davide Libenzi, Linux Kernel Mailing List, Linus Torvalds,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Ulrich Drepper, Zach Brown, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
On Mon, Feb 26, 2007 at 10:55:47AM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> > I will use Ingo's evserver_threadlet server as plong as evserver_epoll
> > (with fixed closing) and evserver_kevent.c.
>
> please also try evserver_epoll_threadlet.c that i've attached below - it
> uses epoll as the main event mechanism but does threadlets for request
> handling.
>
> This is a one step more intelligent threadlet queueing model than
> 'thousands of threads' - although obviously epoll alone should do well
> too with this trivial workload.
No problem.
If I will complete setup today before I go climbing (I need to do some
paid job too), I will post results here and in my blog (without
political correctness).
Btw, 'evserver' in the name means 'event server', so you might think
about changing the name :)
Stay tuned.
> Ingo
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-26 9:55 ` Ingo Molnar
@ 2007-02-26 10:31 ` Ingo Molnar
2007-02-26 10:43 ` Evgeniy Polyakov
2007-02-26 20:02 ` Davide Libenzi
2007-02-26 10:33 ` Evgeniy Polyakov
1 sibling, 2 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-02-26 10:31 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Davide Libenzi, Linux Kernel Mailing List, Linus Torvalds,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Ulrich Drepper, Zach Brown, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
* Ingo Molnar <mingo@elte.hu> wrote:
> please also try evserver_epoll_threadlet.c that i've attached below -
> it uses epoll as the main event mechanism but does threadlets for
> request handling.
find updated code below - your evserver_epoll.c spuriously missed event
edges - so i changed it back to level-triggered. While that is not as
fast as edge-triggered, it does not result in spurious hangs and
workflow 'hickups' during the test.
Could this be the reason why in your testing kevents outperformed epoll?
Also, i have removed the set-nonblocking calls because they are not
needed under threadlets.
[ to build this code, copy it into the async-test/ directory and build
it there - or copy the *.h files from async-test/ directory into your
build directory. ]
Ingo
-------{ evserver_epoll_threadlet.c }-------------->
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/resource.h>
#include <sys/wait.h>
#include <sys/ioctl.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <sys/poll.h>
#include <sys/sendfile.h>
#include <sys/epoll.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <arpa/inet.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>
#include <string.h>
#include <fcntl.h>
#include <time.h>
#include <ctype.h>
#include <netdb.h>
#define DEBUG 0
#include "syslet.h"
#include "sys.h"
#include "threadlet.h"
struct request {
struct request *next_free;
/*
* The threadlet stack is part of the request structure
* and is thus reused as threadlets complete:
*/
unsigned long threadlet_stack;
/*
* These are all the request-specific parameters:
*/
long sock;
};
/*
* Freelist to recycle requests:
*/
static struct request *freelist;
/*
* Allocate a request and set up its syslet atoms:
*/
static struct request *alloc_req(void)
{
struct request *req;
/*
* Occasionally we have to refill the new-thread stack
* entry:
*/
if (!async_head.new_thread_stack) {
async_head.new_thread_stack = thread_stack_alloc();
pr("allocated new thread stack: %08lx\n",
async_head.new_thread_stack);
}
if (freelist) {
req = freelist;
pr("reusing req %p, threadlet stack %08lx\n",
req, req->threadlet_stack);
freelist = freelist->next_free;
req->next_free = NULL;
return req;
}
req = calloc(1, sizeof(struct request));
pr("allocated req %p\n", req);
req->threadlet_stack = thread_stack_alloc();
pr("allocated thread stack %08lx\n", req->threadlet_stack);
return req;
}
/*
* Check whether there are any completions queued for user-space
* to finish up:
*/
static unsigned long complete(void)
{
unsigned long completed = 0;
struct request *req;
for (;;) {
req = (void *)completion_ring[async_head.user_ring_idx];
if (!req)
return completed;
completed++;
pr("completed req %p (threadlet stack %08lx)\n",
req, req->threadlet_stack);
req->next_free = freelist;
freelist = req;
/*
* Clear the completion pointer. To make sure the
* kernel never stomps upon still unhandled completions
* in the ring the kernel only writes to a NULL entry,
* so user-space has to clear it explicitly:
*/
completion_ring[async_head.user_ring_idx] = NULL;
async_head.user_ring_idx++;
if (async_head.user_ring_idx == MAX_PENDING)
async_head.user_ring_idx = 0;
}
}
static unsigned int pending_requests;
/*
* Handle a request that has just been submitted (either it has
* already been executed, or we have to account it as pending):
*/
static void handle_submitted_request(struct request *req, long done)
{
unsigned int nr;
if (done) {
/*
* This is the cached case - free the request:
*/
pr("cache completed req %p (threadlet stack %08lx)\n",
req, req->threadlet_stack);
req->next_free = freelist;
freelist = req;
return;
}
/*
* 'cachemiss' case - the syslet is not finished
* yet. We will be notified about its completion
* via the completion ring:
*/
assert(pending_requests < MAX_PENDING-1);
pending_requests++;
pr("req %p is pending. %d reqs pending.\n", req, pending_requests);
/*
* Attempt to complete requests - this is a fast
* check if there's no completions:
*/
nr = complete();
pending_requests -= nr;
/*
* If the ring is full then wait a bit:
*/
while (pending_requests == MAX_PENDING-1) {
pr("sys_async_wait()");
/*
* Wait for 4 events - to batch things a bit:
*/
sys_async_wait(4, async_head.user_ring_idx, &async_head);
nr = complete();
pending_requests -= nr;
pr("after wait: completed %d requests - still pending: %d\n",
nr, pending_requests);
}
}
#include <linux/types.h>
//#define ulog(f, a...) fprintf(stderr, f, ##a)
#define ulog(f, a...)
#define ulog_err(f, a...) printf(f ": %s [%d].\n", ##a, strerror(errno), errno)
static int kevent_ctl_fd, main_server_s;
static void usage(char *p)
{
ulog("Usage: %s -a addr -p port -f kevent_path -t timeout -w wait_num\n", p);
}
static int evtest_server_init(char *addr, unsigned short port)
{
struct hostent *h;
int s, on;
struct sockaddr_in sa;
if (!addr) {
ulog("%s: Bind address cannot be NULL.\n", __func__);
return -1;
}
h = gethostbyname(addr);
if (!h) {
ulog_err("%s: Failed to get address of %s.\n", __func__, addr);
return -1;
}
s = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
if (s == -1) {
ulog_err("%s: Failed to create server socket", __func__);
return -1;
}
// fcntl(s, F_SETFL, O_NONBLOCK);
memcpy(&(sa.sin_addr.s_addr), h->h_addr_list[0], 4);
sa.sin_port = htons(port);
sa.sin_family = AF_INET;
on = 1;
setsockopt(s, SOL_SOCKET, SO_REUSEADDR, &on, 4);
if (bind(s, (struct sockaddr *)&sa, sizeof(struct sockaddr_in)) == -1) {
ulog_err("%s: Failed to bind to %s", __func__, addr);
close(s);
return -1;
}
if (listen(s, 30000) == -1) {
ulog_err("%s: Failed to listen on %s", __func__, addr);
close(s);
return -1;
}
return s;
}
#define EPOLL_EVENT_MASK (EPOLLIN | EPOLLERR | EPOLLPRI)
static int evtest_kevent_remove(int fd)
{
int err;
struct epoll_event event;
event.events = EPOLL_EVENT_MASK;
event.data.fd = fd;
err = epoll_ctl(kevent_ctl_fd, EPOLL_CTL_DEL, fd, &event);
if (err < 0) {
ulog_err("Failed to perform control REMOVE operation");
return err;
}
return err;
}
static int evtest_kevent_init(int fd)
{
int err;
struct timeval tm;
struct epoll_event event;
event.events = EPOLL_EVENT_MASK;
event.data.fd = fd;
err = epoll_ctl(kevent_ctl_fd, EPOLL_CTL_ADD, fd, &event);
gettimeofday(&tm, NULL);
ulog("%08lu:%06lu: fd=%3d, err=%1d.\n", tm.tv_sec, tm.tv_usec, fd, err);
if (err < 0) {
ulog_err("Failed to perform control ADD operation: fd=%d, events=%08x", fd, event.events);
return err;
}
return err;
}
#define MAX_FILES 1000000
/*
* Debug check:
*/
static struct request *fd_to_req[MAX_FILES];
static long handle_request(void *__req)
{
struct request *req = __req;
int s = req->sock, err, fd;
off_t offset;
int count;
char path[] = "/tmp/index.html";
char buf[4096];
struct timeval tm;
if (!fd_to_req[s])
ulog_err("Bad: no request to fd?");
count = 40960;
offset = 0;
err = recv(s, buf, sizeof(buf), 0);
if (err < 0) {
ulog_err("Failed to read data from s=%d", s);
goto err_out_remove;
}
if (err == 0) {
gettimeofday(&tm, NULL);
ulog("%08lu:%06lu: Client exited: fd=%d.\n", tm.tv_sec, tm.tv_usec, s);
goto err_out_remove;
}
fd = open(path, O_RDONLY);
if (fd == -1) {
ulog_err("Failed to open '%s'", path);
err = -1;
goto err_out_remove;
}
#if 0
do {
err = read(fd, buf, sizeof(buf));
if (err <= 0)
break;
err = send(s, buf, err, 0);
if (err <= 0)
break;
} while (1);
#endif
err = sendfile(s, fd, &offset, count);
{
int on = 0;
setsockopt(s, SOL_TCP, TCP_CORK, &on, sizeof(on));
}
close(fd);
if (err < 0) {
ulog_err("Failed send %d bytes: fd=%d.\n", count, s);
goto err_out_remove;
}
gettimeofday(&tm, NULL);
ulog("%08lu:%06lu: %d bytes has been sent to client fd=%d.\n", tm.tv_sec, tm.tv_usec, err, s);
close(s);
fd_to_req[s] = NULL;
return complete_threadlet_fn(req, &async_head);
err_out_remove:
evtest_kevent_remove(s);
close(s);
fd_to_req[s] = NULL;
return complete_threadlet_fn(req, &async_head);
}
static int evtest_callback_client(int sock)
{
struct request *req;
long done;
if (fd_to_req[sock]) {
ulog_err("Bad: request overlap?");
return 0;
}
req = alloc_req();
if (!req) {
ulog_err("Bad: no req\n");
evtest_kevent_remove(sock);
return -ENOMEM;
}
req->sock = sock;
fd_to_req[sock] = req;
done = threadlet_exec(handle_request, req,
req->threadlet_stack, &async_head);
handle_submitted_request(req, done);
return 0;
}
static int evtest_callback_main(int s)
{
int cs, err;
struct sockaddr_in csa;
socklen_t addrlen = sizeof(struct sockaddr_in);
struct timeval tm;
memset(&csa, 0, sizeof(csa));
if ((cs = accept(s, (struct sockaddr *)&csa, &addrlen)) == -1) {
ulog_err("Failed to accept client");
return -1;
}
// fcntl(cs, F_SETFL, O_NONBLOCK);
gettimeofday(&tm, NULL);
ulog("%08lu:%06lu: Accepted connect from %s:%d.\n",
tm.tv_sec, tm.tv_usec,
inet_ntoa(csa.sin_addr), ntohs(csa.sin_port));
err = evtest_kevent_init(cs);
if (err < 0) {
close(cs);
return -1;
}
return 0;
}
static int evtest_kevent_wait(unsigned int timeout, unsigned int wait_num)
{
int num, err;
struct timeval tm;
struct epoll_event event[256];
int i;
err = epoll_wait(kevent_ctl_fd, event, 256, -1);
if (err < 0) {
ulog_err("Failed to perform control operation");
return num;
}
gettimeofday(&tm, NULL);
num = err;
ulog("%08lu.%06lu: Wait: num=%d.\n", tm.tv_sec, tm.tv_usec, num);
for (i=0; i<num; ++i) {
if (event[i].data.fd == main_server_s)
err = evtest_callback_main(event[i].data.fd);
else
err = evtest_callback_client(event[i].data.fd);
}
return err;
}
int main(int argc, char *argv[])
{
int ch, err;
char *addr;
unsigned short port;
unsigned int timeout, wait_num;
addr = "0.0.0.0";
port = 8080;
timeout = 1000;
wait_num = 1;
async_head_init();
while ((ch = getopt(argc, argv, "f:n:t:a:p:h")) > 0) {
switch (ch) {
case 't':
timeout = atoi(optarg);
break;
case 'n':
wait_num = atoi(optarg);
break;
case 'a':
addr = optarg;
break;
case 'p':
port = atoi(optarg);
break;
case 'f':
break;
default:
usage(argv[0]);
return -1;
}
}
kevent_ctl_fd = epoll_create(10);
if (kevent_ctl_fd == -1) {
ulog_err("Failed to epoll descriptor");
return -1;
}
main_server_s = evtest_server_init(addr, port);
if (main_server_s < 0)
return main_server_s;
err = evtest_kevent_init(main_server_s);
if (err < 0)
goto err_out_exit;
while (1) {
err = evtest_kevent_wait(timeout, wait_num);
}
err_out_exit:
close(kevent_ctl_fd);
async_head_exit();
return 0;
}
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-26 9:25 ` Evgeniy Polyakov
@ 2007-02-26 9:55 ` Ingo Molnar
2007-02-26 10:31 ` Ingo Molnar
2007-02-26 10:33 ` Evgeniy Polyakov
0 siblings, 2 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-02-26 9:55 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Davide Libenzi, Linux Kernel Mailing List, Linus Torvalds,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Ulrich Drepper, Zach Brown, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> I will use Ingo's evserver_threadlet server as plong as evserver_epoll
> (with fixed closing) and evserver_kevent.c.
please also try evserver_epoll_threadlet.c that i've attached below - it
uses epoll as the main event mechanism but does threadlets for request
handling.
This is a one step more intelligent threadlet queueing model than
'thousands of threads' - although obviously epoll alone should do well
too with this trivial workload.
Ingo
---------------------------->
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/resource.h>
#include <sys/wait.h>
#include <sys/ioctl.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <sys/poll.h>
#include <sys/sendfile.h>
#include <sys/epoll.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <arpa/inet.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>
#include <string.h>
#include <fcntl.h>
#include <time.h>
#include <ctype.h>
#include <netdb.h>
#define DEBUG 0
#include "syslet.h"
#include "sys.h"
#include "threadlet.h"
struct request {
struct request *next_free;
/*
* The threadlet stack is part of the request structure
* and is thus reused as threadlets complete:
*/
unsigned long threadlet_stack;
/*
* These are all the request-specific parameters:
*/
long sock;
};
/*
* Freelist to recycle requests:
*/
static struct request *freelist;
/*
* Allocate a request and set up its syslet atoms:
*/
static struct request *alloc_req(void)
{
struct request *req;
/*
* Occasionally we have to refill the new-thread stack
* entry:
*/
if (!async_head.new_thread_stack) {
async_head.new_thread_stack = thread_stack_alloc();
pr("allocated new thread stack: %08lx\n",
async_head.new_thread_stack);
}
if (freelist) {
req = freelist;
pr("reusing req %p, threadlet stack %08lx\n",
req, req->threadlet_stack);
freelist = freelist->next_free;
req->next_free = NULL;
return req;
}
req = calloc(1, sizeof(struct request));
pr("allocated req %p\n", req);
req->threadlet_stack = thread_stack_alloc();
pr("allocated thread stack %08lx\n", req->threadlet_stack);
return req;
}
/*
* Check whether there are any completions queued for user-space
* to finish up:
*/
static unsigned long complete(void)
{
unsigned long completed = 0;
struct request *req;
for (;;) {
req = (void *)completion_ring[async_head.user_ring_idx];
if (!req)
return completed;
completed++;
pr("completed req %p (threadlet stack %08lx)\n",
req, req->threadlet_stack);
req->next_free = freelist;
freelist = req;
/*
* Clear the completion pointer. To make sure the
* kernel never stomps upon still unhandled completions
* in the ring the kernel only writes to a NULL entry,
* so user-space has to clear it explicitly:
*/
completion_ring[async_head.user_ring_idx] = NULL;
async_head.user_ring_idx++;
if (async_head.user_ring_idx == MAX_PENDING)
async_head.user_ring_idx = 0;
}
}
static unsigned int pending_requests;
/*
* Handle a request that has just been submitted (either it has
* already been executed, or we have to account it as pending):
*/
static void handle_submitted_request(struct request *req, long done)
{
unsigned int nr;
if (done) {
/*
* This is the cached case - free the request:
*/
pr("cache completed req %p (threadlet stack %08lx)\n",
req, req->threadlet_stack);
req->next_free = freelist;
freelist = req;
return;
}
/*
* 'cachemiss' case - the syslet is not finished
* yet. We will be notified about its completion
* via the completion ring:
*/
assert(pending_requests < MAX_PENDING-1);
pending_requests++;
pr("req %p is pending. %d reqs pending.\n", req, pending_requests);
/*
* Attempt to complete requests - this is a fast
* check if there's no completions:
*/
nr = complete();
pending_requests -= nr;
/*
* If the ring is full then wait a bit:
*/
while (pending_requests == MAX_PENDING-1) {
pr("sys_async_wait()");
/*
* Wait for 4 events - to batch things a bit:
*/
sys_async_wait(4, async_head.user_ring_idx, &async_head);
nr = complete();
pending_requests -= nr;
pr("after wait: completed %d requests - still pending: %d\n",
nr, pending_requests);
}
}
#include <linux/types.h>
//#define ulog(f, a...) fprintf(stderr, f, ##a)
#define ulog(f, a...)
#define ulog_err(f, a...) printf(f ": %s [%d].\n", ##a, strerror(errno), errno)
static int kevent_ctl_fd, main_server_s;
static void usage(char *p)
{
ulog("Usage: %s -a addr -p port -f kevent_path -t timeout -w wait_num\n", p);
}
static int evtest_server_init(char *addr, unsigned short port)
{
struct hostent *h;
int s, on;
struct sockaddr_in sa;
if (!addr) {
ulog("%s: Bind address cannot be NULL.\n", __func__);
return -1;
}
h = gethostbyname(addr);
if (!h) {
ulog_err("%s: Failed to get address of %s.\n", __func__, addr);
return -1;
}
s = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
if (s == -1) {
ulog_err("%s: Failed to create server socket", __func__);
return -1;
}
fcntl(s, F_SETFL, O_NONBLOCK);
memcpy(&(sa.sin_addr.s_addr), h->h_addr_list[0], 4);
sa.sin_port = htons(port);
sa.sin_family = AF_INET;
on = 1;
setsockopt(s, SOL_SOCKET, SO_REUSEADDR, &on, 4);
if (bind(s, (struct sockaddr *)&sa, sizeof(struct sockaddr_in)) == -1) {
ulog_err("%s: Failed to bind to %s", __func__, addr);
close(s);
return -1;
}
if (listen(s, 30000) == -1) {
ulog_err("%s: Failed to listen on %s", __func__, addr);
close(s);
return -1;
}
return s;
}
static int evtest_kevent_remove(int fd)
{
int err;
struct epoll_event event;
event.events = EPOLLIN | EPOLLET;
event.data.fd = fd;
err = epoll_ctl(kevent_ctl_fd, EPOLL_CTL_DEL, fd, &event);
if (err < 0) {
ulog_err("Failed to perform control REMOVE operation");
return err;
}
return err;
}
static int evtest_kevent_init(int fd)
{
int err;
struct timeval tm;
struct epoll_event event;
event.events = EPOLLIN | EPOLLET;
event.data.fd = fd;
err = epoll_ctl(kevent_ctl_fd, EPOLL_CTL_ADD, fd, &event);
gettimeofday(&tm, NULL);
ulog("%08lu:%06lu: fd=%3d, err=%1d.\n", tm.tv_sec, tm.tv_usec, fd, err);
if (err < 0) {
ulog_err("Failed to perform control ADD operation: fd=%d, events=%08x", fd, event.events);
return err;
}
return err;
}
static long handle_request(void *__req)
{
struct request *req = __req;
int s = req->sock, err, fd;
off_t offset;
int count;
char path[] = "/tmp/index.html";
char buf[4096];
struct timeval tm;
count = 40960;
offset = 0;
err = recv(s, buf, sizeof(buf), 0);
if (err < 0) {
ulog_err("Failed to read data from s=%d", s);
goto err_out_remove;
}
if (err == 0) {
gettimeofday(&tm, NULL);
ulog("%08lu:%06lu: Client exited: fd=%d.\n", tm.tv_sec, tm.tv_usec, s);
goto err_out_remove;
}
fd = open(path, O_RDONLY);
if (fd == -1) {
ulog_err("Failed to open '%s'", path);
err = -1;
goto err_out_remove;
}
#if 0
do {
err = read(fd, buf, sizeof(buf));
if (err <= 0)
break;
err = send(s, buf, err, 0);
if (err <= 0)
break;
} while (1);
#endif
err = sendfile(s, fd, &offset, count);
{
int on = 0;
setsockopt(s, SOL_TCP, TCP_CORK, &on, sizeof(on));
}
close(fd);
if (err < 0) {
ulog_err("Failed send %d bytes: fd=%d.\n", count, s);
goto err_out_remove;
}
gettimeofday(&tm, NULL);
ulog("%08lu:%06lu: %d bytes has been sent to client fd=%d.\n", tm.tv_sec, tm.tv_usec, err, s);
close(s);
return complete_threadlet_fn(req, &async_head);
err_out_remove:
evtest_kevent_remove(s);
close(s);
return complete_threadlet_fn(req, &async_head);
}
static int evtest_callback_client(int sock)
{
struct request *req;
long done;
req = alloc_req();
if (!req) {
printf("no req\n");
evtest_kevent_remove(sock);
return -ENOMEM;
}
req->sock = sock;
done = threadlet_exec(handle_request, req,
req->threadlet_stack, &async_head);
handle_submitted_request(req, done);
return 0;
}
static int evtest_callback_main(int s)
{
int cs, err;
struct sockaddr_in csa;
socklen_t addrlen = sizeof(struct sockaddr_in);
struct timeval tm;
memset(&csa, 0, sizeof(csa));
if ((cs = accept(s, (struct sockaddr *)&csa, &addrlen)) == -1) {
ulog_err("Failed to accept client");
return -1;
}
fcntl(cs, F_SETFL, O_NONBLOCK);
gettimeofday(&tm, NULL);
ulog("%08lu:%06lu: Accepted connect from %s:%d.\n",
tm.tv_sec, tm.tv_usec,
inet_ntoa(csa.sin_addr), ntohs(csa.sin_port));
err = evtest_kevent_init(cs);
if (err < 0) {
close(cs);
return -1;
}
return 0;
}
static int evtest_kevent_wait(unsigned int timeout, unsigned int wait_num)
{
int num, err;
struct timeval tm;
struct epoll_event event[256];
int i;
err = epoll_wait(kevent_ctl_fd, event, 256, -1);
if (err < 0) {
ulog_err("Failed to perform control operation");
return num;
}
gettimeofday(&tm, NULL);
num = err;
ulog("%08lu.%06lu: Wait: num=%d.\n", tm.tv_sec, tm.tv_usec, num);
for (i=0; i<num; ++i) {
if (event[i].data.fd == main_server_s)
err = evtest_callback_main(event[i].data.fd);
else
err = evtest_callback_client(event[i].data.fd);
}
return err;
}
int main(int argc, char *argv[])
{
int ch, err;
char *addr;
unsigned short port;
unsigned int timeout, wait_num;
addr = "0.0.0.0";
port = 8080;
timeout = 1000;
wait_num = 1;
async_head_init();
while ((ch = getopt(argc, argv, "f:n:t:a:p:h")) > 0) {
switch (ch) {
case 't':
timeout = atoi(optarg);
break;
case 'n':
wait_num = atoi(optarg);
break;
case 'a':
addr = optarg;
break;
case 'p':
port = atoi(optarg);
break;
case 'f':
break;
default:
usage(argv[0]);
return -1;
}
}
kevent_ctl_fd = epoll_create(10);
if (kevent_ctl_fd == -1) {
ulog_err("Failed to epoll descriptor");
return -1;
}
main_server_s = evtest_server_init(addr, port);
if (main_server_s < 0)
return main_server_s;
err = evtest_kevent_init(main_server_s);
if (err < 0)
goto err_out_exit;
while (1) {
err = evtest_kevent_wait(timeout, wait_num);
}
err_out_exit:
close(kevent_ctl_fd);
async_head_exit();
return 0;
}
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-26 8:16 ` Ingo Molnar
@ 2007-02-26 9:25 ` Evgeniy Polyakov
2007-02-26 9:55 ` Ingo Molnar
0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-26 9:25 UTC (permalink / raw)
To: Ingo Molnar
Cc: Davide Libenzi, Linux Kernel Mailing List, Linus Torvalds,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Ulrich Drepper, Zach Brown, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
On Mon, Feb 26, 2007 at 09:16:56AM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> > Also, the evtest_kevent_remove call is superfluous with epoll.
>
> it's only used in the error path AFAICS.
>
> but you are right about evserver_epoll/kevent.c incorrectly assuming
> that things wont block in evtest_callback_client(), which, after
> receiving the "there's stuff on the input socket" event does:
>
> recvmsg(sock),
> fd = open();
> sendfile(sock, fd)
> close(fd);
>
> while evserver_threadlet.c, even this naive implementation, does not
> assume that we wont block in that function.
>
> > In any case, comparing epoll/kevent with 100K active sessions, against
> > threadlets, is not exactly a fair/appropriate test for it.
>
> fully agreed.
Hi.
I will highlight several items in this mail:
1. evserver_epoll.c is broken in that regard, that it does not close a
socket - I tried to make it possible to work with keepalive, but failed.
So close of the socket is a must, like in evserver_kevent.c
2. keepalive is not supported - it is a hack server at the end.
3. this test does not assume that above snippet blocks or does not - it
is _typical_ case of the web server with one working thread (per cpu) -
every op can block, so there is no problem - threadlet will reschedule,
event based will block (bad for them).
4. benchmark does not cover all possible cases - initial goal of that
servers was to show how fast/slow _event_ generation/processing is in
epoll/kevent case, but not to create real-life web server.
lighttpd for example can not be used as a good bench too, since its arch
does not support some kevent extensions (which do not exist in epoll),
and, looking to the number of comments in kevent threads, I'm not motivated
to change it at all.
So, drawing a line, evserver_* is a simple event driven server, it does
have disadvantages, but the same approach only favours threadlet model.
Having millions or thousands of connections works against threadlets,
but we compare ultimate cases, it is a one of the possible tests.
So...
I'm cooking up a git tree with kevents and threadlets, which I will test
in via epia (1ghz, 256 mb of ram, 100mbit lan) and Intel(R) Core(TM)2 CPU
6600 @ 2.40GHz (2gb of ram, 1gbit lan) later today with
kevent/epoll/threadlet if wine will not end suddenly.
I will use Ingo's evserver_threadlet server as plong as evserver_epoll
(with fixed closing) and evserver_kevent.c.
Eventually I can move all of them into one file.
Client is 'ab' on my desktop core duo 3.7 ghz. Machines are connected
over gigabit dlink dgs-1216t switch (which freezes on slightly broken
dhcp and tftp packets btw).
Stay tuned.
P.S. Linus, if you do not mind, I will postpone scholastic masturbation
about event vs process context in IO. One sentence only - page fault and
filename lookup both wait - they wait for new page is ready or inode is read
from the disk, eventually they wake up, thing which wakes them up _is_
an event.
> Ingo
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
[not found] ` <Pine.LNX.4.64.0702251232350.6011@alien.or.mcafeemobile.com>
@ 2007-02-26 8:16 ` Ingo Molnar
2007-02-26 9:25 ` Evgeniy Polyakov
0 siblings, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-02-26 8:16 UTC (permalink / raw)
To: Davide Libenzi
Cc: Evgeniy Polyakov, Linux Kernel Mailing List, Linus Torvalds,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Ulrich Drepper, Zach Brown, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
* Davide Libenzi <davidel@xmailserver.org> wrote:
> Also, the evtest_kevent_remove call is superfluous with epoll.
it's only used in the error path AFAICS.
but you are right about evserver_epoll/kevent.c incorrectly assuming
that things wont block in evtest_callback_client(), which, after
receiving the "there's stuff on the input socket" event does:
recvmsg(sock),
fd = open();
sendfile(sock, fd)
close(fd);
while evserver_threadlet.c, even this naive implementation, does not
assume that we wont block in that function.
> In any case, comparing epoll/kevent with 100K active sessions, against
> threadlets, is not exactly a fair/appropriate test for it.
fully agreed.
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-25 19:04 ` Ingo Molnar
2007-02-25 19:42 ` Evgeniy Polyakov
@ 2007-02-25 23:14 ` Michael K. Edwards
1 sibling, 0 replies; 277+ messages in thread
From: Michael K. Edwards @ 2007-02-25 23:14 UTC (permalink / raw)
To: Ingo Molnar
Cc: Evgeniy Polyakov, Ulrich Drepper, linux-kernel, Linus Torvalds,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
On 2/25/07, Ingo Molnar <mingo@elte.hu> wrote:
> Fundamentally a kernel thread is just its
> EIP/ESP [on x86, similar on other architectures] - which can be
> saved/restored in near zero time.
That's because the kernel address space is identical in every
process's MMU context, so the MMU doesn't have to be touched _at_all_.
Also, the kernel very rarely touches FPU state, and even when it
does, the FXSAVE/FXRSTOR pair is highly optimized for the "save state
just long enough to move some memory around with XMM instructions"
case. (I know you know this; this is for the benefit of less
experienced readers.) If your threadlet model shares the FPU state
and TLS arena among all threadlets running on the same CPU, and
threadlets are scheduled in bursts belonging to the same process (and
preferably the same threadlet entrypoint), then you will get similarly
efficient userspace threadlet-to-threadlet transitions. If not, not.
> scheduling only relates to the minimal context that is in the CPU. And
> most of that context we save upon /every system call entry/, and restore
> it upon every system call return. If it's so expensive to manipulate,
> why can the Linux kernel do a full system call in ~150 cycles? That's
> cheaper than the access latency to a single DRAM page.
That would be the magic of shadow register files. When the software
does things that hardware expects it to do, everybody wins. When the
software tries to get clever based on micro-benchmarks, everybody
loses.
Cheers,
- Michael
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-22 11:31 ` Evgeniy Polyakov
2007-02-22 11:52 ` Arjan van de Ven
2007-02-22 12:59 ` Ingo Molnar
@ 2007-02-25 22:44 ` Linus Torvalds
2007-02-26 13:11 ` Ingo Molnar
2007-02-26 17:28 ` Evgeniy Polyakov
2 siblings, 2 replies; 277+ messages in thread
From: Linus Torvalds @ 2007-02-25 22:44 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Ingo Molnar, Ulrich Drepper, linux-kernel, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
On Thu, 22 Feb 2007, Evgeniy Polyakov wrote:
>
> My tests show that with 4k connections per second (8k concurrency) more
> than 20k connections of 80k total block in tcp_sendmsg() over gigabit
> lan between quite fast machines.
Why do people *keep* taking this up as an issue?
Use select/poll/epoll/kevent/whatever for event mechanisms. STOP CLAIMING
that you'd use threadlets/syslets/aio for that. It's been pointed out over
and over and over again, and yet you continue to make the same mistake,
Evgeniy.
So please read that sentence ten times, and then don't continue to make
that same mistake. PLEASE.
Event mechanisms are *superior* for events. But they *suck* for things
that aren't events, but are actual code execution with random places that
can block. THE TWO THINGS ARE TOTALLY AND UTTERLY INDEPENDENT!
Examples of events:
- packet arrives
- timer happens
Examples of things that are *not* "events":
- filesystem lookup.
- page faults
So the basic point is: for events, you use an event-based thing. For code
execution, you use a thread-based thing. It's really that simple.
And yes, the two different things can usually be translated (at a very
high cost in complexity *and* performance) into each other, so people who
look at it as purely a theoretical exercise may think that "events" and
"code execution" are equivalent. That's a very very silly and stupid way
of looking at things in real life, though.
Yes, you can turn things that are better seen as threaded execution into
an event-based thing by turning it into a state machine. And usually that
is a TOTAL DISASTER, and the end result is fragile and impossible to
maintain.
And yes, you can often (more easily) turn an event-based mechanism into a
thread-based one, and usually the end result is a TOTAL DISASTER because
it doesn't scale very well, and while it may actually result in somewhat
simpler code, the overhead of managing ten thousand outstanding threads is
just too high, when you compare to managing just a list of ten thousand
outstanding events.
And yes, people have done both of those mistakes. Java, for example,
largely did the latter mistake ("we don't need anything like 'select',
because we'll just use threads for everything" - what a totally moronic
thing to do!)
So Evgeniy, threadlets/syslets/aio is *not* a replacement for event
queues. It's a TOTALLY DIFFERENT MECHANISM, and one that is hugely
superior to event queues for certain kinds of things. Anybody who thinks
they want to do pathname and inode lookup as a series of events is likely
a moron. It's really that simple.
In a complex server (say, a database), you'd use both. You'd probably use
events for doing the things you *already* use events for (whether it be
select/poll/epoll or whatever): probably things like the client network
connection handling.
But you'd *in*addition* use threadlets to be able to do the actual
database IO in a threaded manner, so that you can scale the things that
are not easily handled as events (usually because they have internal
kernel state that the user cannot even see, and *must*not* see because of
security issues).
So please. Stop this "kevents are better". The only thing you show by
trying to go down that avenue is that you don't understand the
*difference* between an event model and a thread model. They are both
perfectly fine models and they ARE NOT THE SAME! They aren't even mutually
incompatible - quite the reverse.
The thing people want to remove with threadlets is the internal overhead
of maintaining special-purpose code like aio_read() inside the kernel,
that doesn't even do all that people want it to do, and that really does
need a fair amount of internal complexity that we could hopefully do with
a more generic (and hopefully *simpler*) model.
Linus
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-25 19:42 ` Evgeniy Polyakov
@ 2007-02-25 20:38 ` Ingo Molnar
2007-02-26 12:39 ` Ingo Molnar
2007-02-26 19:47 ` Davide Libenzi
2 siblings, 0 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-02-25 20:38 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Ulrich Drepper, linux-kernel, Linus Torvalds, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> Interesting discussion, that will be very fun if kevent will lose
> badly :)
with your keepalive test no way can it lose against 80,000 sync
threadlets - it's pretty much the worst-case thing for threadlets while
it's the best-case for kevents. Try a non-keepalive test perhaps?
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-25 18:34 ` Ingo Molnar
@ 2007-02-25 20:01 ` Frederik Deweerdt
0 siblings, 0 replies; 277+ messages in thread
From: Frederik Deweerdt @ 2007-02-25 20:01 UTC (permalink / raw)
To: Ingo Molnar
Cc: Evgeniy Polyakov, linux-kernel, Linus Torvalds, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
On Sun, Feb 25, 2007 at 07:34:38PM +0100, Ingo Molnar wrote:
>
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
>
> > > thx - i guess i should just run them without any options and they
> > > bind themselves to port 80? What 'ab' options are you using
> > > typically to measure them?
> >
> > Yes, but they require /tmp/index.html to have http header and actual
> > data page. They do not parse http request :)
>
> ok. When i connect to the epoll server via "telnet mysever 80", and
> enter a 'request', i get back the content - but the socket connection is
> not closed. Every time i type enter i get a new content back. Why is
> that so - the code seems to contain a close(fd).
>
I'd say a close(s); is missing just before return 0; in
evtest_callback_client() ?
Regards,
Frederik
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-25 19:04 ` Ingo Molnar
@ 2007-02-25 19:42 ` Evgeniy Polyakov
2007-02-25 20:38 ` Ingo Molnar
` (2 more replies)
2007-02-25 23:14 ` Michael K. Edwards
1 sibling, 3 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-25 19:42 UTC (permalink / raw)
To: Ingo Molnar
Cc: Ulrich Drepper, linux-kernel, Linus Torvalds, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
On Sun, Feb 25, 2007 at 08:04:15PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
>
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
>
> > Kevent is a _very_ small entity and there is _no_ cost of requeueing
> > (well, there is list_add guarded by lock) - after it is done, process
> > can start real work. With rescheduling there are _too_ many things to
> > be done before we can start new work. [...]
>
> actually, no. For example a wakeup too is fundamentally a list_add
> guarded by a lock. Take a look at try_to_wake_up(). The rest you see
> there is just extra frills that relate to things like 'load-balancing
> the requests over multiple CPUs [which i'm sure kevent users would
> request in the future too]'.
wake_up() as a call is pretty simple and fast, but its result - it is
slow. I did not run reschedulingtests with kernel thread, but posix
threads (they do look like a kernel thread) have significant overhead
there. In early developemnt days of M:N threading library I tested
rescheduling performance of the POSIX threads - I created pool of
threads and 'sent' a message using futex wait/wake - such performance of
the userspace threading library (I tested erlang) was 10 times slower.
> > [...] We have to change registers, change address space, various tlb
> > bits and so on - we have to do it, since task describes very heavy
> > entity - the whole process. [...]
>
> but ... 'threadlets' are called thread-lets because they are not full
> processes, they are threads. There's no TLB state in that case. There's
> indeed register state associated with them, and currently there can
> certainly be quite a bit of overhead in a context switch - but not in
> register saving. We do user-space register saving not in the scheduler
> but upon /every system call/. Fundamentally a kernel thread is just its
> EIP/ESP [on x86, similar on other architectures] - which can be
> saved/restored in near zero time. All the rest is something we added for
> good /work queueing/ reasons - and those same extras should either be
> eliminated if they turn out to be not so good reasons after all, or they
> will be wanted for kevents too eventually, once it matures as a work
> queueing solution.
If things decreases performance noticebly, it is a bad things, but it is
matter of taste. Anyway, kevents are very small, threads are very big,
and both are the way they are exactly on purpose - threads serve for
processing of any generic code, kevents are used for event waiting - IO
is such an event, it does not require a lot of infrastructure to handle,
it only nees some simple bits, so it can be optimized to be extremely
fast, with huge infrastructure behind each IO (like in case when it is a
separated thread) it can not be done effectively.
> > I think it is _too_ heavy to have such a monster structure like
> > task(thread/process) and related overhead just to do an IO.
>
> i think you are really, really mistaken if you believe that the fact
> that whole tasks/threads or processes can be 'monster structures',
> somehow has any relevance to scheduling/task-queueing performance and
> scalability. It does not matter how large a task's address space is -
> scheduling only relates to the minimal context that is in the CPU. And
> most of that context we save upon /every system call entry/, and restore
> it upon every system call return. If it's so expensive to manipulate,
> why can the Linux kernel do a full system call in ~150 cycles? That's
> cheaper than the access latency to a single DRAM page.
I meant not its size, but the whole infrastructure, which surrounds
task. If it is that lightweight, why don't we have posix thread per IO?
One question is that mmap/allocation of the stack is too slow (and it is
very slow indeed, that is why glibc and M:N threading lib caches
allocated stacks), another one is kernel/userspace boundary crossing,
next one are tlb flushes, then copies.
Why userspace rescheduling is in order of tens times faster than
kernel/user?
> for the same reason has it no relevance that the full kevent-based
> webserver is a 'monster structure' - still a single request's basic
> queueing operation is cheap. The same is true to tasks/threads.
To move that tasks there must be done too may steps, and although each
one can be quite fast, the whole process of rescheduling in the case of
thousands running threads creates too big overhead per task to drop
performance.
> Really, you dont even have to know or assume anything about the
> scheduler, just lets do some elementary math here:
>
> the reqs/sec your sendfile+kevent based webserver can do is 7900 per
> sec. Lets assume you will write further great kevent code which will
> optimize it further and it goes up to 10,100 reqs per sec (100 usecs per
> request), ok? Then also try how many reschedules/sec can your Athon64
> 3500 box do. My guess is: about a million per second (1 usec per
> reschedule), perhaps a bit more.
Let's calculate: disk bandwidth is about gigabytes per second (cached
case), so to transfer 10k file we need about 10 usec - 10% of it will be
spent in rescheduling (if there will be only one if any).
Network is an order of 10 slower (1gbit for example), but there are much
more blockings, so to transfer 10k we will have let's say 5 blocks, i.e.
5 reschedulings - another 5%, so we wasted 15% of our time in
rescheduling.
Event in turn is a 30 bytes copy (plus of course own overhead, but it is
still faster - it is faster just because ukevet size is smaller than
pt_regs :).
Interesting discussion, that will be very fun if kevent will lose badly
:)
> Now lets assume that a threadlet based server would have to
> context-switch for /every single/ request served. That's totally
> over-estimating it, even with lots of slow clients, but lets assume it,
> to judge the worst-case impact.
>
> So if you had to schedule once per every request served, you'd have to
> add 1 usec to your 100 usecs cost, making it 101 usecs. That would bring
> your 10,100 requests per sec to 10,000 requests/sec, under a threadlet
> model of operation. Put differently: it will cost you only 1% in
> performance to schedule once for every request. Or lets assume the task
> is totally cache-cold and you'd have to add 4 usecs for its scheduling -
> that'd still only be 4%. So where is the fat?
I need to move home or I will sleep on the street, otherwise I would already
ran a test and started to eat a hat (present me a red one like Lan Cox
had), or watch how you to do it :)
Give me several hours.
> Ingo
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-25 18:37 ` Evgeniy Polyakov
2007-02-25 18:34 ` Ingo Molnar
@ 2007-02-25 19:21 ` Ingo Molnar
[not found] ` <20070225194645.GB1353@2ka.mipt.ru>
1 sibling, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-02-25 19:21 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: linux-kernel, Linus Torvalds, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> > thx - i guess i should just run them without any options and they
> > bind themselves to port 80? What 'ab' options are you using
> > typically to measure them?
>
> Yes, but they require /tmp/index.html to have http header and actual
> data page. They do not parse http request :)
>
> For athlon 3500 I used
> ab -c8000 -n80000 $url
how do the header portions of your /tmp/index.html data page look like?
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
@ 2007-02-25 19:11 Al Boldi
0 siblings, 0 replies; 277+ messages in thread
From: Al Boldi @ 2007-02-25 19:11 UTC (permalink / raw)
To: linux-kernel
Ingo Molnar wrote:
> now look at kevents as the queueing model. It does not queue 'tasks', it
> lets user-space queue requests in essence, in various states. But it's
> still the same conceptual thing: a memory buffer with some state
> associated to it. Yes, it has no legacies, it has no priorities and
> other queueing concepts attached to it ... yet. If kevents got
> mainstream, it would get the same kind of pressure to grow 'more
> advanced' event queueing and event scheduling capabilities.
> Prioritization would be needed, etc.
But it would probably be tuned specifically to its use case, which would mean
inherently better performance.
> So my fundamental claim is: a kernel thread /is/ our main request
> structure. We've got tons of really good system calls that queue these
> 'requests' around the place and offer functionality around this concept.
> Plus there's a 1.2+ billion lines of Linux userspace code that works
> well with this abstraction - while there's nary a few thousand lines of
> event-based user-space code.
Think of the kernel scheduler as a default fallback scheduler, for procs that
are randomly queued. Anytime you can identify a group of procs/threads that
behave in a similar way, it's almost always best to do specific/private
scheduling, for performance reasons.
> I also say that you'll likely get kevents outperform threadlets. Maybe
> even significantly so under the right conditions. But i very much
> believe we want to get similar kind of performance out of thread/task
> scheduling, and not introduce a parallel framework to do request
> scheduling the hard way ... just because our task concept and scheduling
> implementation got too fat. For the same reason i didnt really like
> fibrils: they are nice, and Zach's core idea i think nicely survived in
> the syslet/threadlet model too, but they are more limited than true
> threads. So doing that parallel infrastructure, which really just
> implements the same, and is only faster because it skips features, would
> just be hiding the problem with our primary abstraction. Ok?
Ok. But what you are proposing here is a dynamically plugable scheduler that
is extensible on top of that.
Sounds Great!
Thanks!
--
Al
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
@ 2007-02-25 19:10 Al Boldi
0 siblings, 0 replies; 277+ messages in thread
From: Al Boldi @ 2007-02-25 19:10 UTC (permalink / raw)
To: linux-kernel
Ingo Molnar wrote:
> if you create a threadlet based test-webserver, could you please do a
> comparable kevents implementation as well? I.e. same HTTP parser (or
> non-parser, as usually the case is with prototypes ;). Best would be
> something that one could trigger between threadlet and kevent mode,
> using the same binary :-)
Now, why would you want this?
Is there some performance issue with separately loaded binaries?
Thanks!
--
Al
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-25 18:09 ` Evgeniy Polyakov
@ 2007-02-25 19:04 ` Ingo Molnar
2007-02-25 19:42 ` Evgeniy Polyakov
2007-02-25 23:14 ` Michael K. Edwards
0 siblings, 2 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-02-25 19:04 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Ulrich Drepper, linux-kernel, Linus Torvalds, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> Kevent is a _very_ small entity and there is _no_ cost of requeueing
> (well, there is list_add guarded by lock) - after it is done, process
> can start real work. With rescheduling there are _too_ many things to
> be done before we can start new work. [...]
actually, no. For example a wakeup too is fundamentally a list_add
guarded by a lock. Take a look at try_to_wake_up(). The rest you see
there is just extra frills that relate to things like 'load-balancing
the requests over multiple CPUs [which i'm sure kevent users would
request in the future too]'.
> [...] We have to change registers, change address space, various tlb
> bits and so on - we have to do it, since task describes very heavy
> entity - the whole process. [...]
but ... 'threadlets' are called thread-lets because they are not full
processes, they are threads. There's no TLB state in that case. There's
indeed register state associated with them, and currently there can
certainly be quite a bit of overhead in a context switch - but not in
register saving. We do user-space register saving not in the scheduler
but upon /every system call/. Fundamentally a kernel thread is just its
EIP/ESP [on x86, similar on other architectures] - which can be
saved/restored in near zero time. All the rest is something we added for
good /work queueing/ reasons - and those same extras should either be
eliminated if they turn out to be not so good reasons after all, or they
will be wanted for kevents too eventually, once it matures as a work
queueing solution.
> I think it is _too_ heavy to have such a monster structure like
> task(thread/process) and related overhead just to do an IO.
i think you are really, really mistaken if you believe that the fact
that whole tasks/threads or processes can be 'monster structures',
somehow has any relevance to scheduling/task-queueing performance and
scalability. It does not matter how large a task's address space is -
scheduling only relates to the minimal context that is in the CPU. And
most of that context we save upon /every system call entry/, and restore
it upon every system call return. If it's so expensive to manipulate,
why can the Linux kernel do a full system call in ~150 cycles? That's
cheaper than the access latency to a single DRAM page.
for the same reason has it no relevance that the full kevent-based
webserver is a 'monster structure' - still a single request's basic
queueing operation is cheap. The same is true to tasks/threads.
Really, you dont even have to know or assume anything about the
scheduler, just lets do some elementary math here:
the reqs/sec your sendfile+kevent based webserver can do is 7900 per
sec. Lets assume you will write further great kevent code which will
optimize it further and it goes up to 10,100 reqs per sec (100 usecs per
request), ok? Then also try how many reschedules/sec can your Athon64
3500 box do. My guess is: about a million per second (1 usec per
reschedule), perhaps a bit more.
Now lets assume that a threadlet based server would have to
context-switch for /every single/ request served. That's totally
over-estimating it, even with lots of slow clients, but lets assume it,
to judge the worst-case impact.
So if you had to schedule once per every request served, you'd have to
add 1 usec to your 100 usecs cost, making it 101 usecs. That would bring
your 10,100 requests per sec to 10,000 requests/sec, under a threadlet
model of operation. Put differently: it will cost you only 1% in
performance to schedule once for every request. Or lets assume the task
is totally cache-cold and you'd have to add 4 usecs for its scheduling -
that'd still only be 4%. So where is the fat?
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-25 18:22 ` Ingo Molnar
@ 2007-02-25 18:37 ` Evgeniy Polyakov
2007-02-25 18:34 ` Ingo Molnar
2007-02-25 19:21 ` Ingo Molnar
0 siblings, 2 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-25 18:37 UTC (permalink / raw)
To: Ingo Molnar
Cc: linux-kernel, Linus Torvalds, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
On Sun, Feb 25, 2007 at 07:22:30PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> > > Do you have any link where i could check the type of HTTP parsing
> > > and send transport you are (or will be) using? What type of http
> > > client are you using to measure, with precisely what options?
> >
> > For example this ones (essentially the same, except that epoll and
> > kevent are used):
> > http://tservice.net.ru/~s0mbre/archive/kevent/evserver_kevent.c
> > http://tservice.net.ru/~s0mbre/archive/kevent/evserver_epoll.c
>
> thx - i guess i should just run them without any options and they bind
> themselves to port 80? What 'ab' options are you using typically to
> measure them?
Yes, but they require /tmp/index.html to have http header and actual
data page. They do not parse http request :)
For athlon 3500 I used
ab -c8000 -n80000 $url
for via epia likely two/three times less.
> Ingo
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-25 18:37 ` Evgeniy Polyakov
@ 2007-02-25 18:34 ` Ingo Molnar
2007-02-25 20:01 ` Frederik Deweerdt
2007-02-25 19:21 ` Ingo Molnar
1 sibling, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-02-25 18:34 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: linux-kernel, Linus Torvalds, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> > thx - i guess i should just run them without any options and they
> > bind themselves to port 80? What 'ab' options are you using
> > typically to measure them?
>
> Yes, but they require /tmp/index.html to have http header and actual
> data page. They do not parse http request :)
ok. When i connect to the epoll server via "telnet mysever 80", and
enter a 'request', i get back the content - but the socket connection is
not closed. Every time i type enter i get a new content back. Why is
that so - the code seems to contain a close(fd).
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-25 18:21 ` Evgeniy Polyakov
2007-02-25 18:22 ` Ingo Molnar
@ 2007-02-25 18:25 ` Evgeniy Polyakov
2007-02-25 18:24 ` Ingo Molnar
1 sibling, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-25 18:25 UTC (permalink / raw)
To: Ingo Molnar
Cc: linux-kernel, Linus Torvalds, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
On Sun, Feb 25, 2007 at 09:21:35PM +0300, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> > Do you have any link where i could check the type of HTTP parsing and
> > send transport you are (or will be) using? What type of http client are
> > you using to measure, with precisely what options?
>
> For example this ones (essentially the same, except that epoll and
> kevent are used):
> http://tservice.net.ru/~s0mbre/archive/kevent/evserver_kevent.c
> http://tservice.net.ru/~s0mbre/archive/kevent/evserver_epoll.c
Client is 'ab' with high number of connections and concurrency.
For example for athlon64 3500 I used concurrency of 8000 connections
and 80k total.
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-25 18:25 ` Evgeniy Polyakov
@ 2007-02-25 18:24 ` Ingo Molnar
0 siblings, 0 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-02-25 18:24 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: linux-kernel, Linus Torvalds, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> > > Do you have any link where i could check the type of HTTP parsing
> > > and send transport you are (or will be) using? What type of http
> > > client are you using to measure, with precisely what options?
> >
> > For example this ones (essentially the same, except that epoll and
> > kevent are used):
> > http://tservice.net.ru/~s0mbre/archive/kevent/evserver_kevent.c
> > http://tservice.net.ru/~s0mbre/archive/kevent/evserver_epoll.c
>
> Client is 'ab' with high number of connections and concurrency. For
> example for athlon64 3500 I used concurrency of 8000 connections and
> 80k total.
ok. So it's:
ab -c 8000 -n 80000 http://yourserver/tmp/index.html
right? How large is index.html typically - the default 40960 bytes, and
with a constructed HTTP reply header already included in the file?
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-25 18:21 ` Evgeniy Polyakov
@ 2007-02-25 18:22 ` Ingo Molnar
2007-02-25 18:37 ` Evgeniy Polyakov
2007-02-25 18:25 ` Evgeniy Polyakov
1 sibling, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-02-25 18:22 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: linux-kernel, Linus Torvalds, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> > Do you have any link where i could check the type of HTTP parsing
> > and send transport you are (or will be) using? What type of http
> > client are you using to measure, with precisely what options?
>
> For example this ones (essentially the same, except that epoll and
> kevent are used):
> http://tservice.net.ru/~s0mbre/archive/kevent/evserver_kevent.c
> http://tservice.net.ru/~s0mbre/archive/kevent/evserver_epoll.c
thx - i guess i should just run them without any options and they bind
themselves to port 80? What 'ab' options are you using typically to
measure them?
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-25 17:54 ` Ingo Molnar
@ 2007-02-25 18:21 ` Evgeniy Polyakov
2007-02-25 18:22 ` Ingo Molnar
2007-02-25 18:25 ` Evgeniy Polyakov
0 siblings, 2 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-25 18:21 UTC (permalink / raw)
To: Ingo Molnar
Cc: linux-kernel, Linus Torvalds, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
On Sun, Feb 25, 2007 at 06:54:37PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
>
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
>
> > > hm, what tree are you using as a base? The syslet patches are
> > > against v2.6.20 at the moment. (the x86 PDA changes will probably
> > > interfere with it on v2.6.21-rc1-ish kernels) Note that otherwise
> > > the syslet/threadlet patches are for x86 only at the moment (as i
> > > mentioned in the announcement), and the generic code itself contains
> > > some occasional x86-ishms as well. (None of the concepts are
> > > x86-specific though - multi-stack architectures should work just as
> > > well as RISC-ish CPUs.)
> >
> > It is rc1 - and crashes.
>
> yeah. I'm not surprised. The PDA is not set up in create_async_thread()
> for example.
Ok, I will roll back to vanilla 2.6.20 tomorrow.
> > > if you create a threadlet based test-webserver, could you please do
> > > a comparable kevents implementation as well? I.e. same HTTP parser
> > > (or non-parser, as usually the case is with prototypes ;). Best
> > > would be something that one could trigger between threadlet and
> > > kevent mode, using the same binary :-)
> >
> > Ok, I will create such a monster tomorrow :)
> >
> > I will use the same base for threadlet as for kevent/epoll - there is
> > no parser, just sendfile() of the static file which contains http
> > header and actual page.
> >
> > threadlet1 {
> > accept()
> > create threadlet2 {
> > send data
> > }
> > }
> >
> > Is above scheme correct for threadlet scenario?
>
> yep, this is a good first cut. Doing this after the listen() is useful:
>
> int one = 1;
>
> ret = setsockopt(listen_sock_fd, SOL_SOCKET, SO_REUSEADDR,
> (char *)&one, sizeof(int));
>
> and i'd suggest to do this after every accept()-ed socket:
>
> int flag = 1;
>
> setsockopt(sock_fd, IPPROTO_TCP, TCP_NODELAY,
> (char *) &flag, sizeof(int));
>
> Do you have any link where i could check the type of HTTP parsing and
> send transport you are (or will be) using? What type of http client are
> you using to measure, with precisely what options?
For example this ones (essentially the same, except that epoll and
kevent are used):
http://tservice.net.ru/~s0mbre/archive/kevent/evserver_kevent.c
http://tservice.net.ru/~s0mbre/archive/kevent/evserver_epoll.c
> > But note, that on my athlon64 3500 test machine kevent is about 7900
> > requests per second compared to 4000+ epoll, so expect a challenge.
>
> single-core CPU i suspect?
Yep.
> > lighhtpd is about the same 4000 requests per second though, since it
> > can not be easily optimized for kevents.
>
> mean question: do you promise to post the results even if they are not
> unfavorable to threadlets? ;-)
If they are too good, I will start searching for bugs and tune my code
first, but eventually of course yes.
In my blog I will post them in 'real-time' even if kevent will
unbelieveably suck.
> if i want to test kevents on a v2.6.20 kernel base, do you have an URL
> for me to try?
I have a git tree at (based on rc1 as requested by Andrew Morton):
http://tservice.net.ru/~s0mbre/archive/kevent/kevent.git/
Or patches at kevent homepage:
http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent
Direct link to the latest patchset:
http://tservice.net.ru/~s0mbre/archive/kevent/kevent-37/
(order is insignificant as far as I recall, except 'compile-fix',
whcih must be the latest).
> Ingo
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-25 17:45 ` Ingo Molnar
@ 2007-02-25 18:09 ` Evgeniy Polyakov
2007-02-25 19:04 ` Ingo Molnar
0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-25 18:09 UTC (permalink / raw)
To: Ingo Molnar
Cc: Ulrich Drepper, linux-kernel, Linus Torvalds, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
On Sun, Feb 25, 2007 at 06:45:05PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
>
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
>
> > My main concern was only about the situation, when we ends up with
> > truly bloking context (like network), and this results in having
> > thousands of threads doing the work - even having most of them
> > sleeping, there is a problem with memory overhead and context
> > switching, although it is usable situation, but when all of them are
> > ready immediately - context switching will kill a machine even with
> > O(1) scheduler which made situation damn better than before, but it is
> > not a cure for the problem.
>
> yes. This is why in the original fibril discussion i concentrated so
> much on scheduling performance.
>
> to me the picture is this: conceptually the scheduler runqueue is a
> queue of work. You get items queued upon certain events, and they can
> unqueue themselves. (there is also register context but that is already
> optimized to death by hardware) So whatever scheduling overhead we have,
> it's a pure software thing. It's because we have priorities attached.
> It's because we have some legacies. Etc., etc. - it's all stuff /we/
> wanted to add, but nothing truly fundamental on top of the basic 'work
> queueing' model.
>
> now look at kevents as the queueing model. It does not queue 'tasks', it
> lets user-space queue requests in essence, in various states. But it's
> still the same conceptual thing: a memory buffer with some state
> associated to it. Yes, it has no legacies, it has no priorities and
> other queueing concepts attached to it ... yet. If kevents got
> mainstream, it would get the same kind of pressure to grow 'more
> advanced' event queueing and event scheduling capabilities.
> Prioritization would be needed, etc.
>
> So my fundamental claim is: a kernel thread /is/ our main request
> structure. We've got tons of really good system calls that queue these
> 'requests' around the place and offer functionality around this concept.
> Plus there's a 1.2+ billion lines of Linux userspace code that works
> well with this abstraction - while there's nary a few thousand lines of
> event-based user-space code.
>
> I also say that you'll likely get kevents outperform threadlets. Maybe
> even significantly so under the right conditions. But i very much
> believe we want to get similar kind of performance out of thread/task
> scheduling, and not introduce a parallel framework to do request
> scheduling the hard way ... just because our task concept and scheduling
> implementation got too fat. For the same reason i didnt really like
> fibrils: they are nice, and Zach's core idea i think nicely survived in
> the syslet/threadlet model too, but they are more limited than true
> threads. So doing that parallel infrastructure, which really just
> implements the same, and is only faster because it skips features, would
> just be hiding the problem with our primary abstraction. Ok?
Kevent is a _very_ small entity and there is _no_ cost of requeueing
(well, there is list_add guarded by lock) - after it is done, process
can start real work. With rescheduling there are _too_ many things to be
done before we can start new work. We have to change registers, change
address space, various tlb bits and so on - we have to do it, since task
describes very heavy entity - the whole process.
IO in turn is a very small subset of what process is (can do), so there
is no need to change the whole picture, so it is enough to have one
process, which does the work.
Threads are a bit smaller than process, but still it is too heavy to
have it per IO - so we have pools - this decreases rescheduling
overhead, but limits parallelism.
I think it is _too_ heavy to have such a monster structure like
task(thread/process) and related overhead just to do an IO.
> Ingo
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-25 17:44 ` Evgeniy Polyakov
@ 2007-02-25 17:54 ` Ingo Molnar
2007-02-25 18:21 ` Evgeniy Polyakov
0 siblings, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-02-25 17:54 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: linux-kernel, Linus Torvalds, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> > hm, what tree are you using as a base? The syslet patches are
> > against v2.6.20 at the moment. (the x86 PDA changes will probably
> > interfere with it on v2.6.21-rc1-ish kernels) Note that otherwise
> > the syslet/threadlet patches are for x86 only at the moment (as i
> > mentioned in the announcement), and the generic code itself contains
> > some occasional x86-ishms as well. (None of the concepts are
> > x86-specific though - multi-stack architectures should work just as
> > well as RISC-ish CPUs.)
>
> It is rc1 - and crashes.
yeah. I'm not surprised. The PDA is not set up in create_async_thread()
for example.
> > if you create a threadlet based test-webserver, could you please do
> > a comparable kevents implementation as well? I.e. same HTTP parser
> > (or non-parser, as usually the case is with prototypes ;). Best
> > would be something that one could trigger between threadlet and
> > kevent mode, using the same binary :-)
>
> Ok, I will create such a monster tomorrow :)
>
> I will use the same base for threadlet as for kevent/epoll - there is
> no parser, just sendfile() of the static file which contains http
> header and actual page.
>
> threadlet1 {
> accept()
> create threadlet2 {
> send data
> }
> }
>
> Is above scheme correct for threadlet scenario?
yep, this is a good first cut. Doing this after the listen() is useful:
int one = 1;
ret = setsockopt(listen_sock_fd, SOL_SOCKET, SO_REUSEADDR,
(char *)&one, sizeof(int));
and i'd suggest to do this after every accept()-ed socket:
int flag = 1;
setsockopt(sock_fd, IPPROTO_TCP, TCP_NODELAY,
(char *) &flag, sizeof(int));
Do you have any link where i could check the type of HTTP parsing and
send transport you are (or will be) using? What type of http client are
you using to measure, with precisely what options?
> But note, that on my athlon64 3500 test machine kevent is about 7900
> requests per second compared to 4000+ epoll, so expect a challenge.
single-core CPU i suspect?
> lighhtpd is about the same 4000 requests per second though, since it
> can not be easily optimized for kevents.
mean question: do you promise to post the results even if they are not
unfavorable to threadlets? ;-)
if i want to test kevents on a v2.6.20 kernel base, do you have an URL
for me to try?
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-23 12:22 ` Evgeniy Polyakov
2007-02-23 12:41 ` Evgeniy Polyakov
@ 2007-02-25 17:45 ` Ingo Molnar
2007-02-25 18:09 ` Evgeniy Polyakov
1 sibling, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-02-25 17:45 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Ulrich Drepper, linux-kernel, Linus Torvalds, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> My main concern was only about the situation, when we ends up with
> truly bloking context (like network), and this results in having
> thousands of threads doing the work - even having most of them
> sleeping, there is a problem with memory overhead and context
> switching, although it is usable situation, but when all of them are
> ready immediately - context switching will kill a machine even with
> O(1) scheduler which made situation damn better than before, but it is
> not a cure for the problem.
yes. This is why in the original fibril discussion i concentrated so
much on scheduling performance.
to me the picture is this: conceptually the scheduler runqueue is a
queue of work. You get items queued upon certain events, and they can
unqueue themselves. (there is also register context but that is already
optimized to death by hardware) So whatever scheduling overhead we have,
it's a pure software thing. It's because we have priorities attached.
It's because we have some legacies. Etc., etc. - it's all stuff /we/
wanted to add, but nothing truly fundamental on top of the basic 'work
queueing' model.
now look at kevents as the queueing model. It does not queue 'tasks', it
lets user-space queue requests in essence, in various states. But it's
still the same conceptual thing: a memory buffer with some state
associated to it. Yes, it has no legacies, it has no priorities and
other queueing concepts attached to it ... yet. If kevents got
mainstream, it would get the same kind of pressure to grow 'more
advanced' event queueing and event scheduling capabilities.
Prioritization would be needed, etc.
So my fundamental claim is: a kernel thread /is/ our main request
structure. We've got tons of really good system calls that queue these
'requests' around the place and offer functionality around this concept.
Plus there's a 1.2+ billion lines of Linux userspace code that works
well with this abstraction - while there's nary a few thousand lines of
event-based user-space code.
I also say that you'll likely get kevents outperform threadlets. Maybe
even significantly so under the right conditions. But i very much
believe we want to get similar kind of performance out of thread/task
scheduling, and not introduce a parallel framework to do request
scheduling the hard way ... just because our task concept and scheduling
implementation got too fat. For the same reason i didnt really like
fibrils: they are nice, and Zach's core idea i think nicely survived in
the syslet/threadlet model too, but they are more limited than true
threads. So doing that parallel infrastructure, which really just
implements the same, and is only faster because it skips features, would
just be hiding the problem with our primary abstraction. Ok?
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-25 17:23 ` Ingo Molnar
@ 2007-02-25 17:44 ` Evgeniy Polyakov
2007-02-25 17:54 ` Ingo Molnar
0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-25 17:44 UTC (permalink / raw)
To: Ingo Molnar
Cc: linux-kernel, Linus Torvalds, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
On Sun, Feb 25, 2007 at 06:23:38PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
>
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
>
> > On Wed, Feb 21, 2007 at 10:13:55PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> > > this is the v3 release of the syslet/threadlet subsystem:
> > >
> > > http://redhat.com/~mingo/syslet-patches/
> >
> > There is no %xgs.
> >
> > --- ./arch/i386/kernel/process.c~ 2007-02-24 22:56:14.000000000 +0300
> > +++ ./arch/i386/kernel/process.c 2007-02-24 22:53:19.000000000 +0300
> > @@ -426,7 +426,6 @@
> >
> > regs.xds = __USER_DS;
> > regs.xes = __USER_DS;
> > - regs.xgs = __KERNEL_PDA;
>
> hm, what tree are you using as a base? The syslet patches are against
> v2.6.20 at the moment. (the x86 PDA changes will probably interfere with
> it on v2.6.21-rc1-ish kernels) Note that otherwise the syslet/threadlet
> patches are for x86 only at the moment (as i mentioned in the
> announcement), and the generic code itself contains some occasional
> x86-ishms as well. (None of the concepts are x86-specific though -
> multi-stack architectures should work just as well as RISC-ish CPUs.)
It is rc1 - and crashes.
I test on i386 via epia (the only machine which runs x86 right now).
If there will not be any new patches, I will create 2.6.20 test tree
tomorrow.
> if you create a threadlet based test-webserver, could you please do a
> comparable kevents implementation as well? I.e. same HTTP parser (or
> non-parser, as usually the case is with prototypes ;). Best would be
> something that one could trigger between threadlet and kevent mode,
> using the same binary :-)
Ok, I will create such a monster tomorrow :)
I will use the same base for threadlet as for kevent/epoll - there is no
parser, just sendfile() of the static file which contains http header
and actual page.
threadlet1 {
accept()
create threadlet2 {
send data
}
}
Is above scheme correct for threadlet scenario?
But note, that on my athlon64 3500 test machine kevent is about 7900
requests per second compared to 4000+ epoll, so expect a challenge.
lighhtpd is about the same 4000 requests per second though, since it can
not be easily optimized for kevents.
> Ingo
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-24 18:34 ` Evgeniy Polyakov
@ 2007-02-25 17:23 ` Ingo Molnar
2007-02-25 17:44 ` Evgeniy Polyakov
0 siblings, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-02-25 17:23 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: linux-kernel, Linus Torvalds, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> On Wed, Feb 21, 2007 at 10:13:55PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> > this is the v3 release of the syslet/threadlet subsystem:
> >
> > http://redhat.com/~mingo/syslet-patches/
>
> There is no %xgs.
>
> --- ./arch/i386/kernel/process.c~ 2007-02-24 22:56:14.000000000 +0300
> +++ ./arch/i386/kernel/process.c 2007-02-24 22:53:19.000000000 +0300
> @@ -426,7 +426,6 @@
>
> regs.xds = __USER_DS;
> regs.xes = __USER_DS;
> - regs.xgs = __KERNEL_PDA;
hm, what tree are you using as a base? The syslet patches are against
v2.6.20 at the moment. (the x86 PDA changes will probably interfere with
it on v2.6.21-rc1-ish kernels) Note that otherwise the syslet/threadlet
patches are for x86 only at the moment (as i mentioned in the
announcement), and the generic code itself contains some occasional
x86-ishms as well. (None of the concepts are x86-specific though -
multi-stack architectures should work just as well as RISC-ish CPUs.)
if you create a threadlet based test-webserver, could you please do a
comparable kevents implementation as well? I.e. same HTTP parser (or
non-parser, as usually the case is with prototypes ;). Best would be
something that one could trigger between threadlet and kevent mode,
using the same binary :-)
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-24 21:04 ` Davide Libenzi
@ 2007-02-24 23:01 ` Michael K. Edwards
0 siblings, 0 replies; 277+ messages in thread
From: Michael K. Edwards @ 2007-02-24 23:01 UTC (permalink / raw)
To: Davide Libenzi
Cc: Ingo Molnar, Evgeniy Polyakov, Ulrich Drepper,
Linux Kernel Mailing List, Linus Torvalds, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Jens Axboe,
Thomas Gleixner
On 2/24/07, Davide Libenzi <davidel@xmailserver.org> wrote:
> Ok, roger that. But why are you playing "Google & Preach" games to Ingo,
> that ate bread and CPUs for the last 15 years?
Sure I used Google -- for clickable references so that lurkers can
tell I'm not making these things up as I go along. Ingo and Alan have
obviously forgotten more about x86en than I will ever know, and I'm
carrying coals to Newcastle when I comment on pros and cons of XMM
memcpy.
But although the latest edition of the threadlet patches actually has
quite good internal documentation and makes most of its intent clear
even to a reader (me) who is unfamiliar with the code being patched,
it lacks "theory of operations". How is an arch maintainer supposed
to adapt this interface to a completely different CPU, with different
stuff in pt_regs and different cost profiles for blown pipelines and
reloaded coprocessor state? What are the hidden costs of this
particular style of M:N microthreading, and will they explode when
this model escapes out of the microbenchmarks and people who don't
know CPUs inside and out start using it? What standard
thread-pool-management use cases are being glossed over at kernel
level and left to Ulrich (or implementors of JVMs and other bytecode
machines) to sort out?
At some level, I'm just along for the ride; nobody with any sense is
going to pay me to design this sort of thing, and the level of effort
involved in coding an alternate AIO implementation is not something I
can afford to expend on non-revenue-producing activities even if I did
have the skill. Maybe half of my quibbles are sheer stupidity and
four out of five of the rest are things that Ingo has already taken
account in v4 of his patch set. But that would leave one quibble in
ten that has some substance, which might save some nasty rework down
the line. Even if everything I ask about has a simple explanation,
and for Alan and Ingo to waste time spelling it out for me would
result in nothing but an accelerated "theory of operation" document,
would that be a bad thing?
Now I know very little about x86_64 other than that 64-bit code not
only has double-size integer registers to work with, it has twice as
many of them. So for all I know the transition to pure-64-bit 2-4
core x 2-4 thread/core systems, which is going to be 90% or more of
the revenue-generating Linux market over the next few years, makes all
of my concerns moot for Ingo's purposes. After all, as long as Linux
stays good enough to keep Oracle from losing confidence and switching
to Darwin or something, the 100 or so people who earn invites to the
kernel summit have cush jobs for life.
The rest of us would perhaps like for major proposed kernel overhauls
to be accompanied by some kind of analysis of their impact on arches
that live elsewhere in CPU parameter space. That analysis might
suggest small design refinements that make Linux AIO scale well on the
class of processors I'm interested, too. And I personally would like
to see Ingo get that Turing award for designing AIO semantics that are
as big an advance over the past as IEEE 754 was over its predecessors.
He'd have to earn it, though.
Cheers,
- Michael
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-24 19:52 ` Michael K. Edwards
@ 2007-02-24 21:04 ` Davide Libenzi
2007-02-24 23:01 ` Michael K. Edwards
0 siblings, 1 reply; 277+ messages in thread
From: Davide Libenzi @ 2007-02-24 21:04 UTC (permalink / raw)
To: Michael K. Edwards
Cc: Ingo Molnar, Evgeniy Polyakov, Ulrich Drepper,
Linux Kernel Mailing List, Linus Torvalds, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Jens Axboe,
Thomas Gleixner
On Sat, 24 Feb 2007, Michael K. Edwards wrote:
> The preceding may contain errors in detail -- I am neither a CPU
> architect nor an x86 compiler writer nor even a serious kernel hacker.
Ok, roger that. But why are you playing "Google & Preach" games to Ingo,
that ate bread and CPUs for the last 15 years?
- Davide
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-23 12:17 ` Ingo Molnar
@ 2007-02-24 19:52 ` Michael K. Edwards
2007-02-24 21:04 ` Davide Libenzi
0 siblings, 1 reply; 277+ messages in thread
From: Michael K. Edwards @ 2007-02-24 19:52 UTC (permalink / raw)
To: Ingo Molnar
Cc: Evgeniy Polyakov, Ulrich Drepper, linux-kernel, Linus Torvalds,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
On 2/23/07, Ingo Molnar <mingo@elte.hu> wrote:
> > This is a fundamental misconception. [...]
>
> > The scheduler, on the other hand, has to blow and reload all of the
> > hidden state associated with force-loading the PC and wherever your
> > architecture keeps its TLS (maybe not the whole TLB, but not nothing,
> > either). [...]
>
> please read up a bit more about how the Linux scheduler works. Maybe
> even read the code if in doubt? In any case, please direct kernel newbie
> questions to http://kernelnewbies.org/, not linux-kernel@vger.kernel.org.
This is not the first kernel I've swum around in, and I've been
mucking with the Linux kernel since early 2.2 and coding assembly for
heavily pipelined processors on and off since 1990. So I may be a
newbie to your lingo, and I may even be a loud-mouthed idiot, but I'm
not a wet-behind-the-ears undergrad, OK?
Now, I've addressed the non-free-ness of a TLS swap elsewhere; what
about function pointers in state machines (with or without flipping
"supervisor mode" bits)? Just because loading the PC from a data
register is one opcode in the instruction stream does not mean that it
is not quite expensive in terms of blown pipeline state and I-cache
stalls. Really fast state machines exploit PC-relative branches that
really smart CPUs can speculatively execute past (after a few
traversals) because there are a small number of branch targets
actually hit. The instruction prefetch / scheduler unit actually
keeps a table of PC-relative jump instructions found in I-cache, with
a little histogram of destinations eventually branched to, and
speculatively executes down the top branch or two. (Intel Pentiums
have a fairly primitive but effective variant of this; see
http://www.x86.org/articles/branch/branchprediction.htm.)
More general mechanisms are called "branch target buffers" and US
Patent 6609194 is a good hook into the literature. A sufficiently
smart CPU designer may have figured out how to do something similar
with computed jumps (add pc, pc, foo), but odds are high that it cuts
out when you throw function pointers around. Syscall dispatch is a
special and heavily optimized case, though -- so it's quite
conceivable that a well designed userland switch/case state machine
that makes syscalls will outperform an in-kernel state machine data
structure traversal. If this doesn't happen to be true on today's
desktop, it may be on tomorrow's desktop or today's NUMA monstrosity
or embedded mega-multi-MIPS.
There can also be other reasons why tabulated PC-relative jumps and
immediate PC loads are faster than PC loads from data registers.
Take, for instance, the Transmeta Crusoe, which (AIUI) used a trick
similar to the FX!32 x86 emulation on Alpha/NT. If you're going to
"translate" CISC to RISC on the fly, you're going to recognize
switch/case idioms (including tabulated PC-relative branches), and fix
up the translated branch table to contain offsets to the
RISC-translated branch targets. So the state transitions are just as
cheap as if they had been compiled to RISC in the first place. Do it
with function pointers, and the the execution machine is going to have
to stall while it looks up the text location to see if it has it
translated in I-cache somewhere. Guess what: the PIV works the same
way (http://www.karbosguide.com/books/pcarchitecture/chapter12.htm).
Are you starting to get the picture that syslets -- clever as they
might have been on a VAX -- defeat many of the mechanisms that CPU and
compiler architects have negotiated over decades for accelerating real
code? Especially now that we have hyper-threaded CPUs (parallel
instruction decode/issue units sharing almost all of their cache
hierarchy), you can almost treat the kernel as if it were microcode
for a syscall coprocessor. If you try to migrate application code
across the syscall boundary, you may perform well on micro-benchmarks
but you're storing up trouble for the future.
If you don't think this kind of fallout is real, talk to whoever had
the bright idea of hijacking FPU registers to implement memcpy in
1996. The PIII designers rolled over and added XMM so
micro-optimizers would get their dirty mitts off the FPU, which it
appears that Doug Ledford and Jim Blandy duly acted on in 1999. Yes,
you still need to use FXSAVE/FXRSTOR when you want to mess with the
XMM stuff, but the CPU is smart enough to keep a shadow copy of all
the microstate that the flag states represent. So if all you do
between FXSAVE and FXRSTOR is shlep bytes around with MOVAPS, the
FXRSTOR costs you little or nothing. What hurts is an FXRSTOR from a
location that isn't the last location you FXSAVEd to, or an FXRSTOR
after actual FP arithmetic instructions have altered status flags.
The preceding may contain errors in detail -- I am neither a CPU
architect nor an x86 compiler writer nor even a serious kernel hacker.
But hopefully it's at least food for thought. If not, you know where
the "ignore this prolix nitwit" key is to be found on your keyboard.
Cheers,
- Michael
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-21 21:13 Ingo Molnar
2007-02-21 22:46 ` Michael K. Edwards
2007-02-22 10:01 ` Suparna Bhattacharya
@ 2007-02-24 18:34 ` Evgeniy Polyakov
2007-02-25 17:23 ` Ingo Molnar
2 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-24 18:34 UTC (permalink / raw)
To: Ingo Molnar
Cc: linux-kernel, Linus Torvalds, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
On Wed, Feb 21, 2007 at 10:13:55PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> this is the v3 release of the syslet/threadlet subsystem:
>
> http://redhat.com/~mingo/syslet-patches/
There is no %xgs.
--- ./arch/i386/kernel/process.c~ 2007-02-24 22:56:14.000000000 +0300
+++ ./arch/i386/kernel/process.c 2007-02-24 22:53:19.000000000 +0300
@@ -426,7 +426,6 @@
regs.xds = __USER_DS;
regs.xes = __USER_DS;
- regs.xgs = __KERNEL_PDA;
regs.orig_eax = -1;
regs.eip = (unsigned long) async_thread_helper;
regs.xcs = __KERNEL_CS | get_kernel_rpl();
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-24 0:51 ` Michael K. Edwards
2007-02-24 2:17 ` Michael K. Edwards
@ 2007-02-24 3:25 ` Michael K. Edwards
1 sibling, 0 replies; 277+ messages in thread
From: Michael K. Edwards @ 2007-02-24 3:25 UTC (permalink / raw)
To: Alan
Cc: Ingo Molnar, Evgeniy Polyakov, Ulrich Drepper, linux-kernel,
Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
Andrew Morton, Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
On 2/23/07, Michael K. Edwards <medwards.linux@gmail.com> wrote:
> which costs you a D-cache stall.) Now put an sprintf with a %d in it
> between a couple of the syscalls, and _your_ arch is hurting. ...
er, that would be a %f. :-)
Cheers,
- Michael
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-24 0:51 ` Michael K. Edwards
@ 2007-02-24 2:17 ` Michael K. Edwards
2007-02-24 3:25 ` Michael K. Edwards
1 sibling, 0 replies; 277+ messages in thread
From: Michael K. Edwards @ 2007-02-24 2:17 UTC (permalink / raw)
To: Alan
Cc: Ingo Molnar, Evgeniy Polyakov, Ulrich Drepper, linux-kernel,
Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
Andrew Morton, Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
I wrote:
> (On a pre-EABI ARM, there is even a substantial
> cache-related penalty for encoding the syscall number in the syscall
> opcode, because you have to peek back at the text segment to see it,
> which costs you a D-cache stall.)
Before you say it, I'm aware that this is not directly relevant to TLS
switch costs, except insofar as the "arch-dependent syscalls"
introduced for certain parts of ARM TLS handling carry the same
overhead as any other syscall. My point is that the system impact of
seemingly benign operations is not always predictable even to the arch
experts, and therefore one should be "parsimonious" (to use Kahan's
word) in defining what semantics programmers may rely on in
performance-critical situations.
If you arrange things so that threadlets are scheduled as much as
possible in bursts that share the same processor context (process
context, location in program text, TLS arena, FPU state -- basically
everything other than stack and integer registers), you are giving
yourself and future designers the maximum opportunity for exploiting
hardware optimizations. This would be a good thing if you want
threadlets to be performance-competitive with state machine designs.
If you still allow application programmers to _use_ shared processor
state, in the knowledge that it will be clobbered on threadlet switch,
then threadlets can use most of the coding style with which
programmers of event-driven frameworks are familiar. This would be a
good thing if you want threadlets to get wider use than the innards of
three or four databases and web servers.
Cheers,
- Michael
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-23 23:49 ` Michael K. Edwards
@ 2007-02-24 1:08 ` Alan
2007-02-24 0:51 ` Michael K. Edwards
0 siblings, 1 reply; 277+ messages in thread
From: Alan @ 2007-02-24 1:08 UTC (permalink / raw)
To: Michael K. Edwards
Cc: Ingo Molnar, Evgeniy Polyakov, Ulrich Drepper, linux-kernel,
Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
Andrew Morton, Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
> long my_threadlet_fn(void *data)
> {
> char *name = data;
> int fd;
>
> fd = open(name, O_RDONLY);
> if (fd < 0)
> goto out;
>
> fstat(fd, &stat);
> read(fd, buf, count)
> ...
>
> out:
> return threadlet_complete();
> }
>
> You're telling me that runs entirely in kernel space when open()
> blocks, and doesn't touch errno if fstat() fails? Now who hasn't read
> the code?
That example touches back into user space, but doesnt involve MMU changes
or cache flushes, or tlb flushes, or floating point.
errno is thread specific if you use it but errno is as I said before
entirely a C library detail that you don't have to suffer if you don't
want to. Avoiding that saves a segment register load - which isn't too
costly but isn't free.
Alan
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-24 1:08 ` Alan
@ 2007-02-24 0:51 ` Michael K. Edwards
2007-02-24 2:17 ` Michael K. Edwards
2007-02-24 3:25 ` Michael K. Edwards
0 siblings, 2 replies; 277+ messages in thread
From: Michael K. Edwards @ 2007-02-24 0:51 UTC (permalink / raw)
To: Alan
Cc: Ingo Molnar, Evgeniy Polyakov, Ulrich Drepper, linux-kernel,
Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
Andrew Morton, Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
Thanks for taking me at least minimally seriously, Alan. Pretty
generous of you, all things considered.
On 2/23/07, Alan <alan@lxorguk.ukuu.org.uk> wrote:
> That example touches back into user space, but doesnt involve MMU changes
> or cache flushes, or tlb flushes, or floating point.
True -- on an architecture where a change of TLS does not
substantially affect the TLB and cache, which (AIUI) it does on most
or all ARMs. (On a pre-EABI ARM, there is even a substantial
cache-related penalty for encoding the syscall number in the syscall
opcode, because you have to peek back at the text segment to see it,
which costs you a D-cache stall.) Now put an sprintf with a %d in it
between a couple of the syscalls, and _your_ arch is hurting. Deny
the userspace programmer the use of the FPU in threadlets, and they
become a lot less widely applicable -- and a lot flakier in a
non-wizard's hands, given that people often cheat around the small
number of x86 integer registers by using FP registers when copying
memory in bulk.
> errno is thread specific if you use it but errno is as I said before
> entirely a C library detail that you don't have to suffer if you don't
> want to. Avoiding that saves a segment register load - which isn't too
> costly but isn't free.
On your arch, it's a segment register -- and another
who-knows-how-many pages to migrate along with the stack and pt_regs.
On ARM, it's a coprocessor register that is incorrectly emulated by
most JTAG emulators (so bye-bye JTAG-assisted debugging and
profiling), or possibly a register stolen from the general purpose
register set. On some MIPSes I have known you probably can't
implement TLS safely without a cache flush.
If you tell people up front not to touch TLS in threadlets -- which
means not to use routines from <stdlib.h> and <stdio.h> -- then
implementors may have enough flexibility to make them perform well on
a wide range of architectures. Alternately, if there are some things
that threadlet users will genuinely need TLS for, you can tell them
that all of the threadlets belonging to process X on CPU Y share a TLS
context, and therefore things like errno can't be trusted across a
syscall -- but then you had better make fairly sure that threadlets
aren't preempted by other threadlets in between syscalls. Similar
arguments apply to FPU state.
IEEE 754. Harp, harp. :-)
Cheers,
- Michael
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-23 12:37 ` Alan
@ 2007-02-23 23:49 ` Michael K. Edwards
2007-02-24 1:08 ` Alan
0 siblings, 1 reply; 277+ messages in thread
From: Michael K. Edwards @ 2007-02-23 23:49 UTC (permalink / raw)
To: Alan
Cc: Ingo Molnar, Evgeniy Polyakov, Ulrich Drepper, linux-kernel,
Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
Andrew Morton, Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
On 2/23/07, Alan <alan@lxorguk.ukuu.org.uk> wrote:
> > Do you not understand that real user code touches FPU state at
> > unpredictable (to the kernel) junctures? Maybe not in a database or a
>
> We don't care. We don't have to care. The kernel threadlets don't execute
> in user space and don't do FP.
Blocked threadlets go back out to userspace, as new threads, after the
first blocking syscall completes. That's how Ingo described them in
plain English, that's how his threadlet example would have to work,
and that appears to be what his patches actually do.
> > web server, but in the GUIs and web-based monitoring applications that
> > are 99% of the potential customers for kernel AIO? I have no idea
> > what a %cr3 is, but if you don't fence off thread-local stuff from the
>
> How about you go read the intel architecture manuals then you might know
> more.
Y'know, there's more to life than x86. I'm no MMU expert, but I know
enough about ARM TLS and ptrace to have fixed ltrace -- not that that
took any special wizardry, just a need for it to work and some basic
forensic skill. If you want me to go away completely or not follow up
henceforth on anything you write, say so, and I'll decide what to do
in response. Otherwise, you might consider evaluating whether there's
a way to interpret my comments so that they reflect a perspective that
does not overlap 100% with yours rather than total idiocy.
> Last time I checked glibc was in userspace and the interface for kernel
> AIO is a matter for the kernel so errno is irrelevant, plus any
> threadlets doing system calls will only be living in kernel space anyway.
Ingo's original example code:
long my_threadlet_fn(void *data)
{
char *name = data;
int fd;
fd = open(name, O_RDONLY);
if (fd < 0)
goto out;
fstat(fd, &stat);
read(fd, buf, count)
...
out:
return threadlet_complete();
}
You're telling me that runs entirely in kernel space when open()
blocks, and doesn't touch errno if fstat() fails? Now who hasn't read
the code?
Cheers,
- Michael
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-23 18:01 ` Evgeniy Polyakov
@ 2007-02-23 20:43 ` Davide Libenzi
0 siblings, 0 replies; 277+ messages in thread
From: Davide Libenzi @ 2007-02-23 20:43 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Ingo Molnar, Ulrich Drepper, Linux Kernel Mailing List,
Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
On Fri, 23 Feb 2007, Evgeniy Polyakov wrote:
> I was not clear - I meant why do we need to do that when we can run the
> same code in userspace? And better if we can have non-blocking dataflows
> and number of threads equal to number of processors...
I've a userspace library that does exactly that (GUASI - GPL code avail if
you want, but no man page written yet). It uses a pool of threads and
queue requests. I've a bench that crawl through a directory a read files.
The sync against async performance sucks. You can't do the cachehit
optimization in userspace. With network stuff could prolly do better
(since network is more heavily towards async), but still.
> I started a week of writing without russian-english dictionary, so
> expect some troubles in communications with me :)
>
> I said that about kernel design - when we have thousand(s) of threads,
> which do the work - if number of context switches is small (i.e. when
> operations mostly do not block), then it is ok (although 'ps' output
> with threads can scary a grandma).
> It is also ok to say - 'hey, Linux has so easy AIO model, so that
> everyone should switch and start using it and do not care about problems
> associated with multi-threaded programming with high concurrency',
> but, in my opinion, both that cases can not cover all (and most of)
> the usage cases.
>
> To eat my hat (or force others to do the same) I'm preparing a tree for
> threadlet test - I plan to write a trivial web server
> (accept/recv/send/sendfile in one threadlet function) and give it a try
> soon.
Funny, I lazily started doing the same thing last weekend (than I had to
stop, since real job kicked in ;). I wanted to compare a fully MT trivial
HTTP server:
http://www.xmailserver.org/thrhttp.c
with one that is event-driven (epoll) and coroutine based. This one will
only be compared for memory-content delivery, since it has no async vfs
capabilities. They both support the special "/mem-XXXX" url, that allows
an HTTP loader to request a given content size.
I also have a epoll+coroutine HTTP loader (that works around httperf
limitations).
Then, I wanted to compare the above, with one that is epoll+GUASI+coroutine
based (basically a userpace-only thingy).
I've the code for all the above.
Finally, with one that is epoll+syslet+coroutine based (no code for this
yet - but it should be a easy port from the GUASI one).
Keep in mind though, that a threadlet solution doing accept/recv/send/sendfile
is becoming blazingly similar to a full MT solution.
I can only immagine the thunders and flames that Larry would throw at us
for using all those threads :D
- Davide
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-23 17:43 ` Davide Libenzi
@ 2007-02-23 18:01 ` Evgeniy Polyakov
2007-02-23 20:43 ` Davide Libenzi
0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-23 18:01 UTC (permalink / raw)
To: Davide Libenzi
Cc: Ingo Molnar, Ulrich Drepper, Linux Kernel Mailing List,
Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
On Fri, Feb 23, 2007 at 09:43:14AM -0800, Davide Libenzi (davidel@xmailserver.org) wrote:
> On Fri, 23 Feb 2007, Evgeniy Polyakov wrote:
>
> > On Thu, Feb 22, 2007 at 11:46:48AM -0800, Davide Libenzi (davidel@xmailserver.org) wrote:
> > >
> > > A dynamic pool will smooth thread creation/freeing up by a lot.
> > > And, in my box a *pthread* create/free takes ~10us, at 1000/s is 10ms, 1%.
> > > Bad, but not so aweful ;)
> > > Look, I'm *definitely* not trying to advocate the use of async syscalls for
> > > network here, just pointing out that when we're talking about threads,
> > > Linux does a pretty good job.
> >
> > If we are going to create 1000 threads each second, then it is better to
> > preallocate them and queue a work to that pool - like syslets did with
> > syscalls, but not ulitimately create a new thread just because it is not
> > that slow.
>
> We do create a pool indeed, as I said in the opening of my asnwer. The
> numbers I posted was just to show that thread creation/destroy is pretty
> fast, but that does not justify it as a design choice.
I was not clear - I meant why do we need to do that when we can run the
same code in userspace? And better if we can have non-blocking dataflows
and number of threads equal to number of processors...
> > All such micro-thread designs are especially good in the case when
> > 1. switching is _rare_ (very)
> > 2. programmer does not want to create complex model to achieve maximum
> > performance
> >
> > Disk (cached) IO definitely hits first entry and second one is there for
> > advertisements and fast deployment, but overall usage of the
> > asynchronous IO model is not limited to the above scenario, so
> > micro-threads definitely hit own niche, but they can not cover all usage
> > cases.
>
> You know, I read this a few times, but I still don't get what your point
> is here ;) Are you talking about micro-thread design in the kernel as for
> kthreads usage for AIO, or about userspace?
I started a week of writing without russian-english dictionary, so
expect some troubles in communications with me :)
I said that about kernel design - when we have thousand(s) of threads,
which do the work - if number of context switches is small (i.e. when
operations mostly do not block), then it is ok (although 'ps' output
with threads can scary a grandma).
It is also ok to say - 'hey, Linux has so easy AIO model, so that
everyone should switch and start using it and do not care about problems
associated with multi-threaded programming with high concurrency',
but, in my opinion, both that cases can not cover all (and most of)
the usage cases.
To eat my hat (or force others to do the same) I'm preparing a tree for
threadlet test - I plan to write a trivial web server
(accept/recv/send/sendfile in one threadlet function) and give it a try
soon.
> - Davide
>
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-23 12:15 ` Evgeniy Polyakov
@ 2007-02-23 17:43 ` Davide Libenzi
2007-02-23 18:01 ` Evgeniy Polyakov
0 siblings, 1 reply; 277+ messages in thread
From: Davide Libenzi @ 2007-02-23 17:43 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Ingo Molnar, Ulrich Drepper, Linux Kernel Mailing List,
Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
On Fri, 23 Feb 2007, Evgeniy Polyakov wrote:
> On Thu, Feb 22, 2007 at 11:46:48AM -0800, Davide Libenzi (davidel@xmailserver.org) wrote:
> >
> > A dynamic pool will smooth thread creation/freeing up by a lot.
> > And, in my box a *pthread* create/free takes ~10us, at 1000/s is 10ms, 1%.
> > Bad, but not so aweful ;)
> > Look, I'm *definitely* not trying to advocate the use of async syscalls for
> > network here, just pointing out that when we're talking about threads,
> > Linux does a pretty good job.
>
> If we are going to create 1000 threads each second, then it is better to
> preallocate them and queue a work to that pool - like syslets did with
> syscalls, but not ulitimately create a new thread just because it is not
> that slow.
We do create a pool indeed, as I said in the opening of my asnwer. The
numbers I posted was just to show that thread creation/destroy is pretty
fast, but that does not justify it as a design choice.
> All such micro-thread designs are especially good in the case when
> 1. switching is _rare_ (very)
> 2. programmer does not want to create complex model to achieve maximum
> performance
>
> Disk (cached) IO definitely hits first entry and second one is there for
> advertisements and fast deployment, but overall usage of the
> asynchronous IO model is not limited to the above scenario, so
> micro-threads definitely hit own niche, but they can not cover all usage
> cases.
You know, I read this a few times, but I still don't get what your point
is here ;) Are you talking about micro-thread design in the kernel as for
kthreads usage for AIO, or about userspace?
- Davide
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-22 14:36 ` Ingo Molnar
@ 2007-02-23 14:23 ` Suparna Bhattacharya
0 siblings, 0 replies; 277+ messages in thread
From: Suparna Bhattacharya @ 2007-02-23 14:23 UTC (permalink / raw)
To: Ingo Molnar
Cc: Evgeniy Polyakov, Ulrich Drepper, linux-kernel, Linus Torvalds,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Zach Brown, David S. Miller, Davide Libenzi, Jens Axboe,
Thomas Gleixner
On Thu, Feb 22, 2007 at 03:36:58PM +0100, Ingo Molnar wrote:
>
> * Suparna Bhattacharya <suparna@in.ibm.com> wrote:
>
> > > maybe it will, maybe it wont. Lets try? There is no true difference
> > > between having a 'request structure' that represents the current
> > > state of the HTTP connection plus a statemachine that moves that
> > > request between various queues, and a 'kernel stack' that goes in
> > > and out of runnable state and carries its processing state in its
> > > stack - other than the amount of RAM they take. (the kernel stack is
> > > 4K at a minimum - so with a million outstanding requests they would
> > > use up 4 GB of RAM. With 20k outstanding requests it's 80 MB of RAM
> > > - that's acceptable.)
> >
> > At what point are the cachemiss threads destroyed ? In other words how
> > well does this adapt to load variations ? For example, would this 80MB
> > of RAM continue to be locked down even during periods of lighter loads
> > thereafter ?
>
> you can destroy them at will from user-space too - just start a slow
> timer that zaps them if load goes down. I can add a
> sys_async_thread_exit(nr_threads) API to be able to drive this without
> knowing the TIDs of those threads, and/or i can add a kernel-internal
> mechanism to zap inactive threads. It would be rather easy and
> low-overhead - the v2 code already had a max_nr_threads tunable, i can
> reintroduce it. So the size of the pool of contexts does not have to be
> permanent at all.
If you can find a way to do this without additional tunables burden on
the administrator that would certainly help ! IIRC, performance problems
linked to having too many or too few AIO kernel threads has been a commonly
reported issue elsewhere - it would be nice to be able to avoid repeating
the crux of that (mistake) in Linux. To me, any need to manually tune the
number has always seemed to defeat the very benefit of adaptability of varying
loads that AIO intrinsically provides.
Regards
Suparna
>
> Ingo
--
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Lab, India
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-23 12:22 ` Evgeniy Polyakov
@ 2007-02-23 12:41 ` Evgeniy Polyakov
2007-02-25 17:45 ` Ingo Molnar
1 sibling, 0 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-23 12:41 UTC (permalink / raw)
To: Ingo Molnar
Cc: Ulrich Drepper, linux-kernel, Linus Torvalds, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
On Fri, Feb 23, 2007 at 03:22:25PM +0300, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> I meant that we end up with having one thread per IO - they were
> preallocated, but that does not matter. And what about your idea of
> switching userspace threads to cachemiss threads?
>
> My main concern was only about the situation, when we ends up with truly
> bloking context (like network), and this results in having thousands of
> threads doing the work - even having most of them sleeping, there is a
> problem with memory overhead and context switching, although it is usable
> situation, but when all of them are ready immediately - context switching
simultaneously
> will kill a machine even with O(1) scheduler which made situation damn
> better than before, but it is not a cure for the problem.
Week of no-dictionary writings starts beating me.
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-23 2:47 ` Michael K. Edwards
2007-02-23 8:31 ` Michael K. Edwards
2007-02-23 10:22 ` Ingo Molnar
@ 2007-02-23 12:37 ` Alan
2007-02-23 23:49 ` Michael K. Edwards
2 siblings, 1 reply; 277+ messages in thread
From: Alan @ 2007-02-23 12:37 UTC (permalink / raw)
To: Michael K. Edwards
Cc: Ingo Molnar, Evgeniy Polyakov, Ulrich Drepper, linux-kernel,
Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
Andrew Morton, Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
> Do you not understand that real user code touches FPU state at
> unpredictable (to the kernel) junctures? Maybe not in a database or a
We don't care. We don't have to care. The kernel threadlets don't execute
in user space and don't do FP.
> web server, but in the GUIs and web-based monitoring applications that
> are 99% of the potential customers for kernel AIO? I have no idea
> what a %cr3 is, but if you don't fence off thread-local stuff from the
How about you go read the intel architecture manuals then you might know
more.
> > We don't have an errno in the kernel because its a stupid idea. Errno is
> > a user space hack for compatibility with 1970's bad design. So its not
> > relevant either.
>
> Dude, it's thread-local, and the glibc wrapper around most synchronous
Last time I checked glibc was in userspace and the interface for kernel
AIO is a matter for the kernel so errno is irrelevant, plus any
threadlets doing system calls will only be living in kernel space anyway.
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-23 11:51 ` Ingo Molnar
@ 2007-02-23 12:22 ` Evgeniy Polyakov
2007-02-23 12:41 ` Evgeniy Polyakov
2007-02-25 17:45 ` Ingo Molnar
0 siblings, 2 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-23 12:22 UTC (permalink / raw)
To: Ingo Molnar
Cc: Ulrich Drepper, linux-kernel, Linus Torvalds, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
On Fri, Feb 23, 2007 at 12:51:52PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> > [...] Those 20k blocked requests were created in about 20 seconds, so
> > roughly saying we have 1k of thread creation/freeing per second - do
> > we want this?
>
> i'm not sure why you mention thread creation and freeing. The
> syslet/threadlet code reuses already created async threads, and that is
> visible all around in both the kernel-space and in the user-space
> syslet/threadlet code.
>
> While Linux creates+destroys threads pretty damn fast (in about 10-15
> usecs - which is roughly the cost of getting a single 1-byte packet
> through a TCP socket from one process to another, on localhost), still
> we dont want to create and destroy a thread per request.
I meant that we end up with having one thread per IO - they were
preallocated, but that does not matter. And what about your idea of
switching userspace threads to cachemiss threads?
My main concern was only about the situation, when we ends up with truly
bloking context (like network), and this results in having thousands of
threads doing the work - even having most of them sleeping, there is a
problem with memory overhead and context switching, although it is usable
situation, but when all of them are ready immediately - context switching
will kill a machine even with O(1) scheduler which made situation damn
better than before, but it is not a cure for the problem.
> Ingo
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-22 21:24 ` Michael K. Edwards
2007-02-23 0:30 ` Alan
@ 2007-02-23 12:17 ` Ingo Molnar
2007-02-24 19:52 ` Michael K. Edwards
1 sibling, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-02-23 12:17 UTC (permalink / raw)
To: Michael K. Edwards
Cc: Evgeniy Polyakov, Ulrich Drepper, linux-kernel, Linus Torvalds,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
* Michael K. Edwards <medwards.linux@gmail.com> wrote:
> On 2/22/07, Ingo Molnar <mingo@elte.hu> wrote:
> > maybe it will, maybe it wont. Lets try? There is no true difference
> > between having a 'request structure' that represents the current
> > state of the HTTP connection plus a statemachine that moves that
> > request between various queues, and a 'kernel stack' that goes in
> > and out of runnable state and carries its processing state in its
> > stack - other than the amount of RAM they take. (the kernel stack is
> > 4K at a minimum - so with a million outstanding requests they would
> > use up 4 GB of RAM. With 20k outstanding requests it's 80 MB of RAM
> > - that's acceptable.)
>
> This is a fundamental misconception. [...]
> The scheduler, on the other hand, has to blow and reload all of the
> hidden state associated with force-loading the PC and wherever your
> architecture keeps its TLS (maybe not the whole TLB, but not nothing,
> either). [...]
please read up a bit more about how the Linux scheduler works. Maybe
even read the code if in doubt? In any case, please direct kernel newbie
questions to http://kernelnewbies.org/, not linux-kernel@vger.kernel.org.
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-22 19:46 ` Davide Libenzi
@ 2007-02-23 12:15 ` Evgeniy Polyakov
2007-02-23 17:43 ` Davide Libenzi
0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-23 12:15 UTC (permalink / raw)
To: Davide Libenzi
Cc: Ingo Molnar, Ulrich Drepper, Linux Kernel Mailing List,
Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
On Thu, Feb 22, 2007 at 11:46:48AM -0800, Davide Libenzi (davidel@xmailserver.org) wrote:
> > I tried already :) - I just made a allocations atomic in tcp_sendmsg() and
> > ended up with 1/4 of the sends blocking (I counted both allocation
> > failure and socket queue overflow). Those 20k blocked requests were
> > created in about 20 seconds, so roughly saying we have 1k of thread
> > creation/freeing per second - do we want this?
>
> A dynamic pool will smooth thread creation/freeing up by a lot.
> And, in my box a *pthread* create/free takes ~10us, at 1000/s is 10ms, 1%.
> Bad, but not so aweful ;)
> Look, I'm *definitely* not trying to advocate the use of async syscalls for
> network here, just pointing out that when we're talking about threads,
> Linux does a pretty good job.
If we are going to create 1000 threads each second, then it is better to
preallocate them and queue a work to that pool - like syslets did with
syscalls, but not ulitimately create a new thread just because it is not
that slow.
All such micro-thread designs are especially good in the case when
1. switching is _rare_ (very)
2. programmer does not want to create complex model to achieve maximum
performance
Disk (cached) IO definitely hits first entry and second one is there for
advertisements and fast deployment, but overall usage of the
asynchronous IO model is not limited to the above scenario, so
micro-threads definitely hit own niche, but they can not cover all usage
cases.
>
> - Davide
>
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-22 13:32 ` Evgeniy Polyakov
2007-02-22 19:46 ` Davide Libenzi
@ 2007-02-23 11:51 ` Ingo Molnar
2007-02-23 12:22 ` Evgeniy Polyakov
1 sibling, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-02-23 11:51 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Ulrich Drepper, linux-kernel, Linus Torvalds, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> [...] Those 20k blocked requests were created in about 20 seconds, so
> roughly saying we have 1k of thread creation/freeing per second - do
> we want this?
i'm not sure why you mention thread creation and freeing. The
syslet/threadlet code reuses already created async threads, and that is
visible all around in both the kernel-space and in the user-space
syslet/threadlet code.
While Linux creates+destroys threads pretty damn fast (in about 10-15
usecs - which is roughly the cost of getting a single 1-byte packet
through a TCP socket from one process to another, on localhost), still
we dont want to create and destroy a thread per request.
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-22 17:17 ` David Miller
@ 2007-02-23 11:12 ` Ingo Molnar
0 siblings, 0 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-02-23 11:12 UTC (permalink / raw)
To: David Miller
Cc: johnpol, arjan, drepper, linux-kernel, torvalds, hch, akpm, alan,
zach.brown, suparna, davidel, jens.axboe, tglx
* David Miller <davem@davemloft.net> wrote:
> > furthermore, in a real webserver there's a whole lot of other stuff
> > happening too: VFS blocking, mutex/lock blocking, memory pressure
> > blocking, filesystem blocking, etc., etc. Threadlets/syslets cover
> > them /all/ and never hold up the primary context: as long as there's
> > more requests to process, they will be processed. Plus other
> > important networked workloads, like fileservers are typically on
> > fast LANs and those requests are very much a fire-and-forget matter
> > most of the time.
>
> I expect clients of a fileserver to cause the server to block in
> places such as tcp_sendmsg() as much if not more so than a webserver
> :-)
yeah, true. But ... i'd still like to midly disagree with the
characterisation that due to blocking being the norm in networking, that
this goes against the concept of syslets/threadlets.
Networking has a 'work cache' too, in an abstract sense, which works in
favor of syslets: the socket buffers. If there is a reasonably sized
output queue for sockets (not extremely small like 4k per socket but
something at least 16k), then user-space can chunk up its workload along
pretty reasonable lines without assuming too much, and turn one such
'chunk' into one atomic step done in the 'syslet/threadlet request'. In
the rare cases where blocking happens in an unexpected way, the
syslet/threadlet 'goes async' but that only holds up that particular
request, and only for that chunk, not the main loop of processing and
doesnt force the request into an async thread forever.
the kevent model is very much about: /never every/ block the main
thread. If you ever block, performance goes down the drain.
the syslet performance model is: if you block less than say 10-20% of
the time, you are basically as fast as the most extreme kevents based
model. Syslets/threadlets also avoids the fundamental workflow problem
that nonblocking designs have in a natural way: 'how do we guarantee
that the system progresses forward', which problem makes nonblocking
code quite fragile.
another property is that the performance curve is alot less sensitive to
blocking in the syslet model - and real user-space servers will always
have unexpected points of blockage - unless all of userspace code is
perfectly converted into state machines - which is not reasonable. So
with syslets we are not forced to program everything as state-machines
in user-space, such techniques are only needed to reduce the amount of
context-switching and to reduce the RAM footprint - they wont change
fundamental scalability.
plus there's the hidden advantage of having a 'constructed state' on the
kernel stack: thread that blocks in the middle of tcp_sendmsg() has
quite some state: the socket has been looked up in the hash(es), all
input parameters have been validated, the timer has been set, skbs have
been allocated ahead, etc. Those things do add up especially if you are
after a long wait and all those things are scattered around the memory
map cache-cold - not nicely composed into a single block of memory on
the stack (generating only a single cachemiss in essence).
All in one, i'm cautiosly optimistic that even a totally naive,
blocks-itself-for-every-request syslet application would perform pretty
close to a Tux/kevents type of nonblock+event-queueing based application
- with a vastly higher generic utility benefit. So i'm not dismissing
this possibility and i'm not writing off syslet performance just because
they do context-switches :-)
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-23 2:47 ` Michael K. Edwards
2007-02-23 8:31 ` Michael K. Edwards
@ 2007-02-23 10:22 ` Ingo Molnar
2007-02-23 12:37 ` Alan
2 siblings, 0 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-02-23 10:22 UTC (permalink / raw)
To: Michael K. Edwards
Cc: Alan, Evgeniy Polyakov, Ulrich Drepper, linux-kernel,
Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
Andrew Morton, Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
* Michael K. Edwards <medwards.linux@gmail.com> wrote:
> On 2/22/07, Alan <alan@lxorguk.ukuu.org.uk> wrote:
> > We don't use the FPU in the kernel except in very weird cases where
> > it makes an enormous performance difference. The threadlets also
> > have the same page tables so they have the same %cr3 so its very
> > cheap to switch, basically a predicted jump and some register loads
>
> Do you not understand that real user code touches FPU state at
> unpredictable (to the kernel) junctures? Maybe not in a database or a
> web server, but in the GUIs and web-based monitoring applications that
> are 99% of the potential customers for kernel AIO?
> I have no idea what a %cr3 is, [...]
then please stop wasting Alan's time ...
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-23 2:47 ` Michael K. Edwards
@ 2007-02-23 8:31 ` Michael K. Edwards
2007-02-23 10:22 ` Ingo Molnar
2007-02-23 12:37 ` Alan
2 siblings, 0 replies; 277+ messages in thread
From: Michael K. Edwards @ 2007-02-23 8:31 UTC (permalink / raw)
To: Alan
Cc: Ingo Molnar, Evgeniy Polyakov, Ulrich Drepper, linux-kernel,
Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
Andrew Morton, Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
OK, having skimmed through Ingo's code once now, I can already see I
have some crow to eat. But I still have some marginally less stupid
questions.
Cachemiss threads are created with CLONE_VM | CLONE_FS | CLONE_FILES |
CLONE_SIGHAND | CLONE_THREAD | CLONE_SYSVSEM. Does that mean they
share thread-local storage with the userspace thread, have
thread-local storage of their own, or have no thread-local storage
until NPTL asks for it?
When the kernel zeroes the userspace stack pointer in
cachemiss_thread(), presumably the allocation of a new userspace stack
page is postponed until that thread needs to resume userspace
execution (after completion of the first I/O that missed cache). When
do you copy the contents of the threadlet function's stack frame into
this new stack page?
Is there anything in a struct pt_regs that is expensive to restore
(perhaps because it flushes a pipeline or cache that wasn't already
flushed on syscall entry)? Is there any reason why the FPU context
has to differ among threadlets that have blocked while executing the
same userspace function with different stacks? If the TLS pointer
isn't in either of these, where is it, and why doesn't
move_user_context() swap it?
If you set out to cancel one of these threadlets, how are you going to
ensure that it isn't holding any locks? Is there any reasonable way
to implement a userland finally { } block so that you can release
malloc'd memory and clean up application data structures?
If you want to migrate a threadlet to another CPU on syscall entry
and/or exit, what has to travel other than the userspace stack and the
struct pt_regs? (I am assuming a quiesced FPU and thread(s) at the
destination with compatible FPU flags.) Does it make sense for the
userspace stack page to have space reserved for a struct pt_regs
before the threadlet stack frame, so that the entire userspace
threadlet state migrates as one page?
I now see that an effort is already made to schedule threadlets in
bursts, grouped by PID, when several have unblocked since the last
timeslice. What is the transition cost from one threadlet to another?
Can that transition cost be made lower by reducing the amount of
state that belongs to the individual threadlet vs. the pool of
cachemiss threads associated with that threadlet entrypoint?
Generally, is there a "contract" that could be made between the
threadlet application programmer and the implementation which would
allow, perhaps in future hardware, the kind of invisible pipelined
coprocessing for AIO that has been so successful for FP?
I apologize for having adopted a hostile tone in a couple of previous
messages in this thread; remind me in the future not to alternate
between thinking about code and about the FSF. :-) I do really like
a lot of things about the threadlet model, and would rather not see it
given up on for network I/O and NUMA systems. So I'm going to
reiterate again -- more politely this time -- the need for a
data-structure-centric threadlet pool abstraction that supports
request throttling, reprioritization, bulk cancellation, and migration
of individual threadlets to the node nearest the relevant I/O port.
I'm still not sold on syslets as anything userspace-visible, but I
could imagine them enabling a sort of functional syntax for chaining
I/O operations, with most failures handled as inline "Not-a-Pointer"
values or as "AEIOU" (asynchronously executed I/O unit?) exceptions
instead of syscall-test-branch-syscall-test-branch. Actually working
out the semantics and getting them adopted as an IEEE standard could
even win someone a Turing award. :-)
Cheers,
- Michael
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-23 0:30 ` Alan
@ 2007-02-23 2:47 ` Michael K. Edwards
2007-02-23 8:31 ` Michael K. Edwards
` (2 more replies)
0 siblings, 3 replies; 277+ messages in thread
From: Michael K. Edwards @ 2007-02-23 2:47 UTC (permalink / raw)
To: Alan
Cc: Ingo Molnar, Evgeniy Polyakov, Ulrich Drepper, linux-kernel,
Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
Andrew Morton, Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
On 2/22/07, Alan <alan@lxorguk.ukuu.org.uk> wrote:
> > to do anything but chase pointers through cache. Done right, it
> > hardly even branches (although the branch misprediction penalty is a
> > lot less of a worry on current x86_64 than it was in the
> > mega-superscalar-out-of-order-speculative-execution days). It's damn
>
> Actually it costs a lot more on at least one vendors processor because
> you stall very long pipelines.
You're right; I overreached there. I haven't measured branch
misprediction penalties in dog's years (I focus more on system latency
issues these days), so I'm just going on rumor. If your CPU vendor is
still playing the tune-for-SpecINT-at-the-expense-of-real-code game
(*cough* Itanic *cough*), get another CPU vendor -- while you still
can.
> > threadlets promise that they will not touch anything thread-local, and
> > that when the FPU is handed to them in a specific, known state, they
> > leave it in that same state. (Some of the flags can be
>
> We don't use the FPU in the kernel except in very weird cases where it
> makes an enormous performance difference. The threadlets also have the
> same page tables so they have the same %cr3 so its very cheap to switch,
> basically a predicted jump and some register loads
Do you not understand that real user code touches FPU state at
unpredictable (to the kernel) junctures? Maybe not in a database or a
web server, but in the GUIs and web-based monitoring applications that
are 99% of the potential customers for kernel AIO? I have no idea
what a %cr3 is, but if you don't fence off thread-local stuff from the
threadlets you are just begging for end-user Heisenbugs and a place in
the dustheap of history next to Symbolics LISP.
> > Do me a favor. Do some floating point math and a memcpy() in between
> > syscalls in the threadlet. Actually fiddle with errno and the FPU
>
> We don't have an errno in the kernel because its a stupid idea. Errno is
> a user space hack for compatibility with 1970's bad design. So its not
> relevant either.
Dude, it's thread-local, and the glibc wrapper around most synchronous
syscalls touches it. If you don't instantiate a new TLS context (or
whatever the right lingo for that is) for every threadlet, you are
TOAST -- if you let the user call stuff out of <stdlib.h> (let alone
<stdio.h>) from within the threadlet.
Cheers,
- Michael
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-22 21:24 ` Michael K. Edwards
@ 2007-02-23 0:30 ` Alan
2007-02-23 2:47 ` Michael K. Edwards
2007-02-23 12:17 ` Ingo Molnar
1 sibling, 1 reply; 277+ messages in thread
From: Alan @ 2007-02-23 0:30 UTC (permalink / raw)
To: Michael K. Edwards
Cc: Ingo Molnar, Evgeniy Polyakov, Ulrich Drepper, linux-kernel,
Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
Andrew Morton, Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
> to do anything but chase pointers through cache. Done right, it
> hardly even branches (although the branch misprediction penalty is a
> lot less of a worry on current x86_64 than it was in the
> mega-superscalar-out-of-order-speculative-execution days). It's damn
Actually it costs a lot more on at least one vendors processor because
you stall very long pipelines.
> threadlets promise that they will not touch anything thread-local, and
> that when the FPU is handed to them in a specific, known state, they
> leave it in that same state. (Some of the flags can be
We don't use the FPU in the kernel except in very weird cases where it
makes an enormous performance difference. The threadlets also have the
same page tables so they have the same %cr3 so its very cheap to switch,
basically a predicted jump and some register loads
> Do me a favor. Do some floating point math and a memcpy() in between
> syscalls in the threadlet. Actually fiddle with errno and the FPU
We don't have an errno in the kernel because its a stupid idea. Errno is
a user space hack for compatibility with 1970's bad design. So its not
relevant either.
Alan
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-22 21:32 ` Benjamin LaHaise
@ 2007-02-22 21:44 ` Zach Brown
0 siblings, 0 replies; 277+ messages in thread
From: Zach Brown @ 2007-02-22 21:44 UTC (permalink / raw)
To: Benjamin LaHaise
Cc: Ingo Molnar, Ulrich Drepper, linux-kernel, Linus Torvalds,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Evgeniy Polyakov, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
> direct-io.c is evil. Ridiculously.
You will have a hard time finding someone to defend it, I predict :).
There is good news on that front, too. Chris (Mason) is making
progress on getting rid of the worst of the Magical Locking that
makes buffered races with O_DIRECT ops so awful.
I'm not holding my breath for a page cache so fine grained that it
could pin and reference 512B granular user regions and build bios
from them, though that sure would be nice :).
>> As an experiment, I'm working on backing the sys_io_*() calls with
>> syslets. It's looking very promising so far.
>
> Great, I'd love to see the comparisons.
I'm out for data. If it sucks, well, we'll know just how much. I'm
pretty hopeful that it won't :).
> One other implementation to consider is actually using kernel threads
> compared to how syslets perform. Direct IO for one always blocks, so
> there shouldn't be much of a performance difference compared to
> syslets,
> with the bonus that no arch specific code is needed.
Yeah, I'm starting with raw kernel threads so we can get some numbers
before moving to syslets.
One of the benefits syslets bring is the ability to start processing
the next pending request the moment current request processing
blocks. In the concurrent O_DIRECT write case that avoids releasing
a ton of kernel threads which all just run to serialize on i_mutex
(potentially bouncing it around cache domains) as the O_DIRECT ops
are built and sent.
-z
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-22 14:31 ` Ingo Molnar
2007-02-22 14:47 ` David Miller
2007-02-22 14:59 ` Ingo Molnar
@ 2007-02-22 21:42 ` Michael K. Edwards
2 siblings, 0 replies; 277+ messages in thread
From: Michael K. Edwards @ 2007-02-22 21:42 UTC (permalink / raw)
To: Ingo Molnar
Cc: David Miller, johnpol, arjan, drepper, linux-kernel, torvalds,
hch, akpm, alan, zach.brown, suparna, davidel, jens.axboe, tglx
On 2/22/07, Ingo Molnar <mingo@elte.hu> wrote:
> Secondly, even assuming lots of pending requests/async-threads and a
> naive queueing model, an open request will eat up resources on the
> server no matter what.
Another fundamental misconception. Kernel AIO is not for servers.
One programmer in a hundred is working on a server codebase, and one
in a thousand dares to touch server plumbing. Kernel AIO is for
clients, especially when mated to GUIs with an event delivery
mechanism. Ask yourself why the one and only thing that Windows NT
has ever gotten right about networking is I/O completion ports.
Cheers,
- Michael
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-22 21:23 ` Zach Brown
@ 2007-02-22 21:32 ` Benjamin LaHaise
2007-02-22 21:44 ` Zach Brown
0 siblings, 1 reply; 277+ messages in thread
From: Benjamin LaHaise @ 2007-02-22 21:32 UTC (permalink / raw)
To: Zach Brown
Cc: Ingo Molnar, Ulrich Drepper, linux-kernel, Linus Torvalds,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Evgeniy Polyakov, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
On Thu, Feb 22, 2007 at 01:23:57PM -0800, Zach Brown wrote:
> As one of the poor suckers who has been fixing bugs in fs/aio.c and
> fs/direct-io.c, I really want everyone to read Ingo's paragraph a few
> times. Have it printed on a t-shirt.
direct-io.c is evil. Ridiculously.
> Amen.
>
> As an experiment, I'm working on backing the sys_io_*() calls with
> syslets. It's looking very promising so far.
Great, I'd love to see the comparisons.
> >So all in one, i used to think that AIO state-machines have a long-
> >term
> >place within the kernel, but with syslets i think i've proven myself
> >embarrasingly wrong =B-)
>
> Welcome to the party :).
Well, there are always the 2.4 patches which are properly state drive and
reasonably simple. Retry was born out of a need to come up with a mechanism
that had less impact on the core kernel code, and yes, it seems to be a
failure and in dire need of replacement.
One other implementation to consider is actually using kernel threads
compared to how syslets perform. Direct IO for one always blocks, so
there shouldn't be much of a performance difference compared to syslets,
with the bonus that no arch specific code is needed.
-ben
--
"Time is of no importance, Mr. President, only life is important."
Don't Email: <zyntrop@kvack.org>.
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-22 14:47 ` David Miller
` (2 preceding siblings ...)
2007-02-22 20:13 ` Davide Libenzi
@ 2007-02-22 21:30 ` Zach Brown
3 siblings, 0 replies; 277+ messages in thread
From: Zach Brown @ 2007-02-22 21:30 UTC (permalink / raw)
To: David Miller
Cc: mingo, johnpol, arjan, drepper, linux-kernel, torvalds, hch,
akpm, alan, suparna, davidel, jens.axboe, tglx
> The more I think about it, a reasonable solution might actually be to
> use threadlets for disk I/O and pure event based processing for
> networking. It is two different handling paths and non-unified,
> but that might be the price for good performance :-)
I generally agree, with some comments.
If we come to the decision that there are some message rates that are
better suited to delivery into a user-read ring (10gige rx to kevent,
say) then it doesn't seem like it would be much of a stretch to add a
facility where syslet completion could be funneled into that channel
as well.
I also wonder if there isn't some opportunity to cut down the number
of syscalls / op in networking land. Is it madness to think of a
call like recvmsgv() which could provide a vector of msghdrs? It
might not make sense, but it might cut down on the per-op overhead
for loads that know they're going to be heavy enough to get a decent
amount of batching without fatally harming latency. Maybe those
loads are rare..
- z
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-22 12:59 ` Ingo Molnar
2007-02-22 13:32 ` Evgeniy Polyakov
2007-02-22 14:17 ` Suparna Bhattacharya
@ 2007-02-22 21:24 ` Michael K. Edwards
2007-02-23 0:30 ` Alan
2007-02-23 12:17 ` Ingo Molnar
2 siblings, 2 replies; 277+ messages in thread
From: Michael K. Edwards @ 2007-02-22 21:24 UTC (permalink / raw)
To: Ingo Molnar
Cc: Evgeniy Polyakov, Ulrich Drepper, linux-kernel, Linus Torvalds,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Zach Brown, David S. Miller, Suparna Bhattacharya,
Davide Libenzi, Jens Axboe, Thomas Gleixner
On 2/22/07, Ingo Molnar <mingo@elte.hu> wrote:
> > It is not a TUX anymore - you had 1024 threads, and all of them will
> > be consumed by tcp_sendmsg() for slow clients - rescheduling will kill
> > a machine.
>
> maybe it will, maybe it wont. Lets try? There is no true difference
> between having a 'request structure' that represents the current state
> of the HTTP connection plus a statemachine that moves that request
> between various queues, and a 'kernel stack' that goes in and out of
> runnable state and carries its processing state in its stack - other
> than the amount of RAM they take. (the kernel stack is 4K at a minimum -
> so with a million outstanding requests they would use up 4 GB of RAM.
> With 20k outstanding requests it's 80 MB of RAM - that's acceptable.)
This is a fundamental misconception. The state machine doesn't have
to do anything but chase pointers through cache. Done right, it
hardly even branches (although the branch misprediction penalty is a
lot less of a worry on current x86_64 than it was in the
mega-superscalar-out-of-order-speculative-execution days). It's damn
near free -- but it's a pain in the butt to code, and it has to be
done either in-kernel or in per-CPU OS-atop-the-OS dispatch threads.
The scheduler, on the other hand, has to blow and reload all of the
hidden state associated with force-loading the PC and wherever your
architecture keeps its TLS (maybe not the whole TLB, but not nothing,
either). The only way around this that I can think of is to make
threadlets promise that they will not touch anything thread-local, and
that when the FPU is handed to them in a specific, known state, they
leave it in that same state. (Some of the flags can be
unspecified-but-don't-touch-me.) Then you can schedule threadlets in
bursts with negligible transition cost from one to the next.
There is, however, a substantial setup cost for a burst, because you
have to put the FPU in that known state and lock out TLS access (this
is user code, after all). If the wrong process is in foreground, you
also need to switch process context at the start of a burst; no
fandangos on other processes' core, please, and to be remotely useful
the threadlets need access to process-global data structures and
synchronization primitives anyway. That's why you need for threadlets
to have a separate SCHED_THREADLET priority and at least a weak
ordering by PID. At which point you are outside the feature set of
the O(1) scheduler as I understand it, and you might as well schedule
them from the next tasklet following the softirq dispatcher.
> > My tests show that with 4k connections per second (8k concurrency)
> > more than 20k connections of 80k total block in tcp_sendmsg() over
> > gigabit lan between quite fast machines.
>
> yeah. Note that you can have a million sleeping threads if you want, the
> scheduler wont care. What matters more is the amount of true concurrency
> that is present at any given time. But yes, i agree that overscheduling
> can be a problem.
What matters is that a burst of I/O responses be scheduled efficiently
without taking down the rest of the box. That, and the ability to
cancel no-longer-interesting I/O requests in bulk, without leaking
memory and synchronization primitives all over the place. If you
don't have that, this scheme is UNUSABLE for network I/O.
> btw., what is the measurement utility you are using with kevents ('ab'
> perhaps, with a high -c concurrency count?), and which webserver are you
> using? (light-httpd?)
Do me a favor. Do some floating point math and a memcpy() in between
syscalls in the threadlet. Actually fiddle with errno and the FPU
rounding flags. Watch it slow to a crawl and/or break floating point
arithmetic horribly. Understand why no one with half a brain uses
Java, or any other language which cuts FP corners for the sake of
cheap threads, for calculations that have to be correct. (Note that
Kahan received the Turing award for contributions to IEEE 754. If his
polemic is too thick, read
http://www-128.ibm.com/developerworks/java/library/j-jtp0114/.)
Cheers,
- Michael
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-22 7:40 ` Ingo Molnar
2007-02-22 11:31 ` Evgeniy Polyakov
2007-02-22 19:38 ` Davide Libenzi
@ 2007-02-22 21:23 ` Zach Brown
2007-02-22 21:32 ` Benjamin LaHaise
2 siblings, 1 reply; 277+ messages in thread
From: Zach Brown @ 2007-02-22 21:23 UTC (permalink / raw)
To: Ingo Molnar
Cc: Ulrich Drepper, linux-kernel, Linus Torvalds, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Evgeniy Polyakov,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
> Plus there's the fundamental killer that KAIO is a /lot/ harder to
> implement (and to maintain) on the kernel side: it has to be
> implemented
> for every IO discipline, and even for the IO disciplines it
> supports at
> the moment, it is not truly asynchronous for things like metadata
> blocking or VFS blocking. To handle things like metadata blocking
> it has
> to resort to non-statemachine techniques like retries - which are bad
> for performance.
Yes, yes, yes.
As one of the poor suckers who has been fixing bugs in fs/aio.c and
fs/direct-io.c, I really want everyone to read Ingo's paragraph a few
times. Have it printed on a t-shirt.
Look at the number of man-years that have gone into fs/aio.c and fs/
direct-io.c. After all that effort it *barely* supports non-blocking
O_DIRECT IO.
The maintenance overhead of those two files, above all else, is what
pushed me to finally try that nutty fibril attempt.
> Syslets/threadlets on the other hand, once the core is implemented,
> have
> near zero ongoing maintainance cost (compared to KAIO pushed into
> every
> IO subsystem) and cover all IO disciplines and API variants
> immediately,
> and they are as perfectly asynchronous as it gets.
Amen.
As an experiment, I'm working on backing the sys_io_*() calls with
syslets. It's looking very promising so far.
> So all in one, i used to think that AIO state-machines have a long-
> term
> place within the kernel, but with syslets i think i've proven myself
> embarrasingly wrong =B-)
Welcome to the party :).
- z
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-22 14:47 ` David Miller
2007-02-22 15:02 ` Evgeniy Polyakov
2007-02-22 15:15 ` Ingo Molnar
@ 2007-02-22 20:13 ` Davide Libenzi
2007-02-22 21:30 ` Zach Brown
3 siblings, 0 replies; 277+ messages in thread
From: Davide Libenzi @ 2007-02-22 20:13 UTC (permalink / raw)
To: David Miller
Cc: Ingo Molnar, johnpol, Arjan Van de Ven, Ulrich Drepper,
Linux Kernel Mailing List, Linus Torvalds, hch, akpm, Alan Cox,
zach.brown, suparna, jens.axboe, tglx
On Thu, 22 Feb 2007, David Miller wrote:
> The more I think about it, a reasonable solution might actually be to
> use threadlets for disk I/O and pure event based processing for
> networking. It is two different handling paths and non-unified,
> but that might be the price for good performance :-)
Well, it takes 20 lines of userspace C code to bring *unification* to the
universe ;)
- Davide
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-22 13:32 ` Evgeniy Polyakov
@ 2007-02-22 19:46 ` Davide Libenzi
2007-02-23 12:15 ` Evgeniy Polyakov
2007-02-23 11:51 ` Ingo Molnar
1 sibling, 1 reply; 277+ messages in thread
From: Davide Libenzi @ 2007-02-22 19:46 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Ingo Molnar, Ulrich Drepper, Linux Kernel Mailing List,
Linus Torvalds, Arjan van de Ven, Christoph Hellwig,
Andrew Morton, Alan Cox, Zach Brown, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
On Thu, 22 Feb 2007, Evgeniy Polyakov wrote:
> > maybe it will, maybe it wont. Lets try? There is no true difference
> > between having a 'request structure' that represents the current state
> > of the HTTP connection plus a statemachine that moves that request
> > between various queues, and a 'kernel stack' that goes in and out of
> > runnable state and carries its processing state in its stack - other
> > than the amount of RAM they take. (the kernel stack is 4K at a minimum -
> > so with a million outstanding requests they would use up 4 GB of RAM.
> > With 20k outstanding requests it's 80 MB of RAM - that's acceptable.)
>
> I tried already :) - I just made a allocations atomic in tcp_sendmsg() and
> ended up with 1/4 of the sends blocking (I counted both allocation
> failure and socket queue overflow). Those 20k blocked requests were
> created in about 20 seconds, so roughly saying we have 1k of thread
> creation/freeing per second - do we want this?
A dynamic pool will smooth thread creation/freeing up by a lot.
And, in my box a *pthread* create/free takes ~10us, at 1000/s is 10ms, 1%.
Bad, but not so aweful ;)
Look, I'm *definitely* not trying to advocate the use of async syscalls for
network here, just pointing out that when we're talking about threads,
Linux does a pretty good job.
- Davide
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-22 7:40 ` Ingo Molnar
2007-02-22 11:31 ` Evgeniy Polyakov
@ 2007-02-22 19:38 ` Davide Libenzi
2007-02-28 9:45 ` Ingo Molnar
2007-02-22 21:23 ` Zach Brown
2 siblings, 1 reply; 277+ messages in thread
From: Davide Libenzi @ 2007-02-22 19:38 UTC (permalink / raw)
To: Ingo Molnar
Cc: Ulrich Drepper, Linux Kernel Mailing List, Linus Torvalds,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Zach Brown, Evgeniy Polyakov, David S. Miller,
Suparna Bhattacharya, Jens Axboe, Thomas Gleixner
On Thu, 22 Feb 2007, Ingo Molnar wrote:
>
> * Ulrich Drepper <drepper@redhat.com> wrote:
>
> > Ingo Molnar wrote:
> > > in terms of AIO, the best queueing model is i think what the kernel uses
> > > internally: freely ordered, with barrier support.
> >
> > Speaking of AIO, how do you imagine lio_listio is implemented? If
> > there is no asynchronous syscall it would mean creating a threadlet
> > for each request but this means either waiting or creating
> > several/many threads.
>
> my current thinking is that special-purpose (non-programmable, static)
> APIs like aio_*() and lio_*(), where every last cycle of performance
> matters, should be implemented using syslets - even if it is quite
> tricky to write syslets (which they no doubt are - just compare the size
> of syslet-test.c to threadlet-test.c). So i'd move syslets into the same
> category as raw syscalls: pieces of the raw infrastructure between the
> kernel and glibc, not an exposed API to apps. [and even if we keep them
> in that category they still need quite a bit of API work, to clean up
> the 32/64-bit issues, etc.]
Now that chains of syscalls can be way more easily handled with clets^wthreadlets,
why would we need the whole syslets crud inside?
Why can't aio_* be implemented with *simple* (or parallel/unrelated)
syscall submit w/out the burden of a complex, limiting and heavy API (I
won't list all the points against syslets, because I already did it
enough times)? The compat layer only is so bad to not be even funny.
Look at the code. Only removing the syslets crud would prolly cut 40% of
it. And we did not even touch the compat code yet.
- Davide
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-22 15:15 ` Ingo Molnar
2007-02-22 15:29 ` Ingo Molnar
@ 2007-02-22 17:17 ` David Miller
2007-02-23 11:12 ` Ingo Molnar
1 sibling, 1 reply; 277+ messages in thread
From: David Miller @ 2007-02-22 17:17 UTC (permalink / raw)
To: mingo
Cc: johnpol, arjan, drepper, linux-kernel, torvalds, hch, akpm, alan,
zach.brown, suparna, davidel, jens.axboe, tglx
From: Ingo Molnar <mingo@elte.hu>
Date: Thu, 22 Feb 2007 16:15:09 +0100
> furthermore, in a real webserver there's a whole lot of other stuff
> happening too: VFS blocking, mutex/lock blocking, memory pressure
> blocking, filesystem blocking, etc., etc. Threadlets/syslets cover them
> /all/ and never hold up the primary context: as long as there's more
> requests to process, they will be processed. Plus other important
> networked workloads, like fileservers are typically on fast LANs and
> those requests are very much a fire-and-forget matter most of the time.
I expect clients of a fileserver to cause the server to block in
places such as tcp_sendmsg() as much if not more so than a webserver
:-)
But yes, it should all be tested, for sure.
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-22 15:15 ` Ingo Molnar
@ 2007-02-22 15:29 ` Ingo Molnar
2007-02-22 17:17 ` David Miller
1 sibling, 0 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-02-22 15:29 UTC (permalink / raw)
To: David Miller
Cc: johnpol, arjan, drepper, linux-kernel, torvalds, hch, akpm, alan,
zach.brown, suparna, davidel, jens.axboe, tglx
* Ingo Molnar <mingo@elte.hu> wrote:
> > The pushback to the primary thread you speak of is just extra work
> > in my mind, for networking. Better to just begin operations and sit
> > in the primary thread(s) waiting for events, and when they arrive
> > push the operations further along using non-blocking writes, reads,
> > and accept() calls. There is no blocking context really needed for
> > these kinds of things, so a mechanism that tries to provide one is a
> > waste.
>
> one question is, what is cheaper, to block out of a read and a write and
^-------to back out
> to set up the event notification and then to return to the user
> context, or to just stay right in there with all the context already
> constructed and on the stack, and schedule away and then come back and
> queue back to the primary thread once the condition the thread is
> waiting for is done? The latter isnt all that unattractive in my mind,
> because it always does forward progress, with minimal 'backout' costs.
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-22 14:47 ` David Miller
2007-02-22 15:02 ` Evgeniy Polyakov
@ 2007-02-22 15:15 ` Ingo Molnar
2007-02-22 15:29 ` Ingo Molnar
2007-02-22 17:17 ` David Miller
2007-02-22 20:13 ` Davide Libenzi
2007-02-22 21:30 ` Zach Brown
3 siblings, 2 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-02-22 15:15 UTC (permalink / raw)
To: David Miller
Cc: johnpol, arjan, drepper, linux-kernel, torvalds, hch, akpm, alan,
zach.brown, suparna, davidel, jens.axboe, tglx
* David Miller <davem@davemloft.net> wrote:
> The pushback to the primary thread you speak of is just extra work in
> my mind, for networking. Better to just begin operations and sit in
> the primary thread(s) waiting for events, and when they arrive push
> the operations further along using non-blocking writes, reads, and
> accept() calls. There is no blocking context really needed for these
> kinds of things, so a mechanism that tries to provide one is a waste.
one question is, what is cheaper, to block out of a read and a write and
to set up the event notification and then to return to the user context,
or to just stay right in there with all the context already constructed
and on the stack, and schedule away and then come back and queue back to
the primary thread once the condition the thread is waiting for is done?
The latter isnt all that unattractive in my mind, because it always does
forward progress, with minimal 'backout' costs.
furthermore, in a real webserver there's a whole lot of other stuff
happening too: VFS blocking, mutex/lock blocking, memory pressure
blocking, filesystem blocking, etc., etc. Threadlets/syslets cover them
/all/ and never hold up the primary context: as long as there's more
requests to process, they will be processed. Plus other important
networked workloads, like fileservers are typically on fast LANs and
those requests are very much a fire-and-forget matter most of the time.
in any case, this definitely needs to be measured.
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-22 14:47 ` David Miller
@ 2007-02-22 15:02 ` Evgeniy Polyakov
2007-02-22 15:15 ` Ingo Molnar
` (2 subsequent siblings)
3 siblings, 0 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-22 15:02 UTC (permalink / raw)
To: David Miller
Cc: mingo, arjan, drepper, linux-kernel, torvalds, hch, akpm, alan,
zach.brown, suparna, davidel, jens.axboe, tglx
On Thu, Feb 22, 2007 at 06:47:04AM -0800, David Miller (davem@davemloft.net) wrote:
> As a side note although Evgeniy likes M:N threading model ideas, they
> are a mine field wrt. signal semantics. Solaris guys took several
> years to get it right, just grep through the Solaris kernel patch
> readme files over the years to get an idea of how bad it can be. I
> would therefore never advocate such an approach.
I have fully synchronous kevent signal delivery for that purpose :)
Having all events synchronous allows trivial handling of them -
including signals.
> The more I think about it, a reasonable solution might actually be to
> use threadlets for disk I/O and pure event based processing for
> networking. It is two different handling paths and non-unified,
> but that might be the price for good performance :-)
Hmm, yes, for such scenario we need some kind of event delivery
mechanism, which would allow to wait on different kinds of events.
In the above sentence I see known to pain letters -
letter k
letter e
letter v
letter e
letter n
letter t
Or more modern trend - async_wait(epoll).
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-22 14:31 ` Ingo Molnar
2007-02-22 14:47 ` David Miller
@ 2007-02-22 14:59 ` Ingo Molnar
2007-02-22 21:42 ` Michael K. Edwards
2 siblings, 0 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-02-22 14:59 UTC (permalink / raw)
To: David Miller
Cc: johnpol, arjan, drepper, linux-kernel, torvalds, hch, akpm, alan,
zach.brown, suparna, davidel, jens.axboe, tglx
* Ingo Molnar <mingo@elte.hu> wrote:
> Firstly, i dont think you are fully applying the syslet/threadlet
> model. There is no reason why an 'idle' client would have to use up a
> full thread! It all depends on how you use syslets/threadlets, and how
> (frequently) you queue back requests from cachemiss threads back to
> the primary thread. It is only the simplest queueing model where there
> is one thread per request that is currently blocked.
> Syslets/threadlets do /not/ force request processing to be performed
> in the async context forever - the async thread could very much queue
> it back to the primary context. (That's in essence what Tux did.) So
> the same state-machine techniques can be applied on both the syslet
> and the threadlet model, but in much more natural (and thus lower
> overhead) points: /between/ system calls and not in the middle of
> them. There are a number of measures that can be used to keep the
> number of parallel threads down.
i think the best model here is to use kevents or epoll to discover
accept()-able or recv()-able keepalive sockets, and to do the main
request loop via syslets/threadlets, with a 'queue back to the main
context if we went async and if the request is done' feedback mechanism
that keeps the size of the pool down.
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-22 13:41 ` David Miller
2007-02-22 14:31 ` Ingo Molnar
@ 2007-02-22 14:53 ` Avi Kivity
1 sibling, 0 replies; 277+ messages in thread
From: Avi Kivity @ 2007-02-22 14:53 UTC (permalink / raw)
To: David Miller
Cc: johnpol, arjan, mingo, drepper, linux-kernel, torvalds, hch,
akpm, alan, zach.brown, suparna, davidel, jens.axboe, tglx
David Miller wrote:
> From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
> Date: Thu, 22 Feb 2007 15:39:30 +0300
>
>
>> It does not matter - even with threads cost of having thousands of
>> threads is _too_ expensive. So, IMO, it is wrong to have to create
>> 20k threads for the simple web server which only sends one index page to
>> 80k connections with 4k connections per seconds rate.
>>
>> Just have that example in mind - more than 20k blocks in 80k connections
>> over gigabit lan, and it is likely optimistic result, when designing new
>> type of AIO.
>>
>
> I totally agree with Evgeniy on these points.
>
> Using things like syslets and threadlets for networking I/O
> is not a very good idea. Blocking is more the norm than the
> exception for networking I/O.
>
And for O_DIRECT, and for large storage systems which overwhelm caches.
The optimize for the nonblocking case approach does not fit all
workloads. And of course we have to be able to mix mostly-nonblocking
threadlets and mostly-blocking O_DIRECT and networking.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-22 14:31 ` Ingo Molnar
@ 2007-02-22 14:47 ` David Miller
2007-02-22 15:02 ` Evgeniy Polyakov
` (3 more replies)
2007-02-22 14:59 ` Ingo Molnar
2007-02-22 21:42 ` Michael K. Edwards
2 siblings, 4 replies; 277+ messages in thread
From: David Miller @ 2007-02-22 14:47 UTC (permalink / raw)
To: mingo
Cc: johnpol, arjan, drepper, linux-kernel, torvalds, hch, akpm, alan,
zach.brown, suparna, davidel, jens.axboe, tglx
From: Ingo Molnar <mingo@elte.hu>
Date: Thu, 22 Feb 2007 15:31:45 +0100
> Firstly, i dont think you are fully applying the syslet/threadlet model.
> There is no reason why an 'idle' client would have to use up a full
> thread! It all depends on how you use syslets/threadlets, and how
> (frequently) you queue back requests from cachemiss threads back to the
> primary thread. It is only the simplest queueing model where there is
> one thread per request that is currently blocked. Syslets/threadlets do
> /not/ force request processing to be performed in the async context
> forever - the async thread could very much queue it back to the primary
> context. (That's in essence what Tux did.) So the same state-machine
> techniques can be applied on both the syslet and the threadlet model,
> but in much more natural (and thus lower overhead) points: /between/
> system calls and not in the middle of them. There are a number of
> measures that can be used to keep the number of parallel threads down.
Ok.
> Secondly, even assuming lots of pending requests/async-threads and a
> naive queueing model, an open request will eat up resources on the
> server no matter what. So if your point is that "+4K of kernel stack
> pinned down per open, blocked request makes syslets and threadlets not a
> very good idea", then i'd like to disagree with that: while it wont be
> zero-cost (4K does cost you 400MB of RAM per 100,000 outstanding
> threads), it's often comparable to the other RAM costs that are already
> attached to an open connection.
The 400MB is extra, and it's in no way commensurate with the cost
of the TCP socket itself even including the application specific
state being used for that connection.
Even if it would be _equal_, we would be doubling the memory
requirements for such a scenerio.
This is why I dislike the threadlet model, when used in that way.
The pushback to the primary thread you speak of is just extra work in
my mind, for networking. Better to just begin operations and sit in
the primary thread(s) waiting for events, and when they arrive push
the operations further along using non-blocking writes, reads, and
accept() calls. There is no blocking context really needed for these
kinds of things, so a mechanism that tries to provide one is a waste.
As a side note although Evgeniy likes M:N threading model ideas, they
are a mine field wrt. signal semantics. Solaris guys took several
years to get it right, just grep through the Solaris kernel patch
readme files over the years to get an idea of how bad it can be. I
would therefore never advocate such an approach.
The more I think about it, a reasonable solution might actually be to
use threadlets for disk I/O and pure event based processing for
networking. It is two different handling paths and non-unified,
but that might be the price for good performance :-)
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-22 14:17 ` Suparna Bhattacharya
@ 2007-02-22 14:36 ` Ingo Molnar
2007-02-23 14:23 ` Suparna Bhattacharya
0 siblings, 1 reply; 277+ messages in thread
From: Ingo Molnar @ 2007-02-22 14:36 UTC (permalink / raw)
To: Suparna Bhattacharya
Cc: Evgeniy Polyakov, Ulrich Drepper, linux-kernel, Linus Torvalds,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Zach Brown, David S. Miller, Davide Libenzi, Jens Axboe,
Thomas Gleixner
* Suparna Bhattacharya <suparna@in.ibm.com> wrote:
> > maybe it will, maybe it wont. Lets try? There is no true difference
> > between having a 'request structure' that represents the current
> > state of the HTTP connection plus a statemachine that moves that
> > request between various queues, and a 'kernel stack' that goes in
> > and out of runnable state and carries its processing state in its
> > stack - other than the amount of RAM they take. (the kernel stack is
> > 4K at a minimum - so with a million outstanding requests they would
> > use up 4 GB of RAM. With 20k outstanding requests it's 80 MB of RAM
> > - that's acceptable.)
>
> At what point are the cachemiss threads destroyed ? In other words how
> well does this adapt to load variations ? For example, would this 80MB
> of RAM continue to be locked down even during periods of lighter loads
> thereafter ?
you can destroy them at will from user-space too - just start a slow
timer that zaps them if load goes down. I can add a
sys_async_thread_exit(nr_threads) API to be able to drive this without
knowing the TIDs of those threads, and/or i can add a kernel-internal
mechanism to zap inactive threads. It would be rather easy and
low-overhead - the v2 code already had a max_nr_threads tunable, i can
reintroduce it. So the size of the pool of contexts does not have to be
permanent at all.
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-22 13:41 ` David Miller
@ 2007-02-22 14:31 ` Ingo Molnar
2007-02-22 14:47 ` David Miller
` (2 more replies)
2007-02-22 14:53 ` Avi Kivity
1 sibling, 3 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-02-22 14:31 UTC (permalink / raw)
To: David Miller
Cc: johnpol, arjan, drepper, linux-kernel, torvalds, hch, akpm, alan,
zach.brown, suparna, davidel, jens.axboe, tglx
* David Miller <davem@davemloft.net> wrote:
> If used for networking one could easily make this new interface create
> an arbitrary number of threads by just opening up that many
> connections to such a server and just sitting there not reading
> anything from any of the client sockets. And this happens
> non-maliciously for slow clients, whether that is due to application
> blockage or the characteristics of the network path.
there are two issues on which i'd like to disagree.
Firstly, i dont think you are fully applying the syslet/threadlet model.
There is no reason why an 'idle' client would have to use up a full
thread! It all depends on how you use syslets/threadlets, and how
(frequently) you queue back requests from cachemiss threads back to the
primary thread. It is only the simplest queueing model where there is
one thread per request that is currently blocked. Syslets/threadlets do
/not/ force request processing to be performed in the async context
forever - the async thread could very much queue it back to the primary
context. (That's in essence what Tux did.) So the same state-machine
techniques can be applied on both the syslet and the threadlet model,
but in much more natural (and thus lower overhead) points: /between/
system calls and not in the middle of them. There are a number of
measures that can be used to keep the number of parallel threads down.
Secondly, even assuming lots of pending requests/async-threads and a
naive queueing model, an open request will eat up resources on the
server no matter what. So if your point is that "+4K of kernel stack
pinned down per open, blocked request makes syslets and threadlets not a
very good idea", then i'd like to disagree with that: while it wont be
zero-cost (4K does cost you 400MB of RAM per 100,000 outstanding
threads), it's often comparable to the other RAM costs that are already
attached to an open connection.
(let me know if i misunderstood your point.)
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-22 12:59 ` Ingo Molnar
2007-02-22 13:32 ` Evgeniy Polyakov
@ 2007-02-22 14:17 ` Suparna Bhattacharya
2007-02-22 14:36 ` Ingo Molnar
2007-02-22 21:24 ` Michael K. Edwards
2 siblings, 1 reply; 277+ messages in thread
From: Suparna Bhattacharya @ 2007-02-22 14:17 UTC (permalink / raw)
To: Ingo Molnar
Cc: Evgeniy Polyakov, Ulrich Drepper, linux-kernel, Linus Torvalds,
Arjan van de Ven, Christoph Hellwig, Andrew Morton, Alan Cox,
Zach Brown, David S. Miller, Davide Libenzi, Jens Axboe,
Thomas Gleixner
On Thu, Feb 22, 2007 at 01:59:31PM +0100, Ingo Molnar wrote:
>
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
>
> > It is not a TUX anymore - you had 1024 threads, and all of them will
> > be consumed by tcp_sendmsg() for slow clients - rescheduling will kill
> > a machine.
>
> maybe it will, maybe it wont. Lets try? There is no true difference
> between having a 'request structure' that represents the current state
> of the HTTP connection plus a statemachine that moves that request
> between various queues, and a 'kernel stack' that goes in and out of
> runnable state and carries its processing state in its stack - other
> than the amount of RAM they take. (the kernel stack is 4K at a minimum -
> so with a million outstanding requests they would use up 4 GB of RAM.
> With 20k outstanding requests it's 80 MB of RAM - that's acceptable.)
At what point are the cachemiss threads destroyed ? In other words how well
does this adapt to load variations ? For example, would this 80MB of RAM
continue to be locked down even during periods of lighter loads thereafter ?
Regards
Suparna
>
> > My tests show that with 4k connections per second (8k concurrency)
> > more than 20k connections of 80k total block in tcp_sendmsg() over
> > gigabit lan between quite fast machines.
>
> yeah. Note that you can have a million sleeping threads if you want, the
> scheduler wont care. What matters more is the amount of true concurrency
> that is present at any given time. But yes, i agree that overscheduling
> can be a problem.
>
> btw., what is the measurement utility you are using with kevents ('ab'
> perhaps, with a high -c concurrency count?), and which webserver are you
> using? (light-httpd?)
>
> Ingo
--
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Lab, India
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-22 12:39 ` Evgeniy Polyakov
@ 2007-02-22 13:41 ` David Miller
2007-02-22 14:31 ` Ingo Molnar
2007-02-22 14:53 ` Avi Kivity
0 siblings, 2 replies; 277+ messages in thread
From: David Miller @ 2007-02-22 13:41 UTC (permalink / raw)
To: johnpol
Cc: arjan, mingo, drepper, linux-kernel, torvalds, hch, akpm, alan,
zach.brown, suparna, davidel, jens.axboe, tglx
From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Date: Thu, 22 Feb 2007 15:39:30 +0300
> It does not matter - even with threads cost of having thousands of
> threads is _too_ expensive. So, IMO, it is wrong to have to create
> 20k threads for the simple web server which only sends one index page to
> 80k connections with 4k connections per seconds rate.
>
> Just have that example in mind - more than 20k blocks in 80k connections
> over gigabit lan, and it is likely optimistic result, when designing new
> type of AIO.
I totally agree with Evgeniy on these points.
Using things like syslets and threadlets for networking I/O
is not a very good idea. Blocking is more the norm than the
exception for networking I/O.
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-22 12:59 ` Ingo Molnar
@ 2007-02-22 13:32 ` Evgeniy Polyakov
2007-02-22 19:46 ` Davide Libenzi
2007-02-23 11:51 ` Ingo Molnar
2007-02-22 14:17 ` Suparna Bhattacharya
2007-02-22 21:24 ` Michael K. Edwards
2 siblings, 2 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-22 13:32 UTC (permalink / raw)
To: Ingo Molnar
Cc: Ulrich Drepper, linux-kernel, Linus Torvalds, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
On Thu, Feb 22, 2007 at 01:59:31PM +0100, Ingo Molnar (mingo@elte.hu) wrote:
>
> * Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
>
> > It is not a TUX anymore - you had 1024 threads, and all of them will
> > be consumed by tcp_sendmsg() for slow clients - rescheduling will kill
> > a machine.
>
> maybe it will, maybe it wont. Lets try? There is no true difference
> between having a 'request structure' that represents the current state
> of the HTTP connection plus a statemachine that moves that request
> between various queues, and a 'kernel stack' that goes in and out of
> runnable state and carries its processing state in its stack - other
> than the amount of RAM they take. (the kernel stack is 4K at a minimum -
> so with a million outstanding requests they would use up 4 GB of RAM.
> With 20k outstanding requests it's 80 MB of RAM - that's acceptable.)
I tried already :) - I just made a allocations atomic in tcp_sendmsg() and
ended up with 1/4 of the sends blocking (I counted both allocation
failure and socket queue overflow). Those 20k blocked requests were
created in about 20 seconds, so roughly saying we have 1k of thread
creation/freeing per second - do we want this?
> > My tests show that with 4k connections per second (8k concurrency)
> > more than 20k connections of 80k total block in tcp_sendmsg() over
> > gigabit lan between quite fast machines.
>
> yeah. Note that you can have a million sleeping threads if you want, the
> scheduler wont care. What matters more is the amount of true concurrency
> that is present at any given time. But yes, i agree that overscheduling
> can be a problem.
Before I started M:N threading library implementation I checked some
threading perfomance of the current POSIX library - I created simple
pool of threads and 'sent' a message between them using futex wait/wake
(sema_post/wait) one-by-one, results are quite dissapointing - given
that number of sleeping threads were about hundreds, kernel rescheduling
is about 10 times slower than setjmp based (I think so) used in Erlang.
Above example is not 100% correct, I understand, but situation with
thread-like AIO is much worse - it is possible that several threads will
be ready simultaneous, so rescheduling between them will kill a
performance.
> btw., what is the measurement utility you are using with kevents ('ab'
> perhaps, with a high -c concurrency count?), and which webserver are you
> using? (light-httpd?)
Yes, it is ab and lighttpd, before it was httperf (unfair on high load
due to poll/select usage) and own web server (evserver_kevent.c).
> Ingo
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-22 11:31 ` Evgeniy Polyakov
2007-02-22 11:52 ` Arjan van de Ven
@ 2007-02-22 12:59 ` Ingo Molnar
2007-02-22 13:32 ` Evgeniy Polyakov
` (2 more replies)
2007-02-25 22:44 ` Linus Torvalds
2 siblings, 3 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-02-22 12:59 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Ulrich Drepper, linux-kernel, Linus Torvalds, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
* Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> It is not a TUX anymore - you had 1024 threads, and all of them will
> be consumed by tcp_sendmsg() for slow clients - rescheduling will kill
> a machine.
maybe it will, maybe it wont. Lets try? There is no true difference
between having a 'request structure' that represents the current state
of the HTTP connection plus a statemachine that moves that request
between various queues, and a 'kernel stack' that goes in and out of
runnable state and carries its processing state in its stack - other
than the amount of RAM they take. (the kernel stack is 4K at a minimum -
so with a million outstanding requests they would use up 4 GB of RAM.
With 20k outstanding requests it's 80 MB of RAM - that's acceptable.)
> My tests show that with 4k connections per second (8k concurrency)
> more than 20k connections of 80k total block in tcp_sendmsg() over
> gigabit lan between quite fast machines.
yeah. Note that you can have a million sleeping threads if you want, the
scheduler wont care. What matters more is the amount of true concurrency
that is present at any given time. But yes, i agree that overscheduling
can be a problem.
btw., what is the measurement utility you are using with kevents ('ab'
perhaps, with a high -c concurrency count?), and which webserver are you
using? (light-httpd?)
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-22 11:52 ` Arjan van de Ven
@ 2007-02-22 12:39 ` Evgeniy Polyakov
2007-02-22 13:41 ` David Miller
0 siblings, 1 reply; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-22 12:39 UTC (permalink / raw)
To: Arjan van de Ven
Cc: Ingo Molnar, Ulrich Drepper, linux-kernel, Linus Torvalds,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
On Thu, Feb 22, 2007 at 12:52:39PM +0100, Arjan van de Ven (arjan@infradead.org) wrote:
>
> > It is not a TUX anymore - you had 1024 threads, and all of them will be
> > consumed by tcp_sendmsg() for slow clients - rescheduling will kill a
> > machine.
>
> I think it's time to make a split in what "context switch" or
> "reschedule" means...
>
> there are two types of context switch:
>
> 1) To a different process. This means teardown of the TLB, going to a
> new MMU state, saving FPU state etc etc etc. This is obviously quite
> expensive
>
> 2) To a thread of the same process. No TLB flush no new MMU state,
> effectively all it does is getting a new task struct on the kernel side,
> and a new ESP/EIP pair on the userspace side. If there is FPU code
> involved that gets saved as well.
>
> Number 1 is very expensive and that is what is really worrying normally;
> number 2 is a LOT lighter weight, and while Linux is a bit heavy there,
> it can be made lighter... there's no fundamental reason for it to be
> really expensive.
It does not matter - even with threads cost of having thousands of
threads is _too_ expensive. So, IMO, it is wrong to have to create
20k threads for the simple web server which only sends one index page to
80k connections with 4k connections per seconds rate.
Just have that example in mind - more than 20k blocks in 80k connections
over gigabit lan, and it is likely optimistic result, when designing new
type of AIO.
> --
> if you want to mail me at work (you don't), use arjan (at) linux.intel.com
> Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-22 11:31 ` Evgeniy Polyakov
@ 2007-02-22 11:52 ` Arjan van de Ven
2007-02-22 12:39 ` Evgeniy Polyakov
2007-02-22 12:59 ` Ingo Molnar
2007-02-25 22:44 ` Linus Torvalds
2 siblings, 1 reply; 277+ messages in thread
From: Arjan van de Ven @ 2007-02-22 11:52 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Ingo Molnar, Ulrich Drepper, linux-kernel, Linus Torvalds,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
> It is not a TUX anymore - you had 1024 threads, and all of them will be
> consumed by tcp_sendmsg() for slow clients - rescheduling will kill a
> machine.
I think it's time to make a split in what "context switch" or
"reschedule" means...
there are two types of context switch:
1) To a different process. This means teardown of the TLB, going to a
new MMU state, saving FPU state etc etc etc. This is obviously quite
expensive
2) To a thread of the same process. No TLB flush no new MMU state,
effectively all it does is getting a new task struct on the kernel side,
and a new ESP/EIP pair on the userspace side. If there is FPU code
involved that gets saved as well.
Number 1 is very expensive and that is what is really worrying normally;
number 2 is a LOT lighter weight, and while Linux is a bit heavy there,
it can be made lighter... there's no fundamental reason for it to be
really expensive.
--
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-22 7:40 ` Ingo Molnar
@ 2007-02-22 11:31 ` Evgeniy Polyakov
2007-02-22 11:52 ` Arjan van de Ven
` (2 more replies)
2007-02-22 19:38 ` Davide Libenzi
2007-02-22 21:23 ` Zach Brown
2 siblings, 3 replies; 277+ messages in thread
From: Evgeniy Polyakov @ 2007-02-22 11:31 UTC (permalink / raw)
To: Ingo Molnar
Cc: Ulrich Drepper, linux-kernel, Linus Torvalds, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Zach Brown,
David S. Miller, Suparna Bhattacharya, Davide Libenzi,
Jens Axboe, Thomas Gleixner
Hi Ingo, developers.
On Thu, Feb 22, 2007 at 08:40:44AM +0100, Ingo Molnar (mingo@elte.hu) wrote:
> Syslets/threadlets on the other hand, once the core is implemented, have
> near zero ongoing maintainance cost (compared to KAIO pushed into every
> IO subsystem) and cover all IO disciplines and API variants immediately,
> and they are as perfectly asynchronous as it gets.
>
> So all in one, i used to think that AIO state-machines have a long-term
> place within the kernel, but with syslets i think i've proven myself
> embarrasingly wrong =B-)
Hmm...
Try to have a network web server with huge load made on top of
syslets/threadlets.
It is not a TUX anymore - you had 1024 threads, and all of them will be
consumed by tcp_sendmsg() for slow clients - rescheduling will kill a
machine.
My tests show that with 4k connections per second (8k concurrency) more
than 20k connections of 80k total block in tcp_sendmsg() over gigabit
lan between quite fast machines.
Or threadlet/syslet AIO should not be used with networking too?
> Ingo
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-22 10:01 ` Suparna Bhattacharya
@ 2007-02-22 11:20 ` Ingo Molnar
0 siblings, 0 replies; 277+ messages in thread
From: Ingo Molnar @ 2007-02-22 11:20 UTC (permalink / raw)
To: Suparna Bhattacharya
Cc: linux-kernel, Linus Torvalds, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
Zach Brown, Evgeniy Polyakov, David S. Miller, Davide Libenzi,
Jens Axboe, Thomas Gleixner
* Suparna Bhattacharya <suparna@in.ibm.com> wrote:
> > Threadlets share much of the scheduling infrastructure with syslets.
> >
> > Syslets (small, kernel-side, scripted "syscall plugins") are still
> > supported - they are (much...) harder to program than threadlets but
> > they allow the highest performance. Core infrastructure libraries
> > like glibc/libaio are expected to use syslets. Jens Axboe's FIO tool
> > already includes support for v2 syslets, and the following patch
> > updates FIO to
>
> Ah, glad to see that - I was wondering if it was worthwhile to try
> adding syslet support to aio-stress to be able to perform some
> comparisons. [...]
i think it would definitely be worth it.
> [...] Hopefully FIO should be able to generate a similar workload, but
> I haven't tried it yet so am not sure. Are you planning to upload some
> results (so I can compare it with patterns I am familiar with) ?
i had no time yet to do careful benchmarks. Right now my impression from
quick testing is that libaio performance can be exceeded via syslets. So
it would be very interesting if you could try this too, independently of
me.
Ingo
^ permalink raw reply [flat|nested] 277+ messages in thread
* Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
2007-02-21 21:13 Ingo Molnar
2007-02-21 22:46 ` Michael K. Edwards
@ 2007-02-22 10:01 ` Suparna Bhattacharya
2007-02-22 11:20 ` Ingo Molnar
2007-02-24 18:34 ` Evgeniy Polyakov
2 siblings, 1 reply; 277+ messages in thread
From: Suparna Bhattacharya @ 2007-02-22 10:01 UTC (permalink / raw)
To: Ingo Molnar
Cc: linux-kernel, Linus Torvalds, Arjan van de Ven,
Christoph Hellwig, Andrew Morton, Alan Cox, Ulrich Drepper,
Zach Brown, Evgeniy Polyakov, David S. Miller, Davide Libenzi,
Jens Axboe, Thomas Gleixner
On Wed, Feb 21, 2007 at 10:13:55PM +0100, Ingo Molnar wrote:
> this is the v3 release of the syslet/threadlet subsystem:
>
> http://redhat.com/~mingo/syslet-patches/
>
> This release came a few days later than i originally wanted, because
> i've implemented many fundamental changes to the code. The biggest
> highlights of v3 are:
>
> - "Threadlets": the introduction of the 'threadlet' execution concept.
>
> - syslets: multiple rings support with no kernel-side footprint, the
> elimination of mlock() pinning, no async_register/unregister() calls
> needed anymore and more.
>
> "Threadlets" are basically the user-space equivalent of syslets: small
> functions of execution that the kernel attempts to execute without
> scheduling. If the threadlet blocks, the kernel creates a real thread
> from it, and execution continues in that thread. The 'head' context (the
> context that never blocks) returns to the original function that called
> the threadlet. Threadlets are very easy to use:
>
> long my_threadlet_fn(void *data)
> {
> char *name = data;
> int fd;
>
> fd = open(name, O_RDONLY);
> if (fd < 0)
> goto out;
>
> fstat(fd, &stat);
> read(fd, buf, count)
> ...
>
> out:
> return threadlet_complete();
> }
>
>
> main()
> {
> done = threadlet_exec(threadlet_fn, new_stack, &user_head);
> if (!done)
> reqs_queued++;
> }
>
> There is no limitation whatsoever about how a threadlet function can
> look like: it can use arbitrary system-calls and all execution will be
> procedural. There is no 'registration' needed when running threadlets
> either: the kernel will take care of all the details, user-space just
> runs a threadlet without any preparation and that's it.
>
> Completion of async threadlets can be done from user-space via any of
> the existing APIs: in threadlet-test.c (see the async-test-v3.tar.gz
> user-space examples at the URL above) i've for example used a futex
> between the head and the async threads to do threadlet notification. But
> select(), poll() or signals can be used too - whichever is most
> convenient to the application writer.
>
> Threadlets can also be thought of as 'optional threads': they