LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
From: Stephen Hemminger <shemminger@linux-foundation.org>
To: Willy Tarreau <w@1wt.eu>
Cc: David Miller <davem@davemloft.net>,
rkuhn@e18.physik.tu-muenchen.de, andi@firstfloor.org,
dada1@cosmosbay.com, jengelh@linux01.gwdg.de,
linux-kernel@vger.kernel.org, netdev@vger.kernel.org
Subject: Re: [PATCH] tcp_cubic: use 32 bit math
Date: Wed, 21 Mar 2007 11:54:19 -0700 [thread overview]
Message-ID: <20070321115419.483a655a@freekitty> (raw)
In-Reply-To: <20070313205020.GA5418@1wt.eu>
On Tue, 13 Mar 2007 21:50:20 +0100
Willy Tarreau <w@1wt.eu> wrote:
> Hi Stephen,
>
> On Mon, Mar 12, 2007 at 02:11:56PM -0700, Stephen Hemminger wrote:
> > > Oh BTW, I have a newer version with a first approximation of the
> > > cbrt() before the div64_64, which allows us to reduce from 3 div64
> > > to only 2 div64. This results in a version which is twice as fast
> > > as the initial one (ncubic), but with slightly less accuracy (0.286%
> > > compared to 0.247). But I see that other functions such as hcbrt()
> > > had a 1.5% avg error, so I think this is not dramatic.
> >
> > Ignore my hcbrt() it was a less accurate version of andi's stuff.
>
> OK.
>
> > > Also, I managed to remove all other divides, to be kind with CPUs
> > > having a slow divide instruction or no divide at all. Since we compute
> > > on limited range (22 bits), we can multiply then shift right. It shows
> > > me even slightly better time on pentium-m and athlon, with a slightly
> > > higher avg error (0.297% compared to 0.286%), and slightly smaller
> > > code.
> >
> > What does the code look like?
>
> Well, I have cleaned it a little bit, there were more comments and ifdefs
> than code ! I've appended it to the end of this mail.
>
> I have changed it a bit, because I noticed that integer divide precision
> was so coarse that there were other possibilities to play with the bits.
>
> I have experimented with combinations of several methods :
> - replace integer divides with multiplies/shifts where possible.
>
> - compensation for divide imprecisions by adding/removing small
> values bofore/after them. Often, the integer result of 1/(x*(x-1))
> is closer to (float)1/(float)x^2 than 1/(x*x). This is because
> the divide always truncates the result.
>
> - use direct result lookup for small values. Small inputs give small
> outputs which have very few moving bits. Many different values fit
> in a 32bit integer, so we use a shift offset to lookup the value.
> I used this in an fls function I wrote a while ago, that I should
> also post because it is up to twice as fast as the kernel's.
> Sometimes it seems faster to lookup in from memory, sometimes it
> is faster from an immediate value. Maybe more visible differences
> would show up on RISC CPUs where loading 32 bits immediate needs
> two instructions. I don't know yet, I've not tested on my sparc
> yet.
>
> - use small lookup tables (64 bytes) with 6 bits inputs and at least
> as many on output. We only lookup the 6 MSB and return the 2-3 MSB
> of the result.
>
> - iterative search and manual refinment of the lookup tables for best
> accuracy. The avg error rate can easily be halved this way.
>
> I have duplicated tried several functions with 0, 1, 2 and 3 divides.
> Several of them offer better accuracy over what we currently have, in
> less cycles. Others offer faster results (up to 5 times) with slightly
> less accuracy.
>
> There is one function which is not to be used, but is just here for
> comparison (ncubic_0div). It does no divide but has awful avg error.
>
> But one which is interesting is the ncubic_tab0. It does not use any
> divide at all, even not any div64. It shows a 0.6% avg error, which I'm
> not sure is enough or not. It is 6.7 times faster than initial ncubic()
> with less accuracy, and 4 times smaller. I suspect that it can differ
> more on architectures which have no divide instruction.
>
> Is 0.6% avg error rate is too much, ncubic_tab1() uses one single div64
> and is twice slower (still nearly 3 times faster than ncubic). It show
> 0.195% avg error, which is better than initial ncubic. I think that it
> is a good tradeoff.
>
> If best accuracy is an absolute requirement, then I have a variation of
> ncubic (ncubic_3div) which does 0.17% in 2/3 of the time (compared to
> 0.247%), and which is slightly smaller.
>
> I have also added a "size" column, indicating approximative function
> size, provided that the compiler does not reorder the code. On gcc 3.4,
> it's OK, but 4.1 returns garbage. That does not matter, it's just a
> rough estimate anyway.
>
> Here are the results classed by speed :
>
> /* Sample output on a Pentium-M 600 MHz :
>
> Function clocks mean(us) max(us) std(us) Avg err size
> ncubic_tab0 79 0.66 7.20 1.04 0.613% 160
> ncubic_0div 84 0.70 7.64 1.57 4.521% 192
> ncubic_1div 178 1.48 16.27 1.81 0.443% 336
> ncubic_tab1 179 1.49 16.34 1.85 0.195% 320
> ncubic_ndiv3 263 2.18 24.04 3.59 0.250% 512
> ncubic_2div 270 2.24 24.70 2.77 0.187% 512
> ncubic32_1 359 2.98 32.81 3.59 0.238% 544
> ncubic_3div 361 2.99 33.08 3.79 0.170% 656
> ncubic32 364 3.02 33.29 3.51 0.247% 544
> ncubic 529 4.39 48.39 4.92 0.247% 720
> hcbrt 539 4.47 49.25 5.98 1.580% 96
> ocubic 732 4.93 61.83 7.22 0.274% 320
> acbrt 842 6.98 76.73 8.55 0.275% 192
> bictcp 1032 6.95 86.30 9.04 0.172% 768
>
> And now by avg error :
>
> ncubic_3div 361 2.99 33.08 3.79 0.170% 656
> bictcp 1032 6.95 86.30 9.04 0.172% 768
> ncubic_2div 270 2.24 24.70 2.77 0.187% 512
> ncubic_tab1 179 1.49 16.34 1.85 0.195% 320
> ncubic32_1 359 2.98 32.81 3.59 0.238% 544
> ncubic 529 4.39 48.39 4.92 0.247% 720
> ncubic32 364 3.02 33.29 3.51 0.247% 544
> ncubic_ndiv3 263 2.18 24.04 3.59 0.250% 512
> ocubic 732 4.93 61.83 7.22 0.274% 320
> acbrt 842 6.98 76.73 8.55 0.275% 192
> ncubic_1div 178 1.48 16.27 1.81 0.443% 336
> ncubic_tab0 79 0.66 7.20 1.04 0.613% 160
> hcbrt 539 4.47 49.25 5.98 1.580% 96
> ncubic_0div 84 0.70 7.64 1.57 4.521% 192
>
>
> And here comes the code. I have tried to document it a bit, as much
> as can be done on experimentation code. It is often easier to use
> a pencil and paper to understand how the bits move.
>
> Regards,
> Willy
>
The following version of div64_64 is faster because do_div already
optimized for the 32 bit case..
I get the following results on ULV Core Solo (ie slow current processor)
and the following on 64bit Core Duo. ncubic_tab1 seems like
the best (no additional error and about as fast)
ULV Core Solo
Function clocks mean(us) max(us) std(us) Avg err size
ncubic_tab0 192 11.24 45.10 15.28 0.450% -2262
ncubic_0div 201 11.77 47.23 27.40 3.357% -2404
ncubic_1div 324 19.02 76.32 25.82 0.189% -2567
ncubic_tab1 326 19.13 76.73 23.71 0.043% -2059
ncubic_2div 456 26.72 108.92 493.16 0.028% -2790
ncubic_ndiv3 463 27.15 133.37 1889.39 0.104% -3344
ncubic32 549 32.18 130.59 508.97 0.041% -3794
ncubic32_1 574 33.66 138.32 548.48 0.029% -3604
ncubic_3div 581 34.04 140.24 608.55 0.018% -3050
ncubic 733 42.92 173.35 523.19 0.041% 299
ocubic 1046 61.25 283.68 3305.65 0.027% -2232
acbrt 1149 67.32 284.91 1941.55 0.029% 168
bictcp 1663 97.41 394.29 604.86 0.017% 628
Core 2 Duo
Function clocks mean(us) max(us) std(us) Avg err size
ncubic_0div 74 0.03 1.60 0.07 3.357% -2101
ncubic_tab0 74 0.03 1.60 0.04 0.450% -2029
ncubic_1div 142 0.07 3.11 1.05 0.189% -2195
ncubic_tab1 144 0.07 3.18 1.02 0.043% -1638
ncubic_2div 216 0.10 4.74 1.07 0.028% -2326
ncubic_ndiv3 219 0.10 4.76 1.04 0.104% -2709
ncubic32 269 0.13 5.87 1.13 0.041% -1500
ncubic32_1 272 0.13 5.92 1.10 0.029% -2881
ncubic 273 0.13 5.96 1.13 0.041% -1763
ncubic_3div 290 0.14 6.32 1.01 0.018% -2499
acbrt 430 0.20 9.42 1.18 0.029% 77
ocubic 444 0.21 9.82 1.82 0.027% -1924
bictcp 549 0.26 12.06 1.68 0.017% 236
--
Stephen Hemminger <shemminger@linux-foundation.org>
next prev parent reply other threads:[~2007-03-21 18:59 UTC|newest]
Thread overview: 62+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-02-24 1:05 [RFC] div64_64 support Stephen Hemminger
2007-02-24 16:19 ` Sami Farin
2007-02-26 19:28 ` Stephen Hemminger
2007-02-26 19:39 ` David Miller
2007-02-26 20:09 ` Jan Engelhardt
2007-02-26 21:28 ` Stephen Hemminger
2007-02-27 1:20 ` H. Peter Anvin
2007-02-27 3:45 ` Segher Boessenkool
2007-02-26 22:31 ` Stephen Hemminger
2007-02-26 23:02 ` Jan Engelhardt
2007-02-26 23:44 ` Stephen Hemminger
2007-02-27 0:05 ` Jan Engelhardt
2007-02-27 0:07 ` Stephen Hemminger
2007-02-27 0:14 ` Jan Engelhardt
2007-02-27 6:21 ` Dan Williams
2007-03-03 2:31 ` Andi Kleen
2007-03-05 23:57 ` Stephen Hemminger
2007-03-06 0:25 ` David Miller
2007-03-06 13:36 ` Andi Kleen
2007-03-06 14:04 ` [RFC] div64_64 support II Andi Kleen
2007-03-06 17:43 ` Dagfinn Ilmari Mannsåker
2007-03-06 18:25 ` David Miller
2007-03-06 18:48 ` H. Peter Anvin
2007-03-06 13:34 ` [RFC] div64_64 support Andi Kleen
2007-03-06 14:19 ` Eric Dumazet
2007-03-06 14:45 ` Andi Kleen
2007-03-06 15:10 ` Roland Kuhn
2007-03-06 18:29 ` Stephen Hemminger
2007-03-06 19:48 ` Andi Kleen
2007-03-06 20:04 ` Stephen Hemminger
2007-03-06 21:53 ` Sami Farin
2007-03-06 22:24 ` Sami Farin
2007-03-07 0:00 ` Stephen Hemminger
2007-03-07 0:05 ` David Miller
2007-03-07 0:05 ` Sami Farin
2007-03-07 16:11 ` Chuck Ebbert
2007-03-07 18:32 ` Sami Farin
2007-03-08 18:23 ` asm volatile [Was: [RFC] div64_64 support] Sami Farin
2007-03-08 22:01 ` asm volatile David Miller
2007-03-06 21:58 ` [RFC] div64_64 support David Miller
2007-03-06 22:47 ` [PATCH] tcp_cubic: faster cube root Stephen Hemminger
2007-03-06 22:58 ` cube root benchmark code Stephen Hemminger
2007-03-07 6:08 ` Update to " Willy Tarreau
2007-03-08 1:07 ` [PATCH] tcp_cubic: use 32 bit math Stephen Hemminger
2007-03-08 2:55 ` David Miller
2007-03-08 3:10 ` Stephen Hemminger
2007-03-08 3:51 ` David Miller
2007-03-10 11:48 ` Willy Tarreau
2007-03-12 21:11 ` Stephen Hemminger
2007-03-13 20:50 ` Willy Tarreau
2007-03-21 18:54 ` Stephen Hemminger [this message]
2007-03-21 19:15 ` Willy Tarreau
2007-03-21 19:58 ` Stephen Hemminger
2007-03-21 20:15 ` [PATCH 1/2] div64_64 optimization Stephen Hemminger
2007-03-21 20:17 ` [PATCH 2/2] tcp: cubic optimization Stephen Hemminger
2007-03-22 19:11 ` David Miller
2007-03-22 19:11 ` [PATCH 1/2] div64_64 optimization David Miller
2007-03-08 4:16 ` [PATCH] tcp_cubic: use 32 bit math Willy Tarreau
2007-03-07 4:20 ` [PATCH] tcp_cubic: faster cube root David Miller
2007-03-07 12:12 ` Andi Kleen
2007-03-07 19:33 ` David Miller
2007-03-06 18:50 ` [RFC] div64_64 support H. Peter Anvin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20070321115419.483a655a@freekitty \
--to=shemminger@linux-foundation.org \
--cc=andi@firstfloor.org \
--cc=dada1@cosmosbay.com \
--cc=davem@davemloft.net \
--cc=jengelh@linux01.gwdg.de \
--cc=linux-kernel@vger.kernel.org \
--cc=netdev@vger.kernel.org \
--cc=rkuhn@e18.physik.tu-muenchen.de \
--cc=w@1wt.eu \
--subject='Re: [PATCH] tcp_cubic: use 32 bit math' \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).