LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [RFC] lib/vsprintf.c: Even faster decimal conversion
@ 2015-02-20 23:51 Rasmus Villemoes
  2015-03-05 15:22 ` Rasmus Villemoes
  2015-03-18  0:50 ` [RFC] lib/vsprintf.c: Even faster decimal conversion Denys Vlasenko
  0 siblings, 2 replies; 19+ messages in thread
From: Rasmus Villemoes @ 2015-02-20 23:51 UTC (permalink / raw)
  To: Andrew Morton, Rasmus Villemoes, Peter Zijlstra (Intel), Tejun Heo
  Cc: linux-kernel

The most expensive part of decimal conversion is the divisions by 10
(albeit done using reciprocal multiplication with appropriately chosen
constants). I decided to see if one could eliminate around half of
these multiplications by emitting two digits at a time, at the cost of
a 200 byte lookup table, and it does indeed seem like there is
something to be gained, especially on 64 bits.

$ ./test64
Distribution              Function         Cycles/conv  Conv/1 sec
uniform([10, 2^64-1])     linux_put_dec          127.72         23047567
uniform([10, 2^64-1])     rv_put_dec              60.73         45932786
                          +/-                   -52.45%          +99.30%
3 + neg_binom(0.05)       linux_put_dec           62.13         45941598
3 + neg_binom(0.05)       rv_put_dec              41.72         64207025
                          +/-                   -32.86%          +39.76%
3 + neg_binom(0.10)       linux_put_dec           45.38         61649231
3 + neg_binom(0.10)       rv_put_dec              32.41         81348584
                          +/-                   -28.57%          +31.95%
3 + neg_binom(0.15)       linux_put_dec           37.05         74703384
3 + neg_binom(0.15)       rv_put_dec              27.17         95626720
                          +/-                   -26.67%          +28.01%
3 + neg_binom(0.20)       linux_put_dec           31.55         86833667
3 + neg_binom(0.20)       rv_put_dec              23.67        109279546
                          +/-                   -24.98%          +25.85%
3 + neg_binom(0.50)       linux_put_dec           16.85        159560933
3 + neg_binom(0.50)       rv_put_dec              12.59        204607570
                          +/-                   -25.31%          +28.23%

In the above, I've tried to account for the fact that the numbers
being converted are not uniformly distributed in the entire u64
range; smaller numbers are much more common. So I've tried to model
that using the negative binomial distribution: Each of the 2048 test
numbers are generated as follows. First I pick a number between 3 and
63 (inclusive) according to the neg_binom(p) (for parameter p = 0.05,
4 is 95% as likely as 3, 5 is 95% as likely as 4, etc.); this is the
index of the most significant bit in the number I then generate. I
purposely exclude very small numbers (< 10), since the kernel's
number() function takes a shortcut in that case, which is also why the
MSB must be at least 3.

The bloat-o-meter costs are around 150 bytes (the generated code is a
little smaller, so it's not the full 200 bytes) on both 32 and 64
bit. I'm aware that extra cache misses won't show up in a
microbenchmark as used above, but on the other hand decimal
conversions often happen in bulk (top reading lots of /proc files, or
simply a 'ls /proc').

I have of course tested that the new code generates the same output as
the old, for both the first and last 1e10 numbers in [0,2^64-1] and
4e9 'random' numbers in-between.

This touches some of the same lines as the tiny series
<http://thread.gmane.org/gmane.linux.kernel/1892024> and is based on
top of that, though there's no functional dependency.

It would be nice to get some numbers for architectures other than x86,
and if noone hates this I'd like to let it soak in -next for a while.

Test and verification code on github <https://github.com/Villemoes/dec>.

I had to jump through some hoops to get the 32 bit code to compile and
run on my 64 bit machine, so I'm not sure how relevant these numbers
are, but just for comparison I got

$ ./test32
Distribution              Function         Cycles/conv  Conv/1 sec
uniform([10, 2^64-1])     linux_put_dec          132.32         22775225
uniform([10, 2^64-1])     rv_put_dec              91.18         33022454
                          +/-                   -31.09%          +44.99%
3 + neg_binom(0.05)       linux_put_dec           75.61         39920611
3 + neg_binom(0.05)       rv_put_dec              66.23         44478150
                          +/-                   -12.41%          +11.42%
3 + neg_binom(0.10)       linux_put_dec           54.70         54709426
3 + neg_binom(0.10)       rv_put_dec              48.51         60533279
                          +/-                   -11.31%          +10.65%
3 + neg_binom(0.15)       linux_put_dec           45.35         65838749
3 + neg_binom(0.15)       rv_put_dec              40.41         71007544
                          +/-                   -10.90%           +7.85%
3 + neg_binom(0.20)       linux_put_dec           38.93         75976835
3 + neg_binom(0.20)       rv_put_dec              34.52         81615186
                          +/-                   -11.34%           +7.42%
3 + neg_binom(0.50)       linux_put_dec           22.43        119256030
3 + neg_binom(0.50)       rv_put_dec              18.79        144213096
                          +/-                   -16.22%          +20.93%

Not-yet-Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
---
 lib/vsprintf.c | 248 ++++++++++++++++++++++++++++++---------------------------
 1 file changed, 129 insertions(+), 119 deletions(-)

diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index 2753f9261115..5bd8811d6662 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -32,6 +32,7 @@
 
 #include <asm/page.h>		/* for PAGE_SIZE */
 #include <asm/sections.h>	/* for dereference_function_descriptor() */
+#include <asm/byteorder.h>	/* cpu_to_le16 */
 
 #include <linux/string_helpers.h>
 #include "kstrtox.h"
@@ -121,142 +122,147 @@ int skip_atoi(const char **s)
 	return i;
 }
 
-/* Decimal conversion is by far the most typical, and is used
- * for /proc and /sys data. This directly impacts e.g. top performance
- * with many processes running. We optimize it for speed
- * using ideas described at <http://www.cs.uiowa.edu/~jones/bcd/divide.html>
- * (with permission from the author, Douglas W. Jones).
+/*
+ * Decimal conversion is by far the most typical, and is used for
+ * /proc and /sys data. This directly impacts e.g. top performance
+ * with many processes running. We optimize it for speed by emitting
+ * two characters at a time, using a 200 byte lookup table. This
+ * roughly halves the number of multiplications compared to computing
+ * the digits one at a time. Implementation strongly inspired by the
+ * previous version, which in turn used ideas described at
+ * <http://www.cs.uiowa.edu/~jones/bcd/divide.html> (with permission
+ * from the author, Douglas W. Jones).
+ *
+ * It turns out there is precisely one 26 bit fixed-point
+ * approximation a of 64/100 for which x/100 == (x * (u64)a) >> 32
+ * holds for all x in [0, 10^8-1], namely a = 0x28f5c29. The actual
+ * range happens to be somewhat larger (x <= 1073741898), but that's
+ * irrelevant for our purpose.
+ *
+ * For dividing a number in the range [10^4, 10^6-1] by 100, we still
+ * need a 32x32->64 bit multiply, so we simply use the same constant.
+ *
+ * For dividing a number in the range [100, 10^4-1] by 100, there are
+ * several options. The simplest is (x * 0x147b) >> 19, which is valid
+ * for all x <= 43698.
  */
 
-#if BITS_PER_LONG != 32 || BITS_PER_LONG_LONG != 64
-/* Formats correctly any integer in [0, 999999999] */
+static const u16 decpair[100] = {
+#define _(x) cpu_to_le16(((x % 10) | ((x / 10) << 8)) + 0x3030)
+	_( 0), _( 1), _( 2), _( 3), _( 4), _( 5), _( 6), _( 7), _( 8), _( 9),
+	_(10), _(11), _(12), _(13), _(14), _(15), _(16), _(17), _(18), _(19),
+	_(20), _(21), _(22), _(23), _(24), _(25), _(26), _(27), _(28), _(29),
+	_(30), _(31), _(32), _(33), _(34), _(35), _(36), _(37), _(38), _(39),
+	_(40), _(41), _(42), _(43), _(44), _(45), _(46), _(47), _(48), _(49),
+	_(50), _(51), _(52), _(53), _(54), _(55), _(56), _(57), _(58), _(59),
+	_(60), _(61), _(62), _(63), _(64), _(65), _(66), _(67), _(68), _(69),
+	_(70), _(71), _(72), _(73), _(74), _(75), _(76), _(77), _(78), _(79),
+	_(80), _(81), _(82), _(83), _(84), _(85), _(86), _(87), _(88), _(89),
+	_(90), _(91), _(92), _(93), _(94), _(95), _(96), _(97), _(98), _(99),
+#undef _
+};
+
+/*
+ * This will print a single '0' even if r == 0, since we would
+ * immediately jump to out_r where two 0s would be written and one of
+ * them then discarded. This is needed by ip4_string below. All other
+ * callers pass a non-zero value of r.
+*/
 static noinline_for_stack
-char *put_dec_full9(char *buf, unsigned q)
+char *put_dec_trunc8(char *buf, unsigned r)
 {
-	unsigned r;
-
-	/*
-	 * Possible ways to approx. divide by 10
-	 * (x * 0x1999999a) >> 32 x < 1073741829 (multiply must be 64-bit)
-	 * (x * 0xcccd) >> 19     x <      81920 (x < 262149 when 64-bit mul)
-	 * (x * 0x6667) >> 18     x <      43699
-	 * (x * 0x3334) >> 17     x <      16389
-	 * (x * 0x199a) >> 16     x <      16389
-	 * (x * 0x0ccd) >> 15     x <      16389
-	 * (x * 0x0667) >> 14     x <       2739
-	 * (x * 0x0334) >> 13     x <       1029
-	 * (x * 0x019a) >> 12     x <       1029
-	 * (x * 0x00cd) >> 11     x <       1029 shorter code than * 0x67 (on i386)
-	 * (x * 0x0067) >> 10     x <        179
-	 * (x * 0x0034) >>  9     x <         69 same
-	 * (x * 0x001a) >>  8     x <         69 same
-	 * (x * 0x000d) >>  7     x <         69 same, shortest code (on i386)
-	 * (x * 0x0007) >>  6     x <         19
-	 * See <http://www.cs.uiowa.edu/~jones/bcd/divide.html>
-	 */
-	r      = (q * (uint64_t)0x1999999a) >> 32;
-	*buf++ = (q - 10 * r) + '0'; /* 1 */
-	q      = (r * (uint64_t)0x1999999a) >> 32;
-	*buf++ = (r - 10 * q) + '0'; /* 2 */
-	r      = (q * (uint64_t)0x1999999a) >> 32;
-	*buf++ = (q - 10 * r) + '0'; /* 3 */
-	q      = (r * (uint64_t)0x1999999a) >> 32;
-	*buf++ = (r - 10 * q) + '0'; /* 4 */
-	r      = (q * (uint64_t)0x1999999a) >> 32;
-	*buf++ = (q - 10 * r) + '0'; /* 5 */
-	/* Now value is under 10000, can avoid 64-bit multiply */
-	q      = (r * 0x199a) >> 16;
-	*buf++ = (r - 10 * q)  + '0'; /* 6 */
-	r      = (q * 0xcd) >> 11;
-	*buf++ = (q - 10 * r)  + '0'; /* 7 */
-	q      = (r * 0xcd) >> 11;
-	*buf++ = (r - 10 * q) + '0'; /* 8 */
-	*buf++ = q + '0'; /* 9 */
+	unsigned q;
+
+	/* 1 <= r < 10^8 */
+	if (r < 100)
+		goto out_r;
+
+	/* 100 <= r < 10^8 */
+	q = (r * (u64)0x28f5c29) >> 32;
+	*((u16 *)buf) = decpair[r - 100*q];
+	buf += 2;
+
+	/* 1 <= q < 10^6 */
+	if (q < 100)
+		goto out_q;
+
+	/*  100 <= q < 10^6 */
+	r = (q * (u64)0x28f5c29) >> 32;
+	*((u16 *)buf) = decpair[q - 100*r];
+	buf += 2;
+
+	/* 1 <= r < 10^4 */
+	if (r < 100)
+		goto out_r;
+
+	/* 100 <= r < 10^4 */
+	q = (r * 0x147b) >> 19;
+	*((u16 *)buf) = decpair[r - 100*q];
+	buf += 2;
+out_q:
+	/* 1 <= q < 100 */
+	r = q;
+out_r:
+	/* 1 <= r < 100 */
+	*((u16 *)buf) = decpair[r];
+	buf += 2;
+	if (buf[-1] == '0')
+		buf--;
 	return buf;
 }
-#endif
 
-/* Similar to above but do not pad with zeros.
- * Code can be easily arranged to print 9 digits too, but our callers
- * always call put_dec_full9() instead when the number has 9 decimal digits.
- */
+#if BITS_PER_LONG == 64
 static noinline_for_stack
-char *put_dec_trunc8(char *buf, unsigned r)
+char *put_dec_full8(char *buf, unsigned r)
 {
 	unsigned q;
 
-	/* Copy of previous function's body with added early returns */
-	while (r >= 10000) {
-		q = r + '0';
-		r  = (r * (uint64_t)0x1999999a) >> 32;
-		*buf++ = q - 10*r;
-	}
-
-	q      = (r * 0x199a) >> 16;	/* r <= 9999 */
-	*buf++ = (r - 10 * q)  + '0';
-	if (q == 0)
-		return buf;
-	r      = (q * 0xcd) >> 11;	/* q <= 999 */
-	*buf++ = (q - 10 * r)  + '0';
-	if (r == 0)
-		return buf;
-	q      = (r * 0xcd) >> 11;	/* r <= 99 */
-	*buf++ = (r - 10 * q) + '0';
-	if (q == 0)
-		return buf;
-	*buf++ = q + '0';		 /* q <= 9 */
-	return buf;
-}
+	/* 0 <= r < 10^8 */
+	q = (r * (u64)0x28f5c29) >> 32;
+	*((u16 *)buf) = decpair[r - 100*q];
+	buf += 2;
 
-/* There are two algorithms to print larger numbers.
- * One is generic: divide by 1000000000 and repeatedly print
- * groups of (up to) 9 digits. It's conceptually simple,
- * but requires a (unsigned long long) / 1000000000 division.
- *
- * Second algorithm splits 64-bit unsigned long long into 16-bit chunks,
- * manipulates them cleverly and generates groups of 4 decimal digits.
- * It so happens that it does NOT require long long division.
- *
- * If long is > 32 bits, division of 64-bit values is relatively easy,
- * and we will use the first algorithm.
- * If long long is > 64 bits (strange architecture with VERY large long long),
- * second algorithm can't be used, and we again use the first one.
- *
- * Else (if long is 32 bits and long long is 64 bits) we use second one.
- */
+	/* 0 <= q < 10^6 */
+	r = (q * (u64)0x28f5c29) >> 32;
+	*((u16 *)buf) = decpair[q - 100*r];
+	buf += 2;
 
-#if BITS_PER_LONG != 32 || BITS_PER_LONG_LONG != 64
+	/* 0 <= r < 10^4 */
+	q = (r * 0x147b) >> 19;
+	*((u16 *)buf) = decpair[r - 100*q];
+	buf += 2;
 
-/* First algorithm: generic */
+	/* 0 <= q < 100 */
+	*((u16 *)buf) = decpair[q];
+	buf += 2;
+	return buf;
+}
 
-static
+static noinline_for_stack
 char *put_dec(char *buf, unsigned long long n)
 {
-	if (n >= 100*1000*1000) {
-		while (n >= 1000*1000*1000)
-			buf = put_dec_full9(buf, do_div(n, 1000*1000*1000));
-		if (n >= 100*1000*1000)
-			return put_dec_full9(buf, n);
-	}
+	if (n >= 100*1000*1000)
+		buf = put_dec_full8(buf, do_div(n, 100*1000*1000));
+	/* 1 <= n <= 1.6e11 */
+	if (n >= 100*1000*1000)
+		buf = put_dec_full8(buf, do_div(n, 100*1000*1000));
+	/* 1 <= n < 1e8 */
 	return put_dec_trunc8(buf, n);
 }
 
-#else
+#elif BITS_PER_LONG == 32 && BITS_PER_LONG_LONG == 64
 
-/* Second algorithm: valid only for 64-bit long longs */
-
-/* See comment in put_dec_full9 for choice of constants */
-static noinline_for_stack
-void put_dec_full4(char *buf, unsigned q)
+static void
+put_dec_full4(char *buf, unsigned r)
 {
-	unsigned r;
-	r      = (q * 0xccd) >> 15;
-	buf[0] = (q - 10 * r) + '0';
-	q      = (r * 0xcd) >> 11;
-	buf[1] = (r - 10 * q)  + '0';
-	r      = (q * 0xcd) >> 11;
-	buf[2] = (q - 10 * r)  + '0';
-	buf[3] = r + '0';
+	unsigned q;
+
+	/* 0 <= r < 10^4 */
+	q = (r * 0x147b) >> 19;
+	*((u16 *)buf) = decpair[r - 100*q];
+	buf += 2;
+	/* 0 <= q < 100 */
+	*((u16 *)buf) = decpair[q];
 }
 
 /*
@@ -264,9 +270,9 @@ void put_dec_full4(char *buf, unsigned q)
  * The approximation x/10000 == (x * 0x346DC5D7) >> 43
  * holds for all x < 1,128,869,999.  The largest value this
  * helper will ever be asked to convert is 1,125,520,955.
- * (d1 in the put_dec code, assuming n is all-ones).
+ * (second call in the put_dec code, assuming n is all-ones).
  */
-static
+static noinline_for_stack
 unsigned put_dec_helper4(char *buf, unsigned x)
 {
         uint32_t q = (x * (uint64_t)0x346DC5D7) >> 43;
@@ -293,6 +299,8 @@ char *put_dec(char *buf, unsigned long long n)
 	d2  = (h      ) & 0xffff;
 	d3  = (h >> 16); /* implicit "& 0xffff" */
 
+	/* n = 2^48 d3 + 2^32 d2 + 2^16 d1 + d0
+	     = 281_4749_7671_0656 d3 + 42_9496_7296 d2 + 6_5536 d1 + d0 */
 	q   = 656 * d3 + 7296 * d2 + 5536 * d1 + ((uint32_t)n & 0xffff);
 	q = put_dec_helper4(buf, q);
 
@@ -322,7 +330,8 @@ char *put_dec(char *buf, unsigned long long n)
  */
 int num_to_str(char *buf, int size, unsigned long long num)
 {
-	char tmp[sizeof(num) * 3];
+	/* put_dec requires 2-byte alignment of the buffer. */
+	char tmp[sizeof(num) * 3] __aligned(2);
 	int idx, len;
 
 	/* put_dec() may work incorrectly for num = 0 (generate "", not "0") */
@@ -383,7 +392,8 @@ static noinline_for_stack
 char *number(char *buf, char *end, unsigned long long num,
 	     struct printf_spec spec)
 {
-	char tmp[3 * sizeof(num)];
+	/* put_dec requires 2-byte alignment of the buffer. */
+	char tmp[3 * sizeof(num)] __aligned(2);
 	char sign;
 	char locase;
 	int need_pfx = ((spec.flags & SPECIAL) && spec.base != 10);
@@ -935,7 +945,7 @@ char *ip4_string(char *p, const u8 *addr, const char *fmt)
 		break;
 	}
 	for (i = 0; i < 4; i++) {
-		char temp[3];	/* hold each IP quad in reverse order */
+		char temp[4] __aligned(2);	/* hold each IP quad in reverse order */
 		int digits = put_dec_trunc8(temp, addr[index]) - temp;
 		if (leading_zeros) {
 			if (digits < 3)
-- 
2.1.3


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] lib/vsprintf.c: Even faster decimal conversion
  2015-02-20 23:51 [RFC] lib/vsprintf.c: Even faster decimal conversion Rasmus Villemoes
@ 2015-03-05 15:22 ` Rasmus Villemoes
  2015-03-05 16:03   ` Joe Perches
  2015-03-18  0:50 ` [RFC] lib/vsprintf.c: Even faster decimal conversion Denys Vlasenko
  1 sibling, 1 reply; 19+ messages in thread
From: Rasmus Villemoes @ 2015-03-05 15:22 UTC (permalink / raw)
  To: linux-kernel; +Cc: Peter Zijlstra (Intel), Tejun Heo, Andrew Morton

On Sat, Feb 21 2015, Rasmus Villemoes <linux@rasmusvillemoes.dk> wrote:

> [...] decimal conversion [...] it does indeed seem like there is
> something to be gained, especially on 64 bits.
>
> $ ./test64
> Distribution              Function         Cycles/conv  Conv/1 sec
> uniform([10, 2^64-1])     linux_put_dec          127.72         23047567
> uniform([10, 2^64-1])     rv_put_dec              60.73         45932786
>                           +/-                   -52.45%          +99.30%
[...]
> 3 + neg_binom(0.50)       linux_put_dec           16.85        159560933
> 3 + neg_binom(0.50)       rv_put_dec              12.59        204607570
>                           +/-                   -25.31%          +28.23%

I'm assuming the underwhelming response means NAK.

Rasmus

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] lib/vsprintf.c: Even faster decimal conversion
  2015-03-05 15:22 ` Rasmus Villemoes
@ 2015-03-05 16:03   ` Joe Perches
  2015-03-05 16:10     ` Tejun Heo
  0 siblings, 1 reply; 19+ messages in thread
From: Joe Perches @ 2015-03-05 16:03 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: linux-kernel, Peter Zijlstra (Intel), Tejun Heo, Andrew Morton

On Thu, 2015-03-05 at 16:22 +0100, Rasmus Villemoes wrote:
> On Sat, Feb 21 2015, Rasmus Villemoes <linux@rasmusvillemoes.dk> wrote:
> 
> > [...] decimal conversion [...] it does indeed seem like there is
> > something to be gained, especially on 64 bits.
> >
> > $ ./test64
> > Distribution              Function         Cycles/conv  Conv/1 sec
> > uniform([10, 2^64-1])     linux_put_dec          127.72         23047567
> > uniform([10, 2^64-1])     rv_put_dec              60.73         45932786
> >                           +/-                   -52.45%          +99.30%
> [...]
> > 3 + neg_binom(0.50)       linux_put_dec           16.85        159560933
> > 3 + neg_binom(0.50)       rv_put_dec              12.59        204607570
> >                           +/-                   -25.31%          +28.23%
> 
> I'm assuming the underwhelming response means NAK.

Dunno why you assume that, sometimes it just takes
awhile for people to look at non-critical, infrequent
optimization changes like this.

Seems sensible enough to me though.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] lib/vsprintf.c: Even faster decimal conversion
  2015-03-05 16:03   ` Joe Perches
@ 2015-03-05 16:10     ` Tejun Heo
  2015-03-05 22:24       ` Rasmus Villemoes
  0 siblings, 1 reply; 19+ messages in thread
From: Tejun Heo @ 2015-03-05 16:10 UTC (permalink / raw)
  To: Joe Perches
  Cc: Rasmus Villemoes, linux-kernel, Peter Zijlstra (Intel), Andrew Morton

On Thu, Mar 05, 2015 at 08:03:33AM -0800, Joe Perches wrote:
> On Thu, 2015-03-05 at 16:22 +0100, Rasmus Villemoes wrote:
> > On Sat, Feb 21 2015, Rasmus Villemoes <linux@rasmusvillemoes.dk> wrote:
> > 
> > > [...] decimal conversion [...] it does indeed seem like there is
> > > something to be gained, especially on 64 bits.
> > >
> > > $ ./test64
> > > Distribution              Function         Cycles/conv  Conv/1 sec
> > > uniform([10, 2^64-1])     linux_put_dec          127.72         23047567
> > > uniform([10, 2^64-1])     rv_put_dec              60.73         45932786
> > >                           +/-                   -52.45%          +99.30%
> > [...]
> > > 3 + neg_binom(0.50)       linux_put_dec           16.85        159560933
> > > 3 + neg_binom(0.50)       rv_put_dec              12.59        204607570
> > >                           +/-                   -25.31%          +28.23%
> > 
> > I'm assuming the underwhelming response means NAK.
> 
> Dunno why you assume that, sometimes it just takes
> awhile for people to look at non-critical, infrequent
> optimization changes like this.
> 
> Seems sensible enough to me though.

I'd like to see how this actually affects larger operations - sth
along the line of top consumes D% less CPU cycles w/ N processes - if
for nothing else, just to get the sense of scale, but given that we're
already trying pretty hard to optimize the divisions there, I too
don't see anything wrong with the patch.  I haven't studied the code
but looks sensible enough on a glance, so, FWIW,

 Looks-sensible-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] lib/vsprintf.c: Even faster decimal conversion
  2015-03-05 16:10     ` Tejun Heo
@ 2015-03-05 22:24       ` Rasmus Villemoes
  2015-03-10 10:47         ` Rasmus Villemoes
  0 siblings, 1 reply; 19+ messages in thread
From: Rasmus Villemoes @ 2015-03-05 22:24 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Joe Perches, linux-kernel, Peter Zijlstra (Intel), Andrew Morton

On Thu, Mar 05 2015, Tejun Heo <tj@kernel.org> wrote:

> On Thu, Mar 05, 2015 at 08:03:33AM -0800, Joe Perches wrote:
>> On Thu, 2015-03-05 at 16:22 +0100, Rasmus Villemoes wrote:
>>
>> > I'm assuming the underwhelming response means NAK.
>> 
>> Dunno why you assume that, sometimes it just takes
>> awhile for people to look at non-critical, infrequent
>> optimization changes like this.

Well, when nobody has responded within a week with even a "not obviously
insane, I'll look at it after -rcN/next full moon/...", there's usually
little hope of ever getting some feedback.

>> Seems sensible enough to me though.
>
> I'd like to see how this actually affects larger operations - sth
> along the line of top consumes D% less CPU cycles w/ N processes - if
> for nothing else, just to get the sense of scale,

That makes sense. I'll see if I can get some reproducible numbers, but
I'm afraid the effect drowns in all the syscall overhead. Which would be
a valid argument against touching the code.

> I haven't studied the code but looks sensible enough on a glance, so,
> FWIW,
>
>  Looks-sensible-by: Tejun Heo <tj@kernel.org>

Thanks, this was the kind of feedback I was hoping for.

Rasmus

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] lib/vsprintf.c: Even faster decimal conversion
  2015-03-05 22:24       ` Rasmus Villemoes
@ 2015-03-10 10:47         ` Rasmus Villemoes
  2015-03-10 12:42           ` Tejun Heo
  0 siblings, 1 reply; 19+ messages in thread
From: Rasmus Villemoes @ 2015-03-10 10:47 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Joe Perches, linux-kernel, Peter Zijlstra (Intel), Andrew Morton

On Thu, Mar 05 2015, Rasmus Villemoes <linux@rasmusvillemoes.dk> wrote:

> On Thu, Mar 05 2015, Tejun Heo <tj@kernel.org> wrote:
>
>> I'd like to see how this actually affects larger operations - sth
>> along the line of top consumes D% less CPU cycles w/ N processes - if
>> for nothing else, just to get the sense of scale,
>
> That makes sense. I'll see if I can get some reproducible numbers, but
> I'm afraid the effect drowns in all the syscall overhead. Which would be
> a valid argument against touching the code.

I wasn't able to come up with a way to measure the absolute %cpu
reliably enough (neither from top's own output or using something like
watch -n1 ps -p $toppid -o %cpu) - it fluctuates too much to see any
difference. But using perf I was able to get somewhat stable numbers,
which suggest an improvement in the 0.5-1.0% range [1]. Measured with
10000 [2] sleeping processes in an idle virtual machine (and on mostly
idle host), patch on top of 3.19.0. Extracting the functions involved in
the decimal conversion I get

new1.txt:     2.35%  top      [kernel.kallsyms]   [k] num_to_str                 
new2.txt:     2.70%  top      [kernel.kallsyms]   [k] num_to_str                 
old1.txt:     2.25%  top      [kernel.kallsyms]   [k] num_to_str                 
old2.txt:     2.18%  top      [kernel.kallsyms]   [k] num_to_str                 

new1.txt:     0.63%  top      [kernel.kallsyms]   [k] put_dec                    
new2.txt:     0.71%  top      [kernel.kallsyms]   [k] put_dec                    
old1.txt:     0.67%  top      [kernel.kallsyms]   [k] put_dec                    
old2.txt:     0.59%  top      [kernel.kallsyms]   [k] put_dec                    

new1.txt:     0.53%  top      [kernel.kallsyms]   [k] put_dec_full8              
new2.txt:     0.55%  top      [kernel.kallsyms]   [k] put_dec_full8              
old1.txt:     1.09%  top      [kernel.kallsyms]   [k] put_dec_full9              
old2.txt:     1.15%  top      [kernel.kallsyms]   [k] put_dec_full9              

new1.txt:     1.12%  top      [kernel.kallsyms]   [k] put_dec_trunc8             
new2.txt:     1.22%  top      [kernel.kallsyms]   [k] put_dec_trunc8             
old1.txt:     1.64%  top      [kernel.kallsyms]   [k] put_dec_trunc8             
old2.txt:     1.65%  top      [kernel.kallsyms]   [k] put_dec_trunc8             

I can't explain why num_to_str apparently becomes slightly slower (the
patch essentially didn't touch it), but the put_dec_ helpers in any case
make up for that.

If someone has a suggestion for a better way of measuring this I'm all
ears.

Thanks,
Rasmus

[1] in terms of #cycles

[2] numbers for 2000 and 5000 processes are quite similar.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] lib/vsprintf.c: Even faster decimal conversion
  2015-03-10 10:47         ` Rasmus Villemoes
@ 2015-03-10 12:42           ` Tejun Heo
  2015-03-10 12:57             ` Rasmus Villemoes
  0 siblings, 1 reply; 19+ messages in thread
From: Tejun Heo @ 2015-03-10 12:42 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Joe Perches, linux-kernel, Peter Zijlstra (Intel), Andrew Morton

Hello,

On Tue, Mar 10, 2015 at 11:47:47AM +0100, Rasmus Villemoes wrote:
> I can't explain why num_to_str apparently becomes slightly slower (the
> patch essentially didn't touch it), but the put_dec_ helpers in any case
> make up for that.

Unrelated code changes affecting performance in seemingly random way
isn't too uncommon.  A lot of it arises from cacheline behaviors.  The
effect is sometimes surprisingly big.

Generally looks pretty good to me.  Andrew, can you please pick this
patch up?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] lib/vsprintf.c: Even faster decimal conversion
  2015-03-10 12:42           ` Tejun Heo
@ 2015-03-10 12:57             ` Rasmus Villemoes
  2015-03-10 23:01               ` [PATCH v1] " Rasmus Villemoes
  0 siblings, 1 reply; 19+ messages in thread
From: Rasmus Villemoes @ 2015-03-10 12:57 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Joe Perches, linux-kernel, Peter Zijlstra (Intel), Andrew Morton

On Tue, Mar 10 2015, Tejun Heo <tj@kernel.org> wrote:

> Hello,
>
> On Tue, Mar 10, 2015 at 11:47:47AM +0100, Rasmus Villemoes wrote:
>> I can't explain why num_to_str apparently becomes slightly slower (the
>> patch essentially didn't touch it), but the put_dec_ helpers in any case
>> make up for that.
>
> Unrelated code changes affecting performance in seemingly random way
> isn't too uncommon.  A lot of it arises from cacheline behaviors.  The
> effect is sometimes surprisingly big.
>
> Generally looks pretty good to me.

Thanks.

> Andrew, can you please pick this patch up?

No, please give me a day or two to post a new version (I purposely wrote
not-yet-signed-off-by because I want to clean up both the patch and the
commit message - I just wanted to get the code out there early in the
release cycle).

Rasmus

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v1] lib/vsprintf.c: Even faster decimal conversion
  2015-03-10 12:57             ` Rasmus Villemoes
@ 2015-03-10 23:01               ` Rasmus Villemoes
  2015-03-11 21:52                 ` Andrew Morton
                                   ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Rasmus Villemoes @ 2015-03-10 23:01 UTC (permalink / raw)
  To: Andrew Morton, Rasmus Villemoes, Peter Zijlstra (Intel), Tejun Heo
  Cc: Joe Perches, linux-kernel

The most expensive part of decimal conversion is the divisions by 10
(albeit done using reciprocal multiplication with appropriately chosen
constants). I decided to see if one could eliminate around half of
these multiplications by emitting two digits at a time, at the cost of
a 200 byte lookup table, and it does indeed seem like there is
something to be gained, especially on 64 bits. Microbenchmarking shows
improvements ranging from -50% (for numbers uniformly distributed in
[0, 2^64-1]) to -25% (for numbers heavily biased toward the smaller
end, a more realistic distribution).

On a larger scale, perf shows that top, one of the big consumers of
/proc data, uses 0.5-1.0% fewer cpu cycles.

I had to jump through some hoops to get the 32 bit code to compile and
run on my 64 bit machine, so I'm not sure how relevant these numbers
are, but just for comparison the microbenchmark showed improvements
between -30% and -10%.

The bloat-o-meter costs are around 150 bytes (the generated code is a
little smaller, so it's not the full 200 bytes) on both 32 and 64
bit. I'm aware that extra cache misses won't show up in a
microbenchmark as used above, but on the other hand decimal
conversions often happen in bulk (for example in the case of top).

I have of course tested that the new code generates the same output as
the old, for both the first and last 1e10 numbers in [0,2^64-1] and
4e9 'random' numbers in-between.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
---

Test and verification code on github
<https://github.com/Villemoes/dec>.  It would be nice if someone could
verify the code on architectures other than x86 - in particular, I
believe it _should_ work on big-endian, but I have no way of testing
that. Benchmark numbers would be nice too, of course.

v0->v1: More concise commit log. Explicitly require BITS_PER_LONG_LONG
== 64 also in the BITS_PER_LONG == 64 case (I think supporting even
larger long longs would be a matter of replacing the two ifs in
put_dec by a while, but this generates simpler code).


 lib/vsprintf.c | 248 ++++++++++++++++++++++++++++++---------------------------
 1 file changed, 129 insertions(+), 119 deletions(-)

diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index 2753f9261115..5edbf5b8d93f 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -32,6 +32,7 @@
 
 #include <asm/page.h>		/* for PAGE_SIZE */
 #include <asm/sections.h>	/* for dereference_function_descriptor() */
+#include <asm/byteorder.h>	/* cpu_to_le16 */
 
 #include <linux/string_helpers.h>
 #include "kstrtox.h"
@@ -121,142 +122,147 @@ int skip_atoi(const char **s)
 	return i;
 }
 
-/* Decimal conversion is by far the most typical, and is used
- * for /proc and /sys data. This directly impacts e.g. top performance
- * with many processes running. We optimize it for speed
- * using ideas described at <http://www.cs.uiowa.edu/~jones/bcd/divide.html>
- * (with permission from the author, Douglas W. Jones).
+/*
+ * Decimal conversion is by far the most typical, and is used for
+ * /proc and /sys data. This directly impacts e.g. top performance
+ * with many processes running. We optimize it for speed by emitting
+ * two characters at a time, using a 200 byte lookup table. This
+ * roughly halves the number of multiplications compared to computing
+ * the digits one at a time. Implementation strongly inspired by the
+ * previous version, which in turn used ideas described at
+ * <http://www.cs.uiowa.edu/~jones/bcd/divide.html> (with permission
+ * from the author, Douglas W. Jones).
+ *
+ * It turns out there is precisely one 26 bit fixed-point
+ * approximation a of 64/100 for which x/100 == (x * (u64)a) >> 32
+ * holds for all x in [0, 10^8-1], namely a = 0x28f5c29. The actual
+ * range happens to be somewhat larger (x <= 1073741898), but that's
+ * irrelevant for our purpose.
+ *
+ * For dividing a number in the range [10^4, 10^6-1] by 100, we still
+ * need a 32x32->64 bit multiply, so we simply use the same constant.
+ *
+ * For dividing a number in the range [100, 10^4-1] by 100, there are
+ * several options. The simplest is (x * 0x147b) >> 19, which is valid
+ * for all x <= 43698.
  */
 
-#if BITS_PER_LONG != 32 || BITS_PER_LONG_LONG != 64
-/* Formats correctly any integer in [0, 999999999] */
+static const u16 decpair[100] = {
+#define _(x) cpu_to_le16(((x % 10) | ((x / 10) << 8)) + 0x3030)
+	_( 0), _( 1), _( 2), _( 3), _( 4), _( 5), _( 6), _( 7), _( 8), _( 9),
+	_(10), _(11), _(12), _(13), _(14), _(15), _(16), _(17), _(18), _(19),
+	_(20), _(21), _(22), _(23), _(24), _(25), _(26), _(27), _(28), _(29),
+	_(30), _(31), _(32), _(33), _(34), _(35), _(36), _(37), _(38), _(39),
+	_(40), _(41), _(42), _(43), _(44), _(45), _(46), _(47), _(48), _(49),
+	_(50), _(51), _(52), _(53), _(54), _(55), _(56), _(57), _(58), _(59),
+	_(60), _(61), _(62), _(63), _(64), _(65), _(66), _(67), _(68), _(69),
+	_(70), _(71), _(72), _(73), _(74), _(75), _(76), _(77), _(78), _(79),
+	_(80), _(81), _(82), _(83), _(84), _(85), _(86), _(87), _(88), _(89),
+	_(90), _(91), _(92), _(93), _(94), _(95), _(96), _(97), _(98), _(99),
+#undef _
+};
+
+/*
+ * This will print a single '0' even if r == 0, since we would
+ * immediately jump to out_r where two 0s would be written and one of
+ * them then discarded. This is needed by ip4_string below. All other
+ * callers pass a non-zero value of r.
+*/
 static noinline_for_stack
-char *put_dec_full9(char *buf, unsigned q)
+char *put_dec_trunc8(char *buf, unsigned r)
 {
-	unsigned r;
-
-	/*
-	 * Possible ways to approx. divide by 10
-	 * (x * 0x1999999a) >> 32 x < 1073741829 (multiply must be 64-bit)
-	 * (x * 0xcccd) >> 19     x <      81920 (x < 262149 when 64-bit mul)
-	 * (x * 0x6667) >> 18     x <      43699
-	 * (x * 0x3334) >> 17     x <      16389
-	 * (x * 0x199a) >> 16     x <      16389
-	 * (x * 0x0ccd) >> 15     x <      16389
-	 * (x * 0x0667) >> 14     x <       2739
-	 * (x * 0x0334) >> 13     x <       1029
-	 * (x * 0x019a) >> 12     x <       1029
-	 * (x * 0x00cd) >> 11     x <       1029 shorter code than * 0x67 (on i386)
-	 * (x * 0x0067) >> 10     x <        179
-	 * (x * 0x0034) >>  9     x <         69 same
-	 * (x * 0x001a) >>  8     x <         69 same
-	 * (x * 0x000d) >>  7     x <         69 same, shortest code (on i386)
-	 * (x * 0x0007) >>  6     x <         19
-	 * See <http://www.cs.uiowa.edu/~jones/bcd/divide.html>
-	 */
-	r      = (q * (uint64_t)0x1999999a) >> 32;
-	*buf++ = (q - 10 * r) + '0'; /* 1 */
-	q      = (r * (uint64_t)0x1999999a) >> 32;
-	*buf++ = (r - 10 * q) + '0'; /* 2 */
-	r      = (q * (uint64_t)0x1999999a) >> 32;
-	*buf++ = (q - 10 * r) + '0'; /* 3 */
-	q      = (r * (uint64_t)0x1999999a) >> 32;
-	*buf++ = (r - 10 * q) + '0'; /* 4 */
-	r      = (q * (uint64_t)0x1999999a) >> 32;
-	*buf++ = (q - 10 * r) + '0'; /* 5 */
-	/* Now value is under 10000, can avoid 64-bit multiply */
-	q      = (r * 0x199a) >> 16;
-	*buf++ = (r - 10 * q)  + '0'; /* 6 */
-	r      = (q * 0xcd) >> 11;
-	*buf++ = (q - 10 * r)  + '0'; /* 7 */
-	q      = (r * 0xcd) >> 11;
-	*buf++ = (r - 10 * q) + '0'; /* 8 */
-	*buf++ = q + '0'; /* 9 */
+	unsigned q;
+
+	/* 1 <= r < 10^8 */
+	if (r < 100)
+		goto out_r;
+
+	/* 100 <= r < 10^8 */
+	q = (r * (u64)0x28f5c29) >> 32;
+	*((u16 *)buf) = decpair[r - 100*q];
+	buf += 2;
+
+	/* 1 <= q < 10^6 */
+	if (q < 100)
+		goto out_q;
+
+	/*  100 <= q < 10^6 */
+	r = (q * (u64)0x28f5c29) >> 32;
+	*((u16 *)buf) = decpair[q - 100*r];
+	buf += 2;
+
+	/* 1 <= r < 10^4 */
+	if (r < 100)
+		goto out_r;
+
+	/* 100 <= r < 10^4 */
+	q = (r * 0x147b) >> 19;
+	*((u16 *)buf) = decpair[r - 100*q];
+	buf += 2;
+out_q:
+	/* 1 <= q < 100 */
+	r = q;
+out_r:
+	/* 1 <= r < 100 */
+	*((u16 *)buf) = decpair[r];
+	buf += 2;
+	if (buf[-1] == '0')
+		buf--;
 	return buf;
 }
-#endif
 
-/* Similar to above but do not pad with zeros.
- * Code can be easily arranged to print 9 digits too, but our callers
- * always call put_dec_full9() instead when the number has 9 decimal digits.
- */
+#if BITS_PER_LONG == 64 && BITS_PER_LONG_LONG == 64
 static noinline_for_stack
-char *put_dec_trunc8(char *buf, unsigned r)
+char *put_dec_full8(char *buf, unsigned r)
 {
 	unsigned q;
 
-	/* Copy of previous function's body with added early returns */
-	while (r >= 10000) {
-		q = r + '0';
-		r  = (r * (uint64_t)0x1999999a) >> 32;
-		*buf++ = q - 10*r;
-	}
-
-	q      = (r * 0x199a) >> 16;	/* r <= 9999 */
-	*buf++ = (r - 10 * q)  + '0';
-	if (q == 0)
-		return buf;
-	r      = (q * 0xcd) >> 11;	/* q <= 999 */
-	*buf++ = (q - 10 * r)  + '0';
-	if (r == 0)
-		return buf;
-	q      = (r * 0xcd) >> 11;	/* r <= 99 */
-	*buf++ = (r - 10 * q) + '0';
-	if (q == 0)
-		return buf;
-	*buf++ = q + '0';		 /* q <= 9 */
-	return buf;
-}
+	/* 0 <= r < 10^8 */
+	q = (r * (u64)0x28f5c29) >> 32;
+	*((u16 *)buf) = decpair[r - 100*q];
+	buf += 2;
 
-/* There are two algorithms to print larger numbers.
- * One is generic: divide by 1000000000 and repeatedly print
- * groups of (up to) 9 digits. It's conceptually simple,
- * but requires a (unsigned long long) / 1000000000 division.
- *
- * Second algorithm splits 64-bit unsigned long long into 16-bit chunks,
- * manipulates them cleverly and generates groups of 4 decimal digits.
- * It so happens that it does NOT require long long division.
- *
- * If long is > 32 bits, division of 64-bit values is relatively easy,
- * and we will use the first algorithm.
- * If long long is > 64 bits (strange architecture with VERY large long long),
- * second algorithm can't be used, and we again use the first one.
- *
- * Else (if long is 32 bits and long long is 64 bits) we use second one.
- */
+	/* 0 <= q < 10^6 */
+	r = (q * (u64)0x28f5c29) >> 32;
+	*((u16 *)buf) = decpair[q - 100*r];
+	buf += 2;
 
-#if BITS_PER_LONG != 32 || BITS_PER_LONG_LONG != 64
+	/* 0 <= r < 10^4 */
+	q = (r * 0x147b) >> 19;
+	*((u16 *)buf) = decpair[r - 100*q];
+	buf += 2;
 
-/* First algorithm: generic */
+	/* 0 <= q < 100 */
+	*((u16 *)buf) = decpair[q];
+	buf += 2;
+	return buf;
+}
 
-static
+static noinline_for_stack
 char *put_dec(char *buf, unsigned long long n)
 {
-	if (n >= 100*1000*1000) {
-		while (n >= 1000*1000*1000)
-			buf = put_dec_full9(buf, do_div(n, 1000*1000*1000));
-		if (n >= 100*1000*1000)
-			return put_dec_full9(buf, n);
-	}
+	if (n >= 100*1000*1000)
+		buf = put_dec_full8(buf, do_div(n, 100*1000*1000));
+	/* 1 <= n <= 1.6e11 */
+	if (n >= 100*1000*1000)
+		buf = put_dec_full8(buf, do_div(n, 100*1000*1000));
+	/* 1 <= n < 1e8 */
 	return put_dec_trunc8(buf, n);
 }
 
-#else
+#elif BITS_PER_LONG == 32 && BITS_PER_LONG_LONG == 64
 
-/* Second algorithm: valid only for 64-bit long longs */
-
-/* See comment in put_dec_full9 for choice of constants */
-static noinline_for_stack
-void put_dec_full4(char *buf, unsigned q)
+static void
+put_dec_full4(char *buf, unsigned r)
 {
-	unsigned r;
-	r      = (q * 0xccd) >> 15;
-	buf[0] = (q - 10 * r) + '0';
-	q      = (r * 0xcd) >> 11;
-	buf[1] = (r - 10 * q)  + '0';
-	r      = (q * 0xcd) >> 11;
-	buf[2] = (q - 10 * r)  + '0';
-	buf[3] = r + '0';
+	unsigned q;
+
+	/* 0 <= r < 10^4 */
+	q = (r * 0x147b) >> 19;
+	*((u16 *)buf) = decpair[r - 100*q];
+	buf += 2;
+	/* 0 <= q < 100 */
+	*((u16 *)buf) = decpair[q];
 }
 
 /*
@@ -264,9 +270,9 @@ void put_dec_full4(char *buf, unsigned q)
  * The approximation x/10000 == (x * 0x346DC5D7) >> 43
  * holds for all x < 1,128,869,999.  The largest value this
  * helper will ever be asked to convert is 1,125,520,955.
- * (d1 in the put_dec code, assuming n is all-ones).
+ * (second call in the put_dec code, assuming n is all-ones).
  */
-static
+static noinline_for_stack
 unsigned put_dec_helper4(char *buf, unsigned x)
 {
         uint32_t q = (x * (uint64_t)0x346DC5D7) >> 43;
@@ -293,6 +299,8 @@ char *put_dec(char *buf, unsigned long long n)
 	d2  = (h      ) & 0xffff;
 	d3  = (h >> 16); /* implicit "& 0xffff" */
 
+	/* n = 2^48 d3 + 2^32 d2 + 2^16 d1 + d0
+	     = 281_4749_7671_0656 d3 + 42_9496_7296 d2 + 6_5536 d1 + d0 */
 	q   = 656 * d3 + 7296 * d2 + 5536 * d1 + ((uint32_t)n & 0xffff);
 	q = put_dec_helper4(buf, q);
 
@@ -322,7 +330,8 @@ char *put_dec(char *buf, unsigned long long n)
  */
 int num_to_str(char *buf, int size, unsigned long long num)
 {
-	char tmp[sizeof(num) * 3];
+	/* put_dec requires 2-byte alignment of the buffer. */
+	char tmp[sizeof(num) * 3] __aligned(2);
 	int idx, len;
 
 	/* put_dec() may work incorrectly for num = 0 (generate "", not "0") */
@@ -383,7 +392,8 @@ static noinline_for_stack
 char *number(char *buf, char *end, unsigned long long num,
 	     struct printf_spec spec)
 {
-	char tmp[3 * sizeof(num)];
+	/* put_dec requires 2-byte alignment of the buffer. */
+	char tmp[3 * sizeof(num)] __aligned(2);
 	char sign;
 	char locase;
 	int need_pfx = ((spec.flags & SPECIAL) && spec.base != 10);
@@ -935,7 +945,7 @@ char *ip4_string(char *p, const u8 *addr, const char *fmt)
 		break;
 	}
 	for (i = 0; i < 4; i++) {
-		char temp[3];	/* hold each IP quad in reverse order */
+		char temp[4] __aligned(2);	/* hold each IP quad in reverse order */
 		int digits = put_dec_trunc8(temp, addr[index]) - temp;
 		if (leading_zeros) {
 			if (digits < 3)
-- 
2.1.3


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v1] lib/vsprintf.c: Even faster decimal conversion
  2015-03-10 23:01               ` [PATCH v1] " Rasmus Villemoes
@ 2015-03-11 21:52                 ` Andrew Morton
  2015-03-19 21:41                   ` Rasmus Villemoes
  2015-03-12 18:49                 ` Jeff Epler
                                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 19+ messages in thread
From: Andrew Morton @ 2015-03-11 21:52 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Peter Zijlstra (Intel), Tejun Heo, Joe Perches, linux-kernel

On Wed, 11 Mar 2015 00:01:11 +0100 Rasmus Villemoes <linux@rasmusvillemoes.dk> wrote:

> The most expensive part of decimal conversion is the divisions by 10
> (albeit done using reciprocal multiplication with appropriately chosen
> constants). I decided to see if one could eliminate around half of
> these multiplications by emitting two digits at a time, at the cost of
> a 200 byte lookup table, and it does indeed seem like there is
> something to be gained, especially on 64 bits. Microbenchmarking shows
> improvements ranging from -50% (for numbers uniformly distributed in
> [0, 2^64-1]) to -25% (for numbers heavily biased toward the smaller
> end, a more realistic distribution).
> 
> On a larger scale, perf shows that top, one of the big consumers of
> /proc data, uses 0.5-1.0% fewer cpu cycles.
> 
> I had to jump through some hoops to get the 32 bit code to compile and
> run on my 64 bit machine, so I'm not sure how relevant these numbers
> are, but just for comparison the microbenchmark showed improvements
> between -30% and -10%.
> 
> The bloat-o-meter costs are around 150 bytes (the generated code is a
> little smaller, so it's not the full 200 bytes) on both 32 and 64
> bit. I'm aware that extra cache misses won't show up in a
> microbenchmark as used above, but on the other hand decimal
> conversions often happen in bulk (for example in the case of top).
> 
> I have of course tested that the new code generates the same output as
> the old, for both the first and last 1e10 numbers in [0,2^64-1] and
> 4e9 'random' numbers in-between.
> 
> ...
>
> Test and verification code on github
> <https://github.com/Villemoes/dec>.  It would be nice if someone could
> verify the code on architectures other than x86 - in particular, I
> believe it _should_ work on big-endian, but I have no way of testing
> that. Benchmark numbers would be nice too, of course.

That would be nice.

It would be best to make this as easy as possible for people.  Create a
tarball, attach that to a request sent to linuxppc-dev@lists.ozlabs.org
along with simple instructions along the lines of

	tar xfz tarball.tar.gz
	cd ./dec
	make
	./whatever

and magic might happen!

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v1] lib/vsprintf.c: Even faster decimal conversion
  2015-03-10 23:01               ` [PATCH v1] " Rasmus Villemoes
  2015-03-11 21:52                 ` Andrew Morton
@ 2015-03-12 18:49                 ` Jeff Epler
  2015-03-13  0:08                 ` Jeff Epler
  2015-03-13  8:58                 ` [PATCH] lib/vsprintf.c: silence sparse warnings about decpair[] initialization Rasmus Villemoes
  3 siblings, 0 replies; 19+ messages in thread
From: Jeff Epler @ 2015-03-12 18:49 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Andrew Morton, Peter Zijlstra (Intel),
	Tejun Heo, Joe Perches, linux-kernel

On Wed, Mar 11, 2015 at 12:01:11AM +0100, Rasmus Villemoes wrote:
> Test and verification code on github
> <https://github.com/Villemoes/dec>.  It would be nice if someone could
> verify the code on architectures other than x86 - in particular, I
> believe it _should_ work on big-endian, but I have no way of testing
> that. Benchmark numbers would be nice too, of course.

Results from Odroid U3 (32-bit arm exynos 4412) with gcc 4.8.2 after some
trivial modifications (e.g., no -m32) (ubuntu 14.04ish, I think):

Distribution              Function         nsecs/conv   Conv/1 sec      
uniform([10, 2^64-1])     linux_put_dec          139.94          7050452
uniform([10, 2^64-1])     rv_put_dec             105.55          9278367
                          +/-                   -24.57%          +31.60%
3 + neg_binom(0.05)       linux_put_dec           69.35         14015175
3 + neg_binom(0.05)       rv_put_dec              53.00         18201548
                          +/-                   -23.58%          +29.87%
3 + neg_binom(0.10)       linux_put_dec           48.89         19798285
3 + neg_binom(0.10)       rv_put_dec              38.14         25056279
                          +/-                   -22.00%          +26.56%
3 + neg_binom(0.15)       linux_put_dec           40.69         23576136
3 + neg_binom(0.15)       rv_put_dec              31.60         30009858
                          +/-                   -22.33%          +27.29%
3 + neg_binom(0.20)       linux_put_dec           37.50         25633406
3 + neg_binom(0.20)       rv_put_dec              28.86         32689927
                          +/-                   -23.04%          +27.53%
3 + neg_binom(0.50)       linux_put_dec           28.93         33481454
3 + neg_binom(0.50)       rv_put_dec              20.11         45681278
                          +/-                   -30.47%          +36.44%

Raspberry pi 2 (32-bit ARM BCM2836) with debian jessie armhf (not
raspbian) using a copy of the gcc 4.8.2 binary:
Distribution              Function         nsecs/conv   Conv/1 sec      
uniform([10, 2^64-1])     linux_put_dec          301.66          3234768
uniform([10, 2^64-1])     rv_put_dec             224.33          4313939
                          +/-                   -25.63%          +33.36%
3 + neg_binom(0.05)       linux_put_dec          153.66          6221185
3 + neg_binom(0.05)       rv_put_dec             119.23          7938177
                          +/-                   -22.41%          +27.60%
3 + neg_binom(0.10)       linux_put_dec          117.47          8040480
3 + neg_binom(0.10)       rv_put_dec              91.78         10155002
                          +/-                   -21.87%          +26.30%
3 + neg_binom(0.15)       linux_put_dec          101.14          9254714
3 + neg_binom(0.15)       rv_put_dec              79.15         11632168
                          +/-                   -21.73%          +25.69%
3 + neg_binom(0.20)       linux_put_dec           91.49         10152753
3 + neg_binom(0.20)       rv_put_dec              71.55         12739461
                          +/-                   -21.80%          +25.48%
3 + neg_binom(0.50)       linux_put_dec           77.28         11790819
3 + neg_binom(0.50)       rv_put_dec              57.64         15372506
                          +/-                   -25.42%          +30.38%

and using 4.9.1:
Distribution              Function         nsecs/conv   Conv/1 sec      
uniform([10, 2^64-1])     linux_put_dec          324.76          3009495
uniform([10, 2^64-1])     rv_put_dec             250.85          3870099
                          +/-                   -22.76%          +28.60%
3 + neg_binom(0.05)       linux_put_dec          156.64          6090859
3 + neg_binom(0.05)       rv_put_dec             128.18          7364603
                          +/-                   -18.17%          +20.91%
3 + neg_binom(0.10)       linux_put_dec          120.54          7813490
3 + neg_binom(0.10)       rv_put_dec              99.73          9320708
                          +/-                   -17.26%          +19.29%
3 + neg_binom(0.15)       linux_put_dec          102.22          9116936
3 + neg_binom(0.15)       rv_put_dec              85.22         10778178
                          +/-                   -16.63%          +18.22%
3 + neg_binom(0.20)       linux_put_dec           94.56          9797719
3 + neg_binom(0.20)       rv_put_dec              77.60         11750535
                          +/-                   -17.94%          +19.93%
3 + neg_binom(0.50)       linux_put_dec           80.37         11379434
3 + neg_binom(0.50)       rv_put_dec              62.16         14339337
                          +/-                   -22.67%          +26.01%

Verify passed on the odroid, but I got bored and sent this mail before it finished on the rpi2.

Jeff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v1] lib/vsprintf.c: Even faster decimal conversion
  2015-03-10 23:01               ` [PATCH v1] " Rasmus Villemoes
  2015-03-11 21:52                 ` Andrew Morton
  2015-03-12 18:49                 ` Jeff Epler
@ 2015-03-13  0:08                 ` Jeff Epler
  2015-03-13  0:30                   ` Jeff Epler
  2015-03-13  8:58                 ` [PATCH] lib/vsprintf.c: silence sparse warnings about decpair[] initialization Rasmus Villemoes
  3 siblings, 1 reply; 19+ messages in thread
From: Jeff Epler @ 2015-03-13  0:08 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Andrew Morton, Peter Zijlstra (Intel),
	Tejun Heo, Joe Perches, linux-kernel

Since you asked about big-endian systems I also built your test program
for the armeb architecture -- which involved hacking up the test harness
fairly heavily to not require libc -- and ran the result in qemu.

Actually, it hasn't finished after 2 hours of (qemu) CPU time, but I can
tell from the output that the first phase has completed successfully but
the second hasn't finished (because it's printed 1 and 2 but not yet 3).
I am pretty sure this is because it's just rather slow in emulation, and
obviously performance figures are going to be useless.

Anyway, I built this on my armhf machine with
    gcc-4.8 -O -o testeb linux32.c rv32.c verify.c -nostdlib -mbig-endian
copied it to my fastest x86_64 desktop and ran it with
    qemu-armeb ./a.out || echo fail

#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <pthread.h>
#include <string.h>

#include "common.h"

#define NTHR 1

#define LO_START 10ULL
#define LO_STOP  10000000000ULL
#define HI_START ULLONG_MAX
#define HI_STOP  ULLONG_MAX - LO_STOP

static unsigned long long lenfreq[NTHR][32];

#include <unistd.h>
#include <sys/syscall.h>
__attribute__((noinline))
static void do_exit(int i) {
    register int i0 asm("r0") = i;
    register int ss asm("r7") = SYS_exit;
    __asm__ __volatile__("swi 0x0" : : "r"(ss), "r"(i0));
}

__attribute__((noinline))
static void do_write(int fd, const char *buf, size_t n) {
    register int i0 asm("r0") = fd;
    register const char *i1 asm("r1") = buf;
    register size_t i2 asm("r2") = n;
    register int ss asm("r7") = SYS_write;
    __asm__ __volatile__("swi 0x0" : : "r"(ss), "r"(i0), "r"(i1), "r"(i2));
}

int memcmp(const void * va, const void * vb, size_t sz) {
    unsigned char *a, *b, aa, bb;
    while(sz--) {
        int diff = *a++ - *b++;
        if(diff) return diff;
    }
    return 0;
}

static int do_check(unsigned long long n, unsigned idx)
{
	char buf1[24];
	char buf2[24];
	int len1, len2;

	len1 = linux_put_dec(buf1, n) - buf1;
	len2 = rv_put_dec(buf2, n) - buf2;
	if (len1 != len2 || memcmp(buf1, buf2, len1)) {
		return -1;
	}
	lenfreq[idx][len1]++;
	return 0;
}

static void *check(void *arg)
{
	unsigned long idx = (unsigned long)arg;
	unsigned long long n;

        do_write(1, "1\n", 2);
	for (n = LO_START; n % NTHR != idx; ++n)
		;
	for (; n <= LO_STOP; n += NTHR) {
		if (do_check(n, idx))
			return (void*) -1;
	}

        do_write(1, "2\n", 2);
	for (n = HI_START; n % NTHR != idx; --n)
		;
	for (; n >= HI_STOP; n -= NTHR) {
		if (do_check(n, idx))
			return (void*) -1;
	}

	/*
	 * This will also visit a few one-digit numbers, but both the
	 * old and new code actually handle that just fine for
	 * non-zero n (it's just irrelevant because all callers of
	 * put_dec take a shortcut for n < 10).
	 */
        do_write(1, "3\n", 2);
	n = 2*idx + 1;
	do {
		if (do_check(n, idx))
			return (void*) -1;
		n *= 17179869185ull;
	} while (n != 2*idx + 1);

	return NULL;
}

int _start(void)
{
    do_write(1, ".\n", 2);
    do_exit(!!check(0));
}

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v1] lib/vsprintf.c: Even faster decimal conversion
  2015-03-13  0:08                 ` Jeff Epler
@ 2015-03-13  0:30                   ` Jeff Epler
  0 siblings, 0 replies; 19+ messages in thread
From: Jeff Epler @ 2015-03-13  0:30 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Andrew Morton, Peter Zijlstra (Intel),
	Tejun Heo, Joe Perches, linux-kernel

On Thu, Mar 12, 2015 at 07:08:35PM -0500, Jeff Epler wrote:
> Since you asked about big-endian systems I also built your test program
> for the armeb architecture -- which involved hacking up the test harness
> fairly heavily to not require libc -- and ran the result in qemu.

It eventually finished successfully.

Jeff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH] lib/vsprintf.c: silence sparse warnings about decpair[] initialization
  2015-03-10 23:01               ` [PATCH v1] " Rasmus Villemoes
                                   ` (2 preceding siblings ...)
  2015-03-13  0:08                 ` Jeff Epler
@ 2015-03-13  8:58                 ` Rasmus Villemoes
  2015-03-19 21:44                   ` [PATCH] lib/vsprintf.c: improve put_dec_trunc8 slightly Rasmus Villemoes
  3 siblings, 1 reply; 19+ messages in thread
From: Rasmus Villemoes @ 2015-03-13  8:58 UTC (permalink / raw)
  To: Andrew Morton, Rasmus Villemoes, Peter Zijlstra (Intel), Tejun Heo
  Cc: Joe Perches, linux-kernel

sparse is unhappy about the initialization of decpair[] and spews out
a ton of "incorrect type in initializer (different base types)". Shut
it up so useful warnings wouldn't drown in the noise.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
---
 lib/vsprintf.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index 5edbf5b8d93f..e0bea9e5bbbf 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -148,7 +148,7 @@ int skip_atoi(const char **s)
  */
 
 static const u16 decpair[100] = {
-#define _(x) cpu_to_le16(((x % 10) | ((x / 10) << 8)) + 0x3030)
+#define _(x) (__force u16) cpu_to_le16(((x % 10) | ((x / 10) << 8)) + 0x3030)
 	_( 0), _( 1), _( 2), _( 3), _( 4), _( 5), _( 6), _( 7), _( 8), _( 9),
 	_(10), _(11), _(12), _(13), _(14), _(15), _(16), _(17), _(18), _(19),
 	_(20), _(21), _(22), _(23), _(24), _(25), _(26), _(27), _(28), _(29),
-- 
2.1.3


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] lib/vsprintf.c: Even faster decimal conversion
  2015-02-20 23:51 [RFC] lib/vsprintf.c: Even faster decimal conversion Rasmus Villemoes
  2015-03-05 15:22 ` Rasmus Villemoes
@ 2015-03-18  0:50 ` Denys Vlasenko
  2015-03-18  0:52   ` Denys Vlasenko
  2015-03-18 17:10   ` Rasmus Villemoes
  1 sibling, 2 replies; 19+ messages in thread
From: Denys Vlasenko @ 2015-03-18  0:50 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Andrew Morton, Peter Zijlstra (Intel),
	Tejun Heo, Linux Kernel Mailing List

On Sat, Feb 21, 2015 at 12:51 AM, Rasmus Villemoes
<linux@rasmusvillemoes.dk> wrote:
> The most expensive part of decimal conversion is the divisions by 10
> (albeit done using reciprocal multiplication with appropriately chosen
> constants). I decided to see if one could eliminate around half of
> these multiplications by emitting two digits at a time, at the cost of
> a 200 byte lookup table, and it does indeed seem like there is
> something to be gained, especially on 64 bits.
...
...
> +char *put_dec_trunc8(char *buf, unsigned r)
>  {
> +       unsigned q;
> +
> +       /* 1 <= r < 10^8 */
> +       if (r < 100)
> +               goto out_r;
> +
> +       /* 100 <= r < 10^8 */
> +       q = (r * (u64)0x28f5c29) >> 32;
> +       *((u16 *)buf) = decpair[r - 100*q];
> +       buf += 2;
> +
> +       /* 1 <= q < 10^6 */
> +       if (q < 100)
> +               goto out_q;
> +
> +       /*  100 <= q < 10^6 */
> +       r = (q * (u64)0x28f5c29) >> 32;
> +       *((u16 *)buf) = decpair[q - 100*r];
> +       buf += 2;
> +
> +       /* 1 <= r < 10^4 */
> +       if (r < 100)
> +               goto out_r;
> +
> +       /* 100 <= r < 10^4 */
> +       q = (r * 0x147b) >> 19;
> +       *((u16 *)buf) = decpair[r - 100*q];
> +       buf += 2;
> +out_q:
> +       /* 1 <= q < 100 */
> +       r = q;
> +out_r:
> +       /* 1 <= r < 100 */
> +       *((u16 *)buf) = decpair[r];
> +       buf += 2;
> +       if (buf[-1] == '0')
> +               buf--;
>         return buf;
>  }

Your code does four 16-bit stores.
The version below does two 32-bit ones instead,
and it is also marginally smaller.

char *put_dec_full8(char *buf, unsigned r)
{
        unsigned q;
        u32 v;

        /* 0 <= r < 10^8 */
        q = (r * (u64)0x28f5c29) >> 32;
        v = (u32)decpair[r - 100*q] << 16;

        /* 0 <= q < 10^6 */
        r = (q * (u64)0x28f5c29) >> 32;
        v = v | decpair[q - 100*r];
        ((u32*)buf)[0] = v;

        /* 0 <= r < 10^4 */
        q = (r * 0x147b) >> 19;
        v = (u32)decpair[r - 100*q] << 16;

        /* 0 <= q < 100 */
        v = v | decpair[q];
        ((u32*)buf)[1] = v;

        return buf + 8;
}

It may be faster not only because of having fewer stores,
but because on x86, this code (moving 16-bit halves):

        movw    decpair(%ebx,%ebx), %dx
        movw    %dx, 4(%eax)
        movw    decpair(%ecx,%ecx), %dx
        movw    %dx, 6(%eax)

suffers from register merge stall when 16-bit value
is read into lower part of %edx. 32-bit code
has no such stalls:

        movzwl  decpair(%ebx,%ebx), %edx
        sall    $16, %edx
        movzwl  decpair(%ecx,%ecx), %ecx
        orl     %ecx, %edx
        movl    %edx, 4(%eax)


Before you ask:
I didn't manage to extend this trick to a code
with one 64-bit store which is not larger.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] lib/vsprintf.c: Even faster decimal conversion
  2015-03-18  0:50 ` [RFC] lib/vsprintf.c: Even faster decimal conversion Denys Vlasenko
@ 2015-03-18  0:52   ` Denys Vlasenko
  2015-03-18 17:10   ` Rasmus Villemoes
  1 sibling, 0 replies; 19+ messages in thread
From: Denys Vlasenko @ 2015-03-18  0:52 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Andrew Morton, Peter Zijlstra (Intel),
	Tejun Heo, Linux Kernel Mailing List

On Wed, Mar 18, 2015 at 1:50 AM, Denys Vlasenko
<vda.linux@googlemail.com> wrote:
> On Sat, Feb 21, 2015 at 12:51 AM, Rasmus Villemoes
> <linux@rasmusvillemoes.dk> wrote:
>> The most expensive part of decimal conversion is the divisions by 10
>> (albeit done using reciprocal multiplication with appropriately chosen
>> constants). I decided to see if one could eliminate around half of
>> these multiplications by emitting two digits at a time, at the cost of
>> a 200 byte lookup table, and it does indeed seem like there is
>> something to be gained, especially on 64 bits.
> ...
> ...
>> +char *put_dec_trunc8(char *buf, unsigned r)
>>  {
>> +       unsigned q;
>> +
>> +       /* 1 <= r < 10^8 */
>> +       if (r < 100)
>> +               goto out_r;
>> +
>> +       /* 100 <= r < 10^8 */
>> +       q = (r * (u64)0x28f5c29) >> 32;
>> +       *((u16 *)buf) = decpair[r - 100*q];
>> +       buf += 2;
>> +
>> +       /* 1 <= q < 10^6 */
>> +       if (q < 100)
>> +               goto out_q;
>> +
>> +       /*  100 <= q < 10^6 */
>> +       r = (q * (u64)0x28f5c29) >> 32;
>> +       *((u16 *)buf) = decpair[q - 100*r];
>> +       buf += 2;
>> +
>> +       /* 1 <= r < 10^4 */
>> +       if (r < 100)
>> +               goto out_r;
>> +
>> +       /* 100 <= r < 10^4 */
>> +       q = (r * 0x147b) >> 19;
>> +       *((u16 *)buf) = decpair[r - 100*q];
>> +       buf += 2;
>> +out_q:
>> +       /* 1 <= q < 100 */
>> +       r = q;
>> +out_r:
>> +       /* 1 <= r < 100 */
>> +       *((u16 *)buf) = decpair[r];
>> +       buf += 2;
>> +       if (buf[-1] == '0')
>> +               buf--;
>>         return buf;
>>  }
>
> Your code does four 16-bit stores.

I quoted wrong part of your email.
Of course, I meant put_dec_full8(), not put_dec_trunc8().

> The version below does two 32-bit ones instead,
> and it is also marginally smaller.
>
> char *put_dec_full8(char *buf, unsigned r)
> {
>         unsigned q;
>         u32 v;
>
>         /* 0 <= r < 10^8 */
>         q = (r * (u64)0x28f5c29) >> 32;
>         v = (u32)decpair[r - 100*q] << 16;
>
>         /* 0 <= q < 10^6 */
>         r = (q * (u64)0x28f5c29) >> 32;
>         v = v | decpair[q - 100*r];
>         ((u32*)buf)[0] = v;
>
>         /* 0 <= r < 10^4 */
>         q = (r * 0x147b) >> 19;
>         v = (u32)decpair[r - 100*q] << 16;
>
>         /* 0 <= q < 100 */
>         v = v | decpair[q];
>         ((u32*)buf)[1] = v;
>
>         return buf + 8;
> }
>
> It may be faster not only because of having fewer stores,
> but because on x86, this code (moving 16-bit halves):
>
>         movw    decpair(%ebx,%ebx), %dx
>         movw    %dx, 4(%eax)
>         movw    decpair(%ecx,%ecx), %dx
>         movw    %dx, 6(%eax)
>
> suffers from register merge stall when 16-bit value
> is read into lower part of %edx. 32-bit code
> has no such stalls:
>
>         movzwl  decpair(%ebx,%ebx), %edx
>         sall    $16, %edx
>         movzwl  decpair(%ecx,%ecx), %ecx
>         orl     %ecx, %edx
>         movl    %edx, 4(%eax)
>
>
> Before you ask:
> I didn't manage to extend this trick to a code
> with one 64-bit store which is not larger.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] lib/vsprintf.c: Even faster decimal conversion
  2015-03-18  0:50 ` [RFC] lib/vsprintf.c: Even faster decimal conversion Denys Vlasenko
  2015-03-18  0:52   ` Denys Vlasenko
@ 2015-03-18 17:10   ` Rasmus Villemoes
  1 sibling, 0 replies; 19+ messages in thread
From: Rasmus Villemoes @ 2015-03-18 17:10 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Andrew Morton, Peter Zijlstra (Intel),
	Tejun Heo, Linux Kernel Mailing List

On Wed, Mar 18 2015, Denys Vlasenko <vda.linux@googlemail.com> wrote:


> Your code does four 16-bit stores.
> The version below does two 32-bit ones instead,
> and it is also marginally smaller.
>
> char *put_dec_full8(char *buf, unsigned r)
> {
>         unsigned q;
>         u32 v;
>
>         /* 0 <= r < 10^8 */
>         q = (r * (u64)0x28f5c29) >> 32;
>         v = (u32)decpair[r - 100*q] << 16;
>
>         /* 0 <= q < 10^6 */
>         r = (q * (u64)0x28f5c29) >> 32;
>         v = v | decpair[q - 100*r];
>         ((u32*)buf)[0] = v;
>
>         /* 0 <= r < 10^4 */
>         q = (r * 0x147b) >> 19;
>         v = (u32)decpair[r - 100*q] << 16;
>
>         /* 0 <= q < 100 */
>         v = v | decpair[q];
>         ((u32*)buf)[1] = v;
>
>         return buf + 8;
> }
>
> It may be faster not only because of having fewer stores,
> but because on x86, this code (moving 16-bit halves):
>
>         movw    decpair(%ebx,%ebx), %dx
>         movw    %dx, 4(%eax)
>         movw    decpair(%ecx,%ecx), %dx
>         movw    %dx, 6(%eax)
>
> suffers from register merge stall when 16-bit value
> is read into lower part of %edx. 32-bit code
> has no such stalls:
>
>         movzwl  decpair(%ebx,%ebx), %edx
>         sall    $16, %edx
>         movzwl  decpair(%ecx,%ecx), %ecx
>         orl     %ecx, %edx
>         movl    %edx, 4(%eax)
>

[On little-endian, I'm pretty sure the <<16 should be applied to the
second and fourth decpair value.]

Thanks for the suggestion. However, I don't see any change in the size
of the generated code (gcc 4.7), and, at least on my Xeon machine,
converting both ULONG_MAX and uniformly random u64s becomes slightly
slower (54 vs 56 cycles and 61 vs 65 cycles).

Rasmus

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v1] lib/vsprintf.c: Even faster decimal conversion
  2015-03-11 21:52                 ` Andrew Morton
@ 2015-03-19 21:41                   ` Rasmus Villemoes
  0 siblings, 0 replies; 19+ messages in thread
From: Rasmus Villemoes @ 2015-03-19 21:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Zijlstra (Intel),
	Tejun Heo, Joe Perches, linux-kernel, Nishanth Aravamudan

On Wed, Mar 11 2015, Andrew Morton <akpm@linux-foundation.org> wrote:

> On Wed, 11 Mar 2015 00:01:11 +0100 Rasmus Villemoes <linux@rasmusvillemoes.dk> wrote:
>> Test and verification code on github
>> <https://github.com/Villemoes/dec>.  It would be nice if someone could
>> verify the code on architectures other than x86 - in particular, I
>> believe it _should_ work on big-endian, but I have no way of testing
>> that. Benchmark numbers would be nice too, of course.
>
> That would be nice.
>
> It would be best to make this as easy as possible for people.

Verification passed on both ppc64le and ppc64; the microbenchmark showed
improvements in 13-43% range for ppc64le and 13-34% for ppc64.

Thanks to Nishanth Aravamudan.

Rasmus

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH] lib/vsprintf.c: improve put_dec_trunc8 slightly
  2015-03-13  8:58                 ` [PATCH] lib/vsprintf.c: silence sparse warnings about decpair[] initialization Rasmus Villemoes
@ 2015-03-19 21:44                   ` Rasmus Villemoes
  0 siblings, 0 replies; 19+ messages in thread
From: Rasmus Villemoes @ 2015-03-19 21:44 UTC (permalink / raw)
  To: Andrew Morton, Rasmus Villemoes, Peter Zijlstra (Intel), Tejun Heo
  Cc: linux-kernel

I hadn't had enough coffee when I wrote this. Currently, the final
increment of buf depends on the value loaded from the table, and
causes gcc to emit a cmov immediately before the return. It is smarter
to let it depend on r, since the increment can then be computed in
parallel with the final load/store pair. It also shaves 16 bytes of
.text.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
---
 lib/vsprintf.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index e0bea9e5bbbf..7078d90c187b 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -164,9 +164,9 @@ static const u16 decpair[100] = {
 
 /*
  * This will print a single '0' even if r == 0, since we would
- * immediately jump to out_r where two 0s would be written and one of
- * them then discarded. This is needed by ip4_string below. All other
- * callers pass a non-zero value of r.
+ * immediately jump to out_r where two 0s would be written but only
+ * one of them accounted for in buf. This is needed by ip4_string
+ * below. All other callers pass a non-zero value of r.
 */
 static noinline_for_stack
 char *put_dec_trunc8(char *buf, unsigned r)
@@ -205,9 +205,7 @@ out_q:
 out_r:
 	/* 1 <= r < 100 */
 	*((u16 *)buf) = decpair[r];
-	buf += 2;
-	if (buf[-1] == '0')
-		buf--;
+	buf += r < 10 ? 1 : 2;
 	return buf;
 }
 
-- 
2.1.3


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2015-03-19 21:45 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-20 23:51 [RFC] lib/vsprintf.c: Even faster decimal conversion Rasmus Villemoes
2015-03-05 15:22 ` Rasmus Villemoes
2015-03-05 16:03   ` Joe Perches
2015-03-05 16:10     ` Tejun Heo
2015-03-05 22:24       ` Rasmus Villemoes
2015-03-10 10:47         ` Rasmus Villemoes
2015-03-10 12:42           ` Tejun Heo
2015-03-10 12:57             ` Rasmus Villemoes
2015-03-10 23:01               ` [PATCH v1] " Rasmus Villemoes
2015-03-11 21:52                 ` Andrew Morton
2015-03-19 21:41                   ` Rasmus Villemoes
2015-03-12 18:49                 ` Jeff Epler
2015-03-13  0:08                 ` Jeff Epler
2015-03-13  0:30                   ` Jeff Epler
2015-03-13  8:58                 ` [PATCH] lib/vsprintf.c: silence sparse warnings about decpair[] initialization Rasmus Villemoes
2015-03-19 21:44                   ` [PATCH] lib/vsprintf.c: improve put_dec_trunc8 slightly Rasmus Villemoes
2015-03-18  0:50 ` [RFC] lib/vsprintf.c: Even faster decimal conversion Denys Vlasenko
2015-03-18  0:52   ` Denys Vlasenko
2015-03-18 17:10   ` Rasmus Villemoes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).