LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH v5 0/3] riscv: optimized mem* functions
@ 2021-09-29 17:22 Matteo Croce
  2021-09-29 17:22 ` [PATCH v5 1/3] riscv: optimized memcpy Matteo Croce
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Matteo Croce @ 2021-09-29 17:22 UTC (permalink / raw)
  To: linux-riscv
  Cc: linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto,
	Drew Fustini, Bin Meng, David Laight, Guo Ren, Christoph Hellwig

From: Matteo Croce <mcroce@microsoft.com>

Replace the assembly mem{cpy,move,set} with C equivalent.

Try to access RAM with the largest bit width possible, but without
doing unaligned accesses.

A further improvement could be to use multiple read and writes as the
assembly version was trying to do.

Tested on a BeagleV Starlight with a SiFive U74 core, where the
improvement is noticeable.

v4 -> v5:
- call __memcpy() instead of memcpy() in memmove()

v3 -> v4:
- incorporate changes from proposed generic version:
  https://lore.kernel.org/lkml/20210702123153.14093-1-mcroce@linux.microsoft.com/

v2 -> v3:
- alias mem* to __mem* and not viceversa
- use __alias instead of a tail call

v1 -> v2:
- reduce the threshold from 64 to 16 bytes
- fix KASAN build
- optimize memset

Matteo Croce (3):
  riscv: optimized memcpy
  riscv: optimized memmove
  riscv: optimized memset

 arch/riscv/include/asm/string.h |  18 ++--
 arch/riscv/kernel/Makefile      |   1 -
 arch/riscv/kernel/riscv_ksyms.c |  17 ----
 arch/riscv/lib/Makefile         |   4 +-
 arch/riscv/lib/memcpy.S         | 108 ----------------------
 arch/riscv/lib/memmove.S        |  64 -------------
 arch/riscv/lib/memset.S         | 113 -----------------------
 arch/riscv/lib/string.c         | 154 ++++++++++++++++++++++++++++++++
 8 files changed, 164 insertions(+), 315 deletions(-)
 delete mode 100644 arch/riscv/kernel/riscv_ksyms.c
 delete mode 100644 arch/riscv/lib/memcpy.S
 delete mode 100644 arch/riscv/lib/memmove.S
 delete mode 100644 arch/riscv/lib/memset.S
 create mode 100644 arch/riscv/lib/string.c

-- 
2.31.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v5 1/3] riscv: optimized memcpy
  2021-09-29 17:22 [PATCH v5 0/3] riscv: optimized mem* functions Matteo Croce
@ 2021-09-29 17:22 ` Matteo Croce
  2021-09-30  1:24   ` Guo Ren
  2021-10-01 11:23   ` Pavel Machek
  2021-09-29 17:22 ` [PATCH v5 2/3] riscv: optimized memmove Matteo Croce
  2021-09-29 17:22 ` [PATCH v5 3/3] riscv: optimized memset Matteo Croce
  2 siblings, 2 replies; 12+ messages in thread
From: Matteo Croce @ 2021-09-29 17:22 UTC (permalink / raw)
  To: linux-riscv
  Cc: linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto,
	Drew Fustini, Bin Meng, David Laight, Guo Ren, Christoph Hellwig

From: Matteo Croce <mcroce@microsoft.com>

Write a C version of memcpy() which uses the biggest data size allowed,
without generating unaligned accesses.

The procedure is made of three steps:
First copy data one byte at time until the destination buffer is aligned
to a long boundary.
Then copy the data one long at time shifting the current and the next u8
to compose a long at every cycle.
Finally, copy the remainder one byte at time.

On a BeagleV, the TCP RX throughput increased by 45%:

before:

$ iperf3 -c beaglev
Connecting to host beaglev, port 5201
[  5] local 192.168.85.6 port 44840 connected to 192.168.85.48 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  76.4 MBytes   641 Mbits/sec   27    624 KBytes
[  5]   1.00-2.00   sec  72.5 MBytes   608 Mbits/sec    0    708 KBytes
[  5]   2.00-3.00   sec  73.8 MBytes   619 Mbits/sec   10    451 KBytes
[  5]   3.00-4.00   sec  72.5 MBytes   608 Mbits/sec    0    564 KBytes
[  5]   4.00-5.00   sec  73.8 MBytes   619 Mbits/sec    0    658 KBytes
[  5]   5.00-6.00   sec  73.8 MBytes   619 Mbits/sec   14    522 KBytes
[  5]   6.00-7.00   sec  73.8 MBytes   619 Mbits/sec    0    621 KBytes
[  5]   7.00-8.00   sec  72.5 MBytes   608 Mbits/sec    0    706 KBytes
[  5]   8.00-9.00   sec  73.8 MBytes   619 Mbits/sec   20    580 KBytes
[  5]   9.00-10.00  sec  73.8 MBytes   619 Mbits/sec    0    672 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   736 MBytes   618 Mbits/sec   71             sender
[  5]   0.00-10.01  sec   733 MBytes   615 Mbits/sec                  receiver

after:

$ iperf3 -c beaglev
Connecting to host beaglev, port 5201
[  5] local 192.168.85.6 port 44864 connected to 192.168.85.48 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   109 MBytes   912 Mbits/sec   48    559 KBytes
[  5]   1.00-2.00   sec   108 MBytes   902 Mbits/sec    0    690 KBytes
[  5]   2.00-3.00   sec   106 MBytes   891 Mbits/sec   36    396 KBytes
[  5]   3.00-4.00   sec   108 MBytes   902 Mbits/sec    0    567 KBytes
[  5]   4.00-5.00   sec   106 MBytes   891 Mbits/sec    0    699 KBytes
[  5]   5.00-6.00   sec   106 MBytes   891 Mbits/sec   32    414 KBytes
[  5]   6.00-7.00   sec   106 MBytes   891 Mbits/sec    0    583 KBytes
[  5]   7.00-8.00   sec   106 MBytes   891 Mbits/sec    0    708 KBytes
[  5]   8.00-9.00   sec   106 MBytes   891 Mbits/sec   28    433 KBytes
[  5]   9.00-10.00  sec   108 MBytes   902 Mbits/sec    0    591 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1.04 GBytes   897 Mbits/sec  144             sender
[  5]   0.00-10.01  sec  1.04 GBytes   894 Mbits/sec                  receiver

And the decreased CPU time of the memcpy() is observable with perf top.
This is the `perf top -Ue task-clock` output when doing the test:

before:

Overhead  Shared O  Symbol
  42.22%  [kernel]  [k] memcpy
  35.00%  [kernel]  [k] __asm_copy_to_user
   3.50%  [kernel]  [k] sifive_l2_flush64_range
   2.30%  [kernel]  [k] stmmac_napi_poll_rx
   1.11%  [kernel]  [k] memset

after:

Overhead  Shared O  Symbol
  45.69%  [kernel]  [k] __asm_copy_to_user
  29.06%  [kernel]  [k] memcpy
   4.09%  [kernel]  [k] sifive_l2_flush64_range
   2.77%  [kernel]  [k] stmmac_napi_poll_rx
   1.24%  [kernel]  [k] memset

Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Reported-by: kernel test robot <lkp@intel.com>
---
 arch/riscv/include/asm/string.h |   8 ++-
 arch/riscv/kernel/riscv_ksyms.c |   2 -
 arch/riscv/lib/Makefile         |   2 +-
 arch/riscv/lib/memcpy.S         | 108 --------------------------------
 arch/riscv/lib/string.c         |  90 ++++++++++++++++++++++++++
 5 files changed, 97 insertions(+), 113 deletions(-)
 delete mode 100644 arch/riscv/lib/memcpy.S
 create mode 100644 arch/riscv/lib/string.c

diff --git a/arch/riscv/include/asm/string.h b/arch/riscv/include/asm/string.h
index 909049366555..6b5d6fc3eab4 100644
--- a/arch/riscv/include/asm/string.h
+++ b/arch/riscv/include/asm/string.h
@@ -12,9 +12,13 @@
 #define __HAVE_ARCH_MEMSET
 extern asmlinkage void *memset(void *, int, size_t);
 extern asmlinkage void *__memset(void *, int, size_t);
+
+#ifdef CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE
 #define __HAVE_ARCH_MEMCPY
-extern asmlinkage void *memcpy(void *, const void *, size_t);
-extern asmlinkage void *__memcpy(void *, const void *, size_t);
+extern void *memcpy(void *dest, const void *src, size_t count);
+extern void *__memcpy(void *dest, const void *src, size_t count);
+#endif
+
 #define __HAVE_ARCH_MEMMOVE
 extern asmlinkage void *memmove(void *, const void *, size_t);
 extern asmlinkage void *__memmove(void *, const void *, size_t);
diff --git a/arch/riscv/kernel/riscv_ksyms.c b/arch/riscv/kernel/riscv_ksyms.c
index 5ab1c7e1a6ed..3f6d512a5b97 100644
--- a/arch/riscv/kernel/riscv_ksyms.c
+++ b/arch/riscv/kernel/riscv_ksyms.c
@@ -10,8 +10,6 @@
  * Assembly functions that may be used (directly or indirectly) by modules
  */
 EXPORT_SYMBOL(memset);
-EXPORT_SYMBOL(memcpy);
 EXPORT_SYMBOL(memmove);
 EXPORT_SYMBOL(__memset);
-EXPORT_SYMBOL(__memcpy);
 EXPORT_SYMBOL(__memmove);
diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile
index 25d5c9664e57..2ffe85d4baee 100644
--- a/arch/riscv/lib/Makefile
+++ b/arch/riscv/lib/Makefile
@@ -1,9 +1,9 @@
 # SPDX-License-Identifier: GPL-2.0-only
 lib-y			+= delay.o
-lib-y			+= memcpy.o
 lib-y			+= memset.o
 lib-y			+= memmove.o
 lib-$(CONFIG_MMU)	+= uaccess.o
 lib-$(CONFIG_64BIT)	+= tishift.o
+lib-$(CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE) += string.o
 
 obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
diff --git a/arch/riscv/lib/memcpy.S b/arch/riscv/lib/memcpy.S
deleted file mode 100644
index 51ab716253fa..000000000000
--- a/arch/riscv/lib/memcpy.S
+++ /dev/null
@@ -1,108 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only */
-/*
- * Copyright (C) 2013 Regents of the University of California
- */
-
-#include <linux/linkage.h>
-#include <asm/asm.h>
-
-/* void *memcpy(void *, const void *, size_t) */
-ENTRY(__memcpy)
-WEAK(memcpy)
-	move t6, a0  /* Preserve return value */
-
-	/* Defer to byte-oriented copy for small sizes */
-	sltiu a3, a2, 128
-	bnez a3, 4f
-	/* Use word-oriented copy only if low-order bits match */
-	andi a3, t6, SZREG-1
-	andi a4, a1, SZREG-1
-	bne a3, a4, 4f
-
-	beqz a3, 2f  /* Skip if already aligned */
-	/*
-	 * Round to nearest double word-aligned address
-	 * greater than or equal to start address
-	 */
-	andi a3, a1, ~(SZREG-1)
-	addi a3, a3, SZREG
-	/* Handle initial misalignment */
-	sub a4, a3, a1
-1:
-	lb a5, 0(a1)
-	addi a1, a1, 1
-	sb a5, 0(t6)
-	addi t6, t6, 1
-	bltu a1, a3, 1b
-	sub a2, a2, a4  /* Update count */
-
-2:
-	andi a4, a2, ~((16*SZREG)-1)
-	beqz a4, 4f
-	add a3, a1, a4
-3:
-	REG_L a4,       0(a1)
-	REG_L a5,   SZREG(a1)
-	REG_L a6, 2*SZREG(a1)
-	REG_L a7, 3*SZREG(a1)
-	REG_L t0, 4*SZREG(a1)
-	REG_L t1, 5*SZREG(a1)
-	REG_L t2, 6*SZREG(a1)
-	REG_L t3, 7*SZREG(a1)
-	REG_L t4, 8*SZREG(a1)
-	REG_L t5, 9*SZREG(a1)
-	REG_S a4,       0(t6)
-	REG_S a5,   SZREG(t6)
-	REG_S a6, 2*SZREG(t6)
-	REG_S a7, 3*SZREG(t6)
-	REG_S t0, 4*SZREG(t6)
-	REG_S t1, 5*SZREG(t6)
-	REG_S t2, 6*SZREG(t6)
-	REG_S t3, 7*SZREG(t6)
-	REG_S t4, 8*SZREG(t6)
-	REG_S t5, 9*SZREG(t6)
-	REG_L a4, 10*SZREG(a1)
-	REG_L a5, 11*SZREG(a1)
-	REG_L a6, 12*SZREG(a1)
-	REG_L a7, 13*SZREG(a1)
-	REG_L t0, 14*SZREG(a1)
-	REG_L t1, 15*SZREG(a1)
-	addi a1, a1, 16*SZREG
-	REG_S a4, 10*SZREG(t6)
-	REG_S a5, 11*SZREG(t6)
-	REG_S a6, 12*SZREG(t6)
-	REG_S a7, 13*SZREG(t6)
-	REG_S t0, 14*SZREG(t6)
-	REG_S t1, 15*SZREG(t6)
-	addi t6, t6, 16*SZREG
-	bltu a1, a3, 3b
-	andi a2, a2, (16*SZREG)-1  /* Update count */
-
-4:
-	/* Handle trailing misalignment */
-	beqz a2, 6f
-	add a3, a1, a2
-
-	/* Use word-oriented copy if co-aligned to word boundary */
-	or a5, a1, t6
-	or a5, a5, a3
-	andi a5, a5, 3
-	bnez a5, 5f
-7:
-	lw a4, 0(a1)
-	addi a1, a1, 4
-	sw a4, 0(t6)
-	addi t6, t6, 4
-	bltu a1, a3, 7b
-
-	ret
-
-5:
-	lb a4, 0(a1)
-	addi a1, a1, 1
-	sb a4, 0(t6)
-	addi t6, t6, 1
-	bltu a1, a3, 5b
-6:
-	ret
-END(__memcpy)
diff --git a/arch/riscv/lib/string.c b/arch/riscv/lib/string.c
new file mode 100644
index 000000000000..bfc912ee23f8
--- /dev/null
+++ b/arch/riscv/lib/string.c
@@ -0,0 +1,90 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * String functions optimized for hardware which doesn't
+ * handle unaligned memory accesses efficiently.
+ *
+ * Copyright (C) 2021 Matteo Croce
+ */
+
+#include <linux/types.h>
+#include <linux/module.h>
+
+/* Minimum size for a word copy to be convenient */
+#define BYTES_LONG	sizeof(long)
+#define WORD_MASK	(BYTES_LONG - 1)
+#define MIN_THRESHOLD	(BYTES_LONG * 2)
+
+/* convenience union to avoid cast between different pointer types */
+union types {
+	u8 *as_u8;
+	unsigned long *as_ulong;
+	uintptr_t as_uptr;
+};
+
+union const_types {
+	const u8 *as_u8;
+	unsigned long *as_ulong;
+	uintptr_t as_uptr;
+};
+
+void *__memcpy(void *dest, const void *src, size_t count)
+{
+	union const_types s = { .as_u8 = src };
+	union types d = { .as_u8 = dest };
+	int distance = 0;
+
+	if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)) {
+		if (count < MIN_THRESHOLD)
+			goto copy_remainder;
+
+		/* Copy a byte at time until destination is aligned. */
+		for (; d.as_uptr & WORD_MASK; count--)
+			*d.as_u8++ = *s.as_u8++;
+
+		distance = s.as_uptr & WORD_MASK;
+	}
+
+	if (distance) {
+		unsigned long last, next;
+
+		/*
+		 * s is distance bytes ahead of d, and d just reached
+		 * the alignment boundary. Move s backward to word align it
+		 * and shift data to compensate for distance, in order to do
+		 * word-by-word copy.
+		 */
+		s.as_u8 -= distance;
+
+		next = s.as_ulong[0];
+		for (; count >= BYTES_LONG; count -= BYTES_LONG) {
+			last = next;
+			next = s.as_ulong[1];
+
+			d.as_ulong[0] = last >> (distance * 8) |
+					next << ((BYTES_LONG - distance) * 8);
+
+			d.as_ulong++;
+			s.as_ulong++;
+		}
+
+		/* Restore s with the original offset. */
+		s.as_u8 += distance;
+	} else {
+		/*
+		 * If the source and dest lower bits are the same, do a simple
+		 * 32/64 bit wide copy.
+		 */
+		for (; count >= BYTES_LONG; count -= BYTES_LONG)
+			*d.as_ulong++ = *s.as_ulong++;
+	}
+
+copy_remainder:
+	while (count--)
+		*d.as_u8++ = *s.as_u8++;
+
+	return dest;
+}
+EXPORT_SYMBOL(__memcpy);
+
+void *memcpy(void *dest, const void *src, size_t count) __weak __alias(__memcpy);
+EXPORT_SYMBOL(memcpy);
-- 
2.31.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v5 2/3] riscv: optimized memmove
  2021-09-29 17:22 [PATCH v5 0/3] riscv: optimized mem* functions Matteo Croce
  2021-09-29 17:22 ` [PATCH v5 1/3] riscv: optimized memcpy Matteo Croce
@ 2021-09-29 17:22 ` Matteo Croce
  2021-09-29 17:22 ` [PATCH v5 3/3] riscv: optimized memset Matteo Croce
  2 siblings, 0 replies; 12+ messages in thread
From: Matteo Croce @ 2021-09-29 17:22 UTC (permalink / raw)
  To: linux-riscv
  Cc: linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto,
	Drew Fustini, Bin Meng, David Laight, Guo Ren, Christoph Hellwig

From: Matteo Croce <mcroce@microsoft.com>

When the destination buffer is before the source one, or when the
buffers doesn't overlap, it's safe to use memcpy() instead, which is
optimized to use a bigger data size possible.

Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Reported-by: kernel test robot <lkp@intel.com>
---
 arch/riscv/include/asm/string.h |  6 ++--
 arch/riscv/kernel/riscv_ksyms.c |  2 --
 arch/riscv/lib/Makefile         |  1 -
 arch/riscv/lib/memmove.S        | 64 ---------------------------------
 arch/riscv/lib/string.c         | 23 ++++++++++++
 5 files changed, 26 insertions(+), 70 deletions(-)
 delete mode 100644 arch/riscv/lib/memmove.S

diff --git a/arch/riscv/include/asm/string.h b/arch/riscv/include/asm/string.h
index 6b5d6fc3eab4..25d9b9078569 100644
--- a/arch/riscv/include/asm/string.h
+++ b/arch/riscv/include/asm/string.h
@@ -17,11 +17,11 @@ extern asmlinkage void *__memset(void *, int, size_t);
 #define __HAVE_ARCH_MEMCPY
 extern void *memcpy(void *dest, const void *src, size_t count);
 extern void *__memcpy(void *dest, const void *src, size_t count);
+#define __HAVE_ARCH_MEMMOVE
+extern void *memmove(void *dest, const void *src, size_t count);
+extern void *__memmove(void *dest, const void *src, size_t count);
 #endif
 
-#define __HAVE_ARCH_MEMMOVE
-extern asmlinkage void *memmove(void *, const void *, size_t);
-extern asmlinkage void *__memmove(void *, const void *, size_t);
 /* For those files which don't want to check by kasan. */
 #if defined(CONFIG_KASAN) && !defined(__SANITIZE_ADDRESS__)
 #define memcpy(dst, src, len) __memcpy(dst, src, len)
diff --git a/arch/riscv/kernel/riscv_ksyms.c b/arch/riscv/kernel/riscv_ksyms.c
index 3f6d512a5b97..361565c4db7e 100644
--- a/arch/riscv/kernel/riscv_ksyms.c
+++ b/arch/riscv/kernel/riscv_ksyms.c
@@ -10,6 +10,4 @@
  * Assembly functions that may be used (directly or indirectly) by modules
  */
 EXPORT_SYMBOL(memset);
-EXPORT_SYMBOL(memmove);
 EXPORT_SYMBOL(__memset);
-EXPORT_SYMBOL(__memmove);
diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile
index 2ffe85d4baee..484f5ff7b508 100644
--- a/arch/riscv/lib/Makefile
+++ b/arch/riscv/lib/Makefile
@@ -1,7 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 lib-y			+= delay.o
 lib-y			+= memset.o
-lib-y			+= memmove.o
 lib-$(CONFIG_MMU)	+= uaccess.o
 lib-$(CONFIG_64BIT)	+= tishift.o
 lib-$(CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE) += string.o
diff --git a/arch/riscv/lib/memmove.S b/arch/riscv/lib/memmove.S
deleted file mode 100644
index 07d1d2152ba5..000000000000
--- a/arch/riscv/lib/memmove.S
+++ /dev/null
@@ -1,64 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-
-#include <linux/linkage.h>
-#include <asm/asm.h>
-
-ENTRY(__memmove)
-WEAK(memmove)
-        move    t0, a0
-        move    t1, a1
-
-        beq     a0, a1, exit_memcpy
-        beqz    a2, exit_memcpy
-        srli    t2, a2, 0x2
-
-        slt     t3, a0, a1
-        beqz    t3, do_reverse
-
-        andi    a2, a2, 0x3
-        li      t4, 1
-        beqz    t2, byte_copy
-
-word_copy:
-        lw      t3, 0(a1)
-        addi    t2, t2, -1
-        addi    a1, a1, 4
-        sw      t3, 0(a0)
-        addi    a0, a0, 4
-        bnez    t2, word_copy
-        beqz    a2, exit_memcpy
-        j       byte_copy
-
-do_reverse:
-        add     a0, a0, a2
-        add     a1, a1, a2
-        andi    a2, a2, 0x3
-        li      t4, -1
-        beqz    t2, reverse_byte_copy
-
-reverse_word_copy:
-        addi    a1, a1, -4
-        addi    t2, t2, -1
-        lw      t3, 0(a1)
-        addi    a0, a0, -4
-        sw      t3, 0(a0)
-        bnez    t2, reverse_word_copy
-        beqz    a2, exit_memcpy
-
-reverse_byte_copy:
-        addi    a0, a0, -1
-        addi    a1, a1, -1
-
-byte_copy:
-        lb      t3, 0(a1)
-        addi    a2, a2, -1
-        sb      t3, 0(a0)
-        add     a1, a1, t4
-        add     a0, a0, t4
-        bnez    a2, byte_copy
-
-exit_memcpy:
-        move a0, t0
-        move a1, t1
-        ret
-END(__memmove)
diff --git a/arch/riscv/lib/string.c b/arch/riscv/lib/string.c
index bfc912ee23f8..4442ca5ef13c 100644
--- a/arch/riscv/lib/string.c
+++ b/arch/riscv/lib/string.c
@@ -88,3 +88,26 @@ EXPORT_SYMBOL(__memcpy);
 
 void *memcpy(void *dest, const void *src, size_t count) __weak __alias(__memcpy);
 EXPORT_SYMBOL(memcpy);
+
+/*
+ * Simply check if the buffer overlaps an call memcpy() in case,
+ * otherwise do a simple one byte at time backward copy.
+ */
+void *__memmove(void *dest, const void *src, size_t count)
+{
+	if (dest < src || src + count <= dest)
+		return __memcpy(dest, src, count);
+
+	if (dest > src) {
+		const char *s = src + count;
+		char *tmp = dest + count;
+
+		while (count--)
+			*--tmp = *--s;
+	}
+	return dest;
+}
+EXPORT_SYMBOL(__memmove);
+
+void *memmove(void *dest, const void *src, size_t count) __weak __alias(__memmove);
+EXPORT_SYMBOL(memmove);
-- 
2.31.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v5 3/3] riscv: optimized memset
  2021-09-29 17:22 [PATCH v5 0/3] riscv: optimized mem* functions Matteo Croce
  2021-09-29 17:22 ` [PATCH v5 1/3] riscv: optimized memcpy Matteo Croce
  2021-09-29 17:22 ` [PATCH v5 2/3] riscv: optimized memmove Matteo Croce
@ 2021-09-29 17:22 ` Matteo Croce
  2021-11-25 10:56   ` Ley Foon Tan
  2 siblings, 1 reply; 12+ messages in thread
From: Matteo Croce @ 2021-09-29 17:22 UTC (permalink / raw)
  To: linux-riscv
  Cc: linux-kernel, linux-arch, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Atish Patra, Emil Renner Berthing, Akira Tsukamoto,
	Drew Fustini, Bin Meng, David Laight, Guo Ren, Christoph Hellwig

From: Matteo Croce <mcroce@microsoft.com>

The generic memset is defined as a byte at time write. This is always
safe, but it's slower than a 4 byte or even 8 byte write.

Write a generic memset which fills the data one byte at time until the
destination is aligned, then fills using the largest size allowed,
and finally fills the remaining data one byte at time.

Signed-off-by: Matteo Croce <mcroce@microsoft.com>
---
 arch/riscv/include/asm/string.h |  10 +--
 arch/riscv/kernel/Makefile      |   1 -
 arch/riscv/kernel/riscv_ksyms.c |  13 ----
 arch/riscv/lib/Makefile         |   1 -
 arch/riscv/lib/memset.S         | 113 --------------------------------
 arch/riscv/lib/string.c         |  41 ++++++++++++
 6 files changed, 44 insertions(+), 135 deletions(-)
 delete mode 100644 arch/riscv/kernel/riscv_ksyms.c
 delete mode 100644 arch/riscv/lib/memset.S

diff --git a/arch/riscv/include/asm/string.h b/arch/riscv/include/asm/string.h
index 25d9b9078569..90500635035a 100644
--- a/arch/riscv/include/asm/string.h
+++ b/arch/riscv/include/asm/string.h
@@ -6,14 +6,10 @@
 #ifndef _ASM_RISCV_STRING_H
 #define _ASM_RISCV_STRING_H
 
-#include <linux/types.h>
-#include <linux/linkage.h>
-
-#define __HAVE_ARCH_MEMSET
-extern asmlinkage void *memset(void *, int, size_t);
-extern asmlinkage void *__memset(void *, int, size_t);
-
 #ifdef CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE
+#define __HAVE_ARCH_MEMSET
+extern void *memset(void *s, int c, size_t count);
+extern void *__memset(void *s, int c, size_t count);
 #define __HAVE_ARCH_MEMCPY
 extern void *memcpy(void *dest, const void *src, size_t count);
 extern void *__memcpy(void *dest, const void *src, size_t count);
diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
index 3397ddac1a30..fecf03822435 100644
--- a/arch/riscv/kernel/Makefile
+++ b/arch/riscv/kernel/Makefile
@@ -31,7 +31,6 @@ obj-y	+= syscall_table.o
 obj-y	+= sys_riscv.o
 obj-y	+= time.o
 obj-y	+= traps.o
-obj-y	+= riscv_ksyms.o
 obj-y	+= stacktrace.o
 obj-y	+= cacheinfo.o
 obj-y	+= patch.o
diff --git a/arch/riscv/kernel/riscv_ksyms.c b/arch/riscv/kernel/riscv_ksyms.c
deleted file mode 100644
index 361565c4db7e..000000000000
--- a/arch/riscv/kernel/riscv_ksyms.c
+++ /dev/null
@@ -1,13 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0-only
-/*
- * Copyright (C) 2017 Zihao Yu
- */
-
-#include <linux/export.h>
-#include <linux/uaccess.h>
-
-/*
- * Assembly functions that may be used (directly or indirectly) by modules
- */
-EXPORT_SYMBOL(memset);
-EXPORT_SYMBOL(__memset);
diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile
index 484f5ff7b508..e33263cc622a 100644
--- a/arch/riscv/lib/Makefile
+++ b/arch/riscv/lib/Makefile
@@ -1,6 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0-only
 lib-y			+= delay.o
-lib-y			+= memset.o
 lib-$(CONFIG_MMU)	+= uaccess.o
 lib-$(CONFIG_64BIT)	+= tishift.o
 lib-$(CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE) += string.o
diff --git a/arch/riscv/lib/memset.S b/arch/riscv/lib/memset.S
deleted file mode 100644
index 34c5360c6705..000000000000
--- a/arch/riscv/lib/memset.S
+++ /dev/null
@@ -1,113 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only */
-/*
- * Copyright (C) 2013 Regents of the University of California
- */
-
-
-#include <linux/linkage.h>
-#include <asm/asm.h>
-
-/* void *memset(void *, int, size_t) */
-ENTRY(__memset)
-WEAK(memset)
-	move t0, a0  /* Preserve return value */
-
-	/* Defer to byte-oriented fill for small sizes */
-	sltiu a3, a2, 16
-	bnez a3, 4f
-
-	/*
-	 * Round to nearest XLEN-aligned address
-	 * greater than or equal to start address
-	 */
-	addi a3, t0, SZREG-1
-	andi a3, a3, ~(SZREG-1)
-	beq a3, t0, 2f  /* Skip if already aligned */
-	/* Handle initial misalignment */
-	sub a4, a3, t0
-1:
-	sb a1, 0(t0)
-	addi t0, t0, 1
-	bltu t0, a3, 1b
-	sub a2, a2, a4  /* Update count */
-
-2: /* Duff's device with 32 XLEN stores per iteration */
-	/* Broadcast value into all bytes */
-	andi a1, a1, 0xff
-	slli a3, a1, 8
-	or a1, a3, a1
-	slli a3, a1, 16
-	or a1, a3, a1
-#ifdef CONFIG_64BIT
-	slli a3, a1, 32
-	or a1, a3, a1
-#endif
-
-	/* Calculate end address */
-	andi a4, a2, ~(SZREG-1)
-	add a3, t0, a4
-
-	andi a4, a4, 31*SZREG  /* Calculate remainder */
-	beqz a4, 3f            /* Shortcut if no remainder */
-	neg a4, a4
-	addi a4, a4, 32*SZREG  /* Calculate initial offset */
-
-	/* Adjust start address with offset */
-	sub t0, t0, a4
-
-	/* Jump into loop body */
-	/* Assumes 32-bit instruction lengths */
-	la a5, 3f
-#ifdef CONFIG_64BIT
-	srli a4, a4, 1
-#endif
-	add a5, a5, a4
-	jr a5
-3:
-	REG_S a1,        0(t0)
-	REG_S a1,    SZREG(t0)
-	REG_S a1,  2*SZREG(t0)
-	REG_S a1,  3*SZREG(t0)
-	REG_S a1,  4*SZREG(t0)
-	REG_S a1,  5*SZREG(t0)
-	REG_S a1,  6*SZREG(t0)
-	REG_S a1,  7*SZREG(t0)
-	REG_S a1,  8*SZREG(t0)
-	REG_S a1,  9*SZREG(t0)
-	REG_S a1, 10*SZREG(t0)
-	REG_S a1, 11*SZREG(t0)
-	REG_S a1, 12*SZREG(t0)
-	REG_S a1, 13*SZREG(t0)
-	REG_S a1, 14*SZREG(t0)
-	REG_S a1, 15*SZREG(t0)
-	REG_S a1, 16*SZREG(t0)
-	REG_S a1, 17*SZREG(t0)
-	REG_S a1, 18*SZREG(t0)
-	REG_S a1, 19*SZREG(t0)
-	REG_S a1, 20*SZREG(t0)
-	REG_S a1, 21*SZREG(t0)
-	REG_S a1, 22*SZREG(t0)
-	REG_S a1, 23*SZREG(t0)
-	REG_S a1, 24*SZREG(t0)
-	REG_S a1, 25*SZREG(t0)
-	REG_S a1, 26*SZREG(t0)
-	REG_S a1, 27*SZREG(t0)
-	REG_S a1, 28*SZREG(t0)
-	REG_S a1, 29*SZREG(t0)
-	REG_S a1, 30*SZREG(t0)
-	REG_S a1, 31*SZREG(t0)
-	addi t0, t0, 32*SZREG
-	bltu t0, a3, 3b
-	andi a2, a2, SZREG-1  /* Update count */
-
-4:
-	/* Handle trailing misalignment */
-	beqz a2, 6f
-	add a3, t0, a2
-5:
-	sb a1, 0(t0)
-	addi t0, t0, 1
-	bltu t0, a3, 5b
-6:
-	ret
-END(__memset)
diff --git a/arch/riscv/lib/string.c b/arch/riscv/lib/string.c
index 4442ca5ef13c..871ffbfac0ab 100644
--- a/arch/riscv/lib/string.c
+++ b/arch/riscv/lib/string.c
@@ -111,3 +111,44 @@ EXPORT_SYMBOL(__memmove);
 
 void *memmove(void *dest, const void *src, size_t count) __weak __alias(__memmove);
 EXPORT_SYMBOL(memmove);
+
+void *__memset(void *s, int c, size_t count)
+{
+	union types dest = { .as_u8 = s };
+
+	if (count >= MIN_THRESHOLD) {
+		unsigned long cu = (unsigned long)c;
+
+		/* Compose an ulong with 'c' repeated 4/8 times */
+#ifdef CONFIG_ARCH_HAS_FAST_MULTIPLIER
+		cu *= 0x0101010101010101UL;
+#else
+		cu |= cu << 8;
+		cu |= cu << 16;
+		/* Suppress warning on 32 bit machines */
+		cu |= (cu << 16) << 16;
+#endif
+		if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)) {
+			/*
+			 * Fill the buffer one byte at time until
+			 * the destination is word aligned.
+			 */
+			for (; count && dest.as_uptr & WORD_MASK; count--)
+				*dest.as_u8++ = c;
+		}
+
+		/* Copy using the largest size allowed */
+		for (; count >= BYTES_LONG; count -= BYTES_LONG)
+			*dest.as_ulong++ = cu;
+	}
+
+	/* copy the remainder */
+	while (count--)
+		*dest.as_u8++ = c;
+
+	return s;
+}
+EXPORT_SYMBOL(__memset);
+
+void *memset(void *s, int c, size_t count) __weak __alias(__memset);
+EXPORT_SYMBOL(memset);
-- 
2.31.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 1/3] riscv: optimized memcpy
  2021-09-29 17:22 ` [PATCH v5 1/3] riscv: optimized memcpy Matteo Croce
@ 2021-09-30  1:24   ` Guo Ren
  2021-09-30  1:36     ` Matteo Croce
  2021-10-01 11:23   ` Pavel Machek
  1 sibling, 1 reply; 12+ messages in thread
From: Guo Ren @ 2021-09-30  1:24 UTC (permalink / raw)
  To: Matteo Croce
  Cc: linux-riscv, Linux Kernel Mailing List, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng,
	David Laight, Christoph Hellwig

On Thu, Sep 30, 2021 at 1:22 AM Matteo Croce <mcroce@linux.microsoft.com> wrote:
>
> From: Matteo Croce <mcroce@microsoft.com>
>
> Write a C version of memcpy() which uses the biggest data size allowed,
> without generating unaligned accesses.
>
> The procedure is made of three steps:
> First copy data one byte at time until the destination buffer is aligned
> to a long boundary.
> Then copy the data one long at time shifting the current and the next u8
> to compose a long at every cycle.
> Finally, copy the remainder one byte at time.
>
> On a BeagleV, the TCP RX throughput increased by 45%:
>
> before:
>
> $ iperf3 -c beaglev
> Connecting to host beaglev, port 5201
> [  5] local 192.168.85.6 port 44840 connected to 192.168.85.48 port 5201
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec  76.4 MBytes   641 Mbits/sec   27    624 KBytes
> [  5]   1.00-2.00   sec  72.5 MBytes   608 Mbits/sec    0    708 KBytes
> [  5]   2.00-3.00   sec  73.8 MBytes   619 Mbits/sec   10    451 KBytes
> [  5]   3.00-4.00   sec  72.5 MBytes   608 Mbits/sec    0    564 KBytes
> [  5]   4.00-5.00   sec  73.8 MBytes   619 Mbits/sec    0    658 KBytes
> [  5]   5.00-6.00   sec  73.8 MBytes   619 Mbits/sec   14    522 KBytes
> [  5]   6.00-7.00   sec  73.8 MBytes   619 Mbits/sec    0    621 KBytes
> [  5]   7.00-8.00   sec  72.5 MBytes   608 Mbits/sec    0    706 KBytes
> [  5]   8.00-9.00   sec  73.8 MBytes   619 Mbits/sec   20    580 KBytes
> [  5]   9.00-10.00  sec  73.8 MBytes   619 Mbits/sec    0    672 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-10.00  sec   736 MBytes   618 Mbits/sec   71             sender
> [  5]   0.00-10.01  sec   733 MBytes   615 Mbits/sec                  receiver
>
> after:
>
> $ iperf3 -c beaglev
> Connecting to host beaglev, port 5201
> [  5] local 192.168.85.6 port 44864 connected to 192.168.85.48 port 5201
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec   109 MBytes   912 Mbits/sec   48    559 KBytes
> [  5]   1.00-2.00   sec   108 MBytes   902 Mbits/sec    0    690 KBytes
> [  5]   2.00-3.00   sec   106 MBytes   891 Mbits/sec   36    396 KBytes
> [  5]   3.00-4.00   sec   108 MBytes   902 Mbits/sec    0    567 KBytes
> [  5]   4.00-5.00   sec   106 MBytes   891 Mbits/sec    0    699 KBytes
> [  5]   5.00-6.00   sec   106 MBytes   891 Mbits/sec   32    414 KBytes
> [  5]   6.00-7.00   sec   106 MBytes   891 Mbits/sec    0    583 KBytes
> [  5]   7.00-8.00   sec   106 MBytes   891 Mbits/sec    0    708 KBytes
> [  5]   8.00-9.00   sec   106 MBytes   891 Mbits/sec   28    433 KBytes
> [  5]   9.00-10.00  sec   108 MBytes   902 Mbits/sec    0    591 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-10.00  sec  1.04 GBytes   897 Mbits/sec  144             sender
> [  5]   0.00-10.01  sec  1.04 GBytes   894 Mbits/sec                  receiver
>
> And the decreased CPU time of the memcpy() is observable with perf top.
> This is the `perf top -Ue task-clock` output when doing the test:
>
> before:
>
> Overhead  Shared O  Symbol
>   42.22%  [kernel]  [k] memcpy
>   35.00%  [kernel]  [k] __asm_copy_to_user
>    3.50%  [kernel]  [k] sifive_l2_flush64_range
>    2.30%  [kernel]  [k] stmmac_napi_poll_rx
>    1.11%  [kernel]  [k] memset
>
> after:
>
> Overhead  Shared O  Symbol
>   45.69%  [kernel]  [k] __asm_copy_to_user
>   29.06%  [kernel]  [k] memcpy
>    4.09%  [kernel]  [k] sifive_l2_flush64_range
>    2.77%  [kernel]  [k] stmmac_napi_poll_rx
>    1.24%  [kernel]  [k] memset
>
> Signed-off-by: Matteo Croce <mcroce@microsoft.com>
> Reported-by: kernel test robot <lkp@intel.com>
> ---
>  arch/riscv/include/asm/string.h |   8 ++-
>  arch/riscv/kernel/riscv_ksyms.c |   2 -
>  arch/riscv/lib/Makefile         |   2 +-
>  arch/riscv/lib/memcpy.S         | 108 --------------------------------
>  arch/riscv/lib/string.c         |  90 ++++++++++++++++++++++++++
>  5 files changed, 97 insertions(+), 113 deletions(-)
>  delete mode 100644 arch/riscv/lib/memcpy.S
>  create mode 100644 arch/riscv/lib/string.c
>
> diff --git a/arch/riscv/include/asm/string.h b/arch/riscv/include/asm/string.h
> index 909049366555..6b5d6fc3eab4 100644
> --- a/arch/riscv/include/asm/string.h
> +++ b/arch/riscv/include/asm/string.h
> @@ -12,9 +12,13 @@
>  #define __HAVE_ARCH_MEMSET
>  extern asmlinkage void *memset(void *, int, size_t);
>  extern asmlinkage void *__memset(void *, int, size_t);
> +
> +#ifdef CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE
What's the problem with the -O3 & -Os? If the user uses
CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE_O3 that will cause bad performance
for memcpy?
Seems asm version is more compatible?

>  #define __HAVE_ARCH_MEMCPY
> -extern asmlinkage void *memcpy(void *, const void *, size_t);
> -extern asmlinkage void *__memcpy(void *, const void *, size_t);
> +extern void *memcpy(void *dest, const void *src, size_t count);
> +extern void *__memcpy(void *dest, const void *src, size_t count);
> +#endif
> +
>  #define __HAVE_ARCH_MEMMOVE
>  extern asmlinkage void *memmove(void *, const void *, size_t);
>  extern asmlinkage void *__memmove(void *, const void *, size_t);
> diff --git a/arch/riscv/kernel/riscv_ksyms.c b/arch/riscv/kernel/riscv_ksyms.c
> index 5ab1c7e1a6ed..3f6d512a5b97 100644
> --- a/arch/riscv/kernel/riscv_ksyms.c
> +++ b/arch/riscv/kernel/riscv_ksyms.c
> @@ -10,8 +10,6 @@
>   * Assembly functions that may be used (directly or indirectly) by modules
>   */
>  EXPORT_SYMBOL(memset);
> -EXPORT_SYMBOL(memcpy);
>  EXPORT_SYMBOL(memmove);
>  EXPORT_SYMBOL(__memset);
> -EXPORT_SYMBOL(__memcpy);
>  EXPORT_SYMBOL(__memmove);
> diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile
> index 25d5c9664e57..2ffe85d4baee 100644
> --- a/arch/riscv/lib/Makefile
> +++ b/arch/riscv/lib/Makefile
> @@ -1,9 +1,9 @@
>  # SPDX-License-Identifier: GPL-2.0-only
>  lib-y                  += delay.o
> -lib-y                  += memcpy.o
>  lib-y                  += memset.o
>  lib-y                  += memmove.o
>  lib-$(CONFIG_MMU)      += uaccess.o
>  lib-$(CONFIG_64BIT)    += tishift.o
> +lib-$(CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE) += string.o
>
>  obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
> diff --git a/arch/riscv/lib/memcpy.S b/arch/riscv/lib/memcpy.S
> deleted file mode 100644
> index 51ab716253fa..000000000000
> --- a/arch/riscv/lib/memcpy.S
> +++ /dev/null
> @@ -1,108 +0,0 @@
> -/* SPDX-License-Identifier: GPL-2.0-only */
> -/*
> - * Copyright (C) 2013 Regents of the University of California
> - */
> -
> -#include <linux/linkage.h>
> -#include <asm/asm.h>
> -
> -/* void *memcpy(void *, const void *, size_t) */
> -ENTRY(__memcpy)
> -WEAK(memcpy)
> -       move t6, a0  /* Preserve return value */
> -
> -       /* Defer to byte-oriented copy for small sizes */
> -       sltiu a3, a2, 128
> -       bnez a3, 4f
> -       /* Use word-oriented copy only if low-order bits match */
> -       andi a3, t6, SZREG-1
> -       andi a4, a1, SZREG-1
> -       bne a3, a4, 4f
> -
> -       beqz a3, 2f  /* Skip if already aligned */
> -       /*
> -        * Round to nearest double word-aligned address
> -        * greater than or equal to start address
> -        */
> -       andi a3, a1, ~(SZREG-1)
> -       addi a3, a3, SZREG
> -       /* Handle initial misalignment */
> -       sub a4, a3, a1
> -1:
> -       lb a5, 0(a1)
> -       addi a1, a1, 1
> -       sb a5, 0(t6)
> -       addi t6, t6, 1
> -       bltu a1, a3, 1b
> -       sub a2, a2, a4  /* Update count */
> -
> -2:
> -       andi a4, a2, ~((16*SZREG)-1)
> -       beqz a4, 4f
> -       add a3, a1, a4
> -3:
> -       REG_L a4,       0(a1)
> -       REG_L a5,   SZREG(a1)
> -       REG_L a6, 2*SZREG(a1)
> -       REG_L a7, 3*SZREG(a1)
> -       REG_L t0, 4*SZREG(a1)
> -       REG_L t1, 5*SZREG(a1)
> -       REG_L t2, 6*SZREG(a1)
> -       REG_L t3, 7*SZREG(a1)
> -       REG_L t4, 8*SZREG(a1)
> -       REG_L t5, 9*SZREG(a1)
> -       REG_S a4,       0(t6)
> -       REG_S a5,   SZREG(t6)
> -       REG_S a6, 2*SZREG(t6)
> -       REG_S a7, 3*SZREG(t6)
> -       REG_S t0, 4*SZREG(t6)
> -       REG_S t1, 5*SZREG(t6)
> -       REG_S t2, 6*SZREG(t6)
> -       REG_S t3, 7*SZREG(t6)
> -       REG_S t4, 8*SZREG(t6)
> -       REG_S t5, 9*SZREG(t6)
> -       REG_L a4, 10*SZREG(a1)
> -       REG_L a5, 11*SZREG(a1)
> -       REG_L a6, 12*SZREG(a1)
> -       REG_L a7, 13*SZREG(a1)
> -       REG_L t0, 14*SZREG(a1)
> -       REG_L t1, 15*SZREG(a1)
> -       addi a1, a1, 16*SZREG
> -       REG_S a4, 10*SZREG(t6)
> -       REG_S a5, 11*SZREG(t6)
> -       REG_S a6, 12*SZREG(t6)
> -       REG_S a7, 13*SZREG(t6)
> -       REG_S t0, 14*SZREG(t6)
> -       REG_S t1, 15*SZREG(t6)
> -       addi t6, t6, 16*SZREG
> -       bltu a1, a3, 3b
> -       andi a2, a2, (16*SZREG)-1  /* Update count */
> -
> -4:
> -       /* Handle trailing misalignment */
> -       beqz a2, 6f
> -       add a3, a1, a2
> -
> -       /* Use word-oriented copy if co-aligned to word boundary */
> -       or a5, a1, t6
> -       or a5, a5, a3
> -       andi a5, a5, 3
> -       bnez a5, 5f
> -7:
> -       lw a4, 0(a1)
> -       addi a1, a1, 4
> -       sw a4, 0(t6)
> -       addi t6, t6, 4
> -       bltu a1, a3, 7b
> -
> -       ret
> -
> -5:
> -       lb a4, 0(a1)
> -       addi a1, a1, 1
> -       sb a4, 0(t6)
> -       addi t6, t6, 1
> -       bltu a1, a3, 5b
> -6:
> -       ret
> -END(__memcpy)
> diff --git a/arch/riscv/lib/string.c b/arch/riscv/lib/string.c
> new file mode 100644
> index 000000000000..bfc912ee23f8
> --- /dev/null
> +++ b/arch/riscv/lib/string.c
> @@ -0,0 +1,90 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * String functions optimized for hardware which doesn't
> + * handle unaligned memory accesses efficiently.
> + *
> + * Copyright (C) 2021 Matteo Croce
> + */
> +
> +#include <linux/types.h>
> +#include <linux/module.h>
> +
> +/* Minimum size for a word copy to be convenient */
> +#define BYTES_LONG     sizeof(long)
> +#define WORD_MASK      (BYTES_LONG - 1)
> +#define MIN_THRESHOLD  (BYTES_LONG * 2)
> +
> +/* convenience union to avoid cast between different pointer types */
> +union types {
> +       u8 *as_u8;
> +       unsigned long *as_ulong;
> +       uintptr_t as_uptr;
> +};
> +
> +union const_types {
> +       const u8 *as_u8;
> +       unsigned long *as_ulong;
> +       uintptr_t as_uptr;
> +};
> +
> +void *__memcpy(void *dest, const void *src, size_t count)
How about using __attribute__((optimize("-O2"))) here to replace your
previous "#ifdef CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE"?

> +{
> +       union const_types s = { .as_u8 = src };
> +       union types d = { .as_u8 = dest };
> +       int distance = 0;
> +
> +       if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)) {
> +               if (count < MIN_THRESHOLD)
> +                       goto copy_remainder;
> +
> +               /* Copy a byte at time until destination is aligned. */
> +               for (; d.as_uptr & WORD_MASK; count--)
> +                       *d.as_u8++ = *s.as_u8++;
> +
> +               distance = s.as_uptr & WORD_MASK;
> +       }
> +
> +       if (distance) {
> +               unsigned long last, next;
> +
> +               /*
> +                * s is distance bytes ahead of d, and d just reached
> +                * the alignment boundary. Move s backward to word align it
> +                * and shift data to compensate for distance, in order to do
> +                * word-by-word copy.
> +                */
> +               s.as_u8 -= distance;
> +
> +               next = s.as_ulong[0];
> +               for (; count >= BYTES_LONG; count -= BYTES_LONG) {
> +                       last = next;
> +                       next = s.as_ulong[1];
> +
> +                       d.as_ulong[0] = last >> (distance * 8) |
> +                                       next << ((BYTES_LONG - distance) * 8);
> +
> +                       d.as_ulong++;
> +                       s.as_ulong++;
> +               }
> +
> +               /* Restore s with the original offset. */
> +               s.as_u8 += distance;
> +       } else {
> +               /*
> +                * If the source and dest lower bits are the same, do a simple
> +                * 32/64 bit wide copy.
> +                */
> +               for (; count >= BYTES_LONG; count -= BYTES_LONG)
> +                       *d.as_ulong++ = *s.as_ulong++;
> +       }
> +
> +copy_remainder:
> +       while (count--)
> +               *d.as_u8++ = *s.as_u8++;
> +
> +       return dest;
> +}
> +EXPORT_SYMBOL(__memcpy);
> +
> +void *memcpy(void *dest, const void *src, size_t count) __weak __alias(__memcpy);
> +EXPORT_SYMBOL(memcpy);
> --
> 2.31.1
>


-- 
Best Regards
 Guo Ren

ML: https://lore.kernel.org/linux-csky/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 1/3] riscv: optimized memcpy
  2021-09-30  1:24   ` Guo Ren
@ 2021-09-30  1:36     ` Matteo Croce
  2021-09-30 19:43       ` Vineet Gupta
  0 siblings, 1 reply; 12+ messages in thread
From: Matteo Croce @ 2021-09-30  1:36 UTC (permalink / raw)
  To: Guo Ren
  Cc: linux-riscv, Linux Kernel Mailing List, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng,
	David Laight, Christoph Hellwig

On Thu, Sep 30, 2021 at 3:25 AM Guo Ren <guoren@kernel.org> wrote:
>
> On Thu, Sep 30, 2021 at 1:22 AM Matteo Croce <mcroce@linux.microsoft.com> wrote:
> >
> > From: Matteo Croce <mcroce@microsoft.com>
> >
> > Write a C version of memcpy() which uses the biggest data size allowed,
> > without generating unaligned accesses.
> >
> > The procedure is made of three steps:
> > First copy data one byte at time until the destination buffer is aligned
> > to a long boundary.
> > Then copy the data one long at time shifting the current and the next u8
> > to compose a long at every cycle.
> > Finally, copy the remainder one byte at time.
> >
> > On a BeagleV, the TCP RX throughput increased by 45%:
> >
> > before:
> >
> > $ iperf3 -c beaglev
> > Connecting to host beaglev, port 5201
> > [  5] local 192.168.85.6 port 44840 connected to 192.168.85.48 port 5201
> > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > [  5]   0.00-1.00   sec  76.4 MBytes   641 Mbits/sec   27    624 KBytes
> > [  5]   1.00-2.00   sec  72.5 MBytes   608 Mbits/sec    0    708 KBytes
> > [  5]   2.00-3.00   sec  73.8 MBytes   619 Mbits/sec   10    451 KBytes
> > [  5]   3.00-4.00   sec  72.5 MBytes   608 Mbits/sec    0    564 KBytes
> > [  5]   4.00-5.00   sec  73.8 MBytes   619 Mbits/sec    0    658 KBytes
> > [  5]   5.00-6.00   sec  73.8 MBytes   619 Mbits/sec   14    522 KBytes
> > [  5]   6.00-7.00   sec  73.8 MBytes   619 Mbits/sec    0    621 KBytes
> > [  5]   7.00-8.00   sec  72.5 MBytes   608 Mbits/sec    0    706 KBytes
> > [  5]   8.00-9.00   sec  73.8 MBytes   619 Mbits/sec   20    580 KBytes
> > [  5]   9.00-10.00  sec  73.8 MBytes   619 Mbits/sec    0    672 KBytes
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bitrate         Retr
> > [  5]   0.00-10.00  sec   736 MBytes   618 Mbits/sec   71             sender
> > [  5]   0.00-10.01  sec   733 MBytes   615 Mbits/sec                  receiver
> >
> > after:
> >
> > $ iperf3 -c beaglev
> > Connecting to host beaglev, port 5201
> > [  5] local 192.168.85.6 port 44864 connected to 192.168.85.48 port 5201
> > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > [  5]   0.00-1.00   sec   109 MBytes   912 Mbits/sec   48    559 KBytes
> > [  5]   1.00-2.00   sec   108 MBytes   902 Mbits/sec    0    690 KBytes
> > [  5]   2.00-3.00   sec   106 MBytes   891 Mbits/sec   36    396 KBytes
> > [  5]   3.00-4.00   sec   108 MBytes   902 Mbits/sec    0    567 KBytes
> > [  5]   4.00-5.00   sec   106 MBytes   891 Mbits/sec    0    699 KBytes
> > [  5]   5.00-6.00   sec   106 MBytes   891 Mbits/sec   32    414 KBytes
> > [  5]   6.00-7.00   sec   106 MBytes   891 Mbits/sec    0    583 KBytes
> > [  5]   7.00-8.00   sec   106 MBytes   891 Mbits/sec    0    708 KBytes
> > [  5]   8.00-9.00   sec   106 MBytes   891 Mbits/sec   28    433 KBytes
> > [  5]   9.00-10.00  sec   108 MBytes   902 Mbits/sec    0    591 KBytes
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bitrate         Retr
> > [  5]   0.00-10.00  sec  1.04 GBytes   897 Mbits/sec  144             sender
> > [  5]   0.00-10.01  sec  1.04 GBytes   894 Mbits/sec                  receiver
> >
> > And the decreased CPU time of the memcpy() is observable with perf top.
> > This is the `perf top -Ue task-clock` output when doing the test:
> >
> > before:
> >
> > Overhead  Shared O  Symbol
> >   42.22%  [kernel]  [k] memcpy
> >   35.00%  [kernel]  [k] __asm_copy_to_user
> >    3.50%  [kernel]  [k] sifive_l2_flush64_range
> >    2.30%  [kernel]  [k] stmmac_napi_poll_rx
> >    1.11%  [kernel]  [k] memset
> >
> > after:
> >
> > Overhead  Shared O  Symbol
> >   45.69%  [kernel]  [k] __asm_copy_to_user
> >   29.06%  [kernel]  [k] memcpy
> >    4.09%  [kernel]  [k] sifive_l2_flush64_range
> >    2.77%  [kernel]  [k] stmmac_napi_poll_rx
> >    1.24%  [kernel]  [k] memset
> >
> > Signed-off-by: Matteo Croce <mcroce@microsoft.com>
> > Reported-by: kernel test robot <lkp@intel.com>
> > ---
> >  arch/riscv/include/asm/string.h |   8 ++-
> >  arch/riscv/kernel/riscv_ksyms.c |   2 -
> >  arch/riscv/lib/Makefile         |   2 +-
> >  arch/riscv/lib/memcpy.S         | 108 --------------------------------
> >  arch/riscv/lib/string.c         |  90 ++++++++++++++++++++++++++
> >  5 files changed, 97 insertions(+), 113 deletions(-)
> >  delete mode 100644 arch/riscv/lib/memcpy.S
> >  create mode 100644 arch/riscv/lib/string.c
> >
> > diff --git a/arch/riscv/include/asm/string.h b/arch/riscv/include/asm/string.h
> > index 909049366555..6b5d6fc3eab4 100644
> > --- a/arch/riscv/include/asm/string.h
> > +++ b/arch/riscv/include/asm/string.h
> > @@ -12,9 +12,13 @@
> >  #define __HAVE_ARCH_MEMSET
> >  extern asmlinkage void *memset(void *, int, size_t);
> >  extern asmlinkage void *__memset(void *, int, size_t);
> > +
> > +#ifdef CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE
> What's the problem with the -O3 & -Os? If the user uses
> CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE_O3 that will cause bad performance
> for memcpy?
> Seems asm version is more compatible?
>
> >  #define __HAVE_ARCH_MEMCPY
> > -extern asmlinkage void *memcpy(void *, const void *, size_t);
> > -extern asmlinkage void *__memcpy(void *, const void *, size_t);
> > +extern void *memcpy(void *dest, const void *src, size_t count);
> > +extern void *__memcpy(void *dest, const void *src, size_t count);
> > +#endif
> > +
> >  #define __HAVE_ARCH_MEMMOVE
> >  extern asmlinkage void *memmove(void *, const void *, size_t);
> >  extern asmlinkage void *__memmove(void *, const void *, size_t);
> > diff --git a/arch/riscv/kernel/riscv_ksyms.c b/arch/riscv/kernel/riscv_ksyms.c
> > index 5ab1c7e1a6ed..3f6d512a5b97 100644
> > --- a/arch/riscv/kernel/riscv_ksyms.c
> > +++ b/arch/riscv/kernel/riscv_ksyms.c
> > @@ -10,8 +10,6 @@
> >   * Assembly functions that may be used (directly or indirectly) by modules
> >   */
> >  EXPORT_SYMBOL(memset);
> > -EXPORT_SYMBOL(memcpy);
> >  EXPORT_SYMBOL(memmove);
> >  EXPORT_SYMBOL(__memset);
> > -EXPORT_SYMBOL(__memcpy);
> >  EXPORT_SYMBOL(__memmove);
> > diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile
> > index 25d5c9664e57..2ffe85d4baee 100644
> > --- a/arch/riscv/lib/Makefile
> > +++ b/arch/riscv/lib/Makefile
> > @@ -1,9 +1,9 @@
> >  # SPDX-License-Identifier: GPL-2.0-only
> >  lib-y                  += delay.o
> > -lib-y                  += memcpy.o
> >  lib-y                  += memset.o
> >  lib-y                  += memmove.o
> >  lib-$(CONFIG_MMU)      += uaccess.o
> >  lib-$(CONFIG_64BIT)    += tishift.o
> > +lib-$(CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE) += string.o
> >
> >  obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
> > diff --git a/arch/riscv/lib/memcpy.S b/arch/riscv/lib/memcpy.S
> > deleted file mode 100644
> > index 51ab716253fa..000000000000
> > --- a/arch/riscv/lib/memcpy.S
> > +++ /dev/null
> > @@ -1,108 +0,0 @@
> > -/* SPDX-License-Identifier: GPL-2.0-only */
> > -/*
> > - * Copyright (C) 2013 Regents of the University of California
> > - */
> > -
> > -#include <linux/linkage.h>
> > -#include <asm/asm.h>
> > -
> > -/* void *memcpy(void *, const void *, size_t) */
> > -ENTRY(__memcpy)
> > -WEAK(memcpy)
> > -       move t6, a0  /* Preserve return value */
> > -
> > -       /* Defer to byte-oriented copy for small sizes */
> > -       sltiu a3, a2, 128
> > -       bnez a3, 4f
> > -       /* Use word-oriented copy only if low-order bits match */
> > -       andi a3, t6, SZREG-1
> > -       andi a4, a1, SZREG-1
> > -       bne a3, a4, 4f
> > -
> > -       beqz a3, 2f  /* Skip if already aligned */
> > -       /*
> > -        * Round to nearest double word-aligned address
> > -        * greater than or equal to start address
> > -        */
> > -       andi a3, a1, ~(SZREG-1)
> > -       addi a3, a3, SZREG
> > -       /* Handle initial misalignment */
> > -       sub a4, a3, a1
> > -1:
> > -       lb a5, 0(a1)
> > -       addi a1, a1, 1
> > -       sb a5, 0(t6)
> > -       addi t6, t6, 1
> > -       bltu a1, a3, 1b
> > -       sub a2, a2, a4  /* Update count */
> > -
> > -2:
> > -       andi a4, a2, ~((16*SZREG)-1)
> > -       beqz a4, 4f
> > -       add a3, a1, a4
> > -3:
> > -       REG_L a4,       0(a1)
> > -       REG_L a5,   SZREG(a1)
> > -       REG_L a6, 2*SZREG(a1)
> > -       REG_L a7, 3*SZREG(a1)
> > -       REG_L t0, 4*SZREG(a1)
> > -       REG_L t1, 5*SZREG(a1)
> > -       REG_L t2, 6*SZREG(a1)
> > -       REG_L t3, 7*SZREG(a1)
> > -       REG_L t4, 8*SZREG(a1)
> > -       REG_L t5, 9*SZREG(a1)
> > -       REG_S a4,       0(t6)
> > -       REG_S a5,   SZREG(t6)
> > -       REG_S a6, 2*SZREG(t6)
> > -       REG_S a7, 3*SZREG(t6)
> > -       REG_S t0, 4*SZREG(t6)
> > -       REG_S t1, 5*SZREG(t6)
> > -       REG_S t2, 6*SZREG(t6)
> > -       REG_S t3, 7*SZREG(t6)
> > -       REG_S t4, 8*SZREG(t6)
> > -       REG_S t5, 9*SZREG(t6)
> > -       REG_L a4, 10*SZREG(a1)
> > -       REG_L a5, 11*SZREG(a1)
> > -       REG_L a6, 12*SZREG(a1)
> > -       REG_L a7, 13*SZREG(a1)
> > -       REG_L t0, 14*SZREG(a1)
> > -       REG_L t1, 15*SZREG(a1)
> > -       addi a1, a1, 16*SZREG
> > -       REG_S a4, 10*SZREG(t6)
> > -       REG_S a5, 11*SZREG(t6)
> > -       REG_S a6, 12*SZREG(t6)
> > -       REG_S a7, 13*SZREG(t6)
> > -       REG_S t0, 14*SZREG(t6)
> > -       REG_S t1, 15*SZREG(t6)
> > -       addi t6, t6, 16*SZREG
> > -       bltu a1, a3, 3b
> > -       andi a2, a2, (16*SZREG)-1  /* Update count */
> > -
> > -4:
> > -       /* Handle trailing misalignment */
> > -       beqz a2, 6f
> > -       add a3, a1, a2
> > -
> > -       /* Use word-oriented copy if co-aligned to word boundary */
> > -       or a5, a1, t6
> > -       or a5, a5, a3
> > -       andi a5, a5, 3
> > -       bnez a5, 5f
> > -7:
> > -       lw a4, 0(a1)
> > -       addi a1, a1, 4
> > -       sw a4, 0(t6)
> > -       addi t6, t6, 4
> > -       bltu a1, a3, 7b
> > -
> > -       ret
> > -
> > -5:
> > -       lb a4, 0(a1)
> > -       addi a1, a1, 1
> > -       sb a4, 0(t6)
> > -       addi t6, t6, 1
> > -       bltu a1, a3, 5b
> > -6:
> > -       ret
> > -END(__memcpy)
> > diff --git a/arch/riscv/lib/string.c b/arch/riscv/lib/string.c
> > new file mode 100644
> > index 000000000000..bfc912ee23f8
> > --- /dev/null
> > +++ b/arch/riscv/lib/string.c
> > @@ -0,0 +1,90 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * String functions optimized for hardware which doesn't
> > + * handle unaligned memory accesses efficiently.
> > + *
> > + * Copyright (C) 2021 Matteo Croce
> > + */
> > +
> > +#include <linux/types.h>
> > +#include <linux/module.h>
> > +
> > +/* Minimum size for a word copy to be convenient */
> > +#define BYTES_LONG     sizeof(long)
> > +#define WORD_MASK      (BYTES_LONG - 1)
> > +#define MIN_THRESHOLD  (BYTES_LONG * 2)
> > +
> > +/* convenience union to avoid cast between different pointer types */
> > +union types {
> > +       u8 *as_u8;
> > +       unsigned long *as_ulong;
> > +       uintptr_t as_uptr;
> > +};
> > +
> > +union const_types {
> > +       const u8 *as_u8;
> > +       unsigned long *as_ulong;
> > +       uintptr_t as_uptr;
> > +};
> > +
> > +void *__memcpy(void *dest, const void *src, size_t count)
> How about using __attribute__((optimize("-O2"))) here to replace your
> previous "#ifdef CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE"?
>
> > +{
> > +       union const_types s = { .as_u8 = src };
> > +       union types d = { .as_u8 = dest };
> > +       int distance = 0;
> > +
> > +       if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)) {
> > +               if (count < MIN_THRESHOLD)
> > +                       goto copy_remainder;
> > +
> > +               /* Copy a byte at time until destination is aligned. */
> > +               for (; d.as_uptr & WORD_MASK; count--)
> > +                       *d.as_u8++ = *s.as_u8++;
> > +
> > +               distance = s.as_uptr & WORD_MASK;
> > +       }
> > +
> > +       if (distance) {
> > +               unsigned long last, next;
> > +
> > +               /*
> > +                * s is distance bytes ahead of d, and d just reached
> > +                * the alignment boundary. Move s backward to word align it
> > +                * and shift data to compensate for distance, in order to do
> > +                * word-by-word copy.
> > +                */
> > +               s.as_u8 -= distance;
> > +
> > +               next = s.as_ulong[0];
> > +               for (; count >= BYTES_LONG; count -= BYTES_LONG) {
> > +                       last = next;
> > +                       next = s.as_ulong[1];
> > +
> > +                       d.as_ulong[0] = last >> (distance * 8) |
> > +                                       next << ((BYTES_LONG - distance) * 8);
> > +
> > +                       d.as_ulong++;
> > +                       s.as_ulong++;
> > +               }
> > +
> > +               /* Restore s with the original offset. */
> > +               s.as_u8 += distance;
> > +       } else {
> > +               /*
> > +                * If the source and dest lower bits are the same, do a simple
> > +                * 32/64 bit wide copy.
> > +                */
> > +               for (; count >= BYTES_LONG; count -= BYTES_LONG)
> > +                       *d.as_ulong++ = *s.as_ulong++;
> > +       }
> > +
> > +copy_remainder:
> > +       while (count--)
> > +               *d.as_u8++ = *s.as_u8++;
> > +
> > +       return dest;
> > +}
> > +EXPORT_SYMBOL(__memcpy);
> > +
> > +void *memcpy(void *dest, const void *src, size_t count) __weak __alias(__memcpy);
> > +EXPORT_SYMBOL(memcpy);
> > --
> > 2.31.1
> >
>
>
> --
> Best Regards
>  Guo Ren
>
> ML: https://lore.kernel.org/linux-csky/

If CONFIG_CC_OPTIMIZE_FOR_SIZE I fallback to the generic
implementations, which are very small.
It's just an optimization to save more space when -Os is used.

CC_OPTIMIZE_FOR_PERFORMANCE_O3 is ARC specific.

Regards,
-- 
per aspera ad upstream

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 1/3] riscv: optimized memcpy
  2021-09-30  1:36     ` Matteo Croce
@ 2021-09-30 19:43       ` Vineet Gupta
  2021-10-01  8:06         ` David Laight
  0 siblings, 1 reply; 12+ messages in thread
From: Vineet Gupta @ 2021-09-30 19:43 UTC (permalink / raw)
  To: Matteo Croce, Guo Ren
  Cc: linux-riscv, Linux Kernel Mailing List, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng,
	David Laight, Christoph Hellwig, Philipp Tomsich

On 9/29/21 6:36 PM, Matteo Croce wrote:
> On Thu, Sep 30, 2021 at 3:25 AM Guo Ren <guoren@kernel.org> wrote:
>>
>> On Thu, Sep 30, 2021 at 1:22 AM Matteo Croce <mcroce@linux.microsoft.com> wrote:
>>>
>>> From: Matteo Croce <mcroce@microsoft.com>
>>>
>>> Write a C version of memcpy() which uses the biggest data size allowed,
>>> without generating unaligned accesses.
>>>
>>> The procedure is made of three steps:
>>> First copy data one byte at time until the destination buffer is aligned
>>> to a long boundary.
>>> Then copy the data one long at time shifting the current and the next u8
>>> to compose a long at every cycle.
>>> Finally, copy the remainder one byte at time.
>>>
>>> On a BeagleV, the TCP RX throughput increased by 45%:
>>>
>>> before:
>>>
>>> $ iperf3 -c beaglev
>>> Connecting to host beaglev, port 5201
>>> [  5] local 192.168.85.6 port 44840 connected to 192.168.85.48 port 5201
>>> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
>>> [  5]   0.00-1.00   sec  76.4 MBytes   641 Mbits/sec   27    624 KBytes
>>> [  5]   1.00-2.00   sec  72.5 MBytes   608 Mbits/sec    0    708 KBytes
>>> [  5]   2.00-3.00   sec  73.8 MBytes   619 Mbits/sec   10    451 KBytes
>>> [  5]   3.00-4.00   sec  72.5 MBytes   608 Mbits/sec    0    564 KBytes
>>> [  5]   4.00-5.00   sec  73.8 MBytes   619 Mbits/sec    0    658 KBytes
>>> [  5]   5.00-6.00   sec  73.8 MBytes   619 Mbits/sec   14    522 KBytes
>>> [  5]   6.00-7.00   sec  73.8 MBytes   619 Mbits/sec    0    621 KBytes
>>> [  5]   7.00-8.00   sec  72.5 MBytes   608 Mbits/sec    0    706 KBytes
>>> [  5]   8.00-9.00   sec  73.8 MBytes   619 Mbits/sec   20    580 KBytes
>>> [  5]   9.00-10.00  sec  73.8 MBytes   619 Mbits/sec    0    672 KBytes
>>> - - - - - - - - - - - - - - - - - - - - - - - - -
>>> [ ID] Interval           Transfer     Bitrate         Retr
>>> [  5]   0.00-10.00  sec   736 MBytes   618 Mbits/sec   71             sender
>>> [  5]   0.00-10.01  sec   733 MBytes   615 Mbits/sec                  receiver
>>>
>>> after:
>>>
>>> $ iperf3 -c beaglev
>>> Connecting to host beaglev, port 5201
>>> [  5] local 192.168.85.6 port 44864 connected to 192.168.85.48 port 5201
>>> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
>>> [  5]   0.00-1.00   sec   109 MBytes   912 Mbits/sec   48    559 KBytes
>>> [  5]   1.00-2.00   sec   108 MBytes   902 Mbits/sec    0    690 KBytes
>>> [  5]   2.00-3.00   sec   106 MBytes   891 Mbits/sec   36    396 KBytes
>>> [  5]   3.00-4.00   sec   108 MBytes   902 Mbits/sec    0    567 KBytes
>>> [  5]   4.00-5.00   sec   106 MBytes   891 Mbits/sec    0    699 KBytes
>>> [  5]   5.00-6.00   sec   106 MBytes   891 Mbits/sec   32    414 KBytes
>>> [  5]   6.00-7.00   sec   106 MBytes   891 Mbits/sec    0    583 KBytes
>>> [  5]   7.00-8.00   sec   106 MBytes   891 Mbits/sec    0    708 KBytes
>>> [  5]   8.00-9.00   sec   106 MBytes   891 Mbits/sec   28    433 KBytes
>>> [  5]   9.00-10.00  sec   108 MBytes   902 Mbits/sec    0    591 KBytes
>>> - - - - - - - - - - - - - - - - - - - - - - - - -
>>> [ ID] Interval           Transfer     Bitrate         Retr
>>> [  5]   0.00-10.00  sec  1.04 GBytes   897 Mbits/sec  144             sender
>>> [  5]   0.00-10.01  sec  1.04 GBytes   894 Mbits/sec                  receiver
>>>
>>> And the decreased CPU time of the memcpy() is observable with perf top.
>>> This is the `perf top -Ue task-clock` output when doing the test:
>>>
>>> before:
>>>
>>> Overhead  Shared O  Symbol
>>>    42.22%  [kernel]  [k] memcpy
>>>    35.00%  [kernel]  [k] __asm_copy_to_user
>>>     3.50%  [kernel]  [k] sifive_l2_flush64_range
>>>     2.30%  [kernel]  [k] stmmac_napi_poll_rx
>>>     1.11%  [kernel]  [k] memset
>>>
>>> after:
>>>
>>> Overhead  Shared O  Symbol
>>>    45.69%  [kernel]  [k] __asm_copy_to_user
>>>    29.06%  [kernel]  [k] memcpy
>>>     4.09%  [kernel]  [k] sifive_l2_flush64_range
>>>     2.77%  [kernel]  [k] stmmac_napi_poll_rx
>>>     1.24%  [kernel]  [k] memset
>>>
>>> Signed-off-by: Matteo Croce <mcroce@microsoft.com>
>>> Reported-by: kernel test robot <lkp@intel.com>
>>> ---
>>>   arch/riscv/include/asm/string.h |   8 ++-
>>>   arch/riscv/kernel/riscv_ksyms.c |   2 -
>>>   arch/riscv/lib/Makefile         |   2 +-
>>>   arch/riscv/lib/memcpy.S         | 108 --------------------------------
>>>   arch/riscv/lib/string.c         |  90 ++++++++++++++++++++++++++
>>>   5 files changed, 97 insertions(+), 113 deletions(-)
>>>   delete mode 100644 arch/riscv/lib/memcpy.S
>>>   create mode 100644 arch/riscv/lib/string.c
>>>
>>> diff --git a/arch/riscv/include/asm/string.h b/arch/riscv/include/asm/string.h
>>> index 909049366555..6b5d6fc3eab4 100644
>>> --- a/arch/riscv/include/asm/string.h
>>> +++ b/arch/riscv/include/asm/string.h
>>> @@ -12,9 +12,13 @@
>>>   #define __HAVE_ARCH_MEMSET
>>>   extern asmlinkage void *memset(void *, int, size_t);
>>>   extern asmlinkage void *__memset(void *, int, size_t);
>>> +
>>> +#ifdef CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE
>> What's the problem with the -O3 & -Os? If the user uses
>> CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE_O3 that will cause bad performance
>> for memcpy?
>> Seems asm version is more compatible?
>>
>>>   #define __HAVE_ARCH_MEMCPY
>>> -extern asmlinkage void *memcpy(void *, const void *, size_t);
>>> -extern asmlinkage void *__memcpy(void *, const void *, size_t);
>>> +extern void *memcpy(void *dest, const void *src, size_t count);
>>> +extern void *__memcpy(void *dest, const void *src, size_t count);
>>> +#endif
>>> +
>>>   #define __HAVE_ARCH_MEMMOVE
>>>   extern asmlinkage void *memmove(void *, const void *, size_t);
>>>   extern asmlinkage void *__memmove(void *, const void *, size_t);
>>> diff --git a/arch/riscv/kernel/riscv_ksyms.c b/arch/riscv/kernel/riscv_ksyms.c
>>> index 5ab1c7e1a6ed..3f6d512a5b97 100644
>>> --- a/arch/riscv/kernel/riscv_ksyms.c
>>> +++ b/arch/riscv/kernel/riscv_ksyms.c
>>> @@ -10,8 +10,6 @@
>>>    * Assembly functions that may be used (directly or indirectly) by modules
>>>    */
>>>   EXPORT_SYMBOL(memset);
>>> -EXPORT_SYMBOL(memcpy);
>>>   EXPORT_SYMBOL(memmove);
>>>   EXPORT_SYMBOL(__memset);
>>> -EXPORT_SYMBOL(__memcpy);
>>>   EXPORT_SYMBOL(__memmove);
>>> diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile
>>> index 25d5c9664e57..2ffe85d4baee 100644
>>> --- a/arch/riscv/lib/Makefile
>>> +++ b/arch/riscv/lib/Makefile
>>> @@ -1,9 +1,9 @@
>>>   # SPDX-License-Identifier: GPL-2.0-only
>>>   lib-y                  += delay.o
>>> -lib-y                  += memcpy.o
>>>   lib-y                  += memset.o
>>>   lib-y                  += memmove.o
>>>   lib-$(CONFIG_MMU)      += uaccess.o
>>>   lib-$(CONFIG_64BIT)    += tishift.o
>>> +lib-$(CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE) += string.o
>>>
>>>   obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
>>> diff --git a/arch/riscv/lib/memcpy.S b/arch/riscv/lib/memcpy.S
>>> deleted file mode 100644
>>> index 51ab716253fa..000000000000
>>> --- a/arch/riscv/lib/memcpy.S
>>> +++ /dev/null
>>> @@ -1,108 +0,0 @@
>>> -/* SPDX-License-Identifier: GPL-2.0-only */
>>> -/*
>>> - * Copyright (C) 2013 Regents of the University of California
>>> - */
>>> -
>>> -#include <linux/linkage.h>
>>> -#include <asm/asm.h>
>>> -
>>> -/* void *memcpy(void *, const void *, size_t) */
>>> -ENTRY(__memcpy)
>>> -WEAK(memcpy)
>>> -       move t6, a0  /* Preserve return value */
>>> -
>>> -       /* Defer to byte-oriented copy for small sizes */
>>> -       sltiu a3, a2, 128
>>> -       bnez a3, 4f
>>> -       /* Use word-oriented copy only if low-order bits match */
>>> -       andi a3, t6, SZREG-1
>>> -       andi a4, a1, SZREG-1
>>> -       bne a3, a4, 4f
>>> -
>>> -       beqz a3, 2f  /* Skip if already aligned */
>>> -       /*
>>> -        * Round to nearest double word-aligned address
>>> -        * greater than or equal to start address
>>> -        */
>>> -       andi a3, a1, ~(SZREG-1)
>>> -       addi a3, a3, SZREG
>>> -       /* Handle initial misalignment */
>>> -       sub a4, a3, a1
>>> -1:
>>> -       lb a5, 0(a1)
>>> -       addi a1, a1, 1
>>> -       sb a5, 0(t6)
>>> -       addi t6, t6, 1
>>> -       bltu a1, a3, 1b
>>> -       sub a2, a2, a4  /* Update count */
>>> -
>>> -2:
>>> -       andi a4, a2, ~((16*SZREG)-1)
>>> -       beqz a4, 4f
>>> -       add a3, a1, a4
>>> -3:
>>> -       REG_L a4,       0(a1)
>>> -       REG_L a5,   SZREG(a1)
>>> -       REG_L a6, 2*SZREG(a1)
>>> -       REG_L a7, 3*SZREG(a1)
>>> -       REG_L t0, 4*SZREG(a1)
>>> -       REG_L t1, 5*SZREG(a1)
>>> -       REG_L t2, 6*SZREG(a1)
>>> -       REG_L t3, 7*SZREG(a1)
>>> -       REG_L t4, 8*SZREG(a1)
>>> -       REG_L t5, 9*SZREG(a1)
>>> -       REG_S a4,       0(t6)
>>> -       REG_S a5,   SZREG(t6)
>>> -       REG_S a6, 2*SZREG(t6)
>>> -       REG_S a7, 3*SZREG(t6)
>>> -       REG_S t0, 4*SZREG(t6)
>>> -       REG_S t1, 5*SZREG(t6)
>>> -       REG_S t2, 6*SZREG(t6)
>>> -       REG_S t3, 7*SZREG(t6)
>>> -       REG_S t4, 8*SZREG(t6)
>>> -       REG_S t5, 9*SZREG(t6)
>>> -       REG_L a4, 10*SZREG(a1)
>>> -       REG_L a5, 11*SZREG(a1)
>>> -       REG_L a6, 12*SZREG(a1)
>>> -       REG_L a7, 13*SZREG(a1)
>>> -       REG_L t0, 14*SZREG(a1)
>>> -       REG_L t1, 15*SZREG(a1)
>>> -       addi a1, a1, 16*SZREG
>>> -       REG_S a4, 10*SZREG(t6)
>>> -       REG_S a5, 11*SZREG(t6)
>>> -       REG_S a6, 12*SZREG(t6)
>>> -       REG_S a7, 13*SZREG(t6)
>>> -       REG_S t0, 14*SZREG(t6)
>>> -       REG_S t1, 15*SZREG(t6)
>>> -       addi t6, t6, 16*SZREG
>>> -       bltu a1, a3, 3b
>>> -       andi a2, a2, (16*SZREG)-1  /* Update count */
>>> -
>>> -4:
>>> -       /* Handle trailing misalignment */
>>> -       beqz a2, 6f
>>> -       add a3, a1, a2
>>> -
>>> -       /* Use word-oriented copy if co-aligned to word boundary */
>>> -       or a5, a1, t6
>>> -       or a5, a5, a3
>>> -       andi a5, a5, 3
>>> -       bnez a5, 5f
>>> -7:
>>> -       lw a4, 0(a1)
>>> -       addi a1, a1, 4
>>> -       sw a4, 0(t6)
>>> -       addi t6, t6, 4
>>> -       bltu a1, a3, 7b
>>> -
>>> -       ret
>>> -
>>> -5:
>>> -       lb a4, 0(a1)
>>> -       addi a1, a1, 1
>>> -       sb a4, 0(t6)
>>> -       addi t6, t6, 1
>>> -       bltu a1, a3, 5b
>>> -6:
>>> -       ret
>>> -END(__memcpy)
>>> diff --git a/arch/riscv/lib/string.c b/arch/riscv/lib/string.c
>>> new file mode 100644
>>> index 000000000000..bfc912ee23f8
>>> --- /dev/null
>>> +++ b/arch/riscv/lib/string.c
>>> @@ -0,0 +1,90 @@
>>> +// SPDX-License-Identifier: GPL-2.0-only
>>> +/*
>>> + * String functions optimized for hardware which doesn't
>>> + * handle unaligned memory accesses efficiently.
>>> + *
>>> + * Copyright (C) 2021 Matteo Croce
>>> + */
>>> +
>>> +#include <linux/types.h>
>>> +#include <linux/module.h>
>>> +
>>> +/* Minimum size for a word copy to be convenient */
>>> +#define BYTES_LONG     sizeof(long)
>>> +#define WORD_MASK      (BYTES_LONG - 1)
>>> +#define MIN_THRESHOLD  (BYTES_LONG * 2)
>>> +
>>> +/* convenience union to avoid cast between different pointer types */
>>> +union types {
>>> +       u8 *as_u8;
>>> +       unsigned long *as_ulong;
>>> +       uintptr_t as_uptr;
>>> +};
>>> +
>>> +union const_types {
>>> +       const u8 *as_u8;
>>> +       unsigned long *as_ulong;
>>> +       uintptr_t as_uptr;
>>> +};
>>> +
>>> +void *__memcpy(void *dest, const void *src, size_t count)
>> How about using __attribute__((optimize("-O2"))) here to replace your
>> previous "#ifdef CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE"?
>>
>>> +{
>>> +       union const_types s = { .as_u8 = src };
>>> +       union types d = { .as_u8 = dest };
>>> +       int distance = 0;
>>> +
>>> +       if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)) {
>>> +               if (count < MIN_THRESHOLD)
>>> +                       goto copy_remainder;
>>> +
>>> +               /* Copy a byte at time until destination is aligned. */
>>> +               for (; d.as_uptr & WORD_MASK; count--)
>>> +                       *d.as_u8++ = *s.as_u8++;
>>> +
>>> +               distance = s.as_uptr & WORD_MASK;
>>> +       }
>>> +
>>> +       if (distance) {
>>> +               unsigned long last, next;
>>> +
>>> +               /*
>>> +                * s is distance bytes ahead of d, and d just reached
>>> +                * the alignment boundary. Move s backward to word align it
>>> +                * and shift data to compensate for distance, in order to do
>>> +                * word-by-word copy.
>>> +                */
>>> +               s.as_u8 -= distance;
>>> +
>>> +               next = s.as_ulong[0];
>>> +               for (; count >= BYTES_LONG; count -= BYTES_LONG) {
>>> +                       last = next;
>>> +                       next = s.as_ulong[1];
>>> +
>>> +                       d.as_ulong[0] = last >> (distance * 8) |
>>> +                                       next << ((BYTES_LONG - distance) * 8);
>>> +
>>> +                       d.as_ulong++;
>>> +                       s.as_ulong++;
>>> +               }
>>> +
>>> +               /* Restore s with the original offset. */
>>> +               s.as_u8 += distance;
>>> +       } else {
>>> +               /*
>>> +                * If the source and dest lower bits are the same, do a simple
>>> +                * 32/64 bit wide copy.
>>> +                */
>>> +               for (; count >= BYTES_LONG; count -= BYTES_LONG)
>>> +                       *d.as_ulong++ = *s.as_ulong++;
>>> +       }
>>> +
>>> +copy_remainder:
>>> +       while (count--)
>>> +               *d.as_u8++ = *s.as_u8++;
>>> +
>>> +       return dest;
>>> +}
>>> +EXPORT_SYMBOL(__memcpy);
>>> +
>>> +void *memcpy(void *dest, const void *src, size_t count) __weak __alias(__memcpy);
>>> +EXPORT_SYMBOL(memcpy);
>>> --
>>> 2.31.1
>>>
>>
>>
>> --
>> Best Regards
>>   Guo Ren
>>
>> ML: https://lore.kernel.org/linux-csky/
> 
> If CONFIG_CC_OPTIMIZE_FOR_SIZE I fallback to the generic
> implementations, which are very small.
> It's just an optimization to save more space when -Os is used.
> 
> CC_OPTIMIZE_FOR_PERFORMANCE_O3 is ARC specific.

It is not. Its just that for ARC I've used -O3 as default. Historically 
for compilers at the time some architecture had issues with -O3 builds 
and/or compiler would inline more. But nothing in -O3 is specific to ARC.

BTW off topic (but relevant to this patchset), I strongly feel that 
routines like memset/memcpy are better coded in assembly for really 
water tight instruction scheduling and ease of further optimizing (e.g. 
use of CMO.zero etc as experimented by Philipp). What is blocking you 
from optimizing the asm version ? You are leaving the fate of these 
critical routines in the hand of compiler - this can lead to performance 
shenanigans on a big gcc upgrade.

-Vineet

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH v5 1/3] riscv: optimized memcpy
  2021-09-30 19:43       ` Vineet Gupta
@ 2021-10-01  8:06         ` David Laight
  0 siblings, 0 replies; 12+ messages in thread
From: David Laight @ 2021-10-01  8:06 UTC (permalink / raw)
  To: 'Vineet Gupta', Matteo Croce, Guo Ren
  Cc: linux-riscv, Linux Kernel Mailing List, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng,
	Christoph Hellwig, Philipp Tomsich

...
> BTW off topic (but relevant to this patchset), I strongly feel that
> routines like memset/memcpy are better coded in assembly for really
> water tight instruction scheduling and ease of further optimizing (e.g.
> use of CMO.zero etc as experimented by Philipp). What is blocking you
> from optimizing the asm version ? You are leaving the fate of these
> critical routines in the hand of compiler - this can lead to performance
> shenanigans on a big gcc upgrade.

You also need to worry about the cost of short transfers.
A few cycles there could have a much bigger difference
that something that speeds up long transfers.
Short ones are likely to be fairly common.
I doubt the loop unrolling optimisation in gcc is actually
any good for loops that might be done a few times.

Fortunately the kernel doesn't get 'hit by' gcc unrolling
loops into the AVX instructions.
The setup costs for that (and I-cache footprint) are horrid.
Although I suspect it is that optimisation that 'broke'
code that used misaligned pointers on overlapping data.

It is a general problem with the 'one size fits all' memcpy().

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 1/3] riscv: optimized memcpy
  2021-09-29 17:22 ` [PATCH v5 1/3] riscv: optimized memcpy Matteo Croce
  2021-09-30  1:24   ` Guo Ren
@ 2021-10-01 11:23   ` Pavel Machek
  2021-10-02 17:20     ` Matteo Croce
  1 sibling, 1 reply; 12+ messages in thread
From: Pavel Machek @ 2021-10-01 11:23 UTC (permalink / raw)
  To: Matteo Croce
  Cc: linux-riscv, linux-kernel, linux-arch, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing,
	Akira Tsukamoto, Drew Fustini, Bin Meng, David Laight, Guo Ren,
	Christoph Hellwig

[-- Attachment #1: Type: text/plain, Size: 1865 bytes --]

Hi!

> From: Matteo Croce <mcroce@microsoft.com>
> 
> Write a C version of memcpy() which uses the biggest data size allowed,
> without generating unaligned accesses.
> 
> The procedure is made of three steps:
> First copy data one byte at time until the destination buffer is aligned
> to a long boundary.
> Then copy the data one long at time shifting the current and the next u8
> to compose a long at every cycle.
> Finally, copy the remainder one byte at time.
> 
> On a BeagleV, the TCP RX throughput increased by 45%:
> 
> before:
> 
> $ iperf3 -c beaglev
> Connecting to host beaglev, port 5201
> [  5] local 192.168.85.6 port 44840 connected to 192.168.85.48 port 5201
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec  76.4 MBytes   641 Mbits/sec   27    624 KBytes
> [  5]   1.00-2.00   sec  72.5 MBytes   608 Mbits/sec    0    708 KBytes
> 
> after:
> 
> $ iperf3 -c beaglev
> Connecting to host beaglev, port 5201
> [  5] local 192.168.85.6 port 44864 connected to 192.168.85.48 port 5201
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec   109 MBytes   912 Mbits/sec   48    559 KBytes
> [  5]   1.00-2.00   sec   108 MBytes   902 Mbits/sec    0    690
> KBytes

That's really quite cool. Could you see if it is your "optimized
unaligned" copy doing the difference?>

+/* convenience union to avoid cast between different pointer types */
> +union types {
> +	u8 *as_u8;
> +	unsigned long *as_ulong;
> +	uintptr_t as_uptr;
> +};
> +
> +union const_types {
> +	const u8 *as_u8;
> +	unsigned long *as_ulong;
> +	uintptr_t as_uptr;
> +};

Missing consts here?

Plus... this is really "interesting" coding style. I'd just use casts
in kernel.

Regards,						Pavel

-- 
http://www.livejournal.com/~pavelmachek

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 1/3] riscv: optimized memcpy
  2021-10-01 11:23   ` Pavel Machek
@ 2021-10-02 17:20     ` Matteo Croce
  0 siblings, 0 replies; 12+ messages in thread
From: Matteo Croce @ 2021-10-02 17:20 UTC (permalink / raw)
  To: Pavel Machek
  Cc: linux-riscv, Linux Kernel Mailing List, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Atish Patra,
	Emil Renner Berthing, Akira Tsukamoto, Drew Fustini, Bin Meng,
	David Laight, Guo Ren, Christoph Hellwig

On Fri, Oct 1, 2021 at 1:23 PM Pavel Machek <pavel@ucw.cz> wrote:
>
> Hi!
>
> > From: Matteo Croce <mcroce@microsoft.com>
> >
> > Write a C version of memcpy() which uses the biggest data size allowed,
> > without generating unaligned accesses.
> >
> > The procedure is made of three steps:
> > First copy data one byte at time until the destination buffer is aligned
> > to a long boundary.
> > Then copy the data one long at time shifting the current and the next u8
> > to compose a long at every cycle.
> > Finally, copy the remainder one byte at time.
> >
> > On a BeagleV, the TCP RX throughput increased by 45%:
> >
> > before:
> >
> > $ iperf3 -c beaglev
> > Connecting to host beaglev, port 5201
> > [  5] local 192.168.85.6 port 44840 connected to 192.168.85.48 port 5201
> > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > [  5]   0.00-1.00   sec  76.4 MBytes   641 Mbits/sec   27    624 KBytes
> > [  5]   1.00-2.00   sec  72.5 MBytes   608 Mbits/sec    0    708 KBytes
> >
> > after:
> >
> > $ iperf3 -c beaglev
> > Connecting to host beaglev, port 5201
> > [  5] local 192.168.85.6 port 44864 connected to 192.168.85.48 port 5201
> > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > [  5]   0.00-1.00   sec   109 MBytes   912 Mbits/sec   48    559 KBytes
> > [  5]   1.00-2.00   sec   108 MBytes   902 Mbits/sec    0    690
> > KBytes
>
> That's really quite cool. Could you see if it is your "optimized
> unaligned" copy doing the difference?>
>
> +/* convenience union to avoid cast between different pointer types */
> > +union types {
> > +     u8 *as_u8;
> > +     unsigned long *as_ulong;
> > +     uintptr_t as_uptr;
> > +};
> > +
> > +union const_types {
> > +     const u8 *as_u8;
> > +     unsigned long *as_ulong;
> > +     uintptr_t as_uptr;
> > +};
>
> Missing consts here?
>
> Plus... this is really "interesting" coding style. I'd just use casts
> in kernel.
>

Yes, the one for as_ulong is missing.

By using casts I had to use too many of them, making repeated
assignments in every function.
This is basically the same, with less code :)

Cheers,
-- 
per aspera ad upstream

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 3/3] riscv: optimized memset
  2021-09-29 17:22 ` [PATCH v5 3/3] riscv: optimized memset Matteo Croce
@ 2021-11-25 10:56   ` Ley Foon Tan
  2021-12-06 20:37     ` Emil Renner Berthing
  0 siblings, 1 reply; 12+ messages in thread
From: Ley Foon Tan @ 2021-11-25 10:56 UTC (permalink / raw)
  To: Matteo Croce
  Cc: linux-riscv, linux-kernel, linux-arch, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Atish Patra, Emil Renner Berthing,
	Akira Tsukamoto, Drew Fustini, Bin Meng, David Laight, Guo Ren,
	Christoph Hellwig

On Thu, Sep 30, 2021 at 1:56 AM Matteo Croce <mcroce@linux.microsoft.com> wrote:
>
> From: Matteo Croce <mcroce@microsoft.com>
>
> The generic memset is defined as a byte at time write. This is always
> safe, but it's slower than a 4 byte or even 8 byte write.
>
> Write a generic memset which fills the data one byte at time until the
> destination is aligned, then fills using the largest size allowed,
> and finally fills the remaining data one byte at time.
>
> Signed-off-by: Matteo Croce <mcroce@microsoft.com>
> ---
>  arch/riscv/include/asm/string.h |  10 +--
>  arch/riscv/kernel/Makefile      |   1 -
>  arch/riscv/kernel/riscv_ksyms.c |  13 ----
>  arch/riscv/lib/Makefile         |   1 -
>  arch/riscv/lib/memset.S         | 113 --------------------------------
>  arch/riscv/lib/string.c         |  41 ++++++++++++
>  6 files changed, 44 insertions(+), 135 deletions(-)
>  delete mode 100644 arch/riscv/kernel/riscv_ksyms.c
>  delete mode 100644 arch/riscv/lib/memset.S

This patch causes the Linux kernel to hang if compile with LLVM/Clang.
Tested on Qemu.

Steps to compile with Clang:
make CC=clang  defconfig
make CC=clang -j

Boot log:

[    0.000000] Linux version 5.15.4-01003-g23eeaac40da8 (xxxxx@ubuntu)
(clang version 14.0.0 (https://github.com/llvm/llvm-project
6b715e9c4d9cc00f59906d48cd57f4c767229093), GNU ld (GNU Binutils)
2.36.1) #151 SMP Thu Nov 25 18:41:47 +08 2021
[    0.000000] OF: fdt: Ignoring memory range 0x80000000 - 0x80200000
[    0.000000] Machine model: riscv-virtio,qemu
[    0.000000] earlycon: sbi0 at I/O port 0x0 (options '')
[    0.000000] printk: bootconsole [sbi0] enabled
[    0.000000] efi: UEFI not found.

Regards
Ley Foon

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 3/3] riscv: optimized memset
  2021-11-25 10:56   ` Ley Foon Tan
@ 2021-12-06 20:37     ` Emil Renner Berthing
  0 siblings, 0 replies; 12+ messages in thread
From: Emil Renner Berthing @ 2021-12-06 20:37 UTC (permalink / raw)
  To: Ley Foon Tan, Matteo Croce
  Cc: Emil Renner Berthing, linux-riscv, linux-kernel, linux-arch,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Akira Tsukamoto,
	Drew Fustini, Bin Meng, David Laight, Guo Ren, Christoph Hellwig

Hi Ley Foon,

You're right it doesn't boot for me either when building with

  make ARCH=riscv LLVM=1

The following patch fixes it for me though. In hindsigt it's perhaps a
bit surprising it works without -ffreestanding in GCC.

/Emil

---
 arch/riscv/lib/Makefile | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile
index e33263cc622a..6dfa919d4cd6 100644
--- a/arch/riscv/lib/Makefile
+++ b/arch/riscv/lib/Makefile
@@ -4,4 +4,9 @@ lib-$(CONFIG_MMU)	+= uaccess.o
 lib-$(CONFIG_64BIT)	+= tishift.o
 lib-$(CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE) += string.o
 
+# string.o implements standard library functions like memset/memcpy etc.
+# Use -ffreestanding to ensure that the compiler does not try to "optimize"
+# them into calls to themselves.
+CFLAGS_string.o := -ffreestanding
+
 obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
-- 
2.34.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-12-06 20:37 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-29 17:22 [PATCH v5 0/3] riscv: optimized mem* functions Matteo Croce
2021-09-29 17:22 ` [PATCH v5 1/3] riscv: optimized memcpy Matteo Croce
2021-09-30  1:24   ` Guo Ren
2021-09-30  1:36     ` Matteo Croce
2021-09-30 19:43       ` Vineet Gupta
2021-10-01  8:06         ` David Laight
2021-10-01 11:23   ` Pavel Machek
2021-10-02 17:20     ` Matteo Croce
2021-09-29 17:22 ` [PATCH v5 2/3] riscv: optimized memmove Matteo Croce
2021-09-29 17:22 ` [PATCH v5 3/3] riscv: optimized memset Matteo Croce
2021-11-25 10:56   ` Ley Foon Tan
2021-12-06 20:37     ` Emil Renner Berthing

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).