LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH 0/1] __asm_copy_to-from_user: Reduce more byte_copy
@ 2021-07-30 13:50 Akira Tsukamoto
2021-07-30 13:52 ` [PATCH 1/1] riscv: __asm_copy_to-from_user: Improve using word copy if size < 9*SZREG Akira Tsukamoto
2021-08-12 11:01 ` [PATCH 0/1] __asm_copy_to-from_user: Reduce more byte_copy Akira Tsukamoto
0 siblings, 2 replies; 11+ messages in thread
From: Akira Tsukamoto @ 2021-07-30 13:50 UTC (permalink / raw)
To: Paul Walmsley, Palmer Dabbelt, Guenter Roeck, Geert Uytterhoeven,
Qiu Wenbo, Albert Ou, Akira Tsukamoto, linux-riscv, linux-kernel
Adding none unrolling word_copy, which is used if the size is smaller
than 9*SZREG.
This patch is based on Palmer's past comment.
> My guess is that some workloads will want some smaller unrolling factors,
It will reduce the number of slow byte_copy being used when the
size is small.
Have tested on qemu rv32, qemu rv64 and beaglev beta board.
In the future, I am planning to convert uaccess.S to inline assembly
in .c file. Then it will be easier to optimize on both in-order core and
out-of-order core with #ifdef macro in c.
Akira Tsukamoto (1):
riscv: __asm_copy_to-from_user: Improve using word copy if size <
9*SZREG
arch/riscv/lib/uaccess.S | 46 ++++++++++++++++++++++++++++++++++++----
1 file changed, 42 insertions(+), 4 deletions(-)
--
2.17.1
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 1/1] riscv: __asm_copy_to-from_user: Improve using word copy if size < 9*SZREG
2021-07-30 13:50 [PATCH 0/1] __asm_copy_to-from_user: Reduce more byte_copy Akira Tsukamoto
@ 2021-07-30 13:52 ` Akira Tsukamoto
2021-08-12 13:41 ` Guenter Roeck
` (2 more replies)
2021-08-12 11:01 ` [PATCH 0/1] __asm_copy_to-from_user: Reduce more byte_copy Akira Tsukamoto
1 sibling, 3 replies; 11+ messages in thread
From: Akira Tsukamoto @ 2021-07-30 13:52 UTC (permalink / raw)
To: Paul Walmsley, Palmer Dabbelt, Guenter Roeck, Geert Uytterhoeven,
Qiu Wenbo, Albert Ou, linux-riscv, linux-kernel
Reduce the number of slow byte_copy when the size is in between
2*SZREG to 9*SZREG by using none unrolled word_copy.
Without it any size smaller than 9*SZREG will be using slow byte_copy
instead of none unrolled word_copy.
Signed-off-by: Akira Tsukamoto <akira.tsukamoto@gmail.com>
---
arch/riscv/lib/uaccess.S | 46 ++++++++++++++++++++++++++++++++++++----
1 file changed, 42 insertions(+), 4 deletions(-)
diff --git a/arch/riscv/lib/uaccess.S b/arch/riscv/lib/uaccess.S
index 63bc691cff91..6a80d5517afc 100644
--- a/arch/riscv/lib/uaccess.S
+++ b/arch/riscv/lib/uaccess.S
@@ -34,8 +34,10 @@ ENTRY(__asm_copy_from_user)
/*
* Use byte copy only if too small.
* SZREG holds 4 for RV32 and 8 for RV64
+ * a3 - 2*SZREG is minimum size for word_copy
+ * 1*SZREG for aligning dst + 1*SZREG for word_copy
*/
- li a3, 9*SZREG /* size must be larger than size in word_copy */
+ li a3, 2*SZREG
bltu a2, a3, .Lbyte_copy_tail
/*
@@ -66,9 +68,40 @@ ENTRY(__asm_copy_from_user)
andi a3, a1, SZREG-1
bnez a3, .Lshift_copy
+.Lcheck_size_bulk:
+ /*
+ * Evaluate the size if possible to use unrolled.
+ * The word_copy_unlrolled requires larger than 8*SZREG
+ */
+ li a3, 8*SZREG
+ add a4, a0, a3
+ bltu a4, t0, .Lword_copy_unlrolled
+
.Lword_copy:
- /*
- * Both src and dst are aligned, unrolled word copy
+ /*
+ * Both src and dst are aligned
+ * None unrolled word copy with every 1*SZREG iteration
+ *
+ * a0 - start of aligned dst
+ * a1 - start of aligned src
+ * t0 - end of aligned dst
+ */
+ bgeu a0, t0, .Lbyte_copy_tail /* check if end of copy */
+ addi t0, t0, -(SZREG) /* not to over run */
+1:
+ REG_L a5, 0(a1)
+ addi a1, a1, SZREG
+ REG_S a5, 0(a0)
+ addi a0, a0, SZREG
+ bltu a0, t0, 1b
+
+ addi t0, t0, SZREG /* revert to original value */
+ j .Lbyte_copy_tail
+
+.Lword_copy_unlrolled:
+ /*
+ * Both src and dst are aligned
+ * Unrolled word copy with every 8*SZREG iteration
*
* a0 - start of aligned dst
* a1 - start of aligned src
@@ -97,7 +130,12 @@ ENTRY(__asm_copy_from_user)
bltu a0, t0, 2b
addi t0, t0, 8*SZREG /* revert to original value */
- j .Lbyte_copy_tail
+
+ /*
+ * Remaining might large enough for word_copy to reduce slow byte
+ * copy
+ */
+ j .Lcheck_size_bulk
.Lshift_copy:
--
2.17.1
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 0/1] __asm_copy_to-from_user: Reduce more byte_copy
2021-07-30 13:50 [PATCH 0/1] __asm_copy_to-from_user: Reduce more byte_copy Akira Tsukamoto
2021-07-30 13:52 ` [PATCH 1/1] riscv: __asm_copy_to-from_user: Improve using word copy if size < 9*SZREG Akira Tsukamoto
@ 2021-08-12 11:01 ` Akira Tsukamoto
[not found] ` <61187c37.1c69fb81.ed9bd.cc45SMTPIN_ADDED_BROKEN@mx.google.com>
1 sibling, 1 reply; 11+ messages in thread
From: Akira Tsukamoto @ 2021-08-12 11:01 UTC (permalink / raw)
To: Paul Walmsley, Palmer Dabbelt, Guenter Roeck, Geert Uytterhoeven,
Qiu Wenbo, Albert Ou, linux-riscv, linux-kernel
Hi Guenter, Geert and Qiu,
Would you mind testing this patch?
Thanks,
Akira
On 7/30/2021 10:50 PM, Akira Tsukamoto wrote:
> Adding none unrolling word_copy, which is used if the size is smaller
> than 9*SZREG.
>
> This patch is based on Palmer's past comment.
>> My guess is that some workloads will want some smaller unrolling factors,
>
> It will reduce the number of slow byte_copy being used when the
> size is small.
>
> Have tested on qemu rv32, qemu rv64 and beaglev beta board.
>
> In the future, I am planning to convert uaccess.S to inline assembly
> in .c file. Then it will be easier to optimize on both in-order core and
> out-of-order core with #ifdef macro in c.
>
> Akira Tsukamoto (1):
> riscv: __asm_copy_to-from_user: Improve using word copy if size <
> 9*SZREG
>
> arch/riscv/lib/uaccess.S | 46 ++++++++++++++++++++++++++++++++++++----
> 1 file changed, 42 insertions(+), 4 deletions(-)
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/1] riscv: __asm_copy_to-from_user: Improve using word copy if size < 9*SZREG
2021-07-30 13:52 ` [PATCH 1/1] riscv: __asm_copy_to-from_user: Improve using word copy if size < 9*SZREG Akira Tsukamoto
@ 2021-08-12 13:41 ` Guenter Roeck
2021-08-15 6:51 ` Andreas Schwab
2021-08-16 18:09 ` Palmer Dabbelt
2 siblings, 0 replies; 11+ messages in thread
From: Guenter Roeck @ 2021-08-12 13:41 UTC (permalink / raw)
To: Akira Tsukamoto
Cc: Paul Walmsley, Palmer Dabbelt, Geert Uytterhoeven, Qiu Wenbo,
Albert Ou, linux-riscv, linux-kernel
On Fri, Jul 30, 2021 at 10:52:44PM +0900, Akira Tsukamoto wrote:
> Reduce the number of slow byte_copy when the size is in between
> 2*SZREG to 9*SZREG by using none unrolled word_copy.
>
> Without it any size smaller than 9*SZREG will be using slow byte_copy
> instead of none unrolled word_copy.
>
> Signed-off-by: Akira Tsukamoto <akira.tsukamoto@gmail.com>
Tested-by: Guenter Roeck <linux@roeck-us.net>
> ---
> arch/riscv/lib/uaccess.S | 46 ++++++++++++++++++++++++++++++++++++----
> 1 file changed, 42 insertions(+), 4 deletions(-)
>
> diff --git a/arch/riscv/lib/uaccess.S b/arch/riscv/lib/uaccess.S
> index 63bc691cff91..6a80d5517afc 100644
> --- a/arch/riscv/lib/uaccess.S
> +++ b/arch/riscv/lib/uaccess.S
> @@ -34,8 +34,10 @@ ENTRY(__asm_copy_from_user)
> /*
> * Use byte copy only if too small.
> * SZREG holds 4 for RV32 and 8 for RV64
> + * a3 - 2*SZREG is minimum size for word_copy
> + * 1*SZREG for aligning dst + 1*SZREG for word_copy
> */
> - li a3, 9*SZREG /* size must be larger than size in word_copy */
> + li a3, 2*SZREG
> bltu a2, a3, .Lbyte_copy_tail
>
> /*
> @@ -66,9 +68,40 @@ ENTRY(__asm_copy_from_user)
> andi a3, a1, SZREG-1
> bnez a3, .Lshift_copy
>
> +.Lcheck_size_bulk:
> + /*
> + * Evaluate the size if possible to use unrolled.
> + * The word_copy_unlrolled requires larger than 8*SZREG
> + */
> + li a3, 8*SZREG
> + add a4, a0, a3
> + bltu a4, t0, .Lword_copy_unlrolled
> +
> .Lword_copy:
> - /*
> - * Both src and dst are aligned, unrolled word copy
> + /*
> + * Both src and dst are aligned
> + * None unrolled word copy with every 1*SZREG iteration
> + *
> + * a0 - start of aligned dst
> + * a1 - start of aligned src
> + * t0 - end of aligned dst
> + */
> + bgeu a0, t0, .Lbyte_copy_tail /* check if end of copy */
> + addi t0, t0, -(SZREG) /* not to over run */
> +1:
> + REG_L a5, 0(a1)
> + addi a1, a1, SZREG
> + REG_S a5, 0(a0)
> + addi a0, a0, SZREG
> + bltu a0, t0, 1b
> +
> + addi t0, t0, SZREG /* revert to original value */
> + j .Lbyte_copy_tail
> +
> +.Lword_copy_unlrolled:
> + /*
> + * Both src and dst are aligned
> + * Unrolled word copy with every 8*SZREG iteration
> *
> * a0 - start of aligned dst
> * a1 - start of aligned src
> @@ -97,7 +130,12 @@ ENTRY(__asm_copy_from_user)
> bltu a0, t0, 2b
>
> addi t0, t0, 8*SZREG /* revert to original value */
> - j .Lbyte_copy_tail
> +
> + /*
> + * Remaining might large enough for word_copy to reduce slow byte
> + * copy
> + */
> + j .Lcheck_size_bulk
>
> .Lshift_copy:
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/1] riscv: __asm_copy_to-from_user: Improve using word copy if size < 9*SZREG
2021-07-30 13:52 ` [PATCH 1/1] riscv: __asm_copy_to-from_user: Improve using word copy if size < 9*SZREG Akira Tsukamoto
2021-08-12 13:41 ` Guenter Roeck
@ 2021-08-15 6:51 ` Andreas Schwab
2021-08-16 18:09 ` Palmer Dabbelt
2 siblings, 0 replies; 11+ messages in thread
From: Andreas Schwab @ 2021-08-15 6:51 UTC (permalink / raw)
To: Akira Tsukamoto
Cc: Paul Walmsley, Palmer Dabbelt, Guenter Roeck, Geert Uytterhoeven,
Qiu Wenbo, Albert Ou, linux-riscv, linux-kernel
On Jul 30 2021, Akira Tsukamoto wrote:
> .Lword_copy:
> - /*
> - * Both src and dst are aligned, unrolled word copy
> + /*
> + * Both src and dst are aligned
> + * None unrolled word copy with every 1*SZREG iteration
> + *
> + * a0 - start of aligned dst
> + * a1 - start of aligned src
> + * t0 - end of aligned dst
> + */
> + bgeu a0, t0, .Lbyte_copy_tail /* check if end of copy */
> + addi t0, t0, -(SZREG) /* not to over run */
> +1:
> + REG_L a5, 0(a1)
> + addi a1, a1, SZREG
> + REG_S a5, 0(a0)
> + addi a0, a0, SZREG
> + bltu a0, t0, 1b
This is missing fixups.
Andreas.
--
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1
"And now for something completely different."
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 0/1] __asm_copy_to-from_user: Reduce more byte_copy
[not found] ` <61187c37.1c69fb81.ed9bd.cc45SMTPIN_ADDED_BROKEN@mx.google.com>
@ 2021-08-16 6:24 ` Akira Tsukamoto
[not found] ` <611a33ac.1c69fb81.12aae.89a5SMTPIN_ADDED_BROKEN@mx.google.com>
0 siblings, 1 reply; 11+ messages in thread
From: Akira Tsukamoto @ 2021-08-16 6:24 UTC (permalink / raw)
To: Qiu Wenbo, Paul Walmsley, Palmer Dabbelt, Guenter Roeck,
Geert Uytterhoeven, Albert Ou, linux-riscv, linux-kernel
Cc: akira.tsukamoto
Hi Qiu,
On 8/15/2021 11:30 AM, Qiu Wenbo wrote:
> Hi Akira,
>
>
> This patch breaks my userspace and I can't boot my system after applying this. Here is the stack trace:
>
>
> [ 10.349080] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
> [ 10.357116] Oops [#15]
> [ 10.359433] CPU: 2 PID: 169 Comm: (networkd) Tainted: G D 5.14.0-rc5 #53
> [ 10.367422] Hardware name: SiFive HiFive Unmatched A00 (DT)
> [ 10.372981] epc : __asm_copy_from_user+0x48/0xf0
> [ 10.377584] ra : _copy_from_user+0x28/0x68
> [ 10.381754] epc : ffffffff8099a280 ra : ffffffff803614a8 sp : ffffffd00416bd90
> [ 10.388963] gp : ffffffff811ee540 tp : ffffffe0841b3680 t0 : ffffffd00416bde0
> [ 10.396172] t1 : ffffffd00416bdd8 t2 : 0000003ff09ca3a0 s0 : ffffffd00416bdc0
> [ 10.403381] s1 : 0000000000000000 a0 : ffffffd00416bdd8 a1 : 0000000000000000
> [ 10.410590] a2 : 0000000000000010 a3 : 0000000000000040 a4 : ffffffd00416be18
> [ 10.417800] a5 : 0000003ffffffff0 a6 : 000000000000000f a7 : ffffffe085d58540
> [ 10.425009] s2 : 0000000000000010 s3 : ffffffd00416bdd8 s4 : 0000000000000002
> [ 10.432218] s5 : 0000000000000000 s6 : 0000000000000000 s7 : ffffffe0841b3680
> [ 10.439427] s8 : 0000002aad788040 s9 : 0000000000000000 s10: 0000000000000001
> [ 10.446636] s11: 0000000000000000 t3 : 0000000000000000 t4 : 0000000000000001
> [ 10.453845] t5 : 0000000000000010 t6 : 0000000000040000
> [ 10.459144] status: 0000000200040120 badaddr: 0000000000000000 cause: 000000000000000d
> [ 10.467049] [<ffffffff8099a280>] __asm_copy_from_user+0x48/0xf0
> [ 10.472955] [<ffffffff8009a562>] do_seccomp+0x62/0x8be
> [ 10.478079] [<ffffffff8009af58>] prctl_set_seccomp+0x24/0x32
> [ 10.483725] [<ffffffff80020756>] sys_prctl+0xf6/0x450
> [ 10.488763] [<ffffffff800034f2>] ret_from_syscall+0x0/0x2
>
>
> The PC register points to this line:
>
> +1:
> + REG_L a5, 0(a1)
Thanks for testing! Do you mind teaching me how to reproduce the error?
Akira
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/1] riscv: __asm_copy_to-from_user: Improve using word copy if size < 9*SZREG
2021-07-30 13:52 ` [PATCH 1/1] riscv: __asm_copy_to-from_user: Improve using word copy if size < 9*SZREG Akira Tsukamoto
2021-08-12 13:41 ` Guenter Roeck
2021-08-15 6:51 ` Andreas Schwab
@ 2021-08-16 18:09 ` Palmer Dabbelt
2021-08-16 19:00 ` Andreas Schwab
2021-08-17 9:03 ` Akira Tsukamoto
2 siblings, 2 replies; 11+ messages in thread
From: Palmer Dabbelt @ 2021-08-16 18:09 UTC (permalink / raw)
To: akira.tsukamoto
Cc: Paul Walmsley, linux, geert, qiuwenbo, aou, linux-riscv, linux-kernel
On Fri, 30 Jul 2021 06:52:44 PDT (-0700), akira.tsukamoto@gmail.com wrote:
> Reduce the number of slow byte_copy when the size is in between
> 2*SZREG to 9*SZREG by using none unrolled word_copy.
>
> Without it any size smaller than 9*SZREG will be using slow byte_copy
> instead of none unrolled word_copy.
>
> Signed-off-by: Akira Tsukamoto <akira.tsukamoto@gmail.com>
> ---
> arch/riscv/lib/uaccess.S | 46 ++++++++++++++++++++++++++++++++++++----
> 1 file changed, 42 insertions(+), 4 deletions(-)
>
> diff --git a/arch/riscv/lib/uaccess.S b/arch/riscv/lib/uaccess.S
> index 63bc691cff91..6a80d5517afc 100644
> --- a/arch/riscv/lib/uaccess.S
> +++ b/arch/riscv/lib/uaccess.S
> @@ -34,8 +34,10 @@ ENTRY(__asm_copy_from_user)
> /*
> * Use byte copy only if too small.
> * SZREG holds 4 for RV32 and 8 for RV64
> + * a3 - 2*SZREG is minimum size for word_copy
> + * 1*SZREG for aligning dst + 1*SZREG for word_copy
> */
> - li a3, 9*SZREG /* size must be larger than size in word_copy */
> + li a3, 2*SZREG
> bltu a2, a3, .Lbyte_copy_tail
>
> /*
> @@ -66,9 +68,40 @@ ENTRY(__asm_copy_from_user)
> andi a3, a1, SZREG-1
> bnez a3, .Lshift_copy
>
> +.Lcheck_size_bulk:
> + /*
> + * Evaluate the size if possible to use unrolled.
> + * The word_copy_unlrolled requires larger than 8*SZREG
> + */
> + li a3, 8*SZREG
> + add a4, a0, a3
> + bltu a4, t0, .Lword_copy_unlrolled
> +
> .Lword_copy:
> - /*
> - * Both src and dst are aligned, unrolled word copy
> + /*
> + * Both src and dst are aligned
> + * None unrolled word copy with every 1*SZREG iteration
> + *
> + * a0 - start of aligned dst
> + * a1 - start of aligned src
> + * t0 - end of aligned dst
> + */
> + bgeu a0, t0, .Lbyte_copy_tail /* check if end of copy */
> + addi t0, t0, -(SZREG) /* not to over run */
> +1:
> + REG_L a5, 0(a1)
> + addi a1, a1, SZREG
> + REG_S a5, 0(a0)
> + addi a0, a0, SZREG
> + bltu a0, t0, 1b
> +
> + addi t0, t0, SZREG /* revert to original value */
> + j .Lbyte_copy_tail
> +
> +.Lword_copy_unlrolled:
> + /*
> + * Both src and dst are aligned
> + * Unrolled word copy with every 8*SZREG iteration
> *
> * a0 - start of aligned dst
> * a1 - start of aligned src
> @@ -97,7 +130,12 @@ ENTRY(__asm_copy_from_user)
> bltu a0, t0, 2b
>
> addi t0, t0, 8*SZREG /* revert to original value */
> - j .Lbyte_copy_tail
> +
> + /*
> + * Remaining might large enough for word_copy to reduce slow byte
> + * copy
> + */
> + j .Lcheck_size_bulk
>
> .Lshift_copy:
I'm still not convinced that going all the way to such a large unrolling
factor is a net win, but this at least provides a much smoother cost
curve.
That said, this is causing my 32-bit configs to hang. There were a few
conflicts so I may have messed something up, but nothing is jumping out
at me. I've put what I ended up with on a branch, if you have time to
look that'd be great but if not then I'll take another shot at this when
I get back around to it.
https://git.kernel.org/pub/scm/linux/kernel/git/palmer/linux.git/commit/?h=wip-word_user_copy
Here's the backtrace, though that's probably not all that useful:
[ 0.703694] Unable to handle kernel NULL pointer dereference at virtual address 000005a8
[ 0.704194] Oops [#1]
[ 0.704301] Modules linked in:[ 0.704463] CPU: 2 PID: 1 Comm: init Not tainted 5.14.0-rc1-00016-g59461ddb9dbd #5
[ 0.704660] Hardware name: riscv-virtio,qemu (DT)
[ 0.704802] epc : walk_stackframe+0xac/0xc2[ 0.704941] ra : dump_backtrace+0x1a/0x22
[ 0.705074] epc : c0004558 ra : c0004588 sp : c1c5fe10
[ 0.705216] gp : c18b41c8 tp : c1cd8000 t0 : 00000000[ 0.705357] t1 : ffffffff t2 : 00000000 s0 : c1c5fe40
[ 0.705506] s1 : c11313dc a0 : 00000000 a1 : 00000000
[ 0.705647] a2 : c06fd2c2 a3 : c11313dc a4 : c084292d[ 0.705787] a5 : 00000000 a6 : c1864cb8 a7 : 3fffffff
[ 0.705926] s2 : 00000000 s3 : c1123e88 s4 : 00000000
[ 0.706066] s5 : c11313dc s6 : c06fd2c2 s7 : 00000001[ 0.706206] s8 : 00000000 s9 : 95af6e28 s10: 00000000
[ 0.706345] s11: 00000001 t3 : 00000000 t4 : 00000000
[ 0.706482] t5 : 00000001 t6 : 00000000[ 0.706594] status: 00000100 badaddr: 000005a8 cause: 0000000d
[ 0.706809] [<c0004558>] walk_stackframe+0xac/0xc2
[ 0.707019] [<c0004588>] dump_backtrace+0x1a/0x22[ 0.707149] [<c06fd312>] show_stack+0x2c/0x38
[ 0.707271] [<c06ffba4>] dump_stack_lvl+0x40/0x58
[ 0.707400] [<c06ffbce>] dump_stack+0x12/0x1a[ 0.707521] [<c06fd4f6>] panic+0xfa/0x2a6
[ 0.707632] [<c000e2f4>] do_exit+0x7a8/0x7ac
[ 0.707749] [<c000eefa>] do_group_exit+0x2a/0x7e[ 0.707872] [<c000ef60>] __wake_up_parent+0x0/0x20
[ 0.707999] [<c0003020>] ret_from_syscall+0x0/0x2
[ 0.708385] ---[ end trace 260976561a3770d1 ]---
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/1] riscv: __asm_copy_to-from_user: Improve using word copy if size < 9*SZREG
2021-08-16 18:09 ` Palmer Dabbelt
@ 2021-08-16 19:00 ` Andreas Schwab
2021-08-20 6:42 ` Akira Tsukamoto
2021-08-17 9:03 ` Akira Tsukamoto
1 sibling, 1 reply; 11+ messages in thread
From: Andreas Schwab @ 2021-08-16 19:00 UTC (permalink / raw)
To: Palmer Dabbelt
Cc: akira.tsukamoto, Paul Walmsley, linux, geert, qiuwenbo, aou,
linux-riscv, linux-kernel
On Aug 16 2021, Palmer Dabbelt wrote:
> On Fri, 30 Jul 2021 06:52:44 PDT (-0700), akira.tsukamoto@gmail.com wrote:
>> Reduce the number of slow byte_copy when the size is in between
>> 2*SZREG to 9*SZREG by using none unrolled word_copy.
>>
>> Without it any size smaller than 9*SZREG will be using slow byte_copy
>> instead of none unrolled word_copy.
>>
>> Signed-off-by: Akira Tsukamoto <akira.tsukamoto@gmail.com>
>> ---
>> arch/riscv/lib/uaccess.S | 46 ++++++++++++++++++++++++++++++++++++----
>> 1 file changed, 42 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/riscv/lib/uaccess.S b/arch/riscv/lib/uaccess.S
>> index 63bc691cff91..6a80d5517afc 100644
>> --- a/arch/riscv/lib/uaccess.S
>> +++ b/arch/riscv/lib/uaccess.S
>> @@ -34,8 +34,10 @@ ENTRY(__asm_copy_from_user)
>> /*
>> * Use byte copy only if too small.
>> * SZREG holds 4 for RV32 and 8 for RV64
>> + * a3 - 2*SZREG is minimum size for word_copy
>> + * 1*SZREG for aligning dst + 1*SZREG for word_copy
>> */
>> - li a3, 9*SZREG /* size must be larger than size in word_copy */
>> + li a3, 2*SZREG
>> bltu a2, a3, .Lbyte_copy_tail
>>
>> /*
>> @@ -66,9 +68,40 @@ ENTRY(__asm_copy_from_user)
>> andi a3, a1, SZREG-1
>> bnez a3, .Lshift_copy
>>
>> +.Lcheck_size_bulk:
>> + /*
>> + * Evaluate the size if possible to use unrolled.
>> + * The word_copy_unlrolled requires larger than 8*SZREG
>> + */
>> + li a3, 8*SZREG
>> + add a4, a0, a3
>> + bltu a4, t0, .Lword_copy_unlrolled
>> +
>> .Lword_copy:
>> - /*
>> - * Both src and dst are aligned, unrolled word copy
>> + /*
>> + * Both src and dst are aligned
>> + * None unrolled word copy with every 1*SZREG iteration
>> + *
>> + * a0 - start of aligned dst
>> + * a1 - start of aligned src
>> + * t0 - end of aligned dst
>> + */
>> + bgeu a0, t0, .Lbyte_copy_tail /* check if end of copy */
>> + addi t0, t0, -(SZREG) /* not to over run */
>> +1:
>> + REG_L a5, 0(a1)
>> + addi a1, a1, SZREG
>> + REG_S a5, 0(a0)
>> + addi a0, a0, SZREG
>> + bltu a0, t0, 1b
>> +
>> + addi t0, t0, SZREG /* revert to original value */
>> + j .Lbyte_copy_tail
>> +
>> +.Lword_copy_unlrolled:
>> + /*
>> + * Both src and dst are aligned
>> + * Unrolled word copy with every 8*SZREG iteration
>> *
>> * a0 - start of aligned dst
>> * a1 - start of aligned src
>> @@ -97,7 +130,12 @@ ENTRY(__asm_copy_from_user)
>> bltu a0, t0, 2b
>>
>> addi t0, t0, 8*SZREG /* revert to original value */
>> - j .Lbyte_copy_tail
>> +
>> + /*
>> + * Remaining might large enough for word_copy to reduce slow byte
>> + * copy
>> + */
>> + j .Lcheck_size_bulk
>>
>> .Lshift_copy:
>
> I'm still not convinced that going all the way to such a large unrolling
> factor is a net win, but this at least provides a much smoother cost
> curve.
>
> That said, this is causing my 32-bit configs to hang.
It's missing fixups for the loads in the loop.
diff --git a/arch/riscv/lib/uaccess.S b/arch/riscv/lib/uaccess.S
index a835df6bd68f..12ed1f76bd1f 100644
--- a/arch/riscv/lib/uaccess.S
+++ b/arch/riscv/lib/uaccess.S
@@ -89,9 +89,9 @@ ENTRY(__asm_copy_from_user)
bgeu a0, t0, .Lbyte_copy_tail /* check if end of copy */
addi t0, t0, -(SZREG) /* not to over run */
1:
- REG_L a5, 0(a1)
+ fixup REG_L a5, 0(a1), 10f
addi a1, a1, SZREG
- REG_S a5, 0(a0)
+ fixup REG_S a5, 0(a0), 10f
addi a0, a0, SZREG
bltu a0, t0, 1b
Andreas.
--
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1
"And now for something completely different."
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 0/1] __asm_copy_to-from_user: Reduce more byte_copy
[not found] ` <611a33ac.1c69fb81.12aae.89a5SMTPIN_ADDED_BROKEN@mx.google.com>
@ 2021-08-17 7:32 ` Akira Tsukamoto
0 siblings, 0 replies; 11+ messages in thread
From: Akira Tsukamoto @ 2021-08-17 7:32 UTC (permalink / raw)
To: Qiu Wenbo, Paul Walmsley, Palmer Dabbelt, Guenter Roeck,
Geert Uytterhoeven, Albert Ou, linux-riscv, linux-kernel
Cc: akira.tsukamoto
Hi Qiu,
On 8/16/2021 6:45 PM, Qiu Wenbo wrote:
> Hi Akira,
>
>
> I can reproduce it on my HiFive Unmatched with a custom Gentoo rootfs. As pointed out by Andreas, there might be a missing fixup. I'm going to debug this issue myself since I can reproduce it fairly stable.
Ah! Now I understand the bug.
> + REG_L a5, 0(a1)
should be
+ fixup REG_L a5, 0(a1)
If you do not mind, could you make the patch to add 'fixup' to all REG_S and REG_L?
Then I will resubmit them to Palmer with your patch.
Thanks,
Akira
>
>
> Qiu
>
>
> On 8/16/21 14:24, Akira Tsukamoto wrote:
>> Hi Qiu,
>>
>> On 8/15/2021 11:30 AM, Qiu Wenbo wrote:
>>> Hi Akira,
>>>
>>>
>>> This patch breaks my userspace and I can't boot my system after applying this. Here is the stack trace:
>>>
>>>
>>> [ 10.349080] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
>>> [ 10.357116] Oops [#15]
>>> [ 10.359433] CPU: 2 PID: 169 Comm: (networkd) Tainted: G D 5.14.0-rc5 #53
>>> [ 10.367422] Hardware name: SiFive HiFive Unmatched A00 (DT)
>>> [ 10.372981] epc : __asm_copy_from_user+0x48/0xf0
>>> [ 10.377584] ra : _copy_from_user+0x28/0x68
>>> [ 10.381754] epc : ffffffff8099a280 ra : ffffffff803614a8 sp : ffffffd00416bd90
>>> [ 10.388963] gp : ffffffff811ee540 tp : ffffffe0841b3680 t0 : ffffffd00416bde0
>>> [ 10.396172] t1 : ffffffd00416bdd8 t2 : 0000003ff09ca3a0 s0 : ffffffd00416bdc0
>>> [ 10.403381] s1 : 0000000000000000 a0 : ffffffd00416bdd8 a1 : 0000000000000000
>>> [ 10.410590] a2 : 0000000000000010 a3 : 0000000000000040 a4 : ffffffd00416be18
>>> [ 10.417800] a5 : 0000003ffffffff0 a6 : 000000000000000f a7 : ffffffe085d58540
>>> [ 10.425009] s2 : 0000000000000010 s3 : ffffffd00416bdd8 s4 : 0000000000000002
>>> [ 10.432218] s5 : 0000000000000000 s6 : 0000000000000000 s7 : ffffffe0841b3680
>>> [ 10.439427] s8 : 0000002aad788040 s9 : 0000000000000000 s10: 0000000000000001
>>> [ 10.446636] s11: 0000000000000000 t3 : 0000000000000000 t4 : 0000000000000001
>>> [ 10.453845] t5 : 0000000000000010 t6 : 0000000000040000
>>> [ 10.459144] status: 0000000200040120 badaddr: 0000000000000000 cause: 000000000000000d
>>> [ 10.467049] [<ffffffff8099a280>] __asm_copy_from_user+0x48/0xf0
>>> [ 10.472955] [<ffffffff8009a562>] do_seccomp+0x62/0x8be
>>> [ 10.478079] [<ffffffff8009af58>] prctl_set_seccomp+0x24/0x32
>>> [ 10.483725] [<ffffffff80020756>] sys_prctl+0xf6/0x450
>>> [ 10.488763] [<ffffffff800034f2>] ret_from_syscall+0x0/0x2
>>>
>>>
>>> The PC register points to this line:
>>>
>>> +1:
>>> + REG_L a5, 0(a1)
>> Thanks for testing! Do you mind teaching me how to reproduce the error?
>>
>> Akira
>>
>
>
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/1] riscv: __asm_copy_to-from_user: Improve using word copy if size < 9*SZREG
2021-08-16 18:09 ` Palmer Dabbelt
2021-08-16 19:00 ` Andreas Schwab
@ 2021-08-17 9:03 ` Akira Tsukamoto
1 sibling, 0 replies; 11+ messages in thread
From: Akira Tsukamoto @ 2021-08-17 9:03 UTC (permalink / raw)
To: Palmer Dabbelt
Cc: akira.tsukamoto, Paul Walmsley, linux, geert, qiuwenbo, aou,
linux-riscv, linux-kernel
On 8/17/2021 3:09 AM, Palmer Dabbelt wrote:
> On Fri, 30 Jul 2021 06:52:44 PDT (-0700), akira.tsukamoto@gmail.com wrote:
>> Reduce the number of slow byte_copy when the size is in between
>> 2*SZREG to 9*SZREG by using none unrolled word_copy.
>>
>> Without it any size smaller than 9*SZREG will be using slow byte_copy
>> instead of none unrolled word_copy.
>>
>> Signed-off-by: Akira Tsukamoto <akira.tsukamoto@gmail.com>
>> ---
>> arch/riscv/lib/uaccess.S | 46 ++++++++++++++++++++++++++++++++++++----
>> 1 file changed, 42 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/riscv/lib/uaccess.S b/arch/riscv/lib/uaccess.S
>> index 63bc691cff91..6a80d5517afc 100644
>> --- a/arch/riscv/lib/uaccess.S
>> +++ b/arch/riscv/lib/uaccess.S
>> @@ -34,8 +34,10 @@ ENTRY(__asm_copy_from_user)
>> /*
>> * Use byte copy only if too small.
>> * SZREG holds 4 for RV32 and 8 for RV64
>> + * a3 - 2*SZREG is minimum size for word_copy
>> + * 1*SZREG for aligning dst + 1*SZREG for word_copy
>> */
>> - li a3, 9*SZREG /* size must be larger than size in word_copy */
>> + li a3, 2*SZREG
>> bltu a2, a3, .Lbyte_copy_tail
>>
>> /*
>> @@ -66,9 +68,40 @@ ENTRY(__asm_copy_from_user)
>> andi a3, a1, SZREG-1
>> bnez a3, .Lshift_copy
>>
>> +.Lcheck_size_bulk:
>> + /*
>> + * Evaluate the size if possible to use unrolled.
>> + * The word_copy_unlrolled requires larger than 8*SZREG
>> + */
>> + li a3, 8*SZREG
>> + add a4, a0, a3
>> + bltu a4, t0, .Lword_copy_unlrolled
>> +
>> .Lword_copy:
>> - /*
>> - * Both src and dst are aligned, unrolled word copy
>> + /*
>> + * Both src and dst are aligned
>> + * None unrolled word copy with every 1*SZREG iteration
>> + *
>> + * a0 - start of aligned dst
>> + * a1 - start of aligned src
>> + * t0 - end of aligned dst
>> + */
>> + bgeu a0, t0, .Lbyte_copy_tail /* check if end of copy */
>> + addi t0, t0, -(SZREG) /* not to over run */
>> +1:
>> + REG_L a5, 0(a1)
>> + addi a1, a1, SZREG
>> + REG_S a5, 0(a0)
>> + addi a0, a0, SZREG
>> + bltu a0, t0, 1b
>> +
>> + addi t0, t0, SZREG /* revert to original value */
>> + j .Lbyte_copy_tail
>> +
>> +.Lword_copy_unlrolled:
>> + /*
>> + * Both src and dst are aligned
>> + * Unrolled word copy with every 8*SZREG iteration
>> *
>> * a0 - start of aligned dst
>> * a1 - start of aligned src
>> @@ -97,7 +130,12 @@ ENTRY(__asm_copy_from_user)
>> bltu a0, t0, 2b
>>
>> addi t0, t0, 8*SZREG /* revert to original value */
>> - j .Lbyte_copy_tail
>> +
>> + /*
>> + * Remaining might large enough for word_copy to reduce slow byte
>> + * copy
>> + */
>> + j .Lcheck_size_bulk
>>
>> .Lshift_copy:
>
> I'm still not convinced that going all the way to such a large unrolling factor is a net win, but this at least provides a much smoother cost curve.
I would like to meet and discuss the unrolling factor at some events.
The assembler version of memset in arch/riscv/lib/memset.S had thirty two consequent unrolling loading and stores and initially I also thought it was too much unrolling and crazy.
However, I could not beat it with the speed with any of my customization when reducing the
unrolling. I never thought such a large unrolling would have benefit, my initial thought was minimum two or three would be enough for five or so pipeline cores with in-order and single issue design.
At the same time I experienced in the past some x86’s in-order cores would benefit from large unrolling, so I decided to go which was faster after the measurement.
The speed of the memset is critical for clearing the entire 4KiB page.
The biggest down size is that the large unrolling will increase the binary size, and most of out-of-order cores are able to compensate without large unrolling by reordering instructions internally, so when I am able to rewrite the function with inline assembler, I would like to
switch with #ifdef of choosing the portion of unrolling between in-order cores and out-of cores in the future. Currently all physical risc-v cores are in-order design but probably out-of-order cores are coming some time and could benefit from reducing the binary size and relaxing the required memory bandwidth.
>
> That said, this is causing my 32-bit configs to hang. There were a few conflicts so I may have messed something up, but nothing is jumping out at me. I've put what I ended up with on a branch, if you have time to look that'd be great but if not then I'll take another shot at this when I get back around to it.
>
> https://git.kernel.org/pub/scm/linux/kernel/git/palmer/linux.git/commit/?h=wip-word_user_copy
>
> Here's the backtrace, though that's probably not all that useful:
>
> [ 0.703694] Unable to handle kernel NULL pointer dereference at virtual address 000005a8
> [ 0.704194] Oops [#1]
> [ 0.704301] Modules linked in:[ 0.704463] CPU: 2 PID: 1 Comm: init Not tainted 5.14.0-rc1-00016-g59461ddb9dbd #5
> [ 0.704660] Hardware name: riscv-virtio,qemu (DT)
> [ 0.704802] epc : walk_stackframe+0xac/0xc2[ 0.704941] ra : dump_backtrace+0x1a/0x22
> [ 0.705074] epc : c0004558 ra : c0004588 sp : c1c5fe10
> [ 0.705216] gp : c18b41c8 tp : c1cd8000 t0 : 00000000[ 0.705357] t1 : ffffffff t2 : 00000000 s0 : c1c5fe40
> [ 0.705506] s1 : c11313dc a0 : 00000000 a1 : 00000000
> [ 0.705647] a2 : c06fd2c2 a3 : c11313dc a4 : c084292d[ 0.705787] a5 : 00000000 a6 : c1864cb8 a7 : 3fffffff
> [ 0.705926] s2 : 00000000 s3 : c1123e88 s4 : 00000000
> [ 0.706066] s5 : c11313dc s6 : c06fd2c2 s7 : 00000001[ 0.706206] s8 : 00000000 s9 : 95af6e28 s10: 00000000
> [ 0.706345] s11: 00000001 t3 : 00000000 t4 : 00000000
> [ 0.706482] t5 : 00000001 t6 : 00000000[ 0.706594] status: 00000100 badaddr: 000005a8 cause: 0000000d
> [ 0.706809] [<c0004558>] walk_stackframe+0xac/0xc2
> [ 0.707019] [<c0004588>] dump_backtrace+0x1a/0x22[ 0.707149] [<c06fd312>] show_stack+0x2c/0x38
> [ 0.707271] [<c06ffba4>] dump_stack_lvl+0x40/0x58
> [ 0.707400] [<c06ffbce>] dump_stack+0x12/0x1a[ 0.707521] [<c06fd4f6>] panic+0xfa/0x2a6
> [ 0.707632] [<c000e2f4>] do_exit+0x7a8/0x7ac
> [ 0.707749] [<c000eefa>] do_group_exit+0x2a/0x7e[ 0.707872] [<c000ef60>] __wake_up_parent+0x0/0x20
> [ 0.707999] [<c0003020>] ret_from_syscall+0x0/0x2
> [ 0.708385] ---[ end trace 260976561a3770d1 ]---
I am suspecting the error above might be the same cause as Qiu have mentioning at the other thread.
Akira
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/1] riscv: __asm_copy_to-from_user: Improve using word copy if size < 9*SZREG
2021-08-16 19:00 ` Andreas Schwab
@ 2021-08-20 6:42 ` Akira Tsukamoto
0 siblings, 0 replies; 11+ messages in thread
From: Akira Tsukamoto @ 2021-08-20 6:42 UTC (permalink / raw)
To: Andreas Schwab, Palmer Dabbelt
Cc: akira.tsukamoto, Paul Walmsley, linux, geert, qiuwenbo, aou,
linux-riscv, linux-kernel
Hi Andreas,
On 8/17/2021 4:00 AM, Andreas Schwab wrote:
> On Aug 16 2021, Palmer Dabbelt wrote:
>
>> On Fri, 30 Jul 2021 06:52:44 PDT (-0700), akira.tsukamoto@gmail.com wrote:
>>> Reduce the number of slow byte_copy when the size is in between
>>> 2*SZREG to 9*SZREG by using none unrolled word_copy.
>>>
>>> Without it any size smaller than 9*SZREG will be using slow byte_copy
>>> instead of none unrolled word_copy.
>>>
>>> Signed-off-by: Akira Tsukamoto <akira.tsukamoto@gmail.com>
>>> ---
>>> arch/riscv/lib/uaccess.S | 46 ++++++++++++++++++++++++++++++++++++----
>>> 1 file changed, 42 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/arch/riscv/lib/uaccess.S b/arch/riscv/lib/uaccess.S
>>> index 63bc691cff91..6a80d5517afc 100644
>>> --- a/arch/riscv/lib/uaccess.S
>>> +++ b/arch/riscv/lib/uaccess.S
>>> @@ -34,8 +34,10 @@ ENTRY(__asm_copy_from_user)
>>> /*
>>> * Use byte copy only if too small.
>>> * SZREG holds 4 for RV32 and 8 for RV64
>>> + * a3 - 2*SZREG is minimum size for word_copy
>>> + * 1*SZREG for aligning dst + 1*SZREG for word_copy
>>> */
>>> - li a3, 9*SZREG /* size must be larger than size in word_copy */
>>> + li a3, 2*SZREG
>>> bltu a2, a3, .Lbyte_copy_tail
>>>
>>> /*
>>> @@ -66,9 +68,40 @@ ENTRY(__asm_copy_from_user)
>>> andi a3, a1, SZREG-1
>>> bnez a3, .Lshift_copy
>>>
>>> +.Lcheck_size_bulk:
>>> + /*
>>> + * Evaluate the size if possible to use unrolled.
>>> + * The word_copy_unlrolled requires larger than 8*SZREG
>>> + */
>>> + li a3, 8*SZREG
>>> + add a4, a0, a3
>>> + bltu a4, t0, .Lword_copy_unlrolled
>>> +
>>> .Lword_copy:
>>> - /*
>>> - * Both src and dst are aligned, unrolled word copy
>>> + /*
>>> + * Both src and dst are aligned
>>> + * None unrolled word copy with every 1*SZREG iteration
>>> + *
>>> + * a0 - start of aligned dst
>>> + * a1 - start of aligned src
>>> + * t0 - end of aligned dst
>>> + */
>>> + bgeu a0, t0, .Lbyte_copy_tail /* check if end of copy */
>>> + addi t0, t0, -(SZREG) /* not to over run */
>>> +1:
>>> + REG_L a5, 0(a1)
>>> + addi a1, a1, SZREG
>>> + REG_S a5, 0(a0)
>>> + addi a0, a0, SZREG
>>> + bltu a0, t0, 1b
>>> +
>>> + addi t0, t0, SZREG /* revert to original value */
>>> + j .Lbyte_copy_tail
>>> +
>>> +.Lword_copy_unlrolled:
>>> + /*
>>> + * Both src and dst are aligned
>>> + * Unrolled word copy with every 8*SZREG iteration
>>> *
>>> * a0 - start of aligned dst
>>> * a1 - start of aligned src
>>> @@ -97,7 +130,12 @@ ENTRY(__asm_copy_from_user)
>>> bltu a0, t0, 2b
>>>
>>> addi t0, t0, 8*SZREG /* revert to original value */
>>> - j .Lbyte_copy_tail
>>> +
>>> + /*
>>> + * Remaining might large enough for word_copy to reduce slow byte
>>> + * copy
>>> + */
>>> + j .Lcheck_size_bulk
>>>
>>> .Lshift_copy:
>>
>> I'm still not convinced that going all the way to such a large unrolling
>> factor is a net win, but this at least provides a much smoother cost
>> curve.
>>
>> That said, this is causing my 32-bit configs to hang.
>
> It's missing fixups for the loads in the loop.
>
> diff --git a/arch/riscv/lib/uaccess.S b/arch/riscv/lib/uaccess.S
> index a835df6bd68f..12ed1f76bd1f 100644
> --- a/arch/riscv/lib/uaccess.S
> +++ b/arch/riscv/lib/uaccess.S
> @@ -89,9 +89,9 @@ ENTRY(__asm_copy_from_user)
> bgeu a0, t0, .Lbyte_copy_tail /* check if end of copy */
> addi t0, t0, -(SZREG) /* not to over run */
> 1:
> - REG_L a5, 0(a1)
> + fixup REG_L a5, 0(a1), 10f
> addi a1, a1, SZREG
> - REG_S a5, 0(a0)
> + fixup REG_S a5, 0(a0), 10f
> addi a0, a0, SZREG
> bltu a0, t0, 1b
Thanks, our messages crossed.
I also made the same changes after Qiu's comment, and contacting him
so I also could try it at my place and confirm if there are any other
changes required or not.
Please give me a little more while.
Akira
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2021-08-20 6:42 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-30 13:50 [PATCH 0/1] __asm_copy_to-from_user: Reduce more byte_copy Akira Tsukamoto
2021-07-30 13:52 ` [PATCH 1/1] riscv: __asm_copy_to-from_user: Improve using word copy if size < 9*SZREG Akira Tsukamoto
2021-08-12 13:41 ` Guenter Roeck
2021-08-15 6:51 ` Andreas Schwab
2021-08-16 18:09 ` Palmer Dabbelt
2021-08-16 19:00 ` Andreas Schwab
2021-08-20 6:42 ` Akira Tsukamoto
2021-08-17 9:03 ` Akira Tsukamoto
2021-08-12 11:01 ` [PATCH 0/1] __asm_copy_to-from_user: Reduce more byte_copy Akira Tsukamoto
[not found] ` <61187c37.1c69fb81.ed9bd.cc45SMTPIN_ADDED_BROKEN@mx.google.com>
2021-08-16 6:24 ` Akira Tsukamoto
[not found] ` <611a33ac.1c69fb81.12aae.89a5SMTPIN_ADDED_BROKEN@mx.google.com>
2021-08-17 7:32 ` Akira Tsukamoto
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).