LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH] arm64: entry: Improve the performance of system calls
@ 2021-09-03 12:19 Zhen Lei
  2021-09-13 23:01 ` Punit Agrawal
  2021-09-14  9:55 ` Mark Rutland
  0 siblings, 2 replies; 7+ messages in thread
From: Zhen Lei @ 2021-09-03 12:19 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, linux-arm-kernel, linux-kernel
  Cc: Zhen Lei, Mark Rutland

Commit 582f95835a8f ("arm64: entry: convert el0_sync to C") converted lots
of functions from assembly to C, this greatly improves readability. But
el0_svc()/el0_svc_compat() is in response to system call requests from
user mode and may be in the hot path.

Although the SVC is in the first case of the switch statement in C, the
compiler optimizes the switch statement as a whole, and does not give SVC
a small boost.

Use "likely()" to help SVC directly invoke its handler after a simple
judgment to avoid entering the switch table lookup process.

After:
0000000000000ff0 <el0t_64_sync_handler>:
     ff0:       d503245f        bti     c
     ff4:       d503233f        paciasp
     ff8:       a9bf7bfd        stp     x29, x30, [sp, #-16]!
     ffc:       910003fd        mov     x29, sp
    1000:       d5385201        mrs     x1, esr_el1
    1004:       531a7c22        lsr     w2, w1, #26
    1008:       f100545f        cmp     x2, #0x15
    100c:       540000a1        b.ne    1020 <el0t_64_sync_handler+0x30>
    1010:       97fffe14        bl      860 <el0_svc>
    1014:       a8c17bfd        ldp     x29, x30, [sp], #16
    1018:       d50323bf        autiasp
    101c:       d65f03c0        ret
    1020:       f100705f        cmp     x2, #0x1c

Execute "./lat_syscall null" on my board (BogoMIPS : 200.00), it can save
about 10ns.

Before:
Simple syscall: 0.2365 microseconds
Simple syscall: 0.2354 microseconds
Simple syscall: 0.2339 microseconds

After:
Simple syscall: 0.2255 microseconds
Simple syscall: 0.2254 microseconds
Simple syscall: 0.2256 microseconds

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
---
 arch/arm64/kernel/entry-common.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/kernel/entry-common.c b/arch/arm64/kernel/entry-common.c
index 32f9796c4ffe77b..062eb5a895ec6f3 100644
--- a/arch/arm64/kernel/entry-common.c
+++ b/arch/arm64/kernel/entry-common.c
@@ -607,11 +607,14 @@ static void noinstr el0_fpac(struct pt_regs *regs, unsigned long esr)
 asmlinkage void noinstr el0t_64_sync_handler(struct pt_regs *regs)
 {
 	unsigned long esr = read_sysreg(esr_el1);
+	unsigned long ec = ESR_ELx_EC(esr);
 
-	switch (ESR_ELx_EC(esr)) {
-	case ESR_ELx_EC_SVC64:
+	if (likely(ec == ESR_ELx_EC_SVC64)) {
 		el0_svc(regs);
-		break;
+		return;
+	}
+
+	switch (ec) {
 	case ESR_ELx_EC_DABT_LOW:
 		el0_da(regs, esr);
 		break;
@@ -730,11 +733,14 @@ static void noinstr el0_svc_compat(struct pt_regs *regs)
 asmlinkage void noinstr el0t_32_sync_handler(struct pt_regs *regs)
 {
 	unsigned long esr = read_sysreg(esr_el1);
+	unsigned long ec = ESR_ELx_EC(esr);
 
-	switch (ESR_ELx_EC(esr)) {
-	case ESR_ELx_EC_SVC32:
+	if (likely(ec == ESR_ELx_EC_SVC32)) {
 		el0_svc_compat(regs);
-		break;
+		return;
+	}
+
+	switch (ec) {
 	case ESR_ELx_EC_DABT_LOW:
 		el0_da(regs, esr);
 		break;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] arm64: entry: Improve the performance of system calls
  2021-09-03 12:19 [PATCH] arm64: entry: Improve the performance of system calls Zhen Lei
@ 2021-09-13 23:01 ` Punit Agrawal
  2021-09-14  2:01   ` Leizhen (ThunderTown)
  2021-09-14  9:55 ` Mark Rutland
  1 sibling, 1 reply; 7+ messages in thread
From: Punit Agrawal @ 2021-09-13 23:01 UTC (permalink / raw)
  To: Zhen Lei
  Cc: Catalin Marinas, Will Deacon, linux-arm-kernel, linux-kernel,
	Mark Rutland

Hi Zhen Lei,

Zhen Lei <thunder.leizhen@huawei.com> writes:

> Commit 582f95835a8f ("arm64: entry: convert el0_sync to C") converted lots
> of functions from assembly to C, this greatly improves readability. But
> el0_svc()/el0_svc_compat() is in response to system call requests from
> user mode and may be in the hot path.
>
> Although the SVC is in the first case of the switch statement in C, the
> compiler optimizes the switch statement as a whole, and does not give SVC
> a small boost.
>
> Use "likely()" to help SVC directly invoke its handler after a simple
> judgment to avoid entering the switch table lookup process.
>
> After:
> 0000000000000ff0 <el0t_64_sync_handler>:
>      ff0:       d503245f        bti     c
>      ff4:       d503233f        paciasp
>      ff8:       a9bf7bfd        stp     x29, x30, [sp, #-16]!
>      ffc:       910003fd        mov     x29, sp
>     1000:       d5385201        mrs     x1, esr_el1
>     1004:       531a7c22        lsr     w2, w1, #26
>     1008:       f100545f        cmp     x2, #0x15
>     100c:       540000a1        b.ne    1020 <el0t_64_sync_handler+0x30>
>     1010:       97fffe14        bl      860 <el0_svc>
>     1014:       a8c17bfd        ldp     x29, x30, [sp], #16
>     1018:       d50323bf        autiasp
>     101c:       d65f03c0        ret
>     1020:       f100705f        cmp     x2, #0x1c
>
> Execute "./lat_syscall null" on my board (BogoMIPS : 200.00), it can save
> about 10ns.
>
> Before:
> Simple syscall: 0.2365 microseconds
> Simple syscall: 0.2354 microseconds
> Simple syscall: 0.2339 microseconds
>
> After:
> Simple syscall: 0.2255 microseconds
> Simple syscall: 0.2254 microseconds
> Simple syscall: 0.2256 microseconds

I was curious about the impact of the patch on other
micro-architectures. Following are the results from using the patch
applied to v5.14 on A72 and A53 on a RK399.

For the A72 -
Before:
Simple syscall: 0.4311 microseconds
Simple syscall: 0.4311 microseconds
Simple syscall: 0.4313 microseconds

After:
Simple syscall: 0.4249 microseconds
Simple syscall: 0.4248 microseconds
Simple syscall: 0.4248 microseconds

For the A53 -
Before:
Simple syscall: 0.4130 microseconds
Simple syscall: 0.4128 microseconds
Simple syscall: 0.4124 microseconds

After:
Simple syscall: 0.4031 microseconds
Simple syscall: 0.4078 microseconds
Simple syscall: 0.4030 microseconds

Although there is a small benefit, they are not as big as on your board
/ micro-architecture.

Were you able to see any impact on real workloads?

I imagine that other code paths in the sync handler would also benefit
from similar special casing - did you try any others. Page fault
handling came to mind.

Overall, I feel a little uneasy about the special casing introduced here
but at the same time see that it does benefit certain workloads.

One more comment below.

>
> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
> ---
>  arch/arm64/kernel/entry-common.c | 18 ++++++++++++------
>  1 file changed, 12 insertions(+), 6 deletions(-)
>
> diff --git a/arch/arm64/kernel/entry-common.c b/arch/arm64/kernel/entry-common.c
> index 32f9796c4ffe77b..062eb5a895ec6f3 100644
> --- a/arch/arm64/kernel/entry-common.c
> +++ b/arch/arm64/kernel/entry-common.c
> @@ -607,11 +607,14 @@ static void noinstr el0_fpac(struct pt_regs *regs, unsigned long esr)
>  asmlinkage void noinstr el0t_64_sync_handler(struct pt_regs *regs)
>  {
>  	unsigned long esr = read_sysreg(esr_el1);
> +	unsigned long ec = ESR_ELx_EC(esr);
>  
> -	switch (ESR_ELx_EC(esr)) {
> -	case ESR_ELx_EC_SVC64:
> +	if (likely(ec == ESR_ELx_EC_SVC64)) {
>  		el0_svc(regs);
> -		break;
> +		return;
> +	}
> +
> +	switch (ec) {
>  	case ESR_ELx_EC_DABT_LOW:
>  		el0_da(regs, esr);
>  		break;

Please include a big fat comment on why SVC (or any other patch) is
being separated out of the switch case - both here and below.

Thanks,
Punit

> @@ -730,11 +733,14 @@ static void noinstr el0_svc_compat(struct pt_regs *regs)
>  asmlinkage void noinstr el0t_32_sync_handler(struct pt_regs *regs)
>  {
>  	unsigned long esr = read_sysreg(esr_el1);
> +	unsigned long ec = ESR_ELx_EC(esr);
>  
> -	switch (ESR_ELx_EC(esr)) {
> -	case ESR_ELx_EC_SVC32:
> +	if (likely(ec == ESR_ELx_EC_SVC32)) {
>  		el0_svc_compat(regs);
> -		break;
> +		return;
> +	}
> +
> +	switch (ec) {
>  	case ESR_ELx_EC_DABT_LOW:
>  		el0_da(regs, esr);
>  		break;

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] arm64: entry: Improve the performance of system calls
  2021-09-13 23:01 ` Punit Agrawal
@ 2021-09-14  2:01   ` Leizhen (ThunderTown)
  0 siblings, 0 replies; 7+ messages in thread
From: Leizhen (ThunderTown) @ 2021-09-14  2:01 UTC (permalink / raw)
  To: Punit Agrawal
  Cc: Catalin Marinas, Will Deacon, linux-arm-kernel, linux-kernel,
	Mark Rutland



On 2021/9/14 7:01, Punit Agrawal wrote:
> Hi Zhen Lei,
> 
> Zhen Lei <thunder.leizhen@huawei.com> writes:
> 
>> Commit 582f95835a8f ("arm64: entry: convert el0_sync to C") converted lots
>> of functions from assembly to C, this greatly improves readability. But
>> el0_svc()/el0_svc_compat() is in response to system call requests from
>> user mode and may be in the hot path.
>>
>> Although the SVC is in the first case of the switch statement in C, the
>> compiler optimizes the switch statement as a whole, and does not give SVC
>> a small boost.
>>
>> Use "likely()" to help SVC directly invoke its handler after a simple
>> judgment to avoid entering the switch table lookup process.
>>
>> After:
>> 0000000000000ff0 <el0t_64_sync_handler>:
>>      ff0:       d503245f        bti     c
>>      ff4:       d503233f        paciasp
>>      ff8:       a9bf7bfd        stp     x29, x30, [sp, #-16]!
>>      ffc:       910003fd        mov     x29, sp
>>     1000:       d5385201        mrs     x1, esr_el1
>>     1004:       531a7c22        lsr     w2, w1, #26
>>     1008:       f100545f        cmp     x2, #0x15
>>     100c:       540000a1        b.ne    1020 <el0t_64_sync_handler+0x30>
>>     1010:       97fffe14        bl      860 <el0_svc>
>>     1014:       a8c17bfd        ldp     x29, x30, [sp], #16
>>     1018:       d50323bf        autiasp
>>     101c:       d65f03c0        ret
>>     1020:       f100705f        cmp     x2, #0x1c
>>
>> Execute "./lat_syscall null" on my board (BogoMIPS : 200.00), it can save
>> about 10ns.
>>
>> Before:
>> Simple syscall: 0.2365 microseconds
>> Simple syscall: 0.2354 microseconds
>> Simple syscall: 0.2339 microseconds
>>
>> After:
>> Simple syscall: 0.2255 microseconds
>> Simple syscall: 0.2254 microseconds
>> Simple syscall: 0.2256 microseconds
> 
> I was curious about the impact of the patch on other
> micro-architectures. Following are the results from using the patch
> applied to v5.14 on A72 and A53 on a RK399.
> 
> For the A72 -
> Before:
> Simple syscall: 0.4311 microseconds
> Simple syscall: 0.4311 microseconds
> Simple syscall: 0.4313 microseconds
> 
> After:
> Simple syscall: 0.4249 microseconds
> Simple syscall: 0.4248 microseconds
> Simple syscall: 0.4248 microseconds
> 
> For the A53 -
> Before:
> Simple syscall: 0.4130 microseconds
> Simple syscall: 0.4128 microseconds
> Simple syscall: 0.4124 microseconds
> 
> After:
> Simple syscall: 0.4031 microseconds
> Simple syscall: 0.4078 microseconds
> Simple syscall: 0.4030 microseconds
> 
> Although there is a small benefit, they are not as big as on your board
> / micro-architecture.

Maybe the options are different. A slight performance improvement is also
a good thing with the same functionality and readability.

> 
> Were you able to see any impact on real workloads?

We have a plan to do it on the product side, and it won't come out so quickly.

> 
> I imagine that other code paths in the sync handler would also benefit
> from similar special casing - did you try any others. Page fault
> handling came to mind.

Yes, It would be clearer to apply likely() to ESR_ELx_EC_DABT_LOW as well.
After applying this patch, the compiler stops optimizing branch judgments
into an array. The page fault is also improved.

Other case items are mainly for debugging purposes. It doesn't matter whether
the performance is a bit faster or slower.

> 
> Overall, I feel a little uneasy about the special casing introduced here
> but at the same time see that it does benefit certain workloads.
> 
> One more comment below.
> 
>>
>> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
>> ---
>>  arch/arm64/kernel/entry-common.c | 18 ++++++++++++------
>>  1 file changed, 12 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/arm64/kernel/entry-common.c b/arch/arm64/kernel/entry-common.c
>> index 32f9796c4ffe77b..062eb5a895ec6f3 100644
>> --- a/arch/arm64/kernel/entry-common.c
>> +++ b/arch/arm64/kernel/entry-common.c
>> @@ -607,11 +607,14 @@ static void noinstr el0_fpac(struct pt_regs *regs, unsigned long esr)
>>  asmlinkage void noinstr el0t_64_sync_handler(struct pt_regs *regs)
>>  {
>>  	unsigned long esr = read_sysreg(esr_el1);
>> +	unsigned long ec = ESR_ELx_EC(esr);
>>  
>> -	switch (ESR_ELx_EC(esr)) {
>> -	case ESR_ELx_EC_SVC64:
>> +	if (likely(ec == ESR_ELx_EC_SVC64)) {
>>  		el0_svc(regs);
>> -		break;
>> +		return;
>> +	}
>> +
>> +	switch (ec) {
>>  	case ESR_ELx_EC_DABT_LOW:
>>  		el0_da(regs, esr);
>>  		break;
> 
> Please include a big fat comment on why SVC (or any other patch) is
> being separated out of the switch case - both here and below.

Okay, I'll add it.

> 
> Thanks,
> Punit
> 
>> @@ -730,11 +733,14 @@ static void noinstr el0_svc_compat(struct pt_regs *regs)
>>  asmlinkage void noinstr el0t_32_sync_handler(struct pt_regs *regs)
>>  {
>>  	unsigned long esr = read_sysreg(esr_el1);
>> +	unsigned long ec = ESR_ELx_EC(esr);
>>  
>> -	switch (ESR_ELx_EC(esr)) {
>> -	case ESR_ELx_EC_SVC32:
>> +	if (likely(ec == ESR_ELx_EC_SVC32)) {
>>  		el0_svc_compat(regs);
>> -		break;
>> +		return;
>> +	}
>> +
>> +	switch (ec) {
>>  	case ESR_ELx_EC_DABT_LOW:
>>  		el0_da(regs, esr);
>>  		break;
> .
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] arm64: entry: Improve the performance of system calls
  2021-09-03 12:19 [PATCH] arm64: entry: Improve the performance of system calls Zhen Lei
  2021-09-13 23:01 ` Punit Agrawal
@ 2021-09-14  9:55 ` Mark Rutland
  2021-09-14 11:23   ` Leizhen (ThunderTown)
  2021-09-14 15:17   ` Joey Gouly
  1 sibling, 2 replies; 7+ messages in thread
From: Mark Rutland @ 2021-09-14  9:55 UTC (permalink / raw)
  To: Zhen Lei; +Cc: Catalin Marinas, Will Deacon, linux-arm-kernel, linux-kernel

Hi,

On Fri, Sep 03, 2021 at 08:19:50PM +0800, Zhen Lei wrote:
> Commit 582f95835a8f ("arm64: entry: convert el0_sync to C") converted lots
> of functions from assembly to C, this greatly improves readability. But
> el0_svc()/el0_svc_compat() is in response to system call requests from
> user mode and may be in the hot path.
> 
> Although the SVC is in the first case of the switch statement in C, the
> compiler optimizes the switch statement as a whole, and does not give SVC
> a small boost.
> 
> Use "likely()" to help SVC directly invoke its handler after a simple
> judgment to avoid entering the switch table lookup process.
> 
> After:
> 0000000000000ff0 <el0t_64_sync_handler>:
>      ff0:       d503245f        bti     c
>      ff4:       d503233f        paciasp
>      ff8:       a9bf7bfd        stp     x29, x30, [sp, #-16]!
>      ffc:       910003fd        mov     x29, sp
>     1000:       d5385201        mrs     x1, esr_el1
>     1004:       531a7c22        lsr     w2, w1, #26
>     1008:       f100545f        cmp     x2, #0x15
>     100c:       540000a1        b.ne    1020 <el0t_64_sync_handler+0x30>
>     1010:       97fffe14        bl      860 <el0_svc>
>     1014:       a8c17bfd        ldp     x29, x30, [sp], #16
>     1018:       d50323bf        autiasp
>     101c:       d65f03c0        ret
>     1020:       f100705f        cmp     x2, #0x1c

It would be helpful if you could state which toolchain and config was
used to generate the above.

For comparison, what was the code generation like before? I assume
el0_svc wasn't the target of the first test and branch? Assuming so, how
many tests and branches were there before the call to el0_svc()?

At a high-level, I'm not too keen on special-casing things unless
necessary.

I wonder if we could get similar results without special-casing by using
a static const array of handlers indexed by the EC, since (with GCC
11.1.0 from the kernel.org crosstool page) that can result in code like:

0000000000001010 <el0t_64_sync_handler>:
    1010:       d503245f        bti     c
    1014:       d503233f        paciasp
    1018:       a9bf7bfd        stp     x29, x30, [sp, #-16]!
    101c:       910003fd        mov     x29, sp
    1020:       d5385201        mrs     x1, esr_el1
    1024:       90000002        adrp    x2, 0 <el0t_64_sync_handlers>
    1028:       531a7c23        lsr     w3, w1, #26
    102c:       91000042        add     x2, x2, #:lo12:<el0t_64_sync_handlers>
    1030:       f8637842        ldr     x2, [x2, x3, lsl #3]
    1034:       d63f0040        blr     x2
    1038:       a8c17bfd        ldp     x29, x30, [sp], #16
    103c:       d50323bf        autiasp
    1040:       d65f03c0        ret

... which might do better by virtue of reducing a chain of potential
mispredicts down to a single potential mispredict, and dynamic branch
prediction hopefully does a good job of predicting the common case at
runtime. That said, the resulting tables will be pretty big...

> 
> Execute "./lat_syscall null" on my board (BogoMIPS : 200.00), it can save
> about 10ns.
> 
> Before:
> Simple syscall: 0.2365 microseconds
> Simple syscall: 0.2354 microseconds
> Simple syscall: 0.2339 microseconds
> 
> After:
> Simple syscall: 0.2255 microseconds
> Simple syscall: 0.2254 microseconds
> Simple syscall: 0.2256 microseconds

I appreciate this can be seen by a microbenchmark, but does this have an
impact on a real workload? I'd imagine that real syscall usage will
dominate this in practice, and this would fall into the noise.

Thanks,
Mark.

> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
> ---
>  arch/arm64/kernel/entry-common.c | 18 ++++++++++++------
>  1 file changed, 12 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/arm64/kernel/entry-common.c b/arch/arm64/kernel/entry-common.c
> index 32f9796c4ffe77b..062eb5a895ec6f3 100644
> --- a/arch/arm64/kernel/entry-common.c
> +++ b/arch/arm64/kernel/entry-common.c
> @@ -607,11 +607,14 @@ static void noinstr el0_fpac(struct pt_regs *regs, unsigned long esr)
>  asmlinkage void noinstr el0t_64_sync_handler(struct pt_regs *regs)
>  {
>  	unsigned long esr = read_sysreg(esr_el1);
> +	unsigned long ec = ESR_ELx_EC(esr);
>  
> -	switch (ESR_ELx_EC(esr)) {
> -	case ESR_ELx_EC_SVC64:
> +	if (likely(ec == ESR_ELx_EC_SVC64)) {
>  		el0_svc(regs);
> -		break;
> +		return;
> +	}
> +
> +	switch (ec) {
>  	case ESR_ELx_EC_DABT_LOW:
>  		el0_da(regs, esr);
>  		break;
> @@ -730,11 +733,14 @@ static void noinstr el0_svc_compat(struct pt_regs *regs)
>  asmlinkage void noinstr el0t_32_sync_handler(struct pt_regs *regs)
>  {
>  	unsigned long esr = read_sysreg(esr_el1);
> +	unsigned long ec = ESR_ELx_EC(esr);
>  
> -	switch (ESR_ELx_EC(esr)) {
> -	case ESR_ELx_EC_SVC32:
> +	if (likely(ec == ESR_ELx_EC_SVC32)) {
>  		el0_svc_compat(regs);
> -		break;
> +		return;
> +	}
> +
> +	switch (ec) {
>  	case ESR_ELx_EC_DABT_LOW:
>  		el0_da(regs, esr);
>  		break;
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] arm64: entry: Improve the performance of system calls
  2021-09-14  9:55 ` Mark Rutland
@ 2021-09-14 11:23   ` Leizhen (ThunderTown)
  2021-09-14 11:48     ` Leizhen (ThunderTown)
  2021-09-14 15:17   ` Joey Gouly
  1 sibling, 1 reply; 7+ messages in thread
From: Leizhen (ThunderTown) @ 2021-09-14 11:23 UTC (permalink / raw)
  To: Mark Rutland; +Cc: Catalin Marinas, Will Deacon, linux-arm-kernel, linux-kernel



On 2021/9/14 17:55, Mark Rutland wrote:
> Hi,
> 
> On Fri, Sep 03, 2021 at 08:19:50PM +0800, Zhen Lei wrote:
>> Commit 582f95835a8f ("arm64: entry: convert el0_sync to C") converted lots
>> of functions from assembly to C, this greatly improves readability. But
>> el0_svc()/el0_svc_compat() is in response to system call requests from
>> user mode and may be in the hot path.
>>
>> Although the SVC is in the first case of the switch statement in C, the
>> compiler optimizes the switch statement as a whole, and does not give SVC
>> a small boost.
>>
>> Use "likely()" to help SVC directly invoke its handler after a simple
>> judgment to avoid entering the switch table lookup process.
>>
>> After:
>> 0000000000000ff0 <el0t_64_sync_handler>:
>>      ff0:       d503245f        bti     c
>>      ff4:       d503233f        paciasp
>>      ff8:       a9bf7bfd        stp     x29, x30, [sp, #-16]!
>>      ffc:       910003fd        mov     x29, sp
>>     1000:       d5385201        mrs     x1, esr_el1
>>     1004:       531a7c22        lsr     w2, w1, #26
>>     1008:       f100545f        cmp     x2, #0x15
>>     100c:       540000a1        b.ne    1020 <el0t_64_sync_handler+0x30>
>>     1010:       97fffe14        bl      860 <el0_svc>
>>     1014:       a8c17bfd        ldp     x29, x30, [sp], #16
>>     1018:       d50323bf        autiasp
>>     101c:       d65f03c0        ret
>>     1020:       f100705f        cmp     x2, #0x1c
> 
> It would be helpful if you could state which toolchain and config was
> used to generate the above.

gcc version 7.3.0 (GCC), make defconfig

> 
> For comparison, what was the code generation like before? I assume
> el0_svc wasn't the target of the first test and branch? Assuming so, how
> many tests and branches were there before the call to el0_svc()?

0000000000000a10 <el0_sync_handler>:
 a10:   a9bf7bfd        stp     x29, x30, [sp, #-16]!
 a14:   910003fd        mov     x29, sp
 a18:   d5385201        mrs     x1, esr_el1
 a1c:   531a7c22        lsr     w2, w1, #26
 a20:   f100f05f        cmp     x2, #0x3c
 a24:   54000068        b.hi    a30 <el0_sync_handler+0x20>  // b.pmore
 a28:   7100f05f        cmp     w2, #0x3c
 a2c:   540000a9        b.ls    a40 <el0_sync_handler+0x30>  // b.plast
 a30:   97ffffc8        bl      950 <el0_inv>
 a34:   a8c17bfd        ldp     x29, x30, [sp], #16
 a38:   d65f03c0        ret
 a3c:   d503201f        nop
 a40:   90000003        adrp    x3, 0 <enter_from_kernel_mode.isra.6>
 a44:   91000063        add     x3, x3, #0x0
 a48:   38624862        ldrb    w2, [x3, w2, uxtw]
 a4c:   10000063        adr     x3, a58 <el0_sync_handler+0x48>
 a50:   8b228862        add     x2, x3, w2, sxtb #2
 a54:   d61f0040        br      x2
 a58:   97ffff9e        bl      8d0 <el0_dbg>
 a5c:   17fffff6        b       a34 <el0_sync_handler+0x24>
 a60:   97ffff2c        bl      710 <el0_fpsimd_exc>
 a64:   17fffff4        b       a34 <el0_sync_handler+0x24>
 a68:   97ffff46        bl      780 <el0_sp>
 a6c:   17fffff2        b       a34 <el0_sync_handler+0x24>
 a70:   97fffece        bl      5a8 <el0_da>
 a74:   17fffff0        b       a34 <el0_sync_handler+0x24>
 a78:   97ffff50        bl      7b8 <el0_pc>
 a7c:   17ffffee        b       a34 <el0_sync_handler+0x24>
 a80:   97fffedc        bl      5f0 <el0_ia>
 a84:   17ffffec        b       a34 <el0_sync_handler+0x24>
 a88:   97ffffa4        bl      918 <el0_fpac>
 a8c:   17ffffea        b       a34 <el0_sync_handler+0x24>
 a90:   97ffff12        bl      6d8 <el0_sve_acc>
 a94:   17ffffe8        b       a34 <el0_sync_handler+0x24>
 a98:   97fffeba        bl      580 <el0_svc>
 a9c:   17ffffe6        b       a34 <el0_sync_handler+0x24>
 aa0:   97ffff80        bl      8a0 <el0_bti>
 aa4:   17ffffe4        b       a34 <el0_sync_handler+0x24>
 aa8:   97fffefe        bl      6a0 <el0_fpsimd_acc>
 aac:   17ffffe2        b       a34 <el0_sync_handler+0x24>
 ab0:   97ffff26        bl      748 <el0_sys>
 ab4:   17ffffe0        b       a34 <el0_sync_handler+0x24>
 ab8:   97ffff6e        bl      870 <el0_undef>
 abc:   17ffffde        b       a34 <el0_sync_handler+0x24>


> 
> At a high-level, I'm not too keen on special-casing things unless
> necessary.
> 
> I wonder if we could get similar results without special-casing by using
> a static const array of handlers indexed by the EC, since (with GCC
> 11.1.0 from the kernel.org crosstool page) that can result in code like:
> 
> 0000000000001010 <el0t_64_sync_handler>:
>     1010:       d503245f        bti     c
>     1014:       d503233f        paciasp
>     1018:       a9bf7bfd        stp     x29, x30, [sp, #-16]!
>     101c:       910003fd        mov     x29, sp
>     1020:       d5385201        mrs     x1, esr_el1
>     1024:       90000002        adrp    x2, 0 <el0t_64_sync_handlers>
>     1028:       531a7c23        lsr     w3, w1, #26
>     102c:       91000042        add     x2, x2, #:lo12:<el0t_64_sync_handlers>
>     1030:       f8637842        ldr     x2, [x2, x3, lsl #3]
>     1034:       d63f0040        blr     x2
>     1038:       a8c17bfd        ldp     x29, x30, [sp], #16
>     103c:       d50323bf        autiasp
>     1040:       d65f03c0        ret
> 
> ... which might do better by virtue of reducing a chain of potential
> mispredicts down to a single potential mispredict, and dynamic branch
> prediction hopefully does a good job of predicting the common case at
> runtime. That said, the resulting tables will be pretty big...


 a48:   38624862        ldrb    w2, [x3, w2, uxtw]
 a4c:   10000063        adr     x3, a58 <el0_sync_handler+0x48>
 a50:   8b228862        add     x2, x3, w2, sxtb #2
 a54:   d61f0040        br      x2

The original implementation also generated a query table, but yours is
more concise. I will try to test it. Looks like a better solution.

> 
>>
>> Execute "./lat_syscall null" on my board (BogoMIPS : 200.00), it can save
>> about 10ns.
>>
>> Before:
>> Simple syscall: 0.2365 microseconds
>> Simple syscall: 0.2354 microseconds
>> Simple syscall: 0.2339 microseconds
>>
>> After:
>> Simple syscall: 0.2255 microseconds
>> Simple syscall: 0.2254 microseconds
>> Simple syscall: 0.2256 microseconds
> 
> I appreciate this can be seen by a microbenchmark, but does this have an
> impact on a real workload? I'd imagine that real syscall usage will
> dominate this in practice, and this would fall into the noise.

The product side has a test plan, but the progress will be slow.

> 
> Thanks,
> Mark.
> 
>> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
>> ---
>>  arch/arm64/kernel/entry-common.c | 18 ++++++++++++------
>>  1 file changed, 12 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/arm64/kernel/entry-common.c b/arch/arm64/kernel/entry-common.c
>> index 32f9796c4ffe77b..062eb5a895ec6f3 100644
>> --- a/arch/arm64/kernel/entry-common.c
>> +++ b/arch/arm64/kernel/entry-common.c
>> @@ -607,11 +607,14 @@ static void noinstr el0_fpac(struct pt_regs *regs, unsigned long esr)
>>  asmlinkage void noinstr el0t_64_sync_handler(struct pt_regs *regs)
>>  {
>>  	unsigned long esr = read_sysreg(esr_el1);
>> +	unsigned long ec = ESR_ELx_EC(esr);
>>  
>> -	switch (ESR_ELx_EC(esr)) {
>> -	case ESR_ELx_EC_SVC64:
>> +	if (likely(ec == ESR_ELx_EC_SVC64)) {
>>  		el0_svc(regs);
>> -		break;
>> +		return;
>> +	}
>> +
>> +	switch (ec) {
>>  	case ESR_ELx_EC_DABT_LOW:
>>  		el0_da(regs, esr);
>>  		break;
>> @@ -730,11 +733,14 @@ static void noinstr el0_svc_compat(struct pt_regs *regs)
>>  asmlinkage void noinstr el0t_32_sync_handler(struct pt_regs *regs)
>>  {
>>  	unsigned long esr = read_sysreg(esr_el1);
>> +	unsigned long ec = ESR_ELx_EC(esr);
>>  
>> -	switch (ESR_ELx_EC(esr)) {
>> -	case ESR_ELx_EC_SVC32:
>> +	if (likely(ec == ESR_ELx_EC_SVC32)) {
>>  		el0_svc_compat(regs);
>> -		break;
>> +		return;
>> +	}
>> +
>> +	switch (ec) {
>>  	case ESR_ELx_EC_DABT_LOW:
>>  		el0_da(regs, esr);
>>  		break;
>> -- 
>> 2.25.1
>>
> .
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] arm64: entry: Improve the performance of system calls
  2021-09-14 11:23   ` Leizhen (ThunderTown)
@ 2021-09-14 11:48     ` Leizhen (ThunderTown)
  0 siblings, 0 replies; 7+ messages in thread
From: Leizhen (ThunderTown) @ 2021-09-14 11:48 UTC (permalink / raw)
  To: Mark Rutland; +Cc: Catalin Marinas, Will Deacon, linux-arm-kernel, linux-kernel



On 2021/9/14 19:23, Leizhen (ThunderTown) wrote:
> 
> 
> On 2021/9/14 17:55, Mark Rutland wrote:
>> Hi,
>>
>> On Fri, Sep 03, 2021 at 08:19:50PM +0800, Zhen Lei wrote:
>>> Commit 582f95835a8f ("arm64: entry: convert el0_sync to C") converted lots
>>> of functions from assembly to C, this greatly improves readability. But
>>> el0_svc()/el0_svc_compat() is in response to system call requests from
>>> user mode and may be in the hot path.
>>>
>>> Although the SVC is in the first case of the switch statement in C, the
>>> compiler optimizes the switch statement as a whole, and does not give SVC
>>> a small boost.
>>>
>>> Use "likely()" to help SVC directly invoke its handler after a simple
>>> judgment to avoid entering the switch table lookup process.
>>>
>>> After:
>>> 0000000000000ff0 <el0t_64_sync_handler>:
>>>      ff0:       d503245f        bti     c
>>>      ff4:       d503233f        paciasp
>>>      ff8:       a9bf7bfd        stp     x29, x30, [sp, #-16]!
>>>      ffc:       910003fd        mov     x29, sp
>>>     1000:       d5385201        mrs     x1, esr_el1
>>>     1004:       531a7c22        lsr     w2, w1, #26
>>>     1008:       f100545f        cmp     x2, #0x15
>>>     100c:       540000a1        b.ne    1020 <el0t_64_sync_handler+0x30>
>>>     1010:       97fffe14        bl      860 <el0_svc>
>>>     1014:       a8c17bfd        ldp     x29, x30, [sp], #16
>>>     1018:       d50323bf        autiasp
>>>     101c:       d65f03c0        ret
>>>     1020:       f100705f        cmp     x2, #0x1c
>>
>> It would be helpful if you could state which toolchain and config was
>> used to generate the above.
> 
> gcc version 7.3.0 (GCC), make defconfig
> 
>>
>> For comparison, what was the code generation like before? I assume
>> el0_svc wasn't the target of the first test and branch? Assuming so, how
>> many tests and branches were there before the call to el0_svc()?
> 

Sorry, the old assembly code was not compiled with the latest mainline.
But the key point is no different.

0000000000000fe0 <el0t_64_sync_handler>:
     fe0:       d503233f        paciasp
     fe4:       a9bf7bfd        stp     x29, x30, [sp, #-16]!
     fe8:       910003fd        mov     x29, sp
     fec:       d5385201        mrs     x1, esr_el1
     ff0:       531a7c22        lsr     w2, w1, #26
     ff4:       f100f05f        cmp     x2, #0x3c
     ff8:       54000068        b.hi    1004 <el0t_64_sync_handler+0x24>  // b.pmore
     ffc:       7100f05f        cmp     w2, #0x3c
    1000:       540000c9        b.ls    1018 <el0t_64_sync_handler+0x38>  // b.plast
    1004:       97fffce9        bl      3a8 <el0_inv>
    1008:       a8c17bfd        ldp     x29, x30, [sp], #16
    100c:       d50323bf        autiasp
    1010:       d65f03c0        ret
    1014:       d503201f        nop
    1018:       90000003        adrp    x3, 0 <el0_da>
    101c:       91000063        add     x3, x3, #0x0
    1020:       38624862        ldrb    w2, [x3, w2, uxtw]
    1024:       10000063        adr     x3, 1030 <el0t_64_sync_handler+0x50>
    1028:       8b228862        add     x2, x3, w2, sxtb #2
    102c:       d61f0040        br      x2
    1030:       97fffc3a        bl      118 <el0_dbg>
    1034:       17fffff5        b       1008 <el0t_64_sync_handler+0x28>
    1038:       97fffc96        bl      290 <el0_fpsimd_exc>
    103c:       17fffff3        b       1008 <el0t_64_sync_handler+0x28>
    1040:       97fffc08        bl      60 <el0_sp>
    1044:       17fffff1        b       1008 <el0t_64_sync_handler+0x28>
    1048:       97fffbee        bl      0 <el0_da>
    104c:       17ffffef        b       1008 <el0t_64_sync_handler+0x28>
    1050:       97fffeea        bl      bf8 <el0_pc>
    1054:       17ffffed        b       1008 <el0t_64_sync_handler+0x28>
    1058:       97fffeb2        bl      b20 <el0_ia>
    105c:       17ffffeb        b       1008 <el0t_64_sync_handler+0x28>
    1060:       97fffc46        bl      178 <el0_fpac>
    1064:       17ffffe9        b       1008 <el0t_64_sync_handler+0x28>
    1068:       97fffc72        bl      230 <el0_sve_acc>
    106c:       17ffffe7        b       1008 <el0t_64_sync_handler+0x28>
    1070:       97ffff18        bl      cd0 <el0_svc>
    1074:       17ffffe5        b       1008 <el0t_64_sync_handler+0x28>
    1078:       97fffcb6        bl      350 <el0_bti>
    107c:       17ffffe3        b       1008 <el0t_64_sync_handler+0x28>
    1080:       97fffc54        bl      1d0 <el0_fpsimd_acc>
    1084:       17ffffe1        b       1008 <el0t_64_sync_handler+0x28>
    1088:       97fffc9a        bl      2f0 <el0_sys>
    108c:       17ffffdf        b       1008 <el0t_64_sync_handler+0x28>
    1090:       97fffc0c        bl      c0 <el0_undef>
    1094:       17ffffdd        b       1008 <el0t_64_sync_handler+0x28>

> 
> 
>>
>> At a high-level, I'm not too keen on special-casing things unless
>> necessary.
>>
>> I wonder if we could get similar results without special-casing by using
>> a static const array of handlers indexed by the EC, since (with GCC
>> 11.1.0 from the kernel.org crosstool page) that can result in code like:
>>
>> 0000000000001010 <el0t_64_sync_handler>:
>>     1010:       d503245f        bti     c
>>     1014:       d503233f        paciasp
>>     1018:       a9bf7bfd        stp     x29, x30, [sp, #-16]!
>>     101c:       910003fd        mov     x29, sp
>>     1020:       d5385201        mrs     x1, esr_el1
>>     1024:       90000002        adrp    x2, 0 <el0t_64_sync_handlers>
>>     1028:       531a7c23        lsr     w3, w1, #26
>>     102c:       91000042        add     x2, x2, #:lo12:<el0t_64_sync_handlers>
>>     1030:       f8637842        ldr     x2, [x2, x3, lsl #3]
>>     1034:       d63f0040        blr     x2
>>     1038:       a8c17bfd        ldp     x29, x30, [sp], #16
>>     103c:       d50323bf        autiasp
>>     1040:       d65f03c0        ret
>>
>> ... which might do better by virtue of reducing a chain of potential
>> mispredicts down to a single potential mispredict, and dynamic branch
>> prediction hopefully does a good job of predicting the common case at
>> runtime. That said, the resulting tables will be pretty big...
> 
> 
>  a48:   38624862        ldrb    w2, [x3, w2, uxtw]
>  a4c:   10000063        adr     x3, a58 <el0_sync_handler+0x48>
>  a50:   8b228862        add     x2, x3, w2, sxtb #2
>  a54:   d61f0040        br      x2
> 
> The original implementation also generated a query table, but yours is
> more concise. I will try to test it. Looks like a better solution.
> 
>>
>>>
>>> Execute "./lat_syscall null" on my board (BogoMIPS : 200.00), it can save
>>> about 10ns.
>>>
>>> Before:
>>> Simple syscall: 0.2365 microseconds
>>> Simple syscall: 0.2354 microseconds
>>> Simple syscall: 0.2339 microseconds
>>>
>>> After:
>>> Simple syscall: 0.2255 microseconds
>>> Simple syscall: 0.2254 microseconds
>>> Simple syscall: 0.2256 microseconds
>>
>> I appreciate this can be seen by a microbenchmark, but does this have an
>> impact on a real workload? I'd imagine that real syscall usage will
>> dominate this in practice, and this would fall into the noise.
> 
> The product side has a test plan, but the progress will be slow.
> 
>>
>> Thanks,
>> Mark.
>>
>>> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
>>> ---
>>>  arch/arm64/kernel/entry-common.c | 18 ++++++++++++------
>>>  1 file changed, 12 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/arch/arm64/kernel/entry-common.c b/arch/arm64/kernel/entry-common.c
>>> index 32f9796c4ffe77b..062eb5a895ec6f3 100644
>>> --- a/arch/arm64/kernel/entry-common.c
>>> +++ b/arch/arm64/kernel/entry-common.c
>>> @@ -607,11 +607,14 @@ static void noinstr el0_fpac(struct pt_regs *regs, unsigned long esr)
>>>  asmlinkage void noinstr el0t_64_sync_handler(struct pt_regs *regs)
>>>  {
>>>  	unsigned long esr = read_sysreg(esr_el1);
>>> +	unsigned long ec = ESR_ELx_EC(esr);
>>>  
>>> -	switch (ESR_ELx_EC(esr)) {
>>> -	case ESR_ELx_EC_SVC64:
>>> +	if (likely(ec == ESR_ELx_EC_SVC64)) {
>>>  		el0_svc(regs);
>>> -		break;
>>> +		return;
>>> +	}
>>> +
>>> +	switch (ec) {
>>>  	case ESR_ELx_EC_DABT_LOW:
>>>  		el0_da(regs, esr);
>>>  		break;
>>> @@ -730,11 +733,14 @@ static void noinstr el0_svc_compat(struct pt_regs *regs)
>>>  asmlinkage void noinstr el0t_32_sync_handler(struct pt_regs *regs)
>>>  {
>>>  	unsigned long esr = read_sysreg(esr_el1);
>>> +	unsigned long ec = ESR_ELx_EC(esr);
>>>  
>>> -	switch (ESR_ELx_EC(esr)) {
>>> -	case ESR_ELx_EC_SVC32:
>>> +	if (likely(ec == ESR_ELx_EC_SVC32)) {
>>>  		el0_svc_compat(regs);
>>> -		break;
>>> +		return;
>>> +	}
>>> +
>>> +	switch (ec) {
>>>  	case ESR_ELx_EC_DABT_LOW:
>>>  		el0_da(regs, esr);
>>>  		break;
>>> -- 
>>> 2.25.1
>>>
>> .
>>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] arm64: entry: Improve the performance of system calls
  2021-09-14  9:55 ` Mark Rutland
  2021-09-14 11:23   ` Leizhen (ThunderTown)
@ 2021-09-14 15:17   ` Joey Gouly
  1 sibling, 0 replies; 7+ messages in thread
From: Joey Gouly @ 2021-09-14 15:17 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Zhen Lei, Catalin Marinas, Will Deacon, linux-arm-kernel,
	linux-kernel, nd

Hi,

On Tue, Sep 14, 2021 at 10:55:16AM +0100, Mark Rutland wrote:
> Hi,
> 
> At a high-level, I'm not too keen on special-casing things unless
> necessary.
> 
> I wonder if we could get similar results without special-casing by using
> a static const array of handlers indexed by the EC, since (with GCC
> 11.1.0 from the kernel.org crosstool page) that can result in code like:
> 
> 0000000000001010 <el0t_64_sync_handler>:
>     1010:       d503245f        bti     c
>     1014:       d503233f        paciasp
>     1018:       a9bf7bfd        stp     x29, x30, [sp, #-16]!
>     101c:       910003fd        mov     x29, sp
>     1020:       d5385201        mrs     x1, esr_el1
>     1024:       90000002        adrp    x2, 0 <el0t_64_sync_handlers>
>     1028:       531a7c23        lsr     w3, w1, #26
>     102c:       91000042        add     x2, x2, #:lo12:<el0t_64_sync_handlers>
>     1030:       f8637842        ldr     x2, [x2, x3, lsl #3]
>     1034:       d63f0040        blr     x2
>     1038:       a8c17bfd        ldp     x29, x30, [sp], #16
>     103c:       d50323bf        autiasp
>     1040:       d65f03c0        ret
> 
> ... which might do better by virtue of reducing a chain of potential
> mispredicts down to a single potential mispredict, and dynamic branch
> prediction hopefully does a good job of predicting the common case at
> runtime. That said, the resulting tables will be pretty big...

I tested Mark's branch which implements this (found at
https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/entry/switch-table)

I also took lmbench from https://github.com/intel/lmbench.git and built
`lat_syscall` with:

    gcc lat_syscall.c lib_*.c -l m -o lat_syscall -static

These are the results I got from benchmarking on my MacBook Air M1, with
the following command:

    ./lat_syscall null &> /dev/null ; uname -a ; for i in 0 1 2 3 4 ; do ./lat_syscall null ; done

The kernel was based on arm64_defconfig that was then stripped of as much as possible. 
GCC 11.1.0 from kernel.org crosstool page.
Clang build fom git b041b613e6fff713fc9ad6dbc73024286fb2fc93.

gcc:
        master: 0.14300
  switch-table: 0.14350
        likely: 0.13962

clang:
        master: 0.14354
  switch-table: 0.14642
        likely: 0.14256


The generated code looks similar to what Leizhen has posted, so I didn't
post it again.

So it seems the table approach actually performs worse in my testing,
and Leizhen's approach is slightly better than master (d0ee23f9d78be5531c4b055ea424ed0b489dfe9b).

Thanks,
Joey

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-09-14 15:18 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-03 12:19 [PATCH] arm64: entry: Improve the performance of system calls Zhen Lei
2021-09-13 23:01 ` Punit Agrawal
2021-09-14  2:01   ` Leizhen (ThunderTown)
2021-09-14  9:55 ` Mark Rutland
2021-09-14 11:23   ` Leizhen (ThunderTown)
2021-09-14 11:48     ` Leizhen (ThunderTown)
2021-09-14 15:17   ` Joey Gouly

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).