LKML Archive on lore.kernel.org help / color / mirror / Atom feed
* [PATCH 0/4 v2] x86: entry_64.S: steps towards simpler iret frame handling @ 2015-03-09 18:39 Denys Vlasenko 2015-03-09 18:39 ` [PATCH 1/4] x86: save r11 into pt_regs->flags on SYSCALL64 fastpath Denys Vlasenko ` (2 more replies) 0 siblings, 3 replies; 23+ messages in thread From: Denys Vlasenko @ 2015-03-09 18:39 UTC (permalink / raw) To: Andy Lutomirski Cc: Denys Vlasenko, Linus Torvalds, Steven Rostedt, Ingo Molnar, Borislav Petkov, H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook, x86, linux-kernel These changes make SYSENTER64 code path save flags and user's stack pointer in pt_regs->flags and pt_regs->sp, where they belong. As a result, we can drop stub_iopl() and thread_struct::usersp. Usage of PER_CPU(old_rsp) is reduced to bare minimum. FIXUP/RESTORE_TOP_OF_STACK macros are on diet too. Changes since v1: on Ingo's request, split into 4 patches intead of 2. CC: Linus Torvalds <torvalds@linux-foundation.org> CC: Steven Rostedt <rostedt@goodmis.org> CC: Ingo Molnar <mingo@kernel.org> CC: Borislav Petkov <bp@alien8.de> CC: "H. Peter Anvin" <hpa@zytor.com> CC: Andy Lutomirski <luto@amacapital.net> CC: Oleg Nesterov <oleg@redhat.com> CC: Frederic Weisbecker <fweisbec@gmail.com> CC: Alexei Starovoitov <ast@plumgrid.com> CC: Will Drewry <wad@chromium.org> CC: Kees Cook <keescook@chromium.org> CC: x86@kernel.org CC: linux-kernel@vger.kernel.org Denys Vlasenko (4): x86: save r11 into pt_regs->flags on SYSCALL64 fastpath Remove stub_iopl x86: save user %rsp in pt_regs->sp on SYSCALL64 fastpath Remove unused stuff arch/x86/include/asm/calling.h | 20 +++++++++----- arch/x86/include/asm/compat.h | 2 +- arch/x86/include/asm/processor.h | 6 ----- arch/x86/include/asm/ptrace.h | 8 ++---- arch/x86/kernel/entry_64.S | 57 ++++++++++++++-------------------------- arch/x86/kernel/perf_regs.c | 2 +- arch/x86/kernel/process_64.c | 8 +----- arch/x86/syscalls/syscall_64.tbl | 2 +- arch/x86/um/sys_call_table_64.c | 2 +- 9 files changed, 40 insertions(+), 67 deletions(-) -- 1.8.1.4 ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 1/4] x86: save r11 into pt_regs->flags on SYSCALL64 fastpath 2015-03-09 18:39 [PATCH 0/4 v2] x86: entry_64.S: steps towards simpler iret frame handling Denys Vlasenko @ 2015-03-09 18:39 ` Denys Vlasenko 2015-03-09 20:02 ` Andy Lutomirski 2015-03-16 12:04 ` [tip:x86/asm] x86/asm/entry/64: Save R11 into pt_regs-> flags " tip-bot for Denys Vlasenko 2015-03-09 18:39 ` [PATCH 3/4] x86: save user rsp in pt_regs->sp " Denys Vlasenko 2015-03-10 6:00 ` [PATCH 0/4 v2] x86: entry_64.S: steps towards simpler iret frame handling Ingo Molnar 2 siblings, 2 replies; 23+ messages in thread From: Denys Vlasenko @ 2015-03-09 18:39 UTC (permalink / raw) To: Andy Lutomirski Cc: Denys Vlasenko, Linus Torvalds, Steven Rostedt, Ingo Molnar, Borislav Petkov, H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook, x86, linux-kernel Before this patch, r11 was saved in pt_regs->r11. Which looks natural, but requires messy shuffling to/from iret frame whenever ptrace or e.g. sys_iopl wants to modify flags - because that's how this register is used by SYSCALL/SYSRET. This patch saves r11 in pt_regs->flags, and uses that value for SYSRET64 insn. Shuffling is eliminated. FIXUP/RESTORE_TOP_OF_STACK are simplified. stub_iopl is no longer needed: pt_regs->flags needs no fixing up. Testing shows that syscall fast path is ~54.3 ns before and after the patch (on 2.7 GHz Sandy Bridge CPU). Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> CC: Linus Torvalds <torvalds@linux-foundation.org> CC: Steven Rostedt <rostedt@goodmis.org> CC: Ingo Molnar <mingo@kernel.org> CC: Borislav Petkov <bp@alien8.de> CC: "H. Peter Anvin" <hpa@zytor.com> CC: Andy Lutomirski <luto@amacapital.net> CC: Oleg Nesterov <oleg@redhat.com> CC: Frederic Weisbecker <fweisbec@gmail.com> CC: Alexei Starovoitov <ast@plumgrid.com> CC: Will Drewry <wad@chromium.org> CC: Kees Cook <keescook@chromium.org> CC: x86@kernel.org CC: linux-kernel@vger.kernel.org --- arch/x86/include/asm/calling.h | 20 ++++++++++++++------ arch/x86/kernel/entry_64.S | 24 +++++++++++------------- 2 files changed, 25 insertions(+), 19 deletions(-) diff --git a/arch/x86/include/asm/calling.h b/arch/x86/include/asm/calling.h index f1a962f..4b5f7bf 100644 --- a/arch/x86/include/asm/calling.h +++ b/arch/x86/include/asm/calling.h @@ -95,9 +95,11 @@ For 32-bit we have the following conventions - kernel is built with CFI_ADJUST_CFA_OFFSET 15*8+\addskip .endm - .macro SAVE_C_REGS_HELPER offset=0 rax=1 rcx=1 r8plus=1 - .if \r8plus + .macro SAVE_C_REGS_HELPER offset=0 rax=1 rcx=1 r8910=1 r11=1 + .if \r11 movq_cfi r11, 6*8+\offset + .endif + .if \r8910 movq_cfi r10, 7*8+\offset movq_cfi r9, 8*8+\offset movq_cfi r8, 9*8+\offset @@ -113,16 +115,19 @@ For 32-bit we have the following conventions - kernel is built with movq_cfi rdi, 14*8+\offset .endm .macro SAVE_C_REGS offset=0 - SAVE_C_REGS_HELPER \offset, 1, 1, 1 + SAVE_C_REGS_HELPER \offset, 1, 1, 1, 1 .endm .macro SAVE_C_REGS_EXCEPT_RAX_RCX offset=0 - SAVE_C_REGS_HELPER \offset, 0, 0, 1 + SAVE_C_REGS_HELPER \offset, 0, 0, 1, 1 .endm .macro SAVE_C_REGS_EXCEPT_R891011 - SAVE_C_REGS_HELPER 0, 1, 1, 0 + SAVE_C_REGS_HELPER 0, 1, 1, 0, 0 .endm .macro SAVE_C_REGS_EXCEPT_RCX_R891011 - SAVE_C_REGS_HELPER 0, 1, 0, 0 + SAVE_C_REGS_HELPER 0, 1, 0, 0, 0 + .endm + .macro SAVE_C_REGS_EXCEPT_RAX_RCX_R11 + SAVE_C_REGS_HELPER 0, 0, 0, 1, 0 .endm .macro SAVE_EXTRA_REGS offset=0 @@ -179,6 +184,9 @@ For 32-bit we have the following conventions - kernel is built with .macro RESTORE_C_REGS_EXCEPT_R11 RESTORE_C_REGS_HELPER 1,1,0,1,1 .endm + .macro RESTORE_C_REGS_EXCEPT_RCX_R11 + RESTORE_C_REGS_HELPER 1,0,0,1,1 + .endm .macro RESTORE_RSI_RDI RESTORE_C_REGS_HELPER 0,0,0,0,0 .endm diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index 5117a2b..324200a 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -121,14 +121,12 @@ ENDPROC(native_usergs_sysret64) #endif /* - * C code is not supposed to know about undefined top of stack. Every time - * a C function with an pt_regs argument is called from the SYSCALL based - * fast path FIXUP_TOP_OF_STACK is needed. + * C code is not supposed to know that the iret frame is not populated. + * Every time a C function with an pt_regs argument is called from + * the SYSCALL based fast path FIXUP_TOP_OF_STACK is needed. * RESTORE_TOP_OF_STACK syncs the syscall state after any possible ptregs * manipulation. */ - - /* %rsp:at FRAMEEND */ .macro FIXUP_TOP_OF_STACK tmp offset=0 movq PER_CPU_VAR(old_rsp),\tmp movq \tmp,RSP+\offset(%rsp) @@ -136,15 +134,13 @@ ENDPROC(native_usergs_sysret64) movq $__USER_CS,CS+\offset(%rsp) movq RIP+\offset(%rsp),\tmp /* get rip */ movq \tmp,RCX+\offset(%rsp) /* copy it to rcx as sysret would do */ - movq R11+\offset(%rsp),\tmp /* get eflags */ - movq \tmp,EFLAGS+\offset(%rsp) + movq EFLAGS+\offset(%rsp),\tmp /* ditto for rflags->r11 */ + movq \tmp,R11+\offset(%rsp) .endm .macro RESTORE_TOP_OF_STACK tmp offset=0 movq RSP+\offset(%rsp),\tmp movq \tmp,PER_CPU_VAR(old_rsp) - movq EFLAGS+\offset(%rsp),\tmp - movq \tmp,R11+\offset(%rsp) .endm /* @@ -257,9 +253,10 @@ GLOBAL(system_call_after_swapgs) */ ENABLE_INTERRUPTS(CLBR_NONE) ALLOC_PT_GPREGS_ON_STACK 8 /* +8: space for orig_ax */ - SAVE_C_REGS_EXCEPT_RAX_RCX + SAVE_C_REGS_EXCEPT_RAX_RCX_R11 movq $-ENOSYS,RAX(%rsp) movq_cfi rax,ORIG_RAX + movq %r11,EFLAGS(%rsp) movq %rcx,RIP(%rsp) CFI_REL_OFFSET rip,RIP testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP) @@ -277,7 +274,7 @@ system_call_fastpath: movq %rax,RAX(%rsp) /* * Syscall return path ending with SYSRET (fast path) - * Has incomplete stack frame and undefined top of stack. + * Has incompletely filled pt_regs, iret frame is also incomplete. */ ret_from_sys_call: testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP) @@ -291,9 +288,10 @@ ret_from_sys_call: * sysretq will re-enable interrupts: */ TRACE_IRQS_ON - RESTORE_C_REGS_EXCEPT_RCX - movq RIP(%rsp),%rcx + RESTORE_C_REGS_EXCEPT_RCX_R11 + movq RIP(%rsp),%rcx CFI_REGISTER rip,rcx + movq EFLAGS(%rsp),%r11 /*CFI_REGISTER rflags,r11*/ movq PER_CPU_VAR(old_rsp), %rsp /* -- 1.8.1.4 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH 1/4] x86: save r11 into pt_regs->flags on SYSCALL64 fastpath 2015-03-09 18:39 ` [PATCH 1/4] x86: save r11 into pt_regs->flags on SYSCALL64 fastpath Denys Vlasenko @ 2015-03-09 20:02 ` Andy Lutomirski 2015-03-16 12:04 ` [tip:x86/asm] x86/asm/entry/64: Save R11 into pt_regs-> flags " tip-bot for Denys Vlasenko 1 sibling, 0 replies; 23+ messages in thread From: Andy Lutomirski @ 2015-03-09 20:02 UTC (permalink / raw) To: Denys Vlasenko Cc: Linus Torvalds, Steven Rostedt, Ingo Molnar, Borislav Petkov, H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook, X86 ML, linux-kernel On Mon, Mar 9, 2015 at 11:39 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote: > Before this patch, r11 was saved in pt_regs->r11. > Which looks natural, but requires messy shuffling to/from iret frame > whenever ptrace or e.g. sys_iopl wants to modify flags - because > that's how this register is used by SYSCALL/SYSRET. > > This patch saves r11 in pt_regs->flags, > and uses that value for SYSRET64 insn. Shuffling is eliminated. > > FIXUP/RESTORE_TOP_OF_STACK are simplified. > > stub_iopl is no longer needed: pt_regs->flags needs no fixing up. > > Testing shows that syscall fast path is ~54.3 ns before > and after the patch (on 2.7 GHz Sandy Bridge CPU). Acked-by: Andy Lutomirski <luto@amacapital.net> > > Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> > CC: Linus Torvalds <torvalds@linux-foundation.org> > CC: Steven Rostedt <rostedt@goodmis.org> > CC: Ingo Molnar <mingo@kernel.org> > CC: Borislav Petkov <bp@alien8.de> > CC: "H. Peter Anvin" <hpa@zytor.com> > CC: Andy Lutomirski <luto@amacapital.net> > CC: Oleg Nesterov <oleg@redhat.com> > CC: Frederic Weisbecker <fweisbec@gmail.com> > CC: Alexei Starovoitov <ast@plumgrid.com> > CC: Will Drewry <wad@chromium.org> > CC: Kees Cook <keescook@chromium.org> > CC: x86@kernel.org > CC: linux-kernel@vger.kernel.org > --- > arch/x86/include/asm/calling.h | 20 ++++++++++++++------ > arch/x86/kernel/entry_64.S | 24 +++++++++++------------- > 2 files changed, 25 insertions(+), 19 deletions(-) > > diff --git a/arch/x86/include/asm/calling.h b/arch/x86/include/asm/calling.h > index f1a962f..4b5f7bf 100644 > --- a/arch/x86/include/asm/calling.h > +++ b/arch/x86/include/asm/calling.h > @@ -95,9 +95,11 @@ For 32-bit we have the following conventions - kernel is built with > CFI_ADJUST_CFA_OFFSET 15*8+\addskip > .endm > > - .macro SAVE_C_REGS_HELPER offset=0 rax=1 rcx=1 r8plus=1 > - .if \r8plus > + .macro SAVE_C_REGS_HELPER offset=0 rax=1 rcx=1 r8910=1 r11=1 > + .if \r11 > movq_cfi r11, 6*8+\offset > + .endif > + .if \r8910 > movq_cfi r10, 7*8+\offset > movq_cfi r9, 8*8+\offset > movq_cfi r8, 9*8+\offset > @@ -113,16 +115,19 @@ For 32-bit we have the following conventions - kernel is built with > movq_cfi rdi, 14*8+\offset > .endm > .macro SAVE_C_REGS offset=0 > - SAVE_C_REGS_HELPER \offset, 1, 1, 1 > + SAVE_C_REGS_HELPER \offset, 1, 1, 1, 1 > .endm > .macro SAVE_C_REGS_EXCEPT_RAX_RCX offset=0 > - SAVE_C_REGS_HELPER \offset, 0, 0, 1 > + SAVE_C_REGS_HELPER \offset, 0, 0, 1, 1 > .endm > .macro SAVE_C_REGS_EXCEPT_R891011 > - SAVE_C_REGS_HELPER 0, 1, 1, 0 > + SAVE_C_REGS_HELPER 0, 1, 1, 0, 0 > .endm > .macro SAVE_C_REGS_EXCEPT_RCX_R891011 > - SAVE_C_REGS_HELPER 0, 1, 0, 0 > + SAVE_C_REGS_HELPER 0, 1, 0, 0, 0 > + .endm > + .macro SAVE_C_REGS_EXCEPT_RAX_RCX_R11 > + SAVE_C_REGS_HELPER 0, 0, 0, 1, 0 > .endm > > .macro SAVE_EXTRA_REGS offset=0 > @@ -179,6 +184,9 @@ For 32-bit we have the following conventions - kernel is built with > .macro RESTORE_C_REGS_EXCEPT_R11 > RESTORE_C_REGS_HELPER 1,1,0,1,1 > .endm > + .macro RESTORE_C_REGS_EXCEPT_RCX_R11 > + RESTORE_C_REGS_HELPER 1,0,0,1,1 > + .endm > .macro RESTORE_RSI_RDI > RESTORE_C_REGS_HELPER 0,0,0,0,0 > .endm > diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S > index 5117a2b..324200a 100644 > --- a/arch/x86/kernel/entry_64.S > +++ b/arch/x86/kernel/entry_64.S > @@ -121,14 +121,12 @@ ENDPROC(native_usergs_sysret64) > #endif > > /* > - * C code is not supposed to know about undefined top of stack. Every time > - * a C function with an pt_regs argument is called from the SYSCALL based > - * fast path FIXUP_TOP_OF_STACK is needed. > + * C code is not supposed to know that the iret frame is not populated. > + * Every time a C function with an pt_regs argument is called from > + * the SYSCALL based fast path FIXUP_TOP_OF_STACK is needed. > * RESTORE_TOP_OF_STACK syncs the syscall state after any possible ptregs > * manipulation. > */ > - > - /* %rsp:at FRAMEEND */ > .macro FIXUP_TOP_OF_STACK tmp offset=0 > movq PER_CPU_VAR(old_rsp),\tmp > movq \tmp,RSP+\offset(%rsp) > @@ -136,15 +134,13 @@ ENDPROC(native_usergs_sysret64) > movq $__USER_CS,CS+\offset(%rsp) > movq RIP+\offset(%rsp),\tmp /* get rip */ > movq \tmp,RCX+\offset(%rsp) /* copy it to rcx as sysret would do */ > - movq R11+\offset(%rsp),\tmp /* get eflags */ > - movq \tmp,EFLAGS+\offset(%rsp) > + movq EFLAGS+\offset(%rsp),\tmp /* ditto for rflags->r11 */ > + movq \tmp,R11+\offset(%rsp) > .endm > > .macro RESTORE_TOP_OF_STACK tmp offset=0 > movq RSP+\offset(%rsp),\tmp > movq \tmp,PER_CPU_VAR(old_rsp) > - movq EFLAGS+\offset(%rsp),\tmp > - movq \tmp,R11+\offset(%rsp) > .endm > > /* > @@ -257,9 +253,10 @@ GLOBAL(system_call_after_swapgs) > */ > ENABLE_INTERRUPTS(CLBR_NONE) > ALLOC_PT_GPREGS_ON_STACK 8 /* +8: space for orig_ax */ > - SAVE_C_REGS_EXCEPT_RAX_RCX > + SAVE_C_REGS_EXCEPT_RAX_RCX_R11 > movq $-ENOSYS,RAX(%rsp) > movq_cfi rax,ORIG_RAX > + movq %r11,EFLAGS(%rsp) > movq %rcx,RIP(%rsp) > CFI_REL_OFFSET rip,RIP > testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP) > @@ -277,7 +274,7 @@ system_call_fastpath: > movq %rax,RAX(%rsp) > /* > * Syscall return path ending with SYSRET (fast path) > - * Has incomplete stack frame and undefined top of stack. > + * Has incompletely filled pt_regs, iret frame is also incomplete. > */ > ret_from_sys_call: > testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP) > @@ -291,9 +288,10 @@ ret_from_sys_call: > * sysretq will re-enable interrupts: > */ > TRACE_IRQS_ON > - RESTORE_C_REGS_EXCEPT_RCX > - movq RIP(%rsp),%rcx > + RESTORE_C_REGS_EXCEPT_RCX_R11 > + movq RIP(%rsp),%rcx > CFI_REGISTER rip,rcx > + movq EFLAGS(%rsp),%r11 > /*CFI_REGISTER rflags,r11*/ > movq PER_CPU_VAR(old_rsp), %rsp > /* > -- > 1.8.1.4 > -- Andy Lutomirski AMA Capital Management, LLC ^ permalink raw reply [flat|nested] 23+ messages in thread
* [tip:x86/asm] x86/asm/entry/64: Save R11 into pt_regs-> flags on SYSCALL64 fastpath 2015-03-09 18:39 ` [PATCH 1/4] x86: save r11 into pt_regs->flags on SYSCALL64 fastpath Denys Vlasenko 2015-03-09 20:02 ` Andy Lutomirski @ 2015-03-16 12:04 ` tip-bot for Denys Vlasenko 1 sibling, 0 replies; 23+ messages in thread From: tip-bot for Denys Vlasenko @ 2015-03-16 12:04 UTC (permalink / raw) To: linux-tip-commits Cc: dvlasenk, torvalds, keescook, fweisbec, oleg, rostedt, hpa, mingo, linux-kernel, ast, bp, luto, tglx, wad Commit-ID: 29722cd4ef666705b2eda1c3ba44435488e509eb Gitweb: http://git.kernel.org/tip/29722cd4ef666705b2eda1c3ba44435488e509eb Author: Denys Vlasenko <dvlasenk@redhat.com> AuthorDate: Mon, 9 Mar 2015 19:39:21 +0100 Committer: Ingo Molnar <mingo@kernel.org> CommitDate: Tue, 10 Mar 2015 13:56:10 +0100 x86/asm/entry/64: Save R11 into pt_regs->flags on SYSCALL64 fastpath Before this patch, R11 was saved in pt_regs->r11. Which looks natural, but requires messy shuffling to/from iret frame whenever ptrace or e.g. sys_iopl() wants to modify flags - because that's how this register is used by SYSCALL/SYSRET. This patch saves R11 in pt_regs->flags, and uses that value for the SYSRET64 instruction. Shuffling is eliminated. FIXUP/RESTORE_TOP_OF_STACK are simplified. stub_iopl is no longer needed: pt_regs->flags needs no fixing up. Testing shows that syscall fast path is ~54.3 ns before and after the patch (on 2.7 GHz Sandy Bridge CPU). Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Will Drewry <wad@chromium.org> Link: http://lkml.kernel.org/r/1425926364-9526-2-git-send-email-dvlasenk@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> --- arch/x86/include/asm/calling.h | 20 ++++++++++++++------ arch/x86/kernel/entry_64.S | 24 +++++++++++------------- 2 files changed, 25 insertions(+), 19 deletions(-) diff --git a/arch/x86/include/asm/calling.h b/arch/x86/include/asm/calling.h index f1a962f..4b5f7bf 100644 --- a/arch/x86/include/asm/calling.h +++ b/arch/x86/include/asm/calling.h @@ -95,9 +95,11 @@ For 32-bit we have the following conventions - kernel is built with CFI_ADJUST_CFA_OFFSET 15*8+\addskip .endm - .macro SAVE_C_REGS_HELPER offset=0 rax=1 rcx=1 r8plus=1 - .if \r8plus + .macro SAVE_C_REGS_HELPER offset=0 rax=1 rcx=1 r8910=1 r11=1 + .if \r11 movq_cfi r11, 6*8+\offset + .endif + .if \r8910 movq_cfi r10, 7*8+\offset movq_cfi r9, 8*8+\offset movq_cfi r8, 9*8+\offset @@ -113,16 +115,19 @@ For 32-bit we have the following conventions - kernel is built with movq_cfi rdi, 14*8+\offset .endm .macro SAVE_C_REGS offset=0 - SAVE_C_REGS_HELPER \offset, 1, 1, 1 + SAVE_C_REGS_HELPER \offset, 1, 1, 1, 1 .endm .macro SAVE_C_REGS_EXCEPT_RAX_RCX offset=0 - SAVE_C_REGS_HELPER \offset, 0, 0, 1 + SAVE_C_REGS_HELPER \offset, 0, 0, 1, 1 .endm .macro SAVE_C_REGS_EXCEPT_R891011 - SAVE_C_REGS_HELPER 0, 1, 1, 0 + SAVE_C_REGS_HELPER 0, 1, 1, 0, 0 .endm .macro SAVE_C_REGS_EXCEPT_RCX_R891011 - SAVE_C_REGS_HELPER 0, 1, 0, 0 + SAVE_C_REGS_HELPER 0, 1, 0, 0, 0 + .endm + .macro SAVE_C_REGS_EXCEPT_RAX_RCX_R11 + SAVE_C_REGS_HELPER 0, 0, 0, 1, 0 .endm .macro SAVE_EXTRA_REGS offset=0 @@ -179,6 +184,9 @@ For 32-bit we have the following conventions - kernel is built with .macro RESTORE_C_REGS_EXCEPT_R11 RESTORE_C_REGS_HELPER 1,1,0,1,1 .endm + .macro RESTORE_C_REGS_EXCEPT_RCX_R11 + RESTORE_C_REGS_HELPER 1,0,0,1,1 + .endm .macro RESTORE_RSI_RDI RESTORE_C_REGS_HELPER 0,0,0,0,0 .endm diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index 5117a2b..324200a 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -121,14 +121,12 @@ ENDPROC(native_usergs_sysret64) #endif /* - * C code is not supposed to know about undefined top of stack. Every time - * a C function with an pt_regs argument is called from the SYSCALL based - * fast path FIXUP_TOP_OF_STACK is needed. + * C code is not supposed to know that the iret frame is not populated. + * Every time a C function with an pt_regs argument is called from + * the SYSCALL based fast path FIXUP_TOP_OF_STACK is needed. * RESTORE_TOP_OF_STACK syncs the syscall state after any possible ptregs * manipulation. */ - - /* %rsp:at FRAMEEND */ .macro FIXUP_TOP_OF_STACK tmp offset=0 movq PER_CPU_VAR(old_rsp),\tmp movq \tmp,RSP+\offset(%rsp) @@ -136,15 +134,13 @@ ENDPROC(native_usergs_sysret64) movq $__USER_CS,CS+\offset(%rsp) movq RIP+\offset(%rsp),\tmp /* get rip */ movq \tmp,RCX+\offset(%rsp) /* copy it to rcx as sysret would do */ - movq R11+\offset(%rsp),\tmp /* get eflags */ - movq \tmp,EFLAGS+\offset(%rsp) + movq EFLAGS+\offset(%rsp),\tmp /* ditto for rflags->r11 */ + movq \tmp,R11+\offset(%rsp) .endm .macro RESTORE_TOP_OF_STACK tmp offset=0 movq RSP+\offset(%rsp),\tmp movq \tmp,PER_CPU_VAR(old_rsp) - movq EFLAGS+\offset(%rsp),\tmp - movq \tmp,R11+\offset(%rsp) .endm /* @@ -257,9 +253,10 @@ GLOBAL(system_call_after_swapgs) */ ENABLE_INTERRUPTS(CLBR_NONE) ALLOC_PT_GPREGS_ON_STACK 8 /* +8: space for orig_ax */ - SAVE_C_REGS_EXCEPT_RAX_RCX + SAVE_C_REGS_EXCEPT_RAX_RCX_R11 movq $-ENOSYS,RAX(%rsp) movq_cfi rax,ORIG_RAX + movq %r11,EFLAGS(%rsp) movq %rcx,RIP(%rsp) CFI_REL_OFFSET rip,RIP testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP) @@ -277,7 +274,7 @@ system_call_fastpath: movq %rax,RAX(%rsp) /* * Syscall return path ending with SYSRET (fast path) - * Has incomplete stack frame and undefined top of stack. + * Has incompletely filled pt_regs, iret frame is also incomplete. */ ret_from_sys_call: testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP) @@ -291,9 +288,10 @@ ret_from_sys_call: * sysretq will re-enable interrupts: */ TRACE_IRQS_ON - RESTORE_C_REGS_EXCEPT_RCX - movq RIP(%rsp),%rcx + RESTORE_C_REGS_EXCEPT_RCX_R11 + movq RIP(%rsp),%rcx CFI_REGISTER rip,rcx + movq EFLAGS(%rsp),%r11 /*CFI_REGISTER rflags,r11*/ movq PER_CPU_VAR(old_rsp), %rsp /* ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath 2015-03-09 18:39 [PATCH 0/4 v2] x86: entry_64.S: steps towards simpler iret frame handling Denys Vlasenko 2015-03-09 18:39 ` [PATCH 1/4] x86: save r11 into pt_regs->flags on SYSCALL64 fastpath Denys Vlasenko @ 2015-03-09 18:39 ` Denys Vlasenko 2015-03-09 20:11 ` Andy Lutomirski ` (2 more replies) 2015-03-10 6:00 ` [PATCH 0/4 v2] x86: entry_64.S: steps towards simpler iret frame handling Ingo Molnar 2 siblings, 3 replies; 23+ messages in thread From: Denys Vlasenko @ 2015-03-09 18:39 UTC (permalink / raw) To: Andy Lutomirski Cc: Denys Vlasenko, Linus Torvalds, Steven Rostedt, Ingo Molnar, Borislav Petkov, H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook, x86, linux-kernel PER_CPU(old_rsp) usage is simplified - now it is used only as temp storage, and userspace stack pointer is immediately stored in pt_regs->sp on syscall entry, instead of being used later, on syscall exit. Instead of PER_CPU(old_rsp) and task->thread.usersp, C code uses pt_regs->sp now. FIXUP/RESTORE_TOP_OF_STACK are simplified. Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> CC: Linus Torvalds <torvalds@linux-foundation.org> CC: Steven Rostedt <rostedt@goodmis.org> CC: Ingo Molnar <mingo@kernel.org> CC: Borislav Petkov <bp@alien8.de> CC: "H. Peter Anvin" <hpa@zytor.com> CC: Andy Lutomirski <luto@amacapital.net> CC: Oleg Nesterov <oleg@redhat.com> CC: Frederic Weisbecker <fweisbec@gmail.com> CC: Alexei Starovoitov <ast@plumgrid.com> CC: Will Drewry <wad@chromium.org> CC: Kees Cook <keescook@chromium.org> CC: x86@kernel.org CC: linux-kernel@vger.kernel.org --- arch/x86/include/asm/compat.h | 2 +- arch/x86/include/asm/ptrace.h | 8 ++------ arch/x86/kernel/entry_64.S | 18 +++++++----------- arch/x86/kernel/perf_regs.c | 2 +- arch/x86/kernel/process_64.c | 3 +-- 5 files changed, 12 insertions(+), 21 deletions(-) diff --git a/arch/x86/include/asm/compat.h b/arch/x86/include/asm/compat.h index 59c6c40..acdee09 100644 --- a/arch/x86/include/asm/compat.h +++ b/arch/x86/include/asm/compat.h @@ -301,7 +301,7 @@ static inline void __user *arch_compat_alloc_user_space(long len) sp = task_pt_regs(current)->sp; } else { /* -128 for the x32 ABI redzone */ - sp = this_cpu_read(old_rsp) - 128; + sp = task_pt_regs(current)->sp - 128; } return (void __user *)round_down(sp - len, 16); diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h index 4077d96..74bb2e0 100644 --- a/arch/x86/include/asm/ptrace.h +++ b/arch/x86/include/asm/ptrace.h @@ -145,12 +145,8 @@ static inline bool user_64bit_mode(struct pt_regs *regs) #endif } -#define current_user_stack_pointer() this_cpu_read(old_rsp) -/* ia32 vs. x32 difference */ -#define compat_user_stack_pointer() \ - (test_thread_flag(TIF_IA32) \ - ? current_pt_regs()->sp \ - : this_cpu_read(old_rsp)) +#define current_user_stack_pointer() current_pt_regs()->sp +#define compat_user_stack_pointer() current_pt_regs()->sp #endif #ifdef CONFIG_X86_32 diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index 703ced0..d86788c 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -128,8 +128,6 @@ ENDPROC(native_usergs_sysret64) * manipulation. */ .macro FIXUP_TOP_OF_STACK tmp offset=0 - movq PER_CPU_VAR(old_rsp),\tmp - movq \tmp,RSP+\offset(%rsp) movq $__USER_DS,SS+\offset(%rsp) movq $__USER_CS,CS+\offset(%rsp) movq RIP+\offset(%rsp),\tmp /* get rip */ @@ -139,8 +137,7 @@ ENDPROC(native_usergs_sysret64) .endm .macro RESTORE_TOP_OF_STACK tmp offset=0 - movq RSP+\offset(%rsp),\tmp - movq \tmp,PER_CPU_VAR(old_rsp) + /* nothing to do */ .endm /* @@ -222,9 +219,6 @@ ENDPROC(native_usergs_sysret64) * Interrupts are off on entry. * Only called from user space. * - * XXX if we had a free scratch register we could save the RSP into the stack frame - * and report it properly in ps. Unfortunately we haven't. - * * When user can change the frames always force IRET. That is because * it deals with uncanonical addresses better. SYSRET has trouble * with them due to bugs in both AMD and Intel CPUs. @@ -253,11 +247,13 @@ GLOBAL(system_call_after_swapgs) */ ENABLE_INTERRUPTS(CLBR_NONE) ALLOC_PT_GPREGS_ON_STACK 8 /* +8: space for orig_ax */ + movq %rcx,RIP(%rsp) + movq PER_CPU_VAR(old_rsp),%rcx + movq %r11,EFLAGS(%rsp) + movq %rcx,RSP(%rsp) + movq_cfi rax,ORIG_RAX SAVE_C_REGS_EXCEPT_RAX_RCX_R11 movq $-ENOSYS,RAX(%rsp) - movq_cfi rax,ORIG_RAX - movq %r11,EFLAGS(%rsp) - movq %rcx,RIP(%rsp) CFI_REL_OFFSET rip,RIP testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP) jnz tracesys @@ -293,7 +289,7 @@ ret_from_sys_call: CFI_REGISTER rip,rcx movq EFLAGS(%rsp),%r11 /*CFI_REGISTER rflags,r11*/ - movq PER_CPU_VAR(old_rsp), %rsp + movq RSP(%rsp),%rsp /* * 64bit SYSRET restores rip from rcx, * rflags from r11 (but RF and VM bits are forced to 0), diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c index 781861c..02a8720 100644 --- a/arch/x86/kernel/perf_regs.c +++ b/arch/x86/kernel/perf_regs.c @@ -177,7 +177,7 @@ void perf_get_regs_user(struct perf_regs *regs_user, * than just blindly copying user_regs. */ regs_user->abi = PERF_SAMPLE_REGS_ABI_64; - regs_user_copy->sp = this_cpu_read(old_rsp); + regs_user_copy->sp = user_regs->sp; regs_user_copy->cs = __USER_CS; regs_user_copy->ss = __USER_DS; regs_user_copy->cx = -1; /* usually contains garbage */ diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c index 1e393d2..e8c124a 100644 --- a/arch/x86/kernel/process_64.c +++ b/arch/x86/kernel/process_64.c @@ -602,6 +602,5 @@ long sys_arch_prctl(int code, unsigned long addr) unsigned long KSTK_ESP(struct task_struct *task) { - return (test_tsk_thread_flag(task, TIF_IA32)) ? - (task_pt_regs(task)->sp) : ((task)->thread.usersp); + return task_pt_regs(task)->sp; } -- 1.8.1.4 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath 2015-03-09 18:39 ` [PATCH 3/4] x86: save user rsp in pt_regs->sp " Denys Vlasenko @ 2015-03-09 20:11 ` Andy Lutomirski 2015-03-09 20:32 ` Denys Vlasenko 2015-03-10 12:51 ` Ingo Molnar 2015-03-16 12:05 ` [tip:x86/asm] x86/asm/entry/64: Save user RSP in pt_regs-> sp " tip-bot for Denys Vlasenko 2 siblings, 1 reply; 23+ messages in thread From: Andy Lutomirski @ 2015-03-09 20:11 UTC (permalink / raw) To: Denys Vlasenko Cc: Linus Torvalds, Steven Rostedt, Ingo Molnar, Borislav Petkov, H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook, X86 ML, linux-kernel On Mon, Mar 9, 2015 at 11:39 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote: > PER_CPU(old_rsp) usage is simplified - now it is used only > as temp storage, and userspace stack pointer is immediately stored > in pt_regs->sp on syscall entry, instead of being used later, > on syscall exit. > > Instead of PER_CPU(old_rsp) and task->thread.usersp, C code > uses pt_regs->sp now. > > FIXUP/RESTORE_TOP_OF_STACK are simplified. > > Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> > CC: Linus Torvalds <torvalds@linux-foundation.org> > CC: Steven Rostedt <rostedt@goodmis.org> > CC: Ingo Molnar <mingo@kernel.org> > CC: Borislav Petkov <bp@alien8.de> > CC: "H. Peter Anvin" <hpa@zytor.com> > CC: Andy Lutomirski <luto@amacapital.net> > CC: Oleg Nesterov <oleg@redhat.com> > CC: Frederic Weisbecker <fweisbec@gmail.com> > CC: Alexei Starovoitov <ast@plumgrid.com> > CC: Will Drewry <wad@chromium.org> > CC: Kees Cook <keescook@chromium.org> > CC: x86@kernel.org > CC: linux-kernel@vger.kernel.org Looks correct. > @@ -253,11 +247,13 @@ GLOBAL(system_call_after_swapgs) > */ > ENABLE_INTERRUPTS(CLBR_NONE) > ALLOC_PT_GPREGS_ON_STACK 8 /* +8: space for orig_ax */ > + movq %rcx,RIP(%rsp) > + movq PER_CPU_VAR(old_rsp),%rcx > + movq %r11,EFLAGS(%rsp) > + movq %rcx,RSP(%rsp) > + movq_cfi rax,ORIG_RAX > SAVE_C_REGS_EXCEPT_RAX_RCX_R11 > movq $-ENOSYS,RAX(%rsp) > - movq_cfi rax,ORIG_RAX > - movq %r11,EFLAGS(%rsp) > - movq %rcx,RIP(%rsp) Why the reordering? --Andy ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath 2015-03-09 20:11 ` Andy Lutomirski @ 2015-03-09 20:32 ` Denys Vlasenko 2015-03-09 20:43 ` Andy Lutomirski 0 siblings, 1 reply; 23+ messages in thread From: Denys Vlasenko @ 2015-03-09 20:32 UTC (permalink / raw) To: Andy Lutomirski Cc: Denys Vlasenko, Linus Torvalds, Steven Rostedt, Ingo Molnar, Borislav Petkov, H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook, X86 ML, linux-kernel On Mon, Mar 9, 2015 at 9:11 PM, Andy Lutomirski <luto@amacapital.net> wrote: >> @@ -253,11 +247,13 @@ GLOBAL(system_call_after_swapgs) >> */ >> ENABLE_INTERRUPTS(CLBR_NONE) >> ALLOC_PT_GPREGS_ON_STACK 8 /* +8: space for orig_ax */ >> + movq %rcx,RIP(%rsp) >> + movq PER_CPU_VAR(old_rsp),%rcx >> + movq %r11,EFLAGS(%rsp) >> + movq %rcx,RSP(%rsp) >> + movq_cfi rax,ORIG_RAX >> SAVE_C_REGS_EXCEPT_RAX_RCX_R11 >> movq $-ENOSYS,RAX(%rsp) >> - movq_cfi rax,ORIG_RAX >> - movq %r11,EFLAGS(%rsp) >> - movq %rcx,RIP(%rsp) > > Why the reordering? No strong reason. iret stack is "above" the rest of pt_regs. This does not matter now, but when/if we convert to PUSHes for register saving, pushes which build iret frame will have to be before "save C-clobbered registers" part, exactly as in this patch. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath 2015-03-09 20:32 ` Denys Vlasenko @ 2015-03-09 20:43 ` Andy Lutomirski 0 siblings, 0 replies; 23+ messages in thread From: Andy Lutomirski @ 2015-03-09 20:43 UTC (permalink / raw) To: Denys Vlasenko Cc: Denys Vlasenko, Linus Torvalds, Steven Rostedt, Ingo Molnar, Borislav Petkov, H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook, X86 ML, linux-kernel On Mon, Mar 9, 2015 at 1:32 PM, Denys Vlasenko <vda.linux@googlemail.com> wrote: > On Mon, Mar 9, 2015 at 9:11 PM, Andy Lutomirski <luto@amacapital.net> wrote: >>> @@ -253,11 +247,13 @@ GLOBAL(system_call_after_swapgs) >>> */ >>> ENABLE_INTERRUPTS(CLBR_NONE) >>> ALLOC_PT_GPREGS_ON_STACK 8 /* +8: space for orig_ax */ >>> + movq %rcx,RIP(%rsp) >>> + movq PER_CPU_VAR(old_rsp),%rcx >>> + movq %r11,EFLAGS(%rsp) >>> + movq %rcx,RSP(%rsp) >>> + movq_cfi rax,ORIG_RAX >>> SAVE_C_REGS_EXCEPT_RAX_RCX_R11 >>> movq $-ENOSYS,RAX(%rsp) >>> - movq_cfi rax,ORIG_RAX >>> - movq %r11,EFLAGS(%rsp) >>> - movq %rcx,RIP(%rsp) >> >> Why the reordering? > > No strong reason. > > iret stack is "above" the rest of pt_regs. > > This does not matter now, but when/if we convert to PUSHes > for register saving, pushes which build iret frame > will have to be before "save C-clobbered registers" part, > exactly as in this patch. Fair enough. Acked-by: Andy Lutomirski <luto@amacapital.net> -- Andy Lutomirski AMA Capital Management, LLC ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath 2015-03-09 18:39 ` [PATCH 3/4] x86: save user rsp in pt_regs->sp " Denys Vlasenko 2015-03-09 20:11 ` Andy Lutomirski @ 2015-03-10 12:51 ` Ingo Molnar 2015-03-10 13:10 ` Andy Lutomirski 2015-03-10 13:18 ` Denys Vlasenko 2015-03-16 12:05 ` [tip:x86/asm] x86/asm/entry/64: Save user RSP in pt_regs-> sp " tip-bot for Denys Vlasenko 2 siblings, 2 replies; 23+ messages in thread From: Ingo Molnar @ 2015-03-10 12:51 UTC (permalink / raw) To: Denys Vlasenko Cc: Andy Lutomirski, Linus Torvalds, Steven Rostedt, Borislav Petkov, H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook, x86, linux-kernel * Denys Vlasenko <dvlasenk@redhat.com> wrote: > PER_CPU(old_rsp) usage is simplified - now it is used only > as temp storage, and userspace stack pointer is immediately stored > in pt_regs->sp on syscall entry, instead of being used later, > on syscall exit. > > Instead of PER_CPU(old_rsp) and task->thread.usersp, C code > uses pt_regs->sp now. > > FIXUP/RESTORE_TOP_OF_STACK are simplified. Just trying to judge the performance impact: > --- a/arch/x86/kernel/entry_64.S > +++ b/arch/x86/kernel/entry_64.S > @@ -128,8 +128,6 @@ ENDPROC(native_usergs_sysret64) > * manipulation. > */ > .macro FIXUP_TOP_OF_STACK tmp offset=0 > - movq PER_CPU_VAR(old_rsp),\tmp > - movq \tmp,RSP+\offset(%rsp) > movq $__USER_DS,SS+\offset(%rsp) > movq $__USER_CS,CS+\offset(%rsp) > movq RIP+\offset(%rsp),\tmp /* get rip */ > @@ -139,8 +137,7 @@ ENDPROC(native_usergs_sysret64) > .endm > > .macro RESTORE_TOP_OF_STACK tmp offset=0 > - movq RSP+\offset(%rsp),\tmp > - movq \tmp,PER_CPU_VAR(old_rsp) > + /* nothing to do */ > .endm > > /* > @@ -253,11 +247,13 @@ GLOBAL(system_call_after_swapgs) > */ > ENABLE_INTERRUPTS(CLBR_NONE) > ALLOC_PT_GPREGS_ON_STACK 8 /* +8: space for orig_ax */ > + movq %rcx,RIP(%rsp) > + movq PER_CPU_VAR(old_rsp),%rcx > + movq %r11,EFLAGS(%rsp) > + movq %rcx,RSP(%rsp) > + movq_cfi rax,ORIG_RAX > SAVE_C_REGS_EXCEPT_RAX_RCX_R11 > movq $-ENOSYS,RAX(%rsp) > - movq_cfi rax,ORIG_RAX > - movq %r11,EFLAGS(%rsp) > - movq %rcx,RIP(%rsp) > CFI_REL_OFFSET rip,RIP > testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP) > jnz tracesys So there are now +2 instructions (5 instead of 3) in the system_call path, but there are -2 instructions in the SYSRETQ path, so combined it's a wash performance-wise, right? Thanks, Ingo ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath 2015-03-10 12:51 ` Ingo Molnar @ 2015-03-10 13:10 ` Andy Lutomirski 2015-03-10 13:18 ` Denys Vlasenko 1 sibling, 0 replies; 23+ messages in thread From: Andy Lutomirski @ 2015-03-10 13:10 UTC (permalink / raw) To: Ingo Molnar Cc: Denys Vlasenko, Linus Torvalds, Steven Rostedt, Borislav Petkov, H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook, X86 ML, linux-kernel On Tue, Mar 10, 2015 at 5:51 AM, Ingo Molnar <mingo@kernel.org> wrote: > > * Denys Vlasenko <dvlasenk@redhat.com> wrote: > >> PER_CPU(old_rsp) usage is simplified - now it is used only >> as temp storage, and userspace stack pointer is immediately stored >> in pt_regs->sp on syscall entry, instead of being used later, >> on syscall exit. >> >> Instead of PER_CPU(old_rsp) and task->thread.usersp, C code >> uses pt_regs->sp now. >> >> FIXUP/RESTORE_TOP_OF_STACK are simplified. > > Just trying to judge the performance impact: > >> --- a/arch/x86/kernel/entry_64.S >> +++ b/arch/x86/kernel/entry_64.S >> @@ -128,8 +128,6 @@ ENDPROC(native_usergs_sysret64) >> * manipulation. >> */ >> .macro FIXUP_TOP_OF_STACK tmp offset=0 >> - movq PER_CPU_VAR(old_rsp),\tmp >> - movq \tmp,RSP+\offset(%rsp) >> movq $__USER_DS,SS+\offset(%rsp) >> movq $__USER_CS,CS+\offset(%rsp) >> movq RIP+\offset(%rsp),\tmp /* get rip */ >> @@ -139,8 +137,7 @@ ENDPROC(native_usergs_sysret64) >> .endm >> >> .macro RESTORE_TOP_OF_STACK tmp offset=0 >> - movq RSP+\offset(%rsp),\tmp >> - movq \tmp,PER_CPU_VAR(old_rsp) >> + /* nothing to do */ >> .endm >> >> /* >> @@ -253,11 +247,13 @@ GLOBAL(system_call_after_swapgs) >> */ >> ENABLE_INTERRUPTS(CLBR_NONE) >> ALLOC_PT_GPREGS_ON_STACK 8 /* +8: space for orig_ax */ >> + movq %rcx,RIP(%rsp) >> + movq PER_CPU_VAR(old_rsp),%rcx >> + movq %r11,EFLAGS(%rsp) >> + movq %rcx,RSP(%rsp) >> + movq_cfi rax,ORIG_RAX >> SAVE_C_REGS_EXCEPT_RAX_RCX_R11 >> movq $-ENOSYS,RAX(%rsp) >> - movq_cfi rax,ORIG_RAX >> - movq %r11,EFLAGS(%rsp) >> - movq %rcx,RIP(%rsp) >> CFI_REL_OFFSET rip,RIP >> testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP) >> jnz tracesys > > So there are now +2 instructions (5 instead of 3) in the system_call > path, but there are -2 instructions in the SYSRETQ path, so combined > it's a wash performance-wise, right? I think that the SYSRETQ path is the same number of instructions -- RESTORE_TOP_OF_STACK is only used in somewhat unusual circumstances. On SYSRETQ, I think we touch one fewer cache line, although that cache line will be in L1 unless the syscall pushed it out. The added instructions on entry are probably nearly free, though, as one of them is a load from a forwarded address and the other is a store to a hot cacheline. --And ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath 2015-03-10 12:51 ` Ingo Molnar 2015-03-10 13:10 ` Andy Lutomirski @ 2015-03-10 13:18 ` Denys Vlasenko 2015-03-10 13:20 ` Andy Lutomirski 2015-03-10 13:21 ` Ingo Molnar 1 sibling, 2 replies; 23+ messages in thread From: Denys Vlasenko @ 2015-03-10 13:18 UTC (permalink / raw) To: Ingo Molnar Cc: Andy Lutomirski, Linus Torvalds, Steven Rostedt, Borislav Petkov, H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook, x86, linux-kernel On 03/10/2015 01:51 PM, Ingo Molnar wrote: > > * Denys Vlasenko <dvlasenk@redhat.com> wrote: > >> PER_CPU(old_rsp) usage is simplified - now it is used only >> as temp storage, and userspace stack pointer is immediately stored >> in pt_regs->sp on syscall entry, instead of being used later, >> on syscall exit. >> >> Instead of PER_CPU(old_rsp) and task->thread.usersp, C code >> uses pt_regs->sp now. >> >> FIXUP/RESTORE_TOP_OF_STACK are simplified. > > Just trying to judge the performance impact: > >> --- a/arch/x86/kernel/entry_64.S >> +++ b/arch/x86/kernel/entry_64.S >> @@ -128,8 +128,6 @@ ENDPROC(native_usergs_sysret64) >> * manipulation. >> */ >> .macro FIXUP_TOP_OF_STACK tmp offset=0 >> - movq PER_CPU_VAR(old_rsp),\tmp >> - movq \tmp,RSP+\offset(%rsp) >> movq $__USER_DS,SS+\offset(%rsp) >> movq $__USER_CS,CS+\offset(%rsp) >> movq RIP+\offset(%rsp),\tmp /* get rip */ >> @@ -139,8 +137,7 @@ ENDPROC(native_usergs_sysret64) >> .endm >> >> .macro RESTORE_TOP_OF_STACK tmp offset=0 >> - movq RSP+\offset(%rsp),\tmp >> - movq \tmp,PER_CPU_VAR(old_rsp) >> + /* nothing to do */ >> .endm >> >> /* >> @@ -253,11 +247,13 @@ GLOBAL(system_call_after_swapgs) >> */ >> ENABLE_INTERRUPTS(CLBR_NONE) >> ALLOC_PT_GPREGS_ON_STACK 8 /* +8: space for orig_ax */ >> + movq %rcx,RIP(%rsp) >> + movq PER_CPU_VAR(old_rsp),%rcx >> + movq %r11,EFLAGS(%rsp) >> + movq %rcx,RSP(%rsp) >> + movq_cfi rax,ORIG_RAX >> SAVE_C_REGS_EXCEPT_RAX_RCX_R11 >> movq $-ENOSYS,RAX(%rsp) >> - movq_cfi rax,ORIG_RAX >> - movq %r11,EFLAGS(%rsp) >> - movq %rcx,RIP(%rsp) >> CFI_REL_OFFSET rip,RIP >> testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP) >> jnz tracesys > > So there are now +2 instructions (5 instead of 3) in the system_call > path, but there are -2 instructions in the SYSRETQ path, Unfortunately, no. There is only this change in SYSRETQ path, which simply changes where we get RSP from: @@ -293,7 +289,7 @@ ret_from_sys_call: CFI_REGISTER rip,rcx movq EFLAGS(%rsp),%r11 /*CFI_REGISTER rflags,r11*/ - movq PER_CPU_VAR(old_rsp), %rsp + movq RSP(%rsp),%rsp /* * 64bit SYSRET restores rip from rcx, * rflags from r11 (but RF and VM bits are forced to 0), Most likely, no change in execution speed here. At best, it is one cycle faster somewhere in address generation unit because for PER_CPU_VAR() address evaluation, GS base is nonzero. Since this patch does add two extra MOVs, I did benchmark these patches. They add exactly one cycle to system call code path on my Sandy Bridge CPU. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath 2015-03-10 13:18 ` Denys Vlasenko @ 2015-03-10 13:20 ` Andy Lutomirski 2015-03-10 13:26 ` Ingo Molnar 2015-03-10 13:21 ` Ingo Molnar 1 sibling, 1 reply; 23+ messages in thread From: Andy Lutomirski @ 2015-03-10 13:20 UTC (permalink / raw) To: Denys Vlasenko Cc: Ingo Molnar, Linus Torvalds, Steven Rostedt, Borislav Petkov, H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook, X86 ML, linux-kernel On Tue, Mar 10, 2015 at 6:18 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote: > On 03/10/2015 01:51 PM, Ingo Molnar wrote: >> >> * Denys Vlasenko <dvlasenk@redhat.com> wrote: >> >>> PER_CPU(old_rsp) usage is simplified - now it is used only >>> as temp storage, and userspace stack pointer is immediately stored >>> in pt_regs->sp on syscall entry, instead of being used later, >>> on syscall exit. >>> >>> Instead of PER_CPU(old_rsp) and task->thread.usersp, C code >>> uses pt_regs->sp now. >>> >>> FIXUP/RESTORE_TOP_OF_STACK are simplified. >> >> Just trying to judge the performance impact: >> >>> --- a/arch/x86/kernel/entry_64.S >>> +++ b/arch/x86/kernel/entry_64.S >>> @@ -128,8 +128,6 @@ ENDPROC(native_usergs_sysret64) >>> * manipulation. >>> */ >>> .macro FIXUP_TOP_OF_STACK tmp offset=0 >>> - movq PER_CPU_VAR(old_rsp),\tmp >>> - movq \tmp,RSP+\offset(%rsp) >>> movq $__USER_DS,SS+\offset(%rsp) >>> movq $__USER_CS,CS+\offset(%rsp) >>> movq RIP+\offset(%rsp),\tmp /* get rip */ >>> @@ -139,8 +137,7 @@ ENDPROC(native_usergs_sysret64) >>> .endm >>> >>> .macro RESTORE_TOP_OF_STACK tmp offset=0 >>> - movq RSP+\offset(%rsp),\tmp >>> - movq \tmp,PER_CPU_VAR(old_rsp) >>> + /* nothing to do */ >>> .endm >>> >>> /* >>> @@ -253,11 +247,13 @@ GLOBAL(system_call_after_swapgs) >>> */ >>> ENABLE_INTERRUPTS(CLBR_NONE) >>> ALLOC_PT_GPREGS_ON_STACK 8 /* +8: space for orig_ax */ >>> + movq %rcx,RIP(%rsp) >>> + movq PER_CPU_VAR(old_rsp),%rcx >>> + movq %r11,EFLAGS(%rsp) >>> + movq %rcx,RSP(%rsp) >>> + movq_cfi rax,ORIG_RAX >>> SAVE_C_REGS_EXCEPT_RAX_RCX_R11 >>> movq $-ENOSYS,RAX(%rsp) >>> - movq_cfi rax,ORIG_RAX >>> - movq %r11,EFLAGS(%rsp) >>> - movq %rcx,RIP(%rsp) >>> CFI_REL_OFFSET rip,RIP >>> testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP) >>> jnz tracesys >> >> So there are now +2 instructions (5 instead of 3) in the system_call >> path, but there are -2 instructions in the SYSRETQ path, > > Unfortunately, no. There is only this change in SYSRETQ path, > which simply changes where we get RSP from: > > @@ -293,7 +289,7 @@ ret_from_sys_call: > CFI_REGISTER rip,rcx > movq EFLAGS(%rsp),%r11 > /*CFI_REGISTER rflags,r11*/ > - movq PER_CPU_VAR(old_rsp), %rsp > + movq RSP(%rsp),%rsp > /* > * 64bit SYSRET restores rip from rcx, > * rflags from r11 (but RF and VM bits are forced to 0), > > Most likely, no change in execution speed here. > At best, it is one cycle faster somewhere in address generation unit > because for PER_CPU_VAR() address evaluation, GS base is nonzero. > > > Since this patch does add two extra MOVs, > I did benchmark these patches. They add exactly one cycle > to system call code path on my Sandy Bridge CPU. > Personally, I'm willing to pay that cycle. It could be a bigger savings on context switch, and the simplification it enables is pretty good. --Andy -- Andy Lutomirski AMA Capital Management, LLC ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath 2015-03-10 13:20 ` Andy Lutomirski @ 2015-03-10 13:26 ` Ingo Molnar 0 siblings, 0 replies; 23+ messages in thread From: Ingo Molnar @ 2015-03-10 13:26 UTC (permalink / raw) To: Andy Lutomirski Cc: Denys Vlasenko, Linus Torvalds, Steven Rostedt, Borislav Petkov, H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook, X86 ML, linux-kernel * Andy Lutomirski <luto@amacapital.net> wrote: > > Since this patch does add two extra MOVs, > > I did benchmark these patches. They add exactly one cycle > > to system call code path on my Sandy Bridge CPU. > > Personally, I'm willing to pay that cycle. It could be a bigger > savings on context switch, and the simplification it enables is > pretty good. But, but ... context switches are a relative slow path, compared to system calls. And I say this with the scheduler maintainer hat on as well. So this is not a good bargain IMHO, assuming it's not some _huge_ difference in maintainability - but having an extra percpu field isn't really much of a problem. I don't claim that we couldn't in some other situation decide that a certain type of speedup isn't worth it - but what's the big problem here? A bit of arithmetics shouldn't be a problem? Thanks, Ingo ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath 2015-03-10 13:18 ` Denys Vlasenko 2015-03-10 13:20 ` Andy Lutomirski @ 2015-03-10 13:21 ` Ingo Molnar 2015-03-10 13:26 ` Andy Lutomirski ` (2 more replies) 1 sibling, 3 replies; 23+ messages in thread From: Ingo Molnar @ 2015-03-10 13:21 UTC (permalink / raw) To: Denys Vlasenko Cc: Andy Lutomirski, Linus Torvalds, Steven Rostedt, Borislav Petkov, H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook, x86, linux-kernel * Denys Vlasenko <dvlasenk@redhat.com> wrote: > > So there are now +2 instructions (5 instead of 3) in the > > system_call path, but there are -2 instructions in the SYSRETQ > > path, > > Unfortunately, no. [...] So I assumed that it was an equivalent transformation, given that none of the changelogs spelled out the increase in overhead ... > [...] There is only this change in SYSRETQ path, which simply > changes where we get RSP from: > > @@ -293,7 +289,7 @@ ret_from_sys_call: > CFI_REGISTER rip,rcx > movq EFLAGS(%rsp),%r11 > /*CFI_REGISTER rflags,r11*/ > - movq PER_CPU_VAR(old_rsp), %rsp > + movq RSP(%rsp),%rsp > /* > * 64bit SYSRET restores rip from rcx, > * rflags from r11 (but RF and VM bits are forced to 0), > > Most likely, no change in execution speed here. > At best, it is one cycle faster somewhere in address generation unit > because for PER_CPU_VAR() address evaluation, GS base is nonzero. > > Since this patch does add two extra MOVs, > I did benchmark these patches. They add exactly one cycle > to system call code path on my Sandy Bridge CPU. Hm, but that's the wrong direction, we should try to make it faster, and to clean it up - but making it slower without really good reasons isn't good. Is 'usersp' really that much of a complication? Thanks, Ingo ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath 2015-03-10 13:21 ` Ingo Molnar @ 2015-03-10 13:26 ` Andy Lutomirski 2015-03-10 14:00 ` Denys Vlasenko 2015-03-10 13:28 ` Ingo Molnar 2015-03-10 13:50 ` Denys Vlasenko 2 siblings, 1 reply; 23+ messages in thread From: Andy Lutomirski @ 2015-03-10 13:26 UTC (permalink / raw) To: Ingo Molnar Cc: Denys Vlasenko, Linus Torvalds, Steven Rostedt, Borislav Petkov, H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook, X86 ML, linux-kernel On Tue, Mar 10, 2015 at 6:21 AM, Ingo Molnar <mingo@kernel.org> wrote: > > * Denys Vlasenko <dvlasenk@redhat.com> wrote: > >> > So there are now +2 instructions (5 instead of 3) in the >> > system_call path, but there are -2 instructions in the SYSRETQ >> > path, >> >> Unfortunately, no. [...] > > So I assumed that it was an equivalent transformation, given that none > of the changelogs spelled out the increase in overhead ... > >> [...] There is only this change in SYSRETQ path, which simply >> changes where we get RSP from: >> >> @@ -293,7 +289,7 @@ ret_from_sys_call: >> CFI_REGISTER rip,rcx >> movq EFLAGS(%rsp),%r11 >> /*CFI_REGISTER rflags,r11*/ >> - movq PER_CPU_VAR(old_rsp), %rsp >> + movq RSP(%rsp),%rsp >> /* >> * 64bit SYSRET restores rip from rcx, >> * rflags from r11 (but RF and VM bits are forced to 0), >> >> Most likely, no change in execution speed here. >> At best, it is one cycle faster somewhere in address generation unit >> because for PER_CPU_VAR() address evaluation, GS base is nonzero. >> >> Since this patch does add two extra MOVs, >> I did benchmark these patches. They add exactly one cycle >> to system call code path on my Sandy Bridge CPU. > > Hm, but that's the wrong direction, we should try to make it faster, > and to clean it up - but making it slower without really good reasons > isn't good. > > Is 'usersp' really that much of a complication? usersp is IMO tolerable. The nasty thing is the FIXUP_TOP_OF_STACK / RESTORE_TOP_OF_STACK garbage, and this patch is the main step toward killing that off completely. I've still never convinced myself that there aren't ptrace-related info leaks in there. Denys, did you ever benchmark what happens if we use push instead of mov? I bet that we get that cycle back and more, not to mention much less icache usage. The reason I think that is that this patch changes us from writing nothing to the RIP slot to writing something there. If we push our stack frame instead of moving it into place, we must write to every slot, and writing the right value is probably no more expensive than writing, say, zero (the load / forwarding units are unlikely to be very busy at that point). So this change could actually be a step toward getting faster. --Andy > > Thanks, > > Ingo -- Andy Lutomirski AMA Capital Management, LLC ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath 2015-03-10 13:26 ` Andy Lutomirski @ 2015-03-10 14:00 ` Denys Vlasenko 2015-03-10 14:02 ` Andy Lutomirski 0 siblings, 1 reply; 23+ messages in thread From: Denys Vlasenko @ 2015-03-10 14:00 UTC (permalink / raw) To: Andy Lutomirski Cc: Ingo Molnar, Denys Vlasenko, Linus Torvalds, Steven Rostedt, Borislav Petkov, H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook, X86 ML, linux-kernel On Tue, Mar 10, 2015 at 2:26 PM, Andy Lutomirski <luto@amacapital.net> wrote: > usersp is IMO tolerable. The nasty thing is the FIXUP_TOP_OF_STACK / > RESTORE_TOP_OF_STACK garbage, and this patch is the main step toward > killing that off completely. I've still never convinced myself that > there aren't ptrace-related info leaks in there. > > Denys, did you ever benchmark what happens if we use push instead of > mov? I bet that we get that cycle back and more, not to mention much > less icache usage. Yes, I did. Push conversion seems to perform the same as current, MOV-based code. The expected win there that we lose two huge 12-byte insns which store __USER_CS and __USER_DS in iret frame. MOVQ imm,ofs(%rsp) has a very unfortunate encoding in x86: - needs REX prefix - no sing-extending imm8 form exists for it - ofs in our case can't fit into 8 bits - (%esp) requires SIB byte In my tests, each such instruction adds one cycle. Compare this to PUSH imm8, which is 2 bytes only. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath 2015-03-10 14:00 ` Denys Vlasenko @ 2015-03-10 14:02 ` Andy Lutomirski 2015-03-10 14:09 ` Denys Vlasenko 0 siblings, 1 reply; 23+ messages in thread From: Andy Lutomirski @ 2015-03-10 14:02 UTC (permalink / raw) To: Denys Vlasenko Cc: Ingo Molnar, Denys Vlasenko, Linus Torvalds, Steven Rostedt, Borislav Petkov, H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook, X86 ML, linux-kernel On Tue, Mar 10, 2015 at 7:00 AM, Denys Vlasenko <vda.linux@googlemail.com> wrote: > On Tue, Mar 10, 2015 at 2:26 PM, Andy Lutomirski <luto@amacapital.net> wrote: >> usersp is IMO tolerable. The nasty thing is the FIXUP_TOP_OF_STACK / >> RESTORE_TOP_OF_STACK garbage, and this patch is the main step toward >> killing that off completely. I've still never convinced myself that >> there aren't ptrace-related info leaks in there. >> >> Denys, did you ever benchmark what happens if we use push instead of >> mov? I bet that we get that cycle back and more, not to mention much >> less icache usage. > > Yes, I did. > Push conversion seems to perform the same as current, MOV-based code. > > The expected win there that we lose two huge 12-byte insns > which store __USER_CS and __USER_DS in iret frame. > > MOVQ imm,ofs(%rsp) has a very unfortunate encoding in x86: > - needs REX prefix > - no sing-extending imm8 form exists for it > - ofs in our case can't fit into 8 bits > - (%esp) requires SIB byte > > In my tests, each such instruction adds one cycle. > > Compare this to PUSH imm8, which is 2 bytes only. Does that mean that using push on top of this patch gets us our cycle back? --Andy -- Andy Lutomirski AMA Capital Management, LLC ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath 2015-03-10 14:02 ` Andy Lutomirski @ 2015-03-10 14:09 ` Denys Vlasenko 0 siblings, 0 replies; 23+ messages in thread From: Denys Vlasenko @ 2015-03-10 14:09 UTC (permalink / raw) To: Andy Lutomirski Cc: Ingo Molnar, Denys Vlasenko, Linus Torvalds, Steven Rostedt, Borislav Petkov, H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook, X86 ML, linux-kernel On Tue, Mar 10, 2015 at 3:02 PM, Andy Lutomirski <luto@amacapital.net> wrote: > On Tue, Mar 10, 2015 at 7:00 AM, Denys Vlasenko > <vda.linux@googlemail.com> wrote: >> On Tue, Mar 10, 2015 at 2:26 PM, Andy Lutomirski <luto@amacapital.net> wrote: >>> usersp is IMO tolerable. The nasty thing is the FIXUP_TOP_OF_STACK / >>> RESTORE_TOP_OF_STACK garbage, and this patch is the main step toward >>> killing that off completely. I've still never convinced myself that >>> there aren't ptrace-related info leaks in there. >>> >>> Denys, did you ever benchmark what happens if we use push instead of >>> mov? I bet that we get that cycle back and more, not to mention much >>> less icache usage. >> >> Yes, I did. >> Push conversion seems to perform the same as current, MOV-based code. >> >> The expected win there that we lose two huge 12-byte insns >> which store __USER_CS and __USER_DS in iret frame. >> >> MOVQ imm,ofs(%rsp) has a very unfortunate encoding in x86: >> - needs REX prefix >> - no sing-extending imm8 form exists for it >> - ofs in our case can't fit into 8 bits >> - (%esp) requires SIB byte >> >> In my tests, each such instruction adds one cycle. >> >> Compare this to PUSH imm8, which is 2 bytes only. > > Does that mean that using push on top of this patch gets us our cycle back? Maybe. I can't be sure about it. In general I see a jitter of 1-2, sometimes 3 cycles even when I do changes which merely change code size (e.g. replacing equivalent insns). This may be caused by jump targets getting aligned differently wrt cacheline boundaries. If second/third/fourth insn after current one is not fetched because it did not fit into the cacheline, then some insn decoders don't get anything to chew on. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath 2015-03-10 13:21 ` Ingo Molnar 2015-03-10 13:26 ` Andy Lutomirski @ 2015-03-10 13:28 ` Ingo Molnar 2015-03-10 13:50 ` Denys Vlasenko 2 siblings, 0 replies; 23+ messages in thread From: Ingo Molnar @ 2015-03-10 13:28 UTC (permalink / raw) To: Denys Vlasenko Cc: Andy Lutomirski, Linus Torvalds, Steven Rostedt, Borislav Petkov, H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook, x86, linux-kernel > * Denys Vlasenko <dvlasenk@redhat.com> wrote: > > > > So there are now +2 instructions (5 instead of 3) in the > > > system_call path, but there are -2 instructions in the SYSRETQ > > > path, > > > > Unfortunately, no. [...] > > So I assumed that it was an equivalent transformation, given that > none of the changelogs spelled out the increase in overhead ... Also, for future reference, you need to spell out performance costs of fast path patches very clearly - and that wasn't done here. That's a big no-no, please don't do it again ... Thanks, Ingo ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath 2015-03-10 13:21 ` Ingo Molnar 2015-03-10 13:26 ` Andy Lutomirski 2015-03-10 13:28 ` Ingo Molnar @ 2015-03-10 13:50 ` Denys Vlasenko 2015-03-16 9:44 ` Ingo Molnar 2 siblings, 1 reply; 23+ messages in thread From: Denys Vlasenko @ 2015-03-10 13:50 UTC (permalink / raw) To: Ingo Molnar Cc: Denys Vlasenko, Andy Lutomirski, Linus Torvalds, Steven Rostedt, Borislav Petkov, H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook, X86 ML, Linux Kernel Mailing List On Tue, Mar 10, 2015 at 2:21 PM, Ingo Molnar <mingo@kernel.org> wrote: >> Since this patch does add two extra MOVs, >> I did benchmark these patches. They add exactly one cycle >> to system call code path on my Sandy Bridge CPU. > > Hm, but that's the wrong direction, we should try to make it faster, > and to clean it up As you know, goals of "faster" and "cleaner" are often mutually exclusive. C'est la vie :( entry.S seem to lean towards "faster" to the extent where it became a tangled maze of obscure optimizations. Such as the mysterious, and not at all obvious existence of 8 byte "safety padding" at the top of the 32-bit kernel stack. Before Andy stumbled into it, it was not at all obvious that it is even there. I had to write a test patch to verify it. There is a long-standing latent bug in the code where this padding is not accounted for. > - but making it slower without really good reasons isn't good. The thinking here is that cleaning up entry.S is a good reason. We won't do anything which would slow it down by, say, 5%, but one cycle may be considered acceptable loss. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath 2015-03-10 13:50 ` Denys Vlasenko @ 2015-03-16 9:44 ` Ingo Molnar 0 siblings, 0 replies; 23+ messages in thread From: Ingo Molnar @ 2015-03-16 9:44 UTC (permalink / raw) To: Denys Vlasenko Cc: Denys Vlasenko, Andy Lutomirski, Linus Torvalds, Steven Rostedt, Borislav Petkov, H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook, X86 ML, Linux Kernel Mailing List * Denys Vlasenko <vda.linux@googlemail.com> wrote: > > - but making it slower without really good reasons isn't good. > > The thinking here is that cleaning up entry.S is a good reason. > > We won't do anything which would slow it down by, say, 5%, > but one cycle may be considered acceptable loss. Ok, so I've applied this particular cleanup, in the hope that we can recover those cycles on the now cleaner base. If that doesn't work out then we can still re-introduce this complication and see its maintainability in isolation, on a clean base. Thanks, Ingo ^ permalink raw reply [flat|nested] 23+ messages in thread
* [tip:x86/asm] x86/asm/entry/64: Save user RSP in pt_regs-> sp on SYSCALL64 fastpath 2015-03-09 18:39 ` [PATCH 3/4] x86: save user rsp in pt_regs->sp " Denys Vlasenko 2015-03-09 20:11 ` Andy Lutomirski 2015-03-10 12:51 ` Ingo Molnar @ 2015-03-16 12:05 ` tip-bot for Denys Vlasenko 2 siblings, 0 replies; 23+ messages in thread From: tip-bot for Denys Vlasenko @ 2015-03-16 12:05 UTC (permalink / raw) To: linux-tip-commits Cc: linux-kernel, torvalds, rostedt, ast, tglx, wad, keescook, hpa, fweisbec, bp, mingo, dvlasenk, oleg, luto Commit-ID: 263042e4630a85e856b4a8cd72f28dab33ef4741 Gitweb: http://git.kernel.org/tip/263042e4630a85e856b4a8cd72f28dab33ef4741 Author: Denys Vlasenko <dvlasenk@redhat.com> AuthorDate: Mon, 9 Mar 2015 19:39:23 +0100 Committer: Ingo Molnar <mingo@kernel.org> CommitDate: Tue, 10 Mar 2015 13:56:10 +0100 x86/asm/entry/64: Save user RSP in pt_regs->sp on SYSCALL64 fastpath Prepare for the removal of 'usersp', by simplifying PER_CPU(old_rsp) usage: - use it only as temp storage - store the userspace stack pointer immediately in pt_regs->sp on syscall entry, instead of using it later, on syscall exit. - change C code to use pt_regs->sp only, instead of PER_CPU(old_rsp) and task->thread.usersp. FIXUP/RESTORE_TOP_OF_STACK are simplified as well. Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Will Drewry <wad@chromium.org> Link: http://lkml.kernel.org/r/1425926364-9526-4-git-send-email-dvlasenk@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> --- arch/x86/include/asm/compat.h | 2 +- arch/x86/include/asm/ptrace.h | 8 ++------ arch/x86/kernel/entry_64.S | 18 +++++++----------- arch/x86/kernel/perf_regs.c | 2 +- arch/x86/kernel/process_64.c | 3 +-- 5 files changed, 12 insertions(+), 21 deletions(-) diff --git a/arch/x86/include/asm/compat.h b/arch/x86/include/asm/compat.h index 59c6c40..acdee09 100644 --- a/arch/x86/include/asm/compat.h +++ b/arch/x86/include/asm/compat.h @@ -301,7 +301,7 @@ static inline void __user *arch_compat_alloc_user_space(long len) sp = task_pt_regs(current)->sp; } else { /* -128 for the x32 ABI redzone */ - sp = this_cpu_read(old_rsp) - 128; + sp = task_pt_regs(current)->sp - 128; } return (void __user *)round_down(sp - len, 16); diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h index 4077d96..74bb2e0 100644 --- a/arch/x86/include/asm/ptrace.h +++ b/arch/x86/include/asm/ptrace.h @@ -145,12 +145,8 @@ static inline bool user_64bit_mode(struct pt_regs *regs) #endif } -#define current_user_stack_pointer() this_cpu_read(old_rsp) -/* ia32 vs. x32 difference */ -#define compat_user_stack_pointer() \ - (test_thread_flag(TIF_IA32) \ - ? current_pt_regs()->sp \ - : this_cpu_read(old_rsp)) +#define current_user_stack_pointer() current_pt_regs()->sp +#define compat_user_stack_pointer() current_pt_regs()->sp #endif #ifdef CONFIG_X86_32 diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index 703ced0..d86788c 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -128,8 +128,6 @@ ENDPROC(native_usergs_sysret64) * manipulation. */ .macro FIXUP_TOP_OF_STACK tmp offset=0 - movq PER_CPU_VAR(old_rsp),\tmp - movq \tmp,RSP+\offset(%rsp) movq $__USER_DS,SS+\offset(%rsp) movq $__USER_CS,CS+\offset(%rsp) movq RIP+\offset(%rsp),\tmp /* get rip */ @@ -139,8 +137,7 @@ ENDPROC(native_usergs_sysret64) .endm .macro RESTORE_TOP_OF_STACK tmp offset=0 - movq RSP+\offset(%rsp),\tmp - movq \tmp,PER_CPU_VAR(old_rsp) + /* nothing to do */ .endm /* @@ -222,9 +219,6 @@ ENDPROC(native_usergs_sysret64) * Interrupts are off on entry. * Only called from user space. * - * XXX if we had a free scratch register we could save the RSP into the stack frame - * and report it properly in ps. Unfortunately we haven't. - * * When user can change the frames always force IRET. That is because * it deals with uncanonical addresses better. SYSRET has trouble * with them due to bugs in both AMD and Intel CPUs. @@ -253,11 +247,13 @@ GLOBAL(system_call_after_swapgs) */ ENABLE_INTERRUPTS(CLBR_NONE) ALLOC_PT_GPREGS_ON_STACK 8 /* +8: space for orig_ax */ + movq %rcx,RIP(%rsp) + movq PER_CPU_VAR(old_rsp),%rcx + movq %r11,EFLAGS(%rsp) + movq %rcx,RSP(%rsp) + movq_cfi rax,ORIG_RAX SAVE_C_REGS_EXCEPT_RAX_RCX_R11 movq $-ENOSYS,RAX(%rsp) - movq_cfi rax,ORIG_RAX - movq %r11,EFLAGS(%rsp) - movq %rcx,RIP(%rsp) CFI_REL_OFFSET rip,RIP testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP) jnz tracesys @@ -293,7 +289,7 @@ ret_from_sys_call: CFI_REGISTER rip,rcx movq EFLAGS(%rsp),%r11 /*CFI_REGISTER rflags,r11*/ - movq PER_CPU_VAR(old_rsp), %rsp + movq RSP(%rsp),%rsp /* * 64bit SYSRET restores rip from rcx, * rflags from r11 (but RF and VM bits are forced to 0), diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c index 781861c..02a8720 100644 --- a/arch/x86/kernel/perf_regs.c +++ b/arch/x86/kernel/perf_regs.c @@ -177,7 +177,7 @@ void perf_get_regs_user(struct perf_regs *regs_user, * than just blindly copying user_regs. */ regs_user->abi = PERF_SAMPLE_REGS_ABI_64; - regs_user_copy->sp = this_cpu_read(old_rsp); + regs_user_copy->sp = user_regs->sp; regs_user_copy->cs = __USER_CS; regs_user_copy->ss = __USER_DS; regs_user_copy->cx = -1; /* usually contains garbage */ diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c index 1e393d2..e8c124a 100644 --- a/arch/x86/kernel/process_64.c +++ b/arch/x86/kernel/process_64.c @@ -602,6 +602,5 @@ long sys_arch_prctl(int code, unsigned long addr) unsigned long KSTK_ESP(struct task_struct *task) { - return (test_tsk_thread_flag(task, TIF_IA32)) ? - (task_pt_regs(task)->sp) : ((task)->thread.usersp); + return task_pt_regs(task)->sp; } ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH 0/4 v2] x86: entry_64.S: steps towards simpler iret frame handling 2015-03-09 18:39 [PATCH 0/4 v2] x86: entry_64.S: steps towards simpler iret frame handling Denys Vlasenko 2015-03-09 18:39 ` [PATCH 1/4] x86: save r11 into pt_regs->flags on SYSCALL64 fastpath Denys Vlasenko 2015-03-09 18:39 ` [PATCH 3/4] x86: save user rsp in pt_regs->sp " Denys Vlasenko @ 2015-03-10 6:00 ` Ingo Molnar 2 siblings, 0 replies; 23+ messages in thread From: Ingo Molnar @ 2015-03-10 6:00 UTC (permalink / raw) To: Denys Vlasenko Cc: Andy Lutomirski, Linus Torvalds, Steven Rostedt, Borislav Petkov, H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook, x86, linux-kernel * Denys Vlasenko <dvlasenk@redhat.com> wrote: > These changes make SYSENTER64 code path save flags and user's > stack pointer in pt_regs->flags and pt_regs->sp, where they belong. > > As a result, we can drop stub_iopl() and thread_struct::usersp. > > Usage of PER_CPU(old_rsp) is reduced to bare minimum. > > FIXUP/RESTORE_TOP_OF_STACK macros are on diet too. > > Changes since v1: on Ingo's request, split into 4 patches intead of 2. Unfortunately I got only 1/4 and 3/4, the other two patches are missing. They are not in my lkml and spam folders either. Note that the missing mails: > Denys Vlasenko (4): > x86: save r11 into pt_regs->flags on SYSCALL64 fastpath > Remove stub_iopl > x86: save user %rsp in pt_regs->sp on SYSCALL64 fastpath > Remove unused stuff 2/4 and 4/4 Have rather short titles, so maybe some spam trap ate them? Prefix them with 'x86: '? Weird. Thanks, Ingo ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2015-03-16 12:06 UTC | newest] Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-03-09 18:39 [PATCH 0/4 v2] x86: entry_64.S: steps towards simpler iret frame handling Denys Vlasenko 2015-03-09 18:39 ` [PATCH 1/4] x86: save r11 into pt_regs->flags on SYSCALL64 fastpath Denys Vlasenko 2015-03-09 20:02 ` Andy Lutomirski 2015-03-16 12:04 ` [tip:x86/asm] x86/asm/entry/64: Save R11 into pt_regs-> flags " tip-bot for Denys Vlasenko 2015-03-09 18:39 ` [PATCH 3/4] x86: save user rsp in pt_regs->sp " Denys Vlasenko 2015-03-09 20:11 ` Andy Lutomirski 2015-03-09 20:32 ` Denys Vlasenko 2015-03-09 20:43 ` Andy Lutomirski 2015-03-10 12:51 ` Ingo Molnar 2015-03-10 13:10 ` Andy Lutomirski 2015-03-10 13:18 ` Denys Vlasenko 2015-03-10 13:20 ` Andy Lutomirski 2015-03-10 13:26 ` Ingo Molnar 2015-03-10 13:21 ` Ingo Molnar 2015-03-10 13:26 ` Andy Lutomirski 2015-03-10 14:00 ` Denys Vlasenko 2015-03-10 14:02 ` Andy Lutomirski 2015-03-10 14:09 ` Denys Vlasenko 2015-03-10 13:28 ` Ingo Molnar 2015-03-10 13:50 ` Denys Vlasenko 2015-03-16 9:44 ` Ingo Molnar 2015-03-16 12:05 ` [tip:x86/asm] x86/asm/entry/64: Save user RSP in pt_regs-> sp " tip-bot for Denys Vlasenko 2015-03-10 6:00 ` [PATCH 0/4 v2] x86: entry_64.S: steps towards simpler iret frame handling Ingo Molnar
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).