LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH 00/24] x86/entry/64: Convert a bunch of ASM entry code into C code
@ 2021-08-31 17:50 Lai Jiangshan
  2021-08-31 17:50 ` [PATCH 01/24] x86/traps: Remove stack-protector from traps.c Lai Jiangshan
                   ` (25 more replies)
  0 siblings, 26 replies; 72+ messages in thread
From: Lai Jiangshan @ 2021-08-31 17:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Peter Zijlstra, Andy Lutomirski, H. Peter Anvin

From: Lai Jiangshan <laijs@linux.alibaba.com>

Many ASM code in entry_64.S can be rewritten in C if they can be written
to be non-instrumentable and are called in the right order regarding to
whether CR3/gsbase is changed to kernel CR3/gsbase.

The patchset covert some of them to C code.

The patch 11 converts the non paranoid entry (entry of interrupts/
non-IST-exception/IST-exception-from-user) to C code. And patch 1-10
are preparation for it and patch 12-13 are cleanup for it.  The patch 1
might fix a defect.

The patch 22 converts the paranoid entry/exit to Code.  And patch 14-21 are
pareparation for it and patch 23 is cleanup for it.

The patch 24 converts a small part of ASM code of syscall to C code which
does the checking for whether it can use sysret to return to userspace.

Some other paths can be possible to be in C code, for example: the non
paranoid exit, the syscall entry/exit.  The PTI handling for them can
be in C code.  But it would required the pt_regs to be copied/pushed
to the entry stack which means the C code would not be efficient.

When converting ASM to C, the most effort is to make them the same.
Almost no creative was involved.  The code are kept as the same as ASM
as possible and no functional change intended unless my missunderstanding
in the ASM code was involved.  The functions called by the C entry code
are checked to be ensured noinstr or __always_inline.  Some of them have
more than one definitions and require some more cares from reviewers.
The comments in the ASM are also copied in the right place in the C code.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Lai Jiangshan (24):
  x86/traps: Remove stack-protector from traps.c
  x86/traps: Move arch/x86/kernel/traps.c to arch/x86/entry/
  x86/traps: Move declaration of native_irq_return_iret up
  x86/entry: Expose the address of .Lgs_change to traps.c
  x86/entry: Introduce __entry_text for entry code written in C
  x86/entry: Move PTI_USER_* to arch/x86/include/asm/processor-flags.h
  x86: Mark __native_read_cr3() & native_write_cr3() as __always_inline
  x86/traps: Add C verion of SWITCH_TO_KERNEL_CR3 as
    switch_to_kernel_cr3()
  x86/traps: Add fence_swapgs_{user,kernel}_entry()
  x86/traps: Move pt_regs only in fixup_bad_iret()
  x86/entry: Replace the most of asm code of error_entry to C code
  x86/traps: Reconstruct pt_regs on task stack directly in
    fixup_bad_iret()
  x86/traps: Mark sync_regs() and fixup_bad_iret() as static
    __always_inline
  x86/entry: Make paranoid_exit() callable
  x86/entry: Call paranoid_exit() in asm_exc_nmi()
  x86/entry: Use skip_rdi instead of save_ret for PUSH_AND_CLEAR_REGS
  x86/entry: Introduce struct ist_regs
  x86/entry: Add the C version ist_switch_to_kernel_cr3()
  x86/entry: Add the C version ist_restore_cr3()
  x86/entry: Add the C version get_percpu_base()
  x86/entry: Add the C version ist_switch_to_kernel_gsbase()
  x86/entry: Implement and use do_paranoid_entry() and paranoid_exit()
  x86/entry: Remove the unused ASM macros
  x86/syscall/64: Move the checking for sysret to C code

 arch/x86/entry/Makefile                |   5 +-
 arch/x86/entry/calling.h               | 144 +--------
 arch/x86/entry/common.c                |  73 ++++-
 arch/x86/entry/entry_64.S              | 366 ++++-------------------
 arch/x86/{kernel => entry}/traps.c     | 397 +++++++++++++++++++++++--
 arch/x86/include/asm/processor-flags.h |  15 +
 arch/x86/include/asm/special_insns.h   |   4 +-
 arch/x86/include/asm/syscall.h         |   2 +-
 arch/x86/include/asm/traps.h           |  36 ++-
 arch/x86/kernel/Makefile               |   2 +-
 arch/x86/kernel/asm-offsets_64.c       |   2 +
 11 files changed, 554 insertions(+), 492 deletions(-)
 rename arch/x86/{kernel => entry}/traps.c (74%)

-- 
2.19.1.6.gb485710b


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 01/24] x86/traps: Remove stack-protector from traps.c
  2021-08-31 17:50 [PATCH 00/24] x86/entry/64: Convert a bunch of ASM entry code into C code Lai Jiangshan
@ 2021-08-31 17:50 ` Lai Jiangshan
  2021-08-31 17:50 ` [PATCH 02/24] x86/traps: Move arch/x86/kernel/traps.c to arch/x86/entry/ Lai Jiangshan
                   ` (24 subsequent siblings)
  25 siblings, 0 replies; 72+ messages in thread
From: Lai Jiangshan @ 2021-08-31 17:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	x86, H. Peter Anvin, Joerg Roedel, Peter Zijlstra (Intel),
	Daniel Bristot de Oliveira, Brijesh Singh, Andy Shevchenko,
	Arvind Sankar, Chester Lin, Juergen Gross

From: Lai Jiangshan <laijs@linux.alibaba.com>

When stack-protector is enabled, the compiler adds some instrument code
at the beginning and the end of some functions. Many functions in traps.c
are non-instrumentable.  Moreover, stack-protector code in the beginning
of the affected function accesses the canary that might be watched by
hardware breakpoints which might also violate the non-instrumentable
nature of some functions.

So it is better to remove stack-protector from traps.c.

It is also prepared for later patches that move some entry code into
traps.c, some of which can NOT use percpu register until gsbase is
properly switched.  And stack-protector depends on the percpu register
to work.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/kernel/Makefile | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 3e625c61f008..21aa164cece2 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -48,6 +48,9 @@ KCOV_INSTRUMENT		:= n
 
 CFLAGS_head$(BITS).o	+= -fno-stack-protector
 
+CFLAGS_REMOVE_traps.o		= -fstack-protector -fstack-protector-strong
+CFLAGS_traps.o			+= -fno-stack-protector
+
 CFLAGS_irq.o := -I $(srctree)/$(src)/../include/asm/trace
 
 obj-y			:= process_$(BITS).o signal.o
-- 
2.19.1.6.gb485710b


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 02/24] x86/traps: Move arch/x86/kernel/traps.c to arch/x86/entry/
  2021-08-31 17:50 [PATCH 00/24] x86/entry/64: Convert a bunch of ASM entry code into C code Lai Jiangshan
  2021-08-31 17:50 ` [PATCH 01/24] x86/traps: Remove stack-protector from traps.c Lai Jiangshan
@ 2021-08-31 17:50 ` Lai Jiangshan
  2021-09-02  8:09   ` Joerg Roedel
  2021-08-31 17:50 ` [PATCH 03/24] x86/traps: Move declaration of native_irq_return_iret up Lai Jiangshan
                   ` (23 subsequent siblings)
  25 siblings, 1 reply; 72+ messages in thread
From: Lai Jiangshan @ 2021-08-31 17:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin, Joerg Roedel,
	Peter Zijlstra (Intel),
	Daniel Bristot de Oliveira, Brijesh Singh, Andy Shevchenko,
	Arvind Sankar, Chester Lin, Juergen Gross

From: Lai Jiangshan <laijs@linux.alibaba.com>

traps.c handles works for entries.  It is very close to the code in
arch/x86/entry/.  So we move traps.c to arch/x86/entry/.

It is also prepared for later patches that implements C version of
the entry code in traps.c.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/entry/Makefile            | 5 ++++-
 arch/x86/{kernel => entry}/traps.c | 0
 arch/x86/kernel/Makefile           | 5 +----
 3 files changed, 5 insertions(+), 5 deletions(-)
 rename arch/x86/{kernel => entry}/traps.c (100%)

diff --git a/arch/x86/entry/Makefile b/arch/x86/entry/Makefile
index 7fec5dcf6438..4e26248cea33 100644
--- a/arch/x86/entry/Makefile
+++ b/arch/x86/entry/Makefile
@@ -11,8 +11,11 @@ CFLAGS_REMOVE_common.o		= $(CC_FLAGS_FTRACE)
 
 CFLAGS_common.o			+= -fno-stack-protector
 
+CFLAGS_REMOVE_traps.o		= -fstack-protector -fstack-protector-strong
+CFLAGS_traps.o			+= -fno-stack-protector
+
 obj-y				:= entry_$(BITS).o thunk_$(BITS).o syscall_$(BITS).o
-obj-y				+= common.o
+obj-y				+= common.o traps.o
 
 obj-y				+= vdso/
 obj-y				+= vsyscall/
diff --git a/arch/x86/kernel/traps.c b/arch/x86/entry/traps.c
similarity index 100%
rename from arch/x86/kernel/traps.c
rename to arch/x86/entry/traps.c
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 21aa164cece2..5ba80522f5da 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -48,14 +48,11 @@ KCOV_INSTRUMENT		:= n
 
 CFLAGS_head$(BITS).o	+= -fno-stack-protector
 
-CFLAGS_REMOVE_traps.o		= -fstack-protector -fstack-protector-strong
-CFLAGS_traps.o			+= -fno-stack-protector
-
 CFLAGS_irq.o := -I $(srctree)/$(src)/../include/asm/trace
 
 obj-y			:= process_$(BITS).o signal.o
 obj-$(CONFIG_COMPAT)	+= signal_compat.o
-obj-y			+= traps.o idt.o irq.o irq_$(BITS).o dumpstack_$(BITS).o
+obj-y			+= idt.o irq.o irq_$(BITS).o dumpstack_$(BITS).o
 obj-y			+= time.o ioport.o dumpstack.o nmi.o
 obj-$(CONFIG_MODIFY_LDT_SYSCALL)	+= ldt.o
 obj-y			+= setup.o x86_init.o i8259.o irqinit.o
-- 
2.19.1.6.gb485710b


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 03/24] x86/traps: Move declaration of native_irq_return_iret up
  2021-08-31 17:50 [PATCH 00/24] x86/entry/64: Convert a bunch of ASM entry code into C code Lai Jiangshan
  2021-08-31 17:50 ` [PATCH 01/24] x86/traps: Remove stack-protector from traps.c Lai Jiangshan
  2021-08-31 17:50 ` [PATCH 02/24] x86/traps: Move arch/x86/kernel/traps.c to arch/x86/entry/ Lai Jiangshan
@ 2021-08-31 17:50 ` Lai Jiangshan
  2021-08-31 17:50 ` [PATCH 04/24] x86/entry: Expose the address of .Lgs_change to traps.c Lai Jiangshan
                   ` (22 subsequent siblings)
  25 siblings, 0 replies; 72+ messages in thread
From: Lai Jiangshan @ 2021-08-31 17:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin

From: Lai Jiangshan <laijs@linux.alibaba.com>

The declaration of native_irq_return_iret is used in exc_double_fault()
only by now.  But it will be used in other place later, so the declaration
is moved up for preparation.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/entry/traps.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/traps.c b/arch/x86/entry/traps.c
index a58800973aed..7869343473b7 100644
--- a/arch/x86/entry/traps.c
+++ b/arch/x86/entry/traps.c
@@ -65,6 +65,9 @@
 #ifdef CONFIG_X86_64
 #include <asm/x86_init.h>
 #include <asm/proto.h>
+
+extern unsigned char native_irq_return_iret[];
+
 #else
 #include <asm/processor-flags.h>
 #include <asm/setup.h>
@@ -356,8 +359,6 @@ DEFINE_IDTENTRY_DF(exc_double_fault)
 #endif
 
 #ifdef CONFIG_X86_ESPFIX64
-	extern unsigned char native_irq_return_iret[];
-
 	/*
 	 * If IRET takes a non-IST fault on the espfix64 stack, then we
 	 * end up promoting it to a doublefault.  In that case, take
-- 
2.19.1.6.gb485710b


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 04/24] x86/entry: Expose the address of .Lgs_change to traps.c
  2021-08-31 17:50 [PATCH 00/24] x86/entry/64: Convert a bunch of ASM entry code into C code Lai Jiangshan
                   ` (2 preceding siblings ...)
  2021-08-31 17:50 ` [PATCH 03/24] x86/traps: Move declaration of native_irq_return_iret up Lai Jiangshan
@ 2021-08-31 17:50 ` Lai Jiangshan
  2021-09-02  9:14   ` Peter Zijlstra
  2021-08-31 17:50 ` [PATCH 05/24] x86/entry: Introduce __entry_text for entry code written in C Lai Jiangshan
                   ` (21 subsequent siblings)
  25 siblings, 1 reply; 72+ messages in thread
From: Lai Jiangshan @ 2021-08-31 17:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin

From: Lai Jiangshan <laijs@linux.alibaba.com>

The address of .Lgs_change will be used in traps.c in later patch when
some entry code is implemented in traps.c.  So the address of .Lgs_change
is exposed to traps.c for preparation.

The label .Lgs_change is still needed in ASM code for extable.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/entry/entry_64.S | 3 ++-
 arch/x86/entry/traps.c    | 1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index e38a4cf795d9..9164e85b36b8 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -729,6 +729,7 @@ _ASM_NOKPROBE(common_interrupt_return)
 SYM_FUNC_START(asm_load_gs_index)
 	FRAME_BEGIN
 	swapgs
+SYM_INNER_LABEL(asm_load_gs_index_gs_change, SYM_L_GLOBAL)
 .Lgs_change:
 	movl	%edi, %gs
 2:	ALTERNATIVE "", "mfence", X86_BUG_SWAPGS_FENCE
@@ -1011,7 +1012,7 @@ SYM_CODE_START_LOCAL(error_entry)
 	movl	%ecx, %eax			/* zero extend */
 	cmpq	%rax, RIP+8(%rsp)
 	je	.Lbstep_iret
-	cmpq	$.Lgs_change, RIP+8(%rsp)
+	cmpq	$asm_load_gs_index_gs_change, RIP+8(%rsp)
 	jne	.Lerror_entry_done_lfence
 
 	/*
diff --git a/arch/x86/entry/traps.c b/arch/x86/entry/traps.c
index 7869343473b7..f71db7934f90 100644
--- a/arch/x86/entry/traps.c
+++ b/arch/x86/entry/traps.c
@@ -67,6 +67,7 @@
 #include <asm/proto.h>
 
 extern unsigned char native_irq_return_iret[];
+extern unsigned char asm_load_gs_index_gs_change[];
 
 #else
 #include <asm/processor-flags.h>
-- 
2.19.1.6.gb485710b


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 05/24] x86/entry: Introduce __entry_text for entry code written in C
  2021-08-31 17:50 [PATCH 00/24] x86/entry/64: Convert a bunch of ASM entry code into C code Lai Jiangshan
                   ` (3 preceding siblings ...)
  2021-08-31 17:50 ` [PATCH 04/24] x86/entry: Expose the address of .Lgs_change to traps.c Lai Jiangshan
@ 2021-08-31 17:50 ` Lai Jiangshan
  2021-08-31 19:34   ` Peter Zijlstra
  2021-08-31 17:50 ` [PATCH 06/24] x86/entry: Move PTI_USER_* to arch/x86/include/asm/processor-flags.h Lai Jiangshan
                   ` (20 subsequent siblings)
  25 siblings, 1 reply; 72+ messages in thread
From: Lai Jiangshan @ 2021-08-31 17:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin

From: Lai Jiangshan <laijs@linux.alibaba.com>

Some entry code will be implemented in traps.c.  We need __entry_text
to set them in .entry.text section.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/entry/traps.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/entry/traps.c b/arch/x86/entry/traps.c
index f71db7934f90..ef07ae5fb3a0 100644
--- a/arch/x86/entry/traps.c
+++ b/arch/x86/entry/traps.c
@@ -69,6 +69,11 @@
 extern unsigned char native_irq_return_iret[];
 extern unsigned char asm_load_gs_index_gs_change[];
 
+/* Entry code written in C. Must be only used in traps.c */
+#define __entry_text							\
+	noinline notrace __attribute((__section__(".entry.text")))	\
+	__no_kcsan __no_sanitize_address __no_profile __no_sanitize_coverage
+
 #else
 #include <asm/processor-flags.h>
 #include <asm/setup.h>
-- 
2.19.1.6.gb485710b


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 06/24] x86/entry: Move PTI_USER_* to arch/x86/include/asm/processor-flags.h
  2021-08-31 17:50 [PATCH 00/24] x86/entry/64: Convert a bunch of ASM entry code into C code Lai Jiangshan
                   ` (4 preceding siblings ...)
  2021-08-31 17:50 ` [PATCH 05/24] x86/entry: Introduce __entry_text for entry code written in C Lai Jiangshan
@ 2021-08-31 17:50 ` Lai Jiangshan
  2021-08-31 17:50 ` [PATCH 07/24] x86: Mark __native_read_cr3() & native_write_cr3() as __always_inline Lai Jiangshan
                   ` (19 subsequent siblings)
  25 siblings, 0 replies; 72+ messages in thread
From: Lai Jiangshan @ 2021-08-31 17:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin

From: Lai Jiangshan <laijs@linux.alibaba.com>

These constants will be also used in traps.c, so we move them to
arch/x86/include/asm/processor-flags.h which already has a kin
X86_CR3_PTI_PCID_USER_BIT defined in it.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/entry/calling.h               | 10 ----------
 arch/x86/include/asm/processor-flags.h | 15 +++++++++++++++
 2 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index a4c061fb7c6e..996b041e92d2 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -149,16 +149,6 @@ For 32-bit we have the following conventions - kernel is built with
 
 #ifdef CONFIG_PAGE_TABLE_ISOLATION
 
-/*
- * PAGE_TABLE_ISOLATION PGDs are 8k.  Flip bit 12 to switch between the two
- * halves:
- */
-#define PTI_USER_PGTABLE_BIT		PAGE_SHIFT
-#define PTI_USER_PGTABLE_MASK		(1 << PTI_USER_PGTABLE_BIT)
-#define PTI_USER_PCID_BIT		X86_CR3_PTI_PCID_USER_BIT
-#define PTI_USER_PCID_MASK		(1 << PTI_USER_PCID_BIT)
-#define PTI_USER_PGTABLE_AND_PCID_MASK  (PTI_USER_PCID_MASK | PTI_USER_PGTABLE_MASK)
-
 .macro SET_NOFLUSH_BIT	reg:req
 	bts	$X86_CR3_PCID_NOFLUSH_BIT, \reg
 .endm
diff --git a/arch/x86/include/asm/processor-flags.h b/arch/x86/include/asm/processor-flags.h
index 02c2cbda4a74..4dd2fbbc861a 100644
--- a/arch/x86/include/asm/processor-flags.h
+++ b/arch/x86/include/asm/processor-flags.h
@@ -4,6 +4,7 @@
 
 #include <uapi/asm/processor-flags.h>
 #include <linux/mem_encrypt.h>
+#include <asm/page_types.h>
 
 #ifdef CONFIG_VM86
 #define X86_VM_MASK	X86_EFLAGS_VM
@@ -50,7 +51,21 @@
 #endif
 
 #ifdef CONFIG_PAGE_TABLE_ISOLATION
+
 # define X86_CR3_PTI_PCID_USER_BIT	11
+
+#ifdef CONFIG_X86_64
+/*
+ * PAGE_TABLE_ISOLATION PGDs are 8k.  Flip bit 12 to switch between the two
+ * halves:
+ */
+#define PTI_USER_PGTABLE_BIT		PAGE_SHIFT
+#define PTI_USER_PGTABLE_MASK		(1 << PTI_USER_PGTABLE_BIT)
+#define PTI_USER_PCID_BIT		X86_CR3_PTI_PCID_USER_BIT
+#define PTI_USER_PCID_MASK		(1 << PTI_USER_PCID_BIT)
+#define PTI_USER_PGTABLE_AND_PCID_MASK  (PTI_USER_PCID_MASK | PTI_USER_PGTABLE_MASK)
+#endif
+
 #endif
 
 #endif /* _ASM_X86_PROCESSOR_FLAGS_H */
-- 
2.19.1.6.gb485710b


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 07/24] x86: Mark __native_read_cr3() & native_write_cr3() as __always_inline
  2021-08-31 17:50 [PATCH 00/24] x86/entry/64: Convert a bunch of ASM entry code into C code Lai Jiangshan
                   ` (5 preceding siblings ...)
  2021-08-31 17:50 ` [PATCH 06/24] x86/entry: Move PTI_USER_* to arch/x86/include/asm/processor-flags.h Lai Jiangshan
@ 2021-08-31 17:50 ` Lai Jiangshan
  2021-08-31 17:50 ` [PATCH 08/24] x86/traps: Add C verion of SWITCH_TO_KERNEL_CR3 as switch_to_kernel_cr3() Lai Jiangshan
                   ` (18 subsequent siblings)
  25 siblings, 0 replies; 72+ messages in thread
From: Lai Jiangshan @ 2021-08-31 17:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	x86, H. Peter Anvin, Dave Jiang, Ben Widawsky, Dan Williams,
	Tony Luck, Peter Zijlstra, Arvind Sankar

From: Lai Jiangshan <laijs@linux.alibaba.com>

We need __native_read_cr3() & native_write_cr3() to be ensured noinstr.

It is prepared for later patches which implement entry code in traps.c.
Some of the code needs to handle KPTI and has to read/write CR3.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/include/asm/special_insns.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
index f3fbb84ff8a7..058995bb153c 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -42,14 +42,14 @@ static __always_inline void native_write_cr2(unsigned long val)
 	asm volatile("mov %0,%%cr2": : "r" (val) : "memory");
 }
 
-static inline unsigned long __native_read_cr3(void)
+static __always_inline unsigned long __native_read_cr3(void)
 {
 	unsigned long val;
 	asm volatile("mov %%cr3,%0\n\t" : "=r" (val) : __FORCE_ORDER);
 	return val;
 }
 
-static inline void native_write_cr3(unsigned long val)
+static __always_inline void native_write_cr3(unsigned long val)
 {
 	asm volatile("mov %0,%%cr3": : "r" (val) : "memory");
 }
-- 
2.19.1.6.gb485710b


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 08/24] x86/traps: Add C verion of SWITCH_TO_KERNEL_CR3 as switch_to_kernel_cr3()
  2021-08-31 17:50 [PATCH 00/24] x86/entry/64: Convert a bunch of ASM entry code into C code Lai Jiangshan
                   ` (6 preceding siblings ...)
  2021-08-31 17:50 ` [PATCH 07/24] x86: Mark __native_read_cr3() & native_write_cr3() as __always_inline Lai Jiangshan
@ 2021-08-31 17:50 ` Lai Jiangshan
  2021-08-31 17:50 ` [PATCH 09/24] x86/traps: Add fence_swapgs_{user,kernel}_entry() Lai Jiangshan
                   ` (17 subsequent siblings)
  25 siblings, 0 replies; 72+ messages in thread
From: Lai Jiangshan @ 2021-08-31 17:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin

From: Lai Jiangshan <laijs@linux.alibaba.com>

The C version switch_to_kernel_cr3() implements SWITCH_TO_KERNEL_CR3().

No functional difference intended.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/entry/traps.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/arch/x86/entry/traps.c b/arch/x86/entry/traps.c
index ef07ae5fb3a0..9b7d0f15402e 100644
--- a/arch/x86/entry/traps.c
+++ b/arch/x86/entry/traps.c
@@ -765,6 +765,30 @@ struct bad_iret_stack *fixup_bad_iret(struct bad_iret_stack *s)
 	BUG_ON(!user_mode(&new_stack->regs));
 	return new_stack;
 }
+
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+static __always_inline void pti_switch_to_kernel_cr3(unsigned long user_cr3)
+{
+	/*
+	 * Clear PCID and "PAGE_TABLE_ISOLATION bit", point CR3
+	 * at kernel pagetables:
+	 */
+	unsigned long cr3 = user_cr3 & ~PTI_USER_PGTABLE_AND_PCID_MASK;
+
+	if (static_cpu_has(X86_FEATURE_PCID))
+		cr3 |= X86_CR3_PCID_NOFLUSH;
+
+	native_write_cr3(cr3);
+}
+
+static __always_inline void switch_to_kernel_cr3(void)
+{
+	if (static_cpu_has(X86_FEATURE_PTI))
+		pti_switch_to_kernel_cr3(__native_read_cr3());
+}
+#else
+static __always_inline void switch_to_kernel_cr3(void) {}
+#endif
 #endif
 
 static bool is_sysenter_singlestep(struct pt_regs *regs)
-- 
2.19.1.6.gb485710b


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 09/24] x86/traps: Add fence_swapgs_{user,kernel}_entry()
  2021-08-31 17:50 [PATCH 00/24] x86/entry/64: Convert a bunch of ASM entry code into C code Lai Jiangshan
                   ` (7 preceding siblings ...)
  2021-08-31 17:50 ` [PATCH 08/24] x86/traps: Add C verion of SWITCH_TO_KERNEL_CR3 as switch_to_kernel_cr3() Lai Jiangshan
@ 2021-08-31 17:50 ` Lai Jiangshan
  2021-09-02  9:25   ` Peter Zijlstra
  2021-08-31 17:50 ` [PATCH 10/24] x86/traps: Move pt_regs only in fixup_bad_iret() Lai Jiangshan
                   ` (16 subsequent siblings)
  25 siblings, 1 reply; 72+ messages in thread
From: Lai Jiangshan @ 2021-08-31 17:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin

From: Lai Jiangshan <laijs@linux.alibaba.com>

fence_swapgs_{user,kernel}_entry() in traps.c are the same as
the ASM macro FENCE_SWAPGS_{USER,KERNEL}_ENTRY.

fence_swapgs_user_entry is used in the user entry swapgs code path,
to prevent a speculative swapgs when coming from kernel space.

fence_swapgs_kernel_entry is used in the kernel entry non-swapgs code
path, to prevent the swapgs from getting speculatively skipped when
coming from user space.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/entry/traps.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/arch/x86/entry/traps.c b/arch/x86/entry/traps.c
index 9b7d0f15402e..efec3b4eaa5f 100644
--- a/arch/x86/entry/traps.c
+++ b/arch/x86/entry/traps.c
@@ -789,6 +789,26 @@ static __always_inline void switch_to_kernel_cr3(void)
 #else
 static __always_inline void switch_to_kernel_cr3(void) {}
 #endif
+
+/*
+ * Mitigate Spectre v1 for conditional swapgs code paths.
+ *
+ * fence_swapgs_user_entry is used in the user entry swapgs code path, to
+ * prevent a speculative swapgs when coming from kernel space.
+ *
+ * fence_swapgs_kernel_entry is used in the kernel entry non-swapgs code path,
+ * to prevent the swapgs from getting speculatively skipped when coming from
+ * user space.
+ */
+static __always_inline void fence_swapgs_user_entry(void)
+{
+	alternative("", "lfence", X86_FEATURE_FENCE_SWAPGS_USER);
+}
+
+static __always_inline void fence_swapgs_kernel_entry(void)
+{
+	alternative("", "lfence", X86_FEATURE_FENCE_SWAPGS_KERNEL);
+}
 #endif
 
 static bool is_sysenter_singlestep(struct pt_regs *regs)
-- 
2.19.1.6.gb485710b


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 10/24] x86/traps: Move pt_regs only in fixup_bad_iret()
  2021-08-31 17:50 [PATCH 00/24] x86/entry/64: Convert a bunch of ASM entry code into C code Lai Jiangshan
                   ` (8 preceding siblings ...)
  2021-08-31 17:50 ` [PATCH 09/24] x86/traps: Add fence_swapgs_{user,kernel}_entry() Lai Jiangshan
@ 2021-08-31 17:50 ` Lai Jiangshan
  2021-08-31 17:50 ` [PATCH 11/24] x86/entry: Replace the most of asm code of error_entry to C code Lai Jiangshan
                   ` (15 subsequent siblings)
  25 siblings, 0 replies; 72+ messages in thread
From: Lai Jiangshan @ 2021-08-31 17:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin, Joerg Roedel, Youquan Song,
	Tony Luck

From: Lai Jiangshan <laijs@linux.alibaba.com>

Make fixup_bad_iret() works like sync_regs() which doesn't
move the return address of error_entry().

It is prepared later patch which implements the body of error_entry()
in C code.  The fixup_bad_iret() can't handle return address when it
is called from C code.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/entry/entry_64.S    |  5 ++++-
 arch/x86/entry/traps.c       | 17 ++++++-----------
 arch/x86/include/asm/traps.h |  2 +-
 3 files changed, 11 insertions(+), 13 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 9164e85b36b8..42d2918f5646 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1042,9 +1042,12 @@ SYM_CODE_START_LOCAL(error_entry)
 	 * Pretend that the exception came from user mode: set up pt_regs
 	 * as if we faulted immediately after IRET.
 	 */
-	mov	%rsp, %rdi
+	popq	%r12				/* save return addr in %12 */
+	movq	%rsp, %rdi			/* arg0 = pt_regs pointer */
 	call	fixup_bad_iret
 	mov	%rax, %rsp
+	ENCODE_FRAME_POINTER
+	pushq	%r12
 	jmp	.Lerror_entry_from_usermode_after_swapgs
 SYM_CODE_END(error_entry)
 
diff --git a/arch/x86/entry/traps.c b/arch/x86/entry/traps.c
index efec3b4eaa5f..c718663e57ef 100644
--- a/arch/x86/entry/traps.c
+++ b/arch/x86/entry/traps.c
@@ -734,13 +734,8 @@ asmlinkage __visible noinstr struct pt_regs *vc_switch_off_ist(struct pt_regs *r
 }
 #endif
 
-struct bad_iret_stack {
-	void *error_entry_ret;
-	struct pt_regs regs;
-};
-
 asmlinkage __visible noinstr
-struct bad_iret_stack *fixup_bad_iret(struct bad_iret_stack *s)
+struct pt_regs *fixup_bad_iret(struct pt_regs *bad_regs)
 {
 	/*
 	 * This is called from entry_64.S early in handling a fault
@@ -750,19 +745,19 @@ struct bad_iret_stack *fixup_bad_iret(struct bad_iret_stack *s)
 	 * just below the IRET frame) and we want to pretend that the
 	 * exception came from the IRET target.
 	 */
-	struct bad_iret_stack tmp, *new_stack =
-		(struct bad_iret_stack *)__this_cpu_read(cpu_tss_rw.x86_tss.sp0) - 1;
+	struct pt_regs tmp, *new_stack =
+		(struct pt_regs *)__this_cpu_read(cpu_tss_rw.x86_tss.sp0) - 1;
 
 	/* Copy the IRET target to the temporary storage. */
-	__memcpy(&tmp.regs.ip, (void *)s->regs.sp, 5*8);
+	__memcpy(&tmp.ip, (void *)bad_regs->sp, 5*8);
 
 	/* Copy the remainder of the stack from the current stack. */
-	__memcpy(&tmp, s, offsetof(struct bad_iret_stack, regs.ip));
+	__memcpy(&tmp, bad_regs, offsetof(struct pt_regs, ip));
 
 	/* Update the entry stack */
 	__memcpy(new_stack, &tmp, sizeof(tmp));
 
-	BUG_ON(!user_mode(&new_stack->regs));
+	BUG_ON(!user_mode(new_stack));
 	return new_stack;
 }
 
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 7f7200021bd1..e9e3801391d5 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -13,7 +13,7 @@
 #ifdef CONFIG_X86_64
 asmlinkage __visible notrace struct pt_regs *sync_regs(struct pt_regs *eregs);
 asmlinkage __visible notrace
-struct bad_iret_stack *fixup_bad_iret(struct bad_iret_stack *s);
+struct pt_regs *fixup_bad_iret(struct pt_regs *bad_regs);
 void __init trap_init(void);
 asmlinkage __visible noinstr struct pt_regs *vc_switch_off_ist(struct pt_regs *eregs);
 #endif
-- 
2.19.1.6.gb485710b


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 11/24] x86/entry: Replace the most of asm code of error_entry to C code
  2021-08-31 17:50 [PATCH 00/24] x86/entry/64: Convert a bunch of ASM entry code into C code Lai Jiangshan
                   ` (9 preceding siblings ...)
  2021-08-31 17:50 ` [PATCH 10/24] x86/traps: Move pt_regs only in fixup_bad_iret() Lai Jiangshan
@ 2021-08-31 17:50 ` Lai Jiangshan
  2021-09-02 10:16   ` Peter Zijlstra
  2021-08-31 17:50 ` [PATCH 12/24] x86/traps: Reconstruct pt_regs on task stack directly in fixup_bad_iret() Lai Jiangshan
                   ` (14 subsequent siblings)
  25 siblings, 1 reply; 72+ messages in thread
From: Lai Jiangshan @ 2021-08-31 17:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin, Joerg Roedel, Youquan Song,
	Tony Luck

From: Lai Jiangshan <laijs@linux.alibaba.com>

All the needed facilities are set in traps.c, the major body of
error_entry() can be implemented in C in traps.c.  The C version
generally has better readibility and easier to be updated/improved.

No function change intended.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/entry/entry_64.S    | 71 +--------------------------------
 arch/x86/entry/traps.c       | 76 ++++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/traps.h |  1 +
 3 files changed, 78 insertions(+), 70 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 42d2918f5646..bc9e2f5ad370 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -972,83 +972,14 @@ SYM_CODE_START_LOCAL(error_entry)
 	cld
 	PUSH_AND_CLEAR_REGS save_ret=1
 	ENCODE_FRAME_POINTER 8
-	testb	$3, CS+8(%rsp)
-	jz	.Lerror_kernelspace
 
-	/*
-	 * We entered from user mode or we're pretending to have entered
-	 * from user mode due to an IRET fault.
-	 */
-	SWAPGS
-	FENCE_SWAPGS_USER_ENTRY
-	/* We have user CR3.  Change to kernel CR3. */
-	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
-
-.Lerror_entry_from_usermode_after_swapgs:
-	/* Put us onto the real thread stack. */
 	popq	%r12				/* save return addr in %12 */
 	movq	%rsp, %rdi			/* arg0 = pt_regs pointer */
-	call	sync_regs
+	call	do_error_entry
 	movq	%rax, %rsp			/* switch stack */
 	ENCODE_FRAME_POINTER
 	pushq	%r12
 	ret
-
-.Lerror_entry_done_lfence:
-	FENCE_SWAPGS_KERNEL_ENTRY
-.Lerror_entry_done:
-	ret
-
-	/*
-	 * There are two places in the kernel that can potentially fault with
-	 * usergs. Handle them here.  B stepping K8s sometimes report a
-	 * truncated RIP for IRET exceptions returning to compat mode. Check
-	 * for these here too.
-	 */
-.Lerror_kernelspace:
-	leaq	native_irq_return_iret(%rip), %rcx
-	cmpq	%rcx, RIP+8(%rsp)
-	je	.Lerror_bad_iret
-	movl	%ecx, %eax			/* zero extend */
-	cmpq	%rax, RIP+8(%rsp)
-	je	.Lbstep_iret
-	cmpq	$asm_load_gs_index_gs_change, RIP+8(%rsp)
-	jne	.Lerror_entry_done_lfence
-
-	/*
-	 * hack: .Lgs_change can fail with user gsbase.  If this happens, fix up
-	 * gsbase and proceed.  We'll fix up the exception and land in
-	 * .Lgs_change's error handler with kernel gsbase.
-	 */
-	SWAPGS
-	FENCE_SWAPGS_USER_ENTRY
-	jmp .Lerror_entry_done
-
-.Lbstep_iret:
-	/* Fix truncated RIP */
-	movq	%rcx, RIP+8(%rsp)
-	/* fall through */
-
-.Lerror_bad_iret:
-	/*
-	 * We came from an IRET to user mode, so we have user
-	 * gsbase and CR3.  Switch to kernel gsbase and CR3:
-	 */
-	SWAPGS
-	FENCE_SWAPGS_USER_ENTRY
-	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
-
-	/*
-	 * Pretend that the exception came from user mode: set up pt_regs
-	 * as if we faulted immediately after IRET.
-	 */
-	popq	%r12				/* save return addr in %12 */
-	movq	%rsp, %rdi			/* arg0 = pt_regs pointer */
-	call	fixup_bad_iret
-	mov	%rax, %rsp
-	ENCODE_FRAME_POINTER
-	pushq	%r12
-	jmp	.Lerror_entry_from_usermode_after_swapgs
 SYM_CODE_END(error_entry)
 
 SYM_CODE_START_LOCAL(error_return)
diff --git a/arch/x86/entry/traps.c b/arch/x86/entry/traps.c
index c718663e57ef..b8fdf6a9682f 100644
--- a/arch/x86/entry/traps.c
+++ b/arch/x86/entry/traps.c
@@ -804,6 +804,82 @@ static __always_inline void fence_swapgs_kernel_entry(void)
 {
 	alternative("", "lfence", X86_FEATURE_FENCE_SWAPGS_KERNEL);
 }
+
+/*
+ * Put pt_regs onto the task stack and switch GS and CR3 if needed.
+ * The actual stack switch is done in entry_64.S.
+ *
+ * Becareful, it might be in the user CR3 and user GS base at the start
+ * of the function.
+ */
+asmlinkage __visible __entry_text
+struct pt_regs *do_error_entry(struct pt_regs *eregs)
+{
+	unsigned long iret_ip = (unsigned long)native_irq_return_iret;
+
+	/*
+	 * When XENPV, it is already in the task stack, and it can't fault
+	 * from native_irq_return_iret and asm_load_gs_index_gs_change()
+	 * since XENPV uses its own pvops for iret and load_gs_index.
+	 * So it can directly return.
+	 */
+	if (static_cpu_has(X86_FEATURE_XENPV))
+		return eregs;
+
+	if (user_mode(eregs)) {
+		/*
+		 * We entered from user mode.
+		 * Switch to kernel gsbase and CR3.
+		 */
+		native_swapgs();
+		fence_swapgs_user_entry();
+		switch_to_kernel_cr3();
+
+		/* Put pt_regs onto the task stack. */
+		return sync_regs(eregs);
+	}
+
+	/*
+	 * There are two places in the kernel that can potentially fault with
+	 * usergs. Handle them here.  B stepping K8s sometimes report a
+	 * truncated RIP for IRET exceptions returning to compat mode. Check
+	 * for these here too.
+	 */
+	if ((eregs->ip == iret_ip) || (eregs->ip == (unsigned int)iret_ip)) {
+		eregs->ip = iret_ip; /* Fix truncated RIP */
+
+		/*
+		 * We came from an IRET to user mode, so we have user
+		 * gsbase and CR3.  Switch to kernel gsbase and CR3:
+		 */
+		native_swapgs();
+		fence_swapgs_user_entry();
+		switch_to_kernel_cr3();
+
+		/*
+		 * Pretend that the exception came from user mode: set up
+		 * pt_regs as if we faulted immediately after IRET and put
+		 * pt_regs onto the real task stack.
+		 */
+		return sync_regs(fixup_bad_iret(eregs));
+	}
+
+	/*
+	 * Hack: asm_load_gs_index_gs_change can fail with user gsbase.
+	 * If this happens, fix up gsbase and proceed.  We'll fix up the
+	 * exception and land in asm_load_gs_index_gs_change's error
+	 * handler with kernel gsbase.
+	 */
+	if (eregs->ip == (unsigned long)asm_load_gs_index_gs_change) {
+		native_swapgs();
+		fence_swapgs_user_entry();
+	} else {
+		fence_swapgs_kernel_entry();
+	}
+
+	/* Enter from kernel, don't move pt_regs */
+	return eregs;
+}
 #endif
 
 static bool is_sysenter_singlestep(struct pt_regs *regs)
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index e9e3801391d5..7b51a8081ae4 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -14,6 +14,7 @@
 asmlinkage __visible notrace struct pt_regs *sync_regs(struct pt_regs *eregs);
 asmlinkage __visible notrace
 struct pt_regs *fixup_bad_iret(struct pt_regs *bad_regs);
+asmlinkage __visible notrace struct pt_regs *do_error_entry(struct pt_regs *eregs);
 void __init trap_init(void);
 asmlinkage __visible noinstr struct pt_regs *vc_switch_off_ist(struct pt_regs *eregs);
 #endif
-- 
2.19.1.6.gb485710b


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 12/24] x86/traps: Reconstruct pt_regs on task stack directly in fixup_bad_iret()
  2021-08-31 17:50 [PATCH 00/24] x86/entry/64: Convert a bunch of ASM entry code into C code Lai Jiangshan
                   ` (10 preceding siblings ...)
  2021-08-31 17:50 ` [PATCH 11/24] x86/entry: Replace the most of asm code of error_entry to C code Lai Jiangshan
@ 2021-08-31 17:50 ` Lai Jiangshan
  2021-08-31 17:50 ` [PATCH 13/24] x86/traps: Mark sync_regs() and fixup_bad_iret() as static __always_inline Lai Jiangshan
                   ` (13 subsequent siblings)
  25 siblings, 0 replies; 72+ messages in thread
From: Lai Jiangshan @ 2021-08-31 17:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin

From: Lai Jiangshan <laijs@linux.alibaba.com>

In current code for handing bad iret, pt_regs is moved several times.
	Reconstruct pt_regs from bad_regs onto a tmp
	Copy tmp pt_regs onto entry stack
	Copy the pt_regs on the entry stack to the task stack in sync_regs().

This patch directly reconstructs pt_regs from bad_regs onto the task stack.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/entry/traps.c | 35 +++++++++++++++++++----------------
 1 file changed, 19 insertions(+), 16 deletions(-)

diff --git a/arch/x86/entry/traps.c b/arch/x86/entry/traps.c
index b8fdf6a9682f..ab9866b650e7 100644
--- a/arch/x86/entry/traps.c
+++ b/arch/x86/entry/traps.c
@@ -740,25 +740,28 @@ struct pt_regs *fixup_bad_iret(struct pt_regs *bad_regs)
 	/*
 	 * This is called from entry_64.S early in handling a fault
 	 * caused by a bad iret to user mode.  To handle the fault
-	 * correctly, we want to move our stack frame to where it would
-	 * be had we entered directly on the entry stack (rather than
-	 * just below the IRET frame) and we want to pretend that the
-	 * exception came from the IRET target.
+	 * correctly, we want to pretend that the exception came from
+	 * the IRET target (userspace) and reconstruct the orginal
+	 * pt_regs from the bad regs on the task stack and switch to
+	 * the task stack to handle it.  The actual stack switch is
+	 * done in entry_64.S
 	 */
-	struct pt_regs tmp, *new_stack =
-		(struct pt_regs *)__this_cpu_read(cpu_tss_rw.x86_tss.sp0) - 1;
-
-	/* Copy the IRET target to the temporary storage. */
-	__memcpy(&tmp.ip, (void *)bad_regs->sp, 5*8);
+	struct pt_regs *regs = (struct pt_regs *)this_cpu_read(cpu_current_top_of_stack) - 1;
 
-	/* Copy the remainder of the stack from the current stack. */
-	__memcpy(&tmp, bad_regs, offsetof(struct pt_regs, ip));
+	/*
+	 * Copy the IRET target to the task pt_regs.  This fault caused
+	 * by a bad iret on native_irq_return_iret which is not used by
+	 * XENPV, so the bad_regs->sp and bad_regs are in the CPU ENTRY
+	 * AREA and they must not overlap with the task pt_regs and
+	 * we can safely use __memcpy().
+	 */
+	__memcpy(&regs->ip, (void *)bad_regs->sp, 5*8);
 
-	/* Update the entry stack */
-	__memcpy(new_stack, &tmp, sizeof(tmp));
+	/* Copy the remainder of the pt_regs from the bad_regs. */
+	__memcpy(regs, bad_regs, offsetof(struct pt_regs, ip));
 
-	BUG_ON(!user_mode(new_stack));
-	return new_stack;
+	BUG_ON(!user_mode(regs));
+	return regs;
 }
 
 #ifdef CONFIG_PAGE_TABLE_ISOLATION
@@ -861,7 +864,7 @@ struct pt_regs *do_error_entry(struct pt_regs *eregs)
 		 * pt_regs as if we faulted immediately after IRET and put
 		 * pt_regs onto the real task stack.
 		 */
-		return sync_regs(fixup_bad_iret(eregs));
+		return fixup_bad_iret(eregs);
 	}
 
 	/*
-- 
2.19.1.6.gb485710b


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 13/24] x86/traps: Mark sync_regs() and fixup_bad_iret() as static __always_inline
  2021-08-31 17:50 [PATCH 00/24] x86/entry/64: Convert a bunch of ASM entry code into C code Lai Jiangshan
                   ` (11 preceding siblings ...)
  2021-08-31 17:50 ` [PATCH 12/24] x86/traps: Reconstruct pt_regs on task stack directly in fixup_bad_iret() Lai Jiangshan
@ 2021-08-31 17:50 ` Lai Jiangshan
  2021-08-31 17:50 ` [PATCH 14/24] x86/entry: Make paranoid_exit() callable Lai Jiangshan
                   ` (12 subsequent siblings)
  25 siblings, 0 replies; 72+ messages in thread
From: Lai Jiangshan @ 2021-08-31 17:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin, Joerg Roedel, Youquan Song,
	Tony Luck

From: Lai Jiangshan <laijs@linux.alibaba.com>

They have solo caller and are used only in traps.c.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/entry/traps.c       | 5 +++--
 arch/x86/include/asm/traps.h | 3 ---
 2 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/arch/x86/entry/traps.c b/arch/x86/entry/traps.c
index ab9866b650e7..40722a2f61ae 100644
--- a/arch/x86/entry/traps.c
+++ b/arch/x86/entry/traps.c
@@ -683,7 +683,8 @@ DEFINE_IDTENTRY_RAW(exc_int3)
  * to switch to the normal thread stack if the interrupted code was in
  * user mode. The actual stack switch is done in entry_64.S
  */
-asmlinkage __visible noinstr struct pt_regs *sync_regs(struct pt_regs *eregs)
+static __always_inline
+struct pt_regs *sync_regs(struct pt_regs *eregs)
 {
 	struct pt_regs *regs = (struct pt_regs *)this_cpu_read(cpu_current_top_of_stack) - 1;
 	if (regs != eregs)
@@ -734,7 +735,7 @@ asmlinkage __visible noinstr struct pt_regs *vc_switch_off_ist(struct pt_regs *r
 }
 #endif
 
-asmlinkage __visible noinstr
+static __always_inline
 struct pt_regs *fixup_bad_iret(struct pt_regs *bad_regs)
 {
 	/*
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 7b51a8081ae4..5c41a279c1e0 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -11,9 +11,6 @@
 #include <asm/trap_pf.h>
 
 #ifdef CONFIG_X86_64
-asmlinkage __visible notrace struct pt_regs *sync_regs(struct pt_regs *eregs);
-asmlinkage __visible notrace
-struct pt_regs *fixup_bad_iret(struct pt_regs *bad_regs);
 asmlinkage __visible notrace struct pt_regs *do_error_entry(struct pt_regs *eregs);
 void __init trap_init(void);
 asmlinkage __visible noinstr struct pt_regs *vc_switch_off_ist(struct pt_regs *eregs);
-- 
2.19.1.6.gb485710b


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 14/24] x86/entry: Make paranoid_exit() callable
  2021-08-31 17:50 [PATCH 00/24] x86/entry/64: Convert a bunch of ASM entry code into C code Lai Jiangshan
                   ` (12 preceding siblings ...)
  2021-08-31 17:50 ` [PATCH 13/24] x86/traps: Mark sync_regs() and fixup_bad_iret() as static __always_inline Lai Jiangshan
@ 2021-08-31 17:50 ` Lai Jiangshan
  2021-08-31 17:50 ` [PATCH 15/24] x86/entry: Call paranoid_exit() in asm_exc_nmi() Lai Jiangshan
                   ` (11 subsequent siblings)
  25 siblings, 0 replies; 72+ messages in thread
From: Lai Jiangshan @ 2021-08-31 17:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin

From: Lai Jiangshan <laijs@linux.alibaba.com>

Move the last JMP out of paranoid_exit() and make it callable.

Allow paranoid_exit() to be re-written in C later and also allow
asm_exc_nmi() to call it to avoid duplicated code.

No functional change intended.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/entry/entry_64.S | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index bc9e2f5ad370..24b121d2bf61 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -437,7 +437,8 @@ SYM_CODE_START(\asmsym)
 
 	call	\cfunc
 
-	jmp	paranoid_exit
+	call	paranoid_exit
+	jmp	restore_regs_and_return_to_kernel
 
 	/* Switch to the regular task stack and use the noist entry point */
 .Lfrom_usermode_switch_stack_\@:
@@ -514,7 +515,8 @@ SYM_CODE_START(\asmsym)
 	 * identical to the stack in the IRET frame or the VC fall-back stack,
 	 * so it is definitely mapped even with PTI enabled.
 	 */
-	jmp	paranoid_exit
+	call	paranoid_exit
+	jmp	restore_regs_and_return_to_kernel
 
 	/* Switch to the regular task stack */
 .Lfrom_usermode_switch_stack_\@:
@@ -544,7 +546,8 @@ SYM_CODE_START(\asmsym)
 	movq	$-1, ORIG_RAX(%rsp)	/* no syscall to restart */
 	call	\cfunc
 
-	jmp	paranoid_exit
+	call	paranoid_exit
+	jmp	restore_regs_and_return_to_kernel
 
 _ASM_NOKPROBE(\asmsym)
 SYM_CODE_END(\asmsym)
@@ -936,7 +939,7 @@ SYM_CODE_END(paranoid_entry)
  *     Y        User space GSBASE, must be restored unconditionally
  */
 SYM_CODE_START_LOCAL(paranoid_exit)
-	UNWIND_HINT_REGS
+	UNWIND_HINT_REGS offset=8
 	/*
 	 * The order of operations is important. RESTORE_CR3 requires
 	 * kernel GSBASE.
@@ -952,16 +955,17 @@ SYM_CODE_START_LOCAL(paranoid_exit)
 
 	/* With FSGSBASE enabled, unconditionally restore GSBASE */
 	wrgsbase	%rbx
-	jmp		restore_regs_and_return_to_kernel
+	ret
 
 .Lparanoid_exit_checkgs:
 	/* On non-FSGSBASE systems, conditionally do SWAPGS */
 	testl		%ebx, %ebx
-	jnz		restore_regs_and_return_to_kernel
+	jnz		.Lparanoid_exit_done
 
 	/* We are returning to a context with user GSBASE */
 	swapgs
-	jmp		restore_regs_and_return_to_kernel
+.Lparanoid_exit_done:
+	ret
 SYM_CODE_END(paranoid_exit)
 
 /*
-- 
2.19.1.6.gb485710b


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 15/24] x86/entry: Call paranoid_exit() in asm_exc_nmi()
  2021-08-31 17:50 [PATCH 00/24] x86/entry/64: Convert a bunch of ASM entry code into C code Lai Jiangshan
                   ` (13 preceding siblings ...)
  2021-08-31 17:50 ` [PATCH 14/24] x86/entry: Make paranoid_exit() callable Lai Jiangshan
@ 2021-08-31 17:50 ` Lai Jiangshan
  2021-08-31 17:50 ` [PATCH 16/24] x86/entry: Use skip_rdi instead of save_ret for PUSH_AND_CLEAR_REGS Lai Jiangshan
                   ` (10 subsequent siblings)
  25 siblings, 0 replies; 72+ messages in thread
From: Lai Jiangshan @ 2021-08-31 17:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin

From: Lai Jiangshan <laijs@linux.alibaba.com>

The code between "call exc_nmi" and nmi_restore is as the same as
paranoid_exit(), so we can just use paranoid_exit() instead of the open
duplicated code.

No functional change intended.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/entry/entry_64.S | 34 +++++-----------------------------
 1 file changed, 5 insertions(+), 29 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 24b121d2bf61..ac67a1109c9c 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -921,8 +921,7 @@ SYM_CODE_END(paranoid_entry)
 
 /*
  * "Paranoid" exit path from exception stack.  This is invoked
- * only on return from non-NMI IST interrupts that came
- * from kernel space.
+ * only on return from IST interrupts that came from kernel space.
  *
  * We may be returning to very strange contexts (e.g. very early
  * in syscall entry), so checking for preemption here would
@@ -1288,11 +1287,7 @@ end_repeat_nmi:
 	pushq	$-1				/* ORIG_RAX: no syscall to restart */
 
 	/*
-	 * Use paranoid_entry to handle SWAPGS, but no need to use paranoid_exit
-	 * as we should not be calling schedule in NMI context.
-	 * Even with normal interrupts enabled. An NMI should not be
-	 * setting NEED_RESCHED or anything that normal interrupts and
-	 * exceptions might do.
+	 * Use paranoid_entry to handle SWAPGS and CR3.
 	 */
 	call	paranoid_entry
 	UNWIND_HINT_REGS
@@ -1301,31 +1296,12 @@ end_repeat_nmi:
 	movq	$-1, %rsi
 	call	exc_nmi
 
-	/* Always restore stashed CR3 value (see paranoid_entry) */
-	RESTORE_CR3 scratch_reg=%r15 save_reg=%r14
-
 	/*
-	 * The above invocation of paranoid_entry stored the GSBASE
-	 * related information in R/EBX depending on the availability
-	 * of FSGSBASE.
-	 *
-	 * If FSGSBASE is enabled, restore the saved GSBASE value
-	 * unconditionally, otherwise take the conditional SWAPGS path.
+	 * Use paranoid_exit to handle SWAPGS and CR3, but no need to use
+	 * restore_regs_and_return_to_kernel as we must handle nested NMI.
 	 */
-	ALTERNATIVE "jmp nmi_no_fsgsbase", "", X86_FEATURE_FSGSBASE
-
-	wrgsbase	%rbx
-	jmp	nmi_restore
-
-nmi_no_fsgsbase:
-	/* EBX == 0 -> invoke SWAPGS */
-	testl	%ebx, %ebx
-	jnz	nmi_restore
-
-nmi_swapgs:
-	swapgs
+	call	paranoid_exit
 
-nmi_restore:
 	POP_REGS
 
 	/*
-- 
2.19.1.6.gb485710b


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 16/24] x86/entry: Use skip_rdi instead of save_ret for PUSH_AND_CLEAR_REGS
  2021-08-31 17:50 [PATCH 00/24] x86/entry/64: Convert a bunch of ASM entry code into C code Lai Jiangshan
                   ` (14 preceding siblings ...)
  2021-08-31 17:50 ` [PATCH 15/24] x86/entry: Call paranoid_exit() in asm_exc_nmi() Lai Jiangshan
@ 2021-08-31 17:50 ` Lai Jiangshan
  2021-08-31 17:50 ` [PATCH 17/24] x86/entry: Introduce struct ist_regs Lai Jiangshan
                   ` (9 subsequent siblings)
  25 siblings, 0 replies; 72+ messages in thread
From: Lai Jiangshan @ 2021-08-31 17:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin

From: Lai Jiangshan <laijs@linux.alibaba.com>

It allows the user of PUSH_AND_CLEAR_REGS to use its own ways to handle
the return addr.  For example, error_entry() can save it to r12 directly
instead of moving it around and reduce two instructions.

And in later patch, it allows us to move the return addr lower and
add a little more space between pt_regs and the return addr
which consists a new type ist_regs for IST interrupts.

No functional change intended.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/entry/calling.h  | 18 +++++-------------
 arch/x86/entry/entry_64.S | 12 ++++++++----
 2 files changed, 13 insertions(+), 17 deletions(-)

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 996b041e92d2..aeee1575332f 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -63,15 +63,11 @@ For 32-bit we have the following conventions - kernel is built with
  * for assembly code:
  */
 
-.macro PUSH_REGS rdx=%rdx rax=%rax save_ret=0
-	.if \save_ret
-	pushq	%rsi		/* pt_regs->si */
-	movq	8(%rsp), %rsi	/* temporarily store the return address in %rsi */
-	movq	%rdi, 8(%rsp)	/* pt_regs->di (overwriting original return address) */
-	.else
+.macro PUSH_REGS rdx=%rdx rax=%rax skip_rdi=0
+	.if \skip_rdi == 0
 	pushq   %rdi		/* pt_regs->di */
-	pushq   %rsi		/* pt_regs->si */
 	.endif
+	pushq   %rsi		/* pt_regs->si */
 	pushq	\rdx		/* pt_regs->dx */
 	pushq   %rcx		/* pt_regs->cx */
 	pushq   \rax		/* pt_regs->ax */
@@ -86,10 +82,6 @@ For 32-bit we have the following conventions - kernel is built with
 	pushq	%r14		/* pt_regs->r14 */
 	pushq	%r15		/* pt_regs->r15 */
 	UNWIND_HINT_REGS
-
-	.if \save_ret
-	pushq	%rsi		/* return address on top of stack */
-	.endif
 .endm
 
 .macro CLEAR_REGS
@@ -114,8 +106,8 @@ For 32-bit we have the following conventions - kernel is built with
 
 .endm
 
-.macro PUSH_AND_CLEAR_REGS rdx=%rdx rax=%rax save_ret=0
-	PUSH_REGS rdx=\rdx, rax=\rax, save_ret=\save_ret
+.macro PUSH_AND_CLEAR_REGS rdx=%rdx rax=%rax skip_rdi=0
+	PUSH_REGS rdx=\rdx, rax=\rax, skip_rdi=\skip_rdi
 	CLEAR_REGS
 .endm
 
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index ac67a1109c9c..e968074046c3 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -849,7 +849,10 @@ SYM_CODE_END(xen_failsafe_callback)
 SYM_CODE_START_LOCAL(paranoid_entry)
 	UNWIND_HINT_FUNC
 	cld
-	PUSH_AND_CLEAR_REGS save_ret=1
+	PUSH_AND_CLEAR_REGS skip_rdi=1
+	movq	RDI(%rsp), %rsi	/* temporarily store the return address in %rsi */
+	movq	%rdi, RDI(%rsp) /* put %rdi onto pt_regs */
+	pushq	%rsi		/* put the return address onto the stack */
 	ENCODE_FRAME_POINTER 8
 
 	/*
@@ -973,10 +976,11 @@ SYM_CODE_END(paranoid_exit)
 SYM_CODE_START_LOCAL(error_entry)
 	UNWIND_HINT_FUNC
 	cld
-	PUSH_AND_CLEAR_REGS save_ret=1
-	ENCODE_FRAME_POINTER 8
+	PUSH_AND_CLEAR_REGS skip_rdi=1
+	movq	RDI(%rsp), %r12			/* save return addr in %12 */
+	movq	%rdi, RDI(%rsp)			/* put %rdi onto pt_regs */
+	ENCODE_FRAME_POINTER
 
-	popq	%r12				/* save return addr in %12 */
 	movq	%rsp, %rdi			/* arg0 = pt_regs pointer */
 	call	do_error_entry
 	movq	%rax, %rsp			/* switch stack */
-- 
2.19.1.6.gb485710b


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 17/24] x86/entry: Introduce struct ist_regs
  2021-08-31 17:50 [PATCH 00/24] x86/entry/64: Convert a bunch of ASM entry code into C code Lai Jiangshan
                   ` (15 preceding siblings ...)
  2021-08-31 17:50 ` [PATCH 16/24] x86/entry: Use skip_rdi instead of save_ret for PUSH_AND_CLEAR_REGS Lai Jiangshan
@ 2021-08-31 17:50 ` Lai Jiangshan
  2021-09-10  0:18   ` Lai Jiangshan
  2021-08-31 17:50 ` [PATCH 18/24] x86/entry: Add the C version ist_switch_to_kernel_cr3() Lai Jiangshan
                   ` (8 subsequent siblings)
  25 siblings, 1 reply; 72+ messages in thread
From: Lai Jiangshan @ 2021-08-31 17:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin, Joerg Roedel, Youquan Song,
	Tony Luck, Juergen Gross, Peter Zijlstra (Intel)

From: Lai Jiangshan <laijs@linux.alibaba.com>

struct ist_regs is the upmost stack frame for IST interrupt, it
consists of struct pt_regs and other fields which will be added in
later patch.

Make vc_switch_off_ist() take ist_regs as its argument and it will switch
the whole ist_regs if needed.

Make the frame before calling paranoid_entry() and paranoid_exit() be
struct ist_regs.

This patch is prepared for converting paranoid_entry() and paranoid_exit()
into C code which will need the additinal fields to store the results in
paranoid_entry() and to use them in paranoid_exit().

The C interrupt handlers don't use struct ist_regs due to they don't need
the additional fields in the struct ist_regs, and they still use pt_regs.

No functional change intended and IST_pt_reg is still 0.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/entry/entry_64.S        | 37 ++++++++++++++++++--------------
 arch/x86/entry/traps.c           | 16 +++++++-------
 arch/x86/include/asm/traps.h     | 10 ++++++++-
 arch/x86/kernel/asm-offsets_64.c |  2 ++
 4 files changed, 40 insertions(+), 25 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index e968074046c3..1ae10ca351f4 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -431,13 +431,14 @@ SYM_CODE_START(\asmsym)
 	/* paranoid_entry returns GS information for paranoid_exit in EBX. */
 	call	paranoid_entry
 
-	UNWIND_HINT_REGS
+	UNWIND_HINT_REGS offset=IST_pt_regs
 
-	movq	%rsp, %rdi		/* pt_regs pointer */
+	leaq	IST_pt_regs(%rsp), %rdi	/* pt_regs pointer */
 
 	call	\cfunc
 
 	call	paranoid_exit
+	addq	$IST_pt_regs, %rsp	/* put %rsp back to pt_regs */
 	jmp	restore_regs_and_return_to_kernel
 
 	/* Switch to the regular task stack and use the noist entry point */
@@ -488,7 +489,7 @@ SYM_CODE_START(\asmsym)
 	 */
 	call	paranoid_entry
 
-	UNWIND_HINT_REGS
+	UNWIND_HINT_REGS offset=IST_pt_regs
 
 	/*
 	 * Switch off the IST stack to make it free for nested exceptions. The
@@ -496,17 +497,17 @@ SYM_CODE_START(\asmsym)
 	 * stack if it is safe to do so. If not it switches to the VC fall-back
 	 * stack.
 	 */
-	movq	%rsp, %rdi		/* pt_regs pointer */
+	movq	%rsp, %rdi		/* ist_regs pointer */
 	call	vc_switch_off_ist
 	movq	%rax, %rsp		/* Switch to new stack */
 
-	UNWIND_HINT_REGS
+	UNWIND_HINT_REGS offset=IST_pt_regs
 
-	/* Update pt_regs */
-	movq	ORIG_RAX(%rsp), %rsi	/* get error code into 2nd argument*/
-	movq	$-1, ORIG_RAX(%rsp)	/* no syscall to restart */
+	leaq	IST_pt_regs(%rsp), %rdi	/* pt_regs pointer */
 
-	movq	%rsp, %rdi		/* pt_regs pointer */
+	/* Update pt_regs */
+	movq	ORIG_RAX(%rdi), %rsi	/* get error code into 2nd argument*/
+	movq	$-1, ORIG_RAX(%rdi)	/* no syscall to restart */
 
 	call	kernel_\cfunc
 
@@ -516,6 +517,7 @@ SYM_CODE_START(\asmsym)
 	 * so it is definitely mapped even with PTI enabled.
 	 */
 	call	paranoid_exit
+	addq	$IST_pt_regs, %rsp	/* put %rsp back to pt_regs */
 	jmp	restore_regs_and_return_to_kernel
 
 	/* Switch to the regular task stack */
@@ -539,14 +541,15 @@ SYM_CODE_START(\asmsym)
 
 	/* paranoid_entry returns GS information for paranoid_exit in EBX. */
 	call	paranoid_entry
-	UNWIND_HINT_REGS
+	UNWIND_HINT_REGS offset=IST_pt_regs
 
-	movq	%rsp, %rdi		/* pt_regs pointer into first argument */
-	movq	ORIG_RAX(%rsp), %rsi	/* get error code into 2nd argument*/
-	movq	$-1, ORIG_RAX(%rsp)	/* no syscall to restart */
+	leaq	IST_pt_regs(%rsp), %rdi	/* pt_regs pointer into first argument */
+	movq	ORIG_RAX(%rdi), %rsi	/* get error code into 2nd argument*/
+	movq	$-1, ORIG_RAX(%rdi)	/* no syscall to restart */
 	call	\cfunc
 
 	call	paranoid_exit
+	addq	$IST_pt_regs, %rsp	/* put %rsp back to pt_regs */
 	jmp	restore_regs_and_return_to_kernel
 
 _ASM_NOKPROBE(\asmsym)
@@ -852,8 +855,9 @@ SYM_CODE_START_LOCAL(paranoid_entry)
 	PUSH_AND_CLEAR_REGS skip_rdi=1
 	movq	RDI(%rsp), %rsi	/* temporarily store the return address in %rsi */
 	movq	%rdi, RDI(%rsp) /* put %rdi onto pt_regs */
+	subq	$IST_pt_regs, %rsp /* reserve room for ist_regs */
 	pushq	%rsi		/* put the return address onto the stack */
-	ENCODE_FRAME_POINTER 8
+	ENCODE_FRAME_POINTER 8+IST_pt_regs
 
 	/*
 	 * Always stash CR3 in %r14.  This value will be restored,
@@ -1294,9 +1298,9 @@ end_repeat_nmi:
 	 * Use paranoid_entry to handle SWAPGS and CR3.
 	 */
 	call	paranoid_entry
-	UNWIND_HINT_REGS
+	UNWIND_HINT_REGS offset=IST_pt_regs
 
-	movq	%rsp, %rdi
+	leaq	IST_pt_regs(%rsp), %rdi		/* pt_regs pointer */
 	movq	$-1, %rsi
 	call	exc_nmi
 
@@ -1305,6 +1309,7 @@ end_repeat_nmi:
 	 * restore_regs_and_return_to_kernel as we must handle nested NMI.
 	 */
 	call	paranoid_exit
+	addq	$IST_pt_regs, %rsp		/* put %rsp back to pt_regs */
 
 	POP_REGS
 
diff --git a/arch/x86/entry/traps.c b/arch/x86/entry/traps.c
index 40722a2f61ae..da55565d43ef 100644
--- a/arch/x86/entry/traps.c
+++ b/arch/x86/entry/traps.c
@@ -693,17 +693,17 @@ struct pt_regs *sync_regs(struct pt_regs *eregs)
 }
 
 #ifdef CONFIG_AMD_MEM_ENCRYPT
-asmlinkage __visible noinstr struct pt_regs *vc_switch_off_ist(struct pt_regs *regs)
+asmlinkage __visible noinstr struct ist_regs *vc_switch_off_ist(struct ist_regs *ist)
 {
 	unsigned long sp, *stack;
 	struct stack_info info;
-	struct pt_regs *regs_ret;
+	struct ist_regs *ist_ret;
 
 	/*
 	 * In the SYSCALL entry path the RSP value comes from user-space - don't
 	 * trust it and switch to the current kernel stack
 	 */
-	if (ip_within_syscall_gap(regs)) {
+	if (ip_within_syscall_gap(&ist->regs)) {
 		sp = this_cpu_read(cpu_current_top_of_stack);
 		goto sync;
 	}
@@ -713,7 +713,7 @@ asmlinkage __visible noinstr struct pt_regs *vc_switch_off_ist(struct pt_regs *r
 	 * happened from a safe stack. Not safe are the entry or unknown stacks,
 	 * use the fall-back stack instead in this case.
 	 */
-	sp    = regs->sp;
+	sp    = ist->regs.sp;
 	stack = (unsigned long *)sp;
 
 	if (!get_stack_info_noinstr(stack, current, &info) || info.type == STACK_TYPE_ENTRY ||
@@ -726,12 +726,12 @@ asmlinkage __visible noinstr struct pt_regs *vc_switch_off_ist(struct pt_regs *r
 	 * IST stack. The code below only copies pt_regs, the real switch happens
 	 * in assembly code.
 	 */
-	sp = ALIGN_DOWN(sp, 8) - sizeof(*regs_ret);
+	sp = ALIGN_DOWN(sp, 8) - sizeof(*ist_ret);
 
-	regs_ret = (struct pt_regs *)sp;
-	*regs_ret = *regs;
+	ist_ret = (struct ist_regs *)sp;
+	*ist_ret = *ist;
 
-	return regs_ret;
+	return ist_ret;
 }
 #endif
 
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 5c41a279c1e0..e24c63bbc30a 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -11,9 +11,17 @@
 #include <asm/trap_pf.h>
 
 #ifdef CONFIG_X86_64
+struct ist_regs {
+	/*
+	 * ist specific fields must be defined before pt_regs
+	 * and they are located below pt_regs on the stacks.
+	 */
+	struct pt_regs regs;
+};
+
 asmlinkage __visible notrace struct pt_regs *do_error_entry(struct pt_regs *eregs);
 void __init trap_init(void);
-asmlinkage __visible noinstr struct pt_regs *vc_switch_off_ist(struct pt_regs *eregs);
+asmlinkage __visible noinstr struct ist_regs *vc_switch_off_ist(struct ist_regs *ist);
 #endif
 
 #ifdef CONFIG_X86_F00F_BUG
diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c
index b14533af7676..3cffadf64ac2 100644
--- a/arch/x86/kernel/asm-offsets_64.c
+++ b/arch/x86/kernel/asm-offsets_64.c
@@ -4,6 +4,7 @@
 #endif
 
 #include <asm/ia32.h>
+#include <asm/traps.h>
 
 #if defined(CONFIG_KVM_GUEST) && defined(CONFIG_PARAVIRT_SPINLOCKS)
 #include <asm/kvm_para.h>
@@ -55,6 +56,7 @@ int main(void)
 #undef ENTRY
 
 	BLANK();
+	DEFINE(IST_pt_regs, offsetof(struct ist_regs, regs));
 
 #ifdef CONFIG_STACKPROTECTOR
 	DEFINE(stack_canary_offset, offsetof(struct fixed_percpu_data, stack_canary));
-- 
2.19.1.6.gb485710b


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 18/24] x86/entry: Add the C version ist_switch_to_kernel_cr3()
  2021-08-31 17:50 [PATCH 00/24] x86/entry/64: Convert a bunch of ASM entry code into C code Lai Jiangshan
                   ` (16 preceding siblings ...)
  2021-08-31 17:50 ` [PATCH 17/24] x86/entry: Introduce struct ist_regs Lai Jiangshan
@ 2021-08-31 17:50 ` Lai Jiangshan
  2021-08-31 17:50 ` [PATCH 19/24] x86/entry: Add the C version ist_restore_cr3() Lai Jiangshan
                   ` (7 subsequent siblings)
  25 siblings, 0 replies; 72+ messages in thread
From: Lai Jiangshan @ 2021-08-31 17:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin

From: Lai Jiangshan <laijs@linux.alibaba.com>

It switches the CR3 to kernel CR3 and returns the original CR3, and
the caller should save the return value.

It is the C version of SAVE_AND_SWITCH_TO_KERNEL_CR3.

Not functional difference intended.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/entry/traps.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/arch/x86/entry/traps.c b/arch/x86/entry/traps.c
index da55565d43ef..9d6753d0b9fc 100644
--- a/arch/x86/entry/traps.c
+++ b/arch/x86/entry/traps.c
@@ -785,8 +785,23 @@ static __always_inline void switch_to_kernel_cr3(void)
 	if (static_cpu_has(X86_FEATURE_PTI))
 		pti_switch_to_kernel_cr3(__native_read_cr3());
 }
+
+static __always_inline unsigned long ist_switch_to_kernel_cr3(void)
+{
+	unsigned long cr3 = 0;
+
+	if (static_cpu_has(X86_FEATURE_PTI)) {
+		cr3 = __native_read_cr3();
+
+		if (cr3 & PTI_USER_PGTABLE_MASK)
+			pti_switch_to_kernel_cr3(cr3);
+	}
+
+	return cr3;
+}
 #else
 static __always_inline void switch_to_kernel_cr3(void) {}
+static __always_inline unsigned long ist_switch_to_kernel_cr3(void) { return 0; }
 #endif
 
 /*
-- 
2.19.1.6.gb485710b


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 19/24] x86/entry: Add the C version ist_restore_cr3()
  2021-08-31 17:50 [PATCH 00/24] x86/entry/64: Convert a bunch of ASM entry code into C code Lai Jiangshan
                   ` (17 preceding siblings ...)
  2021-08-31 17:50 ` [PATCH 18/24] x86/entry: Add the C version ist_switch_to_kernel_cr3() Lai Jiangshan
@ 2021-08-31 17:50 ` Lai Jiangshan
  2021-08-31 17:50 ` [PATCH 20/24] x86/entry: Add the C version get_percpu_base() Lai Jiangshan
                   ` (6 subsequent siblings)
  25 siblings, 0 replies; 72+ messages in thread
From: Lai Jiangshan @ 2021-08-31 17:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin

From: Lai Jiangshan <laijs@linux.alibaba.com>

It implements the C version of RESTORE_CR3().

Not functional difference intended except the ASM code uses bit test
and clear operations while the C version uses mask check and 'AND'
operations.  The resulted asm code of both versions are very similar.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/entry/traps.c | 46 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 46 insertions(+)

diff --git a/arch/x86/entry/traps.c b/arch/x86/entry/traps.c
index 9d6753d0b9fc..e912bfbd2a61 100644
--- a/arch/x86/entry/traps.c
+++ b/arch/x86/entry/traps.c
@@ -65,6 +65,7 @@
 #ifdef CONFIG_X86_64
 #include <asm/x86_init.h>
 #include <asm/proto.h>
+#include <asm/tlbflush.h>
 
 extern unsigned char native_irq_return_iret[];
 extern unsigned char asm_load_gs_index_gs_change[];
@@ -799,9 +800,54 @@ static __always_inline unsigned long ist_switch_to_kernel_cr3(void)
 
 	return cr3;
 }
+
+static __always_inline void pti_switch_to_user_cr3(unsigned long user_cr3)
+{
+#define KERN_PCID_MASK (CR3_PCID_MASK & ~PTI_USER_PCID_MASK)
+
+	if (static_cpu_has(X86_FEATURE_PCID)) {
+		int pcid = user_cr3 & KERN_PCID_MASK;
+		unsigned short pcid_mask = 1ull << pcid;
+
+		/*
+		 * Check if there's a pending flush for the user ASID we're
+		 * about to set.
+		 */
+		if (!(this_cpu_read(cpu_tlbstate.user_pcid_flush_mask) & pcid_mask))
+			user_cr3 |= X86_CR3_PCID_NOFLUSH;
+		else
+			this_cpu_and(cpu_tlbstate.user_pcid_flush_mask, ~pcid_mask);
+	}
+	native_write_cr3(user_cr3);
+}
+
+static __always_inline void ist_restore_cr3(unsigned long cr3)
+{
+	if (!static_cpu_has(X86_FEATURE_PTI))
+		return;
+
+	if (unlikely(cr3 & PTI_USER_PGTABLE_MASK)) {
+		pti_switch_to_user_cr3(cr3);
+		return;
+	}
+
+	/*
+	 * KERNEL pages can always resume with NOFLUSH as we do
+	 * explicit flushes.
+	 */
+	if (static_cpu_has(X86_FEATURE_PCID))
+		cr3 |= X86_CR3_PCID_NOFLUSH;
+
+	/*
+	 * The CR3 write could be avoided when not changing its value,
+	 * but would require a CR3 read.
+	 */
+	native_write_cr3(cr3);
+}
 #else
 static __always_inline void switch_to_kernel_cr3(void) {}
 static __always_inline unsigned long ist_switch_to_kernel_cr3(void) { return 0; }
+static __always_inline void ist_restore_cr3(unsigned long cr3) {}
 #endif
 
 /*
-- 
2.19.1.6.gb485710b


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 20/24] x86/entry: Add the C version get_percpu_base()
  2021-08-31 17:50 [PATCH 00/24] x86/entry/64: Convert a bunch of ASM entry code into C code Lai Jiangshan
                   ` (18 preceding siblings ...)
  2021-08-31 17:50 ` [PATCH 19/24] x86/entry: Add the C version ist_restore_cr3() Lai Jiangshan
@ 2021-08-31 17:50 ` Lai Jiangshan
  2021-08-31 17:50 ` [PATCH 21/24] x86/entry: Add the C version ist_switch_to_kernel_gsbase() Lai Jiangshan
                   ` (5 subsequent siblings)
  25 siblings, 0 replies; 72+ messages in thread
From: Lai Jiangshan @ 2021-08-31 17:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin

From: Lai Jiangshan <laijs@linux.alibaba.com>

It implements the C version of asm macro GET_PERCPU_BASE().

Not functional difference intended.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/entry/traps.c | 36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/arch/x86/entry/traps.c b/arch/x86/entry/traps.c
index e912bfbd2a61..9f4bc52410d0 100644
--- a/arch/x86/entry/traps.c
+++ b/arch/x86/entry/traps.c
@@ -945,6 +945,42 @@ struct pt_regs *do_error_entry(struct pt_regs *eregs)
 	/* Enter from kernel, don't move pt_regs */
 	return eregs;
 }
+
+#ifdef CONFIG_SMP
+/*
+ * CPU/node NR is loaded from the limit (size) field of a special segment
+ * descriptor entry in GDT.
+ *
+ * Do not use RDPID, because KVM loads guest's TSC_AUX on vm-entry and
+ * may not restore the host's value until the CPU returns to userspace.
+ * Thus the kernel would consume a guest's TSC_AUX if an NMI arrives
+ * while running KVM's run loop.
+ */
+static __always_inline unsigned int gdt_get_cpu(void)
+{
+	unsigned int p;
+
+	asm ("lsl %[seg],%[p]": [p] "=a" (p): [seg] "r" (__CPUNODE_SEG));
+
+	return p & VDSO_CPUNODE_MASK;
+}
+
+/*
+ * Fetch the per-CPU GSBASE value for this processor.
+ *
+ * We normally use %gs for accessing per-CPU data, but we are setting up
+ * %gs here and obviously can not use %gs itself to access per-CPU data.
+ */
+static __always_inline unsigned long get_percpu_base(void)
+{
+	return __per_cpu_offset[gdt_get_cpu()];
+}
+#else
+static __always_inline unsigned long get_percpu_base(void)
+{
+	return pcpu_unit_offsets;
+}
+#endif
 #endif
 
 static bool is_sysenter_singlestep(struct pt_regs *regs)
-- 
2.19.1.6.gb485710b


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 21/24] x86/entry: Add the C version ist_switch_to_kernel_gsbase()
  2021-08-31 17:50 [PATCH 00/24] x86/entry/64: Convert a bunch of ASM entry code into C code Lai Jiangshan
                   ` (19 preceding siblings ...)
  2021-08-31 17:50 ` [PATCH 20/24] x86/entry: Add the C version get_percpu_base() Lai Jiangshan
@ 2021-08-31 17:50 ` Lai Jiangshan
  2021-08-31 17:50 ` [PATCH 22/24] x86/entry: Implement and use do_paranoid_entry() and paranoid_exit() Lai Jiangshan
                   ` (4 subsequent siblings)
  25 siblings, 0 replies; 72+ messages in thread
From: Lai Jiangshan @ 2021-08-31 17:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin

From: Lai Jiangshan <laijs@linux.alibaba.com>

It implements the second half of paranoid_entry() whose functionality
is to switch to kernel gsbase.

Not functional difference intended.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/entry/traps.c | 48 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)

diff --git a/arch/x86/entry/traps.c b/arch/x86/entry/traps.c
index 9f4bc52410d0..b5c92b4e0cb5 100644
--- a/arch/x86/entry/traps.c
+++ b/arch/x86/entry/traps.c
@@ -981,6 +981,54 @@ static __always_inline unsigned long get_percpu_base(void)
 	return pcpu_unit_offsets;
 }
 #endif
+
+/*
+ * Handle GSBASE depends on the availability of FSGSBASE.
+ *
+ * Without FSGSBASE the kernel enforces that negative GSBASE
+ * values indicate kernel GSBASE. With FSGSBASE no assumptions
+ * can be made about the GSBASE value when entering from user
+ * space.
+ */
+static __always_inline unsigned long ist_switch_to_kernel_gsbase(void)
+{
+	unsigned long gsbase;
+
+	if (static_cpu_has(X86_FEATURE_FSGSBASE)) {
+		/*
+		 * Read the current GSBASE for return.
+		 * Retrieve and set the current CPUs kernel GSBASE.
+		 *
+		 * The unconditional write to GS base below ensures that
+		 * no subsequent loads based on a mispredicted GS base can
+		 * happen, therefore no LFENCE is needed here.
+		 */
+		gsbase = rdgsbase();
+		wrgsbase(get_percpu_base());
+		return gsbase;
+	}
+
+	gsbase = __rdmsr(MSR_GS_BASE);
+
+	/*
+	 * The kernel-enforced convention is a negative GSBASE indicates
+	 * a kernel value. No SWAPGS needed on entry and exit.
+	 */
+	if ((long)gsbase < 0)
+		return 1;
+
+	native_swapgs();
+
+	/*
+	 * The above ist_switch_to_kernel_cr3() doesn't do an unconditional
+	 * CR3 write, even in the PTI case.  So do an lfence to prevent GS
+	 * speculation, regardless of whether PTI is enabled.
+	 */
+	fence_swapgs_kernel_entry();
+
+	/* SWAPGS required on exit */
+	return 0;
+}
 #endif
 
 static bool is_sysenter_singlestep(struct pt_regs *regs)
-- 
2.19.1.6.gb485710b


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 22/24] x86/entry: Implement and use do_paranoid_entry() and paranoid_exit()
  2021-08-31 17:50 [PATCH 00/24] x86/entry/64: Convert a bunch of ASM entry code into C code Lai Jiangshan
                   ` (20 preceding siblings ...)
  2021-08-31 17:50 ` [PATCH 21/24] x86/entry: Add the C version ist_switch_to_kernel_gsbase() Lai Jiangshan
@ 2021-08-31 17:50 ` Lai Jiangshan
  2021-09-02 10:33   ` Peter Zijlstra
  2021-08-31 17:50 ` [PATCH 23/24] x86/entry: Remove the unused ASM macros Lai Jiangshan
                   ` (3 subsequent siblings)
  25 siblings, 1 reply; 72+ messages in thread
From: Lai Jiangshan @ 2021-08-31 17:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin, Joerg Roedel, Youquan Song,
	Tony Luck

From: Lai Jiangshan <laijs@linux.alibaba.com>

All the facilities are set in traps.c, so we can implement the major body
of paranoid_entry() in C as do_paranoid_entry() and the whole
paranoid_exit() in C.

paranoid_entry() needs to save two values which are added into the
struct ist_regs.  And paranoid_exit() use them after the interrupt
is handled.

No functional change intended.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/entry/entry_64.S    | 128 +++--------------------------------
 arch/x86/entry/traps.c       |  62 +++++++++++++++++
 arch/x86/include/asm/traps.h |  22 ++++++
 3 files changed, 92 insertions(+), 120 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 1ae10ca351f4..8b2e19e6c9e1 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -437,6 +437,7 @@ SYM_CODE_START(\asmsym)
 
 	call	\cfunc
 
+	movq	%rsp, %rdi		/* ist_regs pointer */
 	call	paranoid_exit
 	addq	$IST_pt_regs, %rsp	/* put %rsp back to pt_regs */
 	jmp	restore_regs_and_return_to_kernel
@@ -516,6 +517,7 @@ SYM_CODE_START(\asmsym)
 	 * identical to the stack in the IRET frame or the VC fall-back stack,
 	 * so it is definitely mapped even with PTI enabled.
 	 */
+	movq	%rsp, %rdi		/* ist_regs pointer */
 	call	paranoid_exit
 	addq	$IST_pt_regs, %rsp	/* put %rsp back to pt_regs */
 	jmp	restore_regs_and_return_to_kernel
@@ -548,6 +550,7 @@ SYM_CODE_START(\asmsym)
 	movq	$-1, ORIG_RAX(%rdi)	/* no syscall to restart */
 	call	\cfunc
 
+	movq	%rsp, %rdi		/* ist_regs pointer */
 	call	paranoid_exit
 	addq	$IST_pt_regs, %rsp	/* put %rsp back to pt_regs */
 	jmp	restore_regs_and_return_to_kernel
@@ -840,14 +843,8 @@ SYM_CODE_END(xen_failsafe_callback)
 #endif /* CONFIG_XEN_PV */
 
 /*
- * Save all registers in pt_regs. Return GSBASE related information
- * in EBX depending on the availability of the FSGSBASE instructions:
- *
- * FSGSBASE	R/EBX
- *     N        0 -> SWAPGS on exit
- *              1 -> no SWAPGS on exit
- *
- *     Y        GSBASE value at entry, must be restored in paranoid_exit
+ * Save all registers and addtional info in ist_regs.
+ * Switch CR3 and gsbase if needed.
  */
 SYM_CODE_START_LOCAL(paranoid_entry)
 	UNWIND_HINT_FUNC
@@ -856,124 +853,14 @@ SYM_CODE_START_LOCAL(paranoid_entry)
 	movq	RDI(%rsp), %rsi	/* temporarily store the return address in %rsi */
 	movq	%rdi, RDI(%rsp) /* put %rdi onto pt_regs */
 	subq	$IST_pt_regs, %rsp /* reserve room for ist_regs */
+	movq	%rsp, %rdi	/* ist_regs pointer */
 	pushq	%rsi		/* put the return address onto the stack */
 	ENCODE_FRAME_POINTER 8+IST_pt_regs
 
-	/*
-	 * Always stash CR3 in %r14.  This value will be restored,
-	 * verbatim, at exit.  Needed if paranoid_entry interrupted
-	 * another entry that already switched to the user CR3 value
-	 * but has not yet returned to userspace.
-	 *
-	 * This is also why CS (stashed in the "iret frame" by the
-	 * hardware at entry) can not be used: this may be a return
-	 * to kernel code, but with a user CR3 value.
-	 *
-	 * Switching CR3 does not depend on kernel GSBASE so it can
-	 * be done before switching to the kernel GSBASE. This is
-	 * required for FSGSBASE because the kernel GSBASE has to
-	 * be retrieved from a kernel internal table.
-	 */
-	SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg=%rax save_reg=%r14
-
-	/*
-	 * Handling GSBASE depends on the availability of FSGSBASE.
-	 *
-	 * Without FSGSBASE the kernel enforces that negative GSBASE
-	 * values indicate kernel GSBASE. With FSGSBASE no assumptions
-	 * can be made about the GSBASE value when entering from user
-	 * space.
-	 */
-	ALTERNATIVE "jmp .Lparanoid_entry_checkgs", "", X86_FEATURE_FSGSBASE
-
-	/*
-	 * Read the current GSBASE and store it in %rbx unconditionally,
-	 * retrieve and set the current CPUs kernel GSBASE. The stored value
-	 * has to be restored in paranoid_exit unconditionally.
-	 *
-	 * The unconditional write to GS base below ensures that no subsequent
-	 * loads based on a mispredicted GS base can happen, therefore no LFENCE
-	 * is needed here.
-	 */
-	SAVE_AND_SET_GSBASE scratch_reg=%rax save_reg=%rbx
-	ret
-
-.Lparanoid_entry_checkgs:
-	/* EBX = 1 -> kernel GSBASE active, no restore required */
-	movl	$1, %ebx
-	/*
-	 * The kernel-enforced convention is a negative GSBASE indicates
-	 * a kernel value. No SWAPGS needed on entry and exit.
-	 */
-	movl	$MSR_GS_BASE, %ecx
-	rdmsr
-	testl	%edx, %edx
-	jns	.Lparanoid_entry_swapgs
-	ret
-
-.Lparanoid_entry_swapgs:
-	swapgs
-
-	/*
-	 * The above SAVE_AND_SWITCH_TO_KERNEL_CR3 macro doesn't do an
-	 * unconditional CR3 write, even in the PTI case.  So do an lfence
-	 * to prevent GS speculation, regardless of whether PTI is enabled.
-	 */
-	FENCE_SWAPGS_KERNEL_ENTRY
-
-	/* EBX = 0 -> SWAPGS required on exit */
-	xorl	%ebx, %ebx
+	call	do_paranoid_entry
 	ret
 SYM_CODE_END(paranoid_entry)
 
-/*
- * "Paranoid" exit path from exception stack.  This is invoked
- * only on return from IST interrupts that came from kernel space.
- *
- * We may be returning to very strange contexts (e.g. very early
- * in syscall entry), so checking for preemption here would
- * be complicated.  Fortunately, there's no good reason to try
- * to handle preemption here.
- *
- * R/EBX contains the GSBASE related information depending on the
- * availability of the FSGSBASE instructions:
- *
- * FSGSBASE	R/EBX
- *     N        0 -> SWAPGS on exit
- *              1 -> no SWAPGS on exit
- *
- *     Y        User space GSBASE, must be restored unconditionally
- */
-SYM_CODE_START_LOCAL(paranoid_exit)
-	UNWIND_HINT_REGS offset=8
-	/*
-	 * The order of operations is important. RESTORE_CR3 requires
-	 * kernel GSBASE.
-	 *
-	 * NB to anyone to try to optimize this code: this code does
-	 * not execute at all for exceptions from user mode. Those
-	 * exceptions go through error_exit instead.
-	 */
-	RESTORE_CR3	scratch_reg=%rax save_reg=%r14
-
-	/* Handle the three GSBASE cases */
-	ALTERNATIVE "jmp .Lparanoid_exit_checkgs", "", X86_FEATURE_FSGSBASE
-
-	/* With FSGSBASE enabled, unconditionally restore GSBASE */
-	wrgsbase	%rbx
-	ret
-
-.Lparanoid_exit_checkgs:
-	/* On non-FSGSBASE systems, conditionally do SWAPGS */
-	testl		%ebx, %ebx
-	jnz		.Lparanoid_exit_done
-
-	/* We are returning to a context with user GSBASE */
-	swapgs
-.Lparanoid_exit_done:
-	ret
-SYM_CODE_END(paranoid_exit)
-
 /*
  * Save all registers in pt_regs, and switch GS if needed.
  */
@@ -1308,6 +1195,7 @@ end_repeat_nmi:
 	 * Use paranoid_exit to handle SWAPGS and CR3, but no need to use
 	 * restore_regs_and_return_to_kernel as we must handle nested NMI.
 	 */
+	movq	%rsp, %rdi			/* ist_regs pointer */
 	call	paranoid_exit
 	addq	$IST_pt_regs, %rsp		/* put %rsp back to pt_regs */
 
diff --git a/arch/x86/entry/traps.c b/arch/x86/entry/traps.c
index b5c92b4e0cb5..52511db6baa6 100644
--- a/arch/x86/entry/traps.c
+++ b/arch/x86/entry/traps.c
@@ -1029,6 +1029,68 @@ static __always_inline unsigned long ist_switch_to_kernel_gsbase(void)
 	/* SWAPGS required on exit */
 	return 0;
 }
+
+asmlinkage __visible __entry_text
+void do_paranoid_entry(struct ist_regs *ist)
+{
+	/*
+	 * Always stash CR3 in ist->cr3.  This value will be restored,
+	 * verbatim, at exit.  Needed if paranoid_entry interrupted
+	 * another entry that already switched to the user CR3 value
+	 * but has not yet returned to userspace.
+	 *
+	 * This is also why CS (stashed in the "iret frame" by the
+	 * hardware at entry) can not be used: this may be a return
+	 * to kernel code, but with a user CR3 value.
+	 *
+	 * Switching CR3 does not depend on kernel GSBASE so it can
+	 * be done before switching to the kernel GSBASE. This is
+	 * required for FSGSBASE because the kernel GSBASE has to
+	 * be retrieved from a kernel internal table.
+	 */
+	ist->cr3 = ist_switch_to_kernel_cr3();
+
+	/* Handle GSBASE, store the return value in ist_regs for exit. */
+	ist->gsbase = ist_switch_to_kernel_gsbase();
+}
+
+/*
+ * "Paranoid" exit path from exception stack.  This is invoked
+ * only on return from IST interrupts that came from kernel space.
+ *
+ * We may be returning to very strange contexts (e.g. very early
+ * in syscall entry), so checking for preemption here would
+ * be complicated.  Fortunately, there's no good reason to try
+ * to handle preemption here.
+ */
+asmlinkage __visible __entry_text
+void paranoid_exit(struct ist_regs *ist)
+{
+	/*
+	 * Restore cr3 at first, it can use kernel GSBASE.
+	 */
+	ist_restore_cr3(ist->cr3);
+
+	/*
+	 * Handle the three GSBASE cases.
+	 *
+	 * ist->gsbase contains the GSBASE related information depending
+	 * on the availability of the FSGSBASE instructions:
+	 *
+	 * FSGSBASE	ist->gsbase
+	 *     N        0 -> SWAPGS on exit
+	 *              1 -> no SWAPGS on exit
+	 *
+	 *     Y        User space GSBASE, must be restored unconditionally
+	 */
+	if (static_cpu_has(X86_FEATURE_FSGSBASE)) {
+		wrgsbase(ist->gsbase);
+		return;
+	}
+
+	if (!ist->gsbase)
+		native_swapgs();
+}
 #endif
 
 static bool is_sysenter_singlestep(struct pt_regs *regs)
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index e24c63bbc30a..0bc7117a01cd 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -12,6 +12,26 @@
 
 #ifdef CONFIG_X86_64
 struct ist_regs {
+	/*
+	 * Always stash CR3 in cr3.  This value will be restored,
+	 * verbatim, at exit.  Needed if paranoid_entry interrupted
+	 * another entry that already switched to the user CR3 value
+	 * but has not yet returned to userspace.
+	 */
+	unsigned long cr3;
+
+	/*
+	 * gsbase contains the GSBASE related information depending on the
+	 * availability of the FSGSBASE instructions:
+	 *
+	 * FSGSBASE	gsbase
+	 *     N        0 -> SWAPGS on exit
+	 *              1 -> no SWAPGS on exit
+	 *
+	 *     Y        User space GSBASE, must be restored unconditionally
+	 */
+	unsigned long gsbase;
+
 	/*
 	 * ist specific fields must be defined before pt_regs
 	 * and they are located below pt_regs on the stacks.
@@ -20,6 +40,8 @@ struct ist_regs {
 };
 
 asmlinkage __visible notrace struct pt_regs *do_error_entry(struct pt_regs *eregs);
+asmlinkage __visible notrace void do_paranoid_entry(struct ist_regs *ist);
+asmlinkage __visible notrace void paranoid_exit(struct ist_regs *ist);
 void __init trap_init(void);
 asmlinkage __visible noinstr struct ist_regs *vc_switch_off_ist(struct ist_regs *ist);
 #endif
-- 
2.19.1.6.gb485710b


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 23/24] x86/entry: Remove the unused ASM macros
  2021-08-31 17:50 [PATCH 00/24] x86/entry/64: Convert a bunch of ASM entry code into C code Lai Jiangshan
                   ` (21 preceding siblings ...)
  2021-08-31 17:50 ` [PATCH 22/24] x86/entry: Implement and use do_paranoid_entry() and paranoid_exit() Lai Jiangshan
@ 2021-08-31 17:50 ` Lai Jiangshan
  2021-08-31 17:50 ` [PATCH 24/24] x86/syscall/64: Move the checking for sysret to C code Lai Jiangshan
                   ` (2 subsequent siblings)
  25 siblings, 0 replies; 72+ messages in thread
From: Lai Jiangshan @ 2021-08-31 17:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin

From: Lai Jiangshan <laijs@linux.alibaba.com>

They are implemented and used in traps.c.  The ASM version is not needed
any more.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/entry/calling.h | 106 ---------------------------------------
 1 file changed, 106 deletions(-)

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index aeee1575332f..3243d67ea7c8 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -202,60 +202,6 @@ For 32-bit we have the following conventions - kernel is built with
 	popq	%rax
 .endm
 
-.macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
-	ALTERNATIVE "jmp .Ldone_\@", "", X86_FEATURE_PTI
-	movq	%cr3, \scratch_reg
-	movq	\scratch_reg, \save_reg
-	/*
-	 * Test the user pagetable bit. If set, then the user page tables
-	 * are active. If clear CR3 already has the kernel page table
-	 * active.
-	 */
-	bt	$PTI_USER_PGTABLE_BIT, \scratch_reg
-	jnc	.Ldone_\@
-
-	ADJUST_KERNEL_CR3 \scratch_reg
-	movq	\scratch_reg, %cr3
-
-.Ldone_\@:
-.endm
-
-.macro RESTORE_CR3 scratch_reg:req save_reg:req
-	ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
-
-	ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID
-
-	/*
-	 * KERNEL pages can always resume with NOFLUSH as we do
-	 * explicit flushes.
-	 */
-	bt	$PTI_USER_PGTABLE_BIT, \save_reg
-	jnc	.Lnoflush_\@
-
-	/*
-	 * Check if there's a pending flush for the user ASID we're
-	 * about to set.
-	 */
-	movq	\save_reg, \scratch_reg
-	andq	$(0x7FF), \scratch_reg
-	bt	\scratch_reg, THIS_CPU_user_pcid_flush_mask
-	jnc	.Lnoflush_\@
-
-	btr	\scratch_reg, THIS_CPU_user_pcid_flush_mask
-	jmp	.Lwrcr3_\@
-
-.Lnoflush_\@:
-	SET_NOFLUSH_BIT \save_reg
-
-.Lwrcr3_\@:
-	/*
-	 * The CR3 write could be avoided when not changing its value,
-	 * but would require a CR3 read *and* a scratch register.
-	 */
-	movq	\save_reg, %cr3
-.Lend_\@:
-.endm
-
 #else /* CONFIG_PAGE_TABLE_ISOLATION=n: */
 
 .macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
@@ -264,10 +210,6 @@ For 32-bit we have the following conventions - kernel is built with
 .endm
 .macro SWITCH_TO_USER_CR3_STACK scratch_reg:req
 .endm
-.macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
-.endm
-.macro RESTORE_CR3 scratch_reg:req save_reg:req
-.endm
 
 #endif
 
@@ -276,17 +218,10 @@ For 32-bit we have the following conventions - kernel is built with
  *
  * FENCE_SWAPGS_USER_ENTRY is used in the user entry swapgs code path, to
  * prevent a speculative swapgs when coming from kernel space.
- *
- * FENCE_SWAPGS_KERNEL_ENTRY is used in the kernel entry non-swapgs code path,
- * to prevent the swapgs from getting speculatively skipped when coming from
- * user space.
  */
 .macro FENCE_SWAPGS_USER_ENTRY
 	ALTERNATIVE "", "lfence", X86_FEATURE_FENCE_SWAPGS_USER
 .endm
-.macro FENCE_SWAPGS_KERNEL_ENTRY
-	ALTERNATIVE "", "lfence", X86_FEATURE_FENCE_SWAPGS_KERNEL
-.endm
 
 .macro STACKLEAK_ERASE_NOCLOBBER
 #ifdef CONFIG_GCC_PLUGIN_STACKLEAK
@@ -296,12 +231,6 @@ For 32-bit we have the following conventions - kernel is built with
 #endif
 .endm
 
-.macro SAVE_AND_SET_GSBASE scratch_reg:req save_reg:req
-	rdgsbase \save_reg
-	GET_PERCPU_BASE \scratch_reg
-	wrgsbase \scratch_reg
-.endm
-
 #else /* CONFIG_X86_64 */
 # undef		UNWIND_HINT_IRET_REGS
 # define	UNWIND_HINT_IRET_REGS
@@ -312,38 +241,3 @@ For 32-bit we have the following conventions - kernel is built with
 	call stackleak_erase
 #endif
 .endm
-
-#ifdef CONFIG_SMP
-
-/*
- * CPU/node NR is loaded from the limit (size) field of a special segment
- * descriptor entry in GDT.
- */
-.macro LOAD_CPU_AND_NODE_SEG_LIMIT reg:req
-	movq	$__CPUNODE_SEG, \reg
-	lsl	\reg, \reg
-.endm
-
-/*
- * Fetch the per-CPU GSBASE value for this processor and put it in @reg.
- * We normally use %gs for accessing per-CPU data, but we are setting up
- * %gs here and obviously can not use %gs itself to access per-CPU data.
- *
- * Do not use RDPID, because KVM loads guest's TSC_AUX on vm-entry and
- * may not restore the host's value until the CPU returns to userspace.
- * Thus the kernel would consume a guest's TSC_AUX if an NMI arrives
- * while running KVM's run loop.
- */
-.macro GET_PERCPU_BASE reg:req
-	LOAD_CPU_AND_NODE_SEG_LIMIT \reg
-	andq	$VDSO_CPUNODE_MASK, \reg
-	movq	__per_cpu_offset(, \reg, 8), \reg
-.endm
-
-#else
-
-.macro GET_PERCPU_BASE reg:req
-	movq	pcpu_unit_offsets(%rip), \reg
-.endm
-
-#endif /* CONFIG_SMP */
-- 
2.19.1.6.gb485710b


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 24/24] x86/syscall/64: Move the checking for sysret to C code
  2021-08-31 17:50 [PATCH 00/24] x86/entry/64: Convert a bunch of ASM entry code into C code Lai Jiangshan
                   ` (22 preceding siblings ...)
  2021-08-31 17:50 ` [PATCH 23/24] x86/entry: Remove the unused ASM macros Lai Jiangshan
@ 2021-08-31 17:50 ` Lai Jiangshan
  2021-09-10  7:20   ` Nikolay Borisov
  2021-08-31 20:44 ` [PATCH 00/24] x86/entry/64: Convert a bunch of ASM entry code into " Peter Zijlstra
  2021-09-02 10:50 ` [PATCH 25/24] x86/traps: Rewrite native_load_gs_index in " Lai Jiangshan
  25 siblings, 1 reply; 72+ messages in thread
From: Lai Jiangshan @ 2021-08-31 17:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin

From: Lai Jiangshan <laijs@linux.alibaba.com>

Like do_fast_syscall_32() which checks whether it can return to userspace
via fast instructions before the function returns, do_syscall_64()
also checks whether it can use sysret to return to userspace before
do_syscall_64() returns via C code.  And a bunch of ASM code can be removed.

No functional change intended.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/entry/calling.h       | 10 +----
 arch/x86/entry/common.c        | 73 ++++++++++++++++++++++++++++++-
 arch/x86/entry/entry_64.S      | 78 ++--------------------------------
 arch/x86/include/asm/syscall.h |  2 +-
 4 files changed, 78 insertions(+), 85 deletions(-)

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 3243d67ea7c8..bfcc50e56496 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -111,27 +111,19 @@ For 32-bit we have the following conventions - kernel is built with
 	CLEAR_REGS
 .endm
 
-.macro POP_REGS pop_rdi=1 skip_r11rcx=0
+.macro POP_REGS pop_rdi=1
 	popq %r15
 	popq %r14
 	popq %r13
 	popq %r12
 	popq %rbp
 	popq %rbx
-	.if \skip_r11rcx
-	popq %rsi
-	.else
 	popq %r11
-	.endif
 	popq %r10
 	popq %r9
 	popq %r8
 	popq %rax
-	.if \skip_r11rcx
-	popq %rsi
-	.else
 	popq %rcx
-	.endif
 	popq %rdx
 	popq %rsi
 	.if \pop_rdi
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 6c2826417b33..718045b7a53c 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -70,7 +70,77 @@ static __always_inline bool do_syscall_x32(struct pt_regs *regs, int nr)
 	return false;
 }
 
-__visible noinstr void do_syscall_64(struct pt_regs *regs, int nr)
+/*
+ * Change top bits to match the most significant bit (47th or 56th bit
+ * depending on paging mode) in the address to get canonical address.
+ *
+ * If width of "canonical tail" ever becomes variable, this will need
+ * to be updated to remain correct on both old and new CPUs.
+ */
+static __always_inline u64 canonical_address(u64 vaddr)
+{
+	if (IS_ENABLED(CONFIG_X86_5LEVEL) && static_cpu_has(X86_FEATURE_LA57))
+		return ((s64)vaddr << (64 - 57)) >> (64 - 57);
+	else
+		return ((s64)vaddr << (64 - 48)) >> (64 - 48);
+}
+
+/*
+ * Check if it can use SYSRET.
+ *
+ * Try to use SYSRET instead of IRET if we're returning to
+ * a completely clean 64-bit userspace context.
+ *
+ * Returns 0 to return using IRET or 1 to return using SYSRET.
+ */
+static __always_inline int can_sysret(struct pt_regs *regs)
+{
+	/* In the Xen PV case we must use iret anyway. */
+	if (static_cpu_has(X86_FEATURE_XENPV))
+		return 0;
+
+	/* SYSRET requires RCX == RIP && R11 == RFLAGS */
+	if (regs->ip != regs->cx || regs->flags != regs->r11)
+		return 0;
+
+	/* CS and SS must match SYSRET */
+	if (regs->cs != __USER_CS || regs->ss != __USER_DS)
+		return 0;
+
+	/*
+	 * On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP
+	 * in kernel space.  This essentially lets the user take over
+	 * the kernel, since userspace controls RSP.
+	 */
+	if (regs->cx != canonical_address(regs->cx))
+		return 0;
+
+	/*
+	 * SYSCALL clears RF when it saves RFLAGS in R11 and SYSRET cannot
+	 * restore RF properly. If the slowpath sets it for whatever reason, we
+	 * need to restore it correctly.
+	 *
+	 * SYSRET can restore TF, but unlike IRET, restoring TF results in a
+	 * trap from userspace immediately after SYSRET.  This would cause an
+	 * infinite loop whenever #DB happens with register state that satisfies
+	 * the opportunistic SYSRET conditions.  For example, single-stepping
+	 * this user code:
+	 *
+	 *           movq	$stuck_here, %rcx
+	 *           pushfq
+	 *           popq %r11
+	 *   stuck_here:
+	 *
+	 * would never get past 'stuck_here'.
+	 */
+	if (regs->r11 & (X86_EFLAGS_RF | X86_EFLAGS_TF))
+		return 0;
+
+	return 1;
+}
+
+/* Returns 0 to return using IRET or 1 to return using SYSRET. */
+__visible noinstr int do_syscall_64(struct pt_regs *regs, int nr)
 {
 	add_random_kstack_offset();
 	nr = syscall_enter_from_user_mode(regs, nr);
@@ -84,6 +154,7 @@ __visible noinstr void do_syscall_64(struct pt_regs *regs, int nr)
 
 	instrumentation_end();
 	syscall_exit_to_user_mode(regs);
+	return can_sysret(regs);
 }
 #endif
 
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 8b2e19e6c9e1..2d17fca1f503 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -112,85 +112,15 @@ SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, SYM_L_GLOBAL)
 	movslq	%eax, %rsi
 	call	do_syscall_64		/* returns with IRQs disabled */
 
-	/*
-	 * Try to use SYSRET instead of IRET if we're returning to
-	 * a completely clean 64-bit userspace context.  If we're not,
-	 * go to the slow exit path.
-	 * In the Xen PV case we must use iret anyway.
-	 */
-
-	ALTERNATIVE "", "jmp	swapgs_restore_regs_and_return_to_usermode", \
-		X86_FEATURE_XENPV
-
-	movq	RCX(%rsp), %rcx
-	movq	RIP(%rsp), %r11
-
-	cmpq	%rcx, %r11	/* SYSRET requires RCX == RIP */
-	jne	swapgs_restore_regs_and_return_to_usermode
+	testl	%eax, %eax
+	jz swapgs_restore_regs_and_return_to_usermode
 
 	/*
-	 * On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP
-	 * in kernel space.  This essentially lets the user take over
-	 * the kernel, since userspace controls RSP.
-	 *
-	 * If width of "canonical tail" ever becomes variable, this will need
-	 * to be updated to remain correct on both old and new CPUs.
-	 *
-	 * Change top bits to match most significant bit (47th or 56th bit
-	 * depending on paging mode) in the address.
-	 */
-#ifdef CONFIG_X86_5LEVEL
-	ALTERNATIVE "shl $(64 - 48), %rcx; sar $(64 - 48), %rcx", \
-		"shl $(64 - 57), %rcx; sar $(64 - 57), %rcx", X86_FEATURE_LA57
-#else
-	shl	$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
-	sar	$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
-#endif
-
-	/* If this changed %rcx, it was not canonical */
-	cmpq	%rcx, %r11
-	jne	swapgs_restore_regs_and_return_to_usermode
-
-	cmpq	$__USER_CS, CS(%rsp)		/* CS must match SYSRET */
-	jne	swapgs_restore_regs_and_return_to_usermode
-
-	movq	R11(%rsp), %r11
-	cmpq	%r11, EFLAGS(%rsp)		/* R11 == RFLAGS */
-	jne	swapgs_restore_regs_and_return_to_usermode
-
-	/*
-	 * SYSCALL clears RF when it saves RFLAGS in R11 and SYSRET cannot
-	 * restore RF properly. If the slowpath sets it for whatever reason, we
-	 * need to restore it correctly.
-	 *
-	 * SYSRET can restore TF, but unlike IRET, restoring TF results in a
-	 * trap from userspace immediately after SYSRET.  This would cause an
-	 * infinite loop whenever #DB happens with register state that satisfies
-	 * the opportunistic SYSRET conditions.  For example, single-stepping
-	 * this user code:
-	 *
-	 *           movq	$stuck_here, %rcx
-	 *           pushfq
-	 *           popq %r11
-	 *   stuck_here:
-	 *
-	 * would never get past 'stuck_here'.
-	 */
-	testq	$(X86_EFLAGS_RF|X86_EFLAGS_TF), %r11
-	jnz	swapgs_restore_regs_and_return_to_usermode
-
-	/* nothing to check for RSP */
-
-	cmpq	$__USER_DS, SS(%rsp)		/* SS must match SYSRET */
-	jne	swapgs_restore_regs_and_return_to_usermode
-
-	/*
-	 * We win! This label is here just for ease of understanding
+	 * This label is here just for ease of understanding
 	 * perf profiles. Nothing jumps here.
 	 */
 syscall_return_via_sysret:
-	/* rcx and r11 are already restored (see code above) */
-	POP_REGS pop_rdi=0 skip_r11rcx=1
+	POP_REGS pop_rdi=0
 
 	/*
 	 * Now all regs are restored except RSP and RDI.
diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
index f7e2d82d24fb..477adea7bac0 100644
--- a/arch/x86/include/asm/syscall.h
+++ b/arch/x86/include/asm/syscall.h
@@ -159,7 +159,7 @@ static inline int syscall_get_arch(struct task_struct *task)
 		? AUDIT_ARCH_I386 : AUDIT_ARCH_X86_64;
 }
 
-void do_syscall_64(struct pt_regs *regs, int nr);
+int do_syscall_64(struct pt_regs *regs, int nr);
 void do_int80_syscall_32(struct pt_regs *regs);
 long do_fast_syscall_32(struct pt_regs *regs);
 
-- 
2.19.1.6.gb485710b


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 05/24] x86/entry: Introduce __entry_text for entry code written in C
  2021-08-31 17:50 ` [PATCH 05/24] x86/entry: Introduce __entry_text for entry code written in C Lai Jiangshan
@ 2021-08-31 19:34   ` Peter Zijlstra
  2021-09-01  0:23     ` Lai Jiangshan
  0 siblings, 1 reply; 72+ messages in thread
From: Peter Zijlstra @ 2021-08-31 19:34 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Lai Jiangshan, Andy Lutomirski, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H. Peter Anvin

On Wed, Sep 01, 2021 at 01:50:06AM +0800, Lai Jiangshan wrote:
> From: Lai Jiangshan <laijs@linux.alibaba.com>
> 
> Some entry code will be implemented in traps.c.  We need __entry_text
> to set them in .entry.text section.
> 
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> ---
>  arch/x86/entry/traps.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/arch/x86/entry/traps.c b/arch/x86/entry/traps.c
> index f71db7934f90..ef07ae5fb3a0 100644
> --- a/arch/x86/entry/traps.c
> +++ b/arch/x86/entry/traps.c
> @@ -69,6 +69,11 @@
>  extern unsigned char native_irq_return_iret[];
>  extern unsigned char asm_load_gs_index_gs_change[];
>  
> +/* Entry code written in C. Must be only used in traps.c */
> +#define __entry_text							\
> +	noinline notrace __attribute((__section__(".entry.text")))	\
> +	__no_kcsan __no_sanitize_address __no_profile __no_sanitize_coverage

Urgh, that's bound to get out of sync with noinstr.. How about you make
noinstr something like:

#define __noinstr_section(section) \
	noinline notrace __attribute((__section__(section)))	\
	__no_kcsan __no_sanitize_address __no_profile __no_sanitize_coverage

#define noinstr __noinstr_section(".noinstr.text")

and then write the above like:

#define __entry_text __noinstr_section(".entry.text")



^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 00/24] x86/entry/64: Convert a bunch of ASM entry code into C code
  2021-08-31 17:50 [PATCH 00/24] x86/entry/64: Convert a bunch of ASM entry code into C code Lai Jiangshan
                   ` (23 preceding siblings ...)
  2021-08-31 17:50 ` [PATCH 24/24] x86/syscall/64: Move the checking for sysret to C code Lai Jiangshan
@ 2021-08-31 20:44 ` Peter Zijlstra
  2021-09-02  6:28   ` Lai Jiangshan
  2021-09-02 10:50 ` [PATCH 25/24] x86/traps: Rewrite native_load_gs_index in " Lai Jiangshan
  25 siblings, 1 reply; 72+ messages in thread
From: Peter Zijlstra @ 2021-08-31 20:44 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Lai Jiangshan, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Andy Lutomirski, H. Peter Anvin

On Wed, Sep 01, 2021 at 01:50:01AM +0800, Lai Jiangshan wrote:
> From: Lai Jiangshan <laijs@linux.alibaba.com>
> 
> Many ASM code in entry_64.S can be rewritten in C if they can be written
> to be non-instrumentable and are called in the right order regarding to
> whether CR3/gsbase is changed to kernel CR3/gsbase.
> 
> The patchset covert some of them to C code.

Overall, very nice! There's a lot of details to verify, but I really
like where this ends up.

I'm not sure about moving traps.c, but whatever, that's simple stuff.
I'll try and put some more eyeballs on this later.


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 05/24] x86/entry: Introduce __entry_text for entry code written in C
  2021-08-31 19:34   ` Peter Zijlstra
@ 2021-09-01  0:23     ` Lai Jiangshan
  0 siblings, 0 replies; 72+ messages in thread
From: Lai Jiangshan @ 2021-09-01  0:23 UTC (permalink / raw)
  To: Peter Zijlstra, Lai Jiangshan
  Cc: linux-kernel, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin



On 2021/9/1 03:34, Peter Zijlstra wrote:
> On Wed, Sep 01, 2021 at 01:50:06AM +0800, Lai Jiangshan wrote:
>> From: Lai Jiangshan <laijs@linux.alibaba.com>
>>
>> Some entry code will be implemented in traps.c.  We need __entry_text
>> to set them in .entry.text section.
>>
>> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
>> ---
>>   arch/x86/entry/traps.c | 5 +++++
>>   1 file changed, 5 insertions(+)
>>
>> diff --git a/arch/x86/entry/traps.c b/arch/x86/entry/traps.c
>> index f71db7934f90..ef07ae5fb3a0 100644
>> --- a/arch/x86/entry/traps.c
>> +++ b/arch/x86/entry/traps.c
>> @@ -69,6 +69,11 @@
>>   extern unsigned char native_irq_return_iret[];
>>   extern unsigned char asm_load_gs_index_gs_change[];
>>   
>> +/* Entry code written in C. Must be only used in traps.c */
>> +#define __entry_text							\
>> +	noinline notrace __attribute((__section__(".entry.text")))	\
>> +	__no_kcsan __no_sanitize_address __no_profile __no_sanitize_coverage
> 
> Urgh, that's bound to get out of sync with noinstr.. How about you make
> noinstr something like:
> 
> #define __noinstr_section(section) \
> 	noinline notrace __attribute((__section__(section)))	\
> 	__no_kcsan __no_sanitize_address __no_profile __no_sanitize_coverage
> 
> #define noinstr __noinstr_section(".noinstr.text")
> 
> and then write the above like:
> 
> #define __entry_text __noinstr_section(".entry.text")

It is a good idea.  There was no "__no_profile __no_sanitize_coverage"
when I first made the patch several month ago.  I added it after I recheck
the definition of "noinstr" yesterday which means we really need __noinstr_section()
to keep them sync.

> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 00/24] x86/entry/64: Convert a bunch of ASM entry code into C code
  2021-08-31 20:44 ` [PATCH 00/24] x86/entry/64: Convert a bunch of ASM entry code into " Peter Zijlstra
@ 2021-09-02  6:28   ` Lai Jiangshan
  2021-09-02  7:44     ` Peter Zijlstra
  0 siblings, 1 reply; 72+ messages in thread
From: Lai Jiangshan @ 2021-09-02  6:28 UTC (permalink / raw)
  To: Peter Zijlstra, Lai Jiangshan
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Andy Lutomirski, H. Peter Anvin



On 2021/9/1 04:44, Peter Zijlstra wrote:
> On Wed, Sep 01, 2021 at 01:50:01AM +0800, Lai Jiangshan wrote:
>> From: Lai Jiangshan <laijs@linux.alibaba.com>
>>
>> Many ASM code in entry_64.S can be rewritten in C if they can be written
>> to be non-instrumentable and are called in the right order regarding to
>> whether CR3/gsbase is changed to kernel CR3/gsbase.
>>
>> The patchset covert some of them to C code.
> 
> Overall, very nice! There's a lot of details to verify, but I really
> like where this ends up.
> 
> I'm not sure about moving traps.c, but whatever, that's simple stuff.
> I'll try and put some more eyeballs on this later.
> 

Hello, Peter

How is it going?

Do I need to send a new version?
How can I cooperate or participate in the details?
Do you plan to move the other path of ASM code into C code?

Thanks
Lai

PS:
And I think [PATCH 12/24] ("x86/traps: Reconstruct pt_regs on task stack
directly in fixup_bad_iret()") has nothing related to moving ASM to C
is off topic to the patchset.  It can be dropped if no one needs it.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 00/24] x86/entry/64: Convert a bunch of ASM entry code into C code
  2021-09-02  6:28   ` Lai Jiangshan
@ 2021-09-02  7:44     ` Peter Zijlstra
  0 siblings, 0 replies; 72+ messages in thread
From: Peter Zijlstra @ 2021-09-02  7:44 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Lai Jiangshan, linux-kernel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Andy Lutomirski, H. Peter Anvin

On Thu, Sep 02, 2021 at 02:28:43PM +0800, Lai Jiangshan wrote:
> 
> 
> On 2021/9/1 04:44, Peter Zijlstra wrote:
> > On Wed, Sep 01, 2021 at 01:50:01AM +0800, Lai Jiangshan wrote:
> > > From: Lai Jiangshan <laijs@linux.alibaba.com>
> > > 
> > > Many ASM code in entry_64.S can be rewritten in C if they can be written
> > > to be non-instrumentable and are called in the right order regarding to
> > > whether CR3/gsbase is changed to kernel CR3/gsbase.
> > > 
> > > The patchset covert some of them to C code.
> > 
> > Overall, very nice! There's a lot of details to verify, but I really
> > like where this ends up.
> > 
> > I'm not sure about moving traps.c, but whatever, that's simple stuff.
> > I'll try and put some more eyeballs on this later.
> > 
> 
> Hello, Peter
> 
> How is it going?

Too much patches in the inbox, sorry :/

> Do I need to send a new version?

Not at this time.

> How can I cooperate or participate in the details?

What I meant was that review of these patches will have to check all the
details which will take some time.

> Do you plan to move the other path of ASM code into C code?

I do have some plans, but currently a total lack of time. Your patches
are a nice contribution.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 02/24] x86/traps: Move arch/x86/kernel/traps.c to arch/x86/entry/
  2021-08-31 17:50 ` [PATCH 02/24] x86/traps: Move arch/x86/kernel/traps.c to arch/x86/entry/ Lai Jiangshan
@ 2021-09-02  8:09   ` Joerg Roedel
  2021-09-02  9:21     ` Lai Jiangshan
  0 siblings, 1 reply; 72+ messages in thread
From: Joerg Roedel @ 2021-09-02  8:09 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Lai Jiangshan, Andy Lutomirski, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H. Peter Anvin,
	Peter Zijlstra (Intel),
	Daniel Bristot de Oliveira, Brijesh Singh, Andy Shevchenko,
	Arvind Sankar, Chester Lin, Juergen Gross

On Wed, Sep 01, 2021 at 01:50:03AM +0800, Lai Jiangshan wrote:
>  arch/x86/entry/Makefile            | 5 ++++-
>  arch/x86/{kernel => entry}/traps.c | 0
>  arch/x86/kernel/Makefile           | 5 +----
>  3 files changed, 5 insertions(+), 5 deletions(-)
>  rename arch/x86/{kernel => entry}/traps.c (100%)

From looking over the patch-set I didn't spot the reason for putting the
entry C code into traps.c. Can you explain that please? Otherwise I'd
prefer to create a new file, like arch/x86/entry/entry64.c.

Thanks,

	Joerg

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 04/24] x86/entry: Expose the address of .Lgs_change to traps.c
  2021-08-31 17:50 ` [PATCH 04/24] x86/entry: Expose the address of .Lgs_change to traps.c Lai Jiangshan
@ 2021-09-02  9:14   ` Peter Zijlstra
  2021-09-02  9:20     ` Lai Jiangshan
  0 siblings, 1 reply; 72+ messages in thread
From: Peter Zijlstra @ 2021-09-02  9:14 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Lai Jiangshan, Andy Lutomirski, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H. Peter Anvin

On Wed, Sep 01, 2021 at 01:50:05AM +0800, Lai Jiangshan wrote:

> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index e38a4cf795d9..9164e85b36b8 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -729,6 +729,7 @@ _ASM_NOKPROBE(common_interrupt_return)
>  SYM_FUNC_START(asm_load_gs_index)
>  	FRAME_BEGIN
>  	swapgs
> +SYM_INNER_LABEL(asm_load_gs_index_gs_change, SYM_L_GLOBAL)
>  .Lgs_change:

After all the patches, there's only a single .Lgs_change reference left.
Can't we convert the _ASM_EXTABLE() entry to use the new label and
totally remove it?

>  	movl	%edi, %gs
>  2:	ALTERNATIVE "", "mfence", X86_BUG_SWAPGS_FENCE
> @@ -1011,7 +1012,7 @@ SYM_CODE_START_LOCAL(error_entry)
>  	movl	%ecx, %eax			/* zero extend */
>  	cmpq	%rax, RIP+8(%rsp)
>  	je	.Lbstep_iret
> -	cmpq	$.Lgs_change, RIP+8(%rsp)
> +	cmpq	$asm_load_gs_index_gs_change, RIP+8(%rsp)
>  	jne	.Lerror_entry_done_lfence

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 04/24] x86/entry: Expose the address of .Lgs_change to traps.c
  2021-09-02  9:14   ` Peter Zijlstra
@ 2021-09-02  9:20     ` Lai Jiangshan
  0 siblings, 0 replies; 72+ messages in thread
From: Lai Jiangshan @ 2021-09-02  9:20 UTC (permalink / raw)
  To: Peter Zijlstra, Lai Jiangshan
  Cc: linux-kernel, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin



On 2021/9/2 17:14, Peter Zijlstra wrote:
> On Wed, Sep 01, 2021 at 01:50:05AM +0800, Lai Jiangshan wrote:
> 
>> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
>> index e38a4cf795d9..9164e85b36b8 100644
>> --- a/arch/x86/entry/entry_64.S
>> +++ b/arch/x86/entry/entry_64.S
>> @@ -729,6 +729,7 @@ _ASM_NOKPROBE(common_interrupt_return)
>>   SYM_FUNC_START(asm_load_gs_index)
>>   	FRAME_BEGIN
>>   	swapgs
>> +SYM_INNER_LABEL(asm_load_gs_index_gs_change, SYM_L_GLOBAL)
>>   .Lgs_change:
> 
> After all the patches, there's only a single .Lgs_change reference left.
> Can't we convert the _ASM_EXTABLE() entry to use the new label and
> totally remove it?

The label .Lgs_change is still needed in ASM code for extable.

I tried the way as you suggested, and the result is:

warning: objtool: __ex_table+0x0: don't know how to handle non-section reloc symbol asm_load_gs_index_gs_change

But never mind, I have already converted load_gs_index into C code.
I will send it later.

> 
>>   	movl	%edi, %gs
>>   2:	ALTERNATIVE "", "mfence", X86_BUG_SWAPGS_FENCE
>> @@ -1011,7 +1012,7 @@ SYM_CODE_START_LOCAL(error_entry)
>>   	movl	%ecx, %eax			/* zero extend */
>>   	cmpq	%rax, RIP+8(%rsp)
>>   	je	.Lbstep_iret
>> -	cmpq	$.Lgs_change, RIP+8(%rsp)
>> +	cmpq	$asm_load_gs_index_gs_change, RIP+8(%rsp)
>>   	jne	.Lerror_entry_done_lfence

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 02/24] x86/traps: Move arch/x86/kernel/traps.c to arch/x86/entry/
  2021-09-02  8:09   ` Joerg Roedel
@ 2021-09-02  9:21     ` Lai Jiangshan
  2021-09-02 10:50       ` Peter Zijlstra
  0 siblings, 1 reply; 72+ messages in thread
From: Lai Jiangshan @ 2021-09-02  9:21 UTC (permalink / raw)
  To: Joerg Roedel, Lai Jiangshan
  Cc: linux-kernel, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin, Peter Zijlstra (Intel),
	Daniel Bristot de Oliveira, Brijesh Singh, Andy Shevchenko,
	Arvind Sankar, Chester Lin, Juergen Gross



On 2021/9/2 16:09, Joerg Roedel wrote:
> On Wed, Sep 01, 2021 at 01:50:03AM +0800, Lai Jiangshan wrote:
>>   arch/x86/entry/Makefile            | 5 ++++-
>>   arch/x86/{kernel => entry}/traps.c | 0
>>   arch/x86/kernel/Makefile           | 5 +----
>>   3 files changed, 5 insertions(+), 5 deletions(-)
>>   rename arch/x86/{kernel => entry}/traps.c (100%)
> 
>  From looking over the patch-set I didn't spot the reason for putting the
> entry C code into traps.c. Can you explain that please? Otherwise I'd
> prefer to create a new file, like arch/x86/entry/entry64.c.


I agree, and I think Peter is handling it.

Thanks
Lai

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 09/24] x86/traps: Add fence_swapgs_{user,kernel}_entry()
  2021-08-31 17:50 ` [PATCH 09/24] x86/traps: Add fence_swapgs_{user,kernel}_entry() Lai Jiangshan
@ 2021-09-02  9:25   ` Peter Zijlstra
  0 siblings, 0 replies; 72+ messages in thread
From: Peter Zijlstra @ 2021-09-02  9:25 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Lai Jiangshan, Andy Lutomirski, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H. Peter Anvin

On Wed, Sep 01, 2021 at 01:50:10AM +0800, Lai Jiangshan wrote:

> +/*
> + * Mitigate Spectre v1 for conditional swapgs code paths.
> + *
> + * fence_swapgs_user_entry is used in the user entry swapgs code path, to
> + * prevent a speculative swapgs when coming from kernel space.
> + *
> + * fence_swapgs_kernel_entry is used in the kernel entry non-swapgs code path,
> + * to prevent the swapgs from getting speculatively skipped when coming from
> + * user space.
> + */
> +static __always_inline void fence_swapgs_user_entry(void)
> +{
> +	alternative("", "lfence", X86_FEATURE_FENCE_SWAPGS_USER);
> +}
> +
> +static __always_inline void fence_swapgs_kernel_entry(void)
> +{
> +	alternative("", "lfence", X86_FEATURE_FENCE_SWAPGS_KERNEL);
> +}

I think slightly larger primitives might make more sense; that is
include the swapgs in these function and drop the fence_ prefix.

Something a bit like...

--- a/arch/x86/entry/traps.c
+++ b/arch/x86/entry/traps.c
@@ -853,20 +853,22 @@ static __always_inline void ist_restore_
 /*
  * Mitigate Spectre v1 for conditional swapgs code paths.
  *
- * fence_swapgs_user_entry is used in the user entry swapgs code path, to
- * prevent a speculative swapgs when coming from kernel space.
+ * swapgs_user_entry is used in the user entry swapgs code path, to prevent a
+ * speculative swapgs when coming from kernel space.
  *
- * fence_swapgs_kernel_entry is used in the kernel entry non-swapgs code path,
- * to prevent the swapgs from getting speculatively skipped when coming from
- * user space.
+ * swapgs_kernel_entry is used in the kernel entry non-swapgs code path, to
+ * prevent the swapgs from getting speculatively skipped when coming from user
+ * space.
  */
-static __always_inline void fence_swapgs_user_entry(void)
+static __always_inline void swapgs_user_entry(void)
 {
+	native_swapgs();
 	alternative("", "lfence", X86_FEATURE_FENCE_SWAPGS_USER);
 }
 
-static __always_inline void fence_swapgs_kernel_entry(void)
+static __always_inline void swapgs_kernel_entry(void)
 {
+	native_swapgs();
 	alternative("", "lfence", X86_FEATURE_FENCE_SWAPGS_KERNEL);
 }
 
@@ -896,8 +898,7 @@ struct pt_regs *do_error_entry(struct pt
 		 * We entered from user mode.
 		 * Switch to kernel gsbase and CR3.
 		 */
-		native_swapgs();
-		fence_swapgs_user_entry();
+		swapgs_user_entry();
 		switch_to_kernel_cr3();
 
 		/* Put pt_regs onto the task stack. */
@@ -917,8 +918,7 @@ struct pt_regs *do_error_entry(struct pt
 		 * We came from an IRET to user mode, so we have user
 		 * gsbase and CR3.  Switch to kernel gsbase and CR3:
 		 */
-		native_swapgs();
-		fence_swapgs_user_entry();
+		swapgs_user_entry();
 		switch_to_kernel_cr3();
 
 		/*
@@ -936,8 +936,7 @@ struct pt_regs *do_error_entry(struct pt
 	 * handler with kernel gsbase.
 	 */
 	if (eregs->ip == (unsigned long)asm_load_gs_index_gs_change) {
-		native_swapgs();
-		fence_swapgs_user_entry();
+		swapgs_user_entry();
 	} else {
 		fence_swapgs_kernel_entry();
 	}
@@ -1017,14 +1016,12 @@ static __always_inline unsigned long ist
 	if ((long)gsbase < 0)
 		return 1;
 
-	native_swapgs();
-
 	/*
 	 * The above ist_switch_to_kernel_cr3() doesn't do an unconditional
 	 * CR3 write, even in the PTI case.  So do an lfence to prevent GS
 	 * speculation, regardless of whether PTI is enabled.
 	 */
-	fence_swapgs_kernel_entry();
+	swapgs_kernel_entry();
 
 	/* SWAPGS required on exit */
 	return 0;
@@ -1089,7 +1086,7 @@ void paranoid_exit(struct ist_regs *ist)
 	}
 
 	if (!ist->gsbase)
-		native_swapgs();
+		swapgs_user_entry();
 }
 #endif
 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 11/24] x86/entry: Replace the most of asm code of error_entry to C code
  2021-08-31 17:50 ` [PATCH 11/24] x86/entry: Replace the most of asm code of error_entry to C code Lai Jiangshan
@ 2021-09-02 10:16   ` Peter Zijlstra
  2021-09-02 12:08     ` Lai Jiangshan
  0 siblings, 1 reply; 72+ messages in thread
From: Peter Zijlstra @ 2021-09-02 10:16 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Lai Jiangshan, Andy Lutomirski, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H. Peter Anvin, Joerg Roedel,
	Youquan Song, Tony Luck

On Wed, Sep 01, 2021 at 01:50:12AM +0800, Lai Jiangshan wrote:
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index 42d2918f5646..bc9e2f5ad370 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -972,83 +972,14 @@ SYM_CODE_START_LOCAL(error_entry)
>  	cld
>  	PUSH_AND_CLEAR_REGS save_ret=1
>  	ENCODE_FRAME_POINTER 8
>  
>  	popq	%r12				/* save return addr in %12 */
>  	movq	%rsp, %rdi			/* arg0 = pt_regs pointer */
> +	call	do_error_entry
>  	movq	%rax, %rsp			/* switch stack */
>  	ENCODE_FRAME_POINTER
>  	pushq	%r12
>  	ret

There's only a single error_entry callsite, which is idtentry_body. One
of the things I wanted to do is change this lot so we change to the
task_stack in 'C', using an adaptation of call_on_irqstack() and
basically don't return frrom C until we're done with \cfunc.

That is, once we call C, stay there, and don't do this back and forth
between C and asm.

As is, the resulting asm in error_entry is somewhat confusing given that
we sometimes don't actually switch stacks.


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 22/24] x86/entry: Implement and use do_paranoid_entry() and paranoid_exit()
  2021-08-31 17:50 ` [PATCH 22/24] x86/entry: Implement and use do_paranoid_entry() and paranoid_exit() Lai Jiangshan
@ 2021-09-02 10:33   ` Peter Zijlstra
  2021-09-02 10:42     ` Lai Jiangshan
  2021-09-02 11:58     ` Lai Jiangshan
  0 siblings, 2 replies; 72+ messages in thread
From: Peter Zijlstra @ 2021-09-02 10:33 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Lai Jiangshan, Andy Lutomirski, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H. Peter Anvin, Joerg Roedel,
	Youquan Song, Tony Luck

On Wed, Sep 01, 2021 at 01:50:23AM +0800, Lai Jiangshan wrote:

> +	call	do_paranoid_entry
>  	ret

That's normally spelled like:

	jmp do_paranoid_entry

But the same comment as for error_entry but more; pretty much all that's
left in asm is things like:


	call paranoid_entry;

	# setup args
	call \cfunc

	call paranoid_exit

which seems like prime material to also pull into C to avoid the
back-and-forth thing. In fact, why can't you call paranoid_entry/exit
from \cfunc itself? The IDT macros should be able to help.


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 22/24] x86/entry: Implement and use do_paranoid_entry() and paranoid_exit()
  2021-09-02 10:33   ` Peter Zijlstra
@ 2021-09-02 10:42     ` Lai Jiangshan
  2021-09-02 12:02       ` Peter Zijlstra
  2021-09-02 11:58     ` Lai Jiangshan
  1 sibling, 1 reply; 72+ messages in thread
From: Lai Jiangshan @ 2021-09-02 10:42 UTC (permalink / raw)
  To: Peter Zijlstra, Lai Jiangshan
  Cc: linux-kernel, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin, Joerg Roedel, Youquan Song,
	Tony Luck



On 2021/9/2 18:33, Peter Zijlstra wrote:
> On Wed, Sep 01, 2021 at 01:50:23AM +0800, Lai Jiangshan wrote:
> 
>> +	call	do_paranoid_entry
>>   	ret
> 
> That's normally spelled like:
> 
> 	jmp do_paranoid_entry
> 
> But the same comment as for error_entry but more; pretty much all that's
> left in asm is things like:
> 
> 
> 	call paranoid_entry;
> 
> 	# setup args
> 	call \cfunc
> 
> 	call paranoid_exit
> 
> which seems like prime material to also pull into C to avoid the
> back-and-forth thing. In fact, why can't you call paranoid_entry/exit
> from \cfunc itself? The IDT macros should be able to help.
> 

It sounds better.

I should have moved the code of pushing pt_regs out of paranoid_entry(),
so that I could also have seen this.
(and we don't need do_paranoid_entry(), paranoid_entry() itself can be in C)

The \cfunc would need to be marked as entry_text, right?

Thanks
Lai

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 02/24] x86/traps: Move arch/x86/kernel/traps.c to arch/x86/entry/
  2021-09-02  9:21     ` Lai Jiangshan
@ 2021-09-02 10:50       ` Peter Zijlstra
  2021-09-02 11:54         ` Lai Jiangshan
  0 siblings, 1 reply; 72+ messages in thread
From: Peter Zijlstra @ 2021-09-02 10:50 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Joerg Roedel, Lai Jiangshan, linux-kernel, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, Daniel Bristot de Oliveira, Brijesh Singh,
	Andy Shevchenko, Arvind Sankar, Chester Lin, Juergen Gross,
	andrew.cooper3

On Thu, Sep 02, 2021 at 05:21:51PM +0800, Lai Jiangshan wrote:
> 
> 
> On 2021/9/2 16:09, Joerg Roedel wrote:
> > On Wed, Sep 01, 2021 at 01:50:03AM +0800, Lai Jiangshan wrote:
> > >   arch/x86/entry/Makefile            | 5 ++++-
> > >   arch/x86/{kernel => entry}/traps.c | 0
> > >   arch/x86/kernel/Makefile           | 5 +----
> > >   3 files changed, 5 insertions(+), 5 deletions(-)
> > >   rename arch/x86/{kernel => entry}/traps.c (100%)
> > 
> >  From looking over the patch-set I didn't spot the reason for putting the
> > entry C code into traps.c. Can you explain that please? Otherwise I'd
> > prefer to create a new file, like arch/x86/entry/entry64.c.
> 
> 
> I agree, and I think Peter is handling it.

I don't think I said that. I said I like the patches but there's a lot
of scary code and details to review, which takes time.

I've now done a second reading of the patches and provided some more
feedback, but I'm very sure I didn't get to the scary details yet.

One thing that got pointed out (by Andrew Cooper) is that by moving the
whole SWAPGS trainwreck to C it becomes entirely 'trivial' to sneak in
an 'accidental' per-cpu reference before having done the SWAPGS dance.

I'm myself not (yet) convinced that's a good enough reason to leave it
in ASM, but it does certainly need consideration.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 25/24] x86/traps: Rewrite native_load_gs_index in C code
  2021-08-31 17:50 [PATCH 00/24] x86/entry/64: Convert a bunch of ASM entry code into C code Lai Jiangshan
                   ` (24 preceding siblings ...)
  2021-08-31 20:44 ` [PATCH 00/24] x86/entry/64: Convert a bunch of ASM entry code into " Peter Zijlstra
@ 2021-09-02 10:50 ` Lai Jiangshan
  2021-09-08  1:38   ` H. Peter Anvin
  2021-09-13 20:01   ` Andy Lutomirski
  25 siblings, 2 replies; 72+ messages in thread
From: Lai Jiangshan @ 2021-09-02 10:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin, Dave Hansen,
	Peter Zijlstra, Dave Jiang, Ben Widawsky, Dan Williams,
	Arvind Sankar

From: Lai Jiangshan <laijs@linux.alibaba.com>

There is no constrain/limition to force native_load_gs_index() to be in
ASM code.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/entry/entry_64.S            | 36 ----------------------------
 arch/x86/entry/traps.c               | 25 +++++++++++++++++++
 arch/x86/include/asm/special_insns.h | 11 +--------
 arch/x86/mm/extable.c                | 15 ++++++++++++
 4 files changed, 41 insertions(+), 46 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 2d17fca1f503..16ee481e9b5f 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -659,42 +659,6 @@ native_irq_return_ldt:
 SYM_CODE_END(common_interrupt_return)
 _ASM_NOKPROBE(common_interrupt_return)
 
-/*
- * Reload gs selector with exception handling
- * edi:  new selector
- *
- * Is in entry.text as it shouldn't be instrumented.
- */
-SYM_FUNC_START(asm_load_gs_index)
-	FRAME_BEGIN
-	swapgs
-SYM_INNER_LABEL(asm_load_gs_index_gs_change, SYM_L_GLOBAL)
-.Lgs_change:
-	movl	%edi, %gs
-2:	ALTERNATIVE "", "mfence", X86_BUG_SWAPGS_FENCE
-	swapgs
-	FRAME_END
-	ret
-SYM_FUNC_END(asm_load_gs_index)
-EXPORT_SYMBOL(asm_load_gs_index)
-
-	_ASM_EXTABLE(.Lgs_change, .Lbad_gs)
-	.section .fixup, "ax"
-	/* running with kernelgs */
-SYM_CODE_START_LOCAL_NOALIGN(.Lbad_gs)
-	swapgs					/* switch back to user gs */
-.macro ZAP_GS
-	/* This can't be a string because the preprocessor needs to see it. */
-	movl $__USER_DS, %eax
-	movl %eax, %gs
-.endm
-	ALTERNATIVE "", "ZAP_GS", X86_BUG_NULL_SEG
-	xorl	%eax, %eax
-	movl	%eax, %gs
-	jmp	2b
-SYM_CODE_END(.Lbad_gs)
-	.previous
-
 #ifdef CONFIG_XEN_PV
 /*
  * A note on the "critical region" in our callback handler.
diff --git a/arch/x86/entry/traps.c b/arch/x86/entry/traps.c
index 52511db6baa6..2a17d74569fd 100644
--- a/arch/x86/entry/traps.c
+++ b/arch/x86/entry/traps.c
@@ -679,6 +679,31 @@ DEFINE_IDTENTRY_RAW(exc_int3)
 }
 
 #ifdef CONFIG_X86_64
+
+/*
+ * Reload gs selector with exception handling
+ * selector:  new selector
+ *
+ * Is noinstr as it shouldn't be instrumented.
+ */
+noinstr void native_load_gs_index(unsigned int selector)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	native_swapgs();
+	asm volatile(
+		".global asm_load_gs_index_gs_change \n"
+		"asm_load_gs_index_gs_change: \n"
+		"1: movl %0, %%gs \n"
+		"   swapgs \n"
+		"2: \n"
+		_ASM_EXTABLE_HANDLE(1b, 2b, ex_handler_clear_gs)
+		:: "r" (selector) : "memory");
+	alternative("", "mfence", X86_BUG_SWAPGS_FENCE);
+	local_irq_restore(flags);
+}
+
 /*
  * Help handler running on a per-cpu (IST or entry trampoline) stack
  * to switch to the normal thread stack if the interrupted code was in
diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
index 058995bb153c..c5d74ebce527 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -120,16 +120,7 @@ static inline void native_wbinvd(void)
 	asm volatile("wbinvd": : :"memory");
 }
 
-extern asmlinkage void asm_load_gs_index(unsigned int selector);
-
-static inline void native_load_gs_index(unsigned int selector)
-{
-	unsigned long flags;
-
-	local_irq_save(flags);
-	asm_load_gs_index(selector);
-	local_irq_restore(flags);
-}
+extern asmlinkage void native_load_gs_index(unsigned int selector);
 
 static inline unsigned long __read_cr4(void)
 {
diff --git a/arch/x86/mm/extable.c b/arch/x86/mm/extable.c
index e1664e9f969c..695a580a652a 100644
--- a/arch/x86/mm/extable.c
+++ b/arch/x86/mm/extable.c
@@ -138,6 +138,21 @@ __visible bool ex_handler_clear_fs(const struct exception_table_entry *fixup,
 }
 EXPORT_SYMBOL(ex_handler_clear_fs);
 
+__visible bool ex_handler_clear_gs(const struct exception_table_entry *fixup,
+				   struct pt_regs *regs, int trapnr,
+				   unsigned long error_code,
+				   unsigned long fault_addr)
+{
+	/*
+	 * The following native_load_gs_index() should not fault, otherwise
+	 * it would be infinite recursion.
+	 */
+	if (static_cpu_has(X86_BUG_NULL_SEG))
+		native_load_gs_index(__USER_DS);
+	native_load_gs_index(0);
+	return ex_handler_default(fixup, regs, trapnr, error_code, fault_addr);
+}
+
 enum handler_type ex_get_fault_handler_type(unsigned long ip)
 {
 	const struct exception_table_entry *e;
-- 
2.19.1.6.gb485710b


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 02/24] x86/traps: Move arch/x86/kernel/traps.c to arch/x86/entry/
  2021-09-02 10:50       ` Peter Zijlstra
@ 2021-09-02 11:54         ` Lai Jiangshan
  2021-09-02 12:05           ` Peter Zijlstra
  0 siblings, 1 reply; 72+ messages in thread
From: Lai Jiangshan @ 2021-09-02 11:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joerg Roedel, Lai Jiangshan, linux-kernel, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, Daniel Bristot de Oliveira, Brijesh Singh,
	Andy Shevchenko, Arvind Sankar, Chester Lin, Juergen Gross,
	andrew.cooper3



On 2021/9/2 18:50, Peter Zijlstra wrote:
> On Thu, Sep 02, 2021 at 05:21:51PM +0800, Lai Jiangshan wrote:
>>
>>
>> On 2021/9/2 16:09, Joerg Roedel wrote:
>>> On Wed, Sep 01, 2021 at 01:50:03AM +0800, Lai Jiangshan wrote:
>>>>    arch/x86/entry/Makefile            | 5 ++++-
>>>>    arch/x86/{kernel => entry}/traps.c | 0
>>>>    arch/x86/kernel/Makefile           | 5 +----
>>>>    3 files changed, 5 insertions(+), 5 deletions(-)
>>>>    rename arch/x86/{kernel => entry}/traps.c (100%)
>>>
>>>   From looking over the patch-set I didn't spot the reason for putting the
>>> entry C code into traps.c. Can you explain that please? Otherwise I'd
>>> prefer to create a new file, like arch/x86/entry/entry64.c.
>>
>>
>> I agree, and I think Peter is handling it.
> 
> I don't think I said that. I said I like the patches but there's a lot
> of scary code and details to review, which takes time.
> 
> I've now done a second reading of the patches and provided some more
> feedback, but I'm very sure I didn't get to the scary details yet.
> 
> One thing that got pointed out (by Andrew Cooper) is that by moving the
> whole SWAPGS trainwreck to C it becomes entirely 'trivial' to sneak in
> an 'accidental' per-cpu reference before having done the SWAPGS dance.
> 
> I'm myself not (yet) convinced that's a good enough reason to leave it
> in ASM, but it does certainly need consideration.
> 


It is real concern and it proves that my having put the C code in traps.c
was totally wrong.

To relieve the concern, I think the C code can be put into a single file, like
arch/x86/entry/entry64.c,  and be documented that it is as critical, dangerous as
entry_64.S and any one should take no less care on modifying/reviewing it than on
modifying/reviewing entry_64.S.

And all the other users of native_swapgs can be moved to this file too, such as
__[rd|wr]gsbase_inactive().

A noninstr function can sometimes have 'accidental' instrument to sneak in.
For example, stack-protector is instrumenting many noninstr functions now
if the CONFIG is yes.  It is normally Ok and gcc is adding per-function control
on it.

But the C code can not be instrumented by any way.  For example stack-protector
would add per-cpu reference before having done the SWAPGS dance.)  Entry C code
required a stronger limitation than noninstr code.

By the way, can objtool check the per-cpu reference?

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 22/24] x86/entry: Implement and use do_paranoid_entry() and paranoid_exit()
  2021-09-02 10:33   ` Peter Zijlstra
  2021-09-02 10:42     ` Lai Jiangshan
@ 2021-09-02 11:58     ` Lai Jiangshan
  2021-09-02 12:29       ` Joerg Roedel
  1 sibling, 1 reply; 72+ messages in thread
From: Lai Jiangshan @ 2021-09-02 11:58 UTC (permalink / raw)
  To: Peter Zijlstra, Lai Jiangshan
  Cc: linux-kernel, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin, Joerg Roedel, Youquan Song,
	Tony Luck



On 2021/9/2 18:33, Peter Zijlstra wrote:
> On Wed, Sep 01, 2021 at 01:50:23AM +0800, Lai Jiangshan wrote:
> 
>> +	call	do_paranoid_entry
>>   	ret
> 
> That's normally spelled like:
> 
> 	jmp do_paranoid_entry
> 
> But the same comment as for error_entry but more; pretty much all that's
> left in asm is things like:
> 
> 
> 	call paranoid_entry;
> 
> 	# setup args
> 	call \cfunc
> 
> 	call paranoid_exit
> 
> which seems like prime material to also pull into C to avoid the
> back-and-forth thing. In fact, why can't you call paranoid_entry/exit
> from \cfunc itself? The IDT macros should be able to help.
> 

Oh, #VC will need to switch stack.  I think we need ASM code to switch
stack since the original stack need to be "free" for next #VC.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 22/24] x86/entry: Implement and use do_paranoid_entry() and paranoid_exit()
  2021-09-02 10:42     ` Lai Jiangshan
@ 2021-09-02 12:02       ` Peter Zijlstra
  0 siblings, 0 replies; 72+ messages in thread
From: Peter Zijlstra @ 2021-09-02 12:02 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Lai Jiangshan, linux-kernel, Andy Lutomirski, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H. Peter Anvin, Joerg Roedel,
	Youquan Song, Tony Luck

On Thu, Sep 02, 2021 at 06:42:16PM +0800, Lai Jiangshan wrote:
> On 2021/9/2 18:33, Peter Zijlstra wrote:
> > On Wed, Sep 01, 2021 at 01:50:23AM +0800, Lai Jiangshan wrote:
> > 
> > > +	call	do_paranoid_entry
> > >   	ret
> > 
> > That's normally spelled like:
> > 
> > 	jmp do_paranoid_entry
> > 
> > But the same comment as for error_entry but more; pretty much all that's
> > left in asm is things like:
> > 
> > 
> > 	call paranoid_entry;
> > 
> > 	# setup args
> > 	call \cfunc
> > 
> > 	call paranoid_exit
> > 
> > which seems like prime material to also pull into C to avoid the
> > back-and-forth thing. In fact, why can't you call paranoid_entry/exit
> > from \cfunc itself? The IDT macros should be able to help.
> > 
> 
> It sounds better.
> 
> I should have moved the code of pushing pt_regs out of paranoid_entry(),
> so that I could also have seen this.
> (and we don't need do_paranoid_entry(), paranoid_entry() itself can be in C)
> 
> The \cfunc would need to be marked as entry_text, right?

Yes I think so. The distinction between .entry.text and .noinstr.text is
that the former it mapped into the userspace mapping, while the latter
is not. Seeing how the PTI swizzling still has to happen when calling
cfunc, that had bettern be .entry.text.

If we care about a strict minimum of .entry.text the IDT macros can
generate a noinstr function to be called the moment we've done the PTI
munging I suppose.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 02/24] x86/traps: Move arch/x86/kernel/traps.c to arch/x86/entry/
  2021-09-02 11:54         ` Lai Jiangshan
@ 2021-09-02 12:05           ` Peter Zijlstra
  2021-09-02 13:34             ` Peter Zijlstra
  0 siblings, 1 reply; 72+ messages in thread
From: Peter Zijlstra @ 2021-09-02 12:05 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Joerg Roedel, Lai Jiangshan, linux-kernel, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, Daniel Bristot de Oliveira, Brijesh Singh,
	Andy Shevchenko, Arvind Sankar, Chester Lin, Juergen Gross,
	andrew.cooper3

On Thu, Sep 02, 2021 at 07:54:25PM +0800, Lai Jiangshan wrote:
> For example, stack-protector is instrumenting many noninstr functions now
> if the CONFIG is yes.  It is normally Ok and gcc is adding per-function control
> on it.

IIRC the latest compiler have an attribute for this too, that really
should be added to noinstr. Lemme go find.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 11/24] x86/entry: Replace the most of asm code of error_entry to C code
  2021-09-02 10:16   ` Peter Zijlstra
@ 2021-09-02 12:08     ` Lai Jiangshan
  0 siblings, 0 replies; 72+ messages in thread
From: Lai Jiangshan @ 2021-09-02 12:08 UTC (permalink / raw)
  To: Peter Zijlstra, Lai Jiangshan
  Cc: linux-kernel, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin, Joerg Roedel, Youquan Song,
	Tony Luck



On 2021/9/2 18:16, Peter Zijlstra wrote:
> On Wed, Sep 01, 2021 at 01:50:12AM +0800, Lai Jiangshan wrote:
>> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
>> index 42d2918f5646..bc9e2f5ad370 100644
>> --- a/arch/x86/entry/entry_64.S
>> +++ b/arch/x86/entry/entry_64.S
>> @@ -972,83 +972,14 @@ SYM_CODE_START_LOCAL(error_entry)
>>   	cld
>>   	PUSH_AND_CLEAR_REGS save_ret=1
>>   	ENCODE_FRAME_POINTER 8
>>   
>>   	popq	%r12				/* save return addr in %12 */
>>   	movq	%rsp, %rdi			/* arg0 = pt_regs pointer */
>> +	call	do_error_entry
>>   	movq	%rax, %rsp			/* switch stack */
>>   	ENCODE_FRAME_POINTER
>>   	pushq	%r12
>>   	ret
> 
> There's only a single error_entry callsite, which is idtentry_body. One
> of the things I wanted to do is change this lot so we change to the
> task_stack in 'C', using an adaptation of call_on_irqstack() and
> basically don't return frrom C until we're done with \cfunc.
> 
> That is, once we call C, stay there, and don't do this back and forth
> between C and asm.

I haven't figured out how can an adaptation of call_on_irqstack() can do it.
The original stack need to be "free" for next task.  And we can't switch
the stack before error_entry() since the CR3 is not switched.

I believe the ASM code here can be simplified and clearer further. But I don't
think going back and forth between C and ASM is real issue if the ASM code is
short and simple enough.


> 
> As is, the resulting asm in error_entry is somewhat confusing given that
> we sometimes don't actually switch stacks.
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 22/24] x86/entry: Implement and use do_paranoid_entry() and paranoid_exit()
  2021-09-02 11:58     ` Lai Jiangshan
@ 2021-09-02 12:29       ` Joerg Roedel
  0 siblings, 0 replies; 72+ messages in thread
From: Joerg Roedel @ 2021-09-02 12:29 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Peter Zijlstra, Lai Jiangshan, linux-kernel, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, Youquan Song, Tony Luck

On Thu, Sep 02, 2021 at 07:58:51PM +0800, Lai Jiangshan wrote:
> Oh, #VC will need to switch stack.  I think we need ASM code to switch
> stack since the original stack need to be "free" for next #VC.

Right, #VC switches off its IST stack to make it free for subsequent
#VCs. The switch mostly happens in C code already, see
vc_switch_off_ist(), only the actual RSP update is still ASM.

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 02/24] x86/traps: Move arch/x86/kernel/traps.c to arch/x86/entry/
  2021-09-02 12:05           ` Peter Zijlstra
@ 2021-09-02 13:34             ` Peter Zijlstra
  2021-09-02 17:05               ` Nick Desaulniers
  0 siblings, 1 reply; 72+ messages in thread
From: Peter Zijlstra @ 2021-09-02 13:34 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Joerg Roedel, Lai Jiangshan, linux-kernel, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, Daniel Bristot de Oliveira, Brijesh Singh,
	Andy Shevchenko, Arvind Sankar, Chester Lin, Juergen Gross,
	andrew.cooper3, ndesaulniers

On Thu, Sep 02, 2021 at 02:05:41PM +0200, Peter Zijlstra wrote:
> On Thu, Sep 02, 2021 at 07:54:25PM +0800, Lai Jiangshan wrote:
> > For example, stack-protector is instrumenting many noninstr functions now
> > if the CONFIG is yes.  It is normally Ok and gcc is adding per-function control
> > on it.
> 
> IIRC the latest compiler have an attribute for this too, that really
> should be added to noinstr. Lemme go find.

Something like so... Nick ?

diff --git a/include/linux/compiler-clang.h b/include/linux/compiler-clang.h
index 49b0ac8b6fd3..6a15c3d8df5c 100644
--- a/include/linux/compiler-clang.h
+++ b/include/linux/compiler-clang.h
@@ -62,6 +62,12 @@
 #define __no_sanitize_coverage
 #endif
 
+#if defined(CONFIG_STACKPROTECTOR)
+#define __no_stack_protector __attribute__((no_stack_protector))
+#else
+#define __no_stack_protector
+#endif
+
 /*
  * Not all versions of clang implement the type-generic versions
  * of the builtin overflow checkers. Fortunately, clang implements
diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h
index cb9217fc60af..7ac0a3f11ba9 100644
--- a/include/linux/compiler-gcc.h
+++ b/include/linux/compiler-gcc.h
@@ -128,6 +128,12 @@
 #define __no_sanitize_coverage
 #endif
 
+#if defined(CONFIG_STACKPROTECTOR) && __has_attribute__(__no_stack_protector__)
+#define __no_stack_protector __attribute__((no_stack_protector))
+#else
+#define __no_stack_protector
+#endif
+
 #if GCC_VERSION >= 50100
 #define COMPILER_HAS_GENERIC_BUILTIN_OVERFLOW 1
 #endif
diff --git a/include/linux/compiler_types.h b/include/linux/compiler_types.h
index e4ea86fc584d..5ae1c08dba8d 100644
--- a/include/linux/compiler_types.h
+++ b/include/linux/compiler_types.h
@@ -210,7 +210,8 @@ struct ftrace_likely_data {
 /* Section for code which can't be instrumented at all */
 #define noinstr								\
 	noinline notrace __attribute((__section__(".noinstr.text")))	\
-	__no_kcsan __no_sanitize_address __no_profile __no_sanitize_coverage
+	__no_kcsan __no_sanitize_address __no_profile __no_sanitize_coverage \
+	__no_stack_protector
 
 #endif /* __KERNEL__ */
 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 02/24] x86/traps: Move arch/x86/kernel/traps.c to arch/x86/entry/
  2021-09-02 13:34             ` Peter Zijlstra
@ 2021-09-02 17:05               ` Nick Desaulniers
  2021-09-02 17:19                 ` Miguel Ojeda
  2021-09-03  7:36                 ` Martin Liška
  0 siblings, 2 replies; 72+ messages in thread
From: Nick Desaulniers @ 2021-09-02 17:05 UTC (permalink / raw)
  To: Peter Zijlstra, Miguel Ojeda, Martin Liška
  Cc: Lai Jiangshan, Joerg Roedel, Lai Jiangshan, linux-kernel,
	Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	x86, H. Peter Anvin, Daniel Bristot de Oliveira, Brijesh Singh,
	Andy Shevchenko, Arvind Sankar, Chester Lin, Juergen Gross,
	andrew.cooper3, linux-toolchains

On Thu, Sep 2, 2021 at 6:36 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Sep 02, 2021 at 02:05:41PM +0200, Peter Zijlstra wrote:
> > On Thu, Sep 02, 2021 at 07:54:25PM +0800, Lai Jiangshan wrote:
> > > For example, stack-protector is instrumenting many noninstr functions now
> > > if the CONFIG is yes.  It is normally Ok and gcc is adding per-function control
> > > on it.
> >
> > IIRC the latest compiler have an attribute for this too, that really
> > should be added to noinstr. Lemme go find.
>
> Something like so... Nick ?
>
> diff --git a/include/linux/compiler-clang.h b/include/linux/compiler-clang.h
> index 49b0ac8b6fd3..6a15c3d8df5c 100644
> --- a/include/linux/compiler-clang.h
> +++ b/include/linux/compiler-clang.h
> @@ -62,6 +62,12 @@
>  #define __no_sanitize_coverage
>  #endif
>
> +#if defined(CONFIG_STACKPROTECTOR)
> +#define __no_stack_protector __attribute__((no_stack_protector))
> +#else
> +#define __no_stack_protector
> +#endif
> +
>  /*
>   * Not all versions of clang implement the type-generic versions
>   * of the builtin overflow checkers. Fortunately, clang implements
> diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h
> index cb9217fc60af..7ac0a3f11ba9 100644
> --- a/include/linux/compiler-gcc.h
> +++ b/include/linux/compiler-gcc.h
> @@ -128,6 +128,12 @@
>  #define __no_sanitize_coverage
>  #endif
>
> +#if defined(CONFIG_STACKPROTECTOR) && __has_attribute__(__no_stack_protector__)
> +#define __no_stack_protector __attribute__((no_stack_protector))
> +#else
> +#define __no_stack_protector
> +#endif
> +

The above 2 hunks should go in include/linux/compiler_attributes.h,
but yes.  I'd been meaning to send such a patch; it's nice to have
finer grain function-level control over -fno-stack-protector which
significantly hurts ergonomics for things like LTO.  IIRC GCC only
added the attribute recently in the 10.X release, so it might be too
new to rely on quite yet.

>  #if GCC_VERSION >= 50100
>  #define COMPILER_HAS_GENERIC_BUILTIN_OVERFLOW 1
>  #endif
> diff --git a/include/linux/compiler_types.h b/include/linux/compiler_types.h
> index e4ea86fc584d..5ae1c08dba8d 100644
> --- a/include/linux/compiler_types.h
> +++ b/include/linux/compiler_types.h
> @@ -210,7 +210,8 @@ struct ftrace_likely_data {
>  /* Section for code which can't be instrumented at all */
>  #define noinstr                                                                \
>         noinline notrace __attribute((__section__(".noinstr.text")))    \
> -       __no_kcsan __no_sanitize_address __no_profile __no_sanitize_coverage
> +       __no_kcsan __no_sanitize_address __no_profile __no_sanitize_coverage \
> +       __no_stack_protector
>
>  #endif /* __KERNEL__ */
>


-- 
Thanks,
~Nick Desaulniers

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 02/24] x86/traps: Move arch/x86/kernel/traps.c to arch/x86/entry/
  2021-09-02 17:05               ` Nick Desaulniers
@ 2021-09-02 17:19                 ` Miguel Ojeda
  2021-09-02 17:23                   ` Nick Desaulniers
  2021-09-03  7:36                 ` Martin Liška
  1 sibling, 1 reply; 72+ messages in thread
From: Miguel Ojeda @ 2021-09-02 17:19 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: Peter Zijlstra, Miguel Ojeda, Martin Liška, Lai Jiangshan,
	Joerg Roedel, Lai Jiangshan, linux-kernel, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	H. Peter Anvin, Daniel Bristot de Oliveira, Brijesh Singh,
	Andy Shevchenko, Arvind Sankar, Chester Lin, Juergen Gross,
	Andrew Cooper, linux-toolchains

On Thu, Sep 2, 2021 at 7:05 PM Nick Desaulniers <ndesaulniers@google.com> wrote:
>
> The above 2 hunks should go in include/linux/compiler_attributes.h,
> but yes.  I'd been meaning to send such a patch; it's nice to have

Note that `compiler_attributes.h` does not keep attributes that depend
on config options.

On one hand, I feel we should put them there. On the other hand, I
want to avoid making a mess again since the purpose of the file is to
keep things clean for the majority of attributes.

Perhaps we should do a middle-ground, and only allow those that depend
on a single config option, i.e. no nesting `#if`s or complex
conditions.

Cheers,
Miguel

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 02/24] x86/traps: Move arch/x86/kernel/traps.c to arch/x86/entry/
  2021-09-02 17:19                 ` Miguel Ojeda
@ 2021-09-02 17:23                   ` Nick Desaulniers
  0 siblings, 0 replies; 72+ messages in thread
From: Nick Desaulniers @ 2021-09-02 17:23 UTC (permalink / raw)
  To: Miguel Ojeda
  Cc: Peter Zijlstra, Miguel Ojeda, Martin Liška, Lai Jiangshan,
	Joerg Roedel, Lai Jiangshan, linux-kernel, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	H. Peter Anvin, Daniel Bristot de Oliveira, Brijesh Singh,
	Andy Shevchenko, Arvind Sankar, Chester Lin, Juergen Gross,
	Andrew Cooper, linux-toolchains

On Thu, Sep 2, 2021 at 10:19 AM Miguel Ojeda
<miguel.ojeda.sandonis@gmail.com> wrote:
>
> On Thu, Sep 2, 2021 at 7:05 PM Nick Desaulniers <ndesaulniers@google.com> wrote:
> >
> > The above 2 hunks should go in include/linux/compiler_attributes.h,
> > but yes.  I'd been meaning to send such a patch; it's nice to have
>
> Note that `compiler_attributes.h` does not keep attributes that depend
> on config options.

Sure, I'd drop the config check and define it conditionally on the
__has_attribute check alone. Does it hurt to mark functions as
__attribute__((no_stack_protector)) when we're not building with
-fstack-protector*? Nope!

> On one hand, I feel we should put them there. On the other hand, I
> want to avoid making a mess again since the purpose of the file is to
> keep things clean for the majority of attributes.
>
> Perhaps we should do a middle-ground, and only allow those that depend
> on a single config option, i.e. no nesting `#if`s or complex
> conditions.
-- 
Thanks,
~Nick Desaulniers

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 02/24] x86/traps: Move arch/x86/kernel/traps.c to arch/x86/entry/
  2021-09-02 17:05               ` Nick Desaulniers
  2021-09-02 17:19                 ` Miguel Ojeda
@ 2021-09-03  7:36                 ` Martin Liška
  2021-09-07 21:12                   ` Nick Desaulniers
  1 sibling, 1 reply; 72+ messages in thread
From: Martin Liška @ 2021-09-03  7:36 UTC (permalink / raw)
  To: Nick Desaulniers, Peter Zijlstra, Miguel Ojeda
  Cc: Lai Jiangshan, Joerg Roedel, Lai Jiangshan, linux-kernel,
	Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	x86, H. Peter Anvin, Daniel Bristot de Oliveira, Brijesh Singh,
	Andy Shevchenko, Arvind Sankar, Chester Lin, Juergen Gross,
	andrew.cooper3, linux-toolchains

On 9/2/21 19:05, Nick Desaulniers wrote:
> IIRC GCC only
> added the attribute recently in the 10.X release, so it might be too
> new to rely on quite yet.

The no_stack_protector attribute was actually added in the GCC 11.x release:
https://gcc.gnu.org/gcc-11/changes.html

Note the compiler is definitely used by Fedora, openSUSE Tumbleweed
and other cutting edge distributions.

Cheers,
Martin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 02/24] x86/traps: Move arch/x86/kernel/traps.c to arch/x86/entry/
  2021-09-03  7:36                 ` Martin Liška
@ 2021-09-07 21:12                   ` Nick Desaulniers
  2021-09-08  7:33                     ` Martin Liška
  0 siblings, 1 reply; 72+ messages in thread
From: Nick Desaulniers @ 2021-09-07 21:12 UTC (permalink / raw)
  To: Martin Liška
  Cc: Peter Zijlstra, Miguel Ojeda, Lai Jiangshan, Joerg Roedel,
	Lai Jiangshan, linux-kernel, Andy Lutomirski, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H. Peter Anvin,
	Daniel Bristot de Oliveira, Brijesh Singh, Andy Shevchenko,
	Arvind Sankar, Chester Lin, Juergen Gross, andrew.cooper3,
	linux-toolchains

On Fri, Sep 3, 2021 at 12:36 AM Martin Liška <mliska@suse.cz> wrote:
>
> On 9/2/21 19:05, Nick Desaulniers wrote:
> > IIRC GCC only
> > added the attribute recently in the 10.X release, so it might be too
> > new to rely on quite yet.
>
> The no_stack_protector attribute was actually added in the GCC 11.x release:
> https://gcc.gnu.org/gcc-11/changes.html

Ah right, that lays more weight though that this feature is still too
new to rely on quite yet.  Martin, do you know if what happens with
regards to inlining when the callee and caller mismatch on this
function attribute in GCC?  This is very much a problem for the
kernel.

> Note the compiler is definitely used by Fedora, openSUSE Tumbleweed
> and other cutting edge distributions.

Kernel supports GCC 4.9+ currently. This feature can only be emulated
with the coarse grain -fno-stack-protector (or gnu_inline with out of
line assembler definition...:( ).
-- 
Thanks,
~Nick Desaulniers

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 25/24] x86/traps: Rewrite native_load_gs_index in C code
  2021-09-02 10:50 ` [PATCH 25/24] x86/traps: Rewrite native_load_gs_index in " Lai Jiangshan
@ 2021-09-08  1:38   ` H. Peter Anvin
  2021-09-08  4:42     ` H. Peter Anvin
  2021-09-13 20:01   ` Andy Lutomirski
  1 sibling, 1 reply; 72+ messages in thread
From: H. Peter Anvin @ 2021-09-08  1:38 UTC (permalink / raw)
  To: Lai Jiangshan, linux-kernel
  Cc: Lai Jiangshan, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, Dave Hansen, Peter Zijlstra, Dave Jiang,
	Ben Widawsky, Dan Williams, Arvind Sankar

On 9/2/21 3:50 AM, Lai Jiangshan wrote:
> From: Lai Jiangshan <laijs@linux.alibaba.com>
> 
> There is no constrain/limition to force native_load_gs_index() to be in
> ASM code.

Hi,

First of all, let me say I really like your patchset, and I will try to
review it in detail ASAP (most of the initial read pass looks very sane
to me.

However, I would like to object in part this specific patch. It adds a
fair bit of extra code to the exception path, and adds jumps between
files which makes the code much harder to read.

You end up doing one swapgs in assembly and one in C, which would seem
to be a very good indication that really isn't an improvement.

Note that this entire sequence is scheduled to be obsoleted by a single
atomic hardware instruction, LKGS, which will replace ALL of
native_load_gs_index(); it will no longer be necessary even to disable
interrupts as there is no non-atomic state. In that sense, doing this as
an out-of-line C function (with some inline assembly) is great, because
it makes it far easier to use LKGS as an alternative; the only (small)
disadvantage is that it ends up clobbering a lot of registers
unnecessarily (in assembly it can be implemented clobbering only two
registers; one if one uses pushf/popf to save the interrupt flag.)

noinstr void native_load_gs_index(unsigned int selector)
{
	unsigned long flags;

	local_irq_save(flags);
	native_swapgs();
redo:
	asm goto("1: movl %0,%%gs\n",
		   "2:\n"
		   _ASM_EXTABLE(%1)
		   : : "r" (selector) : "i" (&&bad_seg);
	alternative("", "mfence", X86_BUG_SWAPGS_FENCE);
	native_swapgs();
	local_irq_restore(flags);
	return;

bad_seg:
	/* Exception entry will have restored kernel GS */
	native_swapgs();
	
	if (static_cpu_has(X86_BUG_NULL_SEG)) {
		asm volatile("movl %0,%%gs"
			: : "r" (__USER_DS));
	}
	selector = 0;
	goto redo;
}

	-hpa

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 25/24] x86/traps: Rewrite native_load_gs_index in C code
  2021-09-08  1:38   ` H. Peter Anvin
@ 2021-09-08  4:42     ` H. Peter Anvin
  2021-09-08  5:00       ` H. Peter Anvin
  0 siblings, 1 reply; 72+ messages in thread
From: H. Peter Anvin @ 2021-09-08  4:42 UTC (permalink / raw)
  To: Lai Jiangshan, linux-kernel
  Cc: Lai Jiangshan, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, Dave Hansen, Peter Zijlstra, Dave Jiang,
	Ben Widawsky, Dan Williams, Arvind Sankar



On 9/7/21 6:38 PM, H. Peter Anvin wrote:
> On 9/2/21 3:50 AM, Lai Jiangshan wrote:
>> From: Lai Jiangshan <laijs@linux.alibaba.com>
>>
>> There is no constrain/limition to force native_load_gs_index() to be in
>> ASM code.
> 
> Hi,
> 
> First of all, let me say I really like your patchset, and I will try to
> review it in detail ASAP (most of the initial read pass looks very sane
> to me.
> 
> However, I would like to object in part this specific patch. It adds a
> fair bit of extra code to the exception path, and adds jumps between
> files which makes the code much harder to read.
> 
> You end up doing one swapgs in assembly and one in C, which would seem
> to be a very good indication that really isn't an improvement.
> 
> Note that this entire sequence is scheduled to be obsoleted by a single
> atomic hardware instruction, LKGS, which will replace ALL of
> native_load_gs_index(); it will no longer be necessary even to disable
> interrupts as there is no non-atomic state. In that sense, doing this as
> an out-of-line C function (with some inline assembly) is great, because
> it makes it far easier to use LKGS as an alternative; the only (small)
> disadvantage is that it ends up clobbering a lot of registers
> unnecessarily (in assembly it can be implemented clobbering only two
> registers; one if one uses pushf/popf to save the interrupt flag.)
> 

OK, here is a version which actually compiles:

(It makes me wonder if it would be useful to have an _ASM_EXTABLE_GOTO() 
macro to highlight the use of an asm goto C label.)

noinstr void native_load_gs_index(unsigned int selector)
{
	unsigned long flags;

	local_irq_save(flags);
	native_swapgs();

do_mov_gs:
	asm_volatile_goto("1: mov %[seg],%%gs\n"
			  "2:\n"
			  _ASM_EXTABLE(1b,%l[bad_seg])
			  : : [seg] "r" (selector) : : bad_seg);

	alternative("", "mfence", X86_BUG_SWAPGS_FENCE);
	native_swapgs();
	local_irq_restore(flags);
	return;

bad_seg:
	selector = 0;

	/* The exception dispatch will have restored kernel GS */
	native_swapgs();

	if (static_cpu_has_bug(X86_BUG_NULL_SEG))
		asm volatile("mov %[seg],%%gs"
			     : : [seg] "r" (__USER_DS));
	goto do_mov_gs;
}


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 25/24] x86/traps: Rewrite native_load_gs_index in C code
  2021-09-08  4:42     ` H. Peter Anvin
@ 2021-09-08  5:00       ` H. Peter Anvin
  2021-09-08  7:12         ` Lai Jiangshan
  0 siblings, 1 reply; 72+ messages in thread
From: H. Peter Anvin @ 2021-09-08  5:00 UTC (permalink / raw)
  To: Lai Jiangshan, linux-kernel
  Cc: Lai Jiangshan, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, Dave Hansen, Peter Zijlstra, Dave Jiang,
	Ben Widawsky, Dan Williams, Arvind Sankar

On 9/7/21 9:42 PM, H. Peter Anvin wrote:
> 
> 
> On 9/7/21 6:38 PM, H. Peter Anvin wrote:
>> On 9/2/21 3:50 AM, Lai Jiangshan wrote:
>>> From: Lai Jiangshan <laijs@linux.alibaba.com>
>>>
>>> There is no constrain/limition to force native_load_gs_index() to be in
>>> ASM code.
>>
>> Hi,
>>
>> First of all, let me say I really like your patchset, and I will try to
>> review it in detail ASAP (most of the initial read pass looks very sane
>> to me.
>>
>> However, I would like to object in part this specific patch. It adds a
>> fair bit of extra code to the exception path, and adds jumps between
>> files which makes the code much harder to read.
>>
>> You end up doing one swapgs in assembly and one in C, which would seem
>> to be a very good indication that really isn't an improvement.
>>
>> Note that this entire sequence is scheduled to be obsoleted by a single
>> atomic hardware instruction, LKGS, which will replace ALL of
>> native_load_gs_index(); it will no longer be necessary even to disable
>> interrupts as there is no non-atomic state. In that sense, doing this as
>> an out-of-line C function (with some inline assembly) is great, because
>> it makes it far easier to use LKGS as an alternative; the only (small)
>> disadvantage is that it ends up clobbering a lot of registers
>> unnecessarily (in assembly it can be implemented clobbering only two
>> registers; one if one uses pushf/popf to save the interrupt flag.)
>>
> 
> OK, here is a version which actually compiles:
> 

... slightly shorter and minimally better compiled code ...

noinstr void native_load_gs_index(unsigned int selector)
{
	unsigned long flags;

	local_irq_save(flags);
	native_swapgs();
do_mov_gs:
	asm_volatile_goto("1: mov %[seg],%%gs\n"
			  "2:\n"
			  _ASM_EXTABLE(1b,%l[bad_seg])
			  : : [seg] "r" (selector) : : bad_seg);
	alternative("", "mfence", X86_BUG_SWAPGS_FENCE);
	native_swapgs();
	local_irq_restore(flags);
	return;

bad_seg:
	/* The exception dispatch will have restored kernel GS */
	native_swapgs();
	alternative_input("", "mov %[seg],%%gs",
			  X86_BUG_NULL_SEG, [seg] "r" (__USER_DS));
	selector = 0;
	goto do_mov_gs;
}


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 25/24] x86/traps: Rewrite native_load_gs_index in C code
  2021-09-08  5:00       ` H. Peter Anvin
@ 2021-09-08  7:12         ` Lai Jiangshan
  2021-09-09 23:16           ` H. Peter Anvin
  0 siblings, 1 reply; 72+ messages in thread
From: Lai Jiangshan @ 2021-09-08  7:12 UTC (permalink / raw)
  To: H. Peter Anvin, Lai Jiangshan, linux-kernel
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	x86, Dave Hansen, Peter Zijlstra, Dave Jiang, Ben Widawsky,
	Dan Williams, Arvind Sankar



On 2021/9/8 13:00, H. Peter Anvin wrote:
> On 9/7/21 9:42 PM, H. Peter Anvin wrote:
>>
>>
>> On 9/7/21 6:38 PM, H. Peter Anvin wrote:
>>> On 9/2/21 3:50 AM, Lai Jiangshan wrote:
>>>> From: Lai Jiangshan <laijs@linux.alibaba.com>
>>>>
>>>> There is no constrain/limition to force native_load_gs_index() to be in
>>>> ASM code.
>>>
>>> Hi,
>>>
>>> First of all, let me say I really like your patchset, and I will try to
>>> review it in detail ASAP (most of the initial read pass looks very sane
>>> to me.

Hello

Thank you for your review.

>>>
>>> However, I would like to object in part this specific patch. It adds a
>>> fair bit of extra code to the exception path, and adds jumps between
>>> files which makes the code much harder to read.

I tried putting all code into a single C function.
But I didn't know how to use a C-label in _ASM_EXTABLE and then I split it.

Your code is much better.

Thanks
Lai


>>>
>>> You end up doing one swapgs in assembly and one in C, which would seem
>>> to be a very good indication that really isn't an improvement.
>>>
>>> Note that this entire sequence is scheduled to be obsoleted by a single
>>> atomic hardware instruction, LKGS, which will replace ALL of
>>> native_load_gs_index(); it will no longer be necessary even to disable
>>> interrupts as there is no non-atomic state. In that sense, doing this as
>>> an out-of-line C function (with some inline assembly) is great, because
>>> it makes it far easier to use LKGS as an alternative; the only (small)
>>> disadvantage is that it ends up clobbering a lot of registers
>>> unnecessarily (in assembly it can be implemented clobbering only two
>>> registers; one if one uses pushf/popf to save the interrupt flag.)
>>>
>>
>> OK, here is a version which actually compiles:
>>
> 
> ... slightly shorter and minimally better compiled code ...
> 
> noinstr void native_load_gs_index(unsigned int selector)
> {
>      unsigned long flags;
> 
>      local_irq_save(flags);
>      native_swapgs();
> do_mov_gs:
>      asm_volatile_goto("1: mov %[seg],%%gs\n"
>                "2:\n"
>                _ASM_EXTABLE(1b,%l[bad_seg])
>                : : [seg] "r" (selector) : : bad_seg);
>      alternative("", "mfence", X86_BUG_SWAPGS_FENCE);
>      native_swapgs();
>      local_irq_restore(flags);
>      return;
> 
> bad_seg:
>      /* The exception dispatch will have restored kernel GS */
>      native_swapgs();
>      alternative_input("", "mov %[seg],%%gs",
>                X86_BUG_NULL_SEG, [seg] "r" (__USER_DS));
>      selector = 0;
>      goto do_mov_gs;
> }

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 02/24] x86/traps: Move arch/x86/kernel/traps.c to arch/x86/entry/
  2021-09-07 21:12                   ` Nick Desaulniers
@ 2021-09-08  7:33                     ` Martin Liška
  0 siblings, 0 replies; 72+ messages in thread
From: Martin Liška @ 2021-09-08  7:33 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: Peter Zijlstra, Miguel Ojeda, Lai Jiangshan, Joerg Roedel,
	Lai Jiangshan, linux-kernel, Andy Lutomirski, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H. Peter Anvin,
	Daniel Bristot de Oliveira, Brijesh Singh, Andy Shevchenko,
	Arvind Sankar, Chester Lin, Juergen Gross, andrew.cooper3,
	linux-toolchains

On 9/7/21 23:12, Nick Desaulniers wrote:
> On Fri, Sep 3, 2021 at 12:36 AM Martin Liška <mliska@suse.cz> wrote:
>>
>> On 9/2/21 19:05, Nick Desaulniers wrote:
>>> IIRC GCC only
>>> added the attribute recently in the 10.X release, so it might be too
>>> new to rely on quite yet.
>>
>> The no_stack_protector attribute was actually added in the GCC 11.x release:
>> https://gcc.gnu.org/gcc-11/changes.html
> 
> Ah right, that lays more weight though that this feature is still too
> new to rely on quite yet.

Sure.

> Martin, do you know if what happens with
> regards to inlining when the callee and caller mismatch on this
> function attribute in GCC?  This is very much a problem for the
> kernel.

That's a know issue that was already discusses both in a Kernel mailing and
GCC bugzilla:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94722#c9

Martin

> 
>> Note the compiler is definitely used by Fedora, openSUSE Tumbleweed
>> and other cutting edge distributions.
> 
> Kernel supports GCC 4.9+ currently. This feature can only be emulated
> with the coarse grain -fno-stack-protector (or gnu_inline with out of
> line assembler definition...:( ).
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 25/24] x86/traps: Rewrite native_load_gs_index in C code
  2021-09-08  7:12         ` Lai Jiangshan
@ 2021-09-09 23:16           ` H. Peter Anvin
  0 siblings, 0 replies; 72+ messages in thread
From: H. Peter Anvin @ 2021-09-09 23:16 UTC (permalink / raw)
  To: Lai Jiangshan, Lai Jiangshan, linux-kernel
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	x86, Dave Hansen, Peter Zijlstra, Dave Jiang, Ben Widawsky,
	Dan Williams, Arvind Sankar

[-- Attachment #1: Type: text/plain, Size: 429 bytes --]

Come to think about it, I think it would be better if we were to 
incorporate the X86_BUG_NULL_SEG null segment load in into the main 
native_load_gs_index() code while we are swapgs'd anyway. This 
simplifies the code further, and should avoid calling 
native_load_gs_index() twice if this bug is enabled.

(This is the traps.c code only; removing the now unnecessary instances 
is left as an exercise for the reader...)

	-hpa

[-- Attachment #2: traps.diff --]
[-- Type: text/x-patch, Size: 1270 bytes --]

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index a58800973aed..5de423a24777 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -671,6 +671,48 @@ DEFINE_IDTENTRY_RAW(exc_int3)
 }
 
 #ifdef CONFIG_X86_64
+
+/*
+ * Reload the %gs selector for user space (setting GS_BASE) with
+ * exception handling: if the selector is invalid, set %gs to NULL.
+ *
+ * selector: new selector
+ *
+ * This function is noinstr as it must not be instrumented.
+ */
+noinstr void native_load_gs_index(unsigned int selector)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+
+again:
+	native_swapgs();
+
+	if (static_cpu_has_bug(X86_BUG_NULL_SEG)) {
+		if (likely(selector <= 3))
+			asm volatile("mov %[seg],%%gs"
+				     : : [seg] "rm" (__USER_DS));
+	}
+
+	asm_volatile_goto("1: mov %[seg],%%gs\n"
+			  _ASM_EXTABLE(1b, %l[bad_seg])
+			  : : [seg] "rm" (selector)
+			  : : bad_seg);
+
+	alternative("", "mfence", X86_BUG_SWAPGS_FENCE);
+	native_swapgs();
+	local_irq_restore(flags);
+	return;
+
+bad_seg:
+	/*
+	 * Invalid selector, set %gs to null (0).
+	 */
+	selector = 0;
+	goto again;
+}
+
 /*
  * Help handler running on a per-cpu (IST or entry trampoline) stack
  * to switch to the normal thread stack if the interrupted code was in

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 17/24] x86/entry: Introduce struct ist_regs
  2021-08-31 17:50 ` [PATCH 17/24] x86/entry: Introduce struct ist_regs Lai Jiangshan
@ 2021-09-10  0:18   ` Lai Jiangshan
  2021-09-10  0:36     ` Lai Jiangshan
                       ` (2 more replies)
  0 siblings, 3 replies; 72+ messages in thread
From: Lai Jiangshan @ 2021-09-10  0:18 UTC (permalink / raw)
  To: Lai Jiangshan, linux-kernel
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	x86, H. Peter Anvin, Joerg Roedel, Youquan Song, Tony Luck,
	Juergen Gross, Peter Zijlstra (Intel)



On 2021/9/1 01:50, Lai Jiangshan wrote:
> From: Lai Jiangshan <laijs@linux.alibaba.com>
> 
> struct ist_regs is the upmost stack frame for IST interrupt, it
> consists of struct pt_regs and other fields which will be added in
> later patch.
> 
> Make vc_switch_off_ist() take ist_regs as its argument and it will switch
> the whole ist_regs if needed.
> 
> Make the frame before calling paranoid_entry() and paranoid_exit() be
> struct ist_regs.
> 
> This patch is prepared for converting paranoid_entry() and paranoid_exit()
> into C code which will need the additinal fields to store the results in
> paranoid_entry() and to use them in paranoid_exit().

This patch was over designed.

In ASM code, we can easily save results in the callee-saved registers.
For example, rc3 is saved in %r14, gsbase info is saved in %rbx.

And in C code, we can't save results in registers.  And I thought there was
no place to save the results because the CR3 and gsbase are not kernel's.
So I extended the pt_regs to ist_regs to save the results.

But it was incorrect.  The results can be saved in percpu data at the end of
paranoid_entry() after the CR3/gsbase are settled down.  And the results
can be read at the beginning of paranoid_exit() before the CR3/gsbase are
switched to the interrupted context's.

sigh.

> 
> The C interrupt handlers don't use struct ist_regs due to they don't need
> the additional fields in the struct ist_regs, and they still use pt_regs.
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 17/24] x86/entry: Introduce struct ist_regs
  2021-09-10  0:18   ` Lai Jiangshan
@ 2021-09-10  0:36     ` Lai Jiangshan
  2021-09-10  4:31     ` H. Peter Anvin
  2021-09-10  4:50     ` H. Peter Anvin
  2 siblings, 0 replies; 72+ messages in thread
From: Lai Jiangshan @ 2021-09-10  0:36 UTC (permalink / raw)
  To: Lai Jiangshan, linux-kernel
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	x86, H. Peter Anvin, Joerg Roedel, Youquan Song, Tony Luck,
	Juergen Gross, Peter Zijlstra (Intel)



On 2021/9/10 08:18, Lai Jiangshan wrote:
> 
> 
> On 2021/9/1 01:50, Lai Jiangshan wrote:
>> From: Lai Jiangshan <laijs@linux.alibaba.com>
>>
>> struct ist_regs is the upmost stack frame for IST interrupt, it
>> consists of struct pt_regs and other fields which will be added in
>> later patch.
>>
>> Make vc_switch_off_ist() take ist_regs as its argument and it will switch
>> the whole ist_regs if needed.
>>
>> Make the frame before calling paranoid_entry() and paranoid_exit() be
>> struct ist_regs.
>>
>> This patch is prepared for converting paranoid_entry() and paranoid_exit()
>> into C code which will need the additinal fields to store the results in
>> paranoid_entry() and to use them in paranoid_exit().
> 
> This patch was over designed.
> 
> In ASM code, we can easily save results in the callee-saved registers.
> For example, rc3 is saved in %r14, gsbase info is saved in %rbx.
> 
> And in C code, we can't save results in registers.  And I thought there was
> no place to save the results because the CR3 and gsbase are not kernel's.
> So I extended the pt_regs to ist_regs to save the results.
> 
> But it was incorrect.  The results can be saved in percpu data at the end of
> paranoid_entry() after the CR3/gsbase are settled down.  And the results
> can be read at the beginning of paranoid_exit() before the CR3/gsbase are
> switched to the interrupted context's.
> 
> sigh.

Sigh again. IST interrupts can be nested.  We can't save the results in percpu
data unless we make it stack-like which is not a good idea.  The changelog
failed to express it.

> 
>>
>> The C interrupt handlers don't use struct ist_regs due to they don't need
>> the additional fields in the struct ist_regs, and they still use pt_regs.
>>
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 17/24] x86/entry: Introduce struct ist_regs
  2021-09-10  0:18   ` Lai Jiangshan
  2021-09-10  0:36     ` Lai Jiangshan
@ 2021-09-10  4:31     ` H. Peter Anvin
  2021-09-10  7:13       ` Lai Jiangshan
  2021-09-10  4:50     ` H. Peter Anvin
  2 siblings, 1 reply; 72+ messages in thread
From: H. Peter Anvin @ 2021-09-10  4:31 UTC (permalink / raw)
  To: Lai Jiangshan, Lai Jiangshan, linux-kernel
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	x86, Joerg Roedel, Youquan Song, Tony Luck, Juergen Gross,
	Peter Zijlstra (Intel)

Note: the examples in this email all compiled with:

gcc -O2 -mpreferred-stack-boundary=3 -mcmodel=kernel

The disassembly has been slightly simplified.


On 9/9/21 5:18 PM, Lai Jiangshan wrote:
> 
> This patch was over designed.
> 
> In ASM code, we can easily save results in the callee-saved registers.
> For example, rc3 is saved in %r14, gsbase info is saved in %rbx.
> 
> And in C code, we can't save results in registers.  And I thought there was
> no place to save the results because the CR3 and gsbase are not kernel's.
> So I extended the pt_regs to ist_regs to save the results.
> 
> But it was incorrect.  The results can be saved in percpu data at the 
> end of
> paranoid_entry() after the CR3/gsbase are settled down.  And the results
> can be read at the beginning of paranoid_exit() before the CR3/gsbase are
> switched to the interrupted context's.
> 

OK, count me confused. Of course you can save results in caller-saved 
registers in C; it is kind of what they do:


extern void bar(void);

unsigned long foo(unsigned long v)
{
	bar();
	return v;
}

0000000000000000 <foo>:
    0:   41 54                   push   %r12
    2:   49 89 fc                mov    %rdi,%r12
    5:   e8 00 00 00 00          callq  bar
    a:   4c 89 e0                mov    %r12,%rax
    d:   41 5c                   pop    %r12
    f:   c3                      retq

Now, if you need to specify *which* registers, you have to declare them 
as global register variables - NOT local (which have completely 
different semantics). This also means that you (probably) want to 
isolate this code into its own compilation unit, because it will prevent 
any other code in the same .c file from using that register as well.

For example:

register unsigned long r14 asm("%r14");
unsigned long foo(unsigned long v)
{
	r14 = v;
	bar();
	v = r14;
	return v;
}

0000000000000000 <foo>:
    0:   49 89 fe                mov    %rdi,%r14
    3:   e8 00 00 00 00          callq  bar
    8:   4c 89 f0                mov    %r14,%rax
    b:   c3                      retq

WARNING: This also means that gcc will happily discard the old value in 
%r14, so if you need it preserved you have to do so explicitly; if you 
are called direct from assembly and are happy to lose the value then the 
above code is fine -- and it is even slightly more efficient!

For preserving the old r14, in this case:

register unsigned long r14 asm("%r14");
unsigned long foo(unsigned long v)
{
	unsigned long saved_r14 = r14;
	r14 = v;
	bar();
	v = r14;
	r14 = saved_r14;
	return v;
}

0000000000000000 <foo>:
    0:   53                      push   %rbx
    1:   4c 89 f3                mov    %r14,%rbx
    4:   49 89 fe                mov    %rdi,%r14
    7:   e8 00 00 00 00          callq  bar
    c:   4c 89 f0                mov    %r14,%rax
    f:   49 89 de                mov    %rbx,%r14
   12:   5b                      pop    %rbx
   13:   c3                      retq


HOWEVER, if you are relying on not using the stack, then using C code is 
probably very much not a good idea. It is very hard to guarantee that 
just because the C compiler is *currently* not using a stack, that it 
won't do so *in the future*.

Again, finally, local register variables DO NOT WORK, this does NOT do 
what you expect:

unsigned long foo(unsigned long v)
{
	register unsigned long r14 asm("%r14") = v;
	bar();
	return r14;
}

0000000000000000 <foo>:
    0:   41 54                   push   %r12
    2:   49 89 fc                mov    %rdi,%r12
    5:   e8 00 00 00 00          callq  bar
    a:   4c 89 e0                mov    %r12,%rax
    d:   41 5c                   pop    %r12
    f:   c3                      retq




^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 17/24] x86/entry: Introduce struct ist_regs
  2021-09-10  0:18   ` Lai Jiangshan
  2021-09-10  0:36     ` Lai Jiangshan
  2021-09-10  4:31     ` H. Peter Anvin
@ 2021-09-10  4:50     ` H. Peter Anvin
  2021-09-10  4:51       ` H. Peter Anvin
  2 siblings, 1 reply; 72+ messages in thread
From: H. Peter Anvin @ 2021-09-10  4:50 UTC (permalink / raw)
  To: Lai Jiangshan, Lai Jiangshan, linux-kernel
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	x86, Joerg Roedel, Youquan Song, Tony Luck, Juergen Gross,
	Peter Zijlstra (Intel)

It may affect your design, I don't know, but FRED exception delivery (as currently architected per draft 2.0; it is subject to change still) pushes several fields above the conventional stack frame (and thus above pt_regs.)

On September 9, 2021 5:18:47 PM PDT, Lai Jiangshan <laijs@linux.alibaba.com> wrote:
>
>
>On 2021/9/1 01:50, Lai Jiangshan wrote:
>> From: Lai Jiangshan <laijs@linux.alibaba.com>
>> 
>> struct ist_regs is the upmost stack frame for IST interrupt, it
>> consists of struct pt_regs and other fields which will be added in
>> later patch.
>> 
>> Make vc_switch_off_ist() take ist_regs as its argument and it will switch
>> the whole ist_regs if needed.
>> 
>> Make the frame before calling paranoid_entry() and paranoid_exit() be
>> struct ist_regs.
>> 
>> This patch is prepared for converting paranoid_entry() and paranoid_exit()
>> into C code which will need the additinal fields to store the results in
>> paranoid_entry() and to use them in paranoid_exit().
>
>This patch was over designed.
>
>In ASM code, we can easily save results in the callee-saved registers.
>For example, rc3 is saved in %r14, gsbase info is saved in %rbx.
>
>And in C code, we can't save results in registers.  And I thought there was
>no place to save the results because the CR3 and gsbase are not kernel's.
>So I extended the pt_regs to ist_regs to save the results.
>
>But it was incorrect.  The results can be saved in percpu data at the end of
>paranoid_entry() after the CR3/gsbase are settled down.  And the results
>can be read at the beginning of paranoid_exit() before the CR3/gsbase are
>switched to the interrupted context's.
>
>sigh.
>
>> 
>> The C interrupt handlers don't use struct ist_regs due to they don't need
>> the additional fields in the struct ist_regs, and they still use pt_regs.
>> 
>

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 17/24] x86/entry: Introduce struct ist_regs
  2021-09-10  4:50     ` H. Peter Anvin
@ 2021-09-10  4:51       ` H. Peter Anvin
  0 siblings, 0 replies; 72+ messages in thread
From: H. Peter Anvin @ 2021-09-10  4:51 UTC (permalink / raw)
  To: Lai Jiangshan, Lai Jiangshan, linux-kernel
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	x86, Joerg Roedel, Youquan Song, Tony Luck, Juergen Gross,
	Peter Zijlstra (Intel)

Note: with FRED paranoid entry is no longer needed since the hardware enforces gs_base mode consistency.

On September 9, 2021 9:50:40 PM PDT, "H. Peter Anvin" <hpa@zytor.com> wrote:
>It may affect your design, I don't know, but FRED exception delivery (as currently architected per draft 2.0; it is subject to change still) pushes several fields above the conventional stack frame (and thus above pt_regs.)
>
>On September 9, 2021 5:18:47 PM PDT, Lai Jiangshan <laijs@linux.alibaba.com> wrote:
>>
>>
>>On 2021/9/1 01:50, Lai Jiangshan wrote:
>>> From: Lai Jiangshan <laijs@linux.alibaba.com>
>>> 
>>> struct ist_regs is the upmost stack frame for IST interrupt, it
>>> consists of struct pt_regs and other fields which will be added in
>>> later patch.
>>> 
>>> Make vc_switch_off_ist() take ist_regs as its argument and it will switch
>>> the whole ist_regs if needed.
>>> 
>>> Make the frame before calling paranoid_entry() and paranoid_exit() be
>>> struct ist_regs.
>>> 
>>> This patch is prepared for converting paranoid_entry() and paranoid_exit()
>>> into C code which will need the additinal fields to store the results in
>>> paranoid_entry() and to use them in paranoid_exit().
>>
>>This patch was over designed.
>>
>>In ASM code, we can easily save results in the callee-saved registers.
>>For example, rc3 is saved in %r14, gsbase info is saved in %rbx.
>>
>>And in C code, we can't save results in registers.  And I thought there was
>>no place to save the results because the CR3 and gsbase are not kernel's.
>>So I extended the pt_regs to ist_regs to save the results.
>>
>>But it was incorrect.  The results can be saved in percpu data at the end of
>>paranoid_entry() after the CR3/gsbase are settled down.  And the results
>>can be read at the beginning of paranoid_exit() before the CR3/gsbase are
>>switched to the interrupted context's.
>>
>>sigh.
>>
>>> 
>>> The C interrupt handlers don't use struct ist_regs due to they don't need
>>> the additional fields in the struct ist_regs, and they still use pt_regs.
>>> 
>>
>

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 17/24] x86/entry: Introduce struct ist_regs
  2021-09-10  4:31     ` H. Peter Anvin
@ 2021-09-10  7:13       ` Lai Jiangshan
  2021-09-10  7:14         ` H. Peter Anvin
  0 siblings, 1 reply; 72+ messages in thread
From: Lai Jiangshan @ 2021-09-10  7:13 UTC (permalink / raw)
  To: H. Peter Anvin, Lai Jiangshan, linux-kernel
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	x86, Joerg Roedel, Youquan Song, Tony Luck, Juergen Gross,
	Peter Zijlstra (Intel)



On 2021/9/10 12:31, H. Peter Anvin wrote:
> Note: the examples in this email all compiled with:
> 
> gcc -O2 -mpreferred-stack-boundary=3 -mcmodel=kernel
> 
> The disassembly has been slightly simplified.

Saving results in registers is non-flexible no matter in ASM or hack in C like this.

Saving CR3 in ist_regs is not different than saving rax in pt_regs, and both of
ist_regs and embedded pt_regs can be moved when stack is required to switch.

I prefer to my original design.

> 
> 
> On 9/9/21 5:18 PM, Lai Jiangshan wrote:
>>
>> This patch was over designed.
>>
>> In ASM code, we can easily save results in the callee-saved registers.
>> For example, rc3 is saved in %r14, gsbase info is saved in %rbx.
>>
>> And in C code, we can't save results in registers.  And I thought there was
>> no place to save the results because the CR3 and gsbase are not kernel's.
>> So I extended the pt_regs to ist_regs to save the results.
>>
>> But it was incorrect.  The results can be saved in percpu data at the end of
>> paranoid_entry() after the CR3/gsbase are settled down.  And the results
>> can be read at the beginning of paranoid_exit() before the CR3/gsbase are
>> switched to the interrupted context's.
>>
> 
> OK, count me confused. Of course you can save results in caller-saved registers in C; it is kind of what they do:
> 
> 
> extern void bar(void);
> 
> unsigned long foo(unsigned long v)
> {
>      bar();
>      return v;
> }
> 
> 0000000000000000 <foo>:
>     0:   41 54                   push   %r12
>     2:   49 89 fc                mov    %rdi,%r12
>     5:   e8 00 00 00 00          callq  bar
>     a:   4c 89 e0                mov    %r12,%rax
>     d:   41 5c                   pop    %r12
>     f:   c3                      retq
> 
> Now, if you need to specify *which* registers, you have to declare them as global register variables - NOT local (which 
> have completely different semantics). This also means that you (probably) want to isolate this code into its own 
> compilation unit, because it will prevent any other code in the same .c file from using that register as well.
> 
> For example:
> 
> register unsigned long r14 asm("%r14");
> unsigned long foo(unsigned long v)
> {
>      r14 = v;
>      bar();
>      v = r14;
>      return v;
> }
> 
> 0000000000000000 <foo>:
>     0:   49 89 fe                mov    %rdi,%r14
>     3:   e8 00 00 00 00          callq  bar
>     8:   4c 89 f0                mov    %r14,%rax
>     b:   c3                      retq
> 
> WARNING: This also means that gcc will happily discard the old value in %r14, so if you need it preserved you have to do 
> so explicitly; if you are called direct from assembly and are happy to lose the value then the above code is fine -- and 
> it is even slightly more efficient!
> 
> For preserving the old r14, in this case:
> 
> register unsigned long r14 asm("%r14");
> unsigned long foo(unsigned long v)
> {
>      unsigned long saved_r14 = r14;
>      r14 = v;
>      bar();
>      v = r14;
>      r14 = saved_r14;
>      return v;
> }
> 
> 0000000000000000 <foo>:
>     0:   53                      push   %rbx
>     1:   4c 89 f3                mov    %r14,%rbx
>     4:   49 89 fe                mov    %rdi,%r14
>     7:   e8 00 00 00 00          callq  bar
>     c:   4c 89 f0                mov    %r14,%rax
>     f:   49 89 de                mov    %rbx,%r14
>    12:   5b                      pop    %rbx
>    13:   c3                      retq
> 
> 
> HOWEVER, if you are relying on not using the stack, then using C code is probably very much not a good idea. It is very 
> hard to guarantee that just because the C compiler is *currently* not using a stack, that it won't do so *in the future*.
> 
> Again, finally, local register variables DO NOT WORK, this does NOT do what you expect:
> 
> unsigned long foo(unsigned long v)
> {
>      register unsigned long r14 asm("%r14") = v;
>      bar();
>      return r14;
> }
> 
> 0000000000000000 <foo>:
>     0:   41 54                   push   %r12
>     2:   49 89 fc                mov    %rdi,%r12
>     5:   e8 00 00 00 00          callq  bar
>     a:   4c 89 e0                mov    %r12,%rax
>     d:   41 5c                   pop    %r12
>     f:   c3                      retq
> 
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 17/24] x86/entry: Introduce struct ist_regs
  2021-09-10  7:13       ` Lai Jiangshan
@ 2021-09-10  7:14         ` H. Peter Anvin
  0 siblings, 0 replies; 72+ messages in thread
From: H. Peter Anvin @ 2021-09-10  7:14 UTC (permalink / raw)
  To: Lai Jiangshan, Lai Jiangshan, linux-kernel
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	x86, Joerg Roedel, Youquan Song, Tony Luck, Juergen Gross,
	Peter Zijlstra (Intel)

On 9/10/21 12:13 AM, Lai Jiangshan wrote:
> 
> 
> On 2021/9/10 12:31, H. Peter Anvin wrote:
>> Note: the examples in this email all compiled with:
>>
>> gcc -O2 -mpreferred-stack-boundary=3 -mcmodel=kernel
>>
>> The disassembly has been slightly simplified.
> 
> Saving results in registers is non-flexible no matter in ASM or hack in 
> C like this.
> 
> Saving CR3 in ist_regs is not different than saving rax in pt_regs, and 
> both of
> ist_regs and embedded pt_regs can be moved when stack is required to 
> switch.
> 
> I prefer to my original design.
> 

I agree. I was having some issues following your arguments.

	-hpa


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 24/24] x86/syscall/64: Move the checking for sysret to C code
  2021-08-31 17:50 ` [PATCH 24/24] x86/syscall/64: Move the checking for sysret to C code Lai Jiangshan
@ 2021-09-10  7:20   ` Nikolay Borisov
  2021-09-10  7:30     ` Lai Jiangshan
  0 siblings, 1 reply; 72+ messages in thread
From: Nikolay Borisov @ 2021-09-10  7:20 UTC (permalink / raw)
  To: Lai Jiangshan, linux-kernel
  Cc: Lai Jiangshan, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin



On 31.08.21 г. 20:50, Lai Jiangshan wrote:
> From: Lai Jiangshan <laijs@linux.alibaba.com>
> 
> Like do_fast_syscall_32() which checks whether it can return to userspace
> via fast instructions before the function returns, do_syscall_64()
> also checks whether it can use sysret to return to userspace before
> do_syscall_64() returns via C code.  And a bunch of ASM code can be removed.
> 
> No functional change intended.
> 
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>

<snip>

> +/*
> + * Check if it can use SYSRET.
> + *
> + * Try to use SYSRET instead of IRET if we're returning to
> + * a completely clean 64-bit userspace context.
> + *
> + * Returns 0 to return using IRET or 1 to return using SYSRET.
> + */
> +static __always_inline int can_sysret(struct pt_regs *regs)

nit: Since this is a predicate function why not simply return bool ?

> +{
> +	/* In the Xen PV case we must use iret anyway. */
> +	if (static_cpu_has(X86_FEATURE_XENPV))
> +		return 0;
> +
> +	/* SYSRET requires RCX == RIP && R11 == RFLAGS */
> +	if (regs->ip != regs->cx || regs->flags != regs->r11)
> +		return 0;
> +
> +	/* CS and SS must match SYSRET */
> +	if (regs->cs != __USER_CS || regs->ss != __USER_DS)
> +		return 0;
> +
> +	/*
> +	 * On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP
> +	 * in kernel space.  This essentially lets the user take over
> +	 * the kernel, since userspace controls RSP.
> +	 */
> +	if (regs->cx != canonical_address(regs->cx))
> +		return 0;
> +
> +	/*
> +	 * SYSCALL clears RF when it saves RFLAGS in R11 and SYSRET cannot
> +	 * restore RF properly. If the slowpath sets it for whatever reason, we
> +	 * need to restore it correctly.
> +	 *
> +	 * SYSRET can restore TF, but unlike IRET, restoring TF results in a
> +	 * trap from userspace immediately after SYSRET.  This would cause an
> +	 * infinite loop whenever #DB happens with register state that satisfies
> +	 * the opportunistic SYSRET conditions.  For example, single-stepping
> +	 * this user code:
> +	 *
> +	 *           movq	$stuck_here, %rcx
> +	 *           pushfq
> +	 *           popq %r11
> +	 *   stuck_here:
> +	 *
> +	 * would never get past 'stuck_here'.
> +	 */
> +	if (regs->r11 & (X86_EFLAGS_RF | X86_EFLAGS_TF))
> +		return 0;
> +
> +	return 1;
> +}
> +
> +/* Returns 0 to return using IRET or 1 to return using SYSRET. */
> +__visible noinstr int do_syscall_64(struct pt_regs *regs, int nr)

nit: Ditto about bool

>  {
>  	add_random_kstack_offset();
>  	nr = syscall_enter_from_user_mode(regs, nr);
> @@ -84,6 +154,7 @@ __visible noinstr void do_syscall_64(struct pt_regs *regs, int nr)
>  
>  	instrumentation_end();
>  	syscall_exit_to_user_mode(regs);
> +	return can_sysret(regs);
>  }
>  #endif
>  

<snip>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 24/24] x86/syscall/64: Move the checking for sysret to C code
  2021-09-10  7:20   ` Nikolay Borisov
@ 2021-09-10  7:30     ` Lai Jiangshan
  0 siblings, 0 replies; 72+ messages in thread
From: Lai Jiangshan @ 2021-09-10  7:30 UTC (permalink / raw)
  To: Nikolay Borisov, Lai Jiangshan, linux-kernel
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	x86, H. Peter Anvin



On 2021/9/10 15:20, Nikolay Borisov wrote:
> 
> 
> On 31.08.21 г. 20:50, Lai Jiangshan wrote:
>> From: Lai Jiangshan <laijs@linux.alibaba.com>
>>
>> Like do_fast_syscall_32() which checks whether it can return to userspace
>> via fast instructions before the function returns, do_syscall_64()
>> also checks whether it can use sysret to return to userspace before
>> do_syscall_64() returns via C code.  And a bunch of ASM code can be removed.
>>
>> No functional change intended.
>>
>> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> 
> <snip>
> 
>> +/*
>> + * Check if it can use SYSRET.
>> + *
>> + * Try to use SYSRET instead of IRET if we're returning to
>> + * a completely clean 64-bit userspace context.
>> + *
>> + * Returns 0 to return using IRET or 1 to return using SYSRET.
>> + */
>> +static __always_inline int can_sysret(struct pt_regs *regs)
> 
> nit: Since this is a predicate function why not simply return bool ?

I don't have any preference.

The choice came from my limitation of the needed knowledge.

I followed the design of do_fast_syscall_32() which returns a 4-byte word
to indicate if it can fast return to userspace,  and I know how to test the
result in ASM for a 4-byte word.  If it was a bool, I don't know how to
test the result in ASM.

> 
>> +{
>> +	/* In the Xen PV case we must use iret anyway. */
>> +	if (static_cpu_has(X86_FEATURE_XENPV))
>> +		return 0;
>> +
>> +	/* SYSRET requires RCX == RIP && R11 == RFLAGS */
>> +	if (regs->ip != regs->cx || regs->flags != regs->r11)
>> +		return 0;
>> +
>> +	/* CS and SS must match SYSRET */
>> +	if (regs->cs != __USER_CS || regs->ss != __USER_DS)
>> +		return 0;
>> +
>> +	/*
>> +	 * On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP
>> +	 * in kernel space.  This essentially lets the user take over
>> +	 * the kernel, since userspace controls RSP.
>> +	 */
>> +	if (regs->cx != canonical_address(regs->cx))
>> +		return 0;
>> +
>> +	/*
>> +	 * SYSCALL clears RF when it saves RFLAGS in R11 and SYSRET cannot
>> +	 * restore RF properly. If the slowpath sets it for whatever reason, we
>> +	 * need to restore it correctly.
>> +	 *
>> +	 * SYSRET can restore TF, but unlike IRET, restoring TF results in a
>> +	 * trap from userspace immediately after SYSRET.  This would cause an
>> +	 * infinite loop whenever #DB happens with register state that satisfies
>> +	 * the opportunistic SYSRET conditions.  For example, single-stepping
>> +	 * this user code:
>> +	 *
>> +	 *           movq	$stuck_here, %rcx
>> +	 *           pushfq
>> +	 *           popq %r11
>> +	 *   stuck_here:
>> +	 *
>> +	 * would never get past 'stuck_here'.
>> +	 */
>> +	if (regs->r11 & (X86_EFLAGS_RF | X86_EFLAGS_TF))
>> +		return 0;
>> +
>> +	return 1;
>> +}
>> +
>> +/* Returns 0 to return using IRET or 1 to return using SYSRET. */
>> +__visible noinstr int do_syscall_64(struct pt_regs *regs, int nr)
> 
> nit: Ditto about bool
> 
>>   {
>>   	add_random_kstack_offset();
>>   	nr = syscall_enter_from_user_mode(regs, nr);
>> @@ -84,6 +154,7 @@ __visible noinstr void do_syscall_64(struct pt_regs *regs, int nr)
>>   
>>   	instrumentation_end();
>>   	syscall_exit_to_user_mode(regs);
>> +	return can_sysret(regs);
>>   }
>>   #endif
>>   
> 
> <snip>
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 25/24] x86/traps: Rewrite native_load_gs_index in C code
  2021-09-02 10:50 ` [PATCH 25/24] x86/traps: Rewrite native_load_gs_index in " Lai Jiangshan
  2021-09-08  1:38   ` H. Peter Anvin
@ 2021-09-13 20:01   ` Andy Lutomirski
  2021-09-14  2:04     ` Lai Jiangshan
  1 sibling, 1 reply; 72+ messages in thread
From: Andy Lutomirski @ 2021-09-13 20:01 UTC (permalink / raw)
  To: Lai Jiangshan, Linux Kernel Mailing List
  Cc: Lai Jiangshan, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	the arch/x86 maintainers, H. Peter Anvin, Dave Hansen,
	Peter Zijlstra (Intel),
	Dave Jiang, Ben Widawsky, Williams, Dan J, Arvind Sankar



On Thu, Sep 2, 2021, at 3:50 AM, Lai Jiangshan wrote:
> From: Lai Jiangshan <laijs@linux.alibaba.com>
> 
> There is no constrain/limition to force native_load_gs_index() to be in
> ASM code.
> 
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>

>  
>  #ifdef CONFIG_X86_64
> +
> +/*
> + * Reload gs selector with exception handling
> + * selector:  new selector
> + *
> + * Is noinstr as it shouldn't be instrumented.
> + */
> +noinstr void native_load_gs_index(unsigned int selector)
> +{
> +	unsigned long flags;
> +
> +	local_irq_save(flags);

This patch would be a bit less alarming if you moved the swapgs into asm.  Also, this needs a comment explaining why skipping the swapgs back to kernel gs in the exception path is correct.

> +	native_swapgs();
> +	asm volatile(
> +		".global asm_load_gs_index_gs_change \n"
> +		"asm_load_gs_index_gs_change: \n"
> +		"1: movl %0, %%gs \n"
> +		"   swapgs \n"
> +		"2: \n"
> +		_ASM_EXTABLE_HANDLE(1b, 2b, ex_handler_clear_gs)
> +		:: "r" (selector) : "memory");
> +	alternative("", "mfence", X86_BUG_SWAPGS_FENCE);
> +	local_irq_restore(flags);
> +}
> +

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 25/24] x86/traps: Rewrite native_load_gs_index in C code
  2021-09-13 20:01   ` Andy Lutomirski
@ 2021-09-14  2:04     ` Lai Jiangshan
  2021-09-14  8:14       ` Peter Zijlstra
  0 siblings, 1 reply; 72+ messages in thread
From: Lai Jiangshan @ 2021-09-14  2:04 UTC (permalink / raw)
  To: Andy Lutomirski, Lai Jiangshan, Linux Kernel Mailing List
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	the arch/x86 maintainers, H. Peter Anvin, Dave Hansen,
	Peter Zijlstra (Intel),
	Dave Jiang, Ben Widawsky, Williams, Dan J, Arvind Sankar



On 2021/9/14 04:01, Andy Lutomirski wrote:
> 
> 
> On Thu, Sep 2, 2021, at 3:50 AM, Lai Jiangshan wrote:
>> From: Lai Jiangshan <laijs@linux.alibaba.com>
>>
>> There is no constrain/limition to force native_load_gs_index() to be in
>> ASM code.
>>
>> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> 
>>   
>>   #ifdef CONFIG_X86_64
>> +
>> +/*
>> + * Reload gs selector with exception handling
>> + * selector:  new selector
>> + *
>> + * Is noinstr as it shouldn't be instrumented.
>> + */
>> +noinstr void native_load_gs_index(unsigned int selector)
>> +{
>> +	unsigned long flags;
>> +
>> +	local_irq_save(flags);
> 
> This patch would be a bit less alarming if you moved the swapgs into asm.  

Emmm, this patch is not so clean and persuadable in C.

I think Peter is still working on reworking the patchset and may be
including improving this patch.  I'm Okay if this patch is dropped.

> Also, this needs a comment explaining why skipping the swapgs back to kernel gs in the exception path is correct.
> 

I think it is all known that the exception handler in ASM_EXTABLE is running in kernel context where kernel gs is active.

It does need a comment explaining why the label asm_load_gs_index_gs_change is needed, how does it help the 
error_entry() restores back to the kernel gs.

Since the C-version error_entry() has to check asm_load_gs_index_gs_change, I think other way to handle the fault of 
"mov %gs" is just doing the %gs fixup in the C-version error_entry(). (see patch 11).  it would be more directly, 
simple, and self-documented.

Thank you for reviewing.

>> +	native_swapgs();
>> +	asm volatile(
>> +		".global asm_load_gs_index_gs_change \n"
>> +		"asm_load_gs_index_gs_change: \n"
>> +		"1: movl %0, %%gs \n"
>> +		"   swapgs \n"
>> +		"2: \n"
>> +		_ASM_EXTABLE_HANDLE(1b, 2b, ex_handler_clear_gs)
>> +		:: "r" (selector) : "memory");
>> +	alternative("", "mfence", X86_BUG_SWAPGS_FENCE);
>> +	local_irq_restore(flags);
>> +}
>> +

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 25/24] x86/traps: Rewrite native_load_gs_index in C code
  2021-09-14  2:04     ` Lai Jiangshan
@ 2021-09-14  8:14       ` Peter Zijlstra
  2021-09-14  8:17         ` Borislav Petkov
  2021-09-14  8:40         ` Lai Jiangshan
  0 siblings, 2 replies; 72+ messages in thread
From: Peter Zijlstra @ 2021-09-14  8:14 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Andy Lutomirski, Lai Jiangshan, Linux Kernel Mailing List,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	the arch/x86 maintainers, H. Peter Anvin, Dave Hansen,
	Dave Jiang, Ben Widawsky, Williams, Dan J, Arvind Sankar

On Tue, Sep 14, 2021 at 10:04:47AM +0800, Lai Jiangshan wrote:
> I think Peter is still working on reworking the patchset and may be
> including improving this patch.  I'm Okay if this patch is dropped.

No I am not.

I said I have interest, I also said you shouldn't repost after a single
bit of feedback -- there's nothing more annoying than trying to review a
large-ish and complicated series of patches when you get a new version
every other day.

But please do work on it, because I really don't have the bandwidth to
carry this atm.  By now there's been a fair amount of feedback, so a new
version might be appropriate.

I really do think you've got some very good things here. Please work on
it. I will try and review :-)

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 25/24] x86/traps: Rewrite native_load_gs_index in C code
  2021-09-14  8:14       ` Peter Zijlstra
@ 2021-09-14  8:17         ` Borislav Petkov
  2021-09-14  8:40         ` Lai Jiangshan
  1 sibling, 0 replies; 72+ messages in thread
From: Borislav Petkov @ 2021-09-14  8:17 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Peter Zijlstra, Andy Lutomirski, Lai Jiangshan,
	Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar,
	the arch/x86 maintainers, H. Peter Anvin, Dave Hansen,
	Dave Jiang, Ben Widawsky, Williams, Dan J, Arvind Sankar

On Tue, Sep 14, 2021 at 10:14:20AM +0200, Peter Zijlstra wrote:
> I really do think you've got some very good things here. Please work on
> it. I will try and review :-)

IOW, what he's saying is, you should handle your patchset the usual way.
And if you're unsure how, feel free to skim over:

Documentation/process/submitting-patches.rst

it is all explained there. :-)

HTH.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 25/24] x86/traps: Rewrite native_load_gs_index in C code
  2021-09-14  8:14       ` Peter Zijlstra
  2021-09-14  8:17         ` Borislav Petkov
@ 2021-09-14  8:40         ` Lai Jiangshan
  1 sibling, 0 replies; 72+ messages in thread
From: Lai Jiangshan @ 2021-09-14  8:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, Lai Jiangshan, Linux Kernel Mailing List,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	the arch/x86 maintainers, H. Peter Anvin, Dave Hansen,
	Dave Jiang, Ben Widawsky, Williams, Dan J, Arvind Sankar



On 2021/9/14 16:14, Peter Zijlstra wrote:
> On Tue, Sep 14, 2021 at 10:04:47AM +0800, Lai Jiangshan wrote:
>> I think Peter is still working on reworking the patchset and may be
>> including improving this patch.  I'm Okay if this patch is dropped.
> 
> No I am not.


I'm sorry.

> 
> I said I have interest, I also said you shouldn't repost after a single
> bit of feedback -- there's nothing more annoying than trying to review a
> large-ish and complicated series of patches when you get a new version
> every other day >
> But please do work on it, because I really don't have the bandwidth to
> carry this atm.  By now there's been a fair amount of feedback, so a new
> version might be appropriate.

Thank you, I will revise the patchset and send it.

> 
> I really do think you've got some very good things here. Please work on
> it. I will try and review :-)
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

end of thread, other threads:[~2021-09-14  8:40 UTC | newest]

Thread overview: 72+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-31 17:50 [PATCH 00/24] x86/entry/64: Convert a bunch of ASM entry code into C code Lai Jiangshan
2021-08-31 17:50 ` [PATCH 01/24] x86/traps: Remove stack-protector from traps.c Lai Jiangshan
2021-08-31 17:50 ` [PATCH 02/24] x86/traps: Move arch/x86/kernel/traps.c to arch/x86/entry/ Lai Jiangshan
2021-09-02  8:09   ` Joerg Roedel
2021-09-02  9:21     ` Lai Jiangshan
2021-09-02 10:50       ` Peter Zijlstra
2021-09-02 11:54         ` Lai Jiangshan
2021-09-02 12:05           ` Peter Zijlstra
2021-09-02 13:34             ` Peter Zijlstra
2021-09-02 17:05               ` Nick Desaulniers
2021-09-02 17:19                 ` Miguel Ojeda
2021-09-02 17:23                   ` Nick Desaulniers
2021-09-03  7:36                 ` Martin Liška
2021-09-07 21:12                   ` Nick Desaulniers
2021-09-08  7:33                     ` Martin Liška
2021-08-31 17:50 ` [PATCH 03/24] x86/traps: Move declaration of native_irq_return_iret up Lai Jiangshan
2021-08-31 17:50 ` [PATCH 04/24] x86/entry: Expose the address of .Lgs_change to traps.c Lai Jiangshan
2021-09-02  9:14   ` Peter Zijlstra
2021-09-02  9:20     ` Lai Jiangshan
2021-08-31 17:50 ` [PATCH 05/24] x86/entry: Introduce __entry_text for entry code written in C Lai Jiangshan
2021-08-31 19:34   ` Peter Zijlstra
2021-09-01  0:23     ` Lai Jiangshan
2021-08-31 17:50 ` [PATCH 06/24] x86/entry: Move PTI_USER_* to arch/x86/include/asm/processor-flags.h Lai Jiangshan
2021-08-31 17:50 ` [PATCH 07/24] x86: Mark __native_read_cr3() & native_write_cr3() as __always_inline Lai Jiangshan
2021-08-31 17:50 ` [PATCH 08/24] x86/traps: Add C verion of SWITCH_TO_KERNEL_CR3 as switch_to_kernel_cr3() Lai Jiangshan
2021-08-31 17:50 ` [PATCH 09/24] x86/traps: Add fence_swapgs_{user,kernel}_entry() Lai Jiangshan
2021-09-02  9:25   ` Peter Zijlstra
2021-08-31 17:50 ` [PATCH 10/24] x86/traps: Move pt_regs only in fixup_bad_iret() Lai Jiangshan
2021-08-31 17:50 ` [PATCH 11/24] x86/entry: Replace the most of asm code of error_entry to C code Lai Jiangshan
2021-09-02 10:16   ` Peter Zijlstra
2021-09-02 12:08     ` Lai Jiangshan
2021-08-31 17:50 ` [PATCH 12/24] x86/traps: Reconstruct pt_regs on task stack directly in fixup_bad_iret() Lai Jiangshan
2021-08-31 17:50 ` [PATCH 13/24] x86/traps: Mark sync_regs() and fixup_bad_iret() as static __always_inline Lai Jiangshan
2021-08-31 17:50 ` [PATCH 14/24] x86/entry: Make paranoid_exit() callable Lai Jiangshan
2021-08-31 17:50 ` [PATCH 15/24] x86/entry: Call paranoid_exit() in asm_exc_nmi() Lai Jiangshan
2021-08-31 17:50 ` [PATCH 16/24] x86/entry: Use skip_rdi instead of save_ret for PUSH_AND_CLEAR_REGS Lai Jiangshan
2021-08-31 17:50 ` [PATCH 17/24] x86/entry: Introduce struct ist_regs Lai Jiangshan
2021-09-10  0:18   ` Lai Jiangshan
2021-09-10  0:36     ` Lai Jiangshan
2021-09-10  4:31     ` H. Peter Anvin
2021-09-10  7:13       ` Lai Jiangshan
2021-09-10  7:14         ` H. Peter Anvin
2021-09-10  4:50     ` H. Peter Anvin
2021-09-10  4:51       ` H. Peter Anvin
2021-08-31 17:50 ` [PATCH 18/24] x86/entry: Add the C version ist_switch_to_kernel_cr3() Lai Jiangshan
2021-08-31 17:50 ` [PATCH 19/24] x86/entry: Add the C version ist_restore_cr3() Lai Jiangshan
2021-08-31 17:50 ` [PATCH 20/24] x86/entry: Add the C version get_percpu_base() Lai Jiangshan
2021-08-31 17:50 ` [PATCH 21/24] x86/entry: Add the C version ist_switch_to_kernel_gsbase() Lai Jiangshan
2021-08-31 17:50 ` [PATCH 22/24] x86/entry: Implement and use do_paranoid_entry() and paranoid_exit() Lai Jiangshan
2021-09-02 10:33   ` Peter Zijlstra
2021-09-02 10:42     ` Lai Jiangshan
2021-09-02 12:02       ` Peter Zijlstra
2021-09-02 11:58     ` Lai Jiangshan
2021-09-02 12:29       ` Joerg Roedel
2021-08-31 17:50 ` [PATCH 23/24] x86/entry: Remove the unused ASM macros Lai Jiangshan
2021-08-31 17:50 ` [PATCH 24/24] x86/syscall/64: Move the checking for sysret to C code Lai Jiangshan
2021-09-10  7:20   ` Nikolay Borisov
2021-09-10  7:30     ` Lai Jiangshan
2021-08-31 20:44 ` [PATCH 00/24] x86/entry/64: Convert a bunch of ASM entry code into " Peter Zijlstra
2021-09-02  6:28   ` Lai Jiangshan
2021-09-02  7:44     ` Peter Zijlstra
2021-09-02 10:50 ` [PATCH 25/24] x86/traps: Rewrite native_load_gs_index in " Lai Jiangshan
2021-09-08  1:38   ` H. Peter Anvin
2021-09-08  4:42     ` H. Peter Anvin
2021-09-08  5:00       ` H. Peter Anvin
2021-09-08  7:12         ` Lai Jiangshan
2021-09-09 23:16           ` H. Peter Anvin
2021-09-13 20:01   ` Andy Lutomirski
2021-09-14  2:04     ` Lai Jiangshan
2021-09-14  8:14       ` Peter Zijlstra
2021-09-14  8:17         ` Borislav Petkov
2021-09-14  8:40         ` Lai Jiangshan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).