LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
From: Ingo Molnar <mingo@kernel.org>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>,
	Oleg Nesterov <oleg@redhat.com>,
	Dave Hansen <dave.hansen@intel.com>, Borislav Petkov <bp@suse.de>,
	Pekka Riikonen <priikone@iki.fi>, Rik van Riel <riel@redhat.com>,
	Suresh Siddha <sbsiddha@gmail.com>,
	LKML <linux-kernel@vger.kernel.org>,
	"Yu, Fenghua" <fenghua.yu@intel.com>,
	Quentin Casasnovas <quentin.casasnovas@oracle.com>,
	Denys Vlasenko <dvlasenk@redhat.com>
Subject: Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs
Date: Sun, 8 Mar 2015 09:55:08 +0100	[thread overview]
Message-ID: <20150308085508.GA13164@gmail.com> (raw)
In-Reply-To: <CA+55aFxU-4T7oMCJw3vZjL+-qFCzNwoqNLN-k2FxhPFcE0_GUA@mail.gmail.com>


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sat, Mar 7, 2015 at 2:36 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > We could save the same 10 cycles from page fault overhead as well,
> > AFAICS.
> 
> Are trap gates actually noticeably faster? Or is it just he
> "conditional_sti()" you're worried about?

( I'll talk about the CR2 complication later, please ignore that 
  problem for a moment. )

So I base my thinking on the following hierarchy of fast paths. In a 
typical x86 system there are 4 main types of 'context switches', in 
order of decreasing frequency:

   - syscall    'context switch': entry/exit from syscall
   - trap/fault 'context switch': entry/exit from trap/fault
   - irq        'context switch': entry/exit from irqs
   - task       'context switch': scheduler context-switch

Where each successive level is about an order of magnitude less 
frequently executed on a typical system than the previous one, so to 
optimize a level we are willing to push overhead to the next one(s).

So the primary payoff in executing much of the entry code with irqs 
enabled would be to allow 64-bit *syscalls* to be made without irqs 
disabled: the first, most important level of context entries.

Doing that would give us four (theoretical) performance advantages:

  - No implicit irq disabling overhead when the syscall instruction is
    executed: we could change MSR_SYSCALL_MASK from 0xc0000084 to
    0xc0000284, which removes the implicit CLI on syscall entry.

  - No explicit irq enabling overhead via ENABLE_INTERRUPTS() [STI] in
    system_call.

  - No explicit irq disabling overhead in the ret_from_sys_call fast 
    path, i.e. no DISABLE_INTERRUPTS() [CLI].

  - No implicit irq enabling overhead in ret_from_sys_call's 
    USERGS_SYSRET64: the SYSRETQ instruction would not have to 
    re-enable irqs as the user-space IF in R11 would match that of the 
    current IF.

whether that's an actual performance win in practice as well needs to 
be measured, but I'd be (very!) shocked if it wasn't in the 20+ cycles 
range: which is absolutely huge in terms of system_call optimizations.

Now do I think this could be done realistically? I think yes (by 
re-using the NMI code's paranoid entry codepaths for the irq code as 
well, essentially fixing up the effects of any partial entries in an 
irq entry slow path), but I could be wrong about that.

My main worry isn't even feasibility but maintainability and general 
robustness: all these asm cleanups we are doing now could enable such 
a trick to be pulled off robustly.

But I think it could be done technically, because the NMI code already 
has to be careful about 'preempting' partial entries, so we have the 
logic.

Complications:

 - We now enable double partial entries: partial syscall interrupted
   by an irq interrupted by an NMI context. I think it should all work
   out fine but it's a new scenario.

 - We'd enable interruptible return from system call, which caused
   trouble in the past. Solvable IMHO, by being careful in the irq 
   entry code.

 - We'd now have to be extra careful about the possibility of 
   exceptions triggered in the entry/exit code itself, triggered by 
   any sort of unusual register content or MMU fault.

Simplifications:

 - I'd ruthlessly disable IRQs for any sort of non fast path: for 
   example 32-bit compat entries, ptrace or any other slowpath - at 
   least initially until we map out the long term effects of this 
   optimization.

Does this scare me? Yes, I think it should scare any sane person, but 
I don't think it's all that bad: all the NMI paranoidentry work has 
already the trail blazed, and most of the races will be readily 
triggerable via regular irq loads, so it's not like we'll leave subtle 
bugs in there.

Being able to do the same with certain traps, because irq entry is 
careful about partial entry state, would just be a secondary bonus.

Regarding the CR2 value on page faults:

> Anyway, for page faulting, we traditionally actually wanted an 
> interrupt gate, because of how we wanted to avoid interrupts coming 
> in and possibly messing up %cr2 due to vmalloc faults, but more 
> importantly for preemption. [...]

Here too I think we could take a page from the NMI code: save cr2 in 
the page fault asm code, recognize from the irq code when we are 
interrupting that and dropping into a slowpath that saves cr2 right 
there. Potentially task-context-switching will be safe after that.

Saving cr2 in the early page fault code should be much less of an 
overhead than what the IRQ disabling/enabling costs, so this should be 
a win. The potential win could be similar to that of system calls:

  - Removal of an implicit 'CLI' in irq gates

  - Removal of the explicit 'STI' in conditional_sti in your proposed 
    code

  - Removal of an explicit 'CLI' (DISABLE_INTERRUPTS) in 
    error_exit/retint_swapgs.

  - Removal of an implicit 'STI' when SYSRET enables interrupts from R11

and the same savings would apply to FPU traps as well. I'd leave all 
other low frequency traps as interrupt gates: #GP, debug, int3, etc.

> [...] vmalloc faults are "harmless" because we'll notice that it's 
> already done, return, and then re-take the real fault. But a 
> preemption event before we read %cr2 can cause bad things to happen:
> 
>  - page fault pushes error code on stack, address in %cr2
> 
>  - we don't have interrupts disabled, and some interrupt comes in and
> causes preemption
> 
>  - some other process runs, take another page fault. %cr2 now is the
> wrong address
> 
>  - we go back to the original thread (perhaps on another cpu), which
> now reads %cr2 for the wrong address
> 
>  - we send the process a SIGSEGV because we think it's accessing
> memory that it has no place touching

I think none of this corruption happens if an interrupting context is 
aware of this and 'fixes up' the entry state accordingly. Am I missing 
anything subtle perhaps?

This would be arguably new (and tricky) code, as today the NMI code 
solves this problem by trying to never fault and thus never corrupt 
CR2. But ... unlike NMIs, this triggers so often via a mix of regular 
traps and regular irqs that I'm pretty sure it can be pulled off 
robustly.

> So the page fault code actually *needs* interrupts disabled until we 
> read %cr2. Stupid x86 trap semantics where the error code is on the 
> thread-safe stack, but %cr2 is not.
> 
> Maybe there is some trick I'm missing, but on the whole I think 
> "interrupt gate + conditional_sti()" does have things going for it. 
> Yes, it still leaves NMI as being special, but NMI really *is* 
> special.

So I think it's all doable, the payoffs are significant in terms of 
entry speedups.

But only on a clean code base, as the x86 entry assembly code was way 
beyond any sane maintainability threshold IMHO - it's fortunately 
improving rapidly these days due to the nice work from Andy and Denys!

What do you think?

Thanks,

	Ingo

  reply	other threads:[~2015-03-08  8:55 UTC|newest]

Thread overview: 126+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-03-04 18:30 Oops with tip/x86/fpu Dave Hansen
2015-03-04 19:06 ` Oleg Nesterov
2015-03-04 19:12   ` Dave Hansen
2015-03-04 20:06   ` Borislav Petkov
2015-03-05 15:14     ` Oleg Nesterov
     [not found]       ` <20150305182203.GA4203@redhat.com>
2015-03-05 18:34         ` Dave Hansen
2015-03-05 18:46           ` Oleg Nesterov
2015-03-05 18:41         ` Dave Hansen
2015-03-26 22:37         ` Yu, Fenghua
2015-03-26 22:43           ` Dave Hansen
2015-03-26 22:48             ` Yu, Fenghua
2015-03-27  7:30               ` Quentin Casasnovas
2015-03-27 19:06           ` Oleg Nesterov
2015-03-05  8:38   ` Quentin Casasnovas
2015-03-05 15:13     ` Oleg Nesterov
2015-03-05 18:42       ` Borislav Petkov
2015-03-05 22:16         ` Dave Hansen
2015-03-05 19:51 ` [PATCH 0/1] x86/fpu: math_state_restore() should not blindly disable irqs Oleg Nesterov
2015-03-05 19:51   ` [PATCH 1/1] " Oleg Nesterov
2015-03-05 20:11     ` Ingo Molnar
2015-03-05 21:25       ` Oleg Nesterov
2015-03-06  7:58         ` Ingo Molnar
2015-03-06 13:26           ` Oleg Nesterov
2015-03-06 13:39             ` Oleg Nesterov
2015-03-06 13:46             ` Ingo Molnar
2015-03-06 14:01               ` Oleg Nesterov
2015-03-06 14:17                 ` Oleg Nesterov
2015-03-06 15:00                 ` David Vrabel
2015-03-06 15:36                   ` Oleg Nesterov
2015-03-06 16:15                     ` David Vrabel
2015-03-06 16:31                       ` Oleg Nesterov
2015-03-06 17:33           ` Linus Torvalds
2015-03-06 18:15             ` Oleg Nesterov
2015-03-06 19:23             ` Andy Lutomirski
2015-03-06 22:00               ` Linus Torvalds
2015-03-06 22:28                 ` Andy Lutomirski
2015-03-07 10:36                   ` Ingo Molnar
2015-03-07 20:11                     ` Linus Torvalds
2015-03-08  8:55                       ` Ingo Molnar [this message]
2015-03-08 11:38                         ` Ingo Molnar
2015-03-08 13:59                         ` Andy Lutomirski
2015-03-08 14:38                           ` Andy Lutomirski
2015-03-07 10:32             ` Ingo Molnar
2015-03-07 15:38   ` [PATCH 0/1] x86/fpu: x86/fpu: avoid math_state_restore() without used_math() in __restore_xstate_sig() Oleg Nesterov
2015-03-07 15:38     ` [PATCH 1/1] " Oleg Nesterov
2015-03-09 14:07       ` Borislav Petkov
2015-03-09 14:34         ` Oleg Nesterov
2015-03-09 15:18           ` Borislav Petkov
2015-03-09 16:24             ` Oleg Nesterov
2015-03-09 16:53               ` Borislav Petkov
2015-03-09 17:05                 ` Oleg Nesterov
2015-03-09 17:23                   ` Borislav Petkov
2015-03-16 12:07       ` [tip:x86/urgent] x86/fpu: Avoid " tip-bot for Oleg Nesterov
2015-03-05 20:35 ` [PATCH 0/1] x86/fpu: math_state_restore() should not blindly disable irqs Oleg Nesterov
2015-03-09 17:10 ` [PATCH] x86/fpu: drop_fpu() should not assume that tsk == current Oleg Nesterov
2015-03-09 17:36   ` Rik van Riel
2015-03-09 17:48   ` Borislav Petkov
2015-03-09 18:06     ` Oleg Nesterov
2015-03-09 18:10       ` Borislav Petkov
2015-03-16 12:07   ` [tip:x86/urgent] x86/fpu: Drop_fpu() should not assume that tsk equals current tip-bot for Oleg Nesterov
2015-03-11 17:33 ` [PATCH 0/4] x86/fpu: avoid math_state_restore() on kthread exec Oleg Nesterov
2015-03-11 17:34   ` [PATCH 1/4] x86/fpu: document user_fpu_begin() Oleg Nesterov
2015-03-13  9:47     ` Borislav Petkov
2015-03-13 14:34       ` Oleg Nesterov
2015-03-23 12:20     ` [tip:x86/fpu] x86/fpu: Document user_fpu_begin() tip-bot for Oleg Nesterov
2015-03-11 17:34   ` [PATCH 2/4] x86/fpu: introduce restore_init_xstate() Oleg Nesterov
2015-03-13 10:34     ` Borislav Petkov
2015-03-13 14:39       ` Oleg Nesterov
2015-03-13 15:20         ` Borislav Petkov
2015-03-16 19:05           ` Rik van Riel
2015-03-23 12:20     ` [tip:x86/fpu] x86/fpu: Introduce restore_init_xstate() tip-bot for Oleg Nesterov
2015-03-11 17:34   ` [PATCH 3/4] x86/fpu: use restore_init_xstate() instead of math_state_restore() on kthread exec Oleg Nesterov
2015-03-13 10:48     ` Borislav Petkov
2015-03-13 14:45       ` Oleg Nesterov
2015-03-13 15:51         ` Borislav Petkov
2015-03-23 12:21     ` [tip:x86/fpu] x86/fpu: Use " tip-bot for Oleg Nesterov
2015-03-11 17:35   ` [PATCH 4/4] x86/fpu: don't abuse drop_init_fpu() in flush_thread() Oleg Nesterov
2015-03-13 10:52     ` Borislav Petkov
2015-03-13 14:55       ` Oleg Nesterov
2015-03-13 16:19         ` Borislav Petkov
2015-03-13 16:26           ` Oleg Nesterov
2015-03-13 19:27             ` Borislav Petkov
2015-03-14 14:48               ` Oleg Nesterov
2015-03-15 17:36                 ` Borislav Petkov
2015-03-15 18:16                   ` Oleg Nesterov
2015-03-15 18:50                     ` Borislav Petkov
2015-03-15 20:04                       ` Oleg Nesterov
2015-03-15 20:38                         ` Borislav Petkov
2015-03-16  9:35                           ` Borislav Petkov
2015-03-16 10:28                             ` Ingo Molnar
2015-03-16 14:39                             ` Oleg Nesterov
2015-03-16 15:26                               ` Borislav Petkov
2015-03-16 15:34                             ` Andy Lutomirski
2015-03-16 15:35                               ` Borislav Petkov
2015-03-13 17:30     ` [PATCH v2 " Oleg Nesterov
2015-03-14 10:55       ` Borislav Petkov
2015-03-14 10:57         ` [PATCH] x86/fpu: Fold __drop_fpu() into its sole user Borislav Petkov
2015-03-14 15:15           ` Oleg Nesterov
2015-03-16 10:27           ` Ingo Molnar
2015-03-23 12:21       ` [tip:x86/fpu] x86/fpu: Don't abuse drop_init_fpu() in flush_thread() tip-bot for Oleg Nesterov
2015-03-13 18:26 ` [PATCH 0/1] x86/cpu: don't allocate fpu->state for swapper/0 Oleg Nesterov
2015-03-13 18:27   ` [PATCH 1/1] " Oleg Nesterov
2015-03-16 10:18     ` Borislav Petkov
2015-03-23 12:22     ` [tip:x86/fpu] x86/fpu: Don't " tip-bot for Oleg Nesterov
2015-03-14 11:16   ` [PATCH 0/1] x86/cpu: don't " Borislav Petkov
2015-03-14 15:13     ` [PATCH 0/1] x86/cpu: kill eager_fpu_init_bp() Oleg Nesterov
2015-03-14 15:13       ` [PATCH 1/1] " Oleg Nesterov
2015-03-16 12:44         ` Borislav Petkov
2015-03-23 12:22         ` [tip:x86/fpu] x86/fpu: Kill eager_fpu_init_bp() tip-bot for Oleg Nesterov
2015-03-15 16:49 ` [PATCH RFC 0/2] x86/fpu: avoid "xstate_fault" in xsave_user/xrestore_user Oleg Nesterov
2015-03-15 16:50   ` [PATCH RFC 1/2] x86: introduce __user_insn() and __check_insn() Oleg Nesterov
2015-03-15 16:50   ` [PATCH RFC 2/2] x86/fpu: change xsave_user() and xrestore_user() to use __user_insn() Oleg Nesterov
2015-03-16 22:43     ` Quentin Casasnovas
2015-03-17  9:35       ` Borislav Petkov
2015-03-16 14:36   ` [PATCH RFC 0/2] x86/fpu: avoid "xstate_fault" in xsave_user/xrestore_user Borislav Petkov
2015-03-16 14:57     ` Oleg Nesterov
2015-03-16 17:58       ` Borislav Petkov
2015-03-16 22:37   ` Quentin Casasnovas
2015-03-17  9:47     ` Borislav Petkov
2015-03-17 10:00       ` Quentin Casasnovas
2015-03-17 11:20         ` Borislav Petkov
2015-03-17 11:36           ` Quentin Casasnovas
2015-03-17 12:07             ` Borislav Petkov
2015-03-18  9:06               ` Quentin Casasnovas
2015-03-18  9:53                 ` Borislav Petkov
2015-03-17 10:07       ` Quentin Casasnovas

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150308085508.GA13164@gmail.com \
    --to=mingo@kernel.org \
    --cc=bp@suse.de \
    --cc=dave.hansen@intel.com \
    --cc=dvlasenk@redhat.com \
    --cc=fenghua.yu@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@amacapital.net \
    --cc=oleg@redhat.com \
    --cc=priikone@iki.fi \
    --cc=quentin.casasnovas@oracle.com \
    --cc=riel@redhat.com \
    --cc=sbsiddha@gmail.com \
    --cc=torvalds@linux-foundation.org \
    --subject='Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).