LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs
@ 2015-02-20 18:58 Andy Lutomirski
  2015-02-20 19:05 ` Borislav Petkov
                   ` (4 more replies)
  0 siblings, 5 replies; 45+ messages in thread
From: Andy Lutomirski @ 2015-02-20 18:58 UTC (permalink / raw)
  To: Oleg Nesterov, Rik van Riel, x86, linux-kernel, Borislav Petkov
  Cc: Andy Lutomirski

We have eager and lazy fpu modes, introduced in:

304bceda6a18 x86, fpu: use non-lazy fpu restore for processors supporting xsave

The result is rather messy.  There are two code paths in almost all of the
FPU code, and only one of them (the eager case) is tested frequently, since
most kernel developers have new enough hardware that we use eagerfpu.

It seems that, on any remotely recent hardware, eagerfpu is a win:
glibc uses SSE2, so laziness is probably overoptimistic, and, in any
case, manipulating TS is far slower that saving and restoring the full
state.

To try to shake out any latent issues on old hardware, this changes
the default to eager on all CPUs.  If no performance or functionality
problems show up, a subsequent patch could remove lazy mode entirely.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/kernel/xsave.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c
index 0de1fae2bdf0..1928c8f34ce5 100644
--- a/arch/x86/kernel/xsave.c
+++ b/arch/x86/kernel/xsave.c
@@ -637,8 +637,8 @@ static void __init xstate_enable_boot_cpu(void)
 	prepare_fx_sw_frame();
 	setup_init_fpu_buf();
 
-	/* Auto enable eagerfpu for xsaveopt */
-	if (cpu_has_xsaveopt && eagerfpu != DISABLE)
+	/* Auto enable eagerfpu for everyone */
+	if (eagerfpu != DISABLE)
 		eagerfpu = ENABLE;
 
 	if (pcntxt_mask & XSTATE_EAGER) {
-- 
2.1.0


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs
  2015-02-20 18:58 [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs Andy Lutomirski
@ 2015-02-20 19:05 ` Borislav Petkov
  2015-02-21  9:31 ` Ingo Molnar
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 45+ messages in thread
From: Borislav Petkov @ 2015-02-20 19:05 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Oleg Nesterov, Rik van Riel, x86, linux-kernel, Linus Torvalds

+ Linus.

I'm sure he'll save something to say about it :-)

On Fri, Feb 20, 2015 at 10:58:15AM -0800, Andy Lutomirski wrote:
> We have eager and lazy fpu modes, introduced in:
> 
> 304bceda6a18 x86, fpu: use non-lazy fpu restore for processors supporting xsave
> 
> The result is rather messy.  There are two code paths in almost all of the
> FPU code, and only one of them (the eager case) is tested frequently, since
> most kernel developers have new enough hardware that we use eagerfpu.
> 
> It seems that, on any remotely recent hardware, eagerfpu is a win:
> glibc uses SSE2, so laziness is probably overoptimistic, and, in any
> case, manipulating TS is far slower that saving and restoring the full
> state.
> 
> To try to shake out any latent issues on old hardware, this changes
> the default to eager on all CPUs.  If no performance or functionality
> problems show up, a subsequent patch could remove lazy mode entirely.
> 
> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> ---
>  arch/x86/kernel/xsave.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c
> index 0de1fae2bdf0..1928c8f34ce5 100644
> --- a/arch/x86/kernel/xsave.c
> +++ b/arch/x86/kernel/xsave.c
> @@ -637,8 +637,8 @@ static void __init xstate_enable_boot_cpu(void)
>  	prepare_fx_sw_frame();
>  	setup_init_fpu_buf();
>  
> -	/* Auto enable eagerfpu for xsaveopt */
> -	if (cpu_has_xsaveopt && eagerfpu != DISABLE)
> +	/* Auto enable eagerfpu for everyone */
> +	if (eagerfpu != DISABLE)
>  		eagerfpu = ENABLE;
>  
>  	if (pcntxt_mask & XSTATE_EAGER) {
> -- 
> 2.1.0
> 
> 

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs
  2015-02-20 18:58 [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs Andy Lutomirski
  2015-02-20 19:05 ` Borislav Petkov
@ 2015-02-21  9:31 ` Ingo Molnar
  2015-02-21 16:38   ` Borislav Petkov
  2015-02-23 14:59 ` Oleg Nesterov
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 45+ messages in thread
From: Ingo Molnar @ 2015-02-21  9:31 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Oleg Nesterov, Rik van Riel, x86, linux-kernel, Borislav Petkov


* Andy Lutomirski <luto@amacapital.net> wrote:

> We have eager and lazy fpu modes, introduced in:
> 
> 304bceda6a18 x86, fpu: use non-lazy fpu restore for processors supporting xsave
> 
> The result is rather messy.  There are two code paths in 
> almost all of the FPU code, and only one of them (the 
> eager case) is tested frequently, since most kernel 
> developers have new enough hardware that we use eagerfpu.
> 
> It seems that, on any remotely recent hardware, eagerfpu 
> is a win: glibc uses SSE2, so laziness is probably 
> overoptimistic, and, in any case, manipulating TS is far 
> slower that saving and restoring the full state.
> 
> To try to shake out any latent issues on old hardware, 
> this changes the default to eager on all CPUs.  If no 
> performance or functionality problems show up, a 
> subsequent patch could remove lazy mode entirely.

So it would be nice to test this on at least one reasonably 
old (but not uncomfortably old - say 5 years old) system, 
to get a feel for what kind of performance impact it has 
there.

But yes, this would enable a nice simplification in the end 
so I'm all for it as long as it doesn't cause unacceptable 
problems - and the FPU code needs simplification badly, 
because the current latency of bug discovery is too high 
IMO.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs
  2015-02-21  9:31 ` Ingo Molnar
@ 2015-02-21 16:38   ` Borislav Petkov
  2015-02-21 17:29     ` Borislav Petkov
  2015-02-21 18:34     ` Ingo Molnar
  0 siblings, 2 replies; 45+ messages in thread
From: Borislav Petkov @ 2015-02-21 16:38 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, Oleg Nesterov, Rik van Riel, x86, linux-kernel,
	Linus Torvalds

On Sat, Feb 21, 2015 at 10:31:50AM +0100, Ingo Molnar wrote:
> So it would be nice to test this on at least one reasonably old (but
> not uncomfortably old - say 5 years old) system, to get a feel for
> what kind of performance impact it has there.

Yeah, this is exactly what Andy and I were talking about yesterday on
IRC. So let's measure our favourite workload - the kernel build! :-) My
assumption is that libc uses SSE for memcpy and thus the FPU will be
used. (I'll trace FPU-specific PMCs later to confirm).

Machine is an AMD F10h which should be 5-10 years old depending on what
you're looking at (uarch, revision, ...).

Numbers look great to me in the sense that we have a very small
improvement and the rest stays the same. Which would mean, killing lazy
FPU does not bring slowdown, if no improvement, but will bring a huuge
improvement in code quality and the handling of the FPU state by getting
rid of the lazyness...

IPC is the same, branch misses are *down* a bit, cache misses go up a
bit probably because we're shuffling FPU state more often to mem, page
faults go down and runtime goes down by half a second:

plain 3.19:
==========

perf stat -a -e task-clock,cycles,instructions,branch-misses,cache-misses,faults,context-switches,migrations --repeat 10 --sync --pre ~/bin/pre-build-kernel.sh make -s -j12

 Performance counter stats for 'system wide' (10 runs):

    1408897.576594      task-clock (msec)         #    6.003 CPUs utilized            ( +-  0.15% ) [100.00%]
 3,137,565,760,188      cycles                    #    2.227 GHz                      ( +-  0.02% ) [100.00%]
 2,849,228,161,721      instructions              #    0.91  insns per cycle          ( +-  0.00% ) [100.00%]
    32,391,188,891      branch-misses             #   22.990 M/sec                    ( +-  0.02% ) [100.00%]
    27,879,813,595      cache-misses              #   19.788 M/sec                    ( +-  0.01% )
        27,195,402      faults                    #    0.019 M/sec                    ( +-  0.01% ) [100.00%]
         1,293,241      context-switches          #    0.918 K/sec                    ( +-  0.09% ) [100.00%]
            69,548      migrations                #    0.049 K/sec                    ( +-  0.22% )

     234.681331200 seconds time elapsed                                          ( +-  0.15% )


eagerfpu=ENABLE
===============

 Performance counter stats for 'system wide' (10 runs):

    1405208.771580      task-clock (msec)         #    6.003 CPUs utilized            ( +-  0.19% ) [100.00%]
 3,137,381,829,748      cycles                    #    2.233 GHz                      ( +-  0.03% ) [100.00%]
 2,849,059,336,718      instructions              #    0.91  insns per cycle          ( +-  0.00% ) [100.00%]
    32,380,999,636      branch-misses             #   23.044 M/sec                    ( +-  0.02% ) [100.00%]
    27,884,281,327      cache-misses              #   19.844 M/sec                    ( +-  0.01% )
        27,193,985      faults                    #    0.019 M/sec                    ( +-  0.01% ) [100.00%]
         1,293,300      context-switches          #    0.920 K/sec                    ( +-  0.08% ) [100.00%]
            69,791      migrations                #    0.050 K/sec                    ( +-  0.18% )

     234.066525648 seconds time elapsed                                          ( +-  0.19% )


-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs
  2015-02-21 16:38   ` Borislav Petkov
@ 2015-02-21 17:29     ` Borislav Petkov
  2015-02-21 18:39       ` Ingo Molnar
  2015-02-22  0:34       ` Maciej W. Rozycki
  2015-02-21 18:34     ` Ingo Molnar
  1 sibling, 2 replies; 45+ messages in thread
From: Borislav Petkov @ 2015-02-21 17:29 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, Oleg Nesterov, Rik van Riel, x86, linux-kernel,
	Linus Torvalds

On Sat, Feb 21, 2015 at 05:38:40PM +0100, Borislav Petkov wrote:
> My assumption is that libc uses SSE for memcpy and thus the FPU will
> be used. (I'll trace FPU-specific PMCs later to confirm).

Ok, so I slapped a trace_printk() at the beginning of fpu_save_init()
and did a kernel build once with default 3.19 and once with Andy's
patch. So we end up with this count:

default: 712000
eager:   780000

This would explain the very small difference in the performance counters
data from the previous email.

Provided I've not made a mistake, this leads me to think that this
simple workload and pretty much everything else uses the FPU through
glibc which does the SSE memcpy and so on. Which basically kills the
whole idea behind lazy FPU as practically you don't really encounter
workloads nowadays which don't use the FPU thanks to glibc and the lazy
strategy doesn't really bring anything.

Which would then mean, we don't really need the lazy handling as
userspace is making it eager, so to speak, for us.

Thoughts?

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs
  2015-02-21 16:38   ` Borislav Petkov
  2015-02-21 17:29     ` Borislav Petkov
@ 2015-02-21 18:34     ` Ingo Molnar
  1 sibling, 0 replies; 45+ messages in thread
From: Ingo Molnar @ 2015-02-21 18:34 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andy Lutomirski, Oleg Nesterov, Rik van Riel, x86, linux-kernel,
	Linus Torvalds


* Borislav Petkov <bp@alien8.de> wrote:

> plain 3.19:
>
>      234.681331200 seconds time elapsed                                          ( +-  0.15% )
>
> eagerfpu=ENABLE
>
>      234.066525648 seconds time elapsed                                          ( +-  0.19% )

hm, a win of more than 600 milliseconds - more than I'd 
have expected.

I really want to believe this, but I worry about the 
systematic boot-to-boot noise though, which cannot be 
eliminated via --repeat 10.

Would it be possible to add a simple runtime debug switch 
to switch between the two FPU modes dynamically via a 
sysctl?

It should not crash tasks I think if it's flipped around on 
a reasonably idle system, so should allow much better 
apples-to-apples comparisons with the same kind of page 
cache layout, etc.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs
  2015-02-21 17:29     ` Borislav Petkov
@ 2015-02-21 18:39       ` Ingo Molnar
  2015-02-21 19:15         ` Borislav Petkov
  2015-02-22  0:34       ` Maciej W. Rozycki
  1 sibling, 1 reply; 45+ messages in thread
From: Ingo Molnar @ 2015-02-21 18:39 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andy Lutomirski, Oleg Nesterov, Rik van Riel, x86, linux-kernel,
	Linus Torvalds


* Borislav Petkov <bp@alien8.de> wrote:

> On Sat, Feb 21, 2015 at 05:38:40PM +0100, Borislav Petkov wrote:
> > My assumption is that libc uses SSE for memcpy and thus the FPU will
> > be used. (I'll trace FPU-specific PMCs later to confirm).
> 
> Ok, so I slapped a trace_printk() at the beginning of fpu_save_init()
> and did a kernel build once with default 3.19 and once with Andy's
> patch. So we end up with this count:
> 
> default: 712000
> eager:   780000
> 
> This would explain the very small difference in the 
> performance counters data from the previous email.

So the workload improved by ~600,000 usecs, and there's 
68,000 less calls, so it saved 8.8 usecs per call. Isn't 
that a bit too high?

I'd sleep a lot better if we had some runtime debug flag to 
be able to do run-to-run comparisons on the same booted up 
kernel, or so.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs
  2015-02-21 18:39       ` Ingo Molnar
@ 2015-02-21 19:15         ` Borislav Petkov
  2015-02-21 19:23           ` Ingo Molnar
  0 siblings, 1 reply; 45+ messages in thread
From: Borislav Petkov @ 2015-02-21 19:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, Oleg Nesterov, Rik van Riel, x86, linux-kernel,
	Linus Torvalds

On Sat, Feb 21, 2015 at 07:39:52PM +0100, Ingo Molnar wrote:
> So the workload improved by ~600,000 usecs, and there's 
> 68,000 less calls, so it saved 8.8 usecs per call. Isn't 

I think you mean more calls. The eager measurement has more calls. Let
me do some primitive math:

def  =(234.681331200 / 712000)*10^6 = 329.60861123595505000000 microsecs/call
eager=(234.066525648 / 780000)*10^6 = 300.08528929230769000000 microsecs/call

diff is 29.52332194364736000000 microsecs speedup per call which could
explain the cost of CR0.TS serialization semantics in the lazy mode.

> that a bit too high?

Now, is 29 microseconds too high? I'm not sure this is even correct and
not some noise interfering.

> I'd sleep a lot better if we had some runtime debug flag to 
> be able to do run-to-run comparisons on the same booted up 
> kernel, or so.

Let me take a look whether we could so some knob... The nice thing is,
code uses use_eager_fpu() to check stuff so we should be able to clear
the cpufeature flag.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs
  2015-02-21 19:15         ` Borislav Petkov
@ 2015-02-21 19:23           ` Ingo Molnar
  2015-02-21 21:36             ` Borislav Petkov
  0 siblings, 1 reply; 45+ messages in thread
From: Ingo Molnar @ 2015-02-21 19:23 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andy Lutomirski, Oleg Nesterov, Rik van Riel, x86, linux-kernel,
	Linus Torvalds


* Borislav Petkov <bp@alien8.de> wrote:

> > I'd sleep a lot better if we had some runtime debug 
> > flag to be able to do run-to-run comparisons on the 
> > same booted up kernel, or so.
> 
> Let me take a look whether we could so some knob... The 
> nice thing is, code uses use_eager_fpu() to check stuff 
> so we should be able to clear the cpufeature flag.

Either that, or you could quickly hack those checks to use 
panic_timeout, add an 'extern int panic_timeout;' and after 
bootup do:

   echo 0 > /proc/sys/kernel/panic
   echo 1 > /proc/sys/kernel/panic

to switch between the modes?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs
  2015-02-21 19:23           ` Ingo Molnar
@ 2015-02-21 21:36             ` Borislav Petkov
  2015-02-22  8:18               ` Ingo Molnar
  0 siblings, 1 reply; 45+ messages in thread
From: Borislav Petkov @ 2015-02-21 21:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, Oleg Nesterov, Rik van Riel, x86, linux-kernel,
	Linus Torvalds

On Sat, Feb 21, 2015 at 08:23:52PM +0100, Ingo Molnar wrote:
> to switch between the modes?

I went all out and did a debugfs file, see patch at the end, which
counts FPU saves. Then I ran this script:

---
#!/bin/bash

D="/sys/kernel/debug/fpu/eager"

echo "Lazy FPU: "
echo 0 > $D
echo -n "  FPU saves before: "; cat $D

perf stat -a -e task-clock,cycles,instructions,branch-misses,cache-misses,faults,context-switches,migrations --sync --pre ~/bin/pre-build-kernel.sh make -s -j12

echo -n "  FPU saves after: "; cat $D

echo ""
echo "Eager FPU: "
echo 1 > $D
echo -n "  FPU saves before: "; cat $D

perf stat -a -e task-clock,cycles,instructions,branch-misses,cache-misses,faults,context-switches,migrations --sync --pre ~/bin/pre-build-kernel.sh make -s -j12

echo -n "  FPU saves after: "; cat $D
---

which spit this:

Lazy FPU:
  FPU saves before: 3
Setup is 16252 bytes (padded to 16384 bytes).
System is 4222 kB
CRC c79a13ab
Kernel: arch/x86/boot/bzImage is ready  (#41)

 Performance counter stats for 'system wide':

    1315527.989020      task-clock (msec)         #    6.003 CPUs utilized           [100.00%]
 3,042,312,057,208      cycles                    #    2.313 GHz                     [100.00%]
 2,790,807,863,402      instructions              #    0.92  insns per cycle         [100.00%]
    31,658,299,111      branch-misses             #   24.065 M/sec                   [100.00%]
    27,504,255,277      cache-misses              #   20.907 M/sec
        26,802,015      faults                    #    0.020 M/sec                   [100.00%]
         1,248,899      context-switches          #    0.949 K/sec                   [100.00%]
            69,553      migrations                #    0.053 K/sec

     219.127929718 seconds time elapsed

  FPU saves after: 704186

Eager FPU:
  FPU saves before: 4
Setup is 16252 bytes (padded to 16384 bytes).
System is 4222 kB
CRC 6767bb2e
Kernel: arch/x86/boot/bzImage is ready  (#42)

 Performance counter stats for 'system wide':

    1321651.543922      task-clock (msec)         #    6.003 CPUs utilized           [100.00%]
 3,044,403,437,364      cycles                    #    2.303 GHz                     [100.00%]
 2,790,835,886,565      instructions              #    0.92  insns per cycle         [100.00%]
    31,638,090,259      branch-misses             #   23.938 M/sec                   [100.00%]
    27,491,643,095      cache-misses              #   20.801 M/sec
        26,869,732      faults                    #    0.020 M/sec                   [100.00%]
         1,252,034      context-switches          #    0.947 K/sec                   [100.00%]
            69,247      migrations                #    0.052 K/sec

     220.148034331 seconds time elapsed

  FPU saves after: 901638

---
so we have a second slowdown and 200K FPU saves more in eager mode.

Provided I've not done a mistake, looks like the increase in cycles gets
mirrored in 1 second time longer. I've not done the --repeat 10 thing
again, maybe I should do it too, just to be fair as this is a single
run.

---
 arch/x86/include/asm/fpu-internal.h |  4 ++++
 arch/x86/kernel/xsave.c             | 47 ++++++++++++++++++++++++++++++++++++-
 2 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/fpu-internal.h b/arch/x86/include/asm/fpu-internal.h
index e97622f57722..7141f353e960 100644
--- a/arch/x86/include/asm/fpu-internal.h
+++ b/arch/x86/include/asm/fpu-internal.h
@@ -38,6 +38,8 @@ int ia32_setup_frame(int sig, struct ksignal *ksig,
 # define ia32_setup_rt_frame	__setup_rt_frame
 #endif
 
+
+extern unsigned long fpu_saved;
 extern unsigned int mxcsr_feature_mask;
 extern void fpu_init(void);
 extern void eager_fpu_init(void);
@@ -242,6 +244,8 @@ static inline void fpu_fxsave(struct fpu *fpu)
  */
 static inline int fpu_save_init(struct fpu *fpu)
 {
+	fpu_saved++;
+
 	if (use_xsave()) {
 		fpu_xsave(fpu);
 
diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c
index 0de1fae2bdf0..029de8b629d0 100644
--- a/arch/x86/kernel/xsave.c
+++ b/arch/x86/kernel/xsave.c
@@ -14,6 +14,8 @@
 #include <asm/sigframe.h>
 #include <asm/xcr.h>
 
+#include <linux/debugfs.h>
+
 /*
  * Supported feature mask by the CPU and the kernel.
  */
@@ -638,7 +640,7 @@ static void __init xstate_enable_boot_cpu(void)
 	setup_init_fpu_buf();
 
 	/* Auto enable eagerfpu for xsaveopt */
-	if (cpu_has_xsaveopt && eagerfpu != DISABLE)
+	if (eagerfpu != DISABLE)
 		eagerfpu = ENABLE;
 
 	if (pcntxt_mask & XSTATE_EAGER) {
@@ -739,3 +741,46 @@ void *get_xsave_addr(struct xsave_struct *xsave, int xstate)
 	return (void *)xsave + xstate_comp_offsets[feature];
 }
 EXPORT_SYMBOL_GPL(get_xsave_addr);
+
+unsigned long fpu_saved;
+
+static int eager_get(void *data, u64 *val)
+{
+	*val = fpu_saved;
+
+	return 0;
+}
+
+static int eager_set(void *data, u64 val)
+{
+	if (val)
+		setup_force_cpu_cap(X86_FEATURE_EAGER_FPU);
+	else
+		setup_clear_cpu_cap(X86_FEATURE_EAGER_FPU);
+
+	fpu_saved = 0;
+
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(eager_fops, eager_get, eager_set, "%llu\n");
+
+static int __init setup_eagerfpu_knob(void)
+{
+	static struct dentry *d_eager, *f_eager;
+
+	d_eager = debugfs_create_dir("fpu", NULL);
+	if (!d_eager) {
+		pr_err("Error creating fpu debugfs dir\n");
+		return -ENOMEM;
+	}
+
+	f_eager = debugfs_create_file("eager", 0644, d_eager, NULL, &eager_fops);
+	if (!f_eager) {
+		pr_err("Error creating fpu debugfs node\n");
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+late_initcall(setup_eagerfpu_knob);
-- 
2.2.0.33.gc18b867

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs
  2015-02-21 17:29     ` Borislav Petkov
  2015-02-21 18:39       ` Ingo Molnar
@ 2015-02-22  0:34       ` Maciej W. Rozycki
  2015-02-22  2:18         ` Andy Lutomirski
  1 sibling, 1 reply; 45+ messages in thread
From: Maciej W. Rozycki @ 2015-02-22  0:34 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Ingo Molnar, Andy Lutomirski, Oleg Nesterov, Rik van Riel, x86,
	linux-kernel, Linus Torvalds

On Sat, 21 Feb 2015, Borislav Petkov wrote:

> Provided I've not made a mistake, this leads me to think that this
> simple workload and pretty much everything else uses the FPU through
> glibc which does the SSE memcpy and so on. Which basically kills the
> whole idea behind lazy FPU as practically you don't really encounter
> workloads nowadays which don't use the FPU thanks to glibc and the lazy
> strategy doesn't really bring anything.
> 
> Which would then mean, we don't really need the lazy handling as
> userspace is making it eager, so to speak, for us.

 Please correct me if I'm wrong, but it looks to me like you're confusing 
lazy FPU context allocation and lazy FPU context switching.  These build 
on the same hardware principles, but they are different concepts.

 Your "userspace is making it eager" statement in the context of glibc 
using SSE for `memcpy' is certainly true for lazy FPU context allocation, 
however I wouldn't be so sure about lazy FPU context switching, and a 
kernel compilation (or in fact any compilation) does not appear to be a 
representative benchmark to me.  I am sure lots of software won't be 
calling `memcpy' all the time, there should be context switches between 
which the FPU is not referred to at all.

 Also, does `__builtin_memcpy' also expand to SSE?  I'd expect it rather 
than external `memcpy' to be used by GCC for copying fixed amounts of 
data, especially smaller ones such as when passing structures by value in 
function calls or for string operations like `strdup' or suchlike.  These 
I'd expect to be ubiquitous, whereas external `memcpy' I'd expect to be 
called from time to time only.

 Additionally I believe long-executing FPU instructions (i.e. 
transcendentals) can take advantage of continuing to execute in parallel 
where the context has already been switched rather than stalling an eager 
FPU context switch until the FPU instruction has completed.

 And last but not least, why does the handling of CR0.TS traps have to be 
complicated?  It does not look like rocket science to me, it should be a 
mere handful of instructions, the time required to move the two FP 
contexts out from and in to the FPU respectively should dominate 
processing time.  Where quoted the optimisation manual states 250 cycles 
for FXSAVE and FXRSTOR combined.

 And of course you can install the right handler (i.e. FSAVE vs FXSAVE) at 
bootstrap depending on processor features, you don't have to do all the 
run-time check on every trap.  You can even optimise the FSAVE handler 
away at the build time if you know it won't ever be used based on the 
minimal supported processor family selected.

 Do you happen to know or can determine how much time (in clock cycles) a 
CR0.TS trap itself takes, including any time required to preserve the 
execution state in the handler such as pushing/popping GPRs to/from the 
stack (as opposed to processing time spent on moving the FP contexts back 
and forth)?  Is there no room for improvement left there?  How many task 
scheduling slots say per million must be there poking at the FPU for eager 
FPU context switching to take advantage over lazy one?

  Maciej

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs
  2015-02-22  0:34       ` Maciej W. Rozycki
@ 2015-02-22  2:18         ` Andy Lutomirski
  2015-02-22 11:06           ` Borislav Petkov
  2015-02-23 21:17           ` Maciej W. Rozycki
  0 siblings, 2 replies; 45+ messages in thread
From: Andy Lutomirski @ 2015-02-22  2:18 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Borislav Petkov, Ingo Molnar, Oleg Nesterov, Rik van Riel,
	X86 ML, linux-kernel, Linus Torvalds

On Sat, Feb 21, 2015 at 4:34 PM, Maciej W. Rozycki <macro@linux-mips.org> wrote:
> On Sat, 21 Feb 2015, Borislav Petkov wrote:
>
>> Provided I've not made a mistake, this leads me to think that this
>> simple workload and pretty much everything else uses the FPU through
>> glibc which does the SSE memcpy and so on. Which basically kills the
>> whole idea behind lazy FPU as practically you don't really encounter
>> workloads nowadays which don't use the FPU thanks to glibc and the lazy
>> strategy doesn't really bring anything.
>>
>> Which would then mean, we don't really need the lazy handling as
>> userspace is making it eager, so to speak, for us.
>
>  Please correct me if I'm wrong, but it looks to me like you're confusing
> lazy FPU context allocation and lazy FPU context switching.  These build
> on the same hardware principles, but they are different concepts.
>
>  Your "userspace is making it eager" statement in the context of glibc
> using SSE for `memcpy' is certainly true for lazy FPU context allocation,
> however I wouldn't be so sure about lazy FPU context switching, and a
> kernel compilation (or in fact any compilation) does not appear to be a
> representative benchmark to me.  I am sure lots of software won't be
> calling `memcpy' all the time, there should be context switches between
> which the FPU is not referred to at all.

That's true.  The question is whether there are enough of them, and
whether twiddling TS is fast enough, that it's worth it.

>
>  Additionally I believe long-executing FPU instructions (i.e.
> transcendentals) can take advantage of continuing to execute in parallel
> where the context has already been switched rather than stalling an eager
> FPU context switch until the FPU instruction has completed.

It seems highly unlikely to me that a slow FPU instruction can retire
*after* a subsequent fxsave, which would need to happen for this to
work.

>
>  And last but not least, why does the handling of CR0.TS traps have to be
> complicated?  It does not look like rocket science to me, it should be a
> mere handful of instructions, the time required to move the two FP
> contexts out from and in to the FPU respectively should dominate
> processing time.  Where quoted the optimisation manual states 250 cycles
> for FXSAVE and FXRSTOR combined.

The TS traps aren't complicated -- they're just really slow.  I think
that each of setting *and* clearing TS serializes and takes well over
a hundred cycles.  A #NM interrupt (the thing that happens if you try
to use the FPU with TS set) serializes and does all kinds of slow
things, so it takes many hundreds of cycles.  The handler needs to
clear TS (another hundred cycles or more), load the FPU state
(actually rather fast on modern CPUs), and then IRET back to userspace
(hundreds of cycles).  This adds up to a lot of cycles.  A round trip
through an exception handler seems to be thousands of cycles.

>
>  And of course you can install the right handler (i.e. FSAVE vs FXSAVE) at
> bootstrap depending on processor features, you don't have to do all the
> run-time check on every trap.  You can even optimise the FSAVE handler
> away at the build time if you know it won't ever be used based on the
> minimal supported processor family selected.

None of this matters.  A couple of branches in an #NM handler are
completely lost in the noise.

>
>  Do you happen to know or can determine how much time (in clock cycles) a
> CR0.TS trap itself takes, including any time required to preserve the
> execution state in the handler such as pushing/popping GPRs to/from the
> stack (as opposed to processing time spent on moving the FP contexts back
> and forth)?  Is there no room for improvement left there?  How many task
> scheduling slots say per million must be there poking at the FPU for eager
> FPU context switching to take advantage over lazy one?

Thousands of cycles.  Considerably worse in an SVM guest.  x86
exception handling sucks.

--Andy

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs
  2015-02-21 21:36             ` Borislav Petkov
@ 2015-02-22  8:18               ` Ingo Molnar
  2015-02-22  8:22                 ` Ingo Molnar
                                   ` (2 more replies)
  0 siblings, 3 replies; 45+ messages in thread
From: Ingo Molnar @ 2015-02-22  8:18 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andy Lutomirski, Oleg Nesterov, Rik van Riel, x86, linux-kernel,
	Linus Torvalds, Arjan van de Ven


* Borislav Petkov <bp@alien8.de> wrote:

> which spit this:
> 
> Lazy FPU:
>      219.127929718 seconds time elapsed

> Eager FPU:
>      220.148034331 seconds time elapsed

> so we have a second slowdown and 200K FPU saves more in eager mode.

So am I interpreting the older and your latest numbers 
correctly in stating that the cost observation has flipped 
around 180 degrees: the first measurement showed eager FPU 
to be a win, but now that we can do more precise 
measurements, eager FPU has actually slowed down the kernel 
build by ~0.5%?

That's not good, and kernel builds are just a random load 
that isn't even that FPU or context switch heavy - there 
will certainly be other loads that would be hurt even more.

So just before we base wide reaching decisions based on any 
of these measurements, would you mind help us increase our 
confidence in the numbers some more:

  - It might make sense to do a 'perf stat --null --repeat'
    measurement as well [without any -e arguments], to make 
    sure the rich PMU stats you are gathering are not 
    interfering?

    With 'perf stat --null --repeat' perf acts essenially 
    as a /usr/bin/time replacement, but can measure down to
    microseconds and will calculate noise/sttdev properly.

  - Perhaps also double check the debug switch: is it
    really properly switching FPU handling mode?

  - Do you have enough RAM that there's essentially no IO
    in the system worth speaking of? Do you have enough RAM
    to copy a whole kernel tree to /tmp/linux/ and do the
    measurement there, on ramfs?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs
  2015-02-22  8:18               ` Ingo Molnar
@ 2015-02-22  8:22                 ` Ingo Molnar
  2015-02-22 10:48                 ` Borislav Petkov
  2015-02-22 12:50                 ` Borislav Petkov
  2 siblings, 0 replies; 45+ messages in thread
From: Ingo Molnar @ 2015-02-22  8:22 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andy Lutomirski, Oleg Nesterov, Rik van Riel, x86, linux-kernel,
	Linus Torvalds, Arjan van de Ven


* Ingo Molnar <mingo@kernel.org> wrote:

>   - Do you have enough RAM that there's essentially no IO
>     in the system worth speaking of? Do you have enough RAM
>     to copy a whole kernel tree to /tmp/linux/ and do the
>     measurement there, on ramfs?

Doing that will also pin down the page cache: kernel build 
times are very sensitive to the page cache layout, and once 
a page cache layout is established on systems with lots of 
RAM it does not tend to be flushed out. But the next bootup 
will generate another random page cache layout - which 
makes inter-kernel kernel build times comparisons much 
noiser than the run-to-run numbers suggest.

So to get more precise measurements a 'pinned' page cache 
layout and the dynamic debug switch you implemented is very 
helpful and --repeat stddev will be pretty representative 
of the true noise of the measurement.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs
  2015-02-22  8:18               ` Ingo Molnar
  2015-02-22  8:22                 ` Ingo Molnar
@ 2015-02-22 10:48                 ` Borislav Petkov
  2015-02-22 12:50                 ` Borislav Petkov
  2 siblings, 0 replies; 45+ messages in thread
From: Borislav Petkov @ 2015-02-22 10:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, Oleg Nesterov, Rik van Riel, x86, linux-kernel,
	Linus Torvalds, Arjan van de Ven

On Sun, Feb 22, 2015 at 09:18:40AM +0100, Ingo Molnar wrote:
> So am I interpreting the older and your latest numbers 
> correctly in stating that the cost observation has flipped 
> around 180 degrees: the first measurement showed eager FPU 
> to be a win, but now that we can do more precise 
> measurements, eager FPU has actually slowed down the kernel 
> build by ~0.5%?

Well, I wouldn't take the latest numbers too seriously - that was a
single run without --repeat.

> That's not good, and kernel builds are just a random load 
> that isn't even that FPU or context switch heavy - there 
> will certainly be other loads that would be hurt even more.

That is my fear.

> So just before we base wide reaching decisions based on any 
> of these measurements, would you mind help us increase our 
> confidence in the numbers some more:
> 
>   - It might make sense to do a 'perf stat --null --repeat'
>     measurement as well [without any -e arguments], to make 
>     sure the rich PMU stats you are gathering are not 
>     interfering?
> 
>     With 'perf stat --null --repeat' perf acts essenially 
>     as a /usr/bin/time replacement, but can measure down to
>     microseconds and will calculate noise/sttdev properly.

Cool, let me do that.

>   - Perhaps also double check the debug switch: is it
>     really properly switching FPU handling mode?

I've changed the use_eager_fpu() test to do:

 static __always_inline __pure bool use_eager_fpu(void)
 {
       return boot_cpu_has(X86_FEATURE_EAGER_FPU);
 }

and I'm clearing/setting eager FPU with
setup_force_cpu_cap/setup_clear_cpu_cap, see full diff below.

>   - Do you have enough RAM that there's essentially no IO
>     in the system worth speaking of? Do you have enough RAM
>     to copy a whole kernel tree to /tmp/linux/ and do the
>     measurement there, on ramfs?

/proc/meminfo says "MemTotal: 4011860 kB" which is probably not enough.
But I could find one somewhere :-)

---
 arch/x86/include/asm/fpu-internal.h |  6 +++-
 arch/x86/kernel/xsave.c             | 57 ++++++++++++++++++++++++++++++++++++-
 2 files changed, 61 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/fpu-internal.h b/arch/x86/include/asm/fpu-internal.h
index e97622f57722..c8a161d02056 100644
--- a/arch/x86/include/asm/fpu-internal.h
+++ b/arch/x86/include/asm/fpu-internal.h
@@ -38,6 +38,8 @@ int ia32_setup_frame(int sig, struct ksignal *ksig,
 # define ia32_setup_rt_frame	__setup_rt_frame
 #endif
 
+
+extern unsigned long fpu_saved;
 extern unsigned int mxcsr_feature_mask;
 extern void fpu_init(void);
 extern void eager_fpu_init(void);
@@ -87,7 +89,7 @@ static inline int is_x32_frame(void)
 
 static __always_inline __pure bool use_eager_fpu(void)
 {
-	return static_cpu_has_safe(X86_FEATURE_EAGER_FPU);
+	return boot_cpu_has(X86_FEATURE_EAGER_FPU);
 }
 
 static __always_inline __pure bool use_xsaveopt(void)
@@ -242,6 +244,8 @@ static inline void fpu_fxsave(struct fpu *fpu)
  */
 static inline int fpu_save_init(struct fpu *fpu)
 {
+	fpu_saved++;
+
 	if (use_xsave()) {
 		fpu_xsave(fpu);
 
diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c
index 0de1fae2bdf0..943af0adacff 100644
--- a/arch/x86/kernel/xsave.c
+++ b/arch/x86/kernel/xsave.c
@@ -14,6 +14,8 @@
 #include <asm/sigframe.h>
 #include <asm/xcr.h>
 
+#include <linux/debugfs.h>
+
 /*
  * Supported feature mask by the CPU and the kernel.
  */
@@ -638,7 +640,7 @@ static void __init xstate_enable_boot_cpu(void)
 	setup_init_fpu_buf();
 
 	/* Auto enable eagerfpu for xsaveopt */
-	if (cpu_has_xsaveopt && eagerfpu != DISABLE)
+	if (eagerfpu != DISABLE)
 		eagerfpu = ENABLE;
 
 	if (pcntxt_mask & XSTATE_EAGER) {
@@ -739,3 +741,56 @@ void *get_xsave_addr(struct xsave_struct *xsave, int xstate)
 	return (void *)xsave + xstate_comp_offsets[feature];
 }
 EXPORT_SYMBOL_GPL(get_xsave_addr);
+
+unsigned long fpu_saved;
+
+static void my_clts(void *arg)
+{
+	asm volatile("clts");
+}
+
+static int eager_get(void *data, u64 *val)
+{
+	*val = fpu_saved;
+
+	return 0;
+}
+
+static int eager_set(void *data, u64 val)
+{
+	preempt_disable();
+	if (val) {
+		on_each_cpu(my_clts, NULL, 1);
+		setup_force_cpu_cap(X86_FEATURE_EAGER_FPU);
+	} else {
+		setup_clear_cpu_cap(X86_FEATURE_EAGER_FPU);
+		stts();
+	}
+	preempt_enable();
+
+	fpu_saved = 0;
+
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(eager_fops, eager_get, eager_set, "%llu\n");
+
+static int __init setup_eagerfpu_knob(void)
+{
+	static struct dentry *d_eager, *f_eager;
+
+	d_eager = debugfs_create_dir("fpu", NULL);
+	if (!d_eager) {
+		pr_err("Error creating fpu debugfs dir\n");
+		return -ENOMEM;
+	}
+
+	f_eager = debugfs_create_file("eager", 0644, d_eager, NULL, &eager_fops);
+	if (!f_eager) {
+		pr_err("Error creating fpu debugfs node\n");
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+late_initcall(setup_eagerfpu_knob);
-- 
2.2.0.33.gc18b867


-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs
  2015-02-22  2:18         ` Andy Lutomirski
@ 2015-02-22 11:06           ` Borislav Petkov
  2015-02-23  1:45             ` Rik van Riel
  2015-02-23 21:17           ` Maciej W. Rozycki
  1 sibling, 1 reply; 45+ messages in thread
From: Borislav Petkov @ 2015-02-22 11:06 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Andy Lutomirski, Ingo Molnar, Oleg Nesterov, Rik van Riel,
	X86 ML, linux-kernel, Linus Torvalds

On Sat, Feb 21, 2015 at 06:18:01PM -0800, Andy Lutomirski wrote:
> That's true.  The question is whether there are enough of them, and
> whether twiddling TS is fast enough, that it's worth it.

Yes, and let me make it clear what I'm trying to do here: I want to make
sure that eager FPU handling (both allocation and switching - and no,
I'm not confusing the concepts) *doesn't* *hurt* any relevant workload.
If it does, then we'll stop wasting time right there.

But(!), if the CR0.TS lazy dance turns out to be really slow and the
eager handling doesn't bring a noticeable slowdown, in comparison, we
should consider doing the eager thing by default. After running a lot
more benchmarks, of course.

Which brings me to the main reason why we're doing this: code
simplification. If we switch to eager, we'll kill a lot of non-trivial
code and the FPU handling in the kernel becomes dumb and nice again.

> >  And last but not least, why does the handling of CR0.TS traps have to be
> > complicated?  It does not look like rocket science to me, it should be a

If you think so, just go and look at the latest tip/x86/fpu commits and
the amount of WTF in the commit messages.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs
  2015-02-22  8:18               ` Ingo Molnar
  2015-02-22  8:22                 ` Ingo Molnar
  2015-02-22 10:48                 ` Borislav Petkov
@ 2015-02-22 12:50                 ` Borislav Petkov
  2015-02-22 12:57                   ` Ingo Molnar
  2 siblings, 1 reply; 45+ messages in thread
From: Borislav Petkov @ 2015-02-22 12:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, Oleg Nesterov, Rik van Riel, x86, linux-kernel,
	Linus Torvalds, Arjan van de Ven

On Sun, Feb 22, 2015 at 09:18:40AM +0100, Ingo Molnar wrote:
>   - It might make sense to do a 'perf stat --null --repeat'
>     measurement as well [without any -e arguments], to make 
>     sure the rich PMU stats you are gathering are not 
>     interfering?

Well, the --repeat thing definitely is good to do and the previous
single run is simply not meaningful due to variance I'd guess:

perf stat --null --repeat 10 --sync --pre ~/bin/pre-build-kernel.sh make -s -j12

...

Lazy FPU:
  FPU saves before: 5
Setup is 16252 bytes (padded to 16384 bytes).
System is 4222 kB
CRC 5417fa4f
Kernel: arch/x86/boot/bzImage is ready  (#43)

...

Kernel: arch/x86/boot/bzImage is ready  (#52)

 Performance counter stats for 'make -s -j12' (10 runs):

     219.406449195 seconds time elapsed                                          ( +-  0.17% )

  FPU saves after: 6974975

Eager FPU:
  FPU saves before: 4
Setup is 16252 bytes (padded to 16384 bytes).
System is 4222 kB
CRC 823c1801
Kernel: arch/x86/boot/bzImage is ready  (#53)

...

Kernel: arch/x86/boot/bzImage is ready  (#62)

 Performance counter stats for 'make -s -j12' (10 runs):

     218.791122148 seconds time elapsed                                          ( +-  0.13% )

  FPU saves after: 8939084


The counter numbers are consistent with before: ~200000 FPU saves more
in eager mode per kernel build.

Timing improvement of 0.6 secs on average looks nice to me. Next I'll
do the same thing in tmpfs to see how the numbers look without the page
cache.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs
  2015-02-22 12:50                 ` Borislav Petkov
@ 2015-02-22 12:57                   ` Ingo Molnar
  2015-02-22 13:21                     ` Borislav Petkov
  0 siblings, 1 reply; 45+ messages in thread
From: Ingo Molnar @ 2015-02-22 12:57 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andy Lutomirski, Oleg Nesterov, Rik van Riel, x86, linux-kernel,
	Linus Torvalds, Arjan van de Ven


* Borislav Petkov <bp@alien8.de> wrote:

> Lazy FPU:
>      219.406449195 seconds time elapsed                                          ( +-  0.17% )

> Eager FPU:
>      218.791122148 seconds time elapsed                                          ( +-  0.13% )

> Timing improvement of 0.6 secs on average looks nice to 
> me. Next I'll do the same thing in tmpfs to see how the 
> numbers look without the page cache.

This is also very similar to the ~0.6 secs improvement your 
first set of numbers gave.

So now that it appears we have consistent numbers, it would 
be nice to check it on older hardware (and other workloads) 
as well, to see whether it's a consistent win.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs
  2015-02-22 12:57                   ` Ingo Molnar
@ 2015-02-22 13:21                     ` Borislav Petkov
  0 siblings, 0 replies; 45+ messages in thread
From: Borislav Petkov @ 2015-02-22 13:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, Oleg Nesterov, Rik van Riel, x86, linux-kernel,
	Linus Torvalds, Arjan van de Ven, kbuild-all

On Sun, Feb 22, 2015 at 01:57:36PM +0100, Ingo Molnar wrote:
> This is also very similar to the ~0.6 secs improvement your 
> first set of numbers gave.

Yeah, running without --repeat was simply misleading.

> So now that it appears we have consistent numbers, it would 
> be nice to check it on older hardware (and other workloads) 
> as well, to see whether it's a consistent win.

I'll try to dig out some old boxes we have here, maybe the build robot
could do some measurements too.

Fengguang et al, guys, is it possible for you to run this patch:

https://lkml.kernel.org/r/b0ba174ea882ed36cf7011e872baf427c23b7e09.1424458621.git.luto@amacapital.net

on your fleet - has to be baremetal - to check how it behaves,
performance-wise? Older machines would be preferred...

That would be lovely :-)

Thanks!

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs
  2015-02-22 11:06           ` Borislav Petkov
@ 2015-02-23  1:45             ` Rik van Riel
  2015-02-23  5:22               ` Andy Lutomirski
  0 siblings, 1 reply; 45+ messages in thread
From: Rik van Riel @ 2015-02-23  1:45 UTC (permalink / raw)
  To: Borislav Petkov, Maciej W. Rozycki
  Cc: Andy Lutomirski, Ingo Molnar, Oleg Nesterov, X86 ML,
	linux-kernel, Linus Torvalds

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 02/22/2015 06:06 AM, Borislav Petkov wrote:
> On Sat, Feb 21, 2015 at 06:18:01PM -0800, Andy Lutomirski wrote:
>> That's true.  The question is whether there are enough of them,
>> and whether twiddling TS is fast enough, that it's worth it.
> 
> Yes, and let me make it clear what I'm trying to do here: I want to
> make sure that eager FPU handling (both allocation and switching -
> and no, I'm not confusing the concepts) *doesn't* *hurt* any
> relevant workload. If it does, then we'll stop wasting time right
> there.
> 
> But(!), if the CR0.TS lazy dance turns out to be really slow and
> the eager handling doesn't bring a noticeable slowdown, in
> comparison, we should consider doing the eager thing by default.
> After running a lot more benchmarks, of course.
> 
> Which brings me to the main reason why we're doing this: code 
> simplification. If we switch to eager, we'll kill a lot of
> non-trivial code and the FPU handling in the kernel becomes dumb
> and nice again.

Currently the FPU handling does something really dumb for
KVM VCPU threads.  Specifically, every time we enter a
KVM guest, we save the userspace FPU state of the VCPU
thread, and every time we leave the KVM guest, we load
the userspace FPU state of the VCPU thread.

This is done for a thread that hardly ever exits to
userspace, and generally just switches between kernel
and guest mode.

The reason for this acrobatics is that at every
context switch time, the userspace FPU state is
saved & loaded.

I am working on a patch series to avoid that needless
FPU save & restore, by moving the point at which the
user FPU state is loaded out to the point where we
return to userspace, in do_notify_resume.

One implication of this is that in kernel mode, we
can no longer just assume that the user space FPU
state is always loaded, and we need to check for that
(like the lazy FPU code does today).  I would really
like to keep that code around, for obvious reasons :)

- -- 
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJU6oZBAAoJEM553pKExN6Dk1oH/iJhvE96xM8Yo38eplaI73nC
Bx8OAJk5ombiTroWTqU99Y/2iZmdt3k9KHYEQhYnvCu3RV/4N9GwVLobbh/EPN8Q
v/gXJHOPT1uT7arpIu+XqcbqYCUUMnFdAOfjuLupGRjX64YgzBltd4TUC4yPdW/2
TXnj7OLW3jGIJVOKjnx7zQaKqolAAxbprXkfe8MsGwy0ARS4kXIvcBG7e8s92uQb
oIIQrs5UxmhQo/8Sa+Q8jCF8bHrTJr/mkbPybT6o1et78kwT7FV+2d7oGQySn+v1
FMBRiQsUOdY6AZOtjkWxB+k73QDSwkdLivVWwXZMGICUcQz4756nINWQyPNL49U=
=DlRc
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs
  2015-02-23  1:45             ` Rik van Riel
@ 2015-02-23  5:22               ` Andy Lutomirski
  2015-02-23 12:51                 ` Rik van Riel
  0 siblings, 1 reply; 45+ messages in thread
From: Andy Lutomirski @ 2015-02-23  5:22 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Borislav Petkov, Maciej W. Rozycki, Ingo Molnar, Oleg Nesterov,
	X86 ML, linux-kernel, Linus Torvalds

On Sun, Feb 22, 2015 at 5:45 PM, Rik van Riel <riel@redhat.com> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 02/22/2015 06:06 AM, Borislav Petkov wrote:
>> On Sat, Feb 21, 2015 at 06:18:01PM -0800, Andy Lutomirski wrote:
>>> That's true.  The question is whether there are enough of them,
>>> and whether twiddling TS is fast enough, that it's worth it.
>>
>> Yes, and let me make it clear what I'm trying to do here: I want to
>> make sure that eager FPU handling (both allocation and switching -
>> and no, I'm not confusing the concepts) *doesn't* *hurt* any
>> relevant workload. If it does, then we'll stop wasting time right
>> there.
>>
>> But(!), if the CR0.TS lazy dance turns out to be really slow and
>> the eager handling doesn't bring a noticeable slowdown, in
>> comparison, we should consider doing the eager thing by default.
>> After running a lot more benchmarks, of course.
>>
>> Which brings me to the main reason why we're doing this: code
>> simplification. If we switch to eager, we'll kill a lot of
>> non-trivial code and the FPU handling in the kernel becomes dumb
>> and nice again.
>
> Currently the FPU handling does something really dumb for
> KVM VCPU threads.  Specifically, every time we enter a
> KVM guest, we save the userspace FPU state of the VCPU
> thread, and every time we leave the KVM guest, we load
> the userspace FPU state of the VCPU thread.
>
> This is done for a thread that hardly ever exits to
> userspace, and generally just switches between kernel
> and guest mode.
>
> The reason for this acrobatics is that at every
> context switch time, the userspace FPU state is
> saved & loaded.
>
> I am working on a patch series to avoid that needless
> FPU save & restore, by moving the point at which the
> user FPU state is loaded out to the point where we
> return to userspace, in do_notify_resume.
>
> One implication of this is that in kernel mode, we
> can no longer just assume that the user space FPU
> state is always loaded, and we need to check for that
> (like the lazy FPU code does today).  I would really
> like to keep that code around, for obvious reasons :)

I like that stuff, except for the fact that it still has code that
depends on whether we're in eager or lazy mode, even though eager is a
little less eager with your patches.  Ideally I'd like to see your
patches applied *and* lazy mode removed.

--Andy

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs
  2015-02-23  5:22               ` Andy Lutomirski
@ 2015-02-23 12:51                 ` Rik van Riel
  2015-02-23 15:03                   ` Borislav Petkov
  0 siblings, 1 reply; 45+ messages in thread
From: Rik van Riel @ 2015-02-23 12:51 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Borislav Petkov, Maciej W. Rozycki, Ingo Molnar, Oleg Nesterov,
	X86 ML, linux-kernel, Linus Torvalds

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 02/23/2015 12:22 AM, Andy Lutomirski wrote:
> On Sun, Feb 22, 2015 at 5:45 PM, Rik van Riel <riel@redhat.com>
> wrote:

>> One implication of this is that in kernel mode, we can no longer
>> just assume that the user space FPU state is always loaded, and
>> we need to check for that (like the lazy FPU code does today).  I
>> would really like to keep that code around, for obvious reasons
>> :)
> 
> I like that stuff, except for the fact that it still has code that 
> depends on whether we're in eager or lazy mode, even though eager
> is a little less eager with your patches.  Ideally I'd like to see
> your patches applied *and* lazy mode removed.

The latest version of my code (which I should forward-port
again to the latest tip bits) simplifies things a lot.

It moves all new task handling to switch_fpu_finished(),
which is called from do_notify_resume().

At that point we either load the FPU context, or we
set CR0.TS.

- -- 
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJU6yI3AAoJEM553pKExN6DyQEH/13E0T6wsII5L+0W1D0WU4v5
F2SmU74zvCv+RhRl/0npvKRQC3IkZ/peM9h/1PU/r61Y39VhexUxiUZ86afzz4KJ
ogPNEs6HZ8UYSji6RPlVsuJJ4vCsZ9GA+27JwKGzLUG1NfniL0qlc93rl+f3ZfFm
5H4bjtcO361Am15yKkFPRl/0PKqRhMDa4tzJGkvvCaJ+k195Ik0a7WXjOlGdIeJA
5Q1D0qaAccXtwfNjqhxIfTBi1LV2a1/jCKF8ktxGJ2+Ywvau9blJnPfD8vUf9JL2
9/le1kyNU3h8B2lU0jIi4a2Y/CcU47heEUy5htmoo6wChQWAy/pR8JaE6++qWqI=
=ee/J
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs
  2015-02-20 18:58 [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs Andy Lutomirski
  2015-02-20 19:05 ` Borislav Petkov
  2015-02-21  9:31 ` Ingo Molnar
@ 2015-02-23 14:59 ` Oleg Nesterov
  2015-02-23 15:11   ` Borislav Petkov
  2015-02-24 19:15 ` Denys Vlasenko
  2015-02-25 17:12 ` Some results (was: Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs) Borislav Petkov
  4 siblings, 1 reply; 45+ messages in thread
From: Oleg Nesterov @ 2015-02-23 14:59 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Rik van Riel, x86, linux-kernel, Borislav Petkov

On 02/20, Andy Lutomirski wrote:
>
> We have eager and lazy fpu modes, introduced in:
> 
> 304bceda6a18 x86, fpu: use non-lazy fpu restore for processors supporting xsave
> 
> The result is rather messy.  There are two code paths in almost all of the
> FPU code, and only one of them (the eager case) is tested frequently, since
> most kernel developers have new enough hardware that we use eagerfpu.
> 
> It seems that, on any remotely recent hardware, eagerfpu is a win:
> glibc uses SSE2, so laziness is probably overoptimistic, and, in any
> case, manipulating TS is far slower that saving and restoring the full
> state.
> 
> To try to shake out any latent issues on old hardware, this changes
> the default to eager on all CPUs.  If no performance or functionality
> problems show up, a subsequent patch could remove lazy mode entirely.
> 
> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> ---
>  arch/x86/kernel/xsave.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c
> index 0de1fae2bdf0..1928c8f34ce5 100644
> --- a/arch/x86/kernel/xsave.c
> +++ b/arch/x86/kernel/xsave.c
> @@ -637,8 +637,8 @@ static void __init xstate_enable_boot_cpu(void)
>  	prepare_fx_sw_frame();
>  	setup_init_fpu_buf();
>  
> -	/* Auto enable eagerfpu for xsaveopt */
> -	if (cpu_has_xsaveopt && eagerfpu != DISABLE)
> +	/* Auto enable eagerfpu for everyone */
> +	if (eagerfpu != DISABLE)
>  		eagerfpu = ENABLE;

Well, but if we want this change then perhaps we should simply change
the default value? This way "AUTO" still can work.

Oleg.

--- x/arch/x86/kernel/xsave.c
+++ x/arch/x86/kernel/xsave.c
@@ -563,7 +563,7 @@ static void __init setup_init_fpu_buf(vo
 	xsave_state_booting(init_xstate_buf, -1);
 }
 
-static enum { AUTO, ENABLE, DISABLE } eagerfpu = AUTO;
+static enum { AUTO, ENABLE, DISABLE } eagerfpu = ENABLE;
 static int __init eager_fpu_setup(char *s)
 {
 	if (!strcmp(s, "on"))


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs
  2015-02-23 12:51                 ` Rik van Riel
@ 2015-02-23 15:03                   ` Borislav Petkov
  2015-02-23 15:51                     ` Rik van Riel
  0 siblings, 1 reply; 45+ messages in thread
From: Borislav Petkov @ 2015-02-23 15:03 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andy Lutomirski, Maciej W. Rozycki, Ingo Molnar, Oleg Nesterov,
	X86 ML, linux-kernel, Linus Torvalds

On Mon, Feb 23, 2015 at 07:51:04AM -0500, Rik van Riel wrote:
> At that point we either load the FPU context, or we
> set CR0.TS.

Right, but provided eager doesn't bring any slowdown, we can drop the TS
fiddling altogether and only load FPU context.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs
  2015-02-23 14:59 ` Oleg Nesterov
@ 2015-02-23 15:11   ` Borislav Petkov
  2015-02-23 15:53     ` Rik van Riel
  0 siblings, 1 reply; 45+ messages in thread
From: Borislav Petkov @ 2015-02-23 15:11 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Andy Lutomirski, Rik van Riel, x86, linux-kernel

On Mon, Feb 23, 2015 at 03:59:29PM +0100, Oleg Nesterov wrote:
> Well, but if we want this change then perhaps we should simply change
> the default value? This way "AUTO" still can work.

Yeah, sure, let's do some measurements first, to see whether this is
even worth it.

Btw, Mel pointed me at some cleanups he did a while ago in that area
too: https://lkml.org/lkml/2014/8/6/201

If you guys could take a look, that would be awesome :)

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs
  2015-02-23 15:03                   ` Borislav Petkov
@ 2015-02-23 15:51                     ` Rik van Riel
  2015-02-23 18:06                       ` Borislav Petkov
  0 siblings, 1 reply; 45+ messages in thread
From: Rik van Riel @ 2015-02-23 15:51 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andy Lutomirski, Maciej W. Rozycki, Ingo Molnar, Oleg Nesterov,
	X86 ML, linux-kernel, Linus Torvalds

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 02/23/2015 10:03 AM, Borislav Petkov wrote:
> On Mon, Feb 23, 2015 at 07:51:04AM -0500, Rik van Riel wrote:
>> At that point we either load the FPU context, or we set CR0.TS.
> 
> Right, but provided eager doesn't bring any slowdown, we can drop
> the TS fiddling altogether and only load FPU context.

Agreed.

However, we would still need the rest of the kernel code to
check whether userspace FPU context is loaded in the registers
or not. We cannot simplify to the point of assuming that the
FPU state will always be loaded when in eager mode.

- -- 
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJU60x+AAoJEM553pKExN6DIMkIAJpZHQuojQMTUyy11EB7dx2N
ZMm5PiQGW47f/BIbIwR3hwZbC385ZLdx1Nkrys8cXd3WNJInrMxf2PDaU9YtDFw1
NdybRbbsxAAaB3oJldLFUH5ouo9ePKV4RyXFgsgNH75uv6u4pesaWwUZNuLmxzrk
h8/wns7Lpddr4C0ahPimEM1pXdfiFyo/0jaAiOj8o/IklXiJW25LBYUzE4ogD/2T
097wV5C8ojvS4doVD+GthxFuJFslZWxyNaRBRNqpE2LyhmjEdz2WwNTSrDtLsBK0
gKHjgD0eox/HpNkBzwaF6MOo1of2/WFNSJyRKwlwaWnRNpZ3jDfyU0VPg3yhF30=
=aIKn
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs
  2015-02-23 15:11   ` Borislav Petkov
@ 2015-02-23 15:53     ` Rik van Riel
  2015-02-23 18:40       ` Oleg Nesterov
  0 siblings, 1 reply; 45+ messages in thread
From: Rik van Riel @ 2015-02-23 15:53 UTC (permalink / raw)
  To: Borislav Petkov, Oleg Nesterov; +Cc: Andy Lutomirski, x86, linux-kernel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 02/23/2015 10:11 AM, Borislav Petkov wrote:
> On Mon, Feb 23, 2015 at 03:59:29PM +0100, Oleg Nesterov wrote:
>> Well, but if we want this change then perhaps we should simply
>> change the default value? This way "AUTO" still can work.
> 
> Yeah, sure, let's do some measurements first, to see whether this
> is even worth it.
> 
> Btw, Mel pointed me at some cleanups he did a while ago in that
> area too: https://lkml.org/lkml/2014/8/6/201
> 
> If you guys could take a look, that would be awesome :)

I don't think any of those changes would survive the
"defer FPU loading to do_notify_resume" patch series.

- -- 
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJU60zqAAoJEM553pKExN6DbIMH/A72/Lid9hho65Wh4notTa7a
QMRsM+XgLvvohmYKWZQ+uS4by9jbLaH827PIQtcW3bSEgtWGjam+pUGC/gWXwZk7
51oYDhyAqvVr6vb3mpl13+N+xGN5TlLoZa7SC73DuSSD/6HGqDryrr7mcfX53Yc+
UOg7MpAx//5tLZ89amMirqqqhqEfQh9qlyjPNLVywpHeSDYfxlinsfJgWOUW0/4A
7wR51ToANH9cMcSXWQ72lUNbJpJLgkcW8AVKXicS/UtFnvgItcvfgy1TjOpuaZom
MRVv0E7NUM10U9mZ13NRDE7JSCk5Euy8XgM+AOk5PXJlb0Lg8IYu1nVTn7Q+6A0=
=JLIM
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs
  2015-02-23 15:51                     ` Rik van Riel
@ 2015-02-23 18:06                       ` Borislav Petkov
  0 siblings, 0 replies; 45+ messages in thread
From: Borislav Petkov @ 2015-02-23 18:06 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andy Lutomirski, Maciej W. Rozycki, Ingo Molnar, Oleg Nesterov,
	X86 ML, linux-kernel, Linus Torvalds

On Mon, Feb 23, 2015 at 10:51:26AM -0500, Rik van Riel wrote:
> However, we would still need the rest of the kernel code to ...

Yeah, let's wait out first and see what the benchmarks say. Mel started
a bunch of them on a couple of boxes here, we'll have results in the
coming days.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs
  2015-02-23 15:53     ` Rik van Riel
@ 2015-02-23 18:40       ` Oleg Nesterov
  0 siblings, 0 replies; 45+ messages in thread
From: Oleg Nesterov @ 2015-02-23 18:40 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Borislav Petkov, Andy Lutomirski, x86, linux-kernel

On 02/23, Rik van Riel wrote:
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 02/23/2015 10:11 AM, Borislav Petkov wrote:
> > On Mon, Feb 23, 2015 at 03:59:29PM +0100, Oleg Nesterov wrote:
> >> Well, but if we want this change then perhaps we should simply
> >> change the default value? This way "AUTO" still can work.
> >
> > Yeah, sure, let's do some measurements first, to see whether this
> > is even worth it.
> >
> > Btw, Mel pointed me at some cleanups he did a while ago in that
> > area too: https://lkml.org/lkml/2014/8/6/201
> >
> > If you guys could take a look, that would be awesome :)
>
> I don't think any of those changes would survive the
> "defer FPU loading to do_notify_resume" patch series.

Agreed, at least

	+static inline void switch_eagerfpu_prepare(struct task_struct *old, struct task_struct *new, int cpu,
	+						fpu_switch_t *fpu)
	+{
	+	fpu->preload = tsk_used_math(new);

doesn't look right if we change the rules.

OTOH, perhaps it makes sense to redo these cleanups later.

Oleg.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs
  2015-02-22  2:18         ` Andy Lutomirski
  2015-02-22 11:06           ` Borislav Petkov
@ 2015-02-23 21:17           ` Maciej W. Rozycki
  2015-02-23 21:21             ` Rik van Riel
  1 sibling, 1 reply; 45+ messages in thread
From: Maciej W. Rozycki @ 2015-02-23 21:17 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Borislav Petkov, Ingo Molnar, Oleg Nesterov, Rik van Riel,
	X86 ML, linux-kernel, Linus Torvalds

On Sat, 21 Feb 2015, Andy Lutomirski wrote:

> >  Additionally I believe long-executing FPU instructions (i.e.
> > transcendentals) can take advantage of continuing to execute in parallel
> > where the context has already been switched rather than stalling an eager
> > FPU context switch until the FPU instruction has completed.
> 
> It seems highly unlikely to me that a slow FPU instruction can retire
> *after* a subsequent fxsave, which would need to happen for this to
> work.

 I meant something else -- a slow FPU instruction can retire after a task 
has been switched where the FP context has been left intact, i.e. in the 
lazy FP context switching case, where only the MMU context and GPRs have 
been replaced.  Whereas in the eager FP context switching case you can get 
through to FXSAVE while a slow FPU instruction hasn't completed yet (e.g. 
started just as preemption was about to happen).

 Obviously that FXSAVE will have to stall until the FPU instruction has 
completed (IIRC the i486 aborted transcendental instructions on any 
exceptions/interrupts instead, leading to the risk of process starvation 
in heavily interrupt loaded systems, but I also believe it has been fixed 
as from the Pentium).  Though if, as you say, the lone taking of a 
trap/interrupt gate can take hundreds of cycles, perhaps indeed no FPU 
instruction will execute *that* long on modern silicon.

> >  And last but not least, why does the handling of CR0.TS traps have to be
> > complicated?  It does not look like rocket science to me, it should be a
> > mere handful of instructions, the time required to move the two FP
> > contexts out from and in to the FPU respectively should dominate
> > processing time.  Where quoted the optimisation manual states 250 cycles
> > for FXSAVE and FXRSTOR combined.
> 
> The TS traps aren't complicated -- they're just really slow.  I think
> that each of setting *and* clearing TS serializes and takes well over
> a hundred cycles.  A #NM interrupt (the thing that happens if you try
> to use the FPU with TS set) serializes and does all kinds of slow
> things, so it takes many hundreds of cycles.  The handler needs to
> clear TS (another hundred cycles or more), load the FPU state
> (actually rather fast on modern CPUs), and then IRET back to userspace
> (hundreds of cycles).  This adds up to a lot of cycles.  A round trip
> through an exception handler seems to be thousands of cycles.

 That sucks wet goat farts! :(

 I have to admit I got moved a bit away from the x86 world and didn't 
realise things have become so bad.  Some 10 years ago or so taking a trap 
or interrupt gate would need some 30 cycles (of course task gates are 
unusable for anything that does not absolutely require them such as a #DF; 
their usability for anything real ended with the 80286 or suchlike).  
Similarly an IRET to reverse the actions taken.  That was already rather 
bad, but understandable, after all the CPU had to read the gate 
descriptor, access the TSS, switch both CS and SS descriptors, etc.

 What I don't understand is why CLTS, a dedicated instruction that avoids 
the need to access whole CR0 (that again can understandably be costly, 
because of the grouping of all the important bits there), has to be so 
slow.  It flips a single bit down there and does not to serialise 
anything, as any instruction down the pipeline it could affect would 
trigger a #NM anyway!  And there's an IRET somewhere on the way too, 
before the instruction that originally triggered the fault will be 
reexecuted.

 And why the heck over all these years a mechanism similar to SYSENTER and 
its bunch of complementing MSRs hasn't been invented for the common 
exceptions, to avoid all this gate descriptor dance!

> >  And of course you can install the right handler (i.e. FSAVE vs FXSAVE) at
> > bootstrap depending on processor features, you don't have to do all the
> > run-time check on every trap.  You can even optimise the FSAVE handler
> > away at the build time if you know it won't ever be used based on the
> > minimal supported processor family selected.
> 
> None of this matters.  A couple of branches in an #NM handler are
> completely lost in the noise.

 Agreed, given what you state, completely understood.

> >  Do you happen to know or can determine how much time (in clock cycles) a
> > CR0.TS trap itself takes, including any time required to preserve the
> > execution state in the handler such as pushing/popping GPRs to/from the
> > stack (as opposed to processing time spent on moving the FP contexts back
> > and forth)?  Is there no room for improvement left there?  How many task
> > scheduling slots say per million must be there poking at the FPU for eager
> > FPU context switching to take advantage over lazy one?
> 
> Thousands of cycles.  Considerably worse in an SVM guest.  x86
> exception handling sucks.

 I must have been spoilt with the MIPS exception handling.  Taking an 
exception on a MIPS processor is virtually instantaneous, just like 
retiring another instruction.  Of course there's the cost equivalent to 
branch misprediction, you need to invalidate all the pipeline.  So 
depending on how many stages you have there, you can expect a latency of 
say 3-7 clocks.

 Granted, on a MIPS processor taking an exception does not change much -- 
it switches into the kernel mode (1 bit set in a control register, a 
special kernel-mode-override bit dedicated to exception handling), saves 
the old PC (another control register updated; called Exception PC or EPC) 
and loads the PC with the exception vector.  All the rest is left to the 
kernel.  Which is good!

 The same stands for ERET, the exception return instruction -- it merely 
loads the PC back from EPC and clears the kernel-mode-override bit in the 
other control register.  More recently it also serves the purpose of an 
instruction hazard barrier, which you'd call synchronisation, the 
strongest kind provided in the MIPS architecture (in older architecture 
revisions you had to take care of any hazards caused by preceding 
instructions that could affect user-mode execution, by inserting the right 
number of NOPs before ERET, possibly taking other instructions already 
executed since the origin of the hazard into account).  So rather than 3-7 
clocks that could be 20 or so, though usually much fewer.

 A while ago I cooperated with the hardware team in adding an extra 
instruction to the architecture under the assumption that it will be 
emulated on legacy hardware, by taking the RI or Reserved Instruction 
exception (the equivalent to x86's #UD) and doing the rest there.  
Another assumption was a fast path would be taken for this single 
instruction and all the handling done in assembly, without even reaching 
the usual C-language RI handlers that we've accumulated over the years.  

 Mind that exceptions actually have to be decoded and dispatched to 
individual handlers on MIPS processors first, it's not that individual 
exception classes have individual vectors like with x86 -- there's only 
one!  And you need to update EPC too or you'd be trapping back.  Finally 
the instruction itself had to be decoded, so instruction memory had to be 
read and compared against the pattern expected.

 To make a long story short I was able to squeeze all the handling into 
some 30 cycles, with a slight variation across different processor 
implementations.  How much different!

 Oh well, some further benchmarking is still needed, but given the 
circumstances I suppose the old good design will have to go after all, 
sigh...  Thanks for your input!

  Maciej

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs
  2015-02-23 21:17           ` Maciej W. Rozycki
@ 2015-02-23 21:21             ` Rik van Riel
  2015-02-23 22:14               ` Linus Torvalds
  2015-02-23 22:27               ` Maciej W. Rozycki
  0 siblings, 2 replies; 45+ messages in thread
From: Rik van Riel @ 2015-02-23 21:21 UTC (permalink / raw)
  To: Maciej W. Rozycki, Andy Lutomirski
  Cc: Borislav Petkov, Ingo Molnar, Oleg Nesterov, X86 ML,
	linux-kernel, Linus Torvalds

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 02/23/2015 04:17 PM, Maciej W. Rozycki wrote:
> On Sat, 21 Feb 2015, Andy Lutomirski wrote:
> 
>>> Additionally I believe long-executing FPU instructions (i.e. 
>>> transcendentals) can take advantage of continuing to execute in
>>> parallel where the context has already been switched rather
>>> than stalling an eager FPU context switch until the FPU
>>> instruction has completed.
>> 
>> It seems highly unlikely to me that a slow FPU instruction can
>> retire *after* a subsequent fxsave, which would need to happen
>> for this to work.
> 
> I meant something else -- a slow FPU instruction can retire after a
> task has been switched where the FP context has been left intact,
> i.e. in the lazy FP context switching case, where only the MMU
> context and GPRs have been replaced.

I don't think that's true, because changing the MMU context and GPRs
also includes changing the instruction pointer, and changing over the
execution to the new task.

After a context switch, the instructions from the old task are no
longer in the pipeline.

- -- 
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJU65noAAoJEM553pKExN6DyVQH+gJDWuHyoFtgKFyELFi+tTGb
vB/9uR267N4Hxu0D3kedhX9HNJwBtcs1rXm++p+r4RXpz2+gl/HxihoL8v1qk4zy
1lAa5se8T5hUL7MCsD5nOIs2ca3HAGwOFtVXrjvJQ6dSCCVD/SuqJC49ENc5Nd3M
1FcW4jg+MlQ2TNJoi4QRBllmxUFffACzv8M3VWVigIibMGqrOXXSraHp0k1hFFOE
LeL122JKVp7YhVj5Dx7qx1DPUkT24RT9ip1f9r5DkxFmHwdCcY31WzT4asArqHb4
uwK8LVcpTax30VHghqy8KUBVKO5h7pHvdhTYSaoLWdUJfNFJ+8Hw5sDMfPdjN2k=
=6lA5
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs
  2015-02-23 21:21             ` Rik van Riel
@ 2015-02-23 22:14               ` Linus Torvalds
  2015-02-24  0:56                 ` Maciej W. Rozycki
  2015-02-23 22:27               ` Maciej W. Rozycki
  1 sibling, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2015-02-23 22:14 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Maciej W. Rozycki, Andy Lutomirski, Borislav Petkov, Ingo Molnar,
	Oleg Nesterov, X86 ML, linux-kernel

On Mon, Feb 23, 2015 at 1:21 PM, Rik van Riel <riel@redhat.com> wrote:
>
> On 02/23/2015 04:17 PM, Maciej W. Rozycki wrote:
>>>
>>> It seems highly unlikely to me that a slow FPU instruction can
>>> retire *after* a subsequent fxsave, which would need to happen
>>> for this to work.
>>
>> I meant something else -- a slow FPU instruction can retire after a
>> task has been switched where the FP context has been left intact,
>> i.e. in the lazy FP context switching case, where only the MMU
>> context and GPRs have been replaced.
>
> I don't think that's true, because changing the MMU context and GPRs
> also includes changing the instruction pointer, and changing over the
> execution to the new task.

We have one traditional special case, which actually did something
like Maciej's nightmare scenario: the completely broken "FPU errors
over irq13" IBM PC/AT FPU linkage.

But since we don't actually support old i386 machines any more, we
don't really need to care. The only way you can get into that
situation is with an external i387. I don't think we need to worry
about it.

But with the old horrid irq13 error handling, you literally could get
into a situation that you got an error "exception" (irq) from the
previous state, *after* you had already switched to the new one. We
had some code to mitigate the problem, but as mentioned, I don't think
it's an issue any more.

                        Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs
  2015-02-23 21:21             ` Rik van Riel
  2015-02-23 22:14               ` Linus Torvalds
@ 2015-02-23 22:27               ` Maciej W. Rozycki
  2015-02-23 23:44                 ` Andy Lutomirski
  1 sibling, 1 reply; 45+ messages in thread
From: Maciej W. Rozycki @ 2015-02-23 22:27 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andy Lutomirski, Borislav Petkov, Ingo Molnar, Oleg Nesterov,
	X86 ML, linux-kernel, Linus Torvalds

On Mon, 23 Feb 2015, Rik van Riel wrote:

> > I meant something else -- a slow FPU instruction can retire after a
> > task has been switched where the FP context has been left intact,
> > i.e. in the lazy FP context switching case, where only the MMU
> > context and GPRs have been replaced.
> 
> I don't think that's true, because changing the MMU context and GPRs
> also includes changing the instruction pointer, and changing over the
> execution to the new task.

 That does not matter.  The instructions in question only operate on x87 
internal registers: the data stack registers, specifically ST(0) and 
possibly also ST(1), and consequently the Tag Word register, and the 
Status Word register.  No CPU resource such as the MMU or GPRs need to be 
referred for an x87 instruction to complete.  Any unmasked IEEE 754 FPU 
exception recorded on the way is only signalled at the next x87 
instruction.

> After a context switch, the instructions from the old task are no
> longer in the pipeline.

 I'd say it's implementation-specific.  As I mentioned the i486 aborted 
any transcendental x87 instruction in progress upon taking an exception or 
interrupt.  That was a model like you refer to, but as I also mentioned it 
had its shortcomings.

 Any newer implementation I'd expect to, and Pentium class processors 
certainly did, continue executing these instructions in parallel in the 
FPU pipeline regardless of what the CPU does until completed.  If WAIT or 
a waiting x87 instruction was encountered while a previous x87 instruction 
was still in progress, the CPU pipeline would stall until the earlier x87 
instruction has completed.  The FPU has no way to determine the CPU 
context has been switched and neither it recognises execution privilege 
levels.

 I can't speak of SIMD instructions, I don't know offhand.  OTOH AFAIK 
they don't suffer from latencies so long as some x87 instructions that may 
be in the range of 400 clock cycles.

  Maciej

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs
  2015-02-23 22:27               ` Maciej W. Rozycki
@ 2015-02-23 23:44                 ` Andy Lutomirski
  2015-02-24  2:14                   ` Maciej W. Rozycki
  0 siblings, 1 reply; 45+ messages in thread
From: Andy Lutomirski @ 2015-02-23 23:44 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Rik van Riel, Borislav Petkov, Ingo Molnar, Oleg Nesterov,
	X86 ML, linux-kernel, Linus Torvalds

On Mon, Feb 23, 2015 at 2:27 PM, Maciej W. Rozycki <macro@linux-mips.org> wrote:
> On Mon, 23 Feb 2015, Rik van Riel wrote:
>
>> > I meant something else -- a slow FPU instruction can retire after a
>> > task has been switched where the FP context has been left intact,
>> > i.e. in the lazy FP context switching case, where only the MMU
>> > context and GPRs have been replaced.
>>
>> I don't think that's true, because changing the MMU context and GPRs
>> also includes changing the instruction pointer, and changing over the
>> execution to the new task.
>
>  That does not matter.  The instructions in question only operate on x87
> internal registers: the data stack registers, specifically ST(0) and
> possibly also ST(1), and consequently the Tag Word register, and the
> Status Word register.  No CPU resource such as the MMU or GPRs need to be
> referred for an x87 instruction to complete.  Any unmasked IEEE 754 FPU
> exception recorded on the way is only signalled at the next x87
> instruction.
>
>> After a context switch, the instructions from the old task are no
>> longer in the pipeline.
>
>  I'd say it's implementation-specific.  As I mentioned the i486 aborted
> any transcendental x87 instruction in progress upon taking an exception or
> interrupt.  That was a model like you refer to, but as I also mentioned it
> had its shortcomings.

IRET is serializing, according to the the docs (I think) and according
to the Intel engineers I asked (I'm absolutely certain about this
part).  So FPU ops are entirely done at the end of a normal context
switch.

We also always save the FPU context on every context switch away from
a task that used the FPU, even in lazy mode.  This is because we might
switch the task back in on a different CPU, and we don't want to use
an IPI to move the FPU context.

Given that we're only talking about old CPUs here, I sincerely doubt
that there's any relevant case in which an fxsave can usefully wait
for a long-running transcendental op to finish while we continue doing
useful work.  *Especially* since there will almost certainly be
several more mfences or atomic ops before the end of the context
switch, even if we're lucky enough to complete the context switching
using sysret.

--Andy

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs
  2015-02-23 22:14               ` Linus Torvalds
@ 2015-02-24  0:56                 ` Maciej W. Rozycki
  2015-02-24  0:59                   ` Andy Lutomirski
  0 siblings, 1 reply; 45+ messages in thread
From: Maciej W. Rozycki @ 2015-02-24  0:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, Andy Lutomirski, Borislav Petkov, Ingo Molnar,
	Oleg Nesterov, X86 ML, linux-kernel

On Mon, 23 Feb 2015, Linus Torvalds wrote:

> We have one traditional special case, which actually did something
> like Maciej's nightmare scenario: the completely broken "FPU errors
> over irq13" IBM PC/AT FPU linkage.
> 
> But since we don't actually support old i386 machines any more, we
> don't really need to care. The only way you can get into that
> situation is with an external i387. I don't think we need to worry
> about it.
> 
> But with the old horrid irq13 error handling, you literally could get
> into a situation that you got an error "exception" (irq) from the
> previous state, *after* you had already switched to the new one. We
> had some code to mitigate the problem, but as mentioned, I don't think
> it's an issue any more.

 Correct, the horrid hack is gone, it was so horrible (though I understand 
why IBM had to do it with their PC/AT) that back in mid 1990s, some 10 
years after the inception of the problem, Intel felt so compelled to make 
people get the handling of it right as to release a dedicated application 
note: "AP-578 Software and Hardware Considerations for FPU Exception 
Handlers for Intel Architecture Processors", Order Number 243291-002.

 Anyway, my point through this consideration has been about the 
performance benefit from continuing the execution of an x87 instruction in 
parallel, perhaps until after a context has been fully switched.  Which 
benefit is lost if a FSAVE/FXSAVE executed eagerly at the context switch 
stalls waiting for the instruction to complete instead.

  Maciej

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs
  2015-02-24  0:56                 ` Maciej W. Rozycki
@ 2015-02-24  0:59                   ` Andy Lutomirski
  0 siblings, 0 replies; 45+ messages in thread
From: Andy Lutomirski @ 2015-02-24  0:59 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Linus Torvalds, Rik van Riel, Borislav Petkov, Ingo Molnar,
	Oleg Nesterov, X86 ML, linux-kernel

On Mon, Feb 23, 2015 at 4:56 PM, Maciej W. Rozycki <macro@linux-mips.org> wrote:
> On Mon, 23 Feb 2015, Linus Torvalds wrote:
>
>> We have one traditional special case, which actually did something
>> like Maciej's nightmare scenario: the completely broken "FPU errors
>> over irq13" IBM PC/AT FPU linkage.
>>
>> But since we don't actually support old i386 machines any more, we
>> don't really need to care. The only way you can get into that
>> situation is with an external i387. I don't think we need to worry
>> about it.
>>
>> But with the old horrid irq13 error handling, you literally could get
>> into a situation that you got an error "exception" (irq) from the
>> previous state, *after* you had already switched to the new one. We
>> had some code to mitigate the problem, but as mentioned, I don't think
>> it's an issue any more.
>
>  Correct, the horrid hack is gone, it was so horrible (though I understand
> why IBM had to do it with their PC/AT) that back in mid 1990s, some 10
> years after the inception of the problem, Intel felt so compelled to make
> people get the handling of it right as to release a dedicated application
> note: "AP-578 Software and Hardware Considerations for FPU Exception
> Handlers for Intel Architecture Processors", Order Number 243291-002.
>
>  Anyway, my point through this consideration has been about the
> performance benefit from continuing the execution of an x87 instruction in
> parallel, perhaps until after a context has been fully switched.  Which
> benefit is lost if a FSAVE/FXSAVE executed eagerly at the context switch
> stalls waiting for the instruction to complete instead.

And the save is indeed executed eagerly.  Conceivably we could get rid
of that on UP, but that seems to be a lot of complexity for extremely
little gain.

--Andy

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs
  2015-02-23 23:44                 ` Andy Lutomirski
@ 2015-02-24  2:14                   ` Maciej W. Rozycki
  2015-02-24  2:31                     ` Andy Lutomirski
  0 siblings, 1 reply; 45+ messages in thread
From: Maciej W. Rozycki @ 2015-02-24  2:14 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Rik van Riel, Borislav Petkov, Ingo Molnar, Oleg Nesterov,
	X86 ML, linux-kernel, Linus Torvalds

On Mon, 23 Feb 2015, Andy Lutomirski wrote:

> >> After a context switch, the instructions from the old task are no
> >> longer in the pipeline.
> >
> >  I'd say it's implementation-specific.  As I mentioned the i486 aborted
> > any transcendental x87 instruction in progress upon taking an exception or
> > interrupt.  That was a model like you refer to, but as I also mentioned it
> > had its shortcomings.
> 
> IRET is serializing, according to the the docs (I think) and according
> to the Intel engineers I asked (I'm absolutely certain about this
> part).  So FPU ops are entirely done at the end of a normal context
> switch.

 No question about the serialising property of IRET, it has been like this 
since the original Pentium implementation.  Do you have an architecture 
specification reference to back up your claim though as far as the FPU is 
concerned?  I'm asking because I am genuinely curious.

 The x87 case is so special, there isn't anything there really that is 
externally observable or should be affected by IRET or any other 
synchronisation barriers apart from WAIT (or a waiting x87 instruction) 
that has been there for this purpose since forever.  And it would defeat 
some documented benefits of running the FP pipeline in the parallel.

 And certainly such synchronisation didn't happen in the old days.

> We also always save the FPU context on every context switch away from
> a task that used the FPU, even in lazy mode.  This is because we might
> switch the task back in on a different CPU, and we don't want to use
> an IPI to move the FPU context.

 That's an interesting case too, although not necessarily related.  If you 
say that we always save the FP context eagerly for the purpose of process 
migration, then sure, that invalidates any benefit we'd have from letting 
the x87 proceed.

 However I can see different ways to address this case avoiding the need 
of eager FP context saving or an IPI:

1. We could bind any currently suspended process with an unsaved FP 
   context to the CPU it last executed on.

2. We could mark such a process for migration next time and let it execute 
   on the CPU that holds its FP context once more, and then save the FP 
   context eagerly on the way out.

In some cases a lazily retained FP context would be preempted by another 
process before the process in question would resume anyway.  In this case 
any temporary binding to a CPU could be given up.

> Given that we're only talking about old CPUs here, I sincerely doubt
> that there's any relevant case in which an fxsave can usefully wait
> for a long-running transcendental op to finish while we continue doing
> useful work.  *Especially* since there will almost certainly be
> several more mfences or atomic ops before the end of the context
> switch, even if we're lucky enough to complete the context switching
> using sysret.

 I am not sure what you mean by FXSAVE usefully waiting for an op, please 
elaborate.  At the point you've reached FXSAVE and an earlier x87 
instruction hasn't completed, you've already lost.  The pipeline will be 
stalled until the x87 instruction has completed and it can be hundreds of 
cycles.  My point therefore has been about avoiding to execute FXSAVE for 
the old task until absolutely necessary, that with the lazy FP context 
switching would be at the next x87 (or SSE) instruction reached by the new 
task.

 Likewise I don't see why MFENCE or an atomic operation should affect the 
excecution of say FSINCOS.  Whether the results of FSINCOS arrive before 
or after MFENCE, etc. are not externally observable.

 And I'm not sure if this all affects old CPUs only -- I don't know how 
much x87 software is out there, but after all these years I'd expect quite 
some.  Sure, lots of this can be recompiled to use SSE instead, but not 
all, and even where it is feasible, that's an extra burden for people, 
beyond say a routine hardware or Linux distribution or for that matter 
lone kernel upgrade.  Therefore I think we need to be careful not to 
pessimise things for a subset of people too much and ideally at all.

 And to be clear, I am not against removing lazy FP context switching per 
se.  I am just emphasizing to be careful with that and be absolutely sure 
that it does not cause excessive harm.

 I still wonder why Intel hasn't addressed some issues around this stuff 
-- is that there are not enough people using proper IEEE 754 arithmetic on 
x86 hardware to attract interest of hardware architecture maintainers?  
After all the same issue applies to enabled IEEE 754 exceptions, a #MF/#XM 
exception isn't going to take any less than a #NM fault.  Or maybe I'm 
just missing something?

  Maciej

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs
  2015-02-24  2:14                   ` Maciej W. Rozycki
@ 2015-02-24  2:31                     ` Andy Lutomirski
  2015-02-24 14:43                       ` Rik van Riel
  0 siblings, 1 reply; 45+ messages in thread
From: Andy Lutomirski @ 2015-02-24  2:31 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Rik van Riel, Borislav Petkov, Ingo Molnar, Oleg Nesterov,
	X86 ML, linux-kernel, Linus Torvalds

On Mon, Feb 23, 2015 at 6:14 PM, Maciej W. Rozycki <macro@linux-mips.org> wrote:
> On Mon, 23 Feb 2015, Andy Lutomirski wrote:
>
>> >> After a context switch, the instructions from the old task are no
>> >> longer in the pipeline.
>> >
>> >  I'd say it's implementation-specific.  As I mentioned the i486 aborted
>> > any transcendental x87 instruction in progress upon taking an exception or
>> > interrupt.  That was a model like you refer to, but as I also mentioned it
>> > had its shortcomings.
>>
>> IRET is serializing, according to the the docs (I think) and according
>> to the Intel engineers I asked (I'm absolutely certain about this
>> part).  So FPU ops are entirely done at the end of a normal context
>> switch.
>
>  No question about the serialising property of IRET, it has been like this
> since the original Pentium implementation.  Do you have an architecture
> specification reference to back up your claim though as far as the FPU is
> concerned?  I'm asking because I am genuinely curious.
>
>  The x87 case is so special, there isn't anything there really that is
> externally observable or should be affected by IRET or any other
> synchronisation barriers apart from WAIT (or a waiting x87 instruction)
> that has been there for this purpose since forever.  And it would defeat
> some documented benefits of running the FP pipeline in the parallel.

It's plausible that this is special, but I doubt it.  Especially since
this optimization would be nuts post-SSE2.

>
>  And certainly such synchronisation didn't happen in the old days.
>
>> We also always save the FPU context on every context switch away from
>> a task that used the FPU, even in lazy mode.  This is because we might
>> switch the task back in on a different CPU, and we don't want to use
>> an IPI to move the FPU context.
>
>  That's an interesting case too, although not necessarily related.  If you
> say that we always save the FP context eagerly for the purpose of process
> migration, then sure, that invalidates any benefit we'd have from letting
> the x87 proceed.
>
>  However I can see different ways to address this case avoiding the need
> of eager FP context saving or an IPI:
>
> 1. We could bind any currently suspended process with an unsaved FP
>    context to the CPU it last executed on.

This would be insane.

>
> 2. We could mark such a process for migration next time and let it execute
>    on the CPU that holds its FP context once more, and then save the FP
>    context eagerly on the way out.

This would be worse than insane.  Now, in order to wake such a process
on a different CPU, we'd have to force a *context switch* on the
source CPU.  Now we're replacing a few hundred cycles at worse for a
transcendental function with at least 10k cycles (at a guess) and
possibly many orders of magnitude more if locks are held, plus
priority issues, plus total craziness.

>
> In some cases a lazily retained FP context would be preempted by another
> process before the process in question would resume anyway.  In this case
> any temporary binding to a CPU could be given up.
>
>> Given that we're only talking about old CPUs here, I sincerely doubt
>> that there's any relevant case in which an fxsave can usefully wait
>> for a long-running transcendental op to finish while we continue doing
>> useful work.  *Especially* since there will almost certainly be
>> several more mfences or atomic ops before the end of the context
>> switch, even if we're lucky enough to complete the context switching
>> using sysret.
>
>  I am not sure what you mean by FXSAVE usefully waiting for an op, please
> elaborate.  At the point you've reached FXSAVE and an earlier x87
> instruction hasn't completed, you've already lost.  The pipeline will be
> stalled until the x87 instruction has completed and it can be hundreds of
> cycles.  My point therefore has been about avoiding to execute FXSAVE for
> the old task until absolutely necessary, that with the lazy FP context
> switching would be at the next x87 (or SSE) instruction reached by the new
> task.
>
>  Likewise I don't see why MFENCE or an atomic operation should affect the
> excecution of say FSINCOS.  Whether the results of FSINCOS arrive before
> or after MFENCE, etc. are not externally observable.

FSINCOS; FXSAVE; MFENCE had better serialize all the way, no matter
what weird architectural crud is going on.

>
>  And I'm not sure if this all affects old CPUs only -- I don't know how
> much x87 software is out there, but after all these years I'd expect quite
> some.  Sure, lots of this can be recompiled to use SSE instead, but not
> all, and even where it is feasible, that's an extra burden for people,
> beyond say a routine hardware or Linux distribution or for that matter
> lone kernel upgrade.  Therefore I think we need to be careful not to
> pessimise things for a subset of people too much and ideally at all.
>
>  And to be clear, I am not against removing lazy FP context switching per
> se.  I am just emphasizing to be careful with that and be absolutely sure
> that it does not cause excessive harm.

We're talking about the unusual case in which we context switch within
~100 cycles of a legacy transcendental operation, and, even so,
there's *still* no regression, since we don't optimize this case
today.

--Andy

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs
  2015-02-24  2:31                     ` Andy Lutomirski
@ 2015-02-24 14:43                       ` Rik van Riel
  0 siblings, 0 replies; 45+ messages in thread
From: Rik van Riel @ 2015-02-24 14:43 UTC (permalink / raw)
  To: Andy Lutomirski, Maciej W. Rozycki
  Cc: Borislav Petkov, Ingo Molnar, Oleg Nesterov, X86 ML,
	linux-kernel, Linus Torvalds

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 02/23/2015 09:31 PM, Andy Lutomirski wrote:
> On Mon, Feb 23, 2015 at 6:14 PM, Maciej W. Rozycki
> <macro@linux-mips.org> wrote:

>> That's an interesting case too, although not necessarily related.
>> If you say that we always save the FP context eagerly for the
>> purpose of process migration, then sure, that invalidates any
>> benefit we'd have from letting the x87 proceed.
>> 
>> However I can see different ways to address this case avoiding
>> the need of eager FP context saving or an IPI:
>> 
>> 1. We could bind any currently suspended process with an unsaved
>> FP context to the CPU it last executed on.
> 
> This would be insane.

The task would only be bound to the CPU as long as nothing else
ran that saved the FPU state to memory.  This means the task
would be bound to the CPU while the CPU is idle, or running a
kernel thread. When another user space thread is about to run,
the FPU state would be saved.

This sounds slightly less insane already.

Of course, once you figure in CPU power saving and CPU hot
plug, the level of insanity is far beyond what you seem to
imply above...

In other words, almost certainly not worth doing :)

- -- 
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJU7I38AAoJEM553pKExN6DhKEIAKauUGBx/1sTshdYYZ1aBlLx
xY7afNUjs8PIxjCbdcwrujbtNa9CFgDlRR6TegvmzQA3prXu/0XZ3vas3O/lD2lC
ks8p3RBzIw4dECxZoCvTQ+VrULk07+LCI+AUNKSm/pNlBCSWeeo2nKqoTREh3oHU
EWzJxn5aEfIA4vZQAnFP5TwkCwR2ob5COGx/I9l54brHEwhEqiRFrPwrIP2WJerx
Lc1Wkmv2PtTN/oQkXOCVKN0hVab//eVnkUiTsY1TnfCQsZSbEMWgq6aqXlb/lhUs
hhpNToBVlWF3LsCnGm6LfCrgX+VSBY9LQpYfaY1ltEmxE+nOplbI+JHQG4Yqgag=
=ah5u
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs
  2015-02-20 18:58 [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs Andy Lutomirski
                   ` (2 preceding siblings ...)
  2015-02-23 14:59 ` Oleg Nesterov
@ 2015-02-24 19:15 ` Denys Vlasenko
  2015-02-25  0:07   ` Andy Lutomirski
  2015-02-25 17:12 ` Some results (was: Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs) Borislav Petkov
  4 siblings, 1 reply; 45+ messages in thread
From: Denys Vlasenko @ 2015-02-24 19:15 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Oleg Nesterov, Rik van Riel, X86 ML, Linux Kernel Mailing List,
	Borislav Petkov

On Fri, Feb 20, 2015 at 7:58 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> We have eager and lazy fpu modes, introduced in:
>
> 304bceda6a18 x86, fpu: use non-lazy fpu restore for processors supporting xsave
>
> The result is rather messy.  There are two code paths in almost all of the
> FPU code, and only one of them (the eager case) is tested frequently, since
> most kernel developers have new enough hardware that we use eagerfpu.
>
> It seems that, on any remotely recent hardware, eagerfpu is a win:
> glibc uses SSE2, so laziness is probably overoptimistic, and, in any
> case, manipulating TS is far slower that saving and restoring the full
> state.
>
> To try to shake out any latent issues on old hardware, this changes
> the default to eager on all CPUs.  If no performance or functionality
> problems show up, a subsequent patch could remove lazy mode entirely.

I'm a big fan of simplifying things, but.

SIMD registers were growing in x86, and they are going to grow again,
this time four-fold in Intel MIC:
from sixteen 256-bit registers to thirty two 512-bit registers.

That's 2 kbytes of data. Just moving this data out to/from memory
will take some time.

And some people talk about 1024-bit registers already...

Let's not completely remove lazy FPU saving code just yet.
Maybe we'll be forced to reinstate it.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs
  2015-02-24 19:15 ` Denys Vlasenko
@ 2015-02-25  0:07   ` Andy Lutomirski
  2015-02-25 10:37     ` Borislav Petkov
  2015-02-25 10:45     ` Ingo Molnar
  0 siblings, 2 replies; 45+ messages in thread
From: Andy Lutomirski @ 2015-02-25  0:07 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Oleg Nesterov, Rik van Riel, X86 ML, Linux Kernel Mailing List,
	Borislav Petkov

On Tue, Feb 24, 2015 at 11:15 AM, Denys Vlasenko
<vda.linux@googlemail.com> wrote:
> On Fri, Feb 20, 2015 at 7:58 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> We have eager and lazy fpu modes, introduced in:
>>
>> 304bceda6a18 x86, fpu: use non-lazy fpu restore for processors supporting xsave
>>
>> The result is rather messy.  There are two code paths in almost all of the
>> FPU code, and only one of them (the eager case) is tested frequently, since
>> most kernel developers have new enough hardware that we use eagerfpu.
>>
>> It seems that, on any remotely recent hardware, eagerfpu is a win:
>> glibc uses SSE2, so laziness is probably overoptimistic, and, in any
>> case, manipulating TS is far slower that saving and restoring the full
>> state.
>>
>> To try to shake out any latent issues on old hardware, this changes
>> the default to eager on all CPUs.  If no performance or functionality
>> problems show up, a subsequent patch could remove lazy mode entirely.
>
> I'm a big fan of simplifying things, but.
>
> SIMD registers were growing in x86, and they are going to grow again,
> this time four-fold in Intel MIC:
> from sixteen 256-bit registers to thirty two 512-bit registers.
>
> That's 2 kbytes of data. Just moving this data out to/from memory
> will take some time.
>
> And some people talk about 1024-bit registers already...
>
> Let's not completely remove lazy FPU saving code just yet.
> Maybe we'll be forced to reinstate it.

I'd prefer a different partial solution: encourage everyone to clear
the xstate before making syscalls (using e.g. vzeroall).  In fact,
maybe user code should aggressively clear newly-unused xstate.

--Andy

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs
  2015-02-25  0:07   ` Andy Lutomirski
@ 2015-02-25 10:37     ` Borislav Petkov
  2015-02-25 10:50       ` Ingo Molnar
  2015-02-25 10:45     ` Ingo Molnar
  1 sibling, 1 reply; 45+ messages in thread
From: Borislav Petkov @ 2015-02-25 10:37 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Denys Vlasenko, Oleg Nesterov, Rik van Riel, X86 ML,
	Linux Kernel Mailing List

On Tue, Feb 24, 2015 at 04:07:07PM -0800, Andy Lutomirski wrote:
> I'd prefer a different partial solution: encourage everyone to clear
> the xstate before making syscalls (using e.g. vzeroall).  In fact,
> maybe user code should aggressively clear newly-unused xstate.

We don't trust userspace.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs
  2015-02-25  0:07   ` Andy Lutomirski
  2015-02-25 10:37     ` Borislav Petkov
@ 2015-02-25 10:45     ` Ingo Molnar
  1 sibling, 0 replies; 45+ messages in thread
From: Ingo Molnar @ 2015-02-25 10:45 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Denys Vlasenko, Oleg Nesterov, Rik van Riel, X86 ML,
	Linux Kernel Mailing List, Borislav Petkov, Linus Torvalds,
	Andrew Morton, H. Peter Anvin, Thomas Gleixner


* Andy Lutomirski <luto@amacapital.net> wrote:

> > I'm a big fan of simplifying things, but.
> >
> > SIMD registers were growing in x86, and they are going 
> > to grow again, this time four-fold in Intel MIC: from 
> > sixteen 256-bit registers to thirty two 512-bit 
> > registers.
> >
> > That's 2 kbytes of data. Just moving this data out 
> > to/from memory will take some time.
> >
> > And some people talk about 1024-bit registers 
> > already...
> >
> > Let's not completely remove lazy FPU saving code just 
> > yet. Maybe we'll be forced to reinstate it.
> 
> I'd prefer a different partial solution: encourage 
> everyone to clear the xstate before making syscalls 
> (using e.g. vzeroall).  In fact, maybe user code should 
> aggressively clear newly-unused xstate.

Also, xstate has various compaction features and could grow 
new ones in the future as well, should the xsave area 
become overly sparse: see xstate_comp_*[] et al in 
arch/x86/kernel/xsave.c.

This is the better, hardware driven, synchronous 
alternative to lazy, async register state save/restore, as 
it gets us similar benefits of not saving/restoring unused 
space, but avoids any async trap overhead.

Also ... with more and wider vector CPU registers context 
switches between different sets of registers are going to 
be inevitably more expensive, no matter what.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs
  2015-02-25 10:37     ` Borislav Petkov
@ 2015-02-25 10:50       ` Ingo Molnar
  0 siblings, 0 replies; 45+ messages in thread
From: Ingo Molnar @ 2015-02-25 10:50 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andy Lutomirski, Denys Vlasenko, Oleg Nesterov, Rik van Riel,
	X86 ML, Linux Kernel Mailing List


* Borislav Petkov <bp@alien8.de> wrote:

> On Tue, Feb 24, 2015 at 04:07:07PM -0800, Andy Lutomirski wrote:
>
> > I'd prefer a different partial solution: encourage 
> > everyone to clear the xstate before making syscalls 
> > (using e.g. vzeroall).  In fact, maybe user code should 
> > aggressively clear newly-unused xstate.
> 
> We don't trust userspace.

We certainly don't, but in this case it's a performance 
optimization detail: if user-space clears unused xstate in 
a way that the CPU recognizes it (for example VZEROALL) 
then it might get faster context switches.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Some results (was: Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs)
  2015-02-20 18:58 [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs Andy Lutomirski
                   ` (3 preceding siblings ...)
  2015-02-24 19:15 ` Denys Vlasenko
@ 2015-02-25 17:12 ` Borislav Petkov
  4 siblings, 0 replies; 45+ messages in thread
From: Borislav Petkov @ 2015-02-25 17:12 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Oleg Nesterov, Rik van Riel, x86, linux-kernel, Mel Gorman

[-- Attachment #1: Type: text/plain, Size: 1137 bytes --]

On Fri, Feb 20, 2015 at 10:58:15AM -0800, Andy Lutomirski wrote:
> -	/* Auto enable eagerfpu for xsaveopt */
> -	if (cpu_has_xsaveopt && eagerfpu != DISABLE)
> +	/* Auto enable eagerfpu for everyone */
> +	if (eagerfpu != DISABLE)
>  		eagerfpu = ENABLE;

So Mel did run some measurements with it on an old Intel and AMD box.
Attached are netperf results from both. The Intel box is a quad core
very similar to this one:

http://ark.intel.com/products/28020/Intel-Xeon-Processor-5063-4M-Cache-3_20-GHz-1066-MHz-FSB

netperf-udp-rr shows some impact of the eagerfpu patch which is outside
of the noise level. netperf-tcp-rr not so much but eager is still a bit
behind.

The AMD box is an old K8 and results there look like eager is better :-)
Not with all though - pipetest is worse.

netperf-udp-rr is better at almost every data point and tcp-rr looks a
bit useless with those high noise levels.

As a summary, the patch has some, albeit small, impact. We would need
more benchmarks. We have one speccpu run which is currently taking
forever to finish...

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

[-- Attachment #2: amd.tar.bz2 --]
[-- Type: application/octet-stream, Size: 19665 bytes --]

[-- Attachment #3: intel.tar.bz2 --]
[-- Type: application/octet-stream, Size: 19099 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2015-02-25 17:13 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-20 18:58 [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs Andy Lutomirski
2015-02-20 19:05 ` Borislav Petkov
2015-02-21  9:31 ` Ingo Molnar
2015-02-21 16:38   ` Borislav Petkov
2015-02-21 17:29     ` Borislav Petkov
2015-02-21 18:39       ` Ingo Molnar
2015-02-21 19:15         ` Borislav Petkov
2015-02-21 19:23           ` Ingo Molnar
2015-02-21 21:36             ` Borislav Petkov
2015-02-22  8:18               ` Ingo Molnar
2015-02-22  8:22                 ` Ingo Molnar
2015-02-22 10:48                 ` Borislav Petkov
2015-02-22 12:50                 ` Borislav Petkov
2015-02-22 12:57                   ` Ingo Molnar
2015-02-22 13:21                     ` Borislav Petkov
2015-02-22  0:34       ` Maciej W. Rozycki
2015-02-22  2:18         ` Andy Lutomirski
2015-02-22 11:06           ` Borislav Petkov
2015-02-23  1:45             ` Rik van Riel
2015-02-23  5:22               ` Andy Lutomirski
2015-02-23 12:51                 ` Rik van Riel
2015-02-23 15:03                   ` Borislav Petkov
2015-02-23 15:51                     ` Rik van Riel
2015-02-23 18:06                       ` Borislav Petkov
2015-02-23 21:17           ` Maciej W. Rozycki
2015-02-23 21:21             ` Rik van Riel
2015-02-23 22:14               ` Linus Torvalds
2015-02-24  0:56                 ` Maciej W. Rozycki
2015-02-24  0:59                   ` Andy Lutomirski
2015-02-23 22:27               ` Maciej W. Rozycki
2015-02-23 23:44                 ` Andy Lutomirski
2015-02-24  2:14                   ` Maciej W. Rozycki
2015-02-24  2:31                     ` Andy Lutomirski
2015-02-24 14:43                       ` Rik van Riel
2015-02-21 18:34     ` Ingo Molnar
2015-02-23 14:59 ` Oleg Nesterov
2015-02-23 15:11   ` Borislav Petkov
2015-02-23 15:53     ` Rik van Riel
2015-02-23 18:40       ` Oleg Nesterov
2015-02-24 19:15 ` Denys Vlasenko
2015-02-25  0:07   ` Andy Lutomirski
2015-02-25 10:37     ` Borislav Petkov
2015-02-25 10:50       ` Ingo Molnar
2015-02-25 10:45     ` Ingo Molnar
2015-02-25 17:12 ` Some results (was: Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs) Borislav Petkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).