LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [Linux Kernel 5.13 GA] ESXi Performance regression
@ 2021-07-30 12:27 Abdul Anshad Azeez
  2021-07-30 13:26 ` Valentin Schneider
  0 siblings, 1 reply; 7+ messages in thread
From: Abdul Anshad Azeez @ 2021-07-30 12:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, valentin.schneider, mingo, juri.lelli, vincent.guittot,
	rostedt, Rajender M, Rahul Gopakumar

As part of VMware's performance regression testing for Linux Kernel
upstream releases, we evaluated the performance of Linux kernel 5.13
against the 5.12 release. Our evaluation revealed performance
regressions in ESXi Compute workloads up to 3x and ESXi Networking
workloads up to 40%.

After performing the bisect between kernel 5.13 and 5.12, we
identified the root cause behavior to be a “Scheduler” related commit
from Peter Zijlstra's "8a99b6833c884fa0e7919030d93fecedc69fc625 (
sched: Move SCHED_DEBUG sysctl to debugfs)". It appears that the
issue arose due to Peter's commit changing the default value of
"sched_wakeup_granularity_ns" and more details are below.

Impacted test case details:

1. Compute:
- VM Config - RHEL 8.1 - 1VM with 8vCPU & 16G Memory
- Benchmark - kernel compile
- Measures time taken to compile Linux kernel source code (Linux
kernel version used - 4.9.24)
- make -j 2xVCPU - This uses all the available CPU threads to achieve
100% CPU utilization

2. Networking:
- VM Config - RHEL 8.1 - 1VM with 8vCPU & 16G Memory and 8VM with
4vCPU & 8G Memory
- Benchmark - Netperf
- Netperf TCP_STREAM RECV small (8K socket & 256B message)(
TCP_NODELAY set) packets – Throughput (1VM)
- Netperf UDP_STREAM RECV (256K socket & 256B message) – Packet rate (
8VM)

From our testing, overall results indicate that the above-mentioned
commit has introduced performance regressions in kernel compile
workload for Compute area and in Networking, test cases with high
packet rates were impacted.

We noticed that Peter Zijlstra's commit has moved the Scheduler
tunables to debugfs file system. And on taking a closer look, the
values of two such tunables are different between before and after
the above-mentioned commit.

1. Before:
sched_min_granularity_ns    - 10000000 (10ms)
sched_wakeup_granularity_ns - 15000000 (15ms)

2. After:
sched_min_granularity_ns    - 3000000 (3ms)
sched_wakeup_granularity_ns - 4000000 (4ms)

With further experiments, we have confirmed that the value of
"sched_wakeup_granularity_ns" is influencing these performance
regressions. And, on setting the "sched_wakeup_granularity_ns" value
back to "15000000" in Peter Zijlstra's commit, we are able to gain
back the lost performance in our Compute & Networking workloads.

Further, we also collected guest scheduling stats (during Kernel
compile workload) and were able to notice more involuntary switches
forced by the scheduler when "sched_wakeup_granularity_ns" value is
set to "4000000".

1. "sched_wakeup_granularity_ns = 4000000" (3 iterations):
nr_involuntary_switches : 3
nr_involuntary_switches : 2
nr_involuntary_switches : 2

2. "sched_wakeup_granularity_ns = 15000000" (3 iterations):
nr_involuntary_switches : 0
nr_involuntary_switches : 0
nr_involuntary_switches : 0

So, we believe decreasing the value of "sched_wakeup_granularity_ns"
is causing more preemption to the running processes and it's
impacting the CPU-bound tasks - Kernel compile & Netperf high packet
rate workloads.

Also, since Linux 5.14-rc3 kernel was recently released, we repeated
the same experiments on 5.14-rc3 and were able to observe the same
regressions in both areas (Compute & Networking).

We wanted to understand the reason behind the change in default
values for the above two scheduler tunables and since changing the
value of "sched_wakeup_granularity_ns" from 15ms to 4ms forces more
involuntary switches and which in-turn introduces performance
regression, can this be changed back to 15ms?

Abdul Anshad Azeez
Performance Engineering
VMware, Inc.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Linux Kernel 5.13 GA] ESXi Performance regression
  2021-07-30 12:27 [Linux Kernel 5.13 GA] ESXi Performance regression Abdul Anshad Azeez
@ 2021-07-30 13:26 ` Valentin Schneider
  2021-08-05 14:33   ` Rahul Gopakumar
  0 siblings, 1 reply; 7+ messages in thread
From: Valentin Schneider @ 2021-07-30 13:26 UTC (permalink / raw)
  To: Abdul Anshad Azeez, linux-kernel
  Cc: peterz, mingo, juri.lelli, vincent.guittot, rostedt, Rajender M,
	Rahul Gopakumar

On 30/07/21 12:27, Abdul Anshad Azeez wrote:
> As part of VMware's performance regression testing for Linux Kernel
> upstream releases, we evaluated the performance of Linux kernel 5.13
> against the 5.12 release. Our evaluation revealed performance
> regressions in ESXi Compute workloads up to 3x and ESXi Networking
> workloads up to 40%.
>
> After performing the bisect between kernel 5.13 and 5.12, we
> identified the root cause behavior to be a “Scheduler” related commit
> from Peter Zijlstra's "8a99b6833c884fa0e7919030d93fecedc69fc625 (
> sched: Move SCHED_DEBUG sysctl to debugfs)". It appears that the
> issue arose due to Peter's commit changing the default value of
> "sched_wakeup_granularity_ns" and more details are below.
>
> Impacted test case details:
>
> 1. Compute:
> - VM Config - RHEL 8.1 - 1VM with 8vCPU & 16G Memory
> - Benchmark - kernel compile
> - Measures time taken to compile Linux kernel source code (Linux
> kernel version used - 4.9.24)
> - make -j 2xVCPU - This uses all the available CPU threads to achieve
> 100% CPU utilization
>
> 2. Networking:
> - VM Config - RHEL 8.1 - 1VM with 8vCPU & 16G Memory and 8VM with
> 4vCPU & 8G Memory
> - Benchmark - Netperf
> - Netperf TCP_STREAM RECV small (8K socket & 256B message)(
> TCP_NODELAY set) packets – Throughput (1VM)
> - Netperf UDP_STREAM RECV (256K socket & 256B message) – Packet rate (
> 8VM)
>
> From our testing, overall results indicate that the above-mentioned
> commit has introduced performance regressions in kernel compile
> workload for Compute area and in Networking, test cases with high
> packet rates were impacted.
>
> We noticed that Peter Zijlstra's commit has moved the Scheduler
> tunables to debugfs file system. And on taking a closer look, the
> values of two such tunables are different between before and after
> the above-mentioned commit.
>
> 1. Before:
> sched_min_granularity_ns    - 10000000 (10ms)
> sched_wakeup_granularity_ns - 15000000 (15ms)
>
> 2. After:
> sched_min_granularity_ns    - 3000000 (3ms)
> sched_wakeup_granularity_ns - 4000000 (4ms)
>
> With further experiments, we have confirmed that the value of
> "sched_wakeup_granularity_ns" is influencing these performance
> regressions. And, on setting the "sched_wakeup_granularity_ns" value
> back to "15000000" in Peter Zijlstra's commit, we are able to gain
> back the lost performance in our Compute & Networking workloads.
>

sysctl_sched_wakeup_granularity's default value hasn't been touched since
2009:

  172e082a9111 ("sched: Re-tune the scheduler latency defaults to decrease worst-case latencies")

and the automagic scaling (see kernel/sched/fair.c::update_sysctl()) hasn't
changed much either.

What's likely to happen here is that you have a service in your distro (or
somesuch) tweaking those values, and since the incriminated commit moves
those files to /sys/kernel/debug/sched/, said service doesn't do anything
anymore.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Linux Kernel 5.13 GA] ESXi Performance regression
  2021-07-30 13:26 ` Valentin Schneider
@ 2021-08-05 14:33   ` Rahul Gopakumar
  2021-08-05 14:58     ` Steven Rostedt
  0 siblings, 1 reply; 7+ messages in thread
From: Rahul Gopakumar @ 2021-08-05 14:33 UTC (permalink / raw)
  To: Valentin Schneider, Abdul Anshad Azeez, linux-kernel
  Cc: peterz, mingo, juri.lelli, vincent.guittot, rostedt, Rajender M

> sysctl_sched_wakeup_granularity's default value hasn't been touched since
> 2009:
> 
> 172e082a9111 ("sched: Re-tune the scheduler latency defaults to decrease worst-case latencies")
>
> and the automagic scaling (see kernel/sched/fair.c::update_sysctl()) hasn't
> changed much either.
> 
> What's likely to happen here is that you have a service in your distro (or
> somesuch) tweaking those values and since the incriminated commit moves
> those files to /sys/kernel/debug/sched/, said service doesn't do anything
> anymore.

Hi Valentin,

In our testing, we use RHEL 8.1 distro. It looks like tuned daemon updates
15ms (per tuned virtual-guest profile) in /proc/sys/kernel/sched_wakeup_granularity_ns
file during tuned's startup.

Now tuned daemon is not able to update the value as the commit moves those
files to debugs and thus sched_wakeup_granularity_ns file remains with the
default value.

Regards,
Rahul Gopakumar,
Performance Engineering,
VMware, Inc.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Linux Kernel 5.13 GA] ESXi Performance regression
  2021-08-05 14:33   ` Rahul Gopakumar
@ 2021-08-05 14:58     ` Steven Rostedt
  2021-08-05 15:05       ` Peter Zijlstra
  0 siblings, 1 reply; 7+ messages in thread
From: Steven Rostedt @ 2021-08-05 14:58 UTC (permalink / raw)
  To: Rahul Gopakumar
  Cc: Valentin Schneider, Abdul Anshad Azeez, linux-kernel, peterz,
	mingo, juri.lelli, vincent.guittot, Rajender M, Linus Torvalds

On Thu, 5 Aug 2021 14:33:52 +0000
Rahul Gopakumar <gopakumarr@vmware.com> wrote:

> In our testing, we use RHEL 8.1 distro. It looks like tuned daemon updates
> 15ms (per tuned virtual-guest profile) in /proc/sys/kernel/sched_wakeup_granularity_ns
> file during tuned's startup.
> 
> Now tuned daemon is not able to update the value as the commit moves those
> files to debugs and thus sched_wakeup_granularity_ns file remains with the
> default value.

Hmm, is this a user space breakage?

-- Steve

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Linux Kernel 5.13 GA] ESXi Performance regression
  2021-08-05 14:58     ` Steven Rostedt
@ 2021-08-05 15:05       ` Peter Zijlstra
  2021-08-05 15:24         ` Steven Rostedt
  0 siblings, 1 reply; 7+ messages in thread
From: Peter Zijlstra @ 2021-08-05 15:05 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Rahul Gopakumar, Valentin Schneider, Abdul Anshad Azeez,
	linux-kernel, mingo, juri.lelli, vincent.guittot, Rajender M,
	Linus Torvalds

On Thu, Aug 05, 2021 at 10:58:53AM -0400, Steven Rostedt wrote:
> On Thu, 5 Aug 2021 14:33:52 +0000
> Rahul Gopakumar <gopakumarr@vmware.com> wrote:
> 
> > In our testing, we use RHEL 8.1 distro. It looks like tuned daemon updates
> > 15ms (per tuned virtual-guest profile) in /proc/sys/kernel/sched_wakeup_granularity_ns
> > file during tuned's startup.
> > 
> > Now tuned daemon is not able to update the value as the commit moves those
> > files to debugs and thus sched_wakeup_granularity_ns file remains with the
> > default value.
> 
> Hmm, is this a user space breakage?

All those files were under CONFIG_SCHED_DEBUG and a !DEBUG build would
not have them to begin with.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Linux Kernel 5.13 GA] ESXi Performance regression
  2021-08-05 15:05       ` Peter Zijlstra
@ 2021-08-05 15:24         ` Steven Rostedt
  2021-08-05 15:28           ` Peter Zijlstra
  0 siblings, 1 reply; 7+ messages in thread
From: Steven Rostedt @ 2021-08-05 15:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rahul Gopakumar, Valentin Schneider, Abdul Anshad Azeez,
	linux-kernel, mingo, juri.lelli, vincent.guittot, Rajender M,
	Linus Torvalds

On Thu, 5 Aug 2021 17:05:00 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> > Hmm, is this a user space breakage?  
> 
> All those files were under CONFIG_SCHED_DEBUG and a !DEBUG build would
> not have them to begin with.

But you can have a config with CONFIG_DEBUG_FS disabled, and DEBUG and
SCHED_DEBUG enabled, which means that there's now configs where this
value is no longer available.

-- Steve


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Linux Kernel 5.13 GA] ESXi Performance regression
  2021-08-05 15:24         ` Steven Rostedt
@ 2021-08-05 15:28           ` Peter Zijlstra
  0 siblings, 0 replies; 7+ messages in thread
From: Peter Zijlstra @ 2021-08-05 15:28 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Rahul Gopakumar, Valentin Schneider, Abdul Anshad Azeez,
	linux-kernel, mingo, juri.lelli, vincent.guittot, Rajender M,
	Linus Torvalds

On Thu, Aug 05, 2021 at 11:24:37AM -0400, Steven Rostedt wrote:
> On Thu, 5 Aug 2021 17:05:00 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > > Hmm, is this a user space breakage?  
> > 
> > All those files were under CONFIG_SCHED_DEBUG and a !DEBUG build would
> > not have them to begin with.
> 
> But you can have a config with CONFIG_DEBUG_FS disabled, and DEBUG and
> SCHED_DEBUG enabled, which means that there's now configs where this
> value is no longer available.

You already had that, notably: CONFIG_SCHED_DEBUG=n.

These have always been debug knobs, if you touch them you get to keep
the pieces.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-08-05 15:28 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-30 12:27 [Linux Kernel 5.13 GA] ESXi Performance regression Abdul Anshad Azeez
2021-07-30 13:26 ` Valentin Schneider
2021-08-05 14:33   ` Rahul Gopakumar
2021-08-05 14:58     ` Steven Rostedt
2021-08-05 15:05       ` Peter Zijlstra
2021-08-05 15:24         ` Steven Rostedt
2021-08-05 15:28           ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).