LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
From: Thomas Ilsche <thomas.ilsche@tu-dresden.de>
To: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Peter Zijlstra <peterz@infradead.org>,
Linux PM <linux-pm@vger.kernel.org>,
Frederic Weisbecker <fweisbec@gmail.com>,
"Thomas Gleixner" <tglx@linutronix.de>,
Paul McKenney <paulmck@linux.vnet.ibm.com>,
Doug Smythies <dsmythies@telus.net>,
Rik van Riel <riel@surriel.com>,
"Aubrey Li" <aubrey.li@linux.intel.com>,
Mike Galbraith <mgalbraith@suse.de>,
LKML <linux-kernel@vger.kernel.org>
Subject: Re: [RFT][PATCH v7.3 5/8] cpuidle: Return nohz hint from cpuidle_select()
Date: Tue, 10 Apr 2018 17:22:48 +0200 [thread overview]
Message-ID: <079e16b6-2e04-2518-0553-66becab226a6@tu-dresden.de> (raw)
In-Reply-To: <24561274.DadG5k7Qxl@aspire.rjw.lan>
>> However my fundamental concerns about the policy whether to disable the sched
>> tick remain:
>>
>> Mixing the precise timer and vague heuristic for the decision is
>> dangerous. The timer should not be wrong, heuristic may be.
>
> Well, I wouldn't say "dangerous". It may be suboptimal, but even that is not
> a given. Besides ->
>
>> Decisions should use actual time points rather than the generic tick
>> duration and residency time. e.g.
>> expected_interval < delta_next_us
>> vs
>> expected_interval < TICK_USEC
>
> -> the role of this check is to justify taking the overhead of stopping the
> tick and it certainly is justifiable if the governor doesn't anticipate any
> wakeups (timer or not) in the TICK_USEC range. It may be justifiable in
> other cases too, but that's a matter of some more complex checks and may not
> be worth the extra complexity at all.
I tried that change. It's just just a bit bulky because I
cache the result of ktime_to_us(delta_next) early.
diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index a6eca02..cad87bf 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -296,6 +296,7 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
unsigned long nr_iowaiters, cpu_load;
int resume_latency = dev_pm_qos_raw_read_value(device);
ktime_t delta_next;
+ unsigned long delta_next_us;
if (data->needs_update) {
menu_update(drv, dev);
@@ -314,6 +315,7 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
/* determine the expected residency time, round up */
data->next_timer_us = ktime_to_us(tick_nohz_get_sleep_length(&delta_next));
+ delta_next_us = ktime_to_us(delta_next);
get_iowait_load(&nr_iowaiters, &cpu_load);
data->bucket = which_bucket(data->next_timer_us, nr_iowaiters);
@@ -364,7 +366,7 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
*/
if (data->predicted_us < TICK_USEC)
data->predicted_us = min_t(unsigned int, TICK_USEC,
- ktime_to_us(delta_next));
+ delta_next_us);
} else {
/*
* Use the performance multiplier and the user-configurable
@@ -412,9 +414,7 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
* expected idle duration is shorter than the tick period length.
*/
if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) ||
- expected_interval < TICK_USEC) {
- unsigned int delta_next_us = ktime_to_us(delta_next);
-
+ expected_interval < delta_next_us) {
*stop_tick = false;
if (!tick_nohz_tick_stopped() && idx > 0 &&
This works as a I expect in the sense of stopping the tick more often
and allowing deeper sleep states in some cases. However, it
drastically *increases* the power consumption for some affected
workloads test system (SKL-SP).
So while I believe this generally improves the behavior - I can't
recommend it based on the practical implications. Below are some
details for the curious:
power consumption for various workload configurations with 250 Hz
kernels 4.16.0, v9, v9+delta_next patch:
https://wwwpub.zih.tu-dresden.de/~tilsche/powernightmares/v9_250_Hz_power.png
Practically v9 has the best power consumption in most cases.
The following detailed looks are with 1000 Hz kernels.
v9 with a synchronized 950 us sleep workload:
https://wwwpub.zih.tu-dresden.de/~tilsche/powernightmares/v9_poll_sync.png
v9 with a staggered 950 us sleep workload:
https://wwwpub.zih.tu-dresden.de/~tilsche/powernightmares/v9_poll_stagger.png
Both show that the sched tick is kept on and this causes more requests
to C1E than C6
v9+delta_next with a synchronized 950 us sleep workload:
https://wwwpub.zih.tu-dresden.de/~tilsche/powernightmares/v9_delta_poll_sync.png
v9+delta_next with a staggered 950 us sleep workload:
https://wwwpub.zih.tu-dresden.de/~tilsche/powernightmares/v9_delta_poll_stagger.png
No more sched tick, no more C1E requests, but increased power.
Besides:
- the hardware reports 0 residency in C6 (both core and PKG) for
both v9 and v9+delta_next_us.
- the increased power consumption comes after a ramp-up of ~200 ms
for the staggered and ~2 s for the synchronized workload.
For reference traces from an unmodified 4.16.0:
https://wwwpub.zih.tu-dresden.de/~tilsche/powernightmares/v4.16.0_poll_sync.png
https://wwwpub.zih.tu-dresden.de/~tilsche/powernightmares/v4.16.0_poll_stagger.png
It behaves similar to the delta_next patch but does not show the
increased power consumption in this exact workload configuration.
I couldn't help to dig into the effect a bit more and am able to
reproduce it even under unmodified kernels with staggered sleep cycles
between ~1.2 ms and ~2.5 ms where power is increased by > 40 W.
Anyway, this effect seems to be beyond what the governor should
consider. It is an example where it doesn't seem possible to decide
for the optimal C state without considering the state of other cores
and such unexpected hardware behavior.
And these are only the results from one system and a limited set of
workload configurations.
>> For some cases the unmodified sched tick is not efficient as fallback.
>> Is it feasible to
>> 1) enable the sched tick when it's currently disabled instead of
>> blindly choosing a different C state?
>
> It is not "blindly" if you will. It is very much "consciously". :-)
>
> Restarting the tick from within menu_select() wouldn't work IMO and
> restarting it from cpuidle_idle_call() every time it was stopped might
> be wasteful.
>
> It could be done, but AFAICS if the CPU in deep idle is woken up by an
> occasional interrupt that doesn't set need_resched, it is more likely
> to go into deep idle again than to go into shallow idle at that point.
>
>> 2) modify the next upcoming sched tick to be better suitable as
>> fallback timer?
>
> Im not sure what you mean.
>
>> I think with the infrastructure changes it should be possible to
>> implement the policy I envisioned previously
>> (https://marc.info/?l=linux-pm&m=151384941425947&w=2), which is based
>> on the ordering of timers and the heuristically predicted idle time.
>> If the sleep_length issue is fixed and I have some mechanism for a
>> modifiable fallback timer, I'll try to demonstrate that on top of your
>> changes.
>
> Sure. I'm not against adding more complexity to this in principle, but there
> needs to be a good enough justification for it.
>
> As I said in one of the previous messages, if simple code gets the job done,
> the extra complexity may just not be worth it. That's why I went for very
> simple code here. Still, if there is a clear case for making it more complex,
> that can be done.
>
> Thanks!
>
next prev parent reply other threads:[~2018-04-10 15:23 UTC|newest]
Thread overview: 41+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-03-20 15:12 [RFT][PATCH v7 0/8] sched/cpuidle: Idle loop rework Rafael J. Wysocki
2018-03-20 15:13 ` [RFT][PATCH v7 1/8] time: tick-sched: Reorganize idle tick management code Rafael J. Wysocki
2018-03-20 15:15 ` [RFT][PATCH v7 2/8] sched: idle: Do not stop the tick upfront in the idle loop Rafael J. Wysocki
2018-03-20 15:15 ` [RFT][PATCH v7 3/8] " Rafael J. Wysocki
2018-03-20 18:00 ` [Correction][RFT][PATCH v7 3/8] sched: idle: Do not stop the tick before cpuidle_idle_call() Rafael J. Wysocki
2018-03-20 15:16 ` [RFT][PATCH v7 4/8] jiffies: Introduce USER_TICK_USEC and redefine TICK_USEC Rafael J. Wysocki
2018-03-20 15:45 ` [RFT][PATCH v7 5/8] cpuidle: Return nohz hint from cpuidle_select() Rafael J. Wysocki
2018-03-21 6:48 ` [RFT][PATCH v7.1 " Rafael J. Wysocki
2018-03-21 11:52 ` Rafael J. Wysocki
2018-03-21 13:03 ` [RFT][PATCH v7.2 " Rafael J. Wysocki
2018-03-21 14:36 ` [RFT][PATCH v7 " Rafael J. Wysocki
2018-03-21 17:59 ` Thomas Ilsche
2018-03-21 22:15 ` Rafael J. Wysocki
2018-03-22 13:18 ` Thomas Ilsche
2018-03-22 17:23 ` Rafael J. Wysocki
2018-03-22 6:24 ` Doug Smythies
2018-03-22 15:41 ` Doug Smythies
2018-03-22 17:21 ` Rafael J. Wysocki
2018-03-21 18:23 ` Doug Smythies
2018-03-22 17:40 ` [RFT][PATCH v7.3 " Rafael J. Wysocki
2018-03-28 9:14 ` Thomas Ilsche
2018-03-30 9:39 ` Rafael J. Wysocki
2018-04-10 15:22 ` Thomas Ilsche [this message]
2018-03-22 20:46 ` Doug Smythies
2018-03-20 15:45 ` [RFT][PATCH v7 6/8] sched: idle: Select idle state before stopping the tick Rafael J. Wysocki
2018-03-27 21:50 ` Thomas Ilsche
2018-03-27 22:10 ` Rafael J. Wysocki
2018-03-28 8:13 ` Rafael J. Wysocki
2018-03-28 8:38 ` Thomas Ilsche
2018-03-28 10:37 ` Rafael J. Wysocki
2018-03-28 10:56 ` Rafael J. Wysocki
2018-03-28 15:15 ` Thomas Ilsche
2018-03-28 20:41 ` Doug Smythies
2018-03-28 23:11 ` Rafael J. Wysocki
2018-03-20 15:46 ` [RFT][PATCH v7 7/8] cpuidle: menu: Refine idle state selection for running tick Rafael J. Wysocki
2018-03-20 15:47 ` [RFT][PATCH v7 8/8] cpuidle: menu: Avoid selecting shallow states with stopped tick Rafael J. Wysocki
2018-03-20 17:52 ` [RFT][PATCH v7 3/8] sched: idle: Do not stop the tick upfront in the idle loop Doug Smythies
2018-03-20 18:01 ` Rafael J. Wysocki
2018-03-21 12:31 ` [RFT][PATCH v7 0/8] sched/cpuidle: Idle loop rework Rik van Riel
2018-03-21 13:55 ` Rafael J. Wysocki
2018-03-21 14:53 ` Rik van Riel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=079e16b6-2e04-2518-0553-66becab226a6@tu-dresden.de \
--to=thomas.ilsche@tu-dresden.de \
--cc=aubrey.li@linux.intel.com \
--cc=dsmythies@telus.net \
--cc=fweisbec@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pm@vger.kernel.org \
--cc=mgalbraith@suse.de \
--cc=paulmck@linux.vnet.ibm.com \
--cc=peterz@infradead.org \
--cc=riel@surriel.com \
--cc=rjw@rjwysocki.net \
--cc=tglx@linutronix.de \
--subject='Re: [RFT][PATCH v7.3 5/8] cpuidle: Return nohz hint from cpuidle_select()' \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).