LKML Archive on
help / color / mirror / Atom feed
From: Daniel Lezcano <>
To: Morten Rasmussen <>
Cc: Vincent Guittot <>,
	Peter Zijlstra <>,
	Ingo Molnar <>,
	Juri Lelli <>,
	Dietmar Eggemann <>,
	Steven Rostedt <>,
	Ben Segall <>,
	linux-kernel <>,
	Qais Yousef <>,
	Valentin Schneider <>
Subject: Re: [PATCH V2] sched: fair: Use the earliest break even
Date: Tue, 17 Mar 2020 18:07:43 +0100	[thread overview]
Message-ID: <> (raw)
In-Reply-To: <>

On 17/03/2020 15:30, Morten Rasmussen wrote:
> On Tue, Mar 17, 2020 at 02:48:51PM +0100, Daniel Lezcano wrote:
>> Hi Morten,
>> On 17/03/2020 08:56, Morten Rasmussen wrote:
>>> Hi Daniel,
>>> First, I think letting the scheduler know about desired minimum idle
>>> times is an interesting optimization if the overhead can be kept at a
>>> minimum. I do have a few comments about the patch though.
>>> On Thu, Mar 12, 2020 at 11:04:19AM +0100, Daniel Lezcano wrote:
>>>> On 12/03/2020 09:36, Vincent Guittot wrote:
>>>>> Hi Daniel,
>>>>> On Wed, 11 Mar 2020 at 21:28, Daniel Lezcano <> wrote:
>>>>>> In the idle CPU selection process occuring in the slow path via the
>>>>>> find_idlest_group_cpu() function, we pick up in priority an idle CPU
>>>>>> with the shallowest idle state otherwise we fall back to the least
>>>>>> loaded CPU.
>>>>> The idea makes sense but this path is only used by fork and exec so
>>>>> I'm not sure about the real impact
>>>> I agree the fork / exec path is called much less often than the wake
>>>> path but it makes more sense for the decision.
>>> Looking at the flow in find_idlest_cpu(), AFAICT,
>>> find_idlest_group_cpu() is not actually making the final choice of CPU,
>>> so going through a lot of trouble there looking at idle states is
>>> pointless. Is there something I don't see?
>>> We fellow sd->child until groups == CPUs which which means that
>>> find_idlest_group() actually makes the final choice as the final group
>>> passed to find_idlest_group_cpu() is single-CPU group. The flow has been
>>> like that for years. Even before you added the initial idle-state
>>> awareness.
>>> I agree with Vincent, if this should really make a difference it should
>>> include wake-ups existing tasks too. Although I'm aware it would be a
>>> more invasive change. As said from the beginning, the idea is fine, but
>>> the current implementation should not make any measurable difference?
>> I'm seeing the wake-ups path so sensitive, I'm not comfortable to do any
>> changes in it. That is the reason why the patch only changes the slow path.
> Right. I'm not against being cautious at all. It would be interesting to
> evaluate how bad it really is. The extra time-stamping business cost is
> the same, so it really down how much we dare to use the information in
> the fast-path and change the CPU selection policy. And of course, how
> much can be gained by the change.

If the change provided by this patch is acceptable. I can give a try
with the fast path.

>>>>>> In order to be more energy efficient but without impacting the
>>>>>> performances, let's use another criteria: the break even deadline.
>>>>>> At idle time, when we store the idle state the CPU is entering in, we
>>>>>> compute the next deadline where the CPU could be woken up without
>>>>>> spending more energy to sleep.
>>> I don't follow the argument that sleeping longer should improve energy
>>> consumption. 
>> May be it is not explained correctly.
>> The patch is about selecting a CPU with the smallest break even deadline
>> value. In a group of idle CPUs in the same idle state, we will pick the
>> one with the smallest break even dead line which is the one with the
>> highest probability it already reached its target residency.
>> It is best effort.
> Indeed. I get what the patch does, I just don't see how the patch
> improves energy efficiency.

If the CPU is woken up before it reached the break even, the idle state
cost in energy is greater than the energy it saved.

Am I misunderstanding your point?

>>> The patch doesn't affect the number of idle state
>>> enter/exit cycles, so you spend the amount of energy on those
>>> transitions. The main change is that idle time get spread out, so CPUs
>>> are less likely to be in the process of entering an idle state when they
>>> are asked to wake back up again.
>>> Isn't it fair to say that we expect the total number of wake-ups remains
>>> unchanged? Total busy and idle times across all CPUs should remain the
>>> same too? Unless chosen idle-state is changed, which I don't think we
>>> expect either, there should be no net effect on energy? The main benefit
>>> is reduced wake-up latency I think.
>>> Regarding chosen idle state, I'm wondering how this patch affects the
>>> cpuidle governor's idle state selection. Could the spreading of wake-ups
>>> trick governor to pick a shallower idle-state for some idle CPUs because
>>> we actively spread wake-ups rather than consolidating them? Just a
>>> thought.
>> May be I missed the point, why are we spreading the tasks?
> Picking the CPU with the smallest break-even time-stamp means you pick
> the CPU that has been idle longest in the shallowest idle-state. If you
> periodically one-shot spawn tasks at a rate which is long enough that
> the shallowest state is the same for several CPUs, you would end up
> picking the least recently used CPU each time effectively spreading the
> wake-ups across all the CPUs in the same state.

Ok, thanks for the clarification.

> Thinking more about it, it might not be a real problem as if one of the
> CPUs suddenly choose a shallower idle-state, it would become the target
> all new tasks from that point onwards.


>> We are taking the decision on the same sched domain, no?
> I'm not sure I get the relation to the sched_domain?

Never mind. I misunderstood your comment.

>>>>>> At the selection process, we use the shallowest CPU but in addition we
>>>>>> choose the one with the minimal break even deadline instead of relying
>>>>>> on the idle_timestamp. When the CPU is idle, the timestamp has less
>>>>>> meaning because the CPU could have wake up and sleep again several times
>>>>>> without exiting the idle loop. In this case the break even deadline is
>>>>>> more relevant as it increases the probability of choosing a CPU which
>>>>>> reached its break even.
>>> I guess you could improve the idle time stamping without adding the
>>> break-even time, they don't have to go together?
>> Yes, we can add the idle start time when entering idle in the
>> cpuidle_enter function which is different from the idle_timestamp which
>> gives the idle task scheduling. I sent a RFC for that [1].
>> However, each time we would like to inspect the deadline, we will have
>> to compute it, so IMO it makes more sense to pre-compute it when
>> entering idle in addition to the idle start.
>> [1]
> Yes, I saw that patch too. Seems to make sense :-)
>>>>>> Tested on:
>>>>>>  - a synquacer 24 cores, 6 sched domains
>>>>>>  - a hikey960 HMP 8 cores, 2 sched domains, with the EAS and energy probe
>>>>>> sched/perf and messaging does not show a performance regression. Ran
>>>>>> 50 times schbench, adrestia and forkbench.
>>>>>> The tools described at
>>>>>>  --------------------------------------------------------------
>>>>>> | Synquacer             | With break even | Without break even |
>>>>>>  --------------------------------------------------------------
>>>>>> | schbench *99.0th      |      14844.8    |         15017.6    |
>>>>>> | adrestia / periodic   |        57.95    |              57    |
>>>>>> | adrestia / single     |         49.3    |            55.4    |
>>>>>>  --------------------------------------------------------------
>>>>> Have you got some figures or cpuidle statistics for the syncquacer ?
>>>> No, and we just noticed the syncquacer has a bug in the firmware and
>>>> does not actually go to the idle states.
>>> I would also like some statistics to help understanding what actually
>>> changes.
>>> I did some measurements on TX2, which only has one idle-state. I don't
>>> see the same trends as you do. adrestia single seems to be most affected
>>> by the patch, but _increases_ with the break_even patch rather than
>>> decrease. I don't trust adrestia too much though as the time resolution
>>> is low on TX2.
>>> TX2			tip		break_even
>>> ----------------------------------------------------
>>> adrestia / single	5.21		5.51
>>> adrestia / periodic	5.75		5.67
>>> schbench 99.0th		45465.6		45376.0
>>> hackbench		27.9851		27.9775
>>> Notes:
>>> adrestia: Avg of 100 runs: adrestia -l 25000
>>> schbench: Avg of 10 runs: schbench -m16 -t64
>>> hackbench: Avg of 10 runs: hackbench -g 20 -T 256 -l 100000
>> Thanks for testing. Is that a Jetson TX2 from Nvidia? If that is the
>> case, IIRC, it has some kind of switcher for the CPUs in the firmware, I
>> don't know how that can interact with the testing.
> Sorry, I should have been clearer. It is a ThunderX2. 2x 32-core (128
> threads) = 256 HW threads.

Indeed, that is different :)

I've a syncquacer but we find out a bug in the firmware, hopefully it
can be fixed.

 <> │ Open source software for ARM SoCs

Follow Linaro:  <> Facebook |
<!/linaroorg> Twitter |
<> Blog

  reply	other threads:[~2020-03-17 17:07 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-11 20:26 Daniel Lezcano
2020-03-12  8:36 ` Vincent Guittot
2020-03-12 10:04   ` Daniel Lezcano
2020-03-12 12:27     ` Vincent Guittot
2020-03-13 12:15       ` Daniel Lezcano
2020-03-13 13:15         ` Vincent Guittot
2020-03-13 13:17           ` Daniel Lezcano
2020-03-13 13:21             ` Vincent Guittot
2020-03-17  7:56     ` Morten Rasmussen
2020-03-17 13:48       ` Daniel Lezcano
2020-03-17 14:30         ` Morten Rasmussen
2020-03-17 17:07           ` Daniel Lezcano [this message]
2020-03-18  8:24             ` Morten Rasmussen
2020-03-18 10:17               ` Daniel Lezcano
2020-03-18 14:38                 ` Morten Rasmussen
2020-03-18  7:51           ` Vincent Guittot
2020-03-18  8:33             ` Morten Rasmussen
2020-03-17 10:56 ` Valentin Schneider
2020-03-17 13:59   ` Daniel Lezcano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \ \ \ \ \ \ \
    --subject='Re: [PATCH V2] sched: fair: Use the earliest break even' \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).