LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [RFC PATCH v4 00/12] Energy Aware Scheduling
@ 2018-06-28 11:40 Quentin Perret
  2018-06-28 11:40 ` [RFC PATCH v4 01/12] sched: Relocate arch_scale_cpu_capacity Quentin Perret
                   ` (11 more replies)
  0 siblings, 12 replies; 54+ messages in thread
From: Quentin Perret @ 2018-06-28 11:40 UTC (permalink / raw)
  To: peterz, rjw, linux-kernel, linux-pm
  Cc: gregkh, mingo, dietmar.eggemann, morten.rasmussen, chris.redpath,
	patrick.bellasi, valentin.schneider, vincent.guittot,
	thara.gopinath, viresh.kumar, tkjos, joel, smuckle, adharmap,
	skannan, pkondeti, juri.lelli, edubezval, srinivas.pandruvada,
	currojerez, javi.merino, quentin.perret

The Energy Aware Scheduler (EAS) based on Morten Rasmussen's posting on
LKML [1] is currently part of the AOSP Common Kernel and runs on today's
smartphones with Arm's big.LITTLE CPUs. This series implements a new and
largely simplified version of EAS based on an Energy Model (EM) of the
platform with only costs information for the active states of the CPUs.

The patch-set is organized in three main parts.
1. Patches 01-04/12 introduce a centralized and independent EM
   management framework
2. Patches 05-10/12 make use of the EM in the scheduler to bias task
   placement decisions
3. Patches 11-12/12 give an Arm64 example on how to register an Energy
   Model in the new framework


1. The Energy Model Framework

The energy consumed by the CPUs can be provided to the OS in different
ways depending on the source of information (firmware or device tree for
example). The EM framework introduced in patch 03/12 addresses this
issue by aggregating the data coming from the drivers in a standard way
and making it available to interested clients (thermal or the task
scheduler, for example).

Although this series comprises patches for the task scheduler as a user
of the EM, the framework itself as introduced in patch 03/12 is meant to
be _independent_ from any other subsystem.

The overall design of the EM framework is depicted on the diagram below
(focused on Arm drivers for the example, but applicable to any
architecture).

   +---------------+  +-----------------+  +---------+
   | Thermal (IPA) |  | Scheduler (EAS) |  | Other ? |
   +---------------+  +-----------------+  +---------+
           |                 | em_fd_energy()   |
           |                 | em_cpu_get()     |
           +-----------+     |         +--------+
                       |     |         |
                       v     v         v
                    +---------------------+
                    |                     |         +---------------+
                    |    Energy Model     |         | arch_topology |
                    |                     |<--------|    driver     |
                    |     Framework       |         +---------------+
                    |                     | em_rescale_cpu_capacity()
                    +---------------------+
                       ^       ^       ^
                       |       |       | em_register_freq_domain()
            +----------+       |       +---------+
            |                  |                 |
    +---------------+  +---------------+  +--------------+
    |  cpufreq-dt   |  |   arm_scmi    |  |    Other     |
    +---------------+  +---------------+  +--------------+
            ^                  ^                 ^
            |                  |                 |
    +--------------+   +---------------+  +--------------+
    | Device Tree  |   |   Firmware    |  |      ?       |
    +--------------+   +---------------+  +--------------+

Drivers can register data in the EM framework using the
em_register_freq_domain() API. They are expected to provide a callback

function that the EM framework can use to build energy cost tables and
store them in shared data structures. Then, clients such as the task
scheduler are allowed to read those shared structures using the
em_fd_energy() and em_cpu_get() APIs. The EM framework also provides
another API enabling to dynamically re-scale the CPU capacity values of
the EM, if needed. The rationale behind this need for dynamic re-scaling
and more details about the different APIs of the framework can be found
in patch 03/12.


2. Energy-aware task placement in the task scheduler

Patches 05-10/12 make use of the newly introduced EM in the scheduler to
bias task placement decisions. When the system is detected as
non-”overutilized”, an EM is available, and the platform has an
asymmetric CPU capacity topology (e.g. big.LITTLE), the consequences on
energy of placing a waking task on a CPU are taken into account to avoid
energy-inefficient CPUs if possible. Patches 05-07/12 modify the
scheduler topology code in order to: 1) check if all conditions for EAS
are met when the scheduling domains are built; and 2) create data
structures holding references on the EM tables that can be accessed in
latency sensitive code paths (e.g. wake-up path).

An “overutilized” flag (patch 08/12) is attached to the root domain,
and is set whenever a CPU is utilized at more than 80% of its capacity.
Patches 09-10/12 introduce the new energy-aware wake-up path which makes
use of the data structures introduced in patches 05-07/10 whenever the
system isn’t overutilized.


3. Arm example of driver modifications to register an EM

Patches 11-12/12 show an example of how drivers should be modified to
register an EM in the new framework and potentially re-scale it. The
patches target Arm drivers, as an example, but the same ideas should be
applicable for others architectures. Patch 11/12 calls the re-scaling
API of the EM framework from the arch_topology driver which changes the
CPU capacity values after CPUFreq is up and running. Patch 12/12 changes
the cpufreq-dt driver (used for testing on Hikey960, see Section 4.) to
provide estimated power values to the EM framework using coefficients
read from DT. This patch has been made simple and self-contained
intentionally in order to show an example of usage of the EM framework.


4. Test results

Two fundamentally different tests were executed. Firstly the energy test
case shows the impact on energy consumption this patch-set has using a
synthetic set of tasks. Secondly the performance test case provides the
conventional hackbench metric numbers.

The tests run on two arm64 big.LITTLE platforms: Hikey960 (4xA73 +
4xA53) and Juno r0 (2xA57 + 4xA53).

Base kernel is tip/sched/core (4.17-rc6), with some Hikey960 and Juno
specific patches, the SD_ASYM_CPUCAPACITY flag set at DIE sched domain
level for arm64 and schedutil as cpufreq governor [2].

4.1 Energy test case

10 iterations of between 10 and 50 periodic rt-app tasks (16ms period,
5% duty-cycle) for 30 seconds with energy measurement. Unit is Joules.
The goal is to save energy, so lower is better.

4.1.1 Hikey960

Energy is measured with an ACME Cape on an instrumented board. Numbers
include consumption of big and little CPUs, LPDDR memory, GPU and most
of the other small components on the board. They do not include
consumption of the radio chip (turned-off anyway) and external
connectors.

+----------+-----------------+-------------------------+
|          | Without patches | With patches            |
+----------+--------+--------+------------------+------+
| Tasks nb |  Mean  | RSD*   | Mean             | RSD* |
+----------+--------+--------+------------------+------+
|       10 | 33.15  |   1.8% |  28.36 (-14.45%) | 1.3% |
|       20 | 50.91  |   3.1% |  42.79 (-15.95%) | 1.4% |
|       30 | 63.97  |   5.0% |  57.35 (-10.35%) | 2.4% |
|       40 | 92.99  |   4.7% |  82.10 (-11.71%) | 6.3% |
|       50 | 133.08 |   5.1% | 106.89 (-19.68%) | 5.8% |
+----------+--------+--------+------------------+------+

4.1.2 Juno r0

Energy is measured with the onboard energy meter. Numbers include
consumption of big and little CPUs.

+----------+-----------------+------------------------+
|          | Without patches | With patches           |
+----------+--------+--------+-----------------+------+
| Tasks nb |  Mean  | RSD*   | Mean            | RSD* |
+----------+--------+--------+-----------------+------+

|       10 |  11.97 |   2.2% |  7.64 (-36.17%) | 2.7% |
|       20 |  20.66 |   2.9% | 13.92 (-29.74%) | 2.2% |
|       30 |  33.18 |   1.9% | 23.22 (-30.02%) | 2.7% |
|       40 |  45.71 |   0.7% | 33.99 (-25.64%) | 5.9% |
|       50 |  56.92 |   0.4% | 53.58 ( -5.87%) | 1.0% |
+----------+--------+--------+-----------------+------+

4.2 Performance test case

30 iterations of perf bench sched messaging --pipe --thread --group G
--loop L with G=[1 2 4 8] and L=50000 (Hikey960)/16000 (Juno r0).

4.2.1 Hikey960

The impact of thermal capping was mitigated thanks to a heatsink, a
fan, and a 10 sec delay between two successive executions.

+----------------+-----------------+------------------------+
|                | Without patches | With patches           |
+--------+-------+---------+-------+----------------+-------+
| Groups | Tasks | Mean    | RSD*  | Mean           | RSD*  |
+--------+-------+---------+-------+----------------+-------+
|      1 |    40 |    8.24 | 0.96% |  8.36 (+1.56%) | 1.48% |
|      2 |    80 |   15.53 | 0.74% | 15.72 (+1.23%) | 0.65% |
|      4 |   160 |   31.97 | 0.50% | 32.35 (+1.18%) | 0.44% |
|      8 |   320 |   66.31 | 0.47% | 67.11 (+1.20%) | 0.37% |
+--------+-------+---------+-------+----------------+-------+

4.2.2 Juno r0

+----------------+-----------------+------------------------+
|                | Without patches | With patches           |
+--------+-------+---------+-------+----------------+-------+
| Groups | Tasks | Mean    | RSD*  | Mean           | RSD*  |
+--------+-------+---------+-------+----------------+-------+
|      1 |    40 |    7.70 | 0.96% |  7.63 (-0.91%) | 1.14% |
|      2 |    80 |   13.94 | 1.05% | 13.88 (-0.43%) | 1.20% |
|      4 |   160 |   26.32 | 1.66% | 26.04 (-1.06%) | 1.38% |
|      8 |   320 |   52.41 | 2.86% | 52.07 (-0.65%) | 2.41% |
+--------+-------+---------+-------+----------------+-------+

*RSD: Relative Standard Deviation (std dev / mean)


5. Version history:

Changes v3[3]->v4:
- Replaced spinlock in EM framework by smp_store_release/READ_ONCE
- Fixed missing locks to protect rcu_assign_pointer in EM framework
- Fixed capacity calculation in EM framework on 32 bits system
- Fixed compilation issue for CONFIG_ENERGY_MODEL=n
- Removed cpumask from struct em_freq_domain, now dynamically allocated
- Power costs of the EM are specified in milliwatts
- Added example of CPUFreq driver modification
- Added doc/comments in the EM framework and better commit header
- Fixed integration issue with util_est in cpu_util_next()
- Changed scheduler topology code to have one freq. dom. list per rd
- Split sched topology patch in smaller patches
- Added doc/comments explaining the heuristic in the wake-up path
- Changed energy threshold for migration to from 1.5% to 6%

Changes v2[4]->v3:
- Removed the PM_OPP dependency by implementing a new EM framework
- Modified the scheduler topology code to take references on the EM data
  structures
- Simplified the overutilization mechanism into a system-wide flag
- Reworked the integration in the wake-up path using the sd_ea shortcut
- Rebased on tip/sched/core (247f2f6f3c70 "sched/core: Don't schedule
  threads on pre-empted vCPUs")

Changes v1[5]->v2:
- Reworked interface between fair.c and energy.[ch] (Remove #ifdef
  CONFIG_PM_OPP from energy.c) (Greg KH)
- Fixed licence & header issue in energy.[ch] (Greg KH)
- Reordered EAS path in select_task_rq_fair() (Joel)
- Avoid prev_cpu if not allowed in select_task_rq_fair() (Morten/Joel)
- Refactored compute_energy() (Patrick)
- Account for RT/IRQ pressure in task_fits() (Patrick)
- Use UTIL_EST and DL utilization during OPP estimation (Patrick/Juri)
- Optimize selection of CPU candidates in the energy-aware wake-up path
- Rebased on top of tip/sched/core (commit b720342849fe “sched/core:
  Update Preempt_notifier_key to modern API”)

[1] https://lkml.org/lkml/2015/7/7/754
[2] http://www.linux-arm.org/git?p=linux-qp.git;a=shortlog;h=refs/heads/upstream/eas_v4
[3] https://marc.info/?l=linux-kernel&m=152691273111941&w=2
[4] https://marc.info/?l=linux-kernel&m=152302902427143&w=2
[5] https://marc.info/?l=linux-kernel&m=152153905805048&w=2

Morten Rasmussen (1):
  sched: Add over-utilization/tipping point indicator

Quentin Perret (11):
  sched: Relocate arch_scale_cpu_capacity
  sched/cpufreq: Factor out utilization to frequency mapping
  PM: Introduce an Energy Model management framework
  PM / EM: Expose the Energy Model in sysfs
  sched/topology: Reference the Energy Model of CPUs when available
  sched/topology: Lowest energy aware balancing sched_domain level
    pointer
  sched/topology: Introduce sched_energy_present static key
  sched/fair: Introduce an energy estimation helper function
  sched/fair: Select an energy-efficient CPU on task wake-up
  OPTIONAL: arch_topology: Start Energy Aware Scheduling
  OPTIONAL: cpufreq: dt: Register an Energy Model

 drivers/base/arch_topology.c     |  15 ++
 drivers/cpufreq/cpufreq-dt.c     |  45 +++-
 include/linux/energy_model.h     | 141 ++++++++++++
 include/linux/sched/cpufreq.h    |   6 +
 include/linux/sched/topology.h   |  19 ++
 kernel/power/Kconfig             |  15 ++
 kernel/power/Makefile            |   2 +
 kernel/power/energy_model.c      | 367 +++++++++++++++++++++++++++++++
 kernel/sched/cpufreq_schedutil.c |   3 +-
 kernel/sched/fair.c              | 238 +++++++++++++++++++-
 kernel/sched/sched.h             |  64 ++++--
 kernel/sched/topology.c          | 203 +++++++++++++++++
 12 files changed, 1090 insertions(+), 28 deletions(-)
 create mode 100644 include/linux/energy_model.h
 create mode 100644 kernel/power/energy_model.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC PATCH v4 01/12] sched: Relocate arch_scale_cpu_capacity
  2018-06-28 11:40 [RFC PATCH v4 00/12] Energy Aware Scheduling Quentin Perret
@ 2018-06-28 11:40 ` Quentin Perret
  2018-06-28 11:40 ` [RFC PATCH v4 02/12] sched/cpufreq: Factor out utilization to frequency mapping Quentin Perret
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 54+ messages in thread
From: Quentin Perret @ 2018-06-28 11:40 UTC (permalink / raw)
  To: peterz, rjw, linux-kernel, linux-pm
  Cc: gregkh, mingo, dietmar.eggemann, morten.rasmussen, chris.redpath,
	patrick.bellasi, valentin.schneider, vincent.guittot,
	thara.gopinath, viresh.kumar, tkjos, joel, smuckle, adharmap,
	skannan, pkondeti, juri.lelli, edubezval, srinivas.pandruvada,
	currojerez, javi.merino, quentin.perret

By default, arch_scale_cpu_capacity() is only visible from within the
kernel/sched folder. Relocate it to include/linux/sched/topology.h to
make it visible to other clients needing to know about the capacity of
CPUs, such as the Energy Model framework.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
---
 include/linux/sched/topology.h | 19 +++++++++++++++++++
 kernel/sched/sched.h           | 18 ------------------
 2 files changed, 19 insertions(+), 18 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 26347741ba50..1e24e88bee6d 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -202,6 +202,17 @@ extern void set_sched_topology(struct sched_domain_topology_level *tl);
 # define SD_INIT_NAME(type)
 #endif
 
+#ifndef arch_scale_cpu_capacity
+static __always_inline
+unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
+{
+	if (sd && (sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
+		return sd->smt_gain / sd->span_weight;
+
+	return SCHED_CAPACITY_SCALE;
+}
+#endif
+
 #else /* CONFIG_SMP */
 
 struct sched_domain_attr;
@@ -217,6 +228,14 @@ static inline bool cpus_share_cache(int this_cpu, int that_cpu)
 	return true;
 }
 
+#ifndef arch_scale_cpu_capacity
+static __always_inline
+unsigned long arch_scale_cpu_capacity(void __always_unused *sd, int cpu)
+{
+	return SCHED_CAPACITY_SCALE;
+}
+#endif
+
 #endif	/* !CONFIG_SMP */
 
 static inline int task_node(const struct task_struct *p)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 67702b4d9ac7..34549bca487d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1730,30 +1730,12 @@ unsigned long arch_scale_freq_capacity(int cpu)
 #ifdef CONFIG_SMP
 extern void sched_avg_update(struct rq *rq);
 
-#ifndef arch_scale_cpu_capacity
-static __always_inline
-unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
-{
-	if (sd && (sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
-		return sd->smt_gain / sd->span_weight;
-
-	return SCHED_CAPACITY_SCALE;
-}
-#endif
-
 static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
 {
 	rq->rt_avg += rt_delta * arch_scale_freq_capacity(cpu_of(rq));
 	sched_avg_update(rq);
 }
 #else
-#ifndef arch_scale_cpu_capacity
-static __always_inline
-unsigned long arch_scale_cpu_capacity(void __always_unused *sd, int cpu)
-{
-	return SCHED_CAPACITY_SCALE;
-}
-#endif
 static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta) { }
 static inline void sched_avg_update(struct rq *rq) { }
 #endif
-- 
2.17.1


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC PATCH v4 02/12] sched/cpufreq: Factor out utilization to frequency mapping
  2018-06-28 11:40 [RFC PATCH v4 00/12] Energy Aware Scheduling Quentin Perret
  2018-06-28 11:40 ` [RFC PATCH v4 01/12] sched: Relocate arch_scale_cpu_capacity Quentin Perret
@ 2018-06-28 11:40 ` Quentin Perret
  2018-06-28 11:40 ` [RFC PATCH v4 03/12] PM: Introduce an Energy Model management framework Quentin Perret
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 54+ messages in thread
From: Quentin Perret @ 2018-06-28 11:40 UTC (permalink / raw)
  To: peterz, rjw, linux-kernel, linux-pm
  Cc: gregkh, mingo, dietmar.eggemann, morten.rasmussen, chris.redpath,
	patrick.bellasi, valentin.schneider, vincent.guittot,
	thara.gopinath, viresh.kumar, tkjos, joel, smuckle, adharmap,
	skannan, pkondeti, juri.lelli, edubezval, srinivas.pandruvada,
	currojerez, javi.merino, quentin.perret

The schedutil governor maps utilization values to frequencies by applying
a 25% margin. Since this sort of mapping mechanism can be needed by other
users (i.e. EAS), factor the utilization-to-frequency mapping code out
of schedutil and move it to include/linux/sched/cpufreq.h to avoid code
duplication. The new map_util_freq() function is inlined to avoid
overheads.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
---
 include/linux/sched/cpufreq.h    | 6 ++++++
 kernel/sched/cpufreq_schedutil.c | 3 ++-
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
index 59667444669f..afa940cd50dc 100644
--- a/include/linux/sched/cpufreq.h
+++ b/include/linux/sched/cpufreq.h
@@ -20,6 +20,12 @@ void cpufreq_add_update_util_hook(int cpu, struct update_util_data *data,
                        void (*func)(struct update_util_data *data, u64 time,
 				    unsigned int flags));
 void cpufreq_remove_update_util_hook(int cpu);
+
+static inline unsigned long map_util_freq(unsigned long util,
+					unsigned long freq, unsigned long cap)
+{
+	return (freq + (freq >> 2)) * util / cap;
+}
 #endif /* CONFIG_CPU_FREQ */
 
 #endif /* _LINUX_SCHED_CPUFREQ_H */
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 2b5096b3725d..714ad11517f0 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -13,6 +13,7 @@
 
 #include "sched.h"
 
+#include <linux/sched/cpufreq.h>
 #include <trace/events/power.h>
 
 struct sugov_tunables {
@@ -163,7 +164,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
 	unsigned int freq = arch_scale_freq_invariant() ?
 				policy->cpuinfo.max_freq : policy->cur;
 
-	freq = (freq + (freq >> 2)) * util / max;
+	freq = map_util_freq(util, freq, max);
 
 	if (freq == sg_policy->cached_raw_freq && sg_policy->next_freq != UINT_MAX)
 		return sg_policy->next_freq;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC PATCH v4 03/12] PM: Introduce an Energy Model management framework
  2018-06-28 11:40 [RFC PATCH v4 00/12] Energy Aware Scheduling Quentin Perret
  2018-06-28 11:40 ` [RFC PATCH v4 01/12] sched: Relocate arch_scale_cpu_capacity Quentin Perret
  2018-06-28 11:40 ` [RFC PATCH v4 02/12] sched/cpufreq: Factor out utilization to frequency mapping Quentin Perret
@ 2018-06-28 11:40 ` Quentin Perret
  2018-07-05 14:31   ` Peter Zijlstra
                     ` (4 more replies)
  2018-06-28 11:40 ` [RFC PATCH v4 04/12] PM / EM: Expose the Energy Model in sysfs Quentin Perret
                   ` (8 subsequent siblings)
  11 siblings, 5 replies; 54+ messages in thread
From: Quentin Perret @ 2018-06-28 11:40 UTC (permalink / raw)
  To: peterz, rjw, linux-kernel, linux-pm
  Cc: gregkh, mingo, dietmar.eggemann, morten.rasmussen, chris.redpath,
	patrick.bellasi, valentin.schneider, vincent.guittot,
	thara.gopinath, viresh.kumar, tkjos, joel, smuckle, adharmap,
	skannan, pkondeti, juri.lelli, edubezval, srinivas.pandruvada,
	currojerez, javi.merino, quentin.perret

Several subsystems in the kernel (task scheduler and/or thermal at the
time of writing) can benefit from knowing about the energy consumed by
CPUs. Yet, this information can come from different sources (DT or
firmware for example), in different formats, hence making it hard to
exploit without a standard API.

As an attempt to address this, introduce a centralized Energy Model
(EM) management framework which aggregates the power values provided
by drivers into a table for each frequency domain in the system. The
power cost tables are made available to interested clients (e.g. task
scheduler or thermal) via platform-agnostic APIs. The overall design
is represented by the diagram below (focused on Arm-related drivers as
an example, but hopefully applicable to any architecture):

     +---------------+  +-----------------+  +---------+
     | Thermal (IPA) |  | Scheduler (EAS) |  | Other ? |
     +---------------+  +-----------------+  +---------+
             |                 | em_fd_energy()   |
             |                 | em_cpu_get()     |
             +-----------+     |         +--------+
                         |     |         |
                         v     v         v
                      +---------------------+
                      |                     |         +---------------+
                      |    Energy Model     |         | arch_topology |
                      |                     |<--------|    driver     |
                      |     Framework       |         +---------------+
                      |                     | em_rescale_cpu_capacity()
                      +---------------------+
                         ^       ^       ^
                         |       |       | em_register_freq_domain()
              +----------+       |       +---------+
              |                  |                 |
      +---------------+  +---------------+  +--------------+
      |  cpufreq-dt   |  |   arm_scmi    |  |    Other     |
      +---------------+  +---------------+  +--------------+
              ^                  ^                 ^
              |                  |                 |
      +--------------+   +---------------+  +--------------+
      | Device Tree  |   |   Firmware    |  |      ?       |
      +--------------+   +---------------+  +--------------+

Drivers (typically, but not limited to, CPUFreq drivers) can register
data in the EM framework using the em_register_freq_domain() API. The
calling driver must provide a callback function with a standardized
signature that will be used by the EM framework to build the power
cost tables of the frequency domain. This design should offer a lot of
flexibility to calling drivers which are free of reading information
from any location and to use any technique to compute power costs.
Moreover, the capacity states registered by drivers in the EM framework
are not required to match real performance states of the target. This
is particularly important on targets where the performance states are
not known by the OS.

On the client side, the EM framework offers APIs to access the power
cost tables of a CPU (em_cpu_get()), and to estimate the energy
consumed by the CPUs of a frequency domain (em_fd_energy()). Clients
such as the task scheduler can then use these APIs to access the shared
data structures holding the Energy Model of CPUs.

The EM framework also provides an API (em_rescale_cpu_capacity()) to
re-scale the capacity values of the model asynchronously, after it has
been created. This is required for architectures where the capacity
scale factor of CPUs can change at run-time. This is the case for
Arm/Arm64 for example where the arch_topology driver recomputes the
capacity scale factors of the CPUs after the maximum frequency of all
CPUs has been discovered. Although complex, the process of creating and
re-scaling the EM has to be kept in two separate steps to fulfill the
needs of the different users. The thermal subsystem doesn't use the
capacity values and shouldn't have dependencies on subsystems providing
them. On the other hand, the task scheduler needs the capacity values,
and it will benefit from seeing them up-to-date when applicable.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
---
 include/linux/energy_model.h | 140 ++++++++++++++++++
 kernel/power/Kconfig         |  15 ++
 kernel/power/Makefile        |   2 +
 kernel/power/energy_model.c  | 273 +++++++++++++++++++++++++++++++++++
 4 files changed, 430 insertions(+)
 create mode 100644 include/linux/energy_model.h
 create mode 100644 kernel/power/energy_model.c

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
new file mode 100644
index 000000000000..88c2f0b9bcb3
--- /dev/null
+++ b/include/linux/energy_model.h
@@ -0,0 +1,140 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_ENERGY_MODEL_H
+#define _LINUX_ENERGY_MODEL_H
+#include <linux/cpumask.h>
+#include <linux/jump_label.h>
+#include <linux/kobject.h>
+#include <linux/rcupdate.h>
+#include <linux/sched/cpufreq.h>
+#include <linux/types.h>
+
+#ifdef CONFIG_ENERGY_MODEL
+struct em_cap_state {
+	unsigned long capacity;
+	unsigned long frequency; /* Kilo-hertz */
+	unsigned long power; /* Milli-watts */
+};
+
+struct em_cs_table {
+	struct em_cap_state *state; /* Capacity states, in ascending order. */
+	int nr_cap_states;
+	struct rcu_head rcu;
+};
+
+struct em_freq_domain {
+	struct em_cs_table *cs_table; /* Capacity state table, RCU-protected */
+	unsigned long cpus[0]; /* CPUs of the frequency domain. */
+};
+
+#define EM_CPU_MAX_POWER 0xFFFF
+
+struct em_data_callback {
+	/**
+	 * active_power() - Provide power at the next capacity state of a CPU
+	 * @power	: Active power at the capacity state in mW (modified)
+	 * @freq	: Frequency at the capacity state in kHz (modified)
+	 * @cpu		: CPU for which we do this operation
+	 *
+	 * active_power() must find the lowest capacity state of 'cpu' above
+	 * 'freq' and update 'power' and 'freq' to the matching active power
+	 * and frequency.
+	 *
+	 * The power is the one of a single CPU in the domain, expressed in
+	 * milli-watts. It is expected to fit in the [0, EM_CPU_MAX_POWER]
+	 * range.
+	 *
+	 * Return 0 on success.
+	 */
+	int (*active_power) (unsigned long *power, unsigned long *freq, int cpu);
+};
+#define EM_DATA_CB(_active_power_cb) { .active_power = &_active_power_cb }
+
+void em_rescale_cpu_capacity(void);
+struct em_freq_domain *em_cpu_get(int cpu);
+int em_register_freq_domain(cpumask_t *span, unsigned int nr_states,
+						struct em_data_callback *cb);
+
+/**
+ * em_fd_energy() - Estimates the energy consumed by the CPUs of a freq. domain
+ * @fd		: frequency domain for which energy has to be estimated
+ * @max_util	: highest utilization among CPUs of the domain
+ * @sum_util	: sum of the utilization of all CPUs in the domain
+ *
+ * em_fd_energy() dereferences the capacity state table of the frequency
+ * domain, so it must be called under RCU read lock.
+ *
+ * Return: the sum of the energy consumed by the CPUs of the domain assuming
+ * a capacity state satisfying the max utilization of the domain.
+ */
+static inline unsigned long em_fd_energy(struct em_freq_domain *fd,
+				unsigned long max_util, unsigned long sum_util)
+{
+	struct em_cs_table *cs_table;
+	struct em_cap_state *cs;
+	unsigned long freq;
+	int i;
+
+	cs_table = rcu_dereference(fd->cs_table);
+	if (!cs_table)
+		return 0;
+
+	/* Map the utilization value to a frequency */
+	cs = &cs_table->state[cs_table->nr_cap_states - 1];
+	freq = map_util_freq(max_util, cs->frequency, cs->capacity);
+
+	/* Find the lowest capacity state above this frequency */
+	for (i = 0; i < cs_table->nr_cap_states; i++) {
+		cs = &cs_table->state[i];
+		if (cs->frequency >= freq)
+			break;
+	}
+
+	return cs->power * sum_util / cs->capacity;
+}
+
+/**
+ * em_fd_nr_cap_states() - Get the number of capacity states of a freq. domain
+ * @fd		: frequency domain for which want to do this
+ *
+ * Return: the number of capacity state in the frequency domain table
+ */
+static inline int em_fd_nr_cap_states(struct em_freq_domain *fd)
+{
+	struct em_cs_table *table;
+	int nr_states;
+
+	rcu_read_lock();
+	table = rcu_dereference(fd->cs_table);
+	nr_states = table->nr_cap_states;
+	rcu_read_unlock();
+
+	return nr_states;
+}
+
+#else
+struct em_freq_domain {};
+struct em_data_callback {};
+#define EM_DATA_CB(_active_power_cb) { }
+
+static inline int em_register_freq_domain(cpumask_t *span,
+			unsigned int nr_states, struct em_data_callback *cb)
+{
+	return -EINVAL;
+}
+static inline struct em_freq_domain *em_cpu_get(int cpu)
+{
+	return NULL;
+}
+static inline unsigned long em_fd_energy(struct em_freq_domain *fd,
+			unsigned long max_util, unsigned long sum_util)
+{
+	return 0;
+}
+static inline int em_fd_nr_cap_states(struct em_freq_domain *fd)
+{
+	return 0;
+}
+static inline void em_rescale_cpu_capacity(void) { }
+#endif
+
+#endif
diff --git a/kernel/power/Kconfig b/kernel/power/Kconfig
index e880ca22c5a5..6f6db452dc7d 100644
--- a/kernel/power/Kconfig
+++ b/kernel/power/Kconfig
@@ -297,3 +297,18 @@ config PM_GENERIC_DOMAINS_OF
 
 config CPU_PM
 	bool
+
+config ENERGY_MODEL
+	bool "Energy Model for CPUs"
+	depends on SMP
+	depends on CPU_FREQ
+	default n
+	help
+	  Several subsystems (thermal and/or the task scheduler for example)
+	  can leverage information about the energy consumed by CPUs to make
+	  smarter decisions. This config option enables the framework from
+	  which subsystems can access the energy models.
+
+	  The exact usage of the energy model is subsystem-dependent.
+
+	  If in doubt, say N.
diff --git a/kernel/power/Makefile b/kernel/power/Makefile
index a3f79f0eef36..e7e47d9be1e5 100644
--- a/kernel/power/Makefile
+++ b/kernel/power/Makefile
@@ -15,3 +15,5 @@ obj-$(CONFIG_PM_AUTOSLEEP)	+= autosleep.o
 obj-$(CONFIG_PM_WAKELOCKS)	+= wakelock.o
 
 obj-$(CONFIG_MAGIC_SYSRQ)	+= poweroff.o
+
+obj-$(CONFIG_ENERGY_MODEL)	+= energy_model.o
diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
new file mode 100644
index 000000000000..08ce4035a6d6
--- /dev/null
+++ b/kernel/power/energy_model.c
@@ -0,0 +1,273 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Energy Model of CPUs
+ *
+ * Copyright (c) 2018, Arm ltd.
+ * Written by: Quentin Perret, Arm ltd.
+ */
+
+#define pr_fmt(fmt) "energy_model: " fmt
+
+#include <linux/cpu.h>
+#include <linux/cpumask.h>
+#include <linux/energy_model.h>
+#include <linux/sched/topology.h>
+#include <linux/slab.h>
+
+/* Mapping of each CPU to the frequency domain to which it belongs. */
+static DEFINE_PER_CPU(struct em_freq_domain *, em_data);
+
+/*
+ * Mutex serializing the registrations of frequency domains and letting
+ * callbacks defined by drivers sleep.
+ */
+static DEFINE_MUTEX(em_fd_mutex);
+
+static struct em_cs_table *alloc_cs_table(int nr_cs)
+{
+	struct em_cs_table *cs_table;
+
+	cs_table = kzalloc(sizeof(*cs_table), GFP_KERNEL);
+	if (!cs_table)
+		return NULL;
+
+	cs_table->state = kcalloc(nr_cs, sizeof(*cs_table->state), GFP_KERNEL);
+	if (!cs_table->state) {
+		kfree(cs_table);
+		return NULL;
+	}
+
+	cs_table->nr_cap_states = nr_cs;
+
+	return cs_table;
+}
+
+static void free_cs_table(struct em_cs_table *table)
+{
+	if (table) {
+		kfree(table->state);
+		kfree(table);
+	}
+}
+
+/* fd_update_cs_table() - Computes the capacity values of a cs_table
+ *
+ * This assumes a linear relation between capacity and frequency. As such,
+ * the capacity of a CPU at the n^th capacity state is computed as:
+ *           capactity(n) = max_capacity * freq(n) / freq_max
+ */
+static void fd_update_cs_table(struct em_cs_table *cs_table, int cpu)
+{
+	unsigned long cmax = arch_scale_cpu_capacity(NULL, cpu);
+	int max_cap_state = cs_table->nr_cap_states - 1;
+	unsigned long fmax = cs_table->state[max_cap_state].frequency;
+	int i;
+
+	for (i = 0; i < cs_table->nr_cap_states; i++) {
+		u64 cap = (u64)cmax * cs_table->state[i].frequency;
+		do_div(cap, fmax);
+		cs_table->state[i].capacity = (unsigned long)cap;
+	}
+}
+
+static struct em_freq_domain *em_create_fd(cpumask_t *span, int nr_states,
+						struct em_data_callback *cb)
+{
+	unsigned long opp_eff, prev_opp_eff = ULONG_MAX;
+	unsigned long power, freq, prev_freq = 0;
+	int i, ret, cpu = cpumask_first(span);
+	struct em_cs_table *cs_table;
+	struct em_freq_domain *fd;
+
+	if (!cb->active_power)
+		return NULL;
+
+	fd = kzalloc(sizeof(*fd) + cpumask_size(), GFP_KERNEL);
+	if (!fd)
+		return NULL;
+
+	cs_table = alloc_cs_table(nr_states);
+	if (!cs_table)
+		goto free_fd;
+
+	/* Build the list of capacity states for this freq domain */
+	for (i = 0, freq = 0; i < nr_states; i++, freq++) {
+		/*
+		 * active_power() is a driver callback which ceils 'freq' to
+		 * lowest capacity state of 'cpu' above 'freq' and update
+		 * 'power' and 'freq' accordingly.
+		 */
+		ret = cb->active_power(&power, &freq, cpu);
+		if (ret) {
+			pr_err("fd%d: invalid cap. state: %d\n", cpu, ret);
+			goto free_cs_table;
+		}
+
+		/*
+		 * We expect the driver callback to increase the frequency for
+		 * higher capacity states.
+		 */
+		if (freq <= prev_freq) {
+			pr_err("fd%d: non-increasing freq: %lu\n", cpu, freq);
+			goto free_cs_table;
+		}
+
+		/*
+		 * The power returned by active_state() is expected to be in
+		 * milli-watts, and to fit in 16 bits.
+		 */
+		if (power > EM_CPU_MAX_POWER) {
+			pr_err("fd%d: power out of scale: %lu\n", cpu, power);
+			goto free_cs_table;
+		}
+
+		cs_table->state[i].power = power;
+		cs_table->state[i].frequency = prev_freq = freq;
+
+		/*
+		 * The hertz/watts efficiency ratio should decrease as the
+		 * frequency grows on sane platforms. But this isn't always
+		 * true in practice so warn the user if some of the high
+		 * OPPs are more power efficient than some of the lower ones.
+		 */
+		opp_eff = freq / power;
+		if (opp_eff >= prev_opp_eff)
+			pr_warn("fd%d: hertz/watts ratio non-monotonically "
+				"decreasing: OPP%d >= OPP%d\n", cpu, i, i - 1);
+		prev_opp_eff = opp_eff;
+	}
+	fd_update_cs_table(cs_table, cpu);
+	rcu_assign_pointer(fd->cs_table, cs_table);
+
+	/* Copy the span of the frequency domain */
+	cpumask_copy(to_cpumask(fd->cpus), span);
+
+	return fd;
+
+free_cs_table:
+	free_cs_table(cs_table);
+free_fd:
+	kfree(fd);
+
+	return NULL;
+}
+
+static void rcu_free_cs_table(struct rcu_head *rp)
+{
+	struct em_cs_table *table;
+
+	table = container_of(rp, struct em_cs_table, rcu);
+	free_cs_table(table);
+}
+
+/**
+ * em_rescale_cpu_capacity() - Re-scale capacity values of the Energy Model
+ *
+ * This re-scales the capacity values for all capacity states of all frequency
+ * domains of the Energy Model. This should be used when the capacity values
+ * of the CPUs are updated at run-time, after the EM was registered.
+ */
+void em_rescale_cpu_capacity(void)
+{
+	struct em_cs_table *old_table, *new_table;
+	struct em_freq_domain *fd;
+	int nr_states, cpu;
+
+	mutex_lock(&em_fd_mutex);
+	rcu_read_lock();
+	for_each_possible_cpu(cpu) {
+		/* Re-scale only once per frequency domain. */
+		fd = READ_ONCE(per_cpu(em_data, cpu));
+		if (!fd || cpu != cpumask_first(to_cpumask(fd->cpus)))
+			continue;
+
+		/* Copy the existing table. */
+		old_table = rcu_dereference(fd->cs_table);
+		nr_states = old_table->nr_cap_states;
+		new_table = alloc_cs_table(nr_states);
+		if (!new_table)
+			goto out;
+		memcpy(new_table->state, old_table->state,
+					nr_states * sizeof(*new_table->state));
+
+		/* Re-scale the capacity values of the copy. */
+		fd_update_cs_table(new_table,
+					cpumask_first(to_cpumask(fd->cpus)));
+
+		/* Replace the fd table with the re-scaled version. */
+		rcu_assign_pointer(fd->cs_table, new_table);
+		call_rcu(&old_table->rcu, rcu_free_cs_table);
+	}
+out:
+	rcu_read_unlock();
+	mutex_unlock(&em_fd_mutex);
+}
+EXPORT_SYMBOL_GPL(em_rescale_cpu_capacity);
+
+/**
+ * em_cpu_get() - Return the frequency domain for a CPU
+ * @cpu : CPU to find the frequency domain for
+ *
+ * Return: the frequency domain to which 'cpu' belongs, or NULL if it doesn't
+ * exist.
+ */
+struct em_freq_domain *em_cpu_get(int cpu)
+{
+	return READ_ONCE(per_cpu(em_data, cpu));
+}
+EXPORT_SYMBOL_GPL(em_cpu_get);
+
+/**
+ * em_register_freq_domain() - Register the Energy Model of a frequency domain
+ * @span	: Mask of CPUs in the frequency domain
+ * @nr_states	: Number of capacity states to register
+ * @cb		: Callback functions providing the data of the Energy Model
+ *
+ * Create Energy Model tables for a frequency domain using the callbacks
+ * defined in cb.
+ *
+ * If multiple clients register the same frequency domain, all but the first
+ * registration will be ignored.
+ *
+ * Return 0 on success
+ */
+int em_register_freq_domain(cpumask_t *span, unsigned int nr_states,
+						struct em_data_callback *cb)
+{
+	struct em_freq_domain *fd;
+	int cpu, ret = 0;
+
+	if (!span || !nr_states || !cb)
+		return -EINVAL;
+
+	/*
+	 * Registration of frequency domains needs to be serialized. Since
+	 * em_create_fd() calls into the driver-defined callback functions
+	 * which might sleep, we use a mutex.
+	 */
+	mutex_lock(&em_fd_mutex);
+
+	/* Make sure we don't register again an existing domain. */
+	for_each_cpu(cpu, span) {
+		if (READ_ONCE(per_cpu(em_data, cpu))) {
+			ret = -EEXIST;
+			goto unlock;
+		}
+	}
+
+	/* Create the frequency domain and add it to the Energy Model. */
+	fd = em_create_fd(span, nr_states, cb);
+	if (!fd) {
+		ret = -EINVAL;
+		goto unlock;
+	}
+
+	for_each_cpu(cpu, span)
+		smp_store_release(per_cpu_ptr(&em_data, cpu), fd);
+
+	pr_debug("Created freq domain %*pbl\n", cpumask_pr_args(span));
+unlock:
+	mutex_unlock(&em_fd_mutex);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(em_register_freq_domain);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC PATCH v4 04/12] PM / EM: Expose the Energy Model in sysfs
  2018-06-28 11:40 [RFC PATCH v4 00/12] Energy Aware Scheduling Quentin Perret
                   ` (2 preceding siblings ...)
  2018-06-28 11:40 ` [RFC PATCH v4 03/12] PM: Introduce an Energy Model management framework Quentin Perret
@ 2018-06-28 11:40 ` Quentin Perret
  2018-06-28 11:40 ` [RFC PATCH v4 05/12] sched/topology: Reference the Energy Model of CPUs when available Quentin Perret
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 54+ messages in thread
From: Quentin Perret @ 2018-06-28 11:40 UTC (permalink / raw)
  To: peterz, rjw, linux-kernel, linux-pm
  Cc: gregkh, mingo, dietmar.eggemann, morten.rasmussen, chris.redpath,
	patrick.bellasi, valentin.schneider, vincent.guittot,
	thara.gopinath, viresh.kumar, tkjos, joel, smuckle, adharmap,
	skannan, pkondeti, juri.lelli, edubezval, srinivas.pandruvada,
	currojerez, javi.merino, quentin.perret

Expose the Energy Model (read-only) of all frequency domains in sysfs
for convenience. To do so, add a kobject to the CPU subsystem under the
umbrella of which a kobject for each frequency domain is attached.

The resulting hierarchy is as follows for a platform with two frequency
domains for example:

   /sys/devices/system/cpu/energy_model
   ├── fd0
   │   ├── capacity
   │   ├── cpus
   │   ├── frequency
   │   └── power
   └── fd4
       ├── capacity
       ├── cpus
       ├── frequency
       └── power

In this implementation, the kobject abstraction is only used as a
convenient way of exposing data to sysfs. However, it could also be
used in the future to allocate and release frequency domains in a more
dynamic way using reference counting.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
---
 include/linux/energy_model.h |  1 +
 kernel/power/energy_model.c  | 94 ++++++++++++++++++++++++++++++++++++
 2 files changed, 95 insertions(+)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index 88c2f0b9bcb3..0f109cefed62 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -23,6 +23,7 @@ struct em_cs_table {
 
 struct em_freq_domain {
 	struct em_cs_table *cs_table; /* Capacity state table, RCU-protected */
+	struct kobject kobj;
 	unsigned long cpus[0]; /* CPUs of the frequency domain. */
 };
 
diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index 08ce4035a6d6..53456147e656 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -23,6 +23,86 @@ static DEFINE_PER_CPU(struct em_freq_domain *, em_data);
  */
 static DEFINE_MUTEX(em_fd_mutex);
 
+static struct kobject *em_kobject;
+
+/* Getters for the attributes of em_freq_domain objects */
+struct em_fd_attr {
+	struct attribute attr;
+	ssize_t (*show)(struct em_freq_domain *fd, char *buf);
+	ssize_t (*store)(struct em_freq_domain *fd, const char *buf, size_t s);
+};
+
+#define EM_ATTR_LEN 13
+#define show_table_attr(_attr) \
+static ssize_t show_##_attr(struct em_freq_domain *fd, char *buf) \
+{ \
+	ssize_t cnt = 0; \
+	int i; \
+	struct em_cs_table *table; \
+	rcu_read_lock(); \
+	table = rcu_dereference(fd->cs_table);\
+	for (i = 0; i < table->nr_cap_states; i++) { \
+		if (cnt >= (ssize_t) (PAGE_SIZE / sizeof(char) \
+				      - (EM_ATTR_LEN + 2))) \
+			goto out; \
+		cnt += scnprintf(&buf[cnt], EM_ATTR_LEN + 1, "%lu ", \
+				 table->state[i]._attr); \
+	} \
+out: \
+	rcu_read_unlock(); \
+	cnt += sprintf(&buf[cnt], "\n"); \
+	return cnt; \
+}
+
+show_table_attr(power);
+show_table_attr(frequency);
+show_table_attr(capacity);
+
+static ssize_t show_cpus(struct em_freq_domain *fd, char *buf)
+{
+	return sprintf(buf, "%*pbl\n", cpumask_pr_args(to_cpumask(fd->cpus)));
+}
+
+#define fd_attr(_name) em_fd_##_name##_attr
+#define define_fd_attr(_name) static struct em_fd_attr fd_attr(_name) = \
+		__ATTR(_name, 0444, show_##_name, NULL)
+
+define_fd_attr(power);
+define_fd_attr(frequency);
+define_fd_attr(capacity);
+define_fd_attr(cpus);
+
+static struct attribute *em_fd_default_attrs[] = {
+	&fd_attr(power).attr,
+	&fd_attr(frequency).attr,
+	&fd_attr(capacity).attr,
+	&fd_attr(cpus).attr,
+	NULL
+};
+
+#define to_fd(k) container_of(k, struct em_freq_domain, kobj)
+#define to_fd_attr(a) container_of(a, struct em_fd_attr, attr)
+
+static ssize_t show(struct kobject *kobj, struct attribute *attr, char *buf)
+{
+	struct em_freq_domain *fd = to_fd(kobj);
+	struct em_fd_attr *fd_attr = to_fd_attr(attr);
+	ssize_t ret;
+
+	ret = fd_attr->show(fd, buf);
+
+	return ret;
+}
+
+static const struct sysfs_ops em_fd_sysfs_ops = {
+	.show	= show,
+};
+
+static struct kobj_type ktype_em_fd = {
+	.sysfs_ops	= &em_fd_sysfs_ops,
+	.default_attrs	= em_fd_default_attrs,
+};
+
 static struct em_cs_table *alloc_cs_table(int nr_cs)
 {
 	struct em_cs_table *cs_table;
@@ -142,6 +222,11 @@ static struct em_freq_domain *em_create_fd(cpumask_t *span, int nr_states,
 	/* Copy the span of the frequency domain */
 	cpumask_copy(to_cpumask(fd->cpus), span);
 
+	ret = kobject_init_and_add(&fd->kobj, &ktype_em_fd, em_kobject,
+				   "fd%u", cpu);
+	if (ret)
+		pr_err("fd%d: failed kobject_init_and_add(): %d\n", cpu, ret);
+
 	return fd;
 
 free_cs_table:
@@ -247,6 +332,15 @@ int em_register_freq_domain(cpumask_t *span, unsigned int nr_states,
 	 */
 	mutex_lock(&em_fd_mutex);
 
+	if (!em_kobject) {
+		em_kobject = kobject_create_and_add("energy_model",
+						&cpu_subsys.dev_root->kobj);
+		if (!em_kobject) {
+			ret = -ENODEV;
+			goto unlock;
+		}
+	}
+
 	/* Make sure we don't register again an existing domain. */
 	for_each_cpu(cpu, span) {
 		if (READ_ONCE(per_cpu(em_data, cpu))) {
-- 
2.17.1


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC PATCH v4 05/12] sched/topology: Reference the Energy Model of CPUs when available
  2018-06-28 11:40 [RFC PATCH v4 00/12] Energy Aware Scheduling Quentin Perret
                   ` (3 preceding siblings ...)
  2018-06-28 11:40 ` [RFC PATCH v4 04/12] PM / EM: Expose the Energy Model in sysfs Quentin Perret
@ 2018-06-28 11:40 ` Quentin Perret
  2018-07-05 17:29   ` Peter Zijlstra
  2018-07-05 17:33   ` Peter Zijlstra
  2018-06-28 11:40 ` [RFC PATCH v4 06/12] sched/topology: Lowest energy aware balancing sched_domain level pointer Quentin Perret
                   ` (6 subsequent siblings)
  11 siblings, 2 replies; 54+ messages in thread
From: Quentin Perret @ 2018-06-28 11:40 UTC (permalink / raw)
  To: peterz, rjw, linux-kernel, linux-pm
  Cc: gregkh, mingo, dietmar.eggemann, morten.rasmussen, chris.redpath,
	patrick.bellasi, valentin.schneider, vincent.guittot,
	thara.gopinath, viresh.kumar, tkjos, joel, smuckle, adharmap,
	skannan, pkondeti, juri.lelli, edubezval, srinivas.pandruvada,
	currojerez, javi.merino, quentin.perret

The existing scheduling domain hierarchy is defined to map to the cache
topology of the system. However, Energy Aware Scheduling (EAS) requires
more knowledge about the platform, and specifically needs to know about
the span of Frequency Domains (FD), which do not always align with
caches.

To address this issue, use the Energy Model (EM) of the system to extend
the scheduler topology code with a representation of the FDs, alongside
the scheduling domains. More specifically, a linked list of FDs is
attached to each root domain. When multiple root domains are in use,
each list contains only the FDs covering the CPUs of its root domain. If
a FD spans over CPUs of two different root domains, it will be
duplicated in both lists.

The lists are fully maintained by the scheduler from
partition_sched_domains() in order to cope with hotplug and cpuset
changes. As for scheduling domains, the list are protected by RCU to
ensure safe concurrent updates.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
---
 kernel/sched/sched.h    |  23 +++++++
 kernel/sched/topology.c | 129 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 152 insertions(+)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 34549bca487d..7038df9fb713 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -44,6 +44,7 @@
 #include <linux/ctype.h>
 #include <linux/debugfs.h>
 #include <linux/delayacct.h>
+#include <linux/energy_model.h>
 #include <linux/init_task.h>
 #include <linux/kprobes.h>
 #include <linux/kthread.h>
@@ -673,6 +674,12 @@ static inline bool sched_asym_prefer(int a, int b)
 	return arch_asym_cpu_priority(a) > arch_asym_cpu_priority(b);
 }
 
+struct freq_domain {
+	struct em_freq_domain *obj;
+	struct freq_domain *next;
+	struct rcu_head rcu;
+};
+
 /*
  * We add the notion of a root-domain which will be used to define per-domain
  * variables. Each exclusive cpuset essentially defines an island domain by
@@ -721,6 +728,14 @@ struct root_domain {
 	struct cpupri		cpupri;
 
 	unsigned long		max_cpu_capacity;
+
+#ifdef CONFIG_ENERGY_MODEL
+	/*
+	 * NULL-terminated list of frequency domains intersecting with the
+	 * CPUs of the rd. Protected by RCU.
+	 */
+	struct freq_domain *fd;
+#endif
 };
 
 extern struct root_domain def_root_domain;
@@ -2169,3 +2184,11 @@ static inline unsigned long cpu_util_cfs(struct rq *rq)
 	return util;
 }
 #endif
+
+#ifdef CONFIG_SMP
+#ifdef CONFIG_ENERGY_MODEL
+#define freq_domain_span(fd) (to_cpumask(((fd)->obj->cpus)))
+#else
+#define freq_domain_span(fd) NULL
+#endif
+#endif
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 61a1125c1ae4..357eff55c1a7 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -201,6 +201,19 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
 	return 1;
 }
 
+#ifdef CONFIG_ENERGY_MODEL
+static void free_fd(struct freq_domain *fd)
+{
+	struct freq_domain *tmp;
+
+	while (fd) {
+		tmp = fd->next;
+		kfree(fd);
+		fd = tmp;
+	}
+}
+#endif
+
 static void free_rootdomain(struct rcu_head *rcu)
 {
 	struct root_domain *rd = container_of(rcu, struct root_domain, rcu);
@@ -211,6 +224,9 @@ static void free_rootdomain(struct rcu_head *rcu)
 	free_cpumask_var(rd->rto_mask);
 	free_cpumask_var(rd->online);
 	free_cpumask_var(rd->span);
+#ifdef CONFIG_ENERGY_MODEL
+	free_fd(rd->fd);
+#endif
 	kfree(rd);
 }
 
@@ -1635,6 +1651,104 @@ static struct sched_domain *build_sched_domain(struct sched_domain_topology_leve
 	return sd;
 }
 
+#ifdef CONFIG_ENERGY_MODEL
+static struct freq_domain *find_fd(struct freq_domain *fd, int cpu)
+{
+	while (fd) {
+		if (cpumask_test_cpu(cpu, freq_domain_span(fd)))
+			return fd;
+		fd = fd->next;
+	}
+
+	return NULL;
+}
+
+static struct freq_domain *fd_init(int cpu)
+{
+	struct em_freq_domain *obj = em_cpu_get(cpu);
+	struct freq_domain *fd;
+
+	if (!obj) {
+		if (sched_debug())
+			pr_info("%s: no EM found for CPU%d\n", __func__, cpu);
+		return NULL;
+	}
+
+	fd = kzalloc(sizeof(*fd), GFP_KERNEL);
+	if (!fd)
+		return NULL;
+	fd->obj = obj;
+
+	return fd;
+}
+
+static void sched_energy_fd_debug(const struct cpumask * cpu_map,
+						struct freq_domain *fd)
+{
+	if (!sched_debug() || !fd)
+		return;
+
+	printk(KERN_DEBUG "root_domain %*pbl: fd:", cpumask_pr_args(cpu_map));
+
+	while (fd) {
+		printk(KERN_CONT " { fd%d cpus=%*pbl nr_cstate=%d }",
+				cpumask_first(freq_domain_span(fd)),
+				cpumask_pr_args(freq_domain_span(fd)),
+				em_fd_nr_cap_states(fd->obj));
+		fd = fd->next;
+	}
+
+	printk(KERN_CONT "\n");
+}
+
+static void destroy_freq_domain_rcu(struct rcu_head *rp)
+{
+	struct freq_domain *fd;
+
+	fd = container_of(rp, struct freq_domain, rcu);
+	free_fd(fd);
+}
+
+static void build_freq_domains(const struct cpumask *cpu_map)
+{
+	struct freq_domain *fd = NULL, *tmp;
+	int cpu = cpumask_first(cpu_map);
+	struct root_domain *rd = cpu_rq(cpu)->rd;
+	int i;
+
+	for_each_cpu(i, cpu_map) {
+		/* Skip already covered CPUs. */
+		if (find_fd(fd, i))
+			continue;
+
+		/* Create the new fd and add it to the local list. */
+		tmp = fd_init(i);
+		if (!tmp)
+			goto free;
+		tmp->next = fd;
+		fd = tmp;
+	}
+
+	sched_energy_fd_debug(cpu_map, fd);
+
+	/* Attach the new list of frequency domains to the root domain. */
+	tmp = rd->fd;
+	rcu_assign_pointer(rd->fd, fd);
+	if (tmp)
+		call_rcu(&tmp->rcu, destroy_freq_domain_rcu);
+
+	return;
+
+free:
+	free_fd(fd);
+	tmp = rd->fd;
+	rcu_assign_pointer(rd->fd, NULL);
+	if (tmp)
+		call_rcu(&tmp->rcu, destroy_freq_domain_rcu);
+}
+#endif
+
+
 /*
  * Build sched domains for a given set of CPUs and attach the sched domains
  * to the individual CPUs
@@ -1913,6 +2027,21 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 		;
 	}
 
+#ifdef CONFIG_ENERGY_MODEL
+	/* Build freq domains: */
+	for (i = 0; i < ndoms_new; i++) {
+		for (j = 0; j < n; j++) {
+			if (cpumask_equal(doms_new[i], doms_cur[j])
+			    && cpu_rq(cpumask_first(doms_cur[j]))->rd->fd)
+			       goto match3;
+		}
+		/* No match - add freq domains for a new rd */
+		build_freq_domains(doms_new[i]);
+match3:
+		;
+	}
+#endif
+
 	/* Remember the new sched domains: */
 	if (doms_cur != &fallback_doms)
 		free_sched_domains(doms_cur, ndoms_cur);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC PATCH v4 06/12] sched/topology: Lowest energy aware balancing sched_domain level pointer
  2018-06-28 11:40 [RFC PATCH v4 00/12] Energy Aware Scheduling Quentin Perret
                   ` (4 preceding siblings ...)
  2018-06-28 11:40 ` [RFC PATCH v4 05/12] sched/topology: Reference the Energy Model of CPUs when available Quentin Perret
@ 2018-06-28 11:40 ` Quentin Perret
  2018-06-28 11:40 ` [RFC PATCH v4 07/12] sched/topology: Introduce sched_energy_present static key Quentin Perret
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 54+ messages in thread
From: Quentin Perret @ 2018-06-28 11:40 UTC (permalink / raw)
  To: peterz, rjw, linux-kernel, linux-pm
  Cc: gregkh, mingo, dietmar.eggemann, morten.rasmussen, chris.redpath,
	patrick.bellasi, valentin.schneider, vincent.guittot,
	thara.gopinath, viresh.kumar, tkjos, joel, smuckle, adharmap,
	skannan, pkondeti, juri.lelli, edubezval, srinivas.pandruvada,
	currojerez, javi.merino, quentin.perret

Add another member to the family of per-cpu sched_domain shortcut
pointers. This one, sd_ea, points to the lowest level at which energy
aware scheduling should be used.

Generally speaking, the largest opportunity to save energy via scheduling
comes from a smarter exploitation of heterogeneous platforms (i.e.
big.LITTLE). Consequently, the sd_ea shortcut is wired to the lowest
scheduling domain at which the SD_ASYM_CPUCAPACITY flag is set. For
example, it is possible to apply energy-aware scheduling within a socket
on a multi-socket system, as long as each socket has an asymmetric
topology. Cross-sockets wake-up balancing will only happen when the
system is over-utilized, or this_cpu and prev_cpu are in different
sockets.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
---
 kernel/sched/sched.h    | 1 +
 kernel/sched/topology.c | 4 ++++
 2 files changed, 5 insertions(+)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7038df9fb713..1c32714aed0d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1168,6 +1168,7 @@ DECLARE_PER_CPU(int, sd_llc_id);
 DECLARE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
 DECLARE_PER_CPU(struct sched_domain *, sd_numa);
 DECLARE_PER_CPU(struct sched_domain *, sd_asym);
+DECLARE_PER_CPU(struct sched_domain *, sd_ea);
 
 struct sched_group_capacity {
 	atomic_t		ref;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 357eff55c1a7..368669fbf0f0 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -414,6 +414,7 @@ DEFINE_PER_CPU(int, sd_llc_id);
 DEFINE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
 DEFINE_PER_CPU(struct sched_domain *, sd_numa);
 DEFINE_PER_CPU(struct sched_domain *, sd_asym);
+DEFINE_PER_CPU(struct sched_domain *, sd_ea);
 
 static void update_top_cache_domain(int cpu)
 {
@@ -439,6 +440,9 @@ static void update_top_cache_domain(int cpu)
 
 	sd = highest_flag_domain(cpu, SD_ASYM_PACKING);
 	rcu_assign_pointer(per_cpu(sd_asym, cpu), sd);
+
+	sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY);
+	rcu_assign_pointer(per_cpu(sd_ea, cpu), sd);
 }
 
 /*
-- 
2.17.1


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC PATCH v4 07/12] sched/topology: Introduce sched_energy_present static key
  2018-06-28 11:40 [RFC PATCH v4 00/12] Energy Aware Scheduling Quentin Perret
                   ` (5 preceding siblings ...)
  2018-06-28 11:40 ` [RFC PATCH v4 06/12] sched/topology: Lowest energy aware balancing sched_domain level pointer Quentin Perret
@ 2018-06-28 11:40 ` Quentin Perret
  2018-06-28 11:40 ` [RFC PATCH v4 08/12] sched: Add over-utilization/tipping point indicator Quentin Perret
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 54+ messages in thread
From: Quentin Perret @ 2018-06-28 11:40 UTC (permalink / raw)
  To: peterz, rjw, linux-kernel, linux-pm
  Cc: gregkh, mingo, dietmar.eggemann, morten.rasmussen, chris.redpath,
	patrick.bellasi, valentin.schneider, vincent.guittot,
	thara.gopinath, viresh.kumar, tkjos, joel, smuckle, adharmap,
	skannan, pkondeti, juri.lelli, edubezval, srinivas.pandruvada,
	currojerez, javi.merino, quentin.perret

In order to ensure a minimal performance impact on non-energy-aware
systems, introduce a static_key guarding the access to Energy-Aware
Scheduling (EAS) code.

The static key is set iff all the following conditions are met for at
least one root domain:
  1. all online CPUs of the root domain are covered by the Energy
     Model (EM);
  2. the complexity of the root domain's EM is low enough to keep
     scheduling overheads low;
  3. the root domain has an asymmetric CPU capacity topology (detected
     by looking for the SD_ASYM_CPUCAPACITY flag in the sched_domain
     hierarchy).

The static key is checked in the rd_freq_domain() function which returns
the frequency domains of a root domain when they are available. As EAS
cannot be enabled with CONFIG_ENERGY_MODEL=n, rd_freq_domain() is
stubbed to 'NULL' to let the compiler remove the unused EAS code by
constant propagation.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com
---
 kernel/sched/sched.h    | 17 ++++++++++
 kernel/sched/topology.c | 72 ++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 88 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1c32714aed0d..5d4e9bbe9e53 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2188,8 +2188,25 @@ static inline unsigned long cpu_util_cfs(struct rq *rq)
 
 #ifdef CONFIG_SMP
 #ifdef CONFIG_ENERGY_MODEL
+extern struct static_key_false sched_energy_present;
+/**
+ * rd_freq_domain - Get the frequency domains of a root domain.
+ *
+ * Must be called from a RCU read-side critical section.
+ */
+static inline struct freq_domain *rd_freq_domain(struct root_domain *rd)
+{
+	if (!static_branch_unlikely(&sched_energy_present))
+		return NULL;
+
+	return rcu_dereference(rd->fd);
+}
 #define freq_domain_span(fd) (to_cpumask(((fd)->obj->cpus)))
 #else
+static inline struct freq_domain *rd_freq_domain(struct root_domain *rd)
+{
+	return NULL;
+}
 #define freq_domain_span(fd) NULL
 #endif
 #endif
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 368669fbf0f0..cb5f0cfe2a7d 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1713,12 +1713,31 @@ static void destroy_freq_domain_rcu(struct rcu_head *rp)
 	free_fd(fd);
 }
 
+/*
+ * The complexity of the Energy Model is defined as: nr_fd * (nr_cpus + nr_cs)
+ * with: 'nr_fd' the number of frequency domains; 'nr_cpus' the number of CPUs;
+ * and 'nr_cs' the sum of the capacity states numbers of all frequency domains.
+ *
+ * It is generally not a good idea to use such a model in the wake-up path on
+ * very complex platforms because of the associated scheduling overheads. The
+ * arbitrary constraint below prevents that. It makes EAS usable up to 16 CPUs
+ * with per-CPU DVFS and less than 8 capacity states each, for example.
+ */
+#define EM_MAX_COMPLEXITY 2048
+
 static void build_freq_domains(const struct cpumask *cpu_map)
 {
+	int i, nr_fd = 0, nr_cs = 0, nr_cpus = cpumask_weight(cpu_map);
 	struct freq_domain *fd = NULL, *tmp;
 	int cpu = cpumask_first(cpu_map);
 	struct root_domain *rd = cpu_rq(cpu)->rd;
-	int i;
+
+	/* EAS is enabled for asymmetric CPU capacity topologies. */
+	if (!per_cpu(sd_ea, cpu)) {
+		if (sched_debug())
+			pr_info("rd %*pbl: !sd_ea\n", cpumask_pr_args(cpu_map));
+		goto free;
+	}
 
 	for_each_cpu(i, cpu_map) {
 		/* Skip already covered CPUs. */
@@ -1731,6 +1750,18 @@ static void build_freq_domains(const struct cpumask *cpu_map)
 			goto free;
 		tmp->next = fd;
 		fd = tmp;
+
+		/* Count freq. doms and perf states for the complexity check. */
+		nr_fd++;
+		nr_cs += em_fd_nr_cap_states(fd->obj);
+	}
+
+	/* Bail out if the Energy Model complexity is too high. */
+	if (nr_fd * (nr_cs + nr_cpus) > EM_MAX_COMPLEXITY) {
+		if (sched_debug())
+			pr_info("rd %*pbl: EM complexity is too high\n ",
+						cpumask_pr_args(cpu_map));
+		goto free;
 	}
 
 	sched_energy_fd_debug(cpu_map, fd);
@@ -1750,6 +1781,44 @@ static void build_freq_domains(const struct cpumask *cpu_map)
 	if (tmp)
 		call_rcu(&tmp->rcu, destroy_freq_domain_rcu);
 }
+
+/*
+ * This static_key is set if at least one root domain meets all the following
+ * conditions:
+ *    1. all CPUs of the root domain are covered by the EM;
+ *    2. the EM complexity is low enough to keep scheduling overheads low;
+ *    3. the SD_ASYM_CPUCAPACITY flag is set in the sched_domain hierarchy.
+ */
+DEFINE_STATIC_KEY_FALSE(sched_energy_present);
+
+static void sched_energy_start(int ndoms_new, cpumask_var_t doms_new[])
+{
+	/*
+	 * The conditions for EAS to start are checked during the creation of
+	 * root domains. If one of them meets all conditions, it will have a
+	 * non-null list of frequency domains.
+	 */
+	while (ndoms_new) {
+		if (cpu_rq(cpumask_first(doms_new[ndoms_new - 1]))->rd->fd)
+			goto enable;
+		ndoms_new--;
+	}
+
+	if (static_branch_unlikely(&sched_energy_present)) {
+		if (sched_debug())
+			pr_info("%s: stopping EAS\n", __func__);
+		static_branch_disable_cpuslocked(&sched_energy_present);
+	}
+
+	return;
+
+enable:
+	if (!static_branch_unlikely(&sched_energy_present)) {
+		if (sched_debug())
+			pr_info("%s: starting EAS\n", __func__);
+		static_branch_enable_cpuslocked(&sched_energy_present);
+	}
+}
 #endif
 
 
@@ -2044,6 +2113,7 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 match3:
 		;
 	}
+	sched_energy_start(ndoms_new, doms_new);
 #endif
 
 	/* Remember the new sched domains: */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC PATCH v4 08/12] sched: Add over-utilization/tipping point indicator
  2018-06-28 11:40 [RFC PATCH v4 00/12] Energy Aware Scheduling Quentin Perret
                   ` (6 preceding siblings ...)
  2018-06-28 11:40 ` [RFC PATCH v4 07/12] sched/topology: Introduce sched_energy_present static key Quentin Perret
@ 2018-06-28 11:40 ` Quentin Perret
  2018-07-06 11:32   ` Peter Zijlstra
                     ` (2 more replies)
  2018-06-28 11:40 ` [RFC PATCH v4 09/12] sched/fair: Introduce an energy estimation helper function Quentin Perret
                   ` (3 subsequent siblings)
  11 siblings, 3 replies; 54+ messages in thread
From: Quentin Perret @ 2018-06-28 11:40 UTC (permalink / raw)
  To: peterz, rjw, linux-kernel, linux-pm
  Cc: gregkh, mingo, dietmar.eggemann, morten.rasmussen, chris.redpath,
	patrick.bellasi, valentin.schneider, vincent.guittot,
	thara.gopinath, viresh.kumar, tkjos, joel, smuckle, adharmap,
	skannan, pkondeti, juri.lelli, edubezval, srinivas.pandruvada,
	currojerez, javi.merino, quentin.perret

From: Morten Rasmussen <morten.rasmussen@arm.com>

Energy-aware scheduling is only meant to be active while the system is
_not_ over-utilized. That is, there are spare cycles available to shift
tasks around based on their actual utilization to get a more
energy-efficient task distribution without depriving any tasks. When
above the tipping point task placement is done the traditional way based
on load_avg, spreading the tasks across as many cpus as possible based
on priority scaled load to preserve smp_nice. Below the tipping point we
want to use util_avg instead. We need to define a criteria for when we
make the switch.

The util_avg for each cpu converges towards 100% (1024) regardless of
how many task additional task we may put on it. If we define
over-utilized as:

sum_{cpus}(rq.cfs.avg.util_avg) + margin > sum_{cpus}(rq.capacity)

some individual cpus may be over-utilized running multiple tasks even
when the above condition is false. That should be okay as long as we try
to spread the tasks out to avoid per-cpu over-utilization as much as
possible and if all tasks have the _same_ priority. If the latter isn't
true, we have to consider priority to preserve smp_nice.

For example, we could have n_cpus nice=-10 util_avg=55% tasks and
n_cpus/2 nice=0 util_avg=60% tasks. Balancing based on util_avg we are
likely to end up with nice=-10 tasks sharing cpus and nice=0 tasks
getting their own as we 1.5*n_cpus tasks in total and 55%+55% is less
over-utilized than 55%+60% for those cpus that have to be shared. The
system utilization is only 85% of the system capacity, but we are
breaking smp_nice.

To be sure not to break smp_nice, we have defined over-utilization
conservatively as when any cpu in the system is fully utilized at it's
highest frequency instead:

cpu_rq(any).cfs.avg.util_avg + margin > cpu_rq(any).capacity

IOW, as soon as one cpu is (nearly) 100% utilized, we switch to load_avg
to factor in priority to preserve smp_nice.

With this definition, we can skip periodic load-balance as no cpu has an
always-running task when the system is not over-utilized. All tasks will
be periodic and we can balance them at wake-up. This conservative
condition does however mean that some scenarios that could benefit from
energy-aware decisions even if one cpu is fully utilized would not get
those benefits.

For system where some cpus might have reduced capacity on some cpus
(RT-pressure and/or big.LITTLE), we want periodic load-balance checks as
soon a just a single cpu is fully utilized as it might one of those with
reduced capacity and in that case we want to migrate it.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
---
 kernel/sched/fair.c  | 48 +++++++++++++++++++++++++++++++++++++++++---
 kernel/sched/sched.h |  3 +++
 2 files changed, 48 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e497c05aab7f..4f74c6d0a79e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5374,6 +5374,24 @@ static inline void hrtick_update(struct rq *rq)
 }
 #endif
 
+#ifdef CONFIG_SMP
+static inline unsigned long cpu_util(int cpu);
+static unsigned long capacity_of(int cpu);
+
+static inline bool cpu_overutilized(int cpu)
+{
+	return (capacity_of(cpu) * 1024) < (cpu_util(cpu) * capacity_margin);
+}
+
+static inline void update_overutilized_status(struct rq *rq)
+{
+	if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu))
+		WRITE_ONCE(rq->rd->overutilized, 1);
+}
+#else
+static inline void update_overutilized_status(struct rq *rq) { }
+#endif
+
 /*
  * The enqueue_task method is called before nr_running is
  * increased. Here we update the fair scheduling stats and
@@ -5384,6 +5402,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 {
 	struct cfs_rq *cfs_rq;
 	struct sched_entity *se = &p->se;
+	int task_new = !(flags & ENQUEUE_WAKEUP);
 
 	/*
 	 * The code below (indirectly) updates schedutil which looks at
@@ -5431,8 +5450,12 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		update_cfs_group(se);
 	}
 
-	if (!se)
+	if (!se) {
 		add_nr_running(rq, 1);
+		if (!task_new)
+			update_overutilized_status(rq);
+
+	}
 
 	hrtick_update(rq);
 }
@@ -8114,11 +8137,12 @@ static bool update_nohz_stats(struct rq *rq, bool force)
  * @local_group: Does group contain this_cpu.
  * @sgs: variable to hold the statistics for this group.
  * @overload: Indicate more than one runnable task for any CPU.
+ * @overutilized: Indicate overutilization for any CPU.
  */
 static inline void update_sg_lb_stats(struct lb_env *env,
 			struct sched_group *group, int load_idx,
 			int local_group, struct sg_lb_stats *sgs,
-			bool *overload)
+			bool *overload, int *overutilized)
 {
 	unsigned long load;
 	int i, nr_running;
@@ -8145,6 +8169,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		if (nr_running > 1)
 			*overload = true;
 
+		if (cpu_overutilized(i))
+			*overutilized = 1;
+
 #ifdef CONFIG_NUMA_BALANCING
 		sgs->nr_numa_running += rq->nr_numa_running;
 		sgs->nr_preferred_running += rq->nr_preferred_running;
@@ -8282,6 +8309,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 	struct sg_lb_stats tmp_sgs;
 	int load_idx, prefer_sibling = 0;
 	bool overload = false;
+	int overutilized = 0;
 
 	if (child && child->flags & SD_PREFER_SIBLING)
 		prefer_sibling = 1;
@@ -8308,7 +8336,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		}
 
 		update_sg_lb_stats(env, sg, load_idx, local_group, sgs,
-						&overload);
+						&overload, &overutilized);
 
 		if (local_group)
 			goto next_group;
@@ -8360,6 +8388,13 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		/* update overload indicator if we are at root domain */
 		if (env->dst_rq->rd->overload != overload)
 			env->dst_rq->rd->overload = overload;
+
+		/* Update over-utilization (tipping point, U >= 0) indicator */
+		if (READ_ONCE(env->dst_rq->rd->overutilized) != overutilized)
+			WRITE_ONCE(env->dst_rq->rd->overutilized, overutilized);
+	} else {
+		if (!READ_ONCE(env->dst_rq->rd->overutilized) && overutilized)
+			WRITE_ONCE(env->dst_rq->rd->overutilized, 1);
 	}
 }
 
@@ -8579,6 +8614,11 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 	 * this level.
 	 */
 	update_sd_lb_stats(env, &sds);
+
+	if (rd_freq_domain(env->dst_rq->rd) &&
+				!READ_ONCE(env->dst_rq->rd->overutilized))
+		goto out_balanced;
+
 	local = &sds.local_stat;
 	busiest = &sds.busiest_stat;
 
@@ -9936,6 +9976,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 
 	if (static_branch_unlikely(&sched_numa_balancing))
 		task_tick_numa(rq, curr);
+
+	update_overutilized_status(task_rq(curr));
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5d4e9bbe9e53..4e6548e4afc5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -698,6 +698,9 @@ struct root_domain {
 	/* Indicate more than one runnable task for any CPU */
 	bool			overload;
 
+	/* Indicate one or more cpus over-utilized (tipping point) */
+	int			overutilized;
+
 	/*
 	 * The bit corresponding to a CPU gets set here if such CPU has more
 	 * than one runnable -deadline task (as it is below for RT tasks).
-- 
2.17.1


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC PATCH v4 09/12] sched/fair: Introduce an energy estimation helper function
  2018-06-28 11:40 [RFC PATCH v4 00/12] Energy Aware Scheduling Quentin Perret
                   ` (7 preceding siblings ...)
  2018-06-28 11:40 ` [RFC PATCH v4 08/12] sched: Add over-utilization/tipping point indicator Quentin Perret
@ 2018-06-28 11:40 ` Quentin Perret
  2018-07-06 13:12   ` Peter Zijlstra
  2018-06-28 11:40 ` [RFC PATCH v4 10/12] sched/fair: Select an energy-efficient CPU on task wake-up Quentin Perret
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 54+ messages in thread
From: Quentin Perret @ 2018-06-28 11:40 UTC (permalink / raw)
  To: peterz, rjw, linux-kernel, linux-pm
  Cc: gregkh, mingo, dietmar.eggemann, morten.rasmussen, chris.redpath,
	patrick.bellasi, valentin.schneider, vincent.guittot,
	thara.gopinath, viresh.kumar, tkjos, joel, smuckle, adharmap,
	skannan, pkondeti, juri.lelli, edubezval, srinivas.pandruvada,
	currojerez, javi.merino, quentin.perret

In preparation for the definition of an energy-aware wakeup path,
introduce a helper function to estimate the consequence on system energy
when a specific task wakes-up on a specific CPU. compute_energy()
estimates the capacity state to be reached by all frequency domains and
estimates the consumption of each online CPU according to its Energy
Model and its percentage of busy time.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
---
 kernel/sched/fair.c  | 69 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |  2 +-
 2 files changed, 70 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4f74c6d0a79e..f50c4e83a488 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6621,6 +6621,75 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
 	return min_cap * 1024 < task_util(p) * capacity_margin;
 }
 
+/*
+ * Predicts what cpu_util(@cpu) would return if @p was migrated (and enqueued)
+ * to @dst_cpu.
+ */
+static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
+{
+	struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs;
+	unsigned long util_est, util = READ_ONCE(cfs_rq->avg.util_avg);
+
+	/*
+	 * If @p migrates from @cpu to another, remove its contribution. Or,
+	 * if @p migrates from another CPU to @cpu, add its contribution. In
+	 * the other cases, @cpu is not impacted by the migration, so the
+	 * util_avg should already be what we want.
+	 */
+	if (task_cpu(p) == cpu && dst_cpu != cpu)
+		util = max_t(long, util - task_util(p), 0);
+	else if (task_cpu(p) != cpu && dst_cpu == cpu)
+		util += task_util(p);
+
+	if (sched_feat(UTIL_EST)) {
+		util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued);
+
+		/*
+		 * During wake-up, the task isn't enqueued yet and doesn't
+		 * appear in the cfs_rq->avg.util_est.enqueued of any rq,
+		 * so just add it (if needed) to "simulate" what will be
+		 * cpu_util() after the task has been enqueued.
+		 */
+		if (dst_cpu == cpu)
+			util_est += _task_util_est(p);
+
+		util = max(util, util_est);
+	}
+
+	return min_t(unsigned long, util, capacity_orig_of(cpu));
+}
+
+/*
+ * Estimates the total energy that would be consumed by all online CPUs if @p
+ * was migrated to @dst_cpu. compute_energy() predicts what will be the
+ * utilization landscape of all CPUs after the task migration, and uses the
+ * Energy Model to compute what would be the system energy if we decided to
+ * actually migrate that task.
+ */
+static long compute_energy(struct task_struct *p, int dst_cpu,
+							struct freq_domain *fd)
+{
+	long util, max_util, sum_util, energy = 0;
+	int cpu;
+
+	while (fd) {
+		max_util = sum_util = 0;
+		for_each_cpu_and(cpu, freq_domain_span(fd), cpu_online_mask) {
+			util = cpu_util_next(cpu, p, dst_cpu);
+			util += cpu_util_dl(cpu_rq(cpu));
+			/* XXX: add RT util_avg when available. */
+
+			max_util = max(util, max_util);
+			sum_util += util;
+		}
+
+		energy += em_fd_energy(fd->obj, max_util, sum_util);
+		fd = fd->next;
+	}
+
+	return energy;
+}
+
 /*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
  * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4e6548e4afc5..009e5fc9a375 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2170,7 +2170,7 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
 # define arch_scale_freq_invariant()	false
 #endif
 
-#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
+#ifdef CONFIG_SMP
 static inline unsigned long cpu_util_dl(struct rq *rq)
 {
 	return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC PATCH v4 10/12] sched/fair: Select an energy-efficient CPU on task wake-up
  2018-06-28 11:40 [RFC PATCH v4 00/12] Energy Aware Scheduling Quentin Perret
                   ` (8 preceding siblings ...)
  2018-06-28 11:40 ` [RFC PATCH v4 09/12] sched/fair: Introduce an energy estimation helper function Quentin Perret
@ 2018-06-28 11:40 ` Quentin Perret
  2018-06-28 11:40 ` [RFC PATCH v4 11/12] OPTIONAL: arch_topology: Start Energy Aware Scheduling Quentin Perret
  2018-06-28 11:40 ` [RFC PATCH v4 12/12] OPTIONAL: cpufreq: dt: Register an Energy Model Quentin Perret
  11 siblings, 0 replies; 54+ messages in thread
From: Quentin Perret @ 2018-06-28 11:40 UTC (permalink / raw)
  To: peterz, rjw, linux-kernel, linux-pm
  Cc: gregkh, mingo, dietmar.eggemann, morten.rasmussen, chris.redpath,
	patrick.bellasi, valentin.schneider, vincent.guittot,
	thara.gopinath, viresh.kumar, tkjos, joel, smuckle, adharmap,
	skannan, pkondeti, juri.lelli, edubezval, srinivas.pandruvada,
	currojerez, javi.merino, quentin.perret

If an Energy Model (EM) is available and if the system isn't
overutilized, re-route waking tasks into an energy-aware placement
algorithm. The selection of an energy-efficient CPU for a task
is achieved by estimating the impact on system-level active energy
resulting from the placement of the task on the CPU with the highest
spare capacity in each frequency domain. This strategy spreads tasks in
a frequency domain and avoids overly aggressive task packing. The best
CPU energy-wise is then selected if it saves a large enough amount of
energy with respect to prev_cpu.

Although it has already shown significant benefits on some existing
targets, this approach cannot scale to platforms with numerous CPUs.
This is an attempt to do something useful as writing a fast heuristic
that performs reasonably well on a broad spectrum of architectures isn't
an easy task. As such, the scope of usability of the energy-aware
wake-up path is restricted to systems with the SD_ASYM_CPUCAPACITY flag
set, and where the EM isn't too complex.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
---
 kernel/sched/fair.c | 121 ++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 117 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f50c4e83a488..e85fc017cc22 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6690,6 +6690,110 @@ static long compute_energy(struct task_struct *p, int dst_cpu,
 	return energy;
 }
 
+/*
+ * find_energy_efficient_cpu(): Find most energy-efficient target CPU for the
+ * waking task. find_energy_efficient_cpu() looks for the CPU with maximum
+ * spare capacity in each frequency domain and uses it as a potential
+ * candidate to execute the task. Then, it uses the Energy Model to figure
+ * out which of the CPU candidates is the most energy-efficient.
+ *
+ * The rationale for this heuristic is as follows. In a frequency domain,
+ * all the most energy efficient CPU candidates (according to the Energy
+ * Model) are those for which we'll request a low frequency. When there are
+ * several CPUs for which the frequency request will be the same, we don't
+ * have enough data to break the tie between them, because the Energy Model
+ * only includes active power costs. With this model, if we assume that
+ * frequency requests follow utilization (e.g. using schedutil), the CPU with
+ * the maximum spare capacity in a frequency domain is guaranteed to be among
+ * the best candidates of the frequency domain.
+ *
+ * In practice, it could be preferable from an energy standpoint to pack
+ * small tasks on a CPU in order to let other CPUs go in deeper idle states,
+ * but that could also hurt our chances to go cluster idle, and we have no
+ * ways to tell with the current Energy Model if this is actually a good
+ * idea or not. So, find_energy_efficient_cpu() basically favors
+ * cluster-packing, and spreading inside a cluster. That should at least be
+ * a good thing for latency, and this is consistent with the idea that most
+ * of the energy savings of EAS come from the asymmetry of the system, and
+ * not so much from breaking the tie between identical CPUs. That's also the
+ * reason why EAS is enabled in the topology code only for systems where
+ * SD_ASYM_CPUCAPACITY is set.
+ */
+static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu,
+							struct freq_domain *fd)
+{
+	unsigned long prev_energy = ULONG_MAX, best_energy = ULONG_MAX;
+	int cpu, best_energy_cpu = prev_cpu;
+	struct freq_domain *head = fd;
+	unsigned long cpu_cap, util;
+	struct sched_domain *sd;
+
+	sync_entity_load_avg(&p->se);
+
+	if (!task_util_est(p))
+		return prev_cpu;
+
+	/*
+	 * Energy-aware wake-up happens on the lowest sched_domain starting
+	 * from sd_ea spanning over this_cpu and prev_cpu.
+	 */
+	sd = rcu_dereference(*this_cpu_ptr(&sd_ea));
+	while (sd && !cpumask_test_cpu(prev_cpu, sched_domain_span(sd)))
+		sd = sd->parent;
+	if (!sd)
+		return prev_cpu;
+
+	while (fd) {
+		unsigned long cur_energy, spare_cap, max_spare_cap = 0;
+		int max_spare_cap_cpu = -1;
+
+		for_each_cpu_and(cpu, freq_domain_span(fd), sched_domain_span(sd)) {
+			if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
+				continue;
+
+			/* Skip CPUs that will be overutilized. */
+			util = cpu_util_next(cpu, p, cpu);
+			cpu_cap = capacity_of(cpu);
+			if (cpu_cap * 1024 < util * capacity_margin)
+				continue;
+
+			/* Always use prev_cpu as a candidate. */
+			if (cpu == prev_cpu) {
+				prev_energy = compute_energy(p, prev_cpu, head);
+				if (prev_energy < best_energy)
+					best_energy = prev_energy;
+				continue;
+			}
+
+			/* Find the CPU with the max spare cap the fd. */
+			spare_cap = cpu_cap - util;
+			if (spare_cap > max_spare_cap) {
+				max_spare_cap = spare_cap;
+				max_spare_cap_cpu = cpu;
+			}
+		}
+
+		/* Evaluate the energy impact of using this CPU. */
+		if (max_spare_cap_cpu >= 0) {
+			cur_energy = compute_energy(p, max_spare_cap_cpu, head);
+			if (cur_energy < best_energy) {
+				best_energy = cur_energy;
+				best_energy_cpu = max_spare_cap_cpu;
+			}
+		}
+		fd = fd->next;
+	}
+
+	/*
+	 * We pick the best CPU only if it saves at least 6% of the
+	 * energy used by prev_cpu.
+	 */
+	if ((prev_energy - best_energy) > (prev_energy >> 4))
+		return best_energy_cpu;
+
+	return prev_cpu;
+}
+
 /*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
  * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -6706,18 +6810,26 @@ static int
 select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
 {
 	struct sched_domain *tmp, *sd = NULL;
+	struct freq_domain *fd;
 	int cpu = smp_processor_id();
 	int new_cpu = prev_cpu;
-	int want_affine = 0;
+	int want_affine = 0, want_energy = 0;
 	int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
 
+	rcu_read_lock();
 	if (sd_flag & SD_BALANCE_WAKE) {
 		record_wakee(p);
-		want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
-			      && cpumask_test_cpu(cpu, &p->cpus_allowed);
+		fd = rd_freq_domain(cpu_rq(cpu)->rd);
+		want_energy = fd && !READ_ONCE(cpu_rq(cpu)->rd->overutilized);
+		want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu) &&
+			      cpumask_test_cpu(cpu, &p->cpus_allowed);
+	}
+
+	if (want_energy) {
+		new_cpu = find_energy_efficient_cpu(p, prev_cpu, fd);
+		goto unlock;
 	}
 
-	rcu_read_lock();
 	for_each_domain(cpu, tmp) {
 		if (!(tmp->flags & SD_LOAD_BALANCE))
 			break;
@@ -6752,6 +6864,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 		if (want_affine)
 			current->recent_used_cpu = cpu;
 	}
+unlock:
 	rcu_read_unlock();
 
 	return new_cpu;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC PATCH v4 11/12] OPTIONAL: arch_topology: Start Energy Aware Scheduling
  2018-06-28 11:40 [RFC PATCH v4 00/12] Energy Aware Scheduling Quentin Perret
                   ` (9 preceding siblings ...)
  2018-06-28 11:40 ` [RFC PATCH v4 10/12] sched/fair: Select an energy-efficient CPU on task wake-up Quentin Perret
@ 2018-06-28 11:40 ` Quentin Perret
  2018-06-28 11:40 ` [RFC PATCH v4 12/12] OPTIONAL: cpufreq: dt: Register an Energy Model Quentin Perret
  11 siblings, 0 replies; 54+ messages in thread
From: Quentin Perret @ 2018-06-28 11:40 UTC (permalink / raw)
  To: peterz, rjw, linux-kernel, linux-pm
  Cc: gregkh, mingo, dietmar.eggemann, morten.rasmussen, chris.redpath,
	patrick.bellasi, valentin.schneider, vincent.guittot,
	thara.gopinath, viresh.kumar, tkjos, joel, smuckle, adharmap,
	skannan, pkondeti, juri.lelli, edubezval, srinivas.pandruvada,
	currojerez, javi.merino, quentin.perret

Energy Aware Scheduling (EAS) starts when the scheduling domains are
built if the Energy Model (EM) is present. However, in the typical case
of Arm/Arm64 systems, the EM is provided after the scheduling domains
are first built at boot time, which results in EAS staying disabled.

Fix this issue by re-building the scheduling domain from the arch
topology driver, once CPUfreq is up and running and when the CPU
capacities have been updated to their final value.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
---
 drivers/base/arch_topology.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index e7cb0c6ade81..2022174f5ac5 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -10,7 +10,9 @@
 #include <linux/arch_topology.h>
 #include <linux/cpu.h>
 #include <linux/cpufreq.h>
+#include <linux/cpuset.h>
 #include <linux/device.h>
+#include <linux/energy_model.h>
 #include <linux/of.h>
 #include <linux/slab.h>
 #include <linux/string.h>
@@ -173,6 +175,9 @@ static cpumask_var_t cpus_to_visit;
 static void parsing_done_workfn(struct work_struct *work);
 static DECLARE_WORK(parsing_done_work, parsing_done_workfn);
 
+static void start_eas_workfn(struct work_struct *work);
+static DECLARE_WORK(start_eas_work, start_eas_workfn);
+
 static int
 init_cpu_capacity_callback(struct notifier_block *nb,
 			   unsigned long val,
@@ -204,6 +209,7 @@ init_cpu_capacity_callback(struct notifier_block *nb,
 		free_raw_capacity();
 		pr_debug("cpu_capacity: parsing done\n");
 		schedule_work(&parsing_done_work);
+		schedule_work(&start_eas_work);
 	}
 
 	return 0;
@@ -249,6 +255,15 @@ static void parsing_done_workfn(struct work_struct *work)
 	free_cpumask_var(cpus_to_visit);
 }
 
+static void start_eas_workfn(struct work_struct *work)
+{
+	/* Make sure the EM knows about the updated CPU capacities. */
+	em_rescale_cpu_capacity();
+
+	/* Inform the scheduler about the EM availability. */
+	rebuild_sched_domains();
+}
+
 #else
 core_initcall(free_raw_capacity);
 #endif
-- 
2.17.1


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC PATCH v4 12/12] OPTIONAL: cpufreq: dt: Register an Energy Model
  2018-06-28 11:40 [RFC PATCH v4 00/12] Energy Aware Scheduling Quentin Perret
                   ` (10 preceding siblings ...)
  2018-06-28 11:40 ` [RFC PATCH v4 11/12] OPTIONAL: arch_topology: Start Energy Aware Scheduling Quentin Perret
@ 2018-06-28 11:40 ` Quentin Perret
  2018-07-06 10:10   ` Vincent Guittot
  11 siblings, 1 reply; 54+ messages in thread
From: Quentin Perret @ 2018-06-28 11:40 UTC (permalink / raw)
  To: peterz, rjw, linux-kernel, linux-pm
  Cc: gregkh, mingo, dietmar.eggemann, morten.rasmussen, chris.redpath,
	patrick.bellasi, valentin.schneider, vincent.guittot,
	thara.gopinath, viresh.kumar, tkjos, joel, smuckle, adharmap,
	skannan, pkondeti, juri.lelli, edubezval, srinivas.pandruvada,
	currojerez, javi.merino, quentin.perret

*******************************************************************
* This patch illustrates the usage of the newly introduced Energy *
* Model framework and isn't supposed to be merged as-is.          *
*******************************************************************

The Energy Model framework provides an API to register the active power
of CPUs. Call this API from the cpufreq-dt driver with an estimation
of the power as P = C * V^2 * f with C, V, and f respectively the
capacitance of the CPU and the voltage and frequency of the OPP.

The CPU capacitance is read from the "dynamic-power-coefficient" DT
binding (originally introduced for thermal/IPA), and the voltage and
frequency values from PM_OPP.

Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
---
 drivers/cpufreq/cpufreq-dt.c | 45 +++++++++++++++++++++++++++++++++++-
 1 file changed, 44 insertions(+), 1 deletion(-)

diff --git a/drivers/cpufreq/cpufreq-dt.c b/drivers/cpufreq/cpufreq-dt.c
index 190ea0dccb79..5a0747f73121 100644
--- a/drivers/cpufreq/cpufreq-dt.c
+++ b/drivers/cpufreq/cpufreq-dt.c
@@ -16,6 +16,7 @@
 #include <linux/cpu_cooling.h>
 #include <linux/cpufreq.h>
 #include <linux/cpumask.h>
+#include <linux/energy_model.h>
 #include <linux/err.h>
 #include <linux/module.h>
 #include <linux/of.h>
@@ -149,8 +150,47 @@ static int resources_available(void)
 	return 0;
 }
 
+static int of_est_power(unsigned long *mW, unsigned long *KHz, int cpu)
+{
+	unsigned long mV, Hz, MHz;
+	struct device *cpu_dev;
+	struct dev_pm_opp *opp;
+	struct device_node *np;
+	u32 cap;
+	u64 tmp;
+
+	cpu_dev = get_cpu_device(cpu);
+	if (!cpu_dev)
+		return -ENODEV;
+
+	np = of_node_get(cpu_dev->of_node);
+	if (!np)
+		return -EINVAL;
+
+	if (of_property_read_u32(np, "dynamic-power-coefficient", &cap))
+		return -EINVAL;
+
+	Hz = *KHz * 1000;
+	opp = dev_pm_opp_find_freq_ceil(cpu_dev, &Hz);
+	if (IS_ERR(opp))
+		return -EINVAL;
+
+	mV = dev_pm_opp_get_voltage(opp) / 1000;
+	dev_pm_opp_put(opp);
+
+	MHz = Hz / 1000000;
+	tmp = (u64)cap * mV * mV * MHz;
+	do_div(tmp, 1000000000);
+
+	*mW = (unsigned long)tmp;
+	*KHz = Hz / 1000;
+
+	return 0;
+}
+
 static int cpufreq_init(struct cpufreq_policy *policy)
 {
+	struct em_data_callback em_cb = EM_DATA_CB(of_est_power);
 	struct cpufreq_frequency_table *freq_table;
 	struct opp_table *opp_table = NULL;
 	struct private_data *priv;
@@ -159,7 +199,7 @@ static int cpufreq_init(struct cpufreq_policy *policy)
 	unsigned int transition_latency;
 	bool fallback = false;
 	const char *name;
-	int ret;
+	int ret, nr_opp;
 
 	cpu_dev = get_cpu_device(policy->cpu);
 	if (!cpu_dev) {
@@ -226,6 +266,7 @@ static int cpufreq_init(struct cpufreq_policy *policy)
 		ret = -EPROBE_DEFER;
 		goto out_free_opp;
 	}
+	nr_opp = ret;
 
 	if (fallback) {
 		cpumask_setall(policy->cpus);
@@ -278,6 +319,8 @@ static int cpufreq_init(struct cpufreq_policy *policy)
 	policy->cpuinfo.transition_latency = transition_latency;
 	policy->dvfs_possible_from_any_cpu = true;
 
+	em_register_freq_domain(policy->cpus, nr_opp, &em_cb);
+
 	return 0;
 
 out_free_cpufreq_table:
-- 
2.17.1


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 03/12] PM: Introduce an Energy Model management framework
  2018-06-28 11:40 ` [RFC PATCH v4 03/12] PM: Introduce an Energy Model management framework Quentin Perret
@ 2018-07-05 14:31   ` Peter Zijlstra
  2018-07-05 15:24     ` Quentin Perret
  2018-07-05 14:39   ` Peter Zijlstra
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 54+ messages in thread
From: Peter Zijlstra @ 2018-07-05 14:31 UTC (permalink / raw)
  To: Quentin Perret
  Cc: rjw, linux-kernel, linux-pm, gregkh, mingo, dietmar.eggemann,
	morten.rasmussen, chris.redpath, patrick.bellasi,
	valentin.schneider, vincent.guittot, thara.gopinath,
	viresh.kumar, tkjos, joel, smuckle, adharmap, skannan, pkondeti,
	juri.lelli, edubezval, srinivas.pandruvada, currojerez,
	javi.merino

On Thu, Jun 28, 2018 at 12:40:34PM +0100, Quentin Perret wrote:
> +/**
> + * em_fd_energy() - Estimates the energy consumed by the CPUs of a freq. domain
> + * @fd		: frequency domain for which energy has to be estimated
> + * @max_util	: highest utilization among CPUs of the domain
> + * @sum_util	: sum of the utilization of all CPUs in the domain
> + *
> + * em_fd_energy() dereferences the capacity state table of the frequency
> + * domain, so it must be called under RCU read lock.
> + *
> + * Return: the sum of the energy consumed by the CPUs of the domain assuming
> + * a capacity state satisfying the max utilization of the domain.
> + */
> +static inline unsigned long em_fd_energy(struct em_freq_domain *fd,
> +				unsigned long max_util, unsigned long sum_util)
> +{
> +	struct em_cs_table *cs_table;
> +	struct em_cap_state *cs;
> +	unsigned long freq;
> +	int i;
> +
> +	cs_table = rcu_dereference(fd->cs_table);
> +	if (!cs_table)
> +		return 0;
> +
> +	/* Map the utilization value to a frequency */
> +	cs = &cs_table->state[cs_table->nr_cap_states - 1];
> +	freq = map_util_freq(max_util, cs->frequency, cs->capacity);
> +
> +	/* Find the lowest capacity state above this frequency */
> +	for (i = 0; i < cs_table->nr_cap_states; i++) {
> +		cs = &cs_table->state[i];
> +		if (cs->frequency >= freq)
> +			break;
> +	}
> +
> +	return cs->power * sum_util / cs->capacity;
> +}

I keep reading @max_util as the highest possible util (iow capacity) of
the freq domain, instead of the current highest instant capacity across
the CPUs in the domain.

At the same time I'm struggling for a better name. Maybe just @util? But
that would then maybe confuse against @sum_util. argh..

And I'm confused by what exactly this function computes; are you
(through sum_util) effectively setting idle power at 0?

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 03/12] PM: Introduce an Energy Model management framework
  2018-06-28 11:40 ` [RFC PATCH v4 03/12] PM: Introduce an Energy Model management framework Quentin Perret
  2018-07-05 14:31   ` Peter Zijlstra
@ 2018-07-05 14:39   ` Peter Zijlstra
  2018-07-05 15:09     ` Quentin Perret
  2018-07-05 15:06   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 54+ messages in thread
From: Peter Zijlstra @ 2018-07-05 14:39 UTC (permalink / raw)
  To: Quentin Perret
  Cc: rjw, linux-kernel, linux-pm, gregkh, mingo, dietmar.eggemann,
	morten.rasmussen, chris.redpath, patrick.bellasi,
	valentin.schneider, vincent.guittot, thara.gopinath,
	viresh.kumar, tkjos, joel, smuckle, adharmap, skannan, pkondeti,
	juri.lelli, edubezval, srinivas.pandruvada, currojerez,
	javi.merino

On Thu, Jun 28, 2018 at 12:40:34PM +0100, Quentin Perret wrote:
> +/**
> + * em_fd_nr_cap_states() - Get the number of capacity states of a freq. domain
> + * @fd		: frequency domain for which want to do this
> + *
> + * Return: the number of capacity state in the frequency domain table
> + */
> +static inline int em_fd_nr_cap_states(struct em_freq_domain *fd)
> +{
> +	struct em_cs_table *table;
> +	int nr_states;
> +
> +	rcu_read_lock();
> +	table = rcu_dereference(fd->cs_table);
> +	nr_states = table->nr_cap_states;
> +	rcu_read_unlock();

So right here, we can continue to free @table...

> +
> +	return nr_states;
> +}

and then what does the value we return mean?


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 03/12] PM: Introduce an Energy Model management framework
  2018-06-28 11:40 ` [RFC PATCH v4 03/12] PM: Introduce an Energy Model management framework Quentin Perret
  2018-07-05 14:31   ` Peter Zijlstra
  2018-07-05 14:39   ` Peter Zijlstra
@ 2018-07-05 15:06   ` Peter Zijlstra
  2018-07-05 15:32     ` Quentin Perret
  2018-07-06  9:57   ` Vincent Guittot
  2018-07-09 18:07   ` Dietmar Eggemann
  4 siblings, 1 reply; 54+ messages in thread
From: Peter Zijlstra @ 2018-07-05 15:06 UTC (permalink / raw)
  To: Quentin Perret
  Cc: rjw, linux-kernel, linux-pm, gregkh, mingo, dietmar.eggemann,
	morten.rasmussen, chris.redpath, patrick.bellasi,
	valentin.schneider, vincent.guittot, thara.gopinath,
	viresh.kumar, tkjos, joel, smuckle, adharmap, skannan, pkondeti,
	juri.lelli, edubezval, srinivas.pandruvada, currojerez,
	javi.merino

On Thu, Jun 28, 2018 at 12:40:34PM +0100, Quentin Perret wrote:
> +/* fd_update_cs_table() - Computes the capacity values of a cs_table
> + *
> + * This assumes a linear relation between capacity and frequency. As such,
> + * the capacity of a CPU at the n^th capacity state is computed as:
> + *           capactity(n) = max_capacity * freq(n) / freq_max
> + */

Broken comment style

> +static void fd_update_cs_table(struct em_cs_table *cs_table, int cpu)
> +{
> +	unsigned long cmax = arch_scale_cpu_capacity(NULL, cpu);
> +	int max_cap_state = cs_table->nr_cap_states - 1;
> +	unsigned long fmax = cs_table->state[max_cap_state].frequency;
> +	int i;
> +
> +	for (i = 0; i < cs_table->nr_cap_states; i++) {
> +		u64 cap = (u64)cmax * cs_table->state[i].frequency;
> +		do_div(cap, fmax);
> +		cs_table->state[i].capacity = (unsigned long)cap;

I prefer div64_*() over do_div(), the calling convention the latter
always confuses the heck out of me.

So this then becomes something like:

		cs_table->state[i].capacity =
			div64_u64(cmax * cs_table->state[i].frequency, fmax);

> +	}
> +}

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 03/12] PM: Introduce an Energy Model management framework
  2018-07-05 14:39   ` Peter Zijlstra
@ 2018-07-05 15:09     ` Quentin Perret
  0 siblings, 0 replies; 54+ messages in thread
From: Quentin Perret @ 2018-07-05 15:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rjw, linux-kernel, linux-pm, gregkh, mingo, dietmar.eggemann,
	morten.rasmussen, chris.redpath, patrick.bellasi,
	valentin.schneider, vincent.guittot, thara.gopinath,
	viresh.kumar, tkjos, joel, smuckle, adharmap, skannan, pkondeti,
	juri.lelli, edubezval, srinivas.pandruvada, currojerez,
	javi.merino

On Thursday 05 Jul 2018 at 16:39:01 (+0200), Peter Zijlstra wrote:
> On Thu, Jun 28, 2018 at 12:40:34PM +0100, Quentin Perret wrote:
> > +/**
> > + * em_fd_nr_cap_states() - Get the number of capacity states of a freq. domain
> > + * @fd		: frequency domain for which want to do this
> > + *
> > + * Return: the number of capacity state in the frequency domain table
> > + */
> > +static inline int em_fd_nr_cap_states(struct em_freq_domain *fd)
> > +{
> > +	struct em_cs_table *table;
> > +	int nr_states;
> > +
> > +	rcu_read_lock();
> > +	table = rcu_dereference(fd->cs_table);
> > +	nr_states = table->nr_cap_states;
> > +	rcu_read_unlock();
> 
> So right here, we can continue to free @table...
> 
> > +
> > +	return nr_states;
> > +}
> 
> and then what does the value we return mean?

Hmm that's a good point. In practice the number of entries in the table
never changes, so that makes no sense to keep that value in the
RCU-protected struct. I can just move nr_cap_states in struct
freq_domain directly and access it without RCU protection, it is simply
not needed.

I'll change that for the next version.

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 03/12] PM: Introduce an Energy Model management framework
  2018-07-05 14:31   ` Peter Zijlstra
@ 2018-07-05 15:24     ` Quentin Perret
  0 siblings, 0 replies; 54+ messages in thread
From: Quentin Perret @ 2018-07-05 15:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rjw, linux-kernel, linux-pm, gregkh, mingo, dietmar.eggemann,
	morten.rasmussen, chris.redpath, patrick.bellasi,
	valentin.schneider, vincent.guittot, thara.gopinath,
	viresh.kumar, tkjos, joel, smuckle, adharmap, skannan, pkondeti,
	juri.lelli, edubezval, srinivas.pandruvada, currojerez,
	javi.merino

On Thursday 05 Jul 2018 at 16:31:51 (+0200), Peter Zijlstra wrote:
> On Thu, Jun 28, 2018 at 12:40:34PM +0100, Quentin Perret wrote:
> > +/**
> > + * em_fd_energy() - Estimates the energy consumed by the CPUs of a freq. domain
> > + * @fd		: frequency domain for which energy has to be estimated
> > + * @max_util	: highest utilization among CPUs of the domain
> > + * @sum_util	: sum of the utilization of all CPUs in the domain
> > + *
> > + * em_fd_energy() dereferences the capacity state table of the frequency
> > + * domain, so it must be called under RCU read lock.
> > + *
> > + * Return: the sum of the energy consumed by the CPUs of the domain assuming
> > + * a capacity state satisfying the max utilization of the domain.
> > + */
> > +static inline unsigned long em_fd_energy(struct em_freq_domain *fd,
> > +				unsigned long max_util, unsigned long sum_util)
> > +{
> > +	struct em_cs_table *cs_table;
> > +	struct em_cap_state *cs;
> > +	unsigned long freq;
> > +	int i;
> > +
> > +	cs_table = rcu_dereference(fd->cs_table);
> > +	if (!cs_table)
> > +		return 0;
> > +
> > +	/* Map the utilization value to a frequency */
> > +	cs = &cs_table->state[cs_table->nr_cap_states - 1];
> > +	freq = map_util_freq(max_util, cs->frequency, cs->capacity);
> > +
> > +	/* Find the lowest capacity state above this frequency */
> > +	for (i = 0; i < cs_table->nr_cap_states; i++) {
> > +		cs = &cs_table->state[i];
> > +		if (cs->frequency >= freq)
> > +			break;
> > +	}
> > +
> > +	return cs->power * sum_util / cs->capacity;
> > +}
> 
> I keep reading @max_util as the highest possible util (iow capacity) of
> the freq domain, instead of the current highest instant capacity across
> the CPUs in the domain.
> 
> At the same time I'm struggling for a better name. Maybe just @util? But
> that would then maybe confuse against @sum_util. argh..

Right, I see your point ... I'm happy to change to 'util' if you prefer.
One can argue that this is more consistent with the 'util' used in
sugov_next_freq_shared() ...

> 
> And I'm confused by what exactly this function computes; are you
> (through sum_util) effectively setting idle power at 0?

Yes, the EM only includes costs for active states, at least for now, so
we don't take idle time into consideration. Extending the model should
be doable later I suppose, as a second step (if proven useful ...).

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 03/12] PM: Introduce an Energy Model management framework
  2018-07-05 15:06   ` Peter Zijlstra
@ 2018-07-05 15:32     ` Quentin Perret
  0 siblings, 0 replies; 54+ messages in thread
From: Quentin Perret @ 2018-07-05 15:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rjw, linux-kernel, linux-pm, gregkh, mingo, dietmar.eggemann,
	morten.rasmussen, chris.redpath, patrick.bellasi,
	valentin.schneider, vincent.guittot, thara.gopinath,
	viresh.kumar, tkjos, joel, smuckle, adharmap, skannan, pkondeti,
	juri.lelli, edubezval, srinivas.pandruvada, currojerez,
	javi.merino

On Thursday 05 Jul 2018 at 17:06:29 (+0200), Peter Zijlstra wrote:
> On Thu, Jun 28, 2018 at 12:40:34PM +0100, Quentin Perret wrote:
> > +/* fd_update_cs_table() - Computes the capacity values of a cs_table
> > + *
> > + * This assumes a linear relation between capacity and frequency. As such,
> > + * the capacity of a CPU at the n^th capacity state is computed as:
> > + *           capactity(n) = max_capacity * freq(n) / freq_max
> > + */
> 
> Broken comment style
> 
> > +static void fd_update_cs_table(struct em_cs_table *cs_table, int cpu)
> > +{
> > +	unsigned long cmax = arch_scale_cpu_capacity(NULL, cpu);
> > +	int max_cap_state = cs_table->nr_cap_states - 1;
> > +	unsigned long fmax = cs_table->state[max_cap_state].frequency;
> > +	int i;
> > +
> > +	for (i = 0; i < cs_table->nr_cap_states; i++) {
> > +		u64 cap = (u64)cmax * cs_table->state[i].frequency;
> > +		do_div(cap, fmax);
> > +		cs_table->state[i].capacity = (unsigned long)cap;
> 
> I prefer div64_*() over do_div(), the calling convention the latter
> always confuses the heck out of me.
> 
> So this then becomes something like:
> 
> 		cs_table->state[i].capacity =
> 			div64_u64(cmax * cs_table->state[i].frequency, fmax);

Ok, I'll change both.

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 05/12] sched/topology: Reference the Energy Model of CPUs when available
  2018-06-28 11:40 ` [RFC PATCH v4 05/12] sched/topology: Reference the Energy Model of CPUs when available Quentin Perret
@ 2018-07-05 17:29   ` Peter Zijlstra
  2018-07-05 17:48     ` Quentin Perret
  2018-07-05 17:33   ` Peter Zijlstra
  1 sibling, 1 reply; 54+ messages in thread
From: Peter Zijlstra @ 2018-07-05 17:29 UTC (permalink / raw)
  To: Quentin Perret
  Cc: rjw, linux-kernel, linux-pm, gregkh, mingo, dietmar.eggemann,
	morten.rasmussen, chris.redpath, patrick.bellasi,
	valentin.schneider, vincent.guittot, thara.gopinath,
	viresh.kumar, tkjos, joel, smuckle, adharmap, skannan, pkondeti,
	juri.lelli, edubezval, srinivas.pandruvada, currojerez,
	javi.merino

On Thu, Jun 28, 2018 at 12:40:36PM +0100, Quentin Perret wrote:
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 61a1125c1ae4..357eff55c1a7 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -201,6 +201,19 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
>  	return 1;
>  }
>  
> +#ifdef CONFIG_ENERGY_MODEL
> +static void free_fd(struct freq_domain *fd)
> +{
> +	struct freq_domain *tmp;
> +
> +	while (fd) {
> +		tmp = fd->next;
> +		kfree(fd);
> +		fd = tmp;
> +	}
> +}
> +#endif
> +
>  static void free_rootdomain(struct rcu_head *rcu)
>  {
>  	struct root_domain *rd = container_of(rcu, struct root_domain, rcu);
> @@ -211,6 +224,9 @@ static void free_rootdomain(struct rcu_head *rcu)
>  	free_cpumask_var(rd->rto_mask);
>  	free_cpumask_var(rd->online);
>  	free_cpumask_var(rd->span);
> +#ifdef CONFIG_ENERGY_MODEL
> +	free_fd(rd->fd);
> +#endif

If you provide a stub function, you can reduce #ifdef.

>  	kfree(rd);
>  }
>  
> @@ -1635,6 +1651,104 @@ static struct sched_domain *build_sched_domain(struct sched_domain_topology_leve
>  	return sd;
>  }
>  
> +#ifdef CONFIG_ENERGY_MODEL

  < snip content >

> +#endif

And is there any reason this #ifdef cannot be merged with the one above?

That is, try and do a pass of #ifdef reduction on this file.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 05/12] sched/topology: Reference the Energy Model of CPUs when available
  2018-06-28 11:40 ` [RFC PATCH v4 05/12] sched/topology: Reference the Energy Model of CPUs when available Quentin Perret
  2018-07-05 17:29   ` Peter Zijlstra
@ 2018-07-05 17:33   ` Peter Zijlstra
  2018-07-05 17:50     ` Quentin Perret
  1 sibling, 1 reply; 54+ messages in thread
From: Peter Zijlstra @ 2018-07-05 17:33 UTC (permalink / raw)
  To: Quentin Perret
  Cc: rjw, linux-kernel, linux-pm, gregkh, mingo, dietmar.eggemann,
	morten.rasmussen, chris.redpath, patrick.bellasi,
	valentin.schneider, vincent.guittot, thara.gopinath,
	viresh.kumar, tkjos, joel, smuckle, adharmap, skannan, pkondeti,
	juri.lelli, edubezval, srinivas.pandruvada, currojerez,
	javi.merino

On Thu, Jun 28, 2018 at 12:40:36PM +0100, Quentin Perret wrote:
> +			if (cpumask_equal(doms_new[i], doms_cur[j])
> +			    && cpu_rq(cpumask_first(doms_cur[j]))->rd->fd)

Please put the operator at the end of the previous line, not the begin
of the next line when splitting.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 05/12] sched/topology: Reference the Energy Model of CPUs when available
  2018-07-05 17:29   ` Peter Zijlstra
@ 2018-07-05 17:48     ` Quentin Perret
  0 siblings, 0 replies; 54+ messages in thread
From: Quentin Perret @ 2018-07-05 17:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rjw, linux-kernel, linux-pm, gregkh, mingo, dietmar.eggemann,
	morten.rasmussen, chris.redpath, patrick.bellasi,
	valentin.schneider, vincent.guittot, thara.gopinath,
	viresh.kumar, tkjos, joel, smuckle, adharmap, skannan, pkondeti,
	juri.lelli, edubezval, srinivas.pandruvada, currojerez,
	javi.merino

On Thursday 05 Jul 2018 at 19:29:22 (+0200), Peter Zijlstra wrote:
> On Thu, Jun 28, 2018 at 12:40:36PM +0100, Quentin Perret wrote:
> >  static void free_rootdomain(struct rcu_head *rcu)
> >  {
> >  	struct root_domain *rd = container_of(rcu, struct root_domain, rcu);
> > @@ -211,6 +224,9 @@ static void free_rootdomain(struct rcu_head *rcu)
> >  	free_cpumask_var(rd->rto_mask);
> >  	free_cpumask_var(rd->online);
> >  	free_cpumask_var(rd->span);
> > +#ifdef CONFIG_ENERGY_MODEL
> > +	free_fd(rd->fd);
> > +#endif
> 
> If you provide a stub function, you can reduce #ifdef.

I cannot just stub free_fd since the 'fd' member of the root_domain is
also defined only for CONFIG_ENERGY_MODEL=y.

But I can introduce a free_rd_fd(fd) function that will be stubbed if
you think that's better. Or make sure to have 'fd' attached to the
root_domain even for CONFIG_ENERGY_MODEL=n.

> 
> >  	kfree(rd);
> >  }
> >  
> > @@ -1635,6 +1651,104 @@ static struct sched_domain *build_sched_domain(struct sched_domain_topology_leve
> >  	return sd;
> >  }
> >  
> > +#ifdef CONFIG_ENERGY_MODEL
> 
>   < snip content >
> 
> > +#endif
> 
> And is there any reason this #ifdef cannot be merged with the one above?

Not really, it was just easier to keep build_freq_domains() close to
partition_sched_domains() while writing/reading the code but that is
a very weak argument, I agree. I'll move all of that to the ifdef above.

> That is, try and do a pass of #ifdef reduction on this file.

Ok I'll look into that.

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 05/12] sched/topology: Reference the Energy Model of CPUs when available
  2018-07-05 17:33   ` Peter Zijlstra
@ 2018-07-05 17:50     ` Quentin Perret
  2018-07-05 18:14       ` Peter Zijlstra
  0 siblings, 1 reply; 54+ messages in thread
From: Quentin Perret @ 2018-07-05 17:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rjw, linux-kernel, linux-pm, gregkh, mingo, dietmar.eggemann,
	morten.rasmussen, chris.redpath, patrick.bellasi,
	valentin.schneider, vincent.guittot, thara.gopinath,
	viresh.kumar, tkjos, joel, smuckle, adharmap, skannan, pkondeti,
	juri.lelli, edubezval, srinivas.pandruvada, currojerez,
	javi.merino

On Thursday 05 Jul 2018 at 19:33:21 (+0200), Peter Zijlstra wrote:
> On Thu, Jun 28, 2018 at 12:40:36PM +0100, Quentin Perret wrote:
> > +			if (cpumask_equal(doms_new[i], doms_cur[j])
> > +			    && cpu_rq(cpumask_first(doms_cur[j]))->rd->fd)
> 
> Please put the operator at the end of the previous line, not the begin
> of the next line when splitting.

I did that to be consistent with the to similar loops above, but n/p,
I'll change it :-)

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 05/12] sched/topology: Reference the Energy Model of CPUs when available
  2018-07-05 17:50     ` Quentin Perret
@ 2018-07-05 18:14       ` Peter Zijlstra
  0 siblings, 0 replies; 54+ messages in thread
From: Peter Zijlstra @ 2018-07-05 18:14 UTC (permalink / raw)
  To: Quentin Perret
  Cc: rjw, linux-kernel, linux-pm, gregkh, mingo, dietmar.eggemann,
	morten.rasmussen, chris.redpath, patrick.bellasi,
	valentin.schneider, vincent.guittot, thara.gopinath,
	viresh.kumar, tkjos, joel, smuckle, adharmap, skannan, pkondeti,
	juri.lelli, edubezval, srinivas.pandruvada, currojerez,
	javi.merino

On Thu, Jul 05, 2018 at 06:50:22PM +0100, Quentin Perret wrote:
> On Thursday 05 Jul 2018 at 19:33:21 (+0200), Peter Zijlstra wrote:
> > On Thu, Jun 28, 2018 at 12:40:36PM +0100, Quentin Perret wrote:
> > > +			if (cpumask_equal(doms_new[i], doms_cur[j])
> > > +			    && cpu_rq(cpumask_first(doms_cur[j]))->rd->fd)
> > 
> > Please put the operator at the end of the previous line, not the begin
> > of the next line when splitting.
> 
> I did that to be consistent with the to similar loops above, but n/p,
> I'll change it :-)

In that case, feel free to fix those too ;-)

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 03/12] PM: Introduce an Energy Model management framework
  2018-06-28 11:40 ` [RFC PATCH v4 03/12] PM: Introduce an Energy Model management framework Quentin Perret
                     ` (2 preceding siblings ...)
  2018-07-05 15:06   ` Peter Zijlstra
@ 2018-07-06  9:57   ` Vincent Guittot
  2018-07-06 10:03     ` Peter Zijlstra
  2018-07-06 10:05     ` Quentin Perret
  2018-07-09 18:07   ` Dietmar Eggemann
  4 siblings, 2 replies; 54+ messages in thread
From: Vincent Guittot @ 2018-07-06  9:57 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Peter Zijlstra, Rafael J. Wysocki, linux-kernel,
	open list:THERMAL, gregkh, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Thara Gopinath, viresh kumar, Todd Kjos,
	Joel Fernandes, Cc: Steve Muckle, adharmap, Kannan, Saravana,
	pkondeti, Juri Lelli, Eduardo Valentin, Srinivas Pandruvada,
	currojerez, Javi Merino

On Thu, 28 Jun 2018 at 13:41, Quentin Perret <quentin.perret@arm.com> wrote:
>
> Several subsystems in the kernel (task scheduler and/or thermal at the
> time of writing) can benefit from knowing about the energy consumed by
> CPUs. Yet, this information can come from different sources (DT or
> firmware for example), in different formats, hence making it hard to
> exploit without a standard API.
>
> As an attempt to address this, introduce a centralized Energy Model
> (EM) management framework which aggregates the power values provided
> by drivers into a table for each frequency domain in the system. The
> power cost tables are made available to interested clients (e.g. task
> scheduler or thermal) via platform-agnostic APIs. The overall design
> is represented by the diagram below (focused on Arm-related drivers as
> an example, but hopefully applicable to any architecture):
>
>      +---------------+  +-----------------+  +---------+
>      | Thermal (IPA) |  | Scheduler (EAS) |  | Other ? |
>      +---------------+  +-----------------+  +---------+
>              |                 | em_fd_energy()   |
>              |                 | em_cpu_get()     |
>              +-----------+     |         +--------+
>                          |     |         |
>                          v     v         v
>                       +---------------------+
>                       |                     |         +---------------+
>                       |    Energy Model     |         | arch_topology |
>                       |                     |<--------|    driver     |
>                       |     Framework       |         +---------------+
>                       |                     | em_rescale_cpu_capacity()
>                       +---------------------+
>                          ^       ^       ^
>                          |       |       | em_register_freq_domain()
>               +----------+       |       +---------+
>               |                  |                 |
>       +---------------+  +---------------+  +--------------+
>       |  cpufreq-dt   |  |   arm_scmi    |  |    Other     |
>       +---------------+  +---------------+  +--------------+
>               ^                  ^                 ^
>               |                  |                 |
>       +--------------+   +---------------+  +--------------+
>       | Device Tree  |   |   Firmware    |  |      ?       |
>       +--------------+   +---------------+  +--------------+
>
> Drivers (typically, but not limited to, CPUFreq drivers) can register
> data in the EM framework using the em_register_freq_domain() API. The
> calling driver must provide a callback function with a standardized
> signature that will be used by the EM framework to build the power
> cost tables of the frequency domain. This design should offer a lot of
> flexibility to calling drivers which are free of reading information
> from any location and to use any technique to compute power costs.
> Moreover, the capacity states registered by drivers in the EM framework
> are not required to match real performance states of the target. This
> is particularly important on targets where the performance states are
> not known by the OS.
>
> On the client side, the EM framework offers APIs to access the power
> cost tables of a CPU (em_cpu_get()), and to estimate the energy
> consumed by the CPUs of a frequency domain (em_fd_energy()). Clients
> such as the task scheduler can then use these APIs to access the shared
> data structures holding the Energy Model of CPUs.
>
> The EM framework also provides an API (em_rescale_cpu_capacity()) to
> re-scale the capacity values of the model asynchronously, after it has
> been created. This is required for architectures where the capacity
> scale factor of CPUs can change at run-time. This is the case for
> Arm/Arm64 for example where the arch_topology driver recomputes the
> capacity scale factors of the CPUs after the maximum frequency of all
> CPUs has been discovered. Although complex, the process of creating and
> re-scaling the EM has to be kept in two separate steps to fulfill the
> needs of the different users. The thermal subsystem doesn't use the
> capacity values and shouldn't have dependencies on subsystems providing
> them. On the other hand, the task scheduler needs the capacity values,
> and it will benefit from seeing them up-to-date when applicable.
>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> Signed-off-by: Quentin Perret <quentin.perret@arm.com>
> ---
>  include/linux/energy_model.h | 140 ++++++++++++++++++
>  kernel/power/Kconfig         |  15 ++
>  kernel/power/Makefile        |   2 +
>  kernel/power/energy_model.c  | 273 +++++++++++++++++++++++++++++++++++
>  4 files changed, 430 insertions(+)
>  create mode 100644 include/linux/energy_model.h
>  create mode 100644 kernel/power/energy_model.c
>
> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> new file mode 100644
> index 000000000000..88c2f0b9bcb3
> --- /dev/null
> +++ b/include/linux/energy_model.h
> @@ -0,0 +1,140 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_ENERGY_MODEL_H
> +#define _LINUX_ENERGY_MODEL_H
> +#include <linux/cpumask.h>
> +#include <linux/jump_label.h>
> +#include <linux/kobject.h>
> +#include <linux/rcupdate.h>
> +#include <linux/sched/cpufreq.h>
> +#include <linux/types.h>
> +
> +#ifdef CONFIG_ENERGY_MODEL
> +struct em_cap_state {
> +       unsigned long capacity;
> +       unsigned long frequency; /* Kilo-hertz */
> +       unsigned long power; /* Milli-watts */
> +};
> +
> +struct em_cs_table {
> +       struct em_cap_state *state; /* Capacity states, in ascending order. */
> +       int nr_cap_states;
> +       struct rcu_head rcu;
> +};
> +
> +struct em_freq_domain {
> +       struct em_cs_table *cs_table; /* Capacity state table, RCU-protected */
> +       unsigned long cpus[0]; /* CPUs of the frequency domain. */
> +};
> +
> +#define EM_CPU_MAX_POWER 0xFFFF
> +
> +struct em_data_callback {
> +       /**
> +        * active_power() - Provide power at the next capacity state of a CPU
> +        * @power       : Active power at the capacity state in mW (modified)
> +        * @freq        : Frequency at the capacity state in kHz (modified)
> +        * @cpu         : CPU for which we do this operation
> +        *
> +        * active_power() must find the lowest capacity state of 'cpu' above
> +        * 'freq' and update 'power' and 'freq' to the matching active power
> +        * and frequency.
> +        *
> +        * The power is the one of a single CPU in the domain, expressed in
> +        * milli-watts. It is expected to fit in the [0, EM_CPU_MAX_POWER]
> +        * range.
> +        *
> +        * Return 0 on success.
> +        */
> +       int (*active_power) (unsigned long *power, unsigned long *freq, int cpu);
> +};
> +#define EM_DATA_CB(_active_power_cb) { .active_power = &_active_power_cb }
> +
> +void em_rescale_cpu_capacity(void);
> +struct em_freq_domain *em_cpu_get(int cpu);
> +int em_register_freq_domain(cpumask_t *span, unsigned int nr_states,
> +                                               struct em_data_callback *cb);
> +
> +/**
> + * em_fd_energy() - Estimates the energy consumed by the CPUs of a freq. domain
> + * @fd         : frequency domain for which energy has to be estimated
> + * @max_util   : highest utilization among CPUs of the domain
> + * @sum_util   : sum of the utilization of all CPUs in the domain
> + *
> + * em_fd_energy() dereferences the capacity state table of the frequency
> + * domain, so it must be called under RCU read lock.
> + *
> + * Return: the sum of the energy consumed by the CPUs of the domain assuming
> + * a capacity state satisfying the max utilization of the domain.
> + */
> +static inline unsigned long em_fd_energy(struct em_freq_domain *fd,
> +                               unsigned long max_util, unsigned long sum_util)
> +{
> +       struct em_cs_table *cs_table;
> +       struct em_cap_state *cs;
> +       unsigned long freq;
> +       int i;
> +
> +       cs_table = rcu_dereference(fd->cs_table);
> +       if (!cs_table)
> +               return 0;
> +
> +       /* Map the utilization value to a frequency */
> +       cs = &cs_table->state[cs_table->nr_cap_states - 1];
> +       freq = map_util_freq(max_util, cs->frequency, cs->capacity);

The 2 lines above deserve more explanation:
1st, you get the max capacity of the freq domain
Then, you estimate what will be the selected frequency according to
the max_utilization.
Might worth to mention that we must keep sync how sched_util and EM
select a freq for a given capacity which is the reason of patch 02

> +
> +       /* Find the lowest capacity state above this frequency */
> +       for (i = 0; i < cs_table->nr_cap_states; i++) {
> +               cs = &cs_table->state[i];
> +               if (cs->frequency >= freq)
> +                       break;
> +       }
> +
> +       return cs->power * sum_util / cs->capacity;

IIUC the formula above, you consider that all CPUs in a frequency
domain has the same capacity. This sounds a reasonable assumption but
it would be good to write that somewhere

> +}
> +

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 03/12] PM: Introduce an Energy Model management framework
  2018-07-06  9:57   ` Vincent Guittot
@ 2018-07-06 10:03     ` Peter Zijlstra
  2018-07-06 10:06       ` Quentin Perret
  2018-07-06 10:05     ` Quentin Perret
  1 sibling, 1 reply; 54+ messages in thread
From: Peter Zijlstra @ 2018-07-06 10:03 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Quentin Perret, Rafael J. Wysocki, linux-kernel,
	open list:THERMAL, gregkh, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Thara Gopinath, viresh kumar, Todd Kjos,
	Joel Fernandes, Cc: Steve Muckle, adharmap, Kannan, Saravana,
	pkondeti, Juri Lelli, Eduardo Valentin, Srinivas Pandruvada,
	currojerez, Javi Merino

On Fri, Jul 06, 2018 at 11:57:37AM +0200, Vincent Guittot wrote:
> 
> IIUC the formula above, you consider that all CPUs in a frequency
> domain has the same capacity. This sounds a reasonable assumption but
> it would be good to write that somewhere

Yes, I think there used to be a WARN and kill EM for that, but I cannot
seem to locate that in the current code.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 03/12] PM: Introduce an Energy Model management framework
  2018-07-06  9:57   ` Vincent Guittot
  2018-07-06 10:03     ` Peter Zijlstra
@ 2018-07-06 10:05     ` Quentin Perret
  1 sibling, 0 replies; 54+ messages in thread
From: Quentin Perret @ 2018-07-06 10:05 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Rafael J. Wysocki, linux-kernel,
	open list:THERMAL, gregkh, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Thara Gopinath, viresh kumar, Todd Kjos,
	Joel Fernandes, Cc: Steve Muckle, adharmap, Kannan, Saravana,
	pkondeti, Juri Lelli, Eduardo Valentin, Srinivas Pandruvada,
	currojerez, Javi Merino

Hi Vincent,

On Friday 06 Jul 2018 at 11:57:37 (+0200), Vincent Guittot wrote:
> On Thu, 28 Jun 2018 at 13:41, Quentin Perret <quentin.perret@arm.com> wrote:
> > +static inline unsigned long em_fd_energy(struct em_freq_domain *fd,
> > +                               unsigned long max_util, unsigned long sum_util)
> > +{
> > +       struct em_cs_table *cs_table;
> > +       struct em_cap_state *cs;
> > +       unsigned long freq;
> > +       int i;
> > +
> > +       cs_table = rcu_dereference(fd->cs_table);
> > +       if (!cs_table)
> > +               return 0;
> > +
> > +       /* Map the utilization value to a frequency */
> > +       cs = &cs_table->state[cs_table->nr_cap_states - 1];
> > +       freq = map_util_freq(max_util, cs->frequency, cs->capacity);
> 
> The 2 lines above deserve more explanation:
> 1st, you get the max capacity of the freq domain
> Then, you estimate what will be the selected frequency according to
> the max_utilization.
> Might worth to mention that we must keep sync how sched_util and EM
> select a freq for a given capacity which is the reason of patch 02

Agreed, this could benefit from more explanations. I'll comment that
better in the next version.

> 
> > +
> > +       /* Find the lowest capacity state above this frequency */
> > +       for (i = 0; i < cs_table->nr_cap_states; i++) {
> > +               cs = &cs_table->state[i];
> > +               if (cs->frequency >= freq)
> > +                       break;
> > +       }
> > +
> > +       return cs->power * sum_util / cs->capacity;
> 
> IIUC the formula above, you consider that all CPUs in a frequency
> domain has the same capacity. This sounds a reasonable assumption but
> it would be good to write that somewhere

That's correct, we agreed on the assumption that CPUs in the same freq
domain must have the same micro-arch, and consequently the same capacity.

But you're right, that needs at the very least to be documented. Or even
better, I should check that when the EM is created and bail out with an
error message if that's not the case. That shouldn't be too hard to
implement, I think.

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 03/12] PM: Introduce an Energy Model management framework
  2018-07-06 10:03     ` Peter Zijlstra
@ 2018-07-06 10:06       ` Quentin Perret
  0 siblings, 0 replies; 54+ messages in thread
From: Quentin Perret @ 2018-07-06 10:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Rafael J. Wysocki, linux-kernel,
	open list:THERMAL, gregkh, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Thara Gopinath, viresh kumar, Todd Kjos,
	Joel Fernandes, Cc: Steve Muckle, adharmap, Kannan, Saravana,
	pkondeti, Juri Lelli, Eduardo Valentin, Srinivas Pandruvada,
	currojerez, Javi Merino

On Friday 06 Jul 2018 at 12:03:53 (+0200), Peter Zijlstra wrote:
> On Fri, Jul 06, 2018 at 11:57:37AM +0200, Vincent Guittot wrote:
> > 
> > IIUC the formula above, you consider that all CPUs in a frequency
> > domain has the same capacity. This sounds a reasonable assumption but
> > it would be good to write that somewhere
> 
> Yes, I think there used to be a WARN and kill EM for that, but I cannot
> seem to locate that in the current code.

There is something if the EM is too complex to be used by the scheduler.
But nothing about that assumption (yet).

I'll fix that in v5

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 12/12] OPTIONAL: cpufreq: dt: Register an Energy Model
  2018-06-28 11:40 ` [RFC PATCH v4 12/12] OPTIONAL: cpufreq: dt: Register an Energy Model Quentin Perret
@ 2018-07-06 10:10   ` Vincent Guittot
  2018-07-06 10:18     ` Quentin Perret
  0 siblings, 1 reply; 54+ messages in thread
From: Vincent Guittot @ 2018-07-06 10:10 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Peter Zijlstra, Rafael J. Wysocki, linux-kernel,
	open list:THERMAL, gregkh, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Thara Gopinath, viresh kumar, Todd Kjos,
	Joel Fernandes, Cc: Steve Muckle, adharmap, Kannan, Saravana,
	pkondeti, Juri Lelli, Eduardo Valentin, Srinivas Pandruvada,
	currojerez, Javi Merino

On Thu, 28 Jun 2018 at 13:41, Quentin Perret <quentin.perret@arm.com> wrote:
>
> *******************************************************************
> * This patch illustrates the usage of the newly introduced Energy *
> * Model framework and isn't supposed to be merged as-is.          *
> *******************************************************************
>
> The Energy Model framework provides an API to register the active power
> of CPUs. Call this API from the cpufreq-dt driver with an estimation
> of the power as P = C * V^2 * f with C, V, and f respectively the
> capacitance of the CPU and the voltage and frequency of the OPP.
>
> The CPU capacitance is read from the "dynamic-power-coefficient" DT
> binding (originally introduced for thermal/IPA), and the voltage and
> frequency values from PM_OPP.
>
> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> Cc: Viresh Kumar <viresh.kumar@linaro.org>
> Signed-off-by: Quentin Perret <quentin.perret@arm.com>
> ---
>  drivers/cpufreq/cpufreq-dt.c | 45 +++++++++++++++++++++++++++++++++++-
>  1 file changed, 44 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/cpufreq/cpufreq-dt.c b/drivers/cpufreq/cpufreq-dt.c
> index 190ea0dccb79..5a0747f73121 100644
> --- a/drivers/cpufreq/cpufreq-dt.c
> +++ b/drivers/cpufreq/cpufreq-dt.c
> @@ -16,6 +16,7 @@
>  #include <linux/cpu_cooling.h>
>  #include <linux/cpufreq.h>
>  #include <linux/cpumask.h>
> +#include <linux/energy_model.h>
>  #include <linux/err.h>
>  #include <linux/module.h>
>  #include <linux/of.h>
> @@ -149,8 +150,47 @@ static int resources_available(void)
>         return 0;
>  }
>
> +static int of_est_power(unsigned long *mW, unsigned long *KHz, int cpu)
> +{
> +       unsigned long mV, Hz, MHz;
> +       struct device *cpu_dev;
> +       struct dev_pm_opp *opp;
> +       struct device_node *np;
> +       u32 cap;
> +       u64 tmp;
> +
> +       cpu_dev = get_cpu_device(cpu);
> +       if (!cpu_dev)
> +               return -ENODEV;
> +
> +       np = of_node_get(cpu_dev->of_node);
> +       if (!np)
> +               return -EINVAL;
> +
> +       if (of_property_read_u32(np, "dynamic-power-coefficient", &cap))
> +               return -EINVAL;
> +
> +       Hz = *KHz * 1000;
> +       opp = dev_pm_opp_find_freq_ceil(cpu_dev, &Hz);
> +       if (IS_ERR(opp))
> +               return -EINVAL;
> +
> +       mV = dev_pm_opp_get_voltage(opp) / 1000;
> +       dev_pm_opp_put(opp);
> +
> +       MHz = Hz / 1000000;
> +       tmp = (u64)cap * mV * mV * MHz;
> +       do_div(tmp, 1000000000);

Could you explain the formula above ? and especially the 1000000000 it
seems related to the use of mV and mW instead of uV and uW ...

Can't you just optimize that into
tmp = (u64)cap * mV * mV * Hz;
do_div(tmp, 1000);

> +
> +       *mW = (unsigned long)tmp;
> +       *KHz = Hz / 1000;
> +
> +       return 0;
> +}
> +
>  static int cpufreq_init(struct cpufreq_policy *policy)
>  {
> +       struct em_data_callback em_cb = EM_DATA_CB(of_est_power);
>         struct cpufreq_frequency_table *freq_table;
>         struct opp_table *opp_table = NULL;
>         struct private_data *priv;
> @@ -159,7 +199,7 @@ static int cpufreq_init(struct cpufreq_policy *policy)
>         unsigned int transition_latency;
>         bool fallback = false;
>         const char *name;
> -       int ret;
> +       int ret, nr_opp;
>
>         cpu_dev = get_cpu_device(policy->cpu);
>         if (!cpu_dev) {
> @@ -226,6 +266,7 @@ static int cpufreq_init(struct cpufreq_policy *policy)
>                 ret = -EPROBE_DEFER;
>                 goto out_free_opp;
>         }
> +       nr_opp = ret;
>
>         if (fallback) {
>                 cpumask_setall(policy->cpus);
> @@ -278,6 +319,8 @@ static int cpufreq_init(struct cpufreq_policy *policy)
>         policy->cpuinfo.transition_latency = transition_latency;
>         policy->dvfs_possible_from_any_cpu = true;
>
> +       em_register_freq_domain(policy->cpus, nr_opp, &em_cb);
> +
>         return 0;
>
>  out_free_cpufreq_table:
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 12/12] OPTIONAL: cpufreq: dt: Register an Energy Model
  2018-07-06 10:10   ` Vincent Guittot
@ 2018-07-06 10:18     ` Quentin Perret
  2018-07-30 15:53       ` Vincent Guittot
  0 siblings, 1 reply; 54+ messages in thread
From: Quentin Perret @ 2018-07-06 10:18 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Rafael J. Wysocki, linux-kernel,
	open list:THERMAL, gregkh, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Thara Gopinath, viresh kumar, Todd Kjos,
	Joel Fernandes, Cc: Steve Muckle, adharmap, Kannan, Saravana,
	pkondeti, Juri Lelli, Eduardo Valentin, Srinivas Pandruvada,
	currojerez, Javi Merino

On Friday 06 Jul 2018 at 12:10:02 (+0200), Vincent Guittot wrote:
> On Thu, 28 Jun 2018 at 13:41, Quentin Perret <quentin.perret@arm.com> wrote:
> > +static int of_est_power(unsigned long *mW, unsigned long *KHz, int cpu)
> > +{
> > +       unsigned long mV, Hz, MHz;
> > +       struct device *cpu_dev;
> > +       struct dev_pm_opp *opp;
> > +       struct device_node *np;
> > +       u32 cap;
> > +       u64 tmp;
> > +
> > +       cpu_dev = get_cpu_device(cpu);
> > +       if (!cpu_dev)
> > +               return -ENODEV;
> > +
> > +       np = of_node_get(cpu_dev->of_node);
> > +       if (!np)
> > +               return -EINVAL;
> > +
> > +       if (of_property_read_u32(np, "dynamic-power-coefficient", &cap))
> > +               return -EINVAL;
> > +
> > +       Hz = *KHz * 1000;
> > +       opp = dev_pm_opp_find_freq_ceil(cpu_dev, &Hz);
> > +       if (IS_ERR(opp))
> > +               return -EINVAL;
> > +
> > +       mV = dev_pm_opp_get_voltage(opp) / 1000;
> > +       dev_pm_opp_put(opp);
> > +
> > +       MHz = Hz / 1000000;
> > +       tmp = (u64)cap * mV * mV * MHz;
> > +       do_div(tmp, 1000000000);
> 
> Could you explain the formula above ? and especially the 1000000000 it
> seems related to the use of mV and mW instead of uV and uW ...

That's correct, this is just a matter of units. I simply tried to
reproduce here what is currently done for IPA:

https://elixir.bootlin.com/linux/latest/source/drivers/thermal/cpu_cooling.c#L252

> 
> Can't you just optimize that into
> tmp = (u64)cap * mV * mV * Hz;
> do_div(tmp, 1000);

I don't think that'll work. If you use Hz instead of MHz you'll need to
divide tmp by 1000000000000000 to get milli-watts, as specified by the
EM framework.

FYI, I don't consider this patch to be ready to be merged. I kept it
self-contained and simple for the example but I actually think that this
of_est_power function should probably be factored-out somewhere else.
scpi-cpufreq could use it as well. Anyway, I guess we can discuss that
later once the APIs of the EM framework are agreed :-)

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 08/12] sched: Add over-utilization/tipping point indicator
  2018-06-28 11:40 ` [RFC PATCH v4 08/12] sched: Add over-utilization/tipping point indicator Quentin Perret
@ 2018-07-06 11:32   ` Peter Zijlstra
  2018-07-06 13:20     ` Quentin Perret
  2018-07-06 11:39   ` Peter Zijlstra
  2018-07-06 11:41   ` Peter Zijlstra
  2 siblings, 1 reply; 54+ messages in thread
From: Peter Zijlstra @ 2018-07-06 11:32 UTC (permalink / raw)
  To: Quentin Perret
  Cc: rjw, linux-kernel, linux-pm, gregkh, mingo, dietmar.eggemann,
	morten.rasmussen, chris.redpath, patrick.bellasi,
	valentin.schneider, vincent.guittot, thara.gopinath,
	viresh.kumar, tkjos, joel, smuckle, adharmap, skannan, pkondeti,
	juri.lelli, edubezval, srinivas.pandruvada, currojerez,
	javi.merino

On Thu, Jun 28, 2018 at 12:40:39PM +0100, Quentin Perret wrote:
> @@ -5384,6 +5402,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  {
>  	struct cfs_rq *cfs_rq;
>  	struct sched_entity *se = &p->se;
> +	int task_new = !(flags & ENQUEUE_WAKEUP);
>  
>  	/*
>  	 * The code below (indirectly) updates schedutil which looks at
> @@ -5431,8 +5450,12 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  		update_cfs_group(se);
>  	}
>  
> -	if (!se)
> +	if (!se) {
>  		add_nr_running(rq, 1);
> +		if (!task_new)
> +			update_overutilized_status(rq);

I'm confused... why only for !ENQUEUE_WAKEUP and why use a local
variable for something that's used only once?

> +	}
>  
>  	hrtick_update(rq);
>  }

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 08/12] sched: Add over-utilization/tipping point indicator
  2018-06-28 11:40 ` [RFC PATCH v4 08/12] sched: Add over-utilization/tipping point indicator Quentin Perret
  2018-07-06 11:32   ` Peter Zijlstra
@ 2018-07-06 11:39   ` Peter Zijlstra
  2018-07-06 11:41   ` Peter Zijlstra
  2 siblings, 0 replies; 54+ messages in thread
From: Peter Zijlstra @ 2018-07-06 11:39 UTC (permalink / raw)
  To: Quentin Perret
  Cc: rjw, linux-kernel, linux-pm, gregkh, mingo, dietmar.eggemann,
	morten.rasmussen, chris.redpath, patrick.bellasi,
	valentin.schneider, vincent.guittot, thara.gopinath,
	viresh.kumar, tkjos, joel, smuckle, adharmap, skannan, pkondeti,
	juri.lelli, edubezval, srinivas.pandruvada, currojerez,
	javi.merino

On Thu, Jun 28, 2018 at 12:40:39PM +0100, Quentin Perret wrote:
>  static inline void update_sg_lb_stats(struct lb_env *env,
>  			struct sched_group *group, int load_idx,
>  			int local_group, struct sg_lb_stats *sgs,
> -			bool *overload)
> +			bool *overload, int *overutilized)

> @@ -8308,7 +8336,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  		}
>  
>  		update_sg_lb_stats(env, sg, load_idx, local_group, sgs,
> -						&overload);
> +						&overload, &overutilized);
>  
>  		if (local_group)
>  			goto next_group;

I'm thinking maybe we should reduce arguments here a little.

@load_idx and @local_group look to be trivially computable..

and possibly we could frob overload and overutilized in a single
variable with some flags?


diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 321cd5dcf2e8..091859069620 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8114,10 +8114,12 @@ static bool update_nohz_stats(struct rq *rq, bool force)
  * @overload: Indicate more than one runnable task for any CPU.
  */
 static inline void update_sg_lb_stats(struct lb_env *env,
-			struct sched_group *group, int load_idx,
-			int local_group, struct sg_lb_stats *sgs,
+			struct sched_group *group,
+			struct sg_lb_stats *sgs,
 			bool *overload)
 {
+	int local_group = cpumask_test_cpu(env->dst_cpu, sched_group_span(sd));
+	int load_idx = get_sd_load_idx(env->sd, env->idle);
 	unsigned long load;
 	int i, nr_running;
 
@@ -8278,7 +8280,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 	struct sched_group *sg = env->sd->groups;
 	struct sg_lb_stats *local = &sds->local_stat;
 	struct sg_lb_stats tmp_sgs;
-	int load_idx, prefer_sibling = 0;
+	int prefer_sibling = 0;
 	bool overload = false;
 
 	if (child && child->flags & SD_PREFER_SIBLING)
@@ -8289,8 +8291,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		env->flags |= LBF_NOHZ_STATS;
 #endif
 
-	load_idx = get_sd_load_idx(env->sd, env->idle);
-
 	do {
 		struct sg_lb_stats *sgs = &tmp_sgs;
 		int local_group;
@@ -8305,8 +8305,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 				update_group_capacity(env->sd, env->dst_cpu);
 		}
 
-		update_sg_lb_stats(env, sg, load_idx, local_group, sgs,
-						&overload);
+		update_sg_lb_stats(env, sg, sgs, &overload);
 
 		if (local_group)
 			goto next_group;

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 08/12] sched: Add over-utilization/tipping point indicator
  2018-06-28 11:40 ` [RFC PATCH v4 08/12] sched: Add over-utilization/tipping point indicator Quentin Perret
  2018-07-06 11:32   ` Peter Zijlstra
  2018-07-06 11:39   ` Peter Zijlstra
@ 2018-07-06 11:41   ` Peter Zijlstra
  2018-07-06 11:49     ` Valentin Schneider
  2 siblings, 1 reply; 54+ messages in thread
From: Peter Zijlstra @ 2018-07-06 11:41 UTC (permalink / raw)
  To: Quentin Perret
  Cc: rjw, linux-kernel, linux-pm, gregkh, mingo, dietmar.eggemann,
	morten.rasmussen, chris.redpath, patrick.bellasi,
	valentin.schneider, vincent.guittot, thara.gopinath,
	viresh.kumar, tkjos, joel, smuckle, adharmap, skannan, pkondeti,
	juri.lelli, edubezval, srinivas.pandruvada, currojerez,
	javi.merino

On Thu, Jun 28, 2018 at 12:40:39PM +0100, Quentin Perret wrote:
> @@ -698,6 +698,9 @@ struct root_domain {
>  	/* Indicate more than one runnable task for any CPU */
>  	bool			overload;

While there, make that an int.

> +	/* Indicate one or more cpus over-utilized (tipping point) */
> +	int			overutilized;
> +
>  	/*
>  	 * The bit corresponding to a CPU gets set here if such CPU has more
>  	 * than one runnable -deadline task (as it is below for RT tasks).
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 08/12] sched: Add over-utilization/tipping point indicator
  2018-07-06 11:41   ` Peter Zijlstra
@ 2018-07-06 11:49     ` Valentin Schneider
  0 siblings, 0 replies; 54+ messages in thread
From: Valentin Schneider @ 2018-07-06 11:49 UTC (permalink / raw)
  To: Peter Zijlstra, Quentin Perret
  Cc: rjw, linux-kernel, linux-pm, gregkh, mingo, dietmar.eggemann,
	morten.rasmussen, chris.redpath, patrick.bellasi,
	vincent.guittot, thara.gopinath, viresh.kumar, tkjos, joel,
	smuckle, adharmap, skannan, pkondeti, juri.lelli, edubezval,
	srinivas.pandruvada, currojerez, javi.merino

On 06/07/18 12:41, Peter Zijlstra wrote:
> On Thu, Jun 28, 2018 at 12:40:39PM +0100, Quentin Perret wrote:
>> @@ -698,6 +698,9 @@ struct root_domain {
>>  	/* Indicate more than one runnable task for any CPU */
>>  	bool			overload;
> 
> While there, make that an int.

That's what I've done at [1] - I guess that's the fun part of having those
two patch sets in flight at the same time...

[1]: https://lore.kernel.org/lkml/1530699470-29808-8-git-send-email-morten.rasmussen@arm.com/

> 
>> +	/* Indicate one or more cpus over-utilized (tipping point) */
>> +	int			overutilized;
>> +
>>  	/*
>>  	 * The bit corresponding to a CPU gets set here if such CPU has more
>>  	 * than one runnable -deadline task (as it is below for RT tasks).
>> -- 
>> 2.17.1
>>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 09/12] sched/fair: Introduce an energy estimation helper function
  2018-06-28 11:40 ` [RFC PATCH v4 09/12] sched/fair: Introduce an energy estimation helper function Quentin Perret
@ 2018-07-06 13:12   ` Peter Zijlstra
  2018-07-06 15:12     ` Quentin Perret
  0 siblings, 1 reply; 54+ messages in thread
From: Peter Zijlstra @ 2018-07-06 13:12 UTC (permalink / raw)
  To: Quentin Perret
  Cc: rjw, linux-kernel, linux-pm, gregkh, mingo, dietmar.eggemann,
	morten.rasmussen, chris.redpath, patrick.bellasi,
	valentin.schneider, vincent.guittot, thara.gopinath,
	viresh.kumar, tkjos, joel, smuckle, adharmap, skannan, pkondeti,
	juri.lelli, edubezval, srinivas.pandruvada, currojerez,
	javi.merino

On Thu, Jun 28, 2018 at 12:40:40PM +0100, Quentin Perret wrote:
> +static long 
   compute_energy(struct task_struct *p, int dst_cpu, struct freq_domain *fd)
> +{
> +	long util, max_util, sum_util, energy = 0;
> +	int cpu;
> +
> +	while (fd) {
> +		max_util = sum_util = 0;
> +		for_each_cpu_and(cpu, freq_domain_span(fd), cpu_online_mask) {

I had a wee little wtf moment when I realized this crosses root_domain
boundaries but that that is actually desired. A comment might be in
order.

> +			util = cpu_util_next(cpu, p, dst_cpu);
> +			util += cpu_util_dl(cpu_rq(cpu));
> +			/* XXX: add RT util_avg when available. */
> +
> +			max_util = max(util, max_util);
> +			sum_util += util;

Did you want to use sugov_get_util() here? There is no way we're going
to duplicate all that.

ISTR we already established the whole EAS was going to require using
schedutil anyway.

> +		}
> +
> +		energy += em_fd_energy(fd->obj, max_util, sum_util);
> +		fd = fd->next;
> +	}
> +
> +	return energy;
> +}

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 08/12] sched: Add over-utilization/tipping point indicator
  2018-07-06 11:32   ` Peter Zijlstra
@ 2018-07-06 13:20     ` Quentin Perret
  2018-07-06 13:24       ` Peter Zijlstra
  0 siblings, 1 reply; 54+ messages in thread
From: Quentin Perret @ 2018-07-06 13:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rjw, linux-kernel, linux-pm, gregkh, mingo, dietmar.eggemann,
	morten.rasmussen, chris.redpath, patrick.bellasi,
	valentin.schneider, vincent.guittot, thara.gopinath,
	viresh.kumar, tkjos, joel, smuckle, adharmap, skannan, pkondeti,
	juri.lelli, edubezval, srinivas.pandruvada, currojerez,
	javi.merino

On Friday 06 Jul 2018 at 13:32:08 (+0200), Peter Zijlstra wrote:
> On Thu, Jun 28, 2018 at 12:40:39PM +0100, Quentin Perret wrote:
> > @@ -5384,6 +5402,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >  {
> >  	struct cfs_rq *cfs_rq;
> >  	struct sched_entity *se = &p->se;
> > +	int task_new = !(flags & ENQUEUE_WAKEUP);
> >  
> >  	/*
> >  	 * The code below (indirectly) updates schedutil which looks at
> > @@ -5431,8 +5450,12 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >  		update_cfs_group(se);
> >  	}
> >  
> > -	if (!se)
> > +	if (!se) {
> >  		add_nr_running(rq, 1);
> > +		if (!task_new)
> > +			update_overutilized_status(rq);
> 
> I'm confused... why only for !ENQUEUE_WAKEUP

So, what we try to do here is to _not_ set the overutilized flag based
on the utilization of a new task, because its utilization is 'wrong'.
That should help avoiding spurious transitions above the overutilized
threshold.

Was there also something else Morten ?

> and why use a local
> variable for something that's used only once?

No good reasons. An older version of this patch used 'task_new' in two
different locations. This is no longer true, I'll clean that up for v5.

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 08/12] sched: Add over-utilization/tipping point indicator
  2018-07-06 13:20     ` Quentin Perret
@ 2018-07-06 13:24       ` Peter Zijlstra
  2018-07-06 13:40         ` Quentin Perret
  0 siblings, 1 reply; 54+ messages in thread
From: Peter Zijlstra @ 2018-07-06 13:24 UTC (permalink / raw)
  To: Quentin Perret
  Cc: rjw, linux-kernel, linux-pm, gregkh, mingo, dietmar.eggemann,
	morten.rasmussen, chris.redpath, patrick.bellasi,
	valentin.schneider, vincent.guittot, thara.gopinath,
	viresh.kumar, tkjos, joel, smuckle, adharmap, skannan, pkondeti,
	juri.lelli, edubezval, srinivas.pandruvada, currojerez,
	javi.merino

On Fri, Jul 06, 2018 at 02:20:57PM +0100, Quentin Perret wrote:
> On Friday 06 Jul 2018 at 13:32:08 (+0200), Peter Zijlstra wrote:
> > On Thu, Jun 28, 2018 at 12:40:39PM +0100, Quentin Perret wrote:
> > > @@ -5384,6 +5402,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > >  {
> > >  	struct cfs_rq *cfs_rq;
> > >  	struct sched_entity *se = &p->se;
> > > +	int task_new = !(flags & ENQUEUE_WAKEUP);
> > >  
> > >  	/*
> > >  	 * The code below (indirectly) updates schedutil which looks at
> > > @@ -5431,8 +5450,12 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > >  		update_cfs_group(se);
> > >  	}
> > >  
> > > -	if (!se)
> > > +	if (!se) {
> > >  		add_nr_running(rq, 1);
> > > +		if (!task_new)
> > > +			update_overutilized_status(rq);
> > 
> > I'm confused... why only for !ENQUEUE_WAKEUP
> 
> So, what we try to do here is to _not_ set the overutilized flag based
> on the utilization of a new task, because its utilization is 'wrong'.
> That should help avoiding spurious transitions above the overutilized
> threshold.

That most certainly deserves a comment, also didn't util_est fix some of
that wrong-ness?


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 08/12] sched: Add over-utilization/tipping point indicator
  2018-07-06 13:24       ` Peter Zijlstra
@ 2018-07-06 13:40         ` Quentin Perret
  0 siblings, 0 replies; 54+ messages in thread
From: Quentin Perret @ 2018-07-06 13:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rjw, linux-kernel, linux-pm, gregkh, mingo, dietmar.eggemann,
	morten.rasmussen, chris.redpath, patrick.bellasi,
	valentin.schneider, vincent.guittot, thara.gopinath,
	viresh.kumar, tkjos, joel, smuckle, adharmap, skannan, pkondeti,
	juri.lelli, edubezval, srinivas.pandruvada, currojerez,
	javi.merino

On Friday 06 Jul 2018 at 15:24:28 (+0200), Peter Zijlstra wrote:
> On Fri, Jul 06, 2018 at 02:20:57PM +0100, Quentin Perret wrote:
> > So, what we try to do here is to _not_ set the overutilized flag based
> > on the utilization of a new task, because its utilization is 'wrong'.
> > That should help avoiding spurious transitions above the overutilized
> > threshold.
> 
> That most certainly deserves a comment,

Ok, will do.

> also didn't util_est fix some of that wrong-ness?

When a new task is enqueued, we don't have util_est samples yet, so the
task's util_est is equal to its util_avg. So util_est won't help here,
unfortunately.

But it does help in the wake-up path to predict what OPP will be
selected (see cpu_util_next() in patch 09/12).

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 09/12] sched/fair: Introduce an energy estimation helper function
  2018-07-06 13:12   ` Peter Zijlstra
@ 2018-07-06 15:12     ` Quentin Perret
  2018-07-06 15:49       ` Peter Zijlstra
  2018-07-06 15:50       ` Peter Zijlstra
  0 siblings, 2 replies; 54+ messages in thread
From: Quentin Perret @ 2018-07-06 15:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rjw, linux-kernel, linux-pm, gregkh, mingo, dietmar.eggemann,
	morten.rasmussen, chris.redpath, patrick.bellasi,
	valentin.schneider, vincent.guittot, thara.gopinath,
	viresh.kumar, tkjos, joel, smuckle, adharmap, skannan, pkondeti,
	juri.lelli, edubezval, srinivas.pandruvada, currojerez,
	javi.merino

On Friday 06 Jul 2018 at 15:12:43 (+0200), Peter Zijlstra wrote:
> On Thu, Jun 28, 2018 at 12:40:40PM +0100, Quentin Perret wrote:
> > +static long 
>    compute_energy(struct task_struct *p, int dst_cpu, struct freq_domain *fd)
> > +{
> > +	long util, max_util, sum_util, energy = 0;
> > +	int cpu;
> > +
> > +	while (fd) {
> > +		max_util = sum_util = 0;
> > +		for_each_cpu_and(cpu, freq_domain_span(fd), cpu_online_mask) {
> 
> I had a wee little wtf moment when I realized this crosses root_domain
> boundaries but that that is actually desired. A comment might be in
> order.

Indeed, the selection of CPU candidates is guaranteed to be within the
root_domain, but the boundaries for the energy calculation can be
different. The OPP in a FD can be driven by a CPU outside of the
root_domain we're currently looking at ... I'll add a comment about
that.

I guess it is also worth mentioning that in the case mentioned above the
scheduling decisions made inside one root domain can indirectly be
impacted by the scheduling decisions made in another root domain. Some
CPUs inside one root domain might look expensive to use simply because
they run at a high OPP because of other CPUs of the same FD being
over-used in another root domain.

This is arguably not desirable, but I'm not sure if this is a problem.
It would be wise for users not to split a frequency domain in multiple
root domains IMO but we can't really prevent them from doing so ...

Another solution would be to ignore the CPUs outside of the root_domain
and to accept that this can lead to sub-optimal scheduling decisions
from an energy standpoint.

> 
> > +			util = cpu_util_next(cpu, p, dst_cpu);
> > +			util += cpu_util_dl(cpu_rq(cpu));
> > +			/* XXX: add RT util_avg when available. */
> > +
> > +			max_util = max(util, max_util);
> > +			sum_util += util;
> 
> Did you want to use sugov_get_util() here? There is no way we're going
> to duplicate all that.

I need to look into how we can do that ... Sugov looks at the current
util landscape while EAS tries to predict the _future_ util landscape.
Merging the two means I need to add a task and a dst_cpu as parameters
of sugov_get_util() and call cpu_util_next() from there, which doesn't
feel so clean ...

Also, if we merge sugov_get_util() and sugov_aggregate_util() with
Vincent's patch-set I'll need to make sure to return two values with
sugov_get_util(): 1) the sum of the util of all classes; and 2) the util
that will be used to request an OPP. 1) should be used in sum_util and
2) could (but I don't think it's is a good idea) be used for max_util.

This probably won't be pretty :/

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 09/12] sched/fair: Introduce an energy estimation helper function
  2018-07-06 15:12     ` Quentin Perret
@ 2018-07-06 15:49       ` Peter Zijlstra
  2018-07-06 17:04         ` Quentin Perret
  2018-07-06 15:50       ` Peter Zijlstra
  1 sibling, 1 reply; 54+ messages in thread
From: Peter Zijlstra @ 2018-07-06 15:49 UTC (permalink / raw)
  To: Quentin Perret
  Cc: rjw, linux-kernel, linux-pm, gregkh, mingo, dietmar.eggemann,
	morten.rasmussen, chris.redpath, patrick.bellasi,
	valentin.schneider, vincent.guittot, thara.gopinath,
	viresh.kumar, tkjos, joel, smuckle, adharmap, skannan, pkondeti,
	juri.lelli, edubezval, srinivas.pandruvada, currojerez,
	javi.merino

On Fri, Jul 06, 2018 at 04:12:12PM +0100, Quentin Perret wrote:
> On Friday 06 Jul 2018 at 15:12:43 (+0200), Peter Zijlstra wrote:

> > Did you want to use sugov_get_util() here? There is no way we're going
> > to duplicate all that.
> 
> I need to look into how we can do that ... Sugov looks at the current
> util landscape while EAS tries to predict the _future_ util landscape.
> Merging the two means I need to add a task and a dst_cpu as parameters
> of sugov_get_util() and call cpu_util_next() from there, which doesn't
> feel so clean ...

Just pass in the util_cfs as computed by cpu_util_next(), then schedutil
will pass in cpu_util_cfs(), the rest is all the same I think.

See below.

> Also, if we merge sugov_get_util() and sugov_aggregate_util() with
> Vincent's patch-set I'll need to make sure to return two values with
> sugov_get_util(): 1) the sum of the util of all classes; and 2) the util
> that will be used to request an OPP. 1) should be used in sum_util and
> 2) could (but I don't think it's is a good idea) be used for max_util.

I am confused, the max/sum thing is composed of the same values, just a
different operator. Both take 'util':


+	util = schedutil_get_util(cpu, cpu_util_next(cpu, p, dst_cpu))
+	max_util = max(util, max_util);
+	sum_util += util;


unsigned long schedutil_get_util(int cpu, unsigned long util_cfs)
{
	struct rq *rq = cpu_rq(cpu);
	unsigned long util, irq, max;

	max = arch_scale_cpu_capacity(NULL, cpu);

	if (rt_rq_is_runnable(&rq->rt))
		return max;

	/*
	 * Early check to see if IRQ/steal time saturates the CPU, can be
	 * because of inaccuracies in how we track these -- see
	 * update_irq_load_avg().
	 */
	irq = cpu_util_irq(rq);
	if (unlikely(irq >= max))
		return max;

	/*
	 * Because the time spend on RT/DL tasks is visible as 'lost' time to
	 * CFS tasks and we use the same metric to track the effective
	 * utilization (PELT windows are synchronized) we can directly add them
	 * to obtain the CPU's actual utilization.
	 */
	util = util_cfs;
	util += cpu_util_rt(rq);

	/*
	 * We do not make cpu_util_dl() a permanent part of this sum because we
	 * want to use cpu_bw_dl() later on, but we need to check if the
	 * CFS+RT+DL sum is saturated (ie. no idle time) such that we select
	 * f_max when there is no idle time.
	 *
	 * NOTE: numerical errors or stop class might cause us to not quite hit
	 * saturation when we should -- something for later.
	 */
	if ((util + cpu_util_dl(rq)) >= max)
		return max;

	/*
	 * There is still idle time; further improve the number by using the
	 * irq metric. Because IRQ/steal time is hidden from the task clock we
	 * need to scale the task numbers:
	 *
	 *              1 - irq
	 *   U' = irq + ------- * U
	 *                max
	 */
	util *= (max - irq);
	util /= max;
	util += irq;

	/*
	 * Bandwidth required by DEADLINE must always be granted while, for
	 * FAIR and RT, we use blocked utilization of IDLE CPUs as a mechanism
	 * to gracefully reduce the frequency when no tasks show up for longer
	 * periods of time.
	 *
	 * Ideally we would like to set bw_dl as min/guaranteed freq and util +
	 * bw_dl as requested freq. However, cpufreq is not yet ready for such
	 * an interface. So, we only do the latter for now.
	 */
	return min(max, util + cpu_bw_dl(rq));
}

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 09/12] sched/fair: Introduce an energy estimation helper function
  2018-07-06 15:12     ` Quentin Perret
  2018-07-06 15:49       ` Peter Zijlstra
@ 2018-07-06 15:50       ` Peter Zijlstra
  1 sibling, 0 replies; 54+ messages in thread
From: Peter Zijlstra @ 2018-07-06 15:50 UTC (permalink / raw)
  To: Quentin Perret
  Cc: rjw, linux-kernel, linux-pm, gregkh, mingo, dietmar.eggemann,
	morten.rasmussen, chris.redpath, patrick.bellasi,
	valentin.schneider, vincent.guittot, thara.gopinath,
	viresh.kumar, tkjos, joel, smuckle, adharmap, skannan, pkondeti,
	juri.lelli, edubezval, srinivas.pandruvada, currojerez,
	javi.merino

On Fri, Jul 06, 2018 at 04:12:12PM +0100, Quentin Perret wrote:

> I guess it is also worth mentioning that in the case mentioned above the
> scheduling decisions made inside one root domain can indirectly be
> impacted by the scheduling decisions made in another root domain. Some
> CPUs inside one root domain might look expensive to use simply because
> they run at a high OPP because of other CPUs of the same FD being
> over-used in another root domain.

That's simply how it is. There's nothing you can do about it, they share
a clock, teh end.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 09/12] sched/fair: Introduce an energy estimation helper function
  2018-07-06 15:49       ` Peter Zijlstra
@ 2018-07-06 17:04         ` Quentin Perret
  2018-07-09 12:01           ` Peter Zijlstra
  0 siblings, 1 reply; 54+ messages in thread
From: Quentin Perret @ 2018-07-06 17:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rjw, linux-kernel, linux-pm, gregkh, mingo, dietmar.eggemann,
	morten.rasmussen, chris.redpath, patrick.bellasi,
	valentin.schneider, vincent.guittot, thara.gopinath,
	viresh.kumar, tkjos, joel, smuckle, adharmap, skannan, pkondeti,
	juri.lelli, edubezval, srinivas.pandruvada, currojerez,
	javi.merino

On Friday 06 Jul 2018 at 17:49:49 (+0200), Peter Zijlstra wrote:
> On Fri, Jul 06, 2018 at 04:12:12PM +0100, Quentin Perret wrote:
> > On Friday 06 Jul 2018 at 15:12:43 (+0200), Peter Zijlstra wrote:
> 
> > > Did you want to use sugov_get_util() here? There is no way we're going
> > > to duplicate all that.
> > 
> > I need to look into how we can do that ... Sugov looks at the current
> > util landscape while EAS tries to predict the _future_ util landscape.
> > Merging the two means I need to add a task and a dst_cpu as parameters
> > of sugov_get_util() and call cpu_util_next() from there, which doesn't
> > feel so clean ...
> 
> Just pass in the util_cfs as computed by cpu_util_next(), then schedutil
> will pass in cpu_util_cfs(), the rest is all the same I think.
> 
> See below.
> 
> > Also, if we merge sugov_get_util() and sugov_aggregate_util() with
> > Vincent's patch-set I'll need to make sure to return two values with
> > sugov_get_util(): 1) the sum of the util of all classes; and 2) the util
> > that will be used to request an OPP. 1) should be used in sum_util and
> > 2) could (but I don't think it's is a good idea) be used for max_util.
> 
> I am confused, the max/sum thing is composed of the same values, just a
> different operator. Both take 'util':
> 
> 
> +	util = schedutil_get_util(cpu, cpu_util_next(cpu, p, dst_cpu))
> +	max_util = max(util, max_util);
> +	sum_util += util;

'max_util' is basically the util we use to request an OPP. 'sum_util' is
how long the CPUs will be running. For now it's the same thing because I
just used the util of the different classes and assumed that OPPs follow
utilization, but if we start using sugov_get_util() things will become
more complex, especially because of RT tasks.

In your example, schedutil_get_util() will return arch_scale_cpu_capacity()
if a RT task is running. We're going to select an OPP with that so
that's what you want to put in max_util. But we also need to know how
long we are going to run at that OPP to compute the energy, and in this
case arch_scale_cpu_capacity is probably _not_ what you want to account in
sum_util ...

Now, I'm not sure if accounting the RT-go-to-max thing here is really
going to help us. This is really an ON/OFF thing, so depending on 'luck',
the OPP landscape that you'll observe in compute_energy() can be very
different depending on whether or not a RT task is running in a FD. And
that will change abruptly when the RT task goes to sleep.

What I'm proposing is to predict in compute_energy() what is the OPP at
which CFS tasks run (cfs_util + dl_util +rt_util, to make it simple),
and ignore the oddity of having an RT task running. But that makes it
harder to factorize things with schedutil more than they are, that is
true ...

I hope that makes sense

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 09/12] sched/fair: Introduce an energy estimation helper function
  2018-07-06 17:04         ` Quentin Perret
@ 2018-07-09 12:01           ` Peter Zijlstra
  2018-07-09 15:28             ` Quentin Perret
  0 siblings, 1 reply; 54+ messages in thread
From: Peter Zijlstra @ 2018-07-09 12:01 UTC (permalink / raw)
  To: Quentin Perret
  Cc: rjw, linux-kernel, linux-pm, gregkh, mingo, dietmar.eggemann,
	morten.rasmussen, chris.redpath, patrick.bellasi,
	valentin.schneider, vincent.guittot, thara.gopinath,
	viresh.kumar, tkjos, joel, smuckle, adharmap, skannan, pkondeti,
	juri.lelli, edubezval, srinivas.pandruvada, currojerez,
	javi.merino

On Fri, Jul 06, 2018 at 06:04:44PM +0100, Quentin Perret wrote:
> 'max_util' is basically the util we use to request an OPP. 'sum_util' is
> how long the CPUs will be running.

A indeed. But yes bit of a mess, but it wants doing I think.
Perhaps a little something like the below... The alternative is two
separate but closely related functions, not sure which is better.


enum schedutil_type {
	frequency_util,
	energy_util,
};

unsigned long
schedutil_freq_util(int cpu, unsigned long util_cfs, enum schedutil_type type)
{
	struct rq *rq = cpu_rq(cpu);
	unsigned long util, irq, max;

	max = arch_scale_cpu_capacity(NULL, cpu);

	if (type == frequency_util && rt_rq_is_runnable(&rq->rt))
		return max;

	/*
	 * Early check to see if IRQ/steal time saturates the CPU, can be
	 * because of inaccuracies in how we track these -- see
	 * update_irq_load_avg().
	 */
	irq = cpu_util_irq(rq);
	if (unlikely(irq >= max))
		return max;

	/*
	 * Because the time spend on RT/DL tasks is visible as 'lost' time to
	 * CFS tasks and we use the same metric to track the effective
	 * utilization (PELT windows are synchronized) we can directly add them
	 * to obtain the CPU's actual utilization.
	 */
	util = util_cfs;
	util += cpu_util_rt(rq);

	if (type == frequency_util) {
		/*
		 * For frequency selection we do not make cpu_util_dl()
		 * a permanent part of this sum because we want to use
		 * cpu_bw_dl() later on, but we need to check if the
		 * CFS+RT+DL sum is saturated (ie. no idle time) such
		 * that we select f_max when there is no idle time.
		 *
		 * NOTE: numerical errors or stop class might cause us
		 * to not quite hit saturation when we should --
		 * something for later.
		 */
		if ((util + cpu_util_dl(rq)) >= max)
			return max;
	} else {
		/*
		 * OTOH, for energy computation we need the estimated
		 * running time, so include util_dl and ignore dl_bw.
		 */
		util += cpu_util_dl(rq);
		if (util >= max)
			return max;
	}

	/*
	 * There is still idle time; further improve the number by using the
	 * irq metric. Because IRQ/steal time is hidden from the task clock we
	 * need to scale the task numbers:
	 *
	 *              1 - irq
	 *   U' = irq + ------- * U
	 *                max
	 */
	util *= (max - irq);
	util /= max;
	util += irq;

	if (type == frequency_util) {
		/*
		 * Bandwidth required by DEADLINE must always be granted
		 * while, for FAIR and RT, we use blocked utilization of
		 * IDLE CPUs as a mechanism to gracefully reduce the
		 * frequency when no tasks show up for longer periods of
		 * time.
		 *
		 * Ideally we would like to set bw_dl as min/guaranteed
		 * freq and util + bw_dl as requested freq. However,
		 * cpufreq is not yet ready for such an interface. So,
		 * we only do the latter for now.
		 */
		util += cpu_bw_dl(rq);
	}

	return min(max, util);
}

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 09/12] sched/fair: Introduce an energy estimation helper function
  2018-07-09 12:01           ` Peter Zijlstra
@ 2018-07-09 15:28             ` Quentin Perret
  2018-07-09 15:42               ` Peter Zijlstra
  0 siblings, 1 reply; 54+ messages in thread
From: Quentin Perret @ 2018-07-09 15:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rjw, linux-kernel, linux-pm, gregkh, mingo, dietmar.eggemann,
	morten.rasmussen, chris.redpath, patrick.bellasi,
	valentin.schneider, vincent.guittot, thara.gopinath,
	viresh.kumar, tkjos, joel, smuckle, adharmap, skannan, pkondeti,
	juri.lelli, edubezval, srinivas.pandruvada, currojerez,
	javi.merino

On Monday 09 Jul 2018 at 14:01:38 (+0200), Peter Zijlstra wrote:
> On Fri, Jul 06, 2018 at 06:04:44PM +0100, Quentin Perret wrote:
> > 'max_util' is basically the util we use to request an OPP. 'sum_util' is
> > how long the CPUs will be running.
> 
> A indeed. But yes bit of a mess, but it wants doing I think.

OK, understood.

> Perhaps a little something like the below... The alternative is two
> separate but closely related functions, not sure which is better.
> 
> 
> enum schedutil_type {
> 	frequency_util,
> 	energy_util,
> };

Hmm, this should work ... Should we also try to avoid the code
duplication in scale_rt_capacity() ? This doesn't want dependencies on
schedutil I assume so a little bit of extra work is required.

What about moving schedutil_freq_util() to sched.h and give it a more
generic name (total_cpu_util() ?) with no dependencies on
CONFIG_CPU_FREQ_GOV_SCHEDUTIL ? Also, 'energy_util' would have to be
renamed I guess (busy_time_util ?).

Then scale_rt_capacity() can be turned into something like (totally
untested):

static unsigned long scale_rt_capacity(int cpu)
{
	unsigned long util = total_cpu_util(cpu, 0, busy_time_util);

	return arch_scale_cpu_capacity(NULL, cpu) - util;
}

Thoughts ?

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 09/12] sched/fair: Introduce an energy estimation helper function
  2018-07-09 15:28             ` Quentin Perret
@ 2018-07-09 15:42               ` Peter Zijlstra
  2018-07-09 16:07                 ` Quentin Perret
  0 siblings, 1 reply; 54+ messages in thread
From: Peter Zijlstra @ 2018-07-09 15:42 UTC (permalink / raw)
  To: Quentin Perret
  Cc: rjw, linux-kernel, linux-pm, gregkh, mingo, dietmar.eggemann,
	morten.rasmussen, chris.redpath, patrick.bellasi,
	valentin.schneider, vincent.guittot, thara.gopinath,
	viresh.kumar, tkjos, joel, smuckle, adharmap, skannan, pkondeti,
	juri.lelli, edubezval, srinivas.pandruvada, currojerez,
	javi.merino

On Mon, Jul 09, 2018 at 04:28:26PM +0100, Quentin Perret wrote:
> What about moving schedutil_freq_util() to sched.h and give it a more
> generic name (total_cpu_util() ?) with no dependencies on
> CONFIG_CPU_FREQ_GOV_SCHEDUTIL ? Also, 'energy_util' would have to be
> renamed I guess (busy_time_util ?).

Bit big for an inline, and I don't think there's much point using all
this without schedutil. You already rely on it anyhow.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 09/12] sched/fair: Introduce an energy estimation helper function
  2018-07-09 15:42               ` Peter Zijlstra
@ 2018-07-09 16:07                 ` Quentin Perret
  0 siblings, 0 replies; 54+ messages in thread
From: Quentin Perret @ 2018-07-09 16:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rjw, linux-kernel, linux-pm, gregkh, mingo, dietmar.eggemann,
	morten.rasmussen, chris.redpath, patrick.bellasi,
	valentin.schneider, vincent.guittot, thara.gopinath,
	viresh.kumar, tkjos, joel, smuckle, adharmap, skannan, pkondeti,
	juri.lelli, edubezval, srinivas.pandruvada, currojerez,
	javi.merino

On Monday 09 Jul 2018 at 17:42:26 (+0200), Peter Zijlstra wrote:
> On Mon, Jul 09, 2018 at 04:28:26PM +0100, Quentin Perret wrote:
> > What about moving schedutil_freq_util() to sched.h and give it a more
> > generic name (total_cpu_util() ?) with no dependencies on
> > CONFIG_CPU_FREQ_GOV_SCHEDUTIL ? Also, 'energy_util' would have to be
> > renamed I guess (busy_time_util ?).
> 
> Bit big for an inline,

True, but that will be used in the wake-up path ...

> and I don't think there's much point using all
> this without schedutil.

For that total_cpu_util() function I was suggesting, the main point
would be to avoid code duplication with scale_rt_capacity(). Not sure
if that matters ?

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 03/12] PM: Introduce an Energy Model management framework
  2018-06-28 11:40 ` [RFC PATCH v4 03/12] PM: Introduce an Energy Model management framework Quentin Perret
                     ` (3 preceding siblings ...)
  2018-07-06  9:57   ` Vincent Guittot
@ 2018-07-09 18:07   ` Dietmar Eggemann
  2018-07-10  8:32     ` Quentin Perret
  4 siblings, 1 reply; 54+ messages in thread
From: Dietmar Eggemann @ 2018-07-09 18:07 UTC (permalink / raw)
  To: Quentin Perret, peterz, rjw, linux-kernel, linux-pm
  Cc: gregkh, mingo, morten.rasmussen, chris.redpath, patrick.bellasi,
	valentin.schneider, vincent.guittot, thara.gopinath,
	viresh.kumar, tkjos, joel, smuckle, adharmap, skannan, pkondeti,
	juri.lelli, edubezval, srinivas.pandruvada, currojerez,
	javi.merino

On 06/28/2018 01:40 PM, Quentin Perret wrote:

[...]

> +/**
> + * em_rescale_cpu_capacity() - Re-scale capacity values of the Energy Model
> + *
> + * This re-scales the capacity values for all capacity states of all frequency
> + * domains of the Energy Model. This should be used when the capacity values
> + * of the CPUs are updated at run-time, after the EM was registered.
> + */
> +void em_rescale_cpu_capacity(void)
> +{
> +	struct em_cs_table *old_table, *new_table;
> +	struct em_freq_domain *fd;
> +	int nr_states, cpu;
> +
> +	mutex_lock(&em_fd_mutex);
> +	rcu_read_lock();
> +	for_each_possible_cpu(cpu) {
> +		/* Re-scale only once per frequency domain. */
> +		fd = READ_ONCE(per_cpu(em_data, cpu));
> +		if (!fd || cpu != cpumask_first(to_cpumask(fd->cpus)))
> +			continue;
> +
> +		/* Copy the existing table. */
> +		old_table = rcu_dereference(fd->cs_table);
> +		nr_states = old_table->nr_cap_states;
> +		new_table = alloc_cs_table(nr_states);
> +		if (!new_table)
> +			goto out;
> +		memcpy(new_table->state, old_table->state,
> +					nr_states * sizeof(*new_table->state));
> +
> +		/* Re-scale the capacity values of the copy. */
> +		fd_update_cs_table(new_table,
> +					cpumask_first(to_cpumask(fd->cpus)));
> +
> +		/* Replace the fd table with the re-scaled version. */
> +		rcu_assign_pointer(fd->cs_table, new_table);
> +		call_rcu(&old_table->rcu, rcu_free_cs_table);
> +	}
> +out:
> +	rcu_read_unlock();
> +	mutex_unlock(&em_fd_mutex);
> +}
> +EXPORT_SYMBOL_GPL(em_rescale_cpu_capacity);

This em_rescale_cpu_capacity() function is still very much specific to 
systems with asymmetric cpu capacity (Arm big.Little/DynamIQ). Only 
after cpufreq is up we can determine the capacity of a CPU, hence we 
need this one to set the CPU capacity values for the individual 
performance states.

Can you not calculate capacity 'on the fly' just using freq and max freq 
as well as arch_scale_cpu_capacity() which gives you max capacity?

capacity = arch_scale_cpu_capacity() * freq / max_freq

In this case we could get rid of the 'ugly' EM rescaling infrastructure.

[...]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 03/12] PM: Introduce an Energy Model management framework
  2018-07-09 18:07   ` Dietmar Eggemann
@ 2018-07-10  8:32     ` Quentin Perret
  2018-07-16 10:29       ` Quentin Perret
  0 siblings, 1 reply; 54+ messages in thread
From: Quentin Perret @ 2018-07-10  8:32 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: peterz, rjw, linux-kernel, linux-pm, gregkh, mingo,
	morten.rasmussen, chris.redpath, patrick.bellasi,
	valentin.schneider, vincent.guittot, thara.gopinath,
	viresh.kumar, tkjos, joel, smuckle, adharmap, skannan, pkondeti,
	juri.lelli, edubezval, srinivas.pandruvada, currojerez,
	javi.merino

On Monday 09 Jul 2018 at 20:07:31 (+0200), Dietmar Eggemann wrote:
> On 06/28/2018 01:40 PM, Quentin Perret wrote:
> This em_rescale_cpu_capacity() function is still very much specific to
> systems with asymmetric cpu capacity (Arm big.Little/DynamIQ). Only after
> cpufreq is up we can determine the capacity of a CPU, hence we need this one
> to set the CPU capacity values for the individual performance states.

The abstraction is that this function is needed by all systems where the
capacities of CPUs are discovered late, or can be changed at run-time.
But yeah, AFAICT this applies mainly to Arm big.LITTLE and/or DynamIQ
systems, at least for now.

> 
> Can you not calculate capacity 'on the fly' just using freq and max freq as
> well as arch_scale_cpu_capacity() which gives you max capacity?
> 
> capacity = arch_scale_cpu_capacity() * freq / max_freq
> 
> In this case we could get rid of the 'ugly' EM rescaling infrastructure.

Indeed, having 'capacity' values in the EM framework is just an
optimization for the scheduler, so that it doesn't need to compute them
in the wake-up path. I could get rid of the whole
em_rescale_cpu_capacity() mess (and by the same occasion the RCU
protection of the tables ...) if I removed the 'capacity' values from
the EM. But that means a slightly higher complexity on the scheduler side.

As you said, the capacity of a CPU at a specific OPP is:

    cap(opp) = freq(opp) * scale_cpu / max_freq

Now, we estimate the energy consumed by this CPU as:

    nrg = power(opp) * util / cap(opp)

because 'util / cap(opp)' represents its percentage of busy time. If I
inject the first equation in the second, I get:

    nrg =  power(opp) * util * max_freq / (scale_cpu * freq(opp))

and this can be re-arranged as:

    nrg = (power(opp) * max_freq / freq(opp)) * (util / scale_cpu)

In the above equation, the first term between () is static so I could
pre-compute it as a 'cost' for that OPP and store it in the EM table:

    cost(opp) = power(opp) * max_freq / freq(opp)

And then the energy calculation would be something like:

   nrg = cost(opp) * util / scale_cpu

If 'scale_cpu' was static, I could fold it into 'cost' and avoid the cost
of the division. But it's not really, so I can either re-do the division
all the time on the scheduler side, or play with RCU to cache the result
of the division in the EM framework.

(I now realize that the current implementation of em_fd_energy() does
'cs->power * sum_util / cs->capacity' but the power / capacity ratio is
constant so, if we decide to keep the capacity values in the EM, I
should still cache 'power / capacity' in the EM tables and actually save
the division ...)

This is really a performance vs. code complexity trade-off. I made the
choice of performance since we're talking about the scheduler here, but
I'm not sure how much we really save by saving this division TBH.

Thoughts ?

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 03/12] PM: Introduce an Energy Model management framework
  2018-07-10  8:32     ` Quentin Perret
@ 2018-07-16 10:29       ` Quentin Perret
  2018-07-17  8:57         ` Dietmar Eggemann
  0 siblings, 1 reply; 54+ messages in thread
From: Quentin Perret @ 2018-07-16 10:29 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: peterz, rjw, linux-kernel, linux-pm, gregkh, mingo,
	morten.rasmussen, chris.redpath, patrick.bellasi,
	valentin.schneider, vincent.guittot, thara.gopinath,
	viresh.kumar, tkjos, joel, smuckle, adharmap, skannan, pkondeti,
	juri.lelli, edubezval, srinivas.pandruvada, currojerez,
	javi.merino

On Tuesday 10 Jul 2018 at 09:32:02 (+0100), Quentin Perret wrote:
> Indeed, having 'capacity' values in the EM framework is just an
> optimization for the scheduler, so that it doesn't need to compute them
> in the wake-up path. I could get rid of the whole
> em_rescale_cpu_capacity() mess (and by the same occasion the RCU
> protection of the tables ...) if I removed the 'capacity' values from
> the EM.

Another thing to take into consideration here is basically that the
thermal subsystem (IPA) will be impacted by the RCU protection on the
cs_table. However, since the 'frequency' and 'power' fields do not change
at run-time, and since IPA doesn't need the 'capacity' value, there is no
good reason to have IPA do rcu_read_lock() all over the place, so
arguably something needs to be fixed here.

One possibility is to remove entirely the following struct:
	struct em_cap_state {
		unsigned long capacity;
		unsigned long frequency; /* Kilo-hertz */
		unsigned long power; /* Milli-watts */
	};

and to have three independent vectors (of the same size) for frequency,
power and capacity. That way only the 'capacity' vector would be RCU
protected, and IPA could use 'frequency' and 'power' directly, without
further protections.

A second possibility is to remove the capacity values from the EM
altogether (as I suggested in my previous message) and to get rid of the
need for RCU protection at the same occasion.

The second option simplifies the code of the EM framework significantly
(no more em_rescale_cpu_capacity()) and shouldn't introduce massive
overheads on the scheduler side (the energy calculation already
requires one multiplication and one division, so nothing new on that
side). At the same time, that would make it a whole lot easier to
interface the EM framework with IPA without having to deal with RCU all
over the place.

So, if there are no objections, I'll try to explore that possibility for
v5.

I hope that makes sense

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 03/12] PM: Introduce an Energy Model management framework
  2018-07-16 10:29       ` Quentin Perret
@ 2018-07-17  8:57         ` Dietmar Eggemann
  2018-07-17 14:19           ` Quentin Perret
  0 siblings, 1 reply; 54+ messages in thread
From: Dietmar Eggemann @ 2018-07-17  8:57 UTC (permalink / raw)
  To: Quentin Perret
  Cc: peterz, rjw, linux-kernel, linux-pm, gregkh, mingo,
	morten.rasmussen, chris.redpath, patrick.bellasi,
	valentin.schneider, vincent.guittot, thara.gopinath,
	viresh.kumar, tkjos, joel, smuckle, adharmap, skannan, pkondeti,
	juri.lelli, edubezval, srinivas.pandruvada, currojerez,
	javi.merino

On 07/16/2018 12:29 PM, Quentin Perret wrote:
> On Tuesday 10 Jul 2018 at 09:32:02 (+0100), Quentin Perret wrote:

[...]

> Another thing to take into consideration here is basically that the
> thermal subsystem (IPA) will be impacted by the RCU protection on the
> cs_table. However, since the 'frequency' and 'power' fields do not change
> at run-time, and since IPA doesn't need the 'capacity' value, there is no
> good reason to have IPA do rcu_read_lock() all over the place, so
> arguably something needs to be fixed here.
> 
> One possibility is to remove entirely the following struct:
> 	struct em_cap_state {
> 		unsigned long capacity;
> 		unsigned long frequency; /* Kilo-hertz */
> 		unsigned long power; /* Milli-watts */
> 	};
> 
> and to have three independent vectors (of the same size) for frequency,
> power and capacity. That way only the 'capacity' vector would be RCU
> protected, and IPA could use 'frequency' and 'power' directly, without
> further protections.
> 
> A second possibility is to remove the capacity values from the EM
> altogether (as I suggested in my previous message) and to get rid of the
> need for RCU protection at the same occasion.

I see an impact of 'calculating capacity on the fly' in 
compute_energy()->em_fd_energy(). Running the first energy test case (# 
task equal 10) on the Juno r0 board with function profiling gives me:

v4:

   Function              Hit    Time            Avg             s^2
A53 - cpu [0,3-5]
   compute_energy      14620    30790.86 us     2.106 us        8.421 us
   compute_energy        682    1512.960 us     2.218 us        0.154 us
   compute_energy       1207    2627.820 us     2.177 us        0.172 us
   compute_energy         93    206.720 us      2.222 us        0.151 us
A57 - cpu [1-2]
   compute_energy        153    160.100 us      1.046 us        0.190 us
   compute_energy        136    130.760 us      0.961 us        0.077 us


v4 + 'calculating capacity on the fly':

   Function              Hit    Time            Avg             s^2
A53 - cpu [0,3-5]
   compute_energy      11623    26941.12 us     2.317 us        12.203 us
   compute_energy       5062    11771.48 us     2.325 us        0.819 us
   compute_energy       4391    10396.78 us     2.367 us        1.753 us
   compute_energy       2234    5265.640 us     2.357 us        0.955 us
A57 - cpu [1-2]
   compute_energy         59    66.020 us       1.118 us        0.112 us
   compute_energy        229    234.880 us      1.025 us        0.135 us

The code is not optimized, I just replaced cs->capacity with 
arch_scale_cpu_capacity(NULL, cpu) (max_cap) and 'max_cap * 
cs->frequency / max_freq' respectively.
There are 3 compute_energy() calls per wake-up on a system with 2 
frequency domains.

> The second option simplifies the code of the EM framework significantly
> (no more em_rescale_cpu_capacity()) and shouldn't introduce massive
> overheads on the scheduler side (the energy calculation already
> requires one multiplication and one division, so nothing new on that
> side). At the same time, that would make it a whole lot easier to
> interface the EM framework with IPA without having to deal with RCU all
> over the place.

IMO, em_rescale_cpu_capacity() is just the capacity related example what 
the EM needs if its values can be changed at runtime. There might be 
other use cases in the future like changing power values depending on 
temperature.
So maybe it's a good idea to not have this 'EM values can change at 
runtime' feature in the first version of the EM and emphasize on 
simplicity of the code instead (if we can eliminate the extra runtime 
overhead).

[...]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 03/12] PM: Introduce an Energy Model management framework
  2018-07-17  8:57         ` Dietmar Eggemann
@ 2018-07-17 14:19           ` Quentin Perret
  2018-07-17 16:00             ` Dietmar Eggemann
  0 siblings, 1 reply; 54+ messages in thread
From: Quentin Perret @ 2018-07-17 14:19 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: peterz, rjw, linux-kernel, linux-pm, gregkh, mingo,
	morten.rasmussen, chris.redpath, patrick.bellasi,
	valentin.schneider, vincent.guittot, thara.gopinath,
	viresh.kumar, tkjos, joel, smuckle, adharmap, skannan, pkondeti,
	juri.lelli, edubezval, srinivas.pandruvada, currojerez,
	javi.merino

Hi Dietmar,

On Tuesday 17 Jul 2018 at 10:57:13 (+0200), Dietmar Eggemann wrote:
> On 07/16/2018 12:29 PM, Quentin Perret wrote:
> I see an impact of 'calculating capacity on the fly' in
> compute_energy()->em_fd_energy(). Running the first energy test case (# task
> equal 10) on the Juno r0 board with function profiling gives me:
> 
> v4:
> 
>   Function              Hit    Time            Avg             s^2
> A53 - cpu [0,3-5]
>   compute_energy      14620    30790.86 us     2.106 us        8.421 us
>   compute_energy        682    1512.960 us     2.218 us        0.154 us
>   compute_energy       1207    2627.820 us     2.177 us        0.172 us
>   compute_energy         93    206.720 us      2.222 us        0.151 us
> A57 - cpu [1-2]
>   compute_energy        153    160.100 us      1.046 us        0.190 us
>   compute_energy        136    130.760 us      0.961 us        0.077 us
> 
> 
> v4 + 'calculating capacity on the fly':
> 
>   Function              Hit    Time            Avg             s^2
> A53 - cpu [0,3-5]
>   compute_energy      11623    26941.12 us     2.317 us        12.203 us
>   compute_energy       5062    11771.48 us     2.325 us        0.819 us
>   compute_energy       4391    10396.78 us     2.367 us        1.753 us
>   compute_energy       2234    5265.640 us     2.357 us        0.955 us
> A57 - cpu [1-2]
>   compute_energy         59    66.020 us       1.118 us        0.112 us
>   compute_energy        229    234.880 us      1.025 us        0.135 us
> 
> The code is not optimized, I just replaced cs->capacity with
> arch_scale_cpu_capacity(NULL, cpu) (max_cap) and 'max_cap * cs->frequency /
> max_freq' respectively.
> There are 3 compute_energy() calls per wake-up on a system with 2 frequency
> domains.

First, thank you very much for looking into this :-)

So, I guess you see this overhead because of the extra division involved
by computing 'cap = max_cap * cs->frequency / max_freq'. However, I
think there is an opportunity to optimize things a bit and avoid that
overhead entirely. My suggestion is to remove the 'capacity' field from
the em_cap_state struct and to add a 'cost' parameter instead:

struct em_cap_state {
	unsigned long frequency;
	unsigned long power;
	unsigned long cost;
};

I define the 'cost' of a capacity state as:

	cost = power * max_freq / freq;

Since 'power', 'max_freq' and 'freq' do not change at run-time (as opposed
to 'capacity'), this coefficient is static and computed when the table is
first created. Then, based on this, you can implement em_fd_energy() as
follows:

static inline unsigned long em_fd_energy(struct em_freq_domain *fd,
				unsigned long max_util, unsigned long sum_util)
{
	unsigned long freq, scale_cpu;
	struct em_cap_state *cs;
	int i, cpu;

	/* Map the utilization value to a frequency */
	cpu = cpumask_first(to_cpumask(fd->cpus));
	scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
	cs = &fd->table[fd->nr_cap_states - 1];
	freq = map_util_freq(max_util, cs->frequency, scale_cpu);

	/* Find the lowest capacity state above this frequency */
	for (i = 0; i < fd->nr_cap_states; i++) {
		cs = &fd->table[i];
		if (cs->frequency >= freq)
			break;
	}

	/*
	 * The capacity of a CPU at a specific performance state is defined as:
	 *
	 *     cap = freq * scale_cpu / max_freq
	 *
	 * The energy consumed by this CPU can be estimated as:
	 *
	 *     nrg = power * util / cap
	 *
	 * because (util / cap) represents the percentage of busy time of the
	 * CPU. Based on those definitions, we have:
	 *
	 *     nrg = power * util * max_freq / (scale_cpu * freq)
	 *
	 * which can be re-arranged as a product of two terms:
	 *
	 *     nrg = (power * max_freq / freq) * (util / scale_cpu)
	 *
	 * The first term is static, and is stored in the em_cap_state struct
	 * as 'cost'. The parameters of the second term change at run-time.
	 */
	return cs->cost * sum_util / scale_cpu;
}

With the above implementation, there is no additional division in
em_fd_energy() compared to v4, so I would expect to see no significant
difference in computation time.

I tried to reproduce your test case and I get the following numbers on
my Juno r0 (using the performance governor):

v4:
***
  Function           Hit    Time            Avg             s^2
A53 - cpu [0,3-5]
  compute_energy    1796    12685.66 us     7.063 us        0.039 us
  compute_energy    4214    28060.02 us     6.658 us        0.919 us
  compute_energy    2743    20167.86 us     7.352 us        0.067 us
  compute_energy   13958    97122.68 us     6.958 us        9.255 us
A57 - cpu [1-2]
  compute_energy      86    448.800 us      5.218 us        0.106 us
  compute_energy     163    847.600 us      5.200 us        0.128 us


'v5' (with 'cost'):
*******************
  Function           Hit    Time            Avg             s^2
A53 - cpu [0,3-5]
  compute_energy    1695    11153.54 us     6.580 us        0.022 us
  compute_energy   16823    113709.5 us     6.759 us        27.109 us
  compute_energy     677    4490.060 us     6.632 us        0.028 us
  compute_energy    1959    13595.66 us     6.940 us        0.029 us
A57 - cpu [1-2]
  compute_energy     211    1089.860 us     5.165 us        0.122 us
  compute_energy      83    420.860 us      5.070 us        0.075 us


So I don't observe any obvious regression with my optimization applied.
The v4 branch I used is the one mentioned in the cover letter:
http://www.linux-arm.org/git?p=linux-qp.git;a=shortlog;h=refs/heads/upstream/eas_v4

And I just pushed the WiP branch I used to compare against:
http://www.linux-arm.org/git?p=linux-qp.git;a=shortlog;h=refs/heads/upstream/eas_v5-WiP-compute_energy_profiling

Is this also fixing the regression on your side ?

> 
> > The second option simplifies the code of the EM framework significantly
> > (no more em_rescale_cpu_capacity()) and shouldn't introduce massive
> > overheads on the scheduler side (the energy calculation already
> > requires one multiplication and one division, so nothing new on that
> > side). At the same time, that would make it a whole lot easier to
> > interface the EM framework with IPA without having to deal with RCU all
> > over the place.
> 
> IMO, em_rescale_cpu_capacity() is just the capacity related example what the
> EM needs if its values can be changed at runtime. There might be other use
> cases in the future like changing power values depending on temperature.
> So maybe it's a good idea to not have this 'EM values can change at runtime'
> feature in the first version of the EM and emphasize on simplicity of the
> code instead (if we can eliminate the extra runtime overhead).

I agree that it would be nice to keep it simple in the beginning. If
there is strong and demonstrated use-case for updating the EM at
run-time later, then we can re-introduce the RCU protection. But until
then, we can avoid the complex implementation at no obvious cost (given
my results above) so that sounds like a good trade-off to me :-)

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 03/12] PM: Introduce an Energy Model management framework
  2018-07-17 14:19           ` Quentin Perret
@ 2018-07-17 16:00             ` Dietmar Eggemann
  0 siblings, 0 replies; 54+ messages in thread
From: Dietmar Eggemann @ 2018-07-17 16:00 UTC (permalink / raw)
  To: Quentin Perret
  Cc: peterz, rjw, linux-kernel, linux-pm, gregkh, mingo,
	morten.rasmussen, chris.redpath, patrick.bellasi,
	valentin.schneider, vincent.guittot, thara.gopinath,
	viresh.kumar, tkjos, joel, smuckle, adharmap, skannan, pkondeti,
	juri.lelli, edubezval, srinivas.pandruvada, currojerez,
	javi.merino

On 07/17/2018 04:19 PM, Quentin Perret wrote:
> Hi Dietmar,
> 
> On Tuesday 17 Jul 2018 at 10:57:13 (+0200), Dietmar Eggemann wrote:
>> On 07/16/2018 12:29 PM, Quentin Perret wrote:

[...]

> So, I guess you see this overhead because of the extra division involved
> by computing 'cap = max_cap * cs->frequency / max_freq'. However, I
> think there is an opportunity to optimize things a bit and avoid that
> overhead entirely. My suggestion is to remove the 'capacity' field from
> the em_cap_state struct and to add a 'cost' parameter instead:
> 
> struct em_cap_state {
> 	unsigned long frequency;
> 	unsigned long power;
> 	unsigned long cost;
> };
> 
> I define the 'cost' of a capacity state as:
> 
> 	cost = power * max_freq / freq;
> 
> Since 'power', 'max_freq' and 'freq' do not change at run-time (as opposed
> to 'capacity'), this coefficient is static and computed when the table is
> first created. Then, based on this, you can implement em_fd_energy() as
> follows:
> 
> static inline unsigned long em_fd_energy(struct em_freq_domain *fd,
> 				unsigned long max_util, unsigned long sum_util)
> {
> 	unsigned long freq, scale_cpu;
> 	struct em_cap_state *cs;
> 	int i, cpu;
> 
> 	/* Map the utilization value to a frequency */
> 	cpu = cpumask_first(to_cpumask(fd->cpus));
> 	scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
> 	cs = &fd->table[fd->nr_cap_states - 1];
> 	freq = map_util_freq(max_util, cs->frequency, scale_cpu);
> 
> 	/* Find the lowest capacity state above this frequency */
> 	for (i = 0; i < fd->nr_cap_states; i++) {
> 		cs = &fd->table[i];
> 		if (cs->frequency >= freq)
> 			break;
> 	}
> 
> 	/*
> 	 * The capacity of a CPU at a specific performance state is defined as:
> 	 *
> 	 *     cap = freq * scale_cpu / max_freq
> 	 *
> 	 * The energy consumed by this CPU can be estimated as:
> 	 *
> 	 *     nrg = power * util / cap
> 	 *
> 	 * because (util / cap) represents the percentage of busy time of the
> 	 * CPU. Based on those definitions, we have:
> 	 *
> 	 *     nrg = power * util * max_freq / (scale_cpu * freq)
> 	 *
> 	 * which can be re-arranged as a product of two terms:
> 	 *
> 	 *     nrg = (power * max_freq / freq) * (util / scale_cpu)
> 	 *
> 	 * The first term is static, and is stored in the em_cap_state struct
> 	 * as 'cost'. The parameters of the second term change at run-time.
> 	 */
> 	return cs->cost * sum_util / scale_cpu;
> }
> 
> With the above implementation, there is no additional division in
> em_fd_energy() compared to v4, so I would expect to see no significant
> difference in computation time.
> 
> I tried to reproduce your test case and I get the following numbers on
> my Juno r0 (using the performance governor):
> 
> v4:
> ***
>    Function           Hit    Time            Avg             s^2
> A53 - cpu [0,3-5]
>    compute_energy    1796    12685.66 us     7.063 us        0.039 us
>    compute_energy    4214    28060.02 us     6.658 us        0.919 us
>    compute_energy    2743    20167.86 us     7.352 us        0.067 us
>    compute_energy   13958    97122.68 us     6.958 us        9.255 us
> A57 - cpu [1-2]
>    compute_energy      86    448.800 us      5.218 us        0.106 us
>    compute_energy     163    847.600 us      5.200 us        0.128 us
> 
> 
> 'v5' (with 'cost'):
> *******************
>    Function           Hit    Time            Avg             s^2
> A53 - cpu [0,3-5]
>    compute_energy    1695    11153.54 us     6.580 us        0.022 us
>    compute_energy   16823    113709.5 us     6.759 us        27.109 us
>    compute_energy     677    4490.060 us     6.632 us        0.028 us
>    compute_energy    1959    13595.66 us     6.940 us        0.029 us
> A57 - cpu [1-2]
>    compute_energy     211    1089.860 us     5.165 us        0.122 us
>    compute_energy      83    420.860 us      5.070 us        0.075 us
> 
> 
> So I don't observe any obvious regression with my optimization applied.
> The v4 branch I used is the one mentioned in the cover letter:
> http://www.linux-arm.org/git?p=linux-qp.git;a=shortlog;h=refs/heads/upstream/eas_v4

Yeah, just realized that I used the wrong eas_v4 branch. With the one 
you mentioned here I still get ~0.2-0.3us diff with the non-optimized 
approach but at least values in the same ballpark as yours (performance 
governor to keep s^2 low):

v4:

    Function             Hit    Time            Avg             s^2
A53 - cpu [0,3-5]
   compute_energy        233    1455.140 us     6.245 us        0.022 us
...
A57 - cpu [1-2]
   compute_energy        130    602.980 us      4.638 us        0.043 us

v4 + '(naive) calculating capacity on the fly':

   Function              Hit    Time            Avg             s^2
A53 - cpu [0,3-5]
   compute_energy        531    3460.200 us     6.516 us        0.044 us
A57 - cpu [1-2]
   compute_energy        141    700.220 us      4.966 us        0.106 us

> And I just pushed the WiP branch I used to compare against:
> http://www.linux-arm.org/git?p=linux-qp.git;a=shortlog;h=refs/heads/upstream/eas_v5-WiP-compute_energy_profiling
> 
> Is this also fixing the regression on your side ?

I assume with you 'unsigned long cost' optimization I will get the same 
test result than you so I guess that's the optimization which assures 
that we don't have to pay the simplification of the EM with scheduler 
runtime.

[...]

>> IMO, em_rescale_cpu_capacity() is just the capacity related example what the
>> EM needs if its values can be changed at runtime. There might be other use
>> cases in the future like changing power values depending on temperature.
>> So maybe it's a good idea to not have this 'EM values can change at runtime'
>> feature in the first version of the EM and emphasize on simplicity of the
>> code instead (if we can eliminate the extra runtime overhead).
> 
> I agree that it would be nice to keep it simple in the beginning. If
> there is strong and demonstrated use-case for updating the EM at
> run-time later, then we can re-introduce the RCU protection. But until
> then, we can avoid the complex implementation at no obvious cost (given
> my results above) so that sounds like a good trade-off to me :-)

Agreed.

[...]


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 12/12] OPTIONAL: cpufreq: dt: Register an Energy Model
  2018-07-06 10:18     ` Quentin Perret
@ 2018-07-30 15:53       ` Vincent Guittot
  2018-07-30 16:20         ` Quentin Perret
  0 siblings, 1 reply; 54+ messages in thread
From: Vincent Guittot @ 2018-07-30 15:53 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Peter Zijlstra, Rafael J. Wysocki, linux-kernel,
	open list:THERMAL, gregkh, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Thara Gopinath, viresh kumar, Todd Kjos,
	Joel Fernandes, Cc: Steve Muckle, adharmap, Kannan, Saravana,
	pkondeti, Juri Lelli, Eduardo Valentin, Srinivas Pandruvada,
	currojerez, Javi Merino

Hi Quentin,

I have noticed the new version but prefer to continue on this thread
to keep the history of the discussion.

On Fri, 6 Jul 2018 at 12:19, Quentin Perret <quentin.perret@arm.com> wrote:
>
> On Friday 06 Jul 2018 at 12:10:02 (+0200), Vincent Guittot wrote:
> > On Thu, 28 Jun 2018 at 13:41, Quentin Perret <quentin.perret@arm.com> wrote:
> > > +static int of_est_power(unsigned long *mW, unsigned long *KHz, int cpu)
> > > +{
> > > +       unsigned long mV, Hz, MHz;
> > > +       struct device *cpu_dev;
> > > +       struct dev_pm_opp *opp;
> > > +       struct device_node *np;
> > > +       u32 cap;
> > > +       u64 tmp;
> > > +
> > > +       cpu_dev = get_cpu_device(cpu);
> > > +       if (!cpu_dev)
> > > +               return -ENODEV;
> > > +
> > > +       np = of_node_get(cpu_dev->of_node);
> > > +       if (!np)
> > > +               return -EINVAL;
> > > +
> > > +       if (of_property_read_u32(np, "dynamic-power-coefficient", &cap))
> > > +               return -EINVAL;
> > > +
> > > +       Hz = *KHz * 1000;
> > > +       opp = dev_pm_opp_find_freq_ceil(cpu_dev, &Hz);
> > > +       if (IS_ERR(opp))
> > > +               return -EINVAL;
> > > +
> > > +       mV = dev_pm_opp_get_voltage(opp) / 1000;
> > > +       dev_pm_opp_put(opp);
> > > +
> > > +       MHz = Hz / 1000000;
> > > +       tmp = (u64)cap * mV * mV * MHz;
> > > +       do_div(tmp, 1000000000);
> >
> > Could you explain the formula above ? and especially the 1000000000 it
> > seems related to the use of mV and mW instead of uV and uW ...
>
> That's correct, this is just a matter of units. I simply tried to
> reproduce here what is currently done for IPA:
>
> https://elixir.bootlin.com/linux/latest/source/drivers/thermal/cpu_cooling.c#L252

ok, so you copy/paste what is done in cpu cooling device ?

Nevertheless I still have some concerns with the formula used here and
in cpu cooling device:

https://elixir.bootlin.com/linux/latest/source/Documentation/devicetree/bindings/arm/cpus.txt#L273
The documentation says
Pdyn = dynamic-power-coefficient * V^2 * f where voltage is in uV,
frequency is in MHz.
 and  dynamic power coefficient in units of mW/MHz/uV^2.

This implies that the results is in mW and we can summarize the formula with :

mW = C * (uV)^2 * Mhz with C = dynamic-power-coefficient
then uV = 1000*mV
mW = C * (1000*mV)^2 * Mhz
mW = C * 1000000 * mV * mV * Mhz

Which is different from what is used here and in cpu cooling device.
I may have done something wrong with the formula but i can't catch
it... My brain is not yet fully back from vacations

Mhz = Hz / 1000000

mW = C * mV * mV * 1000000 * Hz / 1000000
mW = C * mV * mV * Hz

Let take a simple example with
C = 1
Voltage = 1000 uV = 1 mV
freq = 2 Mhz = 2000000 Hz

mW = C * (uV)^2 * Mhz = 1 * (1000)^2 * 2 = 2000000
mW = C * mV * mV * Hz = 1 * 1 * 1 * 2000000 = 2000000
mW = C * mV * mV * MHz / 1000000000 = 1 * 1 * 1 * 2 / 1000000000 = 0

so there is a mismatch between the formula in the documentation and
the formula used in cpu_cooling.c (and in this patch).

That being said, I can understand that we can easily overflow the
result if we use some kind of realistic values like the one for
hikey960:
dynamic-power-coefficient  for cpu4 is : 550
for highest OPP of cpu4:
opp-hz = /bits/ 64 <2362000000>;
opp-microvolt = <1100000>;

Which gives with the formula in the documentation:
mW = 550 * (1100000)^2 * 2362 = 1,571911×10^18 which is quite huge
even for a big core running at max freq
and with the formula in cpu_cooling.c:
mW = 550 * 1100 * 1100 * 2362 / 1000000000 = 1571 which seems to be a
bit more reasonable

So it seems that the dynamic-power-coefficient unit is not mW/MHz/uV^2
but mW/Ghz/V^2
And given that we use integer, we provide Voltage and frequency by
(mV/1000) and (Mhz/1000) and factorize all /1000

May be Javi who added the formula for cpu cooling can confirm on this ?

>
> >
> > Can't you just optimize that into
> > tmp = (u64)cap * mV * mV * Hz;
> > do_div(tmp, 1000);

In fact the do_div is not needed, I thought that dyn coef unit was
uW/Mhz/uV^2 whereas it's written mW/MHz/uV^2

>
> I don't think that'll work. If you use Hz instead of MHz you'll need to
> divide tmp by 1000000000000000 to get milli-watts, as specified by the
> EM framework.
>
> FYI, I don't consider this patch to be ready to be merged. I kept it
> self-contained and simple for the example but I actually think that this
> of_est_power function should probably be factored-out somewhere else.
> scpi-cpufreq could use it as well. Anyway, I guess we can discuss that
> later once the APIs of the EM framework are agreed :-)

It's just that the tests results that you provided in the cover letter
has used this patch and running tests with wrong formula (it seems
that's the documentation is wrong) doesn't give much confidence on the
results

Thanks
Vincent
>
> Thanks,
> Quentin

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 12/12] OPTIONAL: cpufreq: dt: Register an Energy Model
  2018-07-30 15:53       ` Vincent Guittot
@ 2018-07-30 16:20         ` Quentin Perret
  0 siblings, 0 replies; 54+ messages in thread
From: Quentin Perret @ 2018-07-30 16:20 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Rafael J. Wysocki, linux-kernel,
	open list:THERMAL, gregkh, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Thara Gopinath, viresh kumar, Todd Kjos,
	Joel Fernandes, Cc: Steve Muckle, adharmap, Kannan, Saravana,
	pkondeti, Juri Lelli, Eduardo Valentin, Srinivas Pandruvada,
	currojerez, Javi Merino

Hi Vincent,

On Monday 30 Jul 2018 at 17:53:23 (+0200), Vincent Guittot wrote:
[...]
> ok, so you copy/paste what is done in cpu cooling device ?
> 
> Nevertheless I still have some concerns with the formula used here and
> in cpu cooling device:
> 
> https://elixir.bootlin.com/linux/latest/source/Documentation/devicetree/bindings/arm/cpus.txt#L273
> The documentation says
> Pdyn = dynamic-power-coefficient * V^2 * f where voltage is in uV,
> frequency is in MHz.
>  and  dynamic power coefficient in units of mW/MHz/uV^2.
> 
> This implies that the results is in mW and we can summarize the formula with :
> 
> mW = C * (uV)^2 * Mhz with C = dynamic-power-coefficient
> then uV = 1000*mV
> mW = C * (1000*mV)^2 * Mhz
> mW = C * 1000000 * mV * mV * Mhz
> 
> Which is different from what is used here and in cpu cooling device.
> I may have done something wrong with the formula but i can't catch
> it... My brain is not yet fully back from vacations

I think your brain works fine ;) The doc does seem to be incorrect ...
I also believe that this issue has been discussed recently in another
thread:

http://archive.armlinux.org.uk/lurker/message/20180720.141530.2bf75d5c.en.html

> Which gives with the formula in the documentation:
> mW = 550 * (1100000)^2 * 2362 = 1,571911×10^18 which is quite huge
> even for a big core running at max freq
> and with the formula in cpu_cooling.c:
> mW = 550 * 1100 * 1100 * 2362 / 1000000000 = 1571 which seems to be a
> bit more reasonable

Indeed, 1.5W for a big CPU at max OPP is in the right ballpark I
believe.

FYI, this is the power values I have in the EM for all OPPs on H960:

        +--------+-------+--------+--------+
        |      A53s      |       A73s      |
        +--------+-------+--------+--------+
        |  MHz   |  mW   |  MHz   |  mW    |
        +--------+-------+--------+--------+
        |  533   |  28   |  903   |  243   |
        |  999   |  70   |  1421  |  500   |
        |  1402  |  124  |  1805  |  804   |
        |  1709  |  187  |  2112  |  1161  |
        |  1844  |  245  |  2362  |  1571  |
        +--------+-------+--------+--------+

[...]

> It's just that the tests results that you provided in the cover letter
> has used this patch and running tests with wrong formula (it seems
> that's the documentation is wrong) doesn't give much confidence on the
> results

I can understand that the mismatch between the code and the
documentation can be slightly confusing. However, as you said yourself
the code _is_ correct, so the test results are valid.

Moreover, the actual unit for the power costs doesn't make any
difference with EAS as long as that doesn't cause precision issues. EAS
works in relative terms, so even if the unit was in, say, uW instead of
mW, the test results would still be valid. And again, that's not the
case, the code does the right thing here.

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2018-07-30 16:20 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-28 11:40 [RFC PATCH v4 00/12] Energy Aware Scheduling Quentin Perret
2018-06-28 11:40 ` [RFC PATCH v4 01/12] sched: Relocate arch_scale_cpu_capacity Quentin Perret
2018-06-28 11:40 ` [RFC PATCH v4 02/12] sched/cpufreq: Factor out utilization to frequency mapping Quentin Perret
2018-06-28 11:40 ` [RFC PATCH v4 03/12] PM: Introduce an Energy Model management framework Quentin Perret
2018-07-05 14:31   ` Peter Zijlstra
2018-07-05 15:24     ` Quentin Perret
2018-07-05 14:39   ` Peter Zijlstra
2018-07-05 15:09     ` Quentin Perret
2018-07-05 15:06   ` Peter Zijlstra
2018-07-05 15:32     ` Quentin Perret
2018-07-06  9:57   ` Vincent Guittot
2018-07-06 10:03     ` Peter Zijlstra
2018-07-06 10:06       ` Quentin Perret
2018-07-06 10:05     ` Quentin Perret
2018-07-09 18:07   ` Dietmar Eggemann
2018-07-10  8:32     ` Quentin Perret
2018-07-16 10:29       ` Quentin Perret
2018-07-17  8:57         ` Dietmar Eggemann
2018-07-17 14:19           ` Quentin Perret
2018-07-17 16:00             ` Dietmar Eggemann
2018-06-28 11:40 ` [RFC PATCH v4 04/12] PM / EM: Expose the Energy Model in sysfs Quentin Perret
2018-06-28 11:40 ` [RFC PATCH v4 05/12] sched/topology: Reference the Energy Model of CPUs when available Quentin Perret
2018-07-05 17:29   ` Peter Zijlstra
2018-07-05 17:48     ` Quentin Perret
2018-07-05 17:33   ` Peter Zijlstra
2018-07-05 17:50     ` Quentin Perret
2018-07-05 18:14       ` Peter Zijlstra
2018-06-28 11:40 ` [RFC PATCH v4 06/12] sched/topology: Lowest energy aware balancing sched_domain level pointer Quentin Perret
2018-06-28 11:40 ` [RFC PATCH v4 07/12] sched/topology: Introduce sched_energy_present static key Quentin Perret
2018-06-28 11:40 ` [RFC PATCH v4 08/12] sched: Add over-utilization/tipping point indicator Quentin Perret
2018-07-06 11:32   ` Peter Zijlstra
2018-07-06 13:20     ` Quentin Perret
2018-07-06 13:24       ` Peter Zijlstra
2018-07-06 13:40         ` Quentin Perret
2018-07-06 11:39   ` Peter Zijlstra
2018-07-06 11:41   ` Peter Zijlstra
2018-07-06 11:49     ` Valentin Schneider
2018-06-28 11:40 ` [RFC PATCH v4 09/12] sched/fair: Introduce an energy estimation helper function Quentin Perret
2018-07-06 13:12   ` Peter Zijlstra
2018-07-06 15:12     ` Quentin Perret
2018-07-06 15:49       ` Peter Zijlstra
2018-07-06 17:04         ` Quentin Perret
2018-07-09 12:01           ` Peter Zijlstra
2018-07-09 15:28             ` Quentin Perret
2018-07-09 15:42               ` Peter Zijlstra
2018-07-09 16:07                 ` Quentin Perret
2018-07-06 15:50       ` Peter Zijlstra
2018-06-28 11:40 ` [RFC PATCH v4 10/12] sched/fair: Select an energy-efficient CPU on task wake-up Quentin Perret
2018-06-28 11:40 ` [RFC PATCH v4 11/12] OPTIONAL: arch_topology: Start Energy Aware Scheduling Quentin Perret
2018-06-28 11:40 ` [RFC PATCH v4 12/12] OPTIONAL: cpufreq: dt: Register an Energy Model Quentin Perret
2018-07-06 10:10   ` Vincent Guittot
2018-07-06 10:18     ` Quentin Perret
2018-07-30 15:53       ` Vincent Guittot
2018-07-30 16:20         ` Quentin Perret

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).