LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH v3 0/7] sched: support schedstats for RT sched class
@ 2021-08-24 11:29 Yafang Shao
  2021-08-24 11:29 ` [PATCH v3 1/7] sched, fair: use __schedstat_set() in set_next_entity() Yafang Shao
                   ` (7 more replies)
  0 siblings, 8 replies; 20+ messages in thread
From: Yafang Shao @ 2021-08-24 11:29 UTC (permalink / raw)
  To: mingo, peterz, mgorman, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, bristot, achaiken
  Cc: lkp, linux-kernel, linux-rt-users, Yafang Shao

Hi Ingo, Peter,

This feature is useful to trace the sched details of RT tasks. Hopefully
you can give some feedback on it.

We want to measure the latency of RT tasks in our production
environment with schedstats facility, but currently schedstats is only
supported for fair sched class. In order to support if for other sched
classes, we should make it independent of fair sched class. The struct
sched_statistics is the schedular statistics of a task_struct or a
task_group, both of which are independent of sched class. So we can move
struct sched_statistics into struct task_struct and struct task_group to
achieve the goal.

After the patchset, schestats are orgnized as follows,
struct task_struct {
    ...
    struct sched_statistics statistics;
    ...
    struct sched_entity *se;
    struct sched_rt_entity *rt;
    ...
};

struct task_group {                    |---> stats[0] : of CPU0
    ...                                |
    struct sched_statistics **stats; --|---> stats[1] : of CPU1
    ...                                |
                                       |---> stats[n] : of CPUn
 #ifdef CONFIG_FAIR_GROUP_SCHED
    struct sched_entity **se;
 #endif
 #ifdef CONFIG_RT_GROUP_SCHED
    struct sched_rt_entity  **rt_se;
 #endif
    ...
};

The sched_statistics members may be frequently modified when schedstats is
enabled, in order to avoid impacting on random data which may in the same
cacheline with them, the struct sched_statistics is defined as cacheline
aligned.

Then we can use schedstats to trace RT tasks as well, for example,
                    Interface File
 task schedstats :  /proc/[pid]/sched
 group schedstats:  /proc/sched_debug
 tracepoints     :  sched:sched_stat_{runtime, wait, sleep, iowait, blocked}

As PATCH #2 and #3 changes the core struct in the scheduler, so I did
'perf bench sched pipe' to measure the sched performance before and after
the change, suggested by Mel. Below is the data, which are all in
usecs/op.

                                  Before               After
      kernel.sched_schedstats=0  ~5.6                 ~5.6
      kernel.sched_schedstats=1  ~5.7                 ~5.7
    [These data is a little difference with the prev version, that is
     because my old test machine is destroyed so I have to use a new
     different test machine.]
No obvious difference after the change.

Changes Since v2:
- Fixes the output format in /proc/[pid]/sched  
- Rebase it on the latest code
- Redo the performance test

Changes since v1:
- Fix the build failure reported by kernel test robot.
- Add the performance data with 'perf bench sched pipe', suggested by
  Mel.
- Make the struct sched_statistics cacheline aligned.
- Introduce task block time in schedstats

Changes since RFC:
- improvement of schedstats helpers, per Mel.
- make struct schedstats independent of fair sched class 


Yafang Shao (7):
  sched, fair: use __schedstat_set() in set_next_entity()
  sched: make struct sched_statistics independent of fair sched class
  sched: make schedstats helpers independent of fair sched class
  sched: make the output of schedstats independent of fair sched class
  sched: introduce task block time in schedstats
  sched, rt: support sched_stat_runtime tracepoint for RT sched class
  sched, rt: support schedstats for RT sched class

 include/linux/sched.h    |   7 +-
 kernel/sched/core.c      |  24 +++--
 kernel/sched/deadline.c  |   4 +-
 kernel/sched/debug.c     | 136 +++++++++++++++----------
 kernel/sched/fair.c      | 210 ++++++++++++++++-----------------------
 kernel/sched/rt.c        | 147 ++++++++++++++++++++++++++-
 kernel/sched/sched.h     |   3 +
 kernel/sched/stats.c     | 104 +++++++++++++++++++
 kernel/sched/stats.h     |  89 +++++++++++++++++
 kernel/sched/stop_task.c |   4 +-
 10 files changed, 531 insertions(+), 197 deletions(-)

-- 
2.18.2


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v3 1/7] sched, fair: use __schedstat_set() in set_next_entity()
  2021-08-24 11:29 [PATCH v3 0/7] sched: support schedstats for RT sched class Yafang Shao
@ 2021-08-24 11:29 ` Yafang Shao
  2021-08-24 11:29 ` [PATCH v3 2/7] sched: make struct sched_statistics independent of fair sched class Yafang Shao
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Yafang Shao @ 2021-08-24 11:29 UTC (permalink / raw)
  To: mingo, peterz, mgorman, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, bristot, achaiken
  Cc: lkp, linux-kernel, linux-rt-users, Yafang Shao

schedstat_enabled() has been already checked, so we can use
__schedstat_set() directly.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Alison Chaiken <achaiken@aurora.tech>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5aa3cfd15a2e..422426768b84 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4502,7 +4502,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	 */
 	if (schedstat_enabled() &&
 	    rq_of(cfs_rq)->cfs.load.weight >= 2*se->load.weight) {
-		schedstat_set(se->statistics.slice_max,
+		__schedstat_set(se->statistics.slice_max,
 			max((u64)schedstat_val(se->statistics.slice_max),
 			    se->sum_exec_runtime - se->prev_sum_exec_runtime));
 	}
-- 
2.18.2


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v3 2/7] sched: make struct sched_statistics independent of fair sched class
  2021-08-24 11:29 [PATCH v3 0/7] sched: support schedstats for RT sched class Yafang Shao
  2021-08-24 11:29 ` [PATCH v3 1/7] sched, fair: use __schedstat_set() in set_next_entity() Yafang Shao
@ 2021-08-24 11:29 ` Yafang Shao
  2021-08-31 10:14   ` Peter Zijlstra
  2021-08-31 10:19   ` Peter Zijlstra
  2021-08-24 11:29 ` [PATCH v3 3/7] sched: make schedstats helpers " Yafang Shao
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 20+ messages in thread
From: Yafang Shao @ 2021-08-24 11:29 UTC (permalink / raw)
  To: mingo, peterz, mgorman, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, bristot, achaiken
  Cc: lkp, linux-kernel, linux-rt-users, Yafang Shao

If we want to use the schedstats facility to trace other sched classes, we
should make it independent of fair sched class. The struct sched_statistics
is the schedular statistics of a task_struct or a task_group. So we can
move it into struct task_struct and struct task_group to achieve the goal.

After the patch, schestats are orgnized as follows,

struct task_struct {
    ...
    struct sched_statistics statistics;
    ...
    struct sched_entity *se;
    struct sched_rt_entity *rt;
    ...
};

struct task_group {                    |---> stats[0] : of CPU0
    ...                                |
    struct sched_statistics **stats; --|---> stats[1] : of CPU1
    ...                                |
                                       |---> stats[n] : of CPUn
 #ifdef CONFIG_FAIR_GROUP_SCHED
    struct sched_entity **se;
 #endif
 #ifdef CONFIG_RT_GROUP_SCHED
    struct sched_rt_entity  **rt_se;
 #endif
    ...
};

The sched_statistics members may be frequently modified when schedstats is
enabled, in order to avoid impacting on random data which may in the same
cacheline with them, the struct sched_statistics is defined as cacheline
aligned.

As this patch changes the core struct of scheduler, so I verified the
performance it may impact on the scheduler with 'perf bench sched
pipe', suggested by Mel. Below is the result, in which all the values
are in usecs/op.
                                  Before               After
      kernel.sched_schedstats=0  ~5.6                 ~5.6
      kernel.sched_schedstats=1  ~5.7                 ~5.7
[These data is a little difference with the prev version, that is
 because my old test machine is destroyed so I have to use a new
 different test machine.]

Almost no impact on the sched performance.

No functional change.

[lkp@intel.com: reported build failure in earlier version]

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: kernel test robot <lkp@intel.com>
Cc: Alison Chaiken <achaiken@aurora.tech>
---
 include/linux/sched.h    |   5 +-
 kernel/sched/core.c      |  24 ++++----
 kernel/sched/deadline.c  |   4 +-
 kernel/sched/debug.c     |  90 +++++++++++++++--------------
 kernel/sched/fair.c      | 121 ++++++++++++++++++++++++++++-----------
 kernel/sched/rt.c        |   4 +-
 kernel/sched/sched.h     |   3 +
 kernel/sched/stats.h     |  55 ++++++++++++++++++
 kernel/sched/stop_task.c |   4 +-
 9 files changed, 212 insertions(+), 98 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f43fb7a32a9c..39c29eae1af9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -521,7 +521,7 @@ struct sched_statistics {
 	u64				nr_wakeups_passive;
 	u64				nr_wakeups_idle;
 #endif
-};
+} ____cacheline_aligned;
 
 struct sched_entity {
 	/* For load-balancing: */
@@ -537,8 +537,6 @@ struct sched_entity {
 
 	u64				nr_migrations;
 
-	struct sched_statistics		statistics;
-
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	int				depth;
 	struct sched_entity		*parent;
@@ -775,6 +773,7 @@ struct task_struct {
 	unsigned int			rt_priority;
 
 	const struct sched_class	*sched_class;
+	struct sched_statistics         stats;
 	struct sched_entity		se;
 	struct sched_rt_entity		rt;
 	struct sched_dl_entity		dl;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 21d633971fcf..38bb7afb396c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3489,11 +3489,11 @@ ttwu_stat(struct task_struct *p, int cpu, int wake_flags)
 #ifdef CONFIG_SMP
 	if (cpu == rq->cpu) {
 		__schedstat_inc(rq->ttwu_local);
-		__schedstat_inc(p->se.statistics.nr_wakeups_local);
+		__schedstat_inc(p->stats.nr_wakeups_local);
 	} else {
 		struct sched_domain *sd;
 
-		__schedstat_inc(p->se.statistics.nr_wakeups_remote);
+		__schedstat_inc(p->stats.nr_wakeups_remote);
 		rcu_read_lock();
 		for_each_domain(rq->cpu, sd) {
 			if (cpumask_test_cpu(cpu, sched_domain_span(sd))) {
@@ -3505,14 +3505,14 @@ ttwu_stat(struct task_struct *p, int cpu, int wake_flags)
 	}
 
 	if (wake_flags & WF_MIGRATED)
-		__schedstat_inc(p->se.statistics.nr_wakeups_migrate);
+		__schedstat_inc(p->stats.nr_wakeups_migrate);
 #endif /* CONFIG_SMP */
 
 	__schedstat_inc(rq->ttwu_count);
-	__schedstat_inc(p->se.statistics.nr_wakeups);
+	__schedstat_inc(p->stats.nr_wakeups);
 
 	if (wake_flags & WF_SYNC)
-		__schedstat_inc(p->se.statistics.nr_wakeups_sync);
+		__schedstat_inc(p->stats.nr_wakeups_sync);
 }
 
 /*
@@ -4196,7 +4196,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 
 #ifdef CONFIG_SCHEDSTATS
 	/* Even if schedstat is disabled, there should not be garbage */
-	memset(&p->se.statistics, 0, sizeof(p->se.statistics));
+	memset(&p->stats, 0, sizeof(p->stats));
 #endif
 
 	RB_CLEAR_NODE(&p->dl.rb_node);
@@ -9608,9 +9608,9 @@ void normalize_rt_tasks(void)
 			continue;
 
 		p->se.exec_start = 0;
-		schedstat_set(p->se.statistics.wait_start,  0);
-		schedstat_set(p->se.statistics.sleep_start, 0);
-		schedstat_set(p->se.statistics.block_start, 0);
+		schedstat_set(p->stats.wait_start,  0);
+		schedstat_set(p->stats.sleep_start, 0);
+		schedstat_set(p->stats.block_start, 0);
 
 		if (!dl_task(p) && !rt_task(p)) {
 			/*
@@ -9700,6 +9700,7 @@ static void sched_free_group(struct task_group *tg)
 {
 	free_fair_sched_group(tg);
 	free_rt_sched_group(tg);
+	free_tg_schedstats(tg);
 	autogroup_free(tg);
 	kmem_cache_free(task_group_cache, tg);
 }
@@ -9719,6 +9720,9 @@ struct task_group *sched_create_group(struct task_group *parent)
 	if (!alloc_rt_sched_group(tg, parent))
 		goto err;
 
+	if (!alloc_tg_schedstats(tg))
+		goto err;
+
 	alloc_uclamp_sched_group(tg, parent);
 
 	return tg;
@@ -10456,7 +10460,7 @@ static int cpu_cfs_stat_show(struct seq_file *sf, void *v)
 		int i;
 
 		for_each_possible_cpu(i)
-			ws += schedstat_val(tg->se[i]->statistics.wait_sum);
+			ws += schedstat_val(tg->stats[i]->wait_sum);
 
 		seq_printf(sf, "wait_sum %llu\n", ws);
 	}
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index e94314633b39..51dd30990042 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1265,8 +1265,8 @@ static void update_curr_dl(struct rq *rq)
 		return;
 	}
 
-	schedstat_set(curr->se.statistics.exec_max,
-		      max(curr->se.statistics.exec_max, delta_exec));
+	schedstat_set(curr->stats.exec_max,
+		      max(curr->stats.exec_max, delta_exec));
 
 	curr->se.sum_exec_runtime += delta_exec;
 	account_group_exec_runtime(curr, delta_exec);
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 49716228efb4..4cfee2aa1a2d 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -442,9 +442,11 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 	struct sched_entity *se = tg->se[cpu];
 
 #define P(F)		SEQ_printf(m, "  .%-30s: %lld\n",	#F, (long long)F)
-#define P_SCHEDSTAT(F)	SEQ_printf(m, "  .%-30s: %lld\n",	#F, (long long)schedstat_val(F))
+#define P_SCHEDSTAT(F)	SEQ_printf(m, "  .%-30s: %lld\n",	\
+		"se->statistics."#F, (long long)schedstat_val(tg->stats[cpu]->F))
 #define PN(F)		SEQ_printf(m, "  .%-30s: %lld.%06ld\n", #F, SPLIT_NS((long long)F))
-#define PN_SCHEDSTAT(F)	SEQ_printf(m, "  .%-30s: %lld.%06ld\n", #F, SPLIT_NS((long long)schedstat_val(F)))
+#define PN_SCHEDSTAT(F)	SEQ_printf(m, "  .%-30s: %lld.%06ld\n", \
+		"se->statistics."#F, SPLIT_NS((long long)schedstat_val(tg->stats[cpu]->F)))
 
 	if (!se)
 		return;
@@ -454,16 +456,16 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 	PN(se->sum_exec_runtime);
 
 	if (schedstat_enabled()) {
-		PN_SCHEDSTAT(se->statistics.wait_start);
-		PN_SCHEDSTAT(se->statistics.sleep_start);
-		PN_SCHEDSTAT(se->statistics.block_start);
-		PN_SCHEDSTAT(se->statistics.sleep_max);
-		PN_SCHEDSTAT(se->statistics.block_max);
-		PN_SCHEDSTAT(se->statistics.exec_max);
-		PN_SCHEDSTAT(se->statistics.slice_max);
-		PN_SCHEDSTAT(se->statistics.wait_max);
-		PN_SCHEDSTAT(se->statistics.wait_sum);
-		P_SCHEDSTAT(se->statistics.wait_count);
+		PN_SCHEDSTAT(wait_start);
+		PN_SCHEDSTAT(sleep_start);
+		PN_SCHEDSTAT(block_start);
+		PN_SCHEDSTAT(sleep_max);
+		PN_SCHEDSTAT(block_max);
+		PN_SCHEDSTAT(exec_max);
+		PN_SCHEDSTAT(slice_max);
+		PN_SCHEDSTAT(wait_max);
+		PN_SCHEDSTAT(wait_sum);
+		P_SCHEDSTAT(wait_count);
 	}
 
 	P(se->load.weight);
@@ -530,9 +532,9 @@ print_task(struct seq_file *m, struct rq *rq, struct task_struct *p)
 		p->prio);
 
 	SEQ_printf(m, "%9Ld.%06ld %9Ld.%06ld %9Ld.%06ld",
-		SPLIT_NS(schedstat_val_or_zero(p->se.statistics.wait_sum)),
+		SPLIT_NS(schedstat_val_or_zero(p->stats.wait_sum)),
 		SPLIT_NS(p->se.sum_exec_runtime),
-		SPLIT_NS(schedstat_val_or_zero(p->se.statistics.sum_sleep_runtime)));
+		SPLIT_NS(schedstat_val_or_zero(p->stats.sum_sleep_runtime)));
 
 #ifdef CONFIG_NUMA_BALANCING
 	SEQ_printf(m, " %d %d", task_node(p), task_numa_group_id(p));
@@ -948,8 +950,8 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 		"---------------------------------------------------------"
 		"----------\n");
 
-#define P_SCHEDSTAT(F)  __PS(#F, schedstat_val(p->F))
-#define PN_SCHEDSTAT(F) __PSN(#F, schedstat_val(p->F))
+#define P_SCHEDSTAT(F)  __PS("se.statistics."#F, schedstat_val(p->stats.F))
+#define PN_SCHEDSTAT(F) __PSN("se.statistics."#F, schedstat_val(p->stats.F))
 
 	PN(se.exec_start);
 	PN(se.vruntime);
@@ -962,33 +964,33 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 	if (schedstat_enabled()) {
 		u64 avg_atom, avg_per_cpu;
 
-		PN_SCHEDSTAT(se.statistics.sum_sleep_runtime);
-		PN_SCHEDSTAT(se.statistics.wait_start);
-		PN_SCHEDSTAT(se.statistics.sleep_start);
-		PN_SCHEDSTAT(se.statistics.block_start);
-		PN_SCHEDSTAT(se.statistics.sleep_max);
-		PN_SCHEDSTAT(se.statistics.block_max);
-		PN_SCHEDSTAT(se.statistics.exec_max);
-		PN_SCHEDSTAT(se.statistics.slice_max);
-		PN_SCHEDSTAT(se.statistics.wait_max);
-		PN_SCHEDSTAT(se.statistics.wait_sum);
-		P_SCHEDSTAT(se.statistics.wait_count);
-		PN_SCHEDSTAT(se.statistics.iowait_sum);
-		P_SCHEDSTAT(se.statistics.iowait_count);
-		P_SCHEDSTAT(se.statistics.nr_migrations_cold);
-		P_SCHEDSTAT(se.statistics.nr_failed_migrations_affine);
-		P_SCHEDSTAT(se.statistics.nr_failed_migrations_running);
-		P_SCHEDSTAT(se.statistics.nr_failed_migrations_hot);
-		P_SCHEDSTAT(se.statistics.nr_forced_migrations);
-		P_SCHEDSTAT(se.statistics.nr_wakeups);
-		P_SCHEDSTAT(se.statistics.nr_wakeups_sync);
-		P_SCHEDSTAT(se.statistics.nr_wakeups_migrate);
-		P_SCHEDSTAT(se.statistics.nr_wakeups_local);
-		P_SCHEDSTAT(se.statistics.nr_wakeups_remote);
-		P_SCHEDSTAT(se.statistics.nr_wakeups_affine);
-		P_SCHEDSTAT(se.statistics.nr_wakeups_affine_attempts);
-		P_SCHEDSTAT(se.statistics.nr_wakeups_passive);
-		P_SCHEDSTAT(se.statistics.nr_wakeups_idle);
+		PN_SCHEDSTAT(sum_sleep_runtime);
+		PN_SCHEDSTAT(wait_start);
+		PN_SCHEDSTAT(sleep_start);
+		PN_SCHEDSTAT(block_start);
+		PN_SCHEDSTAT(sleep_max);
+		PN_SCHEDSTAT(block_max);
+		PN_SCHEDSTAT(exec_max);
+		PN_SCHEDSTAT(slice_max);
+		PN_SCHEDSTAT(wait_max);
+		PN_SCHEDSTAT(wait_sum);
+		P_SCHEDSTAT(wait_count);
+		PN_SCHEDSTAT(iowait_sum);
+		P_SCHEDSTAT(iowait_count);
+		P_SCHEDSTAT(nr_migrations_cold);
+		P_SCHEDSTAT(nr_failed_migrations_affine);
+		P_SCHEDSTAT(nr_failed_migrations_running);
+		P_SCHEDSTAT(nr_failed_migrations_hot);
+		P_SCHEDSTAT(nr_forced_migrations);
+		P_SCHEDSTAT(nr_wakeups);
+		P_SCHEDSTAT(nr_wakeups_sync);
+		P_SCHEDSTAT(nr_wakeups_migrate);
+		P_SCHEDSTAT(nr_wakeups_local);
+		P_SCHEDSTAT(nr_wakeups_remote);
+		P_SCHEDSTAT(nr_wakeups_affine);
+		P_SCHEDSTAT(nr_wakeups_affine_attempts);
+		P_SCHEDSTAT(nr_wakeups_passive);
+		P_SCHEDSTAT(nr_wakeups_idle);
 
 		avg_atom = p->se.sum_exec_runtime;
 		if (nr_switches)
@@ -1054,7 +1056,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 void proc_sched_set_task(struct task_struct *p)
 {
 #ifdef CONFIG_SCHEDSTATS
-	memset(&p->se.statistics, 0, sizeof(p->se.statistics));
+	memset(&p->stats, 0, sizeof(p->stats));
 #endif
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 422426768b84..7cb802431cfe 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -819,6 +819,41 @@ static void update_tg_load_avg(struct cfs_rq *cfs_rq)
 }
 #endif /* CONFIG_SMP */
 
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static inline void
+__schedstats_from_sched_entity(struct sched_entity *se,
+			      struct sched_statistics **stats)
+{
+	struct task_group *tg;
+	struct task_struct *p;
+	struct cfs_rq *cfs;
+	int cpu;
+
+	if (entity_is_task(se)) {
+		p = task_of(se);
+		*stats = &p->stats;
+	} else {
+		cfs = group_cfs_rq(se);
+		tg = cfs->tg;
+		cpu = cpu_of(rq_of(cfs));
+		*stats = tg->stats[cpu];
+	}
+}
+
+#else
+
+static inline void
+__schedstats_from_sched_entity(struct sched_entity *se,
+			      struct sched_statistics **stats)
+{
+	struct task_struct *p;
+
+	p = task_of(se);
+	*stats = &p->stats;
+}
+
+#endif
+
 /*
  * Update the current task's runtime statistics.
  */
@@ -826,6 +861,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
 {
 	struct sched_entity *curr = cfs_rq->curr;
 	u64 now = rq_clock_task(rq_of(cfs_rq));
+	struct sched_statistics *stats = NULL;
 	u64 delta_exec;
 
 	if (unlikely(!curr))
@@ -837,8 +873,11 @@ static void update_curr(struct cfs_rq *cfs_rq)
 
 	curr->exec_start = now;
 
-	schedstat_set(curr->statistics.exec_max,
-		      max(delta_exec, curr->statistics.exec_max));
+	if (schedstat_enabled()) {
+		__schedstats_from_sched_entity(curr, &stats);
+		__schedstat_set(stats->exec_max,
+				max(delta_exec, stats->exec_max));
+	}
 
 	curr->sum_exec_runtime += delta_exec;
 	schedstat_add(cfs_rq->exec_clock, delta_exec);
@@ -865,40 +904,46 @@ static void update_curr_fair(struct rq *rq)
 static inline void
 update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+	struct sched_statistics *stats = NULL;
 	u64 wait_start, prev_wait_start;
 
 	if (!schedstat_enabled())
 		return;
 
+	__schedstats_from_sched_entity(se, &stats);
+
 	wait_start = rq_clock(rq_of(cfs_rq));
-	prev_wait_start = schedstat_val(se->statistics.wait_start);
+	prev_wait_start = schedstat_val(stats->wait_start);
 
 	if (entity_is_task(se) && task_on_rq_migrating(task_of(se)) &&
 	    likely(wait_start > prev_wait_start))
 		wait_start -= prev_wait_start;
 
-	__schedstat_set(se->statistics.wait_start, wait_start);
+	__schedstat_set(stats->wait_start, wait_start);
 }
 
 static inline void
 update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	struct task_struct *p;
+	struct sched_statistics *stats = NULL;
+	struct task_struct *p = NULL;
 	u64 delta;
 
 	if (!schedstat_enabled())
 		return;
 
+	__schedstats_from_sched_entity(se, &stats);
+
 	/*
 	 * When the sched_schedstat changes from 0 to 1, some sched se
 	 * maybe already in the runqueue, the se->statistics.wait_start
 	 * will be 0.So it will let the delta wrong. We need to avoid this
 	 * scenario.
 	 */
-	if (unlikely(!schedstat_val(se->statistics.wait_start)))
+	if (unlikely(!schedstat_val(stats->wait_start)))
 		return;
 
-	delta = rq_clock(rq_of(cfs_rq)) - schedstat_val(se->statistics.wait_start);
+	delta = rq_clock(rq_of(cfs_rq)) - schedstat_val(stats->wait_start);
 
 	if (entity_is_task(se)) {
 		p = task_of(se);
@@ -908,30 +953,33 @@ update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
 			 * time stamp can be adjusted to accumulate wait time
 			 * prior to migration.
 			 */
-			__schedstat_set(se->statistics.wait_start, delta);
+			__schedstat_set(stats->wait_start, delta);
 			return;
 		}
 		trace_sched_stat_wait(p, delta);
 	}
 
-	__schedstat_set(se->statistics.wait_max,
-		      max(schedstat_val(se->statistics.wait_max), delta));
-	__schedstat_inc(se->statistics.wait_count);
-	__schedstat_add(se->statistics.wait_sum, delta);
-	__schedstat_set(se->statistics.wait_start, 0);
+	__schedstat_set(stats->wait_max,
+		      max(schedstat_val(stats->wait_max), delta));
+	__schedstat_inc(stats->wait_count);
+	__schedstat_add(stats->wait_sum, delta);
+	__schedstat_set(stats->wait_start, 0);
 }
 
 static inline void
 update_stats_enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+	struct sched_statistics *stats = NULL;
 	struct task_struct *tsk = NULL;
 	u64 sleep_start, block_start;
 
 	if (!schedstat_enabled())
 		return;
 
-	sleep_start = schedstat_val(se->statistics.sleep_start);
-	block_start = schedstat_val(se->statistics.block_start);
+	__schedstats_from_sched_entity(se, &stats);
+
+	sleep_start = schedstat_val(stats->sleep_start);
+	block_start = schedstat_val(stats->block_start);
 
 	if (entity_is_task(se))
 		tsk = task_of(se);
@@ -942,11 +990,11 @@ update_stats_enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
 		if ((s64)delta < 0)
 			delta = 0;
 
-		if (unlikely(delta > schedstat_val(se->statistics.sleep_max)))
-			__schedstat_set(se->statistics.sleep_max, delta);
+		if (unlikely(delta > schedstat_val(stats->sleep_max)))
+			__schedstat_set(stats->sleep_max, delta);
 
-		__schedstat_set(se->statistics.sleep_start, 0);
-		__schedstat_add(se->statistics.sum_sleep_runtime, delta);
+		__schedstat_set(stats->sleep_start, 0);
+		__schedstat_add(stats->sum_sleep_runtime, delta);
 
 		if (tsk) {
 			account_scheduler_latency(tsk, delta >> 10, 1);
@@ -959,16 +1007,16 @@ update_stats_enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
 		if ((s64)delta < 0)
 			delta = 0;
 
-		if (unlikely(delta > schedstat_val(se->statistics.block_max)))
-			__schedstat_set(se->statistics.block_max, delta);
+		if (unlikely(delta > schedstat_val(stats->block_max)))
+			__schedstat_set(stats->block_max, delta);
 
-		__schedstat_set(se->statistics.block_start, 0);
-		__schedstat_add(se->statistics.sum_sleep_runtime, delta);
+		__schedstat_set(stats->block_start, 0);
+		__schedstat_add(stats->sum_sleep_runtime, delta);
 
 		if (tsk) {
 			if (tsk->in_iowait) {
-				__schedstat_add(se->statistics.iowait_sum, delta);
-				__schedstat_inc(se->statistics.iowait_count);
+				__schedstat_add(stats->iowait_sum, delta);
+				__schedstat_inc(stats->iowait_count);
 				trace_sched_stat_iowait(tsk, delta);
 			}
 
@@ -1030,10 +1078,10 @@ update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 		/* XXX racy against TTWU */
 		state = READ_ONCE(tsk->__state);
 		if (state & TASK_INTERRUPTIBLE)
-			__schedstat_set(se->statistics.sleep_start,
+			__schedstat_set(tsk->stats.sleep_start,
 				      rq_clock(rq_of(cfs_rq)));
 		if (state & TASK_UNINTERRUPTIBLE)
-			__schedstat_set(se->statistics.block_start,
+			__schedstat_set(tsk->stats.block_start,
 				      rq_clock(rq_of(cfs_rq)));
 	}
 }
@@ -4478,6 +4526,8 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 static void
 set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+	struct sched_statistics *stats = NULL;
+
 	clear_buddies(cfs_rq, se);
 
 	/* 'current' is not kept within the tree. */
@@ -4502,8 +4552,9 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	 */
 	if (schedstat_enabled() &&
 	    rq_of(cfs_rq)->cfs.load.weight >= 2*se->load.weight) {
-		__schedstat_set(se->statistics.slice_max,
-			max((u64)schedstat_val(se->statistics.slice_max),
+		__schedstats_from_sched_entity(se, &stats);
+		__schedstat_set(stats->slice_max,
+			max((u64)schedstat_val(stats->slice_max),
 			    se->sum_exec_runtime - se->prev_sum_exec_runtime));
 	}
 
@@ -5993,12 +6044,12 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 	if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits)
 		target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);
 
-	schedstat_inc(p->se.statistics.nr_wakeups_affine_attempts);
+	schedstat_inc(p->stats.nr_wakeups_affine_attempts);
 	if (target == nr_cpumask_bits)
 		return prev_cpu;
 
 	schedstat_inc(sd->ttwu_move_affine);
-	schedstat_inc(p->se.statistics.nr_wakeups_affine);
+	schedstat_inc(p->stats.nr_wakeups_affine);
 	return target;
 }
 
@@ -7802,7 +7853,7 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
 		int cpu;
 
-		schedstat_inc(p->se.statistics.nr_failed_migrations_affine);
+		schedstat_inc(p->stats.nr_failed_migrations_affine);
 
 		env->flags |= LBF_SOME_PINNED;
 
@@ -7836,7 +7887,7 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	env->flags &= ~LBF_ALL_PINNED;
 
 	if (task_running(env->src_rq, p)) {
-		schedstat_inc(p->se.statistics.nr_failed_migrations_running);
+		schedstat_inc(p->stats.nr_failed_migrations_running);
 		return 0;
 	}
 
@@ -7858,12 +7909,12 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	    env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
 		if (tsk_cache_hot == 1) {
 			schedstat_inc(env->sd->lb_hot_gained[env->idle]);
-			schedstat_inc(p->se.statistics.nr_forced_migrations);
+			schedstat_inc(p->stats.nr_forced_migrations);
 		}
 		return 1;
 	}
 
-	schedstat_inc(p->se.statistics.nr_failed_migrations_hot);
+	schedstat_inc(p->stats.nr_failed_migrations_hot);
 	return 0;
 }
 
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 3daf42a0f462..95a7c3ad2dc3 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1009,8 +1009,8 @@ static void update_curr_rt(struct rq *rq)
 	if (unlikely((s64)delta_exec <= 0))
 		return;
 
-	schedstat_set(curr->se.statistics.exec_max,
-		      max(curr->se.statistics.exec_max, delta_exec));
+	schedstat_set(curr->stats.exec_max,
+		      max(curr->stats.exec_max, delta_exec));
 
 	curr->se.sum_exec_runtime += delta_exec;
 	account_group_exec_runtime(curr, delta_exec);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e6347c88c467..6a4541d7d659 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -389,6 +389,9 @@ struct cfs_bandwidth {
 struct task_group {
 	struct cgroup_subsys_state css;
 
+	/* schedstats of this group on each CPU */
+	struct sched_statistics **stats;
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/* schedulable entities of this group on each CPU */
 	struct sched_entity	**se;
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index d8f8eb0c655b..e6905e369c5d 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -41,6 +41,7 @@ rq_sched_info_dequeue(struct rq *rq, unsigned long long delta)
 #define   schedstat_val_or_zero(var)	((schedstat_enabled()) ? (var) : 0)
 
 #else /* !CONFIG_SCHEDSTATS: */
+
 static inline void rq_sched_info_arrive  (struct rq *rq, unsigned long long delta) { }
 static inline void rq_sched_info_dequeue(struct rq *rq, unsigned long long delta) { }
 static inline void rq_sched_info_depart  (struct rq *rq, unsigned long long delta) { }
@@ -53,8 +54,62 @@ static inline void rq_sched_info_depart  (struct rq *rq, unsigned long long delt
 # define   schedstat_set(var, val)	do { } while (0)
 # define   schedstat_val(var)		0
 # define   schedstat_val_or_zero(var)	0
+
 #endif /* CONFIG_SCHEDSTATS */
 
+#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SCHEDSTATS)
+static inline void free_tg_schedstats(struct task_group *tg)
+{
+	int i;
+
+	for_each_possible_cpu(i) {
+		if (tg->stats)
+			kfree(tg->stats[i]);
+	}
+
+	kfree(tg->stats);
+}
+
+static inline int alloc_tg_schedstats(struct task_group *tg)
+{
+	struct sched_statistics *stats;
+	int i;
+
+	/*
+	 * This memory should be allocated whatever schedstat_enabled() or
+	 * not.
+	 */
+	tg->stats = kcalloc(nr_cpu_ids, sizeof(stats), GFP_KERNEL);
+	if (!tg->stats)
+		return 0;
+
+	for_each_possible_cpu(i) {
+		stats = kzalloc_node(sizeof(struct sched_statistics),
+				     GFP_KERNEL, cpu_to_node(i));
+		if (!stats)
+			return 0;
+
+		tg->stats[i] = stats;
+	}
+
+	return 1;
+}
+
+#else
+
+static inline void free_tg_schedstats(struct task_group *tg)
+{
+
+}
+
+static inline int alloc_tg_schedstats(struct task_group *tg)
+{
+	return 1;
+}
+
+#endif
+
+
 #ifdef CONFIG_PSI
 /*
  * PSI tracks state that persists across sleeps, such as iowaits and
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index f988ebe3febb..0b165a25f22f 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -78,8 +78,8 @@ static void put_prev_task_stop(struct rq *rq, struct task_struct *prev)
 	if (unlikely((s64)delta_exec < 0))
 		delta_exec = 0;
 
-	schedstat_set(curr->se.statistics.exec_max,
-			max(curr->se.statistics.exec_max, delta_exec));
+	schedstat_set(curr->stats.exec_max,
+		      max(curr->stats.exec_max, delta_exec));
 
 	curr->se.sum_exec_runtime += delta_exec;
 	account_group_exec_runtime(curr, delta_exec);
-- 
2.18.2


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v3 3/7] sched: make schedstats helpers independent of fair sched class
  2021-08-24 11:29 [PATCH v3 0/7] sched: support schedstats for RT sched class Yafang Shao
  2021-08-24 11:29 ` [PATCH v3 1/7] sched, fair: use __schedstat_set() in set_next_entity() Yafang Shao
  2021-08-24 11:29 ` [PATCH v3 2/7] sched: make struct sched_statistics independent of fair sched class Yafang Shao
@ 2021-08-24 11:29 ` Yafang Shao
  2021-08-31 11:07   ` Peter Zijlstra
  2021-08-24 11:29 ` [PATCH v3 4/7] sched: make the output of schedstats " Yafang Shao
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Yafang Shao @ 2021-08-24 11:29 UTC (permalink / raw)
  To: mingo, peterz, mgorman, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, bristot, achaiken
  Cc: lkp, linux-kernel, linux-rt-users, Yafang Shao

The original prototype of the schedstats helpers are

update_stats_wait_*(struct cfs_rq *cfs_rq, struct sched_entity *se)

The cfs_rq in these helpers is used to get the rq_clock, and the se is
used to get the struct sched_statistics and the struct task_struct. In
order to make these helpers available by all sched classes, we can pass
the rq, sched_statistics and task_struct directly.

Then the new helpers are

update_stats_wait_*(struct rq *rq, struct task_struct *p,
                    struct sched_statistics *stats)

which are independent of fair sched class.

To avoid vmlinux growing too large or introducing ovehead when
!schedstat_enabled(), some new helpers after schedstat_enabled() are also
introduced, Suggested by Mel. These helpers are in sched/stats.c,

__update_stats_wait_*(struct rq *rq, struct task_struct *p,
                      struct sched_statistics *stats)

The size of vmlinux as follows,
                      Before          After
  Size of vmlinux     826308552       826304640
The size is a litte smaller as some functions are not inlined again after
the change.

I also compared the sched performance with 'perf bench sched pipe',
suggested by Mel. The result as follows,
                             Before               After
  kernel.sched_schedstats=0  ~5.6                 ~5.6
  kernel.sched_schedstats=1  ~5.7                 ~5.7
[These data is a little difference with the prev version, that is
because my old test machine is destroyed so I have to use a new
different test machine.]
Almost no difference.

No functional change.

[lkp@intel.com: reported build failure in prev version]

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: kernel test robot <lkp@intel.com>
Cc: Alison Chaiken <achaiken@aurora.tech>
---
 kernel/sched/fair.c  | 133 +++++++------------------------------------
 kernel/sched/stats.c | 103 +++++++++++++++++++++++++++++++++
 kernel/sched/stats.h |  34 +++++++++++
 3 files changed, 156 insertions(+), 114 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7cb802431cfe..1324000c78bb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -902,32 +902,28 @@ static void update_curr_fair(struct rq *rq)
 }
 
 static inline void
-update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
+update_stats_wait_start_fair(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	struct sched_statistics *stats = NULL;
-	u64 wait_start, prev_wait_start;
+	struct task_struct *p = NULL;
 
 	if (!schedstat_enabled())
 		return;
 
 	__schedstats_from_sched_entity(se, &stats);
 
-	wait_start = rq_clock(rq_of(cfs_rq));
-	prev_wait_start = schedstat_val(stats->wait_start);
+	if (entity_is_task(se))
+		p = task_of(se);
 
-	if (entity_is_task(se) && task_on_rq_migrating(task_of(se)) &&
-	    likely(wait_start > prev_wait_start))
-		wait_start -= prev_wait_start;
+	__update_stats_wait_start(rq_of(cfs_rq), p, stats);
 
-	__schedstat_set(stats->wait_start, wait_start);
 }
 
 static inline void
-update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
+update_stats_wait_end_fair(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	struct sched_statistics *stats = NULL;
 	struct task_struct *p = NULL;
-	u64 delta;
 
 	if (!schedstat_enabled())
 		return;
@@ -943,105 +939,34 @@ update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	if (unlikely(!schedstat_val(stats->wait_start)))
 		return;
 
-	delta = rq_clock(rq_of(cfs_rq)) - schedstat_val(stats->wait_start);
-
-	if (entity_is_task(se)) {
+	if (entity_is_task(se))
 		p = task_of(se);
-		if (task_on_rq_migrating(p)) {
-			/*
-			 * Preserve migrating task's wait time so wait_start
-			 * time stamp can be adjusted to accumulate wait time
-			 * prior to migration.
-			 */
-			__schedstat_set(stats->wait_start, delta);
-			return;
-		}
-		trace_sched_stat_wait(p, delta);
-	}
 
-	__schedstat_set(stats->wait_max,
-		      max(schedstat_val(stats->wait_max), delta));
-	__schedstat_inc(stats->wait_count);
-	__schedstat_add(stats->wait_sum, delta);
-	__schedstat_set(stats->wait_start, 0);
+	__update_stats_wait_end(rq_of(cfs_rq), p, stats);
 }
 
 static inline void
-update_stats_enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
+update_stats_enqueue_sleeper_fair(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	struct sched_statistics *stats = NULL;
 	struct task_struct *tsk = NULL;
-	u64 sleep_start, block_start;
 
 	if (!schedstat_enabled())
 		return;
 
 	__schedstats_from_sched_entity(se, &stats);
 
-	sleep_start = schedstat_val(stats->sleep_start);
-	block_start = schedstat_val(stats->block_start);
-
 	if (entity_is_task(se))
 		tsk = task_of(se);
 
-	if (sleep_start) {
-		u64 delta = rq_clock(rq_of(cfs_rq)) - sleep_start;
-
-		if ((s64)delta < 0)
-			delta = 0;
-
-		if (unlikely(delta > schedstat_val(stats->sleep_max)))
-			__schedstat_set(stats->sleep_max, delta);
-
-		__schedstat_set(stats->sleep_start, 0);
-		__schedstat_add(stats->sum_sleep_runtime, delta);
-
-		if (tsk) {
-			account_scheduler_latency(tsk, delta >> 10, 1);
-			trace_sched_stat_sleep(tsk, delta);
-		}
-	}
-	if (block_start) {
-		u64 delta = rq_clock(rq_of(cfs_rq)) - block_start;
-
-		if ((s64)delta < 0)
-			delta = 0;
-
-		if (unlikely(delta > schedstat_val(stats->block_max)))
-			__schedstat_set(stats->block_max, delta);
-
-		__schedstat_set(stats->block_start, 0);
-		__schedstat_add(stats->sum_sleep_runtime, delta);
-
-		if (tsk) {
-			if (tsk->in_iowait) {
-				__schedstat_add(stats->iowait_sum, delta);
-				__schedstat_inc(stats->iowait_count);
-				trace_sched_stat_iowait(tsk, delta);
-			}
-
-			trace_sched_stat_blocked(tsk, delta);
-
-			/*
-			 * Blocking time is in units of nanosecs, so shift by
-			 * 20 to get a milliseconds-range estimation of the
-			 * amount of time that the task spent sleeping:
-			 */
-			if (unlikely(prof_on == SLEEP_PROFILING)) {
-				profile_hits(SLEEP_PROFILING,
-						(void *)get_wchan(tsk),
-						delta >> 20);
-			}
-			account_scheduler_latency(tsk, delta >> 10, 0);
-		}
-	}
+	__update_stats_enqueue_sleeper(rq_of(cfs_rq), tsk, stats);
 }
 
 /*
  * Task is being enqueued - update stats:
  */
 static inline void
-update_stats_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
+update_stats_enqueue_fair(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
 	if (!schedstat_enabled())
 		return;
@@ -1051,14 +976,14 @@ update_stats_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * a dequeue/enqueue event is a NOP)
 	 */
 	if (se != cfs_rq->curr)
-		update_stats_wait_start(cfs_rq, se);
+		update_stats_wait_start_fair(cfs_rq, se);
 
 	if (flags & ENQUEUE_WAKEUP)
-		update_stats_enqueue_sleeper(cfs_rq, se);
+		update_stats_enqueue_sleeper_fair(cfs_rq, se);
 }
 
 static inline void
-update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
+update_stats_dequeue_fair(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
 
 	if (!schedstat_enabled())
@@ -1069,7 +994,7 @@ update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * waiting task:
 	 */
 	if (se != cfs_rq->curr)
-		update_stats_wait_end(cfs_rq, se);
+		update_stats_wait_end_fair(cfs_rq, se);
 
 	if ((flags & DEQUEUE_SLEEP) && entity_is_task(se)) {
 		struct task_struct *tsk = task_of(se);
@@ -4273,26 +4198,6 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 
 static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
 
-static inline void check_schedstat_required(void)
-{
-#ifdef CONFIG_SCHEDSTATS
-	if (schedstat_enabled())
-		return;
-
-	/* Force schedstat enabled if a dependent tracepoint is active */
-	if (trace_sched_stat_wait_enabled()    ||
-			trace_sched_stat_sleep_enabled()   ||
-			trace_sched_stat_iowait_enabled()  ||
-			trace_sched_stat_blocked_enabled() ||
-			trace_sched_stat_runtime_enabled())  {
-		printk_deferred_once("Scheduler tracepoints stat_sleep, stat_iowait, "
-			     "stat_blocked and stat_runtime require the "
-			     "kernel parameter schedstats=enable or "
-			     "kernel.sched_schedstats=1\n");
-	}
-#endif
-}
-
 static inline bool cfs_bandwidth_used(void);
 
 /*
@@ -4366,7 +4271,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 		place_entity(cfs_rq, se, 0);
 
 	check_schedstat_required();
-	update_stats_enqueue(cfs_rq, se, flags);
+	update_stats_enqueue_fair(cfs_rq, se, flags);
 	check_spread(cfs_rq, se);
 	if (!curr)
 		__enqueue_entity(cfs_rq, se);
@@ -4450,7 +4355,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	update_load_avg(cfs_rq, se, UPDATE_TG);
 	se_update_runnable(se);
 
-	update_stats_dequeue(cfs_rq, se, flags);
+	update_stats_dequeue_fair(cfs_rq, se, flags);
 
 	clear_buddies(cfs_rq, se);
 
@@ -4537,7 +4442,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 		 * a CPU. So account for the time it spent waiting on the
 		 * runqueue.
 		 */
-		update_stats_wait_end(cfs_rq, se);
+		update_stats_wait_end_fair(cfs_rq, se);
 		__dequeue_entity(cfs_rq, se);
 		update_load_avg(cfs_rq, se, UPDATE_TG);
 	}
@@ -4637,7 +4542,7 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 	check_spread(cfs_rq, prev);
 
 	if (prev->on_rq) {
-		update_stats_wait_start(cfs_rq, prev);
+		update_stats_wait_start_fair(cfs_rq, prev);
 		/* Put 'current' back into the tree. */
 		__enqueue_entity(cfs_rq, prev);
 		/* in !on_rq case, update occurred at dequeue */
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 3f93fc3b5648..b2542f4d3192 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -4,6 +4,109 @@
  */
 #include "sched.h"
 
+void __update_stats_wait_start(struct rq *rq, struct task_struct *p,
+			       struct sched_statistics *stats)
+{
+u64 wait_start, prev_wait_start;
+
+	wait_start = rq_clock(rq);
+	prev_wait_start = schedstat_val(stats->wait_start);
+
+	if (p && likely(wait_start > prev_wait_start))
+		wait_start -= prev_wait_start;
+
+	__schedstat_set(stats->wait_start, wait_start);
+}
+
+void __update_stats_wait_end(struct rq *rq, struct task_struct *p,
+			     struct sched_statistics *stats)
+{
+	u64 delta = rq_clock(rq) - schedstat_val(stats->wait_start);
+
+	if (p) {
+		if (task_on_rq_migrating(p)) {
+			/*
+			 * Preserve migrating task's wait time so wait_start
+			 * time stamp can be adjusted to accumulate wait time
+			 * prior to migration.
+			 */
+			__schedstat_set(stats->wait_start, delta);
+
+			return;
+		}
+
+		trace_sched_stat_wait(p, delta);
+	}
+
+	__schedstat_set(stats->wait_max,
+			max(schedstat_val(stats->wait_max), delta));
+	__schedstat_inc(stats->wait_count);
+	__schedstat_add(stats->wait_sum, delta);
+	__schedstat_set(stats->wait_start, 0);
+}
+
+void __update_stats_enqueue_sleeper(struct rq *rq, struct task_struct *p,
+				    struct sched_statistics *stats)
+{
+	u64 sleep_start, block_start;
+
+	sleep_start = schedstat_val(stats->sleep_start);
+	block_start = schedstat_val(stats->block_start);
+
+	if (sleep_start) {
+		u64 delta = rq_clock(rq) - sleep_start;
+
+		if ((s64)delta < 0)
+			delta = 0;
+
+		if (unlikely(delta > schedstat_val(stats->sleep_max)))
+			__schedstat_set(stats->sleep_max, delta);
+
+		__schedstat_set(stats->sleep_start, 0);
+		__schedstat_add(stats->sum_sleep_runtime, delta);
+
+		if (p) {
+			account_scheduler_latency(p, delta >> 10, 1);
+			trace_sched_stat_sleep(p, delta);
+		}
+	}
+
+	if (block_start) {
+		u64 delta = rq_clock(rq) - block_start;
+
+		if ((s64)delta < 0)
+			delta = 0;
+
+		if (unlikely(delta > schedstat_val(stats->block_max)))
+			__schedstat_set(stats->block_max, delta);
+
+		__schedstat_set(stats->block_start, 0);
+		__schedstat_add(stats->sum_sleep_runtime, delta);
+
+		if (p) {
+			if (p->in_iowait) {
+				__schedstat_add(stats->iowait_sum, delta);
+				__schedstat_inc(stats->iowait_count);
+				trace_sched_stat_iowait(p, delta);
+			}
+
+			trace_sched_stat_blocked(p, delta);
+
+			/*
+			 * Blocking time is in units of nanosecs, so shift by
+			 * 20 to get a milliseconds-range estimation of the
+			 * amount of time that the task spent sleeping:
+			 */
+			if (unlikely(prof_on == SLEEP_PROFILING)) {
+				profile_hits(SLEEP_PROFILING,
+					     (void *)get_wchan(p),
+					     delta >> 20);
+			}
+			account_scheduler_latency(p, delta >> 10, 0);
+		}
+	}
+}
+
 /*
  * Current schedstat API version.
  *
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index e6905e369c5d..9ecd81b91f26 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -2,6 +2,8 @@
 
 #ifdef CONFIG_SCHEDSTATS
 
+extern struct static_key_false sched_schedstats;
+
 /*
  * Expects runqueue lock to be held for atomicity of update
  */
@@ -40,6 +42,33 @@ rq_sched_info_dequeue(struct rq *rq, unsigned long long delta)
 #define   schedstat_val(var)		(var)
 #define   schedstat_val_or_zero(var)	((schedstat_enabled()) ? (var) : 0)
 
+void __update_stats_wait_start(struct rq *rq, struct task_struct *p,
+			       struct sched_statistics *stats);
+
+void __update_stats_wait_end(struct rq *rq, struct task_struct *p,
+			     struct sched_statistics *stats);
+void __update_stats_enqueue_sleeper(struct rq *rq, struct task_struct *p,
+				    struct sched_statistics *stats);
+
+static inline void
+check_schedstat_required(void)
+{
+	if (schedstat_enabled())
+		return;
+
+	/* Force schedstat enabled if a dependent tracepoint is active */
+	if (trace_sched_stat_wait_enabled()    ||
+		trace_sched_stat_sleep_enabled()   ||
+		trace_sched_stat_iowait_enabled()  ||
+		trace_sched_stat_blocked_enabled() ||
+		trace_sched_stat_runtime_enabled())  {
+		printk_deferred_once("Scheduler tracepoints stat_sleep, stat_iowait, "
+					"stat_blocked and stat_runtime require the "
+					"kernel parameter schedstats=enable or "
+					"kernel.sched_schedstats=1\n");
+	}
+}
+
 #else /* !CONFIG_SCHEDSTATS: */
 
 static inline void rq_sched_info_arrive  (struct rq *rq, unsigned long long delta) { }
@@ -55,6 +84,11 @@ static inline void rq_sched_info_depart  (struct rq *rq, unsigned long long delt
 # define   schedstat_val(var)		0
 # define   schedstat_val_or_zero(var)	0
 
+# define __update_stats_wait_start(rq, p, stats)       do { } while (0)
+# define __update_stats_wait_end(rq, p, stats)         do { } while (0)
+# define __update_stats_enqueue_sleeper(rq, p, stats)  do { } while (0)
+# define check_schedstat_required()                    do { } while (0)
+
 #endif /* CONFIG_SCHEDSTATS */
 
 #if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SCHEDSTATS)
-- 
2.18.2


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v3 4/7] sched: make the output of schedstats independent of fair sched class
  2021-08-24 11:29 [PATCH v3 0/7] sched: support schedstats for RT sched class Yafang Shao
                   ` (2 preceding siblings ...)
  2021-08-24 11:29 ` [PATCH v3 3/7] sched: make schedstats helpers " Yafang Shao
@ 2021-08-24 11:29 ` Yafang Shao
  2021-08-31 11:08   ` Peter Zijlstra
  2021-08-24 11:29 ` [PATCH v3 5/7] sched: introduce task block time in schedstats Yafang Shao
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Yafang Shao @ 2021-08-24 11:29 UTC (permalink / raw)
  To: mingo, peterz, mgorman, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, bristot, achaiken
  Cc: lkp, linux-kernel, linux-rt-users, Yafang Shao

The per cpu stats can be show with /proc/sched_debug, which includes the
per cpu schedstats of each task group. Currently these per cpu
schedstats only show for the fair sched class. If we want to support
other sched classes, we have to make these output independent of fair
sched class.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Alison Chaiken <achaiken@aurora.tech>
---
 kernel/sched/debug.c | 70 +++++++++++++++++++++++++++++++-------------
 1 file changed, 50 insertions(+), 20 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 4cfee2aa1a2d..705987aed658 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -442,11 +442,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 	struct sched_entity *se = tg->se[cpu];
 
 #define P(F)		SEQ_printf(m, "  .%-30s: %lld\n",	#F, (long long)F)
-#define P_SCHEDSTAT(F)	SEQ_printf(m, "  .%-30s: %lld\n",	\
-		"se->statistics."#F, (long long)schedstat_val(tg->stats[cpu]->F))
 #define PN(F)		SEQ_printf(m, "  .%-30s: %lld.%06ld\n", #F, SPLIT_NS((long long)F))
-#define PN_SCHEDSTAT(F)	SEQ_printf(m, "  .%-30s: %lld.%06ld\n", \
-		"se->statistics."#F, SPLIT_NS((long long)schedstat_val(tg->stats[cpu]->F)))
 
 	if (!se)
 		return;
@@ -454,20 +450,6 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 	PN(se->exec_start);
 	PN(se->vruntime);
 	PN(se->sum_exec_runtime);
-
-	if (schedstat_enabled()) {
-		PN_SCHEDSTAT(wait_start);
-		PN_SCHEDSTAT(sleep_start);
-		PN_SCHEDSTAT(block_start);
-		PN_SCHEDSTAT(sleep_max);
-		PN_SCHEDSTAT(block_max);
-		PN_SCHEDSTAT(exec_max);
-		PN_SCHEDSTAT(slice_max);
-		PN_SCHEDSTAT(wait_max);
-		PN_SCHEDSTAT(wait_sum);
-		P_SCHEDSTAT(wait_count);
-	}
-
 	P(se->load.weight);
 #ifdef CONFIG_SMP
 	P(se->avg.load_avg);
@@ -475,13 +457,60 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 	P(se->avg.runnable_avg);
 #endif
 
-#undef PN_SCHEDSTAT
 #undef PN
-#undef P_SCHEDSTAT
 #undef P
 }
 #endif
 
+#if defined(CONFIG_FAIR_GROUP_SCHED) || defined(CONFIG_RT_GROUP_SCHED)
+struct tg_schedstats {
+	struct seq_file *m;
+	int cpu;
+};
+
+static int tg_show_schedstats(struct task_group *tg, void *data)
+{
+	struct tg_schedstats *p = data;
+	struct seq_file *m = p->m;
+	int cpu = p->cpu;
+
+#define P_SCHEDSTAT(F)	SEQ_printf(m, "  .%-30s: %lld\n",	\
+		"se->statistics."#F, (long long)schedstat_val(tg->stats[cpu]->F))
+#define PN_SCHEDSTAT(F)	SEQ_printf(m, "  .%-30s: %lld.%06ld\n", \
+		"se->statistics."#F, SPLIT_NS((long long)schedstat_val(tg->stats[cpu]->F)))
+
+	PN_SCHEDSTAT(wait_start);
+	PN_SCHEDSTAT(sleep_start);
+	PN_SCHEDSTAT(block_start);
+	PN_SCHEDSTAT(sleep_max);
+	PN_SCHEDSTAT(block_max);
+	PN_SCHEDSTAT(exec_max);
+	PN_SCHEDSTAT(slice_max);
+	PN_SCHEDSTAT(wait_max);
+	PN_SCHEDSTAT(wait_sum);
+	P_SCHEDSTAT(wait_count);
+
+#undef P_SCHEDSTAT
+#undef PN_SCHEDSTAT
+
+return 0;
+}
+
+static void print_task_group_stats(struct seq_file *m, int cpu)
+{
+	struct tg_schedstats data = {
+		.m = m,
+		.cpu = cpu,
+	};
+
+	if (!schedstat_enabled())
+		return;
+
+	walk_tg_tree(tg_show_schedstats, tg_nop, &data);
+}
+#endif
+
+
 #ifdef CONFIG_CGROUP_SCHED
 static DEFINE_SPINLOCK(sched_debug_lock);
 static char group_path[PATH_MAX];
@@ -756,6 +785,7 @@ do {									\
 	print_cfs_stats(m, cpu);
 	print_rt_stats(m, cpu);
 	print_dl_stats(m, cpu);
+	print_task_group_stats(m, cpu);
 
 	print_rq(m, rq, cpu);
 	SEQ_printf(m, "\n");
-- 
2.18.2


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v3 5/7] sched: introduce task block time in schedstats
  2021-08-24 11:29 [PATCH v3 0/7] sched: support schedstats for RT sched class Yafang Shao
                   ` (3 preceding siblings ...)
  2021-08-24 11:29 ` [PATCH v3 4/7] sched: make the output of schedstats " Yafang Shao
@ 2021-08-24 11:29 ` Yafang Shao
  2021-08-24 11:29 ` [PATCH v3 6/7] sched, rt: support sched_stat_runtime tracepoint for RT sched class Yafang Shao
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Yafang Shao @ 2021-08-24 11:29 UTC (permalink / raw)
  To: mingo, peterz, mgorman, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, bristot, achaiken
  Cc: lkp, linux-kernel, linux-rt-users, Yafang Shao

Currently in schedstats we have sum_sleep_runtime and iowait_sum, but
there's no metric to show how long the task is in D state.  Once a task in
D state, it means the task is blocked in the kernel, for example the
task may be waiting for a mutex. The D state is more frequent than
iowait, and it is more critital than S state. So it is worth to add a
metric to measure it.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Alison Chaiken <achaiken@aurora.tech>
---
 include/linux/sched.h | 2 ++
 kernel/sched/debug.c  | 6 ++++--
 kernel/sched/stats.c  | 1 +
 3 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 39c29eae1af9..7888ad8384ba 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -502,6 +502,8 @@ struct sched_statistics {
 
 	u64				block_start;
 	u64				block_max;
+	s64				sum_block_runtime;
+
 	u64				exec_max;
 	u64				slice_max;
 
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 705987aed658..5c6bc3f373f0 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -560,10 +560,11 @@ print_task(struct seq_file *m, struct rq *rq, struct task_struct *p)
 		(long long)(p->nvcsw + p->nivcsw),
 		p->prio);
 
-	SEQ_printf(m, "%9Ld.%06ld %9Ld.%06ld %9Ld.%06ld",
+	SEQ_printf(m, "%9lld.%06ld %9lld.%06ld %9lld.%06ld %9lld.%06ld",
 		SPLIT_NS(schedstat_val_or_zero(p->stats.wait_sum)),
 		SPLIT_NS(p->se.sum_exec_runtime),
-		SPLIT_NS(schedstat_val_or_zero(p->stats.sum_sleep_runtime)));
+		SPLIT_NS(schedstat_val_or_zero(p->stats.sum_sleep_runtime)),
+		SPLIT_NS(schedstat_val_or_zero(p->stats.sum_block_runtime)));
 
 #ifdef CONFIG_NUMA_BALANCING
 	SEQ_printf(m, " %d %d", task_node(p), task_numa_group_id(p));
@@ -995,6 +996,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 		u64 avg_atom, avg_per_cpu;
 
 		PN_SCHEDSTAT(sum_sleep_runtime);
+		PN_SCHEDSTAT(sum_block_runtime);
 		PN_SCHEDSTAT(wait_start);
 		PN_SCHEDSTAT(sleep_start);
 		PN_SCHEDSTAT(block_start);
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index b2542f4d3192..21fae41c06f5 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -82,6 +82,7 @@ void __update_stats_enqueue_sleeper(struct rq *rq, struct task_struct *p,
 
 		__schedstat_set(stats->block_start, 0);
 		__schedstat_add(stats->sum_sleep_runtime, delta);
+		__schedstat_add(stats->sum_block_runtime, delta);
 
 		if (p) {
 			if (p->in_iowait) {
-- 
2.18.2


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v3 6/7] sched, rt: support sched_stat_runtime tracepoint for RT sched class
  2021-08-24 11:29 [PATCH v3 0/7] sched: support schedstats for RT sched class Yafang Shao
                   ` (4 preceding siblings ...)
  2021-08-24 11:29 ` [PATCH v3 5/7] sched: introduce task block time in schedstats Yafang Shao
@ 2021-08-24 11:29 ` Yafang Shao
  2021-08-24 11:29 ` [PATCH v3 7/7] sched, rt: support schedstats " Yafang Shao
  2021-08-31 10:08 ` [PATCH v3 0/7] sched: " Peter Zijlstra
  7 siblings, 0 replies; 20+ messages in thread
From: Yafang Shao @ 2021-08-24 11:29 UTC (permalink / raw)
  To: mingo, peterz, mgorman, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, bristot, achaiken
  Cc: lkp, linux-kernel, linux-rt-users, Yafang Shao

The runtime of a RT task has already been there, so we only need to
add a tracepoint.

One difference between fair task and RT task is that there is no vruntime
in RT task. To reuse the sched_stat_runtime tracepoint, '0' is passed as
vruntime for RT task.

The output of this tracepoint for RT task as follows,
          stress-9748    [039] d.h.   113.519352: sched_stat_runtime: comm=stress pid=9748 runtime=997573 [ns] vruntime=0 [ns]
          stress-9748    [039] d.h.   113.520352: sched_stat_runtime: comm=stress pid=9748 runtime=997627 [ns] vruntime=0 [ns]
          stress-9748    [039] d.h.   113.521352: sched_stat_runtime: comm=stress pid=9748 runtime=998203 [ns] vruntime=0 [ns]

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Alison Chaiken <achaiken@aurora.tech>
---
 kernel/sched/rt.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 95a7c3ad2dc3..5d251112e51c 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1012,6 +1012,8 @@ static void update_curr_rt(struct rq *rq)
 	schedstat_set(curr->stats.exec_max,
 		      max(curr->stats.exec_max, delta_exec));
 
+	trace_sched_stat_runtime(curr, delta_exec, 0);
+
 	curr->se.sum_exec_runtime += delta_exec;
 	account_group_exec_runtime(curr, delta_exec);
 
-- 
2.18.2


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v3 7/7] sched, rt: support schedstats for RT sched class
  2021-08-24 11:29 [PATCH v3 0/7] sched: support schedstats for RT sched class Yafang Shao
                   ` (5 preceding siblings ...)
  2021-08-24 11:29 ` [PATCH v3 6/7] sched, rt: support sched_stat_runtime tracepoint for RT sched class Yafang Shao
@ 2021-08-24 11:29 ` Yafang Shao
  2021-08-31 10:08 ` [PATCH v3 0/7] sched: " Peter Zijlstra
  7 siblings, 0 replies; 20+ messages in thread
From: Yafang Shao @ 2021-08-24 11:29 UTC (permalink / raw)
  To: mingo, peterz, mgorman, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, bristot, achaiken
  Cc: lkp, linux-kernel, linux-rt-users, Yafang Shao

We want to measure the latency of RT tasks in our production
environment with schedstats facility, but currently schedstats is only
supported for fair sched class. This patch enable it for RT sched class
as well.

After we make the struct sched_statistics and the helpers of it
independent of fair sched class, we can easily use the schedstats
facility for RT sched class.

The schedstat usage in RT sched class is similar with fair sched class,
for example,
                fair                        RT
enqueue         update_stats_enqueue_fair   update_stats_enqueue_rt
dequeue         update_stats_dequeue_fair   update_stats_dequeue_rt
put_prev_task   update_stats_wait_start     update_stats_wait_start_rt
set_next_task   update_stats_wait_end       update_stats_wait_end_rt

The user can get the schedstats information in the same way in fair sched
class. For example,
       fair                            RT
       /proc/[pid]/sched               /proc/[pid]/sched

The output of a RT task's schedstats as follows,
$ cat /proc/227408/sched
...
se.statistics.sum_sleep_runtime              :        402284.476088
se.statistics.sum_block_runtime              :        402272.475254
se.statistics.wait_start                     :             0.000000
se.statistics.sleep_start                    :             0.000000
se.statistics.block_start                    :      46903176.965093
se.statistics.sleep_max                      :            12.000834
se.statistics.block_max                      :          1446.963040
se.statistics.exec_max                       :             0.463806
se.statistics.slice_max                      :             0.000000
se.statistics.wait_max                       :           146.656326
se.statistics.wait_sum                       :         81741.944704
se.statistics.wait_count                     :                 1004
se.statistics.iowait_sum                     :         77875.399958
se.statistics.iowait_count                   :                  142
se.statistics.nr_migrations_cold             :                    0
se.statistics.nr_failed_migrations_affine    :                    0
se.statistics.nr_failed_migrations_running   :                    0
se.statistics.nr_failed_migrations_hot       :                    0
se.statistics.nr_forced_migrations           :                    0
se.statistics.nr_wakeups                     :                 1003
se.statistics.nr_wakeups_sync                :                    0
se.statistics.nr_wakeups_migrate             :                    0
se.statistics.nr_wakeups_local               :                  351
se.statistics.nr_wakeups_remote              :                  652
se.statistics.nr_wakeups_affine              :                    0
se.statistics.nr_wakeups_affine_attempts     :                    0
se.statistics.nr_wakeups_passive             :                    0
se.statistics.nr_wakeups_idle                :                    0
...

The sched:sched_stat_{wait, sleep, iowait, blocked} tracepoints can
be used to trace RT tasks as well. The output of these tracepoints for a
RT tasks as follows,

- blocked
  kworker/u113:0-230817  [000] d... 47197.452940: sched_stat_blocked: comm=stress pid=227408 delay=4096 [ns]

- iowait
     kworker/3:1-222921  [003] d... 47492.211521: sched_stat_iowait: comm=stress pid=227408 delay=905187613 [ns]

- wait
          stress-227400  [003] d... 47202.283021: sched_stat_wait: comm=stress pid=227408 delay=67958890 [ns]

- runtime
          stress-227408  [003] d... 47202.283027: sched_stat_runtime: comm=stress pid=227408 runtime=7815 [ns] vruntime=0 [ns]

- sleep
           sleep-244868  [022] dN.. 50070.614833: sched_stat_sleep: comm=sleep.sh pid=244300 delay=1001131165 [ns]
           sleep-244869  [022] dN.. 50071.616222: sched_stat_sleep: comm=sleep.sh pid=244300 delay=1001100486 [ns]
           sleep-244879  [022] dN.. 50072.617628: sched_stat_sleep: comm=sleep.sh pid=244300 delay=1001137198 [ns]
           [ In sleep.sh, it sleeps 1 sec each time. ]

[lkp@intel.com: reported build failure in earlier version]

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Alison Chaiken <achaiken@aurora.tech>
---
 kernel/sched/rt.c | 141 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 141 insertions(+)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 5d251112e51c..446164597232 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1273,6 +1273,129 @@ static void __delist_rt_entity(struct sched_rt_entity *rt_se, struct rt_prio_arr
 	rt_se->on_list = 0;
 }
 
+#ifdef CONFIG_RT_GROUP_SCHED
+static inline void
+__schedstats_from_sched_rt_entity(struct sched_rt_entity *rt_se,
+				  struct sched_statistics **stats)
+{
+	struct task_struct *p;
+	struct task_group *tg;
+	struct rt_rq *rt_rq;
+	int cpu;
+
+	if (rt_entity_is_task(rt_se)) {
+		p = rt_task_of(rt_se);
+		*stats = &p->stats;
+	} else {
+		rt_rq = group_rt_rq(rt_se);
+		tg = rt_rq->tg;
+		cpu = cpu_of(rq_of_rt_rq(rt_rq));
+		*stats = tg->stats[cpu];
+	}
+}
+
+#else
+
+static inline void
+__schedstats_from_sched_rt_entity(struct sched_rt_entity *rt_se,
+				  struct sched_statistics **stats)
+{
+	struct task_struct *p;
+
+	p = rt_task_of(rt_se);
+	*stats = &p->stats;
+}
+
+#endif
+
+static inline void
+update_stats_wait_start_rt(struct rt_rq *rt_rq, struct sched_rt_entity *rt_se)
+{
+	struct sched_statistics *stats = NULL;
+	struct task_struct *p = NULL;
+
+	if (!schedstat_enabled())
+		return;
+
+	if (rt_entity_is_task(rt_se))
+		p = rt_task_of(rt_se);
+
+	__schedstats_from_sched_rt_entity(rt_se, &stats);
+
+	__update_stats_wait_start(rq_of_rt_rq(rt_rq), p, stats);
+}
+
+static inline void
+update_stats_enqueue_sleeper_rt(struct rt_rq *rt_rq, struct sched_rt_entity *rt_se)
+{
+	struct sched_statistics *stats = NULL;
+	struct task_struct *p = NULL;
+
+	if (!schedstat_enabled())
+		return;
+
+	if (rt_entity_is_task(rt_se))
+		p = rt_task_of(rt_se);
+
+	__schedstats_from_sched_rt_entity(rt_se, &stats);
+
+	__update_stats_enqueue_sleeper(rq_of_rt_rq(rt_rq), p, stats);
+}
+
+static inline void
+update_stats_enqueue_rt(struct rt_rq *rt_rq, struct sched_rt_entity *rt_se,
+			int flags)
+{
+	if (!schedstat_enabled())
+		return;
+
+	if (flags & ENQUEUE_WAKEUP)
+		update_stats_enqueue_sleeper_rt(rt_rq, rt_se);
+}
+
+static inline void
+update_stats_wait_end_rt(struct rt_rq *rt_rq, struct sched_rt_entity *rt_se)
+{
+	struct sched_statistics *stats = NULL;
+	struct task_struct *p = NULL;
+
+	if (!schedstat_enabled())
+		return;
+
+	if (rt_entity_is_task(rt_se))
+		p = rt_task_of(rt_se);
+
+	__schedstats_from_sched_rt_entity(rt_se, &stats);
+
+	__update_stats_wait_end(rq_of_rt_rq(rt_rq), p, stats);
+}
+
+static inline void
+update_stats_dequeue_rt(struct rt_rq *rt_rq, struct sched_rt_entity *rt_se,
+			int flags)
+{
+	struct task_struct *p = NULL;
+
+	if (!schedstat_enabled())
+		return;
+
+	if (rt_entity_is_task(rt_se))
+		p = rt_task_of(rt_se);
+
+	if ((flags & DEQUEUE_SLEEP) && p) {
+		unsigned int state;
+
+		state = READ_ONCE(p->__state);
+		if (state & TASK_INTERRUPTIBLE)
+			__schedstat_set(p->stats.sleep_start,
+					rq_clock(rq_of_rt_rq(rt_rq)));
+
+		if (state & TASK_UNINTERRUPTIBLE)
+			__schedstat_set(p->stats.block_start,
+					rq_clock(rq_of_rt_rq(rt_rq)));
+	}
+}
+
 static void __enqueue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
 {
 	struct rt_rq *rt_rq = rt_rq_of_se(rt_se);
@@ -1346,6 +1469,8 @@ static void enqueue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
 {
 	struct rq *rq = rq_of_rt_se(rt_se);
 
+	update_stats_enqueue_rt(rt_rq_of_se(rt_se), rt_se, flags);
+
 	dequeue_rt_stack(rt_se, flags);
 	for_each_sched_rt_entity(rt_se)
 		__enqueue_rt_entity(rt_se, flags);
@@ -1356,6 +1481,8 @@ static void dequeue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
 {
 	struct rq *rq = rq_of_rt_se(rt_se);
 
+	update_stats_dequeue_rt(rt_rq_of_se(rt_se), rt_se, flags);
+
 	dequeue_rt_stack(rt_se, flags);
 
 	for_each_sched_rt_entity(rt_se) {
@@ -1378,6 +1505,9 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
 	if (flags & ENQUEUE_WAKEUP)
 		rt_se->timeout = 0;
 
+	check_schedstat_required();
+	update_stats_wait_start_rt(rt_rq_of_se(rt_se), rt_se);
+
 	enqueue_rt_entity(rt_se, flags);
 
 	if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
@@ -1578,7 +1708,12 @@ static void check_preempt_curr_rt(struct rq *rq, struct task_struct *p, int flag
 
 static inline void set_next_task_rt(struct rq *rq, struct task_struct *p, bool first)
 {
+	struct sched_rt_entity *rt_se = &p->rt;
+	struct rt_rq *rt_rq = &rq->rt;
+
 	p->se.exec_start = rq_clock_task(rq);
+	if (on_rt_rq(&p->rt))
+		update_stats_wait_end_rt(rt_rq, rt_se);
 
 	/* The running task is never eligible for pushing */
 	dequeue_pushable_task(rq, p);
@@ -1652,6 +1787,12 @@ static struct task_struct *pick_next_task_rt(struct rq *rq)
 
 static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
 {
+	struct sched_rt_entity *rt_se = &p->rt;
+	struct rt_rq *rt_rq = &rq->rt;
+
+	if (on_rt_rq(&p->rt))
+		update_stats_wait_start_rt(rt_rq, rt_se);
+
 	update_curr_rt(rq);
 
 	update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 1);
-- 
2.18.2


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 0/7] sched: support schedstats for RT sched class
  2021-08-24 11:29 [PATCH v3 0/7] sched: support schedstats for RT sched class Yafang Shao
                   ` (6 preceding siblings ...)
  2021-08-24 11:29 ` [PATCH v3 7/7] sched, rt: support schedstats " Yafang Shao
@ 2021-08-31 10:08 ` Peter Zijlstra
  2021-08-31 10:44   ` Peter Zijlstra
  2021-08-31 12:57   ` Yafang Shao
  7 siblings, 2 replies; 20+ messages in thread
From: Peter Zijlstra @ 2021-08-31 10:08 UTC (permalink / raw)
  To: Yafang Shao
  Cc: mingo, mgorman, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, bristot, achaiken, lkp, linux-kernel,
	linux-rt-users

On Tue, Aug 24, 2021 at 11:29:39AM +0000, Yafang Shao wrote:
> Hi Ingo, Peter,
> 
> This feature is useful to trace the sched details of RT tasks. Hopefully
> you can give some feedback on it.
> 
> We want to measure the latency of RT tasks in our production
> environment with schedstats facility, but currently schedstats is only
> supported for fair sched class. In order to support if for other sched
> classes, we should make it independent of fair sched class. The struct
> sched_statistics is the schedular statistics of a task_struct or a
> task_group, both of which are independent of sched class. So we can move
> struct sched_statistics into struct task_struct and struct task_group to
> achieve the goal.

Do you really want schedstats or do you want the tracepoints? In general
I really want to cut back on the built-in statistics crud we carry,
there's too much and it seems to keep growing forever :-(

(as is the case here, you're extending it as well)

That said; making schedstats cover the other classes can be seen as
fixing an inconsistency, but then you forgot deadline.

> After the patchset, schestats are orgnized as follows,
> struct task_struct {
>     ...
>     struct sched_statistics statistics;
>     ...
>     struct sched_entity *se;
>     struct sched_rt_entity *rt;
>     ...
> };
> 
> struct task_group {                    |---> stats[0] : of CPU0
>     ...                                |
>     struct sched_statistics **stats; --|---> stats[1] : of CPU1
>     ...                                |
>                                        |---> stats[n] : of CPUn
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>     struct sched_entity **se;
>  #endif
>  #ifdef CONFIG_RT_GROUP_SCHED
>     struct sched_rt_entity  **rt_se;
>  #endif
>     ...
> };

Yeah, this seems to give a terrible mess, let me see if I can come up
with anything less horrible.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/7] sched: make struct sched_statistics independent of fair sched class
  2021-08-24 11:29 ` [PATCH v3 2/7] sched: make struct sched_statistics independent of fair sched class Yafang Shao
@ 2021-08-31 10:14   ` Peter Zijlstra
  2021-08-31 13:25     ` Yafang Shao
  2021-08-31 10:19   ` Peter Zijlstra
  1 sibling, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2021-08-31 10:14 UTC (permalink / raw)
  To: Yafang Shao
  Cc: mingo, mgorman, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, bristot, achaiken, lkp, linux-kernel,
	linux-rt-users

On Tue, Aug 24, 2021 at 11:29:41AM +0000, Yafang Shao wrote:
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 49716228efb4..4cfee2aa1a2d 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c

> @@ -442,9 +442,11 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
>  	struct sched_entity *se = tg->se[cpu];
>  
>  #define P(F)		SEQ_printf(m, "  .%-30s: %lld\n",	#F, (long long)F)
> -#define P_SCHEDSTAT(F)	SEQ_printf(m, "  .%-30s: %lld\n",	#F, (long long)schedstat_val(F))
> +#define P_SCHEDSTAT(F)	SEQ_printf(m, "  .%-30s: %lld\n",	\
> +		"se->statistics."#F, (long long)schedstat_val(tg->stats[cpu]->F))
>  #define PN(F)		SEQ_printf(m, "  .%-30s: %lld.%06ld\n", #F, SPLIT_NS((long long)F))
> -#define PN_SCHEDSTAT(F)	SEQ_printf(m, "  .%-30s: %lld.%06ld\n", #F, SPLIT_NS((long long)schedstat_val(F)))
> +#define PN_SCHEDSTAT(F)	SEQ_printf(m, "  .%-30s: %lld.%06ld\n", \
> +		"se->statistics."#F, SPLIT_NS((long long)schedstat_val(tg->stats[cpu]->F)))
>  
>  	if (!se)
>  		return;

> @@ -948,8 +950,8 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
>  		"---------------------------------------------------------"
>  		"----------\n");
>  
> -#define P_SCHEDSTAT(F)  __PS(#F, schedstat_val(p->F))
> -#define PN_SCHEDSTAT(F) __PSN(#F, schedstat_val(p->F))
> +#define P_SCHEDSTAT(F)  __PS("se.statistics."#F, schedstat_val(p->stats.F))
> +#define PN_SCHEDSTAT(F) __PSN("se.statistics."#F, schedstat_val(p->stats.F))
>  
>  	PN(se.exec_start);
>  	PN(se.vruntime);

That's sad... can't we keep it #F, this is all SCHED_DEBUG code anyway.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/7] sched: make struct sched_statistics independent of fair sched class
  2021-08-24 11:29 ` [PATCH v3 2/7] sched: make struct sched_statistics independent of fair sched class Yafang Shao
  2021-08-31 10:14   ` Peter Zijlstra
@ 2021-08-31 10:19   ` Peter Zijlstra
  2021-08-31 13:25     ` Yafang Shao
  1 sibling, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2021-08-31 10:19 UTC (permalink / raw)
  To: Yafang Shao
  Cc: mingo, mgorman, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, bristot, achaiken, lkp, linux-kernel,
	linux-rt-users

On Tue, Aug 24, 2021 at 11:29:41AM +0000, Yafang Shao wrote:

> +#ifdef CONFIG_FAIR_GROUP_SCHED
> +static inline void
> +__schedstats_from_sched_entity(struct sched_entity *se,
> +			      struct sched_statistics **stats)
> +{
> +	struct task_group *tg;
> +	struct task_struct *p;
> +	struct cfs_rq *cfs;
> +	int cpu;
> +
> +	if (entity_is_task(se)) {
> +		p = task_of(se);
> +		*stats = &p->stats;
> +	} else {
> +		cfs = group_cfs_rq(se);
> +		tg = cfs->tg;
> +		cpu = cpu_of(rq_of(cfs));
> +		*stats = tg->stats[cpu];
> +	}
> +}
> +
> +#else
> +
> +static inline void
> +__schedstats_from_sched_entity(struct sched_entity *se,
> +			      struct sched_statistics **stats)
> +{
> +	struct task_struct *p;
> +
> +	p = task_of(se);
> +	*stats = &p->stats;
> +}
> +
> +#endif
> +
>  /*
>   * Update the current task's runtime statistics.
>   */
> @@ -826,6 +861,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
>  {
>  	struct sched_entity *curr = cfs_rq->curr;
>  	u64 now = rq_clock_task(rq_of(cfs_rq));
> +	struct sched_statistics *stats = NULL;
>  	u64 delta_exec;
>  
>  	if (unlikely(!curr))
> @@ -837,8 +873,11 @@ static void update_curr(struct cfs_rq *cfs_rq)
>  
>  	curr->exec_start = now;
>  
> -	schedstat_set(curr->statistics.exec_max,
> -		      max(delta_exec, curr->statistics.exec_max));
> +	if (schedstat_enabled()) {
> +		__schedstats_from_sched_entity(curr, &stats);
> +		__schedstat_set(stats->exec_max,
> +				max(delta_exec, stats->exec_max));
> +	}
>  
>  	curr->sum_exec_runtime += delta_exec;
>  	schedstat_add(cfs_rq->exec_clock, delta_exec);


That's just really odd style; what's wrong with something like:

static inline struct sched_statistics *
__schedstats_from_se(struct sched_entity *se)
{
	...
}

	if (schedstats_enabled()) {
		struct sched_statistics *stats = __schedstats_from_se(curr);
		__schedstat_set(stats->exec_max, max(stats->exec_max, delta_exec));
	}



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 0/7] sched: support schedstats for RT sched class
  2021-08-31 10:08 ` [PATCH v3 0/7] sched: " Peter Zijlstra
@ 2021-08-31 10:44   ` Peter Zijlstra
  2021-08-31 13:21     ` Yafang Shao
  2021-08-31 12:57   ` Yafang Shao
  1 sibling, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2021-08-31 10:44 UTC (permalink / raw)
  To: Yafang Shao
  Cc: mingo, mgorman, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, bristot, achaiken, lkp, linux-kernel,
	linux-rt-users

On Tue, Aug 31, 2021 at 12:08:15PM +0200, Peter Zijlstra wrote:
> On Tue, Aug 24, 2021 at 11:29:39AM +0000, Yafang Shao wrote:

> > After the patchset, schestats are orgnized as follows,
> > struct task_struct {
> >     ...
> >     struct sched_statistics statistics;
> >     ...
> >     struct sched_entity *se;
> >     struct sched_rt_entity *rt;
> >     ...
> > };
> > 
> > struct task_group {                    |---> stats[0] : of CPU0
> >     ...                                |
> >     struct sched_statistics **stats; --|---> stats[1] : of CPU1
> >     ...                                |
> >                                        |---> stats[n] : of CPUn
> >  #ifdef CONFIG_FAIR_GROUP_SCHED
> >     struct sched_entity **se;
> >  #endif
> >  #ifdef CONFIG_RT_GROUP_SCHED
> >     struct sched_rt_entity  **rt_se;
> >  #endif
> >     ...
> > };
> 
> Yeah, this seems to give a terrible mess, let me see if I can come up
> with anything less horrible.

Here, isn't this *MUCH* saner ?

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -521,7 +521,7 @@ struct sched_statistics {
 	u64				nr_wakeups_passive;
 	u64				nr_wakeups_idle;
 #endif
-};
+} ____cacheline_aligned;
 
 struct sched_entity {
 	/* For load-balancing: */
@@ -537,8 +537,6 @@ struct sched_entity {
 
 	u64				nr_migrations;
 
-	struct sched_statistics		statistics;
-
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	int				depth;
 	struct sched_entity		*parent;
@@ -802,6 +800,8 @@ struct task_struct {
 	struct uclamp_se		uclamp[UCLAMP_CNT];
 #endif
 
+	struct sched_statistics         stats;
+
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 	/* List of struct preempt_notifier: */
 	struct hlist_head		preempt_notifiers;
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3489,11 +3489,11 @@ ttwu_stat(struct task_struct *p, int cpu
 #ifdef CONFIG_SMP
 	if (cpu == rq->cpu) {
 		__schedstat_inc(rq->ttwu_local);
-		__schedstat_inc(p->se.statistics.nr_wakeups_local);
+		__schedstat_inc(p->stats.nr_wakeups_local);
 	} else {
 		struct sched_domain *sd;
 
-		__schedstat_inc(p->se.statistics.nr_wakeups_remote);
+		__schedstat_inc(p->stats.nr_wakeups_remote);
 		rcu_read_lock();
 		for_each_domain(rq->cpu, sd) {
 			if (cpumask_test_cpu(cpu, sched_domain_span(sd))) {
@@ -3505,14 +3505,14 @@ ttwu_stat(struct task_struct *p, int cpu
 	}
 
 	if (wake_flags & WF_MIGRATED)
-		__schedstat_inc(p->se.statistics.nr_wakeups_migrate);
+		__schedstat_inc(p->stats.nr_wakeups_migrate);
 #endif /* CONFIG_SMP */
 
 	__schedstat_inc(rq->ttwu_count);
-	__schedstat_inc(p->se.statistics.nr_wakeups);
+	__schedstat_inc(p->stats.nr_wakeups);
 
 	if (wake_flags & WF_SYNC)
-		__schedstat_inc(p->se.statistics.nr_wakeups_sync);
+		__schedstat_inc(p->stats.nr_wakeups_sync);
 }
 
 /*
@@ -4196,7 +4196,7 @@ static void __sched_fork(unsigned long c
 
 #ifdef CONFIG_SCHEDSTATS
 	/* Even if schedstat is disabled, there should not be garbage */
-	memset(&p->se.statistics, 0, sizeof(p->se.statistics));
+	memset(&p->stats, 0, sizeof(p->stats));
 #endif
 
 	RB_CLEAR_NODE(&p->dl.rb_node);
@@ -9619,9 +9619,9 @@ void normalize_rt_tasks(void)
 			continue;
 
 		p->se.exec_start = 0;
-		schedstat_set(p->se.statistics.wait_start,  0);
-		schedstat_set(p->se.statistics.sleep_start, 0);
-		schedstat_set(p->se.statistics.block_start, 0);
+		schedstat_set(p->stats.wait_start,  0);
+		schedstat_set(p->stats.sleep_start, 0);
+		schedstat_set(p->stats.block_start, 0);
 
 		if (!dl_task(p) && !rt_task(p)) {
 			/*
@@ -10467,7 +10467,7 @@ static int cpu_cfs_stat_show(struct seq_
 		int i;
 
 		for_each_possible_cpu(i)
-			ws += schedstat_val(tg->se[i]->statistics.wait_sum);
+			ws += schedstat_val(tg->stats[i]->wait_sum);
 
 		seq_printf(sf, "wait_sum %llu\n", ws);
 	}
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1265,8 +1265,8 @@ static void update_curr_dl(struct rq *rq
 		return;
 	}
 
-	schedstat_set(curr->se.statistics.exec_max,
-		      max(curr->se.statistics.exec_max, delta_exec));
+	schedstat_set(curr->stats.exec_max,
+		      max(curr->stats.exec_max, delta_exec));
 
 	curr->se.sum_exec_runtime += delta_exec;
 	account_group_exec_runtime(curr, delta_exec);
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -819,6 +819,21 @@ static void update_tg_load_avg(struct cf
 }
 #endif /* CONFIG_SMP */
 
+struct sched_entity_stats {
+	struct sched_entity	se;
+	struct sched_statistics	stats;
+} __no_randomize_layout;
+
+static inline struct sched_statistics *
+__schedstats_from_se(struct sched_entity *se)
+{
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	if (!entity_is_task(se))
+		return &container_of(se, struct sched_entity_stats, se)->stats;
+#endif
+	return &task_of(se)->stats;
+}
+
 /*
  * Update the current task's runtime statistics.
  */
@@ -837,8 +852,10 @@ static void update_curr(struct cfs_rq *c
 
 	curr->exec_start = now;
 
-	schedstat_set(curr->statistics.exec_max,
-		      max(delta_exec, curr->statistics.exec_max));
+	if (schedstat_enabled()) {
+		struct sched_statistics *stats = __schedstats_from_se(curr);
+		__schedstat_set(stats->exec_max, max(delta_exec, stats->exec_max));
+	}
 
 	curr->sum_exec_runtime += delta_exec;
 	schedstat_add(cfs_rq->exec_clock, delta_exec);
@@ -866,39 +883,45 @@ static inline void
 update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	u64 wait_start, prev_wait_start;
+	struct sched_statistics *stats;
 
 	if (!schedstat_enabled())
 		return;
 
+	stats = __schedstats_from_se(se);
+
 	wait_start = rq_clock(rq_of(cfs_rq));
-	prev_wait_start = schedstat_val(se->statistics.wait_start);
+	prev_wait_start = schedstat_val(stats->wait_start);
 
 	if (entity_is_task(se) && task_on_rq_migrating(task_of(se)) &&
 	    likely(wait_start > prev_wait_start))
 		wait_start -= prev_wait_start;
 
-	__schedstat_set(se->statistics.wait_start, wait_start);
+	__schedstat_set(stats->wait_start, wait_start);
 }
 
 static inline void
 update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	struct task_struct *p;
+	struct sched_statistics *stats;
+	struct task_struct *p = NULL;
 	u64 delta;
 
 	if (!schedstat_enabled())
 		return;
 
+	stats = __schedstats_from_se(se);
+
 	/*
 	 * When the sched_schedstat changes from 0 to 1, some sched se
 	 * maybe already in the runqueue, the se->statistics.wait_start
 	 * will be 0.So it will let the delta wrong. We need to avoid this
 	 * scenario.
 	 */
-	if (unlikely(!schedstat_val(se->statistics.wait_start)))
+	if (unlikely(!schedstat_val(stats->wait_start)))
 		return;
 
-	delta = rq_clock(rq_of(cfs_rq)) - schedstat_val(se->statistics.wait_start);
+	delta = rq_clock(rq_of(cfs_rq)) - schedstat_val(stats->wait_start);
 
 	if (entity_is_task(se)) {
 		p = task_of(se);
@@ -908,30 +931,33 @@ update_stats_wait_end(struct cfs_rq *cfs
 			 * time stamp can be adjusted to accumulate wait time
 			 * prior to migration.
 			 */
-			__schedstat_set(se->statistics.wait_start, delta);
+			__schedstat_set(stats->wait_start, delta);
 			return;
 		}
 		trace_sched_stat_wait(p, delta);
 	}
 
-	__schedstat_set(se->statistics.wait_max,
-		      max(schedstat_val(se->statistics.wait_max), delta));
-	__schedstat_inc(se->statistics.wait_count);
-	__schedstat_add(se->statistics.wait_sum, delta);
-	__schedstat_set(se->statistics.wait_start, 0);
+	__schedstat_set(stats->wait_max,
+		      max(schedstat_val(stats->wait_max), delta));
+	__schedstat_inc(stats->wait_count);
+	__schedstat_add(stats->wait_sum, delta);
+	__schedstat_set(stats->wait_start, 0);
 }
 
 static inline void
 update_stats_enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+	struct sched_statistics *stats;
 	struct task_struct *tsk = NULL;
 	u64 sleep_start, block_start;
 
 	if (!schedstat_enabled())
 		return;
 
-	sleep_start = schedstat_val(se->statistics.sleep_start);
-	block_start = schedstat_val(se->statistics.block_start);
+	stats = __schedstats_from_se(se);
+
+	sleep_start = schedstat_val(stats->sleep_start);
+	block_start = schedstat_val(stats->block_start);
 
 	if (entity_is_task(se))
 		tsk = task_of(se);
@@ -942,11 +968,11 @@ update_stats_enqueue_sleeper(struct cfs_
 		if ((s64)delta < 0)
 			delta = 0;
 
-		if (unlikely(delta > schedstat_val(se->statistics.sleep_max)))
-			__schedstat_set(se->statistics.sleep_max, delta);
+		if (unlikely(delta > schedstat_val(stats->sleep_max)))
+			__schedstat_set(stats->sleep_max, delta);
 
-		__schedstat_set(se->statistics.sleep_start, 0);
-		__schedstat_add(se->statistics.sum_sleep_runtime, delta);
+		__schedstat_set(stats->sleep_start, 0);
+		__schedstat_add(stats->sum_sleep_runtime, delta);
 
 		if (tsk) {
 			account_scheduler_latency(tsk, delta >> 10, 1);
@@ -959,16 +985,16 @@ update_stats_enqueue_sleeper(struct cfs_
 		if ((s64)delta < 0)
 			delta = 0;
 
-		if (unlikely(delta > schedstat_val(se->statistics.block_max)))
-			__schedstat_set(se->statistics.block_max, delta);
+		if (unlikely(delta > schedstat_val(stats->block_max)))
+			__schedstat_set(stats->block_max, delta);
 
-		__schedstat_set(se->statistics.block_start, 0);
-		__schedstat_add(se->statistics.sum_sleep_runtime, delta);
+		__schedstat_set(stats->block_start, 0);
+		__schedstat_add(stats->sum_sleep_runtime, delta);
 
 		if (tsk) {
 			if (tsk->in_iowait) {
-				__schedstat_add(se->statistics.iowait_sum, delta);
-				__schedstat_inc(se->statistics.iowait_count);
+				__schedstat_add(stats->iowait_sum, delta);
+				__schedstat_inc(stats->iowait_count);
 				trace_sched_stat_iowait(tsk, delta);
 			}
 
@@ -1030,10 +1056,10 @@ update_stats_dequeue(struct cfs_rq *cfs_
 		/* XXX racy against TTWU */
 		state = READ_ONCE(tsk->__state);
 		if (state & TASK_INTERRUPTIBLE)
-			__schedstat_set(se->statistics.sleep_start,
+			__schedstat_set(tsk->stats.sleep_start,
 				      rq_clock(rq_of(cfs_rq)));
 		if (state & TASK_UNINTERRUPTIBLE)
-			__schedstat_set(se->statistics.block_start,
+			__schedstat_set(tsk->stats.block_start,
 				      rq_clock(rq_of(cfs_rq)));
 	}
 }
@@ -4502,9 +4528,10 @@ set_next_entity(struct cfs_rq *cfs_rq, s
 	 */
 	if (schedstat_enabled() &&
 	    rq_of(cfs_rq)->cfs.load.weight >= 2*se->load.weight) {
-		__schedstat_set(se->statistics.slice_max,
-			max((u64)schedstat_val(se->statistics.slice_max),
-			    se->sum_exec_runtime - se->prev_sum_exec_runtime));
+		struct sched_statistics *stats = __schedstats_from_se(se);
+		__schedstat_set(stats->slice_max,
+				max((u64)stats->slice_max,
+				    se->sum_exec_runtime - se->prev_sum_exec_runtime));
 	}
 
 	se->prev_sum_exec_runtime = se->sum_exec_runtime;
@@ -5993,12 +6020,12 @@ static int wake_affine(struct sched_doma
 	if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits)
 		target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);
 
-	schedstat_inc(p->se.statistics.nr_wakeups_affine_attempts);
+	schedstat_inc(p->stats.nr_wakeups_affine_attempts);
 	if (target == nr_cpumask_bits)
 		return prev_cpu;
 
 	schedstat_inc(sd->ttwu_move_affine);
-	schedstat_inc(p->se.statistics.nr_wakeups_affine);
+	schedstat_inc(p->stats.nr_wakeups_affine);
 	return target;
 }
 
@@ -7802,7 +7829,7 @@ int can_migrate_task(struct task_struct
 	if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
 		int cpu;
 
-		schedstat_inc(p->se.statistics.nr_failed_migrations_affine);
+		schedstat_inc(p->stats.nr_failed_migrations_affine);
 
 		env->flags |= LBF_SOME_PINNED;
 
@@ -7836,7 +7863,7 @@ int can_migrate_task(struct task_struct
 	env->flags &= ~LBF_ALL_PINNED;
 
 	if (task_running(env->src_rq, p)) {
-		schedstat_inc(p->se.statistics.nr_failed_migrations_running);
+		schedstat_inc(p->stats.nr_failed_migrations_running);
 		return 0;
 	}
 
@@ -7858,12 +7885,12 @@ int can_migrate_task(struct task_struct
 	    env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
 		if (tsk_cache_hot == 1) {
 			schedstat_inc(env->sd->lb_hot_gained[env->idle]);
-			schedstat_inc(p->se.statistics.nr_forced_migrations);
+			schedstat_inc(p->stats.nr_forced_migrations);
 		}
 		return 1;
 	}
 
-	schedstat_inc(p->se.statistics.nr_failed_migrations_hot);
+	schedstat_inc(p->stats.nr_failed_migrations_hot);
 	return 0;
 }
 
@@ -11390,7 +11417,7 @@ int alloc_fair_sched_group(struct task_g
 		if (!cfs_rq)
 			goto err;
 
-		se = kzalloc_node(sizeof(struct sched_entity),
+		se = kzalloc_node(sizeof(struct sched_entity_stats),
 				  GFP_KERNEL, cpu_to_node(i));
 		if (!se)
 			goto err_free_rq;
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1009,8 +1009,8 @@ static void update_curr_rt(struct rq *rq
 	if (unlikely((s64)delta_exec <= 0))
 		return;
 
-	schedstat_set(curr->se.statistics.exec_max,
-		      max(curr->se.statistics.exec_max, delta_exec));
+	schedstat_set(curr->stats.exec_max,
+		      max(curr->stats.exec_max, delta_exec));
 
 	curr->se.sum_exec_runtime += delta_exec;
 	account_group_exec_runtime(curr, delta_exec);
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -41,6 +41,7 @@ rq_sched_info_dequeue(struct rq *rq, uns
 #define   schedstat_val_or_zero(var)	((schedstat_enabled()) ? (var) : 0)
 
 #else /* !CONFIG_SCHEDSTATS: */
+
 static inline void rq_sched_info_arrive  (struct rq *rq, unsigned long long delta) { }
 static inline void rq_sched_info_dequeue(struct rq *rq, unsigned long long delta) { }
 static inline void rq_sched_info_depart  (struct rq *rq, unsigned long long delta) { }
@@ -53,6 +54,7 @@ static inline void rq_sched_info_depart
 # define   schedstat_set(var, val)	do { } while (0)
 # define   schedstat_val(var)		0
 # define   schedstat_val_or_zero(var)	0
+
 #endif /* CONFIG_SCHEDSTATS */
 
 #ifdef CONFIG_PSI
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -78,8 +78,8 @@ static void put_prev_task_stop(struct rq
 	if (unlikely((s64)delta_exec < 0))
 		delta_exec = 0;
 
-	schedstat_set(curr->se.statistics.exec_max,
-			max(curr->se.statistics.exec_max, delta_exec));
+	schedstat_set(curr->stats.exec_max,
+		      max(curr->stats.exec_max, delta_exec));
 
 	curr->se.sum_exec_runtime += delta_exec;
 	account_group_exec_runtime(curr, delta_exec);

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 3/7] sched: make schedstats helpers independent of fair sched class
  2021-08-24 11:29 ` [PATCH v3 3/7] sched: make schedstats helpers " Yafang Shao
@ 2021-08-31 11:07   ` Peter Zijlstra
  2021-08-31 13:27     ` Yafang Shao
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2021-08-31 11:07 UTC (permalink / raw)
  To: Yafang Shao
  Cc: mingo, mgorman, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, bristot, achaiken, lkp, linux-kernel,
	linux-rt-users

On Tue, Aug 24, 2021 at 11:29:42AM +0000, Yafang Shao wrote:
> diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
> index 3f93fc3b5648..b2542f4d3192 100644
> --- a/kernel/sched/stats.c
> +++ b/kernel/sched/stats.c
> @@ -4,6 +4,109 @@
>   */
>  #include "sched.h"
>  
> +void __update_stats_wait_start(struct rq *rq, struct task_struct *p,
> +			       struct sched_statistics *stats)
> +{
> +u64 wait_start, prev_wait_start;

indent fail...

> +
> +	wait_start = rq_clock(rq);
> +	prev_wait_start = schedstat_val(stats->wait_start);
> +
> +	if (p && likely(wait_start > prev_wait_start))
> +		wait_start -= prev_wait_start;
> +
> +	__schedstat_set(stats->wait_start, wait_start);
> +}

> diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
> index e6905e369c5d..9ecd81b91f26 100644
> --- a/kernel/sched/stats.h
> +++ b/kernel/sched/stats.h

> @@ -40,6 +42,33 @@ rq_sched_info_dequeue(struct rq *rq, unsigned long long delta)
>  #define   schedstat_val(var)		(var)
>  #define   schedstat_val_or_zero(var)	((schedstat_enabled()) ? (var) : 0)
>  
> +void __update_stats_wait_start(struct rq *rq, struct task_struct *p,
> +			       struct sched_statistics *stats);
> +
> +void __update_stats_wait_end(struct rq *rq, struct task_struct *p,
> +			     struct sched_statistics *stats);
> +void __update_stats_enqueue_sleeper(struct rq *rq, struct task_struct *p,
> +				    struct sched_statistics *stats);
> +
> +static inline void
> +check_schedstat_required(void)
> +{
> +	if (schedstat_enabled())
> +		return;
> +
> +	/* Force schedstat enabled if a dependent tracepoint is active */
> +	if (trace_sched_stat_wait_enabled()    ||
> +		trace_sched_stat_sleep_enabled()   ||
> +		trace_sched_stat_iowait_enabled()  ||
> +		trace_sched_stat_blocked_enabled() ||
> +		trace_sched_stat_runtime_enabled())  {
> +		printk_deferred_once("Scheduler tracepoints stat_sleep, stat_iowait, "
> +					"stat_blocked and stat_runtime require the "
> +					"kernel parameter schedstats=enable or "
> +					"kernel.sched_schedstats=1\n");
> +	}
> +}

If you're moving this, you might as well reflow it to not have broken
indentation.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 4/7] sched: make the output of schedstats independent of fair sched class
  2021-08-24 11:29 ` [PATCH v3 4/7] sched: make the output of schedstats " Yafang Shao
@ 2021-08-31 11:08   ` Peter Zijlstra
  2021-08-31 13:27     ` Yafang Shao
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2021-08-31 11:08 UTC (permalink / raw)
  To: Yafang Shao
  Cc: mingo, mgorman, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, bristot, achaiken, lkp, linux-kernel,
	linux-rt-users

On Tue, Aug 24, 2021 at 11:29:43AM +0000, Yafang Shao wrote:
> The per cpu stats can be show with /proc/sched_debug, which includes the
> per cpu schedstats of each task group. Currently these per cpu
> schedstats only show for the fair sched class. If we want to support
> other sched classes, we have to make these output independent of fair
> sched class.

Arguably the whole rt group stuff needs to die, please don't enable it
further.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 0/7] sched: support schedstats for RT sched class
  2021-08-31 10:08 ` [PATCH v3 0/7] sched: " Peter Zijlstra
  2021-08-31 10:44   ` Peter Zijlstra
@ 2021-08-31 12:57   ` Yafang Shao
  1 sibling, 0 replies; 20+ messages in thread
From: Yafang Shao @ 2021-08-31 12:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Mel Gorman, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Benjamin Segall,
	Daniel Bristot de Oliveira, Alison Chaiken, kbuild test robot,
	LKML, linux-rt-users

On Tue, Aug 31, 2021 at 6:08 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Aug 24, 2021 at 11:29:39AM +0000, Yafang Shao wrote:
> > Hi Ingo, Peter,
> >
> > This feature is useful to trace the sched details of RT tasks. Hopefully
> > you can give some feedback on it.
> >
> > We want to measure the latency of RT tasks in our production
> > environment with schedstats facility, but currently schedstats is only
> > supported for fair sched class. In order to support if for other sched
> > classes, we should make it independent of fair sched class. The struct
> > sched_statistics is the schedular statistics of a task_struct or a
> > task_group, both of which are independent of sched class. So we can move
> > struct sched_statistics into struct task_struct and struct task_group to
> > achieve the goal.
>
> Do you really want schedstats or do you want the tracepoints?

I really want the schedstats, which is very helpful to help us profile
thread-level latency.
The tracepoints is a bonus.

> In general
> I really want to cut back on the built-in statistics crud we carry,

Pls. don't.
There are really use cases of statistics.
Our use case as follows,

Userspace Code Scope         Profiler

{
    user_func_abc();   <----      uprobe_begin() get the start statistics
    ...
    user_func_xyz();   <----       uprobe_end()  get the end statistics
}

Then with this profiler we can easily get what happened in this scope
and why its latency was great:
    scope_latency = Wait + Sleep + Blocked [1]  + Run (stime + utime)

If there is no schedstats, we have to trace the heavy sched::sched_switch.

[1]. With patch #5 and don't include sum_block_runtime in sum_sleep_runtime

> there's too much and it seems to keep growing forever :-(
>
> (as is the case here, you're extending it as well)
>
> That said; making schedstats cover the other classes can be seen as
> fixing an inconsistency, but then you forgot deadline.
>

There's no deadline task on our server, so I didn't support it for deadline.
But with this patchset, it is very easy to extend it to deadline and
any other sched classes.


> > After the patchset, schestats are orgnized as follows,
> > struct task_struct {
> >     ...
> >     struct sched_statistics statistics;
> >     ...
> >     struct sched_entity *se;
> >     struct sched_rt_entity *rt;
> >     ...
> > };
> >
> > struct task_group {                    |---> stats[0] : of CPU0
> >     ...                                |
> >     struct sched_statistics **stats; --|---> stats[1] : of CPU1
> >     ...                                |
> >                                        |---> stats[n] : of CPUn
> >  #ifdef CONFIG_FAIR_GROUP_SCHED
> >     struct sched_entity **se;
> >  #endif
> >  #ifdef CONFIG_RT_GROUP_SCHED
> >     struct sched_rt_entity  **rt_se;
> >  #endif
> >     ...
> > };
>
> Yeah, this seems to give a terrible mess, let me see if I can come up
> with anything less horrible.



-- 
Thanks
Yafang

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 0/7] sched: support schedstats for RT sched class
  2021-08-31 10:44   ` Peter Zijlstra
@ 2021-08-31 13:21     ` Yafang Shao
  0 siblings, 0 replies; 20+ messages in thread
From: Yafang Shao @ 2021-08-31 13:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Mel Gorman, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Benjamin Segall,
	Daniel Bristot de Oliveira, Alison Chaiken, kbuild test robot,
	LKML, linux-rt-users

On Tue, Aug 31, 2021 at 6:44 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Aug 31, 2021 at 12:08:15PM +0200, Peter Zijlstra wrote:
> > On Tue, Aug 24, 2021 at 11:29:39AM +0000, Yafang Shao wrote:
>
> > > After the patchset, schestats are orgnized as follows,
> > > struct task_struct {
> > >     ...
> > >     struct sched_statistics statistics;
> > >     ...
> > >     struct sched_entity *se;
> > >     struct sched_rt_entity *rt;
> > >     ...
> > > };
> > >
> > > struct task_group {                    |---> stats[0] : of CPU0
> > >     ...                                |
> > >     struct sched_statistics **stats; --|---> stats[1] : of CPU1
> > >     ...                                |
> > >                                        |---> stats[n] : of CPUn
> > >  #ifdef CONFIG_FAIR_GROUP_SCHED
> > >     struct sched_entity **se;
> > >  #endif
> > >  #ifdef CONFIG_RT_GROUP_SCHED
> > >     struct sched_rt_entity  **rt_se;
> > >  #endif
> > >     ...
> > > };
> >
> > Yeah, this seems to give a terrible mess, let me see if I can come up
> > with anything less horrible.
>
> Here, isn't this *MUCH* saner ?
>

Seems like a good idea.
I will verify it.


> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -521,7 +521,7 @@ struct sched_statistics {
>         u64                             nr_wakeups_passive;
>         u64                             nr_wakeups_idle;
>  #endif
> -};
> +} ____cacheline_aligned;
>
>  struct sched_entity {
>         /* For load-balancing: */
> @@ -537,8 +537,6 @@ struct sched_entity {
>
>         u64                             nr_migrations;
>
> -       struct sched_statistics         statistics;
> -
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>         int                             depth;
>         struct sched_entity             *parent;
> @@ -802,6 +800,8 @@ struct task_struct {
>         struct uclamp_se                uclamp[UCLAMP_CNT];
>  #endif
>
> +       struct sched_statistics         stats;
> +

The stats was kept close to 'struct sched_entity se' before, because I
don't want to change the original layout of 'struct task_struct' too
much, in case the change may impact the cache line.
I'm not sure whether it is proper to place it here, I will verify it.

>  #ifdef CONFIG_PREEMPT_NOTIFIERS
>         /* List of struct preempt_notifier: */
>         struct hlist_head               preempt_notifiers;
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3489,11 +3489,11 @@ ttwu_stat(struct task_struct *p, int cpu
>  #ifdef CONFIG_SMP
>         if (cpu == rq->cpu) {
>                 __schedstat_inc(rq->ttwu_local);
> -               __schedstat_inc(p->se.statistics.nr_wakeups_local);
> +               __schedstat_inc(p->stats.nr_wakeups_local);
>         } else {
>                 struct sched_domain *sd;
>
> -               __schedstat_inc(p->se.statistics.nr_wakeups_remote);
> +               __schedstat_inc(p->stats.nr_wakeups_remote);
>                 rcu_read_lock();
>                 for_each_domain(rq->cpu, sd) {
>                         if (cpumask_test_cpu(cpu, sched_domain_span(sd))) {
> @@ -3505,14 +3505,14 @@ ttwu_stat(struct task_struct *p, int cpu
>         }
>
>         if (wake_flags & WF_MIGRATED)
> -               __schedstat_inc(p->se.statistics.nr_wakeups_migrate);
> +               __schedstat_inc(p->stats.nr_wakeups_migrate);
>  #endif /* CONFIG_SMP */
>
>         __schedstat_inc(rq->ttwu_count);
> -       __schedstat_inc(p->se.statistics.nr_wakeups);
> +       __schedstat_inc(p->stats.nr_wakeups);
>
>         if (wake_flags & WF_SYNC)
> -               __schedstat_inc(p->se.statistics.nr_wakeups_sync);
> +               __schedstat_inc(p->stats.nr_wakeups_sync);
>  }
>
>  /*
> @@ -4196,7 +4196,7 @@ static void __sched_fork(unsigned long c
>
>  #ifdef CONFIG_SCHEDSTATS
>         /* Even if schedstat is disabled, there should not be garbage */
> -       memset(&p->se.statistics, 0, sizeof(p->se.statistics));
> +       memset(&p->stats, 0, sizeof(p->stats));
>  #endif
>
>         RB_CLEAR_NODE(&p->dl.rb_node);
> @@ -9619,9 +9619,9 @@ void normalize_rt_tasks(void)
>                         continue;
>
>                 p->se.exec_start = 0;
> -               schedstat_set(p->se.statistics.wait_start,  0);
> -               schedstat_set(p->se.statistics.sleep_start, 0);
> -               schedstat_set(p->se.statistics.block_start, 0);
> +               schedstat_set(p->stats.wait_start,  0);
> +               schedstat_set(p->stats.sleep_start, 0);
> +               schedstat_set(p->stats.block_start, 0);
>
>                 if (!dl_task(p) && !rt_task(p)) {
>                         /*
> @@ -10467,7 +10467,7 @@ static int cpu_cfs_stat_show(struct seq_
>                 int i;
>
>                 for_each_possible_cpu(i)
> -                       ws += schedstat_val(tg->se[i]->statistics.wait_sum);
> +                       ws += schedstat_val(tg->stats[i]->wait_sum);
>
>                 seq_printf(sf, "wait_sum %llu\n", ws);
>         }
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1265,8 +1265,8 @@ static void update_curr_dl(struct rq *rq
>                 return;
>         }
>
> -       schedstat_set(curr->se.statistics.exec_max,
> -                     max(curr->se.statistics.exec_max, delta_exec));
> +       schedstat_set(curr->stats.exec_max,
> +                     max(curr->stats.exec_max, delta_exec));
>
>         curr->se.sum_exec_runtime += delta_exec;
>         account_group_exec_runtime(curr, delta_exec);
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -819,6 +819,21 @@ static void update_tg_load_avg(struct cf
>  }
>  #endif /* CONFIG_SMP */
>
> +struct sched_entity_stats {
> +       struct sched_entity     se;
> +       struct sched_statistics stats;
> +} __no_randomize_layout;
> +
> +static inline struct sched_statistics *
> +__schedstats_from_se(struct sched_entity *se)
> +{
> +#ifdef CONFIG_FAIR_GROUP_SCHED
> +       if (!entity_is_task(se))
> +               return &container_of(se, struct sched_entity_stats, se)->stats;
> +#endif
> +       return &task_of(se)->stats;
> +}
> +
>  /*
>   * Update the current task's runtime statistics.
>   */
> @@ -837,8 +852,10 @@ static void update_curr(struct cfs_rq *c
>
>         curr->exec_start = now;
>
> -       schedstat_set(curr->statistics.exec_max,
> -                     max(delta_exec, curr->statistics.exec_max));
> +       if (schedstat_enabled()) {
> +               struct sched_statistics *stats = __schedstats_from_se(curr);
> +               __schedstat_set(stats->exec_max, max(delta_exec, stats->exec_max));
> +       }
>
>         curr->sum_exec_runtime += delta_exec;
>         schedstat_add(cfs_rq->exec_clock, delta_exec);
> @@ -866,39 +883,45 @@ static inline void
>  update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
>         u64 wait_start, prev_wait_start;
> +       struct sched_statistics *stats;
>
>         if (!schedstat_enabled())
>                 return;
>
> +       stats = __schedstats_from_se(se);
> +
>         wait_start = rq_clock(rq_of(cfs_rq));
> -       prev_wait_start = schedstat_val(se->statistics.wait_start);
> +       prev_wait_start = schedstat_val(stats->wait_start);
>
>         if (entity_is_task(se) && task_on_rq_migrating(task_of(se)) &&
>             likely(wait_start > prev_wait_start))
>                 wait_start -= prev_wait_start;
>
> -       __schedstat_set(se->statistics.wait_start, wait_start);
> +       __schedstat_set(stats->wait_start, wait_start);
>  }
>
>  static inline void
>  update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> -       struct task_struct *p;
> +       struct sched_statistics *stats;
> +       struct task_struct *p = NULL;
>         u64 delta;
>
>         if (!schedstat_enabled())
>                 return;
>
> +       stats = __schedstats_from_se(se);
> +
>         /*
>          * When the sched_schedstat changes from 0 to 1, some sched se
>          * maybe already in the runqueue, the se->statistics.wait_start
>          * will be 0.So it will let the delta wrong. We need to avoid this
>          * scenario.
>          */
> -       if (unlikely(!schedstat_val(se->statistics.wait_start)))
> +       if (unlikely(!schedstat_val(stats->wait_start)))
>                 return;
>
> -       delta = rq_clock(rq_of(cfs_rq)) - schedstat_val(se->statistics.wait_start);
> +       delta = rq_clock(rq_of(cfs_rq)) - schedstat_val(stats->wait_start);
>
>         if (entity_is_task(se)) {
>                 p = task_of(se);
> @@ -908,30 +931,33 @@ update_stats_wait_end(struct cfs_rq *cfs
>                          * time stamp can be adjusted to accumulate wait time
>                          * prior to migration.
>                          */
> -                       __schedstat_set(se->statistics.wait_start, delta);
> +                       __schedstat_set(stats->wait_start, delta);
>                         return;
>                 }
>                 trace_sched_stat_wait(p, delta);
>         }
>
> -       __schedstat_set(se->statistics.wait_max,
> -                     max(schedstat_val(se->statistics.wait_max), delta));
> -       __schedstat_inc(se->statistics.wait_count);
> -       __schedstat_add(se->statistics.wait_sum, delta);
> -       __schedstat_set(se->statistics.wait_start, 0);
> +       __schedstat_set(stats->wait_max,
> +                     max(schedstat_val(stats->wait_max), delta));
> +       __schedstat_inc(stats->wait_count);
> +       __schedstat_add(stats->wait_sum, delta);
> +       __schedstat_set(stats->wait_start, 0);
>  }
>
>  static inline void
>  update_stats_enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> +       struct sched_statistics *stats;
>         struct task_struct *tsk = NULL;
>         u64 sleep_start, block_start;
>
>         if (!schedstat_enabled())
>                 return;
>
> -       sleep_start = schedstat_val(se->statistics.sleep_start);
> -       block_start = schedstat_val(se->statistics.block_start);
> +       stats = __schedstats_from_se(se);
> +
> +       sleep_start = schedstat_val(stats->sleep_start);
> +       block_start = schedstat_val(stats->block_start);
>
>         if (entity_is_task(se))
>                 tsk = task_of(se);
> @@ -942,11 +968,11 @@ update_stats_enqueue_sleeper(struct cfs_
>                 if ((s64)delta < 0)
>                         delta = 0;
>
> -               if (unlikely(delta > schedstat_val(se->statistics.sleep_max)))
> -                       __schedstat_set(se->statistics.sleep_max, delta);
> +               if (unlikely(delta > schedstat_val(stats->sleep_max)))
> +                       __schedstat_set(stats->sleep_max, delta);
>
> -               __schedstat_set(se->statistics.sleep_start, 0);
> -               __schedstat_add(se->statistics.sum_sleep_runtime, delta);
> +               __schedstat_set(stats->sleep_start, 0);
> +               __schedstat_add(stats->sum_sleep_runtime, delta);
>
>                 if (tsk) {
>                         account_scheduler_latency(tsk, delta >> 10, 1);
> @@ -959,16 +985,16 @@ update_stats_enqueue_sleeper(struct cfs_
>                 if ((s64)delta < 0)
>                         delta = 0;
>
> -               if (unlikely(delta > schedstat_val(se->statistics.block_max)))
> -                       __schedstat_set(se->statistics.block_max, delta);
> +               if (unlikely(delta > schedstat_val(stats->block_max)))
> +                       __schedstat_set(stats->block_max, delta);
>
> -               __schedstat_set(se->statistics.block_start, 0);
> -               __schedstat_add(se->statistics.sum_sleep_runtime, delta);
> +               __schedstat_set(stats->block_start, 0);
> +               __schedstat_add(stats->sum_sleep_runtime, delta);
>
>                 if (tsk) {
>                         if (tsk->in_iowait) {
> -                               __schedstat_add(se->statistics.iowait_sum, delta);
> -                               __schedstat_inc(se->statistics.iowait_count);
> +                               __schedstat_add(stats->iowait_sum, delta);
> +                               __schedstat_inc(stats->iowait_count);
>                                 trace_sched_stat_iowait(tsk, delta);
>                         }
>
> @@ -1030,10 +1056,10 @@ update_stats_dequeue(struct cfs_rq *cfs_
>                 /* XXX racy against TTWU */
>                 state = READ_ONCE(tsk->__state);
>                 if (state & TASK_INTERRUPTIBLE)
> -                       __schedstat_set(se->statistics.sleep_start,
> +                       __schedstat_set(tsk->stats.sleep_start,
>                                       rq_clock(rq_of(cfs_rq)));
>                 if (state & TASK_UNINTERRUPTIBLE)
> -                       __schedstat_set(se->statistics.block_start,
> +                       __schedstat_set(tsk->stats.block_start,
>                                       rq_clock(rq_of(cfs_rq)));
>         }
>  }
> @@ -4502,9 +4528,10 @@ set_next_entity(struct cfs_rq *cfs_rq, s
>          */
>         if (schedstat_enabled() &&
>             rq_of(cfs_rq)->cfs.load.weight >= 2*se->load.weight) {
> -               __schedstat_set(se->statistics.slice_max,
> -                       max((u64)schedstat_val(se->statistics.slice_max),
> -                           se->sum_exec_runtime - se->prev_sum_exec_runtime));
> +               struct sched_statistics *stats = __schedstats_from_se(se);
> +               __schedstat_set(stats->slice_max,
> +                               max((u64)stats->slice_max,
> +                                   se->sum_exec_runtime - se->prev_sum_exec_runtime));
>         }
>
>         se->prev_sum_exec_runtime = se->sum_exec_runtime;
> @@ -5993,12 +6020,12 @@ static int wake_affine(struct sched_doma
>         if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits)
>                 target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);
>
> -       schedstat_inc(p->se.statistics.nr_wakeups_affine_attempts);
> +       schedstat_inc(p->stats.nr_wakeups_affine_attempts);
>         if (target == nr_cpumask_bits)
>                 return prev_cpu;
>
>         schedstat_inc(sd->ttwu_move_affine);
> -       schedstat_inc(p->se.statistics.nr_wakeups_affine);
> +       schedstat_inc(p->stats.nr_wakeups_affine);
>         return target;
>  }
>
> @@ -7802,7 +7829,7 @@ int can_migrate_task(struct task_struct
>         if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
>                 int cpu;
>
> -               schedstat_inc(p->se.statistics.nr_failed_migrations_affine);
> +               schedstat_inc(p->stats.nr_failed_migrations_affine);
>
>                 env->flags |= LBF_SOME_PINNED;
>
> @@ -7836,7 +7863,7 @@ int can_migrate_task(struct task_struct
>         env->flags &= ~LBF_ALL_PINNED;
>
>         if (task_running(env->src_rq, p)) {
> -               schedstat_inc(p->se.statistics.nr_failed_migrations_running);
> +               schedstat_inc(p->stats.nr_failed_migrations_running);
>                 return 0;
>         }
>
> @@ -7858,12 +7885,12 @@ int can_migrate_task(struct task_struct
>             env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
>                 if (tsk_cache_hot == 1) {
>                         schedstat_inc(env->sd->lb_hot_gained[env->idle]);
> -                       schedstat_inc(p->se.statistics.nr_forced_migrations);
> +                       schedstat_inc(p->stats.nr_forced_migrations);
>                 }
>                 return 1;
>         }
>
> -       schedstat_inc(p->se.statistics.nr_failed_migrations_hot);
> +       schedstat_inc(p->stats.nr_failed_migrations_hot);
>         return 0;
>  }
>
> @@ -11390,7 +11417,7 @@ int alloc_fair_sched_group(struct task_g
>                 if (!cfs_rq)
>                         goto err;
>
> -               se = kzalloc_node(sizeof(struct sched_entity),
> +               se = kzalloc_node(sizeof(struct sched_entity_stats),
>                                   GFP_KERNEL, cpu_to_node(i));
>                 if (!se)
>                         goto err_free_rq;
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1009,8 +1009,8 @@ static void update_curr_rt(struct rq *rq
>         if (unlikely((s64)delta_exec <= 0))
>                 return;
>
> -       schedstat_set(curr->se.statistics.exec_max,
> -                     max(curr->se.statistics.exec_max, delta_exec));
> +       schedstat_set(curr->stats.exec_max,
> +                     max(curr->stats.exec_max, delta_exec));
>
>         curr->se.sum_exec_runtime += delta_exec;
>         account_group_exec_runtime(curr, delta_exec);
> --- a/kernel/sched/stats.h
> +++ b/kernel/sched/stats.h
> @@ -41,6 +41,7 @@ rq_sched_info_dequeue(struct rq *rq, uns
>  #define   schedstat_val_or_zero(var)   ((schedstat_enabled()) ? (var) : 0)
>
>  #else /* !CONFIG_SCHEDSTATS: */
> +
>  static inline void rq_sched_info_arrive  (struct rq *rq, unsigned long long delta) { }
>  static inline void rq_sched_info_dequeue(struct rq *rq, unsigned long long delta) { }
>  static inline void rq_sched_info_depart  (struct rq *rq, unsigned long long delta) { }
> @@ -53,6 +54,7 @@ static inline void rq_sched_info_depart
>  # define   schedstat_set(var, val)     do { } while (0)
>  # define   schedstat_val(var)          0
>  # define   schedstat_val_or_zero(var)  0
> +
>  #endif /* CONFIG_SCHEDSTATS */
>
>  #ifdef CONFIG_PSI
> --- a/kernel/sched/stop_task.c
> +++ b/kernel/sched/stop_task.c
> @@ -78,8 +78,8 @@ static void put_prev_task_stop(struct rq
>         if (unlikely((s64)delta_exec < 0))
>                 delta_exec = 0;
>
> -       schedstat_set(curr->se.statistics.exec_max,
> -                       max(curr->se.statistics.exec_max, delta_exec));
> +       schedstat_set(curr->stats.exec_max,
> +                     max(curr->stats.exec_max, delta_exec));
>
>         curr->se.sum_exec_runtime += delta_exec;
>         account_group_exec_runtime(curr, delta_exec);



-- 
Thanks
Yafang

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/7] sched: make struct sched_statistics independent of fair sched class
  2021-08-31 10:14   ` Peter Zijlstra
@ 2021-08-31 13:25     ` Yafang Shao
  0 siblings, 0 replies; 20+ messages in thread
From: Yafang Shao @ 2021-08-31 13:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Mel Gorman, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Benjamin Segall,
	Daniel Bristot de Oliveira, Alison Chaiken, kbuild test robot,
	LKML, linux-rt-users

On Tue, Aug 31, 2021 at 6:15 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Aug 24, 2021 at 11:29:41AM +0000, Yafang Shao wrote:
> > diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> > index 49716228efb4..4cfee2aa1a2d 100644
> > --- a/kernel/sched/debug.c
> > +++ b/kernel/sched/debug.c
>
> > @@ -442,9 +442,11 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
> >       struct sched_entity *se = tg->se[cpu];
> >
> >  #define P(F)         SEQ_printf(m, "  .%-30s: %lld\n",       #F, (long long)F)
> > -#define P_SCHEDSTAT(F)       SEQ_printf(m, "  .%-30s: %lld\n",       #F, (long long)schedstat_val(F))
> > +#define P_SCHEDSTAT(F)       SEQ_printf(m, "  .%-30s: %lld\n",       \
> > +             "se->statistics."#F, (long long)schedstat_val(tg->stats[cpu]->F))
> >  #define PN(F)                SEQ_printf(m, "  .%-30s: %lld.%06ld\n", #F, SPLIT_NS((long long)F))
> > -#define PN_SCHEDSTAT(F)      SEQ_printf(m, "  .%-30s: %lld.%06ld\n", #F, SPLIT_NS((long long)schedstat_val(F)))
> > +#define PN_SCHEDSTAT(F)      SEQ_printf(m, "  .%-30s: %lld.%06ld\n", \
> > +             "se->statistics."#F, SPLIT_NS((long long)schedstat_val(tg->stats[cpu]->F)))
> >
> >       if (!se)
> >               return;
>
> > @@ -948,8 +950,8 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
> >               "---------------------------------------------------------"
> >               "----------\n");
> >
> > -#define P_SCHEDSTAT(F)  __PS(#F, schedstat_val(p->F))
> > -#define PN_SCHEDSTAT(F) __PSN(#F, schedstat_val(p->F))
> > +#define P_SCHEDSTAT(F)  __PS("se.statistics."#F, schedstat_val(p->stats.F))
> > +#define PN_SCHEDSTAT(F) __PSN("se.statistics."#F, schedstat_val(p->stats.F))
> >
> >       PN(se.exec_start);
> >       PN(se.vruntime);
>
> That's sad... can't we keep it #F, this is all SCHED_DEBUG code anyway.

Sure, I will change it.

-- 
Thanks
Yafang

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/7] sched: make struct sched_statistics independent of fair sched class
  2021-08-31 10:19   ` Peter Zijlstra
@ 2021-08-31 13:25     ` Yafang Shao
  0 siblings, 0 replies; 20+ messages in thread
From: Yafang Shao @ 2021-08-31 13:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Mel Gorman, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Benjamin Segall,
	Daniel Bristot de Oliveira, Alison Chaiken, kbuild test robot,
	LKML, linux-rt-users

On Tue, Aug 31, 2021 at 6:19 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Aug 24, 2021 at 11:29:41AM +0000, Yafang Shao wrote:
>
> > +#ifdef CONFIG_FAIR_GROUP_SCHED
> > +static inline void
> > +__schedstats_from_sched_entity(struct sched_entity *se,
> > +                           struct sched_statistics **stats)
> > +{
> > +     struct task_group *tg;
> > +     struct task_struct *p;
> > +     struct cfs_rq *cfs;
> > +     int cpu;
> > +
> > +     if (entity_is_task(se)) {
> > +             p = task_of(se);
> > +             *stats = &p->stats;
> > +     } else {
> > +             cfs = group_cfs_rq(se);
> > +             tg = cfs->tg;
> > +             cpu = cpu_of(rq_of(cfs));
> > +             *stats = tg->stats[cpu];
> > +     }
> > +}
> > +
> > +#else
> > +
> > +static inline void
> > +__schedstats_from_sched_entity(struct sched_entity *se,
> > +                           struct sched_statistics **stats)
> > +{
> > +     struct task_struct *p;
> > +
> > +     p = task_of(se);
> > +     *stats = &p->stats;
> > +}
> > +
> > +#endif
> > +
> >  /*
> >   * Update the current task's runtime statistics.
> >   */
> > @@ -826,6 +861,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
> >  {
> >       struct sched_entity *curr = cfs_rq->curr;
> >       u64 now = rq_clock_task(rq_of(cfs_rq));
> > +     struct sched_statistics *stats = NULL;
> >       u64 delta_exec;
> >
> >       if (unlikely(!curr))
> > @@ -837,8 +873,11 @@ static void update_curr(struct cfs_rq *cfs_rq)
> >
> >       curr->exec_start = now;
> >
> > -     schedstat_set(curr->statistics.exec_max,
> > -                   max(delta_exec, curr->statistics.exec_max));
> > +     if (schedstat_enabled()) {
> > +             __schedstats_from_sched_entity(curr, &stats);
> > +             __schedstat_set(stats->exec_max,
> > +                             max(delta_exec, stats->exec_max));
> > +     }
> >
> >       curr->sum_exec_runtime += delta_exec;
> >       schedstat_add(cfs_rq->exec_clock, delta_exec);
>
>
> That's just really odd style; what's wrong with something like:
>

I will change it.

> static inline struct sched_statistics *
> __schedstats_from_se(struct sched_entity *se)
> {
>         ...
> }
>
>         if (schedstats_enabled()) {
>                 struct sched_statistics *stats = __schedstats_from_se(curr);
>                 __schedstat_set(stats->exec_max, max(stats->exec_max, delta_exec));
>         }
>
>


-- 
Thanks
Yafang

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 3/7] sched: make schedstats helpers independent of fair sched class
  2021-08-31 11:07   ` Peter Zijlstra
@ 2021-08-31 13:27     ` Yafang Shao
  0 siblings, 0 replies; 20+ messages in thread
From: Yafang Shao @ 2021-08-31 13:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Mel Gorman, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Benjamin Segall,
	Daniel Bristot de Oliveira, Alison Chaiken, kbuild test robot,
	LKML, linux-rt-users

On Tue, Aug 31, 2021 at 7:08 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Aug 24, 2021 at 11:29:42AM +0000, Yafang Shao wrote:
> > diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
> > index 3f93fc3b5648..b2542f4d3192 100644
> > --- a/kernel/sched/stats.c
> > +++ b/kernel/sched/stats.c
> > @@ -4,6 +4,109 @@
> >   */
> >  #include "sched.h"
> >
> > +void __update_stats_wait_start(struct rq *rq, struct task_struct *p,
> > +                            struct sched_statistics *stats)
> > +{
> > +u64 wait_start, prev_wait_start;
>
> indent fail...
>

Sorry about that.
It is strange that my checkpatch.pl didn't find this issue...

> > +
> > +     wait_start = rq_clock(rq);
> > +     prev_wait_start = schedstat_val(stats->wait_start);
> > +
> > +     if (p && likely(wait_start > prev_wait_start))
> > +             wait_start -= prev_wait_start;
> > +
> > +     __schedstat_set(stats->wait_start, wait_start);
> > +}
>
> > diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
> > index e6905e369c5d..9ecd81b91f26 100644
> > --- a/kernel/sched/stats.h
> > +++ b/kernel/sched/stats.h
>
> > @@ -40,6 +42,33 @@ rq_sched_info_dequeue(struct rq *rq, unsigned long long delta)
> >  #define   schedstat_val(var)         (var)
> >  #define   schedstat_val_or_zero(var) ((schedstat_enabled()) ? (var) : 0)
> >
> > +void __update_stats_wait_start(struct rq *rq, struct task_struct *p,
> > +                            struct sched_statistics *stats);
> > +
> > +void __update_stats_wait_end(struct rq *rq, struct task_struct *p,
> > +                          struct sched_statistics *stats);
> > +void __update_stats_enqueue_sleeper(struct rq *rq, struct task_struct *p,
> > +                                 struct sched_statistics *stats);
> > +
> > +static inline void
> > +check_schedstat_required(void)
> > +{
> > +     if (schedstat_enabled())
> > +             return;
> > +
> > +     /* Force schedstat enabled if a dependent tracepoint is active */
> > +     if (trace_sched_stat_wait_enabled()    ||
> > +             trace_sched_stat_sleep_enabled()   ||
> > +             trace_sched_stat_iowait_enabled()  ||
> > +             trace_sched_stat_blocked_enabled() ||
> > +             trace_sched_stat_runtime_enabled())  {
> > +             printk_deferred_once("Scheduler tracepoints stat_sleep, stat_iowait, "
> > +                                     "stat_blocked and stat_runtime require the "
> > +                                     "kernel parameter schedstats=enable or "
> > +                                     "kernel.sched_schedstats=1\n");
> > +     }
> > +}
>
> If you're moving this, you might as well reflow it to not have broken
> indentation.

Sure.


-- 
Thanks
Yafang

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 4/7] sched: make the output of schedstats independent of fair sched class
  2021-08-31 11:08   ` Peter Zijlstra
@ 2021-08-31 13:27     ` Yafang Shao
  0 siblings, 0 replies; 20+ messages in thread
From: Yafang Shao @ 2021-08-31 13:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Mel Gorman, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Benjamin Segall,
	Daniel Bristot de Oliveira, Alison Chaiken, kbuild test robot,
	LKML, linux-rt-users

On Tue, Aug 31, 2021 at 7:09 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Aug 24, 2021 at 11:29:43AM +0000, Yafang Shao wrote:
> > The per cpu stats can be show with /proc/sched_debug, which includes the
> > per cpu schedstats of each task group. Currently these per cpu
> > schedstats only show for the fair sched class. If we want to support
> > other sched classes, we have to make these output independent of fair
> > sched class.
>
> Arguably the whole rt group stuff needs to die, please don't enable it
> further.

Sure.

-- 
Thanks
Yafang

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2021-08-31 13:28 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-24 11:29 [PATCH v3 0/7] sched: support schedstats for RT sched class Yafang Shao
2021-08-24 11:29 ` [PATCH v3 1/7] sched, fair: use __schedstat_set() in set_next_entity() Yafang Shao
2021-08-24 11:29 ` [PATCH v3 2/7] sched: make struct sched_statistics independent of fair sched class Yafang Shao
2021-08-31 10:14   ` Peter Zijlstra
2021-08-31 13:25     ` Yafang Shao
2021-08-31 10:19   ` Peter Zijlstra
2021-08-31 13:25     ` Yafang Shao
2021-08-24 11:29 ` [PATCH v3 3/7] sched: make schedstats helpers " Yafang Shao
2021-08-31 11:07   ` Peter Zijlstra
2021-08-31 13:27     ` Yafang Shao
2021-08-24 11:29 ` [PATCH v3 4/7] sched: make the output of schedstats " Yafang Shao
2021-08-31 11:08   ` Peter Zijlstra
2021-08-31 13:27     ` Yafang Shao
2021-08-24 11:29 ` [PATCH v3 5/7] sched: introduce task block time in schedstats Yafang Shao
2021-08-24 11:29 ` [PATCH v3 6/7] sched, rt: support sched_stat_runtime tracepoint for RT sched class Yafang Shao
2021-08-24 11:29 ` [PATCH v3 7/7] sched, rt: support schedstats " Yafang Shao
2021-08-31 10:08 ` [PATCH v3 0/7] sched: " Peter Zijlstra
2021-08-31 10:44   ` Peter Zijlstra
2021-08-31 13:21     ` Yafang Shao
2021-08-31 12:57   ` Yafang Shao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).