LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH RESEND 0/3] Represent cluster topology and enable load balance between clusters
@ 2021-09-24  8:51 Barry Song
  2021-09-24  8:51 ` [PATCH RESEND 1/3] topology: Represent clusters of CPUs within a die Barry Song
                   ` (3 more replies)
  0 siblings, 4 replies; 46+ messages in thread
From: Barry Song @ 2021-09-24  8:51 UTC (permalink / raw)
  To: dietmar.eggemann, linux-kernel, mingo, peterz, vincent.guittot
  Cc: aubrey.li, bp, bristot, bsegall, catalin.marinas, gregkh,
	guodong.xu, hpa, jonathan.cameron, juri.lelli, lenb, linux-acpi,
	linux-arm-kernel, linuxarm, mark.rutland, mgorman, msys.mizuma,
	prime.zeng, rjw, rostedt, song.bao.hua, sudeep.holla, tglx,
	rafael, tim.c.chen, valentin.schneider, will, x86, yangyicong

From: Barry Song <song.bao.hua@hisilicon.com>

ARM64 machines like kunpeng920 and x86 machines like Jacobsville have a
level of hardware topology in which some CPU cores, typically 4 cores,
share L3 tags or L2 cache.

That means spreading those tasks between clusters will bring more memory
bandwidth and decrease cache contention. But packing tasks might help
decrease the latency of cache synchronization.

We have three series to bring up cluster level scheduler in kernel.
This is the first series.

1st series(this one): make kernel aware of cluster, expose cluster to sysfs
ABI and add SCHED_CLUSTER which can make load balance between clusters to
benefit lots of workload.
Testing shows this can hugely boost the performance, for example, this
can increase 25.1% of SPECrate mcf on Jacobsville and 13.574% of mcf
on kunpeng920.

2nd series(wake_affine): modify the wake_affine and let kernel select CPUs
within cluster first before scanning the whole LLC so that we can benefit
from the lower latency of cache coherence within one single cluster. This
series is much more tricky. so we would like to send it after we build
the base of cluster by the 1st series. Prototype for 2nd series is here:
https://op-lists.linaro.org/pipermail/linaro-open-discussions/2021-June/000219.html

3rd series: a sysctl to permit users to enable or disable cluster scheduler
from Tim Chen. Prototype here:
Add run time sysctl to enable/disable cluster scheduling
https://op-lists.linaro.org/pipermail/linaro-open-discussions/2021-July/000258.html

This series is resent and rebased on 5.15-rc2.

-V1:
 differences with RFC v6 
 * removed wake_affine path modifcation, which will be separately second series
 * cluster_id is gotten by detecting valid ID before falling back to use offset
 * lots of benchmark data from both x86 Jacobsville and ARM64 kunpeng920

-RFC v6:
https://lore.kernel.org/lkml/20210420001844.9116-1-song.bao.hua@hisilicon.com/

Barry Song (1):
  scheduler: Add cluster scheduler level in core and related Kconfig for
    ARM64

Jonathan Cameron (1):
  topology: Represent clusters of CPUs within a die

Tim Chen (1):
  scheduler: Add cluster scheduler level for x86

 Documentation/ABI/stable/sysfs-devices-system-cpu | 15 +++++
 Documentation/admin-guide/cputopology.rst         | 12 ++--
 arch/arm64/Kconfig                                |  7 +++
 arch/arm64/kernel/topology.c                      |  2 +
 arch/x86/Kconfig                                  |  8 +++
 arch/x86/include/asm/smp.h                        |  7 +++
 arch/x86/include/asm/topology.h                   |  3 +
 arch/x86/kernel/cpu/cacheinfo.c                   |  1 +
 arch/x86/kernel/cpu/common.c                      |  3 +
 arch/x86/kernel/smpboot.c                         | 44 ++++++++++++++-
 drivers/acpi/pptt.c                               | 67 +++++++++++++++++++++++
 drivers/base/arch_topology.c                      | 14 +++++
 drivers/base/topology.c                           | 10 ++++
 include/linux/acpi.h                              |  5 ++
 include/linux/arch_topology.h                     |  5 ++
 include/linux/sched/topology.h                    |  7 +++
 include/linux/topology.h                          | 13 +++++
 kernel/sched/topology.c                           |  5 ++
 18 files changed, 223 insertions(+), 5 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH RESEND 1/3] topology: Represent clusters of CPUs within a die
  2021-09-24  8:51 [PATCH RESEND 0/3] Represent cluster topology and enable load balance between clusters Barry Song
@ 2021-09-24  8:51 ` Barry Song
  2021-10-05 16:33   ` Valentin Schneider
  2021-10-15  9:44   ` [tip: sched/core] " tip-bot2 for Jonathan Cameron
  2021-09-24  8:51 ` [PATCH RESEND 2/3] scheduler: Add cluster scheduler level in core and related Kconfig for ARM64 Barry Song
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 46+ messages in thread
From: Barry Song @ 2021-09-24  8:51 UTC (permalink / raw)
  To: dietmar.eggemann, linux-kernel, mingo, peterz, vincent.guittot
  Cc: aubrey.li, bp, bristot, bsegall, catalin.marinas, gregkh,
	guodong.xu, hpa, jonathan.cameron, juri.lelli, lenb, linux-acpi,
	linux-arm-kernel, linuxarm, mark.rutland, mgorman, msys.mizuma,
	prime.zeng, rjw, rostedt, song.bao.hua, sudeep.holla, tglx,
	rafael, tim.c.chen, valentin.schneider, will, x86, yangyicong,
	Jonathan Cameron, Tian Tao

From: Jonathan Cameron <Jonathan.Cameron@huawei.com>

Both ACPI and DT provide the ability to describe additional layers of
topology between that of individual cores and higher level constructs
such as the level at which the last level cache is shared.
In ACPI this can be represented in PPTT as a Processor Hierarchy
Node Structure [1] that is the parent of the CPU cores and in turn
has a parent Processor Hierarchy Nodes Structure representing
a higher level of topology.

For example Kunpeng 920 has 6 or 8 clusters in each NUMA node, and each
cluster has 4 cpus. All clusters share L3 cache data, but each cluster
has local L3 tag. On the other hand, each clusters will share some
internal system bus.

+-----------------------------------+                          +---------+
|  +------+    +------+            +---------------------------+         |
|  | CPU0 |    | cpu1 |             |    +-----------+         |         |
|  +------+    +------+             |    |           |         |         |
|                                   +----+    L3     |         |         |
|  +------+    +------+   cluster   |    |    tag    |         |         |
|  | CPU2 |    | CPU3 |             |    |           |         |         |
|  +------+    +------+             |    +-----------+         |         |
|                                   |                          |         |
+-----------------------------------+                          |         |
+-----------------------------------+                          |         |
|  +------+    +------+             +--------------------------+         |
|  |      |    |      |             |    +-----------+         |         |
|  +------+    +------+             |    |           |         |         |
|                                   |    |    L3     |         |         |
|  +------+    +------+             +----+    tag    |         |         |
|  |      |    |      |             |    |           |         |         |
|  +------+    +------+             |    +-----------+         |         |
|                                   |                          |         |
+-----------------------------------+                          |   L3    |
                                                               |   data  |
+-----------------------------------+                          |         |
|  +------+    +------+             |    +-----------+         |         |
|  |      |    |      |             |    |           |         |         |
|  +------+    +------+             +----+    L3     |         |         |
|                                   |    |    tag    |         |         |
|  +------+    +------+             |    |           |         |         |
|  |      |    |      |            ++    +-----------+         |         |
|  +------+    +------+            |---------------------------+         |
+-----------------------------------|                          |         |
+-----------------------------------|                          |         |
|  +------+    +------+            +---------------------------+         |
|  |      |    |      |             |    +-----------+         |         |
|  +------+    +------+             |    |           |         |         |
|                                   +----+    L3     |         |         |
|  +------+    +------+             |    |    tag    |         |         |
|  |      |    |      |             |    |           |         |         |
|  +------+    +------+             |    +-----------+         |         |
|                                   |                          |         |
+-----------------------------------+                          |         |
+-----------------------------------+                          |         |
|  +------+    +------+             +--------------------------+         |
|  |      |    |      |             |   +-----------+          |         |
|  +------+    +------+             |   |           |          |         |
|                                   |   |    L3     |          |         |
|  +------+    +------+             +---+    tag    |          |         |
|  |      |    |      |             |   |           |          |         |
|  +------+    +------+             |   +-----------+          |         |
|                                   |                          |         |
+-----------------------------------+                          |         |
+-----------------------------------+                         ++         |
|  +------+    +------+             +--------------------------+         |
|  |      |    |      |             |  +-----------+           |         |
|  +------+    +------+             |  |           |           |         |
|                                   |  |    L3     |           |         |
|  +------+    +------+             +--+    tag    |           |         |
|  |      |    |      |             |  |           |           |         |
|  +------+    +------+             |  +-----------+           |         |
|                                   |                          +---------+
+-----------------------------------+

That means spreading tasks among clusters will bring more bandwidth
while packing tasks within one cluster will lead to smaller cache
synchronization latency. So both kernel and userspace will have
a chance to leverage this topology to deploy tasks accordingly to
achieve either smaller cache latency within one cluster or an even
distribution of load among clusters for higher throughput.

This patch exposes cluster topology to both kernel and userspace.
Libraried like hwloc will know cluster by cluster_cpus and related
sysfs attributes. PoC of HWLOC support at [2].

Note this patch only handle the ACPI case.

Special consideration is needed for SMT processors, where it is
necessary to move 2 levels up the hierarchy from the leaf nodes
(thus skipping the processor core level).

Note that arm64 / ACPI does not provide any means of identifying
a die level in the topology but that may be unrelate to the cluster
level.

[1] ACPI Specification 6.3 - section 5.2.29.1 processor hierarchy node
    structure (Type 0)
[2] https://github.com/hisilicon/hwloc/tree/linux-cluster

Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Tian Tao <tiantao6@hisilicon.com>
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
---
 Documentation/ABI/stable/sysfs-devices-system-cpu | 15 +++++
 Documentation/admin-guide/cputopology.rst         | 12 ++--
 arch/arm64/kernel/topology.c                      |  2 +
 drivers/acpi/pptt.c                               | 67 +++++++++++++++++++++++
 drivers/base/arch_topology.c                      | 14 +++++
 drivers/base/topology.c                           | 10 ++++
 include/linux/acpi.h                              |  5 ++
 include/linux/arch_topology.h                     |  5 ++
 include/linux/topology.h                          |  6 ++
 9 files changed, 132 insertions(+), 4 deletions(-)

diff --git a/Documentation/ABI/stable/sysfs-devices-system-cpu b/Documentation/ABI/stable/sysfs-devices-system-cpu
index 516dafe..3965ce5 100644
--- a/Documentation/ABI/stable/sysfs-devices-system-cpu
+++ b/Documentation/ABI/stable/sysfs-devices-system-cpu
@@ -42,6 +42,12 @@ Description:    the CPU core ID of cpuX. Typically it is the hardware platform's
                 architecture and platform dependent.
 Values:         integer
 
+What:           /sys/devices/system/cpu/cpuX/topology/cluster_id
+Description:    the cluster ID of cpuX.  Typically it is the hardware platform's
+                identifier (rather than the kernel's). The actual value is
+                architecture and platform dependent.
+Values:         integer
+
 What:           /sys/devices/system/cpu/cpuX/topology/book_id
 Description:    the book ID of cpuX. Typically it is the hardware platform's
                 identifier (rather than the kernel's). The actual value is
@@ -85,6 +91,15 @@ Description:    human-readable list of CPUs within the same die.
                 The format is like 0-3, 8-11, 14,17.
 Values:         decimal list.
 
+What:           /sys/devices/system/cpu/cpuX/topology/cluster_cpus
+Description:    internal kernel map of CPUs within the same cluster.
+Values:         hexadecimal bitmask.
+
+What:           /sys/devices/system/cpu/cpuX/topology/cluster_cpus_list
+Description:    human-readable list of CPUs within the same cluster.
+                The format is like 0-3, 8-11, 14,17.
+Values:         decimal list.
+
 What:           /sys/devices/system/cpu/cpuX/topology/book_siblings
 Description:    internal kernel map of cpuX's hardware threads within the same
                 book_id. it's only used on s390.
diff --git a/Documentation/admin-guide/cputopology.rst b/Documentation/admin-guide/cputopology.rst
index b085dba..6b62e18 100644
--- a/Documentation/admin-guide/cputopology.rst
+++ b/Documentation/admin-guide/cputopology.rst
@@ -19,11 +19,13 @@ these macros in include/asm-XXX/topology.h::
 
 	#define topology_physical_package_id(cpu)
 	#define topology_die_id(cpu)
+	#define topology_cluster_id(cpu)
 	#define topology_core_id(cpu)
 	#define topology_book_id(cpu)
 	#define topology_drawer_id(cpu)
 	#define topology_sibling_cpumask(cpu)
 	#define topology_core_cpumask(cpu)
+	#define topology_cluster_cpumask(cpu)
 	#define topology_die_cpumask(cpu)
 	#define topology_book_cpumask(cpu)
 	#define topology_drawer_cpumask(cpu)
@@ -39,10 +41,12 @@ not defined by include/asm-XXX/topology.h:
 
 1) topology_physical_package_id: -1
 2) topology_die_id: -1
-3) topology_core_id: 0
-4) topology_sibling_cpumask: just the given CPU
-5) topology_core_cpumask: just the given CPU
-6) topology_die_cpumask: just the given CPU
+3) topology_cluster_id: -1
+4) topology_core_id: 0
+5) topology_sibling_cpumask: just the given CPU
+6) topology_core_cpumask: just the given CPU
+7) topology_cluster_cpumask: just the given CPU
+8) topology_die_cpumask: just the given CPU
 
 For architectures that don't support books (CONFIG_SCHED_BOOK) there are no
 default definitions for topology_book_id() and topology_book_cpumask().
diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
index 4dd14a6..9ab78ad 100644
--- a/arch/arm64/kernel/topology.c
+++ b/arch/arm64/kernel/topology.c
@@ -103,6 +103,8 @@ int __init parse_acpi_topology(void)
 			cpu_topology[cpu].thread_id  = -1;
 			cpu_topology[cpu].core_id    = topology_id;
 		}
+		topology_id = find_acpi_cpu_topology_cluster(cpu);
+		cpu_topology[cpu].cluster_id = topology_id;
 		topology_id = find_acpi_cpu_topology_package(cpu);
 		cpu_topology[cpu].package_id = topology_id;
 
diff --git a/drivers/acpi/pptt.c b/drivers/acpi/pptt.c
index fe69dc5..701f61c 100644
--- a/drivers/acpi/pptt.c
+++ b/drivers/acpi/pptt.c
@@ -747,6 +747,73 @@ int find_acpi_cpu_topology_package(unsigned int cpu)
 }
 
 /**
+ * find_acpi_cpu_topology_cluster() - Determine a unique CPU cluster value
+ * @cpu: Kernel logical CPU number
+ *
+ * Determine a topology unique cluster ID for the given CPU/thread.
+ * This ID can then be used to group peers, which will have matching ids.
+ *
+ * The cluster, if present is the level of topology above CPUs. In a
+ * multi-thread CPU, it will be the level above the CPU, not the thread.
+ * It may not exist in single CPU systems. In simple multi-CPU systems,
+ * it may be equal to the package topology level.
+ *
+ * Return: -ENOENT if the PPTT doesn't exist, the CPU cannot be found
+ * or there is no toplogy level above the CPU..
+ * Otherwise returns a value which represents the package for this CPU.
+ */
+
+int find_acpi_cpu_topology_cluster(unsigned int cpu)
+{
+	struct acpi_table_header *table;
+	acpi_status status;
+	struct acpi_pptt_processor *cpu_node, *cluster_node;
+	u32 acpi_cpu_id;
+	int retval;
+	int is_thread;
+
+	status = acpi_get_table(ACPI_SIG_PPTT, 0, &table);
+	if (ACPI_FAILURE(status)) {
+		acpi_pptt_warn_missing();
+		return -ENOENT;
+	}
+
+	acpi_cpu_id = get_acpi_id_for_cpu(cpu);
+	cpu_node = acpi_find_processor_node(table, acpi_cpu_id);
+	if (cpu_node == NULL || !cpu_node->parent) {
+		retval = -ENOENT;
+		goto put_table;
+	}
+
+	is_thread = cpu_node->flags & ACPI_PPTT_ACPI_PROCESSOR_IS_THREAD;
+	cluster_node = fetch_pptt_node(table, cpu_node->parent);
+	if (cluster_node == NULL) {
+		retval = -ENOENT;
+		goto put_table;
+	}
+	if (is_thread) {
+		if (!cluster_node->parent) {
+			retval = -ENOENT;
+			goto put_table;
+		}
+		cluster_node = fetch_pptt_node(table, cluster_node->parent);
+		if (cluster_node == NULL) {
+			retval = -ENOENT;
+			goto put_table;
+		}
+	}
+	if (cluster_node->flags & ACPI_PPTT_ACPI_PROCESSOR_ID_VALID)
+		retval = cluster_node->acpi_processor_id;
+	else
+		retval = ACPI_PTR_DIFF(cluster_node, table);
+
+put_table:
+	acpi_put_table(table);
+
+	return retval;
+}
+
+/**
  * find_acpi_cpu_topology_hetero_id() - Get a core architecture tag
  * @cpu: Kernel logical CPU number
  *
diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 4340766..7cb31d9 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -600,6 +600,11 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
 	return core_mask;
 }
 
+const struct cpumask *cpu_clustergroup_mask(int cpu)
+{
+	return &cpu_topology[cpu].cluster_sibling;
+}
+
 void update_siblings_masks(unsigned int cpuid)
 {
 	struct cpu_topology *cpu_topo, *cpuid_topo = &cpu_topology[cpuid];
@@ -617,6 +622,11 @@ void update_siblings_masks(unsigned int cpuid)
 		if (cpuid_topo->package_id != cpu_topo->package_id)
 			continue;
 
+		if (cpuid_topo->cluster_id == cpu_topo->cluster_id) {
+			cpumask_set_cpu(cpu, &cpuid_topo->cluster_sibling);
+			cpumask_set_cpu(cpuid, &cpu_topo->cluster_sibling);
+		}
+
 		cpumask_set_cpu(cpuid, &cpu_topo->core_sibling);
 		cpumask_set_cpu(cpu, &cpuid_topo->core_sibling);
 
@@ -635,6 +645,9 @@ static void clear_cpu_topology(int cpu)
 	cpumask_clear(&cpu_topo->llc_sibling);
 	cpumask_set_cpu(cpu, &cpu_topo->llc_sibling);
 
+	cpumask_clear(&cpu_topo->cluster_sibling);
+	cpumask_set_cpu(cpu, &cpu_topo->cluster_sibling);
+
 	cpumask_clear(&cpu_topo->core_sibling);
 	cpumask_set_cpu(cpu, &cpu_topo->core_sibling);
 	cpumask_clear(&cpu_topo->thread_sibling);
@@ -650,6 +663,7 @@ void __init reset_cpu_topology(void)
 
 		cpu_topo->thread_id = -1;
 		cpu_topo->core_id = -1;
+		cpu_topo->cluster_id = -1;
 		cpu_topo->package_id = -1;
 		cpu_topo->llc_id = -1;
 
diff --git a/drivers/base/topology.c b/drivers/base/topology.c
index 43c0940..8f2b641 100644
--- a/drivers/base/topology.c
+++ b/drivers/base/topology.c
@@ -48,6 +48,9 @@
 define_id_show_func(die_id);
 static DEVICE_ATTR_RO(die_id);
 
+define_id_show_func(cluster_id);
+static DEVICE_ATTR_RO(cluster_id);
+
 define_id_show_func(core_id);
 static DEVICE_ATTR_RO(core_id);
 
@@ -63,6 +66,10 @@
 static BIN_ATTR_RO(core_siblings, 0);
 static BIN_ATTR_RO(core_siblings_list, 0);
 
+define_siblings_read_func(cluster_cpus, cluster_cpumask);
+static BIN_ATTR_RO(cluster_cpus, 0);
+static BIN_ATTR_RO(cluster_cpus_list, 0);
+
 define_siblings_read_func(die_cpus, die_cpumask);
 static BIN_ATTR_RO(die_cpus, 0);
 static BIN_ATTR_RO(die_cpus_list, 0);
@@ -94,6 +101,8 @@
 	&bin_attr_thread_siblings_list,
 	&bin_attr_core_siblings,
 	&bin_attr_core_siblings_list,
+	&bin_attr_cluster_cpus,
+	&bin_attr_cluster_cpus_list,
 	&bin_attr_die_cpus,
 	&bin_attr_die_cpus_list,
 	&bin_attr_package_cpus,
@@ -112,6 +121,7 @@
 static struct attribute *default_attrs[] = {
 	&dev_attr_physical_package_id.attr,
 	&dev_attr_die_id.attr,
+	&dev_attr_cluster_id.attr,
 	&dev_attr_core_id.attr,
 #ifdef CONFIG_SCHED_BOOK
 	&dev_attr_book_id.attr,
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 974d497..fbc2146 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -1353,6 +1353,7 @@ static inline int lpit_read_residency_count_address(u64 *address)
 #ifdef CONFIG_ACPI_PPTT
 int acpi_pptt_cpu_is_thread(unsigned int cpu);
 int find_acpi_cpu_topology(unsigned int cpu, int level);
+int find_acpi_cpu_topology_cluster(unsigned int cpu);
 int find_acpi_cpu_topology_package(unsigned int cpu);
 int find_acpi_cpu_topology_hetero_id(unsigned int cpu);
 int find_acpi_cpu_cache_topology(unsigned int cpu, int level);
@@ -1365,6 +1366,10 @@ static inline int find_acpi_cpu_topology(unsigned int cpu, int level)
 {
 	return -EINVAL;
 }
+static inline int find_acpi_cpu_topology_cluster(unsigned int cpu)
+{
+	return -EINVAL;
+}
 static inline int find_acpi_cpu_topology_package(unsigned int cpu)
 {
 	return -EINVAL;
diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h
index f180240..b97cea8 100644
--- a/include/linux/arch_topology.h
+++ b/include/linux/arch_topology.h
@@ -62,10 +62,12 @@ void topology_set_thermal_pressure(const struct cpumask *cpus,
 struct cpu_topology {
 	int thread_id;
 	int core_id;
+	int cluster_id;
 	int package_id;
 	int llc_id;
 	cpumask_t thread_sibling;
 	cpumask_t core_sibling;
+	cpumask_t cluster_sibling;
 	cpumask_t llc_sibling;
 };
 
@@ -73,13 +75,16 @@ struct cpu_topology {
 extern struct cpu_topology cpu_topology[NR_CPUS];
 
 #define topology_physical_package_id(cpu)	(cpu_topology[cpu].package_id)
+#define topology_cluster_id(cpu)	(cpu_topology[cpu].cluster_id)
 #define topology_core_id(cpu)		(cpu_topology[cpu].core_id)
 #define topology_core_cpumask(cpu)	(&cpu_topology[cpu].core_sibling)
 #define topology_sibling_cpumask(cpu)	(&cpu_topology[cpu].thread_sibling)
+#define topology_cluster_cpumask(cpu)	(&cpu_topology[cpu].cluster_sibling)
 #define topology_llc_cpumask(cpu)	(&cpu_topology[cpu].llc_sibling)
 void init_cpu_topology(void);
 void store_cpu_topology(unsigned int cpuid);
 const struct cpumask *cpu_coregroup_mask(int cpu);
+const struct cpumask *cpu_clustergroup_mask(int cpu);
 void update_siblings_masks(unsigned int cpu);
 void remove_cpu_topology(unsigned int cpuid);
 void reset_cpu_topology(void);
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 7634cd7..80d27d7 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -186,6 +186,9 @@ static inline int cpu_to_mem(int cpu)
 #ifndef topology_die_id
 #define topology_die_id(cpu)			((void)(cpu), -1)
 #endif
+#ifndef topology_cluster_id
+#define topology_cluster_id(cpu)		((void)(cpu), -1)
+#endif
 #ifndef topology_core_id
 #define topology_core_id(cpu)			((void)(cpu), 0)
 #endif
@@ -195,6 +198,9 @@ static inline int cpu_to_mem(int cpu)
 #ifndef topology_core_cpumask
 #define topology_core_cpumask(cpu)		cpumask_of(cpu)
 #endif
+#ifndef topology_cluster_cpumask
+#define topology_cluster_cpumask(cpu)		cpumask_of(cpu)
+#endif
 #ifndef topology_die_cpumask
 #define topology_die_cpumask(cpu)		cpumask_of(cpu)
 #endif
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH RESEND 2/3] scheduler: Add cluster scheduler level in core and related Kconfig for ARM64
  2021-09-24  8:51 [PATCH RESEND 0/3] Represent cluster topology and enable load balance between clusters Barry Song
  2021-09-24  8:51 ` [PATCH RESEND 1/3] topology: Represent clusters of CPUs within a die Barry Song
@ 2021-09-24  8:51 ` Barry Song
  2021-10-05  7:35   ` Peter Zijlstra
  2021-09-24  8:51 ` [PATCH RESEND 3/3] scheduler: Add cluster scheduler level for x86 Barry Song
  2021-10-01 10:32 ` [PATCH RESEND 0/3] Represent cluster topology and enable load balance between clusters Barry Song
  3 siblings, 1 reply; 46+ messages in thread
From: Barry Song @ 2021-09-24  8:51 UTC (permalink / raw)
  To: dietmar.eggemann, linux-kernel, mingo, peterz, vincent.guittot
  Cc: aubrey.li, bp, bristot, bsegall, catalin.marinas, gregkh,
	guodong.xu, hpa, jonathan.cameron, juri.lelli, lenb, linux-acpi,
	linux-arm-kernel, linuxarm, mark.rutland, mgorman, msys.mizuma,
	prime.zeng, rjw, rostedt, song.bao.hua, sudeep.holla, tglx,
	rafael, tim.c.chen, valentin.schneider, will, x86, yangyicong,
	Yicong Yang

From: Barry Song <song.bao.hua@hisilicon.com>

This patch adds scheduler level for clusters and automatically enables
the load balance among clusters. It will directly benefit a lot of
workload which loves more resources such as memory bandwidth, caches.

Testing has widely been done in two different hardware configurations of
Kunpeng920:

 24 cores in one NUMA(6 clusters in each NUMA node);
 32 cores in one NUMA(8 clusters in each NUMA node)

Workload is running on either one NUMA node or four NUMA nodes, thus,
this can estimate the effect of cluster spreading w/ and w/o NUMA load
balance.

* Stream benchmark:

4threads stream (on 1NUMA * 24cores = 24cores)
                stream                 stream
                w/o patch              w/ patch
MB/sec copy     29929.64 (   0.00%)    32932.68 (  10.03%)
MB/sec scale    29861.10 (   0.00%)    32710.58 (   9.54%)
MB/sec add      27034.42 (   0.00%)    32400.68 (  19.85%)
MB/sec triad    27225.26 (   0.00%)    31965.36 (  17.41%)

6threads stream (on 1NUMA * 24cores = 24cores)
                stream                 stream
                w/o patch              w/ patch
MB/sec copy     40330.24 (   0.00%)    42377.68 (   5.08%)
MB/sec scale    40196.42 (   0.00%)    42197.90 (   4.98%)
MB/sec add      37427.00 (   0.00%)    41960.78 (  12.11%)
MB/sec triad    37841.36 (   0.00%)    42513.64 (  12.35%)

12threads stream (on 1NUMA * 24cores = 24cores)
                stream                 stream
                w/o patch              w/ patch
MB/sec copy     52639.82 (   0.00%)    53818.04 (   2.24%)
MB/sec scale    52350.30 (   0.00%)    53253.38 (   1.73%)
MB/sec add      53607.68 (   0.00%)    55198.82 (   2.97%)
MB/sec triad    54776.66 (   0.00%)    56360.40 (   2.89%)

Thus, it could help memory-bound workload especially under medium load.
Similar improvement is also seen in lkp-pbzip2:

* lkp-pbzip2 benchmark

2-96 threads (on 4NUMA * 24cores = 96cores)
                  lkp-pbzip2              lkp-pbzip2
                  w/o patch               w/ patch
Hmean     tput-2   11062841.57 (   0.00%)  11341817.51 *   2.52%*
Hmean     tput-5   26815503.70 (   0.00%)  27412872.65 *   2.23%*
Hmean     tput-8   41873782.21 (   0.00%)  43326212.92 *   3.47%*
Hmean     tput-12  61875980.48 (   0.00%)  64578337.51 *   4.37%*
Hmean     tput-21 105814963.07 (   0.00%) 111381851.01 *   5.26%*
Hmean     tput-30 150349470.98 (   0.00%) 156507070.73 *   4.10%*
Hmean     tput-48 237195937.69 (   0.00%) 242353597.17 *   2.17%*
Hmean     tput-79 360252509.37 (   0.00%) 362635169.23 *   0.66%*
Hmean     tput-96 394571737.90 (   0.00%) 400952978.48 *   1.62%*

2-24 threads (on 1NUMA * 24cores = 24cores)
                 lkp-pbzip2               lkp-pbzip2
                 w/o patch                w/ patch
Hmean     tput-2   11071705.49 (   0.00%)  11296869.10 *   2.03%*
Hmean     tput-4   20782165.19 (   0.00%)  21949232.15 *   5.62%*
Hmean     tput-6   30489565.14 (   0.00%)  33023026.96 *   8.31%*
Hmean     tput-8   40376495.80 (   0.00%)  42779286.27 *   5.95%*
Hmean     tput-12  61264033.85 (   0.00%)  62995632.78 *   2.83%*
Hmean     tput-18  86697139.39 (   0.00%)  86461545.74 (  -0.27%)
Hmean     tput-24 104854637.04 (   0.00%) 104522649.46 *  -0.32%*

In the case of 6 threads and 8 threads, we see the greatest performance
improvement.

Similar improvement can be seen on lkp-pixz though the improvement is
smaller:

* lkp-pixz benchmark

2-24 threads lkp-pixz (on 1NUMA * 24cores = 24cores)
                  lkp-pixz               lkp-pixz
                  w/o patch              w/ patch
Hmean     tput-2   6486981.16 (   0.00%)  6561515.98 *   1.15%*
Hmean     tput-4  11645766.38 (   0.00%) 11614628.43 (  -0.27%)
Hmean     tput-6  15429943.96 (   0.00%) 15957350.76 *   3.42%*
Hmean     tput-8  19974087.63 (   0.00%) 20413746.98 *   2.20%*
Hmean     tput-12 28172068.18 (   0.00%) 28751997.06 *   2.06%*
Hmean     tput-18 39413409.54 (   0.00%) 39896830.55 *   1.23%*
Hmean     tput-24 49101815.85 (   0.00%) 49418141.47 *   0.64%*

* SPECrate benchmark

4,8,16 copies mcf_r(on 1NUMA * 32cores = 32cores)
		Base     	 	Base
		Run Time   	 	Rate
		-------  	 	---------
4 Copies	w/o 580 (w/ 570)       	w/o 11.1 (w/ 11.3)
8 Copies	w/o 647 (w/ 605)       	w/o 20.0 (w/ 21.4, +7%)
16 Copies	w/o 844 (w/ 844)       	w/o 30.6 (w/ 30.6)

32 Copies(on 4NUMA * 32 cores = 128cores)
[w/o patch]
                 Base     Base        Base
Benchmarks       Copies  Run Time     Rate
--------------- -------  ---------  ---------
500.perlbench_r      32        584       87.2  *
502.gcc_r            32        503       90.2  *
505.mcf_r            32        745       69.4  *
520.omnetpp_r        32       1031       40.7  *
523.xalancbmk_r      32        597       56.6  *
525.x264_r            1         --            CE
531.deepsjeng_r      32        336      109    *
541.leela_r          32        556       95.4  *
548.exchange2_r      32        513      163    *
557.xz_r             32        530       65.2  *
 Est. SPECrate2017_int_base              80.3

[w/ patch]
                  Base     Base        Base
Benchmarks       Copies  Run Time     Rate
--------------- -------  ---------  ---------
500.perlbench_r      32        580      87.8 (+0.688%)  *
502.gcc_r            32        477      95.1 (+5.432%)  *
505.mcf_r            32        644      80.3 (+13.574%) *
520.omnetpp_r        32        942      44.6 (+9.58%)   *
523.xalancbmk_r      32        560      60.4 (+6.714%%) *
525.x264_r            1         --           CE
531.deepsjeng_r      32        337      109  (+0.000%) *
541.leela_r          32        554      95.6 (+0.210%) *
548.exchange2_r      32        515      163  (+0.000%) *
557.xz_r             32        524      66.0 (+1.227%) *
 Est. SPECrate2017_int_base              83.7 (+4.062%)

On the other hand, it is slightly helpful to CPU-bound tasks like
kernbench:

* 24-96 threads kernbench (on 4NUMA * 24cores = 96cores)
                     kernbench              kernbench
                     w/o cluster            w/ cluster
Min       user-24    12054.67 (   0.00%)    12024.19 (   0.25%)
Min       syst-24     1751.51 (   0.00%)     1731.68 (   1.13%)
Min       elsp-24      600.46 (   0.00%)      598.64 (   0.30%)
Min       user-48    12361.93 (   0.00%)    12315.32 (   0.38%)
Min       syst-48     1917.66 (   0.00%)     1892.73 (   1.30%)
Min       elsp-48      333.96 (   0.00%)      332.57 (   0.42%)
Min       user-96    12922.40 (   0.00%)    12921.17 (   0.01%)
Min       syst-96     2143.94 (   0.00%)     2110.39 (   1.56%)
Min       elsp-96      211.22 (   0.00%)      210.47 (   0.36%)
Amean     user-24    12063.99 (   0.00%)    12030.78 *   0.28%*
Amean     syst-24     1755.20 (   0.00%)     1735.53 *   1.12%*
Amean     elsp-24      601.60 (   0.00%)      600.19 (   0.23%)
Amean     user-48    12362.62 (   0.00%)    12315.56 *   0.38%*
Amean     syst-48     1921.59 (   0.00%)     1894.95 *   1.39%*
Amean     elsp-48      334.10 (   0.00%)      332.82 *   0.38%*
Amean     user-96    12925.27 (   0.00%)    12922.63 (   0.02%)
Amean     syst-96     2146.66 (   0.00%)     2122.20 *   1.14%*
Amean     elsp-96      211.96 (   0.00%)      211.79 (   0.08%)

Note this patch isn't an universal win, it might hurt those workload
which can benefit from packing. Though tasks which want to take
advantages of lower communication latency of one cluster won't
necessarily been packed in one cluster while kernel is not aware of
clusters, they have some chance to be randomly packed. But this
patch will make them more likely spread.

Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
---
 arch/arm64/Kconfig             | 7 +++++++
 include/linux/sched/topology.h | 7 +++++++
 include/linux/topology.h       | 7 +++++++
 kernel/sched/topology.c        | 5 +++++
 4 files changed, 26 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 5c7ae4c..7e4651a 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -989,6 +989,13 @@ config SCHED_MC
 	  making when dealing with multi-core CPU chips at a cost of slightly
 	  increased overhead in some places. If unsure say N here.
 
+config SCHED_CLUSTER
+	bool "Cluster scheduler support"
+	help
+	  Cluster scheduler support improves the CPU scheduler's decision
+	  making when dealing with machines that have clusters(sharing internal
+	  bus or sharing LLC cache tag). If unsure say N here.
+
 config SCHED_SMT
 	bool "SMT scheduler support"
 	help
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 8f0f778..2f9166f 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -42,6 +42,13 @@ static inline int cpu_smt_flags(void)
 }
 #endif
 
+#ifdef CONFIG_SCHED_CLUSTER
+static inline int cpu_cluster_flags(void)
+{
+	return SD_SHARE_PKG_RESOURCES;
+}
+#endif
+
 #ifdef CONFIG_SCHED_MC
 static inline int cpu_core_flags(void)
 {
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 80d27d7..0b3704a 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -212,6 +212,13 @@ static inline const struct cpumask *cpu_smt_mask(int cpu)
 }
 #endif
 
+#if defined(CONFIG_SCHED_CLUSTER) && !defined(cpu_cluster_mask)
+static inline const struct cpumask *cpu_cluster_mask(int cpu)
+{
+	return topology_cluster_cpumask(cpu);
+}
+#endif
+
 static inline const struct cpumask *cpu_cpu_mask(int cpu)
 {
 	return cpumask_of_node(cpu_to_node(cpu));
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 4e8698e..7d27559 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1627,6 +1627,11 @@ static void claim_allocations(int cpu, struct sched_domain *sd)
 #ifdef CONFIG_SCHED_SMT
 	{ cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
 #endif
+
+#ifdef CONFIG_SCHED_CLUSTER
+	{ cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) },
+#endif
+
 #ifdef CONFIG_SCHED_MC
 	{ cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
 #endif
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH RESEND 3/3] scheduler: Add cluster scheduler level for x86
  2021-09-24  8:51 [PATCH RESEND 0/3] Represent cluster topology and enable load balance between clusters Barry Song
  2021-09-24  8:51 ` [PATCH RESEND 1/3] topology: Represent clusters of CPUs within a die Barry Song
  2021-09-24  8:51 ` [PATCH RESEND 2/3] scheduler: Add cluster scheduler level in core and related Kconfig for ARM64 Barry Song
@ 2021-09-24  8:51 ` Barry Song
  2021-10-15  9:44   ` [tip: sched/core] sched: " tip-bot2 for Tim Chen
  2021-10-01 10:32 ` [PATCH RESEND 0/3] Represent cluster topology and enable load balance between clusters Barry Song
  3 siblings, 1 reply; 46+ messages in thread
From: Barry Song @ 2021-09-24  8:51 UTC (permalink / raw)
  To: dietmar.eggemann, linux-kernel, mingo, peterz, vincent.guittot
  Cc: aubrey.li, bp, bristot, bsegall, catalin.marinas, gregkh,
	guodong.xu, hpa, jonathan.cameron, juri.lelli, lenb, linux-acpi,
	linux-arm-kernel, linuxarm, mark.rutland, mgorman, msys.mizuma,
	prime.zeng, rjw, rostedt, song.bao.hua, sudeep.holla, tglx,
	rafael, tim.c.chen, valentin.schneider, will, x86, yangyicong

From: Tim Chen <tim.c.chen@linux.intel.com>

There are x86 CPU architectures (e.g. Jacobsville) where L2 cahce is
shared among a cluster of cores instead of being exclusive to one
single core.
To prevent oversubscription of L2 cache, load should be balanced
between such L2 clusters, especially for tasks with no shared data.
On benchmark such as SPECrate mcf test, this change provides a
boost to performance especially on medium load system on Jacobsville.
on a Jacobsville that has 24 Atom cores, arranged into 6 clusters
of 4 cores each, the benchmark number is as follow:

 Improvement over baseline kernel for mcf_r
 copies		run time	base rate
 1		-0.1%		-0.2%
 6		25.1%		25.1%
 12		18.8%		19.0%
 24		0.3%		0.3%

So this looks pretty good. In terms of the system's task distribution,
some pretty bad clumping can be seen for the vanilla kernel without
the L2 cluster domain for the 6 and 12 copies case. With the extra
domain for cluster, the load does get evened out between the clusters.

Note this patch isn't an universal win as spreading isn't necessarily
a win, particually for those workload who can benefit from packing.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
---
 arch/x86/Kconfig                |  8 ++++++++
 arch/x86/include/asm/smp.h      |  7 +++++++
 arch/x86/include/asm/topology.h |  3 +++
 arch/x86/kernel/cpu/cacheinfo.c |  1 +
 arch/x86/kernel/cpu/common.c    |  3 +++
 arch/x86/kernel/smpboot.c       | 44 ++++++++++++++++++++++++++++++++++++++++-
 6 files changed, 65 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index dad7f85..bd27b1c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1001,6 +1001,14 @@ config NR_CPUS
 	  This is purely to save memory: each supported CPU adds about 8KB
 	  to the kernel image.
 
+config SCHED_CLUSTER
+	bool "Cluster scheduler support"
+	default n
+	help
+	 Cluster scheduler support improves the CPU scheduler's decision
+	 making when dealing with machines that have clusters of CPUs
+	 sharing L2 cache. If unsure say N here.
+
 config SCHED_SMT
 	def_bool y if SMP
 
diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
index 630ff08..08b0e90 100644
--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -16,7 +16,9 @@
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_die_map);
 /* cpus sharing the last level cache: */
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map);
+DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_l2c_shared_map);
 DECLARE_PER_CPU_READ_MOSTLY(u16, cpu_llc_id);
+DECLARE_PER_CPU_READ_MOSTLY(u16, cpu_l2c_id);
 DECLARE_PER_CPU_READ_MOSTLY(int, cpu_number);
 
 static inline struct cpumask *cpu_llc_shared_mask(int cpu)
@@ -24,6 +26,11 @@ static inline struct cpumask *cpu_llc_shared_mask(int cpu)
 	return per_cpu(cpu_llc_shared_map, cpu);
 }
 
+static inline struct cpumask *cpu_l2c_shared_mask(int cpu)
+{
+	return per_cpu(cpu_l2c_shared_map, cpu);
+}
+
 DECLARE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_cpu_to_apicid);
 DECLARE_EARLY_PER_CPU_READ_MOSTLY(u32, x86_cpu_to_acpiid);
 DECLARE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_bios_cpu_apicid);
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 9239399..2548d82 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -103,17 +103,20 @@ static inline void setup_node_to_cpumask_map(void) { }
 #include <asm-generic/topology.h>
 
 extern const struct cpumask *cpu_coregroup_mask(int cpu);
+extern const struct cpumask *cpu_clustergroup_mask(int cpu);
 
 #define topology_logical_package_id(cpu)	(cpu_data(cpu).logical_proc_id)
 #define topology_physical_package_id(cpu)	(cpu_data(cpu).phys_proc_id)
 #define topology_logical_die_id(cpu)		(cpu_data(cpu).logical_die_id)
 #define topology_die_id(cpu)			(cpu_data(cpu).cpu_die_id)
+#define topology_cluster_id(cpu)		(per_cpu(cpu_l2c_id, cpu))
 #define topology_core_id(cpu)			(cpu_data(cpu).cpu_core_id)
 
 extern unsigned int __max_die_per_package;
 
 #ifdef CONFIG_SMP
 #define topology_die_cpumask(cpu)		(per_cpu(cpu_die_map, cpu))
+#define topology_cluster_cpumask(cpu)		(cpu_clustergroup_mask(cpu))
 #define topology_core_cpumask(cpu)		(per_cpu(cpu_core_map, cpu))
 #define topology_sibling_cpumask(cpu)		(per_cpu(cpu_sibling_map, cpu))
 
diff --git a/arch/x86/kernel/cpu/cacheinfo.c b/arch/x86/kernel/cpu/cacheinfo.c
index b5e36bd..fe98a14 100644
--- a/arch/x86/kernel/cpu/cacheinfo.c
+++ b/arch/x86/kernel/cpu/cacheinfo.c
@@ -846,6 +846,7 @@ void init_intel_cacheinfo(struct cpuinfo_x86 *c)
 		l2 = new_l2;
 #ifdef CONFIG_SMP
 		per_cpu(cpu_llc_id, cpu) = l2_id;
+		per_cpu(cpu_l2c_id, cpu) = l2_id;
 #endif
 	}
 
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 0f88859..1c7897c 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -85,6 +85,9 @@ u16 get_llc_id(unsigned int cpu)
 }
 EXPORT_SYMBOL_GPL(get_llc_id);
 
+/* L2 cache ID of each logical CPU */
+DEFINE_PER_CPU_READ_MOSTLY(u16, cpu_l2c_id) = BAD_APICID;
+
 /* correctly size the local cpu masks */
 void __init setup_cpu_local_masks(void)
 {
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 85f6e24..5094ab0 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -101,6 +101,8 @@
 
 DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map);
 
+DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_l2c_shared_map);
+
 /* Per CPU bogomips and other parameters */
 DEFINE_PER_CPU_READ_MOSTLY(struct cpuinfo_x86, cpu_info);
 EXPORT_PER_CPU_SYMBOL(cpu_info);
@@ -464,6 +466,21 @@ static bool match_die(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
 	return false;
 }
 
+static bool match_l2c(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
+{
+	int cpu1 = c->cpu_index, cpu2 = o->cpu_index;
+
+	/* Do not match if we do not have a valid APICID for cpu: */
+	if (per_cpu(cpu_l2c_id, cpu1) == BAD_APICID)
+		return false;
+
+	/* Do not match if L2 cache id does not match: */
+	if (per_cpu(cpu_l2c_id, cpu1) != per_cpu(cpu_l2c_id, cpu2))
+		return false;
+
+	return topology_sane(c, o, "l2c");
+}
+
 /*
  * Unlike the other levels, we do not enforce keeping a
  * multicore group inside a NUMA node.  If this happens, we will
@@ -523,7 +540,7 @@ static bool match_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
 }
 
 
-#if defined(CONFIG_SCHED_SMT) || defined(CONFIG_SCHED_MC)
+#if defined(CONFIG_SCHED_SMT) || defined(CONFIG_SCHED_CLUSTER) || defined(CONFIG_SCHED_MC)
 static inline int x86_sched_itmt_flags(void)
 {
 	return sysctl_sched_itmt_enabled ? SD_ASYM_PACKING : 0;
@@ -541,12 +558,21 @@ static int x86_smt_flags(void)
 	return cpu_smt_flags() | x86_sched_itmt_flags();
 }
 #endif
+#ifdef CONFIG_SCHED_CLUSTER
+static int x86_cluster_flags(void)
+{
+	return cpu_cluster_flags() | x86_sched_itmt_flags();
+}
+#endif
 #endif
 
 static struct sched_domain_topology_level x86_numa_in_package_topology[] = {
 #ifdef CONFIG_SCHED_SMT
 	{ cpu_smt_mask, x86_smt_flags, SD_INIT_NAME(SMT) },
 #endif
+#ifdef CONFIG_SCHED_CLUSTER
+	{ cpu_clustergroup_mask, x86_cluster_flags, SD_INIT_NAME(CLS) },
+#endif
 #ifdef CONFIG_SCHED_MC
 	{ cpu_coregroup_mask, x86_core_flags, SD_INIT_NAME(MC) },
 #endif
@@ -557,6 +583,9 @@ static int x86_smt_flags(void)
 #ifdef CONFIG_SCHED_SMT
 	{ cpu_smt_mask, x86_smt_flags, SD_INIT_NAME(SMT) },
 #endif
+#ifdef CONFIG_SCHED_CLUSTER
+	{ cpu_clustergroup_mask, x86_cluster_flags, SD_INIT_NAME(CLS) },
+#endif
 #ifdef CONFIG_SCHED_MC
 	{ cpu_coregroup_mask, x86_core_flags, SD_INIT_NAME(MC) },
 #endif
@@ -584,6 +613,7 @@ void set_cpu_sibling_map(int cpu)
 	if (!has_mp) {
 		cpumask_set_cpu(cpu, topology_sibling_cpumask(cpu));
 		cpumask_set_cpu(cpu, cpu_llc_shared_mask(cpu));
+		cpumask_set_cpu(cpu, cpu_l2c_shared_mask(cpu));
 		cpumask_set_cpu(cpu, topology_core_cpumask(cpu));
 		cpumask_set_cpu(cpu, topology_die_cpumask(cpu));
 		c->booted_cores = 1;
@@ -602,6 +632,9 @@ void set_cpu_sibling_map(int cpu)
 		if ((i == cpu) || (has_mp && match_llc(c, o)))
 			link_mask(cpu_llc_shared_mask, cpu, i);
 
+		if ((i == cpu) || (has_mp && match_l2c(c, o)))
+			link_mask(cpu_l2c_shared_mask, cpu, i);
+
 		if ((i == cpu) || (has_mp && match_die(c, o)))
 			link_mask(topology_die_cpumask, cpu, i);
 	}
@@ -652,6 +685,11 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
 	return cpu_llc_shared_mask(cpu);
 }
 
+const struct cpumask *cpu_clustergroup_mask(int cpu)
+{
+	return cpu_l2c_shared_mask(cpu);
+}
+
 static void impress_friends(void)
 {
 	int cpu;
@@ -1335,6 +1373,7 @@ void __init native_smp_prepare_cpus(unsigned int max_cpus)
 		zalloc_cpumask_var(&per_cpu(cpu_core_map, i), GFP_KERNEL);
 		zalloc_cpumask_var(&per_cpu(cpu_die_map, i), GFP_KERNEL);
 		zalloc_cpumask_var(&per_cpu(cpu_llc_shared_map, i), GFP_KERNEL);
+		zalloc_cpumask_var(&per_cpu(cpu_l2c_shared_map, i), GFP_KERNEL);
 	}
 
 	/*
@@ -1564,7 +1603,10 @@ static void remove_siblinginfo(int cpu)
 
 	for_each_cpu(sibling, cpu_llc_shared_mask(cpu))
 		cpumask_clear_cpu(cpu, cpu_llc_shared_mask(sibling));
+	for_each_cpu(sibling, cpu_l2c_shared_mask(cpu))
+		cpumask_clear_cpu(cpu, cpu_l2c_shared_mask(sibling));
 	cpumask_clear(cpu_llc_shared_mask(cpu));
+	cpumask_clear(cpu_l2c_shared_mask(cpu));
 	cpumask_clear(topology_sibling_cpumask(cpu));
 	cpumask_clear(topology_core_cpumask(cpu));
 	cpumask_clear(topology_die_cpumask(cpu));
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RESEND 0/3] Represent cluster topology and enable load balance between clusters
  2021-09-24  8:51 [PATCH RESEND 0/3] Represent cluster topology and enable load balance between clusters Barry Song
                   ` (2 preceding siblings ...)
  2021-09-24  8:51 ` [PATCH RESEND 3/3] scheduler: Add cluster scheduler level for x86 Barry Song
@ 2021-10-01 10:32 ` Barry Song
  2021-10-01 10:39   ` Vincent Guittot
  3 siblings, 1 reply; 46+ messages in thread
From: Barry Song @ 2021-10-01 10:32 UTC (permalink / raw)
  To: Dietmar Eggemann, LKML, Ingo Molnar, Peter Zijlstra, Vincent Guittot
  Cc: aubrey.li, Borislav Petkov, Daniel Bristot de Oliveira,
	Ben Segall, Catalin Marinas, Greg Kroah-Hartman, Guodong Xu,
	H. Peter Anvin, Jonathan Cameron, Juri Lelli, lenb, linux-acpi,
	linux-arm-kernel, Linuxarm, Mark Rutland, Mel Gorman,
	msys.mizuma, prime.zeng, rjw, Steven Rostedt, Barry Song,
	Sudeep Holla, Thomas Gleixner, rafael, Tim Chen,
	Valentin Schneider, will, x86, yangyicong

Hi Vincent, Dietmar, Peter, Ingo,
Do you have any comment on this first series which exposes cluster topology
of ARM64 kunpeng 920 & x86 Jacobsville and supports load balance only for
the 1st stage?
I will be very grateful for your comments so that things can move forward in the
right direction. I think Tim also looks forward to bringing up cluster
support in
Jacobsville.

Best Regards
Barry

On Fri, Sep 24, 2021 at 8:51 PM Barry Song <21cnbao@gmail.com> wrote:
>
> From: Barry Song <song.bao.hua@hisilicon.com>
>
> ARM64 machines like kunpeng920 and x86 machines like Jacobsville have a
> level of hardware topology in which some CPU cores, typically 4 cores,
> share L3 tags or L2 cache.
>
> That means spreading those tasks between clusters will bring more memory
> bandwidth and decrease cache contention. But packing tasks might help
> decrease the latency of cache synchronization.
>
> We have three series to bring up cluster level scheduler in kernel.
> This is the first series.
>
> 1st series(this one): make kernel aware of cluster, expose cluster to sysfs
> ABI and add SCHED_CLUSTER which can make load balance between clusters to
> benefit lots of workload.
> Testing shows this can hugely boost the performance, for example, this
> can increase 25.1% of SPECrate mcf on Jacobsville and 13.574% of mcf
> on kunpeng920.
>
> 2nd series(wake_affine): modify the wake_affine and let kernel select CPUs
> within cluster first before scanning the whole LLC so that we can benefit
> from the lower latency of cache coherence within one single cluster. This
> series is much more tricky. so we would like to send it after we build
> the base of cluster by the 1st series. Prototype for 2nd series is here:
> https://op-lists.linaro.org/pipermail/linaro-open-discussions/2021-June/000219.html
>
> 3rd series: a sysctl to permit users to enable or disable cluster scheduler
> from Tim Chen. Prototype here:
> Add run time sysctl to enable/disable cluster scheduling
> https://op-lists.linaro.org/pipermail/linaro-open-discussions/2021-July/000258.html
>
> This series is resent and rebased on 5.15-rc2.
>
> -V1:
>  differences with RFC v6
>  * removed wake_affine path modifcation, which will be separately second series
>  * cluster_id is gotten by detecting valid ID before falling back to use offset
>  * lots of benchmark data from both x86 Jacobsville and ARM64 kunpeng920
>
> -RFC v6:
> https://lore.kernel.org/lkml/20210420001844.9116-1-song.bao.hua@hisilicon.com/
>
> Barry Song (1):
>   scheduler: Add cluster scheduler level in core and related Kconfig for
>     ARM64
>
> Jonathan Cameron (1):
>   topology: Represent clusters of CPUs within a die
>
> Tim Chen (1):
>   scheduler: Add cluster scheduler level for x86
>
>  Documentation/ABI/stable/sysfs-devices-system-cpu | 15 +++++
>  Documentation/admin-guide/cputopology.rst         | 12 ++--
>  arch/arm64/Kconfig                                |  7 +++
>  arch/arm64/kernel/topology.c                      |  2 +
>  arch/x86/Kconfig                                  |  8 +++
>  arch/x86/include/asm/smp.h                        |  7 +++
>  arch/x86/include/asm/topology.h                   |  3 +
>  arch/x86/kernel/cpu/cacheinfo.c                   |  1 +
>  arch/x86/kernel/cpu/common.c                      |  3 +
>  arch/x86/kernel/smpboot.c                         | 44 ++++++++++++++-
>  drivers/acpi/pptt.c                               | 67 +++++++++++++++++++++++
>  drivers/base/arch_topology.c                      | 14 +++++
>  drivers/base/topology.c                           | 10 ++++
>  include/linux/acpi.h                              |  5 ++
>  include/linux/arch_topology.h                     |  5 ++
>  include/linux/sched/topology.h                    |  7 +++
>  include/linux/topology.h                          | 13 +++++
>  kernel/sched/topology.c                           |  5 ++
>  18 files changed, 223 insertions(+), 5 deletions(-)
>
> --
> 1.8.3.1
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RESEND 0/3] Represent cluster topology and enable load balance between clusters
  2021-10-01 10:32 ` [PATCH RESEND 0/3] Represent cluster topology and enable load balance between clusters Barry Song
@ 2021-10-01 10:39   ` Vincent Guittot
  2021-10-01 14:57     ` Peter Zijlstra
  0 siblings, 1 reply; 46+ messages in thread
From: Vincent Guittot @ 2021-10-01 10:39 UTC (permalink / raw)
  To: Barry Song
  Cc: Dietmar Eggemann, LKML, Ingo Molnar, Peter Zijlstra, Aubrey Li,
	Borislav Petkov, Daniel Bristot de Oliveira, Ben Segall,
	Catalin Marinas, Greg Kroah-Hartman, Guodong Xu, H. Peter Anvin,
	Jonathan Cameron, Juri Lelli, Cc: Len Brown,
	ACPI Devel Maling List, LAK, Linuxarm, Mark Rutland, Mel Gorman,
	msys.mizuma, Zengtao (B),
	Rafael J. Wysocki, Steven Rostedt, Barry Song, Sudeep Holla,
	Thomas Gleixner, Rafael J. Wysocki, Tim Chen, Valentin Schneider,
	Will Deacon, x86, yangyicong

Hi Barry,

On Fri, 1 Oct 2021 at 12:32, Barry Song <21cnbao@gmail.com> wrote:
>
> Hi Vincent, Dietmar, Peter, Ingo,
> Do you have any comment on this first series which exposes cluster topology
> of ARM64 kunpeng 920 & x86 Jacobsville and supports load balance only for
> the 1st stage?
> I will be very grateful for your comments so that things can move forward in the
> right direction. I think Tim also looks forward to bringing up cluster
> support in
> Jacobsville.

This patchset makes sense to me and the addition of a new scheduling
level to better reflect the HW topology goes in the right direction.

>
> Best Regards
> Barry
>
> On Fri, Sep 24, 2021 at 8:51 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > From: Barry Song <song.bao.hua@hisilicon.com>
> >
> > ARM64 machines like kunpeng920 and x86 machines like Jacobsville have a
> > level of hardware topology in which some CPU cores, typically 4 cores,
> > share L3 tags or L2 cache.
> >
> > That means spreading those tasks between clusters will bring more memory
> > bandwidth and decrease cache contention. But packing tasks might help
> > decrease the latency of cache synchronization.
> >
> > We have three series to bring up cluster level scheduler in kernel.
> > This is the first series.
> >
> > 1st series(this one): make kernel aware of cluster, expose cluster to sysfs
> > ABI and add SCHED_CLUSTER which can make load balance between clusters to
> > benefit lots of workload.
> > Testing shows this can hugely boost the performance, for example, this
> > can increase 25.1% of SPECrate mcf on Jacobsville and 13.574% of mcf
> > on kunpeng920.
> >
> > 2nd series(wake_affine): modify the wake_affine and let kernel select CPUs
> > within cluster first before scanning the whole LLC so that we can benefit
> > from the lower latency of cache coherence within one single cluster. This
> > series is much more tricky. so we would like to send it after we build
> > the base of cluster by the 1st series. Prototype for 2nd series is here:
> > https://op-lists.linaro.org/pipermail/linaro-open-discussions/2021-June/000219.html
> >
> > 3rd series: a sysctl to permit users to enable or disable cluster scheduler
> > from Tim Chen. Prototype here:
> > Add run time sysctl to enable/disable cluster scheduling
> > https://op-lists.linaro.org/pipermail/linaro-open-discussions/2021-July/000258.html
> >
> > This series is resent and rebased on 5.15-rc2.
> >
> > -V1:
> >  differences with RFC v6
> >  * removed wake_affine path modifcation, which will be separately second series
> >  * cluster_id is gotten by detecting valid ID before falling back to use offset
> >  * lots of benchmark data from both x86 Jacobsville and ARM64 kunpeng920
> >
> > -RFC v6:
> > https://lore.kernel.org/lkml/20210420001844.9116-1-song.bao.hua@hisilicon.com/
> >
> > Barry Song (1):
> >   scheduler: Add cluster scheduler level in core and related Kconfig for
> >     ARM64
> >
> > Jonathan Cameron (1):
> >   topology: Represent clusters of CPUs within a die
> >
> > Tim Chen (1):
> >   scheduler: Add cluster scheduler level for x86
> >
> >  Documentation/ABI/stable/sysfs-devices-system-cpu | 15 +++++
> >  Documentation/admin-guide/cputopology.rst         | 12 ++--
> >  arch/arm64/Kconfig                                |  7 +++
> >  arch/arm64/kernel/topology.c                      |  2 +
> >  arch/x86/Kconfig                                  |  8 +++
> >  arch/x86/include/asm/smp.h                        |  7 +++
> >  arch/x86/include/asm/topology.h                   |  3 +
> >  arch/x86/kernel/cpu/cacheinfo.c                   |  1 +
> >  arch/x86/kernel/cpu/common.c                      |  3 +
> >  arch/x86/kernel/smpboot.c                         | 44 ++++++++++++++-
> >  drivers/acpi/pptt.c                               | 67 +++++++++++++++++++++++
> >  drivers/base/arch_topology.c                      | 14 +++++
> >  drivers/base/topology.c                           | 10 ++++
> >  include/linux/acpi.h                              |  5 ++
> >  include/linux/arch_topology.h                     |  5 ++
> >  include/linux/sched/topology.h                    |  7 +++
> >  include/linux/topology.h                          | 13 +++++
> >  kernel/sched/topology.c                           |  5 ++
> >  18 files changed, 223 insertions(+), 5 deletions(-)
> >
> > --
> > 1.8.3.1
> >

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RESEND 0/3] Represent cluster topology and enable load balance between clusters
  2021-10-01 10:39   ` Vincent Guittot
@ 2021-10-01 14:57     ` Peter Zijlstra
  2021-10-01 23:22       ` Tim Chen
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2021-10-01 14:57 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Barry Song, Dietmar Eggemann, LKML, Ingo Molnar, Aubrey Li,
	Borislav Petkov, Daniel Bristot de Oliveira, Ben Segall,
	Catalin Marinas, Greg Kroah-Hartman, Guodong Xu, H. Peter Anvin,
	Jonathan Cameron, Juri Lelli, Cc: Len Brown,
	ACPI Devel Maling List, LAK, Linuxarm, Mark Rutland, Mel Gorman,
	msys.mizuma, Zengtao (B),
	Rafael J. Wysocki, Steven Rostedt, Barry Song, Sudeep Holla,
	Thomas Gleixner, Rafael J. Wysocki, Tim Chen, Valentin Schneider,
	Will Deacon, x86, yangyicong

On Fri, Oct 01, 2021 at 12:39:56PM +0200, Vincent Guittot wrote:
> Hi Barry,
> 
> On Fri, 1 Oct 2021 at 12:32, Barry Song <21cnbao@gmail.com> wrote:
> >
> > Hi Vincent, Dietmar, Peter, Ingo,
> > Do you have any comment on this first series which exposes cluster topology
> > of ARM64 kunpeng 920 & x86 Jacobsville and supports load balance only for
> > the 1st stage?
> > I will be very grateful for your comments so that things can move forward in the
> > right direction. I think Tim also looks forward to bringing up cluster
> > support in
> > Jacobsville.
> 
> This patchset makes sense to me and the addition of a new scheduling
> level to better reflect the HW topology goes in the right direction.

So I had a look, dreading the selecti-idle-sibling changes, and was
pleasantly surprised they're gone :-)

As is, this does indeed look like something mergable without too much
hassle.

The one questino I have is, do we want default y?

The one nit I have is the Kconfig text, I'm not really sure that's
clarifying what a cluster is.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RESEND 0/3] Represent cluster topology and enable load balance between clusters
  2021-10-01 14:57     ` Peter Zijlstra
@ 2021-10-01 23:22       ` Tim Chen
  2021-10-02  7:09         ` Barry Song
  2021-10-05  7:50         ` Peter Zijlstra
  0 siblings, 2 replies; 46+ messages in thread
From: Tim Chen @ 2021-10-01 23:22 UTC (permalink / raw)
  To: Peter Zijlstra, Vincent Guittot
  Cc: Barry Song, Dietmar Eggemann, LKML, Ingo Molnar, Aubrey Li,
	Borislav Petkov, Daniel Bristot de Oliveira, Ben Segall,
	Catalin Marinas, Greg Kroah-Hartman, Guodong Xu, H. Peter Anvin,
	Jonathan Cameron, Juri Lelli, Cc: Len Brown,
	ACPI Devel Maling List, LAK, Linuxarm, Mark Rutland, Mel Gorman,
	msys.mizuma, Zengtao (B),
	Rafael J. Wysocki, Steven Rostedt, Barry Song, Sudeep Holla,
	Thomas Gleixner, Rafael J. Wysocki, Valentin Schneider,
	Will Deacon, x86, yangyicong

On Fri, 2021-10-01 at 16:57 +0200, Peter Zijlstra wrote:
> On Fri, Oct 01, 2021 at 12:39:56PM +0200, Vincent Guittot wrote:
> > Hi Barry,
> > 
> > On Fri, 1 Oct 2021 at 12:32, Barry Song <21cnbao@gmail.com> wrote:
> > > Hi Vincent, Dietmar, Peter, Ingo,
> > > Do you have any comment on this first series which exposes
> > > cluster topology
> > > of ARM64 kunpeng 920 & x86 Jacobsville and supports load balance
> > > only for
> > > the 1st stage?
> > > I will be very grateful for your comments so that things can move
> > > forward in the
> > > right direction. I think Tim also looks forward to bringing up
> > > cluster
> > > support in
> > > Jacobsville.
> > 
> > This patchset makes sense to me and the addition of a new
> > scheduling
> > level to better reflect the HW topology goes in the right
> > direction.
> 
> So I had a look, dreading the selecti-idle-sibling changes, and was
> pleasantly surprised they're gone :-)
> 
> As is, this does indeed look like something mergable without too much
> hassle.
> 
> The one questino I have is, do we want default y?

I also agree that default y is preferable.

> 
> The one nit I have is the Kconfig text, I'm not really sure that's
> clarifying what a cluster is.

Do you have a preference of a different name other than cluster?
Or simply better documentation on what a cluster is for ARM64
and x86 in Kconfig?

Thanks.

Tim


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RESEND 0/3] Represent cluster topology and enable load balance between clusters
  2021-10-01 23:22       ` Tim Chen
@ 2021-10-02  7:09         ` Barry Song
  2021-10-04 22:54           ` Tim Chen
  2021-10-05  8:04           ` Peter Zijlstra
  2021-10-05  7:50         ` Peter Zijlstra
  1 sibling, 2 replies; 46+ messages in thread
From: Barry Song @ 2021-10-02  7:09 UTC (permalink / raw)
  To: Tim Chen
  Cc: Peter Zijlstra, Vincent Guittot, Dietmar Eggemann, LKML,
	Ingo Molnar, Aubrey Li, Borislav Petkov,
	Daniel Bristot de Oliveira, Ben Segall, Catalin Marinas,
	Greg Kroah-Hartman, Guodong Xu, H. Peter Anvin, Jonathan Cameron,
	Juri Lelli, Cc: Len Brown, ACPI Devel Maling List, LAK, Linuxarm,
	Mark Rutland, Mel Gorman, msys.mizuma, Zengtao (B),
	Rafael J. Wysocki, Steven Rostedt, Barry Song, Sudeep Holla,
	Thomas Gleixner, Rafael J. Wysocki, Valentin Schneider,
	Will Deacon, x86, yangyicong

On Sat, Oct 2, 2021 at 12:22 PM Tim Chen <tim.c.chen@linux.intel.com> wrote:
>
> On Fri, 2021-10-01 at 16:57 +0200, Peter Zijlstra wrote:
> > On Fri, Oct 01, 2021 at 12:39:56PM +0200, Vincent Guittot wrote:
> > > Hi Barry,
> > >
> > > On Fri, 1 Oct 2021 at 12:32, Barry Song <21cnbao@gmail.com> wrote:
> > > > Hi Vincent, Dietmar, Peter, Ingo,
> > > > Do you have any comment on this first series which exposes
> > > > cluster topology
> > > > of ARM64 kunpeng 920 & x86 Jacobsville and supports load balance
> > > > only for
> > > > the 1st stage?
> > > > I will be very grateful for your comments so that things can move
> > > > forward in the
> > > > right direction. I think Tim also looks forward to bringing up
> > > > cluster
> > > > support in
> > > > Jacobsville.
> > >
> > > This patchset makes sense to me and the addition of a new
> > > scheduling
> > > level to better reflect the HW topology goes in the right
> > > direction.
> >
> > So I had a look, dreading the selecti-idle-sibling changes, and was
> > pleasantly surprised they're gone :-)

Thanks, Peter and Vincent for reviewing.

My tiny scheduler team is still hardly working on the
select-idle-sibling changes.
And that one will be sent as a separate series as an improvement to this series.
I promise the wake-affine series won't be that scary when you see it
next time :-)

> >
> > As is, this does indeed look like something mergable without too much
> > hassle.
> >
> > The one questino I have is, do we want default y?
>
> I also agree that default y is preferable.

Thanks, Tim, for your comments.
I am ok to make it default "Y" for x86 after having a better doc as below:
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index bd27b1cdac34..940eb1fe0abb 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1002,12 +1002,17 @@ config NR_CPUS
          to the kernel image.

 config SCHED_CLUSTER
-       bool "Cluster scheduler support"
-       default n
+       def_bool y
+       prompt "Cluster scheduler support"
        help
         Cluster scheduler support improves the CPU scheduler's decision
-        making when dealing with machines that have clusters of CPUs
-        sharing L2 cache. If unsure say N here.
+        making when dealing with machines that have clusters of CPUs.
+        Cluster usually means a couple of CPUs which are placed closely
+        by sharing mid-level caches, last-level cache tags or internal
+        busses. For example, on x86 Jacobsville, each 4 CPUs share one
+        L2 cache. This feature isn't a universal win because it can bring
+        a cost of slightly increased overhead in some places. If unsure
+        say N here.

This also aligns well with SCHED_MC and SCHED_SMT in arch/x86/kconfig:
config SCHED_MC
    def_bool y
    prompt "Multi-core scheduler support"

config SCHED_SMT
    def_bool y if SMP

But ARM64 is running in a different tradition, arch/arm64/Kconfig has
SCHED_MC and SCHED_SMT as below:
   config SCHED_MC
   bool "Multi-core scheduler support"
   help
    ...

config SCHED_SMT
  bool "SMT scheduler support"
  help
  ...

I don't want to be an odd man :-)  So for ARM64, I vote keeping the
Kconfig file as is.  And I am planning to modify arch/arm64/defconfig
in second patchset(select-idle-sibling) by adding
CONFIG_SCHED_CLUSTR=y
as load-balance plus wake-affine changes seem to make cluster
scheduler much more widely win on kunpeng920 while doing load-
balance only can sometimes hurt. so I don't mind holding "N" for
a while on the ARM64 platform.

>
> >
> > The one nit I have is the Kconfig text, I'm not really sure that's
> > clarifying what a cluster is.
>
> Do you have a preference of a different name other than cluster?
> Or simply better documentation on what a cluster is for ARM64
> and x86 in Kconfig?

Anyway, naming is really a hard thing. cluster seems not a bad name for
ARM SoCs as besides kunpeng, some other ARM SoCs are also using this
name in specifications, for example, neoverse-n1, phytium etc.

Will we use the same name between x86 and ARM and just refine the document
as below? Does the below doc explain what is "cluster" better?

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 7e4651a1aaf4..86821e83b935 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -993,8 +993,13 @@ config SCHED_CLUSTER
        bool "Cluster scheduler support"
        help
          Cluster scheduler support improves the CPU scheduler's decision
-         making when dealing with machines that have clusters(sharing internal
-         bus or sharing LLC cache tag). If unsure say N here.
+         making when dealing with machines that have clusters of CPUs.
+         Cluster usually means a couple of CPUs which are placed closely
+         by sharing mid-level caches, last-level cache tags or internal
+         busses. For example, on Hisilicon Kunpeng920, each 4 CPUs share
+         LLC cache tags. This feature isn't a universal win because it
+         can bring a cost of slightly increased overhead in some places.
+         If unsure say N here.

 config SCHED_SMT
        bool "SMT scheduler support"
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index bd27b1cdac34..940eb1fe0abb 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1002,12 +1002,17 @@ config NR_CPUS
          to the kernel image.

 config SCHED_CLUSTER
-       bool "Cluster scheduler support"
-       default n
+       def_bool y
+       prompt "Cluster scheduler support"
        help
         Cluster scheduler support improves the CPU scheduler's decision
-        making when dealing with machines that have clusters of CPUs
-        sharing L2 cache. If unsure say N here.
+        making when dealing with machines that have clusters of CPUs.
+        Cluster usually means a couple of CPUs which are placed closely
+        by sharing mid-level caches, last-level cache tags or internal
+        busses. For example, on x86 Jacobsville, each 4 CPUs share one
+        L2 cache. This feature isn't a universal win because it can bring
+        a cost of slightly increased overhead in some places. If unsure
+        say N here.

 config SCHED_SMT
        def_bool y if SMP


>
> Thanks.
>
> Tim
>

Thanks
barry

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RESEND 0/3] Represent cluster topology and enable load balance between clusters
  2021-10-02  7:09         ` Barry Song
@ 2021-10-04 22:54           ` Tim Chen
  2021-10-05  7:54             ` Peter Zijlstra
  2021-10-05  8:04           ` Peter Zijlstra
  1 sibling, 1 reply; 46+ messages in thread
From: Tim Chen @ 2021-10-04 22:54 UTC (permalink / raw)
  To: Barry Song
  Cc: Peter Zijlstra, Vincent Guittot, Dietmar Eggemann, LKML,
	Ingo Molnar, Aubrey Li, Borislav Petkov,
	Daniel Bristot de Oliveira, Ben Segall, Catalin Marinas,
	Greg Kroah-Hartman, Guodong Xu, H. Peter Anvin, Jonathan Cameron,
	Juri Lelli, Cc: Len Brown, ACPI Devel Maling List, LAK, Linuxarm,
	Mark Rutland, Mel Gorman, msys.mizuma, Zengtao (B),
	Rafael J. Wysocki, Steven Rostedt, Barry Song, Sudeep Holla,
	Thomas Gleixner, Rafael J. Wysocki, Valentin Schneider,
	Will Deacon, x86, yangyicong

On Sat, 2021-10-02 at 20:09 +1300, Barry Song wrote:
> 
> 
> Thanks, Tim, for your comments.
> I am ok to make it default "Y" for x86 after having a better doc as
> below:
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index bd27b1cdac34..940eb1fe0abb 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1002,12 +1002,17 @@ config NR_CPUS
>           to the kernel image.
> 
>  config SCHED_CLUSTER
> -       bool "Cluster scheduler support"
> -       default n
> +       def_bool y
> +       prompt "Cluster scheduler support"
>         help
>          Cluster scheduler support improves the CPU scheduler's
> decision
> -        making when dealing with machines that have clusters of CPUs
> -        sharing L2 cache. If unsure say N here.
> +        making when dealing with machines that have clusters of
> CPUs.
> +        Cluster usually means a couple of CPUs which are placed
> closely
> +        by sharing mid-level caches, last-level cache tags or
> internal
> +        busses. For example, on x86 Jacobsville, each 4 CPUs share
> one
> +        L2 cache. This feature isn't a universal win because it can
> bring
> +        a cost of slightly increased overhead in some places. If
> unsure
> +        say N here.
> 
> This also aligns well with SCHED_MC and SCHED_SMT in
> arch/x86/kconfig:
> config SCHED_MC
>     def_bool y
>     prompt "Multi-core scheduler support"
> 
> config SCHED_SMT
>     def_bool y if SMP
> 
> But ARM64 is running in a different tradition, arch/arm64/Kconfig has
> SCHED_MC and SCHED_SMT as below:
>    config SCHED_MC
>    bool "Multi-core scheduler support"
>    help
>     ...
> 
> config SCHED_SMT
>   bool "SMT scheduler support"
>   help
>   ...
> 
> 

Barry,

Found one minor fix to the x86 patch to take care of compile error for
!CONFIG_SMP.

Thanks.

Tim

diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 2548d824f103..cc164777e661 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -109,12 +109,12 @@ extern const struct cpumask *cpu_clustergroup_mask(int cpu);
 #define topology_physical_package_id(cpu)	(cpu_data(cpu).phys_proc_id)
 #define topology_logical_die_id(cpu)		(cpu_data(cpu).logical_die_id)
 #define topology_die_id(cpu)			(cpu_data(cpu).cpu_die_id)
-#define topology_cluster_id(cpu)		(per_cpu(cpu_l2c_id, cpu))
 #define topology_core_id(cpu)			(cpu_data(cpu).cpu_core_id)
 
 extern unsigned int __max_die_per_package;
 
 #ifdef CONFIG_SMP
+#define topology_cluster_id(cpu)		(per_cpu(cpu_l2c_id, cpu))
 #define topology_die_cpumask(cpu)		(per_cpu(cpu_die_map, cpu))
 #define topology_cluster_cpumask(cpu)		(cpu_clustergroup_mask(cpu))
 #define topology_core_cpumask(cpu)		(per_cpu(cpu_core_map, cpu))


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RESEND 2/3] scheduler: Add cluster scheduler level in core and related Kconfig for ARM64
  2021-09-24  8:51 ` [PATCH RESEND 2/3] scheduler: Add cluster scheduler level in core and related Kconfig for ARM64 Barry Song
@ 2021-10-05  7:35   ` Peter Zijlstra
  2021-10-05  9:01     ` Barry Song
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2021-10-05  7:35 UTC (permalink / raw)
  To: Barry Song
  Cc: dietmar.eggemann, linux-kernel, mingo, vincent.guittot,
	aubrey.li, bp, bristot, bsegall, catalin.marinas, gregkh,
	guodong.xu, hpa, jonathan.cameron, juri.lelli, lenb, linux-acpi,
	linux-arm-kernel, linuxarm, mark.rutland, mgorman, msys.mizuma,
	prime.zeng, rjw, rostedt, song.bao.hua, sudeep.holla, tglx,
	rafael, tim.c.chen, valentin.schneider, will, x86, yangyicong,
	Yicong Yang

On Fri, Sep 24, 2021 at 08:51:03PM +1200, Barry Song wrote:
> From: Barry Song <song.bao.hua@hisilicon.com>
> 

> 
> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>

That is not a valid SoB pattern. The first SoB should be the author, the
last SoB should be the one sending the patch, in this case both are
Barry. What's Yicong's contribution in all this?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RESEND 0/3] Represent cluster topology and enable load balance between clusters
  2021-10-01 23:22       ` Tim Chen
  2021-10-02  7:09         ` Barry Song
@ 2021-10-05  7:50         ` Peter Zijlstra
  2021-10-05  9:15           ` Barry Song
  2021-10-05 13:42           ` Valentin Schneider
  1 sibling, 2 replies; 46+ messages in thread
From: Peter Zijlstra @ 2021-10-05  7:50 UTC (permalink / raw)
  To: Tim Chen
  Cc: Vincent Guittot, Barry Song, Dietmar Eggemann, LKML, Ingo Molnar,
	Aubrey Li, Borislav Petkov, Daniel Bristot de Oliveira,
	Ben Segall, Catalin Marinas, Greg Kroah-Hartman, Guodong Xu,
	H. Peter Anvin, Jonathan Cameron, Juri Lelli, Cc: Len Brown,
	ACPI Devel Maling List, LAK, Linuxarm, Mark Rutland, Mel Gorman,
	msys.mizuma, Zengtao (B),
	Rafael J. Wysocki, Steven Rostedt, Barry Song, Sudeep Holla,
	Thomas Gleixner, Rafael J. Wysocki, Valentin Schneider,
	Will Deacon, x86, yangyicong

On Fri, Oct 01, 2021 at 04:22:46PM -0700, Tim Chen wrote:
> On Fri, 2021-10-01 at 16:57 +0200, Peter Zijlstra wrote:

> > The one questino I have is, do we want default y?
> 
> I also agree that default y is preferable.

I'll change at least the x86 one to:

	default y
	depends on SMP

> > The one nit I have is the Kconfig text, I'm not really sure that's
> > clarifying what a cluster is.
> 
> Do you have a preference of a different name other than cluster?
> Or simply better documentation on what a cluster is for ARM64
> and x86 in Kconfig?

Yes, better wording as to what a cluster is. Currently the x86 and arm64
ones actually differ:

x86:
	help
	 Cluster scheduler support improves the CPU scheduler's decision
	 making when dealing with machines that have clusters of CPUs
	 sharing L2 cache. If unsure say N here.

arm64:
	help
	  Cluster scheduler support improves the CPU scheduler's decision
	  making when dealing with machines that have clusters(sharing internal
	  bus or sharing LLC cache tag). If unsure say N here.


(also, all this stuff being replicated across arch/*/Kconfig seems
unfortunate)

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RESEND 0/3] Represent cluster topology and enable load balance between clusters
  2021-10-04 22:54           ` Tim Chen
@ 2021-10-05  7:54             ` Peter Zijlstra
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2021-10-05  7:54 UTC (permalink / raw)
  To: Tim Chen
  Cc: Barry Song, Vincent Guittot, Dietmar Eggemann, LKML, Ingo Molnar,
	Aubrey Li, Borislav Petkov, Daniel Bristot de Oliveira,
	Ben Segall, Catalin Marinas, Greg Kroah-Hartman, Guodong Xu,
	H. Peter Anvin, Jonathan Cameron, Juri Lelli, Cc: Len Brown,
	ACPI Devel Maling List, LAK, Linuxarm, Mark Rutland, Mel Gorman,
	msys.mizuma, Zengtao (B),
	Rafael J. Wysocki, Steven Rostedt, Barry Song, Sudeep Holla,
	Thomas Gleixner, Rafael J. Wysocki, Valentin Schneider,
	Will Deacon, x86, yangyicong

On Mon, Oct 04, 2021 at 03:54:50PM -0700, Tim Chen wrote:
> diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
> index 2548d824f103..cc164777e661 100644
> --- a/arch/x86/include/asm/topology.h
> +++ b/arch/x86/include/asm/topology.h
> @@ -109,12 +109,12 @@ extern const struct cpumask *cpu_clustergroup_mask(int cpu);
>  #define topology_physical_package_id(cpu)	(cpu_data(cpu).phys_proc_id)
>  #define topology_logical_die_id(cpu)		(cpu_data(cpu).logical_die_id)
>  #define topology_die_id(cpu)			(cpu_data(cpu).cpu_die_id)
> -#define topology_cluster_id(cpu)		(per_cpu(cpu_l2c_id, cpu))
>  #define topology_core_id(cpu)			(cpu_data(cpu).cpu_core_id)
>  
>  extern unsigned int __max_die_per_package;
>  
>  #ifdef CONFIG_SMP
> +#define topology_cluster_id(cpu)		(per_cpu(cpu_l2c_id, cpu))
>  #define topology_die_cpumask(cpu)		(per_cpu(cpu_die_map, cpu))
>  #define topology_cluster_cpumask(cpu)		(cpu_clustergroup_mask(cpu))
>  #define topology_core_cpumask(cpu)		(per_cpu(cpu_core_map, cpu))
> 

Fixed.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RESEND 0/3] Represent cluster topology and enable load balance between clusters
  2021-10-02  7:09         ` Barry Song
  2021-10-04 22:54           ` Tim Chen
@ 2021-10-05  8:04           ` Peter Zijlstra
  2021-10-05  9:06             ` Barry Song
  1 sibling, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2021-10-05  8:04 UTC (permalink / raw)
  To: Barry Song
  Cc: Tim Chen, Vincent Guittot, Dietmar Eggemann, LKML, Ingo Molnar,
	Aubrey Li, Borislav Petkov, Daniel Bristot de Oliveira,
	Ben Segall, Catalin Marinas, Greg Kroah-Hartman, Guodong Xu,
	H. Peter Anvin, Jonathan Cameron, Juri Lelli, Cc: Len Brown,
	ACPI Devel Maling List, LAK, Linuxarm, Mark Rutland, Mel Gorman,
	msys.mizuma, Zengtao (B),
	Rafael J. Wysocki, Steven Rostedt, Barry Song, Sudeep Holla,
	Thomas Gleixner, Rafael J. Wysocki, Valentin Schneider,
	Will Deacon, x86, yangyicong

On Sat, Oct 02, 2021 at 08:09:58PM +1300, Barry Song wrote:

> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 7e4651a1aaf4..86821e83b935 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -993,8 +993,13 @@ config SCHED_CLUSTER
>         bool "Cluster scheduler support"
>         help
>           Cluster scheduler support improves the CPU scheduler's decision
> +         making when dealing with machines that have clusters of CPUs.
> +         Cluster usually means a couple of CPUs which are placed closely
> +         by sharing mid-level caches, last-level cache tags or internal
> +         busses. For example, on Hisilicon Kunpeng920, each 4 CPUs share
> +         LLC cache tags. This feature isn't a universal win because it
> +         can bring a cost of slightly increased overhead in some places.
> +         If unsure say N here.
> 
>  config SCHED_SMT
>         bool "SMT scheduler support"
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index bd27b1cdac34..940eb1fe0abb 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1002,12 +1002,17 @@ config NR_CPUS
>           to the kernel image.
> 
>  config SCHED_CLUSTER
> +       def_bool y
> +       prompt "Cluster scheduler support"
>         help
>          Cluster scheduler support improves the CPU scheduler's decision
> +        making when dealing with machines that have clusters of CPUs.
> +        Cluster usually means a couple of CPUs which are placed closely
> +        by sharing mid-level caches, last-level cache tags or internal
> +        busses. For example, on x86 Jacobsville, each 4 CPUs share one
> +        L2 cache. 

			This feature isn't a universal win because it can bring
> +        a cost of slightly increased overhead in some places. If unsure
> +        say N here.

That is a really odd addition to a default-y feature.

How about I make both:

	help
	  Cluster scheduler support improves the CPU scheduler's decision
	  making when dealing with machines that have clusters of CPUs.
	  Cluster usually means a couple of CPUs which are placed closely
	  by sharing mid-level caches, last-level cache tags or internal
	  busses.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RESEND 2/3] scheduler: Add cluster scheduler level in core and related Kconfig for ARM64
  2021-10-05  7:35   ` Peter Zijlstra
@ 2021-10-05  9:01     ` Barry Song
  2021-10-05 10:40       ` Peter Zijlstra
  0 siblings, 1 reply; 46+ messages in thread
From: Barry Song @ 2021-10-05  9:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dietmar Eggemann, LKML, Ingo Molnar, Vincent Guittot, Aubrey Li,
	Borislav Petkov, Daniel Bristot de Oliveira, Ben Segall,
	Catalin Marinas, Greg Kroah-Hartman, Guodong Xu, H. Peter Anvin,
	Jonathan Cameron, Juri Lelli, Cc: Len Brown,
	ACPI Devel Maling List, LAK, Linuxarm, Mark Rutland, Mel Gorman,
	msys.mizuma, Zengtao (B),
	Rafael J. Wysocki, Steven Rostedt, Barry Song, Sudeep Holla,
	Thomas Gleixner, Rafael J. Wysocki, Tim Chen, Valentin Schneider,
	Will Deacon, x86, yangyicong, Yicong Yang

On Tue, Oct 5, 2021 at 8:35 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Sep 24, 2021 at 08:51:03PM +1200, Barry Song wrote:
> > From: Barry Song <song.bao.hua@hisilicon.com>
> >
>
> >
> > Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
> > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
>
> That is not a valid SoB pattern. The first SoB should be the author, the
> last SoB should be the one sending the patch, in this case both are
> Barry. What's Yicong's contribution in all this?

Yicong made the benchmarks of SPECrate and collected the result data which
is in the commit log. I am not quite sure what is the best tag for
him. Is Tested-by
appropriate in this case?

Thanks
barry

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RESEND 0/3] Represent cluster topology and enable load balance between clusters
  2021-10-05  8:04           ` Peter Zijlstra
@ 2021-10-05  9:06             ` Barry Song
  0 siblings, 0 replies; 46+ messages in thread
From: Barry Song @ 2021-10-05  9:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tim Chen, Vincent Guittot, Dietmar Eggemann, LKML, Ingo Molnar,
	Aubrey Li, Borislav Petkov, Daniel Bristot de Oliveira,
	Ben Segall, Catalin Marinas, Greg Kroah-Hartman, Guodong Xu,
	H. Peter Anvin, Jonathan Cameron, Juri Lelli, Cc: Len Brown,
	ACPI Devel Maling List, LAK, Linuxarm, Mark Rutland, Mel Gorman,
	msys.mizuma, Zengtao (B),
	Rafael J. Wysocki, Steven Rostedt, Barry Song, Sudeep Holla,
	Thomas Gleixner, Rafael J. Wysocki, Valentin Schneider,
	Will Deacon, x86, yangyicong

On Tue, Oct 5, 2021 at 9:05 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Sat, Oct 02, 2021 at 08:09:58PM +1300, Barry Song wrote:
>
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index 7e4651a1aaf4..86821e83b935 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -993,8 +993,13 @@ config SCHED_CLUSTER
> >         bool "Cluster scheduler support"
> >         help
> >           Cluster scheduler support improves the CPU scheduler's decision
> > +         making when dealing with machines that have clusters of CPUs.
> > +         Cluster usually means a couple of CPUs which are placed closely
> > +         by sharing mid-level caches, last-level cache tags or internal
> > +         busses. For example, on Hisilicon Kunpeng920, each 4 CPUs share
> > +         LLC cache tags. This feature isn't a universal win because it
> > +         can bring a cost of slightly increased overhead in some places.
> > +         If unsure say N here.
> >
> >  config SCHED_SMT
> >         bool "SMT scheduler support"
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index bd27b1cdac34..940eb1fe0abb 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -1002,12 +1002,17 @@ config NR_CPUS
> >           to the kernel image.
> >
> >  config SCHED_CLUSTER
> > +       def_bool y
> > +       prompt "Cluster scheduler support"
> >         help
> >          Cluster scheduler support improves the CPU scheduler's decision
> > +        making when dealing with machines that have clusters of CPUs.
> > +        Cluster usually means a couple of CPUs which are placed closely
> > +        by sharing mid-level caches, last-level cache tags or internal
> > +        busses. For example, on x86 Jacobsville, each 4 CPUs share one
> > +        L2 cache.
>
>                         This feature isn't a universal win because it can bring
> > +        a cost of slightly increased overhead in some places. If unsure
> > +        say N here.
>
> That is a really odd addition to a default-y feature.
>
> How about I make both:
>
>         help
>           Cluster scheduler support improves the CPU scheduler's decision
>           making when dealing with machines that have clusters of CPUs.
>           Cluster usually means a couple of CPUs which are placed closely
>           by sharing mid-level caches, last-level cache tags or internal
>           busses.

looks good to me. thanks!

barry

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RESEND 0/3] Represent cluster topology and enable load balance between clusters
  2021-10-05  7:50         ` Peter Zijlstra
@ 2021-10-05  9:15           ` Barry Song
  2021-10-05 10:58             ` Peter Zijlstra
  2021-10-05 13:42           ` Valentin Schneider
  1 sibling, 1 reply; 46+ messages in thread
From: Barry Song @ 2021-10-05  9:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tim Chen, Vincent Guittot, Dietmar Eggemann, LKML, Ingo Molnar,
	Aubrey Li, Borislav Petkov, Daniel Bristot de Oliveira,
	Ben Segall, Catalin Marinas, Greg Kroah-Hartman, Guodong Xu,
	H. Peter Anvin, Jonathan Cameron, Juri Lelli, Cc: Len Brown,
	ACPI Devel Maling List, LAK, Linuxarm, Mark Rutland, Mel Gorman,
	msys.mizuma, Zengtao (B),
	Rafael J. Wysocki, Steven Rostedt, Barry Song, Sudeep Holla,
	Thomas Gleixner, Rafael J. Wysocki, Valentin Schneider,
	Will Deacon, x86, yangyicong

On Tue, Oct 5, 2021 at 8:50 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Oct 01, 2021 at 04:22:46PM -0700, Tim Chen wrote:
> > On Fri, 2021-10-01 at 16:57 +0200, Peter Zijlstra wrote:
>
> > > The one questino I have is, do we want default y?
> >
> > I also agree that default y is preferable.
>
> I'll change at least the x86 one to:
>
>         default y
>         depends on SMP
>
> > > The one nit I have is the Kconfig text, I'm not really sure that's
> > > clarifying what a cluster is.
> >
> > Do you have a preference of a different name other than cluster?
> > Or simply better documentation on what a cluster is for ARM64
> > and x86 in Kconfig?
>
> Yes, better wording as to what a cluster is. Currently the x86 and arm64
> ones actually differ:
>
> x86:
>         help
>          Cluster scheduler support improves the CPU scheduler's decision
>          making when dealing with machines that have clusters of CPUs
>          sharing L2 cache. If unsure say N here.
>
> arm64:
>         help
>           Cluster scheduler support improves the CPU scheduler's decision
>           making when dealing with machines that have clusters(sharing internal
>           bus or sharing LLC cache tag). If unsure say N here.
>
>
> (also, all this stuff being replicated across arch/*/Kconfig seems
> unfortunate)

perhaps worth a separate patchset to do some cleanup so that SCHED_MC,
SCHED_SMT etc
won't be replicated in different architectures. Right now, this kind
of Kconfig option is copied
everywhere. I am seeing SCHED_SMT in all of
arch/arm/Kconfig
arch/arm64/Kconfig
arch/ia64/Kconfig
arch/mips/Kconfig
arch/powerpc/Kconfig
arch/s390/Kconfig
arch/sparc/Kconfig
arch/x86/Kconfig
...

Is it a better way to move them to a common Kconfig and let the architectures to
declare things like ARCH_HAVE_SMT?

Thanks
Barry

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RESEND 2/3] scheduler: Add cluster scheduler level in core and related Kconfig for ARM64
  2021-10-05  9:01     ` Barry Song
@ 2021-10-05 10:40       ` Peter Zijlstra
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2021-10-05 10:40 UTC (permalink / raw)
  To: Barry Song
  Cc: Dietmar Eggemann, LKML, Ingo Molnar, Vincent Guittot, Aubrey Li,
	Borislav Petkov, Daniel Bristot de Oliveira, Ben Segall,
	Catalin Marinas, Greg Kroah-Hartman, Guodong Xu, H. Peter Anvin,
	Jonathan Cameron, Juri Lelli, Cc: Len Brown,
	ACPI Devel Maling List, LAK, Linuxarm, Mark Rutland, Mel Gorman,
	msys.mizuma, Zengtao (B),
	Rafael J. Wysocki, Steven Rostedt, Barry Song, Sudeep Holla,
	Thomas Gleixner, Rafael J. Wysocki, Tim Chen, Valentin Schneider,
	Will Deacon, x86, yangyicong, Yicong Yang

On Tue, Oct 05, 2021 at 10:01:09PM +1300, Barry Song wrote:
> On Tue, Oct 5, 2021 at 8:35 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Fri, Sep 24, 2021 at 08:51:03PM +1200, Barry Song wrote:
> > > From: Barry Song <song.bao.hua@hisilicon.com>
> > >
> >
> > >
> > > Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
> > > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> >
> > That is not a valid SoB pattern. The first SoB should be the author, the
> > last SoB should be the one sending the patch, in this case both are
> > Barry. What's Yicong's contribution in all this?
> 
> Yicong made the benchmarks of SPECrate and collected the result data which
> is in the commit log. I am not quite sure what is the best tag for
> him. Is Tested-by
> appropriate in this case?

Yeah, that works for me, changed it.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RESEND 0/3] Represent cluster topology and enable load balance between clusters
  2021-10-05  9:15           ` Barry Song
@ 2021-10-05 10:58             ` Peter Zijlstra
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2021-10-05 10:58 UTC (permalink / raw)
  To: Barry Song
  Cc: Tim Chen, Vincent Guittot, Dietmar Eggemann, LKML, Ingo Molnar,
	Aubrey Li, Borislav Petkov, Daniel Bristot de Oliveira,
	Ben Segall, Catalin Marinas, Greg Kroah-Hartman, Guodong Xu,
	H. Peter Anvin, Jonathan Cameron, Juri Lelli, Cc: Len Brown,
	ACPI Devel Maling List, LAK, Linuxarm, Mark Rutland, Mel Gorman,
	msys.mizuma, Zengtao (B),
	Rafael J. Wysocki, Steven Rostedt, Barry Song, Sudeep Holla,
	Thomas Gleixner, Rafael J. Wysocki, Valentin Schneider,
	Will Deacon, x86, yangyicong

On Tue, Oct 05, 2021 at 10:15:39PM +1300, Barry Song wrote:

> > (also, all this stuff being replicated across arch/*/Kconfig seems
> > unfortunate)
> 
> perhaps worth a separate patchset to do some cleanup so that SCHED_MC,
> SCHED_SMT etc
> won't be replicated in different architectures. Right now, this kind
> of Kconfig option is copied
> everywhere. I am seeing SCHED_SMT in all of
> arch/arm/Kconfig
> arch/arm64/Kconfig
> arch/ia64/Kconfig
> arch/mips/Kconfig
> arch/powerpc/Kconfig
> arch/s390/Kconfig
> arch/sparc/Kconfig
> arch/x86/Kconfig
> ...
> 
> Is it a better way to move them to a common Kconfig and let the architectures to
> declare things like ARCH_HAVE_SMT?

Dunno, it's all a bit of a mess :/ I can't quickly see a sane
pattern there.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RESEND 0/3] Represent cluster topology and enable load balance between clusters
  2021-10-05  7:50         ` Peter Zijlstra
  2021-10-05  9:15           ` Barry Song
@ 2021-10-05 13:42           ` Valentin Schneider
  2021-10-05 13:50             ` Valentin Schneider
  1 sibling, 1 reply; 46+ messages in thread
From: Valentin Schneider @ 2021-10-05 13:42 UTC (permalink / raw)
  To: Peter Zijlstra, Tim Chen
  Cc: Vincent Guittot, Barry Song, Dietmar Eggemann, LKML, Ingo Molnar,
	Aubrey Li, Borislav Petkov, Daniel Bristot de Oliveira,
	Ben Segall, Catalin Marinas, Greg Kroah-Hartman, Guodong Xu,
	H. Peter Anvin, Jonathan Cameron, Juri Lelli, Cc: Len Brown,
	ACPI Devel Maling List, LAK, Linuxarm, Mark Rutland, Mel Gorman,
	msys.mizuma, Zengtao (B),
	Rafael J. Wysocki, Steven Rostedt, Barry Song, Sudeep Holla,
	Thomas Gleixner, Rafael J. Wysocki, Will Deacon, x86, yangyicong

On 05/10/21 09:50, Peter Zijlstra wrote:
> On Fri, Oct 01, 2021 at 04:22:46PM -0700, Tim Chen wrote:
>> On Fri, 2021-10-01 at 16:57 +0200, Peter Zijlstra wrote:
>
>> > The one questino I have is, do we want default y?
>>
>> I also agree that default y is preferable.
>
> I'll change at least the x86 one to:
>
>       default y
>       depends on SMP
>

Huh, so the arm64 SCHED_{SMT,MC} configs are defaultless (I added SCHED_SMT
to arm64's defconfig not so long ago), but x86 has them default y, which
I'm thinking is a tad better, and would be nice to harmonize. Unfortunately
different architectures have their own dependency requirements - arm has
ARM_CPU_TOPOLOGY, parisc has PARISC_CPU_TOPOLOGY...

Would you hate making SCHED_* a "generic" config, with a common default and
help text, and punt the arch specific stuff to an ARCH_SUPPORTS_* knob?

Something like:

arch/arm/Kconfig:
  select ARCH_SUPPORTS_SCHED_MC if ARM_CPU_TOPOLOGY

init/Kconfig:
  config SCHED_MC
    def_bool y
    depends on ARCH_SUPPORTS_SCHED_MC && SMP

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RESEND 0/3] Represent cluster topology and enable load balance between clusters
  2021-10-05 13:42           ` Valentin Schneider
@ 2021-10-05 13:50             ` Valentin Schneider
  0 siblings, 0 replies; 46+ messages in thread
From: Valentin Schneider @ 2021-10-05 13:50 UTC (permalink / raw)
  To: Peter Zijlstra, Tim Chen
  Cc: Vincent Guittot, Barry Song, Dietmar Eggemann, LKML, Ingo Molnar,
	Aubrey Li, Borislav Petkov, Daniel Bristot de Oliveira,
	Ben Segall, Catalin Marinas, Greg Kroah-Hartman, Guodong Xu,
	H. Peter Anvin, Jonathan Cameron, Juri Lelli, Cc: Len Brown,
	ACPI Devel Maling List, LAK, Linuxarm, Mark Rutland, Mel Gorman,
	msys.mizuma, Zengtao (B),
	Rafael J. Wysocki, Steven Rostedt, Barry Song, Sudeep Holla,
	Thomas Gleixner, Rafael J. Wysocki, Will Deacon, x86, yangyicong

On 05/10/21 14:42, Valentin Schneider wrote:
> On 05/10/21 09:50, Peter Zijlstra wrote:
>> On Fri, Oct 01, 2021 at 04:22:46PM -0700, Tim Chen wrote:
>>> On Fri, 2021-10-01 at 16:57 +0200, Peter Zijlstra wrote:
>>
>>> > The one questino I have is, do we want default y?
>>>
>>> I also agree that default y is preferable.
>>
>> I'll change at least the x86 one to:
>>
>>       default y
>>       depends on SMP
>>
>
> Huh, so the arm64 SCHED_{SMT,MC} configs are defaultless (I added SCHED_SMT
> to arm64's defconfig not so long ago), but x86 has them default y, which
> I'm thinking is a tad better, and would be nice to harmonize. Unfortunately
> different architectures have their own dependency requirements - arm has
> ARM_CPU_TOPOLOGY, parisc has PARISC_CPU_TOPOLOGY...
>
> Would you hate making SCHED_* a "generic" config, with a common default and
> help text, and punt the arch specific stuff to an ARCH_SUPPORTS_* knob?
>
> Something like:
>
> arch/arm/Kconfig:
>   select ARCH_SUPPORTS_SCHED_MC if ARM_CPU_TOPOLOGY
>
> init/Kconfig:
>   config SCHED_MC
>     def_bool y
>     depends on ARCH_SUPPORTS_SCHED_MC && SMP

... Which is essentially what Barry suggested somewhere else in the thread
(I TL; DR'd).

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RESEND 1/3] topology: Represent clusters of CPUs within a die
  2021-09-24  8:51 ` [PATCH RESEND 1/3] topology: Represent clusters of CPUs within a die Barry Song
@ 2021-10-05 16:33   ` Valentin Schneider
  2021-10-05 20:43     ` Barry Song
  2021-10-15  9:44   ` [tip: sched/core] " tip-bot2 for Jonathan Cameron
  1 sibling, 1 reply; 46+ messages in thread
From: Valentin Schneider @ 2021-10-05 16:33 UTC (permalink / raw)
  To: Barry Song, dietmar.eggemann, linux-kernel, mingo, peterz,
	vincent.guittot
  Cc: aubrey.li, bp, bristot, bsegall, catalin.marinas, gregkh,
	guodong.xu, hpa, jonathan.cameron, juri.lelli, lenb, linux-acpi,
	linux-arm-kernel, linuxarm, mark.rutland, mgorman, msys.mizuma,
	prime.zeng, rjw, rostedt, song.bao.hua, sudeep.holla, tglx,
	rafael, tim.c.chen, will, x86, yangyicong, Jonathan Cameron,
	Tian Tao

On 24/09/21 20:51, Barry Song wrote:
>  void update_siblings_masks(unsigned int cpuid)
>  {
>       struct cpu_topology *cpu_topo, *cpuid_topo = &cpu_topology[cpuid];
> @@ -617,6 +622,11 @@ void update_siblings_masks(unsigned int cpuid)
>               if (cpuid_topo->package_id != cpu_topo->package_id)
>                       continue;
>
> +		if (cpuid_topo->cluster_id == cpu_topo->cluster_id) {
> +			cpumask_set_cpu(cpu, &cpuid_topo->cluster_sibling);
> +			cpumask_set_cpu(cpuid, &cpu_topo->cluster_sibling);
> +		}
> +

Hm so without cluster information (e.g. DT system), we have
->cluster_id=-1, we'll essentially copy the package mask into the cluster
mask.

The exposed cluster mask is still <= package mask which is sensible. Are we
fine with that, or do we need/want the mask to be empty in the -1 case? I'm
guessing userspace tools should check for either id!=-1 or if the exclusive
disjucntion of cluster vs package masks is non-empty.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RESEND 1/3] topology: Represent clusters of CPUs within a die
  2021-10-05 16:33   ` Valentin Schneider
@ 2021-10-05 20:43     ` Barry Song
  2021-10-06 10:50       ` Barry Song
  2021-10-06 13:49       ` Valentin Schneider
  0 siblings, 2 replies; 46+ messages in thread
From: Barry Song @ 2021-10-05 20:43 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Dietmar Eggemann, LKML, Ingo Molnar, Peter Zijlstra,
	Vincent Guittot, Aubrey Li, Borislav Petkov,
	Daniel Bristot de Oliveira, Ben Segall, Catalin Marinas,
	Greg Kroah-Hartman, Guodong Xu, H. Peter Anvin, Jonathan Cameron,
	Juri Lelli, Cc: Len Brown, ACPI Devel Maling List, LAK, Linuxarm,
	Mark Rutland, Mel Gorman, msys.mizuma, Zengtao (B),
	Rafael J. Wysocki, Steven Rostedt, Barry Song, Sudeep Holla,
	Thomas Gleixner, Rafael J. Wysocki, Tim Chen, Will Deacon, x86,
	yangyicong, Tian Tao

On Wed, Oct 6, 2021 at 5:34 AM Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
> On 24/09/21 20:51, Barry Song wrote:
> >  void update_siblings_masks(unsigned int cpuid)
> >  {
> >       struct cpu_topology *cpu_topo, *cpuid_topo = &cpu_topology[cpuid];
> > @@ -617,6 +622,11 @@ void update_siblings_masks(unsigned int cpuid)
> >               if (cpuid_topo->package_id != cpu_topo->package_id)
> >                       continue;
> >
> > +             if (cpuid_topo->cluster_id == cpu_topo->cluster_id) {
> > +                     cpumask_set_cpu(cpu, &cpuid_topo->cluster_sibling);
> > +                     cpumask_set_cpu(cpuid, &cpu_topo->cluster_sibling);
> > +             }
> > +
>
> Hm so without cluster information (e.g. DT system), we have
> ->cluster_id=-1, we'll essentially copy the package mask into the cluster
> mask.
>
> The exposed cluster mask is still <= package mask which is sensible. Are we
> fine with that, or do we need/want the mask to be empty in the -1 case? I'm
> guessing userspace tools should check for either id!=-1 or if the exclusive
> disjucntion of cluster vs package masks is non-empty.

Hi Valentin,
Yep, this is a very good question. I'd like change the code to:
diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 7cb31d959f33..fc0836f460fb 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -622,7 +622,8 @@ void update_siblings_masks(unsigned int cpuid)
                if (cpuid_topo->package_id != cpu_topo->package_id)
                        continue;

-               if (cpuid_topo->cluster_id == cpu_topo->cluster_id) {
+               if (cpuid_topo->cluster_id == cpu_topo->cluster_id &&
+                   cpuid_topo->cluster_id != -1) {
                        cpumask_set_cpu(cpu, &cpuid_topo->cluster_sibling);
                        cpumask_set_cpu(cpuid, &cpu_topo->cluster_sibling);
                }

This should be consistent with Tim's patch3/3 for x86 in case
id is BAD_APICID:
static bool match_l2c(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
{
        ...
        /* Do not match if we do not have a valid APICID for cpu: */
        if (per_cpu(cpu_l2c_id, cpu1) == BAD_APICID)
                return false;
        ...
}

Thanks
Barry

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RESEND 1/3] topology: Represent clusters of CPUs within a die
  2021-10-05 20:43     ` Barry Song
@ 2021-10-06 10:50       ` Barry Song
  2021-10-06 12:18         ` Peter Zijlstra
  2021-10-06 13:49       ` Valentin Schneider
  1 sibling, 1 reply; 46+ messages in thread
From: Barry Song @ 2021-10-06 10:50 UTC (permalink / raw)
  To: Valentin Schneider, Peter Zijlstra
  Cc: Dietmar Eggemann, LKML, Ingo Molnar, Vincent Guittot, Aubrey Li,
	Borislav Petkov, Daniel Bristot de Oliveira, Ben Segall,
	Catalin Marinas, Greg Kroah-Hartman, Guodong Xu, H. Peter Anvin,
	Jonathan Cameron, Juri Lelli, Cc: Len Brown,
	ACPI Devel Maling List, LAK, Linuxarm, Mark Rutland, Mel Gorman,
	msys.mizuma, Zengtao (B),
	Rafael J. Wysocki, Steven Rostedt, Barry Song, Sudeep Holla,
	Thomas Gleixner, Rafael J. Wysocki, Tim Chen, Will Deacon, x86,
	yangyicong, Tian Tao

On Wed, Oct 6, 2021 at 9:43 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, Oct 6, 2021 at 5:34 AM Valentin Schneider
> <valentin.schneider@arm.com> wrote:
> >
> > On 24/09/21 20:51, Barry Song wrote:
> > >  void update_siblings_masks(unsigned int cpuid)
> > >  {
> > >       struct cpu_topology *cpu_topo, *cpuid_topo = &cpu_topology[cpuid];
> > > @@ -617,6 +622,11 @@ void update_siblings_masks(unsigned int cpuid)
> > >               if (cpuid_topo->package_id != cpu_topo->package_id)
> > >                       continue;
> > >
> > > +             if (cpuid_topo->cluster_id == cpu_topo->cluster_id) {
> > > +                     cpumask_set_cpu(cpu, &cpuid_topo->cluster_sibling);
> > > +                     cpumask_set_cpu(cpuid, &cpu_topo->cluster_sibling);
> > > +             }
> > > +
> >
> > Hm so without cluster information (e.g. DT system), we have
> > ->cluster_id=-1, we'll essentially copy the package mask into the cluster
> > mask.
> >
> > The exposed cluster mask is still <= package mask which is sensible. Are we
> > fine with that, or do we need/want the mask to be empty in the -1 case? I'm
> > guessing userspace tools should check for either id!=-1 or if the exclusive
> > disjucntion of cluster vs package masks is non-empty.
>
> Hi Valentin,
> Yep, this is a very good question. I'd like change the code to:
> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> index 7cb31d959f33..fc0836f460fb 100644
> --- a/drivers/base/arch_topology.c
> +++ b/drivers/base/arch_topology.c
> @@ -622,7 +622,8 @@ void update_siblings_masks(unsigned int cpuid)
>                 if (cpuid_topo->package_id != cpu_topo->package_id)
>                         continue;
>
> -               if (cpuid_topo->cluster_id == cpu_topo->cluster_id) {
> +               if (cpuid_topo->cluster_id == cpu_topo->cluster_id &&
> +                   cpuid_topo->cluster_id != -1) {
>                         cpumask_set_cpu(cpu, &cpuid_topo->cluster_sibling);
>                         cpumask_set_cpu(cpuid, &cpu_topo->cluster_sibling);
>                 }
>

Hi Peter,
Would you like to change this line in your tree? Or do you want me to send
a new patchset with this small change?

> This should be consistent with Tim's patch3/3 for x86 in case
> id is BAD_APICID:
> static bool match_l2c(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
> {
>         ...
>         /* Do not match if we do not have a valid APICID for cpu: */
>         if (per_cpu(cpu_l2c_id, cpu1) == BAD_APICID)
>                 return false;
>         ...
> }

Thanks
Barry

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RESEND 1/3] topology: Represent clusters of CPUs within a die
  2021-10-06 10:50       ` Barry Song
@ 2021-10-06 12:18         ` Peter Zijlstra
  2021-10-06 12:50           ` Barry Song
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2021-10-06 12:18 UTC (permalink / raw)
  To: Barry Song
  Cc: Valentin Schneider, Dietmar Eggemann, LKML, Ingo Molnar,
	Vincent Guittot, Aubrey Li, Borislav Petkov,
	Daniel Bristot de Oliveira, Ben Segall, Catalin Marinas,
	Greg Kroah-Hartman, Guodong Xu, H. Peter Anvin, Jonathan Cameron,
	Juri Lelli, Cc: Len Brown, ACPI Devel Maling List, LAK, Linuxarm,
	Mark Rutland, Mel Gorman, msys.mizuma, Zengtao (B),
	Rafael J. Wysocki, Steven Rostedt, Barry Song, Sudeep Holla,
	Thomas Gleixner, Rafael J. Wysocki, Tim Chen, Will Deacon, x86,
	yangyicong, Tian Tao

On Wed, Oct 06, 2021 at 11:50:35PM +1300, Barry Song wrote:

> > diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> > index 7cb31d959f33..fc0836f460fb 100644
> > --- a/drivers/base/arch_topology.c
> > +++ b/drivers/base/arch_topology.c
> > @@ -622,7 +622,8 @@ void update_siblings_masks(unsigned int cpuid)
> >                 if (cpuid_topo->package_id != cpu_topo->package_id)
> >                         continue;
> >
> > -               if (cpuid_topo->cluster_id == cpu_topo->cluster_id) {
> > +               if (cpuid_topo->cluster_id == cpu_topo->cluster_id &&
> > +                   cpuid_topo->cluster_id != -1) {
> >                         cpumask_set_cpu(cpu, &cpuid_topo->cluster_sibling);
> >                         cpumask_set_cpu(cpuid, &cpu_topo->cluster_sibling);
> >                 }
> >
> 
> Hi Peter,
> Would you like to change this line in your tree?

Can you please double check:

  https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/next



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RESEND 1/3] topology: Represent clusters of CPUs within a die
  2021-10-06 12:18         ` Peter Zijlstra
@ 2021-10-06 12:50           ` Barry Song
  2021-10-06 13:55             ` Peter Zijlstra
  0 siblings, 1 reply; 46+ messages in thread
From: Barry Song @ 2021-10-06 12:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Valentin Schneider, Dietmar Eggemann, LKML, Ingo Molnar,
	Vincent Guittot, Aubrey Li, Borislav Petkov,
	Daniel Bristot de Oliveira, Ben Segall, Catalin Marinas,
	Greg Kroah-Hartman, Guodong Xu, H. Peter Anvin, Jonathan Cameron,
	Juri Lelli, Cc: Len Brown, ACPI Devel Maling List, LAK, Linuxarm,
	Mark Rutland, Mel Gorman, msys.mizuma, Zengtao (B),
	Rafael J. Wysocki, Steven Rostedt, Barry Song, Sudeep Holla,
	Thomas Gleixner, Rafael J. Wysocki, Tim Chen, Will Deacon, x86,
	yangyicong, Tian Tao

On Thu, Oct 7, 2021 at 1:20 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Oct 06, 2021 at 11:50:35PM +1300, Barry Song wrote:
>
> > > diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> > > index 7cb31d959f33..fc0836f460fb 100644
> > > --- a/drivers/base/arch_topology.c
> > > +++ b/drivers/base/arch_topology.c
> > > @@ -622,7 +622,8 @@ void update_siblings_masks(unsigned int cpuid)
> > >                 if (cpuid_topo->package_id != cpu_topo->package_id)
> > >                         continue;
> > >
> > > -               if (cpuid_topo->cluster_id == cpu_topo->cluster_id) {
> > > +               if (cpuid_topo->cluster_id == cpu_topo->cluster_id &&
> > > +                   cpuid_topo->cluster_id != -1) {
> > >                         cpumask_set_cpu(cpu, &cpuid_topo->cluster_sibling);
> > >                         cpumask_set_cpu(cpuid, &cpu_topo->cluster_sibling);
> > >                 }
> > >
> >
> > Hi Peter,
> > Would you like to change this line in your tree?
>
> Can you please double check:
>
>   https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/next

yes. It is correct for patch 1/3, thanks!

BTW, patch2/3  is missing some benchmark data and tested-by/SOB tags, i guess
it is because you are still editing?

barry

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RESEND 1/3] topology: Represent clusters of CPUs within a die
  2021-10-05 20:43     ` Barry Song
  2021-10-06 10:50       ` Barry Song
@ 2021-10-06 13:49       ` Valentin Schneider
  1 sibling, 0 replies; 46+ messages in thread
From: Valentin Schneider @ 2021-10-06 13:49 UTC (permalink / raw)
  To: Barry Song
  Cc: Dietmar Eggemann, LKML, Ingo Molnar, Peter Zijlstra,
	Vincent Guittot, Aubrey Li, Borislav Petkov,
	Daniel Bristot de Oliveira, Ben Segall, Catalin Marinas,
	Greg Kroah-Hartman, Guodong Xu, H. Peter Anvin, Jonathan Cameron,
	Juri Lelli, Cc: Len Brown, ACPI Devel Maling List, LAK, Linuxarm,
	Mark Rutland, Mel Gorman, msys.mizuma, Zengtao (B),
	Rafael J. Wysocki, Steven Rostedt, Barry Song, Sudeep Holla,
	Thomas Gleixner, Rafael J. Wysocki, Tim Chen, Will Deacon, x86,
	yangyicong, Tian Tao

On 06/10/21 09:43, Barry Song wrote:
>
> Hi Valentin,
> Yep, this is a very good question. I'd like change the code to:
> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> index 7cb31d959f33..fc0836f460fb 100644
> --- a/drivers/base/arch_topology.c
> +++ b/drivers/base/arch_topology.c
> @@ -622,7 +622,8 @@ void update_siblings_masks(unsigned int cpuid)
>                 if (cpuid_topo->package_id != cpu_topo->package_id)
>                         continue;
>
> -               if (cpuid_topo->cluster_id == cpu_topo->cluster_id) {
> +               if (cpuid_topo->cluster_id == cpu_topo->cluster_id &&
> +                   cpuid_topo->cluster_id != -1) {
>                         cpumask_set_cpu(cpu, &cpuid_topo->cluster_sibling);
>                         cpumask_set_cpu(cpuid, &cpu_topo->cluster_sibling);
>                 }
>
> This should be consistent with Tim's patch3/3 for x86 in case
> id is BAD_APICID:
> static bool match_l2c(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
> {
>         ...
>         /* Do not match if we do not have a valid APICID for cpu: */
>         if (per_cpu(cpu_l2c_id, cpu1) == BAD_APICID)
>                 return false;
>         ...
> }
>

LGTM.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RESEND 1/3] topology: Represent clusters of CPUs within a die
  2021-10-06 12:50           ` Barry Song
@ 2021-10-06 13:55             ` Peter Zijlstra
  2021-10-07 10:30               ` Barry Song
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2021-10-06 13:55 UTC (permalink / raw)
  To: Barry Song
  Cc: Valentin Schneider, Dietmar Eggemann, LKML, Ingo Molnar,
	Vincent Guittot, Aubrey Li, Borislav Petkov,
	Daniel Bristot de Oliveira, Ben Segall, Catalin Marinas,
	Greg Kroah-Hartman, Guodong Xu, H. Peter Anvin, Jonathan Cameron,
	Juri Lelli, Cc: Len Brown, ACPI Devel Maling List, LAK, Linuxarm,
	Mark Rutland, Mel Gorman, msys.mizuma, Zengtao (B),
	Rafael J. Wysocki, Steven Rostedt, Barry Song, Sudeep Holla,
	Thomas Gleixner, Rafael J. Wysocki, Tim Chen, Will Deacon, x86,
	yangyicong, Tian Tao

On Thu, Oct 07, 2021 at 01:50:43AM +1300, Barry Song wrote:
> On Thu, Oct 7, 2021 at 1:20 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Wed, Oct 06, 2021 at 11:50:35PM +1300, Barry Song wrote:
> >
> > > > diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> > > > index 7cb31d959f33..fc0836f460fb 100644
> > > > --- a/drivers/base/arch_topology.c
> > > > +++ b/drivers/base/arch_topology.c
> > > > @@ -622,7 +622,8 @@ void update_siblings_masks(unsigned int cpuid)
> > > >                 if (cpuid_topo->package_id != cpu_topo->package_id)
> > > >                         continue;
> > > >
> > > > -               if (cpuid_topo->cluster_id == cpu_topo->cluster_id) {
> > > > +               if (cpuid_topo->cluster_id == cpu_topo->cluster_id &&
> > > > +                   cpuid_topo->cluster_id != -1) {
> > > >                         cpumask_set_cpu(cpu, &cpuid_topo->cluster_sibling);
> > > >                         cpumask_set_cpu(cpuid, &cpu_topo->cluster_sibling);
> > > >                 }
> > > >
> > >
> > > Hi Peter,
> > > Would you like to change this line in your tree?
> >
> > Can you please double check:
> >
> >   https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/next
> 
> yes. It is correct for patch 1/3, thanks!
> 
> BTW, patch2/3  is missing some benchmark data and tested-by/SOB tags, i guess
> it is because you are still editing?

Urgh, no, that's my script thinking one of the many

--------------

lines you got in there was a terminator. Fixed it, should be pushed out
again in a few minutes.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RESEND 1/3] topology: Represent clusters of CPUs within a die
  2021-10-06 13:55             ` Peter Zijlstra
@ 2021-10-07 10:30               ` Barry Song
  2021-10-07 10:35                 ` Peter Zijlstra
  0 siblings, 1 reply; 46+ messages in thread
From: Barry Song @ 2021-10-07 10:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Valentin Schneider, Dietmar Eggemann, LKML, Ingo Molnar,
	Vincent Guittot, Aubrey Li, Borislav Petkov,
	Daniel Bristot de Oliveira, Ben Segall, Catalin Marinas,
	Greg Kroah-Hartman, Guodong Xu, H. Peter Anvin, Jonathan Cameron,
	Juri Lelli, Cc: Len Brown, ACPI Devel Maling List, LAK, Linuxarm,
	Mark Rutland, Mel Gorman, msys.mizuma, Zengtao (B),
	Rafael J. Wysocki, Steven Rostedt, Barry Song, Sudeep Holla,
	Thomas Gleixner, Rafael J. Wysocki, Tim Chen, Will Deacon, x86,
	yangyicong, Tian Tao

On Thu, Oct 7, 2021 at 2:55 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Oct 07, 2021 at 01:50:43AM +1300, Barry Song wrote:
> > On Thu, Oct 7, 2021 at 1:20 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Wed, Oct 06, 2021 at 11:50:35PM +1300, Barry Song wrote:
> > >
> > > > > diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> > > > > index 7cb31d959f33..fc0836f460fb 100644
> > > > > --- a/drivers/base/arch_topology.c
> > > > > +++ b/drivers/base/arch_topology.c
> > > > > @@ -622,7 +622,8 @@ void update_siblings_masks(unsigned int cpuid)
> > > > >                 if (cpuid_topo->package_id != cpu_topo->package_id)
> > > > >                         continue;
> > > > >
> > > > > -               if (cpuid_topo->cluster_id == cpu_topo->cluster_id) {
> > > > > +               if (cpuid_topo->cluster_id == cpu_topo->cluster_id &&
> > > > > +                   cpuid_topo->cluster_id != -1) {
> > > > >                         cpumask_set_cpu(cpu, &cpuid_topo->cluster_sibling);
> > > > >                         cpumask_set_cpu(cpuid, &cpu_topo->cluster_sibling);
> > > > >                 }
> > > > >
> > > >
> > > > Hi Peter,
> > > > Would you like to change this line in your tree?
> > >
> > > Can you please double check:
> > >
> > >   https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/next
> >
> > yes. It is correct for patch 1/3, thanks!

oops, there is a typo there:
+ if (cpuid_topo->cluster_id == cpu_topo->cluster_id &&
+ cpuid_topo->clister_id != -1) {

clister should be cluster.

> >
> > BTW, patch2/3  is missing some benchmark data and tested-by/SOB tags, i guess
> > it is because you are still editing?
>
> Urgh, no, that's my script thinking one of the many
>
> --------------
>
> lines you got in there was a terminator. Fixed it, should be pushed out
> again in a few minutes.

Thanks
barry

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH RESEND 1/3] topology: Represent clusters of CPUs within a die
  2021-10-07 10:30               ` Barry Song
@ 2021-10-07 10:35                 ` Peter Zijlstra
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2021-10-07 10:35 UTC (permalink / raw)
  To: Barry Song
  Cc: Valentin Schneider, Dietmar Eggemann, LKML, Ingo Molnar,
	Vincent Guittot, Aubrey Li, Borislav Petkov,
	Daniel Bristot de Oliveira, Ben Segall, Catalin Marinas,
	Greg Kroah-Hartman, Guodong Xu, H. Peter Anvin, Jonathan Cameron,
	Juri Lelli, Cc: Len Brown, ACPI Devel Maling List, LAK, Linuxarm,
	Mark Rutland, Mel Gorman, msys.mizuma, Zengtao (B),
	Rafael J. Wysocki, Steven Rostedt, Barry Song, Sudeep Holla,
	Thomas Gleixner, Rafael J. Wysocki, Tim Chen, Will Deacon, x86,
	yangyicong, Tian Tao

On Thu, Oct 07, 2021 at 11:30:36PM +1300, Barry Song wrote:
> On Thu, Oct 7, 2021 at 2:55 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Thu, Oct 07, 2021 at 01:50:43AM +1300, Barry Song wrote:
> > > On Thu, Oct 7, 2021 at 1:20 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > > >
> > > > On Wed, Oct 06, 2021 at 11:50:35PM +1300, Barry Song wrote:
> > > >
> > > > > > diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> > > > > > index 7cb31d959f33..fc0836f460fb 100644
> > > > > > --- a/drivers/base/arch_topology.c
> > > > > > +++ b/drivers/base/arch_topology.c
> > > > > > @@ -622,7 +622,8 @@ void update_siblings_masks(unsigned int cpuid)
> > > > > >                 if (cpuid_topo->package_id != cpu_topo->package_id)
> > > > > >                         continue;
> > > > > >
> > > > > > -               if (cpuid_topo->cluster_id == cpu_topo->cluster_id) {
> > > > > > +               if (cpuid_topo->cluster_id == cpu_topo->cluster_id &&
> > > > > > +                   cpuid_topo->cluster_id != -1) {
> > > > > >                         cpumask_set_cpu(cpu, &cpuid_topo->cluster_sibling);
> > > > > >                         cpumask_set_cpu(cpuid, &cpu_topo->cluster_sibling);
> > > > > >                 }
> > > > > >
> > > > >
> > > > > Hi Peter,
> > > > > Would you like to change this line in your tree?
> > > >
> > > > Can you please double check:
> > > >
> > > >   https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/next
> > >
> > > yes. It is correct for patch 1/3, thanks!
> 
> oops, there is a typo there:
> + if (cpuid_topo->cluster_id == cpu_topo->cluster_id &&
> + cpuid_topo->clister_id != -1) {
> 
> clister should be cluster.

Yeah, my bad, typing so hard... :-) Already fixed.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [tip: sched/core] sched: Add cluster scheduler level for x86
  2021-09-24  8:51 ` [PATCH RESEND 3/3] scheduler: Add cluster scheduler level for x86 Barry Song
@ 2021-10-15  9:44   ` tip-bot2 for Tim Chen
  2021-10-20 13:12     ` Tom Lendacky
  0 siblings, 1 reply; 46+ messages in thread
From: tip-bot2 for Tim Chen @ 2021-10-15  9:44 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Tim Chen, Barry Song, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     66558b730f2533cc2bf2b74d51f5f80b81e2bad0
Gitweb:        https://git.kernel.org/tip/66558b730f2533cc2bf2b74d51f5f80b81e2bad0
Author:        Tim Chen <tim.c.chen@linux.intel.com>
AuthorDate:    Fri, 24 Sep 2021 20:51:04 +12:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 15 Oct 2021 11:25:16 +02:00

sched: Add cluster scheduler level for x86

There are x86 CPU architectures (e.g. Jacobsville) where L2 cahce is
shared among a cluster of cores instead of being exclusive to one
single core.

To prevent oversubscription of L2 cache, load should be balanced
between such L2 clusters, especially for tasks with no shared data.
On benchmark such as SPECrate mcf test, this change provides a boost
to performance especially on medium load system on Jacobsville.  on a
Jacobsville that has 24 Atom cores, arranged into 6 clusters of 4
cores each, the benchmark number is as follow:

 Improvement over baseline kernel for mcf_r
 copies		run time	base rate
 1		-0.1%		-0.2%
 6		25.1%		25.1%
 12		18.8%		19.0%
 24		0.3%		0.3%

So this looks pretty good. In terms of the system's task distribution,
some pretty bad clumping can be seen for the vanilla kernel without
the L2 cluster domain for the 6 and 12 copies case. With the extra
domain for cluster, the load does get evened out between the clusters.

Note this patch isn't an universal win as spreading isn't necessarily
a win, particually for those workload who can benefit from packing.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210924085104.44806-4-21cnbao@gmail.com
---
 arch/x86/Kconfig                | 11 ++++++++-
 arch/x86/include/asm/smp.h      |  7 +++++-
 arch/x86/include/asm/topology.h |  3 ++-
 arch/x86/kernel/cpu/cacheinfo.c |  1 +-
 arch/x86/kernel/cpu/common.c    |  3 ++-
 arch/x86/kernel/smpboot.c       | 44 +++++++++++++++++++++++++++++++-
 6 files changed, 68 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index ab83c22..349e59b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1001,6 +1001,17 @@ config NR_CPUS
 	  This is purely to save memory: each supported CPU adds about 8KB
 	  to the kernel image.
 
+config SCHED_CLUSTER
+	bool "Cluster scheduler support"
+	depends on SMP
+	default y
+	help
+	  Cluster scheduler support improves the CPU scheduler's decision
+	  making when dealing with machines that have clusters of CPUs.
+	  Cluster usually means a couple of CPUs which are placed closely
+	  by sharing mid-level caches, last-level cache tags or internal
+	  busses.
+
 config SCHED_SMT
 	def_bool y if SMP
 
diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
index 630ff08..08b0e90 100644
--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -16,7 +16,9 @@ DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map);
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_die_map);
 /* cpus sharing the last level cache: */
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map);
+DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_l2c_shared_map);
 DECLARE_PER_CPU_READ_MOSTLY(u16, cpu_llc_id);
+DECLARE_PER_CPU_READ_MOSTLY(u16, cpu_l2c_id);
 DECLARE_PER_CPU_READ_MOSTLY(int, cpu_number);
 
 static inline struct cpumask *cpu_llc_shared_mask(int cpu)
@@ -24,6 +26,11 @@ static inline struct cpumask *cpu_llc_shared_mask(int cpu)
 	return per_cpu(cpu_llc_shared_map, cpu);
 }
 
+static inline struct cpumask *cpu_l2c_shared_mask(int cpu)
+{
+	return per_cpu(cpu_l2c_shared_map, cpu);
+}
+
 DECLARE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_cpu_to_apicid);
 DECLARE_EARLY_PER_CPU_READ_MOSTLY(u32, x86_cpu_to_acpiid);
 DECLARE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_bios_cpu_apicid);
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 9239399..cc16477 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -103,6 +103,7 @@ static inline void setup_node_to_cpumask_map(void) { }
 #include <asm-generic/topology.h>
 
 extern const struct cpumask *cpu_coregroup_mask(int cpu);
+extern const struct cpumask *cpu_clustergroup_mask(int cpu);
 
 #define topology_logical_package_id(cpu)	(cpu_data(cpu).logical_proc_id)
 #define topology_physical_package_id(cpu)	(cpu_data(cpu).phys_proc_id)
@@ -113,7 +114,9 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu);
 extern unsigned int __max_die_per_package;
 
 #ifdef CONFIG_SMP
+#define topology_cluster_id(cpu)		(per_cpu(cpu_l2c_id, cpu))
 #define topology_die_cpumask(cpu)		(per_cpu(cpu_die_map, cpu))
+#define topology_cluster_cpumask(cpu)		(cpu_clustergroup_mask(cpu))
 #define topology_core_cpumask(cpu)		(per_cpu(cpu_core_map, cpu))
 #define topology_sibling_cpumask(cpu)		(per_cpu(cpu_sibling_map, cpu))
 
diff --git a/arch/x86/kernel/cpu/cacheinfo.c b/arch/x86/kernel/cpu/cacheinfo.c
index b5e36bd..fe98a14 100644
--- a/arch/x86/kernel/cpu/cacheinfo.c
+++ b/arch/x86/kernel/cpu/cacheinfo.c
@@ -846,6 +846,7 @@ void init_intel_cacheinfo(struct cpuinfo_x86 *c)
 		l2 = new_l2;
 #ifdef CONFIG_SMP
 		per_cpu(cpu_llc_id, cpu) = l2_id;
+		per_cpu(cpu_l2c_id, cpu) = l2_id;
 #endif
 	}
 
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 0f88859..1c7897c 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -85,6 +85,9 @@ u16 get_llc_id(unsigned int cpu)
 }
 EXPORT_SYMBOL_GPL(get_llc_id);
 
+/* L2 cache ID of each logical CPU */
+DEFINE_PER_CPU_READ_MOSTLY(u16, cpu_l2c_id) = BAD_APICID;
+
 /* correctly size the local cpu masks */
 void __init setup_cpu_local_masks(void)
 {
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 85f6e24..5094ab0 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -101,6 +101,8 @@ EXPORT_PER_CPU_SYMBOL(cpu_die_map);
 
 DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map);
 
+DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_l2c_shared_map);
+
 /* Per CPU bogomips and other parameters */
 DEFINE_PER_CPU_READ_MOSTLY(struct cpuinfo_x86, cpu_info);
 EXPORT_PER_CPU_SYMBOL(cpu_info);
@@ -464,6 +466,21 @@ static bool match_die(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
 	return false;
 }
 
+static bool match_l2c(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
+{
+	int cpu1 = c->cpu_index, cpu2 = o->cpu_index;
+
+	/* Do not match if we do not have a valid APICID for cpu: */
+	if (per_cpu(cpu_l2c_id, cpu1) == BAD_APICID)
+		return false;
+
+	/* Do not match if L2 cache id does not match: */
+	if (per_cpu(cpu_l2c_id, cpu1) != per_cpu(cpu_l2c_id, cpu2))
+		return false;
+
+	return topology_sane(c, o, "l2c");
+}
+
 /*
  * Unlike the other levels, we do not enforce keeping a
  * multicore group inside a NUMA node.  If this happens, we will
@@ -523,7 +540,7 @@ static bool match_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
 }
 
 
-#if defined(CONFIG_SCHED_SMT) || defined(CONFIG_SCHED_MC)
+#if defined(CONFIG_SCHED_SMT) || defined(CONFIG_SCHED_CLUSTER) || defined(CONFIG_SCHED_MC)
 static inline int x86_sched_itmt_flags(void)
 {
 	return sysctl_sched_itmt_enabled ? SD_ASYM_PACKING : 0;
@@ -541,12 +558,21 @@ static int x86_smt_flags(void)
 	return cpu_smt_flags() | x86_sched_itmt_flags();
 }
 #endif
+#ifdef CONFIG_SCHED_CLUSTER
+static int x86_cluster_flags(void)
+{
+	return cpu_cluster_flags() | x86_sched_itmt_flags();
+}
+#endif
 #endif
 
 static struct sched_domain_topology_level x86_numa_in_package_topology[] = {
 #ifdef CONFIG_SCHED_SMT
 	{ cpu_smt_mask, x86_smt_flags, SD_INIT_NAME(SMT) },
 #endif
+#ifdef CONFIG_SCHED_CLUSTER
+	{ cpu_clustergroup_mask, x86_cluster_flags, SD_INIT_NAME(CLS) },
+#endif
 #ifdef CONFIG_SCHED_MC
 	{ cpu_coregroup_mask, x86_core_flags, SD_INIT_NAME(MC) },
 #endif
@@ -557,6 +583,9 @@ static struct sched_domain_topology_level x86_topology[] = {
 #ifdef CONFIG_SCHED_SMT
 	{ cpu_smt_mask, x86_smt_flags, SD_INIT_NAME(SMT) },
 #endif
+#ifdef CONFIG_SCHED_CLUSTER
+	{ cpu_clustergroup_mask, x86_cluster_flags, SD_INIT_NAME(CLS) },
+#endif
 #ifdef CONFIG_SCHED_MC
 	{ cpu_coregroup_mask, x86_core_flags, SD_INIT_NAME(MC) },
 #endif
@@ -584,6 +613,7 @@ void set_cpu_sibling_map(int cpu)
 	if (!has_mp) {
 		cpumask_set_cpu(cpu, topology_sibling_cpumask(cpu));
 		cpumask_set_cpu(cpu, cpu_llc_shared_mask(cpu));
+		cpumask_set_cpu(cpu, cpu_l2c_shared_mask(cpu));
 		cpumask_set_cpu(cpu, topology_core_cpumask(cpu));
 		cpumask_set_cpu(cpu, topology_die_cpumask(cpu));
 		c->booted_cores = 1;
@@ -602,6 +632,9 @@ void set_cpu_sibling_map(int cpu)
 		if ((i == cpu) || (has_mp && match_llc(c, o)))
 			link_mask(cpu_llc_shared_mask, cpu, i);
 
+		if ((i == cpu) || (has_mp && match_l2c(c, o)))
+			link_mask(cpu_l2c_shared_mask, cpu, i);
+
 		if ((i == cpu) || (has_mp && match_die(c, o)))
 			link_mask(topology_die_cpumask, cpu, i);
 	}
@@ -652,6 +685,11 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
 	return cpu_llc_shared_mask(cpu);
 }
 
+const struct cpumask *cpu_clustergroup_mask(int cpu)
+{
+	return cpu_l2c_shared_mask(cpu);
+}
+
 static void impress_friends(void)
 {
 	int cpu;
@@ -1335,6 +1373,7 @@ void __init native_smp_prepare_cpus(unsigned int max_cpus)
 		zalloc_cpumask_var(&per_cpu(cpu_core_map, i), GFP_KERNEL);
 		zalloc_cpumask_var(&per_cpu(cpu_die_map, i), GFP_KERNEL);
 		zalloc_cpumask_var(&per_cpu(cpu_llc_shared_map, i), GFP_KERNEL);
+		zalloc_cpumask_var(&per_cpu(cpu_l2c_shared_map, i), GFP_KERNEL);
 	}
 
 	/*
@@ -1564,7 +1603,10 @@ static void remove_siblinginfo(int cpu)
 
 	for_each_cpu(sibling, cpu_llc_shared_mask(cpu))
 		cpumask_clear_cpu(cpu, cpu_llc_shared_mask(sibling));
+	for_each_cpu(sibling, cpu_l2c_shared_mask(cpu))
+		cpumask_clear_cpu(cpu, cpu_l2c_shared_mask(sibling));
 	cpumask_clear(cpu_llc_shared_mask(cpu));
+	cpumask_clear(cpu_l2c_shared_mask(cpu));
 	cpumask_clear(topology_sibling_cpumask(cpu));
 	cpumask_clear(topology_core_cpumask(cpu));
 	cpumask_clear(topology_die_cpumask(cpu));

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [tip: sched/core] topology: Represent clusters of CPUs within a die
  2021-09-24  8:51 ` [PATCH RESEND 1/3] topology: Represent clusters of CPUs within a die Barry Song
  2021-10-05 16:33   ` Valentin Schneider
@ 2021-10-15  9:44   ` tip-bot2 for Jonathan Cameron
  1 sibling, 0 replies; 46+ messages in thread
From: tip-bot2 for Jonathan Cameron @ 2021-10-15  9:44 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Jonathan Cameron, Tian Tao, Barry Song, Peter Zijlstra (Intel),
	x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     c5e22feffdd736cb02b98b0f5b375c8ebc858dd4
Gitweb:        https://git.kernel.org/tip/c5e22feffdd736cb02b98b0f5b375c8ebc858dd4
Author:        Jonathan Cameron <Jonathan.Cameron@huawei.com>
AuthorDate:    Fri, 24 Sep 2021 20:51:02 +12:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 15 Oct 2021 11:25:15 +02:00

topology: Represent clusters of CPUs within a die

Both ACPI and DT provide the ability to describe additional layers of
topology between that of individual cores and higher level constructs
such as the level at which the last level cache is shared.
In ACPI this can be represented in PPTT as a Processor Hierarchy
Node Structure [1] that is the parent of the CPU cores and in turn
has a parent Processor Hierarchy Nodes Structure representing
a higher level of topology.

For example Kunpeng 920 has 6 or 8 clusters in each NUMA node, and each
cluster has 4 cpus. All clusters share L3 cache data, but each cluster
has local L3 tag. On the other hand, each clusters will share some
internal system bus.

+-----------------------------------+                          +---------+
|  +------+    +------+             +--------------------------+         |
|  | CPU0 |    | cpu1 |             |    +-----------+         |         |
|  +------+    +------+             |    |           |         |         |
|                                   +----+    L3     |         |         |
|  +------+    +------+   cluster   |    |    tag    |         |         |
|  | CPU2 |    | CPU3 |             |    |           |         |         |
|  +------+    +------+             |    +-----------+         |         |
|                                   |                          |         |
+-----------------------------------+                          |         |
+-----------------------------------+                          |         |
|  +------+    +------+             +--------------------------+         |
|  |      |    |      |             |    +-----------+         |         |
|  +------+    +------+             |    |           |         |         |
|                                   |    |    L3     |         |         |
|  +------+    +------+             +----+    tag    |         |         |
|  |      |    |      |             |    |           |         |         |
|  +------+    +------+             |    +-----------+         |         |
|                                   |                          |         |
+-----------------------------------+                          |   L3    |
                                                               |   data  |
+-----------------------------------+                          |         |
|  +------+    +------+             |    +-----------+         |         |
|  |      |    |      |             |    |           |         |         |
|  +------+    +------+             +----+    L3     |         |         |
|                                   |    |    tag    |         |         |
|  +------+    +------+             |    |           |         |         |
|  |      |    |      |             |    +-----------+         |         |
|  +------+    +------+             +--------------------------+         |
+-----------------------------------|                          |         |
+-----------------------------------|                          |         |
|  +------+    +------+             +--------------------------+         |
|  |      |    |      |             |    +-----------+         |         |
|  +------+    +------+             |    |           |         |         |
|                                   +----+    L3     |         |         |
|  +------+    +------+             |    |    tag    |         |         |
|  |      |    |      |             |    |           |         |         |
|  +------+    +------+             |    +-----------+         |         |
|                                   |                          |         |
+-----------------------------------+                          |         |
+-----------------------------------+                          |         |
|  +------+    +------+             +--------------------------+         |
|  |      |    |      |             |   +-----------+          |         |
|  +------+    +------+             |   |           |          |         |
|                                   |   |    L3     |          |         |
|  +------+    +------+             +---+    tag    |          |         |
|  |      |    |      |             |   |           |          |         |
|  +------+    +------+             |   +-----------+          |         |
|                                   |                          |         |
+-----------------------------------+                          |         |
+-----------------------------------+                          |         |
|  +------+    +------+             +--------------------------+         |
|  |      |    |      |             |  +-----------+           |         |
|  +------+    +------+             |  |           |           |         |
|                                   |  |    L3     |           |         |
|  +------+    +------+             +--+    tag    |           |         |
|  |      |    |      |             |  |           |           |         |
|  +------+    +------+             |  +-----------+           |         |
|                                   |                          +---------+
+-----------------------------------+

That means spreading tasks among clusters will bring more bandwidth
while packing tasks within one cluster will lead to smaller cache
synchronization latency. So both kernel and userspace will have
a chance to leverage this topology to deploy tasks accordingly to
achieve either smaller cache latency within one cluster or an even
distribution of load among clusters for higher throughput.

This patch exposes cluster topology to both kernel and userspace.
Libraried like hwloc will know cluster by cluster_cpus and related
sysfs attributes. PoC of HWLOC support at [2].

Note this patch only handle the ACPI case.

Special consideration is needed for SMT processors, where it is
necessary to move 2 levels up the hierarchy from the leaf nodes
(thus skipping the processor core level).

Note that arm64 / ACPI does not provide any means of identifying
a die level in the topology but that may be unrelate to the cluster
level.

[1] ACPI Specification 6.3 - section 5.2.29.1 processor hierarchy node
    structure (Type 0)
[2] https://github.com/hisilicon/hwloc/tree/linux-cluster

Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Tian Tao <tiantao6@hisilicon.com>
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210924085104.44806-2-21cnbao@gmail.com
---
 Documentation/ABI/stable/sysfs-devices-system-cpu | 15 +++-
 Documentation/admin-guide/cputopology.rst         | 12 ++-
 arch/arm64/kernel/topology.c                      |  2 +-
 drivers/acpi/pptt.c                               | 67 ++++++++++++++-
 drivers/base/arch_topology.c                      | 15 +++-
 drivers/base/topology.c                           | 10 ++-
 include/linux/acpi.h                              |  5 +-
 include/linux/arch_topology.h                     |  5 +-
 include/linux/topology.h                          |  6 +-
 9 files changed, 133 insertions(+), 4 deletions(-)

diff --git a/Documentation/ABI/stable/sysfs-devices-system-cpu b/Documentation/ABI/stable/sysfs-devices-system-cpu
index 516dafe..3965ce5 100644
--- a/Documentation/ABI/stable/sysfs-devices-system-cpu
+++ b/Documentation/ABI/stable/sysfs-devices-system-cpu
@@ -42,6 +42,12 @@ Description:    the CPU core ID of cpuX. Typically it is the hardware platform's
                 architecture and platform dependent.
 Values:         integer
 
+What:           /sys/devices/system/cpu/cpuX/topology/cluster_id
+Description:    the cluster ID of cpuX.  Typically it is the hardware platform's
+                identifier (rather than the kernel's). The actual value is
+                architecture and platform dependent.
+Values:         integer
+
 What:           /sys/devices/system/cpu/cpuX/topology/book_id
 Description:    the book ID of cpuX. Typically it is the hardware platform's
                 identifier (rather than the kernel's). The actual value is
@@ -85,6 +91,15 @@ Description:    human-readable list of CPUs within the same die.
                 The format is like 0-3, 8-11, 14,17.
 Values:         decimal list.
 
+What:           /sys/devices/system/cpu/cpuX/topology/cluster_cpus
+Description:    internal kernel map of CPUs within the same cluster.
+Values:         hexadecimal bitmask.
+
+What:           /sys/devices/system/cpu/cpuX/topology/cluster_cpus_list
+Description:    human-readable list of CPUs within the same cluster.
+                The format is like 0-3, 8-11, 14,17.
+Values:         decimal list.
+
 What:           /sys/devices/system/cpu/cpuX/topology/book_siblings
 Description:    internal kernel map of cpuX's hardware threads within the same
                 book_id. it's only used on s390.
diff --git a/Documentation/admin-guide/cputopology.rst b/Documentation/admin-guide/cputopology.rst
index b085dba..6b62e18 100644
--- a/Documentation/admin-guide/cputopology.rst
+++ b/Documentation/admin-guide/cputopology.rst
@@ -19,11 +19,13 @@ these macros in include/asm-XXX/topology.h::
 
 	#define topology_physical_package_id(cpu)
 	#define topology_die_id(cpu)
+	#define topology_cluster_id(cpu)
 	#define topology_core_id(cpu)
 	#define topology_book_id(cpu)
 	#define topology_drawer_id(cpu)
 	#define topology_sibling_cpumask(cpu)
 	#define topology_core_cpumask(cpu)
+	#define topology_cluster_cpumask(cpu)
 	#define topology_die_cpumask(cpu)
 	#define topology_book_cpumask(cpu)
 	#define topology_drawer_cpumask(cpu)
@@ -39,10 +41,12 @@ not defined by include/asm-XXX/topology.h:
 
 1) topology_physical_package_id: -1
 2) topology_die_id: -1
-3) topology_core_id: 0
-4) topology_sibling_cpumask: just the given CPU
-5) topology_core_cpumask: just the given CPU
-6) topology_die_cpumask: just the given CPU
+3) topology_cluster_id: -1
+4) topology_core_id: 0
+5) topology_sibling_cpumask: just the given CPU
+6) topology_core_cpumask: just the given CPU
+7) topology_cluster_cpumask: just the given CPU
+8) topology_die_cpumask: just the given CPU
 
 For architectures that don't support books (CONFIG_SCHED_BOOK) there are no
 default definitions for topology_book_id() and topology_book_cpumask().
diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
index 4dd14a6..9ab78ad 100644
--- a/arch/arm64/kernel/topology.c
+++ b/arch/arm64/kernel/topology.c
@@ -103,6 +103,8 @@ int __init parse_acpi_topology(void)
 			cpu_topology[cpu].thread_id  = -1;
 			cpu_topology[cpu].core_id    = topology_id;
 		}
+		topology_id = find_acpi_cpu_topology_cluster(cpu);
+		cpu_topology[cpu].cluster_id = topology_id;
 		topology_id = find_acpi_cpu_topology_package(cpu);
 		cpu_topology[cpu].package_id = topology_id;
 
diff --git a/drivers/acpi/pptt.c b/drivers/acpi/pptt.c
index fe69dc5..701f61c 100644
--- a/drivers/acpi/pptt.c
+++ b/drivers/acpi/pptt.c
@@ -747,6 +747,73 @@ int find_acpi_cpu_topology_package(unsigned int cpu)
 }
 
 /**
+ * find_acpi_cpu_topology_cluster() - Determine a unique CPU cluster value
+ * @cpu: Kernel logical CPU number
+ *
+ * Determine a topology unique cluster ID for the given CPU/thread.
+ * This ID can then be used to group peers, which will have matching ids.
+ *
+ * The cluster, if present is the level of topology above CPUs. In a
+ * multi-thread CPU, it will be the level above the CPU, not the thread.
+ * It may not exist in single CPU systems. In simple multi-CPU systems,
+ * it may be equal to the package topology level.
+ *
+ * Return: -ENOENT if the PPTT doesn't exist, the CPU cannot be found
+ * or there is no toplogy level above the CPU..
+ * Otherwise returns a value which represents the package for this CPU.
+ */
+
+int find_acpi_cpu_topology_cluster(unsigned int cpu)
+{
+	struct acpi_table_header *table;
+	acpi_status status;
+	struct acpi_pptt_processor *cpu_node, *cluster_node;
+	u32 acpi_cpu_id;
+	int retval;
+	int is_thread;
+
+	status = acpi_get_table(ACPI_SIG_PPTT, 0, &table);
+	if (ACPI_FAILURE(status)) {
+		acpi_pptt_warn_missing();
+		return -ENOENT;
+	}
+
+	acpi_cpu_id = get_acpi_id_for_cpu(cpu);
+	cpu_node = acpi_find_processor_node(table, acpi_cpu_id);
+	if (cpu_node == NULL || !cpu_node->parent) {
+		retval = -ENOENT;
+		goto put_table;
+	}
+
+	is_thread = cpu_node->flags & ACPI_PPTT_ACPI_PROCESSOR_IS_THREAD;
+	cluster_node = fetch_pptt_node(table, cpu_node->parent);
+	if (cluster_node == NULL) {
+		retval = -ENOENT;
+		goto put_table;
+	}
+	if (is_thread) {
+		if (!cluster_node->parent) {
+			retval = -ENOENT;
+			goto put_table;
+		}
+		cluster_node = fetch_pptt_node(table, cluster_node->parent);
+		if (cluster_node == NULL) {
+			retval = -ENOENT;
+			goto put_table;
+		}
+	}
+	if (cluster_node->flags & ACPI_PPTT_ACPI_PROCESSOR_ID_VALID)
+		retval = cluster_node->acpi_processor_id;
+	else
+		retval = ACPI_PTR_DIFF(cluster_node, table);
+
+put_table:
+	acpi_put_table(table);
+
+	return retval;
+}
+
+/**
  * find_acpi_cpu_topology_hetero_id() - Get a core architecture tag
  * @cpu: Kernel logical CPU number
  *
diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 4340766..fc0836f 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -600,6 +600,11 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
 	return core_mask;
 }
 
+const struct cpumask *cpu_clustergroup_mask(int cpu)
+{
+	return &cpu_topology[cpu].cluster_sibling;
+}
+
 void update_siblings_masks(unsigned int cpuid)
 {
 	struct cpu_topology *cpu_topo, *cpuid_topo = &cpu_topology[cpuid];
@@ -617,6 +622,12 @@ void update_siblings_masks(unsigned int cpuid)
 		if (cpuid_topo->package_id != cpu_topo->package_id)
 			continue;
 
+		if (cpuid_topo->cluster_id == cpu_topo->cluster_id &&
+		    cpuid_topo->cluster_id != -1) {
+			cpumask_set_cpu(cpu, &cpuid_topo->cluster_sibling);
+			cpumask_set_cpu(cpuid, &cpu_topo->cluster_sibling);
+		}
+
 		cpumask_set_cpu(cpuid, &cpu_topo->core_sibling);
 		cpumask_set_cpu(cpu, &cpuid_topo->core_sibling);
 
@@ -635,6 +646,9 @@ static void clear_cpu_topology(int cpu)
 	cpumask_clear(&cpu_topo->llc_sibling);
 	cpumask_set_cpu(cpu, &cpu_topo->llc_sibling);
 
+	cpumask_clear(&cpu_topo->cluster_sibling);
+	cpumask_set_cpu(cpu, &cpu_topo->cluster_sibling);
+
 	cpumask_clear(&cpu_topo->core_sibling);
 	cpumask_set_cpu(cpu, &cpu_topo->core_sibling);
 	cpumask_clear(&cpu_topo->thread_sibling);
@@ -650,6 +664,7 @@ void __init reset_cpu_topology(void)
 
 		cpu_topo->thread_id = -1;
 		cpu_topo->core_id = -1;
+		cpu_topo->cluster_id = -1;
 		cpu_topo->package_id = -1;
 		cpu_topo->llc_id = -1;
 
diff --git a/drivers/base/topology.c b/drivers/base/topology.c
index 43c0940..8f2b641 100644
--- a/drivers/base/topology.c
+++ b/drivers/base/topology.c
@@ -48,6 +48,9 @@ static DEVICE_ATTR_RO(physical_package_id);
 define_id_show_func(die_id);
 static DEVICE_ATTR_RO(die_id);
 
+define_id_show_func(cluster_id);
+static DEVICE_ATTR_RO(cluster_id);
+
 define_id_show_func(core_id);
 static DEVICE_ATTR_RO(core_id);
 
@@ -63,6 +66,10 @@ define_siblings_read_func(core_siblings, core_cpumask);
 static BIN_ATTR_RO(core_siblings, 0);
 static BIN_ATTR_RO(core_siblings_list, 0);
 
+define_siblings_read_func(cluster_cpus, cluster_cpumask);
+static BIN_ATTR_RO(cluster_cpus, 0);
+static BIN_ATTR_RO(cluster_cpus_list, 0);
+
 define_siblings_read_func(die_cpus, die_cpumask);
 static BIN_ATTR_RO(die_cpus, 0);
 static BIN_ATTR_RO(die_cpus_list, 0);
@@ -94,6 +101,8 @@ static struct bin_attribute *bin_attrs[] = {
 	&bin_attr_thread_siblings_list,
 	&bin_attr_core_siblings,
 	&bin_attr_core_siblings_list,
+	&bin_attr_cluster_cpus,
+	&bin_attr_cluster_cpus_list,
 	&bin_attr_die_cpus,
 	&bin_attr_die_cpus_list,
 	&bin_attr_package_cpus,
@@ -112,6 +121,7 @@ static struct bin_attribute *bin_attrs[] = {
 static struct attribute *default_attrs[] = {
 	&dev_attr_physical_package_id.attr,
 	&dev_attr_die_id.attr,
+	&dev_attr_cluster_id.attr,
 	&dev_attr_core_id.attr,
 #ifdef CONFIG_SCHED_BOOK
 	&dev_attr_book_id.attr,
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 974d497..fbc2146 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -1353,6 +1353,7 @@ static inline int lpit_read_residency_count_address(u64 *address)
 #ifdef CONFIG_ACPI_PPTT
 int acpi_pptt_cpu_is_thread(unsigned int cpu);
 int find_acpi_cpu_topology(unsigned int cpu, int level);
+int find_acpi_cpu_topology_cluster(unsigned int cpu);
 int find_acpi_cpu_topology_package(unsigned int cpu);
 int find_acpi_cpu_topology_hetero_id(unsigned int cpu);
 int find_acpi_cpu_cache_topology(unsigned int cpu, int level);
@@ -1365,6 +1366,10 @@ static inline int find_acpi_cpu_topology(unsigned int cpu, int level)
 {
 	return -EINVAL;
 }
+static inline int find_acpi_cpu_topology_cluster(unsigned int cpu)
+{
+	return -EINVAL;
+}
 static inline int find_acpi_cpu_topology_package(unsigned int cpu)
 {
 	return -EINVAL;
diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h
index f180240..b97cea8 100644
--- a/include/linux/arch_topology.h
+++ b/include/linux/arch_topology.h
@@ -62,10 +62,12 @@ void topology_set_thermal_pressure(const struct cpumask *cpus,
 struct cpu_topology {
 	int thread_id;
 	int core_id;
+	int cluster_id;
 	int package_id;
 	int llc_id;
 	cpumask_t thread_sibling;
 	cpumask_t core_sibling;
+	cpumask_t cluster_sibling;
 	cpumask_t llc_sibling;
 };
 
@@ -73,13 +75,16 @@ struct cpu_topology {
 extern struct cpu_topology cpu_topology[NR_CPUS];
 
 #define topology_physical_package_id(cpu)	(cpu_topology[cpu].package_id)
+#define topology_cluster_id(cpu)	(cpu_topology[cpu].cluster_id)
 #define topology_core_id(cpu)		(cpu_topology[cpu].core_id)
 #define topology_core_cpumask(cpu)	(&cpu_topology[cpu].core_sibling)
 #define topology_sibling_cpumask(cpu)	(&cpu_topology[cpu].thread_sibling)
+#define topology_cluster_cpumask(cpu)	(&cpu_topology[cpu].cluster_sibling)
 #define topology_llc_cpumask(cpu)	(&cpu_topology[cpu].llc_sibling)
 void init_cpu_topology(void);
 void store_cpu_topology(unsigned int cpuid);
 const struct cpumask *cpu_coregroup_mask(int cpu);
+const struct cpumask *cpu_clustergroup_mask(int cpu);
 void update_siblings_masks(unsigned int cpu);
 void remove_cpu_topology(unsigned int cpuid);
 void reset_cpu_topology(void);
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 7634cd7..80d27d7 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -186,6 +186,9 @@ static inline int cpu_to_mem(int cpu)
 #ifndef topology_die_id
 #define topology_die_id(cpu)			((void)(cpu), -1)
 #endif
+#ifndef topology_cluster_id
+#define topology_cluster_id(cpu)		((void)(cpu), -1)
+#endif
 #ifndef topology_core_id
 #define topology_core_id(cpu)			((void)(cpu), 0)
 #endif
@@ -195,6 +198,9 @@ static inline int cpu_to_mem(int cpu)
 #ifndef topology_core_cpumask
 #define topology_core_cpumask(cpu)		cpumask_of(cpu)
 #endif
+#ifndef topology_cluster_cpumask
+#define topology_cluster_cpumask(cpu)		cpumask_of(cpu)
+#endif
 #ifndef topology_die_cpumask
 #define topology_die_cpumask(cpu)		cpumask_of(cpu)
 #endif

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [tip: sched/core] sched: Add cluster scheduler level for x86
  2021-10-15  9:44   ` [tip: sched/core] sched: " tip-bot2 for Tim Chen
@ 2021-10-20 13:12     ` Tom Lendacky
  2021-10-20 19:51       ` Peter Zijlstra
  0 siblings, 1 reply; 46+ messages in thread
From: Tom Lendacky @ 2021-10-20 13:12 UTC (permalink / raw)
  To: linux-kernel, linux-tip-commits
  Cc: Tim Chen, Barry Song, Peter Zijlstra (Intel), x86

On 10/15/21 4:44 AM, tip-bot2 for Tim Chen wrote:
> The following commit has been merged into the sched/core branch of tip:
> 
> Commit-ID:     66558b730f2533cc2bf2b74d51f5f80b81e2bad0
> Gitweb:        https://git.kernel.org/tip/66558b730f2533cc2bf2b74d51f5f80b81e2bad0
> Author:        Tim Chen <tim.c.chen@linux.intel.com>
> AuthorDate:    Fri, 24 Sep 2021 20:51:04 +12:00
> Committer:     Peter Zijlstra <peterz@infradead.org>
> CommitterDate: Fri, 15 Oct 2021 11:25:16 +02:00
> 
> sched: Add cluster scheduler level for x86
> 
> There are x86 CPU architectures (e.g. Jacobsville) where L2 cahce is
> shared among a cluster of cores instead of being exclusive to one
> single core.
> 
> To prevent oversubscription of L2 cache, load should be balanced
> between such L2 clusters, especially for tasks with no shared data.
> On benchmark such as SPECrate mcf test, this change provides a boost
> to performance especially on medium load system on Jacobsville.  on a
> Jacobsville that has 24 Atom cores, arranged into 6 clusters of 4
> cores each, the benchmark number is as follow:
> 
>   Improvement over baseline kernel for mcf_r
>   copies		run time	base rate
>   1		-0.1%		-0.2%
>   6		25.1%		25.1%
>   12		18.8%		19.0%
>   24		0.3%		0.3%
> 
> So this looks pretty good. In terms of the system's task distribution,
> some pretty bad clumping can be seen for the vanilla kernel without
> the L2 cluster domain for the 6 and 12 copies case. With the extra
> domain for cluster, the load does get evened out between the clusters.
> 
> Note this patch isn't an universal win as spreading isn't necessarily
> a win, particually for those workload who can benefit from packing.
> 
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: https://lore.kernel.org/r/20210924085104.44806-4-21cnbao@gmail.com

I've bisected to this patch which now results in my EPYC systems issuing a 
lot of:

[    4.788480] BUG: arch topology borken
[    4.789578]      the SMT domain not a subset of the CLS domain

messages (one for each CPU in the system).

I haven't had a chance to dig deeper and understand everything, does 
anyone have some quick insights/ideas?

Thanks,
Tom


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [tip: sched/core] sched: Add cluster scheduler level for x86
  2021-10-20 13:12     ` Tom Lendacky
@ 2021-10-20 19:51       ` Peter Zijlstra
  2021-10-20 20:08         ` Tom Lendacky
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2021-10-20 19:51 UTC (permalink / raw)
  To: Tom Lendacky; +Cc: linux-kernel, linux-tip-commits, Tim Chen, Barry Song, x86

On Wed, Oct 20, 2021 at 08:12:51AM -0500, Tom Lendacky wrote:
> On 10/15/21 4:44 AM, tip-bot2 for Tim Chen wrote:
> > The following commit has been merged into the sched/core branch of tip:
> > 
> > Commit-ID:     66558b730f2533cc2bf2b74d51f5f80b81e2bad0
> > Gitweb:        https://git.kernel.org/tip/66558b730f2533cc2bf2b74d51f5f80b81e2bad0
> > Author:        Tim Chen <tim.c.chen@linux.intel.com>
> > AuthorDate:    Fri, 24 Sep 2021 20:51:04 +12:00
> > Committer:     Peter Zijlstra <peterz@infradead.org>
> > CommitterDate: Fri, 15 Oct 2021 11:25:16 +02:00
> > 
> > sched: Add cluster scheduler level for x86
> > 
> > There are x86 CPU architectures (e.g. Jacobsville) where L2 cahce is
> > shared among a cluster of cores instead of being exclusive to one
> > single core.
> > 
> > To prevent oversubscription of L2 cache, load should be balanced
> > between such L2 clusters, especially for tasks with no shared data.
> > On benchmark such as SPECrate mcf test, this change provides a boost
> > to performance especially on medium load system on Jacobsville.  on a
> > Jacobsville that has 24 Atom cores, arranged into 6 clusters of 4
> > cores each, the benchmark number is as follow:
> > 
> >   Improvement over baseline kernel for mcf_r
> >   copies		run time	base rate
> >   1		-0.1%		-0.2%
> >   6		25.1%		25.1%
> >   12		18.8%		19.0%
> >   24		0.3%		0.3%
> > 
> > So this looks pretty good. In terms of the system's task distribution,
> > some pretty bad clumping can be seen for the vanilla kernel without
> > the L2 cluster domain for the 6 and 12 copies case. With the extra
> > domain for cluster, the load does get evened out between the clusters.
> > 
> > Note this patch isn't an universal win as spreading isn't necessarily
> > a win, particually for those workload who can benefit from packing.
> > 
> > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > Link: https://lore.kernel.org/r/20210924085104.44806-4-21cnbao@gmail.com
> 
> I've bisected to this patch which now results in my EPYC systems issuing a
> lot of:
> 
> [    4.788480] BUG: arch topology borken
> [    4.789578]      the SMT domain not a subset of the CLS domain
> 
> messages (one for each CPU in the system).
> 
> I haven't had a chance to dig deeper and understand everything, does anyone
> have some quick insights/ideas?

Urgh, sorry about that. So this stuff uses cpu_l2c_id to build 'clusters',
basically CPUs that share L2, as a subset of the 'multi-core' topology,
which is all CPUs that share LLC (L3 typically).

Your EPYC seems to think the SMT group is not a strict subset of the L2.
The implication is that you have threads with different L2, which would
franky be quite 'exotic' if true :-)


If it does boot, what does something like:

  for i in /sys/devices/system/cpu/cpu*/topology/*{_id,_list}; do echo -n "${i}: " ; cat $i; done

produce?

I'll try and figure out how AMD sets l2c_id, that stuff is always a bit
of a maze.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [tip: sched/core] sched: Add cluster scheduler level for x86
  2021-10-20 19:51       ` Peter Zijlstra
@ 2021-10-20 20:08         ` Tom Lendacky
  2021-10-20 20:25           ` Peter Zijlstra
  0 siblings, 1 reply; 46+ messages in thread
From: Tom Lendacky @ 2021-10-20 20:08 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, linux-tip-commits, Tim Chen, Barry Song, x86

On 10/20/21 2:51 PM, Peter Zijlstra wrote:
> On Wed, Oct 20, 2021 at 08:12:51AM -0500, Tom Lendacky wrote:
>> On 10/15/21 4:44 AM, tip-bot2 for Tim Chen wrote:
>>> The following commit has been merged into the sched/core branch of tip:

> 
> If it does boot, what does something like:
> 
>    for i in /sys/devices/system/cpu/cpu*/topology/*{_id,_list}; do echo -n "${i}: " ; cat $i; done
> 
> produce?

The output is about 160K in size, I'll email it to you off-list.

> 
> I'll try and figure out how AMD sets l2c_id, that stuff is always a bit
> of a maze.

Thanks!

Tom

> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [tip: sched/core] sched: Add cluster scheduler level for x86
  2021-10-20 20:08         ` Tom Lendacky
@ 2021-10-20 20:25           ` Peter Zijlstra
  2021-10-20 20:36             ` Peter Zijlstra
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2021-10-20 20:25 UTC (permalink / raw)
  To: Tom Lendacky; +Cc: linux-kernel, linux-tip-commits, Tim Chen, Barry Song, x86

On Wed, Oct 20, 2021 at 03:08:41PM -0500, Tom Lendacky wrote:
> On 10/20/21 2:51 PM, Peter Zijlstra wrote:
> > On Wed, Oct 20, 2021 at 08:12:51AM -0500, Tom Lendacky wrote:
> > > On 10/15/21 4:44 AM, tip-bot2 for Tim Chen wrote:
> > > > The following commit has been merged into the sched/core branch of tip:
> 
> > 
> > If it does boot, what does something like:
> > 
> >    for i in /sys/devices/system/cpu/cpu*/topology/*{_id,_list}; do echo -n "${i}: " ; cat $i; done
> > 
> > produce?
> 
> The output is about 160K in size, I'll email it to you off-list.

/sys/devices/system/cpu/cpu0/topology/cluster_cpus_list: 0
/sys/devices/system/cpu/cpu0/topology/core_cpus_list: 0,128

/sys/devices/system/cpu/cpu128/topology/cluster_cpus_list: 128
/sys/devices/system/cpu/cpu128/topology/core_cpus_list: 0,128

So for some reason that thing thinks each SMT thread has it's own L2,
which seems rather unlikely. Or SMT has started to mean something
radically different than it used to be :-)

Let me continue trying to make sense of cacheinfo.c


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [tip: sched/core] sched: Add cluster scheduler level for x86
  2021-10-20 20:25           ` Peter Zijlstra
@ 2021-10-20 20:36             ` Peter Zijlstra
  2021-10-20 20:40               ` Tom Lendacky
  2021-10-20 20:40               ` Peter Zijlstra
  0 siblings, 2 replies; 46+ messages in thread
From: Peter Zijlstra @ 2021-10-20 20:36 UTC (permalink / raw)
  To: Tom Lendacky; +Cc: linux-kernel, linux-tip-commits, Tim Chen, Barry Song, x86

On Wed, Oct 20, 2021 at 10:25:42PM +0200, Peter Zijlstra wrote:
> On Wed, Oct 20, 2021 at 03:08:41PM -0500, Tom Lendacky wrote:
> > On 10/20/21 2:51 PM, Peter Zijlstra wrote:
> > > On Wed, Oct 20, 2021 at 08:12:51AM -0500, Tom Lendacky wrote:
> > > > On 10/15/21 4:44 AM, tip-bot2 for Tim Chen wrote:
> > > > > The following commit has been merged into the sched/core branch of tip:
> > 
> > > 
> > > If it does boot, what does something like:
> > > 
> > >    for i in /sys/devices/system/cpu/cpu*/topology/*{_id,_list}; do echo -n "${i}: " ; cat $i; done
> > > 
> > > produce?
> > 
> > The output is about 160K in size, I'll email it to you off-list.
> 
> /sys/devices/system/cpu/cpu0/topology/cluster_cpus_list: 0
> /sys/devices/system/cpu/cpu0/topology/core_cpus_list: 0,128
> 
> /sys/devices/system/cpu/cpu128/topology/cluster_cpus_list: 128
> /sys/devices/system/cpu/cpu128/topology/core_cpus_list: 0,128
> 
> So for some reason that thing thinks each SMT thread has it's own L2,
> which seems rather unlikely. Or SMT has started to mean something
> radically different than it used to be :-)
> 
> Let me continue trying to make sense of cacheinfo.c

OK, I think I see what's happening.

AFAICT cacheinfo.c does *NOT* set l2c_id on AMD/Hygon hardware, this
means it's set to BAD_APICID.

This then results in match_l2c() to never match. And as a direct
consequence set_cpu_sibling_map() will generate cpu_l2c_shared_mask with
just the one CPU set.

And we have the above result and things come unstuck if we assume:
  SMT <= L2 <= LLC

Now, the big question, how to fix this... Does AMD have means of
actually setting l2c_id or should we fall back to using match_smt() for
l2c_id == BAD_APICID ?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [tip: sched/core] sched: Add cluster scheduler level for x86
  2021-10-20 20:36             ` Peter Zijlstra
@ 2021-10-20 20:40               ` Tom Lendacky
  2021-10-20 20:40               ` Peter Zijlstra
  1 sibling, 0 replies; 46+ messages in thread
From: Tom Lendacky @ 2021-10-20 20:40 UTC (permalink / raw)
  To: Peter Zijlstra, Suravee Suthikulpanit
  Cc: linux-kernel, linux-tip-commits, Tim Chen, Barry Song, x86

On 10/20/21 3:36 PM, Peter Zijlstra wrote:
> On Wed, Oct 20, 2021 at 10:25:42PM +0200, Peter Zijlstra wrote:
>> On Wed, Oct 20, 2021 at 03:08:41PM -0500, Tom Lendacky wrote:
>>> On 10/20/21 2:51 PM, Peter Zijlstra wrote:
>>>> On Wed, Oct 20, 2021 at 08:12:51AM -0500, Tom Lendacky wrote:
>>>>> On 10/15/21 4:44 AM, tip-bot2 for Tim Chen wrote:
>>>>>> The following commit has been merged into the sched/core branch of tip:
>>>
>>>>
>>>> If it does boot, what does something like:
>>>>
>>>>     for i in /sys/devices/system/cpu/cpu*/topology/*{_id,_list}; do echo -n "${i}: " ; cat $i; done
>>>>
>>>> produce?
>>>
>>> The output is about 160K in size, I'll email it to you off-list.
>>
>> /sys/devices/system/cpu/cpu0/topology/cluster_cpus_list: 0
>> /sys/devices/system/cpu/cpu0/topology/core_cpus_list: 0,128
>>
>> /sys/devices/system/cpu/cpu128/topology/cluster_cpus_list: 128
>> /sys/devices/system/cpu/cpu128/topology/core_cpus_list: 0,128
>>
>> So for some reason that thing thinks each SMT thread has it's own L2,
>> which seems rather unlikely. Or SMT has started to mean something
>> radically different than it used to be :-)
>>
>> Let me continue trying to make sense of cacheinfo.c
> 
> OK, I think I see what's happening.
> 
> AFAICT cacheinfo.c does *NOT* set l2c_id on AMD/Hygon hardware, this
> means it's set to BAD_APICID.
> 
> This then results in match_l2c() to never match. And as a direct
> consequence set_cpu_sibling_map() will generate cpu_l2c_shared_mask with
> just the one CPU set.
> 
> And we have the above result and things come unstuck if we assume:
>    SMT <= L2 <= LLC
> 
> Now, the big question, how to fix this... Does AMD have means of
> actually setting l2c_id or should we fall back to using match_smt() for
> l2c_id == BAD_APICID ?

Let me include Suravee, who has more experience with our topology and 
cache information.

Thanks,
Tom

> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [tip: sched/core] sched: Add cluster scheduler level for x86
  2021-10-20 20:36             ` Peter Zijlstra
  2021-10-20 20:40               ` Tom Lendacky
@ 2021-10-20 20:40               ` Peter Zijlstra
  2021-10-20 20:51                 ` Tom Lendacky
                                   ` (2 more replies)
  1 sibling, 3 replies; 46+ messages in thread
From: Peter Zijlstra @ 2021-10-20 20:40 UTC (permalink / raw)
  To: Tom Lendacky; +Cc: linux-kernel, linux-tip-commits, Tim Chen, Barry Song, x86

On Wed, Oct 20, 2021 at 10:36:19PM +0200, Peter Zijlstra wrote:

> OK, I think I see what's happening.
> 
> AFAICT cacheinfo.c does *NOT* set l2c_id on AMD/Hygon hardware, this
> means it's set to BAD_APICID.
> 
> This then results in match_l2c() to never match. And as a direct
> consequence set_cpu_sibling_map() will generate cpu_l2c_shared_mask with
> just the one CPU set.
> 
> And we have the above result and things come unstuck if we assume:
>   SMT <= L2 <= LLC
> 
> Now, the big question, how to fix this... Does AMD have means of
> actually setting l2c_id or should we fall back to using match_smt() for
> l2c_id == BAD_APICID ?

The latter looks something like the below and ought to make EPYC at
least function as it did before.


---
 arch/x86/kernel/smpboot.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 849159797101..c2671b2333d1 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -472,7 +472,7 @@ static bool match_l2c(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
 
 	/* Do not match if we do not have a valid APICID for cpu: */
 	if (per_cpu(cpu_l2c_id, cpu1) == BAD_APICID)
-		return false;
+		return match_smt(c, o); /* assume at least SMT shares L2 */
 
 	/* Do not match if L2 cache id does not match: */
 	if (per_cpu(cpu_l2c_id, cpu1) != per_cpu(cpu_l2c_id, cpu2))

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [tip: sched/core] sched: Add cluster scheduler level for x86
  2021-10-20 20:40               ` Peter Zijlstra
@ 2021-10-20 20:51                 ` Tom Lendacky
  2021-10-21 10:32                 ` Barry Song
  2021-10-22 13:31                 ` Tom Lendacky
  2 siblings, 0 replies; 46+ messages in thread
From: Tom Lendacky @ 2021-10-20 20:51 UTC (permalink / raw)
  To: Peter Zijlstra, Suravee Suthikulpanit
  Cc: linux-kernel, linux-tip-commits, Tim Chen, Barry Song, x86

On 10/20/21 3:40 PM, Peter Zijlstra wrote:
> On Wed, Oct 20, 2021 at 10:36:19PM +0200, Peter Zijlstra wrote:
> 
>> OK, I think I see what's happening.
>>
>> AFAICT cacheinfo.c does *NOT* set l2c_id on AMD/Hygon hardware, this
>> means it's set to BAD_APICID.
>>
>> This then results in match_l2c() to never match. And as a direct
>> consequence set_cpu_sibling_map() will generate cpu_l2c_shared_mask with
>> just the one CPU set.
>>
>> And we have the above result and things come unstuck if we assume:
>>    SMT <= L2 <= LLC
>>
>> Now, the big question, how to fix this... Does AMD have means of
>> actually setting l2c_id or should we fall back to using match_smt() for
>> l2c_id == BAD_APICID ?
> 
> The latter looks something like the below and ought to make EPYC at
> least function as it did before.
> 
> 
> ---
>   arch/x86/kernel/smpboot.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 849159797101..c2671b2333d1 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -472,7 +472,7 @@ static bool match_l2c(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
>   
>   	/* Do not match if we do not have a valid APICID for cpu: */
>   	if (per_cpu(cpu_l2c_id, cpu1) == BAD_APICID)
> -		return false;
> +		return match_smt(c, o); /* assume at least SMT shares L2 */
>   
>   	/* Do not match if L2 cache id does not match: */
>   	if (per_cpu(cpu_l2c_id, cpu1) != per_cpu(cpu_l2c_id, cpu2))
> 

Adding Suravee.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [tip: sched/core] sched: Add cluster scheduler level for x86
  2021-10-20 20:40               ` Peter Zijlstra
  2021-10-20 20:51                 ` Tom Lendacky
@ 2021-10-21 10:32                 ` Barry Song
  2021-10-21 10:54                   ` Barry Song
  2021-10-21 13:22                   ` Peter Zijlstra
  2021-10-22 13:31                 ` Tom Lendacky
  2 siblings, 2 replies; 46+ messages in thread
From: Barry Song @ 2021-10-21 10:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tom Lendacky, LKML, linux-tip-commits, Tim Chen, Barry Song, x86

On Thu, Oct 21, 2021 at 9:43 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Oct 20, 2021 at 10:36:19PM +0200, Peter Zijlstra wrote:
>
> > OK, I think I see what's happening.
> >
> > AFAICT cacheinfo.c does *NOT* set l2c_id on AMD/Hygon hardware, this
> > means it's set to BAD_APICID.
> >
> > This then results in match_l2c() to never match. And as a direct
> > consequence set_cpu_sibling_map() will generate cpu_l2c_shared_mask with
> > just the one CPU set.
> >
> > And we have the above result and things come unstuck if we assume:
> >   SMT <= L2 <= LLC
> >
> > Now, the big question, how to fix this... Does AMD have means of
> > actually setting l2c_id or should we fall back to using match_smt() for
> > l2c_id == BAD_APICID ?
>
> The latter looks something like the below and ought to make EPYC at
> least function as it did before.
>
>
> ---
>  arch/x86/kernel/smpboot.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 849159797101..c2671b2333d1 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -472,7 +472,7 @@ static bool match_l2c(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
>
>         /* Do not match if we do not have a valid APICID for cpu: */
>         if (per_cpu(cpu_l2c_id, cpu1) == BAD_APICID)
> -               return false;
> +               return match_smt(c, o); /* assume at least SMT shares L2 */

Rather than making a fake cluster_cpus and cluster_cpus_list which
will expose to userspace
through /sys/devices/cpus/cpux/topology, could we just fix the
sched_domain mask by the
below?
It will be odd to users that a cpu has BAD cluster_id but has
"meaningful" cluster_cpus and
cluster_cpus_list in sys.

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 5094ab0bae58..0f9d706a7507 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -687,6 +687,15 @@ const struct cpumask *cpu_coregroup_mask(int cpu)

 const struct cpumask *cpu_clustergroup_mask(int cpu)
 {
+       /*
+        * if L2(cluster) is not represented, make cluster sched_domain
+        * same with smt domain, so that this redundant sched_domain can
+        * be dropped and we can avoid the complaint "the SMT domain not
+        * a subset of the cluster domain"
+        */
+       if (cpumask_subset(cpu_l2c_shared_mask(cpu), cpu_smt_mask(cpu)))
+               return cpu_smt_mask(cpu);
+
        return cpu_l2c_shared_mask(cpu);
 }



>
>         /* Do not match if L2 cache id does not match: */
>         if (per_cpu(cpu_l2c_id, cpu1) != per_cpu(cpu_l2c_id, cpu2))

Thanks
Barry

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [tip: sched/core] sched: Add cluster scheduler level for x86
  2021-10-21 10:32                 ` Barry Song
@ 2021-10-21 10:54                   ` Barry Song
  2021-10-21 13:22                   ` Peter Zijlstra
  1 sibling, 0 replies; 46+ messages in thread
From: Barry Song @ 2021-10-21 10:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tom Lendacky, LKML, linux-tip-commits, Tim Chen, Barry Song, x86

On Thu, Oct 21, 2021 at 11:32 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Thu, Oct 21, 2021 at 9:43 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Wed, Oct 20, 2021 at 10:36:19PM +0200, Peter Zijlstra wrote:
> >
> > > OK, I think I see what's happening.
> > >
> > > AFAICT cacheinfo.c does *NOT* set l2c_id on AMD/Hygon hardware, this
> > > means it's set to BAD_APICID.
> > >
> > > This then results in match_l2c() to never match. And as a direct
> > > consequence set_cpu_sibling_map() will generate cpu_l2c_shared_mask with
> > > just the one CPU set.
> > >
> > > And we have the above result and things come unstuck if we assume:
> > >   SMT <= L2 <= LLC
> > >
> > > Now, the big question, how to fix this... Does AMD have means of
> > > actually setting l2c_id or should we fall back to using match_smt() for
> > > l2c_id == BAD_APICID ?
> >
> > The latter looks something like the below and ought to make EPYC at
> > least function as it did before.
> >
> >
> > ---
> >  arch/x86/kernel/smpboot.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> > index 849159797101..c2671b2333d1 100644
> > --- a/arch/x86/kernel/smpboot.c
> > +++ b/arch/x86/kernel/smpboot.c
> > @@ -472,7 +472,7 @@ static bool match_l2c(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
> >
> >         /* Do not match if we do not have a valid APICID for cpu: */
> >         if (per_cpu(cpu_l2c_id, cpu1) == BAD_APICID)
> > -               return false;
> > +               return match_smt(c, o); /* assume at least SMT shares L2 */
>
> Rather than making a fake cluster_cpus and cluster_cpus_list which
> will expose to userspace
> through /sys/devices/cpus/cpux/topology, could we just fix the
> sched_domain mask by the
> below?
> It will be odd to users that a cpu has BAD cluster_id but has
> "meaningful" cluster_cpus and
> cluster_cpus_list in sys.
>
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 5094ab0bae58..0f9d706a7507 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -687,6 +687,15 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
>
>  const struct cpumask *cpu_clustergroup_mask(int cpu)
>  {
> +       /*
> +        * if L2(cluster) is not represented, make cluster sched_domain
> +        * same with smt domain, so that this redundant sched_domain can
> +        * be dropped and we can avoid the complaint "the SMT domain not
> +        * a subset of the cluster domain"
> +        */
> +       if (cpumask_subset(cpu_l2c_shared_mask(cpu), cpu_smt_mask(cpu)))
> +               return cpu_smt_mask(cpu);
> +
>         return cpu_l2c_shared_mask(cpu);
>  }

needs to change this line as well, otherwise, sys is still using the
cpu_clustergroup_mask:

diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index cc164777e661..c8cd4e9732bd 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -116,7 +116,7 @@ extern unsigned int __max_die_per_package;
 #ifdef CONFIG_SMP
 #define topology_cluster_id(cpu)               (per_cpu(cpu_l2c_id, cpu))
 #define topology_die_cpumask(cpu)              (per_cpu(cpu_die_map, cpu))
-#define topology_cluster_cpumask(cpu)          (cpu_clustergroup_mask(cpu))
+#define topology_cluster_cpumask(cpu)          (cpu_l2c_shared_mask(cpu))
 #define topology_core_cpumask(cpu)             (per_cpu(cpu_core_map, cpu))
 #define topology_sibling_cpumask(cpu)          (per_cpu(cpu_sibling_map, cpu))

so sys will use topology_cluster_cpumask which describes hardware as it is;
cpu_clustergroup_mask() describes what is given to the scheduler.

>
>
>
> >
> >         /* Do not match if L2 cache id does not match: */
> >         if (per_cpu(cpu_l2c_id, cpu1) != per_cpu(cpu_l2c_id, cpu2))
>
> Thanks
> Barry

Thanks
barry

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [tip: sched/core] sched: Add cluster scheduler level for x86
  2021-10-21 10:32                 ` Barry Song
  2021-10-21 10:54                   ` Barry Song
@ 2021-10-21 13:22                   ` Peter Zijlstra
  2021-10-21 22:23                     ` Song Bao Hua (Barry Song)
  1 sibling, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2021-10-21 13:22 UTC (permalink / raw)
  To: Barry Song
  Cc: Tom Lendacky, LKML, linux-tip-commits, Tim Chen, Barry Song, x86

On Thu, Oct 21, 2021 at 11:32:36PM +1300, Barry Song wrote:
> On Thu, Oct 21, 2021 at 9:43 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Wed, Oct 20, 2021 at 10:36:19PM +0200, Peter Zijlstra wrote:
> >
> > > OK, I think I see what's happening.
> > >
> > > AFAICT cacheinfo.c does *NOT* set l2c_id on AMD/Hygon hardware, this
> > > means it's set to BAD_APICID.
> > >
> > > This then results in match_l2c() to never match. And as a direct
> > > consequence set_cpu_sibling_map() will generate cpu_l2c_shared_mask with
> > > just the one CPU set.
> > >
> > > And we have the above result and things come unstuck if we assume:
> > >   SMT <= L2 <= LLC
> > >
> > > Now, the big question, how to fix this... Does AMD have means of
> > > actually setting l2c_id or should we fall back to using match_smt() for
> > > l2c_id == BAD_APICID ?
> >
> > The latter looks something like the below and ought to make EPYC at
> > least function as it did before.
> >
> >
> > ---
> >  arch/x86/kernel/smpboot.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> > index 849159797101..c2671b2333d1 100644
> > --- a/arch/x86/kernel/smpboot.c
> > +++ b/arch/x86/kernel/smpboot.c
> > @@ -472,7 +472,7 @@ static bool match_l2c(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
> >
> >         /* Do not match if we do not have a valid APICID for cpu: */
> >         if (per_cpu(cpu_l2c_id, cpu1) == BAD_APICID)
> > -               return false;
> > +               return match_smt(c, o); /* assume at least SMT shares L2 */
> 
> Rather than making a fake cluster_cpus and cluster_cpus_list which
> will expose to userspace
> through /sys/devices/cpus/cpux/topology, could we just fix the
> sched_domain mask by the
> below?

I don't think it's fake; SMT fundamentally has to share all cache
levels. And having the sched domains differ in setup from the reported
(nonsensical) topology also isn't appealing.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [tip: sched/core] sched: Add cluster scheduler level for x86
  2021-10-21 13:22                   ` Peter Zijlstra
@ 2021-10-21 22:23                     ` Song Bao Hua (Barry Song)
  0 siblings, 0 replies; 46+ messages in thread
From: Song Bao Hua (Barry Song) @ 2021-10-21 22:23 UTC (permalink / raw)
  To: Peter Zijlstra, Barry Song
  Cc: Tom Lendacky, LKML, linux-tip-commits, Tim Chen, x86



> -----Original Message-----
> From: Peter Zijlstra [mailto:peterz@infradead.org]
> Sent: Friday, October 22, 2021 2:23 AM
> To: Barry Song <21cnbao@gmail.com>
> Cc: Tom Lendacky <thomas.lendacky@amd.com>; LKML
> <linux-kernel@vger.kernel.org>; linux-tip-commits@vger.kernel.org; Tim Chen
> <tim.c.chen@linux.intel.com>; Song Bao Hua (Barry Song)
> <song.bao.hua@hisilicon.com>; x86 <x86@kernel.org>
> Subject: Re: [tip: sched/core] sched: Add cluster scheduler level for x86
> 
> On Thu, Oct 21, 2021 at 11:32:36PM +1300, Barry Song wrote:
> > On Thu, Oct 21, 2021 at 9:43 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Wed, Oct 20, 2021 at 10:36:19PM +0200, Peter Zijlstra wrote:
> > >
> > > > OK, I think I see what's happening.
> > > >
> > > > AFAICT cacheinfo.c does *NOT* set l2c_id on AMD/Hygon hardware, this
> > > > means it's set to BAD_APICID.
> > > >
> > > > This then results in match_l2c() to never match. And as a direct
> > > > consequence set_cpu_sibling_map() will generate cpu_l2c_shared_mask with
> > > > just the one CPU set.
> > > >
> > > > And we have the above result and things come unstuck if we assume:
> > > >   SMT <= L2 <= LLC
> > > >
> > > > Now, the big question, how to fix this... Does AMD have means of
> > > > actually setting l2c_id or should we fall back to using match_smt() for
> > > > l2c_id == BAD_APICID ?
> > >
> > > The latter looks something like the below and ought to make EPYC at
> > > least function as it did before.
> > >
> > >
> > > ---
> > >  arch/x86/kernel/smpboot.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> > > index 849159797101..c2671b2333d1 100644
> > > --- a/arch/x86/kernel/smpboot.c
> > > +++ b/arch/x86/kernel/smpboot.c
> > > @@ -472,7 +472,7 @@ static bool match_l2c(struct cpuinfo_x86 *c, struct
> cpuinfo_x86 *o)
> > >
> > >         /* Do not match if we do not have a valid APICID for cpu: */
> > >         if (per_cpu(cpu_l2c_id, cpu1) == BAD_APICID)
> > > -               return false;
> > > +               return match_smt(c, o); /* assume at least SMT shares L2 */
> >
> > Rather than making a fake cluster_cpus and cluster_cpus_list which
> > will expose to userspace
> > through /sys/devices/cpus/cpux/topology, could we just fix the
> > sched_domain mask by the
> > below?
> 
> I don't think it's fake; SMT fundamentally has to share all cache
> levels. And having the sched domains differ in setup from the reported
> (nonsensical) topology also isn't appealing.

Fair enough. I was actually inspired by cpu_coregroup_mask() which
is a combination of a couple of cpumask set:
drivers/base/arch_topology.c

const struct cpumask *cpu_coregroup_mask(int cpu)
{
	const cpumask_t *core_mask = cpumask_of_node(cpu_to_node(cpu));

	/* Find the smaller of NUMA, core or LLC siblings */
	if (cpumask_subset(&cpu_topology[cpu].core_sibling, core_mask)) {
		/* not numa in package, lets use the package siblings */
		core_mask = &cpu_topology[cpu].core_sibling;
	}
	if (cpu_topology[cpu].llc_id != -1) {
		if (cpumask_subset(&cpu_topology[cpu].llc_sibling, core_mask))
			core_mask = &cpu_topology[cpu].llc_sibling;
	}

	return core_mask;
}

Thanks
Barry


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [tip: sched/core] sched: Add cluster scheduler level for x86
  2021-10-20 20:40               ` Peter Zijlstra
  2021-10-20 20:51                 ` Tom Lendacky
  2021-10-21 10:32                 ` Barry Song
@ 2021-10-22 13:31                 ` Tom Lendacky
  2021-10-22 13:36                   ` Peter Zijlstra
  2 siblings, 1 reply; 46+ messages in thread
From: Tom Lendacky @ 2021-10-22 13:31 UTC (permalink / raw)
  To: Peter Zijlstra, Suthikulpanit, Suravee
  Cc: linux-kernel, linux-tip-commits, Tim Chen, Barry Song, x86

On 10/20/21 3:40 PM, Peter Zijlstra wrote:
> On Wed, Oct 20, 2021 at 10:36:19PM +0200, Peter Zijlstra wrote:
> 
>> OK, I think I see what's happening.
>>
>> AFAICT cacheinfo.c does *NOT* set l2c_id on AMD/Hygon hardware, this
>> means it's set to BAD_APICID.
>>
>> This then results in match_l2c() to never match. And as a direct
>> consequence set_cpu_sibling_map() will generate cpu_l2c_shared_mask with
>> just the one CPU set.
>>
>> And we have the above result and things come unstuck if we assume:
>>    SMT <= L2 <= LLC
>>
>> Now, the big question, how to fix this... Does AMD have means of
>> actually setting l2c_id or should we fall back to using match_smt() for
>> l2c_id == BAD_APICID ?
> 
> The latter looks something like the below and ought to make EPYC at
> least function as it did before.
> 
> 
> ---
>   arch/x86/kernel/smpboot.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 849159797101..c2671b2333d1 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -472,7 +472,7 @@ static bool match_l2c(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
>   
>   	/* Do not match if we do not have a valid APICID for cpu: */
>   	if (per_cpu(cpu_l2c_id, cpu1) == BAD_APICID)
> -		return false;
> +		return match_smt(c, o); /* assume at least SMT shares L2 */

This does eliminate the message and seems like an appropriate thing to do, 
in general, if the l2c id is not set.

We're looking into setting the l2c id for AMD platforms, but need to test 
against some older platforms. We'll let you know the results next week.

In the mean time, it is probably best to at least apply your above patch.

Thanks,
Tom

>   
>   	/* Do not match if L2 cache id does not match: */
>   	if (per_cpu(cpu_l2c_id, cpu1) != per_cpu(cpu_l2c_id, cpu2))
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [tip: sched/core] sched: Add cluster scheduler level for x86
  2021-10-22 13:31                 ` Tom Lendacky
@ 2021-10-22 13:36                   ` Peter Zijlstra
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2021-10-22 13:36 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: Suthikulpanit, Suravee, linux-kernel, linux-tip-commits,
	Tim Chen, Barry Song, x86

On Fri, Oct 22, 2021 at 08:31:40AM -0500, Tom Lendacky wrote:
> On 10/20/21 3:40 PM, Peter Zijlstra wrote:
> >   arch/x86/kernel/smpboot.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> > index 849159797101..c2671b2333d1 100644
> > --- a/arch/x86/kernel/smpboot.c
> > +++ b/arch/x86/kernel/smpboot.c
> > @@ -472,7 +472,7 @@ static bool match_l2c(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
> >   	/* Do not match if we do not have a valid APICID for cpu: */
> >   	if (per_cpu(cpu_l2c_id, cpu1) == BAD_APICID)
> > -		return false;
> > +		return match_smt(c, o); /* assume at least SMT shares L2 */
> 
> This does eliminate the message and seems like an appropriate thing to do,
> in general, if the l2c id is not set.
> 
> We're looking into setting the l2c id for AMD platforms, but need to test
> against some older platforms. We'll let you know the results next week.
> 
> In the mean time, it is probably best to at least apply your above patch.

OK, I'll go write up a Changelog and stick on your Reported- and
Tested-by tags.

Thanks!

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2021-10-22 13:38 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-24  8:51 [PATCH RESEND 0/3] Represent cluster topology and enable load balance between clusters Barry Song
2021-09-24  8:51 ` [PATCH RESEND 1/3] topology: Represent clusters of CPUs within a die Barry Song
2021-10-05 16:33   ` Valentin Schneider
2021-10-05 20:43     ` Barry Song
2021-10-06 10:50       ` Barry Song
2021-10-06 12:18         ` Peter Zijlstra
2021-10-06 12:50           ` Barry Song
2021-10-06 13:55             ` Peter Zijlstra
2021-10-07 10:30               ` Barry Song
2021-10-07 10:35                 ` Peter Zijlstra
2021-10-06 13:49       ` Valentin Schneider
2021-10-15  9:44   ` [tip: sched/core] " tip-bot2 for Jonathan Cameron
2021-09-24  8:51 ` [PATCH RESEND 2/3] scheduler: Add cluster scheduler level in core and related Kconfig for ARM64 Barry Song
2021-10-05  7:35   ` Peter Zijlstra
2021-10-05  9:01     ` Barry Song
2021-10-05 10:40       ` Peter Zijlstra
2021-09-24  8:51 ` [PATCH RESEND 3/3] scheduler: Add cluster scheduler level for x86 Barry Song
2021-10-15  9:44   ` [tip: sched/core] sched: " tip-bot2 for Tim Chen
2021-10-20 13:12     ` Tom Lendacky
2021-10-20 19:51       ` Peter Zijlstra
2021-10-20 20:08         ` Tom Lendacky
2021-10-20 20:25           ` Peter Zijlstra
2021-10-20 20:36             ` Peter Zijlstra
2021-10-20 20:40               ` Tom Lendacky
2021-10-20 20:40               ` Peter Zijlstra
2021-10-20 20:51                 ` Tom Lendacky
2021-10-21 10:32                 ` Barry Song
2021-10-21 10:54                   ` Barry Song
2021-10-21 13:22                   ` Peter Zijlstra
2021-10-21 22:23                     ` Song Bao Hua (Barry Song)
2021-10-22 13:31                 ` Tom Lendacky
2021-10-22 13:36                   ` Peter Zijlstra
2021-10-01 10:32 ` [PATCH RESEND 0/3] Represent cluster topology and enable load balance between clusters Barry Song
2021-10-01 10:39   ` Vincent Guittot
2021-10-01 14:57     ` Peter Zijlstra
2021-10-01 23:22       ` Tim Chen
2021-10-02  7:09         ` Barry Song
2021-10-04 22:54           ` Tim Chen
2021-10-05  7:54             ` Peter Zijlstra
2021-10-05  8:04           ` Peter Zijlstra
2021-10-05  9:06             ` Barry Song
2021-10-05  7:50         ` Peter Zijlstra
2021-10-05  9:15           ` Barry Song
2021-10-05 10:58             ` Peter Zijlstra
2021-10-05 13:42           ` Valentin Schneider
2021-10-05 13:50             ` Valentin Schneider

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).